SlideShare a Scribd company logo
1 of 44
STANDARDISED TESTING
Fulcher (2010)
Presenter Melika Rajabi
Two paradigms
ī‚¨ norm-referenced testing
īļ is the normative approach in educational testing.
īļ Individuals are compared to each other.
The meaning of the score on a test is derived from the position
of an individual in relation to others. If the purpose of testing is
to distribute rare resources fairly like university places , we
need a test that separates out the test takers very effectively.
Two paradigms
īļ The primary requirement of the test is that it should discriminate between
test takers.
The high-scoring test takers are offered places at the most prestigious
institutions, while those on lower scores may not be so fortunate. The
decision makers need the test takers to be ‘spread out’ over the range of test
scores, and this spread is called the distribution.
It has a long history, and its principles and assumptions still dominate
the testing industry today.
Two paradigms
ī‚¨ criterion-referenced testing
īļ The idea was first discussed in 1960s.
īļ informs testing and assessment that is related to instructional decisions
much more than norm referenced testing.
īļ The purpose of a criterion-referenced test is to make a decision about
whether an individual test taker has achieved a pre-specified criterion, or
standard, that is required for a particular decision context.
Like mastery tests
Testing as science
ī‚¨ The rise of testing began during the First World War.
ī‚¨ For the first time in history, industry was organised on a large and efficient
scale in order to produce the materials necessary for military success.
ī‚¨ With it came the need to count, measure, and quantify on a completely new
scale. In an early study of labour efficiency, Greenwood (1919: 186) takes
as his rationale the dictum of the natural scientist, Kelvin:
When you can measure what you are speaking about and express it in
numbers, you know something about it, but when you cannot measure
it, when
you cannot express it in numbers, your knowledge is of a meagre and
unsatisfactory kind.
ī‚¨ the First World War saw the dramatic rise of large-scale testing.
ī‚¨ Psychologists wished to contribute to the war effort and show that testing
was a scientific discipline (Kelves, 1968).
Testing as science
ī‚¨ Tests are about measuring knowledge, skills or
abilities (‘KSAs’) and expressing their existence, or
degree or presence, in numerical form.
ī‚¨ The assumption is that, once we are able to do this,
we have ‘genuine’ knowledge.
Testing as science
ī‚¨ Shohamy (2001: 21) correctly identifies this as one
of the key features of the ‘power of tests’:
The language of science in Western societies
grants
authority, status and power. Testing is perceived
as a
scientific discipline because it is experimental,
statistical
and uses numbers. It is viewed as objective, fair,
true
Testing as science
When the Great War broke measurement and scientific progress went
hand in hand.
Measurement had in fact been seen as the most important development in
scientific research since the early nineteenth century.
ī‚¨ Cattell (1893) said:
The history of science is the history of measurement. Those
departments of
Knowledge in which measurement could be used most readily
were the first
to become sciences, and those sciences are at the present time
the furthest
advanced in which measurement is the most extended and the
most exact.
What did the early testers believe they could achieve that led to the
explosion of testing theory and practice during the Great War?
It is best explained with reference to a comment that Francis Galton
was asked to write on one of Cattell’s early papers
(Cattell and Galton, 1890: 380):
One of the most important objects of measurement â€Ļ is to
obtain a
general knowledge of the capacities of a man by sinking
shafts, as it
were, at a few critical points. In order to ascertain the best
points for
the purpose, the sets of measures should be compared with an
independent estimate of the man’s powers. We thus may learn
which of
the measures are the most instructive.
Testing as science
Galton is suggesting that
In language testing
a strong ‘trait theory’ of
what we think our test measures is a real, stable,
part
of the test taker.
the use of tests is like drilling a hole into the test taker to discover what
is inside.
validity
Testing as science
ī‚¨ In order to make a decision about which tests are the
best measures, we need to compare the results of the
test with an independent estimate of whatever the
test is designed to measure.
ī‚¨ The external aspect of validity the value of our
measurements can be judged by their relation with
other measures of the same property.
Testing as science
A norm-referenced test a scientific
tool
Discriminate among test
takers
Ex:
Past: In the Army tests the higher scorers were considered for officer positions.
The testers genuinely believed that through their efforts the war would be more
successfully prosecuted, and victory would be achieved sooner.
Today: Test takers who get higher grades on modern international language
tests have more chances of promotion, or can obtain places in the universities
of their choice.
Testing as science
ī‚¨ a view of testing as ‘scientific’ has always been controversial.
Lipman (1922), argued that a strong trait theory is untenable. In
fact, most of the traits or constructs that we work with are
extremely difficult to define, and if we are not able to define
them, measurement is even more problematic.
What is in a curve?
What is in a curve?
It is possible for an illiterate enlisted man to get a score that
is similar to that obtained by a low-scoring – or even a
middle-scoring – officer. But it is highly unlikely.
Also, someone who is really officer material may get a
score similar to that expected of a high scoring illiterate.
Once again, this is highly unlikely, but possible.
If we look at the intersection between the officer curve and
the sergeants’ curve, however, the possibility that an officer
would get the same score as a sergeant and vice versa is
much higher and placement into an appropriate category
becomes more difficult. This is the problem of setting cut
scores on tests in order to make decisions
The curve and score meaning
ī‚¨ In norm-referenced testing the meaning of a
score is directly related to its place in the
curve of the distribution from which it is
drawn, because it tells us how an individual
with this score relates to the rest of the test-
taking population.
ī‚¨ Each individual is compared with all other
individuals in the distribution.
The curve and score meaning
Around 68 per cent of all scores cluster closely to the mean, with approximately 34
per cent just above the mean, and 34 per cent just below the mean. As we move
away from the mean the scores in the distribution become more extreme, and so
less common.
The curve of normal distribution tells us what the probability is that a test taker
could have got the score they have, given the place of the score in a particular
Putting it into practice
A test score is arrived at by giving the test
takers a number of items or tasks to do. The
responses to these items are scored, usually
as correct (1) or incorrect (0). The number of
correct responses for each individual is then
added up to arrive at a total raw score.
Putting it into practice
Mode: The most frequent score in a distribution.
(11)
Median: The score that falls in the middle of the
distribution.
Mean: is calculated by adding all the scores
together
and dividing the total by the number of test
takers.
Putting it into practice
ī‚¨ We can present these scores visually in
histograms
Putting it into practice
ī‚¨ The mean is the most important measure of the centre
of the distribution of test scores.
ī‚¨ If we take away the mean from each of the individual
scores we get a deviation score from the mean
ī‚¨ Deviation score: When we subtract the mean from the
score. This number shows how far an individual score
is away from the mean, and may be a negative or
positive number, and the mean of these scores is
always zero.
The standard deviation is the square root
of the sum of the squared deviation scores,
divided by N – 1.
Putting it into practice
ī‚¨ A z-score is simply the raw score expressed in standard
deviations. So, if your score on the test was11.72, you would in
fact score zero. And if your score was 19.2, you would score 1.
There is a very straightforward formula for transferring any raw
score to a z-score:
X= raw score
Test scores in a consumer age
Gaokao’s modern standardised test:
The reported score ranges from 100 to 900. The reported
score bears little relationship to the actual number of
items on
the test. Indeed, the actual number of items may vary
from
year to year, but the score meaning on the scale remains
the
same.
The descriptive statistics for the standardised Gaokao
examination every year are as follows:
ī‚¨ Mean = 500
ī‚¨ Standard deviation = 100
Test scores in a consumer age
ī‚¨ z multiplied by the new standard deviation plus
the new mean:
raw score mean of 0SD of
1
z * 100 + 500
If we look at item 29, which we would probably all agree is the most difficult on the test.
The comparison of native speaker responses with those of the target non-native test-taking
population has become a basic source of information on whether the test is a language test,
rather than a test of other cognitive or non-linguistic abilities. Like TOEFL test
Introducing reliability
ī‚¨ The most prized quality of standardised language tests that are designed to implement
meritocracy is reliability
ī‚¨ Classic definition by Lado (1961: 31)
Does a test yield the same scores one day and the next if there has been no
instruction
intervening? That is, does the test yield dependable scores in the sense that they will
not fluctuate very much so that we may know that the score obtained by a student is
pretty close to the score he would obtain if we gave the test again? If it does, the test
is reliable.
Lado specifically separated the quality of reliability from scorability.
Scorability is the ease with which a test item or task can be scored
a set of
multiple
choice
items
It is quite possible that they are easily scored, but produce unreliable scores, as defined by Lado
Sources of unreliability according to
Lado
ī‚¨ Lado assumed:
ī‚¨ Unreliability is about the lack of fluctuation, or consistency
1. If the test is held in two different places, or under slightly different conditions
(such as a different room, or with a different invigilator) and the score
changed as a direct result.
2. The test itself is the second source of unreliability.
He pointed to problems with sampling what language to test
we can’t test everything in a single test
no learning had taken place
between two administrations
of a test
No different scores
Introducing reliability
if a test consists of,
In standardised tests:
any group of items from which responses are added together to create a single score
Are assumed to test the same ability, skill or knowledge. Item homogeneity.
īļ If humans are scoring multiple-choice items they may become fatigued and make
mistakes, or transfer marks inaccurately from scripts to computer records.
īļ However, there is more room for variation when humans are asked to make
judgments
about the quality of a piece of writing or a sample of speech, and give it a rating
items that test
very different
things
reliability is also
reduced
Finally, unreliability can be caused by the scoring.
Why testing was viewed as a
‘science’:
Tests, like scientific instruments, provide the
means by
which we can observe and measure
consistencies in
human ability.
Any observed score on our tests is therefore
assumed
to be a composite, explained by the formula:
X= T + E
Observed
score
The ‘true’ score of an individual’s ability on what the test
measures
The error (E) that can come from a variety
of sources like those identified by Lado.
Calculating reliability
īƒ˜ It depends upon what kind of error we wish to focus on
īƒ˜ a reliability coefficient is calculated that ranges from 0
to 1
īƒ˜ No test is ‘perfectly’ reliable. There is always error.
Three areas identified by Lado
ī‚¨ Test administrations
ī‚¨ The test itself
ī‚¨ Marking or rating
Test administration
A measure of the strength of a relation between two interval level
variables, such as test scores. The full name of this statistic is the
Pearson Product Moment correlation.
The strength of the relation between the two sets of scores can easily be quantified
on
1. a scale of –1 (there is an inverse relationship between the scores – as one goes up,
the other comes down),
2. through 0 (there is no relation between the two sets of scores)
3. to 1 (the scores are exactly the same on both administrations of the test). The
closer the result is to 1, the more test–retest reliability we have.
There is a scatter plot of the scores from the same test given at two different times.
The score that each student got on the test at administration 1 and 2 is plotted.
We can see visually that there is a strong positive relationship between the two sets
of scores: it is highly likely that a student who scored highly on one administration
will score highly on another administration, but there will be some fluctuation.
The formula for the correlation between the two sets of raw scores is
ī‚— In order to interpret the correlation coefficient we square the result,
and .872 = .76
ī‚— This number (or r2) can be interpreted as the percentage variance
shared by the two sets of scores, or the degree to which they vary
together (as the score on one test increases, so it increases
proportionally on the other test). In our case, 76 percent of variance
is shared
The shared variance is represented
by the shaded area of overlap
between the two boxes.
the white area of each box represents
variance that is unique to each
administration.
The test itself
The items must be homogenous. In technical terms, they must all be highly correlated.
īļTwo ways of addressing reliability in terms of item homogeneity.
1. the split-half method
After a test has been administered the first task is to split it into two equal halves. This might be
done by placing item 1 in test A, item 2 in test B, item 3 in test A, item 4 in test B, and so on.
Then we have two tests, each of which is half the length of the original test. We then calculate
the correlation between the two halves in exactly the same way.
The correlation coefficient is the reliability coefficient for a test half the length of the one you
have actually given.
Reliability is directly related to the length of a test:
The longer a test, the more reliable it becomes.
For this we use the Spearman Brown correction formula, which is:
rhh is the correlation between the two halves of
the test.
2. The most frequently used and reported reliability coefficients
is Cronbach’s alpha
The formula for dichotomously scored items (scored ‘right’or ‘wrong’) is:
We know for the linguality test that k = 29 and S2 = 56.25
The number of items on the test
The test score variance,
which is the square of
the standard deviation
The sum of
the variances
of individual
items
Marking or rating
If we have two raters and we need to discover their inter-rater reliability, the formula is:
The number of raters
The variance
of their
scores
R1 and R2 merely stand for rater 1 and rater 2
The scoring of closed response items like multiple choice is much easier than open
response items because there is only one correct response. Rating is much more
complex because there is an assumption that whichever rater is making the judgment
should be a matter of indifference to the test taker.
If there is variation by rater this is considered Calculating reliability to be a source of
unreliability, or error
Living with uncertainty
The standard error of measurement that tells us what this might mean for a specific
observed score while the reliability coefficient tells us how much error there might be in
the measurement.
The formula for the standard error is:
The standard error The reliability coefficient
The standard deviation
We use the standard error to calculate a confidence interval around an observed
test score, which tells us by how much the true score may be above or below the
observed score that the test taker has actually got on our test.
Reliability and test length
ī‚¨ The reliability of a test is determined mostly by the quality of the items
and by the length of the test.
In standardised tests with many items, each item provides a piece of
information about the ability of the test taker.
If you were to increase the value of k in the Îą formula you would see that
reliability would steadily increase.
The more independent pieces of information we collect, the more reliable
the measurement becomes. So the response to any specific item must be
independent of the response to any other item, The technical term for this
is the stochastic independence of items.
Lado (1961: 339) provides us with the following formula for looking at the
relationship between reliability and test length:
is the proportion by
which you would have
to lengthen the test to
get the desired
reliability
The desired reliability
the reliability of the current
test
Imagine a test with a reliability of .7, and you wish to raise this to .85. The illustrative
calculation is as follows:
Relationships with other
measures
īļ The comparison of two measures of the same construct was key part of
standardised testing.
It has been assumed in Galton’s time that if two different measures were
highly correlated this provided evidence of validity (Pearson, 1920).
This aspect of external validity is criterion-related evidence, or evidence that
shows the scores on two measures are highly correlated, or that one test is
highly correlated with a criterion that is already known to be a valid measure
of its construct.
It is also sometimes called evidence for convergent validity
Measurement
Up to here we talked about measurement as understood in Classical Test Theory. All
test theory assumes that scores are normally distributed. The measurement theory
described in this chapter is therefore central to understanding how all standardised
tests are built, and how they work. Content analysis also played a role. Yerkes
(1921:355), for example, comments on the content of the linguality test in this way:
A fairly accurate individual directions test was arranged, and a less accurate
but
usable group test, modelled after examination beta. The group test is too
difficult at
the start and is extended to an unnecessarily high level for purposes of
calibration.
The individual test is preferable, not only on the score of accuracy and level,
but also
because of its military nature.
Measurement
ī‚¨ Deciding what to test is now seen as just as important as how
to test it. However, we must acknowledge that the basic
technology of language testing and assessment is drawn from
measurement theory, which in turn models itself upon the
measurement tools of the physical sciences.
Thanks for your attention

More Related Content

What's hot

Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessmentSutrisno Evenddy
 
Adopting and Adapting Materials
Adopting and Adapting MaterialsAdopting and Adapting Materials
Adopting and Adapting MaterialsAnnasta Tastha
 
VALIDITY
VALIDITYVALIDITY
VALIDITYANCYBS
 
Achieving beneficial blackwash
Achieving beneficial blackwashAchieving beneficial blackwash
Achieving beneficial blackwashMaury Martinez
 
Language Assessment : Kinds of tests and testing
Language Assessment : Kinds of tests and testingLanguage Assessment : Kinds of tests and testing
Language Assessment : Kinds of tests and testingMusfera Nara Vadia
 
Designing classroom language tests
Designing classroom language testsDesigning classroom language tests
Designing classroom language testsSutrisno Evenddy
 
Introduction to Language Assessment by Brown
Introduction to Language Assessment by BrownIntroduction to Language Assessment by Brown
Introduction to Language Assessment by BrownEFL Learning
 
Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6ahfameri
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessmentAstrid Caballero
 
Writing a research proposal in english language education
Writing a research proposal in english language educationWriting a research proposal in english language education
Writing a research proposal in english language educationBeruga' Alam Institute
 
MaEd -Methods of Research - Chapter 6 Experimental Research Group 2 Libmanan....
MaEd -Methods of Research - Chapter 6 Experimental Research Group 2 Libmanan....MaEd -Methods of Research - Chapter 6 Experimental Research Group 2 Libmanan....
MaEd -Methods of Research - Chapter 6 Experimental Research Group 2 Libmanan....KateJasminAvecilla
 
Test development
Test developmentTest development
Test developmentZubair Khan
 
Testing for Language Teachers
Testing for Language TeachersTesting for Language Teachers
Testing for Language Teachersmpazhou
 
Chapter Seven Research Questions and Hypotheses
Chapter Seven Research Questions and HypothesesChapter Seven Research Questions and Hypotheses
Chapter Seven Research Questions and HypothesesInternational advisers
 
Unit1(testing, assessing, and teaching)
Unit1(testing, assessing, and teaching)Unit1(testing, assessing, and teaching)
Unit1(testing, assessing, and teaching)Kheang Sokheng
 
Reliability in Language Testing
Reliability in Language Testing Reliability in Language Testing
Reliability in Language Testing Seray Tanyer
 
Communicative Language Teaching Presentation.pdf
Communicative Language Teaching Presentation.pdfCommunicative Language Teaching Presentation.pdf
Communicative Language Teaching Presentation.pdfMonicaDidi1
 
discrete-point and integrative testing
discrete-point and integrative testingdiscrete-point and integrative testing
discrete-point and integrative testingindriyatul munawaroh
 
Materials development for language learning and teaching
Materials development for language learning and teachingMaterials development for language learning and teaching
Materials development for language learning and teachingBike
 

What's hot (20)

Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessment
 
Validity
ValidityValidity
Validity
 
Adopting and Adapting Materials
Adopting and Adapting MaterialsAdopting and Adapting Materials
Adopting and Adapting Materials
 
VALIDITY
VALIDITYVALIDITY
VALIDITY
 
Achieving beneficial blackwash
Achieving beneficial blackwashAchieving beneficial blackwash
Achieving beneficial blackwash
 
Language Assessment : Kinds of tests and testing
Language Assessment : Kinds of tests and testingLanguage Assessment : Kinds of tests and testing
Language Assessment : Kinds of tests and testing
 
Designing classroom language tests
Designing classroom language testsDesigning classroom language tests
Designing classroom language tests
 
Introduction to Language Assessment by Brown
Introduction to Language Assessment by BrownIntroduction to Language Assessment by Brown
Introduction to Language Assessment by Brown
 
Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6Reliability bachman 1990 chapter 6
Reliability bachman 1990 chapter 6
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessment
 
Writing a research proposal in english language education
Writing a research proposal in english language educationWriting a research proposal in english language education
Writing a research proposal in english language education
 
MaEd -Methods of Research - Chapter 6 Experimental Research Group 2 Libmanan....
MaEd -Methods of Research - Chapter 6 Experimental Research Group 2 Libmanan....MaEd -Methods of Research - Chapter 6 Experimental Research Group 2 Libmanan....
MaEd -Methods of Research - Chapter 6 Experimental Research Group 2 Libmanan....
 
Test development
Test developmentTest development
Test development
 
Testing for Language Teachers
Testing for Language TeachersTesting for Language Teachers
Testing for Language Teachers
 
Chapter Seven Research Questions and Hypotheses
Chapter Seven Research Questions and HypothesesChapter Seven Research Questions and Hypotheses
Chapter Seven Research Questions and Hypotheses
 
Unit1(testing, assessing, and teaching)
Unit1(testing, assessing, and teaching)Unit1(testing, assessing, and teaching)
Unit1(testing, assessing, and teaching)
 
Reliability in Language Testing
Reliability in Language Testing Reliability in Language Testing
Reliability in Language Testing
 
Communicative Language Teaching Presentation.pdf
Communicative Language Teaching Presentation.pdfCommunicative Language Teaching Presentation.pdf
Communicative Language Teaching Presentation.pdf
 
discrete-point and integrative testing
discrete-point and integrative testingdiscrete-point and integrative testing
discrete-point and integrative testing
 
Materials development for language learning and teaching
Materials development for language learning and teachingMaterials development for language learning and teaching
Materials development for language learning and teaching
 

Viewers also liked

Practical Language Testing by Fulcher (2010)
Practical Language Testing by Fulcher (2010)Practical Language Testing by Fulcher (2010)
Practical Language Testing by Fulcher (2010)Mahsa Farahanynia
 
Standardized Testing
Standardized TestingStandardized Testing
Standardized Testinglees2jm
 
The acquisition of cultural competence an ethnographic framework for cultural...
The acquisition of cultural competence an ethnographic framework for cultural...The acquisition of cultural competence an ethnographic framework for cultural...
The acquisition of cultural competence an ethnographic framework for cultural...Mahsa Farahanynia
 
Study quality in quantitative l2 research (1990–2010) a methodological synthe...
Study quality in quantitative l2 research (1990–2010) a methodological synthe...Study quality in quantitative l2 research (1990–2010) a methodological synthe...
Study quality in quantitative l2 research (1990–2010) a methodological synthe...Mahsa Farahanynia
 
Overview of Proficiency Testing
Overview of Proficiency TestingOverview of Proficiency Testing
Overview of Proficiency TestingGlobal PT Services
 
Lesson 1. key concepts in language testing jose martin serna valdeolivar
Lesson 1.  key concepts in language testing jose martin serna valdeolivarLesson 1.  key concepts in language testing jose martin serna valdeolivar
Lesson 1. key concepts in language testing jose martin serna valdeolivarJosÊ Martín Serna Valdeolivar
 
Mixed between-within groups ANOVA
Mixed between-within groups ANOVAMixed between-within groups ANOVA
Mixed between-within groups ANOVAMahsa Farahanynia
 
Classroom assessment glenn fulcher
Classroom assessment glenn fulcherClassroom assessment glenn fulcher
Classroom assessment glenn fulcherahfameri
 
Chapter 2: Principles of Language Assessment
Chapter 2: Principles of Language AssessmentChapter 2: Principles of Language Assessment
Chapter 2: Principles of Language AssessmentHamid Najaf Pour Sani
 
Reading test specifications assignment-01-ppt
Reading test specifications assignment-01-pptReading test specifications assignment-01-ppt
Reading test specifications assignment-01-pptBilal Yaseen
 
Culture and nonverbal communication
Culture and nonverbal communicationCulture and nonverbal communication
Culture and nonverbal communicationMahsa Farahanynia
 
Test specifications and designs
Test specifications and designs  Test specifications and designs
Test specifications and designs ahfameri
 
Chapter 9( assessing writing)
Chapter 9( assessing writing)Chapter 9( assessing writing)
Chapter 9( assessing writing)Kheang Sokheng
 
Chapter 2(principles of language assessment)
Chapter 2(principles of language assessment)Chapter 2(principles of language assessment)
Chapter 2(principles of language assessment)Kheang Sokheng
 
Uses of language by Brown 1990
Uses of language by Brown 1990Uses of language by Brown 1990
Uses of language by Brown 1990Mahsa Farahanynia
 
Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky
Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky  Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky
Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky Mahsa Farahanynia
 
Standardized Tests, by Kathy and Mary
Standardized Tests, by Kathy and MaryStandardized Tests, by Kathy and Mary
Standardized Tests, by Kathy and Marymarz_bar_angel_9999
 
Standardized Testing
Standardized TestingStandardized Testing
Standardized TestingMiss EAP
 

Viewers also liked (20)

Practical Language Testing by Fulcher (2010)
Practical Language Testing by Fulcher (2010)Practical Language Testing by Fulcher (2010)
Practical Language Testing by Fulcher (2010)
 
Standardized Testing
Standardized TestingStandardized Testing
Standardized Testing
 
The acquisition of cultural competence an ethnographic framework for cultural...
The acquisition of cultural competence an ethnographic framework for cultural...The acquisition of cultural competence an ethnographic framework for cultural...
The acquisition of cultural competence an ethnographic framework for cultural...
 
Study quality in quantitative l2 research (1990–2010) a methodological synthe...
Study quality in quantitative l2 research (1990–2010) a methodological synthe...Study quality in quantitative l2 research (1990–2010) a methodological synthe...
Study quality in quantitative l2 research (1990–2010) a methodological synthe...
 
Overview of Proficiency Testing
Overview of Proficiency TestingOverview of Proficiency Testing
Overview of Proficiency Testing
 
Lesson 1. key concepts in language testing jose martin serna valdeolivar
Lesson 1.  key concepts in language testing jose martin serna valdeolivarLesson 1.  key concepts in language testing jose martin serna valdeolivar
Lesson 1. key concepts in language testing jose martin serna valdeolivar
 
Mixed between-within groups ANOVA
Mixed between-within groups ANOVAMixed between-within groups ANOVA
Mixed between-within groups ANOVA
 
Situational syllabi
Situational syllabiSituational syllabi
Situational syllabi
 
Classroom assessment glenn fulcher
Classroom assessment glenn fulcherClassroom assessment glenn fulcher
Classroom assessment glenn fulcher
 
Chapter 2: Principles of Language Assessment
Chapter 2: Principles of Language AssessmentChapter 2: Principles of Language Assessment
Chapter 2: Principles of Language Assessment
 
Reading test specifications assignment-01-ppt
Reading test specifications assignment-01-pptReading test specifications assignment-01-ppt
Reading test specifications assignment-01-ppt
 
Culture and nonverbal communication
Culture and nonverbal communicationCulture and nonverbal communication
Culture and nonverbal communication
 
Test specifications and designs
Test specifications and designs  Test specifications and designs
Test specifications and designs
 
Chapter 9( assessing writing)
Chapter 9( assessing writing)Chapter 9( assessing writing)
Chapter 9( assessing writing)
 
Chapter 2(principles of language assessment)
Chapter 2(principles of language assessment)Chapter 2(principles of language assessment)
Chapter 2(principles of language assessment)
 
Uses of language by Brown 1990
Uses of language by Brown 1990Uses of language by Brown 1990
Uses of language by Brown 1990
 
Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky
Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky  Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky
Cognitive Approaches to Learning:Socio-cultural theory by Lev Vygotsky
 
Standardized Tests, by Kathy and Mary
Standardized Tests, by Kathy and MaryStandardized Tests, by Kathy and Mary
Standardized Tests, by Kathy and Mary
 
Teaching and testing
Teaching and testingTeaching and testing
Teaching and testing
 
Standardized Testing
Standardized TestingStandardized Testing
Standardized Testing
 

Similar to Fulcher standardized testing

validity and reliability
validity and reliabilityvalidity and reliability
validity and reliabilityaffera mujahid
 
01 validity and its type
01 validity and its type01 validity and its type
01 validity and its typeNoorulhadi Qureshi
 
01 validity and its type
01 validity and its type01 validity and its type
01 validity and its typeNoorulhadi Qureshi
 
Rep
RepRep
RepCedy_28
 
Indexes scales and typologies
Indexes scales and typologiesIndexes scales and typologies
Indexes scales and typologiesmaryjune Jardeleza
 
Running head ASSESSING A CLIENT .docx
Running head ASSESSING A CLIENT                                  .docxRunning head ASSESSING A CLIENT                                  .docx
Running head ASSESSING A CLIENT .docxhealdkathaleen
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilationHannan Mahmud
 
Advanced statistics for librarians
Advanced statistics for librariansAdvanced statistics for librarians
Advanced statistics for librariansJohn McDonald
 
Session 2 2018
Session 2 2018Session 2 2018
Session 2 2018Sue Hines
 
The Developmental Coordination Disorder Questionnaire
The Developmental Coordination Disorder QuestionnaireThe Developmental Coordination Disorder Questionnaire
The Developmental Coordination Disorder QuestionnaireMandy Cross
 
Business Research Methods Unit 3 notes
Business Research Methods Unit 3 notesBusiness Research Methods Unit 3 notes
Business Research Methods Unit 3 notesSUJEET TAMBE
 
Reasrch methodology for MBA
Reasrch methodology for MBAReasrch methodology for MBA
Reasrch methodology for MBARahul Rajan
 
Adler clark 4e ppt 06
Adler clark 4e ppt 06Adler clark 4e ppt 06
Adler clark 4e ppt 06arpsychology
 
Measurment and scale
Measurment and scaleMeasurment and scale
Measurment and scalejas maan
 

Similar to Fulcher standardized testing (20)

Unit. 6.doc
Unit. 6.docUnit. 6.doc
Unit. 6.doc
 
validity and reliability
validity and reliabilityvalidity and reliability
validity and reliability
 
Validity and reliability
Validity and reliabilityValidity and reliability
Validity and reliability
 
Norms[1]
Norms[1]Norms[1]
Norms[1]
 
01 validity and its type
01 validity and its type01 validity and its type
01 validity and its type
 
01 validity and its type
01 validity and its type01 validity and its type
01 validity and its type
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
Rep
RepRep
Rep
 
Presentation1
Presentation1Presentation1
Presentation1
 
Indexes scales and typologies
Indexes scales and typologiesIndexes scales and typologies
Indexes scales and typologies
 
Running head ASSESSING A CLIENT .docx
Running head ASSESSING A CLIENT                                  .docxRunning head ASSESSING A CLIENT                                  .docx
Running head ASSESSING A CLIENT .docx
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilation
 
Advanced statistics for librarians
Advanced statistics for librariansAdvanced statistics for librarians
Advanced statistics for librarians
 
Session 2 2018
Session 2 2018Session 2 2018
Session 2 2018
 
The Developmental Coordination Disorder Questionnaire
The Developmental Coordination Disorder QuestionnaireThe Developmental Coordination Disorder Questionnaire
The Developmental Coordination Disorder Questionnaire
 
Business Research Methods Unit 3 notes
Business Research Methods Unit 3 notesBusiness Research Methods Unit 3 notes
Business Research Methods Unit 3 notes
 
Reasrch methodology for MBA
Reasrch methodology for MBAReasrch methodology for MBA
Reasrch methodology for MBA
 
Adler clark 4e ppt 06
Adler clark 4e ppt 06Adler clark 4e ppt 06
Adler clark 4e ppt 06
 
Quantitative analysis
Quantitative analysisQuantitative analysis
Quantitative analysis
 
Measurment and scale
Measurment and scaleMeasurment and scale
Measurment and scale
 

More from Melikarj

Presentation
PresentationPresentation
PresentationMelikarj
 
Presentation goffman
Presentation goffmanPresentation goffman
Presentation goffmanMelikarj
 
A CA of english and persian intonation melika rajabi
A CA of english and persian intonation melika rajabiA CA of english and persian intonation melika rajabi
A CA of english and persian intonation melika rajabiMelikarj
 
The role of materials dudly
The role of materials dudlyThe role of materials dudly
The role of materials dudlyMelikarj
 
Materials design hutchinson
Materials design hutchinsonMaterials design hutchinson
Materials design hutchinsonMelikarj
 
Discourse analysis
Discourse analysisDiscourse analysis
Discourse analysisMelikarj
 

More from Melikarj (6)

Presentation
PresentationPresentation
Presentation
 
Presentation goffman
Presentation goffmanPresentation goffman
Presentation goffman
 
A CA of english and persian intonation melika rajabi
A CA of english and persian intonation melika rajabiA CA of english and persian intonation melika rajabi
A CA of english and persian intonation melika rajabi
 
The role of materials dudly
The role of materials dudlyThe role of materials dudly
The role of materials dudly
 
Materials design hutchinson
Materials design hutchinsonMaterials design hutchinson
Materials design hutchinson
 
Discourse analysis
Discourse analysisDiscourse analysis
Discourse analysis
 

Recently uploaded

Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 

Recently uploaded (20)

Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 

Fulcher standardized testing

  • 2. Two paradigms ī‚¨ norm-referenced testing īļ is the normative approach in educational testing. īļ Individuals are compared to each other. The meaning of the score on a test is derived from the position of an individual in relation to others. If the purpose of testing is to distribute rare resources fairly like university places , we need a test that separates out the test takers very effectively.
  • 3. Two paradigms īļ The primary requirement of the test is that it should discriminate between test takers. The high-scoring test takers are offered places at the most prestigious institutions, while those on lower scores may not be so fortunate. The decision makers need the test takers to be ‘spread out’ over the range of test scores, and this spread is called the distribution. It has a long history, and its principles and assumptions still dominate the testing industry today.
  • 4. Two paradigms ī‚¨ criterion-referenced testing īļ The idea was first discussed in 1960s. īļ informs testing and assessment that is related to instructional decisions much more than norm referenced testing. īļ The purpose of a criterion-referenced test is to make a decision about whether an individual test taker has achieved a pre-specified criterion, or standard, that is required for a particular decision context. Like mastery tests
  • 5. Testing as science ī‚¨ The rise of testing began during the First World War. ī‚¨ For the first time in history, industry was organised on a large and efficient scale in order to produce the materials necessary for military success. ī‚¨ With it came the need to count, measure, and quantify on a completely new scale. In an early study of labour efficiency, Greenwood (1919: 186) takes as his rationale the dictum of the natural scientist, Kelvin: When you can measure what you are speaking about and express it in numbers, you know something about it, but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind. ī‚¨ the First World War saw the dramatic rise of large-scale testing. ī‚¨ Psychologists wished to contribute to the war effort and show that testing was a scientific discipline (Kelves, 1968).
  • 6. Testing as science ī‚¨ Tests are about measuring knowledge, skills or abilities (‘KSAs’) and expressing their existence, or degree or presence, in numerical form. ī‚¨ The assumption is that, once we are able to do this, we have ‘genuine’ knowledge.
  • 7. Testing as science ī‚¨ Shohamy (2001: 21) correctly identifies this as one of the key features of the ‘power of tests’: The language of science in Western societies grants authority, status and power. Testing is perceived as a scientific discipline because it is experimental, statistical and uses numbers. It is viewed as objective, fair, true
  • 8. Testing as science When the Great War broke measurement and scientific progress went hand in hand. Measurement had in fact been seen as the most important development in scientific research since the early nineteenth century. ī‚¨ Cattell (1893) said: The history of science is the history of measurement. Those departments of Knowledge in which measurement could be used most readily were the first to become sciences, and those sciences are at the present time the furthest advanced in which measurement is the most extended and the most exact.
  • 9. What did the early testers believe they could achieve that led to the explosion of testing theory and practice during the Great War? It is best explained with reference to a comment that Francis Galton was asked to write on one of Cattell’s early papers (Cattell and Galton, 1890: 380): One of the most important objects of measurement â€Ļ is to obtain a general knowledge of the capacities of a man by sinking shafts, as it were, at a few critical points. In order to ascertain the best points for the purpose, the sets of measures should be compared with an independent estimate of the man’s powers. We thus may learn which of the measures are the most instructive.
  • 10. Testing as science Galton is suggesting that In language testing a strong ‘trait theory’ of what we think our test measures is a real, stable, part of the test taker. the use of tests is like drilling a hole into the test taker to discover what is inside. validity
  • 11. Testing as science ī‚¨ In order to make a decision about which tests are the best measures, we need to compare the results of the test with an independent estimate of whatever the test is designed to measure. ī‚¨ The external aspect of validity the value of our measurements can be judged by their relation with other measures of the same property.
  • 12. Testing as science A norm-referenced test a scientific tool Discriminate among test takers Ex: Past: In the Army tests the higher scorers were considered for officer positions. The testers genuinely believed that through their efforts the war would be more successfully prosecuted, and victory would be achieved sooner. Today: Test takers who get higher grades on modern international language tests have more chances of promotion, or can obtain places in the universities of their choice.
  • 13. Testing as science ī‚¨ a view of testing as ‘scientific’ has always been controversial. Lipman (1922), argued that a strong trait theory is untenable. In fact, most of the traits or constructs that we work with are extremely difficult to define, and if we are not able to define them, measurement is even more problematic.
  • 14. What is in a curve?
  • 15. What is in a curve? It is possible for an illiterate enlisted man to get a score that is similar to that obtained by a low-scoring – or even a middle-scoring – officer. But it is highly unlikely. Also, someone who is really officer material may get a score similar to that expected of a high scoring illiterate. Once again, this is highly unlikely, but possible. If we look at the intersection between the officer curve and the sergeants’ curve, however, the possibility that an officer would get the same score as a sergeant and vice versa is much higher and placement into an appropriate category becomes more difficult. This is the problem of setting cut scores on tests in order to make decisions
  • 16. The curve and score meaning ī‚¨ In norm-referenced testing the meaning of a score is directly related to its place in the curve of the distribution from which it is drawn, because it tells us how an individual with this score relates to the rest of the test- taking population. ī‚¨ Each individual is compared with all other individuals in the distribution.
  • 17. The curve and score meaning Around 68 per cent of all scores cluster closely to the mean, with approximately 34 per cent just above the mean, and 34 per cent just below the mean. As we move away from the mean the scores in the distribution become more extreme, and so less common. The curve of normal distribution tells us what the probability is that a test taker could have got the score they have, given the place of the score in a particular
  • 18. Putting it into practice A test score is arrived at by giving the test takers a number of items or tasks to do. The responses to these items are scored, usually as correct (1) or incorrect (0). The number of correct responses for each individual is then added up to arrive at a total raw score.
  • 19. Putting it into practice Mode: The most frequent score in a distribution. (11) Median: The score that falls in the middle of the distribution. Mean: is calculated by adding all the scores together and dividing the total by the number of test takers.
  • 20. Putting it into practice ī‚¨ We can present these scores visually in histograms
  • 21. Putting it into practice ī‚¨ The mean is the most important measure of the centre of the distribution of test scores. ī‚¨ If we take away the mean from each of the individual scores we get a deviation score from the mean ī‚¨ Deviation score: When we subtract the mean from the score. This number shows how far an individual score is away from the mean, and may be a negative or positive number, and the mean of these scores is always zero. The standard deviation is the square root of the sum of the squared deviation scores, divided by N – 1.
  • 22. Putting it into practice ī‚¨ A z-score is simply the raw score expressed in standard deviations. So, if your score on the test was11.72, you would in fact score zero. And if your score was 19.2, you would score 1. There is a very straightforward formula for transferring any raw score to a z-score: X= raw score
  • 23. Test scores in a consumer age Gaokao’s modern standardised test: The reported score ranges from 100 to 900. The reported score bears little relationship to the actual number of items on the test. Indeed, the actual number of items may vary from year to year, but the score meaning on the scale remains the same. The descriptive statistics for the standardised Gaokao examination every year are as follows: ī‚¨ Mean = 500 ī‚¨ Standard deviation = 100
  • 24. Test scores in a consumer age ī‚¨ z multiplied by the new standard deviation plus the new mean: raw score mean of 0SD of 1 z * 100 + 500
  • 25. If we look at item 29, which we would probably all agree is the most difficult on the test. The comparison of native speaker responses with those of the target non-native test-taking population has become a basic source of information on whether the test is a language test, rather than a test of other cognitive or non-linguistic abilities. Like TOEFL test
  • 26. Introducing reliability ī‚¨ The most prized quality of standardised language tests that are designed to implement meritocracy is reliability ī‚¨ Classic definition by Lado (1961: 31) Does a test yield the same scores one day and the next if there has been no instruction intervening? That is, does the test yield dependable scores in the sense that they will not fluctuate very much so that we may know that the score obtained by a student is pretty close to the score he would obtain if we gave the test again? If it does, the test is reliable. Lado specifically separated the quality of reliability from scorability. Scorability is the ease with which a test item or task can be scored a set of multiple choice items It is quite possible that they are easily scored, but produce unreliable scores, as defined by Lado
  • 27. Sources of unreliability according to Lado ī‚¨ Lado assumed: ī‚¨ Unreliability is about the lack of fluctuation, or consistency 1. If the test is held in two different places, or under slightly different conditions (such as a different room, or with a different invigilator) and the score changed as a direct result. 2. The test itself is the second source of unreliability. He pointed to problems with sampling what language to test we can’t test everything in a single test no learning had taken place between two administrations of a test No different scores
  • 28. Introducing reliability if a test consists of, In standardised tests: any group of items from which responses are added together to create a single score Are assumed to test the same ability, skill or knowledge. Item homogeneity. īļ If humans are scoring multiple-choice items they may become fatigued and make mistakes, or transfer marks inaccurately from scripts to computer records. īļ However, there is more room for variation when humans are asked to make judgments about the quality of a piece of writing or a sample of speech, and give it a rating items that test very different things reliability is also reduced Finally, unreliability can be caused by the scoring.
  • 29. Why testing was viewed as a ‘science’: Tests, like scientific instruments, provide the means by which we can observe and measure consistencies in human ability. Any observed score on our tests is therefore assumed to be a composite, explained by the formula: X= T + E Observed score The ‘true’ score of an individual’s ability on what the test measures The error (E) that can come from a variety of sources like those identified by Lado.
  • 30. Calculating reliability īƒ˜ It depends upon what kind of error we wish to focus on īƒ˜ a reliability coefficient is calculated that ranges from 0 to 1 īƒ˜ No test is ‘perfectly’ reliable. There is always error. Three areas identified by Lado ī‚¨ Test administrations ī‚¨ The test itself ī‚¨ Marking or rating
  • 31. Test administration A measure of the strength of a relation between two interval level variables, such as test scores. The full name of this statistic is the Pearson Product Moment correlation. The strength of the relation between the two sets of scores can easily be quantified on 1. a scale of –1 (there is an inverse relationship between the scores – as one goes up, the other comes down), 2. through 0 (there is no relation between the two sets of scores) 3. to 1 (the scores are exactly the same on both administrations of the test). The closer the result is to 1, the more test–retest reliability we have.
  • 32. There is a scatter plot of the scores from the same test given at two different times. The score that each student got on the test at administration 1 and 2 is plotted. We can see visually that there is a strong positive relationship between the two sets of scores: it is highly likely that a student who scored highly on one administration will score highly on another administration, but there will be some fluctuation.
  • 33. The formula for the correlation between the two sets of raw scores is
  • 34. ī‚— In order to interpret the correlation coefficient we square the result, and .872 = .76 ī‚— This number (or r2) can be interpreted as the percentage variance shared by the two sets of scores, or the degree to which they vary together (as the score on one test increases, so it increases proportionally on the other test). In our case, 76 percent of variance is shared The shared variance is represented by the shaded area of overlap between the two boxes. the white area of each box represents variance that is unique to each administration.
  • 35. The test itself The items must be homogenous. In technical terms, they must all be highly correlated. īļTwo ways of addressing reliability in terms of item homogeneity. 1. the split-half method After a test has been administered the first task is to split it into two equal halves. This might be done by placing item 1 in test A, item 2 in test B, item 3 in test A, item 4 in test B, and so on. Then we have two tests, each of which is half the length of the original test. We then calculate the correlation between the two halves in exactly the same way. The correlation coefficient is the reliability coefficient for a test half the length of the one you have actually given. Reliability is directly related to the length of a test: The longer a test, the more reliable it becomes. For this we use the Spearman Brown correction formula, which is: rhh is the correlation between the two halves of the test.
  • 36. 2. The most frequently used and reported reliability coefficients is Cronbach’s alpha The formula for dichotomously scored items (scored ‘right’or ‘wrong’) is: We know for the linguality test that k = 29 and S2 = 56.25 The number of items on the test The test score variance, which is the square of the standard deviation The sum of the variances of individual items
  • 37. Marking or rating If we have two raters and we need to discover their inter-rater reliability, the formula is: The number of raters The variance of their scores R1 and R2 merely stand for rater 1 and rater 2 The scoring of closed response items like multiple choice is much easier than open response items because there is only one correct response. Rating is much more complex because there is an assumption that whichever rater is making the judgment should be a matter of indifference to the test taker. If there is variation by rater this is considered Calculating reliability to be a source of unreliability, or error
  • 38. Living with uncertainty The standard error of measurement that tells us what this might mean for a specific observed score while the reliability coefficient tells us how much error there might be in the measurement. The formula for the standard error is: The standard error The reliability coefficient The standard deviation We use the standard error to calculate a confidence interval around an observed test score, which tells us by how much the true score may be above or below the observed score that the test taker has actually got on our test.
  • 39. Reliability and test length ī‚¨ The reliability of a test is determined mostly by the quality of the items and by the length of the test. In standardised tests with many items, each item provides a piece of information about the ability of the test taker. If you were to increase the value of k in the Îą formula you would see that reliability would steadily increase. The more independent pieces of information we collect, the more reliable the measurement becomes. So the response to any specific item must be independent of the response to any other item, The technical term for this is the stochastic independence of items.
  • 40. Lado (1961: 339) provides us with the following formula for looking at the relationship between reliability and test length: is the proportion by which you would have to lengthen the test to get the desired reliability The desired reliability the reliability of the current test Imagine a test with a reliability of .7, and you wish to raise this to .85. The illustrative calculation is as follows:
  • 41. Relationships with other measures īļ The comparison of two measures of the same construct was key part of standardised testing. It has been assumed in Galton’s time that if two different measures were highly correlated this provided evidence of validity (Pearson, 1920). This aspect of external validity is criterion-related evidence, or evidence that shows the scores on two measures are highly correlated, or that one test is highly correlated with a criterion that is already known to be a valid measure of its construct. It is also sometimes called evidence for convergent validity
  • 42. Measurement Up to here we talked about measurement as understood in Classical Test Theory. All test theory assumes that scores are normally distributed. The measurement theory described in this chapter is therefore central to understanding how all standardised tests are built, and how they work. Content analysis also played a role. Yerkes (1921:355), for example, comments on the content of the linguality test in this way: A fairly accurate individual directions test was arranged, and a less accurate but usable group test, modelled after examination beta. The group test is too difficult at the start and is extended to an unnecessarily high level for purposes of calibration. The individual test is preferable, not only on the score of accuracy and level, but also because of its military nature.
  • 43. Measurement ī‚¨ Deciding what to test is now seen as just as important as how to test it. However, we must acknowledge that the basic technology of language testing and assessment is drawn from measurement theory, which in turn models itself upon the measurement tools of the physical sciences.
  • 44. Thanks for your attention