SlideShare a Scribd company logo
we would not expect every individual to get precisely the same score 
on the Wednesday as they got on the Thursday. Human beings are not like that; 
they simply do not behave in exactly the same way on every occasion, even 
when the circumstances seem identical. But if this is the case, it implies that we 
can never have complete trust in any set of test scores. We know that the 
scores would have been different if the test had been administered on the 
previous or the following day. This is inevitable, and we must accept it. What we 
have to do is construct, administer and score tests in such a way that the scores 
actually obtained on a test on a particular occasion are likely to be very similar 
to those which would have been obtained if it had been administered to the 
same students with the same ability, but at a different time. The more similar the 
scores would have been, the more reliable the test is said to be. 
RELIABILITY
The reliability coefficient 
 Reliability coefficients are like validity coefficients 
(chapter 4). They allow us to compare the reliability of 
different tests. The ideal reliability coefficient is 1. A 
test with a reliability coefficient of 1 is one which would 
give precisely the same results for a particular set of 
candidates regardless if when it happened to be 
administered. test which had a reliability coefficient of 
zero (and let us hope that no such test exists!) would 
give sets of results quite unconnected with each other, 
in the sense that the score that someone actually got 
on a Wednesday would be no help at all in attempting 
to predict the score he or she would get if they took 
the test the day after. It is between the two extremes 
of 1 and zero that genuine test reliability coefficient 
are to be found.
 Certain authors have suggested how high a reliability 
coefficient we should expect for different types of language 
tests. Lado (1961), for example, says that good vocabulary, 
structure and reading tests are usually in the .90 to .99 range, 
while auditory comprehension tests are more often in the.80 
to .89 range. Oral production tests may be in the .70 to .79 
range. He adds that a reliability coefficient of .85 might be 
considered high for an oral production tests but low for a 
reading test. These suggestions reflect what Lado sees as 
the difficulty in achieving reliability in the testing of the 
different abilities. In fact the reliability coefficient that is to be 
sought will depend also on other considerations, most 
particularly the importance the decisions, the greater 
reliability we must demand: if we are to refuse someone the 
opportunity to study overseas because of their score on a 
language test, then we have to be pretty sure that their score 
would not have been much different if they had taken the rest 
a day or two earlier or later.
The standard error of measurement and the 
true score 
 While the reliability coefficient allows us to compare the reliability of 
tests, it does not tell us directly how close an individual’s actual score is 
to what he or she might have scored on another occasion. With a little 
further calculation, however, it is possible to estimate how close a 
person’s actual score is to what is called their ‘true score’. Imagine that 
it were possible for someone to take the same language test over and 
over again, a indefinitely large number of times, without their 
performance being affected by having already taken the test, and 
without their ability in the language changing. Unless the test is 
perfectly reliable, and provided that it is not so easy or difficult that the 
student always gets full marks or zero, we would expect their scores on 
the various administrations to vary. If we had all of these scores we 
would be able to calculate their average score, and it would seem not 
unreasonable to think of this average as the one that best represents 
the student’s ability with respect to this particular test. It is this score, 
which for obvious reasons we can never know for certain, which is 
referred to as the candidate’s true score.
Scorer reliability 
 In the first example given in this chapter we spoke 
about scores on a multiple choice test. It was 
most unlikely, we thought, that every candidate 
would get precisely the same score on both of two 
possible administration of the test. We assumed, 
however, that scoring of the test would be 
‘perfect’. That is, if a particular candidate did 
perform in exactly the same way on the two 
occasions, they would be given the same score 
on both occasions. That is, any one scorer would 
give the same score on the two occasions, and 
this would be the same score as would be given 
by any other scorer on either occasion.
Suppose that a test has a standard error of measurement of 5. An individual 
scores 56 on that test. We are them in a position to make the following 
statements. 
 We can be about 68 percent certain that the person’s true score lies in the 
range of 51-61 (i.e. within one standard error of measurement of the score 
actually obtained on this occasion). 
 We can be about 95 per cent certain that their true score is in the range 46- 
66 (i.e. within two standard error of measurement of the score actually 
obtained). 
 We can be 99.7 per cent certain that their true score is in the range 41-71 
(i.e. within three standard errors of measurement of the score actually 
obtained). 
 These statements are based on what is known about the pattern of scores 
that would occur if it were in fact possible for someone to take the test 
repeatedly in the way described above. About 68 per cent of their scores 
would be within one standard error of measurement, and so on. If in fact 
they only take the test once, we cannot be sure how their score on that 
occasion relates to their true score, but we are still able to make 
probabilistic statements as above.
How to make tests more reliable 
 As we have seen, there are two components 
of test reliability: the performance of 
candidates from occasion to occasion, and 
the reliability of the scoring. We will begin by 
suggesting ways of achieving consistent 
performances form candidates and then turn 
our attention to scorer reliability.
Exclude items which do not discriminate well 
between weaker and stronger students 
 Items on which strong students and weak 
students perform with similar degrees of 
success contribute little to the reliability of 
a test. Statistical analysis of items 
(Appendix 1) will reveal which items do not 
discriminate well.
 These are likely to include items which are too 
easy or too difficult for the candidates, but not 
only such items. A small number of easy, non-discriminating 
items may be kept at the 
beginning of a test to give candidates 
confidence and reduce the stress they feel.
Do not allow candidates too much 
freedom 
 In some kinds of language test there is a 
tendency to offer candidates a choice of 
questions and then to allow them a great deal 
of freedom in the way that they answer the 
ones that they haven chosen. An example 
would be a test of writing where the 
candidates are simply given a selection of 
titles from which to choose. Such a procedure 
is likely to have a depressing effect on the 
reliability of the test.
 The more freedom that is given, the greater is 
likely to be the difference between the 
performance actually elicited and the 
performance that would have been elicited had 
the test been taken, say, a day later. In general, 
therefore, candidates should not be given a 
choice, and the range over which possible 
answers might vary should be restricted. 
Compare the following writing task:
 Write a composition on tourism. 
 Write a composition on tourism in this country. 
 Write a composition on how we might 
developed the tourist industry in this country. 
 Discuss the following measures intended to 
increase the number of foreign tourists coming 
to this country:
 More/better advertising and/or information 
(Where? What form should it take)/ 
 Improve facilities (hotels, transportation, 
communication, etc) 
 Training of personnel (guides, hotel managers, 
etc)
Provide clear and explicit 
instructions 
 This applies both to written and oral 
instructions. If it is possible for candidates to 
misinterpret what they are asked to do, then 
on some occasions some of them certainly 
will. It is by no means always the weakest 
candidates who are misled by ambiguous 
instructions, indeed it is often the better 
candidate who is able to provide the 
alternative interpretation.
 A common fault of tests written for the students 
of a particular teaching institution is the 
supposition that the students all know what is 
intended by carelessly worded instructions. 
The frequency of the complaint that students 
are unintelligent, have been stupid, have 
wilfully, misunderstood what they were asked 
to do, reveals that the supposition is often 
unwarranted.
 Test writers should not rely on the students 
powers of telepathy to elicit the desired 
behaviour. Again, the use of colleagues to 
criticise drafts of instructions (including those 
which will be spoken) is the best means of 
avoiding problems. Spoken instructions should 
always be read from a prepared text in order to 
avoid introducing confusion.
Ensure that tests are well laid out and perfectly 
legible 
 Too often, institutional tests are badly typed 
(or handwritten), have too much text in too 
small a space, and are poorly reproduced. As 
a result, students are faced with additional 
tasks which are not ones meant to measure 
their language ability. Their variable 
performance on the unwanted tasks will lower 
the reliability of a test.
Make candidates familiar with format and 
testing techniques 
 If any aspect of a test is unfamiliar to 
candidates, they are likely to perform less well 
than they would do otherwise (on 
subsequently taking a parallel version, for 
example). For this reason, every effort must b 
made to ensure that all candidates have the 
opportunity to learn just what will be required 
of them. This may mean the distribution of 
sample tests (or of past test papers) or at 
least the provision of practice materials in the 
case of tests set within teaching institutions.
Provide uniform and non-distracting 
conditions of administration 
 The greater the differences between one 
administration of a test and another, the greater 
the differences one can expect between a 
candidate’s performance on the two occasions. 
Great care should be taken to ensure uniformity. 
For example, timing should be specified and 
strictly adhered to; the acoustic conditions should 
be similar for all administrations of a listening test. 
Every precaution should be taken to maintain a 
quiet setting with no distracting sounds or 
movements. 
 We turn now to ways of obtaining scorer reliability, 
which, as we saw above, is essential to test 
reliability.
Make comparisons between candidates as 
direct as possible 
 This reinforces the suggestion already made 
that candidates should not be given a choice 
of items and that they should be limited in the 
way that they are allowed to respond. Scoring 
the compositions all on tone topic will be more 
reliable than if the candidates are allowed to 
choose from six topics, as has been the case 
in some well-known tests. The scoring should 
be all the more reliable it the compositions are 
guided as in the example above, in the 
section, ‘do not allow candidates too much 
freedom’.
Provide a detailed scoring key 
 This should specify acceptable answers and 
assign points for acceptable partially correct 
response. For high scorer reliability the key 
should be as detailed as possible in its 
assignment of points. It should be the outcome 
of efforts to anticipate all possible responses 
and have been subjected to group criticism. 
(this advice applies only where responses can 
be classed as partially or totally ‘correct’, not in 
the case of compositions, for instance.)
Train scorers 
 This is especially important where scoring 
is most subjective. The scoring of 
compositions, for example, should not be 
assigned to anyone who has not learned to 
score accurately compositions from past 
administrations. After each administration, 
patterns of scoring should be analysed. 
Individuals whose scoring deviates 
markedly and inconsistently from the norm 
should not be used again.
Identify candidates by number, not name 
 Scorers inevitably have expectations of 
candidates that they know. Except in purely 
objective testing, this will affect the way that 
they score. Studies have sown that even 
where the candidates are unknown to the 
scorers, the name on a script (or a 
photograph) will make a significant difference 
to the scores given. For example, a scorer 
may be influenced by the gender or nationality 
of a name into making predictions which can 
affect the score given. The identification of 
candidates only by number will reduce such 
effects.
Employ multiple, independent 
scoring 
 As a general rule, and certainly where testing 
is subjective, all scripts should be scored by at 
least two independent scorers. Neither scorer 
should know how the other has scored a test 
paper. Scores should be recorded on separate 
score sheets and passed to a third, senior, 
colleague, who compares the two sets of 
scorers and investigates discrepancies.
THE END

More Related Content

What's hot

Reliability in Language Testing
Reliability in Language Testing Reliability in Language Testing
Reliability in Language Testing
Seray Tanyer
 
Testing for language teachers 101 (1)
Testing for language teachers 101 (1)Testing for language teachers 101 (1)
Testing for language teachers 101 (1)Paul Doyon
 
Validity
ValidityValidity
Validity
Maury Martinez
 
Principles of Language Assessment
Principles of Language AssessmentPrinciples of Language Assessment
Principles of Language Assessment
SubramanianMuthusamy3
 
Chapter 2: Principles of Language Assessment
Chapter 2: Principles of Language AssessmentChapter 2: Principles of Language Assessment
Chapter 2: Principles of Language Assessment
Hamid Najaf Pour Sani
 
Test Techniques
Test TechniquesTest Techniques
Test Techniques
Ariane Mitschek
 
Testing writing
Testing writingTesting writing
Testing writing
Marta Ribas
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessment
Ameer Al-Labban
 
Assessments, concepts and issues
Assessments, concepts and issuesAssessments, concepts and issues
Assessments, concepts and issues
Rahila Khan
 
Chapter 3(designing classroom language tests)
Chapter 3(designing classroom language tests)Chapter 3(designing classroom language tests)
Chapter 3(designing classroom language tests)Kheang Sokheng
 
Types of tests and types of testing
Types of tests and types of testingTypes of tests and types of testing
Types of tests and types of testing
Phạm Phúc Khánh Minh
 
Language testing
Language testingLanguage testing
Language testing
Jihan Zayed
 
Testing for Language Teachers Arthur Hughes
Testing for Language TeachersArthur HughesTesting for Language TeachersArthur Hughes
Testing for Language Teachers Arthur HughesRajputt Ainee
 
Common test techniques
Common test techniquesCommon test techniques
Common test techniques
Maury Martinez
 
Washback
WashbackWashback
Washback
steadyfalcon
 
Designing classroom language test
Designing classroom language testDesigning classroom language test
Designing classroom language test
Ratih Puspitasari
 
Stages of test development and common test techniques (1)
Stages of test development and common test techniques (1)Stages of test development and common test techniques (1)
Stages of test development and common test techniques (1)
Maury Martinez
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessmentSutrisno Evenddy
 
Language Testing Evaluation
Language Testing EvaluationLanguage Testing Evaluation
Language Testing Evaluation
bikashtaly
 

What's hot (20)

Reliability in Language Testing
Reliability in Language Testing Reliability in Language Testing
Reliability in Language Testing
 
Testing for language teachers 101 (1)
Testing for language teachers 101 (1)Testing for language teachers 101 (1)
Testing for language teachers 101 (1)
 
Validity
ValidityValidity
Validity
 
Principles of Language Assessment
Principles of Language AssessmentPrinciples of Language Assessment
Principles of Language Assessment
 
Chapter 2: Principles of Language Assessment
Chapter 2: Principles of Language AssessmentChapter 2: Principles of Language Assessment
Chapter 2: Principles of Language Assessment
 
Test Techniques
Test TechniquesTest Techniques
Test Techniques
 
Testing writing
Testing writingTesting writing
Testing writing
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessment
 
Assessments, concepts and issues
Assessments, concepts and issuesAssessments, concepts and issues
Assessments, concepts and issues
 
Chapter 3(designing classroom language tests)
Chapter 3(designing classroom language tests)Chapter 3(designing classroom language tests)
Chapter 3(designing classroom language tests)
 
Types of tests and types of testing
Types of tests and types of testingTypes of tests and types of testing
Types of tests and types of testing
 
Language testing
Language testingLanguage testing
Language testing
 
Testing for Language Teachers Arthur Hughes
Testing for Language TeachersArthur HughesTesting for Language TeachersArthur Hughes
Testing for Language Teachers Arthur Hughes
 
Common test techniques
Common test techniquesCommon test techniques
Common test techniques
 
Washback
WashbackWashback
Washback
 
Designing classroom language test
Designing classroom language testDesigning classroom language test
Designing classroom language test
 
Stages of test development and common test techniques (1)
Stages of test development and common test techniques (1)Stages of test development and common test techniques (1)
Stages of test development and common test techniques (1)
 
Principles of language assessment
Principles of language assessmentPrinciples of language assessment
Principles of language assessment
 
Language Testing Evaluation
Language Testing EvaluationLanguage Testing Evaluation
Language Testing Evaluation
 
Language testing
Language testingLanguage testing
Language testing
 

Viewers also liked

How to improve test reliability
How to improve test reliabilityHow to improve test reliability
How to improve test reliabilityKAthy Cea
 
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliabilitysongoten77
 
Validity, reliability & practicality
Validity, reliability & practicalityValidity, reliability & practicality
Validity, reliability & practicalitySamcruz5
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Maheen Iftikhar
 
Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and Reliability
Maury Martinez
 
Validity, Credibility, and Reliability
Validity, Credibility, and ReliabilityValidity, Credibility, and Reliability
Validity, Credibility, and Reliability
Shanda Huston
 
Chapter10
Chapter10Chapter10
Chapter10
Ying Liu
 
Validity & reliability an interesting powerpoint slide i created
Validity & reliability  an interesting powerpoint slide i createdValidity & reliability  an interesting powerpoint slide i created
Validity & reliability an interesting powerpoint slide i createdSze Kai
 
AQA Biology As level Lungs revision booklet
AQA Biology As level Lungs revision booklet AQA Biology As level Lungs revision booklet
AQA Biology As level Lungs revision booklet
SteffyGT
 
Teaching and testing
Teaching and testingTeaching and testing
Teaching and testing
Mohammed Alkhamali
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good testBoyet Aluan
 
8. validity and reliability of research instruments
8. validity and reliability of research instruments8. validity and reliability of research instruments
8. validity and reliability of research instruments
Razif Shahril
 

Viewers also liked (14)

How to improve test reliability
How to improve test reliabilityHow to improve test reliability
How to improve test reliability
 
Presentation Validity & Reliability
Presentation Validity & ReliabilityPresentation Validity & Reliability
Presentation Validity & Reliability
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
Validity, reliability & practicality
Validity, reliability & practicalityValidity, reliability & practicality
Validity, reliability & practicality
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.
 
Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and Reliability
 
Validity, Credibility, and Reliability
Validity, Credibility, and ReliabilityValidity, Credibility, and Reliability
Validity, Credibility, and Reliability
 
Chapter10
Chapter10Chapter10
Chapter10
 
Lesson 6
Lesson 6Lesson 6
Lesson 6
 
Validity & reliability an interesting powerpoint slide i created
Validity & reliability  an interesting powerpoint slide i createdValidity & reliability  an interesting powerpoint slide i created
Validity & reliability an interesting powerpoint slide i created
 
AQA Biology As level Lungs revision booklet
AQA Biology As level Lungs revision booklet AQA Biology As level Lungs revision booklet
AQA Biology As level Lungs revision booklet
 
Teaching and testing
Teaching and testingTeaching and testing
Teaching and testing
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good test
 
8. validity and reliability of research instruments
8. validity and reliability of research instruments8. validity and reliability of research instruments
8. validity and reliability of research instruments
 

Similar to How to make tests more reliable

What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docx
mecklenburgstrelitzh
 
Reliability and Validity types and example.pptx
Reliability and Validity types and example.pptxReliability and Validity types and example.pptx
Reliability and Validity types and example.pptx
Aiswarya Lakshmi
 
Fulcher standardized testing
Fulcher standardized testingFulcher standardized testing
Fulcher standardized testing
Melikarj
 
Faith & ReasonFaith is not opposed to reason, but is sometime.docx
Faith & ReasonFaith is not opposed to reason, but is sometime.docxFaith & ReasonFaith is not opposed to reason, but is sometime.docx
Faith & ReasonFaith is not opposed to reason, but is sometime.docx
mecklenburgstrelitzh
 
Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test
Arash Yazdani
 
CLASSROOM ACTIVITIES
CLASSROOM  ACTIVITIESCLASSROOM  ACTIVITIES
CLASSROOM ACTIVITIES
Alfredo Carrion
 
Forms of Reliability and Validity in statistics.pptx
Forms of Reliability and Validity in statistics.pptxForms of Reliability and Validity in statistics.pptx
Forms of Reliability and Validity in statistics.pptx
Epoy15
 
Rep
RepRep
Rep
Cedy_28
 
Norm Reference Test
Norm Reference TestNorm Reference Test
Norm Reference Test
Teguh Ekosetio
 
D8 and d9 personality test development 10 2007-posting
D8 and d9 personality test development 10 2007-postingD8 and d9 personality test development 10 2007-posting
D8 and d9 personality test development 10 2007-postingBlessed Santos
 
Norms[1]
Norms[1]Norms[1]
Norms[1]
Milen Ramos
 
Session 2 2018
Session 2 2018Session 2 2018
Session 2 2018
Sue Hines
 
Reliability and dependability by neil jones
Reliability and dependability by neil jonesReliability and dependability by neil jones
Reliability and dependability by neil jones
ahfameri
 
Reliability and dependability by neil jones
Reliability and dependability by neil jonesReliability and dependability by neil jones
Reliability and dependability by neil jones
Amir Hamid Forough Ameri
 
Reliability
ReliabilityReliability
Reliability
shaziazamir1
 
Reliablity and Validity
Reliablity and ValidityReliablity and Validity
Reliablity and Validity
Sushant Kumar Sinha
 
EM&E.pptx
EM&E.pptxEM&E.pptx
EM&E.pptx
Hafiz20006
 
A" Research Methods Reliability and validity
A" Research Methods Reliability and validityA" Research Methods Reliability and validity
A" Research Methods Reliability and validity
Jill Jan
 
Language testing in all levels
Language testing in all levelsLanguage testing in all levels
Language testing in all levels
April Love Palad
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilationHannan Mahmud
 

Similar to How to make tests more reliable (20)

What makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docxWhat makes a good testA test is considered good” if the .docx
What makes a good testA test is considered good” if the .docx
 
Reliability and Validity types and example.pptx
Reliability and Validity types and example.pptxReliability and Validity types and example.pptx
Reliability and Validity types and example.pptx
 
Fulcher standardized testing
Fulcher standardized testingFulcher standardized testing
Fulcher standardized testing
 
Faith & ReasonFaith is not opposed to reason, but is sometime.docx
Faith & ReasonFaith is not opposed to reason, but is sometime.docxFaith & ReasonFaith is not opposed to reason, but is sometime.docx
Faith & ReasonFaith is not opposed to reason, but is sometime.docx
 
Characteristics of a good test
Characteristics of a good test Characteristics of a good test
Characteristics of a good test
 
CLASSROOM ACTIVITIES
CLASSROOM  ACTIVITIESCLASSROOM  ACTIVITIES
CLASSROOM ACTIVITIES
 
Forms of Reliability and Validity in statistics.pptx
Forms of Reliability and Validity in statistics.pptxForms of Reliability and Validity in statistics.pptx
Forms of Reliability and Validity in statistics.pptx
 
Rep
RepRep
Rep
 
Norm Reference Test
Norm Reference TestNorm Reference Test
Norm Reference Test
 
D8 and d9 personality test development 10 2007-posting
D8 and d9 personality test development 10 2007-postingD8 and d9 personality test development 10 2007-posting
D8 and d9 personality test development 10 2007-posting
 
Norms[1]
Norms[1]Norms[1]
Norms[1]
 
Session 2 2018
Session 2 2018Session 2 2018
Session 2 2018
 
Reliability and dependability by neil jones
Reliability and dependability by neil jonesReliability and dependability by neil jones
Reliability and dependability by neil jones
 
Reliability and dependability by neil jones
Reliability and dependability by neil jonesReliability and dependability by neil jones
Reliability and dependability by neil jones
 
Reliability
ReliabilityReliability
Reliability
 
Reliablity and Validity
Reliablity and ValidityReliablity and Validity
Reliablity and Validity
 
EM&E.pptx
EM&E.pptxEM&E.pptx
EM&E.pptx
 
A" Research Methods Reliability and validity
A" Research Methods Reliability and validityA" Research Methods Reliability and validity
A" Research Methods Reliability and validity
 
Language testing in all levels
Language testing in all levelsLanguage testing in all levels
Language testing in all levels
 
Chapter 8 compilation
Chapter 8 compilationChapter 8 compilation
Chapter 8 compilation
 

Recently uploaded

Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
rosedainty
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
Col Mukteshwar Prasad
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
Steve Thomason
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 

Recently uploaded (20)

Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 

How to make tests more reliable

  • 1. we would not expect every individual to get precisely the same score on the Wednesday as they got on the Thursday. Human beings are not like that; they simply do not behave in exactly the same way on every occasion, even when the circumstances seem identical. But if this is the case, it implies that we can never have complete trust in any set of test scores. We know that the scores would have been different if the test had been administered on the previous or the following day. This is inevitable, and we must accept it. What we have to do is construct, administer and score tests in such a way that the scores actually obtained on a test on a particular occasion are likely to be very similar to those which would have been obtained if it had been administered to the same students with the same ability, but at a different time. The more similar the scores would have been, the more reliable the test is said to be. RELIABILITY
  • 2. The reliability coefficient  Reliability coefficients are like validity coefficients (chapter 4). They allow us to compare the reliability of different tests. The ideal reliability coefficient is 1. A test with a reliability coefficient of 1 is one which would give precisely the same results for a particular set of candidates regardless if when it happened to be administered. test which had a reliability coefficient of zero (and let us hope that no such test exists!) would give sets of results quite unconnected with each other, in the sense that the score that someone actually got on a Wednesday would be no help at all in attempting to predict the score he or she would get if they took the test the day after. It is between the two extremes of 1 and zero that genuine test reliability coefficient are to be found.
  • 3.  Certain authors have suggested how high a reliability coefficient we should expect for different types of language tests. Lado (1961), for example, says that good vocabulary, structure and reading tests are usually in the .90 to .99 range, while auditory comprehension tests are more often in the.80 to .89 range. Oral production tests may be in the .70 to .79 range. He adds that a reliability coefficient of .85 might be considered high for an oral production tests but low for a reading test. These suggestions reflect what Lado sees as the difficulty in achieving reliability in the testing of the different abilities. In fact the reliability coefficient that is to be sought will depend also on other considerations, most particularly the importance the decisions, the greater reliability we must demand: if we are to refuse someone the opportunity to study overseas because of their score on a language test, then we have to be pretty sure that their score would not have been much different if they had taken the rest a day or two earlier or later.
  • 4. The standard error of measurement and the true score  While the reliability coefficient allows us to compare the reliability of tests, it does not tell us directly how close an individual’s actual score is to what he or she might have scored on another occasion. With a little further calculation, however, it is possible to estimate how close a person’s actual score is to what is called their ‘true score’. Imagine that it were possible for someone to take the same language test over and over again, a indefinitely large number of times, without their performance being affected by having already taken the test, and without their ability in the language changing. Unless the test is perfectly reliable, and provided that it is not so easy or difficult that the student always gets full marks or zero, we would expect their scores on the various administrations to vary. If we had all of these scores we would be able to calculate their average score, and it would seem not unreasonable to think of this average as the one that best represents the student’s ability with respect to this particular test. It is this score, which for obvious reasons we can never know for certain, which is referred to as the candidate’s true score.
  • 5. Scorer reliability  In the first example given in this chapter we spoke about scores on a multiple choice test. It was most unlikely, we thought, that every candidate would get precisely the same score on both of two possible administration of the test. We assumed, however, that scoring of the test would be ‘perfect’. That is, if a particular candidate did perform in exactly the same way on the two occasions, they would be given the same score on both occasions. That is, any one scorer would give the same score on the two occasions, and this would be the same score as would be given by any other scorer on either occasion.
  • 6. Suppose that a test has a standard error of measurement of 5. An individual scores 56 on that test. We are them in a position to make the following statements.  We can be about 68 percent certain that the person’s true score lies in the range of 51-61 (i.e. within one standard error of measurement of the score actually obtained on this occasion).  We can be about 95 per cent certain that their true score is in the range 46- 66 (i.e. within two standard error of measurement of the score actually obtained).  We can be 99.7 per cent certain that their true score is in the range 41-71 (i.e. within three standard errors of measurement of the score actually obtained).  These statements are based on what is known about the pattern of scores that would occur if it were in fact possible for someone to take the test repeatedly in the way described above. About 68 per cent of their scores would be within one standard error of measurement, and so on. If in fact they only take the test once, we cannot be sure how their score on that occasion relates to their true score, but we are still able to make probabilistic statements as above.
  • 7. How to make tests more reliable  As we have seen, there are two components of test reliability: the performance of candidates from occasion to occasion, and the reliability of the scoring. We will begin by suggesting ways of achieving consistent performances form candidates and then turn our attention to scorer reliability.
  • 8. Exclude items which do not discriminate well between weaker and stronger students  Items on which strong students and weak students perform with similar degrees of success contribute little to the reliability of a test. Statistical analysis of items (Appendix 1) will reveal which items do not discriminate well.
  • 9.  These are likely to include items which are too easy or too difficult for the candidates, but not only such items. A small number of easy, non-discriminating items may be kept at the beginning of a test to give candidates confidence and reduce the stress they feel.
  • 10. Do not allow candidates too much freedom  In some kinds of language test there is a tendency to offer candidates a choice of questions and then to allow them a great deal of freedom in the way that they answer the ones that they haven chosen. An example would be a test of writing where the candidates are simply given a selection of titles from which to choose. Such a procedure is likely to have a depressing effect on the reliability of the test.
  • 11.  The more freedom that is given, the greater is likely to be the difference between the performance actually elicited and the performance that would have been elicited had the test been taken, say, a day later. In general, therefore, candidates should not be given a choice, and the range over which possible answers might vary should be restricted. Compare the following writing task:
  • 12.  Write a composition on tourism.  Write a composition on tourism in this country.  Write a composition on how we might developed the tourist industry in this country.  Discuss the following measures intended to increase the number of foreign tourists coming to this country:
  • 13.  More/better advertising and/or information (Where? What form should it take)/  Improve facilities (hotels, transportation, communication, etc)  Training of personnel (guides, hotel managers, etc)
  • 14. Provide clear and explicit instructions  This applies both to written and oral instructions. If it is possible for candidates to misinterpret what they are asked to do, then on some occasions some of them certainly will. It is by no means always the weakest candidates who are misled by ambiguous instructions, indeed it is often the better candidate who is able to provide the alternative interpretation.
  • 15.  A common fault of tests written for the students of a particular teaching institution is the supposition that the students all know what is intended by carelessly worded instructions. The frequency of the complaint that students are unintelligent, have been stupid, have wilfully, misunderstood what they were asked to do, reveals that the supposition is often unwarranted.
  • 16.  Test writers should not rely on the students powers of telepathy to elicit the desired behaviour. Again, the use of colleagues to criticise drafts of instructions (including those which will be spoken) is the best means of avoiding problems. Spoken instructions should always be read from a prepared text in order to avoid introducing confusion.
  • 17. Ensure that tests are well laid out and perfectly legible  Too often, institutional tests are badly typed (or handwritten), have too much text in too small a space, and are poorly reproduced. As a result, students are faced with additional tasks which are not ones meant to measure their language ability. Their variable performance on the unwanted tasks will lower the reliability of a test.
  • 18. Make candidates familiar with format and testing techniques  If any aspect of a test is unfamiliar to candidates, they are likely to perform less well than they would do otherwise (on subsequently taking a parallel version, for example). For this reason, every effort must b made to ensure that all candidates have the opportunity to learn just what will be required of them. This may mean the distribution of sample tests (or of past test papers) or at least the provision of practice materials in the case of tests set within teaching institutions.
  • 19. Provide uniform and non-distracting conditions of administration  The greater the differences between one administration of a test and another, the greater the differences one can expect between a candidate’s performance on the two occasions. Great care should be taken to ensure uniformity. For example, timing should be specified and strictly adhered to; the acoustic conditions should be similar for all administrations of a listening test. Every precaution should be taken to maintain a quiet setting with no distracting sounds or movements.  We turn now to ways of obtaining scorer reliability, which, as we saw above, is essential to test reliability.
  • 20. Make comparisons between candidates as direct as possible  This reinforces the suggestion already made that candidates should not be given a choice of items and that they should be limited in the way that they are allowed to respond. Scoring the compositions all on tone topic will be more reliable than if the candidates are allowed to choose from six topics, as has been the case in some well-known tests. The scoring should be all the more reliable it the compositions are guided as in the example above, in the section, ‘do not allow candidates too much freedom’.
  • 21. Provide a detailed scoring key  This should specify acceptable answers and assign points for acceptable partially correct response. For high scorer reliability the key should be as detailed as possible in its assignment of points. It should be the outcome of efforts to anticipate all possible responses and have been subjected to group criticism. (this advice applies only where responses can be classed as partially or totally ‘correct’, not in the case of compositions, for instance.)
  • 22. Train scorers  This is especially important where scoring is most subjective. The scoring of compositions, for example, should not be assigned to anyone who has not learned to score accurately compositions from past administrations. After each administration, patterns of scoring should be analysed. Individuals whose scoring deviates markedly and inconsistently from the norm should not be used again.
  • 23. Identify candidates by number, not name  Scorers inevitably have expectations of candidates that they know. Except in purely objective testing, this will affect the way that they score. Studies have sown that even where the candidates are unknown to the scorers, the name on a script (or a photograph) will make a significant difference to the scores given. For example, a scorer may be influenced by the gender or nationality of a name into making predictions which can affect the score given. The identification of candidates only by number will reduce such effects.
  • 24. Employ multiple, independent scoring  As a general rule, and certainly where testing is subjective, all scripts should be scored by at least two independent scorers. Neither scorer should know how the other has scored a test paper. Scores should be recorded on separate score sheets and passed to a third, senior, colleague, who compares the two sets of scorers and investigates discrepancies.