Testing for Language Teachers

Mohammad Pazhouhesh
Khayyam University of Mashhad
Farhangian University of Mashhad, Beheshti Campus

TEACHING AND TESTING
 BACKWASH
 The effect of testing on teaching and learning
 If the test is important, it can dominates all teaching and learning
activities
 It can be harmful or beneficial
 Harmful Backwash :
 The test content and testing techniques are in variance with ojectives
of the course
 Beneficial Backwash :
 It has an immediate effect on teaching

All measures of mental ability are necessarily
 indirect,
 incomplete,
 imprecise,
 subjective,
 relative
To minimize the effects of these limitations
A. provide clear theoretical definitions of the abilities we
want to measure;
B. Specify precisely the conditions, or operations that we
will follow in eliciting and observing performance,
C. Quantify the observations so as to assure our
measurement scales have the properties we require.

GENERAL TYPES OF TESTS
Proficiency tests
Achievement tests
Diagnostic tests
Placement tests
Selection Tests
Competition tests
Aptitude tests
Language aptitude tests
Vocational aptitude tests
KINDS OF TESTS AND TESTING

PROFICIENCY TESTS
measure language ability regardless of any previous
training.
are not based on
the content / objectives of language courses .
a specification of what candidates are able to do
to be considered proficient
 Proficiency: having sufficient command of the
language
 used for a particular purpose such as:
A translator in the United Nations
A student seeking admission in American / British
Universities
Used for a general purpose such as:
 General Proficiency Tests
FCE, CPE, TOEFL, IELTS
KINDS OF TESTS AND TESTING

ACHIEVEMENT TESTS
 are directly related to language courses
Used to determine whether students have achieved
the objectives of the course or not.
Kinds of Achievement tests
1. Final achievement tests
2. Progress achievement tests

 Final achievement tests
are administered at the end of a course of study.
Their contents are related to the course concerned.
syllabus content approach:
Should the test be based directly on a detailed course syllabus?
 disadvantage : If the syllabus is badly designed, the results of
the test could be misleading
Objective content approach: :
Should the test be based on course objectives?
 Advantages
 compelling course designers to be explicit about course objectives
 making it possible for the test to show how far objectives have been
achieved
 compelling the course designers to choose a syllabus which is
consistent with the course objectives
 working against poor teaching practice
 promoting a more beneficial backwash effect

PROGRESS ACHIEVEMENT TESTS
measure the progress of students and one to measure
progress is to administer repeatedly final achievement
tests
Disadvantage:
The low scores in early stages are discouraging.
The alternative is to establish a series of well-defined
short-term objectives. These should make a clear
progression towards the final achievement test based
on course objectives
Pop Quizzes
make a rough check on students’ progress
keep students on their toes.

• are used to identify learners’ strengths and weaknesses
• are intended to ascertain what learning still needs to take place.
• can tell us that someone is particularly weak in, say, speaking as
opposed to reading in a language
• Proficiency tests may prove adequate for this purpose
• Teacher may even need to analyze samples of a person’s
performance in writing or speaking in order to create profiles of
student’s ability in certain categories
DIAGNOSTIC TESTS

PLACEMENT TESTS
 are intended to place students at the stage of the
teaching program most appropriate to their abilities.
 are used to assign students to classes at different levels
 are constructed for particular situations.
 depend on the identification of the key features at
different levels of teaching.
APTITUDE TESTS
 indicate an individual facility for acquiring specific
skills and learning
 are used to measure aptitude for learning and to
predict future performance

DIRECT VS. INDIRECT TESTING
 Direct tests require the candidate to perform precisely the
skill that we wish to measure
 If we want to know how well candidates can write
composition, we get them to write composition.
 If we want to know how they pronounce a language, we
ask them to speak.
 The tasks, and the texts used, should be as authentic as
possible.
 Direct testing is easier to carry out to measure the
productive skill.

Attractions of Direct testing
1. Straightforward to create the conditions to elicit the
required behaviors
2. Straightforward assessment and interpretation
3. helpful backwash effect
SEMI-DIRECT TESTING
A. speaking tests where candidates respond to a tape-
recorder stimuli, their own responses being recorded
and later scored

INDIRECT TESTING
measures the abilities that underlie the skills tested.
 EXAMPLE: One section of the TOEFL as an indirect
measure of writing ability.
At first the old woman seemed unwilling to accept anything
that was offered her by my friend and I.
The main appeal of indirect testing:
 testing a large number of elements in one test
 giving it to a large number of students
 correcting it objectively
The main problem of indirect testing:
 The relationship between performance in the test and
actual performance of the skills being tested is weak in
strength and uncertain in nature

A. DISCRETE POINT TESTING
 refers to the testing of one element at a time, item by
item.
 might take the form of a series of items, each testing a
particular grammatical structure .
 is a testing approach which cuts up language skills
and components into smaller parts and then tests
them one by one.
 is an atomistic approach to language teaching and
learning.
DISCRETE POINT VS. INTEGRATIVE TESTING

B. INTEGRATIVE TESTING
 requires the candidate to combine many language
elements in the completion of a task
 writing a composition
 taking notes while listening to a lecture
 taking a dictation
 completing a cloze passage
 Unlike DP tests , IN tests tend to be direct.
 some integrative methods, such as cloze procedure, are
indirect
 Diagnostic tests of grammar tend to be discrete point

A. NORM-REFERENCETESTING (NRT)
 relates one candidate’s performance to that of other
candidates .
 We are not told directly what the student is capable of
doing in the language .
B. CRITERION-REFERENCE TESTING (CRT)
 provides direct information about what a candidate
can actually do in the language.
NORM-REFERENCE VS. CRITERION-REFERENCE TESTING

A. OBJECTIVE TESTING
No judgment is required on the part of the scorer
(multiple-choice tests)
B. SUBJECTIVE TESTING
 Judgment is called for on the part of the scorer
(composition)
 There are different degrees of subjectivity in testing.
Scoring composition is more subjective than scoring short-
answer items.
Objectivity in scoring brings greater reliability to testing.
Scoring rubrics can increase reliability of subjective
tests such as composition.
OBJECTIVE VS. SUBJECTIVE TESTING

No real need for strong candidates to attempt easy items,
and no need for weak candidates to attempt difficult items.
an efficient way of collecting information on testees’ ability
Presenting initially items of average difficulty.
 Those who respond correctly are presented with a more
difficult item.
 Those who respond incorrectly are presented with an
easy item.
 The computer adapts the items to the testees’ level .
 Oral interviews are typically a form of adaptive testing
COMPUTER ADAPTIVE TESTING

Measuring any ability to take part in acts of
communication, including reading and listening
It is assumed that it is usually communicative ability
that we want to test .
COMMUNICATIVE LANGUAGE TESTING

VALIDITY: Definition
A test is valid if it measures accurately what it is intended
to measure
Types of validity
 Construct
 Content
 Criterion-related
 Face
CHAPTER 4 : VALIDITY

Construct Validity
the degree to which a test measures what it claims,
or purports, to be measuring
 Construct: A construct is an attribute, an ability, or
skill that happens in the human brain and is defined by
established theories.
 Intelligence, motivation, anxiety, proficiency, and fear
are all examples of constructs.
 They exist in theory and has been observed to exist in
practice.
 Constructs exist in the human brain and are not
directly observable.
 There are two types of construct validity: convergent
and discriminant validity. Construct validity is
established by looking at numerous studies that use
the test being evaluated.
CHAPTER 4 : VALIDITY

2. CONTENT VALIDITY
The test content is a representative sample of the
language skills being tested .
The test is content valid if it includes a proper
sample.
importance of content validity
the greater a test’s content validity, the more
likely its construct validity
a test without content validity is likely to have a
harmful backwash effect since areas that are not
tested are likely to become ignored in teaching and
learning

3. CRITERION-ORIENTED VALIDITY
The degree to which results on the test agree with those
provided by an independent criterion
Kinds of criterion-related validity
A. Concurrent Validity
is established when the test and the criterion are
administered at the same time
B. Predictive Validity
concerns the degree to which a test can predict
candidates’ future performance.
Areas that are not tested are likely to become
ignored in teaching and learning

VALIDITY COEFFIENT
A mathematical measure of similarity to show the degree
of validity .
Perfect validity will result in a coefficient of 1.00
Total lack of validity results in a coefficient of 0.00
Satisfactory validity depends on the test’s purpose &
importance
A coefficient of 0.70 might be considered low if the test is
important
VALIDITY IN SCORING
a reading test may call for short written responses
If the scoring of these responses takes into account
spelling and grammar, then it is not valid in scoring.

4. FACE VALIDITY
The way the test looks to the examinees, test administrator,
educators, and the like
If you want to test the student in pronunciation, but you do not ask
them to speak, your test lacks face validity
If your test contain items or materials which are not acceptable to
candidates, teachers, educators, etc., your test lacks face validity
HOW TO MAKE TESTS MORE VALID?
 Write explicit specifications for the test, which include all the
constructs to be measured.
 Make sure that you include a representative sample of the content
 Use direct testing .
 Make sure the scoring is valid .
 Make the test reliable .

RELIABILITY
refers to the stability or consistency of scores
Nearly the same scores for the same individuals in two
sessions
Multiple-choice tests have high coefficient of reliability
Look at the tables on p. 37
RELIABILITY COEFFICIENT
The ideal coefficient is 1.00
Total lack of reliability is 0.00
Satisfactory reliability depends on the purpose and
importance of the test
Vocabulary, structure, and reading tests: .90 - .99
Auditory comprehension tests: .80 - .89
Oral production tests: .70 - .79
CHAPTER 5 : RELIABILITY

HOW TO ESTIMATE RELIABILITY?
The way in which reliability coefficient arrived at
Test-retest Method
Taking the same test twice by the same students, and
then comparing the scores
Drawbacks of this method:
If the administration is too soon, the students will
remember, and then their scores will be affected
If the time is too long, the students will forget or
improve, and then that will affect the scores

The Alternate Forms Method
Two equivalent forms, but the problem is such forms
are not available
The Split Half Method
The most common method to obtain reliability
The subjects take the test one time, but each subject is
given two scores
One score for each half of the test
The two sets of scores are then used to obtain the
reliability coefficient be affected

THE STANDARD ERROR OF MEASUREMENT
AND THE TRUE SCORE
All test scores are estimates
All tests contain some degree of error
you have to use a statistic known as the standard
error of measurement… to estimate the limits within
which an obtained score is likely to diverge from a true
score

SCORER RELIABILITY
Consistency of scoring
Nearly the same score for the same test
In other words, comparing the scores of two or more
scorers for the same students
In composition tests, scores are usually fluctuate
In multiple-choice tests, scores are nearly perfect
If the scoring of a test is not reliable, then the test
results cannot be reliable either

HOW TO MAKE TESTS MORE RELIABLE? 1) Take
enough samples of behavior The more items you have
on a test, the more reliable the test will be
Considerations to be taken when adding extra items:
Additional items should be independent of each other
and of existing items
Each additional item should represent a fresh start for
the candidate
Tests should neither be too long, nor too short

HOW TO MAKE TESTS MORE RELIABLE?
2) Exclude items which do not discriminate well
between weaker and stronger students
Items on which strong students and weak students
perform with similar degree of success contribute little to
the reliability of a test
Too easy items or too difficult items should be excluded
A small number of easy items may be kept at the
beginning of a test to give candidates confidence and
reduce the stress they feel

3) Do not allow candidates too much freedom
The procedure of giving choices of questions to
candidates has a negative effect on reliability
In general, candidates should not be given a choice
4) Write unambiguous items
5) Provide clear and explicit instructions
6) Ensure that tests are well laid out and perfectly
legible
7) Make candidates familiar with format and
testing techniques

8) Provide uniform and non-distracting conditions
of administration
9) Use items that permit scoring which is as
objective
10) Provide a detailed scoring key
11) Train scorers
12) All scorers should follow the same criteria for
scoring
13) Identify candidates by number not name
14) Employ multiple, independent scoring

RELIABILITY AND VALIDITY
A valid test must be reliable
However, a reliable test may not be valid at all
Increasing the reliability of a test may be on the expense
of validity
There will always be some tension between reliability
and validity
The tester has to balance gains in one against losses in
the other

CHAPTER 6 :
ACHIEVING BENEFICIAL BACKWASH
Test the abilities whose development you want to
encourage.
Beware of reasons for not testing particular abilities.
In case of MCQ and objectivity
subjective scoring in case of subjective tests
the expense involved in terms of time and money
Determine the points that should be tested and give them
sufficient weight in relation to the other
abilities

How to achieve beneficial backwash:
I. Sample widely and unpredictably.
II. Use direct testing.
III. Make testing criterion-reference.
IV. Base achievement tests on objectives.
V. Ensure the test is known and understood by
students and teachers.
VI. Where necessary provide assistance to teachers.
VII.Count the cost.

CHAPTER 7:
STAGES OF TEST DEVELOPMENT
1) Make a full and clear statement of the testing ‘problem’
2) Write complete specifications for the test
3) Write and moderate items
4) Try the items on native speakers
5) Try the items on non-native speakers
6) Analyze the results of the trial and make necessary
changes
7) Calibrate scales
8) Validate
9) Write handbooks for test takers, test users, and staff
10) Train any necessary staff (interviewers, raters, etc.)

The questions to be answered in order to state the problem:
i) What kind of test is it to be?
ii) What is its precise purpose?
iii) What abilities are to be tested?
iv) How details must the results be?
v) How accurate must the results be?
vi) How important is backwash?
vii) What constraints are set by unavailability, expertise,
facilities, and time?
Stating the problem

i) Determining Content
 Specifying instructional objectives
 Preparing a table of specifications
 Determining number of items
ii. Necessary operations by the test-developer
 Specification of Text types:
 Letters, forms, academic essays
 Addresses of Texts
 Length of Text(s)
 Topics (familiar/unfamiliar)
 Readability
 Structural and Vocabulary Range
 Dialect, accent, style
 Speed of Processing
words to be read per minute , rate of speech
2) Writing specifications for the test

iii) Structure, timing, medium/channel and techniques
Test Structure(test section: grammar, voc.,
reading)
Number of Items
Number of Passages
Medium/channel( tape, paper & pencil, ...)
Timing
Techniques
iv) Critical Levels of Performance
iv) Scoring Procedures :
subjective or objective
3) Writing and Moderating Items
i) Sampling ( based on the contents)
ii) Writing Items
iii) Moderating Items (Reviewing )

4) Pretesting
Informal Trial of Items on Native Speakers
Trialing Items on Non-native Speakers (Pretesting)
6) Item Analysis (analysis of the results)
Reliability
level of difficulty
discrimination index
distracters
clearance of instructions and items
timing
7) Calibration of scales
8) Validation
9) Writing handbooks for test-takers, test users & staff
10. Training staff

7) Calibration of Scales For testing speaking and
writing, a team of experts looks at samples of skills
and assign each of them to a point on the relevant
scale
8) Validation It is essential for proficiency tests and
repeatedly-used tests
9) Writing Handbooks for test takers, users, staff
It is essential for proficiency tests and repeatedly -
used tests
10) Training Staff It is essential for proficiency tests and
repeatedly -used tests
See pp. 66 – 72 for examples of test development

Testing for Language Teachers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Testing for Language Teachers

Similar to Testing for Language Teachers (20)

Recently uploaded

Recently uploaded (20)

Testing for Language Teachers