In an era of communicative language teaching:
Tests should measure up to standards of authenticity and meaningfulness.
Ts should design tests that serve as motivating learning experiences rather
than anxiety-provoking threats.
should be positive experiences
should build a person‟s confidence and become learning experiences
should bring out the best in students
shouldn‟t be degrading
shouldn‟t be artificial
shouldn‟t be anxiety-provoking
Language Assessment aims;
to create more authentic, intrinsically motivating assessment procedures that
are appropriate for their context & designed offer constructive feedback to sts
What is a test?
is measuring a person’s ability, knowledge or performance in a given domain.
A set of techniques, procedures or items.
To qualify as a test, the method must be explicit and structured. Like;
Multiple-choice questions with prescribed correct answers
A writing prompt with a scoring rubric
An oral interview based on a question script and a checklist of
expected responses to be filled by the administrator
A means for offering the test-taker some kind of result.
If an instrument does not specify a form of reporting measurement, then that
technique cannot be defined as a test.
Scoring may be like the followings
Classroom-based short answer essay test may earn the test-taker a letter grade
accompanied by the instructor‟s marginal comments.
Large-scale standardized tests provide a total numerical score, a percentile
rank, and perhaps some sub-scores.
3. The test-taker(the individual) = The person who takes the test.
Testers need to understand;
who the test-takers are?
what is their previous experience and background?
whether the test is appropriately matched to their abilities?
how should test-takers interpret their scores?
Test measures performance, but results imply test-taker‟ ability or competence.
Some language tests measure one‟s ability to perform language:
To speak, write, read or listen to a subset of language
Some others measure a test-taker‟s knowledge about language:
Defining a vocabulary item, reciting a grammatical rule or identifying a
rhetorical feature in written discourse.
5. Measuring a given domain
It means measuring the desired criterion and not including other factors.
Even though the actual performance on the test involves only a sampling of
skills, that domain is overall proficiency in a language – general competence in
all skills of a language.
Classroom-based performance tests:
These have more specific criteria. For example:
A test of pronunciation might well be a test of only a limited set of
phonemic minimal pairs.
A vocabulary test may focus on only the set of words covered in a particular
A well-constructed test is an instrument that provides an accurate measure of
the test taker‟s ability within a particular domain.
TESTING, ASSESSMENT & TEACHING
are prepared administrative
procedures that occur at
identifiable times in a
When tested, learners know that
their performance is being
measured and evaluated.
When tested, learners muster all
their faculties to offer peak
Tests are a subset of assessment.
They are only one among many
procedures and tasks that
teachers can ultimately use to
Tests are usually time-constrained
(usually spanning a class period
or at most several hours) and
draw on a limited sample of
Assessment is an ongoing process that
encompasses a much wider
A good teacher never ceases to assess
students, whether those
assessments are incidental or
Whenever a student responds to a
question, offers a comment, or tries
out a new word or structure, the
teacher subconsciously makes an
assessment of the student‟s
Assessment includes testing.
Assessment is more extended and it
includes a lot more components.
What about TEACHING?
For optimal learning to take place, learners must have opportunities to “play”
with language without being formally graded.
Teaching sets up the practice games of language learning:
the opportunities for learners to listen, think, take risks, set goals,
and process feedback from the teacher (coach)
and then recycle through the skills that they are trying to master.
During these practice activities, teachers are indeed observing students‟
performance and making various evaluations of each learner.
Then, it can be said that testing and assessment are subsets of teaching.
They are incidental, unplanned
comments and responses.
Examples include: “Nice job!” “Well
done!” “Good work!” “Did you say
can or can’t?” “Broke or break!”, or
putting a ☺ on some homework.
Classroom tasks are designed to
elicit performance without
recording results and making
fixed judgements about a
Examples of unrecorded assessment:
marginal comments on
papers, responding to a draft of
an essay, advice about how to
better pronounce a word, a
suggestion for a strategy for
compensating for a reading
difficulty, and showing how to
modify a student‟s note-taking to
better remember the content of a
They are exercises or procedures specifically
designed to tap into a storehouse of skills
They are systematic, planned sampling
techniques constructed to give Ts and sts
an appraisal of student achievement.
They are tournament games that occur
periodically in the course of teaching.
It can be said that all tests are formal
assessments, but not all formal
assessment is testing.
Example 1: A student‟s journal or portfolio of
materials can be used as a formal
assessment of attainment of the certain
course objectives, but it is problematic to
call those two procedures “test”.
Example 2: A systematic set of observations
of a student‟s frequency of oral
participation in class is certainly a formal
assessment, but not a “test”.
THE FUNCTION OF AN ASSESSMENT
Evaluating students in the
It aims to measure, or
summarize, what a student
process of “forming” their
has grasped, and typically
competencies and skills with
occurs at the end of a course.
the goal of helping them to
continue that growth process. It does not necessarily point the
way to future progress.
It provides the ongoing
development of learner‟s lang Example: Final exams in a
course and general
Example: When you give sts a
comment or a suggestion, or
call attention to an error, that
feedback is offered to improve
learner‟s language ability.
Virtually all kinds of informal
assessment are formative.
All tests/formal assessment
(quizzes, periodic review
tests, midterm exams, etc.)
As far as summative assessment is considered, in the aftermath of any
test, students tend to think that “Whew! I‟m glad that‟s over.
Now I don‟t have to remember that stuff anymore!”
An ideal teacher should try to change this attitude among students.
A teacher should:
· instill a more formative quality to his lessons
· offer students an opportunity to convert tests into “learning
Each test-taker‟s score is interpreted in
relation to a mean (average
score), median (middle
score), standard deviation (extend of
variance in scores), and/or percentile
The purpose is to place test-takers along a
mathematical continuum in rank order.
Scores are usually reported back to the
test-taker in the form of a numerical
score. (230 out of 300, 84%, etc.)
Typical of these tests are standardized tests
like SAT. TOEFL, ÜDS, KPDS, DS, etc.
These tests are intended to be administered
to large audiences, with results
efficiently disseminated to test takers.
They must have fixed, predetermined
responses in a format that can be scored
quickly at minimum expense.
Money and efficiency are primary
They are designed to give testtakers
feedback, usually in the form of
grades, on specific course or lesson
Tests that involve the sts in only one class,
and are connected to a curriculum, are
Much time and effort on the part of the
teacher are required to deliver useful,
appropriate feedback to students.
The distribution of students‟ scores across
a continuum may be of little concern as
long as the instrument assesses
As opposed to standardized, large scale
testing with its emphasis on
classroom-based testing, CriterionReferenced
Testing is of more prominent interest than
Approaches to Language Testing: A Brief History
Historically, language-testing trends have followed the trends of
During 1950s: An era of behaviourism and special attention to
Testing focused on specific lang elements such as
phonological, grammatical, and lexical contrasts between two
During 1970s and 80s: Communicative Theories were widely
A more integrative view of testing.
Today: Test designers are trying to form authentic, valid
instruments that simulate real world interaction.
APPROACHES TO LANGUAGE TESTING
A) Discrete-Point Testing
B) Integrative Testing
Language can be broken down into its
component parts and those parts
can be tested successfully.
Component parts; listening, speaking,
reading and writing.
Units of language (discrete points);
morphology, lexicon, syntax and
An language proficiency test should
sample all 4 skills and as many
linguistic discrete points as possible
In the face of evidence that in a study
each student scored differently in
various skills depending on his
background, country and major
field, Oller admitted that “unitary
trait hypothesis was wrong.”
Language competence is a unified set
of interacting abilities that cannot
be tested separately.
Communicative competence is global
and requires such integration that
it cannot be captured in additive
grammar, reading, vocab, and
other discrete points of lang.
Two types of tests examples of
*cloze test and **dictation.
Unitary trait hypothesis: It suggests
an “indivisible” view of language
vocabulary, grammar, phonology,
“4 skills”, and other discrete points
of lang could not be disentangled
Cloze Test results are good measures of overall proficiency.
The ability to supply appropriate words in blanks requires a number of
abilities that lie at the heart of competence in a language:
knowledge of vocabulary, grammatical structure,
discourse structure, reading skills and strategies.
It was argued that successful completion of cloze items taps into all of those
abilities, which were said to be the essence of global language proficiency.
Essentially, learners listen to a passage of 100 to 150 words read aloud by an
administrator (or audiotape) and write what they hear, using correct spelling.
Supporters argue that dictation is an integrative test because
success on a dictation requires careful listening,
reproduction in writing of what is heard, efficient short-term memory,
to an extent, some expectancy rules to aid the short-term memory.
c) Communicative Language Testing ( recent approach after mid 1980s)
What does it criticise?
In order for a particular langtest to be useful for its intended purposes, test performance
must correspond in demonstrable ways to language use in non-test situations.
Integrative tests such as cloze only tell us about a candidate‟s linguistic competence.
They do not tell us anything directly about a student‟s performance ability. (Knowledge
about a language, not the use of language)
A quest for authenticity, as test designers centered on communicative performance.
The supporters emphasized the importance of strategic competence (the ability to
employ communicative strategies to compensate for breakdowns as well as to enhance
the rhetorical effect of utterances) in the process of communication.
Any problem in using this approach?
Yes, communicative testing presented challenges to test designers, because they began
to identify the real-world tasks that language learners were called upon to perform.
But, it was clear that the contexts for those tasks were extraordinarily widely varied and
that the sampling of tasks for any one assessment procedure needed to be validated by
what language users actually do with language.
As a result:
The assessment field became more and more concerned with the authenticity of tasks
and the genuineness of texts.
d) Performance-Based Assessment
performance-based assessment of language typically involves oral production,
written production, open-ended responses, integrated performance (across skill areas),
group performance, and other interactive tasks.
It is time-consuming and expensive, but those extra efforts are paying off in more direct
testing because sts are assessed as they perform actual or simulated real-world tasks.
The advantage of this approach?
Higher content validity is achieved because learners are measured in the process of
performing the targeted linguistic acts. Important
performance-based assessment means that Ts should rely a little less on formally
structured tests and a little more on evaluation while sts are performing various tasks.
In performance-based assessment:
Interactive Tests (speaking, requesting, responding, etc.) IN ☺ Paper-and-pencil OUT
Result: in this test tasks can approach the authenticity of real life language use.
CURRENT ISSUES IN CLASSROOM TESTING
The design of communicative, performance-based assessment continues to
challenge both assessment experts and classroom teachers.
There‟re three issues which are helping to shape our current understanding of
effective assessment. These are:
· The effect of new theories of intelligence on the testing industry
· The advent of what has come to be called “alternative assessment”
The increasing popularity of computer-based testing
New Views on Intelligence
In the past:
Intelligence was once viewed strictly as the ability to perform linguistic and
logical-mathematical problem solving.
For many years, we‟ve lived in a word of standardized, norm-referenced tests
that are timed in a multiple-choice format consisting of a multiplicity of logic
constrained items, many of which are inauthentic.
We were relying on timed, discrete-point, analytical tests in measuring lang.
We were forced to be in the limits of objectivity and give impersonal responds.
EQ (Emotional Quotient) underscore emotions in our cognitive processing.
Those who manage their emotions tend to be more capable of fully intelligent
processing, because anger, grief, resentment, other feelings can easily impair
peak performance in everyday tasks as well as higher-order problem solving.
These conceptualizations of intelligence‟ intuitive appeal infused the 1990s
with a sense of both freedom and responsibility in our testing agenda.
In past, our challenge was to test
interpersonal, creative, communicative, interactive skills, doing so to place
some trust in our subjectivity and intuition.
Traditional and “Alternative” Assessment
-One-shot, standardized exams
-Timed, multiple-choice format
-Decontextualized test items
-Scores suffice for feedback
-Focus on the “right” answer
-Oriented to product
-Fosters extrinsic motivation
Continuous longterm assessment
Untimed, free-response format
Contextualized communicative tests
Individualized feedback and washback
Open-ended, creative answers
Oriented to process
Fosters intrinsic motivation
It is difficult to draw a clear line of distinction between
traditional and alternative assessment.
Many forms of assessment fall in between the two, and some
combine the best of both.
More time and higher institutional budgets are required to
administer and score assessments that presuppose more
subjective evaluation, more individualization, and more
interaction in the process of offering feedback.
But the payoff of the “Alternative Assessment” comes with more
useful feedback to students, the potential for intrinsic
motivation, and ultimately a more complete description of a
Some computer-based tests are small-scale. Others are standardized, large scale tests
(e.g. TOEFL) in which thousands of test-takers are involved.
A type of computer-based test (Computer-Adaptive Test / CAT) is available
In CAT, the test-taker sees only one question at a time, and the computer scores each
question before selecting the next one.
Test-takers cannot skip questions, and, once they have entered and confirmed their
answers, they cannot return to questions.
Advantages of Computer-Based Testing:
o Classroom-based testing
o Self-directed testing on various aspects of a lang (vocabulary, grammar, discourse, etc)
o Practice for upcoming high-stakes standardized tests
o Some individualization, in the case of CATs.
o Scored electronically for rapid reporting of results.
Disadvantages of Computer-Based Testing:
Lack of security and the possibility of cheating in unsupervised computerized tests.
Home-grown quizzes may be mistaken for validates assessments.
Open-ended responses are less likely to appear because of need for human scorers.
The human interactive element is absent.
An Overall summary
Assessment is an integral part of the teaching-learning cycle.
In an interactive, communicative curriculum, assessment is almost constant.
Tests can provide authenticity, motivation, and feedback to the learner.
Tests are essential components of a successful curriculum and learning process.
Periodic assessments can increase motivation as milestones of student progress.
Appropriate assessments aid in the reinforcement and retention of information.
Assessments can confirm strength and pinpoint areas needing further work.
Assessments provide sense of periodic closure to modules within a curriculum.
Assessments promote sts autonomy by encouraging self-evaluation progress.
Assessments can spur learners to set goals for themselves.
Assessments can aid in evaluating teaching effectiveness.
Decide whether the following statements are TRUE or FALSE.
1. It‟s possible to create authentic and motivating assessment to offer
constructive feedback to the sts. ----------2. All tests should offer the test takers some kind of measurement or result. ----3. Performance based tests measure test takers‟ knowledge about language. ----4. Tests are the best tools to assess students. ----------5. Assessment and testing are synonymous terms. ----------6. Ts‟ incidental and unplanned comments and responses to sts is an example
of formal assessment. ------7. Most of our classroom assessment is summative assessment. ----------8. Formative assessment always points toward future formation of learning. ---9. The distribution sts‟ scores across a continuum is a concern in norm
referenced test. ----------10. C riterion referenced testing has more instructional value than normreferenced testing for classroom teachers. ----------1. TRUE
3. FALSE They are designed to test actual use of lang not knowledge about lang
4. FALSE (We cannot say they are best, but one of useful devices to assess sts.)
5. FALSE (They are not.)
6. FALSE (They are informal assessment)
7. FALSE (formative assessment)
There’re five testing criteria for “testing a test”:
Practicality 2. Reliability 3. Validity 4. Authenticity 5. Washback
A practical test
· is not excessively expensive,
· stays within appropriate time constraints,
· is relatively easy to administer, and
· has a scoring/evaluation procedure that is specific and time-efficient.
For a test to be practical
· administrative details should clearly be established before the test,
· sts should be able to complete the test reasonably within the set time frame,
· the test should be able to be administered smoothly (prosedürle boğmamalı),
· all materials and equipment should be ready,
· the cost of the test should be within budgeted limits,
· the scoring/evaluation system should be feasible in the teacher‟s time frame.
· methods for reporting results should be determined in advance.
A reliable test is consistent and dependable.
The issue of reliability of a test may best be addressed by considering a number
of factors that may contribute to the unreliability of a test.
Consider following possibilities: fluctuations
· in the student (Student-Related Reliability),
· in scoring (Rater Reliability),
· in test administration (Test Administration Reliability), and
· in the test (Test Reliability) itself.
Temporary illness, fatigue, a bad day, anxiety, other physical or psychological
factors may make an “observed” score deviate from one‟s “true” score.
Also a test-taker‟s “test-wiseness” or strategies for efficient test taking can also
be included in this category.
Human error, subjectivity, lack of attention to scoring criteria, inexperience, inattention,
or even preconceived (peşin hükümlü) biases may enter into scoring process.
Inter-rater unreliability occurs when 2 or more scorers yield inconsistent scores of the
Intra-rater unreliability is because of unclear scoring criteria, fatigue, bias toward
particular “good” and “bad” students, or simple carelessness.
One solution to such intra-rater unreliability is to read through about half of the tests
before rendering any final scores or grades, then to recycle back through the whole set
of tests to ensure an even-handed judgment.
The careful specification of an analytical scoring instrument can increase raterreliability.
Test Administration Reliability:
Unreliability may also result from the conditions in which the test is administered.
Street noise, photocopying variations, poor light, temperature, desks and chairs.
Sometimes the nature of the test itself can cause measurement errors.
Timed tests may discriminate against sts who do not perform well with a time limit.
Poorly written test items may be a further source of test unreliability.
The extent to which the assessment requires students to perform
tasks that were included in the previous classroom lessons.
How is the validity of a test established?
There is no final, absolute measure of validity, but several different kinds of
evidence may be invoked in support.
it may be appropriate to examine the extent to which a test calls for
performance that matches that of the course or unit of study being tested.
In other cases we may be concerned with how well a test determines whether
or not students have reached an established set of goals or level of competence.
it could be appropriate to study statistical correlation with other related but
Other concerns about a test‟s validity
may focus on the consequences – beyond measuring the criteria themselves - of
a test, or even on the test-taker‟s perception of validity.
We will look at these five types of evidence below.
If a test requires the test-taker to perform the behaviour that is being measured,
content-related evidence of validity, often popularly referred to as content validity.
If you assess a person‟s ability to speak TL, asking sts answer paper-and-pencil multiple
choice questions requiring grammatical judgements does not achieve content validity.
for content validity to be achieved, one should be able to elicit the following conditions:
· Classroom objectives should be identified and appropriately framed. The first measure
of an effective classroom test is the identification of objectives.
· Lesson objectives should be represented in the form of test specifications. A test
should have a structure that follows logically from lesson or unit you are testing.
If you clearly perceive the performance of test-takers as reflective of the classroom
objectives, then you can argue this, content validity has probably been achieved.
To understand content validity consider difference between direct and indirect testing.
Direct testing involves the test-taker in actually performing the target task.
Indirect testing involves performing not target task itself, but that related in some way.
Direct testing is most feasible (uygun) way to achieve content validity in assessment.
It examines the extent to which the criterion of test has actually been achieved.
For example, a classroom test designed to assess a point of grammar in
communicative use will have criterion validity if test scores are corroborated
either by observed subsequent behavior or by other communicative measures
of the grammar point in question.
Criterion-related evidence usually falls into one of two categories:
Concurrent (uygun, aynı zamanda olan) validity:
A test has concurrent validity if its results are supported by other concurrent
performance beyond the assessment itself.
For example, the validity of a high score on the final exam of a foreign
language course will be substantiated by actual proficiency in the language.
· Predictive (öngörüsel, tahmini) validity:
The assessment criterion in such cases is not to measure concurrent ability but
to assess (and predict) a test-taker‟s likelihood of future success.
For example, the predictive validity of an assessment becomes important in the
case of placement tests, language aptitude tests, and the like.
· Construct Validity:
Every issue in language learning and teaching involves theoretical constructs.
In the field of assessment, construct validity asks, “Does this test actually tap
into the theoretical construct as it has been identified?” (test gerçekten de test
etmek istediğim konu ya da beceriyi test etmede gerekli olan yapısal
özellikleri taşıyor mu?)
Imagine that you have been given a procedure for conducting an oral
interview. The scoring analysis for the interview includes several factors in
the final score: pronunciation, fluency, grammatical accuracy, vocabulary use,
and sociolinguistic appropriateness. The justification for these five factors lies
in a theoretical construct that claims those factors to be major components of
oral proficiency. So if you were asked to conduct on oral proficiency
interview that evaluated only pronunciation and grammar, you could be
justifiably suspicious about the construct validity of that test.
“Large-scale standardized tests” olarak nitelediğimiz sınavlar “construct
validity” açısından pek de uygun değildir. Çünkü pratik olması açısından
(yani hem zaman hem de ekonomik nedenlerden) bu testlerde ölçülmesi
gereken bütün dil becerileri ölçülememektedir. Örneğin TOEFL‟ da “oral
production” bölümünün olmaması “construct validity” açısından büyük bir
engel olarak karşımıza çıkmaktadır.
Consequential validity encompasses all the consequences of a test,
including such considerations as its accuracy in measuring intended
criteria, its impact on the preparation of test-takers, its effect on the
learner, and the (intended and unintended) social consequences of a test‟s
interpretation and use.
McNamara (2000, p. 54) cautions against test results that may reflect
socioeconomic conditions such as opportunities for coaching (özel ders,
özel ilgi). For example, only some families can afford coaching, or because
children with more highly educated parents get help from their parents.
Teachers should consider the effect of assessments on students‟
motivation, subsequent performance in a course, independent
learning, study habits, and attitude toward school work.
the degree to which a test looks right, and appears to measure the knowledge
or abilities it claims to measure, based on the subjective judgment of test-takers
· Face validity means that the students perceive the test to be valid. Face
validity asks the question “Does the test, on the „face‟ of it, appear from the
learner‟s perspective to test what it is designed to test?
· Face validity is not something that can be empirically tested by a teacher or
even by a testing expert. It depends on subjective evaluation of the test-taker.
· A classroom test is not the time to introduce new tasks.
· If a test samples the actual content of what the learner has achieved or expects
to achieve, face validity will be more likely to be perceived.
· Content validity is a very important ingredient in achieving face validity.
· Students will generally judge a test to be face valid if directions are clear, the
structure of the test is organized logically, its difficulty level is appropriately
pitched, the test has no “surprises”, and timing is appropriate.
· To give an assessment procedure that is “biased for best” a teacher offers
students appropriate review and preparation for the test, suggests strategies
that will be beneficial, and structures the test so that the best students will be
modestly challenged and the weaker students will not be overwhelmed.
In an authentic test
· the language is as natural as possible,
· items are as contextualized as possible,
· topics and situations are interesting, enjoyable and/or humorous,
· some thematic (konuyla ilgili) organization, such as through a story line or
episode is provided,
· tasks represent real-world tasks.
Reading passages are selected from real-world sources that test-takers are
likely to have encountered or will encounter.
Listening comprehension sections feature natural language with
hesitations, white noise, and interruptions.
More and more tests offer items that are “episodic” in that they are sequenced
to form meaningful units, paragraphs, or stories.
Washback includes the effects of an assessment on teaching and learning prior
to the assessment itself, that is, on preparation for the assessment.
Informal performance assessment is by nature more likely to have built-in
washback effects because the teacher is usually providing interactive feedback.
Formal tests can also have positive washback, but they provide no washback if
the students receive a simple letter grade or a single overall numerical score.
Tests should serve as learning devices through which washback is achieved.
Sts‟ incorrect responses can become windows of insight into further work.
Their correct responses need to be praised, especially when they represent
accomplishments in a student‟s inter-language.
Washback enhances a number of basic principles of language acquisition:
intrinsic motivation, autonomy, self-confidence, language ego, interlanguage,
and strategic investment, among others.
To enhance washback comment generously & specifically on test performance.
Washback implies that students have ready access to the teacher to discuss the
feedback and evaluation he has given.
Teachers can raise the washback potential by asking students to use test results
as a guide to setting goals for their future effort.
What is washback?
In general terms: The effect of testing on teaching and learning
In large-scale assessment: Refers to the effects that the tests have on
instruction in terms of how students prepare for the test
In classroom assessment: The information that washes back to students
in the form of useful diagnoses of strengths and weaknesses
What does washback enhance?
What should teachers do to enhance washback?
Comment generously and specifically on test performance
Respond to as many details as possible
Criticize weaknesses constructively
Give strategic hints to improve performance
Decide whether the following statements are TRUE or FALSE.
1. An expensive test is not practical.
2. One of the sources of unreliability of a test is the school.
3. Sts, raters, test, and administration of it may affect the test‟s reliability.
4. In indirect tests, students do not actually perform the task.
5. If students are aware of what is being tested when they take a test, and think
that the questions are appropriate, the test has face validity.
6. Face validity can be tested empirically.
7. Diagnosing strengths and weaknesses of students in language learning is a
facet of washback.
8. One way of achieving authenticity in testing is to use simplified language.
Decide which type of validity does each sentence belong to?
1. It is based on subjective judgment. ---------------------2. It questions the accuracy of measuring the intended criteria. ---------------------3. It appears to measure the knowledge and abilities it claims to measure. ------------4. It measures whether the test meets the objectives of classroom objectives. -------5. It requires the test to be based on a theoretical background. ---------------------6. Washback is part of it. ---------------------7. It requires the test-taker to perform the behavior being measured. -----------------8. The students (test-takers) think they are given enough time to do the test. ----------9. It assesses a test-taker's likelihood of future success. (e.g. placement tests). --------10. The students' psychological mood may affect it negatively or positively. -------------11. It includes the consideration of the test's effect on the learner. ---------------------12. Items of the test do not seem to be complicated. ---------------------13. The test covers the objectives of the course. ----------------------
14. The test has clear directions. ---------------------1. Face
4. Content 5. Construct 6. Content
7. Criterion related
9. Criterion related
12. Face validity
13. Content validity
14. Face validity
Decide with which type of reliability could each sentence be related?
1. There are ambiguous items.
2. The student is anxious.
3. The tape is of bad quality.
4. The teacher is tired but continues scoring.
5. The test is too long.
6. The room is dark.
7. The student has had an argument with the teacher.
8. The scorers interpret the criteria differently.
9. There Is a lot of noise outside the building.
1. Test reliability
3. Test administration reliability
2. Student-related reliability
4. Rater reliability
5. Test reliability
6. Test administration reliability
7. Student-related reliability
8. Rater reliability
9. Test administration reliability
we examine test types, and learn how to design tests and revise existing ones.
To start the process of designing tests, we will ask some critical questions.
5 questions should form basis of your approach to designing tests for class.
Question 1: What is the purpose of the test?
· Why am I creating this test?
· For an evaluation of overall proficiency? (Proficiency Test)
· To place students into a course? (Placement Test)
· To measure achievement within a course? (Achievement Test)
Once you established major purpose of a test, you can determine its objectives.
Question 2: What are the objectives of the test?
· What specifically am I trying to find out?
· What language abilities are to be assessed?
Question 3: How will test specifications reflect both purpose and objectives?
· When a test is designed, the objectives should be incorporated into a structure
that appropriately weights the various competencies being assessed.
Question 4: How will test tasks be selected and the separate items arranged?
· The tasks need to be practical.
· They should also achieve content validity by presenting tasks that mirror
those of the course being assessed.
· They should be evaluated reliably by the teacher or scorer.
· The tasks themselves should strive for authenticity, and the progression of
tasks ought to be biased for best performance.
Question 5: What kind of scoring, grading, and/or feedback is expected?
· Tests vary in the form and function of feedback, depending on their purpose.
· For every test, the way results are reported is an important consideration.
· Under some circumstances a letter grade or a holistic score may appropriate;
other circumstances may require that a teacher offer substantive washback to
Defining your purpose will help you choose the right kind of test, and it will
also help you to focus on the specific objectives of the test.
Below are the test types to be examined:
1. Language Aptitude Tests
2. Proficiency Tests
3. Placement Tests
4. Diagnostic Tests
5. Achievement Tests
1. Language Aptitude Tests
They predict a person‟s success prior to exposure to the second language.
Aptitude test is designed to measure capacity or general ability to learn a FL.
They are designed to apply to the classroom learning of any language.
Two standardized aptitude tests have been used in the US.
The Modern Language Aptitude Test (MLAT),
Pimsleur Language Aptitude Battery(PLAB)
Tasks in MLAT includes: Number learning, phonetic script, spelling clues,
words in sentences, and paired associates.
There‟s no unequivocal evidence that language aptitude tests predict
communicative success in a language.
Any test that claims to predict success in learning a language is undoubtedly
flawed because we now know that with appropriate self-knowledge, and
active strategic involvement in learning, everyone can succeed eventually.
2. Proficiency Tests
A proficiency test is not limited to any one course, curriculum, or single skill in
the language; rather, it tests overall ability.
It includes: standardized multiple choice items on grammar, vocabulary,
reading comprehension, and aural comprehension. Sometimes a sample of
writing is added, and more recent tests also include oral production.
Such tests often have content validity weaknesses.
Proficiency tests are almost always summative and norm-referenced.
They are usually not equipped to provide diagnostic feedback.
Their role is to accept or to deny someone‟s passage into next stage of a journey
TOEFL is a typical standardized proficiency test.
Creating & validating them with research is time-consuming & costly process
To choose one of a number of commercially available proficiency tests is a far
more practical method for classroom teachers.
3. Placement Tests
The objective of placement test is to correctly place sts into a course or level.
Certain proficient tests can act in the role of placement tests.
A placement test usually includes a sampling of the material to be covered in
the various courses in a curriculum.
Sts should find the test neither too easy nor too difficult but challenging.
ESL Placement Test (ESLPT) at San Francisco State University has three parts.
Part 1: sts read a short article and then write a summary essay.
Part 2: sts write a composition in response to an article.
Part 3: multiple-choice; sts read an essay and identify grammar errors in it.
ESL is more authentic but less practical, because human evaluators are
required for the first two parts.
Reliability problems present but mitigated by conscientious training evaluators
What is lost in practicality and reliability is gained in the diagnostic
information that the ESLPT provides.
4. Diagnostic Tests
A diagnostic test is designed to diagnose specified aspects of a language.
A diagnostic test can help a student become aware of errors and encourage the
adoption of appropriate compensatory strategies.
A test of pronunciation diagnose phonological features that are difficult for Sts
and should become part of a curriculum. Such tests offer a checklist of features
for administrator to use in pinpointing difficulties.
A writing diagnostic elicit a writing sample from sts that would allow Ts to
identify those rhetorical and linguistic features on which the course needed to
focus special attention.
A diagnostic test of oral production was created by Clifford Prator (1972) to
accompany a manual of English pronunciation. In the test;
Test-takers are directed to read 150-word passage while they are tape recorded.
The test administrator then refers to an inventory(envanter, deftere kayıtlı
eşya) of phonological items for analyzing a learner‟s production.
After multiple listening, they produce checklist for errors in 5 categories.
Stress - rhythm, Intonation, Vowels, Consonants, Other factors.
This information help Ts make decisions about aspects of English phonology.
5. Achievement Tests
Achievement test is related directly to lessons, units, or even a total curriculum.
Achievement tests should be limited to particular material addressed in a
curriculum within a particular time frame and should be offered after a course
has focused on the objectives in question.
There‟s a fine line of differences between diagnostic test and achievement test.
Achievement tests analyze the extent to which students have acquired
language features that have already been taught. (Geçmişin analizini yapıyor.)
Diagnostic tests should elicit information on what students need to work
on in the future. (Gelecek ile ilgili bir analiz yapılıyor.)
Primary role of achievement test is to determine whether course objectives
have been met – and appropriate knowledge and skills acquired – by the end
of a period of instruction.
They are often summative because they are administered end of a unit or term.
But effective achievement tests can serve as useful washback by showing the
errors of students and helping them analyze their weaknesses and strengths.
Achievement tests range from five- or ten-minute quizzes to three-hour final
examinations, with an almost infinite variety of item types and formats.
practical steps in constructing classroom tests:
A) Assessing Clear, Unambiguous Objectives
Before giving a test;
examine the objectives for the unit you‟re testing.
Your first task in designing a test, then, is to determine appropriate objectives.
“Students will recognize and produce tag questions, with the correct
grammatical form and final intonation pattern, in simple social conversations.
B) Drawing Up Test Specifications (Talimatlar)
Test specifications will simply comprise
a) a broad outline of the test
b) what skills you will test
c) what the items will look like
This is an example for test specifications based on the objective stated above:
“Students will recognize and produce tag questions, with the correct
grammatical form and final intonation pattern, in simple social conversations.”
C) Devising Test Tasks
how students will perceive them(face validity) the extent to which authentic language
and contexts are present potential difficulty caused by cultural schemata
In revising your draft, you should ask yourself some important questions:
1. Are the directions to each section absolutely clear?
2. Is there an example item for each section?
3. Does each item measure a specified objective?
4. Is each item stated in clear, simple language?
5. Does each multiple choice have appropriate distracters; that is, are the wrong items
clearly wrong and yet sufficiently “alluring” that they aren‟t ridiculously easy?
6. Is the difficulty of each item appropriate for your students?
7. Is the language of each item sufficiently authentic?
8. Do the sum of items and the test as a whole adequately reflect the learning objectives?
In the final revision of your test, Time yourself
if the test should be shortened or lengthened, make the necessary adjustments
make sure your test is neat and uncluttered on the page
if there is an audio component, make sure that the script is clear,
D) Designing Multiple-Choice Test Items
There‟re a number of weaknesses in multiple-choice items:
The technique tests only recognition knowledge.
Guessing may have a considerable effect on test scores.
The technique severely restricts what can be tested.
It is very difficult to write successful items.
Washback may be harmful.
Cheating may be facilitated.
2 principles support multiple-choice formats are practicality - reliability.
Some important jargons in Multiple-Choice Items:
Multiple-choice items are all receptive, or selective, that is, test-taker chooses
from a set of responses rather than creating a response. Other receptive item
types include true-false questions and matching lists.
Every multiple-choice item has a stem, which presents several options or
alternatives to choose from.
One of those options, the key, is correct response, others serve as distractors .
Consider the following four guidelines for designing multiple-choice items for
both classroom-based and large-scale situations:
1. Design each item to measure a specific objective. (aynı anda hem modal bilgisini hem de article bilgisini ölçme.)
2. State both stem and options as simply and directly as possible. Do not use
superfluous (lüzumsuz) words, and another rule of succinctness (az ve öz) is
to remove needless redundancy (gereksiz bilgi) from your options.
3. Make certain that the intended answer is clearly the only correct one.
Eliminating unintended possible answers is often the most difficult problem of
designing multiple-choice items. With only a minimum of context in each
stem, a wide variety of responses may be perceived as correct.
4. Use item indices (indeksler) to accept, discard, or revise items: The
appropriate selection and arrangement of suitable multiple-choice items on a
test can best be accomplished by measuring items against three indices: a)
item facility(IF), or item difficulty b) item discrimination (ID), or
item differentiation, and c) distractor analysis
a) Item facility (IF) is the extent to which an item is easy or difficult for the proposed
group of test-takers.
20 öğrenciden 13 doğru cevap geldiyse; 13/20=0,65(%65). %15 - %85‟in kabul edilebilir
Two good reasons for including a very easy item (%85 or higher) are to build in some
affective feelings of “success” among lower-ability students and to serve as warm-up
items. And very difficult items can provide a challenge to high estability sts.
b) Item discrimination (ID) is extent to which an item differentiates between high- and
An item on which high-ability students and low-ability students score equally well
would have poor ID because it did not discriminate between the two groups.
An item that garners(toplamak) correct responses from most of the high-ability group
and incorrect responses from most of low-ability group has good discrimination power.
30 öğrenciyi en iyiden en düşüğe kadar üç eşit parçaya ayır. En yüksek notu alan 10
öğrenci ile en düşük notu alan 10 öğrenciyi bir item‟da aşağıdaki gibi ayıralım
High-ability students (top 10)
Low-ability students (bottom10)
ID: 7-2=5/ 10= 0,50 The result tells us that us that the item has a moderate level of ID.
High discriminating level would approach 1.0 and no discriminating power at all would
be zero. In most cases, you would want to discard an item that scored near zero.
No absolute rule governs establishment of acceptable and unacceptable ID indices.
c) Distractor efficiency (DE) is the extent to which the distractors “lure” a
sufficient number of test-takers, especially lower-ability ones, and those
responses are somewhat evenly distributed across all distractors.
Example: *Note: C is the correct response.
High-ability students (10)
Low-ability students (10)
The item might be improved in two ways:
a) Distractor D doesn‟t fool anyone. Therefore it probably has no utility. A
revision might provide a distractor that actually attracts a response or two.
b) Distractor E attracts more responses (2) from the high-ability group than
the low-ability group (0). Why are good students choosing this one?
Perhaps it includes a subtle reference that entices the high group but is “over
the head” of low group, and therefore latter sts don‟t even consider it.
The other two distractor (A and B) seem to be fulfilling their function of
attracting some attention from the lower-ability students.
SCORING, GRADING AND GIVING FEEDBACK
As you design a test, you must consider how the test will be scored and graded
Scoring plan reflects relative weight that you place on each section and items
hangi beceriyi daha çok önemsemişse o beceriye fazla puan vermek gerekir
Oral production %30, Listening %30, Reading %20 ve Writing %20 şeklinde.
Grading doesn‟t mean just giving “A” for 90-100. It‟s not that simple.
How assign letter grades is a product of country, culture and context of class
institutional expectations (most of them unwritten),
explicit and implicit definitions of grades that you have set forth,
the relationship you have established with the class,
Sts‟ expectations that have been engendered in previous tests, quizzes in class.
C) Giving Feedback
Feedback should become beneficial washback. Those are some examples of feedback:
1. a letter grade
2. a total score
3. four subscores (speaking, listening, reading, writing)
4. for the listening and reading sections
a. an indication of correct/incorrect responses
b. marginal comments
5. for the oral interview
a. scores for each element being rated
c. oral feedback after the interview
6. on the essay
b. checklist of areas needing work
d. post-interview conference to go over results
a. scores for each element being rated
b. a checklist of areas needing work
e. a self-assessment
c. marginal end-of-essay comments, suggestions
d. post-test conference to go work
7. on all or selected parts of the test, peer checking of results
8. a whole-class discussion of results of the test
9. individual conferences with each student to review the whole test
Decide whether the following statements are TRUE or FALSE.
1.language aptitude test measures a learner‟s future success in learning a FL.
2. Language aptitude tests are very common today.
3. A proficiency test is limited to a particular course or curriculum.
4. The aim of a placement test is to place a student into particular level.
5. Placement tests have many varieties.
6. Any placement test can be used at a particular teaching program.
7. Achievement tests are related to classroom lessons, units, or curriculum.
8. A five-minute quiz can be an achievement test.
9. The first task in designing a test is to determine test specification.
6. FALSE (Not all placement tests suit every teaching program.)
9. FALSE (The first task is to determine appropriate objectives.)
Decide whether the following statements are TRUE or FALSE.
1. It is very easy to develop multiple-choice tests.
2. Multiple-choice tests are practical but not reliable.
3. Multiple-choice tests are time-saving in terms of scoring and grading.
4. Multiple-choice items are receptive.
5. Each multiple-choice item in a test should measure a specific objective.
6. The stem of a multiple-choice item should be as long as possible in order to
help students to understand the context.
7. If the Item Facility value is .10(% 10), it means the item is very easy.
8. Item discrimination index differentiates between high and low-ability sts.
1. FALSE (It seems easy, but is not very easy.)
2. FALSE (They can be both practical and reliable.)
6. FALSE (It should be short and to the point.)
7. FALSE (An item with an IF value of .10 is a very difficult one.)
WHAT IS STANDARDIZATION:
A standardized test presupposes certain standard objectives or criteria that are held
constant across one form of the test to another..
They measure a broad band of competencies, but not only one particular curriculum
They are norm-referenced and the main goal is to place sts in a rank order.
Scholastic Aptitude Test (SAT):
college entrance exam seeking further information
The Graduate Record Exam (GRE):
test for entry into many graduate school programs
Graduate Management Admission Test (GMAT) & Law School Aptitude Test (LSAT):
tests that specialize in particular disciplines
Test of English as a Foreign Language (TOEFL):
produced by the International English Language Testing System (IELTS)
The tests are standardized because they specify a set of competencies for a given
domain and through a process of construct validation they program a set of tasks.
In general standardized test items are in the form of MC.
They provide „objective‟ means for determining correct and incorrect responses.
However MC is not the only test item type in standardized test.
Human scored tests of oral and written production are also involved.
ADVANTAGES AND DISADVANTAGES OF STANDARDIZED TESTS:
* Ready-made previously (Ts don‟t need to spend time to prepare it)
* It can be administered to a large number of sts in a time constraint
* Easy to score thanks to MC format scoring (computerized or hole-punched
* It has face validity
* Inappropriate use of tests
* Misunderstanding of the difference between direct and indirect testing
DEVELOPING A STANDARDIZED TEST:
- Knowing how to develop a standardized test can be helpful to
revise an existing test, adapt or expand an existing test, create a
smaller-scale standardized test
(A) The Test of English as a Foreign Language (TOEFL) „general
ability or proficiency‟
(B) The English as a Second Language Placement Test (ESLPT),
San Francisco State University (SFSU) „placement test at a
(C) The Graduate Essay Test (GET), SFSU „gate-keeping essay
1. Determine the purpose and objectives of the test.
- Standardized tests are expected to be valid and practical
*To evaluate the English proficiency of people whose NL is not English.
*Colleges and universities in the US use the score TOEFL score to admit or
refuse international applicants for admission
*To place already admitted sts at SFSU in an approp. course in academic
writing and oral production.
*To provide Ts some diagnostic information about sts
*To determine whether their writing ability is sufficient to permit them to enter
graduate-level courses in their programs(it is offered beginning of each term)
2. Design test specification.
the first step is to define the construct of language proficiency
After breaking langcompetence down into subset of 4 skills each performance mode can
be examined on a continuum of linguistic units. (pronun, spelling, word, grammar)
Oral production section tests fluency and pronunciation by using imitation
Listening section focuses on a particular feature of lang or overall listening comprehens
Reading section aims to test comprehension of long/short passages, single
sentences, phrases or words
Writing section tests writing ability in the form of open-ended(free composition) or it
can be structured to elicit anything from correct spelling to discourse-level competence
Designing test specs for ESLPT was simpler tasks . purpose is placement and construct
validation of a test consisted of an examination of the content of the ESL courses
*In recent revision of ESLPT, content & face validity are important theoretical issues.
And also practicality, reliability in tasks and item response formats equally important
The specification mirrored reading-based and process writing approach used in class.
specification for GET are skills of writing grammatically and rhetorically acceptable
prose on a topic , with clearly produced organization of ideas and logical development.
3. Design, select, and arrange test tasks/items.
• Content coding: the skills and a variety of subject matter without biasing (the
content must be universal and as neutral as possible)
• Statistical characteristic: it include IF and ID
• Before administration, they are piloted and scientifically selected to meet difficulty
specifications within each subsection, section and the test overall.
For written parts; the main problems are
a) selecting appropriate passages(conform the standards of content validity)
• b) providing appropriate prompts (they should fit the passages)
• c) processing data form pilot testing
• In the MC editing test; first (easier task) choose an approp. essay within whick
embed errors. And a more complicated one is to embed a specified number errors
from a pre-determined error categories.(T can perceive the categories from sts
previous error in written work & sts‟ error can be used as distractors)
Topics are appealing and capable of yielding intended product of an essay that requires
an organized logical arguments conclusion. No pilot testing of prompts is conducted.
• Be careful about the potential cultural effect on the numerous international students
who must take the GET
Make appropriate evaluations of different kinds of items.
- IF, ID and distractor analysis may not be necessary for classroom (one-time)
test, but they are must for standardized MC test.
- For production responses, different forms of evaluation become important.
(i.e. practicality, reliability & facility)
*practicality: clarity of directions, timing of test, ease of administration & how
much time is required to score
*reliability: is a major player is instances where more than one scorer is
employed and to a lesser extent when a single scorer has to evaluate tests over
long spans of time that could lead to deterioration of standards
*facilities: is key for valid and successful items. Unclear direction, complex
lang, obscure topic, fuzzy data, culturally biased information may lead to
higher level of difficulty
*No data are collected from sts on their perceptions, but the scorers have an
opportunity to reflect on the validity of given topic
5. Specify scoring procedures and reporting formats.
-Scores are calculated and reported for
*three sections of TOEFL
*a total score
*a separate score
*It reports a score for each of the essay section (each essay is read by 2 readers)
*Editing section is machined scanned
*It provides data to place sts and diagnostic information
*sts don‟t receive their essay back
*Each GET is read by two trained reader. They give scores between 1 to 4
*recommended score is 6 as threshold for allowing sts to pursue graduate-level
*If the st gets score below 6, he either repeat the test or take a remedial course
6. Performing ongoing construct validation studies.
Any standardized test must be accompanied by systematic periodic
corroboration of its effectiveness and by steps toward its improvement
*the latest study on TOEFL examined the content characteristics of the TOEFL
from a communicative perspective based on current research in applied
linguistics and language proficiency assessment
*The development of the new ESLPT involved a lengthy process both content
and construct validation, along with facing such practical issues as scoring the
written sections and a machine-scorable MC answer sheet
*There is no research to validate the GET itself. Administrators rely on the
research on university level academic writing tests such as TWE.
*Some criticism of the GET has come from international test-takers who posit
that the topics and time limits of the GET work to the disadvantage of writers
whose native language is not English.
U.S. universities and colleges for admission purposes
Computer-based and paper-based
Multiple-choice responses and essay
Up to 4 hours (CB); 3 hours (PB)
CB: A listening section which includes dialogs, short conversations, academic
discussions, and mini lectures;
a structure section which tests formal language with two types of questions
(completing incomplete sentences and identifying one of four underlined
words or phrases that is not acceptable in English;
a reading section which include four to five passages on academic subjects
with 10-14 questions for each passage;
writing section which requires examinees to compose an essay on a given topic
U.S. and Canadian language programs and colleges; some worldwide
Multiple-choice responses and essay
2.5 to 3.5 hours
A 30-minute impromptu essay on a given topic;
a 25-minute multiple-choice listening comprehension test;
a 100-item 75-minute multiple choice test of grammar, cloze
reading, vocabulary, and reading comprehension;
an optional oral interview
Australian, British, Canadian, and New Zealand academic institutions and
professional organizations and some American academic institutions
Computer-based for Reading and Writing sections; paper-based for Listening
and Speaking parts
Multiple-choice responses, essay, and oral production
2 hours, 45 minutes
A 60-minute reading;
a 60-minute writing;
a 30-minute listening of four sections;
a 10 to 15 minute speaking of five sections
Worldwide; workplace settings
Computer-based and paper-based
A 100-item, approximately 45-minute listening administered by audiocassette
and which includes statements, questions, short conversations, and short talks;
a 100-item, 75-minute reading which includes cloze sentences, error
recognition, and reading comprehension
Mid 20th Century
Standardized tests had unchallenged popularity and growth.
Standardized tests brought convenience, efficiency, air of empirical science.
Tests were considered to be a way of making reforms in education.
Quickly and cheaply assessing students became a political issue.
Late 20th Century
*There was possible inequity and disparity between the tests in such tests and
the ones they teach in classes.
*The claims in mid-20th century began to be questioned/criticised in all areas.
*Teachers were in the leading position of those challenges.
The Last 20 Years
*Educators become aware of weaknesses in standardized testing: They were
not accurate measures of achievement and success and they were not based on
carefully framed, comprehensive and validated standards of achievement.
*A movement has started to establish standards to assess students of all ages
and subject-matter areas.
*There have been efforts on basing the standardised tests on clearly specified
criteria for each content area being measured.
Some teachers claimed that those tests were unfair there were dissimilarity
between the content & task of the tests & what they were teaching in their
By becoming aware of these weaknesses, educators started to establish some
standards on which sts of all ages & subject matter areas might be assessed
most departments of education at all state level in the US have specified the
appropriate standards (criteria, objectives) for each grade level(pre-school to
grade 12) and each content area (math, science, arts…)
The construction of standards makes possible concordance between
standardized test specification and the goals and objectives
(ESL, ESOL, ELD,ELLs) (LEP is discarded because of the negative connotation
word „limited‟) pg 105 please
In creating benchmarks for accountability, there is a tremendous responsibility
to carry out a comprehensive study of a number of domains:
Categories of language; phonology, discourse, pragmatic, functional and
Specification of what ELD students‟ needs are.
A realistic scope of standards to be included in curriculum.(MUFRADATTAKI
STANDARDLAR GERCEKCI OLCAK)
Standards for teachers ( qualifications, expertise, training)(OGRETMENLERE
A thorough analysis of means available to assess student attainment of those
standards.(OGRENCILERIN OGRENDIKLERINI NASIL
The development of standards obviously implies the responsibility for
correctly assessing their attainment.
It is found that the standardized tests of the past decades were not in line with
newly developed standards the interactive process not only of developing
standards but also of creating standards-based assessment started.
Specialists design, revise and validate many tests.
The California English Language Development Test (CELDT) is a battery of
instruments designed to assess attainment of ELD standards across grade
level. (not publicly available)
Language and literacy assessment rubric collected students‟ work.
Teachers‟ observations recorded on scannable forms.
It provided useful data on students‟ performance for oral production, reading
and writing in different grades
CASAS AND SCANS
CASAS: (Comprehensive Adult Student Assessment System):
Designed to provide broadly based assessments of ESL curricula across US.
It includes more than 80 standardized assessment instruments used to;
*place sts in programs *diagnose learners‟ needs
*certify mastery of functional skills
At higher level of education (colleges, adult and language schools, workplace)
SCANS: (Secretary‟s Commissions in Achieving Necessary Skills):
outlines competencies necessary for language in the workplace
the competencies are acquired and maintained through training in basic skills(4 skills);
thinking skills (reasoning & problem solving);
personal qualities (self-esteem & sociability)
Resources (allocating time, materials, staff etc.)
Interpersonal skills, teamwork, customer service etc.
Information processing, evaluating data, organising files etc,
Systems, understanding social and organizational system,
Technology use and application
TEACHER STANDARDS – OGRETMEN NASIL OLMALI
Linguistic and language development
Culture and interrelationship between language and culture
Planning and managing instructions
Consequences of standardized based and standardized testing
High level of practicality and reliability
Provides insights into academic performance
Accuracy in placing a number of test takers on to a norm referenced scala
Ongoing construct validation studies
They involve a number of test biases
A small but significant number of test takers are not assessed fairly nor they
are assessed accurately
Fosters extinct motivation
Multiple intelligence are not considered
There is danger of test driven learning and teaching
In general performance is not directly assessed
Standardized tests involve many test bias (lang, culture, race, gender, learning styles)
National Centre for Fair and Open Testing claims of tests bias from; teachers, parents,
students, and legal consultants. (reading texts, listening stimulus)
Standardised tests do not promote logical-mathematical and verbal linguistic to the
virtual exclusions of the other contextualised, integrative intelligence. (some learners
may need to be assessed with interviews, portfolios, samples of
work, demonstrations, observation reports) more formative assessment rather than
That would solve test bias problems but it is difficult to control it in standardized items.
Those who use standardised tests for the gate keeping purposes, with few if only other
assessments would do well to consider multiple measures before attributing infallible
predictive power to standardised test.
Test-driven learning and teaching
It is another consequence of standardized testing. When students know that one single
measure of performance will determine their lives they are less likely to take positive
attitudes towards learning. Extrinsic motivation not intrinsic
Ts are also affected from test-driven policies. They are under pressure to make sure
their sts excelled in the exam, at the risk of ignoring other objectives in the curriculum.
A more serious effect was to punish schools with lower-socioeconomic neighbourhood
ETHICAL ISSUES: CRITICAL LANGUAGE TESTING
One of by-products of rapid growing testing industry is danger of an abuse of power.
„Tests represent a social technology deeply embedded in education, government and
business; tests are most powerful as they are often the single indicators for determining
the future of individuals‟ (Shohamy)
Standards ,specified by client educational institutions, bring with them certain ethical
surrounding the gate-keeping nature of standardized tests.
Teachers can demonstrate standards in their teaching.
Teachers can be assessed through their classroom performance.
Performance can be detailed with „indicators‟: examples of evidence that the teacher can
meet a part of a standard.
Indicators are more than „how to‟ statements (complex evidence of performance.
Performance based assessment is integrated (not a checklist or discrete assessments)
Each assessment has performance criteria against which performance can be measured.
Performance criteria identify to what extend the teacher meets the standard.
Student learning is at the heart of the teacher‟s performance.
OBSERVING THE PERFORMANCE OF FOUR SKILLS
1. two interacting concepts:
Sometimes the performance does not indicate true competence
a bad night‟s rest, illness, an emotional distraction, test anxiety, a memory
block, or other student-related reliability factor.
One important principle for assessing a learner‟s competence is to consider the
fallibility of the results of a single performance such as that produced in a test.
The form which involve performances and contexts in measurement should
Several tests that are combined t form an assessment.
The listening tasks are designed to assess the candidate‟s ability to process
form of spoken English.
A single test with multiple test tasks to account for learning styles and
In-class and extra-class graded work
Alternative forms of assessment ( e. g journal, portfolio, conference,
observation, self – assessment, peer – assessment )
Multiple measures give more reliable & valid assessment than a single measure
We can observe neither the process of performing nor a product?
1. Receptive skills -- Listening performance
The process of listening performance is about :
Invisible, inaudible – process of internalizing meaning form the auditory
signals being transmitted to the ear and brain.
2 The productive skills allow us to hear and see the process as it is performance
writing can give permanent product of written piece.
But recorded speech, there is no permanent observable product for speaking.
THE IMPORTANCE OF LISTENING
Listening has often played second fiddle to its counterpart of speaking. But its
rare to find just a listening test.
Listening is often implied as component of speaking.
Oral production ability – other than monologues, speeches, reading aloud and
the like– is only as good as one‟s listening comprehension.
Input the aural-oral mode accounts for a large proportion of successful
BASIC TYPES OF LISTENING
For effective test, designing appropriate assessment tasks in listening begins
with the specification of objectives, or criteria.
The following processes flash through your brain :
1. recognize speech sounds and hold a temporary “ imprint” of them in
2. Simultaneously determine the type of speech event.
3. use (bottom-up) linguistic decoding skills and / or (top-down)
background schemata to bring a plausible interpretation to the message and
assign a literal and intended meaning to the utterance. ( Jeremy Harmer, page
on 305) said.. This study shows is that activating student‟s schemata.
4. in most cases, delete the exact linguistic form in which the message
was originally received in favor of conceptually retaining important or
relevant information in long-term memory.
four commonly identified types of listening performances
Listening for perception of the components.
Teacher use audio material on tape or hard disk when they want their students
to practice listening skills
Extensive listening will usually take a place outside the classroom.
Material for extensive listening can be obtained from a number of sources.
Micro and Macro skills
Attending to smaller bits and chunks, in more of bottom-up process
Discriminate among sounds of English
retain chunks of language of different lengths in short-term memory
Recognize stress patterns, words in stressed/ unstressed position, rhythmic
structure , intonation contours, and their role in signaling information
Recognize reduce form of words.
Distinguish word boundaries, recognize the core of a words and interpret
word order patterns and their significance
Process speech at different rates of delivery
Process speech containing pauses, errors, corrections, other performance
Recognize grammatical word classes (nouns, verbs, etc.), systems (e.g.
tense, agreement, pluralization), pattern, rules, and elliptical forms.
Detect sentence constituents and distinguish between major-minor constituents
Recognize particular meaning may be expressed in different grammatical form
Recognize cohesive device in spoken discourse
Focusing on larger elements involved in a top-down approach
recognize the communicative functions of utterances, according to situations,
Infer situations, participants, goals using real-world knowledge
From events, ideas, and so on, described, predict outcomes, infer links and
connections between events, deduce causes and effects, and detect such
relations as main idea, supporting idea, new information, given
information, generalization, and exemplification
Distinguish between literal and implied meanings
Use the facial, kinesics, body language, and other nonverbal clues to decipher
Develop and uses a battery of listening strategies, such as detecting key
words, guessing the meaning from context, appealing for help, and signaling
comprehension or lack thereof
What Makes Listening Difficult
Chunking-phrases, clauses, constituents
Repetitions, Rephrasing, Elaborations and Insertions
3. Reduced Forms
Understanding reduced forms that may not be a part of learner‟s past
experiences in classes where only formal ”textbook” lang has been presented
4. Performance variables
Hesitations, False starts, Corrections, Diversion
5 Colloquial Language
Idioms, slang, reduced forms, shared cultural knowledge
6. Rate of Delivery
Keeping up with speed of delivery, processing automatically as speker continu
7. Stress, Rhythm, and Intonation:
Correctly understanding prosodic elements of spoken language, which is more
difficult than understanding the smaller phonological bits and pieces.
Negotiation,clarification,attending signals,turn taking,maintenance,termination
Designing Assessment Tasks
• Recognizing Phonological and Morphological Elements
Phonemic pair, consonants
He’s from California
A. He’s from California
B. She’s from California
Phonemic pair, vowels
is he living?
A. is he leaving?
B. is he living?
Morphological pair, -ed ending
I missed you very much.
A. I missed you very much
B. I miss you very much
Stress pattern in can’t
My girlfriend can’t go to the party
A. My girlfriend can go to the party
B. My girlfriend can’t go to the party
One word stimulus
– Sentence Paraphrase
: Hellow, my name is Keiko. I come from Japan
: A. Keiko is comfortable in japan
B. Keiko wants to come to Japan
C. Keiko is Japanese
D. Keiko likes Japan
– Dialogue paraphrase
: Hi, Maria, my name is George.
woman : Nice to meet you, George. Are you
: no, I’m Canadian
: A. George lives in United States
B. George is American
C. George comes from Canada
D. Maria is Canadian
Designing Assessment Tasks
• Appropriate response to a question
: how much time did you take to do your homework?
: A. in about an hour
B. about an hour
C. about $10
D. yes, I did.
• Open-ended response to a question
Test-takers write or speak
: how much time did you take to do your
Designing Assessment Tasks : Selective Listening
Test-taker listens a limited quantity of aural input and discern some specific information
Listening Cloze (cloze dictations or Partial Dictation)
Listening cloze tasks require the test-taker to listen a story, monologue
or conversatation and simultaneously read written text in which selected words or
phrases have been deleted
One Potentional Weakness of listening cloze technique
They may be simply become reading comprehension tasks. Test-takers who are asked
to listen to a story with periodic deletions in the written version may not need to listen
at all, yet may still able to respond with the appropriate word or phrase.
aurally processed must be trnasfered to a visual representation, E.g labelling a diagram,
identifying an element in a picture, completing a form, or showing routes on a map.
Test-takers see the chart about Lucy‟s daily schedule and fill in the schedule.
The test-takers must retain a strecth of language long enough to reproduce it, and then
must respond with an oral repetition of that stimulus.
DESIGNING ASSESSMENT TASKS: EXTENSIVE LISTENING
Dictation: Test-takers hear a passage, typically 50-100 words, recited 3 times;
First reading, natural speed, no pauses, test-takers listen for gist.
Second reading, slowed speed, pause at each break, test-takers write.
Third reading, natural speed, test takers check their work.
Communicative Stimulus-Response Tasks
The test-takers are presented with a stimulus monologue or conversation and
then are asked to respond to a set of comprehension questions.
First: Test-takers hear the insrtuction and dialogue or monologue.
Second: Test-takers read the multiple-choice comprehension questions and
items then chose the correct one
Authentic Listening Tasks
Buck (2001-p.92)“Every test requires some components of communicative
language ability, and no test covers them all. Similarly, every task shares some
characteristics with target-language tasks, and no test is completely authentic”
Alternatives to assess comprehension in a truly communicative context
Listening to a lecturer and write down the important ideas.
Disadvantage: scoring is time consuming
Advantages: mirror real classroom situation it fulfills the criteria of
cognitive demand, communicative language & authenticity
Editing a written stimulus of an aural stimulus
paraphrasing a story or conversation
Potential stimuli include: song lyrics, poetry, radio, TV, news reports, etc.
Listen story &simply retell it either orally or written à show full comprehension
Difficulties: scoring and reliability
validity, cognitive, communicative ability, authenticity are well incorporated
into the task.
Interactive listening (face to face conversations)
Challenges of the testing speaking:
1- The interaction of speaking and listening
2- Elicitation techniques
BASIC TYPES OF SPEAKING
1.Imitative: (parrot back) Testing the ability to imitate a word, phrase, sentence.
Pronunciation is tested. Examples: Word, phrase, sentence repetition
2. Intensive: The purpose is producing short stretches of oral language. It is designed
to demonstrate competence in a narrow band of grammatical, phrasal, lexical,
phonological relationships (stress / rhythm / intonation)
3.Responsive: (interacting with the interlocutor) include interaction and test
comprehension but somewhat limited level of very short conversations, standards
greetings, small talk, simple requests and comments, and the like.
4. Interactive: Difference between responsive and interactive speaking is length and
complexity of interaction, which includes multiple exchanges /or multiple participant.
5. Extensive (monologue) : Extensive oral production tasks include speeches, oral
presentations, story-telling, during which the opportunity for oral interaction from
listeners is either highly limited (perhaps to nonverbal responses) or ruled out together.
Micro- and Macroskills of Speaking
microskills of speaking refer to producing small chunks of language such as
phonemes, morphemes, words and phrasal units. The macroskills include the
speakers' focus on the larger elements such as fluency, discourse, function,
style cohesion, nonverbal communication and strategic options.
1.Apropriately accomplish communicative functions according to situations,
2.Use appropriate styles, registers, implicative, redundancies, pragmatic
conventions, conversation rules, floor-keeping and –yielding, interrupting, and
other sociolinguistic features in face-to-face conversations.
3.Convey links and connections between events and communicative such
relations as focal and peripheral ideas, events and feelings, new information
and given information, generalization and exemplification.
4.Convey facial features, body language, and other nonverbal cues along with
5.Develop and use a battery of speaking strategies, such as emphasizing key
words, rephrasing, providing a context for interpreting the meaning of words,
appealing for help, and accurately assessing how well your interlocutor is
1.Produce differences among English phonemes and allophonic variants.
2.Produce chunks of language of different lengths.
3.Produce English stress patterns, words in stressed and unstressed positions,
rhytmic structure, and intonation contours.
4.Produce reduced forms of words and phrases.
5.Use adequate number of lexical units(words) to accomplish pragmatic
6.Produce fluent speech at different rates of delivery.
7.Monitor one‟s own oral production and use various devicespauses, fillers, self-corrections, backtracking- to enhance the clarity of the
8.Use grammatical word classes (nouns,verbs,etc.),systems
(tense, agreement, pluralization), word order, patterns, rules, and elliptical
9.Produce speech in natural constituents: in appropriate phrases, pause
groups,breath groups, and sentence constituents.
10.Express a particular meaning in different grammatical forms.
11.Use cohesive devices in spoken discourse.
Three important issues as you set out to design tasks;
1.No speaking task is capable of isolating the single skills of oral
production. Concurrent involvement of the additional performance of aural
comprehension, and possibly reading, is usually necessary.
2.Eliciting the specific criterion you have designated for a task can be
tricky because beyond the word level, spoken language offers a number of
productive options to test-takers. Make sure your elicitation prompt achieves
its aims as closely as possible.
3.It is important to carefully specify scoring procedures for a response
so that ultimately you achieve as high a reliability index as possible.
interaction between speaking and listening or reading is unavoidable.
Interaction effect: impossibility of testing speaking in isolation
Elicitation techniques: to elicit specific criterion we expect from test takers.
Scoring: to achieve reliability
Designing Assessment Tasks: Imitative Speaking
paying more attention to pronunciation, especially suprasegmentals, in
attempt to help learners be more comprehensible.
Repetition tasks are not allowed to occupy a dominant role in an overall oral
production assessment, and as long as avoid a negative washback effect.
In a simple repetition task, test-takers repeat the stimulus, whether it is a pair
of words, a sentence, or perhaps a question ( to test for intonation production.)
Word repetition task:
Scoring specifications must be to avoid reliability breakdowns. A common
form of scoring simply indicates 2 or 3 point system for each response
Scoring scale for repetition tasks:
2 acceptable pronunciation
1 comprehensible, partially correct pronunciation
0 silence, seriously incorrect pronunciation
The longer the stretch of language, the more possibility for error and therefore
the more difficult it becomes to assign a point system to the text.
The phonepass test has supported the construct validity of its repetition tasks not just
for discourse and overall oral production ability.
The PhonePass tests elicits computer-assisted oral production over a telephone.
Test-takers read aloud, repeat sentences, say words, and answer questions.
Test-takers are directed to telephone a designated number and listen for directions.
The test has five sections.
Part A Testee read aloud selected sentences forum among printed on the test sheet.
Part B Testee repeat sentences dictated over the phone.
Part C Testee answer questions with a single word or a short phrase of 2 or 3 words.
Part D Testee hear 3 word groups in random order and link them in correctly ordered
Part E Testee have 30 seconds to talk about their opinion about some topic that is
dictated over phone.
Scores are calculated by a computerized scoring template and reported back to the testtaker within minutes.
Pronunciation, reading fluency, repeat accuracy and fluency, listening vocabulary are
the sub-skills scored
The scoring procedure has been validated against human scoring with extraordinary
high reliabilities and correlation statistics.
Designing Assessment Tasks: Intensive Speaking
test-takers are prompted to produce short stretches of discourse (no more then a
sentence) through which they demonstrate linguistic ability at a specified level lang
Intensive tasks may also be described as limited response tasks, or mechanical tasks, or
what classroom pedagogy would label as controlled responses.
Directed Response Tasks
Administrator elicits a particular grammatical form or a transformation of a sentence.
Such tasks are clearly mechanical and not communicative(possible drawbacks),but they
do require minimal processing of meaning in order to produce the correct
grammatical output.(practical advantages
Read – Aloud Tasks (to improve pronunciation and fluency)
include beyond sentence level up to a paragraph or two. It is easily administered by
selecting a passage that incorporates test specs and bye recording testee‟ output; the
scoring is easy because all of the test-takers’s oral production is controlled.
If reading aloud shows certain practical adavantages (predictable
output, practicality, reliability in scoring), there are several drawbacks
Reading aloud is somewhat inauthentic in that we seldom read anything aloud to
someone else in the real world, with exception of a parent reading to a child.
Sentence / Dialogue Completion Tasks and Oral Questionnaries
( to produce omitted lines, words in a dialogue appropiriately)
Test-takers read dialogue in which one speaker‟s lines have been omitted. Testtakers are first given time to read through the dialogue to get its gist and to
think about appropriate lines to fill in.
An advantage of this technique lies in its moderate control of the output of the
test-taker (practical advantage).
One disadvantage of this technique is its reliance on literacy and an ability to
transfer easily from written to spoken English.(possible drawback)
Another disadvantage is contrived, inauthentic nature of this task. (drawback.)
Picture – Cued Tasks (to elicit oral production by using pictures)
One of more popular ways to elicit oral language performance at both
intensive and extensive levels is a picture-cued stimulus that requires a
destcription from the test-taker.
Assessment of oral production may be stimulated through a more elaborate
picture. (practical advantages)
Maps are another visual stimulus that can be used to assess the language
forms needed to give directions and specify locations.(practical advantage)
Scoring may be problematic depending on the expected performance.
Scoring scale for intensive tasks
2 comprehensible; acceptable target form
1 comprehensible; partially correct target form
0 silence, or seriously incorrect target form
Translation (of Limited Stretches of Discourse) (To translate from target
language to native language)
The test-takers are given a native language word, phrase, or sentence and are
asked to translate it.
As an assessment procedure, the advantages of translation lie in its control of
the output of the test-taker, which of course means that scoring is more easily
Designing Assessment Tasks: Response Speaking
Assessment involves brief interactions with an interlocutor, differing from
intensive tasks in the increased creativity given to the test-taker and from
interactive tasks by the somewhat limited length of utterances.
Question and Answer
Question and answer tasks can consist of one or two questions from an
interviewer, or they can make up a portion of a whole battery of questions and
prompts in an oral interview.
The first question is intensive in its purpose; it is a display question intended
to elicit a predetermined correct response.
Questions at the responsive level tend to be genuine referential questions in
which the test-taker is given more opportunity to produce meaningful
language in response.
Test-takers respond with a few sentences at most.
Test-takers respond with questions.
A potentially tricky form of oral production assessment involves more than
one test-taker with an interviewer. With two students in an interview
contxt, both test-takers can ask questions of each other.
Giving Instruction and Directions
The technique is simple : the administrator poses the problem, and the testtaker responds. Scoring is based primarily on comprehensibility and
secondarily on other specified grammatical or discourse categories.
Eliciting instructions or directions
read or hear a number of sentences and produce a paraphrase of the sentence.
Advantages they elicit short stretches of output and perhaps tap into testee
ability to practice conversation by reducing the output/input ratio.
If you use short paraphrasing tasks as an assessment procedure, it‟s important
to pinpoint objective of task clearly. In this case, the integration of listening
and speaking is probably more at stake than simple oral production alone.
TEST OF SPOKEN ENGLISH (TSE)
The TSE is a 20 –minute audio-taped test of oral language ability within an
academic or Professional environment.
The scores are also used for selecting and certifying health professionals such
as physicians, nurses, pharmacists, physical therapists, and veterinaries.
The tasks on the TSE are designed to elicit oral production in various discourse
categories rather than in selected phonological, grammatical, or lexical targets.
Designing Assessment Tasks: Interactive Speaking
Tasks include long interactive discourse ( interview, role plays, discussions, games).
A test administrator and a test-taker sit down in a direct face-to-face Exchange and
proceed through a protocol of questions and directives. The interview is then scored on
accuracy in pronunciation and/or grammar, vocabulary usage, fluency, pragmatic
appropriateness, task accomplishment, and even comprehension.
Placement interviews, designed to get a quick spoken sample from a student to verify
placement into a course,
1.Warm-up : (small talk) interviewer directs matual introductions, helps testee
become comfortable, apprises testee, anxieties.(No scoring)
2.Level check: interviewer stimulates testee to respond using expected - predicted
forms and functions. This stage give interviewer a picture of testee‟s
extroversion, readiness to speak, confidence.Linguistic target criteria are scored in this
3.Probe: Probe questions and prompts challenge testee to go heights of their
ability, to extend beyond limits of interviewer‟s expectation through difficult questions.
4.Wind-down: This phase is a short period of time during which interviewer
encourages testee to relax with easy questions, sets testee‟s ease,
The scussess of an oral interview will depend on;
*clearly specifying administrative procedures of the
*focusing the q and probes on the purpose of the
*appropriately eliciting an optimal amount and quality
of oral production from the test-taker.( biased for best
*creating a consistent, workable scoring system
Role playing is a popular pedagogical activity in communicative language
Within constraints set forth by guidelines, it frees students to be somewhat
creative in their linguistic output.
While role play can be controlled or „‟guided‟‟ by the interviewer, this
technique takes test-takers beyond simple intensive and responsive levels to a
level of creativity and complexity that approaches real-world pragmatics.
Scoring presents the usual issues in any task that elicits somewhat
unpredictable responses from test-takers.
Discussions and Conversations
As formal assessment devices, discussions and conversations with and among
students are difficult to specify and even more difficult to score.
But as informal techniques to assess learners, they offer a level of authenticity
and spontaneity that other assessment techniques may not provide.
Assessing the performance of participants through score or checklists should
be carefully designed to suit the objectives of the observed discussion.
Discussion is a integrative task, and so it is also advisable to give some cognizance to
comprehension performance in evaluating learners.
Among informal assessment devices are a variety of games that directly
involve language production.
1.‟‟Tinkertoy‟‟ game (Logo block)
3.Information gap grids
ORAL PROFICIENCY INTERVIEW (OPI)
The best-known oral interview format is the Oral Proficinecy Interview.
OPI is the result of historical progression of revisions under the auspices of
several agencies, including the Educational Testing Service and American
Council on Teaching Foreign Language (ACTFL).
The OPI is carefully designed to elicit pronunciation, fluency and integrative
ability, sociolinguistic and cultural knowledge, grammar, and vocabulary.
Performance is judged by the examiner to be at one of ten possible levels on
the ACTFL-designated proficiency guidelines for speaking: Superior;
Advanced-high, mid, low; Intermediate-high, mid,low; Novice-high, mid,low.
Designing Assessments : Extensive Speaking
involves complex, relatively lengthy stretches of discourse.
They are variations on monologues, with minimal verbal interaction.
it would not be uncommon to be called on to present a report, a paper, a marketing
plan, a sales idea, a design of new product, or a method.
Once again the rules for effective assessment must be invoked:
a- specify the criterion,
b-set appropriate tasks,
c- elicit optimal output,
d-establish practical, reliable scoring procedures.
Scoring is the key assessment challenge.
Picture –Cued Story-Telling
techniques for eliciting oral production is through visual pictures, photographs,
diagrams, and charts.
consider a picture or series of pictures as a stimulus for a longer or description.
Criteria for scoring need to be clear about what it is you are hoping to assess.
Retelling a Story, News Event
In this type of task, test-takers hear or read a story or news event that they are
asked to retell.
The objectives in assigning such a task vary from listening comprehension of
the original to production a number of oral discourse features (communicating
sequences and relationships of events, stress and emphasis patterns,‟
‟expression‟‟ in the case of a dramatic story), fluency, and interaction with the
Scoring should meet the intended criteria
Translation (of Extended Prose)
Longer texts are presented for test-taker to read in NL and then translate into
English (dialogues, directions for assembly of a product, synopsis of a story or
play or movie, directions on how to find something on map, and other genres).
The advantage of translation is in the control of the content, vocabulary, and to
some extent, the grammatical and discourse features.
The disadvantage is that translation of longer text is a highly specialized skill
for which some individuals obtain post-baccalaureate.
Criteria for scoring should take into account not only purpose in stimulating a
translation but possibility of errors that are unrelated to oral production ability