Fundamentals of Language Assessment
Upcoming SlideShare
Loading in...5
×
 

Fundamentals of Language Assessment

on

  • 21,223 views

 

Statistics

Views

Total Views
21,223
Views on SlideShare
11,584
Embed Views
9,639

Actions

Likes
10
Downloads
753
Comments
2

29 Embeds 9,639

http://sirmedu.blogspot.com 7551
http://sirmeduranda.blogspot.com 1994
http://sirmedu.blogspot.in 22
http://sirmedu.blogspot.com.au 7
http://sirmeduranda.blogspot.kr 7
http://translate.googleusercontent.com 6
http://sirmedu.blogspot.ca 5
http://sirmeduranda.blogspot.com.ar 4
http://sirmeduranda.blogspot.no 4
http://sirmeduranda.blogspot.in 4
http://sirmeduranda.blogspot.com.au 3
http://sirmedu.blogspot.nl 3
http://sirmeduranda.blogspot.com.es 3
http://sirmeduranda.blogspot.tw 3
http://www.google.com.ph 3
http://sirmedu.blogspot.sg 3
http://sirmedu.blogspot.co.uk 2
http://sirmedu.blogspot.fr 2
http://sirmedu.blogspot.it 2
http://sirmeduranda.blogspot.ca 2
https://www.google.com.ph 1
http://sirmedu.blogspot.jp 1
http://sirmedu.blogspot.no 1
http://sirmedu.blogspot.ru 1
http://sirmedu.blogspot.hk 1
http://sirmeduranda.blogspot.sg 1
http://sirmeduranda.blogspot.com.br 1
http://sirmeduranda.blogspot.co.uk 1
http://sirmeduranda.blogspot.jp 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Fundamentals of Language Assessment Fundamentals of Language Assessment Document Transcript

  • Fundamentals of Language Assessment Christine Coombe & Nancy HubleyTable of Contents
  • Introduction ..................................................................................................... 3About the Authors….…………………………………………………… …………3The Cornerstones of Testing ........................................................................... 4Cornerstones Checklist………………………………………………… …………8Test Types …………………………………………………………………………..9Test Development Process ............................................................................ 13Guidelines for Classroom Testing .................................................................. 15Writing Objective Test Items .......................................................................... 17Assessing Writing .......................................................................................... 20Assessing Reading………………………………………………………………...23Assessing Listening………………………………………………………………..26Assessing Speaking……………………………………………………………….31Student Test-Taking Strategies...................................................................... 33Statistics ........................................................................................................ 37Sample Classroom Statistics Worksheet ....................................................... 40Technology for Testing .................................................................................. 41Internet Resources…………………………………………………………………42Alternative Assessment ……………………………………………… …………43Testing Acronyms………………………………………………………………….46Glossary Of Important Testing Terms ............................................................ 47Annotated Bibliography.................................................................................. 56Contact Information........................................................................................ 60Coombe/Hubley 2Fundamentals of Language Assessment
  • IntroductionThe past ten years have seen a number of developments in the examproduction and evaluation process. Although the specialist testing literaturehas burgeoned in this decade, information about recent developments andissues has been slow to filter down to the classroom level. This workshopprovides an opportunity for teachers and administrators to explore andevaluate aspects of the testing process and discuss current issues.“Fundamentals of Language Assessment” focuses on the principles of testdesign, construction, administration and analysis that underpin good testingpractice, but this session presupposes no prior knowledge of testing orstatistics. Participants will be provided with the essential theoretical andpractical background that they need to construct and analyze their own testsor to evaluate other tests.During the session, basic testing techniques will be covered in a briefpresentation immediately followed by practice in small groups. Participantswill receive a “kit” of materials which they will actively explore during theworkshop and later take away to guide them when they apply their new skills.In addition to becoming familiar with standard approaches to classroomtesting, participants will be introduced to alternative forms of assessment suchas self-assessment, portfolio writing, and student-designed tests.The organizers believe in learning through doing. A key activity in theworkshop is reviewing and critiquing tests that exemplify good and bad testingpractices. Through this activity, participants will acquire the useful skill offixing bad test items and salvaging the best aspects of a test. Experience hasshown that teachers value something that they can put into practice,something that immediately enhances their skills. We hope that attendees willleave the workshop equipped with a heightened awareness of current testingissues and the means to put them into practice by creating effectiveclassroom tests.Coombe/Hubley 3Fundamentals of Language Assessment
  • About the authorsChristine Coombe teaches English at Dubai Mens College and is anAssessment Leader for the UAE’s Higher Colleges of Technology. NancyHubley formerly worked in these roles for HCT. In 1997, Christine and Nancyfounded the Current Trends in English Language Testing (CTELT)Conference, now an annual international event. They frequently provideassessment training and serve as English Language Specialists for the U.S.Department of State. Together, they edited the Assessment Practices(2003) volume for the TESOL Case Studies series. Their co-authored volumeentitled A Practical Guide to Assessing English Language Learners waspublished by the University of Michigan Press in March 2007.Christine is the Past President of TESOL Arabia, and the winner of the 2002ETS Outstanding Young Scholar Award and the Mary Spann Fellowship. Hercurrent research is on test preparation strategies. In 2006 she chaired theTESOL Convention held in Tampa, Florida.In 2001, Nancy received the HCT Chancellor’s Award as Teacher of the Year.In 2003, she was part of the writing team for the nationwide Grade 12textbook in the UAE. She is now based in the US as a free-lance materialswriter and assessment consultant.Coombe/Hubley 4Fundamentals of Language Assessment
  • The Cornerstones of TestingLanguage testing at any level is a highly complex undertaking that must bebased on theory as well as practice. Although this manual focuses onpractical aspects of classroom testing, an understanding of the basicprinciples of larger-scale testing is essential. The guiding principles thatgovern good test design, development and analysis are validity, reliability,practicality, washback, authenticity, transparency, security and usefulness.Constant references to these important “cornerstones” of language testing willbe made throughout the workshops.UsefulnessFor Bachman and Palmer (1996), the most important consideration indesigning and developing a language test is the use for which it is intended.Hence, for them, usefulness is the most important quality or cornerstone oftesting. They state that “test usefulness provides a kind of metric by which wecan evaluate not only the tests that we develop and use, but also all aspectsof test development and use”(pg. 17).Bachman and Palmer’s model of test usefulness requires that any languagetest must be developed with a specific purpose, a particular group of testtakers and a specific language use in mind.ValidityThe term validity refers to the extent to which a test measures what it says itmeasures. In other words, test what you teach, how you teach it! Types ofvalidity include content, construct, and face. For classroom teachers, contentvalidity means that the test assesses the course content and outcomes usingformats familiar to the students. Construct validity refers to the “fit” betweenthe underlying theories and methodology of language learning and the type ofassessment. For example, a communicative language learning approachmust be matched by communicative language testing. Face validity meansthat the test looks as though it measures what it is supposed to measure.This is an important factor for both students and administrators. Other typesof validity are more appropriate to large-scale assessment and are defined inthe glossary.Coombe/Hubley 5Fundamentals of Language Assessment
  • It is important that we be clear about what we want to assess and be sure thatwe are assessing that and not something else. Making sure that clearassessment objectives are met is of primary importance in achieving testvalidity. The best way to ensure validity is to produce tests to specifications.ReliabilityReliability refers to the consistency of test scores. It simply means that a testwould give similar results if it were given at another time. For example, if thesame test were to be administered to the same group of students at twodifferent times, in two different settings, it should not make any difference tothe test taker whether he/she takes the test on one occasion and in onesetting or the other. Similarly, if we develop two forms of a test that areintended to be used interchangeably, it should not make any difference to thetest taker which form or version of the test he/she takes. The student shouldobtain about the same score on either form or version of the test.Three important factors affect test reliability. Test factors such as the formatsand content of the questions and the length of the exam must be consistent.For example, testing research shows that longer exams produce more reliableresults than very brief quizzes. In general, the more items on a test, the morereliable it is considered to be. Administrative factors are also important forreliability. These include the classroom setting (lighting, seatingarrangements, acoustics, lack of intrusive noise etc.) and how the teachermanages the exam administration. Affective factors in the response ofindividual students can also affect reliability. Test anxiety can be allayed bycoaching students in good test-taking strategies.A fundamental concern in the development and use of language tests is toidentify potential sources of error in a given measure of language ability andto minimize the effect of these factors. Henning (1987) describes a number ofthreats to test reliability. These factors have been shown to introducefluctuations in test scores and thus reduce reliability.• Fluctuations in the Learner: A variety of changes may take place within the learner that either will introduce error or change the learners’ true score from test to test. Examples of this type of change might be further learning or forgetting. Influences such as fatigue, sickness, emotional problems and practice effect may cause the test taker’s score to deviate from the score which reflects his/her actual ability.• Fluctuations in Scoring: Subjectivity in scoring or mechanical errors in the scoring process may introduce error into scores and affect the reliability of the test’s results. These kinds of errors usually occur within (intra-rater) or between (inter-rater) the raters themselves.• Fluctuations in Test Administration: Inconsistent administrative procedures and testing conditions may reduce test reliability. This is mostCoombe/Hubley 6Fundamentals of Language Assessment
  • common in institutions where different groups of students are tested in different locations on different days.Reliability is an essential quality of test scores, because unless test scores arerelatively consistent, they cannot provide us with information about theabilities we want to measure. A common theme in the assessment literatureis the idea that reliability and validity are closely interlocked. While reliabilityfocuses on the empirical aspects of the measurement process, validityfocuses on the theoretical aspects and seeks to interweave these conceptswith the empirical ones (Davies et al, 1999 pg. 169). For this reason it iseasier to assess reliability than validity.PracticalityAnother important feature of a good test is practicality. Classroom teachersare well familiar with practical issues, but they need to think of how practicalmatters relate to testing. A good classroom test should be “teacher-friendly”.A teacher should be able to develop, administer and mark it within theavailable time and with available resources. Classroom tests are onlyvaluable to students when they are returned promptly and when the feedbackfrom assessment is understood by the student. In this way, students canbenefit from the test-taking process. Practical issues include cost of testdevelopment and maintenance, time (for development and test length),resources (everything from computer access, copying facilities, AV equipmentto storage space), ease of marking, availability of suitable/trained markers andadministrative logistics.It is important to remember that assessment is only one aspect of a teacher’sjob and cannot be allowed to detract from teaching or preparation time.WashbackWashback refers to the effect of testing on teaching and learning. Washbackis generally said to be either positive or negative. Unfortunately, students andteachers tend to think of the negative effects of testing such as “test-driven”curricula and only studying and learning “what they need to know for the test”.Positive washback, or what we prefer to call “guided washback” can benefitteachers, students and administrators. Positive washback assumes thattesting and curriculum design are both based on clear course outcomes whichare known to both students and teachers/testers. If students perceive thattests are markers of their progress towards achieving these outcomes, theyhave a sense of accomplishment. In short, tests must be part of learningexperiences for all involved. Positive washback occurs when a testencourages good teaching practice.AuthenticityLanguage learners are motivated to perform when they are faced with tasksthat reflect real world situations and contexts. Good testing or assessmentCoombe/Hubley 7Fundamentals of Language Assessment
  • strives to use formats and tasks that mirror the types of situations in whichstudents would authentically use the target language. Whenever possible,teachers should attempt to use authentic materials in testing language skills.TransparencyTransparency refers to the availability of clear, accurate information tostudents about testing. Such information should include outcomes to beevaluated, formats used, weighting of items and sections, time allowed tocomplete the test, and grading criteria. Transparency dispels the myths andmysteries surrounding secretive testing and the adversarial relationshipbetween learning and assessment. Transparency makes students part of thetesting process.SecurityMost teachers feel that security is an issue only in large-scale, high-stakestesting. However, security is part of both reliability and validity. If a teacherinvests time and energy in developing good tests that accurately reflect thecourse outcomes, then it is desirable to be able to recycle the tests or similarmaterials. This is especially important if analyses show that the items,distractors and test sections are valid and discriminating. In some parts of theworld, cultural attitudes towards “collaborative test-taking” are a threat to testsecurity and thus to reliability and validity. As a result, there is a trade-offbetween letting tests into the public domain and giving students adequateinformation about tests.Coombe/Hubley 8Fundamentals of Language Assessment
  • Cornerstones ChecklistWhen developing, administering and gradingexams, ask yourself the following questions: Does your exam test the curriculum content? Does your exam contain formats familiar to the students? Does your test reflect your philosophy of teaching? Would this test yield the same results if you gave it again? Will the administration of your test be the same for all classes? Have you helped students reduce test anxiety through test-taking strategies? Do you have enough time to write, grade and analyze your test? Do you have all the resources (equipment, paper, storage) you need? Will this test have a positive effect on teaching and learning? Are the exam tasks authentic and meaningful? Do students have accurate information about this test? Have you taken measures to ensure test security? Is your test a good learning experience for all involved?Coombe/Hubley 9Fundamentals of Language Assessment
  • Test TypesThe most common use of language tests is to identify strengths andweaknesses in students’ abilities. For example, through testing we candiscover that a student has excellent oral abilities but a relatively low level ofreading comprehension. Information gleaned from tests also assists us indeciding who should be allowed to participate in a particular course orprogram area. Another common use of tests is to provide information aboutthe effectiveness of programs of instruction.Henning (1987) identifies six kinds of information that tests provide aboutstudents. They are:• Diagnosis and feedback• Screening and selection• Placement• Program evaluation• Providing research criteria• Assessment of attitudes and socio-psychological differencesAlderson, Clapham and Wall (1995) have a different classification scheme.They sort tests into these broad categories: placement, progress,achievement, proficiency and diagnostic.Placement TestsThese tests are designed to assess students’ level of language ability forplacement in an appropriate course or class. This type of test indicates thelevel at which a student will learn most effectively. The main aim is to creategroups which are homogeneous in level. In designing a placement test, thetest developer may choose to base the test content either on a theory ofgeneral language proficiency or on learning objectives of the curriculum. Inthe former, institutions may choose to use a well-established proficiency testsuch as the TOEFL or IELTS exam and link it to curricular benchmarks. Inthe latter, tests are based on aspects of the syllabus taught at the institutionconcerned.In some contexts, students are placed according to their overall rank in thetest results. At other institutions, students are placed according to their levelin each individual skill area. Elsewhere, placement test scores are used todetermine if a student needs any further instruction in the language or couldmatriculate directly into an academic program.Diagnostic TestsDiagnostic tests seek to identify those language areas in which a studentneeds further help. Harris and McCann (1994 p. 29) point out that where“other types of tests are based on success, diagnostic tests are based onfailure.” The information gained from diagnostic tests is crucial for furthercourse activities and providing students with remediation. Because diagnosticCoombe/Hubley 10Fundamentals of Language Assessment
  • tests are difficult to write, placement tests often serve a dual function of bothplacement and diagnosis (Harris & McCann, 1994; Davies et al, 1999).Progress TestsThese tests measure the progress that students are making towards definedcourse or program goals. They are administered at various stages throughouta language course to see what the students have learned, perhaps aftercertain segments of instruction have been completed. Progress tests aregenerally teacher produced and are narrower in focus than achievement testsbecause they cover a smaller amount of material and assess fewerobjectives.Achievement TestsAchievement tests are similar to progress tests in that their purpose is to seewhat a student has learned with regard to stated course outcomes. However,they are usually administered at mid- and end- point of the semester oracademic year. The content of achievement tests is generally based on thespecific course content or on the course objectives. Achievement tests areoften cumulative, covering material drawn from an entire course or semester.Proficiency TestsProficiency tests, on the other hand, are not based on a particular curriculumor language program. They are designed to assess the overall languageability of students at varying levels. They may also tell us how capable aperson is in a particular language skill area. Their purpose is to describe whatstudents are capable of doing in a language.Proficiency tests are usually developed by external bodies such asexamination boards like Educational Testing Services (ETS) or CambridgeESOL. Some proficiency tests have been standardized for international use,such as the American TOEFL test which is used to measure the Englishlanguage proficiency of foreign college students who wish to study in North-American universities or the British-Australian IELTS test designed for thosewho wish to study in the UK or Australia (Davies et al, 1999).Other Test TypesObjective vs. Subjective TestsSometimes tests are distinguished on the basis of the manner in which theyare scored. An objective test is one that is scored by comparing a student’sresponses with an established set of acceptable/correct responses on ananswer key. With objectively scored tests, no particular knowledge or trainingin the examined area is required of the scorer. Conversely, a subjective testrequires scoring by opinion or personal judgment. In this type of test, thehuman element is very important.Coombe/Hubley 11Fundamentals of Language Assessment
  • Testing formats associated with objective tests are MCQs, T/F/Ns and cloze.Objectively-scored tests are ideal for computer scanning. Examples ofsubjectively scored tests are essay tests, interviews or comprehensionquestions. Even experienced scorers or markers need moderation sessionsto ensure inter-rater reliability.Criterion- vs. Norm-Referenced or Standardized TestsCriterion-referenced tests (CRT) are designed to enable the test user tointerpret a test score with reference to a criterion level of ability or domain ofcontent (Bachman, 1990). True CRTs are devised before instruction itself isdesigned so the test will match the teaching objectives. This lessens thepossibility that teachers will ‘teach to the test’. The criterion or cut-off score isset in advance. Student achievement is measured with respect to the degreeof their learning or mastery of the pre-specified content. A primary concern ofa CRT is that it be sensitive to different ability levels.Norm-referenced (NRT) or standardized tests differ from criterion-referencedtests in a number of ways. By definition, a NRT must have been previouslyadministered to a large sample of people from the target population.Acceptable standards of achievement are determined after the test has beendeveloped and administered. Test results are interpreted with reference tothe performance of a given group or norm. The ‘norm’ is typically a largegroup of students who are similar to the individuals for whom the test isdesigned.High- vs. Low-stakes TestsHigh-stakes tests are those where the results are likely to have a majorimpact of the lives of large numbers of individuals, or on large programs. Forexample, a test like the TOEFL is high-stakes in that admission to a universityprogram is often contingent upon receiving a sufficient language proficiencyscore.Low-stakes tests are those where the results have a relatively minor impacton the lives of the individual or on small programs. In-class progress tests orshort quizzes are examples of low-stakes tests.Coombe/Hubley 12Fundamentals of Language Assessment
  • Test Development ProcessPlanning Establish purpose of test o place students in program o achievement of course outcomes o diagnosis of strengths and areas for improvement o international benchmark Identify objectives o operationalize outcomes Decide on cutoffs o grade, mastery Inventory course content and materials o consider appropriate formats o establish overall weighting Scheduling Write test specificationsTest Content and Development Map the exam o decide on sections, formats, weighting Construct items according to test specifications Establish grading criteria o prepare an answer key Vett the exam Pilot the examBefore the Test Provide information to students o coverage, weighting, formats, logistics Prepare students o student test-taking strategies o practice exam activitiesCoombe/Hubley 13Fundamentals of Language Assessment
  • Test Administration Decide on test conditions and procedures Organize equipment needed Establish makeup policy Inform students about availability of resultsAfter the Test Grade tests o calibrate if more than one teacher involved o adjust answer key if needed Compute basic statistics Get results to students o provide feedback for remediation Conduct exam analysis o overall exam analysis o item and distractor analysis o error analysis Report on exam results o channel washbackReflect on the testing process Learn from each exam o Did it serve its purpose? o What was the “fit” with curricular outcomes? o Was it a valid and reliable test? o Was it part of the students’ learning experience? o What future changes would you make?Coombe/Hubley 14Fundamentals of Language Assessment
  • Guidelines for Classroom Testing• Test to course outcomes• Test what has been taught, how it has been taught• Weight exam according to outcomes and course emphases• Organize exam with student time allocation in mind• Test one language skill at a time unless integrative testing is the intent• Set tasks in context wherever possible• Choose formats that are authentic for tasks and skills• Avoid mixing formats within one exam task• Distinguish between recognition, recall, and production in selecting formats• Design test with entire test sections and tasks in mind• Prepare unambiguous items well in advance• Sequence items from easy to more difficult• Items receiving equal weight should be of equal difficulty• Write clear directions and rubrics• Provide examples for each format• Write more items than you need• Avoid sequential items• Take your test as a student before finalizing it• Make the test easy and fair to grade• Develop practice tests and answer keys simultaneouslyCoombe/Hubley 15Fundamentals of Language Assessment
  • • Specify the material to be tested to the students• Acquaint students with techniques and formats• Administer test in uniform, non-distracting conditions• For subjective formats, use multiple raters whenever possible• Provide timely feedback to students• Reflect on exam without delayCoombe/Hubley 16Fundamentals of Language Assessment
  • Writing Objective Test ItemsMultiple Choice Questions (MCQs)Multiple-choice questions are the hardest type of objective question to writefor classroom teachers. Although many people believe MCQs are simplistic,actually the format can be used for intellectually challenging tasks. Teachersshould keep the following guiding principles in mind when writing MCQs:• The optimum number of response options for F/SL testing is four.• With four response options, one should be an unambiguous correct or best answer. The three remaining options function as distractors.• Distractors should attract students who are unsure of the answer.• All response options should be the same length and level of difficulty.• All distractors should be related in some way (e.g. same part of speech).• The question or task should be clear from the stem of the MCQ.• The language of the stem and response options should be as simple as possible to avoid skill contamination.• The selection of the correct or best answer should involve interpretation of the passage/stem, not merely the activation of background knowledge or “verbatim selection”.• Avoid using “all of the above”, “none of the above”, or “a, b, and sometimes c, but never d” options.• All response options should be grammatically correct unless error identification is part of your course outcomes.• Correct answers should appear equally in all positions.• Make sure there is an unambiguous correct answer for each item.• As much context as possible should be provided.• Recurring information in response options should be moved to the stem.Coombe/Hubley 17Fundamentals of Language Assessment
  • • Avoid writing absurd or “giveaway” distractors.• Avoid extraneous clues.• Avoid sequential items where the successful completion of one question presupposes a correct answer to the preceding question.Main Idea MCQ FormatThe testing of the main idea of a text is frequently done via MCQs. Therecommended word count of the paragraph or text itself should be based oncourse materials. One standard way to test main idea employs an MCQformat with the response options written in the following way:• JR (just right) This option should be the correct or best answer.• TG (too general) This distractor relates an option that is too broad.• TS (too specific) This distractor focuses on one detail within the text or paragraph.• OT (off topic) Depending on the level of the students, this distractor is written so that it reflects an idea that is not developed in the paragraph or text. For more advanced students, the idea would be related in some way.Main idea can also be tested via the TFN format by using the “Thistext/paragraph is mostly about...” prompt.True/False/Not Given (TFN)True/False/Not Given questions are a reliable way of testing readingcomprehension provided that there are enough questions. They have theadded advantage of being easier and quicker to write than MCQs. Teachersshould keep the following guidelines in mind when writing TFNs:• Questions should be written in language at a lower level of difficulty than the text.• Questions should appear in the same order as they appear in the text.• The first question should be an “easy” question. This serves to reduce text anxiety.• Avoid using absolutes like “always” or “never” in TFNs.• Have students circle T F N rather than write a letter in a blank.Coombe/Hubley 18Fundamentals of Language Assessment
  • • To increase the discrimination or reduce the guessing factor, add the Not Given option. It means that the necessary information to answer the question is not included in the text.• Successful completion of TFN items should depend on the student’s reading of the text, not on background knowledge.• Avoid discernible patterns for marking i.e. TTTFFFNNN• Avoid verbatim selection or simply matching the question to words/phrases in the text.• Paraphrase questions by using vocabulary and grammar from course materials.• The TFN format is effectively used to test reading, but should be avoided for listening comprehension.MatchingMatching is an extended form of MCQ that draws upon the student’s ability tomake connections between ideas, vocabulary and structure. The advantageover MCQs is that the student has more distractors per item. Additionally,writing items in the matching format is somewhat easier for teachers thaneither MCQs or TFNs. These are some important points to bear in mind:• Include more items in the answer group than in the question group.• Never write items that rely on direct 1-on-1 matching. The consequence of 1-on-1 matching is that if a student gets one item wrong, at least two are wrong by default. By contrast, if the student gets all previous items right, the last item is a “process of elimination freebie”.• Matching can be used very effectively with related items for gap-fill paragraphs instead of two lists. In this way, students focus on meaning in context and attend to features such as collocation.• If a two-column format is used for matching, number the questions and letter the answer options. Leave a space for students to write the letter of the chosen answer. This prevents lines drawn from Q to A columns.• Two-column matching formats should be used sparingly for word association tasks. When this is the specific testing objective, be sure that the syntax between the two columns is correct and unambiguous.• Avoid extraneous clues such as using “an” when the correct answer starts with a vowel.Coombe/Hubley 19Fundamentals of Language Assessment
  • Assessing WritingMost teachers find that it is relatively easy to write subjective test itemprompts as contrasted to objective ones. The difficulty lies in clearlyspecifying the task for the student so that grading is fair and equitable to allstudents. Some teachers find that the best approach is to write a sampleanswer and then analyze the elements of that answer. Alternatively, it isuseful to ask a colleague to write a sample answer and critique the prompt.Writing good subjective items is an interactive, negotiated process.The F/SL literature generally addresses two types of writing: free writing andguided writing. The former requires students to read a prompt that poses asituation and write a planned response based on a combination of backgroundknowledge and knowledge learned from the course. Guided writing, however,requires students to manipulate content that is provided in the prompt, usuallyin the form of a chart or diagram.Guided WritingGuided writing is a bridge between objective and subjective formats. Thistask requires teachers to be very clear about what they expect students to do.Decide in advance whether mechanical issues like spelling, punctuation andcapitalization matter when the task focuses on comprehension. Someimportant points to keep in mind for guided writing are:• Be clear about the expected form and length of response (one paragraph, a 250-word essay, a letter etc.).• If you want particular information included, clearly specify it in the prompt (i.e. three causes and effects, two supporting details etc.)• Similarly, specify the discourse pattern(s) the students are expected to use (i.e. compare and contrast, cause and effect, description etc.)• Since guided writing depends on the students manipulation of the information provided, be sure to ask them to provide something beyond the prompt such as an opinion, an inference, or a prediction.• Be amenable to revising the anticipated answer even as you grade.Coombe/Hubley 20Fundamentals of Language Assessment
  • Free WritingAll of the above suggestions are particularly germane to free writing. The goalfor teachers is to elicit comparable products from students of different abilitylevels.• The use of multiple raters is especially important in evaluating free writing. Agree on grading criteria in advance and calibrate before the actual grading session.• Decide on whether to use holistic, analytical or a combination of the two as a rating scale for marking.• If using a band scale, adjust it to the task.• Acquaint students with the marking scheme in advance by using it for teaching, grading homework and providing feedback.• Subliminally teach good writing strategies by providing students with enough space for an outline, a draft and the finished product.• In ES/FL classrooms, be aware of cultural differences and sensitivities among students. Avoid contentious issues that might offend or disadvantage students.Writing Assessment ScalesThe F/SL assessment literature generally recognises two different types of writingscales for assessing student written proficiency: holistic marking and analyticalmarking.Holistic Marking ScalesHolistic marking is where the scorer “records a single impression of the impact ofthe performance as a whole” McNamara (2000:43). In short, holistic marking isbased on the markers total impression of the essay as a whole. Holistic markingis variously termed as impressionistic, global or integrative marking.Experts in holistic marking scales recommend that this type of marking is quickand reliable if 3 to 4 people mark each script. The general rule of thumb forholistic marking is to mark for two hours and then take a rest grading no more than20 scripts per hour. Holistic marking is most successful using scales of a limitedrange (i.e. from 0-6).FL/SL educators have identified a number of advantages to this type of marking.First, it is reliable if done under no time constraints and if teachers receiveadequate training. Also, this type of marking is generally perceived to be quickerthan other types of writing assessment and enables a large number of scripts tobe scored in a short period of time. Thirdly, since overall writing ability isCoombe/Hubley 21Fundamentals of Language Assessment
  • assessed, students are not disadvantaged by one lower component such as poorgrammar bringing down a score.Several disadvantages of holistic marking have also been identified. First of all,this type of marking can be unreliable if marking is done under short timeconstraints and with inexperienced, untrained teachers (Heaton, 1990). Secondly,Cohen (1994) has cautioned that longer essays often tend to receive highermarks. Testers point out that by reducing a score to one figure tends to reducethe reliability of the overall mark. The most serious problem associated withholistic marking is the inability of this type of marking to provide feedback to thoseinvolved. More specifically, when marks are gathered through a holistic markingscale, no information or washback on how those marks were awarded appears.Thus, testers often find it difficult to justify the rationale for the mark. Hamp-Lyons(1990) has stated that holistic marking is severely limited in that it does notprovide a profile of the students writing ability.Analytical Marking ScalesAnalytical marking is where “raters provide separate assessments for each of anumber of aspects of performance” (Hamp-Lyons, 1991). In other words, ratersmark selected aspects of a piece of writing and assign point values to quantifiablecriteria (Coombe & Evans, 2001). In the literature, analytical marking has beentermed discrete point marking and focused holistic marking.Analytical marking scales are generally more effective with inexperiencedteachers. These scales are more reliable for scales with a larger point range.A number of advantages have been identified with analytical marking. Firstly,unlike holistic marking, analytical writing scales provide teachers with a "profile" oftheir students strengths and weaknesses in the area of writing. Additionally, thistype of marking is very reliable if done with a population of inexperienced teacherswho have had little training and grade under short time constraints (Heaton, 1990).Finally, training raters is easier because the scales are more explicit and detailed.Just as there are advantages to analytical marking, educators point out a numberof disadvantages associated with using this type of scale. Analytical marking isperceived to be more time consuming because it requires teachers to rate variousaspects of a students essay. It also necessitates a set of specific criteria to bewritten and for markers to be trained and attend frequent moderation or calibrationsessions. These moderation sessions are to insure that inter-marker differencesare reduced which thereby increase the validity. Also, because teachers look atspecific areas in a given essay, the most common being content, organization,grammar, mechanics and vocabulary, marks are often lower than for theirholistically-marked counterparts. Another disadvantage is that that analyticalmarking scales remove the integrative nature of writing assessment.Selecting the Appropriate Marking ScaleSelecting the appropriate marking scale depends upon the context in which ateacher works. This includes the availability of resources, amount of timeallocated to getting reliable writing marks to administration, the teacher populationCoombe/Hubley 22Fundamentals of Language Assessment
  • and management structure of the institution. Reliability can be increased by usingmultiple marking, which reduces the scope for error that is inherent in a singlescore.Writing Moderation/Calibration ProcessFor test reliability, it is recommended that clear criteria for grading beestablished and that rater training in using these criteria take place prior tomarking. The criteria can be based on holistic or analytical rating scales.However, whatever scale is chosen, it is crucial that all raters adhere to thesame scale regardless of their personal preference.The best way to achieve inter-rater reliability is to practice. Start early in theacademic year by employing the marking criteria in non-test situations. Makestudents aware from the outset of the criteria and expectations for their work.Collect a range of student writing samples on the same task and haveteachers evaluate and discuss them until they arrive at a consensus score.Involve students in peer-grading of classroom writing to familiarize them withmarking criteria. This has the benefit of making students more aware of waysin which they can edit and improve their writing.Recommendations for Writing AssessmentAs always, assessment should first and foremost reflect the goals of the course.In order for writing assessment to be fair for students, they should have plenty ofopportunities to practice a variety of different writing skills of varying lengths. Inother words, tests of writing should be shorter and more frequent, not just a"snapshot" approach at midterm and final exams.Coombe/Hubley 23Fundamentals of Language Assessment
  • Assessing ReadingMost language teachers assess reading through the component subskills. Sincereading is a receptive language skill, we can only get an idea of how studentsactually process texts through techniques such as think aloud protocols. It is notpossible to observe reading behavior directly. For assessment, we normally focuson certain important skills which can be divided up into major and minor (orcontributing) reading skills.Major reading skills include: – Reading quickly to skim for gist, scan for specific details, and establish overall organization of the passage – Reading carefully for main ideas, supporting details, author’s argument and purpose, relationship of paragraphs, fact vs. opinion – Information transfer from nonlinear textsMinor reading skills include: – understanding at the sentence level syntax, vocabulary, cohesive markers – understanding at inter-sentence level reference, discourse markers – understanding components of nonlinear texts the meaning of graph or chart labels, keys, and the ability to find and interpret intersection points.It should be noted that the designations major and minor largely relate to whetherthe skills pertain to large segments of the text or whether they focus on certainlocal structural or lexical points. Increasingly, grammar and vocabulary arecontextualized as part of reading passages instead of being assessed separatelyin a discrete point fashion. However, there are times when it is appropriate toassess structure, vocabulary, and language-in-use separately.Reading texts include both prose passages and nonlinear texts such as tables,graphs, schedules, advertisements and diagrams. Texts for assessment shouldbe carefully chosen to fit the purpose of assessment and the level of the studentstaking factors such as text length, density and readability into account. Forassessment, avoid texts with controversial or biased material because they canupset students and affect the reliability of test results. Ninety percent of thevocabulary in a prose passage should be known to the students (Nation, 1990).Reading tests use many of the formats already discussed. Recognition formatsinclude MCQs, TFNs, matching and cloze with answers provided. If limitedproduction formats such as short answer are used, usually the emphasis is onmeaning, not spelling. Of course, there will be authentic tasks such as readingdirections for form-filling where accuracy is important.Coombe/Hubley 24Fundamentals of Language Assessment
  • SpecificationsAs with all skills assessment, it is important to start with a clear understanding ofprogram objectives, intended outcomes and target uses of English. Once theseare clear, you can develop specifications or frameworks for developingassessment. Specifications will clearly state what and how you will assess, whatthe conditions of assessment will be (length and overall design of the test), andwill provide criteria for marking or grading. Here are typical features ofspecifications:Content • What material will the test cover? What aspects of this material? • What does the student have to be able to do? For example, in reading, perhaps a students has to scan for detailed information. • A propos reading passages, specifications state the type of text (prose or nonlinear), the number of words in the passage and readability level. • Acceptable topics and the treatment of vocabulary are usually set forth in specifications. For instance, topics may be restricted to those covered in the student book and vocabulary may focus on core vocabulary in the course.Conditions • Specifications usually provide information about the structure of the examination and the component parts. For example, a reading examination may include 5 subsections which use different formats and texts to test different subskills. • Specific formats or a range of formats are usually given in specifications in addition to the number of questions for each format or section. • Timing is another condition which specifications state. The time for the entire test may be given or sometimes for each individual subsection. For example, you can place time-dependent skills such as skimming and scanning in separately timed sections or you can place them at the end of a longer reading test where students typically are reading faster to finish within the allocated time.Grading criteria • Specifications indicate how the assessment instrument will be marked. For instance, the relative importance of marks for communication as contrasted to those for mechanics (spelling, punctuation, capitalization) should reflect the overall approach and objectives of the instructional program. Similarly, if some skills are deemed more important or require more processing than other skills, they may be weighted more heavily.In short, specifications help teachers and administrators establish a clear linkagebetween the overall objectives for the program and the design of particularassessment instruments. Specifications are especially useful for ensuring evencoverage of the main skills and content of courses as well as developing tests thatare comparable to one another because they are based on the same guidelines.Coombe/Hubley 25Fundamentals of Language Assessment
  • Recommendations for Reading AssessmentTextsTexts can be purpose written, taken directly from authentic material or adapted.The best way to develop good reading assessments is to constantly be on thewatch for appropriate material. Keep a file of authentic material from newspapers,magazines, brochures, instruction guides – anything that is a suitable source ofreal texts. Other ways to find material on particular topics are to use anencyclopedia written at an appropriate readability level or to use an Internetsearch engine. Whatever the source, cite it properly.Microsoft Word provides word counts and readability statistics. First, highlight thepassage, and then select word count from the Tool menu. To access readabilityinformation, go to Options under the Tool menu, then Spelling and Grammar, andtick “Show Readability Statistics”. Readability is based on word and sentencelength as well as use of the passive voice. You can raise or lower the level bychanging these. You can also add line numbers and other special features totexts.QuestionsMake sure that questions are written at a slightly lower level than the readingpassages. Reading questions should be in the same order as the material in thepassage itself. If you have two types of questions or two formats based on onetext, go through the text with different colored markers to check that you haveevenly covered the material in order.For objective formats such as multiple choice and true/false/not given, try to makeall statements positive. If you phrase a statement negatively and an option isnegative as well, students have to deal with the logical problems of doublenegatives. Whenever possible, rephrase material using synonyms to avoidstudents scanning for verbatim matches. Paraphrasing encourages vocabularygrowth as positive washback.Coombe/Hubley 26Fundamentals of Language Assessment
  • Assessing ListeningThe assessment of listening abilities is one of the least understood, leastdeveloped and yet one of the most important areas of language testing andassessment (Alderson & Bachman, 2001). In fact, Nunan (2002) callslistening comprehension “the poor cousin amongst the various languageskills” because it is the most neglected skill area. As teachers we recognizethe importance of teaching and then assessing the listening skills of ourstudents, but - for a number of reasons - we are often unable to do thiseffectively. One reason for this neglect is the availability of culturallyappropriate listening materials suitable for EF/SL contexts. The biggestchallenges for teaching and assessing listening comprehension center aroundthe production of listening materials. Indeed, listening comprehension is oftenavoided because of the time, effort and expense required to develop,rehearse, record and produce high quality audio tapes or CDs.Approaches to Listening AssessmentBuck (2001) has identified three major approaches to the assessment oflistening abilities: discrete point, integrative and communicative approaches.The discrete-point approach became popular during the early 1960’s with theadvent of the Audiolingual Method. This approach identified and isolatedlistening into separate elements. Some of the question types that wereutilized in this approach included phonemic discrimination, paraphraserecognition and response evaluation. An example of phonemic discriminationis assessing students by their ability to distinguish minimal pairs likeship/sheep. Paraphrase recognition is a format that required students to listento a statement and then select the option closest in meaning to the statement.Response evaluation is an objective format that presents students withquestions and then four response options. The underlying rationale for thediscrete-point approach stemmed from two beliefs. First, it was important tobe able to isolate one element of language from a continuous stream ofspeech. Secondly, spoken language is the same as written language, only itis presented orally.The integrative approach starting in the early 1970s called for integrativetesting. The underlying rationale for this approach is best explained by Oller(1979:37) who stated “whereas discrete items attempt to test knowledge oflanguage one bit at a time, integrative tests attempt to assess a learner’scapacity to use many bits at the same time.” Proponents of the integrativeapproach to listening assessment believed that the whole of language isgreater than the sum of its parts. Common question types in this approachwere dictation and cloze.Coombe/Hubley 27Fundamentals of Language Assessment
  • The third approach, the communicative approach, arose at approximately thesame time as the integrative approach as a result of the CommunicativeLanguage Teaching movement. In this approach, the listener must be able tocomprehend the message and then use it in context. Communicativequestion formats must be authentic in nature.Issues in Listening AssessmentA number of issues make the assessment of listening different from theassessment of other skills. Buck (2001) has identified several issues thatneed to be taken into account. They are: setting, rubric, input, voiceovers,test structure, formats, timing, scoring and finding texts. Each is brieflydescribed below and recommendations are offered.SettingThe physical characteristics of the test setting or venue can affect the validityand/or reliability of the test. Exam rooms must have good acoustics andminimal background noise. Equipment used in test administrations should bewell maintained and checked out beforehand. In addition, an AV technicianshould be available for any potential problems during the administration.RubricContext is extremely important in the assessment of listening comprehensionas test takers don’t have access to the text as they do in reading. Contextcan be written into the rubric which enhances the authenticity of the task.Instructions to students should be in the students’ L1 whenever possible.However, in many teaching situations, L1 instructions are not allowed. WhenL2 instructions are used, they should be written at one level of difficulty lowerthan the actual test. Clear examples should be provided for students andpoint values for questions should be included in the rubrics.InputInput should have a communicative purpose. In other words, the listenermust have a reason for listening. Background or prior knowledge needs to betaken into account. There is a considerable body of research that suggeststhat background knowledge affects comprehension and test performance. Ina testing situation, we must take care to ensure that students are not able toanswer questions based on their background knowledge rather than on theircomprehension.VoiceoversAnyone recording a segment for a listening test should receive training andpractice beforehand. In large-scale testing, it is advisable to use a mixture ofgenders, accents and dialects. To be fair for all students, listening voiceoversshould match the demographics of the teacher population. Other issues arethe use of non-native speakers for voiceovers and the speed of delivery. Ourbelief is that non-native speakers of English constitute the majority of Englishspeaking people in the world. Whoever is used for listening test voiceovers,whether native or non-native speakers, should speak clearly and enunciateCoombe/Hubley 28Fundamentals of Language Assessment
  • carefully. The speed of delivery of a listening test should be consistent withthe level of the students and the materials used for instruction. If yourinstitution espouses a communicative approach, then the speed of delivery forlistening assessments should be native or near native delivery. The delivery ofthe test should be standard for all test takers. If live readers are used, theyshould practice reading the script before the test and standardize with otherreaders.Test StructureThe way a test is structured depends largely on who constructs it. There aregenerally two schools of thought on this: British and the Americanperspectives. British exam boards generally grade input from easy to difficultin a test and mix formats within a section. This means that the easier sectionscome first with the more difficult sections later. American exam boards, onthe other hand, usually grade question difficulty within each section of anexam and follow the 30/40/30 rule. This rule states that 30% of the questionswithin a test or test section are of an easy level of difficulty; 40% of thequestions represent mid range levels of difficulty; and the remaining 30% ofthe questions are of an advanced level of difficulty. American exam boardsusually use one format within each section. The structure you use should beconsistent with external benchmarks you use in your program. It is advisableto start the test with an ‘easy’ question. This will lower students’ test anxietyby relaxing them at the outset of the test.Within a listening test, it is important to test as wide a range of skills aspossible. Questions should also be ordered as they are heard in the passage.Questions should always be well-spaced out in the passage for good contentcoverage. It is recommended that no content from the first 15-20 seconds ofthe recording be tested to allow students to adjust to the listening. Manyteachers only include test content which is easy to test, such as dates andnumbers. Include some paraphrased content to challenge students.FormatsPerhaps the most important piece of advice here is that students should neverbe exposed to a new format in a testing situation. If new formats are to beused, they should be first practiced in a teaching situation and then introducedinto the testing repertoire. Objective formats like MCQs and T/F are oftenused because they are more reliable and easier to mark and analyze. Whenusing these formats, make sure that the N option is dropped from T/F/N andthat three response options instead of four are utilized for MCQs. Rememberthat with listening comprehension, memory plays a role. Since students don’thave repeated access to the text, more options add to the memory load andaffect the difficulty of the task and question. Visuals are often used as part oflistening comprehension assessment. When using them as input, makecertain that you use clear copies that reproduce well.Coombe/Hubley 29Fundamentals of Language Assessment
  • Skill contamination is an issue that is regularly discussed with regard tolistening comprehension. Skill contamination is the idea that a test-taker mustuse other language skills in order to answer questions on a listening test. Forexample, a test taker must first read the question and then write the answer.Whereas skill contamination used to be viewed negatively in the testingliterature, it is now viewed more positively and termed ‘skill integration.’TimingThe length of a listening test is generally determined by one of two things: thelength of the tape or the number of repetitions of the passages. Mostpublished listening tests do not require the proctor to attend to timing. He/shesimply inserts the tape or CD into the machine. The test is over when theproctor hears a pre-recorded “this is the end of the listening test” statement.For teacher-produced listening tests, the timing of a test will usually bedetermined by how many times the test takers are permitted to hear eachpassage. Proficiency tests like the TOEFL usually allow one repetitionwhereas achievement tests usually repeat the input twice. Buck (2001)recommends that if you’re assessing main idea, input should be heard onceand if you’re assessing detail, input should be heard twice. According toCarroll (1972), listening tests should not exceed 30 minutes.It is important to remember to give students time to pre-read the questionsbefore the test and answer the questions throughout the test. If students arerequired to transfer their answers from the test paper to an answer sheet,extra time to do this should be built into the exam.ScoringThe scoring of listening tests provides numerous challenges to theteacher/tester. Dichotomous scoring (questions that are either right or wrong)is easier and more reliable. However, it doesn’t lend itself to many of thecommunicative formats such as note-taking. Other issues are whether pointsare deducted for grammar or spelling mistakes or non-adherence to wordcounts. When more than one teacher is participating in the marking of alistening test, calibration or standardization training should be completed toensure fairness to all students.Finding Suitable TextsMany teachers feel that the unavailability of suitable texts is listeningcomprehension’s most pressing issue. The reason for this is that creatingscripts which have the characteristics of oral language is not an easy task.Some teachers simply take a reading text and ‘transform’ it into a listeningscript. The transformation of reading texts into listening scripts results incontrived and inauthentic listening tasks because written texts often lack theredundant features which are so important in helping us understand speech.A better strategy is to look for texts that concentrate on characteristics that areunique to listening. If you start collecting texts that have the right oralfeatures, you can then construct tasks around them. When graphics orvisuals are used as test context, teachers often find themselves ‘driven by clipart’. This occurs when teachers build a listening script around readilyCoombe/Hubley 30Fundamentals of Language Assessment
  • available graphics. It is best to inventory the topics in a course and collectappropriate material well in advance of exam construction.To produce more extemporaneous listening recordings, use availableprograms on your computer like Sound Recorder or shareware like Audacityand PureVoice to record scripts for use as listening assessments in theclassroom.VocabularyResearch recommends that students must know between 90-95% of thewords to understand a text/script. Indeed the level of the vocabulary that youutilize in your scripts can affect the difficulty and hence the comprehension ofstudents. If your institution employs word lists, it is recommended that youseed vocabulary from your own word lists into listening scripts wheneverpossible. To determine the vocabulary profile of your text/script, go tohttp://www.er.uqam.ca/nobel/r21270/cgi-bin/webfreqs/web_vp.cgi forVocabulary Profiler, a very user-friendly piece of software. By simply pastingyour text into the program, you will receive information about the percentageof words that come from Nation’s 1000 Word List and the Academic WordList.Another thing to remember about vocabulary is that ‘lexical overlap’ can affectdifficulty. Lexical overlap refers to when words used in the passage are usedin the questions and response options. When words from the passage areused in the correct answer or key, the question is easier. The questionbecomes more difficult if lexical overlap occurs from the passage/script to thedistractors. A final thought on vocabulary is that unknown vocabulary shouldnever occur as a keyable response (the actual answer) in a listening test.Final Recommendations for Listening AssessmentNo matter what the skill area, as always test developers should be guided bythe cornerstones of good testing practice when constructing tests. • Validity (Does it measure what it says it does?) • Reliability (Are the results consistent?) • Practicality (Is the test “teacher-friendly”?) • Washback (Is feedback channeled to everyone concerned?) • Authenticity (Do the tasks mirror real life contexts?) • Transparency (Are expectations clear to students? Do Ss and Ts have access to information about the test/assessment?) • Security (Are exams and item banks secure? Can they be reused?)Coombe/Hubley 31Fundamentals of Language Assessment
  • Assessing SpeakingAlways keeping the cornerstones of good assessment in mind, why do we want totest speaking? In a general English program, speaking is an important channel ofcommunication in daily life. We want to simulate real-life situations in whichstudents engage in conversation, ask and answer questions, and give information.In an academic English program, the emphasis might be on participating in classdiscussions and debates or giving academic presentations. In a Business Englishcourse, students might develop telephone skills, interact in a number of commonsituations involving meetings, travel, and sales as well as make reports. Whateverthe teaching focus, valid assessment should reflect the course objectives and theeventual target language.Speaking is a productive language skill like writing and thus shares many issuessuch as whether to grade holistically or analytically. However, unlike writing,speaking is more ephemeral unless measures are taken to record studentperformance. Yet the presence of recording equipment can inhibit students andoften recording is not practical or feasible. To score reliably, it is often necessaryto have two teachers assess together. When this happens, one is the interlocutorwho interacts with the speaker(s) while the other teacher, the assessor, tracks thestudent’s performance.Based on Bygate’s categories, Weir (1993) divides oral skills into two maingroups: speaking skills that are part of a repertoire of routines for exchanginginformation or interacting, and improvisational skills such as negotiating meaningand managing the interaction. The routine skills are largely associated withlanguage functions and the spoken language required in certain situations. Bycontrast, the improvisational skills are more general and may be brought into playat any time for clarification, to keep a conversation flowing, to change topics or totake turns. In circumstances when presentation skills form an importantcomponent of a program, naturally they should be assessed. However, avoidsituations where a student simply memorizes a prepared speech. Decide whichspeaking skills are most germane to a particular program and then createassessment tasks that sample skills widely with a variety of tasks.While it is possible to assess speaking skills on an individual basis, most largeexam boards opt to test pairs of students with pairs of testers. Within testsorganized in this way, there are times when only one student speaks and othertimes when the students interact in a conversation. This setup makes it possibleto test common routine functions as well as a range of improvisational skills. Forreliability, interlocutors should work from a script so that all students get similarquestions framed in the same way. In general, the teacher or interlocutor shouldkeep in the background and only intercede if truly necessary.Coombe/Hubley 32Fundamentals of Language Assessment
  • Common speaking assessment formatsIt is good practice to start the speaking assessment with a simple task that putsstudents at ease so they perform better. Often this takes the form of asking thestudents for some personal information.Interview: can be teacher to student or student to student. Teacher to student ismore reliable when the questions are scripted.Description of a photograph or item: Students describe what they see.Narration: This is often an elaboration of a description. The student is given aseries of pictures or cartoon strip for the major events in a story.Information gap activity: One student has information the other lacks and viceversa. Students have to exchange information to see how it fits together.Negotiation task: Students work together on a task where they may havedifferent opinions. They have to reach a conclusion in a limited period of time.Roleplays: Students are given cue cards with information about their “character”and the setting. Some students find it difficult to project themselves into animaginary situation and this lack of “acting ability” may affect reliability.Oral presentations: Strive to make them impromtu instead of rehearsed.Recommendations for Speaking AssessmentDecide with your colleagues which speaking subskills are most important andadopt a grading scale that fits your program. Whether you adopt a holistic oranalytical approach to grading, create a recording form that enables you to trackstudents’ production and later give feedback for improvement.Think about these factors: fluency vs. accuracy, appropriate responses (indicatingcomprehension), pronunciation, accent and intonation, use of repair strategies.Train teachers in scoring and practice together until there is a high rate of inter-rater reliablity. Use moderation sessions with high-stakes exams.Keep skill contamination in mind. Don’t give students lengthy written instructionswhich must be read and understood before speaking.Remember that larger samples of language are more reliable. Make sure thatstudents speak long enough on a variety of tasks.Choose tasks that generate positive washback for teaching and learning!Coombe/Hubley 33Fundamentals of Language Assessment
  • Student Test-Taking StrategiesIn todays universities, grades are substantially determined by test results. Somuch importance is placed on students test results that often just the word"test" makes students afraid. The best way for students to overcome thisfeeling of fear or nervousness is to prepare themselves with test-takingstrategies. This process should begin during the first week of each semesterand continue throughout the school year. The key to successful test takinglies in a students ability to use time wisely and to develop practical studyhabits.Actually, effective test-taking strategies are synonymous with effectivelearning strategies. This section is intended to provide suggestions for long-term successful learning techniques and test-taking strategies, not quick"tricks". There is nothing that can replace the development of good studyskills.The following steps will help students approach tests with confidence:• Make a semester study plan.• Come to class regularly.• Use good review techniques.• Organize pre-exam hours wisely.• Plan out how to take the exam.• Use strategies appropriate to the skill area.• Learn from each exam experience.Make A Semester Study PlanStudents need to plan their study time for each week of their courses. Theyshould make schedules for themselves and revise these schedules whennecessary. These schedules should:• BE REALISTIC. Keep a balance between classes and studying. Block out space for study time, class time, family time and recreation time.Coombe/Hubley 34Fundamentals of Language Assessment
  • • INCLUDE A STUDY PLACE. Finding a good place to study will help students get started; dont forget to have all the materials needed (i.e. pens, paper, textbooks, highlighter pens etc.).• INCLUDE A DAILY STUDY TIME. Students forget things almost at once after learning them, so they should immediately review materials learned in class. Students should go over the main points from each class and/or textbooks for a few minutes each night. Encourage students to do homework assignments during this time as a good way to remember important points made in class.Come To Class RegularlyIn order for language learning to take place, students need to come to classon a regular basis. It is not surprising to note that poor attendance correlateshighly with poor test results.Teachers need to point out early in the semester what constitutes legitimatereasons to be absent and stress the advantages of regular attendance.Use Good Review TechniquesIf students make a semester study plan and follow it, preparing for examsshould really be a matter of reviewing materials. Research shows that thetime spent reviewing should be no more than 15 minutes for weekly quizzes,2 to 3 hours for a midterm exam, and 5 to 8 hours for a final exam.When reviewing for a test, students should do the following:• PLAN REVIEW SESSIONS. Look at the course outline, notes and textbooks. What are the major topics? Make a list of them. How much time was spent on each of these topics in class? Did the teacher note that some topics were more important than others? If so, these should be emphasized in review sessions.• TAKE THE PRACTICE EXAM. By taking the practice exam students will have an idea of the tasks/activities that they will encounter on the real exam. They will also know the point allocation for each section. This information can help them plan their time wisely.• REVIEW WITH FRIENDS. Another way of studying for an exam is to create a "study group". By studying with friends there is the advantage of sharing information with others who are reviewing the same material. A study group, however, should not take the place of studying individually.Organize Pre-Exam Hours WiselyCoombe/Hubley 35Fundamentals of Language Assessment
  • Students who have regularly reviewed course materials throughout thesemester dont have to "cram" at the last minute. They can concentrate theirefforts on particular areas of difficulty and conduct an overall review of thematerial to be tested.Physical and mental fitness are important considerations for good test taking.These can be best achieved by making sure that the student has adequaterest and nutrition in the hours preceding the exam. A well-rested and well-fedstudent who has prepared thoroughly is likely to be calm and self-confident,two other important factors for successful test taking.Some teachers have found it useful to encourage students in stress-reducingactivities.Strategize Your Exam PlanAn important factor in test taking is exam planning. Students should arriveearly at the designated exam room and find a seat. All books and personaleffects (with the exception of student ID cards and writing materials) shouldbe left at the front of the room. Students should come prepared with severalpens or pencils and an eraser.As soon as the exams have been distributed and students have been told tostart the exam, the student should write his/her name and ID number on allpages of the test paper.If one section is given first, such as the listening portion of English exams, thestudent should focus attention on this section. With any section of the exam,the student is well-advised to do an overview of the questions, their values,and the tasks required. At this point, students should determine if the exammust be done in order (i.e. listening first) or if they can skip around betweensections. The latter is not possible on some standardized exams wherestudents must complete one section before moving on to the next.An important consideration in effective test taking is time management. Whenexams are written, review time is usually factored into the overall examdesign. Students should be encouraged to allocate their time proportional tothe value of each exam section and to allow time to review their work.Teachers when proctoring can assist students with time management byalerting them to time remaining in the exam. Computer based tests (such asthe new TOEFL) often show a countdown of the remaining time. Studentsshould be made aware of this feature during the practice exams.A recent research project investigating the reading skills of English studentshas yielded several disturbing findings. First, students frequently fail to readdirections or read them superficially. Teachers can acquaint students with therequisite metalanguage of rubrics, and encourage them to emphasize theimportant points of the task. For example, teachers should point out thatreading for main ideas requires very different strategies than scanning forCoombe/Hubley 36Fundamentals of Language Assessment
  • specific information. Brainstorm with your students on the key terms found inrubrics.Another finding in this project is the fact that students dont spend enoughtime on the reading and that they dont refer back to the text as often as theyperhaps should. Again, when students are reading for specific detail, it isimportant that they refer back to the main text for each question.Use Strategies Appropriate To The Skill AreaTeachers should train students in effective strategies for the various skillareas to be tested. Important activities (i.e. like note-taking for listening andwriting tasks) should be demonstrated to students during classroom activities.Representative strategies for English skill areas will be modeled in todaysworkshop.Learn From Each Exam Experience.Each test should be a learning experience. Go over test results with students.Teachers should note specific students strengths and weaknesses. Theanalyses that teachers receive right after computer-based exams provideteachers with invaluable information in a timely manner. Teachers should usethis information to send students to student support services for remediation.Each exam that the students take should help them do better on the next one.Coombe/Hubley 37Fundamentals of Language Assessment
  • ?StatisticsStatistics simply mean mathematical forms of exam results. Unfortunately, theterm statistics often has negative connotations for language teachers. Yetteachers can easily learn how to use statistics to get information about theirstudents’ performance on a test and to check the tests reliability. Basicstatistics provide information on individual students, the class as a whole, thecourse content and how it has been taught. Every teacher can benefit fromthis feedback. The most important statistics are simple arithmetic conceptswhich are easy to compute with a hand calculator.Basic Statistics for Classroom TestingThe most useful statistics for classroom teachers are known as descriptivestatistics. They "describe" the population of students taking the test. Themean, mode, median, standard deviation, and range are common descriptivestatistics. Of these, the mean is the most important for classroom teachers.Other descriptive statistics are important for large-scale, high stakes testingand can easily be obtained with computer applications such as Excel. Seethe annotated bibliography for suggestions on testing books that cover theuse of other statistics.The Mean: Once a test has been administered to a group of students, thefirst step for any classroom teacher should be to compute the mean score orarithmetic average. The mean is the sum of all the scores divided by thenumber of scores.Mean scores can be computed for the test as a whole or for each section (i.e.listening, reading, writing etc.) of a test. Computing a mean score can giveyou information as to the reliability of the test. In general, mean scores thatfall within the 70th percentile (i.e. from 70 to 79) are said to be valid indicatorsof test reliability. For shorter or mastery quizzes, however, teachers canexpect higher means.Pass/Fail Rate: Another useful statistic to compute is the pass or failure ratefor a given test or quiz. This is most simply done by a grade breakdown. Thefirst step in this process is to count the number of As, Bs, Cs, and Dsreceived on the test. This number represents the pass rate for a given test.Divide this number by the total number of students who took the test and youhave the pass rate. To compute the failure rate, count the number of Fs orfailing grades and divide this number by the total number of students who tookthe exam.Coombe/Hubley 38Fundamentals of Language Assessment
  • Histograms: Histograms are visual representations of how well a group ofstudents did on a test or quiz. Histograms can be easily drawn from a chart ofgrade breakdowns (number of A, B, C, D and F grades received on a test).These totals are later graphed on a chart. The resulting curve represents howthe class did as a whole on a test.Computing Basic Statistics for Classroom UseFiguring the mean1. Add the grades of all students2. Divide the total of the grades by the number of students3. The result is the mean for that test or quiz.Figuring the pass rate1. Count the number of students in each grade category In some systems, this will be A, B, C, D, F. Note that test and quiz grades can be out of any number, not just 100.2. Divide the number of students who received a grade in all passing categories by the number of students who took the test.3. The result is the pass rate for that class for that test.Figuring the failure rate1. Count the number of students in each grade category In some systems, this will be A, B, C, D, F.2. Divide the number of students who received a grade in all failing categories by the number of students who took the test.3. The result is the failure rate for that class for that test.Plotting a histogram1. A histogram is a picture of your grade distribution or breakdown. It is asimple graph with two axes. One side (the vertical) represents the number ofstudents who took the exam. The horizontal side has the range of gradesreceived on the exam.2. Create the vertical axis by showing how many students took the exam.For example, if you have 25 students in your class, have the bottom represent0 and the top 25 with intervals of 5 students.3. Create the horizontal axis by showing the range of possible grades. Forexample, you may have F on the left side, followed by D, C, B, and A inascending order.Coombe/Hubley 39Fundamentals of Language Assessment
  • 4. Plot the number of students who received each grade. Remember to notegrades for which there were no scores at the zero level. Then, you can eitheruse bars to depict the number of scores in each category or connect the dotsat the top of each column (include the zeroes!).Coombe/Hubley 40Fundamentals of Language Assessment
  • I. Sample Classroom Statistics WorksheetTeacher:___________________________ Class:________________Exam:_____________________________ Date:_________________Failing F D grades C grades B grades A grades59 and 60 to 69 70 to 79 80 to 89 90 to 100belowTotal number of students taking exam:_________ Number passed:_________ Pass rate: _____% Number failed: _________ Fail rate: _____%Grade breakdown: A: n = ____ % = ____ B: n = ____ % = ____ C: n = ____ % = ____ D: n = ____ % = ____ (cusp: n = ____ % = ____) F: n = ____ % = ____Class Mean:Total of all scores: ___________divided by_____________ = __________ meanNumber of Ss: _______ 25 20 15 10 5 0 F D C B ACoombe/Hubley 41Fundamentals of Language Assessment
  • Technology for TestingTechnology is increasingly employed in testing. The last decade hasprogressed from scanned examinations to the recent widespread use ofcomputer adaptive testing. Each institution embraces technology according toavailable resources, but whatever the use, technology should always be in theservice of teaching and testing, not the other way around.Scanned TestsMany schools and colleges use scanners to quickly and accurately grade andanalyze objective tests. From a testing perspective, the most importantconsideration is that items and tasks follow good testing practice since theresults will only be as good and valid as the test itself. The use of a scannerrequires that students fill in a special answer or “bubble” sheet that ismachine-readable. The technical term for these bubble sheets is an OMR orOptical Mark Reader. Some students may inadvertently transfer informationincorrectly from the question paper to the answer sheet so students shouldreceive training before scan sheets are used for high-stakes examinations.Computer Based TestingComputer-based testing (CBT) has numerous advantages including:• quick, accurate results and feedback• detailed statistical analysis• easy administration with a high level of security• item banks of validated test items• encouragement of certain effective test-taking skillsHowever, the use of CBTs requires access to hardware and software andspecial training in test construction using formats that are amenable tomachine scoring. Additionally, some skills such as writing and speaking canonly be computer tested with sophisticated equipment.Computer Adaptive TestingRecently, institutions such as ETS have developed language tests usingcomputer adaptive testing (CAT). CATs are tailored to individual ability levelssince each question presented to the candidate is based on their response tothe previous question. In addition to all the CBT features noted above,advantages include shorter testing time and the establishment of theindividual’s unique level. CAT requires sophisticated equipment and testwriting skills based on item response theory.Coombe/Hubley 42Fundamentals of Language Assessment
  • Internet ResourcesThe Internet provides access to global resources on testing. Some Internetsites address generic testing issues while others are specific to Englishlanguage testing. Most of the sites listed below “point to” other Internettesting resources. Please note that URLs or addresses change often; thoseprovided were accurate as of mid-March 2004.Language testers consider Glenn Fulcher’s Resources in LanguageTesting site (http://www.dundee.ac.uk/languagestudies/ltest/ltr.html) thepremier Internet resource. Dr. Fulcher maintains this site which directs thesearcher to virtually all important general and language assessment sites onthe Internet. Information about language testing conferences as well as arange of articles on different aspects of assessment can be found here. Inaddition, Dr. Fulcher also maintains the International Language TestingAssociation’s (ILTA) website and provides links to the testing list serveLTEST-L.The Clearinghouse on Assessment and Evaluation’s site(http://www.ericae.net) provided by ERIC, the Educational ResourcesInformation Center, provides linkage to a wide range of resources includingthe ERIC Search Wizard. Useful aspects of the ERIC site include the ERICThesaurus online which facilitates finding keywords for searches and RIE,Resources in Education. It is also the access point for an excellent peer-reviewed electronic journal entitled Practical Assessment, Research andEvaluation.CRESST, the National Center for Research on Evaluation Standards andStudent Testing site (http://cresst96.cse.ucla.edu/index.htm) is funded bythe U.S. Department of Education. It is primarily useful for K-12 teachers.Information about North American high-stakes testing is found at the sitemaintained by Educational Testing Service (http://www.ets.org/), thedevelopers of TOEFL, GRE and other standardized tests. These sites helpteachers determine which tests are most appropriate for their students andalso provide specific information on test specifications for preparing thestudents to sit the exams. Look here for information on the next generationTOEFL exam to be launched in 2005.Language Testing Update from the University of Lancaster provides onlinesummaries of recent publications and news of professional conferences attheir site (http://www.ling.lancs.ac.uk/pubs/ltu/ltumain.htm).Teachers who want to become more familiar with Computer AdaptiveTesting can learn more about it at the University of Minnesota’s CARLAwebsite which employs a FAQ (frequently asked questions) format(http://carla.acad.umn.edu/CAT.html).Coombe/Hubley 43Fundamentals of Language Assessment
  • Alternative AssessmentTraditional vs. Alternative AssessmentOne useful way of understanding the concept of alternative assessment is tocontrast it to traditional testing. Alternative assessment is different fromtraditional assessment in that it actually asks students to show what they cando. Students are evaluated on what they integrate and produce rather thanon what they are able to recall and reproduce. (Huerta-Macias, 1994). Testsare a means of determining whether students have learned what they havebeen taught. Sometimes tests serve as feedback to students on their overallprogress in a language course. By contrast, alternative assessment providesalternatives to traditional testing in that it:• does not intrude on regular classroom activities• reflects the curriculum that is actually being implemented in the classroom• provides information on the strengths and weaknesses of each individual student• provides multiple indices that can be used to gauge student progress• is more multiculturally sensitive and free of the norm, linguistic, and cultural biases found in traditional testing (Huerta-Macias, 1994)Bailey (1998 pg. 207) provides us with a very useful chart that effectivelycontrasts traditional and alternative assessment (see Figure 1).Figure 1: Contrasting Traditional and Alternative Assessment Traditional Assessment Alternative AssessmentOne-shot tests Continuous, longitudinal assessmentIndirect tests Direct testsInauthentic tests Authentic assessmentIndividual projects Group projectsNo feedback provided to learners Feedback provided to learnersTimed exams Untimed examsDecontextualized test tasks Contextualized test tasksNorm-referenced score interpretation Criterion-referenced score interpretationStandardized tests Classroom-based testsCoombe/Hubley 44Fundamentals of Language Assessment
  • Types of Alternative AssessmentSelf-assessmentSelf-assessment plays a central role in student monitoring of progress in alanguage program. It refers to the student’s evaluation of his or her ownperformance at various points in a course. An advantage of self-assessmentis that student awareness of outcomes and progress is enhanced.Portfolio AssessmentPortfolios are collections assembled by both teacher and student ofrepresentative samples of on-going work over a period of time. The bestportfolios are more than a scrapbook or ‘folder of all my papers’; they containa variety of work in various stages and utilize multiple media.Student-designed TestsA novel approach within alternative assessment is to have students write testson course material. This process results in greater learner awareness ofcourse content, test formats, and test strategies. Student-designed tests aregood practice and review activities that encourage students to takeresponsibility for their own learning.Learner-centered AssessmentLearner-centered assessment advocates using input from learners in manyareas of testing. For example, students can select the themes, formats andmarking schemes to be used. Involving learners in aspects of classroomtesting results in reduced test anxiety and greater student motivation.ProjectsTypically, projects are content-based and involve a group of students workingtogether to find information about a topic. In the process, they use authenticinformation sources and have to evaluate what they find. Projects usuallyculminate in a find product in which the information is given to other people.This product could be a presentation, a poster, a brochure, a display or manyother options. An additional advantage of projects is that they integratelanguage skills in a real-life context.Coombe/Hubley 45Fundamentals of Language Assessment
  • PresentationsPresentations can be an assessment tool in themselves for speaking, butmore often they are integrated into other forms of alternative assessment.Increasingly, students make use of computer presentation software whichhelps them to clarify the organization and sequence of their presentation.Presentations are another real-life skill that give learners an opportunity toaddress some of the socio-cultural aspects of communication such as usingappropriate register and discourse.Future DirectionsBecause language performance depends heavily on the purpose for languageuse and the context in which it is done, it makes sense to provide studentswith assessment opportunities that reflect these practices. In addition, we aslanguage testers must be responsive to differing learning styles of students.In the real world, we must demonstrate that we can complete tasks using theEnglish language effectively both at work and in social settings. Ourassessment practices must reflect the importance of using language both inand outside of the language classroom.Coombe/Hubley 46Fundamentals of Language Assessment
  • Testing Acronyms: How to Sound Like an Expert Acronym What it stands forACTFL American Council for the Teaching of Foreign LanguagesALTE Association of Language Testers in EuropeASLPR Australian Second Language Proficiency RatingsCAE Certificate in Advanced EnglishCAT Computer-Adaptive testingCALT Computer-Assisted Language TestingCBT Computer-based testingCCSE Certificates in Communicative Skills in EnglishCPE Certificate of Proficiency in EnglishCRT Criterion-referenced TestingEPT English Placement Test (UMich)ETS Educational Testing ServicesFCE First Certificate in EnglishFSI Scales Foreign Service InstituteIELTS International English Language Testing SystemILTA International Language Testing AssociationIRT Item response theoryKET Key English TestLAB Language Aptitude Battery (Pimsleur)LTRC Language Testing Research ColloquiumLTU Language Testing UpdateMCQ Multiple-Choice QuestionMELAB Michigan English Language Assessment BatteryMLAT Modern Language Aptitude TestMTELP Michigan Test of English Language ProficiencyNRT Norm-referenced TestingOPI Oral Proficiency Interview (best known ACTFL test)PET Preliminary English TestSEM Standard error of measurementT/F/N True/False/Not Given questionTOEFL Test of English as a Foreign LanguageTOEIC Test of English for International CommunicationTWE Test of Written EnglishUCLES University of Cambridge Local Examinations SyndicateYLE Cambridge Young Learners English TestCoombe/Hubley 47Fundamentals of Language Assessment
  • Glossary Of Important Testing TermsAchievement test: measures what a learner knows from what he/she hasbeen taught; this type of test is typically given by the teacher at a particulartime during a course and covers a certain amount of material.Alignment: the process of linking content and performance standards toassessment, instruction, and learning in classrooms.Alternative assessment: refers to a non-conventional way of evaluatingwhat students know and can do with the language; it is informal and usuallyadministered in class; examples of this type of assessment include self-assessment and portfolio assessment.Alternate forms: different editions of the same assessment written to meetcommon specifications and comparable in most respects, except that some orall of the questions differ in content.Analytical scale: a type of rating scale that requires teachers to giveseparate ratings for the different components of language ability i.e. content,grammar, vocabulary etc. This type of evaluation requires teachers toconsider multiple dimensions of performance rather than give an overallimpression.Anchor items: a set of items that remains the same in two or more forms ofa test for the purposes of equating; a characteristic found in computeradaptive tests and IRT.Aptitude test: a test of general ability that is usually not closely related to aspecific curriculum and that is used primarily to predict future performance.Assessment: the process of gathering, describing or quantifying informationabout performance.Authenticity: refers to evaluation based mainly on real-life experiences;students show what they have learned by performing tasks similar to thoserequired in real-life contexts; one of the cornerstones of good testing practice.Banding scale: a type of holistic scale that measures language competencevia descriptors of language ability; an example is the IELTS bandsCoombe/Hubley 48Fundamentals of Language Assessment
  • Benchmark: a detailed description of a specific level of student performanceexpected of students at particular ages, grades or development levels.Bias: in general usage, this terms refers to unfairness.Branching test: an assessment in which test takers may be given differentsets of items, depending on their responses to earlier items; this is acharacteristic of computer adaptive testing.Ceiling effect: the phenomenon where most test takers score near the top ofthe scale on a particular test; test does not discriminate adequately at thehigher ability levels.Composite score: a score that is the combination of two or more scores bysome specified formula.Computer-based testing (CBT): tests that are administered to students oncomputer; question formats are frequently objective, discrete-point items;these tests are subsequently scored electronically.Computer-adaptive testing (CAT): presents language items to the learnervia computer; subsequent questions on the exam are "adapted" based on astudents response(s) to a previous question(s).Concurrent validity: relationship between a test and another existingmeasure.Construct: the complete set of knowledge, skills, abilities, or traits anassessment is intended to measure.Content validity: this type of validity refers to testing what you teach howyou teach it; testing content covered in some way in the course materialsusing formats that are familiar to the student.Cornerstones of good testing practice: concepts that underpin goodtesting practice; they include usefulness, validity, reliability, practicality,transparency, authenticity, security and washback.Construct validity: refers to the fit between the theoretical andmethodological approaches used in a program and the assessmentinstruments administered.Constructed-response item: a type of test item requiring students toproduce their own responses, rather than select from a range of responsesprovided.Criterion-referenced test: compares students performance to particularoutcomes or expectations.Coombe/Hubley 49Fundamentals of Language Assessment
  • Curve grades: this refers to a practice whereby teachers add or subtractpoints to a test in order to make the results seem more acceptable;sometimes referred to as adjusting scores.Cut score: a point on a scale above which test takers are classified in oneway and below which they are classified in a different way.Descriptive statistics: statistics that describe the population or providesummary data of the population taking the test; the most common descriptivestatistics include mean, mode, medium, standard deviation and range; theyare also known as the measures of central tendency.Diagnostic test: a type of formative evaluation that attempts to diagnosestudents strengths and weaknesses; typically students receive no grades ondiagnostic instruments.Difficulty: the extent to which an item is within the ability range of thestudent.Directed-response item: a test item that is designed to elicit an answer froma closed or constrained set of options.Direct test: a test which measures ability through a performance thatapproximates an authentic language scenario.Discrete-point test: an objective test that measures the students ability toanswer questions on a particular aspect of language; discrete-point items arevery popular with teachers because they are easy to score.Discrimination: the power of an item to differentiate among test takers atdifferent levels of ability.Distractor: a response in a forced choice item which is not the key.Distribution: the spread and pattern of a set of test scores or data.Equating: a statistical procedure used to adjust scores on two or morealternate forms of an assessment so that the scores may be usedinterchangeably.Equity: is the concern for fairness or that assessments are free from bias orfavoritism. At minimum, all assessments should be reviewed for a)stereotypes, b) situations that favor one culture over another, c) excessivelanguage demands that prevent some students from showing their knowledgeand d) the assessment’s potential to include students with disabilities orlimited English proficiency.Equivalent forms: see alternate formsCoombe/Hubley 50Fundamentals of Language Assessment
  • Error: nonsystematic fluctuations in scores caused by such factors asguessing, unreliable scoring; error is the difference between an observedscore and the true score for an individual.Evaluation: when used for most educational settings, evaluation means tomeasure, compare, and judge the quality of student work, schools or aspecific educational program.Face validity: refers to the overall appearance of the test; it is the extent towhich a test appeals to test takers.Fairness: the extent to which a test is appropriate for members of differentgroups regardless of gender, ethnicity etc.Forced-choice item: an item which requires the test taker to choose fromresponse options that are provided.Formative evaluation: refers to tests that are designed to measurestudents achievement of instructional objectives; these tests give feedback onthe extent to which students have mastered the course materials. Examplesof this type of evaluation include achievement tests and mastery tests.Grade Inflation: this refers to the practice of giving students higher gradesthan what they deserve or grades that are not commensurate with theirlanguage ability levels.Halo effect: the tendency of a rater to let overall impressions/judgments of aperson influence judgments on more specified criteria.High-stakes test: the extent to which the outcome of a test or anassessment can effect the test taker’s future; a high-stakes test is a testwhere the test taker’s future hinges on passing or failing.Histogram: a graphic method of presenting statistical information.Holistic scoring: is based on an impressionistic method of scoring; anexample of this is the scoring used with the TOEFL Test of Written English(TWE) or the IELTS banding system.Impact: the effect that a test has on an individual student, an educationalsystem and on society.Indirect test: a test that does not require the student to perform tasks thatdirectly relate to the kind of language use targeted in the classroom.Integrative testing: goes beyond discrete-point test items and contextualizeslanguage ability; test takers are required to combine various skills to answerthe test questions. Partial dictation is an example.Coombe/Hubley 51Fundamentals of Language Assessment
  • Inter-rater reliability: attempts to standardize the consistency of marksbetween raters; it is established through rater training and calibration.Item: an individual question or exercise in an assessment or evaluativeinstrument.Item bank: a large bank or number of items measuring the same skill orcompetency; item banks are most frequently found in objective testing inparticularly CBT and CAT.Item analysis: a procedure whereby test items and distractors are examinedbased on the level of difficulty of the item and the extent to which theydiscriminate between high-achieving and low-achieving students. Results ofitem analyses are used in the upkeep and revision of item banks.Item response theory (IRT): a mathematical model relating performance onquestions to characteristics of the test takers and characteristics of the item.Item violation: refers to a common mistake that teachers/testers make whenwriting test items.Inter-rater reliability: the extent to which there is consistency betweenmultiple graders or raters.Intra-rater reliability: the extent to which a rater is consistent in using aproficiency rating scale; refers to inner consistency.Key: a correct answer to a questionLive pilot: this is a practice used by institutions who do not have the time orresources to pilot test items; it refers to administering a test that has notpreviously been piloted or pre-tested but one that is solidly based onempirically validated specifications and written by trained testers.Live test: a test that is currently in use or one that is being stored for a futureadministration.Mean: known as the arithmetic average; to obtain the mean, the scores areadded together and then divided by the number of students who took the test;the mean is a descriptive statistic.Median: one of the measures of central tendency; it represents the 50thpercentile or the middle score.Mode: the most frequently received score in a distribution.Moderation: this refers to a process of review or evaluation of test materialsand rating performance.Coombe/Hubley 52Fundamentals of Language Assessment
  • Monkey score: this refers to the random guess or literally the score that amonkey would receive on an item should it randomly point to an answer; for aMCQ with four response options the monkey score is 25%.Multiple-choice test: an item where the student is required to select thecorrect/best answer from a selection of response options. MCQs include astem (the question to be answered or the sentence to be completed) and anumber of response options. One response option is the key while the othersare distractors.Norm-referenced test: measures language ability against a standard or"norm" performance of a group; standardized tests like the TOEFL are norm-referenced tests because they are normed through prior administrations tolarge numbers of students.Objective test: can be scored based solely on an answer key; it requires noexpert judgment or subjectivity on the part of the scorer.Observed score: the score a person happens to obtain on a particular formof an assessment at a particular administration.Ordering: refers to the sequencing of test items on a given test; consideredto be an important factor in test development which can affect scores;generally two ways to order or sequence items; 1) a few easy items areplaced at the beginning of the test whereby the rest are sequenced at randomthroughout; 2) the items are sequenced from easy to difficult.Outlier: this refers to an extreme or ‘rogue’ score which does not seem tobelong to the general answer pattern of the population; outliers may skew thedistribution as the mean is very sensitive to them.Parallel tests: multiple versions of a test; they are written with test securityin mind; they share the same framework, but the exact items differ.Patching: this is a practice in high-stakes institutional testing wherebyseparate sub scores are accepted on different test administrations; a studentmight take an exam and fail two of the three sections; therefore the twosections that he/she passed would not need to be repeated.Performance-based test: requires students to show what they can do withthe language as opposed to what they know about the language; they areoften referred to as task-based.Performance standards: explicit definitions of what students must do todemonstrate proficiency at a specific level.Piloting: a common practice among language testers; piloting is a practicewhereby an item or a format is administered to a small random orrepresentative selection of the population to be tested; information fromCoombe/Hubley 53Fundamentals of Language Assessment
  • piloting is commonly used to revise items and improve them; also known asfield testing or trialing.Placement test: is administered to incoming students in order to place orput them in the correct ability level; content on placement tests is specific to agiven curriculum; placement tests are most successfully produced in-house.Portfolio assessment: one type of alternative assessment; portfolios are arepresentative collection of a students work throughout an extended period oftime; the aim is to document the students progress in language learning viathe completion of such tasks as reports, projects, artwork, and essays.Practicality: one of the cornerstones of good testing practice; practicalityrefers to the practical issues that teachers and administrators must keep inmind when developing and administering tests; examples include time, andavailable resources.Practice effect: the phenomenon of taking two tests with the same or similarcontent and the result being a higher score on the second test with no actualincrease in language ability.Predictive validity: measures how well a test predicts performance on anexternal criterion.Pretest: administering a test or set of test items before it goes live for thepurpose of collecting information about the students or identifying problemswith the items.Proficiency test: is not specific to a particular curriculum; it assesses astudents general ability level in the language as compared to all otherstudents who study that language. An example is the TOEFL.Profile marking: sometimes called analytical marking; after markingteachers have a ‘profile’ of the students’ mark.Range: one of the descriptive statistics or measures of central tendency; therange or min/max is the lowest and highest score in a distribution.Rater: a person who evaluates or judges student performance on anassessment against specific performance.Rater training: the process of educating raters to evaluate student work andproduce dependable scores.Rating scale: instruments that are used for the evaluation of writing andspeaking; they are either analytical or holistic or a combination of the two.Raw score: the number of items answered correctly.Coombe/Hubley 54Fundamentals of Language Assessment
  • Readability: the level of reading difficulty for a given text; most readabilityindices are based on vocabulary (frequency or length) and syntax (averagesentence length); well-known readability formulas include Flesch-Kincaid andFry.Reliability: one of the cornerstones of good testing practice; reliability refersto the consistency of exam results over repeated administrations and thedegree to which the results of an assessment are dependable andconsistently measure particular student knowledge and/or skills.Reported score: the actual score that is reported to the student.Retired test: a test that is no longer live or in the public domain; the termimplies that the test was once secure and statistically valid; retired tests areoften used as practice materials.Security: measures taken to ensure that the test remains live andoperational and not in the hands of test takers.Self-assessment: asks students to judge their own ability level in alanguage; one type of alternative assessment.Severity: this is a characteristic of a rater; some many be consistentlygenerous or lenient with scores; others may be consistently harsh.Specifications: a document that states what the test should be used for andwho is it aimed at; test specifications usually contain all instructions, examplesof test formats/items, weighting information and pass/fail criteria.Speededness: the extent to which test takers lack sufficient time to respondto items; for most tests, speededness is not a desirable characteristic.Stakeholders: all those who have a stake or an interest in the use or effectof a particular test or assessment.Standardized test: measures language ability against a norm or standard.Standard Error of Measurement (SEM): a way of expressing test reliabilityStem: the first part of a multiple-choice question; usually takes the form of aquestion or a sentence completion.Stimulus: material provided as part of the test or task that the test taker hasto respond to.Subjective test: requires knowledge of the content area being tested; asubjective test frequently depends on impression, human judgment andopinion at the time of the scoring.Coombe/Hubley 55Fundamentals of Language Assessment
  • Summative evaluation: refers to a test that is given at the end of a course orcourse segment; the aim of summative evaluation is to give the student agrade that represents his/her mastery of the course content.Test anxiety: a feeling or nervousness or fear surrounding an assessment;can occur before, during or after a test; has the potential to effect testperformance.Test equivalence: tests that are constructed from the same set of testspecifications with the goal being to test the same skills; the scores on thesetwo tests are expected to be the same or similar.Test-Retest: parallel tests are administered before learning has occurredand after it has taken place for the purpose of determining or measuring howmuch language has been learned over timeTest wiseness: refers to the amount and type of preparation or priorexperience with the test the test taker has.Transparency: the idea that teachers and students have a right to know howthey will be assessed and which criteria will be used to evaluate them.True score: the score a person would receive if the test were perfectlyreliable or the SEM was zero.Validity: one of the cornerstones of good testing practice; refers to thedegree to which a test measures what it is supposed to measure.Washback: one of the cornerstones of good testing practice; refers to theimpact a test or testing program may have on the curriculum.Weighting: refers to the value that is placed on certain skills within the examdetermined through prior administrations to large numbers of students.Coombe/Hubley 56Fundamentals of Language Assessment
  • Annotated BibliographyOUR FAVORITE BOOKS ON LANGUAGE ASSESSMENTAlderson, J. Charles, Caroline Clapham and Dianne Wall. 1995. Test Construction and Evaluation. Cambridge, U.K.: Cambridge University Press.This volume describes and illustrates principles of test design, constructionand evaluation. Each chapter deals with one stage of the test constructionprocess. The final chapter examines current practice in EFL assessment.Bachman, Lyle F. 1990. Fundamental Considerations in Language Testing. Oxford, U.K.: Oxford University Press.This book explores the basic considerations that underlie the practicaldevelopment and use of language tests; the nature of measurement, thecontexts that determine the use of language tests, and the nature of both thelanguage abilities to be measured and the testing methods that are used tomeasure them. Bachman also provides a synthesis of testing research.Bachman, Lyle F. and Adrian S. Palmer. 1996. Language Testing in Practice: Designing and Developing Useful Language Tests. Oxford, U.K.: Oxford University Press.This book relates language testing practice to current views of communicativelanguage teaching and testing. It builds on the theoretical background setforth in Bachmans 1990 volume. The authors discuss the design, planningand organization of tests.Bailey, Kathleen. (1998). Learning About Language Assessment: Dilemmas, Decisions, and Directions. TeacherSource Series (ed.) Donald Freeman. Boston: Heinle.This text provides a practical analysis of language assessment theory andaccessible explanations of the statistics involved.Brown, H. Douglas. (2003) Language Assessment: Principles and Classroom Practice. Hertfordshire, UK: Prentice Hall.An accessible book on assessment by an experienced teacher and teachertrainer.Coombe/Hubley 57Fundamentals of Language Assessment
  • Cambridge Language Assessment Series. (series editors) J. Charles Alderson & Lyle Bachman. Cambridge: Cambridge University Press.This excellent series of professional volumes includes:Assessing Languages for Specific Purposes, Dan DouglasAssessing Vocabulary, John ReadAssessing Reading, Charles AldersonAssessing Listening, Gary BuckAssessing Writing, Sara Cushing WeigleAssessing Speaking, Sari LuomaAssessing Grammar, Jim PurpuraCohen, Andrew D. 1994. Assessing Language Ability in the Classroom. Boston, MA.: Heinle.This second edition presents various principles for guiding teachers throughthe assessment process (dictation, cloze summary, oral interview, roleplays,portfolio assessment). Cohen deals with issues in assessment, not just withtesting. He also examines the test-taking process and presents up-to-datetopics in language assessment.Coombe, Christine and Nancy Hubley. 2003. Assessment Practices. TESOL Case Studies Series. TESOL Publications.This edited volume includes case studies about successful languageassessment practices from a global perspective.Coombe, Christine, Keith Folse and Nancy Hubley. 2007. A Practical Guide to Assessing English Language Learners. Ann Arbor, MI: University of Michigan Press.This co-authored volume includes chapters on the basics of languageassessment. The content revolves around two fictitious language teachers;one who has very good instincts about how students should be assessed andthe other who is just starting out and makes mistakes along the way. Themanual you are reading today is the basis for this book.Davidson, Fred and Brian Lynch. 2002. Testcraft: A Teacher’s Guide to Writing and Using Language Test Specifications. New Haven: Yale University Press.This book is about language test development using test specifications. It isintended for language teachers at all career levels.Fulcher, Glenn. 2003. Testing Second Language Speaking. London: Longman/Pearson Education.Coombe/Hubley 58Fundamentals of Language Assessment
  • This book offers a comprehensive treatment of testing speaking in a secondlanguage. It will be useful for anyone who has to develop speaking tests intheir own institutions.Genesee, Fred and John A. Upshur. 1996. Classroom-Based Evaluations in Second Language Education. Cambridge, U.K.: Cambridge University Press.The authors emphasize the value of classroom-based assessment as a toolfor improving both teaching and learning. The book is non-technical andpresupposes no specialized knowledge in testing or statistics. The suggestedassessment procedures are useful for a broad range of proficiency levels,teaching situations, and instructional approaches.Harris, Michael and Paul McCann. 1994. Assessment. Oxford, U.K.: Heinemann Publishers.This volume examines the areas of formal and informal assessment as well asself-assessment. Within each section, practical guidance is given on theissues of purpose, timing, methods and content. The ready-to-use materialsinclude model tests, self-assessment and assessment instruments whichteachers can adapt to suit their instructional context.Heaton, J. B. 1988. Writing English Language Tests. Harlow, England: Longman Press.This volume gives detailed and practical suggestions on methods ofclassroom testing and shows how both students and teachers can gain themaximum benefit from testing. Examples of useful testing techniques areincluded as well as practical advice on using them in the classroom.Hughes, Arthur. 2003. Testing for Language Teachers (2nd Edition). Cambridge, U.K.: Cambridge University Press.This practical guide is designed for teachers who want to have a betterunderstanding of the role of testing in language teaching. The principles andpractice of testing are presented in a logical, accessible way and guidance isgiven for teachers who devise their own tests.McNamara, Tim. 2000. Language Testing. Oxford Introductions to Language Study, (ed.) H.G. Widdowson. Oxford: Oxford University Press.This book examines issues such as test design, the rating process, validityand measurement. The looks at both traditional and newer forms of languageassessment, the wider social and political context of testing and thechallenges posed by new ideas.Studies in Language Testing Series. (Series Editors). Michael Milanovic and Cyril Weir. Cambridge: Cambridge University Press.Coombe/Hubley 59Fundamentals of Language Assessment
  • This series focuses on important developments in language testing. Theseries has been produced by UCLES in conjunction with CambridgeUniversity Press. Titles in the series are of considerable interest to test users,language test developers and researchers.Some of the excellent volumes in this series include:Using Verbal Protocols in Language Test Validation: A Handbook, AlisonGreenDictionary of Language Testing, Alan Davies et al.Fairness and Validation in Language Assessment, Antony John KunnanExperimenting with Uncertainty: Language Testing Essays in Honour of AlanDavies, Catherine Elder et al.The Equivalence of Direct and Semi-Direct Speaking Tests, Kieran O’LoughlinThe Development of IELTS: A Study of the Effect of Background on ReadingComprehension, Caroline ClaphamThe Multilingual Glossary of Language Testing TermsLearner Strategy Use and Performance on Language Tests: A StructuralEquation Modelling Approach, Jim PurpuraIssues in Computer-Adaptive Testing of Reading Proficiency, MichelineChalhoub-DevilleA Qualitative Approach to the Validation of Oral Language Tests, AnneLazaratonWeir, Cyril. 1993. Understanding and Developing Language Tests. Hertfordshire, UK: Prentice Hall.This book is designed for language teachers, teacher educators, andlanguage teacher trainees interested in the theory and practice of languagetests. The book takes a critical look at a range of published exams and helpsreaders understand not only how tests are-and should be-constructed, buthow they relate to classroom teaching.Coombe/Hubley 60Fundamentals of Language Assessment
  • Contact InformationThe presenters can be contacted in the following ways:By mailChristine CoombeDubai Men’s College, HCTP.O. Box 15825Dubai, United Arab EmiratesNancy HubleyAlice Lloyd College100 Purpose Road, #83Pippa Passes, KY 41844By emailChristine Coombechristine.coombe@hct.ac.aeNancy Hubleynjhubleyae@yahoo.comAcknowledgments: The authors are grateful to the many people whohave provided support for this project. Thanks are particularly due to ourcolleagues at UAE University, Zayed University and the Higher Colleges ofTechnology who participated in numerous workshops and piloted thesematerials. We are grateful to our students who have taught us so much abouttesting. Lastly, we appreciate the feedback from ELT colleagues in manycountries who have shared their assessment experiences with us.Coombe/Hubley 61Fundamentals of Language Assessment