I would like to begin today’s presentation with a quote which, taken to an extreme, illustrates the effect that high stakes testing can have on students.
1. It is a common assumption that well known testing tools are useful for our purposes, because we believe that they are technically sound and ongoing research is being carried out. However… 2. What is the “best” test to use? 3. Given that we use the results of language proficiency tests to determine, in large part, the academic future of our students, we must ask what relevant information the test provides that would justify this use. 4. 5. These are questions that I will begin to address in today’s presentation and we will return to them at the end.
REFER AUDIENCE TO HANDOUT 1. examples: end of course tests, portfolio assessments 2. We accumulate evidence during, or at the end of a course of study in order to see whether and where progress has been made in terms of the goals of learning 3. designed to measure how much of a syllabus a learner has mastered and thus they are only valid to the extent to which the content of the test matches the content of the syllabus 4. The use of achievement tests allows instructors to be innovative and to reflect progressive aspects of the curriculum = they are thus associated with some interesting new developments , a movement known as alternative assessment 5. this approach stresses the need for assessment to be integrated with the goals of the curriculum -learners may be encouraged to share responsibility in assessment and be trained to evaluate their own capacities -known as self-assessment Refer to Brown and Hudson for a detailed discussion of alternative assessment
1. This is established for university admission, professional certification, workplace etc. 2. “language ability” - consequently not reflective of a specific syllabus 3. “stable trait”- this means that scores tend not to change within a short period of time; thus this type of test would not be useful in the context of assessing learning over a few weeks. -indeed this change would mainly indicate statistical variance -however, programs are often pressured to employ such tests in order to determine the effectiveness of teaching 4. “predictions’ - this is why such tests are used for admissions decisions and consequently are high stakes - they determine in great part a student’s academic and economic future - Interestingly Hamp-Lyons notes that the vast majority of people who interpret test scores are neither teachers nor testing professionals, they are administrators.
Objective= no human interference, very highly reliable subjective=individuals are involved in the evaluation process indirect= we make inferences from the test tasks- e.g. using a sentence structure question to infer the writing ability of a test taker direct= no gap between test task and target language situation . E.g assessing speaking skills in an interview discrete-point=multiple-choice, often isolated items integrative=different skills are not separated but assessed holistically external internal Norm-referenced= a test takers performance is evaluated against the range of performances typical of a population of similar test takers Criterion-referenced=performances are compared to one or more descriptions of adequate performance at a given level e.g. band scores Describing and evaluating tests on a continuum allows us to steer away from a black and white judgment.
REFER AUDIENCE TO HANDOUT 1. In order to determine which test is “best” for a given assessment situation, we need to evaluate its overall usefulness. 2. Bachman and Palmer include six qualities in their definition of usefulness :list reliability= consistency of measurement validity= the extent to which the inferences that we make on the basis of the test are valid given the target language use situation authenticity= how closely does the test resemble the actual language use situation interactiveness= to what extent is the test taker involved in active communication impact= what is the effect of the test on test takers, test users, teachers etc. 3. These qualities are not all granted equal regard but they must all be considered in order to achieve a desired balance - consequently the balance would vary from one testing situation to another. these elements cannot be evaluated independently but must be looked at in terms of their combined effect - OVERALL usefulness that needs to be emphazised rather than ind Qualities. - evaluation of test usefulness is essentially subjective because it is based on judgements on part of test user REFER AUDIENCE TO QUESTIONS FOR EVALUATION ON HANDOUT
1. Two essential considerations in the evaluation of test usefulness are reliability and validity 2. Reliability is necessary because we want to ensure that test results are scored in a reliable and consistent manner. However, strong reliability without validity tells us essentially nothing. 3. Therefore, construct validity is of specific interest to us is because it is concerned with the extent to which we can interpret a given test score as an indicator of the ability we want to measure - thus, it addresses the meaningfulness and appropriateness of the interpretations that we make. 4. Threats to construct validity can occur when real requirements of the TLU domain may be not be fully represented in the test. We frequently hear people complain that even though students perform very high on the TOEFL they lack basic communication skills. This is probably the case because interaction is not required by the test. The TSE is sometimes employed to remedy this fact; however describing how a tourist can find the way to the train station will not necessarily translate into the ablity to take part in round-table discussions threats to content validity : issue is to what extent the test content forms a satisfactor basis for the inferences to be made from performance e.g. using the TOEFL to make inferences about the ability of an international student to act as a teaching assistant if we want to use the scores from a language test to make inferences about individuals’ language ability, and possibly make various types of decisions, we must be able to demonstrate how performance on that language test is related to language use in specific situations other than the language test itself that is why when considering the six qualities just addressed we always need to examine them in connection to the test taker, the test task and the Target Language Use - Ideally there should be a seamless connection between these three elements- the greater the distance the less useful the inferences that we can make.
1. The greatest language test prep industry has developed around this test introduction to test prep book states “ you are well aware that the TOEFL is one of the most important examinations that you will ever take. Your entire future may well depend on your performance in the TOEFL. The results of this test will determine whether you will be admitted to the school of your choice. 2. 1 million test takers 3. the TOEFL is 100% multiple choice -it uses “generic, or neutral” language and does not specify a context 4. Four sections - Listening section: test takers are not given opportunity to preview questions, nor to see them while listening, nor take notes 5. Research at TOEFL places heavy emphasis on reliability but provides inadequate validity evidence. New development include automatic essay scoring that is done by computer analysis of written structures - TOEFL 2000 project that aims to make changes to the construct of the test which dates back to the 1960’s.
1. Does not reflect current teaching and learning practices and could thus have negative effects on students, teachers because it is in conflict. 2. Passive reconition Students who “pass the test are often unable to communicate However, institutions and other TOEFL score recipients that note inconsistencies such as high TOEFL scores and apparent weak English proficiency, should refer to the photo on the Official Score Report for evidence of impersonation 3. Cutoff scores CPA called upon Canadian universities to refrain from using TOEFL as a standard for university admission - contrary to recommendations decisions often based solely on score - interpretation of scores is difficult because it is norm-referenced and simply provides a number -many have increased have increased TOEFL cutoffs ranging from 580-600 -many who would otherwise be qualified for university admission are denied access - after an 8week summer university orientation program given in English, students’ scores on the TOEFL itself increased from an average of 570-601 -mean score of native speakers reported by ETS is 590 4. General proficiency In his critique of language tests and admission procedures, Elson quoted several studies that have found that merely knowing how a student scored on TOEFL will tell us practically nothing we need to know to predict the student’s academic performance 5. dissatisfaction has led to disuse of TOEFL by some e.g. Australia -misuse of the TOEFL, cycles of raising and lowering requirements -TOEFL is used as an initial screen but other tests have to be taken upon arrival
1. listening section includes variety of statements, questions, short conversations 2. reading section includes incomplete sentences, error recognition, and reading comprehension 3. Content is drawn from a wide variety of areas 4. tailored to provide rapid, affordable, and convenient service; therefore only measure listening and reading since these can be tested objectively. Testing writing and speaking requires time and expense and are “less objective and less reliable”
1.Concern with lack of correspondence between test tasks and target language use. Does not measure speaking - how do you know that person will be able to communicate in a business setting? 2. It only measures listening and reading but makes inferences to communicative ability 3. the test content is extremely broad and may in the end not provide any useful information to any of the fields that use this test
1. 205 test centers in over 100 countries 2. Test is divided into four modules, which have no central theme or topic but offer separate reading and writing tasks for either general or academic English use 3. listening: number of recorded texts which increase in difficulty as the test progresses, mixture of conversations and dialogues - allowed to preview 4. readings are taken from books, magazines, journals 5. writing includes two tasks 1. Write a 150 word report based on material found in a table or diagram, demonstrating ability to describe and explain. - Short essay of 250 words in response to an opinion or a problem expected to demonstrate ability to discuss issues, construct an argument, and use appropriate tone and register 6. Speaking is assessed during a 10-15min one-on-one interview. Requires the test taker to describe, narrate, and provide explanations on a variety of personal and general interest topics - objective key for listening and reading components, speaking and writing components are marked on a subjective key 7. The test includes a variety of task and response types
1. The actual tasks are reflective of academic tasks 2. Comprehensive scoring structure has advantage of giving students knowledge of what specific area of language needs special attention - when asked whether the subjective component of the assessment procedure might introduce a degree of unfairness into the testing process, Jill Richardson said that if the test is truly to be regarded as a communication oriented process, personal interaction is a necessary ingredient without which it is difficult to truly establish a person’s capacity to use language 3.need for more reliability research. -emphasis for UCLES has been on validity and this is also reflected in their certificate exams. It comes from a tradition where teaching professionals are trusted to make fair judgements. 4. It is one of the two tests accepted by the Canadian government for immigration purposes.
1. Was designed by Carleton U. in response to their perceived failure of standardized tests to effectively identify students who were able to use English at levels required for university study 2. test is grounded in day-to day use of language within first year courses at the university -this test is designed not for the global knowledge of English but for English-medium academic contexts -attempts to recreate for the test taker the experience of joining an introductory first year course 3.Integrated, criterion-referenced, topic-based test for EAP -uses constructed response rather than multiple-choice items -there is direct overlap between taking a CAEL assessment, taking and academically oriented ESL course or taking a first year course at a university. The overlap is clear in the tasks and activities of the test -in this way the test aims to promote positive and useful learning - When completing practice tests students are provided with a conversion key that states which skill is tested by each question
1. The nature of the test tasks encourage students to make use of their language knowledge and actively engages them 2. The language skills that are promoted by the test are in line with what a teacher would use in an EAP classroom 3. Research has shown that teachers evaluate their students in-class performances similarly 4. There is an ongoing tracking study that aims to link test performance with future academic performance 5. Even though the test was designed to create positive washback for language learners and teachers; some students have reportedly the same studying habits as for the TOEFL: staying at home for independent cramming. Demonstrates that a “positive” test does not have the same impact on all students.
1. Bailey “ there is a natural tendency for both teachers and students to tailor their classroom activities to the demands of the test, especially when the test is very important to the future of the students” 2. washback can be either positive or negative to the extent that it promotes or hinders achievement of language learning goals held by learners and educators 3. Complex interaction of factors. 4. The more information is available to teachers, learners, test users, and the more they are involved in the testing process, the more likely we will be creating positive impact
-considering that proficiency tests are most powerful indicator for determining the academic future of ESL students discussion needs to start focusing on ethics and consequences of test use - Shohamy introduced the concept of critical language testing - this concept builds on critical pedagogy perspective and emphasizes that the act of testing is both a product and agent of cultural, social, and political agendas - consequently the notion of just a test does not exist -what sort of vision of society does the test create? Question puts at center the responsibility that test users carry with regard to consequences of test use -need to examine the extent to which test agendas reflect the interest of the field of language teaching and learning - it calls into question traditional testing knowledge that views numbers as symbols of objectivity and truth- these numbers are powerful not only because those who use them consider them truthful but also because they allow classification, quantification and judgement. Success and failure are determined by arbitrary cutting scores and all test takers are judged according to the same yardstick -research to suggest that academic achievement in selected disciplines is hardly affected by degree of English language proficiency- how much do we actually know about the degree of English facility that is required for successful completion? Test developers and experts cannot agree what indeed the tests measure and they do not have a clear sense of We must accept responsibility for all the consequences that we are aware of.
1. there is what the receiving institution wants to know from a test- there is also what the test actually tests, these interests are not necessarily compatible 2. There is no “best” test . We need to consider all variables to make app. choice 3. different tests produce different information . What connection is there between test items that measure surface structure recognition and the ability to be a successful student? If a test is isolated from the reality that the student will experience as a learner, it becomes accordingly less relevant 4. Impact = many of us may have encountered the answer to this question in our classrooms, when students demand to be taught to the test 5. language testing is used as a basis for refusing or admitting a student and thus shifts responsibility away from the institution itself. If the student meets the admission requirements to which native speakers are subject, then they should be admitted on the same basis. The provision of opportunities to continue developing English facility is part of commitment to learning 6. AERA standards state that test developers should provide information on the strengths and the weaknesses of their instruments. However, the ultimate responsibility for appropriate test use and interpretation lies predominantly with the test user. 7. I hope that this brief overview of language proficiency testing will lead to further reflection on language testing and that these testing questions remain with us.
Ple ase Go d m ay Ino t failPle ase Go d m ay Ig e t o ve r sixty pe r ce ntPle ase Go d m ay Ig e t a hig h placePle ase Go d m ay alltho se like ly to be at m e g e t kille d inro ad accide nts and m ay the y die ro aring .Irish no ve list McGahe rn
OverviewTypes of language testsWays of describing testsEvaluating the usefulness of language testsOverview of common language tests:TOEFL, TOEIC, IELTS, and CAELImpact of testing on learning and teachingCritical use of language testsTesting Questions
Testing QuestionsWhat is actually being tested by the testwe are using?What is the“best” test to use?What relevant information does the testprovide?How is testing affecting teaching andlearning behaviour?Is language testing “fair”?
Types of Language TestsAchievement testassociated with process of instructionassesses where progress has beenmadeshould support the teaching to which itrelatesAlternative Assessmentneed for assessment to be integratedwith the goals of the curriculum
Proficiency testaims to establish a test taker’sreadiness for a particularcommunicative rolegeneral measure of “language ability”measures a relatively stable traitused to make predictions about futurelanguage performance (Hamp-Lyons,1998)high-stakes test
Some ways of describing testsObjective SubjectiveIndirect DirectDiscrete-point IntegrativeAptitude / Achievement/Proficiency PerformanceExternal InternalNorm-Referenced Criterion-Referenced
Evaluating the usefulness of alanguage testUsefulness= reliability+validity+ impactauthenticity+interactiveness+practicality(Bachman and Palmer, 1996)TESTUSEFULNESSTESTUSEFULNESSRELIABILITYRELIABILITY VALIDITYVALIDITYImpactImpact AuthenticityAuthenticityPracticalityPracticality InteractivenessInteractiveness
Evaluating the usefulness of alanguage testEssential measurement qualitiesreliabilityconstruct validityEvaluation: test taker - test task - TargetLanguage Use (TLU)TLUTest TaskTest Taker
Overview of common languageproficiency testsTOEFL TOEICIELTSCAEL
Test of English as a ForeignLanguageOne million test takers peryearP&P 310-677/ CBT 0-300Three sections:ListeningStructure and WrittenExpressionReadingComprehensionTWE
Test of English as a ForeignLanguageObjective SubjectiveDiscrete-point IntegrativeProficiencyAchievementdiscord between test and understanding oflanguage and communicationpassive recognition of languagecutoff scores are very problematicgeneral proficiency ≠ academic proficiency
Test of English forInternationalCommunicationTOEFL equivalent forworkplace settingtwo sections, 200 q.listeningreadingentertainment,manufacturing, health,travel, finance, etc.“objective and cost-efficient”
Test of English forInternationalCommunicationObjectiveSubjectiveDiscrete-pointIntegrativeProficiencyAchievementlack of correspondence with TLU
International English LanguageTesting SystemAcademic/GeneralResults reported inband scores 1-9ListeningListeningG.ReadingG.Reading A.ReadingA.ReadingG.WritingG.Writing A.WritingA.WritingSpeakingSpeaking
International English LanguageTesting SystemObjectiveSubjectiveDiscrete-pointIntegrativeProficiencyAchievementtest tasks reflective of academictasks
Canadian Academic EnglishLanguage AssessmentMirrors languageuse in universityTopic-based,integratedreading, listening,and writing tasksprovides specificdiagnosticinformationscores are reportedin bands 10-90
Canadian Academic EnglishLanguage AssessmentObjective SubjectiveDiscrete-point IntegrativeProficiency Achievementtests performance and usediminished gap between test and classroomvalidity is supported by teacher evaluationsstudies on predicting academic success
Washback: The Impact of Tests onTeaching and Learning“The power of tests has a strong influence oncurriculum and learning outcomes”(Shohamy, 1993)good test ≠ positive washbackform of test impact depends onantecedent: educational context and conditionprocessconsequences (Wall,2000)
Critical Language TestingFocus on consequence and ethics of testuseTests are embedded in cultural,educational, and political arenaswhose agenda?Questions traditional testing knowledgeEnglish proficiency= academic success?English: got it or get it!Responsible test use (Hamp-Lyons, 2000)
Testing QuestionsWhat is actually being tested by the test weare using?What is the”best” test to use?What relevant information does the testprovide?How is testing affecting teaching andlearning behaviour?Is language testing “fair”?
Test design criteriaUsefulness= reliability+validity+ impactauthenticity+interactiveness+practicality reliability= consistency of measurement validity= the extent to which the inferences that we makeon the basis of the test are valid given the target languageuse situation authenticity= how closely does the test resemble theactual language use situation interactiveness= to what extent is the test taker involved inactive communication impact= what is the effect of the test on test takers, testusers, teachers etc.
Time – language level – designLayoutTheoretical support (one page to explainthe test; explain why your test isusefulness, the type of test, )Score 1 – 5 (create bands for scores)Make copies for the whole group15 minutes per skill (except - speaking)