1. PTLC2005 J. Szpyra-Kozłowska, J. Frankiewicz, M. Nowacka, L. Stadnicka, AssessingAssessment Methods: 1 Assessing assessment methods – on the reliability of pronunciation tests in EFL Jolanta Szpyra-Kozłowska, Justyna Frankiewicz, Marta Nowacka, Lidia Stadnicka Maria Curie-Skłodowska University, Lublin, Poland1. Introductory remarksTeaching another language is inevitably tied with testing. Teachers have to assessthe learners’ linguistic ability, their progress and achievements. In this respectpronunciation is no different from other language skills; if we regard it as an importantelement of communicative competence which deserves a place in languageinstruction, we should also be able to evaluate the process of teaching/learning it aswell as its outcome. Yet, as pointed out by Celce-Murcia et al. (1996: 341), ‘in theexisting literature on teaching pronunciation, little attention is paid to issues of testingand evaluation.’ The major reason for this negligence is the fact that, as argued byHeaton (1988: 88), speaking, which obviously comprises pronunciation, is a verycomplex skill ‘to permit any reliable analysis to be made for the purpose of objectivetesting.’The present paper addresses the issue of the reliability of the most frequentlyemployed assessment methods of EFL learners’ pronunciation. First we examineimpression-based pronunciation testing in the internationally recognized CambridgeEnglish Examinations and point to its various shortcomings. Next we present a reporton an experiment which compares two approaches to pronunciation testing: holistic(global, impressionistic) and atomistic (analytic) We point to their strengths andweaknesses, and show that they are not equivalent and lead to different results.2. Pronunciation assessment in Cambridge English ExaminationsIn evaluating different methods of pronunciation testing, it seems useful to start withanalyzing the way in which is it done in international English language examinations.Pronunciation does not play any important role in the majority of them (for a detailedanalysis see Szpyra-Kozłowska 2003). Cambridge examinations are no exception tothis rule; candidates get only 5%-6% of the total score for this skill. The assessmentis impressionistic in nature Thus, the following criteria have been adopted for the 5basic examinations: • KET (Key English Test)– pronunciation is heavily influenced by L1 features and may at times be difficult to understand; • PET (Preliminary English Test) – pronunciation is generally intelligible, but L1 features may put a strain on the listener; • FCE (First Certificate in English) – although pronunciation is easily understood, L1 features may be intrusive; • CAE (Certificate in Advanced English) – L1 accent may be evident but does not affect the clarity of the message; • CPE (Certificate of Proficiency in English) – pronunciation is easily understood and prosodic features are used effectively; many features, including pausing and hesitation, are ‘native-like.It is obvious that these requirements are very general and impression-based. Alsocomments addressed to examiners make constant reference to the vague notions ofintelligibility and the amount of strain a candidate’s pronunciation puts on the listener.In the manual, evaluators, who are usually experienced nonnative teachers of
2. PTLC2005 J. Szpyra-Kozłowska, J. Frankiewicz, M. Nowacka, L. Stadnicka, AssessingAssessment Methods: 2English, are instructed as follows, ‘when assessing pronunciation, examiners shouldtry to put themselves in the position of a non-EFL specialist, native speaker ofEnglish and assess the amount of strain on the listener and the degree of patienceand effort required to understand the candidate.’ This procedure raises the followingdoubts: 1. A professional teacher of English cannot be required to pretend to be a non- EFL specialist who, in addition, is a native speaker of English; not everyone has a talent of pretending to be a completely different person (what if he fails?). 2. It is not clear what kind of native speaker the examiner is supposed to impersonate – a well-travelled university professor, familiar with many nonnative varieties of English or a small-town housewife who has never left her birthplace? 3. A nonnative teacher in most cases can understand even very bad English of his fellow-countrymen because of his/her frequent exposure to it. He is, therefore, in no position to judge its intelligibility to users of English of different nationalities than his own. 4. Having no precise criteria of pronunciation assessment, the examiner is likely to adopt his own subjective principles of evaluation (see section 3). This often happens in spite of standardization procedures and examiners’ training.We can conclude that the examinations under analysis do not provide clear-cutcriteria of assessing the examinees’ pronunciation by relying too heavily on veryimprecise impressionistic judgements and by making unreasonable demands onnonnative examiners. This, in turn, seriously undermines their inter-rater reliability.3. Holistic versus atomistic pronunciation testingAs shown in the preceding section, Cambridge English Examinations, similarly tomany other language tests, employ rather objectionable impressionistic evaluation. Itis therefore crucial to examine its logical alternative, i.e. analytic testing. In thissection these two approaches to pronunciation assessment are compared andverified.In the holistic approach to language testing (Alderson et al. 1996:289), ‘examinersare asked not to pay too much attention to any one aspect of a candidate’sperformance, but rather to judge its overall effectiveness.’ The greatest advantage ofthis procedure is that it can be administered to large groups of learners within a shortperiod of time. Moreover, according to Underhill (1987:101), ‘impression marking isused for the kind of categories that are very hard to define but everybody agrees areimportant: fluency, ability to communicate, style, naturalness of speech, and so on.’For these reasons it is advocated by many researchers (e.g. Celce-Murcia et.al.1996, Hughes 1991, Koren 1995).Nevertheless, global pronunciation testing has many drawbacks. It is often toogeneral and imprecise since the assessment criteria in the rating scales, as has beenshown in section 2, tend to be vague. This means, in consequence, that differentraters might adopt their own criteria of evaluation. Finally, as pointed out by Underhill(1987: 101), “making accurate impression-based assessments requires a lot ofexperience. (…) Even experienced assessors find it difficult to make consistentimpression-based judgements.” In other words, this procedure raises problems bothof intra-rater and inter-rater reliability.Analytic evaluation consists in establishing a detailed marking scheme in whichspecific aspects of the learner’s performance are evaluated separately. Subsequently
3. PTLC2005 J. Szpyra-Kozłowska, J. Frankiewicz, M. Nowacka, L. Stadnicka, AssessingAssessment Methods: 3these different ratings are combined to provide an overall mark. An atomisticapproach to pronunciation testing thus involves judgements on the correctness of thelearner’s production of particular vowels, consonants, stress, rhythm, intonation, etc.This method of pronunciation testing is claimed to be more objective than the holisticapproach as it provides a more detailed diagnosis of the learner’s problems andachievements. It is generally preferred by pronunciation specialists and phoneticians(e.g. Vaughan-Rees 1989).On the other hand, atomistic procedure is not without its problems. It is extremelytime-consuming and requires recording the learners’ speech samples andsubsequent listening to them several times by the raters. For these reasons thisapproach seems unsuitable for large classes and examinations with manyparticipants.According to Hughes (1991), the choice between holistic and analytic scoringdepends to some extent on the purpose of testing; atomistic tests are more reliablefor diagnostic purposes in the language classroom and in the situations in whichscoring is carried out in many places by different judges, while holistic evaluation,which is faster, is more appropriate for experienced scorers who are well familiar withthe grading system.In order to compare both approaches, we have carried out an experiment whoseprimary goal was to examine whether the holistic and atomistic procedures ofpronunciation testing are equivalent and bring about the same results.In the experiment reported here 10 judges, all teachers of English, evaluated thepronunciation of 10 randomly selected intermediate Polish learners, secondaryschool pupils, who were asked to read aloud a short passage, which wassubsequently recorded. The raters were first asked to evaluate holistically pupils’pronunciation recorded on the tape using an ordinary scale of Polish school marks of1, 2, 2,5, 3, 3,5, 4, 4,5, 5 and 6, where 1 = failure and 6 = excellent. After a break oftwo weeks the same group of raters assessed the recordings once again. On thisoccasion they were given the following 6 criteria to be employed in the evaluation:pronunciation of individual words, vowel quality (the /i/ - /i:/ distinction in particular),the interdental fricatives, the -ing suffix, word stress and other phonetic features.Each of these aspects were rated individually using the same scoring scale asbefore. Subsequently, the means were calculated. Finally, the assessors were askedto comment on the strengths and weaknesses of both approaches.The questionnaires have revealed that in making holistic evaluation the ratersadopted, in fact, various analytic criteria (such as the pronunciation of ‘silent’ letters,intonation, pauses, devoicing of final obstruents, etc.), which differed from person toperson. Moreover, 90% of assessors regarded atomistic testing as more reliable andobjective.The table below contains the results of the experiment. We provide averagedatomistic and holistic marks given by the raters.
4. PTLC2005 J. Szpyra-Kozłowska, J. Frankiewicz, M. Nowacka, L. Stadnicka, AssessingAssessment Methods: 4 ASSESSMENT Learners Holistic Atomistic L1 3.7 3 L2 3 2.5 L3 4 3.2 L4 3.2 3.1 L5 4 3.3 L6 4.4 4.1 L7 4.1 3.6 L8 3 3 L9 2.5 2.8 L10 3.4 3.1 Mean 3.53 3.17 Table 1. Results of holistic and atomistic assessmentAs can clearly be observed, in 8 cases out of 10 the mean atomistic marks are lowerthat the holistic marks. In one case the results are reversed and in one are the same.The obtained means are 3,53 in the holistic evaluation and 3,17 in the analyticprocedure.To verify the obtained results, another experiment, a replica of the previous one, hasbeen conducted with a different group of 5 raters and 5 other learners. This time themean scores have been 3.56 in the holistic and 3.04 in the analytic assessment.Thus, a conclusion can be drawn that the holistic and atomistic approaches topronunciation testing are not equivalent; the former usually results in higher scoresthan analytic assessment. This means that raters generally tend to be more lenient intheir overall impressions than in judgements made on the basis of more specificcriteria. An explanation of this phenomenon can be sought in the likely assumptionthat in atomistic testing the focus seems to be on error finding more than in theholistic procedure, where the criterion of intelligibility is employed, which allows for amore tolerant approach to phonetic inaccuracies.4. Final remarksPronunciation is extremely difficult to test in an objective and reliable fashion. Wehave demonstrated that Cambridge English Examinations, just like other similartests, are based entirely on impressionistic evaluation and raise many objections withregard to their reliability. We have considered an alternative procedure of analyticevaluation and demonstrated that the two methods are not exactly equivalent, theformer being more lenient and permissive than the latter. The atomistic approach canbe regarded as more objective and reliable, and is particularly well-suited fordiagnostic purposes as it allows the teacher to identify specific pronunciationproblems of the learners to be dealt with in the course of subsequent instruction. It is,however, time-consuming and not easy to execute with large groups of learners orexaminees. Holistic testing, on the other hand, is technically simpler to carry out. It isinvaluable in assessing the overall impression, the intelligibility of the learner’sspeech and other aspects of his pronunciation which cannot be easily expressed bymeans of definite, clear-cut criteria. Its reliability, however, is questionable.Apparently, none of these two methods can be viewed as fulfilling all the necessaryrequirements of objectivity, reliability and practicality.
5. PTLC2005 J. Szpyra-Kozłowska, J. Frankiewicz, M. Nowacka, L. Stadnicka, AssessingAssessment Methods: 5ReferencesAlderson, C. J., Wall, D. & C. Claphaim. (1996). Language Test Construction andEvaluation. Cambridge: Cambridge University Press.Celce-Murcia, M., Brinton, D. & J. Goodwin. 1996. Teaching Pronunciation: aReference for Teachers of English to Speakers of Other Languages. Cambridge:Cambridge University Press.Heaton, J. B. 1988. Writing English Language Tests. London: Longman.Hughes, A. (1991). Testing for Language Teachers. Cambridge: CambridgeUniversity Press.Koren, S. (1995). “Foreign language pronunciation testing: a new approach.” System23 (3). 387-400.Szpyra-Kozłowska, J. (2003). ”Miejsce i rola fonetyki w międzynarodowychegzaminach Cambridge, TOEFL i TSE.” Zeszyty Naukowe PWSZ w Płocku.Neofilologia. Tom V. 181-191.Underhill, N. (1987). Testing Spoken Language. A handbook of oral testingtechniques. Cambridge: Cambridge University Press.Vaughan-Rees, M. (1989). “The testing of pronunciation – receptive skills.” SpeakOut! 4. p. 8.