11. TEEP 2004 54% achieved grade of at least 6.5 (compared with 70% in 2003 and 66% in 2001)
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25. Sample “Principles” from ILTA’s Code of Ethics ( available for public consultation on ILTA’s webpage at http://www.iltaonline.com ) Principle 1 : “ Language testers shall have respect for the humanity and dignity of each of their test takers. They shall provide them with the best possible professional consideration and shall respect all persons’ needs, values and cultures in the provision of their language testing service.” Principle 6 : “ Language testers shall share the responsibility of upholding the integrity of the language testing profession.” Principle 9 : “ Language testers shall regularly consider the potential effects, both short and long term on all stakeholders of their projects, reserving the right to withhold their professional service on the grounds of conscience.”
26.
27. Sample items from ILTA’s Code of Practice (available for public consultation on ILTA’s webpage at http:// www.iltaonline.com ) Item A2 : “All tests, regardless of their purpose or use, must provide information which allows valid inferences to be made. Validity refers to the accuracy of the inferences and uses that are made on the basis of the test’s scores. If, for example….” (Item continues for 5 more lines) Item B2 : “A test designer must decide on the construct to be measured and state explicitly how that construct is to be operationalised.” Item B6 : “Those doing the scoring should be trained for the task and both inter and intra-rater reliability should be calculated and published.” Item D3 : “Those preparing and administering publicly available tests should publish validity and reliability estimates and bias reports for the test, along with sufficient explanation to allow potential test takers and test users to decide if the test is suitable in their situation.”
28.
29.
30.
31.
32.
33.
34. Thank you….. for your time & attention. Here’s something to make you smile….
Editor's Notes
Welcome to our presentation about Standards in Language Testing and, specifically, working with the EALTA Guidelines. It’s pretty long, so we’d recommend coffee, paracetamol, pillows….
This presentation comes under the theme of standards in language testing. As we’ve been reading this week, the term standards can have different uses in the field. We are thinking about sets of principles that attempt to define good practice in language testing, and refer you to a description given by Alderson, Clapham & Wall of standards, that is: “an agreed set of guidelines which should be consulted and, as far as possible, heeded in the construction and evaluation of a test” . In LTCE, those esteemed authors pose questions for further consideration, such as Can holistic standards be applied to all tests?, What ideals should they describe?, and How prescriptive should they be? Perhaps we can consider these questions more in our discussion online.
We’ve split our presentation into 3 parts. In Part One, we set out to accomplish the main task – that is, to apply the EALTA Guidelines to a test known as TEEP. Following that, we will then take a critical look at the EALTA Guidelines outlining sections that we think are unclear or problematic; and then, in Part Three, we make some conclusions about the most important things that David and I learned in this assignment, as well as some final thoughts on standards and language testing.
So, Part 1. Most of you will, by now, be familiar with EALTA, the E uropean A ssociation for L anguage T esting and A ssessment. EALTA is an independent, professional association supported financially by the European Community. Their declared aims are to promote understanding of the theoretical principles of language testing and assessment and they seek improvement and sharing of practices. Furthermore, EALTA promotes adherence to principles of transparency, accountability and quality.
Hopefully by now you will have had a chance to look at the EALTA Guidelines and will have your own ideas about their applicability. They are aimed at 3 groups of people but, for our purposes, the main focus of research was centred around section C – test development in national or institutional testing centres. Two further points: firstly, the guidelines appear as questions for consideration rather than principled statements. Secondly, Guidelines such as these need to be tailored to the contexts of each particular testing situation or ‘testing culture’ that uses them. This last point brings us to one of the main issues in adopting standards, which Alan Davies describes as the struggle of “maintaining a balance between the demands of the social on the one hand and the rights of the individual on the other”. In other words, standards need to be general enough to be understood by everyone, but specific enough to apply to each individual test situation. We’ll revisit this theme again in Part 2.
We were asked to use EALTA’s Guidelines to learn more about TEEP. But what is TEEP, you’re asking… As you can see from the slide, TEEP stands for Test of English for Educational Purposes and is accepted my many Admissions Departments of higher education institutions in Britain. TEEP aims to assess English language proficiency for academic purposes in three sub-tests – reading, writing and listening.
So, onto our task - to apply the EALTA Guidelines to TEEP. In the following slides we will describe our findings, providing information first about which information was available and which information was only partially stated or missing from the TEEP documentation in each area.
EALTA state that “ test developers are encouraged to engage in dialogue with decision makers in their institutions and ministries to ensure that decision makers are aware of both good and bad practice, in order to enhance the quality of assessment systems and practices.” TEEP has been subjected to one major revision, and as this “was sanctioned by The University of Reading” in 1999 and involved language testing experts, we feel TEEP has engaged with decision-makers and considered its practices. The revision project was based on: Suggestions that particular items were not functioning as predicted, highlighting difficulties in the test’s quality systems, and a feeling that the original needs analysis required updating to reflect the current language needs of overseas students in British universities and the present ways of looking at language competence.
We applied the EALTA Guidelines to TEEP under different headings – the first being Test Purpose and Specification. TEEP does have a Clearly stated purpose and clearly described its test-takers (see slide 6 for a reminder), of which there were 347 in 2004. Several Handbooks have been produced which describe the Test Specifications for different audiences – the candidates, teachers on preparation courses, and for general consumption – and the Test methods and tasks are described and exemplified. Descriptions of the constructs underlying the sub-tests are also included. And information about the Performance of TEEP is given for 2001, 2003 and 2004.
TEEP Scores are given in the form of an average of the 3 papers – listening, reading and writing – on a scale of 0 to 9. The majority of their candidature comes from their pres-sessional courses at the Centre for Applied Language Studies (CALS) at Reading University in the UK. Only applicants at a certain level are accepted onto the course, which aims to ensure participants reach the levels given by a 6 point 5 or 7 point 0 score – typically the required scores for entry into university institutions. Interestingly, CALS uses a combination of TEEP score and continuous assessment to provide an overall assessment of performance.
In the histogram here we can see the results of the 2004 administrations of TEEP. Given that most candidates have prepared for TEEP in pre-sessional courses, it is unsurprising to find a bunching of scores around the 6.5/7.0 level, and very few at the lower end. Approximately 54% of the candidates achieved a grade of at least 6.5. This compares with 70% in 2003 and 66% in 2001.
Under the heading of Test Purpose and Specification, there are several areas of partially-given or missing information from the TEEP materials. For example: There is no description of possible misuse of TEEP. There is no explicitly-stated reference to the Central European Framework of Reference apart from suggestions that candidates below B1 or an intermediate level should improve their language competence before sitting TEEP. Presumably then, TEEP is intended for candidates at level B2 and above but this isn’t made clear in the Candidate Handbook. And there are no rating scales. Although they themselves are not included for public consultation, Evaluation Criteria for writing are included, having been revised after it was decided they “be improved technically to avoid reporting unreliable test results.” However, it is unclear what factors influenced the changes and how the listening and reading papers are scored.
Regarding test design and item writing, we found very little information within the TEEP literature. 3 Examiner’s Reports for years 2001, 2003 and 2004 imply that, at least for the years mentioned, ‘systematic procedures’ are indeed in place. However, whether these procedures “match the test specifications and comply with item writer guidelines” we can not say, since there is no mention of their existence. There were no references to relevant teaching or testing experience, nothing about any training test developers and item writers have; and no information about item writing guidelines or feedback given to writers. We found no information about reviews & revisions of items to ensure they match test specifications. While there are studies of items, and a comment that “items were working at a very acceptable level.” we found contradictions in their descriptions of item facility values. For example, analysis of the 2001 administrations showed that items 1 and 2 had F.V.s of point 25 and point 5 respectively, accompanied by a comment that this was “somewhat of a surprise” as they were designed to be an easy introduction to the test. Obviously, this seemed poor item design to us, and we were unable to find any information that showed how or if these issues are tackled.
In terms of quality control, we can say that TEEP uses several procedures to try to maintain or improve performance, as you can see from the slide……………… Interestingly, and new to us, TEEP reports a Standard Error of Measurement (the SEM), which describes the accuracy of its assessment. Since each of the three test sections combine to form the overall score, TEEP suggest it important to produce a single estimate of the SEM of the test. As a result of their analyses, their calculations for 2004 report that TEEP is accurate to approximately 0.27 of a band scale . (This means that they are 68% certain that a score of 7.0 on the TEEP will be within range 6.73 to 7.27). It might well be interesting to investigate the SEM of tests further…..
The TEEP Handbooks are full of statistical analyses, many of which were unfamiliar to us – although we are novices in this game, obviously. I’m not going to pronounce them – you see if you know them. By 2004, we found some examples of classical item analysis that were recognisable, although the stats were accompanied by some fairly subjective comments.
Within the area of quality control, the EALTA guidelines suggested there are areas where information is outstanding. We couldn’t find anything about piloting, no trials data whatsoever, and therefore no information about revisions to items following trials although there were some implications in the 2004 report as you can see from the quote…. There is only a brief mention of the three test versions – yes, three versions in 25 years – but no descriptions of version equivalence. We couldn’t see details about rater training or monitoring, and there was nothing about a complaints or appeals process.
Regarding Administration & Security, we had access to the 3 Examiner’s Reports produced since TEEP’s conception 25 years ago, although there is a stated “intention” in the 2004 report, that they will become an annual publication. There was no information about the training or monitoring of administrators, except for the quote you can see on-screen… The TEEP certificate has only basic security features – an original signature, and a stamp – and yet the litetature claims that “At all stages, the TEEP test is secure” without clarifying further.
To conclude our findings on TEEP, we should point out that TEEP is considered high stakes for candidates, the majority of whom study at CALS prior to taking the test. We couldn’t find any concrete information about how TEEP keeps pace with changes in the CALS curriculum, or whether the curriculum keeps pace with TEEP. And there was no information about how alternative assessment (including an optional Speaking test) is conducted and how it impacts on candidates. All this seems hugely relevant to washback, of which there was no reference. About these issues , there appears a certain l ack of transparency. Now, if you’re still with us, it’s about to get interesting as we move onto part 2 of the presentation in which we take more of a critical look at the EALTA Guidelines….
In Part 2 of this presentation we will take a brief look at some of the problems we encountered when using the EALTA Guidelines, and then we will look at how another organization (the International Language Testing Association or ILTA) deals with the issue of standards.
As a basis of critiquing the EALTA Guidelines from a test developer’s perspective, we felt it might be interesting to measure the Guidelines in terms of their own ‘construct validity’, albeit from a rather surface or superficial point of view. We identified the construct here as the EALTA Mission Statement, which appears on the first page of the http link we were given in Part 3 of Task 8.1 of this unit, and which we have reproduced for you here. (Narrator: Read Mission Statement out loud). Upon closer analysis we noticed that it would be possible to divide the Mission Statement into 3 separate parts or goals in terms of what EALTA intends to promote: A) An understanding of the theoretical principles of language testing and assessment; B) The sharing of testing and assessment practices ; and C) The improvement of testing and assessment practices. All 3 of these goals are geared to the general European community. Finally, we then applied these goals to the EALTA Guidelines to see how well the Guidelines live up to the EALTA Mission Statement, or, stated differently, how well the Guidelines fulfill their purpose or construct.
Again, from a rather surface point of view (meaning that we have no research or statistics to back up our claims), we felt that nevertheless it might be possible to make some basic assumptions about how well the EALTA Guidelines fulfill the three goals of the Mission Statement. For example we felt that the Guidelines most likely do fulfill the 1st and 2nd goals of the Mission Statement in that they do promote the understanding and sharing of theoretical principles and practices of language testing. Why? Simply because the EALTA Guidelines have been made public (on the internet and through other forms of communication) since 2006, and therefore it seems reasonable to assume that they have indeed helped people to understand and share different theories and concepts of language assessment.
We felt, however, that it is harder to ascertain (from a surface level) whether or not the Guidelines fulfill the 3rd goal of the Mission Statement – helping to improve testing and assessment practices. Perhaps merely by the fact that they have fulfilled goals A and B (promoting understanding and sharing) one can assume that the Guidelines have also helped improve testing and assessment practices. But we think that the guidelines have one significant weakness which can be seen as an obstacle in enabling them to actually help improve testing and assessment practices on a day-to-day realistic level. This has to do with the fact that the guidelines are in the form of QUESTIONS or CONSIDERATIONS rather than STATEMENTS. It seems to us that unless you are actually willing to convert questions into actions (“to make them explicit” as Alan Davies says), then the questions serve mainly as considerations, and may or may not move test developers further along the path towards actually taking concrete steps to improve their language tests.
Our particular stance on this point is taken from the words of Alan Davies in the substantial work he has done on standards and ethics. In his article entitled Ethics, professionalism, Rights and Codes (from Volume 7 of the 2nd edition of the Encyclopedia of Language and Education, 2008), Davies makes the argument that if standrads are to be taken seriouisly, then they need to be converted into explicit statements about: a) what specific objectives or goals are being considered , and b) how those goals or objectives will be reached. In summarizing his thoughts on the subject, he claims that the three key tasks involved in setting and adhering to standards for language testing are to “describe, measure and report”. This then is precisley the problem we have with the EALTA Guidelines……they aren’t explicit enough because they do not tell users how these ‘norms’ or considerations (in the form of questions) are to be actually met. This also brings us to our next point which is the distinction in language testing standards between Codes of Ethics on the one hand and Codes of Practice on the other.
According to Davies (2008), a professional Code of Ethics is (quote) “a set of principles which draws upon moral philosophy and serves to guide good professional conduct. It is neither a statute nor a regulation and it does not provide guidelines for practice, but it is intended to offer a benchmark of satisfactory ethical behavior by members of the profession. A Code of Ethics is based on a blend of the principles of benevolence, non-maleficience, justice, a respect for autonomy and for civil society. (unquote)” (p.433) So, according to this definition, the EALTA Guidelines represent a clear CODE OF ETHICS and deserve to be judged as such.
As a means of illustrating what a Code of Ethics actually looks like, we wanted to share with you three of the nine principles from the International Language Testing Association’s Code of Ethics. As you can, see all three of them are rather general statements that prescribe the types of things that language testers “ought to do” or “ought to aspire to”. Taken as a whole, the general feeling that these principles convey is a rather lofty or idealistic one, but not in a perjorative sense. They set a high standard of what language testers should be aspiring towards and, in our opinon, this is a definitely a postive characteristic.
According to Davies, a Code of Practice is meant to specify or “instantiate” the Code of Ethics. He states (quote) “while the Code of Ethics focuses on the morals and ideals in the profession, the Code of Practice identifies the minimum requirements for practice in the profession and focuses on the clarification of professional misconduct and unprofessonal conduct.” (unquote). Regarding this last point and, as Ben pointed out earlier, we feel that one of the weaknesses in the TEEP Guidelines is that they fail to mention anything pertaining to the potential misuse of the test. In order to conclude this part of the presentation, we would like to briefly share with you in the next 2 slides an alternative to the EALTA Guidelines (or Code of Ethics) along with a brief exmaple of a professonal Code of Practice. Both documents are from the International Language Testing Association (ILTA) and are available for public consultaton on ILTA’s webpage at http:// www.iltaonline.com
As can be seen in the present slide, the breadth of detail exemplified in ILTA’s Code of Practice is quite remarkable. It includes 7 sections that start out from the general (such as “Basic Considerations for good testing practice in all situations”) to the more specific (such as “Obligations of institutions preparing or adminstering high stakes examinations” to the separate “Rights & Responsibilities” of not only test-takers but test-users as well. What seems especially noteworthy to us is the fact that key notions such as ‘test valdidty’ and ‘reliability’ are not just mentioend but actually explained in easy-to-understand terms, making them very difficult to be ignored by anyone taking the time to consult the document. We would HIGHLY urge anyone involved in language testing to consult the full texts of the sample Codes we have just shared with you by going to ILTA’s webpage at www.iltaonline.com
We’d now like to close our presentation with three further observations about standards in language testing.
We begin with a personal view of some of the frustration we encountered when doing this assignment. The main part of the assignment, applying the EALTA Guidelines to the TEEP, took up approximatley 75% our time. Part of the reason for this is that the information on the TEEP webpage is organized in a much different way than is the order of considerations in the EALTA Guidelines. Consequently we spent hours upon hours sifting through the myriad pages of information on the TEEP webpage to try and find answers to the EALTA questions, and in many cases, we were simply unable to answer those questions or needed to apply a fair amount of supposition and inference in order to answer them. Some may argue that this type of complaint is inappropriate for a presentation of this type or that TEEP decision-makers (or the decision-makers of any exam board for that matter) reserve the right to organize and design their webpages in any way they see fit. We however would beg to differ. We see this once again as an issue that is central to standards. In an ideal world, a given exam board would endorse a particular set of standards and would make it known which set of standards they were endorsing. Likewise, they would agree to have a section of their webpage devoted to the discussion of those standards which would make it easier for potential clients and other interested parties to access that information. Whether that ideal world will ever exist is a separate matter. But the main thing we want to emphasize here is that no-one should have to go through the trouble that we did in trying to find basic information about an exam board’s adherence (or not) to a set of standards. On the contrary, that information should be easy to access, and clearly stated.
Another important issue is the one surrounding TEEP’s decision to not include a Speaking Test as one of their sub-tests. As Ben mentioned in Part 1 of this presentation, TEEP’s comments about whether or not to include a Speaking Test are rather contradictory. In their Examiner’s Report for 2001, they state their intention to include an “optional test of proficiency in speaking” in the 2002 administration. Needless to say, this is the last we hear about a speaking test until TEEP’s 2004 Examiner’s Report when they state that (quote) “at present it is not practical to implement the involved academic-style speaking tasks that the developers would like to use.” (endquote). This is not the first time we have heard the word “practicality” mentioned in relation to standards for language tests. As we mentioned in the beginning of our presentation when first referring to the EALTA Guidelines, standards need to be filtered through the particular context of each individual ‘test situation’ and practical concerns, as well as political and financial ones, are part of what makes each context different. In this sense, it is not unusual for TEEP to tell us that, due to practical constraints, they chose not to include a sub-test of Speaking. What does seem unusual however is that there is no further explanation beyond this. And we bring up the point up because we feel that TEEP’s omission here serves to push the boundaries a bit too far, in our opinion, towards the rights of the individual (to paraphrase Davies). In other words, while we respect the rights of individual exam boards to make their own informed decisions based on their own context, we also feel that exam boards, in the end, need to be held accountable to something or to someone, and this is where standards comes in. It just doesn’t seem even-handed or professional that an exam board should first state that they are in the process of developing a certain sub-test and then, two years later, turn around and say that out of practical concerns they changed their mind. At the very least a further explanation seems justified – otherwise it seems that we’re not all playing by the same rules.
Finally, we would like to leave you with one last consideration about standards and it has to do with the analogy of ‘coming back to where we started’. When all is said and done, certain scholars (such as Davies, Messick and Gipps) have suggested that the concepts underlying standards and ethics in language testing are really about the same thing that writing professional language exams is about: making sure that our exams really test what we say they test. In other words, as Alan Davies writes in his article on Ethics, Professionalism, Rights and Codes , the concepts of standards and ethics, when they are stretched out to their fullest meaning, may end up touching the outer confines of construct validity. Perhaps this is something we will discuss further in the last unit for this Module on ‘Chaniging Views of Validity’ but for now we leave you to consider this idea with a quotation by Mr. Davies. Please take the tme to read the slide at your leisure since we have now reached the end of our presentation. We hope that you’ve enjoyed this presentation and we thank you for your time and attention.