Standards In Language Testing

Group Presentation for Module 3 / Unit 8 by Ben & David Standards in Language Testing: Working with the EALTA Guidelines

Standards in Language Testing Attempt to define good practice Alderson, Clapham & Wall define ‘standards’ as: “ an agreed set of guidelines which should be consulted and, as far as possible, heeded in the construction and evaluation of a test” (1995: 236) Can holistic standards be applied to all tests? What ideals should they describe? How prescriptive should they be?

Presentation Outline: PART ONE : Applying the EALTA Guidelines to TEEP PART TWO : A critical look at the EALTA Guidelines and comparison to another well-known set of standards PART THREE : Conclusions

PART ONE: What is EALTA? E uropean A ssociation for L anguage T esting and A ssessment independent, professional association supported financially by European Community aims to promote understanding of theoretical principles of language testing and assessment and improvement and sharing of practices promotes adherence to principles of transparency, accountability and quality

What are the EALTA Guidelines ? Aimed at those involved in (a) training teachers in testing and assessment (b) classroom testing and assessment (c) test development in national or institutional testing centres Questions for consideration rather than principled statements need to be filtered through particular context of each ‘test situation’ - Davies: ethics and standards are about “maintaining a balance between the rights of the individual and the demands of the social.”

What is TEEP? Test of English for Educational Purposes proficiency test in English for academic purposes taken by overseas students intending study at both Reading University and other institutions of higher education in Britain assesses proficiency in reading, listening and writing in 3 separate sub-tests separate grammar test (used to separate borderline cases)

Our Task: Applying the EALTA Guidelines to TEEP

EALTA & BEST PRACTICE EALTA : “test developers are encouraged to engage in dialogue with decision makers in their institutions and ministries to ensure that decision makers are aware of both good and bad practice, in order to enhance the quality of assessment systems and practices.” TEEP major revision was sanctioned by The University of Reading in 1999, based on: items functioning unpredictably, highlighting difficulties in quality. (O’Sullivan, 1999) an out-dated needs analysis to reflect modern needs/views of language competence (O’Sullivan, 2000).

TEEP: TEST PURPOSE & SPECIFICATION Clearly stated purpose Clearly described test-taker ( 347 in 2004) Test Specifications – Handbooks – for each audience Test methods & tasks described and exemplified Description of constructs underlying sub-tests Performance data for 2001, 2003, 2004 given

TEEP: STUDENT PERFORMANCE Most TEEP candidates attend Pre-sessional course at Centre for Applied Language Studies (CALS) at Reading University with objective of acceptance to the University’s academic courses minimum level course entry levels to ensure level of 6.5 or 7.0 achievable Entry requirements for university institutions normally 6.5 or 7.0 CALS use TEEP plus continuous assessment

TEEP 2004 54% achieved grade of at least 6.5 (compared with 70% in 2003 and 66% in 2001)

TEEP: PURPOSE & SPECIFICATION Outstanding Information : No descriptions of misuse No evidence linking TEEP to CEFR (only advice for candidates at B1 or below to improve their language competence before sitting TEEP) Candidates should be “intermediate at the very least” No rating scales and no information about changes to scales & band descriptions (despite reference to revisions to marking criteria for Writing Paper between 1999 & 2001)

TEEP: TEST DESIGN & ITEM WRITING Item analyses in 3 Examiner’s Reports suggests ‘systematic procedures’ are in place… No information about Qualifications Training Item-writing guidelines Feedback to item writers Little information about item revision “ items were working at a very acceptable level” First 2 test item FVs of 0.25 & 0.5 though “actually designed as relatively easy introduction” No further details

TEEP: QUALITY CONTROL & TEST ANALYSES Several quality control procedures Independent experts invited to conduct test revisions Development of new Writing Rating Scale Biannual rater standardisation Writing tests double-marked Inter-rater reliability studies (83% of double ratings within 0.5 of a band of each other) Results reported as Overall Score based on Reading, Listening & Writing papers (with Grammar test results only significant in critical level decisions) Results sent directly to candidate & Admissions Dept. Reporting of Standard Error of Measurement (68% certain that score of 7.0 on the TEEP will be within range 6.73 to 7.27)

TEEP: QUALITY CONTROL & TEST ANALYSES Variety of statistical studies: All Facet Vertical Rulers Raters Measurement Report Scale Criteria Measurement Report Category Statistics Probability Curves Expected Score Ogive Item Fit Statistics Classical item analysis in 2004 Report mean FV = 0.501! (“a fairly satisfactory result, though somewhat on the low side”) Point Biserial Correlation = 0.3 (“a very satisfactory outcome”)

TEEP: QUALITY CONTROL & TEST ANALYSES Outstanding Information : No piloting information No trials data or information about revisions 2004 Report states that items “too easy or too difficult for the candidates... ()…will be reviewed and re-trialled before this version of the test is used again” Brief mention of (only) 3 test versions No descriptions of version equivalence No details about rater training No descriptions of rater monitoring policy No information about complaints/appeals

TEEP: TEST ADMINISTRATION & SECURITY Only 3 Examiner’s Reports in 25 years “ intention” (2004) for annual report No information about the training/monitoring of administrators "Strict invigilation instructions are always followed, which are designed to not only keep the test secure but also to alleviate examination stress" certificate has some basic security features (original signature/stamp) “ At all stages, the TEEP test is secure.”

TEEP: FURTHER FINDINGS High stakes testing No concrete information about how TEEP keeps pace with changes in CALS curriculum, or if curriculum keeps pace with TEEP No information about how alternative assessment (including a Speaking test) conducted / its impact on TEEP candidates No information about washback

PART TWO : A critical and alternative view of the EALTA Guidelines Potential problems with the EALTA Guidelines The ILTA Codes as a possible alternative

EALTA Mission Statement: “ The purpose of EALTA is to promote the understanding of theoretical principles of language testing and assessment, and the improvement and sharing of testing and assessment practices throughout Europe.” ------------------------------------------------------------------------ Applying Mission Statement to EALTA Guidelines : Understanding of theoretical principles of language testing and assessment Sharing of testing and assessment practices Improvement of testing and assessment practices

How well do the Guidelines live up to the EALTA Mission Statement? Goal A (promote the understanding of theoretical principles of language testing throughout Europe): YES √ Goal B (promote the sharing of testing and assessment practices throughout Europe): YES √

Goal C (promote the improvement of testing and assessment practices throughout Europe): If understanding and sharing were promoted, also likely that improvement was promoted 2. One major problem: in the form of QUESTIONS, rather than STATEMENTS MAYBE

Alan Davies (2007, 437-438): “ There needs to be a description of the standard or level, an explicit statement of the measure that will indicate that the level has or has not been reached and a means of reporting that decision through grades, scores, impressions, profiles and so on.......Description, measure and report, these three stages are essential….” (underlining ours)

A Code of Ethics (according to Davies, 2007) REPRESENTS : a set of principles influenced by “moral philosophy” a guide to “good professional conduct” a “benchmark of satisfactory ethical behaviour by members of a profession” a ‘ blending’ of principles of benevolence, non-maleficence, justice, a respect for autonomy and for civil society DOES NOT REPRESENT : statutes or regulations guidelines for practice

Sample “Principles” from ILTA’s Code of Ethics ( available for public consultation on ILTA’s webpage at http://www.iltaonline.com ) Principle 1 : “ Language testers shall have respect for the humanity and dignity of each of their test takers. They shall provide them with the best possible professional consideration and shall respect all persons’ needs, values and cultures in the provision of their language testing service.” Principle 6 : “ Language testers shall share the responsibility of upholding the integrity of the language testing profession.” Principle 9 : “ Language testers shall regularly consider the potential effects, both short and long term on all stakeholders of their projects, reserving the right to withhold their professional service on the grounds of conscience.”

A Code of PRACTICE (according to Davies, 2007) meant to specify or instantiate points mentioned in Code of Ethics identifies minimum requirements for practice in profession and focuses on clarification of professional misconduct

Sample items from ILTA’s Code of Practice (available for public consultation on ILTA’s webpage at http:// www.iltaonline.com ) Item A2 : “All tests, regardless of their purpose or use, must provide information which allows valid inferences to be made. Validity refers to the accuracy of the inferences and uses that are made on the basis of the test’s scores. If, for example….” (Item continues for 5 more lines) Item B2 : “A test designer must decide on the construct to be measured and state explicitly how that construct is to be operationalised.” Item B6 : “Those doing the scoring should be trained for the task and both inter and intra-rater reliability should be calculated and published.” Item D3 : “Those preparing and administering publicly available tests should publish validity and reliability estimates and bias reports for the test, along with sufficient explanation to allow potential test takers and test users to decide if the test is suitable in their situation.”

CONCLUSIONS A personal view about the assignment Standards means that we should all play by the same rules A final interpretation

CONCLUSIONS A personal view about the assignment, and about standards in general

CONCLUSIONS Standards mean that we should all play by the same rules

CONCLUSIONS 3) the final word on standards & ethics “ It has been suggested that ethics in language testing is no more than an extended validity. This is the argument of Alderson, Clapham and Wall (1995), that ethics is made up of a combination of validity and washback. Validity, and particularly consequential validity, is defined by Messick (1989) as being concerned with the social consequences of test use and how test interpretations are arrived at. Gibbs (1994) considers that consequential validity represents a shift from: ‘a purely technical perspective to a test-use perspective – which I would characterise as an ethical perspective’ (Gibbs, p.146).” ( Davies, 2007: 432)

References Alderson, J. C., Clapham, C., & Wall, D. (1995). Standards in language testing: The state of the art. In J.C. Alderson, C. Clapham, & D. Wall. Language Test Construction and Evaluation. (pp 235-260). Cambridge: Cambridge University Press. Boyd, K. and Davies, A. (2002) Doctors’ orders for language testers: the origin and purpose of ethical codes. Language Testing , 19 (3), 296-322. Davidson, F,, Turner, C., & Huhta, A. (1997). Language testing standards. In C. Clapham & D. Corson (Eds.), Encyclopedia of language and education, Volume i7: Language testing and assessment (pp. 303-311). Dordrecht : Kluwer. Davies, A. (1997) Introduction: the limits of ethics in language testing. Language Testing 14 (3) 235-241. Davies, A. (2007) Ethics, professionalism, rights and codes. In E. Shohamy & N.H. Hornberger (Eds.) Encyclopedia of language and education (2nd Ed.), Volume 7: Language Testing and Assessment (pp.419-443). Springer Science + Business Media. Hamp-Lyons, L. (1997). Washback, impact and validity: ethical concerns. Language Testing , 14 (3) 295-303. Howe, K.R. (1994) Standards, assessment and equality of educational opportunity. Educational researcher 23, 27-33. Lynch, B.K. (1997). In search of the ethical test. Language Testing 14 (3) 315-327. Spolsky, B. (1997). The ethics of gatekeeping tests: What have we learned in a hundred years? Language Testing 14 (3) 242-247.

References (n.d.). EALTA guidelines for good practice in language testing and assessment . Retrieved from http://www.ealta.eu.org/guidelines.htm (n.d.). TEEP general description . Retrieved from http://www.cals.rdg.ac.uk/teep/index.asp (n.d.). TEEP history / background . Retrieved from http://www.cals.rdg.ac.uk/teep/background.asp (n.d.). TEEP - information for candidates . Retrieved from http://www.cals.rdg.ac.uk/teep/candidates.asp (n.d.). TEEP - faqs . Retrieved from http://www.cals.rdg.ac.uk/teep/faq.asp (n.d.). TEEP Extended Handbook, 2001(incorporating Examiner's Report). Retrieved from http://www.cals.rdg.ac.uk/teep/files/2001_teep_extended_handbook.pdf (n.d.). TEEP Examiner’s Report, 2003. Retrieved from http://www.cals.rdg.ac.uk/teep/files/2003_teep_examiners_report.pdf (n.d.). TEEP Examiner’s Report, 2004. Retrieved from http://www.cals.rdg.ac.uk/teep/files/2004_teep_examination_report.pdf (2009). ILTA code of ethics . Retrieved from http://www.iltaonline.com (2009). ILTA guidelines for practice . Retrieved from http://www.iltaonline.com

Thank you….. for your time & attention. Here’s something to make you smile….

Standards In Language Testing

In this document