Introduction to ArtificiaI Intelligence in Higher Education
7.2 assessment and the cefr (2)
1.
2. Chapters from Hughes’ Testing for Language Teachers
8. Common Test techniques: Elaine, 24th
9. Testing Writing: Marta, Idoia, 22nd
10. Testing Oral Abilities: Paula, Ángela, 24th
11. Testing Reading: Lucía, 24th
12. Testing Listening: Lorena, 22nd
13. Testing Grammar and Vocabulary: Clara, Cristina,
22nd
14. Testing Overall Ability: Jefferson, 22nd
15. Tests for Young Learners: Tania, Diego, 24th
3. 8. Common Test Techniques
• Features:
Reliable
Valid
Reliably scored
Economical
Beneficial Washback effect
• Multiple choice:
Advantages/Disadvantages
• Yes/No, True/False: like multiple choice
• Short Answer: Adv./Disadv.
• Gap-Filling
4. 1. Set representative tasks
1. Specify all possible content
2. Include a representative sample of the specified content
2. Elicit valid samples of writing ability
1. Set as many separate tasks as feasible
2. Test only writing ability and nothing else
3. Restrict candidates
3. Ensure valid and reliable scoring:
1. Set as many tasks as possible
2. Restrict candidates
3. Give no choice of tasks
4. Ensure long enough samples
5. Create appropriate scales for scoring: HOLISTIC/ANALYTIC
6. Calibrate the scale to be used
7. Select and train scorers
8. Follow acceptable scoring procedures
9. Avoid Taboo topics
5. • “The most highly prized language skill”, Lado’s Language
Testing (1961).
• Challenges:
Ephemeral, intangible.
Simultaneous assessment. Solutions?
Very stressful!
• Features of oral speech: inaccuracy, unfinished sentences,
less precision, generic vocabulary, pauses
• Contrast US/UK: Certificate of Proficiency in English (1913)
already included it, TOEFL only in 2005 iBT
• Key notion: not accent, but intelligibility
• Very different approaches.
Indirect
Direct (Cambridge, EOIs) or Semi-direct (TOEFL ibt, OTE,
Aptis). Conflict with the American tradition.
The future?: Fully automated L2 speaking tests: Versant test,
Speechrater.
• Not only speaking, also interaction
6. 1. Set representative tasks
1. Specify all possible content
2. Include a representative sample of the specified content
2. Elicit valid samples of oral ability.
1. Techniques:
1. Interview :Questions, pictures, role play, interpreting (L1 to L2),
prepared monologue, reading aloud
2. Interaction: discussion, roleplay
3. Responses to audio- or video-recordings (semi-direct)
2. Plan and structure the test carefully
1. Make the oral test as long as it is feasible
2. Plan the test carefully
3. As many tasks (“fresh starts”) as possible
4. Use a second tester
5. Set only tasks that candidates could do easily in L1
7. Plan and structure the test carefully (2)
1. Quiet room with good acoustics
2. Put candidates at ease (at first, easy questions, not assessed,
problem with note-taking?)
3. Collect enough relevant information
4. Do not talk too much
5. (select interviewers carefully and train them)
1. Ensure valid and reliable scoring:
1. Create appropriate scales for scoring: HOLISTIC/ANALYTIC. Calibrate
the scale to be used
2. Select and train scorers (different from interviewers if possible)
3. Follow acceptable scoring procedures
8. PROBLEMS:
Indirect assessment:
We read in very different ways: scanning, skimming, inferring,
intensive, extensive reading…
SOME TIPS
As many texts and operations as possible (Dialang).
Avoid texts which deal with general knowledge
Avoid disturbing topics, or texts students might have read
Use authentic texts
Techniques: better short answer and gap filling than multiple
choice
Task difficulty can be lower than text difficulty
Items should follow the order of the text
Make items independent of each other
Do not take into account errors of grammar or spelling
Instructions: Easy to understand, even in L1
If higher stakes, trialling/piloting is essential
Provide an example of the task.
9. PROBLEMS
As in listening: Indirect assessment and different ways of listening
As in speaking: Transient nature of speech, Redundancy is typical
Anxiety!!! (everything in real time, no re-reading, can’t stop, or slow
down)
http://www.usingenglish.com/articles/why-your-students-have-problems-with-listening-comprehension.html
TIPS:
Same as in reading
If recording is used, make it as natural as possible
Items should be far apart in the text
Give students time to become familiar with the tasks
Techniques: apart from multiple choice, shot answers and gap filling,
information transfer, note taking, partial dictation, transcription
Moderation is essential
Write ítems after listening, not looking at the script
No general knowledge, no “common-sense” questions
Follow the order of the speech
Independent items
How many times?
10. GRAMMAR
Why? Easy to test, Content validity
Why not? Harmful washback effect
It depends on the type of test.
Specifications: Core Inventory/English Profile
Techniques: Gap filling, rephrasings, completion
Don’t penalize for mistakes that were not tested (-s if
the item is testing relatives, for example)
VOCABULARY
Why (not)?
Specifications: use frequency considerations (English
Profile)
Techniques:
Recognition: Recognise synonims, recognise definitions,
recognise appropriate word for context
Production: pictures, definitions, gap filling,
11. Useful in particular tests where washback is not
important (placement tests, for example)
Cloze test (from closure). Based on the idea
of “reduced redundancy”. Subtypes:
Selected deletion cloze
Conversational cloze
C-Tests: second half of every second word
deleted. Example:
“The passen___ sits bes___ the dri___”
Dictation: Long tradition. Easy to create and
administer, but difficult to score properly.
Main problema of all these techniques : horrible
washback effect.
12. TIPS
- Make testing and assessment an integral part of teaching
- Feedback: immediate and positive
- Self assessment
- Washback more important than ever
- Short tasks (short attention span)
- Use stories and games
- Use pictures and color
- Don’t forget that children are still developing L1 and cognitive
abilities
- Include interaction
- Use colour and drawing
- Use cartoon stories
- Long warm-ups in speaking
- Use cards and pictures
Editor's Notes
Common Test Techniques
We need techniques which:
- will elicit behaviour which is a reliable and valid indicator of the ability in which we are interested;
- will elicit behaviour which can be reliably scored;
- are as economical of time and effort as possible;
will have a beneficial backwash effect, where this is relevant.
MULTIPLE CHOICE
Advantages:
Reliable
Economical
Good for receptive skills
(It used to be as the perfect, almost only way to test)
Disadvantages:
Only for recognition
Guessing may have a considerable but unknowable effect
The technique severely restricts what can be tested
It is very difficult to write successful items
Washback may be harmful
Cheating may be facilitated
YES/NO TRUE/FALSE ITEMS
Essentially multiple choice, but with a 50 % chance4 of getting it right. Ok in class activities. Not appropriate in real testing.
SHORT-ANSWER ITEMS
Advantages.
Less guessing
No need for distractors
Less cheating
Items are easier to write
Disadvantages
Responses may take longer
The test taker has to produce language (mixture of skills in a receptive test) (TRY TO MAKE RESPONSES REALLY SHORT)
Judging may be required (less validity or reliability)
Scoring may take longer (SOLUTIONS: MAKE THE REQUIRED RESPONSE UNIQUE)
GAP FILLING ITEMS very similar to short-answer items
Set representative tasks
Specify all possible content (in the specifications)
Include a representative sample of the specified content (in the test)
Elicit valid samples of writing ability
Set as many separate tasks as feasible
Test only writing ability and nothing else (creativity, imagination, etc. No extra long instructions with complicated reading)
Restrict candidates
Ensure valid and reliable scoring:
Set as many tasks as possible
Restrict candidates
Give no choice of tasks
Ensure long enough samples
Create appropriate scales for scoring: HOLISTIC/ANALYTIC See examples. HOLISTIC. Good if many scorers. ANALYTIC: equal or unequal weight to the different parts, main disadvantage: time-consuming, if too much attention is payed to the parts, one may forget the general impression. IMPORTANT POTENTIAL FOR WASHBACK.
Calibrate the scale to be used (collect samples. Choose representative ones. Use them as reference points. This is called “benchmarking”)
Select and train scorers
Follow acceptable scoring procedures: benchmarking, two scorers (and a third, senior one for discrepancies), carry out statistical analysis
“The most highly prized language skill”, a source of cultural capital, Lado’s Language Testing (1961). However, it hasn’t always been properly assessed.
Challenges: ephemeral, intangible. Solutions: recording it, and also sound waves, spectrographs
Some tests (TOEFL in particular) have a long history of ignoring it: Only in 2005 TOEFL iBT/Contrast with Cambridge Certificate of Proficiency in English (1913) which already included it. However, Grammar-Translation approaches ignored it almost completely. Kaulfers 1944 created the first scales used to assess oral proficiency, designed for the military abroad
Key notion: not accent, but intelligibility (the ease or difficulty with which a listener understands L2 speech. You can be highly intelligible with a non-native accent. It is only when the accent interferes with a learner’s ability that it should be considered in speaking scales.
Very different approaches.
Indirect (multiple choice as an indicator, not really valid or reliable)
Direct or Semi-direct (responding to stimulus from a computer, TOEFL ibt, OTE, Aptis). Problems: raters and rating scales (which oversimplify the complexity of oral speech). Despite the practical challenges, they are the only valid formats for assessing L2 speech today. Conflict with the American tradition of “psychometrically influenced assessment tradition” focusing on the technical (statistical) reliability of test items (multiple choice) and the most administratively feasible test formats and item types in the context of large-scale, high-stakes tests (GRE?)
The future?: Fully automated L2 speaking tests: Versant test, Speechrater. Automatic scoring systems (measuring grammatical accuracy, lexical frequency, acoustic variables, temporal variables)
Not only speaking, also interaction (listening and speaking): Cambridge included interaction in 1996. Washback effect (usual practice in class, pairwork, groupwork). Problems: peer interlocutor variables (L2 proficiency, L1 background, gender, personality, etc). Solutions: more tasks.
Set representative tasks
Specify all possible content
Include a representative sample of the specified content
Elicit valid samples of oral ability.
Techniques:
Interview (the candidate may feel intimidated): Questions, pictures, role play, interpreting (L1 to L2), prepared monologue, reading aloud
Interaction with fellow candidates: discussion, roleplay
Responses to audio- or video-recordings (semi-direct)
Plan and structure the test carefully
Make the oral test as long as it is feasible
Plan the test carefully
As many tasks (“fresh starts”) as possible
Use a second tester
Set only tasks that candidates could do easily in L1
Quiet room with good acoustics
Put candidates at ease (at first, easy questions, not assessed, problem with note-taking?)
Collect enough relevant information
Do not talk too much
(select interviewers carefully and train them)
Ensure valid and reliable scoring:
Create appropriate scales for scoring: HOLISTIC/ANALYTIC. Used as a check on each other
Calibrate the scale to be used
Select and train scorers (different from interviewers if possible)
Follow acceptable scoring procedures
PROBLEMS:
Indirect assessment: the exercise of receptive skills does not manifest itself directly. We need an instrument.
We read in very different ways: scanning, skimming, inferring, intensive, extensive reading… All of them should be specified and tested
SOME TIPS
As many texts and operations as possible (Dialang). (Time limits for scanning or skimming?)
Avoid texts which deal with general knowledge (answers will be guessed)
Avoid disturbing topics, or texts students might have read
Use, as much as possible, authentic texts
Techniques: better short answer and gap filling than multiple choice. Also information transfer.
Task difficulty can be lower than text difficulty
Items should follow the order of the text
Make items independent of each other
Do not take into account errors of grammar or spelling.
Similar PROBLEMS to listening:
Indirect assessment: the exercise of receptive skills does not manifest itself directly. We need an instrument.
We listen in very different ways: scanning, skimming, inferring, intensive, extensive listening… All of them should be specified and tested
And to Speaking:
Transient nature of speech
http://www.usingenglish.com/articles/why-your-students-have-problems-with-listening-comprehension.html
Similar tips from Reading (go back to the list)
If recording is used, make it as natural as possible (with typical spoken redundancy). Don’t read aloud written texts.
Items should be far apart in the text (to have time to write them down)
Give students time to become familiar with the tasks
Techniques: apart from multiple choice, shot answers and gap filling, information transfer (draw a map of the accident), note taking, partial dictation (problem: do you consider spelling?), transcription (spelling names, numbers: real life task)
Moderation (more teachers, trialing) is essential
How many times? Why two? Never three
GRAMMAR:
Why? Easy to test, Content validity: more than in any of the skills (Skills: we just cover a few of the topics, or operations from the specifications. Grammar: we can cover many more items)
Why not? Harmful washback effect
Maybe not in proficiency tests, but, if grammar is taught (and it almost always is), it should be included in achievement tests, placement and diagnostic tests. However, because of the potential harmful washback effect, it should not be given too much (porcentual) prominence.
Specifications: from the Council of Europe books (Threshold, etc.)
Techniques: Gap filling, rephrasings, completion
Don’t penalize for mistakes that were not tested (-s if the item is testing relatives, for example)
VOCABULARY
Why (not)? Similar arguments as for grammar.
Specifications: use frequency considerations (cobuild dictionaries)
Techniques:
Recognition: Recognise synonims, recognise definitions, recognise appropriate word for context
Production: pictures, definitions, gap filling,
Special techniques which are more useful in tests where washback is not important: placement tests, for example
Types:
Cloze test (from closure). Based on the idea of “reduced redundancy”. Texts are always redundant. If we reduce the redundancy (by deleting a few words), native speakers are easily able to cope and guess the missing words. Originally, every seventh word. In the 80s ot ised to be considered as a language testing panacea (panasía). Easy to construct, administer and score. Unfortunately, poor validity. Native speakers cannot always guess the words. SUBTYPES:
Selected deletion cloze
Conversational cloze
The C-Test: a variety of cloze, with the second half of every second word deleted. Puzzle-like
Dictation: traditionally used (particularly in places like France, but not only). However, in the 60s, dictation testing was considered misguided. Later, nevertheless, research showed correlation between scores on dictation tests and scores on more complex tests, or on cloze tests. They are easy to create, and easy to adminster, but very difficult to score properly.
Main problem with all of these tests: horrible washback effect.
Primary School: Other types of assessment are more appropriate. However, a common yardstick at the end is necessary: Pruebas Estandarizadas.
Good opportunity to develop good attitudes towards assessment. Recommendations:
Make testing an integral part of assessment, and assessment an integral part of the teaching program
Feedback from tests should be immediate and positive
Self assessment should be part of the teaching program
Washback is more important than ever
TIPS
Short tasks: Short attention span
Use stories and games
Use pictures and color
Don’t forget that children are still developing L1 and cognitive abilities
Include interaction
SOME TECHNIQUES:
Placing objects or identifying people
Multiple choice pictures
Colour and draw
Use pictures in reading and in writing
Cartoon stories for writing
Long warm-ups in speaking
Use cards and pictures