7 assessment and the cefr

 Assessment: Assessment of the proficiency of
the language user
 3 key concepts:
• Validity: the information gained is an accurate
representation of the proficiency of the candidates
• Reliability: A student being tested twice will get the
same result (technical concept: the rank order of the
candidates is replicated in two separate—real or
simulated—administrations of the same assessment )
• Feasibility: The procedure needs to be practical,
adapted to the available elements and features

 If we want assessment to be valid, reliable,
and feasible, we need to specify:
• What is assessed: according to the CEFR,
communicative activities (contexts, texts, and tasks).
See examples.
• How performance is interpreted: assessment criteria.
See examples
• How to make comparisons between different tests
and ways of assessment (for example, between public
examinations and teacher assesment). Two main
procedures:
 Social moderation: discussion between experts
 Benchmarking: comparison of samples in relation to
standardized definitions and examples, which become
reference points (benchmarks)
• Guidelines for good practice: EALTA

TYPES OF ASSESSMENT
1 Achievement assessment / Proficiency assessment
2 Norm-referencing (NR)/ Criterion-referencing (CR)
3 Mastery learning CR / Continuum CR
4 Continuous assessment / Fixed assessment points
5 Formative assessment / Summative assessment
6 Direct assessment / Indirect assessment
7 Performance assessment / Knowledge assessment
8 Subjective assessment / Objective assessment
9 Checklist rating / Performance rating
10 Impression / Guided judgement
11 Holistic assessment/ Analytic assessment
12 Series assessment / Category assessment
13 Assessment by others / Self-assessment

Types of tests:
• Proficiency tests
• Achievement tests. 2 approaches:
 To base achievement tests on the textbook/syllabus
 To base them on course objectives. More beneficial
washback.
• Diagnostic tests
• Placement tests

Validity Types:
• Construct validity (very general, the information
gained is accurate representation of the
proficiency of the candidate. It checks the validity
of the construct, the thing we want to measure)
• Content validity. This checks it the test’s content is
a representative simple of the skills or structures
that it wants to measure. In order to check this we
need a complete specification of all the skills or
structures we want to cover. If it covers 5% only, it
has less content validity than if it covers 25 %.

 Validity Types:
• Criterion-related validity: Results on the test agree with
other dependable results (criterion test)
 Concurrent validity. We compare the test results with the
criterion test.
 Predictive validity. A placement test is validated by the
teachers who teach the selected students.
• Validity in scoring. Not only the items need to be valid,
but also the way in which responses are scored
(taking into account grammar mistakes in a reading
comprehension exam is not valid)
• Face validity: the test has to look as if it measures
what it is supposed to measure. A written test to check
pronunciation has little face validity.

How to make tests more valid (Hughes)
Write specifications for the test.
Include a representative sample ot the
content of the specifications in the text
Whenever feasible, use direct testing
Make sure that the scoring relates directly
to what is being tested
Try to make the test reliable

Reliability: A student being tested twice will get the same
result (technical concept: the rank order of the candidates
is replicated in two separate—real or simulated—
administrations of the same assessment )
- We compare two tests. Methods:
- Test-Retest: the student takes the same test again
- Alternate Forms: the students take two alternate forms
of the same test
- Split.Half: you split the test into two equivalent halves
and compare them as if they were two different tests.

- Reliability coefficient / Standard Error of Measurement
A High Stakes Test needs a high reliability coefficient
(highest is 1), and therefore a very low standard error of
measurement (a number obtained by statistical analysis). A
Lower Stakes exam does not need those coefficients.
- True Score: the real score that a student would get in a
perfectly reliable test. In a very reliable test, the true
score is clearly defined (the student will always get a
similar result, for example 65-67). In a less reliable test,
the range is wider (55-75).
- Scorer reliability (coefficient). You compare the scores
given by different scorers (examiners). The more
agreement, the more reliable their reliability coefficient.

Item analysis:
 Facility value
 Discrimination indices: drop some, improve
others
 Analyse distractors
 Item banking

1.Take enough samples of behaviour.
2.Exclude items which do not descriminate well
3.Do not allow candidates too much freedom.
4.Write unambiguous items
5.Provide clear and explicit instructions
6.Ensure that tests are well laid out and perfectly
legible
7.Make candidates familiar with format and testing
techniques
8.Provide uniform and non-distracting conditions of
administration

9. Use items which permit scoring which is as
objective as possible
10. Make comparisons between candidates as direct
as possible
11. Provide a detailed scoring key
12. Train scorers
13. Agree acceptable responses and appropriate
scores at the beginning of the scoring process.
14. Identifty candidates by number not by name
15. Employ multiple, independent scorers..

To be valid a test must be reliable (provide
accurate measurement)
A reliable test may not be valid at all
(technically perfect, but globally wrong: it
does not test what it is supposed to test)

Washback/Backwash
 Test the abilities/skills you want to encourage.
 Sample widely and unpredictably
 Use direct testing
 Make testing criterion-referenced (CEFR)
 Base achievement tests on objectives
 Ensure that the test is known and understood by
students and teachers
 Counting the cost

1. Make a full and clear statement of the testing
‘problem’.
2. Write complete specifications for the test.
3. Write and moderate items.
4. Trial the items informally on native speakers
and reject or modify problematic ones as
necessary.
5. Trial the test on a group of non-native
speakers similar to those for whom the test is
intended.
6. Analyse the results of the trial and make any
necessary changes.
7. Calibrate scales.
8. Validate.
9. Write handbooks for test takers, test users
and staff.
10. Train any necessary staff (interviewers,
raters, etc.).

Common Test Techniques
• Multiple choice
• Yes/No, True/False
• Short Answer
• Gap-Filling

Chapters from Hughes’ Testing for Language Teachers
9. Testing Writing
10. Testing Oral Abilities
11. Testing Reading
12. Testing Listening
13. Testing Grammar and Vocabulary
14. Testing Overall Ability
15. Tests for Young Learners

1. Set representative tasks
1. Specify all possible content
2. Include a representative sample of the specified content
2. Elicit valid samples of writing ability
1. Set as many separate tasks as feasible
2. Test only writing ability and nothing else
3. Restrict candidates
3. Ensure valid and reliable scoring:
1. Set as many tasks as possible
2. Restrict candidates
3. Give no choice of tasks
4. Ensure long enough samples
5. Create appropriate scales for scoring: HOLISTIC/ANALYTIC
6. Calibrate the scale to be used
7. Select and train scorers
8. Follow acceptable scoring procedures

• “The most highly prized language skill”, Lado’s
Language Testing (1961).
• Challenges: ephemeral, intangible.
• Contrast US/UK: Certificate of Proficiency in English
(1913) already included it, TOEFL only in 2005 iBT
• Key notion: not accent, but intelligibility
• Very different approaches.
 Indirect
 Direct (Cambridge, EOIs) or Semi-direct (TOEFL ibt, OTE,
Aptis). Conflict with the American tradition.
 The future?: Fully automated L2 speaking tests: Versant
test, Speechrater.
• Not only speaking, also interaction

1. Set representative tasks
1. Specify all possible content
2. Include a representative sample of the specified content
2. Elicit valid samples of oral ability.
1. Techniques:
1. Interview :Questions, pictures, role play, interpreting (L1 to L2),
prepared monologue, reading aloud
2. Interaction: discussion, roleplay
3. Responses to audio- or video-recordings (semi-direct)
2. Plan and structure the test carefully
1. Make the oral test as long as it is feasible
2. Plan the test carefully
3. As many tasks (“fresh starts”) as possible
4. Use a second tester
5. Set only tasks that candidates could do easily in L1

Plan and structure the test carefully
1. Set only tasks that candidates could do easily in L1
2. Quiet room with good acoustics
3. Put candidates at ease (at first, easy questions, not assessed,
problem with note-taking?)
4. Collect enough relevant information
5. Do not talk too much
6. (select interviewers carefully and train them)
1. Ensure valid and reliable scoring:
1. Create appropriate scales for scoring: HOLISTIC/ANALYTIC. Calibrate
the scale to be used
2. Select and train scorers (different from interviewers if possible)
3. Follow acceptable scoring procedures

PROBLEMS:
 Indirect assessment:
 We read in very different ways: scanning, skimming,
inferring, intensive, extensive reading…
SOME TIPS
 As many texts and operations as possible (Dialang).
 Avoid texts which deal with general knowledge
 Avoid disturbing topics, or texts students might have
read
 Use authentic texts
 Techniques: better short answer and gap filling than
multiple choice
 Task difficulty can be lower than text difficulty
 Items should follow the order of the text
 Make items independent of each other
 Do not take into account errors of grammar or spelling

PROBLEMS
 As in listening: Indirect assessment and different ways
of listening
 As in speaking: Transient nature of speech
http://www.usingenglish.com/articles/why-your-students-have-problems-with-listening-comprehension.html
TIPS:
 Same as in reading
 If recording is used, make it as natural as possible
 Items should be far apart in the text
 Give students time to become familiar with the tasks
 Techniques: apart from multiple choice, shot answers
and gap filling, information transfer, note taking, partial
dictation, transcription
 Moderation is essential
 How many times?

GRAMMAR
 Why? Easy to test, Content validity
 Why not? Harmful washback effect
 It depends on the type of test.
 Specifications: from the Council of Europe books
 Techniques: Gap filling, rephrasings, completion
 Don’t penalize for mistakes that were not tested (-s if
the item is testing relatives, for example)
VOCABULARY
 Why (not)?
 Specifications: use frequency considerations
 Techniques:
 Recognition: Recognise synonims, recognise definitions,
recognise appropriate word for context
 Production: pictures, definitions, gap filling,

Useful in particular tests where washback is
not important
 Cloze test (from closure). Based on the
idea of “reduced redundancy”. Subtypes:
 Selected deletion cloze
 Conversational cloze
 C-Tests
 Dictation
Main problem : horrible washback effect.

TIPS
- Testing-assessment-teaching
- Feedback
- Self assessment
- Washback
- Short tasks
- Use stories and games
- Use pictures and color
- Don’t forget that children are still developing L1 and cognitive
abilities
- Include interaction
- Use colour and drawing
- Use cartoon stories
- Long warm-ups in speaking
- Use cards eotj pictures

7 assessment and the cefr

More Related Content

What's hot

Viewers also liked

Similar to 7 assessment and the cefr

More from Jesús Ángel González López

Recently uploaded

7 assessment and the cefr

Editor's Notes