7.1 assessment and the cefr (1)

What does the word suggest?
What sort of emotions does it convey?
Try to write a definition. What does it
imply?
Which characteristics should it have?

 What does the word suggest?
 What sort of emotions does it convey?
 Try to write a definition. “A purposeful activity that
gathers information about students’ language
development” (Jang)
 What does it imply?
• Collecting information
• Analyzing the information and making an assessment
• Taking decisions according to the assessment made:
 Pedagogical decisions (formative assessment)
 Social decisions
 Which characteristics should it have?
• Validity, reliability, feasibility
• Fairness
• Pedagogical purpose (Washback effect)

 Also known as “assessment for learning”, “assessment
as learning”, or formative assessment (Jang, 7)
 Using a range of assessment methods:
• tests
• portfolios
• performance observation
• Standards-based assessments (CEFR: The European Standard)
 Self-Assessment and Peer-Assessment: excellent
complements
 The importance of feedback: descriptive, not only
evaluative; indirect-facilitative, better than direct-
corrective.
 Washback Effect: effect on teaching/learning

 Disadvantages:
• Non-reliable (underestimation/overestimation of abilities,
cheating?). Difficulty of integration in the assessment process
• Difficulty of the task (self-assessment itself)
 Advantages: It encourages:
• Learner Autonomy (responsibility, planning)
• Awareness (vertical/horizontal, levels/features of language
learning)
• Goal-orientation (learning objectives)
• Motivation
• Better learning: learner-centred learning (supported by cognitive
and constructivist theories of learning)
• It fits perfectly in the modern comprehensive model of
‘educational assessment’ which encourages validity, learner-
support, performance-oriented test tasks, real-world language use
and classroom life-oriented activities (Mats Oscarson,
“Assessment in the Classroom” 715)

 Test the abilities/skills you want to encourage. Give them
sufficient weight.
 Sample widely and unpredictably (from the specifications)
 Use direct testing
 Make testing criterion-referenced (CEFR)
 Base achievement tests on learning objectives (NOT
CONTENT)
 Ensure that the test is known and understood by students
and teachers (the more transparent, the better)
 Counting the cost: individual direct testing is expensive
and time-consuming, but what is the cost of harmful
washback? (PAU)
(EVERY GOOD TEACHER SHOULD DO THIS)

 Assessment: Assessment of the proficiency of
the language user
 3 key concepts:
• Validity: the information gained is an accurate
representation of the proficiency of the candidates
• Reliability: A student being tested twice will get the
same result (technical concept: the rank order of the
candidates is replicated in two separate—real or
simulated—administrations of the same assessment )
• Feasibility (practicality): The procedure needs to be
practical, adapted to the available elements and
features
(relative terms)

 If we want assessment to be valid, reliable,
and feasible, we need to specify:
• What is assessed: according to the CEFR,
communicative activities (contexts, texts, and tasks).
See examples.
• How performance is interpreted: assessment criteria.
See examples
• How to make comparisons between different tests
and ways of assessment (for example, between public
examinations and teacher assesment). Two main
procedures:
 Social/Teacher “moderation”: discussion between experts
 Benchmarking: comparison of samples in relation to
standardized definitions and examples, which become
reference points (benchmarks)
• Guidelines for good practice: EALTA

TYPES OF ASSESSMENT (1)
1 Achievement assessment (achievement of specific objectives,
previously taught content)/ Proficiency assessment (what someone
can do in the real world) Ideally, they should be as close as possible.
2 Norm-referencing (NR: students are placed in rank order, US:
grades are sometimes adapted to a previous norm, a “curve”)/
Criterion-referencing (CR: the criterion is a standard, like the CEFR)
3 Mastery learning CR / Continuum CR
4 Continuous assessment (grades are based on a number of
performances, papers, tests throughout the course) / Fixed
assessment points
5 Formative assessment (ongoing process: to check on the progress
and improve teaching and learning)/ Summative assessment
(designed to summarize students’ progress at a particular time,
normally the end of the course)
6 Direct assessment (assessing what the candidate is doing:
speaking, writing/ Indirect assessment (assessing through an
instrument: reading and speaking comprehension)

TYPES OF ASSESSMENT (2)
1 7 Performance assessment (providing a sample of
language)/ Knowledge assessment (providing evidence
of knowledge: impossible?)
8 Subjective assessment (judgement by an assessor) /
Objective assessment (subjectivity is removed: indirect
tests. Really objective?) (*)
9 Rating on a scale/ Rating on a checklist
10 Impression / Guided judgement
11 Holistic assessment (global synthetic judgement)/
Analytic assessment (looking at different aspects
separately)
12 Series assessment (tasks in series)/ Category
assessment (one task with different categories)
13 Peer Assessment/ Self-assessment (very useful
complements)

• Proficiency tests
• Achievement tests. 2 approaches:
 To base achievement tests on the textbook/syllabus
(contents)
 To base them on course learning objectives. More
beneficial washback.
• Diagnostic tests
• Placement tests

 Validity: the information gained is an accurate
representation of the proficiency of the
candidates
 Validity Types:
• Construct validity (very general, the information gained
is an accurate representation of the proficiency of the
candidate. It checks the validity of the construct, the
thing we want to measure, PAU?)
• Content validity. This checks it the test’s content is a
representative sample of the skills or structures that it
wants to measure. In order to check this we need a
complete specification of all the skills or structures we
want to cover. If it covers 5% only, it has less content
validity than if it covers 25 %.

 Validity Types:
• Criterion-related validity: Results on the test agree with
other dependable results (criterion test)
 Concurrent validity. We compare the test results with the
criterion test (a longer, or more standardized test)
 Predictive validity. The test predicts future performance.A
placement test is validated by the teachers who teach the
selected students.
• Validity in scoring. It is not only the items that need to
be valid, but also the way in which responses are
scored (Example: taking into account grammar
mistakes in a reading comprehension exam is not
valid)
• Face validity: the test has to look as if it measures
what it is supposed to measure. A written test to check
pronunciation has little face validity.

How to make tests more valid (Hughes)
Write specifications for the test
(transparency)
Include a representative sample of the
content of the specifications in the text
(content validity)
Whenever feasible, use direct testing
Make sure that the scoring relates directly
to what is being tested
Try to make the test reliable

Reliability: A student being tested twice will get the same
result (technical concept: the rank order of the candidates
is replicated in two separate—real or simulated—
administrations of the same assessment. Result: a
reliability coefficient, theoretical maximum 1, if all the
students get exactly the same result)
- We compare two tests. Methods:
- Test-Retest: the student takes the same test again
- Alternate Forms: the students take two alternate forms
of the same test
- Split Half: you split the test into two equivalent halves
and compare them as if they were two different tests.

- Reliability coefficient / Standard Error of Measurement
A High Stakes Test (high impact or consequences)
needs a high “reliability coefficient” (highest is 1), and
therefore a very low “standard error of measurement” (a
number obtained by statistical analysis). A Lower Stakes
exam does not need those coefficients.
- True Score: the real score that a student would get in a
perfectly reliable test. In a very reliable test, the true
score is clearly defined (the student will always get a
similar result, for example 65-67). In a less reliable test,
the range is wider (55-75).
- Scorer reliability (coefficient). You compare the scores
given by different scorers (examiners). The more
agreement, the more reliable their reliability coefficient.

Item analysis: (example on p. 20)
 Facility value
 Discrimination indices: drop some, improve
others
 Analyse distractors
 Item banking

1.Take enough samples of behaviour (example).
2.Exclude items which do not descriminate well
3.Do not allow candidates too much freedom.
(example)
4.Write unambiguous ítems (example)
5.Provide clear and explicit instructions
6.Ensure that tests are well laid out and perfectly
legible
7.Make candidates familiar with format and testing
techniques
8.Provide uniform and non-distracting conditions of
administration

9. Use items which permit scoring which is as
objective as possible
10. Make comparisons between candidates as direct
as possible
11. Provide a detailed scoring key
12. Train scorers
13. Agree acceptable responses and appropriate
scores at the beginning of the scoring process.
14. Identify candidates by number not by name
15. Employ multiple, independent scorers.

 To be valid a test must be reliable (it must provide
accurate measurement. For example, untrained or
biased assessors, or a wrong key)
 The Validity/Reliability Paradox: A perfectly reliable test
may not be valid at all (technically perfect, but globally
wrong: it does not test what it is supposed to test; for
example, a driving test without a practical exam, or a
multiple-choice test to assess vocabulary knowledge:
Students don’t use vocabulary)
 “Validity concerns outweigh reliability concerns in
current assesment culture” (Jang 97): more
performance, direct assessment. More time-
consuming, more difficult to administer, but better
washback effect and pedagogical use. Use materials
from Standard-Based assessment (rubrics, proficiency-
level descriptors)

Standards are a set of benchmarks for
students to achieve (sometimes turned into
curricular goals): proficiency-level
descriptors later used in rubrics of teacher
observation checklists. Examples:
• NLLIA ESL Bandscales from Australia
• STEPs to English Proficiency in Canada
• Council of Europe: CEFR (+ELP)
• USA Standards derived from NCLB Act (2001)

Chapters from Hughes’ Testing for Language Teachers
Nov 7
8. Common Test Techniques:
9. Testing Writing:
10. Testing Oral Abilities:
11. Testing Reading:
Nov 8
12. Testing Listening:
13. Testing Grammar and Vocabulary:
14. Testing Overall Ability:
15. Tests for Young Learners:

7.1 assessment and the cefr (1)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 7.1 assessment and the cefr (1)

Similar to 7.1 assessment and the cefr (1) (20)

More from Jesús Ángel González López

More from Jesús Ángel González López (20)

Recently uploaded

Recently uploaded (20)

7.1 assessment and the cefr (1)

Editor's Notes