SlideShare a Scribd company logo
1 of 27
 Assessment: Assessment of the proficiency of
the language user
 3 key concepts:
• Validity: the information gained is an accurate
representation of the proficiency of the candidates
• Reliability: A student being tested twice will get the
same result (technical concept: the rank order of the
candidates is replicated in two separate—real or
simulated—administrations of the same assessment )
• Feasibility: The procedure needs to be practical,
adapted to the available elements and features
 If we want assessment to be valid, reliable,
and feasible, we need to specify:
• What is assessed: according to the CEFR,
communicative activities (contexts, texts, and tasks).
See examples.
• How performance is interpreted: assessment criteria.
See examples
• How to make comparisons between different tests
and ways of assessment (for example, between public
examinations and teacher assesment). Two main
procedures:
 Social moderation: discussion between experts
 Benchmarking: comparison of samples in relation to
standardized definitions and examples, which become
reference points (benchmarks)
• Guidelines for good practice: EALTA
TYPES OF ASSESSMENT
1 Achievement assessment / Proficiency assessment
2 Norm-referencing (NR)/ Criterion-referencing (CR)
3 Mastery learning CR / Continuum CR
4 Continuous assessment / Fixed assessment points
5 Formative assessment / Summative assessment
6 Direct assessment / Indirect assessment
7 Performance assessment / Knowledge assessment
8 Subjective assessment / Objective assessment
9 Checklist rating / Performance rating
10 Impression / Guided judgement
11 Holistic assessment/ Analytic assessment
12 Series assessment / Category assessment
13 Assessment by others / Self-assessment
Types of tests:
• Proficiency tests
• Achievement tests. 2 approaches:
 To base achievement tests on the textbook/syllabus
 To base them on course objectives. More beneficial
washback.
• Diagnostic tests
• Placement tests
Validity Types:
• Construct validity (very general, the information
gained is accurate representation of the
proficiency of the candidate. It checks the validity
of the construct, the thing we want to measure)
• Content validity. This checks it the test’s content is
a representative simple of the skills or structures
that it wants to measure. In order to check this we
need a complete specification of all the skills or
structures we want to cover. If it covers 5% only, it
has less content validity than if it covers 25 %.
 Validity Types:
• Criterion-related validity: Results on the test agree with
other dependable results (criterion test)
 Concurrent validity. We compare the test results with the
criterion test.
 Predictive validity. A placement test is validated by the
teachers who teach the selected students.
• Validity in scoring. Not only the items need to be valid,
but also the way in which responses are scored
(taking into account grammar mistakes in a reading
comprehension exam is not valid)
• Face validity: the test has to look as if it measures
what it is supposed to measure. A written test to check
pronunciation has little face validity.
How to make tests more valid (Hughes)
Write specifications for the test.
Include a representative sample ot the
content of the specifications in the text
Whenever feasible, use direct testing
Make sure that the scoring relates directly
to what is being tested
Try to make the test reliable
Reliability: A student being tested twice will get the same
result (technical concept: the rank order of the candidates
is replicated in two separate—real or simulated—
administrations of the same assessment )
- We compare two tests. Methods:
- Test-Retest: the student takes the same test again
- Alternate Forms: the students take two alternate forms
of the same test
- Split.Half: you split the test into two equivalent halves
and compare them as if they were two different tests.
- Reliability coefficient / Standard Error of Measurement
A High Stakes Test needs a high reliability coefficient
(highest is 1), and therefore a very low standard error of
measurement (a number obtained by statistical analysis). A
Lower Stakes exam does not need those coefficients.
- True Score: the real score that a student would get in a
perfectly reliable test. In a very reliable test, the true
score is clearly defined (the student will always get a
similar result, for example 65-67). In a less reliable test,
the range is wider (55-75).
- Scorer reliability (coefficient). You compare the scores
given by different scorers (examiners). The more
agreement, the more reliable their reliability coefficient.
Item analysis:
 Facility value
 Discrimination indices: drop some, improve
others
 Analyse distractors
 Item banking
1.Take enough samples of behaviour.
2.Exclude items which do not descriminate well
3.Do not allow candidates too much freedom.
4.Write unambiguous items
5.Provide clear and explicit instructions
6.Ensure that tests are well laid out and perfectly
legible
7.Make candidates familiar with format and testing
techniques
8.Provide uniform and non-distracting conditions of
administration
9. Use items which permit scoring which is as
objective as possible
10. Make comparisons between candidates as direct
as possible
11. Provide a detailed scoring key
12. Train scorers
13. Agree acceptable responses and appropriate
scores at the beginning of the scoring process.
14. Identifty candidates by number not by name
15. Employ multiple, independent scorers..
To be valid a test must be reliable (provide
accurate measurement)
A reliable test may not be valid at all
(technically perfect, but globally wrong: it
does not test what it is supposed to test)
Washback/Backwash
 Test the abilities/skills you want to encourage.
 Sample widely and unpredictably
 Use direct testing
 Make testing criterion-referenced (CEFR)
 Base achievement tests on objectives
 Ensure that the test is known and understood by
students and teachers
 Counting the cost
1. Make a full and clear statement of the testing
‘problem’.
2. Write complete specifications for the test.
3. Write and moderate items.
4. Trial the items informally on native speakers
and reject or modify problematic ones as
necessary.
5. Trial the test on a group of non-native
speakers similar to those for whom the test is
intended.
6. Analyse the results of the trial and make any
necessary changes.
7. Calibrate scales.
8. Validate.
9. Write handbooks for test takers, test users
and staff.
10. Train any necessary staff (interviewers,
raters, etc.).
Common Test Techniques
• Multiple choice
• Yes/No, True/False
• Short Answer
• Gap-Filling
Chapters from Hughes’ Testing for Language Teachers
9. Testing Writing
10. Testing Oral Abilities
11. Testing Reading
12. Testing Listening
13. Testing Grammar and Vocabulary
14. Testing Overall Ability
15. Tests for Young Learners
1. Set representative tasks
1. Specify all possible content
2. Include a representative sample of the specified content
2. Elicit valid samples of writing ability
1. Set as many separate tasks as feasible
2. Test only writing ability and nothing else
3. Restrict candidates
3. Ensure valid and reliable scoring:
1. Set as many tasks as possible
2. Restrict candidates
3. Give no choice of tasks
4. Ensure long enough samples
5. Create appropriate scales for scoring: HOLISTIC/ANALYTIC
6. Calibrate the scale to be used
7. Select and train scorers
8. Follow acceptable scoring procedures
• “The most highly prized language skill”, Lado’s
Language Testing (1961).
• Challenges: ephemeral, intangible.
• Contrast US/UK: Certificate of Proficiency in English
(1913) already included it, TOEFL only in 2005 iBT
• Key notion: not accent, but intelligibility
• Very different approaches.
 Indirect
 Direct (Cambridge, EOIs) or Semi-direct (TOEFL ibt, OTE,
Aptis). Conflict with the American tradition.
 The future?: Fully automated L2 speaking tests: Versant
test, Speechrater.
• Not only speaking, also interaction
1. Set representative tasks
1. Specify all possible content
2. Include a representative sample of the specified content
2. Elicit valid samples of oral ability.
1. Techniques:
1. Interview :Questions, pictures, role play, interpreting (L1 to L2),
prepared monologue, reading aloud
2. Interaction: discussion, roleplay
3. Responses to audio- or video-recordings (semi-direct)
2. Plan and structure the test carefully
1. Make the oral test as long as it is feasible
2. Plan the test carefully
3. As many tasks (“fresh starts”) as possible
4. Use a second tester
5. Set only tasks that candidates could do easily in L1
Plan and structure the test carefully
1. Set only tasks that candidates could do easily in L1
2. Quiet room with good acoustics
3. Put candidates at ease (at first, easy questions, not assessed,
problem with note-taking?)
4. Collect enough relevant information
5. Do not talk too much
6. (select interviewers carefully and train them)
1. Ensure valid and reliable scoring:
1. Create appropriate scales for scoring: HOLISTIC/ANALYTIC. Calibrate
the scale to be used
2. Select and train scorers (different from interviewers if possible)
3. Follow acceptable scoring procedures
PROBLEMS:
 Indirect assessment:
 We read in very different ways: scanning, skimming,
inferring, intensive, extensive reading…
SOME TIPS
 As many texts and operations as possible (Dialang).
 Avoid texts which deal with general knowledge
 Avoid disturbing topics, or texts students might have
read
 Use authentic texts
 Techniques: better short answer and gap filling than
multiple choice
 Task difficulty can be lower than text difficulty
 Items should follow the order of the text
 Make items independent of each other
 Do not take into account errors of grammar or spelling
PROBLEMS
 As in listening: Indirect assessment and different ways
of listening
 As in speaking: Transient nature of speech
http://www.usingenglish.com/articles/why-your-students-have-problems-with-listening-comprehension.html
TIPS:
 Same as in reading
 If recording is used, make it as natural as possible
 Items should be far apart in the text
 Give students time to become familiar with the tasks
 Techniques: apart from multiple choice, shot answers
and gap filling, information transfer, note taking, partial
dictation, transcription
 Moderation is essential
 How many times?
GRAMMAR
 Why? Easy to test, Content validity
 Why not? Harmful washback effect
 It depends on the type of test.
 Specifications: from the Council of Europe books
 Techniques: Gap filling, rephrasings, completion
 Don’t penalize for mistakes that were not tested (-s if
the item is testing relatives, for example)
VOCABULARY
 Why (not)?
 Specifications: use frequency considerations
 Techniques:
 Recognition: Recognise synonims, recognise definitions,
recognise appropriate word for context
 Production: pictures, definitions, gap filling,
Useful in particular tests where washback is
not important
 Cloze test (from closure). Based on the
idea of “reduced redundancy”. Subtypes:
 Selected deletion cloze
 Conversational cloze
 C-Tests
 Dictation
Main problem : horrible washback effect.
TIPS
- Testing-assessment-teaching
- Feedback
- Self assessment
- Washback
- Short tasks
- Use stories and games
- Use pictures and color
- Don’t forget that children are still developing L1 and cognitive
abilities
- Include interaction
- Use colour and drawing
- Use cartoon stories
- Long warm-ups in speaking
- Use cards eotj pictures

More Related Content

What's hot

The Alternatives in Language Assessment
The Alternatives in Language AssessmentThe Alternatives in Language Assessment
The Alternatives in Language Assessment
athom404
 
English for specific purpose
English for specific purposeEnglish for specific purpose
English for specific purpose
Titin Rohayati
 
Rumi fihi ma_fihi
Rumi fihi ma_fihiRumi fihi ma_fihi
Rumi fihi ma_fihi
iffi5577
 
Testing and assessment in elt
Testing and assessment in eltTesting and assessment in elt
Testing and assessment in elt
Cidher89
 
Beyond test : alternatives assesment
Beyond test : alternatives assesment Beyond test : alternatives assesment
Beyond test : alternatives assesment
iendah lestari
 
Unit1(testing, assessing, and teaching)
Unit1(testing, assessing, and teaching)Unit1(testing, assessing, and teaching)
Unit1(testing, assessing, and teaching)
Kheang Sokheng
 
Speech Act (Andrew D. Cohen)
Speech Act (Andrew D. Cohen)Speech Act (Andrew D. Cohen)
Speech Act (Andrew D. Cohen)
1213anita
 
Testing writing
Testing writingTesting writing
Testing writing
Jenny Aque
 

What's hot (20)

The Alternatives in Language Assessment
The Alternatives in Language AssessmentThe Alternatives in Language Assessment
The Alternatives in Language Assessment
 
ASSESSMENT: DISCRETE POINT TEST, INTEGRATIVE TESTING, PERFORMANCE-BASED ASSES...
ASSESSMENT: DISCRETE POINT TEST, INTEGRATIVE TESTING, PERFORMANCE-BASED ASSES...ASSESSMENT: DISCRETE POINT TEST, INTEGRATIVE TESTING, PERFORMANCE-BASED ASSES...
ASSESSMENT: DISCRETE POINT TEST, INTEGRATIVE TESTING, PERFORMANCE-BASED ASSES...
 
English for specific purpose
English for specific purposeEnglish for specific purpose
English for specific purpose
 
Rumi fihi ma_fihi
Rumi fihi ma_fihiRumi fihi ma_fihi
Rumi fihi ma_fihi
 
Validity, reliablility, washback
Validity, reliablility, washbackValidity, reliablility, washback
Validity, reliablility, washback
 
Need analysis (English Specific Purpose)
Need analysis (English Specific Purpose)Need analysis (English Specific Purpose)
Need analysis (English Specific Purpose)
 
Testing and assessment in elt
Testing and assessment in eltTesting and assessment in elt
Testing and assessment in elt
 
Testing reading
Testing readingTesting reading
Testing reading
 
Beyond test : alternatives assesment
Beyond test : alternatives assesment Beyond test : alternatives assesment
Beyond test : alternatives assesment
 
Stylistics
StylisticsStylistics
Stylistics
 
Testing reading
Testing readingTesting reading
Testing reading
 
Testing oral ability
Testing oral ability Testing oral ability
Testing oral ability
 
Module 1 Principles and Purposes of Language Assessment
Module 1   Principles and Purposes of Language AssessmentModule 1   Principles and Purposes of Language Assessment
Module 1 Principles and Purposes of Language Assessment
 
Unit1(testing, assessing, and teaching)
Unit1(testing, assessing, and teaching)Unit1(testing, assessing, and teaching)
Unit1(testing, assessing, and teaching)
 
Error. analysis
Error. analysisError. analysis
Error. analysis
 
discrete-point and integrative testing
discrete-point and integrative testingdiscrete-point and integrative testing
discrete-point and integrative testing
 
Speech Act (Andrew D. Cohen)
Speech Act (Andrew D. Cohen)Speech Act (Andrew D. Cohen)
Speech Act (Andrew D. Cohen)
 
Assessing speaking
Assessing speakingAssessing speaking
Assessing speaking
 
Testing writing
Testing writingTesting writing
Testing writing
 
Testing writing
Testing writingTesting writing
Testing writing
 

Viewers also liked

Thecommoneuropeanframeworkofreferenceforlanguages 110920205758-phpapp01
Thecommoneuropeanframeworkofreferenceforlanguages 110920205758-phpapp01Thecommoneuropeanframeworkofreferenceforlanguages 110920205758-phpapp01
Thecommoneuropeanframeworkofreferenceforlanguages 110920205758-phpapp01
Claudia Martínez
 
Strategies for assessment and grading
Strategies for assessment and gradingStrategies for assessment and grading
Strategies for assessment and grading
Abidi Mohamed Salah
 
Types of assessment
Types of assessmentTypes of assessment
Types of assessment
cwhinsch
 

Viewers also liked (11)

7.1 assessment and the cefr (1)
7.1 assessment and the cefr (1)7.1 assessment and the cefr (1)
7.1 assessment and the cefr (1)
 
Thecommoneuropeanframeworkofreferenceforlanguages 110920205758-phpapp01
Thecommoneuropeanframeworkofreferenceforlanguages 110920205758-phpapp01Thecommoneuropeanframeworkofreferenceforlanguages 110920205758-phpapp01
Thecommoneuropeanframeworkofreferenceforlanguages 110920205758-phpapp01
 
CEFR-based tools and resources: latest developments (Mila Angelova)
CEFR-based tools and resources: latest developments (Mila Angelova)CEFR-based tools and resources: latest developments (Mila Angelova)
CEFR-based tools and resources: latest developments (Mila Angelova)
 
7.2 assessment and the cefr (2)
7.2 assessment and the cefr (2)7.2 assessment and the cefr (2)
7.2 assessment and the cefr (2)
 
Strategies for assessment and grading
Strategies for assessment and gradingStrategies for assessment and grading
Strategies for assessment and grading
 
CEFR
CEFRCEFR
CEFR
 
Cefr presentation
Cefr presentationCefr presentation
Cefr presentation
 
Types of assessment(2)
Types of assessment(2)Types of assessment(2)
Types of assessment(2)
 
Types of Assessment
Types of AssessmentTypes of Assessment
Types of Assessment
 
Types of assessment
Types of assessmentTypes of assessment
Types of assessment
 
Assessment types and tasks
Assessment types and tasksAssessment types and tasks
Assessment types and tasks
 

Similar to 7 assessment and the cefr

Testing for language teachers 101 (1)
Testing for language teachers 101 (1)Testing for language teachers 101 (1)
Testing for language teachers 101 (1)
Paul Doyon
 
ELTLAE Group 2.pptx
ELTLAE Group 2.pptxELTLAE Group 2.pptx
ELTLAE Group 2.pptx
AhzaPutro
 
Testing and Test construction (Evaluation in EFL)
Testing and Test construction (Evaluation in EFL)Testing and Test construction (Evaluation in EFL)
Testing and Test construction (Evaluation in EFL)
Samcruz5
 
Testing and Test Construction (Evaluation ILE)
Testing and Test Construction (Evaluation ILE)Testing and Test Construction (Evaluation ILE)
Testing and Test Construction (Evaluation ILE)
Samcruz5
 
Apt 501 chapter_7
Apt 501 chapter_7Apt 501 chapter_7
Apt 501 chapter_7
cdjhaigler
 
Developing Assessment Instruments Chapter 7
Developing Assessment Instruments Chapter 7Developing Assessment Instruments Chapter 7
Developing Assessment Instruments Chapter 7
cdjhaigler
 
Developing Assessment Instrument
Developing Assessment InstrumentDeveloping Assessment Instrument
Developing Assessment Instrument
cdjhaigler
 

Similar to 7 assessment and the cefr (20)

7.1 assessment and the cefr (1)
7.1 assessment and the cefr (1)7.1 assessment and the cefr (1)
7.1 assessment and the cefr (1)
 
Types of Tests,
Types of Tests, Types of Tests,
Types of Tests,
 
Testing for language teachers 101 (1)
Testing for language teachers 101 (1)Testing for language teachers 101 (1)
Testing for language teachers 101 (1)
 
Assessment in Learning
Assessment in LearningAssessment in Learning
Assessment in Learning
 
Language testing
Language testingLanguage testing
Language testing
 
Characteristics of Assessment
Characteristics of Assessment Characteristics of Assessment
Characteristics of Assessment
 
languagetesting-200627343434344035034.pdf
languagetesting-200627343434344035034.pdflanguagetesting-200627343434344035034.pdf
languagetesting-200627343434344035034.pdf
 
ELTLAE Group 2.pptx
ELTLAE Group 2.pptxELTLAE Group 2.pptx
ELTLAE Group 2.pptx
 
Principles_of_language_testing.ppt
Principles_of_language_testing.pptPrinciples_of_language_testing.ppt
Principles_of_language_testing.ppt
 
English Proficiency Test
English Proficiency TestEnglish Proficiency Test
English Proficiency Test
 
Testing and Test construction (Evaluation in EFL)
Testing and Test construction (Evaluation in EFL)Testing and Test construction (Evaluation in EFL)
Testing and Test construction (Evaluation in EFL)
 
Testing and Test Construction (Evaluation ILE)
Testing and Test Construction (Evaluation ILE)Testing and Test Construction (Evaluation ILE)
Testing and Test Construction (Evaluation ILE)
 
STANDARDIZED AND NON-STANDARDIZED TEST
STANDARDIZED AND NON-STANDARDIZED TESTSTANDARDIZED AND NON-STANDARDIZED TEST
STANDARDIZED AND NON-STANDARDIZED TEST
 
Validity and Reliability
Validity and ReliabilityValidity and Reliability
Validity and Reliability
 
HDP Assessment (1).pptxmmmmmmmmmmmmmmmm
HDP Assessment  (1).pptxmmmmmmmmmmmmmmmmHDP Assessment  (1).pptxmmmmmmmmmmmmmmmm
HDP Assessment (1).pptxmmmmmmmmmmmmmmmm
 
Assessment of-learning-ppt
Assessment of-learning-pptAssessment of-learning-ppt
Assessment of-learning-ppt
 
Apt 501 chapter_7
Apt 501 chapter_7Apt 501 chapter_7
Apt 501 chapter_7
 
Developing Assessment Instruments Chapter 7
Developing Assessment Instruments Chapter 7Developing Assessment Instruments Chapter 7
Developing Assessment Instruments Chapter 7
 
Developing Assessment Instrument
Developing Assessment InstrumentDeveloping Assessment Instrument
Developing Assessment Instrument
 
Lesson-5.pptx
Lesson-5.pptxLesson-5.pptx
Lesson-5.pptx
 

More from Jesús Ángel González López

More from Jesús Ángel González López (20)

Bilingual education at university uc policy plan
Bilingual education at university uc policy planBilingual education at university uc policy plan
Bilingual education at university uc policy plan
 
Cambridge tkt assessment
Cambridge tkt assessmentCambridge tkt assessment
Cambridge tkt assessment
 
American Geography
American GeographyAmerican Geography
American Geography
 
History of Britain
History of BritainHistory of Britain
History of Britain
 
8 how to teach literature (and comics)
8 how to teach literature (and comics) 8 how to teach literature (and comics)
8 how to teach literature (and comics)
 
7.1 ealta guidelines
7.1 ealta guidelines7.1 ealta guidelines
7.1 ealta guidelines
 
6 teaching culture
6 teaching culture 6 teaching culture
6 teaching culture
 
5 methodology
5 methodology 5 methodology
5 methodology
 
4 european language portfolio
4 european language portfolio 4 european language portfolio
4 european language portfolio
 
ceftrain
ceftrain ceftrain
ceftrain
 
cefr introduction
cefr introductioncefr introduction
cefr introduction
 
Theoretical framework
Theoretical frameworkTheoretical framework
Theoretical framework
 
1 popular fiction
1 popular fiction1 popular fiction
1 popular fiction
 
2 dashiell hammett bio
2 dashiell hammett bio2 dashiell hammett bio
2 dashiell hammett bio
 
European dimensions
European dimensionsEuropean dimensions
European dimensions
 
language policy plan
language policy planlanguage policy plan
language policy plan
 
Pcic
PcicPcic
Pcic
 
Otras aplicaciones
Otras aplicacionesOtras aplicaciones
Otras aplicaciones
 
Evaluación
EvaluaciónEvaluación
Evaluación
 
Metodología
MetodologíaMetodología
Metodología
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Recently uploaded (20)

Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 

7 assessment and the cefr

  • 1.
  • 2.  Assessment: Assessment of the proficiency of the language user  3 key concepts: • Validity: the information gained is an accurate representation of the proficiency of the candidates • Reliability: A student being tested twice will get the same result (technical concept: the rank order of the candidates is replicated in two separate—real or simulated—administrations of the same assessment ) • Feasibility: The procedure needs to be practical, adapted to the available elements and features
  • 3.  If we want assessment to be valid, reliable, and feasible, we need to specify: • What is assessed: according to the CEFR, communicative activities (contexts, texts, and tasks). See examples. • How performance is interpreted: assessment criteria. See examples • How to make comparisons between different tests and ways of assessment (for example, between public examinations and teacher assesment). Two main procedures:  Social moderation: discussion between experts  Benchmarking: comparison of samples in relation to standardized definitions and examples, which become reference points (benchmarks) • Guidelines for good practice: EALTA
  • 4. TYPES OF ASSESSMENT 1 Achievement assessment / Proficiency assessment 2 Norm-referencing (NR)/ Criterion-referencing (CR) 3 Mastery learning CR / Continuum CR 4 Continuous assessment / Fixed assessment points 5 Formative assessment / Summative assessment 6 Direct assessment / Indirect assessment 7 Performance assessment / Knowledge assessment 8 Subjective assessment / Objective assessment 9 Checklist rating / Performance rating 10 Impression / Guided judgement 11 Holistic assessment/ Analytic assessment 12 Series assessment / Category assessment 13 Assessment by others / Self-assessment
  • 5. Types of tests: • Proficiency tests • Achievement tests. 2 approaches:  To base achievement tests on the textbook/syllabus  To base them on course objectives. More beneficial washback. • Diagnostic tests • Placement tests
  • 6. Validity Types: • Construct validity (very general, the information gained is accurate representation of the proficiency of the candidate. It checks the validity of the construct, the thing we want to measure) • Content validity. This checks it the test’s content is a representative simple of the skills or structures that it wants to measure. In order to check this we need a complete specification of all the skills or structures we want to cover. If it covers 5% only, it has less content validity than if it covers 25 %.
  • 7.  Validity Types: • Criterion-related validity: Results on the test agree with other dependable results (criterion test)  Concurrent validity. We compare the test results with the criterion test.  Predictive validity. A placement test is validated by the teachers who teach the selected students. • Validity in scoring. Not only the items need to be valid, but also the way in which responses are scored (taking into account grammar mistakes in a reading comprehension exam is not valid) • Face validity: the test has to look as if it measures what it is supposed to measure. A written test to check pronunciation has little face validity.
  • 8. How to make tests more valid (Hughes) Write specifications for the test. Include a representative sample ot the content of the specifications in the text Whenever feasible, use direct testing Make sure that the scoring relates directly to what is being tested Try to make the test reliable
  • 9. Reliability: A student being tested twice will get the same result (technical concept: the rank order of the candidates is replicated in two separate—real or simulated— administrations of the same assessment ) - We compare two tests. Methods: - Test-Retest: the student takes the same test again - Alternate Forms: the students take two alternate forms of the same test - Split.Half: you split the test into two equivalent halves and compare them as if they were two different tests.
  • 10. - Reliability coefficient / Standard Error of Measurement A High Stakes Test needs a high reliability coefficient (highest is 1), and therefore a very low standard error of measurement (a number obtained by statistical analysis). A Lower Stakes exam does not need those coefficients. - True Score: the real score that a student would get in a perfectly reliable test. In a very reliable test, the true score is clearly defined (the student will always get a similar result, for example 65-67). In a less reliable test, the range is wider (55-75). - Scorer reliability (coefficient). You compare the scores given by different scorers (examiners). The more agreement, the more reliable their reliability coefficient.
  • 11. Item analysis:  Facility value  Discrimination indices: drop some, improve others  Analyse distractors  Item banking
  • 12. 1.Take enough samples of behaviour. 2.Exclude items which do not descriminate well 3.Do not allow candidates too much freedom. 4.Write unambiguous items 5.Provide clear and explicit instructions 6.Ensure that tests are well laid out and perfectly legible 7.Make candidates familiar with format and testing techniques 8.Provide uniform and non-distracting conditions of administration
  • 13. 9. Use items which permit scoring which is as objective as possible 10. Make comparisons between candidates as direct as possible 11. Provide a detailed scoring key 12. Train scorers 13. Agree acceptable responses and appropriate scores at the beginning of the scoring process. 14. Identifty candidates by number not by name 15. Employ multiple, independent scorers..
  • 14. To be valid a test must be reliable (provide accurate measurement) A reliable test may not be valid at all (technically perfect, but globally wrong: it does not test what it is supposed to test)
  • 15. Washback/Backwash  Test the abilities/skills you want to encourage.  Sample widely and unpredictably  Use direct testing  Make testing criterion-referenced (CEFR)  Base achievement tests on objectives  Ensure that the test is known and understood by students and teachers  Counting the cost
  • 16. 1. Make a full and clear statement of the testing ‘problem’. 2. Write complete specifications for the test. 3. Write and moderate items. 4. Trial the items informally on native speakers and reject or modify problematic ones as necessary. 5. Trial the test on a group of non-native speakers similar to those for whom the test is intended. 6. Analyse the results of the trial and make any necessary changes. 7. Calibrate scales. 8. Validate. 9. Write handbooks for test takers, test users and staff. 10. Train any necessary staff (interviewers, raters, etc.).
  • 17. Common Test Techniques • Multiple choice • Yes/No, True/False • Short Answer • Gap-Filling
  • 18. Chapters from Hughes’ Testing for Language Teachers 9. Testing Writing 10. Testing Oral Abilities 11. Testing Reading 12. Testing Listening 13. Testing Grammar and Vocabulary 14. Testing Overall Ability 15. Tests for Young Learners
  • 19. 1. Set representative tasks 1. Specify all possible content 2. Include a representative sample of the specified content 2. Elicit valid samples of writing ability 1. Set as many separate tasks as feasible 2. Test only writing ability and nothing else 3. Restrict candidates 3. Ensure valid and reliable scoring: 1. Set as many tasks as possible 2. Restrict candidates 3. Give no choice of tasks 4. Ensure long enough samples 5. Create appropriate scales for scoring: HOLISTIC/ANALYTIC 6. Calibrate the scale to be used 7. Select and train scorers 8. Follow acceptable scoring procedures
  • 20. • “The most highly prized language skill”, Lado’s Language Testing (1961). • Challenges: ephemeral, intangible. • Contrast US/UK: Certificate of Proficiency in English (1913) already included it, TOEFL only in 2005 iBT • Key notion: not accent, but intelligibility • Very different approaches.  Indirect  Direct (Cambridge, EOIs) or Semi-direct (TOEFL ibt, OTE, Aptis). Conflict with the American tradition.  The future?: Fully automated L2 speaking tests: Versant test, Speechrater. • Not only speaking, also interaction
  • 21. 1. Set representative tasks 1. Specify all possible content 2. Include a representative sample of the specified content 2. Elicit valid samples of oral ability. 1. Techniques: 1. Interview :Questions, pictures, role play, interpreting (L1 to L2), prepared monologue, reading aloud 2. Interaction: discussion, roleplay 3. Responses to audio- or video-recordings (semi-direct) 2. Plan and structure the test carefully 1. Make the oral test as long as it is feasible 2. Plan the test carefully 3. As many tasks (“fresh starts”) as possible 4. Use a second tester 5. Set only tasks that candidates could do easily in L1
  • 22. Plan and structure the test carefully 1. Set only tasks that candidates could do easily in L1 2. Quiet room with good acoustics 3. Put candidates at ease (at first, easy questions, not assessed, problem with note-taking?) 4. Collect enough relevant information 5. Do not talk too much 6. (select interviewers carefully and train them) 1. Ensure valid and reliable scoring: 1. Create appropriate scales for scoring: HOLISTIC/ANALYTIC. Calibrate the scale to be used 2. Select and train scorers (different from interviewers if possible) 3. Follow acceptable scoring procedures
  • 23. PROBLEMS:  Indirect assessment:  We read in very different ways: scanning, skimming, inferring, intensive, extensive reading… SOME TIPS  As many texts and operations as possible (Dialang).  Avoid texts which deal with general knowledge  Avoid disturbing topics, or texts students might have read  Use authentic texts  Techniques: better short answer and gap filling than multiple choice  Task difficulty can be lower than text difficulty  Items should follow the order of the text  Make items independent of each other  Do not take into account errors of grammar or spelling
  • 24. PROBLEMS  As in listening: Indirect assessment and different ways of listening  As in speaking: Transient nature of speech http://www.usingenglish.com/articles/why-your-students-have-problems-with-listening-comprehension.html TIPS:  Same as in reading  If recording is used, make it as natural as possible  Items should be far apart in the text  Give students time to become familiar with the tasks  Techniques: apart from multiple choice, shot answers and gap filling, information transfer, note taking, partial dictation, transcription  Moderation is essential  How many times?
  • 25. GRAMMAR  Why? Easy to test, Content validity  Why not? Harmful washback effect  It depends on the type of test.  Specifications: from the Council of Europe books  Techniques: Gap filling, rephrasings, completion  Don’t penalize for mistakes that were not tested (-s if the item is testing relatives, for example) VOCABULARY  Why (not)?  Specifications: use frequency considerations  Techniques:  Recognition: Recognise synonims, recognise definitions, recognise appropriate word for context  Production: pictures, definitions, gap filling,
  • 26. Useful in particular tests where washback is not important  Cloze test (from closure). Based on the idea of “reduced redundancy”. Subtypes:  Selected deletion cloze  Conversational cloze  C-Tests  Dictation Main problem : horrible washback effect.
  • 27. TIPS - Testing-assessment-teaching - Feedback - Self assessment - Washback - Short tasks - Use stories and games - Use pictures and color - Don’t forget that children are still developing L1 and cognitive abilities - Include interaction - Use colour and drawing - Use cartoon stories - Long warm-ups in speaking - Use cards eotj pictures

Editor's Notes

  1. If we want assessment to be valid, reliable, and feasible, we need to specify: What is assessed: according to the CEFR, communicative activities (contexts, texts, and tasks). See examples. How performance is interpreted: assessment criteria. See examples How to make comparisons between different tests and ways of assessment (for example, between public examinations and teacher assesment). Two main procedures: Social moderation: discussion between experts Benchmarking: comparison of samples in relation to standardized definitions and examples Guidelines for good practice: EALTA
  2. Types of tests: Proficiency tests: designed to measure people’s ability in a language, regardless of any training. “Proficient”: command of the language, for a particular purpose or for general purposes. Achievement tests: most teachers are not responsible for proficiency tests, but for achievement tests. They are normally related to language courses. Two approaches: to base achievement tests on the textbook (or the syllabus), so that only what is covered in the classes is tested, or, much better, to base test content on course objectives. More beneficial washback. The long-term interests of the students are best served by this approach. Two types: final achievement tests, and progress achievement tests (formative assessment) Diagnostic tests: Used to identify learners’ strengths and weaknesses (example: Dialang) Placement tests: to place students at the stage most appropriate to their abilities
  3. A test is valid if it measures accurately what it is intended to measure. Or, the information gained is an accurate representation of the proficiency of the candidate. This general type of validity is called “construct validity”, the validity of the construct, the thing we want to measure Content validity: A test has it if its content constitutes a representative sample of the language skills or structures, etc. that it wants to measure. So, first, we need a specification of the skills of structures that we want to cover, and compare them with the test itself. For example, B2 writing skills, writing formal letters is one of the subskills shown in the specification, there are more, the more we cover, the more valid the test will be. The more content validity, the more construct validity and the more backwash effect. Criterion-related validity: Results on the test agree with other (independent and highly dependable) results. This independent assessment is the criterion measure. Two types: Concurrent validity: we compare the criterion test and the test that we want to check. They both take place at about the same time. Example 1: we administer a 45 m. oral test where all the subskills, tasks, operations, are tested. But only to a sample of the students. This is the criterion test. Then we do 10 m. interviews to the whole level of students. We compare the results, and they tell us whether 10 m. is enough or not. This is expressed in a “correlation coefficient” bw the criterion and the test being validated. Example 2: we compare the results of a general test (Pruebas Estandarizadas) with teachers’ assessment. Predictive validity: the test predicts future performance of the students. A placement test can easily be validated by the teachers teaching the students by checking if the students are well placed or not. Validity in scoring: not only the items need to be valid, but also the way in which the responses are scored. For example, a reading test may call for short written responses. If the scoring of these responses takes into account spelling and grammar, then it is not valid (it is not measuring what it is intended to measure). Same for the scoring of writing or speaking. Face validity: the test has to look as if it measures what it is supposed to measure. It is not a scientific notion, but it is important (for candidates, teachers, employers). For example, a written test to check pronunciation.
  4. A test is valid if it measures accurately what it is intended to measure. Or, the information gained is an accurate representation of the proficiency of the candidate. This general type of validity is called “construct validity”, the validity of the construct, the thing we want to measure Content validity: A test has it if its content constitutes a representative sample of the language skills or structures, etc. that it wants to measure. So, first, we need a specification of the skills of structures that we want to cover, and compare them with the test itself. For example, B2 writing skills, writing formal letters is one of the subskills shown in the specification, there are more, the more we cover, the more valid the test will be. The more content validity, the more construct validity and the more backwash effect. Criterion-related validity: Results on the test agree with other (independent and highly dependable) results. This independent assessment is the criterion measure. Two types: Concurrent validity: we compare the criterion test and the test that we want to check. They both take place at about the same time. Example 1: we administer a 45 m. oral test where all the subskills, tasks, operations, are tested. But only to a sample of the students. This is the criterion test. Then we do 10 m. interviews to the whole level of students. We compare the results, and they tell us whether 10 m. is enough or not. This is expressed in a “correlation coefficient” bw the criterion and the test being validated. Example 2: we compare the results of a general test (Pruebas Estandarizadas) with teachers’ assessment. Predictive validity: the test predicts future performance of the students. A placement test can easily be validated by the teachers teaching the students by checking if the students are well placed or not. Validity in scoring: not only the items need to be valid, but also the way in which the responses are scored. For example, a reading test may call for short written responses. If the scoring of these responses takes into account spelling and grammar, then it is not valid (it is not measuring what it is intended to measure). Same for the scoring of writing or speaking. Face validity: the test has to look as if it measures what it is supposed to measure. It is not a scientific notion, but it is important (for candidates, teachers, employers). For example, a written test to check pronunciation.
  5. Reliability: A student being tested twice will get the same result (technical concept: the rank order of the candidates is replicated in two separate—real or simulated—administrations of the same assessment ) We compare two tests taken by the same group of students, and get a reliability coefficient: if all the students get exactly the same result, the coefficient is 1 (It never happens). High Stakes Tests need a higher coefficient than Lower Stakes exams. They shouldn’t depend on chance, or particular circumstances. In order to get two comparable tests, there are two procedures: Test-retest method: the students take the same test again Alternate forms method: the students take two alternate forms of the same test Split half method: you split the test into two (equivalent) halves and compare them as if they were two different tests. You get a “coefficient of internal consistency”. We also need to know the standard error of measurement of a test. This is actually the opposite of the reliability coefficient and you can get it through statistical analysis. With this number, we can find out what the true score of a student is. For example, if we have a very reliable test, it will have a low standard error of measurement, and therefore, the student will always get a very similar result no matter how many times he takes the test. In a less reliable test, his true score would be less defined. The true score lies in a range that varies depending on the standard error of measurement of the test. These numbers are important to compare tests and to take decisions (by companies, governments, etc.) based on those results. Another statistical procedure commonly used now is Item Response Theory. Very technical. Scorer reliability. There is also a scorer reliability coefficient, the level of agreement given by the same or different scorers on different occasions. If the scoring is not reliable, the rest results cannot be reliable.
  6. Reliability: A student being tested twice will get the same result (technical concept: the rank order of the candidates is replicated in two separate—real or simulated—administrations of the same assessment ) We compare two tests taken by the same group of students, and get a reliability coefficient: if all the students get exactly the same result, the coefficient is 1 (It never happens). High Stakes Tests need a higher coefficient than Lower Stakes exams. They shouldn’t depend on chance, or particular circumstances. In order to get two comparable tests, there are two procedures: Test-retest method: the students take the same test again Alternate forms method: the students take two alternate forms of the same test Split half method: you split the test into two (equivalent) halves and compare them as if they were two different tests. You get a “coefficient of internal consistency”. We also need to know the standard error of measurement of a test. This is actually the opposite of the reliability coefficient and you can get it through statistical analysis. With this number, we can find out what the true score of a student is. For example, if we have a very reliable test, it will have a low standard error of measurement, and therefore, the student will always get a very similar result no matter how many times he takes the test. In a less reliable test, his true score would be less defined. The true score lies in a range that varies depending on the standard error of measurement of the test. These numbers are important to compare tests and to take decisions (by companies, governments, etc.) based on those results. Another statistical procedure commonly used now is Item Response Theory. Very technical. Scorer reliability. There is also a scorer reliability coefficient, the level of agreement given by the same or different scorers on different occasions. If the scoring is not reliable, the rest results cannot be reliable.
  7. Item analysis: Facility value Discrimination indices: drop some, improve others Analyse distractors Item banking SEE EXAMPLE FROM FUENSANTA
  8. How to make tests more reliable (Hughes) Take enough samples of behaviour. The more items, the more reliable. The higher stakes, the longer it should be. Example from the Bible. P. 45 Exclude items which do not descriminate well between weaker and stronger students Do not allow candidates too much freedom. Example p. 46 Write unambiguous items: Critical scrutiny of colleagues, pre-testing (trialling, piloting) Provide clear and explicit instructions: write them down, read them aloud. No problem with writing them in L1. Ensure that tests are well laid out and perfectly legible Make candidates familiar with format and testing techniques Provide uniform and non-distracting conditions of administration (specified timing, good acoustic conditions)
  9. Use items which permit scoring which is as objective as possible (better one-word response than multiple choice) Make comparisons between candidates as direct as possible (no choice of items) Provide a detailed scoring key Train scorers Agree acceptable responses and appropriate scores at the beginning of the scoring process. Score a sample. Choose representative examples. Agree. Then scorers can begin to score. Identifty candidates by number not by name Emply multiple, independent scorers. At least two, independently. Then, a third, senior scorer gets the results, and investigates discrepancies.
  10. Washback/Backwash: (One of the) main reasons for a language teacher/school/department to use appropriate forms of assessment. Test the abilities/skills you want to encourage. Give them sufficient weight in relation to other skills. Sample widely and unpredictably: Test across the full range of the specifications Use direct testing Make testing criterion-referenced (CEFR) Base achievement tests on objectives Ensure that the test is known and understood by students and teachers (the more transparent, the better) (Where necessary, provide assistance to teachers) Counting the cost: Individual direct testing is expensive, but what is the cost of not achieving beneficial washback
  11. Calibrate scales: collect samples of performance, and use them as models, reference points (European Study)
  12. Common Test Techniques We need techniques which: - will elicit behaviour which is a reliable and valid indicator of the ability in which we are interested; - will elicit behaviour which can be reliably scored; - are as economical of time and effort as possible; will have a beneficial backwash effect, where this is relevant. MULTIPLE CHOICE Advantages: Reliable Economical Good for receptive skills (It used to be as the perfect, almost only way to test) Disadvantages: Only for recognition Guessing may have a considerable bu unknowable effect The technique severely restricts what can be tested It is very difficult to write successful items Washback may be harmful Cheating may be facilitated YES/NO TRUE/FALSE ITEMS Essentially multiple choice, but with a 50 % chance4 of getting it right. Ok in class activities. Not appropriate in real testing. SHORT-ANSWER ITEMS Advantages. Less guessing No need for distractors Less cheating Items are easier to write Disadvantages Responses may take longer The test taker has to produce language (mixture of skills in a receptive test) (TRY TO MAKE RESPONSES REALLY SHORT) Judging may be required (less validity or reliability) Scoring may take longer (SOLUTIONS: MAKE THE REQUIRED RESPONSE UNIQUE) GAP FILLING ITEMS very similar to short-answer items
  13. Set representative tasks Specify all possible content (in the specifications) Include a representative sample of the specified content (in the test) Elicit valid samples of writing ability Set as many separate tasks as feasible Test only writing ability and nothing else (creativity, imagination, etc. No extra long instructions with complicated reading) Restrict candidates Ensure valid and reliable scoring: Set as many tasks as possible Restrict candidates Give no choice of tasks Ensure long enough samples Create appropriate scales for scoring: HOLISTIC/ANALYTIC See examples. HOLISTIC. Good if many scorers. ANALYTIC: equal or unequal weight to the different parts, main disadvantage: time-consuming, if too much attention is payed to the parts, one may forget the general impression. IMPORTANT POTENTIAL FOR WASHBACK. Calibrate the scale to be used (collect samples. Choose representative ones. Use them as reference points. This is called “benchmarking”) Select and train scorers Follow acceptable scoring procedures: benchmarking, two scorers (and a third, senior one for discrepancies), carry out statistical analysis
  14. “The most highly prized language skill”, a source of cultural capital, Lado’s Language Testing (1961). However, it hasn’t always been properly assessed. Challenges: ephemeral, intangible. Solutions: recording it, and also sound waves, spectrographs Some tests (TOEFL in particular) have a long history of ignoring it: Only in 2005 TOEFL iBT/Contrast with Cambridge Certificate of Proficiency in English (1913) which already included it. However, Grammar-Translation approaches ignored it almost completely. Kaulfers 1944 created the first scales used to assess oral proficiency, designed for the military abroad Key notion: not accent, but intelligibility (the ease or difficulty with which a listener understands L2 speech. You can be highly intelligible with a non-native accent. It is only when the accent interferes with a learner’s ability that it should be considered in speaking scales. Very different approaches. Indirect (multiple choice as an indicator, not really valid or reliable) Direct or Semi-direct (responding to stimulus from a computer, TOEFL ibt, OTE, Aptis). Problems: raters and rating scales (which oversimplify the complexity of oral speech). Despite the practical challenges, they are the only valid formats for assessing L2 speech today. Conflict with the American tradition of “psychometrically influenced assessment tradition” focusing on the technical (statistical) reliability of test items (multiple choice) and the most administratively feasible test formats and item types in the context of large-scale, high-stakes tests (GRE?) The future?: Fully automated L2 speaking tests: Versant test, Speechrater. Automatic scoring systems (measuring grammatical accuracy, lexical frequency, acoustic variables, temporal variables) Not only speaking, also interaction (listening and speaking): Cambridge included interaction in 1996. Washback effect (usual practice in class, pairwork, groupwork). Problems: peer interlocutor variables (L2 proficiency, L1 background, gender, personality, etc). Solutions: more tasks.
  15. Set representative tasks Specify all possible content Include a representative sample of the specified content Elicit valid samples of oral ability. Techniques: Interview (the candidate may feel intimidated): Questions, pictures, role play, interpreting (L1 to L2), prepared monologue, reading aloud Interaction with fellow candidates: discussion, roleplay Responses to audio- or video-recordings (semi-direct) Plan and structure the test carefully Make the oral test as long as it is feasible Plan the test carefully As many tasks (“fresh starts”) as possible Use a second tester Set only tasks that candidates could do easily in L1
  16. Quiet room with good acoustics Put candidates at ease (at first, easy questions, not assessed, problem with note-taking?) Collect enough relevant information Do not talk too much (select interviewers carefully and train them) Ensure valid and reliable scoring: Create appropriate scales for scoring: HOLISTIC/ANALYTIC. Used as a check on each other Calibrate the scale to be used Select and train scorers (different from interviewers if possible) Follow acceptable scoring procedures
  17. PROBLEMS: Indirect assessment: the exercise of receptive skills does not manifest itself directly. We need an instrument. We read in very different ways: scanning, skimming, inferring, intensive, extensive reading… All of them should be specified and tested SOME TIPS As many texts and operations as possible (Dialang). (Time limits for scanning or skimming?) Avoid texts which deal with general knowledge (answers will be guessed) Avoid disturbing topics, or texts students might have read Use, as much as possible, authentic texts Techniques: better short answer and gap filling than multiple choice. Also information transfer. Task difficulty can be lower than text difficulty Items should follow the order of the text Make items independent of each other Do not take into account errors of grammar or spelling.
  18. Similar PROBLEMS to listening: Indirect assessment: the exercise of receptive skills does not manifest itself directly. We need an instrument. We listen in very different ways: scanning, skimming, inferring, intensive, extensive listening… All of them should be specified and tested And to Speaking: Transient nature of speech http://www.usingenglish.com/articles/why-your-students-have-problems-with-listening-comprehension.html Similar tips from Reading (go back to the list) If recording is used, make it as natural as possible (with typical spoken redundancy). Don’t read aloud written texts. Items should be far apart in the text (to have time to write them down) Give students time to become familiar with the tasks Techniques: apart from multiple choice, shot answers and gap filling, information transfer (draw a map of the accident), note taking, partial dictation (problem: do you consider spelling?), transcription (spelling names, numbers: real life task) Moderation (more teachers, trialing) is essential How many times? Why two? Never three
  19. GRAMMAR: Why? Easy to test, Content validity: more than in any of the skills (Skills: we just cover a few of the topics, or operations from the specifications. Grammar: we can cover many more items) Why not? Harmful washback effect Maybe not in proficiency tests, but, if grammar is taught (and it almost always is), it should be included in achievement tests, placement and diagnostic tests. However, because of the potential harmful washback effect, it should not be given too much (porcentual) prominence. Specifications: from the Council of Europe books (Threshold, etc.) Techniques: Gap filling, rephrasings, completion Don’t penalize for mistakes that were not tested (-s if the item is testing relatives, for example) VOCABULARY Why (not)? Similar arguments as for grammar. Specifications: use frequency considerations (cobuild dictionaries) Techniques: Recognition: Recognise synonims, recognise definitions, recognise appropriate word for context Production: pictures, definitions, gap filling,
  20. Special techniques which are more useful in tests where washback is not important: placement tests, for example Types: Cloze test (from closure). Based on the idea of “reduced redundancy”. Texts are always redundant. If we reduce the redundancy (by deleting a few words), native speakers are easily able to cope and guess the missing words. Originally, every seventh word. In the 80s ot ised to be considered as a language testing panacea (panasía). Easy to construct, administer and score. Unfortunately, poor validity. Native speakers cannot always guess the words. SUBTYPES: Selected deletion cloze Conversational cloze The C-Test: a variety of cloze, with the second half of every second word deleted. Puzzle-like Dictation: traditionally used (particularly in places like France, but not only). However, in the 60s, dictation testing was considered misguided. Later, nevertheless, research showed correlation between scores on dictation tests and scores on more complex tests, or on cloze tests. They are easy to create, and easy to adminster, but very difficult to score properly. Main problem with all of these tests: horrible washback effect.
  21. Primary School: Other types of assessment are more appropriate. However, a common yardstick at the end is necessary: Pruebas Estandarizadas. Good opportunity to develop good attitudes towards assessment. Recommendations: Make testing an integral part of assessment, and assessment an integral part of the teaching program Feedback from tests should be immediate and positive Self assessment should be part of the teaching program Washback is more important than ever TIPS Short tasks: Short attention span Use stories and games Use pictures and color Don’t forget that children are still developing L1 and cognitive abilities Include interaction SOME TECHNIQUES: Placing objects or identifying people Multiple choice pictures Colour and draw Use pictures in reading and in writing Cartoon stories for writing Long warm-ups in speaking Use cards and pictures