SlideShare a Scribd company logo
1 of 22
Jesús Ángel González
What does the word suggest?
What sort of emotions does it convey?
Try to write a definition. What does it
imply?
Which characteristics should it have?
 What does the word suggest?
 What sort of emotions does it convey?
 Try to write a definition. “A purposeful activity that
gathers information about students’ language
development” (Jang)
 What does it imply?
• Collecting information
• Analyzing the information and making an assessment
• Taking decisions according to the assessment made:
 Pedagogical decisions (formative assessment)
 Social decisions
 Which characteristics should it have?
• Validity, reliability, feasibility
• Fairness
• Pedagogical purpose (Washback effect)
 Also known as “assessment for learning”, “assessment
as learning”, or formative assessment (Jang, 7)
 Using a range of assessment methods:
• tests
• portfolios
• performance observation
• Standards-based assessments (CEFR: The European Standard)
 Self-Assessment and Peer-Assessment: excellent
complements
 The importance of feedback: descriptive, not only
evaluative; indirect-facilitative, better than direct-
corrective.
 Washback Effect: effect on teaching/learning
 Disadvantages:
• Non-reliable (underestimation/overestimation of abilities,
cheating?). Difficulty of integration in the assessment process
• Difficulty of the task (self-assessment itself)
 Advantages: It encourages:
• Learner Autonomy (responsibility, planning)
• Awareness (vertical/horizontal, levels/features of language
learning)
• Goal-orientation (learning objectives)
• Motivation
• Better learning: learner-centred learning (supported by cognitive
and constructivist theories of learning)
• It fits perfectly in the modern comprehensive model of
‘educational assessment’ which encourages validity, learner-
support, performance-oriented test tasks, real-world language use
and classroom life-oriented activities (Mats Oscarson,
“Assessment in the Classroom” 715)
 Test the abilities/skills you want to encourage. Give them
sufficient weight.
 Sample widely and unpredictably (from the specifications)
 Use direct testing
 Make testing criterion-referenced (CEFR)
 Base achievement tests on learning objectives (NOT
CONTENT)
 Ensure that the test is known and understood by students
and teachers (the more transparent, the better)
 Counting the cost: individual direct testing is expensive
and time-consuming, but what is the cost of harmful
washback? (PAU)
(EVERY GOOD TEACHER SHOULD DO THIS)
 Assessment: Assessment of the proficiency of
the language user
 3 key concepts:
• Validity: the information gained is an accurate
representation of the proficiency of the candidates
• Reliability: A student being tested twice will get the
same result (technical concept: the rank order of the
candidates is replicated in two separate—real or
simulated—administrations of the same assessment )
• Feasibility (practicality): The procedure needs to be
practical, adapted to the available elements and
features
(relative terms)
 If we want assessment to be valid, reliable,
and feasible, we need to specify:
• What is assessed: according to the CEFR,
communicative activities (contexts, texts, and tasks).
See examples.
• How performance is interpreted: assessment criteria.
See examples
• How to make comparisons between different tests
and ways of assessment (for example, between public
examinations and teacher assesment). Two main
procedures:
 Social/Teacher “moderation”: discussion between experts
 Benchmarking: comparison of samples in relation to
standardized definitions and examples, which become
reference points (benchmarks)
• Guidelines for good practice: EALTA
TYPES OF ASSESSMENT (1)
1 Achievement assessment (achievement of specific objectives,
previously taught content)/ Proficiency assessment (what someone
can do in the real world) Ideally, they should be as close as possible.
2 Norm-referencing (NR: students are placed in rank order, US:
grades are sometimes adapted to a previous norm, a “curve”)/
Criterion-referencing (CR: the criterion is a standard, like the CEFR)
3 Mastery learning CR / Continuum CR
4 Continuous assessment (grades are based on a number of
performances, papers, tests throughout the course) / Fixed
assessment points
5 Formative assessment (ongoing process: to check on the progress
and improve teaching and learning)/ Summative assessment
(designed to summarize students’ progress at a particular time,
normally the end of the course)
6 Direct assessment (assessing what the candidate is doing:
speaking, writing/ Indirect assessment (assessing through an
instrument: reading and speaking comprehension)
TYPES OF ASSESSMENT (2)
1 7 Performance assessment (providing a sample of
language)/ Knowledge assessment (providing evidence
of knowledge: impossible?)
8 Subjective assessment (judgement by an assessor) /
Objective assessment (subjectivity is removed: indirect
tests. Really objective?) (*)
9 Rating on a scale/ Rating on a checklist
10 Impression / Guided judgement
11 Holistic assessment (global synthetic judgement)/
Analytic assessment (looking at different aspects
separately)
12 Series assessment (tasks in series)/ Category
assessment (one task with different categories)
13 Peer Assessment/ Self-assessment (very useful
complements)
• Proficiency tests
• Achievement tests. 2 approaches:
 To base achievement tests on the textbook/syllabus
(contents)
 To base them on course learning objectives. More
beneficial washback.
• Diagnostic tests
• Placement tests
 Validity: the information gained is an accurate
representation of the proficiency of the
candidates
 Validity Types:
• Construct validity (very general, the information gained
is an accurate representation of the proficiency of the
candidate. It checks the validity of the construct, the
thing we want to measure, PAU?)
• Content validity. This checks it the test’s content is a
representative sample of the skills or structures that it
wants to measure. In order to check this we need a
complete specification of all the skills or structures we
want to cover. If it covers 5% only, it has less content
validity than if it covers 25 %.
 Validity Types:
• Criterion-related validity: Results on the test agree with
other dependable results (criterion test)
 Concurrent validity. We compare the test results with the
criterion test (a longer, or more standardized test)
 Predictive validity. The test predicts future performance.A
placement test is validated by the teachers who teach the
selected students.
• Validity in scoring. It is not only the items that need to
be valid, but also the way in which responses are
scored (Example: taking into account grammar
mistakes in a reading comprehension exam is not
valid)
• Face validity: the test has to look as if it measures
what it is supposed to measure. A written test to check
pronunciation has little face validity.
How to make tests more valid (Hughes)
Write specifications for the test
(transparency)
Include a representative sample of the
content of the specifications in the text
(content validity)
Whenever feasible, use direct testing
Make sure that the scoring relates directly
to what is being tested
Try to make the test reliable
Reliability: A student being tested twice will get the same
result (technical concept: the rank order of the candidates
is replicated in two separate—real or simulated—
administrations of the same assessment. Result: a
reliability coefficient, theoretical maximum 1, if all the
students get exactly the same result)
- We compare two tests. Methods:
- Test-Retest: the student takes the same test again
- Alternate Forms: the students take two alternate forms
of the same test
- Split Half: you split the test into two equivalent halves
and compare them as if they were two different tests.
- Reliability coefficient / Standard Error of Measurement
A High Stakes Test (high impact or consequences)
needs a high “reliability coefficient” (highest is 1), and
therefore a very low “standard error of measurement” (a
number obtained by statistical analysis). A Lower Stakes
exam does not need those coefficients.
- True Score: the real score that a student would get in a
perfectly reliable test. In a very reliable test, the true
score is clearly defined (the student will always get a
similar result, for example 65-67). In a less reliable test,
the range is wider (55-75).
- Scorer reliability (coefficient). You compare the scores
given by different scorers (examiners). The more
agreement, the more reliable their reliability coefficient.
Item analysis: (example on p. 20)
 Facility value
 Discrimination indices: drop some, improve
others
 Analyse distractors
 Item banking
1.Take enough samples of behaviour (example).
2.Exclude items which do not descriminate well
3.Do not allow candidates too much freedom.
(example)
4.Write unambiguous ítems (example)
5.Provide clear and explicit instructions
6.Ensure that tests are well laid out and perfectly
legible
7.Make candidates familiar with format and testing
techniques
8.Provide uniform and non-distracting conditions of
administration
9. Use items which permit scoring which is as
objective as possible
10. Make comparisons between candidates as direct
as possible
11. Provide a detailed scoring key
12. Train scorers
13. Agree acceptable responses and appropriate
scores at the beginning of the scoring process.
14. Identify candidates by number not by name
15. Employ multiple, independent scorers.
 To be valid a test must be reliable (it must provide
accurate measurement. For example, untrained or
biased assessors, or a wrong key)
 The Validity/Reliability Paradox: A perfectly reliable test
may not be valid at all (technically perfect, but globally
wrong: it does not test what it is supposed to test; for
example, a driving test without a practical exam, or a
multiple-choice test to assess vocabulary knowledge:
Students don’t use vocabulary)
 “Validity concerns outweigh reliability concerns in
current assesment culture” (Jang 97): more
performance, direct assessment. More time-
consuming, more difficult to administer, but better
washback effect and pedagogical use. Use materials
from Standard-Based assessment (rubrics, proficiency-
level descriptors)
Standards are a set of benchmarks for
students to achieve (sometimes turned into
curricular goals): proficiency-level
descriptors later used in rubrics of teacher
observation checklists. Examples:
• NLLIA ESL Bandscales from Australia
• STEPs to English Proficiency in Canada
• Council of Europe: CEFR (+ELP)
• USA Standards derived from NCLB Act (2001)
Chapters from Hughes’ Testing for Language Teachers
Nov 7
8. Common Test Techniques:
9. Testing Writing:
10. Testing Oral Abilities:
11. Testing Reading:
Nov 8
12. Testing Listening:
13. Testing Grammar and Vocabulary:
14. Testing Overall Ability:
15. Tests for Young Learners:

More Related Content

What's hot

Sub-skills in reading comprehension tests
Sub-skills in reading comprehension testsSub-skills in reading comprehension tests
Sub-skills in reading comprehension testsCindy Shen
 
Ppg module tsl3105 topic 3 selection & adaptation
Ppg module tsl3105 topic 3 selection & adaptationPpg module tsl3105 topic 3 selection & adaptation
Ppg module tsl3105 topic 3 selection & adaptationJojo PaPat
 
Language as a medium of communication
Language as a medium of communicationLanguage as a medium of communication
Language as a medium of communicationPatraChinmay
 
Innovative Teaching Methods of English Language
Innovative Teaching Methods of English LanguageInnovative Teaching Methods of English Language
Innovative Teaching Methods of English LanguageMeeta Agrawal
 
Characteristics of a good measuring instrument
Characteristics of a good measuring instrumentCharacteristics of a good measuring instrument
Characteristics of a good measuring instrumentNeha Deo
 
Evaluation in Education
Evaluation in EducationEvaluation in Education
Evaluation in EducationKusum Gaur
 
General ideas of Language Acquisition
General ideas of Language AcquisitionGeneral ideas of Language Acquisition
General ideas of Language AcquisitionAbdul Momin
 
How to Manage Teaching and Learning
How to Manage Teaching and LearningHow to Manage Teaching and Learning
How to Manage Teaching and Learningrobertagimenez_et
 
Chapter 4 introduction to assessing language skills
Chapter 4  introduction to assessing language skillsChapter 4  introduction to assessing language skills
Chapter 4 introduction to assessing language skillsIES JFK
 
Scoring and grading ppt
Scoring and grading pptScoring and grading ppt
Scoring and grading pptM Shoaib GH
 
Language Testing :kinds of tests
Language Testing :kinds of testsLanguage Testing :kinds of tests
Language Testing :kinds of testsahmedabbas1121
 
Language Testing : Principles of language assessment
Language Testing : Principles of language assessment Language Testing : Principles of language assessment
Language Testing : Principles of language assessment Yulia Eolia
 
Common test techniques
Common test techniquesCommon test techniques
Common test techniquesMaury Martinez
 
Children Learning A Foreign Language
Children Learning A Foreign LanguageChildren Learning A Foreign Language
Children Learning A Foreign LanguageBishara Adam
 
Linguistic competence and performance group5
Linguistic competence and performance group5Linguistic competence and performance group5
Linguistic competence and performance group5Ahmed ibrahem
 

What's hot (20)

Types of Tests,
Types of Tests, Types of Tests,
Types of Tests,
 
Sub-skills in reading comprehension tests
Sub-skills in reading comprehension testsSub-skills in reading comprehension tests
Sub-skills in reading comprehension tests
 
Ppg module tsl3105 topic 3 selection & adaptation
Ppg module tsl3105 topic 3 selection & adaptationPpg module tsl3105 topic 3 selection & adaptation
Ppg module tsl3105 topic 3 selection & adaptation
 
Language as a medium of communication
Language as a medium of communicationLanguage as a medium of communication
Language as a medium of communication
 
Innovative Teaching Methods of English Language
Innovative Teaching Methods of English LanguageInnovative Teaching Methods of English Language
Innovative Teaching Methods of English Language
 
Characteristics of a good measuring instrument
Characteristics of a good measuring instrumentCharacteristics of a good measuring instrument
Characteristics of a good measuring instrument
 
Evaluation in Education
Evaluation in EducationEvaluation in Education
Evaluation in Education
 
Oral test
Oral testOral test
Oral test
 
General ideas of Language Acquisition
General ideas of Language AcquisitionGeneral ideas of Language Acquisition
General ideas of Language Acquisition
 
How to Manage Teaching and Learning
How to Manage Teaching and LearningHow to Manage Teaching and Learning
How to Manage Teaching and Learning
 
Chapter 4 introduction to assessing language skills
Chapter 4  introduction to assessing language skillsChapter 4  introduction to assessing language skills
Chapter 4 introduction to assessing language skills
 
Scoring and grading ppt
Scoring and grading pptScoring and grading ppt
Scoring and grading ppt
 
Language Testing :kinds of tests
Language Testing :kinds of testsLanguage Testing :kinds of tests
Language Testing :kinds of tests
 
Testing and evaluation
Testing and evaluationTesting and evaluation
Testing and evaluation
 
Language Testing
Language TestingLanguage Testing
Language Testing
 
Language Testing : Principles of language assessment
Language Testing : Principles of language assessment Language Testing : Principles of language assessment
Language Testing : Principles of language assessment
 
Common test techniques
Common test techniquesCommon test techniques
Common test techniques
 
Language
LanguageLanguage
Language
 
Children Learning A Foreign Language
Children Learning A Foreign LanguageChildren Learning A Foreign Language
Children Learning A Foreign Language
 
Linguistic competence and performance group5
Linguistic competence and performance group5Linguistic competence and performance group5
Linguistic competence and performance group5
 

Similar to 7.1 assessment and the cefr (1)

Validity and reliability
Validity and reliabilityValidity and reliability
Validity and reliabilityrandoparis
 
principle and types of assessment 12.pptx
principle and types of assessment 12.pptxprinciple and types of assessment 12.pptx
principle and types of assessment 12.pptxSajjadKhan713444
 
Concept and nature of measurment and evaluation (1)
Concept and nature of measurment and evaluation (1)Concept and nature of measurment and evaluation (1)
Concept and nature of measurment and evaluation (1)dheerajvyas5
 
Apt 501 chapter_7
Apt 501 chapter_7Apt 501 chapter_7
Apt 501 chapter_7cdjhaigler
 
Developing Assessment Instruments Chapter 7
Developing Assessment Instruments Chapter 7Developing Assessment Instruments Chapter 7
Developing Assessment Instruments Chapter 7cdjhaigler
 
Developing Assessment Instrument
Developing Assessment InstrumentDeveloping Assessment Instrument
Developing Assessment Instrumentcdjhaigler
 
Assessing learning in Instructional Design
Assessing learning in Instructional DesignAssessing learning in Instructional Design
Assessing learning in Instructional Designleesha roberts
 
Evaluation: Determining the Effect of the Intervention
Evaluation: Determining the Effect of the Intervention Evaluation: Determining the Effect of the Intervention
Evaluation: Determining the Effect of the Intervention Ijaz Ahmad
 
constructionoftests-211015110341 (1).pptx
constructionoftests-211015110341 (1).pptxconstructionoftests-211015110341 (1).pptx
constructionoftests-211015110341 (1).pptxGajeSingh9
 
Testing for language teachers 101 (1)
Testing for language teachers 101 (1)Testing for language teachers 101 (1)
Testing for language teachers 101 (1)Paul Doyon
 
Construction of Tests
Construction of TestsConstruction of Tests
Construction of TestsDakshta1
 
Validity and reliability of questionnaires
Validity and reliability of questionnairesValidity and reliability of questionnaires
Validity and reliability of questionnairesVenkitachalam R
 
TEST DEVELOPMENT AND EVALUATION (6462)
TEST DEVELOPMENT AND EVALUATION (6462)TEST DEVELOPMENT AND EVALUATION (6462)
TEST DEVELOPMENT AND EVALUATION (6462)HennaAnsari
 
Program evaluation
Program evaluationProgram evaluation
Program evaluationaneez103
 
Arte387 Ch3
Arte387 Ch3Arte387 Ch3
Arte387 Ch3SCWARTED
 
Principles_of_language_testing.ppt
Principles_of_language_testing.pptPrinciples_of_language_testing.ppt
Principles_of_language_testing.pptNaufalKurniawan12
 

Similar to 7.1 assessment and the cefr (1) (20)

7.1 assessment and the cefr (1)
7.1 assessment and the cefr (1)7.1 assessment and the cefr (1)
7.1 assessment and the cefr (1)
 
7 assessment and the cefr
7 assessment and the cefr 7 assessment and the cefr
7 assessment and the cefr
 
Testing
TestingTesting
Testing
 
Validity and reliability
Validity and reliabilityValidity and reliability
Validity and reliability
 
principle and types of assessment 12.pptx
principle and types of assessment 12.pptxprinciple and types of assessment 12.pptx
principle and types of assessment 12.pptx
 
Concept and nature of measurment and evaluation (1)
Concept and nature of measurment and evaluation (1)Concept and nature of measurment and evaluation (1)
Concept and nature of measurment and evaluation (1)
 
Apt 501 chapter_7
Apt 501 chapter_7Apt 501 chapter_7
Apt 501 chapter_7
 
Developing Assessment Instruments Chapter 7
Developing Assessment Instruments Chapter 7Developing Assessment Instruments Chapter 7
Developing Assessment Instruments Chapter 7
 
Developing Assessment Instrument
Developing Assessment InstrumentDeveloping Assessment Instrument
Developing Assessment Instrument
 
Assessing learning in Instructional Design
Assessing learning in Instructional DesignAssessing learning in Instructional Design
Assessing learning in Instructional Design
 
Evaluation: Determining the Effect of the Intervention
Evaluation: Determining the Effect of the Intervention Evaluation: Determining the Effect of the Intervention
Evaluation: Determining the Effect of the Intervention
 
constructionoftests-211015110341 (1).pptx
constructionoftests-211015110341 (1).pptxconstructionoftests-211015110341 (1).pptx
constructionoftests-211015110341 (1).pptx
 
Testing for language teachers 101 (1)
Testing for language teachers 101 (1)Testing for language teachers 101 (1)
Testing for language teachers 101 (1)
 
Construction of Tests
Construction of TestsConstruction of Tests
Construction of Tests
 
Validity and reliability of questionnaires
Validity and reliability of questionnairesValidity and reliability of questionnaires
Validity and reliability of questionnaires
 
TEST DEVELOPMENT AND EVALUATION (6462)
TEST DEVELOPMENT AND EVALUATION (6462)TEST DEVELOPMENT AND EVALUATION (6462)
TEST DEVELOPMENT AND EVALUATION (6462)
 
Program evaluation
Program evaluationProgram evaluation
Program evaluation
 
Arte387 Ch3
Arte387 Ch3Arte387 Ch3
Arte387 Ch3
 
Test appraisal
Test appraisalTest appraisal
Test appraisal
 
Principles_of_language_testing.ppt
Principles_of_language_testing.pptPrinciples_of_language_testing.ppt
Principles_of_language_testing.ppt
 

More from Jesús Ángel González López

More from Jesús Ángel González López (20)

7.2 assessment and the cefr (2)
7.2 assessment and the cefr (2)7.2 assessment and the cefr (2)
7.2 assessment and the cefr (2)
 
Bilingual education at university uc policy plan
Bilingual education at university uc policy planBilingual education at university uc policy plan
Bilingual education at university uc policy plan
 
Cambridge tkt assessment
Cambridge tkt assessmentCambridge tkt assessment
Cambridge tkt assessment
 
American Geography
American GeographyAmerican Geography
American Geography
 
History of Britain
History of BritainHistory of Britain
History of Britain
 
8 how to teach literature (and comics)
8 how to teach literature (and comics) 8 how to teach literature (and comics)
8 how to teach literature (and comics)
 
7.1 ealta guidelines
7.1 ealta guidelines7.1 ealta guidelines
7.1 ealta guidelines
 
6 teaching culture
6 teaching culture 6 teaching culture
6 teaching culture
 
5 methodology
5 methodology 5 methodology
5 methodology
 
4 european language portfolio
4 european language portfolio 4 european language portfolio
4 european language portfolio
 
ceftrain
ceftrain ceftrain
ceftrain
 
cefr introduction
cefr introductioncefr introduction
cefr introduction
 
Theoretical framework
Theoretical frameworkTheoretical framework
Theoretical framework
 
1 popular fiction
1 popular fiction1 popular fiction
1 popular fiction
 
2 dashiell hammett bio
2 dashiell hammett bio2 dashiell hammett bio
2 dashiell hammett bio
 
European dimensions
European dimensionsEuropean dimensions
European dimensions
 
language policy plan
language policy planlanguage policy plan
language policy plan
 
Pcic
PcicPcic
Pcic
 
Otras aplicaciones
Otras aplicacionesOtras aplicaciones
Otras aplicaciones
 
Evaluación
EvaluaciónEvaluación
Evaluación
 

Recently uploaded

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 

Recently uploaded (20)

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 

7.1 assessment and the cefr (1)

  • 2. What does the word suggest? What sort of emotions does it convey? Try to write a definition. What does it imply? Which characteristics should it have?
  • 3.  What does the word suggest?  What sort of emotions does it convey?  Try to write a definition. “A purposeful activity that gathers information about students’ language development” (Jang)  What does it imply? • Collecting information • Analyzing the information and making an assessment • Taking decisions according to the assessment made:  Pedagogical decisions (formative assessment)  Social decisions  Which characteristics should it have? • Validity, reliability, feasibility • Fairness • Pedagogical purpose (Washback effect)
  • 4.  Also known as “assessment for learning”, “assessment as learning”, or formative assessment (Jang, 7)  Using a range of assessment methods: • tests • portfolios • performance observation • Standards-based assessments (CEFR: The European Standard)  Self-Assessment and Peer-Assessment: excellent complements  The importance of feedback: descriptive, not only evaluative; indirect-facilitative, better than direct- corrective.  Washback Effect: effect on teaching/learning
  • 5.  Disadvantages: • Non-reliable (underestimation/overestimation of abilities, cheating?). Difficulty of integration in the assessment process • Difficulty of the task (self-assessment itself)  Advantages: It encourages: • Learner Autonomy (responsibility, planning) • Awareness (vertical/horizontal, levels/features of language learning) • Goal-orientation (learning objectives) • Motivation • Better learning: learner-centred learning (supported by cognitive and constructivist theories of learning) • It fits perfectly in the modern comprehensive model of ‘educational assessment’ which encourages validity, learner- support, performance-oriented test tasks, real-world language use and classroom life-oriented activities (Mats Oscarson, “Assessment in the Classroom” 715)
  • 6.  Test the abilities/skills you want to encourage. Give them sufficient weight.  Sample widely and unpredictably (from the specifications)  Use direct testing  Make testing criterion-referenced (CEFR)  Base achievement tests on learning objectives (NOT CONTENT)  Ensure that the test is known and understood by students and teachers (the more transparent, the better)  Counting the cost: individual direct testing is expensive and time-consuming, but what is the cost of harmful washback? (PAU) (EVERY GOOD TEACHER SHOULD DO THIS)
  • 7.  Assessment: Assessment of the proficiency of the language user  3 key concepts: • Validity: the information gained is an accurate representation of the proficiency of the candidates • Reliability: A student being tested twice will get the same result (technical concept: the rank order of the candidates is replicated in two separate—real or simulated—administrations of the same assessment ) • Feasibility (practicality): The procedure needs to be practical, adapted to the available elements and features (relative terms)
  • 8.  If we want assessment to be valid, reliable, and feasible, we need to specify: • What is assessed: according to the CEFR, communicative activities (contexts, texts, and tasks). See examples. • How performance is interpreted: assessment criteria. See examples • How to make comparisons between different tests and ways of assessment (for example, between public examinations and teacher assesment). Two main procedures:  Social/Teacher “moderation”: discussion between experts  Benchmarking: comparison of samples in relation to standardized definitions and examples, which become reference points (benchmarks) • Guidelines for good practice: EALTA
  • 9. TYPES OF ASSESSMENT (1) 1 Achievement assessment (achievement of specific objectives, previously taught content)/ Proficiency assessment (what someone can do in the real world) Ideally, they should be as close as possible. 2 Norm-referencing (NR: students are placed in rank order, US: grades are sometimes adapted to a previous norm, a “curve”)/ Criterion-referencing (CR: the criterion is a standard, like the CEFR) 3 Mastery learning CR / Continuum CR 4 Continuous assessment (grades are based on a number of performances, papers, tests throughout the course) / Fixed assessment points 5 Formative assessment (ongoing process: to check on the progress and improve teaching and learning)/ Summative assessment (designed to summarize students’ progress at a particular time, normally the end of the course) 6 Direct assessment (assessing what the candidate is doing: speaking, writing/ Indirect assessment (assessing through an instrument: reading and speaking comprehension)
  • 10. TYPES OF ASSESSMENT (2) 1 7 Performance assessment (providing a sample of language)/ Knowledge assessment (providing evidence of knowledge: impossible?) 8 Subjective assessment (judgement by an assessor) / Objective assessment (subjectivity is removed: indirect tests. Really objective?) (*) 9 Rating on a scale/ Rating on a checklist 10 Impression / Guided judgement 11 Holistic assessment (global synthetic judgement)/ Analytic assessment (looking at different aspects separately) 12 Series assessment (tasks in series)/ Category assessment (one task with different categories) 13 Peer Assessment/ Self-assessment (very useful complements)
  • 11. • Proficiency tests • Achievement tests. 2 approaches:  To base achievement tests on the textbook/syllabus (contents)  To base them on course learning objectives. More beneficial washback. • Diagnostic tests • Placement tests
  • 12.  Validity: the information gained is an accurate representation of the proficiency of the candidates  Validity Types: • Construct validity (very general, the information gained is an accurate representation of the proficiency of the candidate. It checks the validity of the construct, the thing we want to measure, PAU?) • Content validity. This checks it the test’s content is a representative sample of the skills or structures that it wants to measure. In order to check this we need a complete specification of all the skills or structures we want to cover. If it covers 5% only, it has less content validity than if it covers 25 %.
  • 13.  Validity Types: • Criterion-related validity: Results on the test agree with other dependable results (criterion test)  Concurrent validity. We compare the test results with the criterion test (a longer, or more standardized test)  Predictive validity. The test predicts future performance.A placement test is validated by the teachers who teach the selected students. • Validity in scoring. It is not only the items that need to be valid, but also the way in which responses are scored (Example: taking into account grammar mistakes in a reading comprehension exam is not valid) • Face validity: the test has to look as if it measures what it is supposed to measure. A written test to check pronunciation has little face validity.
  • 14. How to make tests more valid (Hughes) Write specifications for the test (transparency) Include a representative sample of the content of the specifications in the text (content validity) Whenever feasible, use direct testing Make sure that the scoring relates directly to what is being tested Try to make the test reliable
  • 15. Reliability: A student being tested twice will get the same result (technical concept: the rank order of the candidates is replicated in two separate—real or simulated— administrations of the same assessment. Result: a reliability coefficient, theoretical maximum 1, if all the students get exactly the same result) - We compare two tests. Methods: - Test-Retest: the student takes the same test again - Alternate Forms: the students take two alternate forms of the same test - Split Half: you split the test into two equivalent halves and compare them as if they were two different tests.
  • 16. - Reliability coefficient / Standard Error of Measurement A High Stakes Test (high impact or consequences) needs a high “reliability coefficient” (highest is 1), and therefore a very low “standard error of measurement” (a number obtained by statistical analysis). A Lower Stakes exam does not need those coefficients. - True Score: the real score that a student would get in a perfectly reliable test. In a very reliable test, the true score is clearly defined (the student will always get a similar result, for example 65-67). In a less reliable test, the range is wider (55-75). - Scorer reliability (coefficient). You compare the scores given by different scorers (examiners). The more agreement, the more reliable their reliability coefficient.
  • 17. Item analysis: (example on p. 20)  Facility value  Discrimination indices: drop some, improve others  Analyse distractors  Item banking
  • 18. 1.Take enough samples of behaviour (example). 2.Exclude items which do not descriminate well 3.Do not allow candidates too much freedom. (example) 4.Write unambiguous ítems (example) 5.Provide clear and explicit instructions 6.Ensure that tests are well laid out and perfectly legible 7.Make candidates familiar with format and testing techniques 8.Provide uniform and non-distracting conditions of administration
  • 19. 9. Use items which permit scoring which is as objective as possible 10. Make comparisons between candidates as direct as possible 11. Provide a detailed scoring key 12. Train scorers 13. Agree acceptable responses and appropriate scores at the beginning of the scoring process. 14. Identify candidates by number not by name 15. Employ multiple, independent scorers.
  • 20.  To be valid a test must be reliable (it must provide accurate measurement. For example, untrained or biased assessors, or a wrong key)  The Validity/Reliability Paradox: A perfectly reliable test may not be valid at all (technically perfect, but globally wrong: it does not test what it is supposed to test; for example, a driving test without a practical exam, or a multiple-choice test to assess vocabulary knowledge: Students don’t use vocabulary)  “Validity concerns outweigh reliability concerns in current assesment culture” (Jang 97): more performance, direct assessment. More time- consuming, more difficult to administer, but better washback effect and pedagogical use. Use materials from Standard-Based assessment (rubrics, proficiency- level descriptors)
  • 21. Standards are a set of benchmarks for students to achieve (sometimes turned into curricular goals): proficiency-level descriptors later used in rubrics of teacher observation checklists. Examples: • NLLIA ESL Bandscales from Australia • STEPs to English Proficiency in Canada • Council of Europe: CEFR (+ELP) • USA Standards derived from NCLB Act (2001)
  • 22. Chapters from Hughes’ Testing for Language Teachers Nov 7 8. Common Test Techniques: 9. Testing Writing: 10. Testing Oral Abilities: 11. Testing Reading: Nov 8 12. Testing Listening: 13. Testing Grammar and Vocabulary: 14. Testing Overall Ability: 15. Tests for Young Learners:

Editor's Notes

  1. Washback/Backwash: (One of the) main reasons for a language teacher/school/department to use appropriate forms of assessment. Test the abilities/skills you want to encourage. Give them sufficient weight in relation to other skills. Sample widely and unpredictably: Test across the full range of the specifications Use direct testing Make testing criterion-referenced (CEFR) Base achievement tests on objectives Ensure that the test is known and understood by students and teachers (the more transparent, the better) (Where necessary, provide assistance to teachers) Counting the cost: Individual direct testing is expensive, but what is the cost of not achieving beneficial washback
  2. If we want assessment to be valid, reliable, and feasible, we need to specify: What is assessed: according to the CEFR, communicative activities (contexts, texts, and tasks). See examples. How performance is interpreted: assessment criteria. See examples How to make comparisons between different tests and ways of assessment (for example, between public examinations and teacher assesment). Two main procedures: Social moderation: discussion between experts Benchmarking: comparison of samples in relation to standardized definitions and examples Guidelines for good practice: EALTA
  3. Types of tests: Proficiency tests: designed to measure people’s ability in a language, regardless of any training. “Proficient”: command of the language, for a particular purpose or for general purposes. Achievement tests: most teachers are not responsible for proficiency tests, but for achievement tests. They are normally related to language courses. Two approaches: to base achievement tests on the textbook (or the syllabus), so that only what is covered in the classes is tested, or, much better, to base test content on course objectives. More beneficial washback. The long-term interests of the students are best served by this approach. Two types: final achievement tests, and progress achievement tests (formative assessment) Diagnostic tests: Used to identify learners’ strengths and weaknesses (example: Dialang) Placement tests: to place students at the stage most appropriate to their abilities
  4. A test is valid if it measures accurately what it is intended to measure. Or, the information gained is an accurate representation of the proficiency of the candidate. This general type of validity is called “construct validity”, the validity of the construct, the thing we want to measure Content validity: A test has it if its content constitutes a representative sample of the language skills or structures, etc. that it wants to measure. So, first, we need a specification of the skills of structures that we want to cover, and compare them with the test itself. For example, B2 writing skills, writing formal letters is one of the subskills shown in the specification, there are more, the more we cover, the more valid the test will be. The more content validity, the more construct validity and the more backwash effect. Criterion-related validity: Results on the test agree with other (independent and highly dependable) results. This independent assessment is the criterion measure. Two types: Concurrent validity: we compare the criterion test and the test that we want to check. They both take place at about the same time. Example 1: we administer a 45 m. oral test where all the subskills, tasks, operations, are tested. But only to a sample of the students. This is the criterion test. Then we do 10 m. interviews to the whole level of students. We compare the results, and they tell us whether 10 m. is enough or not. This is expressed in a “correlation coefficient” bw the criterion and the test being validated. Example 2: we compare the results of a general test (Pruebas Estandarizadas) with teachers’ assessment. Predictive validity: the test predicts future performance of the students. A placement test can easily be validated by the teachers teaching the students by checking if the students are well placed or not. Validity in scoring: not only the items need to be valid, but also the way in which the responses are scored. For example, a reading test may call for short written responses. If the scoring of these responses takes into account spelling and grammar, then it is not valid (it is not measuring what it is intended to measure). Same for the scoring of writing or speaking. Face validity: the test has to look as if it measures what it is supposed to measure. It is not a scientific notion, but it is important (for candidates, teachers, employers). For example, a written test to check pronunciation.
  5. A test is valid if it measures accurately what it is intended to measure. Or, the information gained is an accurate representation of the proficiency of the candidate. This general type of validity is called “construct validity”, the validity of the construct, the thing we want to measure Content validity: A test has it if its content constitutes a representative sample of the language skills or structures, etc. that it wants to measure. So, first, we need a specification of the skills of structures that we want to cover, and compare them with the test itself. For example, B2 writing skills, writing formal letters is one of the subskills shown in the specification, there are more, the more we cover, the more valid the test will be. The more content validity, the more construct validity and the more backwash effect. Criterion-related validity: Results on the test agree with other (independent and highly dependable) results. This independent assessment is the criterion measure. Two types: Concurrent validity: we compare the criterion test and the test that we want to check. They both take place at about the same time. Example 1: we administer a 45 m. oral test where all the subskills, tasks, operations, are tested. But only to a sample of the students. This is the criterion test. Then we do 10 m. interviews to the whole level of students. We compare the results, and they tell us whether 10 m. is enough or not. This is expressed in a “correlation coefficient” bw the criterion and the test being validated. Example 2: we compare the results of a general test (Pruebas Estandarizadas) with teachers’ assessment. Predictive validity: the test predicts future performance of the students. A placement test can easily be validated by the teachers teaching the students by checking if the students are well placed or not. Validity in scoring: not only the items need to be valid, but also the way in which the responses are scored. For example, a reading test may call for short written responses. If the scoring of these responses takes into account spelling and grammar, then it is not valid (it is not measuring what it is intended to measure). Same for the scoring of writing or speaking. Face validity: the test has to look as if it measures what it is supposed to measure. It is not a scientific notion, but it is important (for candidates, teachers, employers). For example, a written test to check pronunciation.
  6. Reliability: A student being tested twice will get the same result (technical concept: the rank order of the candidates is replicated in two separate—real or simulated—administrations of the same assessment ) We compare two tests taken by the same group of students, and get a reliability coefficient: if all the students get exactly the same result, the coefficient is 1 (It never happens). High Stakes Tests need a higher coefficient than Lower Stakes exams. They shouldn’t depend on chance, or particular circumstances. In order to get two comparable tests, there are two procedures: Test-retest method: the students take the same test again Alternate forms method: the students take two alternate forms of the same test Split half method: you split the test into two (equivalent) halves and compare them as if they were two different tests. You get a “coefficient of internal consistency”. We also need to know the standard error of measurement of a test. This is actually the opposite of the reliability coefficient and you can get it through statistical analysis. With this number, we can find out what the true score of a student is. For example, if we have a very reliable test, it will have a low standard error of measurement, and therefore, the student will always get a very similar result no matter how many times he takes the test. In a less reliable test, his true score would be less defined. The true score lies in a range that varies depending on the standard error of measurement of the test. These numbers are important to compare tests and to take decisions (by companies, governments, etc.) based on those results. Another statistical procedure commonly used now is Item Response Theory. Very technical. Scorer reliability. There is also a scorer reliability coefficient, the level of agreement given by the same or different scorers on different occasions. If the scoring is not reliable, the rest results cannot be reliable.
  7. Reliability: A student being tested twice will get the same result (technical concept: the rank order of the candidates is replicated in two separate—real or simulated—administrations of the same assessment ) We compare two tests taken by the same group of students, and get a reliability coefficient: if all the students get exactly the same result, the coefficient is 1 (It never happens). High Stakes Tests need a higher coefficient than Lower Stakes exams. They shouldn’t depend on chance, or particular circumstances. In order to get two comparable tests, there are two procedures: Test-retest method: the students take the same test again Alternate forms method: the students take two alternate forms of the same test Split half method: you split the test into two (equivalent) halves and compare them as if they were two different tests. You get a “coefficient of internal consistency”. We also need to know the standard error of measurement of a test. This is actually the opposite of the reliability coefficient and you can get it through statistical analysis. With this number, we can find out what the true score of a student is. For example, if we have a very reliable test, it will have a low standard error of measurement, and therefore, the student will always get a very similar result no matter how many times he takes the test. In a less reliable test, his true score would be less defined. The true score lies in a range that varies depending on the standard error of measurement of the test. These numbers are important to compare tests and to take decisions (by companies, governments, etc.) based on those results. Another statistical procedure commonly used now is Item Response Theory. Very technical. Scorer reliability. There is also a scorer reliability coefficient, the level of agreement given by the same or different scorers on different occasions. If the scoring is not reliable, the rest results cannot be reliable.
  8. Item analysis: Facility value Discrimination indices: drop some, improve others Analyse distractors Item banking SEE EXAMPLE FROM FUENSANTA
  9. How to make tests more reliable (Hughes) Take enough samples of behaviour. The more items, the more reliable. The higher stakes, the longer it should be. Example from the Bible. P. 45 Exclude items which do not descriminate well between weaker and stronger students Do not allow candidates too much freedom. Example p. 46 Write unambiguous items: Critical scrutiny of colleagues, pre-testing (trialling, piloting) Provide clear and explicit instructions: write them down, read them aloud. No problem with writing them in L1. Ensure that tests are well laid out and perfectly legible Make candidates familiar with format and testing techniques Provide uniform and non-distracting conditions of administration (specified timing, good acoustic conditions)
  10. Use items which permit scoring which is as objective as possible (better one-word response than multiple choice) Make comparisons between candidates as direct as possible (no choice of items) Provide a detailed scoring key Train scorers Agree acceptable responses and appropriate scores at the beginning of the scoring process. Score a sample. Choose representative examples. Agree. Then scorers can begin to score. Identifty candidates by number not by name Emply multiple, independent scorers. At least two, independently. Then, a third, senior scorer gets the results, and investigates discrepancies.