This document discusses key concepts in language assessment including validity, reliability, and feasibility. It defines validity as the accuracy of a test in measuring the intended proficiency. There are different types of validity including content, criterion-related, and construct validity. Reliability refers to a test producing consistent results, which can be measured using methods like test-retest. Feasibility means a test is practical to administer. The document also discusses types of language tests, how to improve validity and reliability, and item analysis. Chapters from a book on language testing techniques are assigned for discussion.
The European Association for Language Testing and Assessment (EALTA) aims to promote understanding of language testing principles and improve testing practices across Europe. EALTA's guidelines provide best practices for those involved in teacher training, classroom assessment, and test development. The guidelines stress respect, responsibility, fairness, reliability, and validity. They also recommend clarifying purposes and ensuring appropriateness, accuracy, feedback, and stakeholder involvement in the assessment process. EALTA encourages engagement with decision makers to enhance quality of assessment systems.
This document discusses key concepts in language assessment including validity, reliability, and feasibility. It provides definitions and examples of different types of validity including construct, content, criterion-related, and face validity. Reliability is discussed in terms of test-retest, alternate forms, and split-half methods. The document also covers types of language assessment such as proficiency tests, achievement tests, and diagnostic tests. Specific techniques for assessing writing, speaking, reading, listening, grammar, and vocabulary are outlined. Guidelines are provided for developing valid and reliable language tests.
This document discusses different types of test questions used in education measurement and evaluation. It describes supply type tests where students must supply missing information, including short answer and extended answer varieties. Short answer questions assess basic knowledge through one word to short responses, while extended/essay questions allow lengthier, paragraph responses to measure higher-order thinking. Selection type tests involve choosing from options, including true/false, matching, and multiple choice questions. The advantages and disadvantages of each question type are outlined.
This document provides guidance on constructing practical test questions in exams. It defines a practical question as one that requires students to perform a task to verify a hypothesis or established law. Safety should be ensured when conducting practical exams. Questions should test psychomotor skills and use clear wording. It is important to check that necessary equipment, tools and chemicals are available and that poisonous or explosive substances are avoided. Students should work under supervision. Practical exams can assess cognitive, affective and psychomotor domains, and evaluate students' ability to handle apparatus, but require skilled teachers and available resources. Examples are provided of verifying if a substance is acid or base, or finding the volume of a cube.
This document discusses validity, reliability, and washback in language testing. Validity refers to a test measuring what it intends to measure, which includes content validity (testing relevant skills and concepts) and criterion-related validity (how test results agree with other assessment results). Reliability means a test is repeatable, which can be measured through reliability coefficients. Washback refers to how a test influences teaching and learning, with the goal of achieving positive washback that encourages effective preparation. Ensuring validity, reliability, and beneficial washback requires careful test construction and use of techniques like setting test specifications, direct testing of objectives, and providing clear scoring criteria.
The document outlines a test specification for a reading comprehension test for 7th grade students. It details the test blueprint covering content from grades 1-12, developed by four individuals. It provides information on scoring methods, time allotment, instructions, test purpose and construct. It also summarizes measures of central tendency, frequency distribution, standard deviation, reliability, validity, difficulty level, discriminating power and distractor analysis that will be used to evaluate the test. The conclusion summarizes the analysis of a sample test including highest/lowest scores, modes, median, mean, reliability, validity, item validity, difficulty levels and items requiring revision.
This document discusses key concepts related to language testing and assessment. It begins by defining tests as methods used to measure a person's ability, knowledge, or performance in a given domain. Well-constructed tests provide an accurate measure of a test-taker's proficiency. The document then examines different types of language assessment methods, both formal and informal. It also explores theoretical approaches to language testing, including behavioral, communicative, integrative, and performance-based assessments. Current issues in classroom testing are discussed, such as theories of multiple intelligences and alternative forms of assessment. Principles of effective language assessment are outlined, emphasizing practicality, reliability, validity, authenticity, and washback.
The document outlines the stages of test construction: planning, preparing, reviewing, and revising. In the planning stage, test objectives and techniques are determined. The preparing stage involves developing test items addressing content, format, and scoring procedures. Experts then review the test for validity, reliability, and usability, providing feedback. In the revising stage, test items are modified based on the review before pretesting. The stages ensure tests accurately measure learners' knowledge and skills.
The European Association for Language Testing and Assessment (EALTA) aims to promote understanding of language testing principles and improve testing practices across Europe. EALTA's guidelines provide best practices for those involved in teacher training, classroom assessment, and test development. The guidelines stress respect, responsibility, fairness, reliability, and validity. They also recommend clarifying purposes and ensuring appropriateness, accuracy, feedback, and stakeholder involvement in the assessment process. EALTA encourages engagement with decision makers to enhance quality of assessment systems.
This document discusses key concepts in language assessment including validity, reliability, and feasibility. It provides definitions and examples of different types of validity including construct, content, criterion-related, and face validity. Reliability is discussed in terms of test-retest, alternate forms, and split-half methods. The document also covers types of language assessment such as proficiency tests, achievement tests, and diagnostic tests. Specific techniques for assessing writing, speaking, reading, listening, grammar, and vocabulary are outlined. Guidelines are provided for developing valid and reliable language tests.
This document discusses different types of test questions used in education measurement and evaluation. It describes supply type tests where students must supply missing information, including short answer and extended answer varieties. Short answer questions assess basic knowledge through one word to short responses, while extended/essay questions allow lengthier, paragraph responses to measure higher-order thinking. Selection type tests involve choosing from options, including true/false, matching, and multiple choice questions. The advantages and disadvantages of each question type are outlined.
This document provides guidance on constructing practical test questions in exams. It defines a practical question as one that requires students to perform a task to verify a hypothesis or established law. Safety should be ensured when conducting practical exams. Questions should test psychomotor skills and use clear wording. It is important to check that necessary equipment, tools and chemicals are available and that poisonous or explosive substances are avoided. Students should work under supervision. Practical exams can assess cognitive, affective and psychomotor domains, and evaluate students' ability to handle apparatus, but require skilled teachers and available resources. Examples are provided of verifying if a substance is acid or base, or finding the volume of a cube.
This document discusses validity, reliability, and washback in language testing. Validity refers to a test measuring what it intends to measure, which includes content validity (testing relevant skills and concepts) and criterion-related validity (how test results agree with other assessment results). Reliability means a test is repeatable, which can be measured through reliability coefficients. Washback refers to how a test influences teaching and learning, with the goal of achieving positive washback that encourages effective preparation. Ensuring validity, reliability, and beneficial washback requires careful test construction and use of techniques like setting test specifications, direct testing of objectives, and providing clear scoring criteria.
The document outlines a test specification for a reading comprehension test for 7th grade students. It details the test blueprint covering content from grades 1-12, developed by four individuals. It provides information on scoring methods, time allotment, instructions, test purpose and construct. It also summarizes measures of central tendency, frequency distribution, standard deviation, reliability, validity, difficulty level, discriminating power and distractor analysis that will be used to evaluate the test. The conclusion summarizes the analysis of a sample test including highest/lowest scores, modes, median, mean, reliability, validity, item validity, difficulty levels and items requiring revision.
This document discusses key concepts related to language testing and assessment. It begins by defining tests as methods used to measure a person's ability, knowledge, or performance in a given domain. Well-constructed tests provide an accurate measure of a test-taker's proficiency. The document then examines different types of language assessment methods, both formal and informal. It also explores theoretical approaches to language testing, including behavioral, communicative, integrative, and performance-based assessments. Current issues in classroom testing are discussed, such as theories of multiple intelligences and alternative forms of assessment. Principles of effective language assessment are outlined, emphasizing practicality, reliability, validity, authenticity, and washback.
The document outlines the stages of test construction: planning, preparing, reviewing, and revising. In the planning stage, test objectives and techniques are determined. The preparing stage involves developing test items addressing content, format, and scoring procedures. Experts then review the test for validity, reliability, and usability, providing feedback. In the revising stage, test items are modified based on the review before pretesting. The stages ensure tests accurately measure learners' knowledge and skills.
There are several types of language tests that serve different purposes: proficiency tests measure overall language ability, diagnostic tests identify specific strengths and weaknesses, placement tests determine what level is appropriate, achievement tests are limited to material covered in a particular course, and aptitude tests predict future success in learning a foreign language before instruction begins. Each type of test has a distinct goal to help evaluate, diagnose, or place students in a way that benefits their language education.
The document outlines the steps for developing a valid and reliable test: 1) determining test specifications, 2) planning by preparing a table of specifications, 3) writing test items, 4) preparing appropriate test formats, 5) reviewing test items, 6) pre-testing the test, and 7) validating test items through analyzing item difficulty, discrimination, and facility. The goal is to design a test that accurately measures the intended objectives and skills at an appropriate level of difficulty without cultural bias.
The document discusses how to improve test reliability by providing clear instructions, unambiguous questions, familiar formatting, and objective scoring. It recommends training scorers, reviewing items for errors, using parallel distractors of similar length, and avoiding subjectively scored items or those that provide clues. Tables of specifications can improve validity by matching test questions to course content and objectives. Formative evaluations provide ongoing feedback, while summative evaluations assess effectiveness after full implementation.
There are several types of tests used to measure student performance and abilities, including diagnostic tests, proficiency tests, achievement tests, aptitude tests, placement tests, personality tests, and intelligence tests. Tests can also be objective or subjective, oral or written, criterion-referenced or norm-referenced, formative or summative, and administered individually or to groups. The document provides descriptions of the various types of tests.
This document outlines the steps to design an effective test. It discusses that tests should be valid in measuring the skills and content taught, reliable in producing consistent results, and practical to develop without excessive time or resources. The planning stage involves specifying the test's use and ensuring authentic tasks. Tests should sample across language skills and content areas. The development stage includes compiling materials, selecting appropriate question formats and clear instructions, setting scoring criteria, and analyzing and revising based on results to improve teaching.
Advantages and limitations of subjective test itemsTest Generator
In the world of test creation software and online exam makers, we often hear talk of objective and subjective questions and their differing effects on test takers. Take a look at our presentation for a quick overview.
The document outlines the 5 main steps in test development: 1) test conceptualization which includes defining what will be measured and pilot studies, 2) test construction including scaling methods, writing items, and approaches, 3) test tryout, 4) item analysis to evaluate item difficulty, reliability, validity, and discrimination, and 5) test revision to ensure quality over time as needed. Key aspects include defining the construct being measured, using various scaling and scoring models, analyzing item performance, and revalidating tests periodically.
This document provides guidelines for conducting a try-out test and performing item analysis on a test. It explains that a try-out test should be conducted on a sample similar to the actual test takers and under similar conditions to prepare for the real test administration. Item analysis evaluates the quality of test items using metrics like item difficulty index and item discrimination index calculated based on comparing response rates of high-scoring and low-scoring test takers. The document provides examples of computing these indexes and interpreting their values to identify items that need revision or removal.
The document outlines different types of language tests: proficiency tests measure general language ability regardless of training; achievement tests relate to language courses and assess whether objectives were achieved; diagnostic tests identify strengths and weaknesses; placement tests determine what language level is appropriate. It also distinguishes between direct and indirect testing, discrete point and integrative testing, norm-referenced and criterion-referenced testing, and objective and subjective scoring. The document concludes by mentioning computer adaptive testing and communicative language testing.
Here are the key steps in effective feedback:
1. Be timely
2. Focus on the task/criteria, not the person
3. Explain what was done well and areas for improvement
4. Suggest strategies for improvement
5. Allow opportunity for questions/discussion
Topic: Assembling The Test
Student Name: Latif Qureshi
Class: M.Ed
Project Name: “Young Teachers' Professional Development (TPD)"
"Project Founder: Prof. Dr. Amjad Ali Arain
Faculty of Education, University of Sindh, Pakistan
The document discusses different types of assessment including formal, informal, and self-assessment. It then describes various types of tests such as diagnostic tools, formal tests, informal tests, summative tests, formative tests, norm-referenced tests, and criterion-referenced tests. The final section outlines principles of test construction including validity, reliability, objectivity, discrimination, comprehensiveness, ease of administration, practicality and scoring, and usability.
This document discusses various methods for analyzing items on language tests, including item facility (IF), item discrimination (ID), difference index (DI), and B-index. IF measures the percentage of correct answers on an item. ID measures how well an item separates high-scoring from low-scoring students. DI compares pre-test and post-test IFs to measure sensitivity to instruction. B-index compares pass/fail groups' IFs. While norm-referenced tests aim for a normal score distribution, criterion-referenced tests use DI and B-index to relate items to instructional objectives and passing standards.
The document discusses the characteristics of a good test. A good test is both valid and reliable. Validity means a test measures what it is intended to measure, such as a math test measuring math ability not reading ability. Reliability means test scores are consistent and not due to random chance. Tests can be made more reliable by including more test items and using objective scoring methods. Characteristics like a large number of test items, objective scoring, and piloting a test widely increase reliability.
The document discusses test appraisal systems and item analysis. It defines test appraisal as the process used by educational institutions to evaluate how effectively they have conducted and evaluated students. Item analysis is a statistical technique used to select and reject test items based on their difficulty and ability to discriminate between stronger and weaker students. Item analysis provides benefits such as improving test construction, identifying weaknesses, and enhancing teaching methods.
Testing is a matter of using data to establish evidence of learning. But evidence does not occur concretely in the natural state, but is an abstract inference. It is a matter of judgment.
This document discusses key principles of language assessment, including reliability, validity, practicality, authenticity, and washback. It provides definitions and explanations of these principles in 3-7 sentences each. Reliability refers to a test producing consistent scores and being error-free. Validity is the correspondence between a test's content and the material being tested. Practicality balances the resources required to design, develop, and use a test with the available resources. Authenticity is the similarity between test tasks and real-life language use. Washback describes the influence of a test on teaching and learning, which can be positive or negative.
This document discusses how to achieve beneficial backwash from tests. It provides several recommendations: test the abilities you want to encourage; sample widely and unpredictably in tests; use direct testing of skills; make tests criterion-referenced; base achievement tests on objectives; ensure students and teachers understand tests; and provide teacher assistance. It also mentions the Cambridge English Proficiency exam and cites various sources.
The Common European Framework of Reference (CEFR) is a standard developed by the Council of Europe to describe language ability. It introduces six common reference levels (A1 to C2) to standardize language education across Europe. The CEFR provides clear definitions of what language learners can do at each level to facilitate cooperation in language education.
CEFR-based tools and resources: latest developments (Mila Angelova)eaquals
This document provides an overview of an EAQUALS session on CEFR-based curriculum and assessment. It discusses developing resources like the Core Inventory for French and reading/listening scenarios. The EAQUALS Certificate of Achievement scheme guarantees quality CEFR-implemented assessment and curriculum design through screening processes. Benefits include differentiating members and demonstrating academic competence. Main prerequisites for certification include a CEFR-based curriculum, standardization training, and moderation techniques. The session aims to help members implement CEFR-based approaches and identify areas of interest.
There are several types of language tests that serve different purposes: proficiency tests measure overall language ability, diagnostic tests identify specific strengths and weaknesses, placement tests determine what level is appropriate, achievement tests are limited to material covered in a particular course, and aptitude tests predict future success in learning a foreign language before instruction begins. Each type of test has a distinct goal to help evaluate, diagnose, or place students in a way that benefits their language education.
The document outlines the steps for developing a valid and reliable test: 1) determining test specifications, 2) planning by preparing a table of specifications, 3) writing test items, 4) preparing appropriate test formats, 5) reviewing test items, 6) pre-testing the test, and 7) validating test items through analyzing item difficulty, discrimination, and facility. The goal is to design a test that accurately measures the intended objectives and skills at an appropriate level of difficulty without cultural bias.
The document discusses how to improve test reliability by providing clear instructions, unambiguous questions, familiar formatting, and objective scoring. It recommends training scorers, reviewing items for errors, using parallel distractors of similar length, and avoiding subjectively scored items or those that provide clues. Tables of specifications can improve validity by matching test questions to course content and objectives. Formative evaluations provide ongoing feedback, while summative evaluations assess effectiveness after full implementation.
There are several types of tests used to measure student performance and abilities, including diagnostic tests, proficiency tests, achievement tests, aptitude tests, placement tests, personality tests, and intelligence tests. Tests can also be objective or subjective, oral or written, criterion-referenced or norm-referenced, formative or summative, and administered individually or to groups. The document provides descriptions of the various types of tests.
This document outlines the steps to design an effective test. It discusses that tests should be valid in measuring the skills and content taught, reliable in producing consistent results, and practical to develop without excessive time or resources. The planning stage involves specifying the test's use and ensuring authentic tasks. Tests should sample across language skills and content areas. The development stage includes compiling materials, selecting appropriate question formats and clear instructions, setting scoring criteria, and analyzing and revising based on results to improve teaching.
Advantages and limitations of subjective test itemsTest Generator
In the world of test creation software and online exam makers, we often hear talk of objective and subjective questions and their differing effects on test takers. Take a look at our presentation for a quick overview.
The document outlines the 5 main steps in test development: 1) test conceptualization which includes defining what will be measured and pilot studies, 2) test construction including scaling methods, writing items, and approaches, 3) test tryout, 4) item analysis to evaluate item difficulty, reliability, validity, and discrimination, and 5) test revision to ensure quality over time as needed. Key aspects include defining the construct being measured, using various scaling and scoring models, analyzing item performance, and revalidating tests periodically.
This document provides guidelines for conducting a try-out test and performing item analysis on a test. It explains that a try-out test should be conducted on a sample similar to the actual test takers and under similar conditions to prepare for the real test administration. Item analysis evaluates the quality of test items using metrics like item difficulty index and item discrimination index calculated based on comparing response rates of high-scoring and low-scoring test takers. The document provides examples of computing these indexes and interpreting their values to identify items that need revision or removal.
The document outlines different types of language tests: proficiency tests measure general language ability regardless of training; achievement tests relate to language courses and assess whether objectives were achieved; diagnostic tests identify strengths and weaknesses; placement tests determine what language level is appropriate. It also distinguishes between direct and indirect testing, discrete point and integrative testing, norm-referenced and criterion-referenced testing, and objective and subjective scoring. The document concludes by mentioning computer adaptive testing and communicative language testing.
Here are the key steps in effective feedback:
1. Be timely
2. Focus on the task/criteria, not the person
3. Explain what was done well and areas for improvement
4. Suggest strategies for improvement
5. Allow opportunity for questions/discussion
Topic: Assembling The Test
Student Name: Latif Qureshi
Class: M.Ed
Project Name: “Young Teachers' Professional Development (TPD)"
"Project Founder: Prof. Dr. Amjad Ali Arain
Faculty of Education, University of Sindh, Pakistan
The document discusses different types of assessment including formal, informal, and self-assessment. It then describes various types of tests such as diagnostic tools, formal tests, informal tests, summative tests, formative tests, norm-referenced tests, and criterion-referenced tests. The final section outlines principles of test construction including validity, reliability, objectivity, discrimination, comprehensiveness, ease of administration, practicality and scoring, and usability.
This document discusses various methods for analyzing items on language tests, including item facility (IF), item discrimination (ID), difference index (DI), and B-index. IF measures the percentage of correct answers on an item. ID measures how well an item separates high-scoring from low-scoring students. DI compares pre-test and post-test IFs to measure sensitivity to instruction. B-index compares pass/fail groups' IFs. While norm-referenced tests aim for a normal score distribution, criterion-referenced tests use DI and B-index to relate items to instructional objectives and passing standards.
The document discusses the characteristics of a good test. A good test is both valid and reliable. Validity means a test measures what it is intended to measure, such as a math test measuring math ability not reading ability. Reliability means test scores are consistent and not due to random chance. Tests can be made more reliable by including more test items and using objective scoring methods. Characteristics like a large number of test items, objective scoring, and piloting a test widely increase reliability.
The document discusses test appraisal systems and item analysis. It defines test appraisal as the process used by educational institutions to evaluate how effectively they have conducted and evaluated students. Item analysis is a statistical technique used to select and reject test items based on their difficulty and ability to discriminate between stronger and weaker students. Item analysis provides benefits such as improving test construction, identifying weaknesses, and enhancing teaching methods.
Testing is a matter of using data to establish evidence of learning. But evidence does not occur concretely in the natural state, but is an abstract inference. It is a matter of judgment.
This document discusses key principles of language assessment, including reliability, validity, practicality, authenticity, and washback. It provides definitions and explanations of these principles in 3-7 sentences each. Reliability refers to a test producing consistent scores and being error-free. Validity is the correspondence between a test's content and the material being tested. Practicality balances the resources required to design, develop, and use a test with the available resources. Authenticity is the similarity between test tasks and real-life language use. Washback describes the influence of a test on teaching and learning, which can be positive or negative.
This document discusses how to achieve beneficial backwash from tests. It provides several recommendations: test the abilities you want to encourage; sample widely and unpredictably in tests; use direct testing of skills; make tests criterion-referenced; base achievement tests on objectives; ensure students and teachers understand tests; and provide teacher assistance. It also mentions the Cambridge English Proficiency exam and cites various sources.
The Common European Framework of Reference (CEFR) is a standard developed by the Council of Europe to describe language ability. It introduces six common reference levels (A1 to C2) to standardize language education across Europe. The CEFR provides clear definitions of what language learners can do at each level to facilitate cooperation in language education.
CEFR-based tools and resources: latest developments (Mila Angelova)eaquals
This document provides an overview of an EAQUALS session on CEFR-based curriculum and assessment. It discusses developing resources like the Core Inventory for French and reading/listening scenarios. The EAQUALS Certificate of Achievement scheme guarantees quality CEFR-implemented assessment and curriculum design through screening processes. Benefits include differentiating members and demonstrating academic competence. Main prerequisites for certification include a CEFR-based curriculum, standardization training, and moderation techniques. The session aims to help members implement CEFR-based approaches and identify areas of interest.
This document discusses strategies for assessing and grading students according to the Common European Framework of Reference (CEFR). It recommends using CEFR criteria for low-stakes, in-class assessment but standardized exams for high-stakes certification. When assessing CEFR levels, schools should relate their curriculum to CEFR descriptors and develop assessment tasks aligned to the descriptors. Moderation is important to counteract subjectivity and ensure consistent standards are applied. Criterion-referenced assessment according to CEFR criteria provides explicit information about students' abilities independent of peer performance.
This document summarizes key points about testing various language skills from Hughes' book on language testing. It discusses techniques for testing writing, oral abilities, reading, listening, grammar, vocabulary, and overall ability. Tips are provided for each skill area as well as for testing young learners. Common issues with indirect assessment are addressed. The importance of test design, task selection, task difficulty, and reliable scoring procedures are emphasized throughout.
The Common European Framework of Reference for Languages (CEFR) provides a common basis for describing language ability across Europe. It describes what language learners need to know and be able to do to use a language for communication. The CEFR defines six reference levels of language proficiency from A1 for basic users to C2 for mastery. It also outlines the grammatical structures and competencies required at each level. The CEFR takes a communicative approach, focusing on learners' needs and basing teaching on developing communicative competence through everyday interactions and cultural understanding.
The document discusses different types of assessments including formative assessment, which is used to identify if students have achieved the lesson objective and determine gaps. Examples of formative assessments include questioning students and collecting assignments. Summative assessment provides grades based on performance over a period of time, such as final exams. Performance assessment evaluates what students can do in real-world scenarios through demonstrations and projects.
This document discusses different types of assessment and evaluation tools used in education. It describes diagnostic, formative, and summative assessments and their purposes. Diagnostic assessments identify student strengths and weaknesses at the start of instruction. Formative assessments evaluate student learning throughout instruction to help students improve. Summative assessments make judgments about student achievement at the end of a learning period. The document also outlines specific tools like observations, checklists, interviews, and projects that can be used for assessment and evaluation.
The document discusses assessment practices and formative assessment. It provides an overview of assessment types including formative, summative, and diagnostic assessments. Formative assessment identifies student needs, guides ongoing instruction, and provides feedback to improve learning, while summative assessment evaluates learning at the end of a unit. The document emphasizes that formative assessment, when used to adapt teaching to meet student needs, has a strong positive effect on learning.
This document discusses different types of assessment used to evaluate learner performance. Formal assessment includes tests and exams with numerical grades, while informal assessment observes learner skills through comments without grades. Self-assessment and peer assessment also allow learners to evaluate themselves and each other. Examples of assessment tasks that can be used formally or informally include gap fills, multiple choice questions, interviews, compositions and dictation. Tasks vary in what they measure, from communication skills to language accuracy, and in how easy they are to score objectively versus subjectively. Informal methods include observation, note-taking and self/peer evaluation sheets.
This document discusses assessment in language learning. It defines assessment as collecting information about students' language development through various methods such as tests, portfolios, and observations. This information is then analyzed and used to make pedagogical decisions. For assessment to be effective, it should be valid, reliable, and feasible. Validity means the assessment accurately measures proficiency. Reliability means a student would achieve similar results on multiple attempts. Feasibility means the assessment is practical to implement. The document also discusses formative assessment, self-assessment, and the importance of feedback in the learning process.
1. Research instruments are required in research to systematically collect and measure data relevant to the research problem or questions.
2. The key qualities of a good research instrument are validity, reliability, and usability. Validity ensures an instrument measures what it intends to measure. Reliability means an instrument produces consistent results. Usability means an instrument can be used practically.
3. Common types of instruments include questionnaires, interviews, checklists, tests, and observations. Quantitative instruments like questionnaires use closed-form questions while qualitative instruments like interviews use open-form questions. Standardized tests are published and validated over time while researcher-made tools require validation.
This document discusses the validity and reliability of questionnaires. It defines validity as the ability of a questionnaire to measure what it intends to measure. There are several types of validity discussed, including content validity, face validity, criterion validity (concurrent and predictive), and construct validity. Steps for validating a questionnaire include evaluating face validity and getting expert feedback to establish content validity. Reliability is the ability to get consistent results and is measured through test-retest reliability, internal consistency (split-half), and inter-rater reliability. Establishing both validity and reliability is important for developing a high-quality questionnaire.
This document discusses the key characteristics of effective assessment: validity, reliability, practicality, and accuracy. It defines each characteristic and provides examples. Validity means a test measures what it intends to measure. Reliability means a test produces consistent results. Practicality means a test is usable in terms of time and cost. Accuracy means a test is free from errors. The document also discusses factors that affect the acceptability of a test like length, technique, administration conditions, and presentation quality. Overall, the document provides an overview of the essential features of assessment and testing.
Characteristics of Good Evaluation InstrumentSuresh Babu
1. Validity, reliability, objectivity, adequacy, discrimination power, practicability, comparability, utility, and comprehensiveness are key characteristics of a good evaluation instrument.
2. Validity refers to a test accurately measuring what it is intended to measure. Reliability is consistency in a test's measurements. Objectivity means a test's scores are not affected by scorers' biases.
3. Other important characteristics include a test being adequate to measure objectives, able to discriminate levels of performance, practical to administer, allowing comparability of scores, useful for its intended purpose, and comprehensive in assessing objectives.
This document outlines various topics related to language testing, including types of tests, approaches to testing, validity and reliability, and achieving beneficial backwash effects. It discusses proficiency tests, achievement tests, and diagnostic tests. It also covers direct and indirect testing, norm-referenced and criterion-referenced testing, and objective and subjective testing. Validity is defined as accurately measuring the intended abilities, while reliability is consistency of results. Achieving beneficial backwash means testing abilities you want to foster and ensuring students and teachers understand the test.
The document discusses developing assessment instruments for measuring learner progress and instructional quality. It covers criterion-referenced assessments that measure performance against specific standards or levels. The objectives are to describe criterion-referenced tests and different types of pre- and post-instruction assessments. It also discusses developing quality criterion-referenced test items and assessments of products, performances, and attitudes.
The document discusses developing assessment instruments for measuring learner progress and instructional quality. It describes criterion-referenced assessments that measure performance against specific standards or levels of mastery. The objectives are to describe criterion-referenced tests and how various assessment types (entry tests, pretests, practice tests, posttests) are used. It also discusses developing quality criterion-referenced test items in four categories: goal-centered, learner-centered, context-centered, and assessment-centered.
The document discusses developing criterion-referenced assessments. It explains that criterion-referenced assessments directly measure skills described in behavioral objectives and focus on gauging learner performance and instructional quality. The document provides guidance on writing test items, developing different types of assessments, setting mastery criteria, and ensuring assessments are congruent with objectives and instructional analyses. It emphasizes the importance of criterion-referenced assessments for evaluating both learners and instruction.
The document discusses principles of testing including practicality, reliability, validity, and different types of tests. It addresses how to make tests more reliable and valid. Reliability refers to consistency and dependability, and can be improved through clear instructions, uniform conditions, and objective scoring. Validity means a test accurately measures what it intends to. Communicative competence and practical issues in testing are also covered.
This document provides an outline for a course on testing for language teachers. It covers various topics related to language testing including the purposes of different types of tests, approaches to testing, ensuring validity and reliability, and achieving beneficial backwash effects. The key points covered are the types of tests (proficiency, achievement, diagnostic, placement), approaches to testing (direct vs indirect, discrete point vs integrative), factors of validity and reliability, and how to design tests that motivate effective teaching practices.
Validity refers to the appropriateness and usefulness of assessment interpretations and results, while reliability refers to the consistency of measurements. There are various types of validity evidence including content, criterion, and construct validity. Reliability can be estimated through methods like test-retest, equivalent forms, and internal consistency. Ensuring both validity and reliability of assessments is important for making fair and meaningful evaluations of students.
This document discusses standardized tests and test construction. It defines standardized tests as tests where all students answer the same questions in the same way, allowing performance to be compared. The main types of standardized tests are norm-referenced tests, which compare performance to others, and criterion-referenced tests, which compare performance to objectives. Good test construction involves planning test objectives, writing clear and valid questions, and revising the test based on analysis to ensure it reliably measures the desired content.
This document discusses key concepts and principles of assessment for English language learners. It begins by explaining why assessment should take place, noting that it is used to measure learning and improve instruction. It then covers key concepts involved in assessment like accountability, achievement, and different assessment types and strategies. Several principles of assessment are outlined, including being ethical, fair, valid, reliable and practical. The document concludes by providing checklists to evaluate if classroom tests are applying these principles of practicality, reliability, validity, authenticity, and having a beneficial washback effect on learning.
This document discusses bilingual education programs at higher education institutions that use English as the primary language of instruction, known as EMI programs. It notes the increasing trend of EMI programs in Europe and reasons for their growth, including internationalization, improving English skills, and prestige. Potential threats of EMI discussed include lack of English proficiency among students and teachers leading to ineffective teaching and learning, and EMI limiting classroom discourse. Solutions proposed include screening language levels, additional training, and bilingual degrees. Research on EMI programs found small improvements in students' English skills. Examples of EMI programs in Spanish universities are also provided.
This document outlines 3 modules for language teaching. Module 1 provides background information on language teaching. Module 2 covers lesson planning and use of resources. Module 3 focuses on managing the teaching and learning process in the classroom.
The document provides an overview of the geography and regions of the United States. It describes the country's large scale and varied landscape and climate. The US is divided into several distinct regions, including New England, the Mid-Atlantic region, the South, the Midwest, the Southwest, the Rocky Mountain region, and Pacific states. Each region has its own history and culture, and the document profiles some of the major cities and landmarks within each one.
Early Britain saw the construction of Stonehenge and Celtic population during the Iron Age. The Romans invaded in 43 AD and built Hadrian's Wall to separate Roman and barbarian tribes. Anglo-Saxons then invaded in the 5th-6th centuries, establishing kingdoms and converting to Christianity under the influence of Celtic and Roman faiths. The Normans conquered England in 1066 under William the Conqueror, establishing French dominance. The Magna Carta limited royal power in 1215 and the Hundred Years' War with France began in 1337. The Tudors rose to power in 1485 and Henry VIII established the Church of England in the 1530s. Civil war erupted in 1642 over disputes
This document discusses how to teach English literature in the classroom. It defines literature and English literature, explaining that literature includes stories, poems, and plays considered to have artistic value. It recommends using a learner-centered approach that encourages personal growth and interaction with texts. Teachers should select texts that are appropriate for their students' ages, English levels, and learning objectives. A variety of activities are described to help students engage with texts both inside and outside the classroom, including using comics due to their motivational value and ease of understanding.
Culture can be defined in several ways:
1. Traditionally, culture referred to civilization as opposed to nature, including high art and intellectual achievements.
2. Anthropologists view culture as a worldview comprising the knowledge, beliefs, arts, and customs acquired by people as members of a society.
3. Cultural anthropologists see culture as systems of symbols people use to communicate and understand each other.
Culture encompasses the distinctive spiritual, material, intellectual and emotional features of a society or group, including art, literature, lifestyles, beliefs and values.
The document outlines an action-oriented language model used in Cantabria, Spain for teaching language. The model uses task-based learning principles including using authentic materials and real-life tasks to reproduce natural language acquisition. Some key methodological principles are integrating skills and contents through communicative tasks, using texts close to students' experiences, promoting student autonomy, and treating errors positively as part of the learning process. An example task is provided focusing on family topics, with associated learning goals and activities involving listening, reading, writing, and oral interaction.
The document describes the European Language Portfolio (ELP), which aims to help language learners improve their learning process through self-assessment and reflection. It discusses the ELP's objectives, sections (passport, biography, dossier), and types for different age groups in Spain. It then details the implementation of ELPs at two Spanish schools, including developing activities, trials with students, and addressing problems like the difficulty of self-assessment and the ELP's bulkiness. Teachers found that ELPs increased student autonomy and awareness of language learning as a process. The document concludes by discussing ways ELP principles have been incorporated into language courses, and the impact of ELPs on teacher training.
This document provides instructions for a training program to familiarize users with the Common European Framework of Reference for Languages (CEF). The training involves 6 steps: 1) selecting a language skill, 2) choosing a communicative task, 3) reading the task description, 4) completing the task, 5) rating the task difficulty, and 6) checking the user's rating against the trainer's rating. The goal is to help users scale language skills and assess task and performance levels as defined by the CEF.
The Common European Framework provides a common basis for language education across Europe by establishing common reference levels for languages. It aims to promote plurilingualism, lifelong learning, and greater mobility and cooperation through common standards. The Framework describes language ability through communicative competences and sets out descriptive levels from A1 for basic users up to C2 for mastery. It takes a comprehensive approach to language skills including reception, production, interaction and mediation.
This document outlines the theoretical frameworks and main schools of thought regarding the relationship between film and literature. It discusses the precursors to film theory from the avant-garde, Russian formalism, and early 20th century writers. It also summarizes the influential semiological and narratological studies from thinkers like Christian Metz, Umberto Eco and Roland Barthes. Finally, it examines some of the key linguistic differences between film and literature, such as their approaches to space, time, and point of view, as well as some extra-linguistic aspects like their modes of creation and reception.
This document discusses different perspectives for analyzing popular fiction, including Dashiell Hammett's hard-boiled detective novels. It covers views of the author, reader, social context, genre codes, and medium. Regarding Hammett's works specifically, it notes they were successful on both a casual and careful reading level, offering absorbing stories and rich language. The document also examines criteria for classifying crime fiction genres, such as detective, noir, and thriller.
Dashiell Hammett was an American author best known for hardboiled crime novels and short stories published in the 1920s and 1930s, which had a significant influence on subsequent crime writers. Some of his most famous works include The Maltese Falcon, The Thin Man, and Red Harvest. He pioneered the "hardboiled" school of detective fiction, featuring private detectives who are often more concerned with honor than obedience to the law.
This document discusses definitions of Europe's boundaries. It notes that Europe's eastern boundary is commonly defined as the Ural Mountains, while its southeastern boundary with Asia is less clearly defined, possibly being the Ural River, Emba River, or Caucasus Mountains. The Mediterranean Sea separates Europe from Africa, and the Atlantic Ocean separates Europe from North America. However, there are differing opinions on where boundaries fall. Some include parts of Western Asia like Turkey and the Caucasus region within Europe, while others define boundaries along geographic features. The document also lists some transcontinental countries located in both Europe and Asia.
The document discusses language proficiency levels based on the Common European Framework and strategies for improving English skills at a university. It outlines 6 proficiency levels from A1 to C2 and provides examples of certifications and exams. It then proposes 3 solutions for students to demonstrate an B2 level: taking additional English classes, having subjects taught in English, or obtaining other qualifications involving English study.
El documento presenta un resumen del Plan Curricular del Instituto Cervantes de 2006. Establece tres objetivos generales centrados en el alumno como agente social, hablante intercultural y aprendiente autónomo. Además, describe los cinco componentes de la lengua que comprende el plan: gramatical, pragmático-discursivo, nocional, cultural y de aprendizaje. Finalmente, ofrece ejemplos de los inventarios de los componentes gramatical, pragmático-discursivo y cultural.
El documento describe dos aplicaciones principales del Marco Común Europeo de Referencia para las Lenguas: DIALANG, un sistema de evaluación de conocimientos de idiomas en línea disponible para 14 idiomas y tres destrezas, y Europass, un conjunto de documentos que ayudan a comunicar cualificaciones y titulaciones para facilitar la movilidad en la Unión Europea. También menciona algunos otros recursos en línea relacionados con el aprendizaje y evaluación de idiomas.
El documento describe los conceptos clave de la evaluación según el MCER: validez, fiabilidad y viabilidad. Explica que para que la evaluación sea válida, fiable y viable, es necesario precisar lo que se evalúa (actividades comunicativas), cómo se interpreta la actuación (criterios de evaluación), y seguir códigos de buenas prácticas. Además, enumera 13 tipos diferentes de evaluación.
Este documento presenta la metodología para nuevos currículos de idiomas basados en el Marco Común Europeo de Referencia para las Lenguas. Describe 11 principios metodológicos como el uso de tareas comunicativas, la integración de destrezas y contenidos, y textos cercanos a la experiencia del estudiante. También incluye ejemplos de objetivos, actividades y recursos para el nivel B1.
This document provides information about familiarizing oneself with the Common European Framework of Reference for Languages (CEF). It outlines a 6 step process for users to get acquainted with the CEF scales through familiarization exercises, training materials, and practice rating sample language tasks and comparing their ratings to expert ratings. The goal is to help users understand the CEF levels, scales, and assessment of language task and performance levels.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
How Barcodes Can Be Leveraged Within Odoo 17Celine George
In this presentation, we will explore how barcodes can be leveraged within Odoo 17 to streamline our manufacturing processes. We will cover the configuration steps, how to utilize barcodes in different manufacturing scenarios, and the overall benefits of implementing this technology.
This presentation was provided by Rebecca Benner, Ph.D., of the American Society of Anesthesiologists, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
The chapter Lifelines of National Economy in Class 10 Geography focuses on the various modes of transportation and communication that play a vital role in the economic development of a country. These lifelines are crucial for the movement of goods, services, and people, thereby connecting different regions and promoting economic activities.
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 𝟏)-𝐏𝐫𝐞𝐥𝐢𝐦𝐬
𝐃𝐢𝐬𝐜𝐮𝐬𝐬 𝐭𝐡𝐞 𝐄𝐏𝐏 𝐂𝐮𝐫𝐫𝐢𝐜𝐮𝐥𝐮𝐦 𝐢𝐧 𝐭𝐡𝐞 𝐏𝐡𝐢𝐥𝐢𝐩𝐩𝐢𝐧𝐞𝐬:
- Understand the goals and objectives of the Edukasyong Pantahanan at Pangkabuhayan (EPP) curriculum, recognizing its importance in fostering practical life skills and values among students. Students will also be able to identify the key components and subjects covered, such as agriculture, home economics, industrial arts, and information and communication technology.
𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐭𝐡𝐞 𝐍𝐚𝐭𝐮𝐫𝐞 𝐚𝐧𝐝 𝐒𝐜𝐨𝐩𝐞 𝐨𝐟 𝐚𝐧 𝐄𝐧𝐭𝐫𝐞𝐩𝐫𝐞𝐧𝐞𝐮𝐫:
-Define entrepreneurship, distinguishing it from general business activities by emphasizing its focus on innovation, risk-taking, and value creation. Students will describe the characteristics and traits of successful entrepreneurs, including their roles and responsibilities, and discuss the broader economic and social impacts of entrepreneurial activities on both local and global scales.
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
2. What does the word suggest?
What sort of emotions does it convey?
Try to write a definition. What does it
imply?
Which characteristics should it have?
3. What does the word suggest?
What sort of emotions does it convey?
Try to write a definition. What does it imply?
• Collecting information
• Analyzing the information and making an assessment
• Taking decisions according to the assessment made:
Pedagogical decisions (formative assessment)
Social decisions
Which characteristics should it have?
• Validity, reliability, feasibility
4. Assessment: Assessment of the proficiency of
the language user
3 key concepts:
• Validity: the information gained is an accurate
representation of the proficiency of the candidates
• Reliability: A student being tested twice will get the
same result (technical concept: the rank order of the
candidates is replicated in two separate—real or
simulated—administrations of the same assessment )
• Feasibility: The procedure needs to be practical,
adapted to the available elements and features
5. If we want assessment to be valid, reliable,
and feasible, we need to specify:
• What is assessed: according to the CEFR,
communicative activities (contexts, texts, and tasks).
See examples.
• How performance is interpreted: assessment criteria.
See examples
• How to make comparisons between different tests
and ways of assessment (for example, between public
examinations and teacher assesment). Two main
procedures:
Social moderation: discussion between experts
Benchmarking: comparison of samples in relation to
standardized definitions and examples, which become
reference points (benchmarks)
• Guidelines for good practice: EALTA
7. Types of tests:
• Proficiency tests
• Achievement tests. 2 approaches:
To base achievement tests on the textbook/syllabus
(contents)
To base them on course objectives. More beneficial
washback.
• Diagnostic tests
• Placement tests
8. Validity: the information gained is an accurate
representation of the proficiency of the
candidates
Validity Types:
• Construct validity (very general, the information gained
is an accurate representation of the proficiency of the
candidate. It checks the validity of the construct, the
thing we want to measure)
• Content validity. This checks it the test’s content is a
representative sample of the skills or structures that it
wants to measure. In order to check this we need a
complete specification of all the skills or structures we
want to cover. If it covers 5% only, it has less content
validity than if it covers 25 %.
9. Validity Types:
• Criterion-related validity: Results on the test agree with
other dependable results (criterion test)
Concurrent validity. We compare the test results with the
criterion test.
Predictive validity. The test predicts future performance.A
placement test is validated by the teachers who teach the
selected students.
• Validity in scoring. Not only the items need to be valid,
but also the way in which responses are scored
(taking into account grammar mistakes in a reading
comprehension exam is not valid)
• Face validity: the test has to look as if it measures
what it is supposed to measure. A written test to check
pronunciation has little face validity.
10. How to make tests more valid (Hughes)
Write specifications for the test.
Include a representative sample ot the
content of the specifications in the text
Whenever feasible, use direct testing
Make sure that the scoring relates directly
to what is being tested
Try to make the test reliable
11. Reliability: A student being tested twice will get the same
result (technical concept: the rank order of the candidates
is replicated in two separate—real or simulated—
administrations of the same assessment. Result: a
reliability coefficient, theoretical maximum 1, if all the
students get exactly the same result)
- We compare two tests. Methods:
- Test-Retest: the student takes the same test again
- Alternate Forms: the students take two alternate forms
of the same test
- Split.Half: you split the test into two equivalent halves
and compare them as if they were two different tests.
12. - Reliability coefficient / Standard Error of Measurement
A High Stakes Test needs a high reliability coefficient
(highest is 1), and therefore a very low standard error of
measurement (a number obtained by statistical
analysis). A Lower Stakes exam does not need those
coefficients.
- True Score: the real score that a student would get in a
perfectly reliable test. In a very reliable test, the true
score is clearly defined (the student will always get a
similar result, for example 65-67). In a less reliable test,
the range is wider (55-75).
- Scorer reliability (coefficient). You compare the scores
given by different scorers (examiners). The more
agreement, the more reliable their reliability coefficient.
13. Item analysis:
Facility value
Discrimination indices: drop some, improve
others
Analyse distractors
Item banking
14. 1.Take enough samples of behaviour.
2.Exclude items which do not descriminate well
3.Do not allow candidates too much freedom.
4.Write unambiguous items
5.Provide clear and explicit instructions
6.Ensure that tests are well laid out and perfectly
legible
7.Make candidates familiar with format and testing
techniques
8.Provide uniform and non-distracting conditions of
administration
15. 9. Use items which permit scoring which is as
objective as possible
10. Make comparisons between candidates as direct
as possible
11. Provide a detailed scoring key
12. Train scorers
13. Agree acceptable responses and appropriate
scores at the beginning of the scoring process.
14. Identifty candidates by number not by name
15. Employ multiple, independent scorers..
16. To be valid a test must be reliable (provide
accurate measurement)
A reliable test may not be valid at all
(technically perfect, but globally wrong: it
does not test what it is supposed to test)
17. Test the abilities/skills you want to encourage.
Sample widely and unpredictably
Use direct testing
Make testing criterion-referenced (CEFR)
Base achievement tests on objectives
Ensure that the test is known and understood by
students and teachers
Counting the cost
18. 1. Make a full and clear statement of the testing
‘problem’.
2. Write complete specifications for the test.
3. Write and moderate items.
4. Trial the items informally on native speakers and
reject or modify problematic ones as necessary.
5. Trial the test on a group of non-native speakers
similar to those for whom the test is intended.
6. Analyse the results of the trial and make any
necessary changes.
7. Calibrate scales: collect samples of performance,
use them as models (benchmarking)
8. Validate.
9. Write handbooks for test takers, test users and
staff.
10. Train any necessary staff (interviewers, raters,
etc.).
19. Chapters from Hughes’ Testing for Language Teachers
8. Common Test techniques: Elaine, 24th
9. Testing Writing: Marta, Idoia, 22nd
10. Testing Oral Abilities: Paula, Ángela, 24th
11. Testing Reading: Lucía, 24th
12. Testing Listening: Lorena, 22nd
13. Testing Grammar and Vocabulary: Clara, Cristina,
22nd
14. Testing Overall Ability: Jefferson, 22nd
15. Tests for Young Learners: Tania, Diego, 24th
Editor's Notes
If we want assessment to be valid, reliable, and feasible, we need to specify:
What is assessed: according to the CEFR, communicative activities (contexts, texts, and tasks). See examples.
How performance is interpreted: assessment criteria. See examples
How to make comparisons between different tests and ways of assessment (for example, between public examinations and teacher assesment). Two main procedures:
Social moderation: discussion between experts
Benchmarking: comparison of samples in relation to standardized definitions and examples
Guidelines for good practice: EALTA
Types of tests:
Proficiency tests: designed to measure people’s ability in a language, regardless of any training. “Proficient”: command of the language, for a particular purpose or for general purposes.
Achievement tests: most teachers are not responsible for proficiency tests, but for achievement tests. They are normally related to language courses. Two approaches:
to base achievement tests on the textbook (or the syllabus), so that only what is covered in the classes is tested,
or, much better, to base test content on course objectives. More beneficial washback. The long-term interests of the students are best served by this approach.
Two types: final achievement tests, and progress achievement tests (formative assessment)
Diagnostic tests: Used to identify learners’ strengths and weaknesses (example: Dialang)
Placement tests: to place students at the stage most appropriate to their abilities
A test is valid if it measures accurately what it is intended to measure. Or, the information gained is an accurate representation of the proficiency of the candidate. This general type of validity is called “construct validity”, the validity of the construct, the thing we want to measure
Content validity: A test has it if its content constitutes a representative sample of the language skills or structures, etc. that it wants to measure. So, first, we need a specification of the skills of structures that we want to cover, and compare them with the test itself. For example, B2 writing skills, writing formal letters is one of the subskills shown in the specification, there are more, the more we cover, the more valid the test will be. The more content validity, the more construct validity and the more backwash effect.
Criterion-related validity: Results on the test agree with other (independent and highly dependable) results. This independent assessment is the criterion measure.
Two types:
Concurrent validity: we compare the criterion test and the test that we want to check. They both take place at about the same time.
Example 1: we administer a 45 m. oral test where all the subskills, tasks, operations, are tested. But only to a sample of the students. This is the criterion test. Then we do 10 m. interviews to the whole level of students. We compare the results, and they tell us whether 10 m. is enough or not. This is expressed in a “correlation coefficient” bw the criterion and the test being validated.
Example 2: we compare the results of a general test (Pruebas Estandarizadas) with teachers’ assessment.
Predictive validity: the test predicts future performance of the students. A placement test can easily be validated by the teachers teaching the students by checking if the students are well placed or not.
Validity in scoring: not only the items need to be valid, but also the way in which the responses are scored. For example, a reading test may call for short written responses. If the scoring of these responses takes into account spelling and grammar, then it is not valid (it is not measuring what it is intended to measure). Same for the scoring of writing or speaking.
Face validity: the test has to look as if it measures what it is supposed to measure. It is not a scientific notion, but it is important (for candidates, teachers, employers). For example, a written test to check pronunciation.
A test is valid if it measures accurately what it is intended to measure. Or, the information gained is an accurate representation of the proficiency of the candidate. This general type of validity is called “construct validity”, the validity of the construct, the thing we want to measure
Content validity: A test has it if its content constitutes a representative sample of the language skills or structures, etc. that it wants to measure. So, first, we need a specification of the skills of structures that we want to cover, and compare them with the test itself. For example, B2 writing skills, writing formal letters is one of the subskills shown in the specification, there are more, the more we cover, the more valid the test will be. The more content validity, the more construct validity and the more backwash effect.
Criterion-related validity: Results on the test agree with other (independent and highly dependable) results. This independent assessment is the criterion measure.
Two types:
Concurrent validity: we compare the criterion test and the test that we want to check. They both take place at about the same time.
Example 1: we administer a 45 m. oral test where all the subskills, tasks, operations, are tested. But only to a sample of the students. This is the criterion test. Then we do 10 m. interviews to the whole level of students. We compare the results, and they tell us whether 10 m. is enough or not. This is expressed in a “correlation coefficient” bw the criterion and the test being validated.
Example 2: we compare the results of a general test (Pruebas Estandarizadas) with teachers’ assessment.
Predictive validity: the test predicts future performance of the students. A placement test can easily be validated by the teachers teaching the students by checking if the students are well placed or not.
Validity in scoring: not only the items need to be valid, but also the way in which the responses are scored. For example, a reading test may call for short written responses. If the scoring of these responses takes into account spelling and grammar, then it is not valid (it is not measuring what it is intended to measure). Same for the scoring of writing or speaking.
Face validity: the test has to look as if it measures what it is supposed to measure. It is not a scientific notion, but it is important (for candidates, teachers, employers). For example, a written test to check pronunciation.
Reliability: A student being tested twice will get the same result (technical concept: the rank order of the candidates is replicated in two separate—real or simulated—administrations of the same assessment )
We compare two tests taken by the same group of students, and get a reliability coefficient: if all the students get exactly the same result, the coefficient is 1 (It never happens). High Stakes Tests need a higher coefficient than Lower Stakes exams. They shouldn’t depend on chance, or particular circumstances.
In order to get two comparable tests, there are two procedures:
Test-retest method: the students take the same test again
Alternate forms method: the students take two alternate forms of the same test
Split half method: you split the test into two (equivalent) halves and compare them as if they were two different tests. You get a “coefficient of internal consistency”.
We also need to know the standard error of measurement of a test. This is actually the opposite of the reliability coefficient and you can get it through statistical analysis. With this number, we can find out what the true score of a student is. For example, if we have a very reliable test, it will have a low standard error of measurement, and therefore, the student will always get a very similar result no matter how many times he takes the test. In a less reliable test, his true score would be less defined. The true score lies in a range that varies depending on the standard error of measurement of the test.
These numbers are important to compare tests and to take decisions (by companies, governments, etc.) based on those results.
Another statistical procedure commonly used now is Item Response Theory. Very technical.
Scorer reliability. There is also a scorer reliability coefficient, the level of agreement given by the same or different scorers on different occasions. If the scoring is not reliable, the rest results cannot be reliable.
Reliability: A student being tested twice will get the same result (technical concept: the rank order of the candidates is replicated in two separate—real or simulated—administrations of the same assessment )
We compare two tests taken by the same group of students, and get a reliability coefficient: if all the students get exactly the same result, the coefficient is 1 (It never happens). High Stakes Tests need a higher coefficient than Lower Stakes exams. They shouldn’t depend on chance, or particular circumstances.
In order to get two comparable tests, there are two procedures:
Test-retest method: the students take the same test again
Alternate forms method: the students take two alternate forms of the same test
Split half method: you split the test into two (equivalent) halves and compare them as if they were two different tests. You get a “coefficient of internal consistency”.
We also need to know the standard error of measurement of a test. This is actually the opposite of the reliability coefficient and you can get it through statistical analysis. With this number, we can find out what the true score of a student is. For example, if we have a very reliable test, it will have a low standard error of measurement, and therefore, the student will always get a very similar result no matter how many times he takes the test. In a less reliable test, his true score would be less defined. The true score lies in a range that varies depending on the standard error of measurement of the test.
These numbers are important to compare tests and to take decisions (by companies, governments, etc.) based on those results.
Another statistical procedure commonly used now is Item Response Theory. Very technical.
Scorer reliability. There is also a scorer reliability coefficient, the level of agreement given by the same or different scorers on different occasions. If the scoring is not reliable, the rest results cannot be reliable.
Item analysis:
Facility value
Discrimination indices: drop some, improve others
Analyse distractors
Item banking
SEE EXAMPLE FROM FUENSANTA
How to make tests more reliable (Hughes)
Take enough samples of behaviour. The more items, the more reliable. The higher stakes, the longer it should be. Example from the Bible. P. 45
Exclude items which do not descriminate well between weaker and stronger students
Do not allow candidates too much freedom. Example p. 46
Write unambiguous items: Critical scrutiny of colleagues, pre-testing (trialling, piloting)
Provide clear and explicit instructions: write them down, read them aloud. No problem with writing them in L1.
Ensure that tests are well laid out and perfectly legible
Make candidates familiar with format and testing techniques
Provide uniform and non-distracting conditions of administration (specified timing, good acoustic conditions)
Use items which permit scoring which is as objective as possible (better one-word response than multiple choice)
Make comparisons between candidates as direct as possible (no choice of items)
Provide a detailed scoring key
Train scorers
Agree acceptable responses and appropriate scores at the beginning of the scoring process. Score a sample. Choose representative examples. Agree. Then scorers can begin to score.
Identifty candidates by number not by name
Emply multiple, independent scorers. At least two, independently. Then, a third, senior scorer gets the results, and investigates discrepancies.
Washback/Backwash: (One of the) main reasons for a language teacher/school/department to use appropriate forms of assessment.
Test the abilities/skills you want to encourage. Give them sufficient weight in relation to other skills.
Sample widely and unpredictably: Test across the full range of the specifications
Use direct testing
Make testing criterion-referenced (CEFR)
Base achievement tests on objectives
Ensure that the test is known and understood by students and teachers (the more transparent, the better)
(Where necessary, provide assistance to teachers)
Counting the cost: Individual direct testing is expensive, but what is the cost of not achieving beneficial washback
Calibrate scales: collect samples of performance, and use them as models, reference points (European Study)