TEST DEVELOPMENT AND
EVALUATION (6462)
CLASSROOM TESTING AND HIGH-STAKE TESTING
Department of Secondary Teacher Education
ALLAMA IQBAL OPEN UNIVERSITY, ISLAMABAD
OBJECTIVES OF THE UNIT
After studying this unit, the students will have ability to demonstrate.
1. understand the concept of class room testing and its techniques
2. understand the need and scope of high stake testing
3. differentiate between teacher made tests/classroom tests/low stake tests and
standardized/high stake tests
4. enumerate advantages and limitations of the low stake and high stake tests
5. prepare tests using Bloom’s Taxonomy and SOLO Taxonomy
6. elaborate the procedure for test development
7. provide examples of standardized tests with characteristics with examples.
8. enlist few trends in high stake testing
3.1 CONCEPT OF CLASSROOM TESTING AND ITS TECHNIQUES
Classroom assessment is the process, usually conducted by teachers, of designing, collecting,
interpreting and applying information about student learning and attainment to make educational
decisions. There are four interrelated steps to the classroom assessment process.
 The first step is to define the purposes for the information. During this period, the teacher
considers how the information will be used and how the assessment fits in the students'
educational program.
 The next step in the assessment process is to measure student learning or attainment.
Measurement involves using tests, surveys, observation or interviews to produce either numeric
or verbal descriptions of the degree to which a student has achieved academic goals.
 The third step is to evaluate the measurement data, which entails making judgments about the
information. During this stage, the teacher interprets the measurement data to determine if
students have certain strengths or limitations or whether the student has sufficiently attained the
learning goals.
 In the last stage, the teacher applies the interpretations to fulfill the aims of assessment that
were defined in first stage. The teacher uses the data to guide instruction, render grades, or help
students with any particular learning deficiencies or barriers.
3.2 HIGH STAKE TESTING: ITS NATURE, NEED AND SCOPE
 High-stakes testing has consequences attached to the results. For example, highstakes tests
can be used to determine students’ promotion from grade to grade or graduation from high
school (Resnick, 2004; Cizek, 2001).
 The use and misuse of high-stakes tests are a controversial topic in public education, in
advanced countries and even in Pakistan as they are used not only to assess students but in
attempts to increase teacher accountability also.
Precisely we can say that a high-stakes test is a test that:
o is a single, defined assessment,
o has a clear line drawn between those who pass and those who fail, and
o has direct consequences for passing or failing (something "at stake").
• What is Need of High Stake Testing?
• What is Nature of the High Stake Testing?
Teacher made vs
Standardized test
EduTainment
EduTainment
Teacher made vs
Standardized test
Teacher made vs Standardized test
EduTainment
Differences Between Standard And Teachers Made Tests
EduTainment
EduTainment
Differences Between Standard And Teachers Made Tests
Differences Between Standard And Teachers Made Tests
3.5.2 Advantage and Disadvantage of High Stake Testing
 It holds teachers accountable for ensuring that all students learn what they are expected to learn.
 Motivates students to work harder, learn more, and take the tests more seriously, which can promote higher
student achievement.
 Establishes high expectations for both educators and students, which can help reverse the cycles of low
educational expectations, achievement, and attainment that have historically disadvantaged some student
groups, particularly students of color, and that have characterized some schools in poorer communities or
more troubled urban areas.
 Reveals areas of educational need that can be targeted for reform and improvement, such as programs for
students who may be underperforming academically or being underserved by schools.
 Provides easily understandable information about school and student performance in the form of numerical
test scores that reformers, educational leaders, elected officials and policy makers can use to develop new
laws, regulations, and school-improvement strategies.
 Gives parents, employers, colleges and others more confidence that students are learning at a high level or
that high school graduates have acquired the skills they will need to succeed in adulthood.
Disadvantage of High-Stakes Testing
 It forces educators to “teach to the test”—
 It promotes a more “narrow” academic program in schools—
 It may contribute to higher, or even much higher, rates of cheating—
 It has been correlated in some research studies to increase failure rates,
lower graduation rates, and higher dropout rates—
 May diminish the overall quality of teaching and learning—
 Exacerbates negative stereotypes about the intelligence and academic
ability of minority students—
3.6 CONCEPT OF USE OF TAXONOMIES IN TEST
DEVELOPMENT
Using Bloom’s Taxonomy in Test Development
Using SOLO Taxonomy in Test Development
Bloom’s Taxonomy (1956) question samples:
•Knowledge: How many…? Who was it that…? Can you name the…?
•Comprehension: Can you write in your own words…? Can you write a brief outline…? What do you
think could have happened next…?
•Application: Choose the best statements that apply Judge the effects of… What would result …?
•Analysis: Which events could have happened…? If … happened, how might the ending have been
different? How was this similar to…?
•Synthesis: Can you design a … to achieve …? Write a poem, song or creative presentation about…?
Can you see a possible solution to…?
•Evaluation: What criteria would you use to assess…? What data was used to evaluate…? How could
you verify…?
SOLO Taxonomy
 SOLO taxonomy was developed by Biggs and Collis (1982) Stands for Structure of Observed Learning Outcomes
3.7 PROCEDURE OR STEPS FOR A STANDARDIZED TEST
DEVELOPMENT PROCESS
Pilot
Forms, Scoring and Analysis
Development
Review
Purpose
Specifications
3.8 EXAMPLES OF STANDARDIZED TESTS WITH
CHARACTERISTICS
The Standardized tests can be classified as per their functions are
• Group and Individual Tests
• Norm-referenced
• Achievement Tests
• Criterion-referenced
• Aptitude
• Personality
• Projective
• Interest Inventories
• Intelligence tests
Reliability refers to the consistency of scores
obtained by the same individuals when re-
examined with test on different occasions, or
with different sets of equivalent items.
Reliability
Typesof Reliability
Inter-rater reliability by considering the similarity of the scores
awarded by the two observers.
Inter-Rater or Inter-ObserverReliability
⚫ It is used to judge the consistency of
results across items on the same test.
⚫ We estimate test-retest reliability when
we administer the same test to the same
sample on two different occasions.
⚫ The amount of time allowed between
measures is critical.
⚫ The shorter the time gap, the higher the
correlation; the longer the time gap, the
lower the correlation.
Test-RetestReliability
⚫ In split-half reliability we randomly divide all items that claim to
measure the same contents into two sets.
⚫ The split-half reliability estimate is simply the correlation between two
total scores.
Split-Half Reliability
⚫ In parallel form reliability we have to create two different tests from
the same contents to measure the same learning outcomes.
⚫ The correlation between the two parallel forms is the estimate of
reliability.
Parallel-FormReliability
● It is the degree to which items on an instrument are consistent among
themselves and with the instrument as a whole.
Internal ConsistencyReliability
Validity
 The validity of an assessment tool is the degree to which it measures
for what it is designed to measure.
 The concept refers to the appropriateness, meaningfulness, and
usefulness of the specific inferences made from test scores.
Methods of Measuring Validity
1 2
3
4 5
Content Validity
 Content validity evidence involves the degree to which the content of the test
matches a content domain associated with the construct.
 Items in a test appear to cover whole domain.
Face validity
It is an estimate of
whether a test appears
to measure a certain
criterion. It is
appearance of test.
Construct Validity
 Construct is the concept or the characteristic that a test is designed to measure.
 According to Howell (1992) Construct validity is a test’s ability to measure
factors which are relevant to the field of study.
Convergent
Convergent validity
refers to the degree to
which a measure is
correlated with other
measures.
Criterion Validity
 Criterion validity evidence involves
the correlation between the test and a
criterion variable (or variables) taken
as representative of the construct.
 It compares the test with other
measures or outcomes (the criteria)
already held to be valid.
Concurrent Validity
 Concurrent validity refers to the degree to which the scores taken at one point
correlates with other measures (test, observation or interview) of the same
construct that is measured at the same time.
Predictive Validity
 Predictive validity assures how well the
test predicts some future behaviour of the
examinee.
 If higher scores on the Boards Exams are
positively correlated with higher
G.P.A.’s in the Universities and vice
versa, then the Board exams is said to
have predictive validity.
Factors Affecting Validity
 Instructions to Take A Test
 Difficult Language Structure
 Inappropriate Level of Difficulty
 Poorly Constructed Test Items
 Ambiguity in Items Statements
 Length of the Test
 Improper Arrangement of Items
 Identifiable Pattern of Answers
Relationship between Validity and Reliability
 Reliability is a necessary requirement for validity
 Establishing good reliability is only the first part of establishing validity
 Reliability is necessary but not sufficient for validity.
3.8.3 Usability of Tests
 Usability testing refers to evaluating a product or service by testing it with
representative users. Typically, during a test, participants will try to complete
typical tasks while observers watch, listen and takes notes. You should also
select tests based on how easy the test is to use. In addition to reliability and
validity, you need to think about how much time you have to create a test, grade
it and administer it. You need to think about how you will interpret and use the
scores from the tests. And you need to check to make sure the test questions and
directions are written clearly, the test itself is short enough not to overwhelm the
students, the questions don't includes stereotypes or personal biases, and that
they are interesting and make the students think.
Department of Secondary Teacher Education
ALLAMA IQBAL OPEN UNIVERSITY, ISLAMABAD
Dr. Hina Jalal
hinansari23@gmail.com

TEST DEVELOPMENT AND EVALUATION (6462)

  • 1.
    TEST DEVELOPMENT AND EVALUATION(6462) CLASSROOM TESTING AND HIGH-STAKE TESTING Department of Secondary Teacher Education ALLAMA IQBAL OPEN UNIVERSITY, ISLAMABAD
  • 2.
    OBJECTIVES OF THEUNIT After studying this unit, the students will have ability to demonstrate. 1. understand the concept of class room testing and its techniques 2. understand the need and scope of high stake testing 3. differentiate between teacher made tests/classroom tests/low stake tests and standardized/high stake tests 4. enumerate advantages and limitations of the low stake and high stake tests 5. prepare tests using Bloom’s Taxonomy and SOLO Taxonomy 6. elaborate the procedure for test development 7. provide examples of standardized tests with characteristics with examples. 8. enlist few trends in high stake testing
  • 3.
    3.1 CONCEPT OFCLASSROOM TESTING AND ITS TECHNIQUES Classroom assessment is the process, usually conducted by teachers, of designing, collecting, interpreting and applying information about student learning and attainment to make educational decisions. There are four interrelated steps to the classroom assessment process.  The first step is to define the purposes for the information. During this period, the teacher considers how the information will be used and how the assessment fits in the students' educational program.  The next step in the assessment process is to measure student learning or attainment. Measurement involves using tests, surveys, observation or interviews to produce either numeric or verbal descriptions of the degree to which a student has achieved academic goals.  The third step is to evaluate the measurement data, which entails making judgments about the information. During this stage, the teacher interprets the measurement data to determine if students have certain strengths or limitations or whether the student has sufficiently attained the learning goals.  In the last stage, the teacher applies the interpretations to fulfill the aims of assessment that were defined in first stage. The teacher uses the data to guide instruction, render grades, or help students with any particular learning deficiencies or barriers.
  • 4.
    3.2 HIGH STAKETESTING: ITS NATURE, NEED AND SCOPE  High-stakes testing has consequences attached to the results. For example, highstakes tests can be used to determine students’ promotion from grade to grade or graduation from high school (Resnick, 2004; Cizek, 2001).  The use and misuse of high-stakes tests are a controversial topic in public education, in advanced countries and even in Pakistan as they are used not only to assess students but in attempts to increase teacher accountability also. Precisely we can say that a high-stakes test is a test that: o is a single, defined assessment, o has a clear line drawn between those who pass and those who fail, and o has direct consequences for passing or failing (something "at stake"). • What is Need of High Stake Testing? • What is Nature of the High Stake Testing?
  • 5.
  • 6.
  • 7.
    Teacher made vsStandardized test EduTainment
  • 8.
    Differences Between StandardAnd Teachers Made Tests EduTainment
  • 9.
  • 10.
    Differences Between StandardAnd Teachers Made Tests
  • 11.
    3.5.2 Advantage andDisadvantage of High Stake Testing  It holds teachers accountable for ensuring that all students learn what they are expected to learn.  Motivates students to work harder, learn more, and take the tests more seriously, which can promote higher student achievement.  Establishes high expectations for both educators and students, which can help reverse the cycles of low educational expectations, achievement, and attainment that have historically disadvantaged some student groups, particularly students of color, and that have characterized some schools in poorer communities or more troubled urban areas.  Reveals areas of educational need that can be targeted for reform and improvement, such as programs for students who may be underperforming academically or being underserved by schools.  Provides easily understandable information about school and student performance in the form of numerical test scores that reformers, educational leaders, elected officials and policy makers can use to develop new laws, regulations, and school-improvement strategies.  Gives parents, employers, colleges and others more confidence that students are learning at a high level or that high school graduates have acquired the skills they will need to succeed in adulthood.
  • 12.
    Disadvantage of High-StakesTesting  It forces educators to “teach to the test”—  It promotes a more “narrow” academic program in schools—  It may contribute to higher, or even much higher, rates of cheating—  It has been correlated in some research studies to increase failure rates, lower graduation rates, and higher dropout rates—  May diminish the overall quality of teaching and learning—  Exacerbates negative stereotypes about the intelligence and academic ability of minority students—
  • 13.
    3.6 CONCEPT OFUSE OF TAXONOMIES IN TEST DEVELOPMENT Using Bloom’s Taxonomy in Test Development Using SOLO Taxonomy in Test Development
  • 14.
    Bloom’s Taxonomy (1956)question samples: •Knowledge: How many…? Who was it that…? Can you name the…? •Comprehension: Can you write in your own words…? Can you write a brief outline…? What do you think could have happened next…? •Application: Choose the best statements that apply Judge the effects of… What would result …? •Analysis: Which events could have happened…? If … happened, how might the ending have been different? How was this similar to…? •Synthesis: Can you design a … to achieve …? Write a poem, song or creative presentation about…? Can you see a possible solution to…? •Evaluation: What criteria would you use to assess…? What data was used to evaluate…? How could you verify…?
  • 15.
    SOLO Taxonomy  SOLOtaxonomy was developed by Biggs and Collis (1982) Stands for Structure of Observed Learning Outcomes
  • 16.
    3.7 PROCEDURE ORSTEPS FOR A STANDARDIZED TEST DEVELOPMENT PROCESS Pilot Forms, Scoring and Analysis Development Review Purpose Specifications
  • 17.
    3.8 EXAMPLES OFSTANDARDIZED TESTS WITH CHARACTERISTICS The Standardized tests can be classified as per their functions are • Group and Individual Tests • Norm-referenced • Achievement Tests • Criterion-referenced • Aptitude • Personality • Projective • Interest Inventories • Intelligence tests
  • 18.
    Reliability refers tothe consistency of scores obtained by the same individuals when re- examined with test on different occasions, or with different sets of equivalent items. Reliability
  • 19.
  • 20.
    Inter-rater reliability byconsidering the similarity of the scores awarded by the two observers. Inter-Rater or Inter-ObserverReliability
  • 21.
    ⚫ It isused to judge the consistency of results across items on the same test. ⚫ We estimate test-retest reliability when we administer the same test to the same sample on two different occasions. ⚫ The amount of time allowed between measures is critical. ⚫ The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. Test-RetestReliability
  • 22.
    ⚫ In split-halfreliability we randomly divide all items that claim to measure the same contents into two sets. ⚫ The split-half reliability estimate is simply the correlation between two total scores. Split-Half Reliability
  • 23.
    ⚫ In parallelform reliability we have to create two different tests from the same contents to measure the same learning outcomes. ⚫ The correlation between the two parallel forms is the estimate of reliability. Parallel-FormReliability
  • 24.
    ● It isthe degree to which items on an instrument are consistent among themselves and with the instrument as a whole. Internal ConsistencyReliability
  • 25.
    Validity  The validityof an assessment tool is the degree to which it measures for what it is designed to measure.  The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores.
  • 26.
    Methods of MeasuringValidity 1 2 3 4 5
  • 27.
    Content Validity  Contentvalidity evidence involves the degree to which the content of the test matches a content domain associated with the construct.  Items in a test appear to cover whole domain. Face validity It is an estimate of whether a test appears to measure a certain criterion. It is appearance of test.
  • 28.
    Construct Validity  Constructis the concept or the characteristic that a test is designed to measure.  According to Howell (1992) Construct validity is a test’s ability to measure factors which are relevant to the field of study. Convergent Convergent validity refers to the degree to which a measure is correlated with other measures.
  • 29.
    Criterion Validity  Criterionvalidity evidence involves the correlation between the test and a criterion variable (or variables) taken as representative of the construct.  It compares the test with other measures or outcomes (the criteria) already held to be valid.
  • 30.
    Concurrent Validity  Concurrentvalidity refers to the degree to which the scores taken at one point correlates with other measures (test, observation or interview) of the same construct that is measured at the same time.
  • 31.
    Predictive Validity  Predictivevalidity assures how well the test predicts some future behaviour of the examinee.  If higher scores on the Boards Exams are positively correlated with higher G.P.A.’s in the Universities and vice versa, then the Board exams is said to have predictive validity.
  • 32.
    Factors Affecting Validity Instructions to Take A Test  Difficult Language Structure  Inappropriate Level of Difficulty  Poorly Constructed Test Items  Ambiguity in Items Statements  Length of the Test  Improper Arrangement of Items  Identifiable Pattern of Answers
  • 33.
    Relationship between Validityand Reliability  Reliability is a necessary requirement for validity  Establishing good reliability is only the first part of establishing validity  Reliability is necessary but not sufficient for validity.
  • 34.
    3.8.3 Usability ofTests  Usability testing refers to evaluating a product or service by testing it with representative users. Typically, during a test, participants will try to complete typical tasks while observers watch, listen and takes notes. You should also select tests based on how easy the test is to use. In addition to reliability and validity, you need to think about how much time you have to create a test, grade it and administer it. You need to think about how you will interpret and use the scores from the tests. And you need to check to make sure the test questions and directions are written clearly, the test itself is short enough not to overwhelm the students, the questions don't includes stereotypes or personal biases, and that they are interesting and make the students think.
  • 35.
    Department of SecondaryTeacher Education ALLAMA IQBAL OPEN UNIVERSITY, ISLAMABAD Dr. Hina Jalal hinansari23@gmail.com