SlideShare a Scribd company logo
1 of 108
Purposes and Characteristics
of Educational Assessment
Focus Questions
After reading this chapter, you should be able to answer the
following questions:
1. What are the four principal uses of educational assessment?
2. What are the main characteristics of good educational
assessment?
3. What makes a test unfair?
4. Can a test be fair but invalid?
5. Can a test be valid but unreliable?
6. How can you determine the reliability of a test?
7. How can you improve test validity and reliability?
2
Jack Hollingsworth/Photodisc/Thinkstock
He uses statistics as a drunken man uses lamp-posts—
for support rather than illumination.
—Andrew Lang
Not to imply that many teachers are like drunken men leaning
on their lampposts, but it is true
that many use the results of their assessments a little like
crutches to support their summaries
of the characteristics, virtues, and vices of their students. For
these teachers, test results and
grades serve as handy statistics that summarize all the important
things that can be commu-
nicated to school administrators and parents.
Others use the results of their assessments in very different
ways. They see them not as sum-
marizing important accomplishments, but as tools that suggest
to both the learner and the
teacher how the teaching–learning process can be improved. For
these teachers, assessments
are less like lampposts against which a tired (but sober) teacher
can lean and rest: They are
more like the light on top of the posts that throws back the
darkness and shows teacher and
student where the paths are, so that they need not stumble about
like blindfolded pigs in a
supermarket.
Purposes and Characteristics of Educational Assessment
Chapter 2
Chapter Outline
2.1 Four Purposes of Educational
Assessment
Planning for Assessment
Test Blueprints
2.2 Test Fairness
Content Problems
Trick Questions
Opportunity to Learn
Insufficient Time
Failure to Make Accommodations for
Special Needs
Biases and Stereotypes
Inconsistent Grading
Cheating and Test Fairness
2.3 Validity
Face Validity
Content Validity
Construct Validity
Criterion-Related Validity
2.4 Reliability
Test-Retest Reliability
Parallel-Forms Reliability
Split-Half Reliability
Factors That Affect Reliability
How to Improve Test Reliability
Chapter 2 Themes and Questions
Section Summaries
Applied Questions
Key Terms
Four Purposes of Educational Assessment Chapter 2
2.1 Four Purposes of Educational Assessment
These distinctions highlight the difference between summative
assessment and formative
assessment.
As we saw in Chapter 1, summative assessment is assessment
that occurs mainly at the end
of an instructional period. Its chief purpose is to provide a
grade. In a real sense, it is a sum-
mation of the learner’s achievements.
Formative assessment is assessment that is an integral and
ongoing part of instruction. Its
central purpose is to provide guidance for both the teacher and
the learner in an effort to
improve learning. Its goal is formative rather than summative.
Assessment designed specifi-
cally to enhance learning—in other words, formative
assessment—is the main emphasis of
current approaches to educational assessment.
Diagnostic assessment in education, like diagnosis in medicine,
has to do with finding out
what is wrong, why it’s wrong, and how it can be fixed.
Diagnosis of automobile mechani-
cal difficulties is much the same: The mechanic runs tests to
find out what is wrong and to
develop hypotheses about how the problem can be repaired.
Diagnostic assessment in educa-
tion is no different. It has three purposes:
1. Finding out what is wrong (uncovering which learning goals
have not been met)
2. Developing hypotheses about why something is wrong
(suggesting possible reasons why
these goals have not been accomplished)
3. Suggesting ways of repairing what is wrong (devising
interventions that might increase the
likelihood that instructional objectives will be reached)
The examiners look for the causes of failure so they can suggest
ways of fostering success. Like
the physician and the mechanic who diagnose in order to repair,
so, too, does the educator.
A fourth kind of assessment is placement assessment. It
describes assessment undertaken
before instruction. Its main purpose is to provide educators with
information about student
placement and about learning readiness. Placement assessment
can also influence choice of
content and of instructional approaches (Figure 2.1).
Planning for Assessment
While we can distinguish between assessment used for
placement, diagnosis, formation,
or summation, these distinctions do not contradict the fact that
all forms of assessment
share the same goals. Simply put, the general aim of all
assessment is to help learners learn.
Accordingly, educational assessment and instruction are part of
the same process; assessment
is for learning rather than simply of learning.
It is worth noting that the assessment of learners is, in a sense,
an assessment of the effec-
tiveness of the teacher and of the soundness and appropriateness
of instructional and assess-
ment strategies. This doesn’t invariably mean, of course, that
when students do well on their
tests they have the good fortune of having an astonishingly
good teacher. Some students do
remarkably well with embarrassingly inadequate teachers, and
others perform poorly with the
most gifted of instructors.
Four Purposes of Educational Assessment Chapter 2
Still, if your teaching is aimed at admirable, clearly stated, and
well-understood instructional
objectives, and if most of your charges attain these targets, the
assessments that tell you (and
the world) that this is so also reflect positively on your
teaching.
As you plan your instruction and your assessment, the following
guidelines can be highly
useful.
Communicating Instructional Objectives
Know and communicate your instructional objectives. Not only
do teachers need to under-
stand what their learning objectives are, but students need to
know what is expected of
them—hence the importance of having clear instructional
objectives and of communicating
them to students. If students understand what the most
important learning goals are, they
are far more likely to reach them than if they are just feeling
their way around in the dark. For
example, at the beginning of a unit on geography, Mrs. Wyatt
tells her seventh graders that
at the end of the unit, they will be able to create a map of their
town to scale. To do this, Mrs.
Wyatt explains, she will have to teach them some map-making
skills and the math they will
need to calculate the scaling for the map. She then begins a
lesson on ratios.
Aligning Goals, Instruction, and Assessments
Match instruction, assessment, and grading to goals. In theory,
this guideline might seem
obvious; but in practice, it is not always highly apparent. For
example, one of my high school
Figure 2.1: Four kinds of educational assessment based on their
main purposes
Assessment can serve at least four distinct purposes in
education. However, the same assessment
procedures and instruments can be used for all these purposes.
Further, there is often no clear
distinction among these purposes. For example, diagnosis is
often part of placement and can also
have formative functions.
f02.01_EDU645.ai
Summative
Assessment
Summarizes
extent to which
instructional
objectives have
been met
Provides basis for
grading
Useful for
further
educational or
career decisions
Diagnostic
Assessment
Identifies
instructional
objectives that
have not been
met
Identifies
reasons why
targets have not
been met
Suggests
approaches to
remediation
Formative
Assessment
Designed to
improve the
teaching/
learning process
Provides feedback
for teachers and
learners to
enhance learning
and motivation
Enhances
learning; fosters
self-regulated
learning; increases
motivation
Placement
Assessment
Assesses
pre-existing
knowledge and
skills
Provides
information for
making decisions
about learner’s
readiness
Useful for
placement and
selection
decisions
Four Purposes of Educational Assessment Chapter 2
teachers, Sister Ste. Mélanie, delighted in teaching us obscure
details about the lives and
times of the English authors whose poems and essays were part
of our curriculum. We natu-
rally assumed that some of her most important objectives had to
do with learning these
intriguing details.
Her tests matched her instructional objectives. As we expected,
most of the items on the
tests she gave us asked questions like Name three different
kinds of food that would have
been common in Shakespeare’s time. (Among the correct
answers were responses like cherry
cordial, half-moon-shaped meat and potato pies called pasties,
toasted hazelnuts, and stuffed
game hens. We often left her classes very hungry.)
But sadly, Sister Ste. Mélanie’s grading was based less on what
we had learned in her
classes than on the quality of our English. She had developed an
elaborate system for sub-
tracting points based on the number and severity of our
grammatical and spelling errors.
Our grades reflected the accuracy of our responses only when
our grammar and spelling
were impeccable. Her grading did not match her instruction; it
exemplified poor educa-
tional alignment.
Educational alignment is the expression used to describe an
approach to education that
deliberately matches learning objectives with instruction and
assessment. As Biggs and Tang
(2011) describe it, alignment involves three key components:
1. A conscious attempt to provide learners with clearly
specified goals
2. The deliberate use of instructional strategies and learning
activities designed to foster
achievement of instructional goals
3. The development and use of assessments that provide
feedback to improve learning and
to gauge the degree of alignment between goals, instruction, and
assessment
Good alignment happens when teachers think through their
whole unit before beginning
instruction. They identify the learning objectives, the evidence
they will collect to document
student learning (assessments), and the sequence in which they
will have students access
and interact with information before administering assessments.
One elementary teacher
did just that when she designed a unit on the watersheds. Her
goal was the Virginia science
standard: Science 4.8—The student will investigate and
understand important VA natural
resources (a) watershed and water resources. To organize her
unit, she used a focusing
question: “What happens to the water flowing down your street
after a big rainstorm?”
She wanted students to understand the overarching concept that
every action has a con-
sequence—in this case, that the flow of water affects areas
downstream. Throughout the
course of her unit, she had children actively engaged in a
variety of meaningful tasks. The
children discussed and debated the issues of pollution and their
responsibilities to avoid pol-
luting their water. They created tangible vocabulary tools to
learn the vocabulary of the unit.
They responded to academic prompts to explain key concepts.
They built models of water-
sheds. As the students performed these tasks, the teacher noted
the level of performance of
each child and documented individual knowledge, skill, and
understanding. She used these
instructional strategies as formative assessments as she
provided feedback to each student.
In addition to multiple quizzes throughout the unit, the students
demonstrated their under-
standing of important concepts by completing a performance-
based task. Everything was
aligned so that the teacher could infer that students truly
understood the importance of
Virginia’s watersheds and water resources.
Four Purposes of Educational Assessment Chapter 2
Using Assessment to Improve Instruction
Use assessment as an integral part of the teaching–learning
process. Good formative assess-
ment is designed to provide frequent and timely feedback that is
of immediate assistance to
learners and teachers. As we see in Chapter 5, this doesn’t mean
that teachers need to make
up specially designed and carefully constructed placement and
formative tests to assess their
learners’ readiness for instruction, gauge their strengths and
weaknesses, and monitor their
progress. The best formative assessment will often consist of
brief, informal assessments,
perhaps in the form of oral questions or written problems that
provide immediate feedback
and inform both teaching and learning. Formative feedback
might then lead the teacher to
modify instructional approaches and learners to adjust their
learning activities and strategies.
Using Different Approaches to Assessment
Employ a variety of assessments, especially when important
decisions depend on their out-
comes. Test results are not always entirely valid (they don’t
always measure what they are
intended to measure) or reliable (they don’t always measure
very accurately). The results of
a single test might reflect temporary influences such as those
related to fatigue, test anxiety,
illness, situational distractions, current preoccupations, or other
factors. Grades and decisions
based on a variety of assessments are more likely to be fair and
valid. For example, when
someone is ready to demonstrate driving knowledge and skills,
multiple assessments are given
by the Department of Motor Vehicles (DMV). Drivers need to
know the rules of the road, and
they also need to know how to parallel park. So DMV
assessments include both a written and
a driving field test.
Constructing Tests According to Blueprints
A house construction blueprint describes in detail the
components of a house—its dimen-
sions, the number of rooms and their placement, the materials to
be used for building it, the
pitch of its roof, the depth of its basement, its profile from
different directions. A skilled con-
tractor can read a blueprint and almost see the completed house.
In much the same way, a test blueprint describes in detail the
nature of the items to be used
in building the test. It includes information about the number of
items it will contain, the con-
tent areas they will tap, and the intellectual processes that will
be assessed. A skilled educator
can look at a test blueprint and almost see the completed test
(Figure 2.2).
Test Blueprints
It’s important to keep in mind that tests are only one form of
educational assessment.
Assessment is a broad term referring to all the various methods
that might be used to obtain
information about different aspects of teaching and learning.
The word test has a more spe-
cific meaning: In education, it refers to specific instruments or
procedures designed to mea-
sure student achievement or progress, or various student
characteristics.
Educational tests are quite different from many of the other
measuring instruments we use—
instruments like rulers, tape measures, thermometers, and
speedometers. These instruments
measure directly and relatively exactly: We don’t often have
reason to doubt them.
Our psychological and educational tests aren’t like that: They
measure indirectly and with
varying accuracy. In effect, they measure a sample of behaviors.
And from students’ behaviors
(responses), we make inferences about qualities we can’t really
measure directly at all. Thus,
from a patient’s responses to questions like “What is the first
word you think of when I say
Four Purposes of Educational Assessment Chapter 2
mother?” the psychologist makes inferences about hidden
motives and feelings—and perhaps
eventually arrives at a diagnosis.
In much the same way, the teacher makes inferences about what
the learner knows—and
perhaps inferences about the learner’s thought processes as
well—from responses to a hand-
ful of questions like this one:
Which of the following is most likely to be correct?
1. Mr. Wilson will still be alive at the end of the story.
2. Mr. Wilson will be in jail at the end of the story.
3. Mr. Wilson will have died within the next 30 pages.
4. Mr. Wilson will not be mentioned again.
Tests that are most likely to allow the teacher to make valid and
useful inferences are those
that actually tap the knowledge and skills that make up course
objectives. And the best way
of ensuring that this is the case is to use test construction
blueprints that take these objectives
into consideration (see Tables 4.3 and 4.4 for examples of test
blueprints).
Figure 2.2: Guidelines for assessment
These guidelines are most useful when planning for assessment.
Many other considerations have to
be kept in mind when devising, administering, grading, and
interpreting teacher-made tests.
f02.02_EDU645.ai
Some
Guidelines
for
Assessment
Use a variety
of assessments
Use assessment
to improve
instruction
Align
instruction,
goals, and
assessment
Develop
blueprints
to construct
tests
Know and
communicate
learning targets
Test Fairness Chapter 2
Guidelines for Constructing Test Blueprints
A good test blueprint will contain most of the following:
• A clear statement of the test content related directly to
instructional objectives
• The performance, affective, or cognitive skills to be
tapped
• An indication of the test format, describing the kinds of
test items to be used or the
nature of the performances required
• A summary of how marks are to be allocated in relation to
different aspects of
the content
• Some notion of the achievement levels expected of
learners
• An indication of how achievement levels will be graded
• A review of the implications of different grades
Regrettably, not all teachers use test blueprints. Instead, when a
test is required, many find
it less trouble to sit down and simply write a number of test
items that seem to them a rea-
sonable examination of what they have taught. And sadly, in too
many cases what they have
taught is aimed loosely at what are often implied and vague
rather than specific instructional
objectives.
Using test blueprints has a number of important advantages and
benefits. Among them is
that they force the teacher to clarify learning objectives and to
make decisions about the
importance of different aspects of content. They also encourage
teachers to become more
aware of the learner’s cognitive processes and, by the same
token, to pay more attention to
the development of higher cognitive skills.
At a more practical level, using test blueprints makes it easier
for teachers to produce similar
tests at different times, thus maintaining uniform standards and
allowing for comparisons
among different classes and different students. Also, good test
blueprints serve as a useful
guide for constructing test items and perhaps, in the long run,
make the teacher’s work easier.
Figure 2.3 summarizes some of the many benefits of using test
blueprints.
2.2 Test Fairness
Determining what the best assessment procedures and
instruments are is no simple matter
and is not without controversy. But although educators and
parents don’t always agree about
these matters, there is general agreement about the
characteristics of good measuring instru-
ments. Most important among these is that evaluative
instruments be fair and that students
see them as being fair. The most common student complaint
about tests and testing practices
has to do with their imagined or real lack of fairness (Bouville,
2008; Felder, 2002).
The importance of test fairness was highlighted during the
Vietnam War in the 1960s.
President Kennedy’s decision to send troops to Vietnam led to
the drafting of large numbers
of age-eligible men, some of whom died or were seriously
injured in Vietnam. But men who
went to college were usually exempt from the draft—or their
required military service was at
least deferred. So, for many, it became crucial to be admitted to
undergraduate or postgradu-
ate studies. For some, passing college or graduate entrance
exams was literally a matter of life
and death. That the exams upon which admission decisions
would be based should be as fair
as possible seemed absolutely vital.
Test Fairness Chapter 2
Just how fair are our educational assessments? We don’t always
know. But science provides
ways of defining and sometimes of actually measuring the
characteristics of tests. It says, for
example, that the best assessment instruments have three
important qualities:
1. Fairness
2. Validity
3. Reliability
As we saw, from the student’s point of view, the most important
of these is the apparent—
and real—fairness of the test.
There are two ways of looking at test fairness, explains Bouville
(2008): On the one hand,
there is fairness of treatment; on the other, there is fairness of
opportunity. Fairness of treat-
ment issues include problems relating to not making
accommodations for children with spe-
cial needs, biases and stereotypes, the use of misleading “trick”
questions, and inconsistent
Figure 2.3: Advantages of test blueprints
Making and using test blueprints presents a number of distinct
benefits. And, although develop-
ing blueprints can be time-consuming, contrary to what some
think, it can make the teacher’s task
easier rather than more difficult and complicated.
f02.03_EDU645.ai
Advantages
of Devising
and Using Test
Blueprints
Forces
teacher to
clarify
learning
targets
Increases test
validity and
reliability
Promotes the
development of
thinking rather
than mainly
remembering
skills
Encourages
teachers to
become more
aware of
learners’
cognitive
activity
Simplifies
test
construction
Leads to more
consistency among
different tests,
allowing more
meaningful
comparisons
Promotes
decisions about
the relative
importance of
different
aspects of
content
Test Fairness Chapter 2
grading. Fairness of opportunity problems include testing
students on material not covered,
not providing an opportunity to learn, not allowing sufficient
time for the assessment, and
not guarding against cheating. We look at each of these issues
in the following sections
(Figure 2.4).
Content Problems
Tests are—or at the very least, seem—highly unfair when they
ask questions or pose prob-
lems about matters that have not been covered or assigned. This
issue sometimes has to do
with bad teaching; at other times, it simply relates to bad test
construction. For example, in
my second year in high school, we had a teacher who almost
invariably peppered her quizzes
and exams with questions about matters we had never heard
about in class. “We didn’t have
time,” she would protest when someone complained and pointed
out that she had never
mentioned rhombuses and trapezoids and quadrilaterals. “But
it’s important and it’s in the
book and it might be on the final exam,” she would add.
Had she simply told us that we were responsible for the content
in Chapter 6, we would
not have felt so unfairly treated. This example illustrates bad
teaching as much as bad test
construction.
In connection with content problems that affect test fairness, it
is interesting to note that
when test results are higher, students tend to perceive the test as
being fairer. It’s an intrigu-
ing observation that, it turns out, may have a grain of truth in it.
As Oller (2012) points out,
higher scores are evidence that there is agreement between test
makers and the better stu-
dents about the content that is most important. This agreement
illustrates what we termed
educational alignment: close correspondence among goals,
instructional approaches, and
assessments.
Conversely, exams that yield low scores for all students may
reflect poor educational align-
ment: They indicate that what the teacher chose to test is not
what even the better learners
have learned. Hence there is good reason to believe that tests
that yield higher average scores
are, in fact, fairer than those on which most students do very
poorly. And raising the marks,
perhaps by scaling them so that they approximate a normal
distribution with an acceptably
high average, will do little to alter the apparent fairness of the
test.
Figure 2.4: Issues affecting test fairness
That a test is fair, and that it seems to be fair, is one of the most
important characteristics of good
assessment.
f02.04_EDU645.ai
Issues of Fairness
of Opportunity
Issues of Fairness
of Treatment
• Testing material not
covered
• Not providing an opportunity
to learn
• Not allowing sufficient time
to complete test
• Not guarding against
cheating
• Not accommodating to
special needs
• Being influenced by biases
and stereotypes
• Using misleading, trick
questions
• Grading inconsistently
Test Fairness Chapter 2
Trick Questions
Trick questions illustrate problems that have less to do with test
content than with test
construction—which, of course, doesn’t mean that the test
maker is always unaware that one
or more questions might be considered trick questions.
Trick questions are questions that mislead and deceive,
regardless of whether the deception is
intentional or is simply due to poor item construction. Trick
questions do not test the intended
learning targets, but rather a student’s ability to navigate a
deceptive test. Items that students
are most likely to consider trick questions include:
1. Questions that are ambiguous (even when the ambiguity is
accidental rather than deliber-
ate). Questions are ambiguous when they have more than one
possible interpretation. For
example, “Did you see the man in your car?” might mean, “Did
you see the man who is in
your car?” or “Did you see the man when you were in your
car?”
2. Multiple-choice items where two nearly identical alternatives
seem correct. Or, as in the
following example, where all alternatives are potentially
correct:
The Spanish word fastidiar means:
annoy damage
disgust harm
3. Items deliberately designed to catch students off their guard.
For example, consider this
item from a science test:
During a very strong north wind, a rooster lays an egg on a flat
roof: On what side of the
roof is the egg most likely to roll off?
North
South
East
West
No egg will roll off the roof
Students who aren’t paying sufficient attention on this fast-
paced, timed test might well
say South. Seems reasonable. (But no; apparently, roosters
rarely lay eggs.)
4. Questions that use double negatives. For example: Is it true
that people should never not
eat everything they don’t like?
5. Items in which some apparently trivial word turns out to be
crucial. That is often the case
for words such as always, never, all, and most, as in this item:
True or False? Organic prod-
ucts are always better for you than those that are nonorganic.
6. Items that make a finer discrimination than expected. For
example, say a teacher has
described the speed of sound in dry air at 20 degrees centigrade
as being right around 340
meters per second. Now she presents her students with this
item:
What is the speed of sound in dry air at 20 degrees centigrade?
A. 300 meters per second
B. around 340 meters per second
C. 343.2 meters per second
D. 343.8 meters per second
Test Fairness Chapter 2
Because the alternatives contain both the correct answer (C)
and the less precise informa-
tion given by the teacher (B), the item is deceiving.
7. Long stems in multiple-choice questions that include
extraneous and irrelevant information
but that serve to distract. Consider, for example, this multiple-
choice item:
A researcher found that the average score of a sample
consisting of 106 females was 52.
The highest score was 89 while the lowest score was 34. In this
study, the median score
was 55 and the two most frequent scores were 53 and 58. What
was the sum of all the
scores?
A. 5,512
B. 5,830
C. 5,618
D. 6,148
All the information required to answer this item correctly (A)
is included in the first sen-
tence. Everything after that sentence is irrelevant and, for that
reason, misleading.
Opportunity to Learn
Tests are patently unfair when they sample concepts, skills, or
cognitive processes that stu-
dents have not had an opportunity to acquire. Lack of
opportunity to learn might reflect an
instructional problem. For example, it might result from not
being exposed to the material
either in class or through instructional resources. It might also
result from not having sufficient
time to learn. Bloom (1976), for example,
believed that there are faster and slower
learners (not gifted learners and those less
gifted), and that all, given sufficient time,
can master what schools offer.
If Bloom is mostly correct, the results of
many of our tests indicate that we sim-
ply don’t allow some of our learners suf-
ficient time for what we ask of them.
Bloom’s mastery learning system offers
one solution. Mastery learning describes
an instructional approach in which course
content is broken into small, sequential
units and steps are taken to ensure that
all learners eventually master instructional
objectives (see Chapter 6 for a discussion
of mastery learning).
Another solution, suggests Beem (2010),
is the expanded use of technology and of
virtual reality instructional programs. These are instructional
computer-based simula-
tions designed to provide a sensation of realism. She argues that
these, along with other
digital technologies including computers and handheld devices,
offer students an opportunity
to learn at their own rate. Besides, digital technology might also
reduce the negative influence
of poorly qualified teachers—if there are any left.
iStockphoto/Thinkstock
▲ Ambiguous questions, misleading items, items about mate-
rial not covered or assigned, overly long tests—all of these
contribute to the perceived unfairness of tests.
Test Fairness Chapter 2
As Ferlazzo (2010) argues, great teaching is about giving
students the opportunity to learn.
Poor and unfair testing is about assessing the extent to which
they have reached instruc-
tional objectives they have never had an opportunity to reach.
See Applications: Addressing
Unfair Tests.
Insufficient Time
Closely related to the unfairness that results from not having an
opportunity to learn is the
injustice of a test that doesn’t give students an opportunity to
demonstrate what they actu-
ally have learned. For some learners, this is a common
occurrence simply because they tend
to respond more slowly than others and, as a result, often find
that they don’t have enough
time to complete the test.
A P P L I C A T I O N S :
Addressing Unfair Tests
In spring 2013, students in New York City had their first
experience with English Language Arts
tests designed to tap the curriculum of the Common Core State
Standards. After adopting these
standards in 2010, New York hired a test-publishing company to
design a test that would reflect the
knowledge and skills within the Common Core Standards.
After witnessing their students’ anguish following the initial
testing experience, 21 principals
were so outraged that they felt compelled to issue a formal
protest through a letter to the State’s
Commissioner of Education. In that letter, they highlighted how
the English/Language Arts test
was unfair. One of their major concerns was the lack of
alignment between the types of questions
asked and the critical thinking skills valued in the Common
Core State Standards. The Common
Core State Standards emphasize deep and rich analysis of
fiction and nonfiction. But the ELA tests
focused mostly on specific lines and words rather than on the
wider standards. What was taught
in the classrooms was not assessed on the test: The test failed to
meet the criterion of fairness of
opportunity.
While alignment between what was tested and what is in the
Standards is important, this was not
the administrators’ only complaint. In reviewing the tests taken
by the students, they concluded
that the structure of the tests was not developmentally
appropriate. For example, testing required
90-minute sessions on each of three consecutive days—a
difficult undertaking for a mature stu-
dent, let alone for a 10-year-old. Clearly, there is a violation
here of the criterion of fairness of
opportunity.
Finally, the principals expressed concern that too much was
riding on a flawed test developed by a
company with a track record of errors. They feared that the tests
might not be valid. Yet students’
promotion to the next grade, entry into middle and secondary
school, and admission to special
programs are often based on these tests. In addition, teachers
and schools are evaluated in terms of
how well their students perform, even though that is not an
intended use of the tests. As a result,
scores on these tests can affect the extent to which schools
receive special funds or are put on
improvement plans. These complications raise questions about
the test’s validity for these purposes.
Clearly, as the principals reflected on the new
English/Language Arts test, they saw problems with
both fairness of opportunity and validity.
Test Fairness Chapter 2
Suppose that a 100-item test is designed to sample all the target
skills and information that
define a course of study. If a student has time to respond to only
80 of these 100 items, then
only 80% of the instructional objectives have been tested. That
test is probably unfair for
that student.
There is clearly a place for speeded testing, particularly with
respect to standardized tests such
as those that assess some of the abilities that contribute to
intelligence. (We look at some
of these tests in Chapter 10.) But as a rule of thumb, teacher-
made tests should always be
of such a length and difficulty level that most, if not all,
students can easily complete them
within the allotted time (van der Linden, 2011).
Failure to Make Accommodations for Special Needs
Timed tests can be especially unfair to some learners with
special needs. For example, Gregg
and Nelson (2012) reviewed a large number of studies that
looked at performance on timed
graduation tests—a form of high-stakes testing (so called
because results on these tests can
have important consequences relating to transition from high
school, school funding, and
even teaching and administrative careers). These researchers
found that whereas students
with learning disabilities would normally be expected to
achieve at a lower than average level
on these tests, when they are given the extra time they require,
their test scores are often
comparable to those of students without disabilities.
Giving students with special needs extra time is the most
common of possible accommoda-
tions. It is also one of the most effective and fairest
adjustments. Even for more gifted and
talented learners, additional time may be important. Coskun
(2011) reports a study where the
number of valuable ideas produced in creative brainstorming
groups was positively related to
the amount of time allowed.
Accommodations for Test Anxiety
In addition to being given extra time for learning and
assessment, many other accommoda-
tions for learners with special needs are possible and often
desirable. For example, steps can
be taken to improve the test performance of learners with severe
test anxiety. Geist (2010)
suggests that one way of doing this is to reduce negative
attitudes toward school subjects
such as mathematics. As Putwain and Best (2011) showed, when
elementary school students
are led to fear a subject by being told that it will be difficult
and that important decisions will
be based on how well they do, their performance suffers. The
lesson is clear: Teachers should
not try to motivate their students by appealing to their fears.
For severe cases of test anxiety, certain cognitive and
behavioral therapies, in the hands of
a skilled therapist, are sometimes highly effective (e.g., Brown
et al., 2011). And even in less
skilled hands, the use of simple relaxation techniques might be
helpful (for example, Larson
et al., 2010).
It is worth keeping in mind, too, that test anxiety often results
from inadequate instruction
and learning. Not surprisingly, after Faber (2010) had exposed
his “spelling-anxious” students
to a systematic remedial spelling training program, their
spelling performance increased and
their test anxiety scores decreased.
Test Fairness Chapter 2
Accommodations for Minority Languages
Considerable research indicates that children whose first
language is not the dominant school
language are often at a measurable disadvantage in school. And
this disadvantage can become
very apparent if no accommodations are made in assessment
instruments and procedures—as
is sometimes the case for standardized tests given to children
whose dominant language is
not the test language (Sinharay, Dorans, & Liang, 2011). As
Lakin and Lai (2012) note, there are
some serious issues with the fairness and reliability of ability
measures given to these children
without special accommodations. As we saw in Chapter 1,
accommodations in these cases are
mandated by law (see In the Classroom: Culturally Unfair
Assessments).
Accommodations for Other Special Needs
Teachers must be sensitive to, and they must make
accommodations for, many other “special
needs.” These might include medical problems, sensory
disabilities such as vision and hearing
problems, emotional exceptionalities, learning disabilities, and
intellectual disabilities. They
might also include cultural and ethnic differences among
learners.
Figure 2.5 describes some of the accommodations that fair
assessments of students with spe-
cial needs might require.
I N T H E C L A S S R O O M :
Culturally Unfair Assessments
Joseph Born-With-One-Tooth knew all the legends his
grandfather and the other elders told—even
those he had heard only once. His favorites were the legend of
the Warriors of the Rainbow, and
the legend of Kuikuhâchâu, the man who took the form of the
wolverine. These legends are long,
complicated stories, but Joseph never forgot a single detail,
never confused one with the other.
The elders named him ôhô, which is the world for owl, the wise
one. They knew that Joseph was
extraordinarily gifted.
But in school, it seemed that Joseph was unremarkable. He read
and wrote well, and he performed
better than many. But no one even bothered to give him the tests
that singled out those who were
gifted and talented. Those who are talented and gifted are often
identified through a combina-
tion of methods, beginning with teacher nominations that then
lead to further testing and perhaps
interviews and auditions (Pfeiffer & Blei, 2008). Those who
don’t do as well in school, sometimes
because of cultural or language differences, tend to be
overlooked.
Joseph Born-With-One-Tooth is not alone. Aboriginal and other
culturally different children are
vastly underrepresented among the gifted and the talented
(Baldwin & Reis, 2004). By the same
token, they tend to be overrepresented among programs for
those with learning disabilities and
emotional disorders (Briggs, Reis, & Sullivan, 2008).
There is surely a lesson here for those concerned with the
fairness of assessments.
Test Fairness Chapter 2
Biases and Stereotypes
Accommodations for language differences are not especially
difficult. But overcoming the
many biases and stereotypes that can affect the fairness of
assessments often is.
Biases are preconceived judgments usually in favor of or
against some person, thing, or idea.
For example, I might think that Early Girl tomatoes are better
than Big Boys. That is a harmless
bias. And like most biases, it is a personal tendency. But if we
North Americans tend to believe
that all Laplanders are such and such, and most Roma are this
and that (such and such and
this and that of course being negative), then we hold some
stereotypes that are potentially
highly detrimental.
Closer to home, historically there have been gender stereotypes
about male–female differ-
ences whose consequences can be unfair to both genders. Some
of these stereotypes are
based on long-held beliefs rooted in culture and tradition and
propagated through centuries
of recorded “expert” opinion. And some are based on various
controversial and often con-
tested findings of science.
It’s clear that males and females have some biologically linked
sex differences, mainly in physi-
cal skills requiring strength, speed, and stamina. But it’s not
quite so clear whether we also
have important, gender-linked psychological differences. Still,
early research on male–female
differences (Maccoby & Jacklin, 1974) reported significant
differences in four areas: verbal abil-
ity, favoring females; mathematical ability, favoring males;
spatial–visual ability (evident, for
example, in navigation and orientation skills), favoring males;
and aggression (higher in males).
Figure 2.5: Fair assessment accommodations for children with
special needs
These are only a few of the many possible accommodations that
might be required for fair assess-
ment of children with special needs. Each child’s requirements
might be different. Note, too, that
some of these accommodations might increase the fairness of
assessments for all children.
f02.05_EDU645.ai
Instructional Accommodations
• teacher aides and other professional assistance
• special classes and programs
• individual education plans
• special materials such as large print or audio devices
• provisions for reducing test anxiety
• increased time for learning
Testing Accommodations
• increased time for test completion
• special equipment for test-taking
• different form of test (for example, verbal rather than written)
• giving test in different setting
• testing in a different language
Possible Accommodations for Fair Assessment
of Students with Special Needs
Test Fairness Chapter 2
Many of these differences are no longer as apparent now as they
were in 1974. There is
increasing evidence that when early experiences are similar,
differences are minimal or non-
existent (Strand, 2010).
But the point is that experiences are not always similar, nor are
opportunities and expecta-
tions. In the results of many assessments, there are still gender
differences. These often favor
males in mathematics and females in language arts (e.g., De
Lisle, Smith, Keller, & Jules, 2012).
And there is evidence that the stereotypes many people still
hold regarding, say, girls’ inferior-
ity in mathematics might unfairly affect girls’ opportunities and
their outcomes.
In an intriguing study, Jones (2011) found that when women
were shown a video supporting
the belief that females perform more poorly than males in
mathematics, subsequent tests
revealed a clear gender difference in favor of males on a
mathematics achievement test. But
when they were shown a video indicating that women performed
as well as men, no sex dif-
ferences were later apparent.
Inconsistent Grading
Approaches to grading can vary enormously in different schools
and even in different class-
rooms within the same school. They might involve an enormous
range of practices, including
• Giving or deducting marks for good behavior
• Giving or deducting marks for class participation
• Giving or deducting marks for punctuality
• Using well-defined rubrics for grading
• Basing grades solely on test results
• Giving zeros for missed assignments
• Ignoring missed assignments
• Using grades as a form of reward or punishment
• Grading on any of a variety of letter, number, percentage,
verbal descriptor, or other
systems
• Allowing students to disregard their lowest grade
• And on and on . . .
No matter what practices are used in a given school, for
assessments to be fair, grades need to
be arrived at in a predictable and transparent manner. Moreover,
the rules and practices that
underlie their calculation need to be consistent. This approach
is also critical for describing
what students know and are able to do. If a math grade is
polluted with behavioral objec-
tives such as participation, how will the student and parents
know what the student’s math
skills are?
Inconsistent grading practices are sometimes evident in
disparities within schools, where dif-
ferent teachers grade their students using very different rules. In
one class, for example, stu-
dents might be assured of receiving relatively high grades if
they dutifully complete and hand
in all their assignments as required. But in another class, grades
might depend entirely on test
results. And in yet another, grades might be strongly influenced
by class participation or by
spelling and grammar.
Test Fairness Chapter 2
Inconsistent grading within a class can also present serious
problems of fairness for students.
A social studies teacher should not ignore grammatical and
spelling errors on a short-answer
test one week and deduct handfuls of marks for the same sorts
of errors the following week.
Simply put, the criteria that govern grading should be clearly
understood by both the teacher
and students, and those criteria should be followed consistently.
Cheating and Test Fairness
Most of us, even the cheaters among us, believe that cheating is
immoral. Sometimes it is
even illegal—such as when you do it on your income tax return.
And clearly, cheating is unfair.
First, if cheating results in a higher than warranted grade, then
it does not represent the stu-
dent’s progress or accomplishments—which hardly seems fair.
Second, those who cheat, by that very act, cheat other students.
I once took a statistics
course where, in the middle of a dark night, a fellow student
sneaked up the brick wall of
the education building, jimmied open the window to Professor
Clark’s office, and copied the
midterm exam we were about to take. He then wrote out what he
thought were all the cor-
rect answers and sold copies to a bunch of his classmates.
I didn’t buy. No money, actually. And I didn’t do nearly as well
on the test as I expected. I
thought I had answered most of the questions correctly; but, this
being a statistics course, the
raw scores (original scores) were scaled so that the class
average would be neither distress-
ingly low nor alarmingly high.
The deception was soon uncovered. Some unnamed person later
informed Professor Clark
who, after reexamining the test papers, discovered that 10 of his
35 students had nearly iden-
tical marks. More telling was that on one item, all 10 of these
students made the same, highly
unlikely, computational error.
Cheating is not uncommon in schools, especially in higher
grades and in postsecondary pro-
grams where the stakes are so much higher. In addition, today
there are far more oppor-
tunities for cheating than there were in the days of our
grandparents. Wireless electronic
communication; instant transmission of photos, videos, and
messages; and wide-scale access
to Internet resources have seen to that.
High-Stakes Tests and Cheating
There is evidence, too, that high-stakes testing may be
contributing to increased cheating,
especially when the consequences of doing well or poorly can
dramatically affect entire school
systems. For example, state investigators in Georgia found that
178 administrators and teach-
ers in 44 Atlanta schools who had early access to standardized
tests systematically cheated to
improve student scores (Schachter, 2011).
Some school systems cheat on high-stakes tests by excluding
certain students who are not
expected to do well; others cheat by not adhering to guidelines
for administering the tests,
perhaps by giving students more time or even by giving them
hints and answers (Ehren &
Swanborn, 2012).
A more subtle form of administrative and teacher cheating on
high-stakes tests takes the
form of “narrowing” the curriculum. In effect, instructional
objectives are narrowed to topics
covered by the tests, and instruction is focused specifically on
those targets to the exclusion
Test Fairness Chapter 2
of all others. This practice, notes Berliner (2011), is a
rational—meaning “reasonable or intel-
ligent”—reaction to high-stakes testing.
With the proliferation of online courses and online universities,
the potential for electronic
cheating has also increased dramatically (Young, 2012). For
example, online tests can be taken
by the student, the student’s friend, or even some paid expert,
with little fear of detection.
Preventing Cheating
Among the various suggestions for preventing or reducing
cheating on exams are the following:
• Encourage students to value honesty.
• Be aware of school policy regarding the consequences of
cheating, and communicate
them to students.
• Clarify for students exactly what cheating is.
• When possible, use more than one form of an exam so that
no two adjacent students
have the same form.
• Stagger seats so that seeing other students’ work is
unlikely.
• Randomize and assign seating for exams.
• Guard the security of exams and answer sheets.
• Monitor exams carefully.
• Prohibit talking or other forms of communication during
exams.
Of course, none of these tactics, or even all of them taken
together, is likely to guarantee
that none of your students cheat. In fact, one large-scale study
found that 21% of 40,000
undergraduate students surveyed had cheated on tests, and an
astonishing 51% had cheated
at least once on their written work (McCabe, 2005; Figure 2.6).
Sadly, that cheating is prevalent does not justify it. Nor does it
do anything to increase the
fairness of our testing practices.
Figure 2.6: Cheating among college undergraduates
Percentage of undergraduate students who admitted having
cheated at least once.
f02.06_EDU645.ai
Percent of 40,000 students who admitted cheating
Cheated on written work
Cheated on tests
0 10 20 30 40 50 60
Source: Based on McCabe, D. (2005). It takes a village:
Academic dishonesty. Liberal Education. Retrieved September
2, 2012, from
http://www.middlebury.edu/media/view/257515/original/It_take
s_a_village.pdf.
http://www.middlebury.edu/media/view/257515/original/It_take
s_a_village.pdf
Validity Chapter 2
Figure 2.7 summarizes the main characteristics of fair
assessment practices. Related to this,
Table 2.3 presents the American Psychological Association
(APA) Code of Fair Testing Practices
in Education. Because of its importance, the code is reprinted in
its entirety at the end of
this chapter.
2.3 Validity
In addition to the characteristics of fair assessment practices
listed in Figure 2.6 and Table 2.3,
the fairness of a test or assessment system depends on the
reliability of the test instruments
and the validity of the inferences made from the test results.
Simply put, a test is valid if it measures what it is intended to
measure. For example, a high
schooler’s ACT scores should not be used to decide if a student
should have a driver’s license.
The test is designed to predict college performance rather than
readiness to drive. From a
measurement point of view, validity is the most important
characteristic of a measuring
instrument. If a test does not measure what it is intended to
measure, the scores derived from
it are of no value whatsoever, no matter how consistent and
predictable they are.
Test validity has to do not only with what the test measures, but
also with how the test results
are used. It relates to the inferences we base on test results and
the consequences that fol-
low. In effect, interpreting test scores amounts to making an
inference about some quality or
characteristic of the test taker.
For example, based on Nora’s brilliant performance on a
mathematics test, her teacher infers
that Nora has commendable mathematical skills and
understanding. And one consequence
of this inference might be that Nora is invited to join the
lunchtime mathematics enrichment
group. But note that the inference and the consequence are
appropriate and defensible only
if the test on which Nora performed so admirably actually
measures relevant mathematical
skills and understanding.
The important point is that in educational assessment, validity
is closely related to the way
test results are used. Accordingly, a test may be valid for some
purposes but totally invalid
for others.
Figure 2.7: Fair assessment practices
Assessments are not always fair for all learners. But their
fairness can be improved by paying atten-
tion to some simple guidelines.
f02.07_EDU645.ai
The Fairest
Assessment
Practices
• Cover material that every student has had an opportunity to
learn.
• Reflect learning targets for that course.
• Allow sufficient time for students to finish the test.
• Discourage cheating.
• Provide accommodations for learners with special needs.
• Ensure that tests are free of biases and stereotypes.
• Avoid misleading questions.
• Follow consistent and clearly understood grading practices.
• Base important decisions on a variety of different assessments.
• Take steps to ensure the validity and reliability of
assessments.
Validity Chapter 2
Face Validity
How can you determine whether a test is valid? Put another
way, how do you know a test
measures what it says it measures? Or what it is intended to
measure?
There are a number of ways of answering these questions. One
of the most obvious is to look
at the items that make up the test. Does the mathematics test
look like it measures mathemat-
ics? Does the grammar test appear to be a grammar test?
Answers to these sorts of questions determine the face validity
of the test. Basically, face
validity is the extent to which the test appears to measure what
it is supposed to measure. If
the mathematics test consists of appropriate mathematical
problems, it has face validity.
Face validity is especially important for teacher-made tests. Just
by looking at a test, students
should immediately know that they are being tested on the right
things. A mathematics test
that has face validity will not ask a series of questions based on
Shakespeare’s Julius Caesar.
Occasionally, however, test makers are careful to avoid any hint
of face validity. For example,
if you wanted to construct a test designed to measure a
personality characteristic such as
honesty, you probably wouldn’t want your test participants to
know what is being measured.
If your instrument had face validity—that is, if it looked like it
was measuring honesty—the
scoundrels who take it might actually lie and act as if they are
honest when they really aren’t.
Better to deceive them, lie to them, pretend you are testing
motivational qualities or character
strength, so you can determine what liars and rogues they really
are.
Content Validity
Of course, a test must not only look as though it measures what
it is intended to measure,
but should actually do so. That is, its content should reflect the
instructional objectives it is
designed to assess. This indicator of validity, termed content
validity, is assessed by analyz-
ing the content of test items in relation to the objectives of the
course, unit, or lesson.
Determining Content Validity
Content validity is one of the most important kinds of validity
for measurements of school
achievement. A test with high content validity includes items
that sample all important course
objectives in proportion to their importance. Thus, if some of
the objectives of an instruc-
tional sequence have to do with the development of cognitive
processes, a relevant test will
have content validity to the extent that it samples these
processes. And if 40% of the course
content (and, consequently, of the course objectives) deals with
knowledge (rather than with
comprehension, analysis, and so on), 40% of the test items
should assess knowledge.
Determining the content validity of a test is largely a matter of
careful, logical analysis of the
items it comprises. Basically, the test needs to include a sample
of items that tap the knowl-
edge and skills that define course objectives.
Increasing Content Validity
As Wilson, Pan, and Schumsky (2012) explain, the basic
process the test maker should follow
to ensure content validity involves the following steps:
1. Define the content (the instructional objectives).
2. Define the level of difficulty or abstraction for the items.
Validity Chapter 2
3. Develop a pool of representative items.
4. Determine what ratio of different items best represents the
instructional objectives.
5. Develop a test blueprint.
One of the main advantages of preparing a test blueprint (also
referred to as a table of specifi-
cations) is that it ensures a relatively high degree of content
validity (providing, of course, that
the test maker follows the blueprint).
It’s important to realize that tests and test items do not possess
validity as a sort of intrin-
sic quality; a test is not generally valid or generally invalid in
and of itself. Rather, it is valid
for certain purposes and with certain individuals, and it is
invalid for others. For example, if
the following item is intended to measure comprehension, it
does not have content validity,
because it measures only simple recall:
How many different kinds of validity are discussed in this
chapter?
A. 1
B. 2
C. 3
D. 5
E. 10
If, on the other hand, the item were intended to
measure knowledge of specifics, it would have
content validity. And an item such as the follow-
ing might have content validity with respect to
measuring comprehension:
Explain why face validity is important for
teacher-constructed tests.
Note, however, that this last item measures
comprehension only if students have not been
explicitly taught an appropriate answer. It is
quite possible to teach principles, applications,
analyses, and so on as specifics, so that ques-
tions of this sort require no more than recall of
knowledge. What an item measures is not inher-
ent in the item itself so much as in the relation-
ship between the material as the student has
learned it and what the item requires.
Construct Validity
A third type of validity, construct validity, is somewhat less
relevant for teacher-constructed
tests but highly relevant for many other psychological measures
(e.g., personality and intel-
ligence tests).
Ryan McVay/Photodisc/Thinkstock
▲ One measure of validity is reflected in the extent to
which the predictions we base on test results are borne
out. If this boy does exceptionally well on this standard-
ized battery of tests, will he also do well next year in
fourth grade? In high school? In college?
Reliability Chapter 2
In essence, a construct is a hypothetical variable—an
unobservable characteristic or qual-
ity, often inferred from theory. For example, a theory might
argue that individuals who are
highly intelligent should be reflective rather than impulsive—
reflectivity being evident in the
care and caution with which they solve problems or make
decisions. Impulsivity would be
apparent in a person’s hastiness and in failure to consider all
aspects of a situation. One way
to determine the construct validity of a test designed to measure
intelligence would then be
to look at how well it correlates with measures of reflection and
impulsivity (see Chapter 9 for
a discussion of correlation—a mathematical index of
relationships).
Criterion-Related Validity
If Harold does exceptionally well on all his 12th-grade year-end
tests, his teachers might be
justified in predicting that he will do well in college. Colleges
that subsequently admit Harold
into one of their programs because of his grade 12 marks are
also making the same prediction.
Predictive Validity
At all levels, prediction is one of the main uses of summative
(rather than formative) assess-
ments. We assume that all students who do well on year-end
fifth-grade achievement tests
will do reasonably well in sixth grade. We also predict that
those who perform poorly on these
tests will not do well in sixth grade, and we might use this
prediction as justification for having
them undertake remedial work.
The extent to which our predictions are accurate reflects
criterion-related validity. One
component of this form of validity, just described, is labeled
predictive validity. Predictive
validity is easily measured by looking at the relationship
between actual performance on a
test and subsequent performance. Thus, a college entrance
examination designed to identify
students whose chances of college success are high has
predictive validity to the extent that
its predictions are borne out.
Concurrent Validity
Concurrent validity, a second aspect of criterion-related
validity, is the relationship between
a given test and other measures of the same behaviors or
characteristics. For example, as we
see in Chapter 10, the most accurate way to measure
intelligence is to administer a time-
consuming and expensive individual test. A second option is to
administer a quick, inexpen-
sive group test; a third, far less consistent approach, is to have
teachers informally assess
intelligence based on what they know of their students’
achievements and effort. Teachers’
assessments are said to have concurrent validity to the extent
that they are similar to the more
formal measures. In the same way, a group or an individual test
is said to have concurrent
validity if it agrees well with measures obtained using a
different and presumably valid test.
Figure 2.8 summarizes the various approaches to determining
test validity.
2.4 Reliability
Reliability is what we want in our cars, our computers, our
spouses, our dogs. We want
our cars and our computers to start when we go through the
appropriate motions, and we
want them to function as they were designed to function. So,
too, with dogs and spouses:
Reliability Chapter 2
Reliability is predictability and consistency. If you stepped on
your bathroom scale five times
in a row, you would expect it display the same weight each
time.
Reliability in educational measurement is no different.
Basically, it has to do with consistency.
Good measuring instruments must not only measure what they
are intended to measure (they
must have validity); they must also provide consistent,
dependable, reliable measures.
Reliability in testing has to do with the accuracy of our
measurements. The more errors there
are in our measurements, the less reliable will be our test
results. A reliable intelligence test,
for example, should yield similar results from one week to the
next. Or even from one year to
the next.
But the reliability of most of our educational and psychological
measures is never perfect. If
you give Roberta an intelligence test this week and another in
two weeks, it is highly unlikely
that her scores will be identical. No matter, we say, as long as
the difference between the two
scores is not too great. After all, many factors can account for
this error of measurement.
Figure 2.8: Types of test validity
Validity is closely related to the ways that a test is used. If a
test is not valid, it is also likely to be
unreliable and unfair.
f02.08_EDU645.ai
Content
(The test samples
behaviors that represent
both the topics and the
processes implicit in
course objectives)
Test Validity
Predictive
(Test scores are
valuable predictors
of future
performance in
related areas)
Concurrent
(Test scores are
closely related to
smilar measures
based on other
presumably valid
tests)
Face
(The test appears to
measure what it says it
measures)
Construct
(The test taps
hypothetical variables
that underlie the
property being tested)
Criterion-Related
Reliability Chapter 2
Say Roberta scored 123 the first week but only 102 the second.
The difference between
the two scores may be because Roberta had a headache at the
time of the second testing.
Perhaps she was distracted by personal problems or tired from a
long trip or anxious about
the test or confused by some new directions.
In psychology and education, we tend to assume that the things
we measure are relatively
stable. But the emphasis should be on the word relatively
because we know that much of
what we measure is variable. So at least some of the error in our
measurements is likely due
to instability and change in what we measure. But if two similar
measures of achievement in
chemistry yield a score of 82% one week but only 53% the next
week for the same student,
then the test we are using may well have a reliability problem.
How can we assess the reli-
ability of our tests?
Test-Retest Reliability
If a test measures what it purports to (that is, if it is valid), and
if what it measures does not
fluctuate unpredictably, no matter how often it is given, the test
should yield similar scores. If
it doesn’t, it is not only unreliable but probably invalid as well.
In fact, a test cannot be valid
without being reliable. If it yields inconsistent scores for a
stable characteristic, we can hardly
insist that it is measuring what it is supposed to measure.
That a test should yield similar scores from one testing to the
next—unless, of course, the
test is simple enough that the student learns and remembers
appropriate responses—is the
basis for one of the most common measures of reliability.
Giving the same test two or more
times and comparing the results obtained at each testing yields a
measure of what is known
as test-retest reliability (sometimes also called repeated-
measures reliability or stability
reliability).
Say, for example, that I give a group of first-grade students a
standardized language profi-
ciency test (let’s call it October Test) at the end of October and
then give them the same test
again at the end of November. Assume the results are as shown
in columns 2 and 3 of Table
2.1 (“October Test Results” and “Hypothetical November Test
Results”). We can see immedi-
ately that the test yields consistent, stable scores and is
therefore highly reliable. Students who
scored high in October continue to score high in November—as
we would expect given our
assumption that language proficiency should not change
dramatically in one month.
Table 2.1 Test-retest reliability
Student
October Test Results
Hypothetical
November Test Results
Alternate November
Test Results
A 72 75 92
B 84 83 55
C 56 57 80
D 79 82 72
E 55 57 78
F 84 79 48
G 91 88 66
Reliability Chapter 2
Suppose, however, the results were as shown in columns 2 and 4
(“October Test Results” and
“Alternate November Results”). Unless we have some other
logical explanation, we would
have legitimate questions about the reliability of this language
proficiency test. Now some of
the students who scored high in October do very poorly in
November; and others who did
poorly in October do astonishingly well in November.
Statistically, the reliability of this test would be obtained by
looking at the correlation between
scores obtained on the test and those obtained on the retest (see
Chapter 9 for an explana-
tion of correlation). The first chart in Figure 2.9 shows how the
hypothetical November results
closely parallel the October results. In fact, there is a high
positive correlation (+.98) between
these results. The second chart in Figure 2.9 shows how the
alternate November results do
not parallel October results. In fact, the correlation between the
two is negative (–.63).
Figure 2.9: Test-retest reliability
If a test is reliable, it should yield similar scores when given to
the same students at different times.
Chart 1 (based on Table 2.1) shows high reliability (correlation
+.98); Chart 2 illustrates low reliabil-
ity (correlation –.63).
f02.09_EDU645.ai
100
90
80
70
60
50
40
30
20
10
0
A B C D
Student
E F G
S
co
re
s
Hypothetical
November
test results
October
test results
100
90
80
70
60
50
40
30
20
10
0
A B C D
Student
Chart 1
Chart 2
E F G
S
co
re
s
Alternate
November
test results
October
test results
Reliability Chapter 2
Parallel-Forms Reliability
Test-retest measures of reliability look at the correlation
between results obtained by giving
the same test twice to the same individuals. But in some cases,
it isn’t possible or convenient
to administer the same test twice. If the test is very simple, or if
it contains striking and highly
memorable questions (or answers), some learners may improve
dramatically from one test-
ing to the next. Or, if the teacher goes over the test and
discusses possible responses, some
students might learn enough to improve, and others might not.
A second approach to estimating test reliability gets around this
problem by administering a
different form of the test the second time. The different form of
the test is designed to be
highly similar to the first and is expected to yield similar
scores. It is therefore labeled a paral-
lel form. The correlation between these parallel forms of the
same test yields a measure of
parallel-forms reliability (also termed alternate-form
reliability). Figure 2.10 plots the scores
obtained by seven students on parallel forms of a test. Note how
the results follow each other.
That is, a student who scores high on form A of the test is also
likely to score high on form B.
In this case, the correlation between the two forms is .86.
Split-Half Reliability
Teachers seldom go to the trouble of making up two forms of
the same test and establishing
that they are equivalent. Fortunately, there is another clever
way of calculating test reliability.
The reasoning goes something like this: If I prepare a
comprehensive test made up of a large
number of items, then many of the items on this test will
overlap in what they assess. It is
therefore reasonable to assume that if I were to split the test in
two and administer each half
to my students, their scores on the two halves would be highly
similar. But I don’t really need
to split the test: All I need to do is give the entire test to all
students and then score the test
as though I had actually split it.
Figure 2.10: Parallel-forms reliability
The relationship between scores on two parallel forms of the
same test given to the same group is
an indication of how dependably and consistently (reliably) the
test measures.
f02.10_EDU645.ai
Test A
Results
Test B
ResultsStudent
62
74
66
34
79
23
91
60
83
57
55
86
44
79
A
B
C
D
E
F
G
100
90
80
70
60
50
40
30
20
10
0
A B C D
Student
E F G
S
co
re
s
Test B resultsTest A results
Reliability Chapter 2
Suppose, for example, that my original test consisted of 100
multiple-choice items, carefully
constructed to tap all my instructional objectives. When I score
the test, I might consider the
50 even-numbered items as one test, and the other 50 as a
separate test. I can now easily
generate two scores for each student, one for each half of my
split test. And when I calculate
the correlation between these two halves of the test, I will have
determined what is called
split-half reliability. Figure 2.11 illustrates split-half reliability
based on a 90-item test split
into two 45-item halves. Note that the longer the test, the more
accurate the measure of reli-
ability. Figure 2.12 summarizes the various ways of assessing
test reliability.
Figure 2.11: Split-half reliability
A single test scored as though it were two separate tests
provides information for judging its inter-
nal consistency (reliability). In this case, the correlation
between the two test halves is .80.
Figure 2.12: Measures of test reliability
Test reliability reflects the stability and consistency of a
measure. It, along with fairness and valid-
ity, is an extremely important quality of educational
assessments.
f02.11_EDU645.ai
60
50
40
30
20
10
0
A B C D
Student
E F G H I J K
S
co
re
o
n
e
a
ch
h
a
lf
Even itemsOdd items
Odd
items
Even
itemsStudent
42
33
23
34
36
47
34
44
22
18
33
45
35
40
34
42
48
28
41
25
16
37
A
B
C
D
E
F
G
H
I
J
K
f02.12_EDU645.ai
Measures of
Test Reliability
Test-retest
reliability
Parallel-forms
reliability
Split-half
reliability
Correlation is between
scores obtained on the
same test given to the
same students on two
different occasions
Correlation is between
two forms of a test
given to the same
examinees
Correlation is
between halves
of a single test
Reliability Chapter 2
Factors That Affect Reliability
Say you require a unit-end assessment and ranking of your
students in a 12th-grade physics
course. But you have stupidly put off building your final exam
until the night before it is to be
administered. So you write out a single question. Then, having
just completed a measurement
course, you are clever enough to devise a list of detailed scoring
criteria. The question and
scoring criteria are shown in Table 2.2.
Table 2.2 Illustrative single-question 12th-grade physics exam
Question: Points
Explain, in your own words, the details of vertical projectile
motion.
Scoring Criteria
Describes what is meant by motion in a gravitational field:
Explains acceleration
Mentions zero velocity at zenith
Includes free-fall equation
Applies free-fall equation to hypothetical situation
Includes graph of vertical projectile motion
10
10
5
10
20
10
Maximum Points 65
Length of Test
If your physics unit covered only vertical projectile motion, and
if your instructional objectives
are well represented in your scoring criteria, your one-item
exam might be quite good. Under
these circumstances, it might actually measure what you intend
to measure (it would have
high validity). And, given careful application of your scoring
criteria, the results might be con-
sistent and stable (it would have reasonable reliability).
But if your unit also covered topics such as elastic and inelastic
collisions, relative velocity,
notions of frames of reference, and other related topics, your
single-item test would be about
as useful as a snowmobile in Los Angeles.
Although it might occasionally be possible to achieve an
acceptable level of validity and reli-
ability with a single item, in most cases it is not possible. Poor
reliability, of course, is especially
likely if your test consists of objective test items such as
multiple-choice questions, matching
problems, or true-false exercises. It is difficult to imagine that a
single multiple-choice item
could measure all your instructional objectives. In most cases,
the more items in your test, the
more valid and reliable it is likely to be.
Stability of Characteristics
The stability of what is being measured also affects the
reliability of a test. If what we are
measuring is unstable and unpredictable, our measures are also
likely to be inconsistent and
unpredictable. However, we assume that most of what we
measure in education will not
fluctuate unpredictably. We know, for example, that cognitive
strategies develop over time
and that knowledge increases. Tests that are both valid and
reliable are expected to reflect
these changes. These are predictable changes that don’t reduce
the reliability of our measur-
ing instruments.
Reliability Chapter 2
The Effects of Chance
Another factor that can affect the reli-
ability of a test is chance, especially with
respect to objective, teacher-made tests.
We know, for example, that the chance of
getting a true-false item correct, all other
things being equal, is 50–50. If you give a
60-item, true-false, graduate-level plasma
physics test to a large group of intelligent
fourth-graders, they can be expected to
answer an average of around 30 items
correctly by chance—unless Lady Luck is
looking pointedly in the other direction.
And a few of the luckier individuals in this
class may have astonishingly high scores.
But a later administration of this test
might lead to startlingly different scores,
resulting in an extraordinarily low measure
of test-retest reliability.
One way to reduce the effects of chance
is to make tests longer or to use a larger
number of short tests. The important point is that teachers
should not base any important
decision on only one or two measures.
Item Difficulty
Test reliability is also affected by the difficulty of items. Tests
that are made up of excessively
easy or impossibly difficult items will almost invariably have
lower measured reliability scores
than tests composed of items of moderate difficulty. Other
things being equal, very easy and
very difficult items tend to result in less consistent patterns of
responding.
Relationship Between Validity and Reliability
It’s important to realize that a test cannot be valid without also
being reliable. If what we
want to measure is a stable characteristic, and if the measures
we obtain are inconsistent
and unpredictable (hence, unreliable), then we clearly aren’t
measuring what we intend to
measure. Figure 2.13 summarizes the meanings of validity,
reliability, and fairness, and the
relationship among them.
On the other hand, a test can be highly reliable without being
valid. Consider the test and
scoring guide shown in Figure 2.14. This is an extremely
reliable test: Examinees invariably
answer all questions correctly and always obtain the same score.
But as a measure of intel-
ligence, it clearly lacks face, content, construct, and criterion-
related validity. It measures reli-
ably, but it does not measure what it is intended to measure.
How to Improve Test Reliability
Test reliability is not something that most teachers are likely to
calculate for their tests. For
several reasons, however, teachers should understand what
reliability is, how important it is
in educational assessment, and how it can be improved.
© I Love Images/Corbis
▲ If the validity, reliability, and fairness of the educational
assessment whose results this young man has just seen are sus-
pect, an injustice may have occurred.
Reliability Chapter 2
When you need to select from among a number of different
standardized tests, it’s important
that you have, and understand, information about their
reliability. Similarly, when you are
making decisions about your students, you need to have some
knowledge of the reliability of
the assessments on which you base your decisions.
It might be useful to know that the internal consistency (split-
half reliability, for example) of
teacher-made tests is around .50. As we see in Chapter 9, which
deals with statistical mea-
sures, this is a modest index of reliability. The fact is that most
teacher-made tests have a
relatively high degree of measurement error.
Standardized tests, on the other hand, tend to have reliabilities
of around .90 (Frisbie, 1988).
As a result, the most important decisions that affect the lives of
students should be based on
carefully selected standardized tests—and on professional
opinion where necessary—rather
than on teacher-constructed tests, hunches, or intuitive
impressions.
Figure 2.15 summarizes a number of ways in which the
reliability and validity of educational
assessments can be increased. Table 2.3 is the American
Psychological Association’s Code of
Fair Testing Practices in Education (2004). Especially important
are suggested guidelines for
test users with respect to selecting tests, administering them,
and interpreting and reporting
test results.
Figure 2.13: Three essential qualities of educational assessment
The most subjective of these qualities, fairness, is often the one
students think is most important.
f02.13_EDU645.ai
Reliability Validity Fairness
Subjective estimate
influenced by extent
to which:
• material tested has
been covered
• all students have
had an equal
opportunity to
learn
• sufficient time is
allowed for testing
• there are safe-
guards against
cheating
• assessments are
free of biases and
stereotypes
• misleading and
trick questions have
been avoided
• accommodations
are made for
special needs
• grading is consistent
Consistency;
accuracy of
measurement
Estimated by:
• testing and
retesting
• parallel-forms
tests
• split-half tests
The extent to which
a test measures
what it is meant to
measure
Estimated by:
• face (appearance)
• content
• construct
• criterion-related
• predictive
• concurrent
is necessary for
Reliability Chapter 2
Figure 2.14: Human intelligence scale
Not really an intelligence test. Simply illustrates that highly
reliable (consistent) measures can be
desperately invalid.
f02.14_EDU645.ai
The 23rd-Century Human Intelligence Scale
Please answer each question as briefly as possible.
Name Age
Address
Questions
What is your name?
What is your address?
How old were you on your last birthday?
What is your mother’s name?
What is your father’s name?
Do you have a dog?
Do you have a cat?
Would you like a dog?
Would you like a cat?
What is your name?
Scoring: 10 IQ points per answer
Acceptable Answers
Correct if matches above
Correct if matches above
Correct if matches above
Any name, blank or “Don’t know” accepted
Any name, blank or “Don’t know” accepted
“Yes” or “No” or “Don’t know”
“Yes” or “No” or “Don’t know”
“Yes” or “No” or “Don’t know”
“Yes” or “No” or “Don’t know”
Should match first question
Maximum: 100
Figure 2.15: Improving reliability and validity
For important educational decisions to be fair, they must be
based on the most valid and reliable
assessments possible.
f02.15_EDU645.ai
Suggestions for Improving Test Validity Suggestions for
Improving Test Reliability
• Use clear and easily understood tasks
• Sample from all skill and content areas
• Select items to reflect importance of
specific objectives
• Allow sufficient time for all students to
complete the test
• Use blueprints to guide instruction and
test construction
• Analyze items to determine how well
they match learning targets
• Check to see if students who do well
on your tests also do well in other
comparable classes
• Use a variety of approaches to
assessment
• Make tests longer
• Enlist the assistance of other raters
when using performance assessments
• Develop moderately difficult rather than
excessively easy or very difficult items
• Try to eliminate subjective influences
in scoring
• Develop and use clear rubrics and
checklists for scoring performance
assessments
• Restrict distracting influences when-
ever possible
• Use a variety of different assessments
• Eliminate or reduce the possibility that
chance might affect test outcomes
Reliability Chapter 2
Table 2.3 The APA Code of Fair Testing Practices in
Education*
A. Developing and Selecting Appropriate Tests
Test Developers Test Users
Test developers should provide the information and
supporting evidence that test users need to select
appropriate tests.
Test users should select tests that meet the intended
purpose and that are appropriate for the intended test
takers.
A-1. Provide evidence of what the test measures, the
recommended uses, the intended test takers, and the
strengths and limitations of the test, including the level
of precision of the test scores.
A-1. Define the purpose for testing, the content and
skills to be tested, and the intended test takers. Select
and use the most appropriate test based on a thorough
review of available information.
A-2. Describe how the content and skills to be tested
were selected and how the tests were developed.
A-2. Review and select tests based on the appropriate-
ness of test content, skills tested, and content coverage
for the intended purpose of testing.
A-3. Communicate information about a test’s charac-
teristics at a level of detail appropriate to the intended
test users.
A-3. Review materials provided by test developers and
select tests for which clear, accurate, and complete
information is provided.
A-4. Provide guidance on the levels of skills,
knowledge, and training necessary for appropriate
review, selection, and administration of tests.
A-4. Select tests through a process that includes
persons with appropriate knowledge, skills, and
training.
A-5. Provide evidence that the technical quality,
including reliability and validity, of the test meets its
intended purposes.
A-5. Evaluate evidence of the technical quality of the
test provided by the test developer and any indepen-
dent reviewers.
A-6. Provide to qualified test users representative
samples of test questions or practice tests, directions,
answer sheets, manuals, and score reports.
A-6. Evaluate representative samples of test questions
or practice tests, directions, answer sheets, manuals,
and score reports before selecting a test.
A-7. Avoid potentially offensive content or language
when developing test questions and related materials.
A-7. Evaluate procedures and materials used by test
developers, as well as the resulting test, to ensure that
potentially offensive content or language is avoided.
A-8. Make appropriately modified forms of tests or
administration procedures available for test takers with
disabilities who need special accommodations.
A-8. Select tests with appropriately modified forms or
administration procedures for test takers with disabili-
ties who need special accommodations.
A-9. Obtain and provide evidence on the performance
of test takers of diverse subgroups, making significant
efforts to obtain sample sizes that are adequate for
subgroup analyses. Evaluate the evidence to ensure
that differences in performance are related to the skills
being assessed.
A-9. Evaluate the available evidence on the perfor-
mance of test takers of diverse subgroups. Determine
to the extent feasible which performance differences
may have been caused by factors unrelated to the skills
being assessed.
B. Administering and Scoring Tests
Test Developers Test Users
Test developers should explain how to administer and
score tests correctly and fairly.
Test users should administer and score tests correctly
and fairly.
*Copyright 2004 by the Joint Committee on Testing Practices.
This material may be reproduced in its entirety without fees or
permission,
provided that acknowledgment is made to the Joint Committee
on Testing Practices. Any exceptions to this, including requests
to excerpt or
paraphrase this document, must be presented in writing to
Director, Testing and Assessment, Science Directorate, APA.
This edition replaces
the first edition of the Code, which was published in 1988.
Source: From Code of Fair Testing Practices in Education.
(2004). Washington, DC: Joint Committee on Testing Practices.
(Mailing address:
Joint Committee on Testing Practices, Science Directorate,
American Psychological Association, 750 First Street, NE,
Washington, DC
20002-4242)
(continued)
Reliability Chapter 2
B-1. Provide clear descriptions of detailed procedures
for administering tests in a standardized manner.
B-1. Follow established procedures for administering
tests in a standardized manner.
B-2. Provide guidelines on reasonable procedures for
assessing persons with disabilities who need special
accommodations or those with diverse linguistic
backgrounds.
B-2. Provide and document appropriate procedures for
test takers with disabilities who need special accom-
modations or those with diverse linguistic backgrounds.
Some accommodations may be required by law or
regulation.
B-3. Provide information to test takers or test users on
test question formats and procedures for answering
test questions, including information on the use of any
needed materials and equipment.
B-3. Provide test takers with an opportunity to become
familiar with test question formats and any materials or
equipment that may be used during testing.
B-4. Establish and implement procedures to ensure the
security of testing materials during all phases of test
development, administration, scoring, and reporting.
B-4. Protect the security of test materials, including
respecting copyrights and eliminating opportunities for
test takers to obtain scores by fraudulent means.
B-5. Provide procedures, materials and guidelines for
scoring the tests, and for monitoring the accuracy of
the scoring process. If scoring the test is the responsi-
bility of the test developer, provide adequate training
for scorers.
B-5. If test scoring is the responsibility of the test user,
provide adequate training to scorers and ensure and
monitor the accuracy of the scoring process.
B-6. Correct errors that affect the interpretation of
the scores and communicate the corrected results
promptly.
B-6. Correct errors that affect the interpretation of
the scores and communicate the corrected results
promptly.
B-7. Develop and implement procedures for ensuring
the confidentiality of scores.
B-7. Develop and implement procedures for ensuring
the confidentiality of scores.
C. Reporting and Interpreting Test Results
Test Developers Test Users
Test developers should report test results accurately
and provide information to help test users interpret test
results correctly.
Test users should report and interpret test results accu-
rately and clearly.
C-1. Provide information to support recommended
interpretations of the results, including the nature of
the content, norms or comparison groups, and other
technical evidence. Advise test users of the benefits
and limitations of test results and their interpreta-
tion. Warn against assigning greater precision than is
warranted.
C-1. Interpret the meaning of the test results, taking
into account the nature of the content, norms or
comparison groups, other technical evidence, and
benefits and limitations of test results.
C-2. Provide guidance regarding the interpretations
of results for tests administered with modifications.
Inform test users of potential problems in interpreting
test results when tests or test administration proce-
dures are modified.
C-2. Interpret test results from modified test or test
administration procedures in view of the impact those
modifications may have had on test results.
C-3. Specify appropriate uses of test results and warn
test users of potential misuses.
C-3. Avoid using tests for purposes other than those
recommended by the test developer unless there is
evidence to support the intended use or interpretation.
C-4. When test developers set standards, provide
the rationale, procedures, and evidence for setting
performance standards or passing scores. Avoid using
stigmatizing labels.
C-4. Review the procedures for setting performance
standards or passing scores. Avoid using stigmatizing
labels.
(continued)
Reliability Chapter 2
C-5. Encourage test users to base decisions about test
takers on multiple sources of appropriate information,
not on a single test score.
C-5. Avoid using a single test score as the sole deter-
minant of decisions about test takers. Interpret test
scores in conjunction with other information about
individuals.
C-6. Provide information to enable test users to accu-
rately interpret and report test results for groups of
test takers, including information about who were and
who were not included in the different groups being
compared, and information about factors that might
influence the interpretation of results.
C-6. State the intended interpretation and use of test
results for groups of test takers. Avoid grouping test
results for purposes not specifically recommended
by the test developer unless evidence is obtained to
support the intended use. Report procedures that were
followed in determining who were and who were not
included in the groups being compared and describe
factors that might influence the interpretation of
results.
C-7. Provide test results in a timely fashion and in a
manner that is understood by the test taker.
C-7. Communicate test results in a timely fashion and in
a manner that is understood by the test taker.
C-8. Provide guidance to test users about how to
monitor the extent to which the test is fulfilling its
intended purposes.
C-8. Develop and implement procedures for moni-
toring test use, including consistency with the intended
purposes of the test.
D. Informing Test Takers
Under some circumstances, test developers have direct
communication with the test takers and/or control of the
tests, testing process, and test results. In other circumstances
the test users have these responsibilities.
Test developers or test users should inform test takers about the
nature of the test, test taker rights and responsi-
bilities, the appropriate use of scores, and procedures for
resolving challenges to scores.
D-1. Inform test takers in advance of the test administration
about the coverage of the test, the types of question
formats, the directions, and appropriate test-taking strategies.
Make such information available to all test takers.
D-2. When a test is optional, provide test takers or their
parents/guardians with information to help them judge
whether a test should be taken—including indications of any
consequences that may result from not taking the
test (e.g., not being eligible to compete for a particular
scholarship) —and whether there is an available alternative
to the test.
D-3. Provide test takers or their parents/guardians with
information about rights test takers may have to obtain
copies of tests and completed answer sheets, to retake tests, to
have tests rescored, or to have scores declared
invalid.
D-4. Provide test takers or their parents/guardians with
information about responsibilities test takers have, such as
being aware of the intended purpose and uses of the test,
performing at capacity, following directions, and not
disclosing test items or interfering with other test takers.
D-5. Inform test takers or their parents/guardians how long
scores will be kept on file and indicate to whom, under
what circumstances, and in what manner test scores and related
information will or will not be released. Protect
test scores from unauthorized release and access.
D-6. Describe procedures for investigating and resolving
circumstances that might result in canceling or with-
holding scores, such as failure to adhere to specified testing
procedures.
D-7. Describe procedures that test takers, parents/guardians,
and other interested parties may use to obtain more
information about the test, register complaints, and have
problems resolved.
Section Summaries Chapter 2
Chapter 2 Themes and Questions
Section Summaries
2.1 Four Purposes of Educational Assessment Assessments can
be used to summarize
student achievement (summative ); for selection and placement
purposes (placement ); as a
basis for diagnosing strengths, problems, and deficits
(diagnostic); or as a means of provid-
ing feedback to improve teaching and learning (formative).
Despite these different emphases,
the overriding purpose of all forms of educational assessment is
to help the learner reach
important goals. Planning for effective assessment requires
clarifying and communicating
instructional objectives and aligning instruction and assessment
with these goals. As much
as possible, assessment should be an integral part of the
instructional process, designed to
provide immediate feedback for both teachers and learners to
assist and improve minute-to-
minute decisions. Teachers should use a variety of approaches
to assessment.
2.2 Test Fairness Tests are fair when they treat all learners in an
evenhanded manner and
when they provide all learners with an equal opportunity to
learn. Tests are unfair when they
examine content that has neither been covered nor assigned; if
they deliberately or uninten-
tionally use misleading “trick” questions; when some learners
have not been given sufficient
opportunity to learn the material; if they don’t allow sufficient
time for all learners to finish;
when they fail to accommodate to the special needs of
individual learners; when they reflect
biases and stereotypes in test construction; if scoring is
influenced by stereotypes and teacher
expectations; when steps are not taken to guard against
cheating; and if they are graded
inconsistently.
2.3 Validity A test is valid to the extent that it measures what it
is intended to measure. Tests
are seldom fair if they are not also valid. Face validity is a
measure of the extent to which a
test looks as though it measures what it is meant to measure.
However, for some assessments
(such as assessments of responder honesty), it is sometimes
essential that an instrument not
appear to measure what it is intended to measure. Content
validity is determined by the extent
to which items on the test reflect course objectives, and it is
normally determined through a
careful analysis and selection of items. Construct validity
relates to how closely a test agrees
with its theoretical underpinnings. It is a measure of how
consistent it is with the thinking that
gave rise to it, and it is more relevant for psychological
assessments (e.g., personality or intel-
ligence tests) than for teacher-made tests. Criterion-related
validity has two aspects: One is
defined by agreement between test results and predictions based
on those results (predictive
validity); the other is evident in the extent to which the results
of a test agree with the results
of other tests that measure the same characteristics.
2.4 Reliability Consistency and predictability are the hallmarks
of high reliability. Reliability
has to do with error of measurement; the greater the errors in
our assessments, the lower the
reliability. Reliability can be estimated by giving the same test
to the same participants on more
than one occasion. High reliability would be reflected in similar
scores for each individual at
different testings (test-retest reliability). Or it can be calculated
by giving different forms of a
test to the same subjects and comparing their results (parallel-
forms reliability). A third option
is to take a single test administered once, divide it into halves,
score each half, and compare
those scores. Providing the test is long enough, scores on each
half will be similar if it is reli-
able. Reliability is affected by the length of a test (generally
higher with longer tests or with
the frequent use of many shorter tests); by the stability of what
is being measured (unstable
characteristics tend to yield lower reliabilities); chance factors;
and item difficulty (moderately
difficult items usually lead to higher test reliability than very
easy or very difficult items).
Key Terms Chapter 2
Applied Questions
1. What are the four principal uses of educational assessment?
Research and defend the prop-
osition that the most important function of educational
assessment should be formative.
2. What are the main characteristics of good educational
assessment? Explain in your own
words why you think one of these characteristics is more
important than the other two in
educational measurement.
3. What makes a test unfair? List the things you think a teacher
can do to improve the fairness
of teacher-made tests.
4. Can a test be fair but invalid? Write a brief position paper
detailing why you think the
answer is yes, no, or maybe.
5. Can a test be valid but unreliable? Generate a test item to
illustrate and support your
answer for this question.
6. How can you determine the reliability of a test? Outline the
steps a teacher might take to
increase test reliability without having to go to the trouble of
actually calculating it.
7. How can you improve test validity and reliability? Make a
list of what you consider to be
guidelines for sound testing practices. Identify those that would
improve test validity and
reliability.
Key Terms
bias Personal prejudice (prejudgment) in favor of or against a
person, thing, or idea rather
than another, when such prejudgment is typically unfair.
Purposes and Characteristics  of Educational AssessmentF.docx
Purposes and Characteristics  of Educational AssessmentF.docx
Purposes and Characteristics  of Educational AssessmentF.docx
Purposes and Characteristics  of Educational AssessmentF.docx
Purposes and Characteristics  of Educational AssessmentF.docx
Purposes and Characteristics  of Educational AssessmentF.docx
Purposes and Characteristics  of Educational AssessmentF.docx
Purposes and Characteristics  of Educational AssessmentF.docx
Purposes and Characteristics  of Educational AssessmentF.docx
Purposes and Characteristics  of Educational AssessmentF.docx
Purposes and Characteristics  of Educational AssessmentF.docx
Purposes and Characteristics  of Educational AssessmentF.docx

More Related Content

Similar to Purposes and Characteristics of Educational AssessmentF.docx

Assessment matters assignment
Assessment matters assignmentAssessment matters assignment
Assessment matters assignmentMary Lee
 
Assessment matters assignment
Assessment matters assignmentAssessment matters assignment
Assessment matters assignmentMary Lee
 
Reflection journal 3 assessment
Reflection journal 3 assessmentReflection journal 3 assessment
Reflection journal 3 assessmentyuxuan liu
 
Continuous Assessment/Testing Guidelines Summary
Continuous Assessment/Testing Guidelines SummaryContinuous Assessment/Testing Guidelines Summary
Continuous Assessment/Testing Guidelines SummaryManuel Reyes
 
Guide Assessment4learning
Guide Assessment4learningGuide Assessment4learning
Guide Assessment4learningDai Barnes
 
Continuous asessment / Testing guidelines summary
Continuous asessment / Testing guidelines summaryContinuous asessment / Testing guidelines summary
Continuous asessment / Testing guidelines summaryManolo05
 
Language testing and evaluation
Language testing and evaluationLanguage testing and evaluation
Language testing and evaluationesra66
 
What-is-Educational-Assessment.pptx
What-is-Educational-Assessment.pptxWhat-is-Educational-Assessment.pptx
What-is-Educational-Assessment.pptxANIOAYRochelleDaoaya
 
Are you Assessment Literate?
Are you Assessment Literate?Are you Assessment Literate?
Are you Assessment Literate?Eddy White, Ph.D.
 
The process and purpose of evaluation
The process and purpose of evaluationThe process and purpose of evaluation
The process and purpose of evaluationahmedabbas1121
 
ASSESSMENT IN LEARNING 1-LESSONS 1-4 (1).ppt
ASSESSMENT IN LEARNING 1-LESSONS 1-4 (1).pptASSESSMENT IN LEARNING 1-LESSONS 1-4 (1).ppt
ASSESSMENT IN LEARNING 1-LESSONS 1-4 (1).pptOscarAncheta
 

Similar to Purposes and Characteristics of Educational AssessmentF.docx (20)

Assessment matters assignment
Assessment matters assignmentAssessment matters assignment
Assessment matters assignment
 
Assessment matters assignment
Assessment matters assignmentAssessment matters assignment
Assessment matters assignment
 
Summative Assessment
Summative AssessmentSummative Assessment
Summative Assessment
 
Reflection journal 3 assessment
Reflection journal 3 assessmentReflection journal 3 assessment
Reflection journal 3 assessment
 
report in chem
report in chemreport in chem
report in chem
 
Continuous Assessment/Testing Guidelines Summary
Continuous Assessment/Testing Guidelines SummaryContinuous Assessment/Testing Guidelines Summary
Continuous Assessment/Testing Guidelines Summary
 
Guide Assessment4learning
Guide Assessment4learningGuide Assessment4learning
Guide Assessment4learning
 
Continuous asessment / Testing guidelines summary
Continuous asessment / Testing guidelines summaryContinuous asessment / Testing guidelines summary
Continuous asessment / Testing guidelines summary
 
Language testing and evaluation
Language testing and evaluationLanguage testing and evaluation
Language testing and evaluation
 
Assessment And Learner Essay
Assessment And Learner EssayAssessment And Learner Essay
Assessment And Learner Essay
 
What-is-Educational-Assessment.pptx
What-is-Educational-Assessment.pptxWhat-is-Educational-Assessment.pptx
What-is-Educational-Assessment.pptx
 
Essay On Assessment
Essay On AssessmentEssay On Assessment
Essay On Assessment
 
Assessment Overview Formatted
Assessment  Overview FormattedAssessment  Overview Formatted
Assessment Overview Formatted
 
Are you Assessment Literate?
Are you Assessment Literate?Are you Assessment Literate?
Are you Assessment Literate?
 
The process and purpose of evaluation
The process and purpose of evaluationThe process and purpose of evaluation
The process and purpose of evaluation
 
Assessment and learning - getting quality in both
Assessment and learning - getting quality in bothAssessment and learning - getting quality in both
Assessment and learning - getting quality in both
 
ASSESSMENT IN LEARNING 1-LESSONS 1-4 (1).ppt
ASSESSMENT IN LEARNING 1-LESSONS 1-4 (1).pptASSESSMENT IN LEARNING 1-LESSONS 1-4 (1).ppt
ASSESSMENT IN LEARNING 1-LESSONS 1-4 (1).ppt
 
Afl research proposal
Afl   research proposalAfl   research proposal
Afl research proposal
 
Types Of Assessment
Types Of AssessmentTypes Of Assessment
Types Of Assessment
 
Assignment 8628.pdf
Assignment 8628.pdfAssignment 8628.pdf
Assignment 8628.pdf
 

More from amrit47

APA, The assignment require a contemporary approach addressing Race,.docx
APA, The assignment require a contemporary approach addressing Race,.docxAPA, The assignment require a contemporary approach addressing Race,.docx
APA, The assignment require a contemporary approach addressing Race,.docxamrit47
 
APA style and all questions answered ( no min page requirements) .docx
APA style and all questions answered ( no min page requirements) .docxAPA style and all questions answered ( no min page requirements) .docx
APA style and all questions answered ( no min page requirements) .docxamrit47
 
Apa format1-2 paragraphsreferences It is often said th.docx
Apa format1-2 paragraphsreferences It is often said th.docxApa format1-2 paragraphsreferences It is often said th.docx
Apa format1-2 paragraphsreferences It is often said th.docxamrit47
 
APA format2-3 pages, double-spaced1. Choose a speech to review. It.docx
APA format2-3 pages, double-spaced1. Choose a speech to review. It.docxAPA format2-3 pages, double-spaced1. Choose a speech to review. It.docx
APA format2-3 pages, double-spaced1. Choose a speech to review. It.docxamrit47
 
APA format  httpsapastyle.apa.orghttpsowl.purd.docx
APA format     httpsapastyle.apa.orghttpsowl.purd.docxAPA format     httpsapastyle.apa.orghttpsowl.purd.docx
APA format  httpsapastyle.apa.orghttpsowl.purd.docxamrit47
 
APA format2-3 pages, double-spaced1. Choose a speech to review. .docx
APA format2-3 pages, double-spaced1. Choose a speech to review. .docxAPA format2-3 pages, double-spaced1. Choose a speech to review. .docx
APA format2-3 pages, double-spaced1. Choose a speech to review. .docxamrit47
 
APA Formatting AssignmentUse the information below to create.docx
APA Formatting AssignmentUse the information below to create.docxAPA Formatting AssignmentUse the information below to create.docx
APA Formatting AssignmentUse the information below to create.docxamrit47
 
APA style300 words10 maximum plagiarism  Mrs. Smith was.docx
APA style300 words10 maximum plagiarism  Mrs. Smith was.docxAPA style300 words10 maximum plagiarism  Mrs. Smith was.docx
APA style300 words10 maximum plagiarism  Mrs. Smith was.docxamrit47
 
APA format1. What are the three most important takeawayslessons.docx
APA format1. What are the three most important takeawayslessons.docxAPA format1. What are the three most important takeawayslessons.docx
APA format1. What are the three most important takeawayslessons.docxamrit47
 
APA General Format Summary APA (American Psychological.docx
APA General Format Summary APA (American Psychological.docxAPA General Format Summary APA (American Psychological.docx
APA General Format Summary APA (American Psychological.docxamrit47
 
Appearance When I watched the video of myself, I felt that my b.docx
Appearance When I watched the video of myself, I felt that my b.docxAppearance When I watched the video of myself, I felt that my b.docx
Appearance When I watched the video of myself, I felt that my b.docxamrit47
 
apa format1-2 paragraphsreferencesFor this week’s .docx
apa format1-2 paragraphsreferencesFor this week’s .docxapa format1-2 paragraphsreferencesFor this week’s .docx
apa format1-2 paragraphsreferencesFor this week’s .docxamrit47
 
APA Format, with 2 references for each question and an assignment..docx
APA Format, with 2 references for each question and an assignment..docxAPA Format, with 2 references for each question and an assignment..docx
APA Format, with 2 references for each question and an assignment..docxamrit47
 
APA-formatted 8-10 page research paper which examines the potential .docx
APA-formatted 8-10 page research paper which examines the potential .docxAPA-formatted 8-10 page research paper which examines the potential .docx
APA-formatted 8-10 page research paper which examines the potential .docxamrit47
 
APA    STYLE 1.Define the terms multiple disabilities and .docx
APA    STYLE 1.Define the terms multiple disabilities and .docxAPA    STYLE 1.Define the terms multiple disabilities and .docx
APA    STYLE 1.Define the terms multiple disabilities and .docxamrit47
 
APA STYLE  follow this textbook answer should be summarize for t.docx
APA STYLE  follow this textbook answer should be summarize for t.docxAPA STYLE  follow this textbook answer should be summarize for t.docx
APA STYLE  follow this textbook answer should be summarize for t.docxamrit47
 
APA7Page length 3-4, including Title Page and Reference Pag.docx
APA7Page length 3-4, including Title Page and Reference Pag.docxAPA7Page length 3-4, including Title Page and Reference Pag.docx
APA7Page length 3-4, including Title Page and Reference Pag.docxamrit47
 
APA format, 2 pagesThree general sections 1. an article s.docx
APA format, 2 pagesThree general sections 1. an article s.docxAPA format, 2 pagesThree general sections 1. an article s.docx
APA format, 2 pagesThree general sections 1. an article s.docxamrit47
 
APA Style with minimum of 450 words, with annotations, quotation.docx
APA Style with minimum of 450 words, with annotations, quotation.docxAPA Style with minimum of 450 words, with annotations, quotation.docx
APA Style with minimum of 450 words, with annotations, quotation.docxamrit47
 
APA FORMAT1.  What are the three most important takeawayslesson.docx
APA FORMAT1.  What are the three most important takeawayslesson.docxAPA FORMAT1.  What are the three most important takeawayslesson.docx
APA FORMAT1.  What are the three most important takeawayslesson.docxamrit47
 

More from amrit47 (20)

APA, The assignment require a contemporary approach addressing Race,.docx
APA, The assignment require a contemporary approach addressing Race,.docxAPA, The assignment require a contemporary approach addressing Race,.docx
APA, The assignment require a contemporary approach addressing Race,.docx
 
APA style and all questions answered ( no min page requirements) .docx
APA style and all questions answered ( no min page requirements) .docxAPA style and all questions answered ( no min page requirements) .docx
APA style and all questions answered ( no min page requirements) .docx
 
Apa format1-2 paragraphsreferences It is often said th.docx
Apa format1-2 paragraphsreferences It is often said th.docxApa format1-2 paragraphsreferences It is often said th.docx
Apa format1-2 paragraphsreferences It is often said th.docx
 
APA format2-3 pages, double-spaced1. Choose a speech to review. It.docx
APA format2-3 pages, double-spaced1. Choose a speech to review. It.docxAPA format2-3 pages, double-spaced1. Choose a speech to review. It.docx
APA format2-3 pages, double-spaced1. Choose a speech to review. It.docx
 
APA format  httpsapastyle.apa.orghttpsowl.purd.docx
APA format     httpsapastyle.apa.orghttpsowl.purd.docxAPA format     httpsapastyle.apa.orghttpsowl.purd.docx
APA format  httpsapastyle.apa.orghttpsowl.purd.docx
 
APA format2-3 pages, double-spaced1. Choose a speech to review. .docx
APA format2-3 pages, double-spaced1. Choose a speech to review. .docxAPA format2-3 pages, double-spaced1. Choose a speech to review. .docx
APA format2-3 pages, double-spaced1. Choose a speech to review. .docx
 
APA Formatting AssignmentUse the information below to create.docx
APA Formatting AssignmentUse the information below to create.docxAPA Formatting AssignmentUse the information below to create.docx
APA Formatting AssignmentUse the information below to create.docx
 
APA style300 words10 maximum plagiarism  Mrs. Smith was.docx
APA style300 words10 maximum plagiarism  Mrs. Smith was.docxAPA style300 words10 maximum plagiarism  Mrs. Smith was.docx
APA style300 words10 maximum plagiarism  Mrs. Smith was.docx
 
APA format1. What are the three most important takeawayslessons.docx
APA format1. What are the three most important takeawayslessons.docxAPA format1. What are the three most important takeawayslessons.docx
APA format1. What are the three most important takeawayslessons.docx
 
APA General Format Summary APA (American Psychological.docx
APA General Format Summary APA (American Psychological.docxAPA General Format Summary APA (American Psychological.docx
APA General Format Summary APA (American Psychological.docx
 
Appearance When I watched the video of myself, I felt that my b.docx
Appearance When I watched the video of myself, I felt that my b.docxAppearance When I watched the video of myself, I felt that my b.docx
Appearance When I watched the video of myself, I felt that my b.docx
 
apa format1-2 paragraphsreferencesFor this week’s .docx
apa format1-2 paragraphsreferencesFor this week’s .docxapa format1-2 paragraphsreferencesFor this week’s .docx
apa format1-2 paragraphsreferencesFor this week’s .docx
 
APA Format, with 2 references for each question and an assignment..docx
APA Format, with 2 references for each question and an assignment..docxAPA Format, with 2 references for each question and an assignment..docx
APA Format, with 2 references for each question and an assignment..docx
 
APA-formatted 8-10 page research paper which examines the potential .docx
APA-formatted 8-10 page research paper which examines the potential .docxAPA-formatted 8-10 page research paper which examines the potential .docx
APA-formatted 8-10 page research paper which examines the potential .docx
 
APA    STYLE 1.Define the terms multiple disabilities and .docx
APA    STYLE 1.Define the terms multiple disabilities and .docxAPA    STYLE 1.Define the terms multiple disabilities and .docx
APA    STYLE 1.Define the terms multiple disabilities and .docx
 
APA STYLE  follow this textbook answer should be summarize for t.docx
APA STYLE  follow this textbook answer should be summarize for t.docxAPA STYLE  follow this textbook answer should be summarize for t.docx
APA STYLE  follow this textbook answer should be summarize for t.docx
 
APA7Page length 3-4, including Title Page and Reference Pag.docx
APA7Page length 3-4, including Title Page and Reference Pag.docxAPA7Page length 3-4, including Title Page and Reference Pag.docx
APA7Page length 3-4, including Title Page and Reference Pag.docx
 
APA format, 2 pagesThree general sections 1. an article s.docx
APA format, 2 pagesThree general sections 1. an article s.docxAPA format, 2 pagesThree general sections 1. an article s.docx
APA format, 2 pagesThree general sections 1. an article s.docx
 
APA Style with minimum of 450 words, with annotations, quotation.docx
APA Style with minimum of 450 words, with annotations, quotation.docxAPA Style with minimum of 450 words, with annotations, quotation.docx
APA Style with minimum of 450 words, with annotations, quotation.docx
 
APA FORMAT1.  What are the three most important takeawayslesson.docx
APA FORMAT1.  What are the three most important takeawayslesson.docxAPA FORMAT1.  What are the three most important takeawayslesson.docx
APA FORMAT1.  What are the three most important takeawayslesson.docx
 

Recently uploaded

Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 

Recently uploaded (20)

Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 

Purposes and Characteristics of Educational AssessmentF.docx

  • 1. Purposes and Characteristics of Educational Assessment Focus Questions After reading this chapter, you should be able to answer the following questions: 1. What are the four principal uses of educational assessment? 2. What are the main characteristics of good educational assessment? 3. What makes a test unfair? 4. Can a test be fair but invalid? 5. Can a test be valid but unreliable? 6. How can you determine the reliability of a test? 7. How can you improve test validity and reliability? 2 Jack Hollingsworth/Photodisc/Thinkstock He uses statistics as a drunken man uses lamp-posts— for support rather than illumination.
  • 2. —Andrew Lang Not to imply that many teachers are like drunken men leaning on their lampposts, but it is true that many use the results of their assessments a little like crutches to support their summaries of the characteristics, virtues, and vices of their students. For these teachers, test results and grades serve as handy statistics that summarize all the important things that can be commu- nicated to school administrators and parents. Others use the results of their assessments in very different ways. They see them not as sum- marizing important accomplishments, but as tools that suggest to both the learner and the teacher how the teaching–learning process can be improved. For these teachers, assessments are less like lampposts against which a tired (but sober) teacher can lean and rest: They are more like the light on top of the posts that throws back the darkness and shows teacher and student where the paths are, so that they need not stumble about like blindfolded pigs in a supermarket. Purposes and Characteristics of Educational Assessment Chapter 2 Chapter Outline 2.1 Four Purposes of Educational Assessment Planning for Assessment Test Blueprints
  • 3. 2.2 Test Fairness Content Problems Trick Questions Opportunity to Learn Insufficient Time Failure to Make Accommodations for Special Needs Biases and Stereotypes Inconsistent Grading Cheating and Test Fairness 2.3 Validity Face Validity Content Validity Construct Validity Criterion-Related Validity 2.4 Reliability Test-Retest Reliability Parallel-Forms Reliability
  • 4. Split-Half Reliability Factors That Affect Reliability How to Improve Test Reliability Chapter 2 Themes and Questions Section Summaries Applied Questions Key Terms Four Purposes of Educational Assessment Chapter 2 2.1 Four Purposes of Educational Assessment These distinctions highlight the difference between summative assessment and formative assessment. As we saw in Chapter 1, summative assessment is assessment that occurs mainly at the end of an instructional period. Its chief purpose is to provide a grade. In a real sense, it is a sum- mation of the learner’s achievements. Formative assessment is assessment that is an integral and ongoing part of instruction. Its central purpose is to provide guidance for both the teacher and the learner in an effort to improve learning. Its goal is formative rather than summative. Assessment designed specifi- cally to enhance learning—in other words, formative
  • 5. assessment—is the main emphasis of current approaches to educational assessment. Diagnostic assessment in education, like diagnosis in medicine, has to do with finding out what is wrong, why it’s wrong, and how it can be fixed. Diagnosis of automobile mechani- cal difficulties is much the same: The mechanic runs tests to find out what is wrong and to develop hypotheses about how the problem can be repaired. Diagnostic assessment in educa- tion is no different. It has three purposes: 1. Finding out what is wrong (uncovering which learning goals have not been met) 2. Developing hypotheses about why something is wrong (suggesting possible reasons why these goals have not been accomplished) 3. Suggesting ways of repairing what is wrong (devising interventions that might increase the likelihood that instructional objectives will be reached) The examiners look for the causes of failure so they can suggest ways of fostering success. Like the physician and the mechanic who diagnose in order to repair, so, too, does the educator. A fourth kind of assessment is placement assessment. It describes assessment undertaken before instruction. Its main purpose is to provide educators with information about student placement and about learning readiness. Placement assessment can also influence choice of content and of instructional approaches (Figure 2.1).
  • 6. Planning for Assessment While we can distinguish between assessment used for placement, diagnosis, formation, or summation, these distinctions do not contradict the fact that all forms of assessment share the same goals. Simply put, the general aim of all assessment is to help learners learn. Accordingly, educational assessment and instruction are part of the same process; assessment is for learning rather than simply of learning. It is worth noting that the assessment of learners is, in a sense, an assessment of the effec- tiveness of the teacher and of the soundness and appropriateness of instructional and assess- ment strategies. This doesn’t invariably mean, of course, that when students do well on their tests they have the good fortune of having an astonishingly good teacher. Some students do remarkably well with embarrassingly inadequate teachers, and others perform poorly with the most gifted of instructors. Four Purposes of Educational Assessment Chapter 2 Still, if your teaching is aimed at admirable, clearly stated, and well-understood instructional objectives, and if most of your charges attain these targets, the assessments that tell you (and the world) that this is so also reflect positively on your teaching.
  • 7. As you plan your instruction and your assessment, the following guidelines can be highly useful. Communicating Instructional Objectives Know and communicate your instructional objectives. Not only do teachers need to under- stand what their learning objectives are, but students need to know what is expected of them—hence the importance of having clear instructional objectives and of communicating them to students. If students understand what the most important learning goals are, they are far more likely to reach them than if they are just feeling their way around in the dark. For example, at the beginning of a unit on geography, Mrs. Wyatt tells her seventh graders that at the end of the unit, they will be able to create a map of their town to scale. To do this, Mrs. Wyatt explains, she will have to teach them some map-making skills and the math they will need to calculate the scaling for the map. She then begins a lesson on ratios. Aligning Goals, Instruction, and Assessments Match instruction, assessment, and grading to goals. In theory, this guideline might seem obvious; but in practice, it is not always highly apparent. For example, one of my high school Figure 2.1: Four kinds of educational assessment based on their main purposes Assessment can serve at least four distinct purposes in education. However, the same assessment procedures and instruments can be used for all these purposes.
  • 8. Further, there is often no clear distinction among these purposes. For example, diagnosis is often part of placement and can also have formative functions. f02.01_EDU645.ai Summative Assessment Summarizes extent to which instructional objectives have been met Provides basis for grading Useful for further educational or career decisions Diagnostic Assessment Identifies instructional objectives that have not been
  • 9. met Identifies reasons why targets have not been met Suggests approaches to remediation Formative Assessment Designed to improve the teaching/ learning process Provides feedback for teachers and learners to enhance learning and motivation Enhances learning; fosters self-regulated learning; increases motivation Placement
  • 10. Assessment Assesses pre-existing knowledge and skills Provides information for making decisions about learner’s readiness Useful for placement and selection decisions Four Purposes of Educational Assessment Chapter 2 teachers, Sister Ste. Mélanie, delighted in teaching us obscure details about the lives and times of the English authors whose poems and essays were part of our curriculum. We natu- rally assumed that some of her most important objectives had to do with learning these intriguing details. Her tests matched her instructional objectives. As we expected, most of the items on the
  • 11. tests she gave us asked questions like Name three different kinds of food that would have been common in Shakespeare’s time. (Among the correct answers were responses like cherry cordial, half-moon-shaped meat and potato pies called pasties, toasted hazelnuts, and stuffed game hens. We often left her classes very hungry.) But sadly, Sister Ste. Mélanie’s grading was based less on what we had learned in her classes than on the quality of our English. She had developed an elaborate system for sub- tracting points based on the number and severity of our grammatical and spelling errors. Our grades reflected the accuracy of our responses only when our grammar and spelling were impeccable. Her grading did not match her instruction; it exemplified poor educa- tional alignment. Educational alignment is the expression used to describe an approach to education that deliberately matches learning objectives with instruction and assessment. As Biggs and Tang (2011) describe it, alignment involves three key components: 1. A conscious attempt to provide learners with clearly specified goals 2. The deliberate use of instructional strategies and learning activities designed to foster achievement of instructional goals 3. The development and use of assessments that provide feedback to improve learning and to gauge the degree of alignment between goals, instruction, and
  • 12. assessment Good alignment happens when teachers think through their whole unit before beginning instruction. They identify the learning objectives, the evidence they will collect to document student learning (assessments), and the sequence in which they will have students access and interact with information before administering assessments. One elementary teacher did just that when she designed a unit on the watersheds. Her goal was the Virginia science standard: Science 4.8—The student will investigate and understand important VA natural resources (a) watershed and water resources. To organize her unit, she used a focusing question: “What happens to the water flowing down your street after a big rainstorm?” She wanted students to understand the overarching concept that every action has a con- sequence—in this case, that the flow of water affects areas downstream. Throughout the course of her unit, she had children actively engaged in a variety of meaningful tasks. The children discussed and debated the issues of pollution and their responsibilities to avoid pol- luting their water. They created tangible vocabulary tools to learn the vocabulary of the unit. They responded to academic prompts to explain key concepts. They built models of water- sheds. As the students performed these tasks, the teacher noted the level of performance of each child and documented individual knowledge, skill, and understanding. She used these instructional strategies as formative assessments as she provided feedback to each student.
  • 13. In addition to multiple quizzes throughout the unit, the students demonstrated their under- standing of important concepts by completing a performance- based task. Everything was aligned so that the teacher could infer that students truly understood the importance of Virginia’s watersheds and water resources. Four Purposes of Educational Assessment Chapter 2 Using Assessment to Improve Instruction Use assessment as an integral part of the teaching–learning process. Good formative assess- ment is designed to provide frequent and timely feedback that is of immediate assistance to learners and teachers. As we see in Chapter 5, this doesn’t mean that teachers need to make up specially designed and carefully constructed placement and formative tests to assess their learners’ readiness for instruction, gauge their strengths and weaknesses, and monitor their progress. The best formative assessment will often consist of brief, informal assessments, perhaps in the form of oral questions or written problems that provide immediate feedback and inform both teaching and learning. Formative feedback might then lead the teacher to modify instructional approaches and learners to adjust their learning activities and strategies. Using Different Approaches to Assessment Employ a variety of assessments, especially when important decisions depend on their out- comes. Test results are not always entirely valid (they don’t
  • 14. always measure what they are intended to measure) or reliable (they don’t always measure very accurately). The results of a single test might reflect temporary influences such as those related to fatigue, test anxiety, illness, situational distractions, current preoccupations, or other factors. Grades and decisions based on a variety of assessments are more likely to be fair and valid. For example, when someone is ready to demonstrate driving knowledge and skills, multiple assessments are given by the Department of Motor Vehicles (DMV). Drivers need to know the rules of the road, and they also need to know how to parallel park. So DMV assessments include both a written and a driving field test. Constructing Tests According to Blueprints A house construction blueprint describes in detail the components of a house—its dimen- sions, the number of rooms and their placement, the materials to be used for building it, the pitch of its roof, the depth of its basement, its profile from different directions. A skilled con- tractor can read a blueprint and almost see the completed house. In much the same way, a test blueprint describes in detail the nature of the items to be used in building the test. It includes information about the number of items it will contain, the con- tent areas they will tap, and the intellectual processes that will be assessed. A skilled educator can look at a test blueprint and almost see the completed test (Figure 2.2). Test Blueprints
  • 15. It’s important to keep in mind that tests are only one form of educational assessment. Assessment is a broad term referring to all the various methods that might be used to obtain information about different aspects of teaching and learning. The word test has a more spe- cific meaning: In education, it refers to specific instruments or procedures designed to mea- sure student achievement or progress, or various student characteristics. Educational tests are quite different from many of the other measuring instruments we use— instruments like rulers, tape measures, thermometers, and speedometers. These instruments measure directly and relatively exactly: We don’t often have reason to doubt them. Our psychological and educational tests aren’t like that: They measure indirectly and with varying accuracy. In effect, they measure a sample of behaviors. And from students’ behaviors (responses), we make inferences about qualities we can’t really measure directly at all. Thus, from a patient’s responses to questions like “What is the first word you think of when I say Four Purposes of Educational Assessment Chapter 2 mother?” the psychologist makes inferences about hidden motives and feelings—and perhaps eventually arrives at a diagnosis.
  • 16. In much the same way, the teacher makes inferences about what the learner knows—and perhaps inferences about the learner’s thought processes as well—from responses to a hand- ful of questions like this one: Which of the following is most likely to be correct? 1. Mr. Wilson will still be alive at the end of the story. 2. Mr. Wilson will be in jail at the end of the story. 3. Mr. Wilson will have died within the next 30 pages. 4. Mr. Wilson will not be mentioned again. Tests that are most likely to allow the teacher to make valid and useful inferences are those that actually tap the knowledge and skills that make up course objectives. And the best way of ensuring that this is the case is to use test construction blueprints that take these objectives into consideration (see Tables 4.3 and 4.4 for examples of test blueprints). Figure 2.2: Guidelines for assessment These guidelines are most useful when planning for assessment. Many other considerations have to be kept in mind when devising, administering, grading, and interpreting teacher-made tests. f02.02_EDU645.ai Some Guidelines
  • 17. for Assessment Use a variety of assessments Use assessment to improve instruction Align instruction, goals, and assessment Develop blueprints to construct tests Know and communicate learning targets Test Fairness Chapter 2 Guidelines for Constructing Test Blueprints A good test blueprint will contain most of the following: • A clear statement of the test content related directly to instructional objectives
  • 18. • The performance, affective, or cognitive skills to be tapped • An indication of the test format, describing the kinds of test items to be used or the nature of the performances required • A summary of how marks are to be allocated in relation to different aspects of the content • Some notion of the achievement levels expected of learners • An indication of how achievement levels will be graded • A review of the implications of different grades Regrettably, not all teachers use test blueprints. Instead, when a test is required, many find it less trouble to sit down and simply write a number of test items that seem to them a rea- sonable examination of what they have taught. And sadly, in too many cases what they have taught is aimed loosely at what are often implied and vague rather than specific instructional objectives. Using test blueprints has a number of important advantages and benefits. Among them is that they force the teacher to clarify learning objectives and to make decisions about the importance of different aspects of content. They also encourage teachers to become more aware of the learner’s cognitive processes and, by the same
  • 19. token, to pay more attention to the development of higher cognitive skills. At a more practical level, using test blueprints makes it easier for teachers to produce similar tests at different times, thus maintaining uniform standards and allowing for comparisons among different classes and different students. Also, good test blueprints serve as a useful guide for constructing test items and perhaps, in the long run, make the teacher’s work easier. Figure 2.3 summarizes some of the many benefits of using test blueprints. 2.2 Test Fairness Determining what the best assessment procedures and instruments are is no simple matter and is not without controversy. But although educators and parents don’t always agree about these matters, there is general agreement about the characteristics of good measuring instru- ments. Most important among these is that evaluative instruments be fair and that students see them as being fair. The most common student complaint about tests and testing practices has to do with their imagined or real lack of fairness (Bouville, 2008; Felder, 2002). The importance of test fairness was highlighted during the Vietnam War in the 1960s. President Kennedy’s decision to send troops to Vietnam led to the drafting of large numbers of age-eligible men, some of whom died or were seriously injured in Vietnam. But men who went to college were usually exempt from the draft—or their required military service was at
  • 20. least deferred. So, for many, it became crucial to be admitted to undergraduate or postgradu- ate studies. For some, passing college or graduate entrance exams was literally a matter of life and death. That the exams upon which admission decisions would be based should be as fair as possible seemed absolutely vital. Test Fairness Chapter 2 Just how fair are our educational assessments? We don’t always know. But science provides ways of defining and sometimes of actually measuring the characteristics of tests. It says, for example, that the best assessment instruments have three important qualities: 1. Fairness 2. Validity 3. Reliability As we saw, from the student’s point of view, the most important of these is the apparent— and real—fairness of the test. There are two ways of looking at test fairness, explains Bouville (2008): On the one hand, there is fairness of treatment; on the other, there is fairness of opportunity. Fairness of treat- ment issues include problems relating to not making accommodations for children with spe- cial needs, biases and stereotypes, the use of misleading “trick”
  • 21. questions, and inconsistent Figure 2.3: Advantages of test blueprints Making and using test blueprints presents a number of distinct benefits. And, although develop- ing blueprints can be time-consuming, contrary to what some think, it can make the teacher’s task easier rather than more difficult and complicated. f02.03_EDU645.ai Advantages of Devising and Using Test Blueprints Forces teacher to clarify learning targets Increases test validity and reliability Promotes the development of thinking rather than mainly remembering
  • 22. skills Encourages teachers to become more aware of learners’ cognitive activity Simplifies test construction Leads to more consistency among different tests, allowing more meaningful comparisons Promotes decisions about the relative importance of different aspects of
  • 23. content Test Fairness Chapter 2 grading. Fairness of opportunity problems include testing students on material not covered, not providing an opportunity to learn, not allowing sufficient time for the assessment, and not guarding against cheating. We look at each of these issues in the following sections (Figure 2.4). Content Problems Tests are—or at the very least, seem—highly unfair when they ask questions or pose prob- lems about matters that have not been covered or assigned. This issue sometimes has to do with bad teaching; at other times, it simply relates to bad test construction. For example, in my second year in high school, we had a teacher who almost invariably peppered her quizzes and exams with questions about matters we had never heard about in class. “We didn’t have time,” she would protest when someone complained and pointed out that she had never mentioned rhombuses and trapezoids and quadrilaterals. “But it’s important and it’s in the book and it might be on the final exam,” she would add. Had she simply told us that we were responsible for the content in Chapter 6, we would not have felt so unfairly treated. This example illustrates bad teaching as much as bad test
  • 24. construction. In connection with content problems that affect test fairness, it is interesting to note that when test results are higher, students tend to perceive the test as being fairer. It’s an intrigu- ing observation that, it turns out, may have a grain of truth in it. As Oller (2012) points out, higher scores are evidence that there is agreement between test makers and the better stu- dents about the content that is most important. This agreement illustrates what we termed educational alignment: close correspondence among goals, instructional approaches, and assessments. Conversely, exams that yield low scores for all students may reflect poor educational align- ment: They indicate that what the teacher chose to test is not what even the better learners have learned. Hence there is good reason to believe that tests that yield higher average scores are, in fact, fairer than those on which most students do very poorly. And raising the marks, perhaps by scaling them so that they approximate a normal distribution with an acceptably high average, will do little to alter the apparent fairness of the test. Figure 2.4: Issues affecting test fairness That a test is fair, and that it seems to be fair, is one of the most important characteristics of good assessment. f02.04_EDU645.ai
  • 25. Issues of Fairness of Opportunity Issues of Fairness of Treatment • Testing material not covered • Not providing an opportunity to learn • Not allowing sufficient time to complete test • Not guarding against cheating • Not accommodating to special needs • Being influenced by biases and stereotypes • Using misleading, trick questions • Grading inconsistently Test Fairness Chapter 2 Trick Questions
  • 26. Trick questions illustrate problems that have less to do with test content than with test construction—which, of course, doesn’t mean that the test maker is always unaware that one or more questions might be considered trick questions. Trick questions are questions that mislead and deceive, regardless of whether the deception is intentional or is simply due to poor item construction. Trick questions do not test the intended learning targets, but rather a student’s ability to navigate a deceptive test. Items that students are most likely to consider trick questions include: 1. Questions that are ambiguous (even when the ambiguity is accidental rather than deliber- ate). Questions are ambiguous when they have more than one possible interpretation. For example, “Did you see the man in your car?” might mean, “Did you see the man who is in your car?” or “Did you see the man when you were in your car?” 2. Multiple-choice items where two nearly identical alternatives seem correct. Or, as in the following example, where all alternatives are potentially correct: The Spanish word fastidiar means: annoy damage disgust harm 3. Items deliberately designed to catch students off their guard. For example, consider this
  • 27. item from a science test: During a very strong north wind, a rooster lays an egg on a flat roof: On what side of the roof is the egg most likely to roll off? North South East West No egg will roll off the roof Students who aren’t paying sufficient attention on this fast- paced, timed test might well say South. Seems reasonable. (But no; apparently, roosters rarely lay eggs.) 4. Questions that use double negatives. For example: Is it true that people should never not eat everything they don’t like? 5. Items in which some apparently trivial word turns out to be crucial. That is often the case for words such as always, never, all, and most, as in this item: True or False? Organic prod- ucts are always better for you than those that are nonorganic. 6. Items that make a finer discrimination than expected. For example, say a teacher has described the speed of sound in dry air at 20 degrees centigrade as being right around 340 meters per second. Now she presents her students with this
  • 28. item: What is the speed of sound in dry air at 20 degrees centigrade? A. 300 meters per second B. around 340 meters per second C. 343.2 meters per second D. 343.8 meters per second Test Fairness Chapter 2 Because the alternatives contain both the correct answer (C) and the less precise informa- tion given by the teacher (B), the item is deceiving. 7. Long stems in multiple-choice questions that include extraneous and irrelevant information but that serve to distract. Consider, for example, this multiple- choice item: A researcher found that the average score of a sample consisting of 106 females was 52. The highest score was 89 while the lowest score was 34. In this study, the median score was 55 and the two most frequent scores were 53 and 58. What was the sum of all the scores? A. 5,512 B. 5,830
  • 29. C. 5,618 D. 6,148 All the information required to answer this item correctly (A) is included in the first sen- tence. Everything after that sentence is irrelevant and, for that reason, misleading. Opportunity to Learn Tests are patently unfair when they sample concepts, skills, or cognitive processes that stu- dents have not had an opportunity to acquire. Lack of opportunity to learn might reflect an instructional problem. For example, it might result from not being exposed to the material either in class or through instructional resources. It might also result from not having sufficient time to learn. Bloom (1976), for example, believed that there are faster and slower learners (not gifted learners and those less gifted), and that all, given sufficient time, can master what schools offer. If Bloom is mostly correct, the results of many of our tests indicate that we sim- ply don’t allow some of our learners suf- ficient time for what we ask of them. Bloom’s mastery learning system offers one solution. Mastery learning describes an instructional approach in which course content is broken into small, sequential units and steps are taken to ensure that
  • 30. all learners eventually master instructional objectives (see Chapter 6 for a discussion of mastery learning). Another solution, suggests Beem (2010), is the expanded use of technology and of virtual reality instructional programs. These are instructional computer-based simula- tions designed to provide a sensation of realism. She argues that these, along with other digital technologies including computers and handheld devices, offer students an opportunity to learn at their own rate. Besides, digital technology might also reduce the negative influence of poorly qualified teachers—if there are any left. iStockphoto/Thinkstock ▲ Ambiguous questions, misleading items, items about mate- rial not covered or assigned, overly long tests—all of these contribute to the perceived unfairness of tests. Test Fairness Chapter 2 As Ferlazzo (2010) argues, great teaching is about giving students the opportunity to learn. Poor and unfair testing is about assessing the extent to which they have reached instruc- tional objectives they have never had an opportunity to reach. See Applications: Addressing Unfair Tests. Insufficient Time
  • 31. Closely related to the unfairness that results from not having an opportunity to learn is the injustice of a test that doesn’t give students an opportunity to demonstrate what they actu- ally have learned. For some learners, this is a common occurrence simply because they tend to respond more slowly than others and, as a result, often find that they don’t have enough time to complete the test. A P P L I C A T I O N S : Addressing Unfair Tests In spring 2013, students in New York City had their first experience with English Language Arts tests designed to tap the curriculum of the Common Core State Standards. After adopting these standards in 2010, New York hired a test-publishing company to design a test that would reflect the knowledge and skills within the Common Core Standards. After witnessing their students’ anguish following the initial testing experience, 21 principals were so outraged that they felt compelled to issue a formal protest through a letter to the State’s Commissioner of Education. In that letter, they highlighted how the English/Language Arts test was unfair. One of their major concerns was the lack of alignment between the types of questions asked and the critical thinking skills valued in the Common Core State Standards. The Common Core State Standards emphasize deep and rich analysis of fiction and nonfiction. But the ELA tests focused mostly on specific lines and words rather than on the
  • 32. wider standards. What was taught in the classrooms was not assessed on the test: The test failed to meet the criterion of fairness of opportunity. While alignment between what was tested and what is in the Standards is important, this was not the administrators’ only complaint. In reviewing the tests taken by the students, they concluded that the structure of the tests was not developmentally appropriate. For example, testing required 90-minute sessions on each of three consecutive days—a difficult undertaking for a mature stu- dent, let alone for a 10-year-old. Clearly, there is a violation here of the criterion of fairness of opportunity. Finally, the principals expressed concern that too much was riding on a flawed test developed by a company with a track record of errors. They feared that the tests might not be valid. Yet students’ promotion to the next grade, entry into middle and secondary school, and admission to special programs are often based on these tests. In addition, teachers and schools are evaluated in terms of how well their students perform, even though that is not an intended use of the tests. As a result, scores on these tests can affect the extent to which schools receive special funds or are put on improvement plans. These complications raise questions about the test’s validity for these purposes. Clearly, as the principals reflected on the new English/Language Arts test, they saw problems with both fairness of opportunity and validity.
  • 33. Test Fairness Chapter 2 Suppose that a 100-item test is designed to sample all the target skills and information that define a course of study. If a student has time to respond to only 80 of these 100 items, then only 80% of the instructional objectives have been tested. That test is probably unfair for that student. There is clearly a place for speeded testing, particularly with respect to standardized tests such as those that assess some of the abilities that contribute to intelligence. (We look at some of these tests in Chapter 10.) But as a rule of thumb, teacher- made tests should always be of such a length and difficulty level that most, if not all, students can easily complete them within the allotted time (van der Linden, 2011). Failure to Make Accommodations for Special Needs Timed tests can be especially unfair to some learners with special needs. For example, Gregg and Nelson (2012) reviewed a large number of studies that looked at performance on timed graduation tests—a form of high-stakes testing (so called because results on these tests can have important consequences relating to transition from high school, school funding, and even teaching and administrative careers). These researchers found that whereas students with learning disabilities would normally be expected to achieve at a lower than average level
  • 34. on these tests, when they are given the extra time they require, their test scores are often comparable to those of students without disabilities. Giving students with special needs extra time is the most common of possible accommoda- tions. It is also one of the most effective and fairest adjustments. Even for more gifted and talented learners, additional time may be important. Coskun (2011) reports a study where the number of valuable ideas produced in creative brainstorming groups was positively related to the amount of time allowed. Accommodations for Test Anxiety In addition to being given extra time for learning and assessment, many other accommoda- tions for learners with special needs are possible and often desirable. For example, steps can be taken to improve the test performance of learners with severe test anxiety. Geist (2010) suggests that one way of doing this is to reduce negative attitudes toward school subjects such as mathematics. As Putwain and Best (2011) showed, when elementary school students are led to fear a subject by being told that it will be difficult and that important decisions will be based on how well they do, their performance suffers. The lesson is clear: Teachers should not try to motivate their students by appealing to their fears. For severe cases of test anxiety, certain cognitive and behavioral therapies, in the hands of a skilled therapist, are sometimes highly effective (e.g., Brown et al., 2011). And even in less skilled hands, the use of simple relaxation techniques might be
  • 35. helpful (for example, Larson et al., 2010). It is worth keeping in mind, too, that test anxiety often results from inadequate instruction and learning. Not surprisingly, after Faber (2010) had exposed his “spelling-anxious” students to a systematic remedial spelling training program, their spelling performance increased and their test anxiety scores decreased. Test Fairness Chapter 2 Accommodations for Minority Languages Considerable research indicates that children whose first language is not the dominant school language are often at a measurable disadvantage in school. And this disadvantage can become very apparent if no accommodations are made in assessment instruments and procedures—as is sometimes the case for standardized tests given to children whose dominant language is not the test language (Sinharay, Dorans, & Liang, 2011). As Lakin and Lai (2012) note, there are some serious issues with the fairness and reliability of ability measures given to these children without special accommodations. As we saw in Chapter 1, accommodations in these cases are mandated by law (see In the Classroom: Culturally Unfair Assessments). Accommodations for Other Special Needs Teachers must be sensitive to, and they must make accommodations for, many other “special
  • 36. needs.” These might include medical problems, sensory disabilities such as vision and hearing problems, emotional exceptionalities, learning disabilities, and intellectual disabilities. They might also include cultural and ethnic differences among learners. Figure 2.5 describes some of the accommodations that fair assessments of students with spe- cial needs might require. I N T H E C L A S S R O O M : Culturally Unfair Assessments Joseph Born-With-One-Tooth knew all the legends his grandfather and the other elders told—even those he had heard only once. His favorites were the legend of the Warriors of the Rainbow, and the legend of Kuikuhâchâu, the man who took the form of the wolverine. These legends are long, complicated stories, but Joseph never forgot a single detail, never confused one with the other. The elders named him ôhô, which is the world for owl, the wise one. They knew that Joseph was extraordinarily gifted. But in school, it seemed that Joseph was unremarkable. He read and wrote well, and he performed better than many. But no one even bothered to give him the tests that singled out those who were gifted and talented. Those who are talented and gifted are often identified through a combina- tion of methods, beginning with teacher nominations that then lead to further testing and perhaps interviews and auditions (Pfeiffer & Blei, 2008). Those who
  • 37. don’t do as well in school, sometimes because of cultural or language differences, tend to be overlooked. Joseph Born-With-One-Tooth is not alone. Aboriginal and other culturally different children are vastly underrepresented among the gifted and the talented (Baldwin & Reis, 2004). By the same token, they tend to be overrepresented among programs for those with learning disabilities and emotional disorders (Briggs, Reis, & Sullivan, 2008). There is surely a lesson here for those concerned with the fairness of assessments. Test Fairness Chapter 2 Biases and Stereotypes Accommodations for language differences are not especially difficult. But overcoming the many biases and stereotypes that can affect the fairness of assessments often is. Biases are preconceived judgments usually in favor of or against some person, thing, or idea. For example, I might think that Early Girl tomatoes are better than Big Boys. That is a harmless bias. And like most biases, it is a personal tendency. But if we North Americans tend to believe that all Laplanders are such and such, and most Roma are this and that (such and such and this and that of course being negative), then we hold some stereotypes that are potentially
  • 38. highly detrimental. Closer to home, historically there have been gender stereotypes about male–female differ- ences whose consequences can be unfair to both genders. Some of these stereotypes are based on long-held beliefs rooted in culture and tradition and propagated through centuries of recorded “expert” opinion. And some are based on various controversial and often con- tested findings of science. It’s clear that males and females have some biologically linked sex differences, mainly in physi- cal skills requiring strength, speed, and stamina. But it’s not quite so clear whether we also have important, gender-linked psychological differences. Still, early research on male–female differences (Maccoby & Jacklin, 1974) reported significant differences in four areas: verbal abil- ity, favoring females; mathematical ability, favoring males; spatial–visual ability (evident, for example, in navigation and orientation skills), favoring males; and aggression (higher in males). Figure 2.5: Fair assessment accommodations for children with special needs These are only a few of the many possible accommodations that might be required for fair assess- ment of children with special needs. Each child’s requirements might be different. Note, too, that some of these accommodations might increase the fairness of assessments for all children. f02.05_EDU645.ai
  • 39. Instructional Accommodations • teacher aides and other professional assistance • special classes and programs • individual education plans • special materials such as large print or audio devices • provisions for reducing test anxiety • increased time for learning Testing Accommodations • increased time for test completion • special equipment for test-taking • different form of test (for example, verbal rather than written) • giving test in different setting • testing in a different language Possible Accommodations for Fair Assessment of Students with Special Needs Test Fairness Chapter 2 Many of these differences are no longer as apparent now as they were in 1974. There is increasing evidence that when early experiences are similar, differences are minimal or non- existent (Strand, 2010). But the point is that experiences are not always similar, nor are opportunities and expecta- tions. In the results of many assessments, there are still gender differences. These often favor males in mathematics and females in language arts (e.g., De Lisle, Smith, Keller, & Jules, 2012). And there is evidence that the stereotypes many people still
  • 40. hold regarding, say, girls’ inferior- ity in mathematics might unfairly affect girls’ opportunities and their outcomes. In an intriguing study, Jones (2011) found that when women were shown a video supporting the belief that females perform more poorly than males in mathematics, subsequent tests revealed a clear gender difference in favor of males on a mathematics achievement test. But when they were shown a video indicating that women performed as well as men, no sex dif- ferences were later apparent. Inconsistent Grading Approaches to grading can vary enormously in different schools and even in different class- rooms within the same school. They might involve an enormous range of practices, including • Giving or deducting marks for good behavior • Giving or deducting marks for class participation • Giving or deducting marks for punctuality • Using well-defined rubrics for grading • Basing grades solely on test results • Giving zeros for missed assignments • Ignoring missed assignments • Using grades as a form of reward or punishment
  • 41. • Grading on any of a variety of letter, number, percentage, verbal descriptor, or other systems • Allowing students to disregard their lowest grade • And on and on . . . No matter what practices are used in a given school, for assessments to be fair, grades need to be arrived at in a predictable and transparent manner. Moreover, the rules and practices that underlie their calculation need to be consistent. This approach is also critical for describing what students know and are able to do. If a math grade is polluted with behavioral objec- tives such as participation, how will the student and parents know what the student’s math skills are? Inconsistent grading practices are sometimes evident in disparities within schools, where dif- ferent teachers grade their students using very different rules. In one class, for example, stu- dents might be assured of receiving relatively high grades if they dutifully complete and hand in all their assignments as required. But in another class, grades might depend entirely on test results. And in yet another, grades might be strongly influenced by class participation or by spelling and grammar. Test Fairness Chapter 2
  • 42. Inconsistent grading within a class can also present serious problems of fairness for students. A social studies teacher should not ignore grammatical and spelling errors on a short-answer test one week and deduct handfuls of marks for the same sorts of errors the following week. Simply put, the criteria that govern grading should be clearly understood by both the teacher and students, and those criteria should be followed consistently. Cheating and Test Fairness Most of us, even the cheaters among us, believe that cheating is immoral. Sometimes it is even illegal—such as when you do it on your income tax return. And clearly, cheating is unfair. First, if cheating results in a higher than warranted grade, then it does not represent the stu- dent’s progress or accomplishments—which hardly seems fair. Second, those who cheat, by that very act, cheat other students. I once took a statistics course where, in the middle of a dark night, a fellow student sneaked up the brick wall of the education building, jimmied open the window to Professor Clark’s office, and copied the midterm exam we were about to take. He then wrote out what he thought were all the cor- rect answers and sold copies to a bunch of his classmates. I didn’t buy. No money, actually. And I didn’t do nearly as well on the test as I expected. I thought I had answered most of the questions correctly; but, this being a statistics course, the
  • 43. raw scores (original scores) were scaled so that the class average would be neither distress- ingly low nor alarmingly high. The deception was soon uncovered. Some unnamed person later informed Professor Clark who, after reexamining the test papers, discovered that 10 of his 35 students had nearly iden- tical marks. More telling was that on one item, all 10 of these students made the same, highly unlikely, computational error. Cheating is not uncommon in schools, especially in higher grades and in postsecondary pro- grams where the stakes are so much higher. In addition, today there are far more oppor- tunities for cheating than there were in the days of our grandparents. Wireless electronic communication; instant transmission of photos, videos, and messages; and wide-scale access to Internet resources have seen to that. High-Stakes Tests and Cheating There is evidence, too, that high-stakes testing may be contributing to increased cheating, especially when the consequences of doing well or poorly can dramatically affect entire school systems. For example, state investigators in Georgia found that 178 administrators and teach- ers in 44 Atlanta schools who had early access to standardized tests systematically cheated to improve student scores (Schachter, 2011). Some school systems cheat on high-stakes tests by excluding certain students who are not expected to do well; others cheat by not adhering to guidelines
  • 44. for administering the tests, perhaps by giving students more time or even by giving them hints and answers (Ehren & Swanborn, 2012). A more subtle form of administrative and teacher cheating on high-stakes tests takes the form of “narrowing” the curriculum. In effect, instructional objectives are narrowed to topics covered by the tests, and instruction is focused specifically on those targets to the exclusion Test Fairness Chapter 2 of all others. This practice, notes Berliner (2011), is a rational—meaning “reasonable or intel- ligent”—reaction to high-stakes testing. With the proliferation of online courses and online universities, the potential for electronic cheating has also increased dramatically (Young, 2012). For example, online tests can be taken by the student, the student’s friend, or even some paid expert, with little fear of detection. Preventing Cheating Among the various suggestions for preventing or reducing cheating on exams are the following: • Encourage students to value honesty. • Be aware of school policy regarding the consequences of cheating, and communicate them to students.
  • 45. • Clarify for students exactly what cheating is. • When possible, use more than one form of an exam so that no two adjacent students have the same form. • Stagger seats so that seeing other students’ work is unlikely. • Randomize and assign seating for exams. • Guard the security of exams and answer sheets. • Monitor exams carefully. • Prohibit talking or other forms of communication during exams. Of course, none of these tactics, or even all of them taken together, is likely to guarantee that none of your students cheat. In fact, one large-scale study found that 21% of 40,000 undergraduate students surveyed had cheated on tests, and an astonishing 51% had cheated at least once on their written work (McCabe, 2005; Figure 2.6). Sadly, that cheating is prevalent does not justify it. Nor does it do anything to increase the fairness of our testing practices. Figure 2.6: Cheating among college undergraduates Percentage of undergraduate students who admitted having cheated at least once.
  • 46. f02.06_EDU645.ai Percent of 40,000 students who admitted cheating Cheated on written work Cheated on tests 0 10 20 30 40 50 60 Source: Based on McCabe, D. (2005). It takes a village: Academic dishonesty. Liberal Education. Retrieved September 2, 2012, from http://www.middlebury.edu/media/view/257515/original/It_take s_a_village.pdf. http://www.middlebury.edu/media/view/257515/original/It_take s_a_village.pdf Validity Chapter 2 Figure 2.7 summarizes the main characteristics of fair assessment practices. Related to this, Table 2.3 presents the American Psychological Association (APA) Code of Fair Testing Practices in Education. Because of its importance, the code is reprinted in its entirety at the end of this chapter. 2.3 Validity In addition to the characteristics of fair assessment practices listed in Figure 2.6 and Table 2.3, the fairness of a test or assessment system depends on the reliability of the test instruments and the validity of the inferences made from the test results.
  • 47. Simply put, a test is valid if it measures what it is intended to measure. For example, a high schooler’s ACT scores should not be used to decide if a student should have a driver’s license. The test is designed to predict college performance rather than readiness to drive. From a measurement point of view, validity is the most important characteristic of a measuring instrument. If a test does not measure what it is intended to measure, the scores derived from it are of no value whatsoever, no matter how consistent and predictable they are. Test validity has to do not only with what the test measures, but also with how the test results are used. It relates to the inferences we base on test results and the consequences that fol- low. In effect, interpreting test scores amounts to making an inference about some quality or characteristic of the test taker. For example, based on Nora’s brilliant performance on a mathematics test, her teacher infers that Nora has commendable mathematical skills and understanding. And one consequence of this inference might be that Nora is invited to join the lunchtime mathematics enrichment group. But note that the inference and the consequence are appropriate and defensible only if the test on which Nora performed so admirably actually measures relevant mathematical skills and understanding. The important point is that in educational assessment, validity is closely related to the way
  • 48. test results are used. Accordingly, a test may be valid for some purposes but totally invalid for others. Figure 2.7: Fair assessment practices Assessments are not always fair for all learners. But their fairness can be improved by paying atten- tion to some simple guidelines. f02.07_EDU645.ai The Fairest Assessment Practices • Cover material that every student has had an opportunity to learn. • Reflect learning targets for that course. • Allow sufficient time for students to finish the test. • Discourage cheating. • Provide accommodations for learners with special needs. • Ensure that tests are free of biases and stereotypes. • Avoid misleading questions. • Follow consistent and clearly understood grading practices. • Base important decisions on a variety of different assessments. • Take steps to ensure the validity and reliability of assessments. Validity Chapter 2 Face Validity
  • 49. How can you determine whether a test is valid? Put another way, how do you know a test measures what it says it measures? Or what it is intended to measure? There are a number of ways of answering these questions. One of the most obvious is to look at the items that make up the test. Does the mathematics test look like it measures mathemat- ics? Does the grammar test appear to be a grammar test? Answers to these sorts of questions determine the face validity of the test. Basically, face validity is the extent to which the test appears to measure what it is supposed to measure. If the mathematics test consists of appropriate mathematical problems, it has face validity. Face validity is especially important for teacher-made tests. Just by looking at a test, students should immediately know that they are being tested on the right things. A mathematics test that has face validity will not ask a series of questions based on Shakespeare’s Julius Caesar. Occasionally, however, test makers are careful to avoid any hint of face validity. For example, if you wanted to construct a test designed to measure a personality characteristic such as honesty, you probably wouldn’t want your test participants to know what is being measured. If your instrument had face validity—that is, if it looked like it was measuring honesty—the scoundrels who take it might actually lie and act as if they are honest when they really aren’t. Better to deceive them, lie to them, pretend you are testing
  • 50. motivational qualities or character strength, so you can determine what liars and rogues they really are. Content Validity Of course, a test must not only look as though it measures what it is intended to measure, but should actually do so. That is, its content should reflect the instructional objectives it is designed to assess. This indicator of validity, termed content validity, is assessed by analyz- ing the content of test items in relation to the objectives of the course, unit, or lesson. Determining Content Validity Content validity is one of the most important kinds of validity for measurements of school achievement. A test with high content validity includes items that sample all important course objectives in proportion to their importance. Thus, if some of the objectives of an instruc- tional sequence have to do with the development of cognitive processes, a relevant test will have content validity to the extent that it samples these processes. And if 40% of the course content (and, consequently, of the course objectives) deals with knowledge (rather than with comprehension, analysis, and so on), 40% of the test items should assess knowledge. Determining the content validity of a test is largely a matter of careful, logical analysis of the items it comprises. Basically, the test needs to include a sample of items that tap the knowl- edge and skills that define course objectives.
  • 51. Increasing Content Validity As Wilson, Pan, and Schumsky (2012) explain, the basic process the test maker should follow to ensure content validity involves the following steps: 1. Define the content (the instructional objectives). 2. Define the level of difficulty or abstraction for the items. Validity Chapter 2 3. Develop a pool of representative items. 4. Determine what ratio of different items best represents the instructional objectives. 5. Develop a test blueprint. One of the main advantages of preparing a test blueprint (also referred to as a table of specifi- cations) is that it ensures a relatively high degree of content validity (providing, of course, that the test maker follows the blueprint). It’s important to realize that tests and test items do not possess validity as a sort of intrin- sic quality; a test is not generally valid or generally invalid in and of itself. Rather, it is valid for certain purposes and with certain individuals, and it is invalid for others. For example, if the following item is intended to measure comprehension, it does not have content validity, because it measures only simple recall:
  • 52. How many different kinds of validity are discussed in this chapter? A. 1 B. 2 C. 3 D. 5 E. 10 If, on the other hand, the item were intended to measure knowledge of specifics, it would have content validity. And an item such as the follow- ing might have content validity with respect to measuring comprehension: Explain why face validity is important for teacher-constructed tests. Note, however, that this last item measures comprehension only if students have not been explicitly taught an appropriate answer. It is quite possible to teach principles, applications, analyses, and so on as specifics, so that ques- tions of this sort require no more than recall of knowledge. What an item measures is not inher- ent in the item itself so much as in the relation- ship between the material as the student has learned it and what the item requires. Construct Validity
  • 53. A third type of validity, construct validity, is somewhat less relevant for teacher-constructed tests but highly relevant for many other psychological measures (e.g., personality and intel- ligence tests). Ryan McVay/Photodisc/Thinkstock ▲ One measure of validity is reflected in the extent to which the predictions we base on test results are borne out. If this boy does exceptionally well on this standard- ized battery of tests, will he also do well next year in fourth grade? In high school? In college? Reliability Chapter 2 In essence, a construct is a hypothetical variable—an unobservable characteristic or qual- ity, often inferred from theory. For example, a theory might argue that individuals who are highly intelligent should be reflective rather than impulsive— reflectivity being evident in the care and caution with which they solve problems or make decisions. Impulsivity would be apparent in a person’s hastiness and in failure to consider all aspects of a situation. One way to determine the construct validity of a test designed to measure intelligence would then be to look at how well it correlates with measures of reflection and impulsivity (see Chapter 9 for a discussion of correlation—a mathematical index of relationships). Criterion-Related Validity
  • 54. If Harold does exceptionally well on all his 12th-grade year-end tests, his teachers might be justified in predicting that he will do well in college. Colleges that subsequently admit Harold into one of their programs because of his grade 12 marks are also making the same prediction. Predictive Validity At all levels, prediction is one of the main uses of summative (rather than formative) assess- ments. We assume that all students who do well on year-end fifth-grade achievement tests will do reasonably well in sixth grade. We also predict that those who perform poorly on these tests will not do well in sixth grade, and we might use this prediction as justification for having them undertake remedial work. The extent to which our predictions are accurate reflects criterion-related validity. One component of this form of validity, just described, is labeled predictive validity. Predictive validity is easily measured by looking at the relationship between actual performance on a test and subsequent performance. Thus, a college entrance examination designed to identify students whose chances of college success are high has predictive validity to the extent that its predictions are borne out. Concurrent Validity Concurrent validity, a second aspect of criterion-related validity, is the relationship between a given test and other measures of the same behaviors or characteristics. For example, as we
  • 55. see in Chapter 10, the most accurate way to measure intelligence is to administer a time- consuming and expensive individual test. A second option is to administer a quick, inexpen- sive group test; a third, far less consistent approach, is to have teachers informally assess intelligence based on what they know of their students’ achievements and effort. Teachers’ assessments are said to have concurrent validity to the extent that they are similar to the more formal measures. In the same way, a group or an individual test is said to have concurrent validity if it agrees well with measures obtained using a different and presumably valid test. Figure 2.8 summarizes the various approaches to determining test validity. 2.4 Reliability Reliability is what we want in our cars, our computers, our spouses, our dogs. We want our cars and our computers to start when we go through the appropriate motions, and we want them to function as they were designed to function. So, too, with dogs and spouses: Reliability Chapter 2 Reliability is predictability and consistency. If you stepped on your bathroom scale five times in a row, you would expect it display the same weight each time. Reliability in educational measurement is no different.
  • 56. Basically, it has to do with consistency. Good measuring instruments must not only measure what they are intended to measure (they must have validity); they must also provide consistent, dependable, reliable measures. Reliability in testing has to do with the accuracy of our measurements. The more errors there are in our measurements, the less reliable will be our test results. A reliable intelligence test, for example, should yield similar results from one week to the next. Or even from one year to the next. But the reliability of most of our educational and psychological measures is never perfect. If you give Roberta an intelligence test this week and another in two weeks, it is highly unlikely that her scores will be identical. No matter, we say, as long as the difference between the two scores is not too great. After all, many factors can account for this error of measurement. Figure 2.8: Types of test validity Validity is closely related to the ways that a test is used. If a test is not valid, it is also likely to be unreliable and unfair. f02.08_EDU645.ai Content (The test samples behaviors that represent both the topics and the
  • 57. processes implicit in course objectives) Test Validity Predictive (Test scores are valuable predictors of future performance in related areas) Concurrent (Test scores are closely related to smilar measures based on other presumably valid tests) Face (The test appears to measure what it says it measures) Construct (The test taps hypothetical variables that underlie the
  • 58. property being tested) Criterion-Related Reliability Chapter 2 Say Roberta scored 123 the first week but only 102 the second. The difference between the two scores may be because Roberta had a headache at the time of the second testing. Perhaps she was distracted by personal problems or tired from a long trip or anxious about the test or confused by some new directions. In psychology and education, we tend to assume that the things we measure are relatively stable. But the emphasis should be on the word relatively because we know that much of what we measure is variable. So at least some of the error in our measurements is likely due to instability and change in what we measure. But if two similar measures of achievement in chemistry yield a score of 82% one week but only 53% the next week for the same student, then the test we are using may well have a reliability problem. How can we assess the reli- ability of our tests? Test-Retest Reliability If a test measures what it purports to (that is, if it is valid), and if what it measures does not fluctuate unpredictably, no matter how often it is given, the test
  • 59. should yield similar scores. If it doesn’t, it is not only unreliable but probably invalid as well. In fact, a test cannot be valid without being reliable. If it yields inconsistent scores for a stable characteristic, we can hardly insist that it is measuring what it is supposed to measure. That a test should yield similar scores from one testing to the next—unless, of course, the test is simple enough that the student learns and remembers appropriate responses—is the basis for one of the most common measures of reliability. Giving the same test two or more times and comparing the results obtained at each testing yields a measure of what is known as test-retest reliability (sometimes also called repeated- measures reliability or stability reliability). Say, for example, that I give a group of first-grade students a standardized language profi- ciency test (let’s call it October Test) at the end of October and then give them the same test again at the end of November. Assume the results are as shown in columns 2 and 3 of Table 2.1 (“October Test Results” and “Hypothetical November Test Results”). We can see immedi- ately that the test yields consistent, stable scores and is therefore highly reliable. Students who scored high in October continue to score high in November—as we would expect given our assumption that language proficiency should not change dramatically in one month. Table 2.1 Test-retest reliability
  • 60. Student October Test Results Hypothetical November Test Results Alternate November Test Results A 72 75 92 B 84 83 55 C 56 57 80 D 79 82 72 E 55 57 78 F 84 79 48 G 91 88 66 Reliability Chapter 2 Suppose, however, the results were as shown in columns 2 and 4 (“October Test Results” and “Alternate November Results”). Unless we have some other logical explanation, we would have legitimate questions about the reliability of this language proficiency test. Now some of
  • 61. the students who scored high in October do very poorly in November; and others who did poorly in October do astonishingly well in November. Statistically, the reliability of this test would be obtained by looking at the correlation between scores obtained on the test and those obtained on the retest (see Chapter 9 for an explana- tion of correlation). The first chart in Figure 2.9 shows how the hypothetical November results closely parallel the October results. In fact, there is a high positive correlation (+.98) between these results. The second chart in Figure 2.9 shows how the alternate November results do not parallel October results. In fact, the correlation between the two is negative (–.63). Figure 2.9: Test-retest reliability If a test is reliable, it should yield similar scores when given to the same students at different times. Chart 1 (based on Table 2.1) shows high reliability (correlation +.98); Chart 2 illustrates low reliabil- ity (correlation –.63). f02.09_EDU645.ai 100 90 80 70 60
  • 62. 50 40 30 20 10 0 A B C D Student E F G S co re s Hypothetical November test results October test results 100 90 80
  • 63. 70 60 50 40 30 20 10 0 A B C D Student Chart 1 Chart 2 E F G S co re s Alternate November test results
  • 64. October test results Reliability Chapter 2 Parallel-Forms Reliability Test-retest measures of reliability look at the correlation between results obtained by giving the same test twice to the same individuals. But in some cases, it isn’t possible or convenient to administer the same test twice. If the test is very simple, or if it contains striking and highly memorable questions (or answers), some learners may improve dramatically from one test- ing to the next. Or, if the teacher goes over the test and discusses possible responses, some students might learn enough to improve, and others might not. A second approach to estimating test reliability gets around this problem by administering a different form of the test the second time. The different form of the test is designed to be highly similar to the first and is expected to yield similar scores. It is therefore labeled a paral- lel form. The correlation between these parallel forms of the same test yields a measure of parallel-forms reliability (also termed alternate-form reliability). Figure 2.10 plots the scores obtained by seven students on parallel forms of a test. Note how the results follow each other. That is, a student who scores high on form A of the test is also likely to score high on form B. In this case, the correlation between the two forms is .86.
  • 65. Split-Half Reliability Teachers seldom go to the trouble of making up two forms of the same test and establishing that they are equivalent. Fortunately, there is another clever way of calculating test reliability. The reasoning goes something like this: If I prepare a comprehensive test made up of a large number of items, then many of the items on this test will overlap in what they assess. It is therefore reasonable to assume that if I were to split the test in two and administer each half to my students, their scores on the two halves would be highly similar. But I don’t really need to split the test: All I need to do is give the entire test to all students and then score the test as though I had actually split it. Figure 2.10: Parallel-forms reliability The relationship between scores on two parallel forms of the same test given to the same group is an indication of how dependably and consistently (reliably) the test measures. f02.10_EDU645.ai Test A Results Test B ResultsStudent 62
  • 68. s Test B resultsTest A results Reliability Chapter 2 Suppose, for example, that my original test consisted of 100 multiple-choice items, carefully constructed to tap all my instructional objectives. When I score the test, I might consider the 50 even-numbered items as one test, and the other 50 as a separate test. I can now easily generate two scores for each student, one for each half of my split test. And when I calculate the correlation between these two halves of the test, I will have determined what is called split-half reliability. Figure 2.11 illustrates split-half reliability based on a 90-item test split into two 45-item halves. Note that the longer the test, the more accurate the measure of reli- ability. Figure 2.12 summarizes the various ways of assessing test reliability. Figure 2.11: Split-half reliability A single test scored as though it were two separate tests provides information for judging its inter- nal consistency (reliability). In this case, the correlation between the two test halves is .80. Figure 2.12: Measures of test reliability Test reliability reflects the stability and consistency of a measure. It, along with fairness and valid-
  • 69. ity, is an extremely important quality of educational assessments. f02.11_EDU645.ai 60 50 40 30 20 10 0 A B C D Student E F G H I J K S co re o n e a ch
  • 72. I J K f02.12_EDU645.ai Measures of Test Reliability Test-retest reliability Parallel-forms reliability Split-half reliability Correlation is between scores obtained on the same test given to the same students on two different occasions Correlation is between two forms of a test given to the same examinees Correlation is between halves of a single test
  • 73. Reliability Chapter 2 Factors That Affect Reliability Say you require a unit-end assessment and ranking of your students in a 12th-grade physics course. But you have stupidly put off building your final exam until the night before it is to be administered. So you write out a single question. Then, having just completed a measurement course, you are clever enough to devise a list of detailed scoring criteria. The question and scoring criteria are shown in Table 2.2. Table 2.2 Illustrative single-question 12th-grade physics exam Question: Points Explain, in your own words, the details of vertical projectile motion. Scoring Criteria Describes what is meant by motion in a gravitational field: Explains acceleration Mentions zero velocity at zenith Includes free-fall equation Applies free-fall equation to hypothetical situation
  • 74. Includes graph of vertical projectile motion 10 10 5 10 20 10 Maximum Points 65 Length of Test If your physics unit covered only vertical projectile motion, and if your instructional objectives are well represented in your scoring criteria, your one-item exam might be quite good. Under these circumstances, it might actually measure what you intend to measure (it would have high validity). And, given careful application of your scoring criteria, the results might be con- sistent and stable (it would have reasonable reliability). But if your unit also covered topics such as elastic and inelastic collisions, relative velocity, notions of frames of reference, and other related topics, your single-item test would be about as useful as a snowmobile in Los Angeles. Although it might occasionally be possible to achieve an acceptable level of validity and reli- ability with a single item, in most cases it is not possible. Poor
  • 75. reliability, of course, is especially likely if your test consists of objective test items such as multiple-choice questions, matching problems, or true-false exercises. It is difficult to imagine that a single multiple-choice item could measure all your instructional objectives. In most cases, the more items in your test, the more valid and reliable it is likely to be. Stability of Characteristics The stability of what is being measured also affects the reliability of a test. If what we are measuring is unstable and unpredictable, our measures are also likely to be inconsistent and unpredictable. However, we assume that most of what we measure in education will not fluctuate unpredictably. We know, for example, that cognitive strategies develop over time and that knowledge increases. Tests that are both valid and reliable are expected to reflect these changes. These are predictable changes that don’t reduce the reliability of our measur- ing instruments. Reliability Chapter 2 The Effects of Chance Another factor that can affect the reli- ability of a test is chance, especially with respect to objective, teacher-made tests. We know, for example, that the chance of getting a true-false item correct, all other things being equal, is 50–50. If you give a 60-item, true-false, graduate-level plasma
  • 76. physics test to a large group of intelligent fourth-graders, they can be expected to answer an average of around 30 items correctly by chance—unless Lady Luck is looking pointedly in the other direction. And a few of the luckier individuals in this class may have astonishingly high scores. But a later administration of this test might lead to startlingly different scores, resulting in an extraordinarily low measure of test-retest reliability. One way to reduce the effects of chance is to make tests longer or to use a larger number of short tests. The important point is that teachers should not base any important decision on only one or two measures. Item Difficulty Test reliability is also affected by the difficulty of items. Tests that are made up of excessively easy or impossibly difficult items will almost invariably have lower measured reliability scores than tests composed of items of moderate difficulty. Other things being equal, very easy and very difficult items tend to result in less consistent patterns of responding. Relationship Between Validity and Reliability It’s important to realize that a test cannot be valid without also being reliable. If what we want to measure is a stable characteristic, and if the measures we obtain are inconsistent and unpredictable (hence, unreliable), then we clearly aren’t measuring what we intend to
  • 77. measure. Figure 2.13 summarizes the meanings of validity, reliability, and fairness, and the relationship among them. On the other hand, a test can be highly reliable without being valid. Consider the test and scoring guide shown in Figure 2.14. This is an extremely reliable test: Examinees invariably answer all questions correctly and always obtain the same score. But as a measure of intel- ligence, it clearly lacks face, content, construct, and criterion- related validity. It measures reli- ably, but it does not measure what it is intended to measure. How to Improve Test Reliability Test reliability is not something that most teachers are likely to calculate for their tests. For several reasons, however, teachers should understand what reliability is, how important it is in educational assessment, and how it can be improved. © I Love Images/Corbis ▲ If the validity, reliability, and fairness of the educational assessment whose results this young man has just seen are sus- pect, an injustice may have occurred. Reliability Chapter 2 When you need to select from among a number of different standardized tests, it’s important that you have, and understand, information about their reliability. Similarly, when you are
  • 78. making decisions about your students, you need to have some knowledge of the reliability of the assessments on which you base your decisions. It might be useful to know that the internal consistency (split- half reliability, for example) of teacher-made tests is around .50. As we see in Chapter 9, which deals with statistical mea- sures, this is a modest index of reliability. The fact is that most teacher-made tests have a relatively high degree of measurement error. Standardized tests, on the other hand, tend to have reliabilities of around .90 (Frisbie, 1988). As a result, the most important decisions that affect the lives of students should be based on carefully selected standardized tests—and on professional opinion where necessary—rather than on teacher-constructed tests, hunches, or intuitive impressions. Figure 2.15 summarizes a number of ways in which the reliability and validity of educational assessments can be increased. Table 2.3 is the American Psychological Association’s Code of Fair Testing Practices in Education (2004). Especially important are suggested guidelines for test users with respect to selecting tests, administering them, and interpreting and reporting test results. Figure 2.13: Three essential qualities of educational assessment The most subjective of these qualities, fairness, is often the one students think is most important.
  • 79. f02.13_EDU645.ai Reliability Validity Fairness Subjective estimate influenced by extent to which: • material tested has been covered • all students have had an equal opportunity to learn • sufficient time is allowed for testing • there are safe- guards against cheating • assessments are free of biases and stereotypes • misleading and trick questions have been avoided • accommodations are made for special needs • grading is consistent Consistency; accuracy of measurement Estimated by: • testing and
  • 80. retesting • parallel-forms tests • split-half tests The extent to which a test measures what it is meant to measure Estimated by: • face (appearance) • content • construct • criterion-related • predictive • concurrent is necessary for Reliability Chapter 2 Figure 2.14: Human intelligence scale Not really an intelligence test. Simply illustrates that highly reliable (consistent) measures can be desperately invalid. f02.14_EDU645.ai The 23rd-Century Human Intelligence Scale Please answer each question as briefly as possible.
  • 81. Name Age Address Questions What is your name? What is your address? How old were you on your last birthday? What is your mother’s name? What is your father’s name? Do you have a dog? Do you have a cat? Would you like a dog? Would you like a cat? What is your name? Scoring: 10 IQ points per answer Acceptable Answers Correct if matches above Correct if matches above Correct if matches above
  • 82. Any name, blank or “Don’t know” accepted Any name, blank or “Don’t know” accepted “Yes” or “No” or “Don’t know” “Yes” or “No” or “Don’t know” “Yes” or “No” or “Don’t know” “Yes” or “No” or “Don’t know” Should match first question Maximum: 100 Figure 2.15: Improving reliability and validity For important educational decisions to be fair, they must be based on the most valid and reliable assessments possible. f02.15_EDU645.ai Suggestions for Improving Test Validity Suggestions for Improving Test Reliability • Use clear and easily understood tasks • Sample from all skill and content areas • Select items to reflect importance of specific objectives • Allow sufficient time for all students to complete the test • Use blueprints to guide instruction and test construction
  • 83. • Analyze items to determine how well they match learning targets • Check to see if students who do well on your tests also do well in other comparable classes • Use a variety of approaches to assessment • Make tests longer • Enlist the assistance of other raters when using performance assessments • Develop moderately difficult rather than excessively easy or very difficult items • Try to eliminate subjective influences in scoring • Develop and use clear rubrics and checklists for scoring performance assessments • Restrict distracting influences when- ever possible • Use a variety of different assessments • Eliminate or reduce the possibility that chance might affect test outcomes Reliability Chapter 2 Table 2.3 The APA Code of Fair Testing Practices in Education* A. Developing and Selecting Appropriate Tests Test Developers Test Users Test developers should provide the information and
  • 84. supporting evidence that test users need to select appropriate tests. Test users should select tests that meet the intended purpose and that are appropriate for the intended test takers. A-1. Provide evidence of what the test measures, the recommended uses, the intended test takers, and the strengths and limitations of the test, including the level of precision of the test scores. A-1. Define the purpose for testing, the content and skills to be tested, and the intended test takers. Select and use the most appropriate test based on a thorough review of available information. A-2. Describe how the content and skills to be tested were selected and how the tests were developed. A-2. Review and select tests based on the appropriate- ness of test content, skills tested, and content coverage for the intended purpose of testing. A-3. Communicate information about a test’s charac- teristics at a level of detail appropriate to the intended test users. A-3. Review materials provided by test developers and select tests for which clear, accurate, and complete information is provided. A-4. Provide guidance on the levels of skills, knowledge, and training necessary for appropriate review, selection, and administration of tests.
  • 85. A-4. Select tests through a process that includes persons with appropriate knowledge, skills, and training. A-5. Provide evidence that the technical quality, including reliability and validity, of the test meets its intended purposes. A-5. Evaluate evidence of the technical quality of the test provided by the test developer and any indepen- dent reviewers. A-6. Provide to qualified test users representative samples of test questions or practice tests, directions, answer sheets, manuals, and score reports. A-6. Evaluate representative samples of test questions or practice tests, directions, answer sheets, manuals, and score reports before selecting a test. A-7. Avoid potentially offensive content or language when developing test questions and related materials. A-7. Evaluate procedures and materials used by test developers, as well as the resulting test, to ensure that potentially offensive content or language is avoided. A-8. Make appropriately modified forms of tests or administration procedures available for test takers with disabilities who need special accommodations. A-8. Select tests with appropriately modified forms or administration procedures for test takers with disabili- ties who need special accommodations. A-9. Obtain and provide evidence on the performance
  • 86. of test takers of diverse subgroups, making significant efforts to obtain sample sizes that are adequate for subgroup analyses. Evaluate the evidence to ensure that differences in performance are related to the skills being assessed. A-9. Evaluate the available evidence on the perfor- mance of test takers of diverse subgroups. Determine to the extent feasible which performance differences may have been caused by factors unrelated to the skills being assessed. B. Administering and Scoring Tests Test Developers Test Users Test developers should explain how to administer and score tests correctly and fairly. Test users should administer and score tests correctly and fairly. *Copyright 2004 by the Joint Committee on Testing Practices. This material may be reproduced in its entirety without fees or permission, provided that acknowledgment is made to the Joint Committee on Testing Practices. Any exceptions to this, including requests to excerpt or paraphrase this document, must be presented in writing to Director, Testing and Assessment, Science Directorate, APA. This edition replaces the first edition of the Code, which was published in 1988. Source: From Code of Fair Testing Practices in Education. (2004). Washington, DC: Joint Committee on Testing Practices. (Mailing address:
  • 87. Joint Committee on Testing Practices, Science Directorate, American Psychological Association, 750 First Street, NE, Washington, DC 20002-4242) (continued) Reliability Chapter 2 B-1. Provide clear descriptions of detailed procedures for administering tests in a standardized manner. B-1. Follow established procedures for administering tests in a standardized manner. B-2. Provide guidelines on reasonable procedures for assessing persons with disabilities who need special accommodations or those with diverse linguistic backgrounds. B-2. Provide and document appropriate procedures for test takers with disabilities who need special accom- modations or those with diverse linguistic backgrounds. Some accommodations may be required by law or regulation. B-3. Provide information to test takers or test users on test question formats and procedures for answering test questions, including information on the use of any needed materials and equipment. B-3. Provide test takers with an opportunity to become familiar with test question formats and any materials or equipment that may be used during testing.
  • 88. B-4. Establish and implement procedures to ensure the security of testing materials during all phases of test development, administration, scoring, and reporting. B-4. Protect the security of test materials, including respecting copyrights and eliminating opportunities for test takers to obtain scores by fraudulent means. B-5. Provide procedures, materials and guidelines for scoring the tests, and for monitoring the accuracy of the scoring process. If scoring the test is the responsi- bility of the test developer, provide adequate training for scorers. B-5. If test scoring is the responsibility of the test user, provide adequate training to scorers and ensure and monitor the accuracy of the scoring process. B-6. Correct errors that affect the interpretation of the scores and communicate the corrected results promptly. B-6. Correct errors that affect the interpretation of the scores and communicate the corrected results promptly. B-7. Develop and implement procedures for ensuring the confidentiality of scores. B-7. Develop and implement procedures for ensuring the confidentiality of scores. C. Reporting and Interpreting Test Results Test Developers Test Users
  • 89. Test developers should report test results accurately and provide information to help test users interpret test results correctly. Test users should report and interpret test results accu- rately and clearly. C-1. Provide information to support recommended interpretations of the results, including the nature of the content, norms or comparison groups, and other technical evidence. Advise test users of the benefits and limitations of test results and their interpreta- tion. Warn against assigning greater precision than is warranted. C-1. Interpret the meaning of the test results, taking into account the nature of the content, norms or comparison groups, other technical evidence, and benefits and limitations of test results. C-2. Provide guidance regarding the interpretations of results for tests administered with modifications. Inform test users of potential problems in interpreting test results when tests or test administration proce- dures are modified. C-2. Interpret test results from modified test or test administration procedures in view of the impact those modifications may have had on test results. C-3. Specify appropriate uses of test results and warn test users of potential misuses. C-3. Avoid using tests for purposes other than those recommended by the test developer unless there is
  • 90. evidence to support the intended use or interpretation. C-4. When test developers set standards, provide the rationale, procedures, and evidence for setting performance standards or passing scores. Avoid using stigmatizing labels. C-4. Review the procedures for setting performance standards or passing scores. Avoid using stigmatizing labels. (continued) Reliability Chapter 2 C-5. Encourage test users to base decisions about test takers on multiple sources of appropriate information, not on a single test score. C-5. Avoid using a single test score as the sole deter- minant of decisions about test takers. Interpret test scores in conjunction with other information about individuals. C-6. Provide information to enable test users to accu- rately interpret and report test results for groups of test takers, including information about who were and who were not included in the different groups being compared, and information about factors that might influence the interpretation of results. C-6. State the intended interpretation and use of test results for groups of test takers. Avoid grouping test results for purposes not specifically recommended
  • 91. by the test developer unless evidence is obtained to support the intended use. Report procedures that were followed in determining who were and who were not included in the groups being compared and describe factors that might influence the interpretation of results. C-7. Provide test results in a timely fashion and in a manner that is understood by the test taker. C-7. Communicate test results in a timely fashion and in a manner that is understood by the test taker. C-8. Provide guidance to test users about how to monitor the extent to which the test is fulfilling its intended purposes. C-8. Develop and implement procedures for moni- toring test use, including consistency with the intended purposes of the test. D. Informing Test Takers Under some circumstances, test developers have direct communication with the test takers and/or control of the tests, testing process, and test results. In other circumstances the test users have these responsibilities. Test developers or test users should inform test takers about the nature of the test, test taker rights and responsi- bilities, the appropriate use of scores, and procedures for resolving challenges to scores. D-1. Inform test takers in advance of the test administration about the coverage of the test, the types of question formats, the directions, and appropriate test-taking strategies.
  • 92. Make such information available to all test takers. D-2. When a test is optional, provide test takers or their parents/guardians with information to help them judge whether a test should be taken—including indications of any consequences that may result from not taking the test (e.g., not being eligible to compete for a particular scholarship) —and whether there is an available alternative to the test. D-3. Provide test takers or their parents/guardians with information about rights test takers may have to obtain copies of tests and completed answer sheets, to retake tests, to have tests rescored, or to have scores declared invalid. D-4. Provide test takers or their parents/guardians with information about responsibilities test takers have, such as being aware of the intended purpose and uses of the test, performing at capacity, following directions, and not disclosing test items or interfering with other test takers. D-5. Inform test takers or their parents/guardians how long scores will be kept on file and indicate to whom, under what circumstances, and in what manner test scores and related information will or will not be released. Protect test scores from unauthorized release and access. D-6. Describe procedures for investigating and resolving circumstances that might result in canceling or with- holding scores, such as failure to adhere to specified testing procedures. D-7. Describe procedures that test takers, parents/guardians, and other interested parties may use to obtain more information about the test, register complaints, and have
  • 93. problems resolved. Section Summaries Chapter 2 Chapter 2 Themes and Questions Section Summaries 2.1 Four Purposes of Educational Assessment Assessments can be used to summarize student achievement (summative ); for selection and placement purposes (placement ); as a basis for diagnosing strengths, problems, and deficits (diagnostic); or as a means of provid- ing feedback to improve teaching and learning (formative). Despite these different emphases, the overriding purpose of all forms of educational assessment is to help the learner reach important goals. Planning for effective assessment requires clarifying and communicating instructional objectives and aligning instruction and assessment with these goals. As much as possible, assessment should be an integral part of the instructional process, designed to provide immediate feedback for both teachers and learners to assist and improve minute-to- minute decisions. Teachers should use a variety of approaches to assessment. 2.2 Test Fairness Tests are fair when they treat all learners in an evenhanded manner and when they provide all learners with an equal opportunity to learn. Tests are unfair when they examine content that has neither been covered nor assigned; if they deliberately or uninten-
  • 94. tionally use misleading “trick” questions; when some learners have not been given sufficient opportunity to learn the material; if they don’t allow sufficient time for all learners to finish; when they fail to accommodate to the special needs of individual learners; when they reflect biases and stereotypes in test construction; if scoring is influenced by stereotypes and teacher expectations; when steps are not taken to guard against cheating; and if they are graded inconsistently. 2.3 Validity A test is valid to the extent that it measures what it is intended to measure. Tests are seldom fair if they are not also valid. Face validity is a measure of the extent to which a test looks as though it measures what it is meant to measure. However, for some assessments (such as assessments of responder honesty), it is sometimes essential that an instrument not appear to measure what it is intended to measure. Content validity is determined by the extent to which items on the test reflect course objectives, and it is normally determined through a careful analysis and selection of items. Construct validity relates to how closely a test agrees with its theoretical underpinnings. It is a measure of how consistent it is with the thinking that gave rise to it, and it is more relevant for psychological assessments (e.g., personality or intel- ligence tests) than for teacher-made tests. Criterion-related validity has two aspects: One is defined by agreement between test results and predictions based on those results (predictive validity); the other is evident in the extent to which the results of a test agree with the results
  • 95. of other tests that measure the same characteristics. 2.4 Reliability Consistency and predictability are the hallmarks of high reliability. Reliability has to do with error of measurement; the greater the errors in our assessments, the lower the reliability. Reliability can be estimated by giving the same test to the same participants on more than one occasion. High reliability would be reflected in similar scores for each individual at different testings (test-retest reliability). Or it can be calculated by giving different forms of a test to the same subjects and comparing their results (parallel- forms reliability). A third option is to take a single test administered once, divide it into halves, score each half, and compare those scores. Providing the test is long enough, scores on each half will be similar if it is reli- able. Reliability is affected by the length of a test (generally higher with longer tests or with the frequent use of many shorter tests); by the stability of what is being measured (unstable characteristics tend to yield lower reliabilities); chance factors; and item difficulty (moderately difficult items usually lead to higher test reliability than very easy or very difficult items). Key Terms Chapter 2 Applied Questions 1. What are the four principal uses of educational assessment? Research and defend the prop- osition that the most important function of educational
  • 96. assessment should be formative. 2. What are the main characteristics of good educational assessment? Explain in your own words why you think one of these characteristics is more important than the other two in educational measurement. 3. What makes a test unfair? List the things you think a teacher can do to improve the fairness of teacher-made tests. 4. Can a test be fair but invalid? Write a brief position paper detailing why you think the answer is yes, no, or maybe. 5. Can a test be valid but unreliable? Generate a test item to illustrate and support your answer for this question. 6. How can you determine the reliability of a test? Outline the steps a teacher might take to increase test reliability without having to go to the trouble of actually calculating it. 7. How can you improve test validity and reliability? Make a list of what you consider to be guidelines for sound testing practices. Identify those that would improve test validity and reliability. Key Terms bias Personal prejudice (prejudgment) in favor of or against a person, thing, or idea rather than another, when such prejudgment is typically unfair.