Purposes and Characteristics of Educational AssessmentF.docx

Purposes and Characteristics
of Educational Assessment
Focus Questions
After reading this chapter, you should be able to answer the
following questions:
1. What are the four principal uses of educational assessment?
2. What are the main characteristics of good educational
assessment?
3. What makes a test unfair?
4. Can a test be fair but invalid?
5. Can a test be valid but unreliable?
6. How can you determine the reliability of a test?
7. How can you improve test validity and reliability?
2
Jack Hollingsworth/Photodisc/Thinkstock
He uses statistics as a drunken man uses lamp-posts—
for support rather than illumination.

—Andrew Lang
Not to imply that many teachers are like drunken men leaning
on their lampposts, but it is true
that many use the results of their assessments a little like
crutches to support their summaries
of the characteristics, virtues, and vices of their students. For
these teachers, test results and
grades serve as handy statistics that summarize all the important
things that can be commu-
nicated to school administrators and parents.
Others use the results of their assessments in very different
ways. They see them not as sum-
marizing important accomplishments, but as tools that suggest
to both the learner and the
teacher how the teaching–learning process can be improved. For
these teachers, assessments
are less like lampposts against which a tired (but sober) teacher
can lean and rest: They are
more like the light on top of the posts that throws back the
darkness and shows teacher and
student where the paths are, so that they need not stumble about
like blindfolded pigs in a
supermarket.
Purposes and Characteristics of Educational Assessment
Chapter 2
Chapter Outline
2.1 Four Purposes of Educational
Assessment
Planning for Assessment
Test Blueprints

2.2 Test Fairness
Content Problems
Trick Questions
Opportunity to Learn
Insufficient Time
Failure to Make Accommodations for
Special Needs
Biases and Stereotypes
Inconsistent Grading
Cheating and Test Fairness
2.3 Validity
Face Validity
Content Validity
Construct Validity
Criterion-Related Validity
2.4 Reliability
Test-Retest Reliability
Parallel-Forms Reliability

Split-Half Reliability
Factors That Affect Reliability
How to Improve Test Reliability
Chapter 2 Themes and Questions
Section Summaries
Applied Questions
Key Terms
Four Purposes of Educational Assessment Chapter 2
2.1 Four Purposes of Educational Assessment
These distinctions highlight the difference between summative
assessment and formative
assessment.
As we saw in Chapter 1, summative assessment is assessment
that occurs mainly at the end
of an instructional period. Its chief purpose is to provide a
grade. In a real sense, it is a sum-
mation of the learner’s achievements.
Formative assessment is assessment that is an integral and
ongoing part of instruction. Its
central purpose is to provide guidance for both the teacher and
the learner in an effort to
improve learning. Its goal is formative rather than summative.
Assessment designed specifi-
cally to enhance learning—in other words, formative

assessment—is the main emphasis of
current approaches to educational assessment.
Diagnostic assessment in education, like diagnosis in medicine,
has to do with finding out
what is wrong, why it’s wrong, and how it can be fixed.
Diagnosis of automobile mechani-
cal difficulties is much the same: The mechanic runs tests to
find out what is wrong and to
develop hypotheses about how the problem can be repaired.
Diagnostic assessment in educa-
tion is no different. It has three purposes:
1. Finding out what is wrong (uncovering which learning goals
have not been met)
2. Developing hypotheses about why something is wrong
(suggesting possible reasons why
these goals have not been accomplished)
3. Suggesting ways of repairing what is wrong (devising
interventions that might increase the
likelihood that instructional objectives will be reached)
The examiners look for the causes of failure so they can suggest
ways of fostering success. Like
the physician and the mechanic who diagnose in order to repair,
so, too, does the educator.
A fourth kind of assessment is placement assessment. It
describes assessment undertaken
before instruction. Its main purpose is to provide educators with
information about student
placement and about learning readiness. Placement assessment
can also influence choice of
content and of instructional approaches (Figure 2.1).

Planning for Assessment
While we can distinguish between assessment used for
placement, diagnosis, formation,
or summation, these distinctions do not contradict the fact that
all forms of assessment
share the same goals. Simply put, the general aim of all
assessment is to help learners learn.
Accordingly, educational assessment and instruction are part of
the same process; assessment
is for learning rather than simply of learning.
It is worth noting that the assessment of learners is, in a sense,
an assessment of the effec-
tiveness of the teacher and of the soundness and appropriateness
of instructional and assess-
ment strategies. This doesn’t invariably mean, of course, that
when students do well on their
tests they have the good fortune of having an astonishingly
good teacher. Some students do
remarkably well with embarrassingly inadequate teachers, and
others perform poorly with the
most gifted of instructors.
Still, if your teaching is aimed at admirable, clearly stated, and
well-understood instructional
objectives, and if most of your charges attain these targets, the
assessments that tell you (and
the world) that this is so also reflect positively on your
teaching.

As you plan your instruction and your assessment, the following
guidelines can be highly
useful.
Communicating Instructional Objectives
Know and communicate your instructional objectives. Not only
do teachers need to under-
stand what their learning objectives are, but students need to
know what is expected of
them—hence the importance of having clear instructional
objectives and of communicating
them to students. If students understand what the most
important learning goals are, they
are far more likely to reach them than if they are just feeling
their way around in the dark. For
example, at the beginning of a unit on geography, Mrs. Wyatt
tells her seventh graders that
at the end of the unit, they will be able to create a map of their
town to scale. To do this, Mrs.
Wyatt explains, she will have to teach them some map-making
skills and the math they will
need to calculate the scaling for the map. She then begins a
lesson on ratios.
Aligning Goals, Instruction, and Assessments
Match instruction, assessment, and grading to goals. In theory,
this guideline might seem
obvious; but in practice, it is not always highly apparent. For
example, one of my high school
Figure 2.1: Four kinds of educational assessment based on their
main purposes
Assessment can serve at least four distinct purposes in
education. However, the same assessment
procedures and instruments can be used for all these purposes.

Further, there is often no clear
distinction among these purposes. For example, diagnosis is
often part of placement and can also
have formative functions.
f02.01_EDU645.ai
Summative
Assessment
Summarizes
extent to which
instructional
objectives have
been met
Provides basis for
grading
Useful for
further
educational or
career decisions
Diagnostic
Assessment
Identifies
instructional
objectives that
have not been

met
Identifies
reasons why
targets have not
been met
Suggests
approaches to
remediation
Formative
Assessment
Designed to
improve the
teaching/
learning process
Provides feedback
for teachers and
learners to
enhance learning
and motivation
Enhances
learning; fosters
self-regulated
learning; increases
motivation
Placement

Assessment
Assesses
pre-existing
knowledge and
skills
Provides
information for
making decisions
about learner’s
readiness
Useful for
placement and
selection
decisions
teachers, Sister Ste. Mélanie, delighted in teaching us obscure
details about the lives and
times of the English authors whose poems and essays were part
of our curriculum. We natu-
rally assumed that some of her most important objectives had to
do with learning these
intriguing details.
Her tests matched her instructional objectives. As we expected,
most of the items on the

tests she gave us asked questions like Name three different
kinds of food that would have
been common in Shakespeare’s time. (Among the correct
answers were responses like cherry
cordial, half-moon-shaped meat and potato pies called pasties,
toasted hazelnuts, and stuffed
game hens. We often left her classes very hungry.)
But sadly, Sister Ste. Mélanie’s grading was based less on what
we had learned in her
classes than on the quality of our English. She had developed an
elaborate system for sub-
tracting points based on the number and severity of our
grammatical and spelling errors.
Our grades reflected the accuracy of our responses only when
our grammar and spelling
were impeccable. Her grading did not match her instruction; it
exemplified poor educa-
tional alignment.
Educational alignment is the expression used to describe an
approach to education that
deliberately matches learning objectives with instruction and
assessment. As Biggs and Tang
(2011) describe it, alignment involves three key components:
1. A conscious attempt to provide learners with clearly
specified goals
2. The deliberate use of instructional strategies and learning
activities designed to foster
achievement of instructional goals
3. The development and use of assessments that provide
feedback to improve learning and
to gauge the degree of alignment between goals, instruction, and

assessment
Good alignment happens when teachers think through their
whole unit before beginning
instruction. They identify the learning objectives, the evidence
they will collect to document
student learning (assessments), and the sequence in which they
will have students access
and interact with information before administering assessments.
One elementary teacher
did just that when she designed a unit on the watersheds. Her
goal was the Virginia science
standard: Science 4.8—The student will investigate and
understand important VA natural
resources (a) watershed and water resources. To organize her
unit, she used a focusing
question: “What happens to the water flowing down your street
after a big rainstorm?”
She wanted students to understand the overarching concept that
every action has a con-
sequence—in this case, that the flow of water affects areas
downstream. Throughout the
course of her unit, she had children actively engaged in a
variety of meaningful tasks. The
children discussed and debated the issues of pollution and their
responsibilities to avoid pol-
luting their water. They created tangible vocabulary tools to
learn the vocabulary of the unit.
They responded to academic prompts to explain key concepts.
They built models of water-
sheds. As the students performed these tasks, the teacher noted
the level of performance of
each child and documented individual knowledge, skill, and
understanding. She used these
instructional strategies as formative assessments as she
provided feedback to each student.

In addition to multiple quizzes throughout the unit, the students
demonstrated their under-
standing of important concepts by completing a performance-
based task. Everything was
aligned so that the teacher could infer that students truly
understood the importance of
Virginia’s watersheds and water resources.
Using Assessment to Improve Instruction
Use assessment as an integral part of the teaching–learning
process. Good formative assess-
ment is designed to provide frequent and timely feedback that is
of immediate assistance to
learners and teachers. As we see in Chapter 5, this doesn’t mean
that teachers need to make
up specially designed and carefully constructed placement and
formative tests to assess their
learners’ readiness for instruction, gauge their strengths and
weaknesses, and monitor their
progress. The best formative assessment will often consist of
brief, informal assessments,
perhaps in the form of oral questions or written problems that
provide immediate feedback
and inform both teaching and learning. Formative feedback
might then lead the teacher to
modify instructional approaches and learners to adjust their
learning activities and strategies.
Using Different Approaches to Assessment
Employ a variety of assessments, especially when important
decisions depend on their out-
comes. Test results are not always entirely valid (they don’t

always measure what they are
intended to measure) or reliable (they don’t always measure
very accurately). The results of
a single test might reflect temporary influences such as those
related to fatigue, test anxiety,
illness, situational distractions, current preoccupations, or other
factors. Grades and decisions
based on a variety of assessments are more likely to be fair and
valid. For example, when
someone is ready to demonstrate driving knowledge and skills,
multiple assessments are given
by the Department of Motor Vehicles (DMV). Drivers need to
know the rules of the road, and
they also need to know how to parallel park. So DMV
assessments include both a written and
a driving field test.
Constructing Tests According to Blueprints
A house construction blueprint describes in detail the
components of a house—its dimen-
sions, the number of rooms and their placement, the materials to
be used for building it, the
pitch of its roof, the depth of its basement, its profile from
different directions. A skilled con-
tractor can read a blueprint and almost see the completed house.
In much the same way, a test blueprint describes in detail the
nature of the items to be used
in building the test. It includes information about the number of
items it will contain, the con-
tent areas they will tap, and the intellectual processes that will
be assessed. A skilled educator
can look at a test blueprint and almost see the completed test
(Figure 2.2).
Test Blueprints

It’s important to keep in mind that tests are only one form of
educational assessment.
Assessment is a broad term referring to all the various methods
that might be used to obtain
information about different aspects of teaching and learning.
The word test has a more spe-
cific meaning: In education, it refers to specific instruments or
procedures designed to mea-
sure student achievement or progress, or various student
characteristics.
Educational tests are quite different from many of the other
measuring instruments we use—
instruments like rulers, tape measures, thermometers, and
speedometers. These instruments
measure directly and relatively exactly: We don’t often have
reason to doubt them.
Our psychological and educational tests aren’t like that: They
measure indirectly and with
varying accuracy. In effect, they measure a sample of behaviors.
And from students’ behaviors
(responses), we make inferences about qualities we can’t really
measure directly at all. Thus,
from a patient’s responses to questions like “What is the first
word you think of when I say
mother?” the psychologist makes inferences about hidden
motives and feelings—and perhaps
eventually arrives at a diagnosis.

In much the same way, the teacher makes inferences about what
the learner knows—and
perhaps inferences about the learner’s thought processes as
well—from responses to a hand-
ful of questions like this one:
Which of the following is most likely to be correct?
1. Mr. Wilson will still be alive at the end of the story.
2. Mr. Wilson will be in jail at the end of the story.
3. Mr. Wilson will have died within the next 30 pages.
4. Mr. Wilson will not be mentioned again.
Tests that are most likely to allow the teacher to make valid and
useful inferences are those
that actually tap the knowledge and skills that make up course
objectives. And the best way
of ensuring that this is the case is to use test construction
blueprints that take these objectives
into consideration (see Tables 4.3 and 4.4 for examples of test
blueprints).
Figure 2.2: Guidelines for assessment
These guidelines are most useful when planning for assessment.
Many other considerations have to
be kept in mind when devising, administering, grading, and
interpreting teacher-made tests.
f02.02_EDU645.ai
Some
Guidelines

for
Assessment
Use a variety
of assessments
Use assessment
to improve
instruction
Align
instruction,
goals, and
assessment
Develop
blueprints
to construct
tests
Know and
communicate
learning targets
Test Fairness Chapter 2
Guidelines for Constructing Test Blueprints
A good test blueprint will contain most of the following:
• A clear statement of the test content related directly to
instructional objectives

• The performance, affective, or cognitive skills to be
tapped
• An indication of the test format, describing the kinds of
test items to be used or the
nature of the performances required
• A summary of how marks are to be allocated in relation to
different aspects of
the content
• Some notion of the achievement levels expected of
learners
• An indication of how achievement levels will be graded
• A review of the implications of different grades
Regrettably, not all teachers use test blueprints. Instead, when a
test is required, many find
it less trouble to sit down and simply write a number of test
items that seem to them a rea-
sonable examination of what they have taught. And sadly, in too
many cases what they have
taught is aimed loosely at what are often implied and vague
rather than specific instructional
objectives.
Using test blueprints has a number of important advantages and
benefits. Among them is
that they force the teacher to clarify learning objectives and to
make decisions about the
importance of different aspects of content. They also encourage
teachers to become more
aware of the learner’s cognitive processes and, by the same

token, to pay more attention to
the development of higher cognitive skills.
At a more practical level, using test blueprints makes it easier
for teachers to produce similar
tests at different times, thus maintaining uniform standards and
allowing for comparisons
among different classes and different students. Also, good test
blueprints serve as a useful
guide for constructing test items and perhaps, in the long run,
make the teacher’s work easier.
Figure 2.3 summarizes some of the many benefits of using test
blueprints.
2.2 Test Fairness
Determining what the best assessment procedures and
instruments are is no simple matter
and is not without controversy. But although educators and
parents don’t always agree about
these matters, there is general agreement about the
characteristics of good measuring instru-
ments. Most important among these is that evaluative
instruments be fair and that students
see them as being fair. The most common student complaint
about tests and testing practices
has to do with their imagined or real lack of fairness (Bouville,
2008; Felder, 2002).
The importance of test fairness was highlighted during the
Vietnam War in the 1960s.
President Kennedy’s decision to send troops to Vietnam led to
the drafting of large numbers
of age-eligible men, some of whom died or were seriously
injured in Vietnam. But men who
went to college were usually exempt from the draft—or their
required military service was at

least deferred. So, for many, it became crucial to be admitted to
undergraduate or postgradu-
ate studies. For some, passing college or graduate entrance
exams was literally a matter of life
and death. That the exams upon which admission decisions
would be based should be as fair
as possible seemed absolutely vital.
Just how fair are our educational assessments? We don’t always
know. But science provides
ways of defining and sometimes of actually measuring the
characteristics of tests. It says, for
example, that the best assessment instruments have three
important qualities:
1. Fairness
2. Validity
3. Reliability
As we saw, from the student’s point of view, the most important
of these is the apparent—
and real—fairness of the test.
There are two ways of looking at test fairness, explains Bouville
(2008): On the one hand,
there is fairness of treatment; on the other, there is fairness of
opportunity. Fairness of treat-
ment issues include problems relating to not making
accommodations for children with spe-
cial needs, biases and stereotypes, the use of misleading “trick”

questions, and inconsistent
Figure 2.3: Advantages of test blueprints
Making and using test blueprints presents a number of distinct
benefits. And, although develop-
ing blueprints can be time-consuming, contrary to what some
think, it can make the teacher’s task
easier rather than more difficult and complicated.
f02.03_EDU645.ai
Advantages
of Devising
and Using Test
Blueprints
Forces
teacher to
clarify
learning
targets
Increases test
validity and
reliability
Promotes the
development of
thinking rather
than mainly
remembering

skills
Encourages
teachers to
become more
aware of
learners’
cognitive
activity
Simplifies
test
construction
Leads to more
consistency among
different tests,
allowing more
meaningful
comparisons
Promotes
decisions about
the relative
importance of
different
aspects of

content
grading. Fairness of opportunity problems include testing
students on material not covered,
not providing an opportunity to learn, not allowing sufficient
time for the assessment, and
not guarding against cheating. We look at each of these issues
in the following sections
(Figure 2.4).
Content Problems
Tests are—or at the very least, seem—highly unfair when they
ask questions or pose prob-
lems about matters that have not been covered or assigned. This
issue sometimes has to do
with bad teaching; at other times, it simply relates to bad test
construction. For example, in
my second year in high school, we had a teacher who almost
invariably peppered her quizzes
and exams with questions about matters we had never heard
about in class. “We didn’t have
time,” she would protest when someone complained and pointed
out that she had never
mentioned rhombuses and trapezoids and quadrilaterals. “But
it’s important and it’s in the
book and it might be on the final exam,” she would add.
Had she simply told us that we were responsible for the content
in Chapter 6, we would
not have felt so unfairly treated. This example illustrates bad
teaching as much as bad test

construction.
In connection with content problems that affect test fairness, it
is interesting to note that
when test results are higher, students tend to perceive the test as
being fairer. It’s an intrigu-
ing observation that, it turns out, may have a grain of truth in it.
As Oller (2012) points out,
higher scores are evidence that there is agreement between test
makers and the better stu-
dents about the content that is most important. This agreement
illustrates what we termed
educational alignment: close correspondence among goals,
instructional approaches, and
assessments.
Conversely, exams that yield low scores for all students may
reflect poor educational align-
ment: They indicate that what the teacher chose to test is not
what even the better learners
have learned. Hence there is good reason to believe that tests
that yield higher average scores
are, in fact, fairer than those on which most students do very
poorly. And raising the marks,
perhaps by scaling them so that they approximate a normal
distribution with an acceptably
high average, will do little to alter the apparent fairness of the
test.
Figure 2.4: Issues affecting test fairness
That a test is fair, and that it seems to be fair, is one of the most
important characteristics of good
assessment.
f02.04_EDU645.ai

Issues of Fairness
of Opportunity
Issues of Fairness
of Treatment
• Testing material not
covered
• Not providing an opportunity
to learn
• Not allowing sufficient time
to complete test
• Not guarding against
cheating
• Not accommodating to
special needs
• Being influenced by biases
and stereotypes
• Using misleading, trick
questions
• Grading inconsistently
Trick Questions

Trick questions illustrate problems that have less to do with test
content than with test
construction—which, of course, doesn’t mean that the test
maker is always unaware that one
or more questions might be considered trick questions.
Trick questions are questions that mislead and deceive,
regardless of whether the deception is
intentional or is simply due to poor item construction. Trick
questions do not test the intended
learning targets, but rather a student’s ability to navigate a
deceptive test. Items that students
are most likely to consider trick questions include:
1. Questions that are ambiguous (even when the ambiguity is
accidental rather than deliber-
ate). Questions are ambiguous when they have more than one
possible interpretation. For
example, “Did you see the man in your car?” might mean, “Did
you see the man who is in
your car?” or “Did you see the man when you were in your
car?”
2. Multiple-choice items where two nearly identical alternatives
seem correct. Or, as in the
following example, where all alternatives are potentially
correct:
The Spanish word fastidiar means:
annoy damage
disgust harm
3. Items deliberately designed to catch students off their guard.
For example, consider this

item from a science test:
During a very strong north wind, a rooster lays an egg on a flat
roof: On what side of the
roof is the egg most likely to roll off?
North
South
East
West
No egg will roll off the roof
Students who aren’t paying sufficient attention on this fast-
paced, timed test might well
say South. Seems reasonable. (But no; apparently, roosters
rarely lay eggs.)
4. Questions that use double negatives. For example: Is it true
that people should never not
eat everything they don’t like?
5. Items in which some apparently trivial word turns out to be
crucial. That is often the case
for words such as always, never, all, and most, as in this item:
True or False? Organic prod-
ucts are always better for you than those that are nonorganic.
6. Items that make a finer discrimination than expected. For
example, say a teacher has
described the speed of sound in dry air at 20 degrees centigrade
as being right around 340
meters per second. Now she presents her students with this

item:
What is the speed of sound in dry air at 20 degrees centigrade?
A. 300 meters per second
B. around 340 meters per second
C. 343.2 meters per second
D. 343.8 meters per second
Because the alternatives contain both the correct answer (C)
and the less precise informa-
tion given by the teacher (B), the item is deceiving.
7. Long stems in multiple-choice questions that include
extraneous and irrelevant information
but that serve to distract. Consider, for example, this multiple-
choice item:
A researcher found that the average score of a sample
consisting of 106 females was 52.
The highest score was 89 while the lowest score was 34. In this
study, the median score
was 55 and the two most frequent scores were 53 and 58. What
was the sum of all the
scores?
A. 5,512
B. 5,830

C. 5,618
D. 6,148
All the information required to answer this item correctly (A)
is included in the first sen-
tence. Everything after that sentence is irrelevant and, for that
reason, misleading.
Opportunity to Learn
Tests are patently unfair when they sample concepts, skills, or
cognitive processes that stu-
dents have not had an opportunity to acquire. Lack of
opportunity to learn might reflect an
instructional problem. For example, it might result from not
being exposed to the material
either in class or through instructional resources. It might also
result from not having sufficient
time to learn. Bloom (1976), for example,
believed that there are faster and slower
learners (not gifted learners and those less
gifted), and that all, given sufficient time,
can master what schools offer.
If Bloom is mostly correct, the results of
many of our tests indicate that we sim-
ply don’t allow some of our learners suf-
ficient time for what we ask of them.
Bloom’s mastery learning system offers
one solution. Mastery learning describes
an instructional approach in which course
content is broken into small, sequential
units and steps are taken to ensure that

all learners eventually master instructional
objectives (see Chapter 6 for a discussion
of mastery learning).
Another solution, suggests Beem (2010),
is the expanded use of technology and of
virtual reality instructional programs. These are instructional
computer-based simula-
tions designed to provide a sensation of realism. She argues that
these, along with other
digital technologies including computers and handheld devices,
offer students an opportunity
to learn at their own rate. Besides, digital technology might also
reduce the negative influence
of poorly qualified teachers—if there are any left.
iStockphoto/Thinkstock
▲ Ambiguous questions, misleading items, items about mate-
rial not covered or assigned, overly long tests—all of these
contribute to the perceived unfairness of tests.
As Ferlazzo (2010) argues, great teaching is about giving
students the opportunity to learn.
Poor and unfair testing is about assessing the extent to which
they have reached instruc-
tional objectives they have never had an opportunity to reach.
See Applications: Addressing
Unfair Tests.
Insufficient Time

Closely related to the unfairness that results from not having an
opportunity to learn is the
injustice of a test that doesn’t give students an opportunity to
demonstrate what they actu-
ally have learned. For some learners, this is a common
occurrence simply because they tend
to respond more slowly than others and, as a result, often find
that they don’t have enough
time to complete the test.
A P P L I C A T I O N S :
Addressing Unfair Tests
In spring 2013, students in New York City had their first
experience with English Language Arts
tests designed to tap the curriculum of the Common Core State
Standards. After adopting these
standards in 2010, New York hired a test-publishing company to
design a test that would reflect the
knowledge and skills within the Common Core Standards.
After witnessing their students’ anguish following the initial
testing experience, 21 principals
were so outraged that they felt compelled to issue a formal
protest through a letter to the State’s
Commissioner of Education. In that letter, they highlighted how
the English/Language Arts test
was unfair. One of their major concerns was the lack of
alignment between the types of questions
asked and the critical thinking skills valued in the Common
Core State Standards. The Common
Core State Standards emphasize deep and rich analysis of
fiction and nonfiction. But the ELA tests
focused mostly on specific lines and words rather than on the

wider standards. What was taught
in the classrooms was not assessed on the test: The test failed to
meet the criterion of fairness of
opportunity.
While alignment between what was tested and what is in the
Standards is important, this was not
the administrators’ only complaint. In reviewing the tests taken
by the students, they concluded
that the structure of the tests was not developmentally
appropriate. For example, testing required
90-minute sessions on each of three consecutive days—a
difficult undertaking for a mature stu-
dent, let alone for a 10-year-old. Clearly, there is a violation
here of the criterion of fairness of
opportunity.
Finally, the principals expressed concern that too much was
riding on a flawed test developed by a
company with a track record of errors. They feared that the tests
might not be valid. Yet students’
promotion to the next grade, entry into middle and secondary
school, and admission to special
programs are often based on these tests. In addition, teachers
and schools are evaluated in terms of
how well their students perform, even though that is not an
intended use of the tests. As a result,
scores on these tests can affect the extent to which schools
receive special funds or are put on
improvement plans. These complications raise questions about
the test’s validity for these purposes.
Clearly, as the principals reflected on the new
English/Language Arts test, they saw problems with
both fairness of opportunity and validity.

Suppose that a 100-item test is designed to sample all the target
skills and information that
define a course of study. If a student has time to respond to only
80 of these 100 items, then
only 80% of the instructional objectives have been tested. That
test is probably unfair for
that student.
There is clearly a place for speeded testing, particularly with
respect to standardized tests such
as those that assess some of the abilities that contribute to
intelligence. (We look at some
of these tests in Chapter 10.) But as a rule of thumb, teacher-
made tests should always be
of such a length and difficulty level that most, if not all,
students can easily complete them
within the allotted time (van der Linden, 2011).
Failure to Make Accommodations for Special Needs
Timed tests can be especially unfair to some learners with
special needs. For example, Gregg
and Nelson (2012) reviewed a large number of studies that
looked at performance on timed
graduation tests—a form of high-stakes testing (so called
because results on these tests can
have important consequences relating to transition from high
school, school funding, and
even teaching and administrative careers). These researchers
found that whereas students
with learning disabilities would normally be expected to
achieve at a lower than average level

on these tests, when they are given the extra time they require,
their test scores are often
comparable to those of students without disabilities.
Giving students with special needs extra time is the most
common of possible accommoda-
tions. It is also one of the most effective and fairest
adjustments. Even for more gifted and
talented learners, additional time may be important. Coskun
(2011) reports a study where the
number of valuable ideas produced in creative brainstorming
groups was positively related to
the amount of time allowed.
Accommodations for Test Anxiety
In addition to being given extra time for learning and
assessment, many other accommoda-
tions for learners with special needs are possible and often
desirable. For example, steps can
be taken to improve the test performance of learners with severe
test anxiety. Geist (2010)
suggests that one way of doing this is to reduce negative
attitudes toward school subjects
such as mathematics. As Putwain and Best (2011) showed, when
elementary school students
are led to fear a subject by being told that it will be difficult
and that important decisions will
be based on how well they do, their performance suffers. The
lesson is clear: Teachers should
not try to motivate their students by appealing to their fears.
For severe cases of test anxiety, certain cognitive and
behavioral therapies, in the hands of
a skilled therapist, are sometimes highly effective (e.g., Brown
et al., 2011). And even in less
skilled hands, the use of simple relaxation techniques might be

helpful (for example, Larson
et al., 2010).
It is worth keeping in mind, too, that test anxiety often results
from inadequate instruction
and learning. Not surprisingly, after Faber (2010) had exposed
his “spelling-anxious” students
to a systematic remedial spelling training program, their
spelling performance increased and
their test anxiety scores decreased.
Accommodations for Minority Languages
Considerable research indicates that children whose first
language is not the dominant school
language are often at a measurable disadvantage in school. And
this disadvantage can become
very apparent if no accommodations are made in assessment
instruments and procedures—as
is sometimes the case for standardized tests given to children
whose dominant language is
not the test language (Sinharay, Dorans, & Liang, 2011). As
Lakin and Lai (2012) note, there are
some serious issues with the fairness and reliability of ability
measures given to these children
without special accommodations. As we saw in Chapter 1,
accommodations in these cases are
mandated by law (see In the Classroom: Culturally Unfair
Assessments).
Accommodations for Other Special Needs
Teachers must be sensitive to, and they must make
accommodations for, many other “special

needs.” These might include medical problems, sensory
disabilities such as vision and hearing
problems, emotional exceptionalities, learning disabilities, and
intellectual disabilities. They
might also include cultural and ethnic differences among
learners.
Figure 2.5 describes some of the accommodations that fair
assessments of students with spe-
cial needs might require.
I N T H E C L A S S R O O M :
Culturally Unfair Assessments
Joseph Born-With-One-Tooth knew all the legends his
grandfather and the other elders told—even
those he had heard only once. His favorites were the legend of
the Warriors of the Rainbow, and
the legend of Kuikuhâchâu, the man who took the form of the
wolverine. These legends are long,
complicated stories, but Joseph never forgot a single detail,
never confused one with the other.
The elders named him ôhô, which is the world for owl, the wise
one. They knew that Joseph was
extraordinarily gifted.
But in school, it seemed that Joseph was unremarkable. He read
and wrote well, and he performed
better than many. But no one even bothered to give him the tests
that singled out those who were
gifted and talented. Those who are talented and gifted are often
identified through a combina-
tion of methods, beginning with teacher nominations that then
lead to further testing and perhaps
interviews and auditions (Pfeiffer & Blei, 2008). Those who

don’t do as well in school, sometimes
because of cultural or language differences, tend to be
overlooked.
Joseph Born-With-One-Tooth is not alone. Aboriginal and other
culturally different children are
vastly underrepresented among the gifted and the talented
(Baldwin & Reis, 2004). By the same
token, they tend to be overrepresented among programs for
those with learning disabilities and
emotional disorders (Briggs, Reis, & Sullivan, 2008).
There is surely a lesson here for those concerned with the
fairness of assessments.
Biases and Stereotypes
Accommodations for language differences are not especially
difficult. But overcoming the
many biases and stereotypes that can affect the fairness of
assessments often is.
Biases are preconceived judgments usually in favor of or
against some person, thing, or idea.
For example, I might think that Early Girl tomatoes are better
than Big Boys. That is a harmless
bias. And like most biases, it is a personal tendency. But if we
North Americans tend to believe
that all Laplanders are such and such, and most Roma are this
and that (such and such and
this and that of course being negative), then we hold some
stereotypes that are potentially

highly detrimental.
Closer to home, historically there have been gender stereotypes
about male–female differ-
ences whose consequences can be unfair to both genders. Some
of these stereotypes are
based on long-held beliefs rooted in culture and tradition and
propagated through centuries
of recorded “expert” opinion. And some are based on various
controversial and often con-
tested findings of science.
It’s clear that males and females have some biologically linked
sex differences, mainly in physi-
cal skills requiring strength, speed, and stamina. But it’s not
quite so clear whether we also
have important, gender-linked psychological differences. Still,
early research on male–female
differences (Maccoby & Jacklin, 1974) reported significant
differences in four areas: verbal abil-
ity, favoring females; mathematical ability, favoring males;
spatial–visual ability (evident, for
example, in navigation and orientation skills), favoring males;
and aggression (higher in males).
Figure 2.5: Fair assessment accommodations for children with
special needs
These are only a few of the many possible accommodations that
might be required for fair assess-
ment of children with special needs. Each child’s requirements
might be different. Note, too, that
some of these accommodations might increase the fairness of
assessments for all children.
f02.05_EDU645.ai

Instructional Accommodations
• teacher aides and other professional assistance
• special classes and programs
• individual education plans
• special materials such as large print or audio devices
• provisions for reducing test anxiety
• increased time for learning
Testing Accommodations
• increased time for test completion
• special equipment for test-taking
• different form of test (for example, verbal rather than written)
• giving test in different setting
• testing in a different language
Possible Accommodations for Fair Assessment
of Students with Special Needs
Many of these differences are no longer as apparent now as they
were in 1974. There is
increasing evidence that when early experiences are similar,
differences are minimal or non-
existent (Strand, 2010).
But the point is that experiences are not always similar, nor are
opportunities and expecta-
tions. In the results of many assessments, there are still gender
differences. These often favor
males in mathematics and females in language arts (e.g., De
Lisle, Smith, Keller, & Jules, 2012).
And there is evidence that the stereotypes many people still

hold regarding, say, girls’ inferior-
ity in mathematics might unfairly affect girls’ opportunities and
their outcomes.
In an intriguing study, Jones (2011) found that when women
were shown a video supporting
the belief that females perform more poorly than males in
mathematics, subsequent tests
revealed a clear gender difference in favor of males on a
mathematics achievement test. But
when they were shown a video indicating that women performed
as well as men, no sex dif-
ferences were later apparent.
Inconsistent Grading
Approaches to grading can vary enormously in different schools
and even in different class-
rooms within the same school. They might involve an enormous
range of practices, including
• Giving or deducting marks for good behavior
• Giving or deducting marks for class participation
• Giving or deducting marks for punctuality
• Using well-defined rubrics for grading
• Basing grades solely on test results
• Giving zeros for missed assignments
• Ignoring missed assignments
• Using grades as a form of reward or punishment

• Grading on any of a variety of letter, number, percentage,
verbal descriptor, or other
systems
• Allowing students to disregard their lowest grade
• And on and on . . .
No matter what practices are used in a given school, for
assessments to be fair, grades need to
be arrived at in a predictable and transparent manner. Moreover,
the rules and practices that
underlie their calculation need to be consistent. This approach
is also critical for describing
what students know and are able to do. If a math grade is
polluted with behavioral objec-
tives such as participation, how will the student and parents
know what the student’s math
skills are?
Inconsistent grading practices are sometimes evident in
disparities within schools, where dif-
ferent teachers grade their students using very different rules. In
one class, for example, stu-
dents might be assured of receiving relatively high grades if
they dutifully complete and hand
in all their assignments as required. But in another class, grades
might depend entirely on test
results. And in yet another, grades might be strongly influenced
by class participation or by
spelling and grammar.

Inconsistent grading within a class can also present serious
problems of fairness for students.
A social studies teacher should not ignore grammatical and
spelling errors on a short-answer
test one week and deduct handfuls of marks for the same sorts
of errors the following week.
Simply put, the criteria that govern grading should be clearly
understood by both the teacher
and students, and those criteria should be followed consistently.
Cheating and Test Fairness
Most of us, even the cheaters among us, believe that cheating is
immoral. Sometimes it is
even illegal—such as when you do it on your income tax return.
And clearly, cheating is unfair.
First, if cheating results in a higher than warranted grade, then
it does not represent the stu-
dent’s progress or accomplishments—which hardly seems fair.
Second, those who cheat, by that very act, cheat other students.
I once took a statistics
course where, in the middle of a dark night, a fellow student
sneaked up the brick wall of
the education building, jimmied open the window to Professor
Clark’s office, and copied the
midterm exam we were about to take. He then wrote out what he
thought were all the cor-
rect answers and sold copies to a bunch of his classmates.
I didn’t buy. No money, actually. And I didn’t do nearly as well
on the test as I expected. I
thought I had answered most of the questions correctly; but, this
being a statistics course, the

raw scores (original scores) were scaled so that the class
average would be neither distress-
ingly low nor alarmingly high.
The deception was soon uncovered. Some unnamed person later
informed Professor Clark
who, after reexamining the test papers, discovered that 10 of his
35 students had nearly iden-
tical marks. More telling was that on one item, all 10 of these
students made the same, highly
unlikely, computational error.
Cheating is not uncommon in schools, especially in higher
grades and in postsecondary pro-
grams where the stakes are so much higher. In addition, today
there are far more oppor-
tunities for cheating than there were in the days of our
grandparents. Wireless electronic
communication; instant transmission of photos, videos, and
messages; and wide-scale access
to Internet resources have seen to that.
High-Stakes Tests and Cheating
There is evidence, too, that high-stakes testing may be
contributing to increased cheating,
especially when the consequences of doing well or poorly can
dramatically affect entire school
systems. For example, state investigators in Georgia found that
178 administrators and teach-
ers in 44 Atlanta schools who had early access to standardized
tests systematically cheated to
improve student scores (Schachter, 2011).
Some school systems cheat on high-stakes tests by excluding
certain students who are not
expected to do well; others cheat by not adhering to guidelines

for administering the tests,
perhaps by giving students more time or even by giving them
hints and answers (Ehren &
Swanborn, 2012).
A more subtle form of administrative and teacher cheating on
high-stakes tests takes the
form of “narrowing” the curriculum. In effect, instructional
objectives are narrowed to topics
covered by the tests, and instruction is focused specifically on
those targets to the exclusion
of all others. This practice, notes Berliner (2011), is a
rational—meaning “reasonable or intel-
ligent”—reaction to high-stakes testing.
With the proliferation of online courses and online universities,
the potential for electronic
cheating has also increased dramatically (Young, 2012). For
example, online tests can be taken
by the student, the student’s friend, or even some paid expert,
with little fear of detection.
Preventing Cheating
Among the various suggestions for preventing or reducing
cheating on exams are the following:
• Encourage students to value honesty.
• Be aware of school policy regarding the consequences of
cheating, and communicate
them to students.

• Clarify for students exactly what cheating is.
• When possible, use more than one form of an exam so that
no two adjacent students
have the same form.
• Stagger seats so that seeing other students’ work is
unlikely.
• Randomize and assign seating for exams.
• Guard the security of exams and answer sheets.
• Monitor exams carefully.
• Prohibit talking or other forms of communication during
exams.
Of course, none of these tactics, or even all of them taken
together, is likely to guarantee
that none of your students cheat. In fact, one large-scale study
found that 21% of 40,000
undergraduate students surveyed had cheated on tests, and an
astonishing 51% had cheated
at least once on their written work (McCabe, 2005; Figure 2.6).
Sadly, that cheating is prevalent does not justify it. Nor does it
do anything to increase the
fairness of our testing practices.
Figure 2.6: Cheating among college undergraduates
Percentage of undergraduate students who admitted having
cheated at least once.

f02.06_EDU645.ai
Percent of 40,000 students who admitted cheating
Cheated on written work
Cheated on tests
0 10 20 30 40 50 60
Source: Based on McCabe, D. (2005). It takes a village:
Academic dishonesty. Liberal Education. Retrieved September
2, 2012, from
http://www.middlebury.edu/media/view/257515/original/It_take
s_a_village.pdf.
http://www.middlebury.edu/media/view/257515/original/It_take
s_a_village.pdf
Validity Chapter 2
Figure 2.7 summarizes the main characteristics of fair
assessment practices. Related to this,
Table 2.3 presents the American Psychological Association
(APA) Code of Fair Testing Practices
in Education. Because of its importance, the code is reprinted in
its entirety at the end of
this chapter.
2.3 Validity
In addition to the characteristics of fair assessment practices
listed in Figure 2.6 and Table 2.3,
the fairness of a test or assessment system depends on the
reliability of the test instruments
and the validity of the inferences made from the test results.

Simply put, a test is valid if it measures what it is intended to
measure. For example, a high
schooler’s ACT scores should not be used to decide if a student
should have a driver’s license.
The test is designed to predict college performance rather than
readiness to drive. From a
measurement point of view, validity is the most important
characteristic of a measuring
instrument. If a test does not measure what it is intended to
measure, the scores derived from
it are of no value whatsoever, no matter how consistent and
predictable they are.
Test validity has to do not only with what the test measures, but
also with how the test results
are used. It relates to the inferences we base on test results and
the consequences that fol-
low. In effect, interpreting test scores amounts to making an
inference about some quality or
characteristic of the test taker.
For example, based on Nora’s brilliant performance on a
mathematics test, her teacher infers
that Nora has commendable mathematical skills and
understanding. And one consequence
of this inference might be that Nora is invited to join the
lunchtime mathematics enrichment
group. But note that the inference and the consequence are
appropriate and defensible only
if the test on which Nora performed so admirably actually
measures relevant mathematical
skills and understanding.
The important point is that in educational assessment, validity
is closely related to the way

test results are used. Accordingly, a test may be valid for some
purposes but totally invalid
for others.
Figure 2.7: Fair assessment practices
Assessments are not always fair for all learners. But their
fairness can be improved by paying atten-
tion to some simple guidelines.
f02.07_EDU645.ai
The Fairest
Assessment
Practices
• Cover material that every student has had an opportunity to
learn.
• Reflect learning targets for that course.
• Allow sufficient time for students to finish the test.
• Discourage cheating.
• Provide accommodations for learners with special needs.
• Ensure that tests are free of biases and stereotypes.
• Avoid misleading questions.
• Follow consistent and clearly understood grading practices.
• Base important decisions on a variety of different assessments.
• Take steps to ensure the validity and reliability of
assessments.
Validity Chapter 2
Face Validity

How can you determine whether a test is valid? Put another
way, how do you know a test
measures what it says it measures? Or what it is intended to
measure?
There are a number of ways of answering these questions. One
of the most obvious is to look
at the items that make up the test. Does the mathematics test
look like it measures mathemat-
ics? Does the grammar test appear to be a grammar test?
Answers to these sorts of questions determine the face validity
of the test. Basically, face
validity is the extent to which the test appears to measure what
it is supposed to measure. If
the mathematics test consists of appropriate mathematical
problems, it has face validity.
Face validity is especially important for teacher-made tests. Just
by looking at a test, students
should immediately know that they are being tested on the right
things. A mathematics test
that has face validity will not ask a series of questions based on
Shakespeare’s Julius Caesar.
Occasionally, however, test makers are careful to avoid any hint
of face validity. For example,
if you wanted to construct a test designed to measure a
personality characteristic such as
honesty, you probably wouldn’t want your test participants to
know what is being measured.
If your instrument had face validity—that is, if it looked like it
was measuring honesty—the
scoundrels who take it might actually lie and act as if they are
honest when they really aren’t.
Better to deceive them, lie to them, pretend you are testing

motivational qualities or character
strength, so you can determine what liars and rogues they really
are.
Content Validity
Of course, a test must not only look as though it measures what
it is intended to measure,
but should actually do so. That is, its content should reflect the
instructional objectives it is
designed to assess. This indicator of validity, termed content
validity, is assessed by analyz-
ing the content of test items in relation to the objectives of the
course, unit, or lesson.
Determining Content Validity
Content validity is one of the most important kinds of validity
for measurements of school
achievement. A test with high content validity includes items
that sample all important course
objectives in proportion to their importance. Thus, if some of
the objectives of an instruc-
tional sequence have to do with the development of cognitive
processes, a relevant test will
have content validity to the extent that it samples these
processes. And if 40% of the course
content (and, consequently, of the course objectives) deals with
knowledge (rather than with
comprehension, analysis, and so on), 40% of the test items
should assess knowledge.
Determining the content validity of a test is largely a matter of
careful, logical analysis of the
items it comprises. Basically, the test needs to include a sample
of items that tap the knowl-
edge and skills that define course objectives.

Increasing Content Validity
As Wilson, Pan, and Schumsky (2012) explain, the basic
process the test maker should follow
to ensure content validity involves the following steps:
1. Define the content (the instructional objectives).
2. Define the level of difficulty or abstraction for the items.
Validity Chapter 2
3. Develop a pool of representative items.
4. Determine what ratio of different items best represents the
instructional objectives.
5. Develop a test blueprint.
One of the main advantages of preparing a test blueprint (also
referred to as a table of specifi-
cations) is that it ensures a relatively high degree of content
validity (providing, of course, that
the test maker follows the blueprint).
It’s important to realize that tests and test items do not possess
validity as a sort of intrin-
sic quality; a test is not generally valid or generally invalid in
and of itself. Rather, it is valid
for certain purposes and with certain individuals, and it is
invalid for others. For example, if
the following item is intended to measure comprehension, it
does not have content validity,
because it measures only simple recall:

How many different kinds of validity are discussed in this
chapter?
A. 1
B. 2
C. 3
D. 5
E. 10
If, on the other hand, the item were intended to
measure knowledge of specifics, it would have
content validity. And an item such as the follow-
ing might have content validity with respect to
measuring comprehension:
Explain why face validity is important for
teacher-constructed tests.
Note, however, that this last item measures
comprehension only if students have not been
explicitly taught an appropriate answer. It is
quite possible to teach principles, applications,
analyses, and so on as specifics, so that ques-
tions of this sort require no more than recall of
knowledge. What an item measures is not inher-
ent in the item itself so much as in the relation-
ship between the material as the student has
learned it and what the item requires.
Construct Validity

A third type of validity, construct validity, is somewhat less
relevant for teacher-constructed
tests but highly relevant for many other psychological measures
(e.g., personality and intel-
ligence tests).
Ryan McVay/Photodisc/Thinkstock
▲ One measure of validity is reflected in the extent to
which the predictions we base on test results are borne
out. If this boy does exceptionally well on this standard-
ized battery of tests, will he also do well next year in
fourth grade? In high school? In college?
Reliability Chapter 2
In essence, a construct is a hypothetical variable—an
unobservable characteristic or qual-
ity, often inferred from theory. For example, a theory might
argue that individuals who are
highly intelligent should be reflective rather than impulsive—
reflectivity being evident in the
care and caution with which they solve problems or make
decisions. Impulsivity would be
apparent in a person’s hastiness and in failure to consider all
aspects of a situation. One way
to determine the construct validity of a test designed to measure
intelligence would then be
to look at how well it correlates with measures of reflection and
impulsivity (see Chapter 9 for
a discussion of correlation—a mathematical index of
relationships).
Criterion-Related Validity

If Harold does exceptionally well on all his 12th-grade year-end
tests, his teachers might be
justified in predicting that he will do well in college. Colleges
that subsequently admit Harold
into one of their programs because of his grade 12 marks are
also making the same prediction.
Predictive Validity
At all levels, prediction is one of the main uses of summative
(rather than formative) assess-
ments. We assume that all students who do well on year-end
fifth-grade achievement tests
will do reasonably well in sixth grade. We also predict that
those who perform poorly on these
tests will not do well in sixth grade, and we might use this
prediction as justification for having
them undertake remedial work.
The extent to which our predictions are accurate reflects
criterion-related validity. One
component of this form of validity, just described, is labeled
predictive validity. Predictive
validity is easily measured by looking at the relationship
between actual performance on a
test and subsequent performance. Thus, a college entrance
examination designed to identify
students whose chances of college success are high has
predictive validity to the extent that
its predictions are borne out.
Concurrent Validity
Concurrent validity, a second aspect of criterion-related
validity, is the relationship between
a given test and other measures of the same behaviors or
characteristics. For example, as we

see in Chapter 10, the most accurate way to measure
intelligence is to administer a time-
consuming and expensive individual test. A second option is to
administer a quick, inexpen-
sive group test; a third, far less consistent approach, is to have
teachers informally assess
intelligence based on what they know of their students’
achievements and effort. Teachers’
assessments are said to have concurrent validity to the extent
that they are similar to the more
formal measures. In the same way, a group or an individual test
is said to have concurrent
validity if it agrees well with measures obtained using a
different and presumably valid test.
Figure 2.8 summarizes the various approaches to determining
test validity.
2.4 Reliability
Reliability is what we want in our cars, our computers, our
spouses, our dogs. We want
our cars and our computers to start when we go through the
appropriate motions, and we
want them to function as they were designed to function. So,
too, with dogs and spouses:
Reliability is predictability and consistency. If you stepped on
your bathroom scale five times
in a row, you would expect it display the same weight each
time.
Reliability in educational measurement is no different.

Basically, it has to do with consistency.
Good measuring instruments must not only measure what they
are intended to measure (they
must have validity); they must also provide consistent,
dependable, reliable measures.
Reliability in testing has to do with the accuracy of our
measurements. The more errors there
are in our measurements, the less reliable will be our test
results. A reliable intelligence test,
for example, should yield similar results from one week to the
next. Or even from one year to
the next.
But the reliability of most of our educational and psychological
measures is never perfect. If
you give Roberta an intelligence test this week and another in
two weeks, it is highly unlikely
that her scores will be identical. No matter, we say, as long as
the difference between the two
scores is not too great. After all, many factors can account for
this error of measurement.
Figure 2.8: Types of test validity
Validity is closely related to the ways that a test is used. If a
test is not valid, it is also likely to be
unreliable and unfair.
f02.08_EDU645.ai
Content
(The test samples
behaviors that represent
both the topics and the

processes implicit in
course objectives)
Test Validity
Predictive
(Test scores are
valuable predictors
of future
performance in
related areas)
Concurrent
(Test scores are
closely related to
smilar measures
based on other
presumably valid
tests)
Face
(The test appears to
measure what it says it
measures)
Construct
(The test taps
hypothetical variables
that underlie the

property being tested)
Criterion-Related
Say Roberta scored 123 the first week but only 102 the second.
The difference between
the two scores may be because Roberta had a headache at the
time of the second testing.
Perhaps she was distracted by personal problems or tired from a
long trip or anxious about
the test or confused by some new directions.
In psychology and education, we tend to assume that the things
we measure are relatively
stable. But the emphasis should be on the word relatively
because we know that much of
what we measure is variable. So at least some of the error in our
measurements is likely due
to instability and change in what we measure. But if two similar
measures of achievement in
chemistry yield a score of 82% one week but only 53% the next
week for the same student,
then the test we are using may well have a reliability problem.
How can we assess the reli-
ability of our tests?
Test-Retest Reliability
If a test measures what it purports to (that is, if it is valid), and
if what it measures does not
fluctuate unpredictably, no matter how often it is given, the test

should yield similar scores. If
it doesn’t, it is not only unreliable but probably invalid as well.
In fact, a test cannot be valid
without being reliable. If it yields inconsistent scores for a
stable characteristic, we can hardly
insist that it is measuring what it is supposed to measure.
That a test should yield similar scores from one testing to the
next—unless, of course, the
test is simple enough that the student learns and remembers
appropriate responses—is the
basis for one of the most common measures of reliability.
Giving the same test two or more
times and comparing the results obtained at each testing yields a
measure of what is known
as test-retest reliability (sometimes also called repeated-
measures reliability or stability
reliability).
Say, for example, that I give a group of first-grade students a
standardized language profi-
ciency test (let’s call it October Test) at the end of October and
then give them the same test
again at the end of November. Assume the results are as shown
in columns 2 and 3 of Table
2.1 (“October Test Results” and “Hypothetical November Test
Results”). We can see immedi-
ately that the test yields consistent, stable scores and is
therefore highly reliable. Students who
scored high in October continue to score high in November—as
we would expect given our
assumption that language proficiency should not change
dramatically in one month.
Table 2.1 Test-retest reliability

Student
October Test Results
Hypothetical
November Test Results
Alternate November
Test Results
A 72 75 92
B 84 83 55
C 56 57 80
D 79 82 72
E 55 57 78
F 84 79 48
G 91 88 66
Suppose, however, the results were as shown in columns 2 and 4
(“October Test Results” and
“Alternate November Results”). Unless we have some other
logical explanation, we would
have legitimate questions about the reliability of this language
proficiency test. Now some of

the students who scored high in October do very poorly in
November; and others who did
poorly in October do astonishingly well in November.
Statistically, the reliability of this test would be obtained by
looking at the correlation between
scores obtained on the test and those obtained on the retest (see
Chapter 9 for an explana-
tion of correlation). The first chart in Figure 2.9 shows how the
hypothetical November results
closely parallel the October results. In fact, there is a high
positive correlation (+.98) between
these results. The second chart in Figure 2.9 shows how the
alternate November results do
not parallel October results. In fact, the correlation between the
two is negative (–.63).
Figure 2.9: Test-retest reliability
If a test is reliable, it should yield similar scores when given to
the same students at different times.
Chart 1 (based on Table 2.1) shows high reliability (correlation
+.98); Chart 2 illustrates low reliabil-
ity (correlation –.63).
f02.09_EDU645.ai
100
90
80
70
60

50
40
30
20
10
0
A B C D
Student
E F G
S
co
re
s
Hypothetical
November
test results
October
test results
100
90
80

70
60
50
40
30
20
10
0
A B C D
Student
Chart 1
Chart 2
E F G
S
co
re
s
Alternate
November
test results

October
test results
Parallel-Forms Reliability
Test-retest measures of reliability look at the correlation
between results obtained by giving
the same test twice to the same individuals. But in some cases,
it isn’t possible or convenient
to administer the same test twice. If the test is very simple, or if
it contains striking and highly
memorable questions (or answers), some learners may improve
dramatically from one test-
ing to the next. Or, if the teacher goes over the test and
discusses possible responses, some
students might learn enough to improve, and others might not.
A second approach to estimating test reliability gets around this
problem by administering a
different form of the test the second time. The different form of
the test is designed to be
highly similar to the first and is expected to yield similar
scores. It is therefore labeled a paral-
lel form. The correlation between these parallel forms of the
same test yields a measure of
parallel-forms reliability (also termed alternate-form
reliability). Figure 2.10 plots the scores
obtained by seven students on parallel forms of a test. Note how
the results follow each other.
That is, a student who scores high on form A of the test is also
likely to score high on form B.
In this case, the correlation between the two forms is .86.

Split-Half Reliability
Teachers seldom go to the trouble of making up two forms of
the same test and establishing
that they are equivalent. Fortunately, there is another clever
way of calculating test reliability.
The reasoning goes something like this: If I prepare a
comprehensive test made up of a large
number of items, then many of the items on this test will
overlap in what they assess. It is
therefore reasonable to assume that if I were to split the test in
two and administer each half
to my students, their scores on the two halves would be highly
similar. But I don’t really need
to split the test: All I need to do is give the entire test to all
students and then score the test
as though I had actually split it.
Figure 2.10: Parallel-forms reliability
The relationship between scores on two parallel forms of the
same test given to the same group is
an indication of how dependably and consistently (reliably) the
test measures.
f02.10_EDU645.ai
Test A
Results
Test B
ResultsStudent
62

74
66
34
79
23
91
60
83
57
55
86
44
79
A
B
C
D
E

F
G
100
90
80
70
60
50
40
30
20
10
0
A B C D
Student
E F G
S
co
re

s
Test B resultsTest A results
Suppose, for example, that my original test consisted of 100
multiple-choice items, carefully
constructed to tap all my instructional objectives. When I score
the test, I might consider the
50 even-numbered items as one test, and the other 50 as a
separate test. I can now easily
generate two scores for each student, one for each half of my
split test. And when I calculate
the correlation between these two halves of the test, I will have
determined what is called
split-half reliability. Figure 2.11 illustrates split-half reliability
based on a 90-item test split
into two 45-item halves. Note that the longer the test, the more
accurate the measure of reli-
ability. Figure 2.12 summarizes the various ways of assessing
test reliability.
Figure 2.11: Split-half reliability
A single test scored as though it were two separate tests
provides information for judging its inter-
nal consistency (reliability). In this case, the correlation
between the two test halves is .80.
Figure 2.12: Measures of test reliability
Test reliability reflects the stability and consistency of a
measure. It, along with fairness and valid-

ity, is an extremely important quality of educational
assessments.
f02.11_EDU645.ai
60
50
40
30
20
10
0
A B C D
Student
E F G H I J K
S
co
re
o
n
e
a
ch

h
a
lf
Even itemsOdd items
Odd
items
Even
itemsStudent
42
33
23
34
36
47
34
44
22
18
33
45

35
40
34
42
48
28
41
25
16
37
A
B
C
D
E
F
G
H

I
J
K
f02.12_EDU645.ai
Measures of
Test Reliability
Test-retest
reliability
Parallel-forms
reliability
Split-half
reliability
Correlation is between
scores obtained on the
same test given to the
same students on two
different occasions
Correlation is between
two forms of a test
given to the same
examinees
Correlation is
between halves
of a single test

Factors That Affect Reliability
Say you require a unit-end assessment and ranking of your
students in a 12th-grade physics
course. But you have stupidly put off building your final exam
until the night before it is to be
administered. So you write out a single question. Then, having
just completed a measurement
course, you are clever enough to devise a list of detailed scoring
criteria. The question and
scoring criteria are shown in Table 2.2.
Table 2.2 Illustrative single-question 12th-grade physics exam
Question: Points
Explain, in your own words, the details of vertical projectile
motion.
Scoring Criteria
Describes what is meant by motion in a gravitational field:
Explains acceleration
Mentions zero velocity at zenith
Includes free-fall equation
Applies free-fall equation to hypothetical situation

Includes graph of vertical projectile motion
10
10
5
10
20
10
Maximum Points 65
Length of Test
If your physics unit covered only vertical projectile motion, and
if your instructional objectives
are well represented in your scoring criteria, your one-item
exam might be quite good. Under
these circumstances, it might actually measure what you intend
to measure (it would have
high validity). And, given careful application of your scoring
criteria, the results might be con-
sistent and stable (it would have reasonable reliability).
But if your unit also covered topics such as elastic and inelastic
collisions, relative velocity,
notions of frames of reference, and other related topics, your
single-item test would be about
as useful as a snowmobile in Los Angeles.
Although it might occasionally be possible to achieve an
acceptable level of validity and reli-
ability with a single item, in most cases it is not possible. Poor

reliability, of course, is especially
likely if your test consists of objective test items such as
multiple-choice questions, matching
problems, or true-false exercises. It is difficult to imagine that a
single multiple-choice item
could measure all your instructional objectives. In most cases,
the more items in your test, the
more valid and reliable it is likely to be.
Stability of Characteristics
The stability of what is being measured also affects the
reliability of a test. If what we are
measuring is unstable and unpredictable, our measures are also
likely to be inconsistent and
unpredictable. However, we assume that most of what we
measure in education will not
fluctuate unpredictably. We know, for example, that cognitive
strategies develop over time
and that knowledge increases. Tests that are both valid and
reliable are expected to reflect
these changes. These are predictable changes that don’t reduce
the reliability of our measur-
ing instruments.
The Effects of Chance
Another factor that can affect the reli-
ability of a test is chance, especially with
respect to objective, teacher-made tests.
We know, for example, that the chance of
getting a true-false item correct, all other
things being equal, is 50–50. If you give a
60-item, true-false, graduate-level plasma

physics test to a large group of intelligent
fourth-graders, they can be expected to
answer an average of around 30 items
correctly by chance—unless Lady Luck is
looking pointedly in the other direction.
And a few of the luckier individuals in this
class may have astonishingly high scores.
But a later administration of this test
might lead to startlingly different scores,
resulting in an extraordinarily low measure
of test-retest reliability.
One way to reduce the effects of chance
is to make tests longer or to use a larger
number of short tests. The important point is that teachers
should not base any important
decision on only one or two measures.
Item Difficulty
Test reliability is also affected by the difficulty of items. Tests
that are made up of excessively
easy or impossibly difficult items will almost invariably have
lower measured reliability scores
than tests composed of items of moderate difficulty. Other
things being equal, very easy and
very difficult items tend to result in less consistent patterns of
responding.
Relationship Between Validity and Reliability
It’s important to realize that a test cannot be valid without also
being reliable. If what we
want to measure is a stable characteristic, and if the measures
we obtain are inconsistent
and unpredictable (hence, unreliable), then we clearly aren’t
measuring what we intend to

measure. Figure 2.13 summarizes the meanings of validity,
reliability, and fairness, and the
relationship among them.
On the other hand, a test can be highly reliable without being
valid. Consider the test and
scoring guide shown in Figure 2.14. This is an extremely
reliable test: Examinees invariably
answer all questions correctly and always obtain the same score.
But as a measure of intel-
ligence, it clearly lacks face, content, construct, and criterion-
related validity. It measures reli-
ably, but it does not measure what it is intended to measure.
How to Improve Test Reliability
Test reliability is not something that most teachers are likely to
calculate for their tests. For
several reasons, however, teachers should understand what
reliability is, how important it is
in educational assessment, and how it can be improved.
© I Love Images/Corbis
▲ If the validity, reliability, and fairness of the educational
assessment whose results this young man has just seen are sus-
pect, an injustice may have occurred.
When you need to select from among a number of different
standardized tests, it’s important
that you have, and understand, information about their
reliability. Similarly, when you are

making decisions about your students, you need to have some
knowledge of the reliability of
the assessments on which you base your decisions.
It might be useful to know that the internal consistency (split-
half reliability, for example) of
teacher-made tests is around .50. As we see in Chapter 9, which
deals with statistical mea-
sures, this is a modest index of reliability. The fact is that most
teacher-made tests have a
relatively high degree of measurement error.
Standardized tests, on the other hand, tend to have reliabilities
of around .90 (Frisbie, 1988).
As a result, the most important decisions that affect the lives of
students should be based on
carefully selected standardized tests—and on professional
opinion where necessary—rather
than on teacher-constructed tests, hunches, or intuitive
impressions.
Figure 2.15 summarizes a number of ways in which the
reliability and validity of educational
assessments can be increased. Table 2.3 is the American
Psychological Association’s Code of
Fair Testing Practices in Education (2004). Especially important
are suggested guidelines for
test users with respect to selecting tests, administering them,
and interpreting and reporting
test results.
Figure 2.13: Three essential qualities of educational assessment
The most subjective of these qualities, fairness, is often the one
students think is most important.

f02.13_EDU645.ai
Reliability Validity Fairness
Subjective estimate
influenced by extent
to which:
• material tested has
been covered
• all students have
had an equal
opportunity to
learn
• sufficient time is
allowed for testing
• there are safe-
guards against
cheating
• assessments are
free of biases and
stereotypes
• misleading and
trick questions have
been avoided
• accommodations
are made for
special needs
• grading is consistent
Consistency;
accuracy of
measurement
Estimated by:
• testing and

retesting
• parallel-forms
tests
• split-half tests
The extent to which
a test measures
what it is meant to
measure
Estimated by:
• face (appearance)
• content
• construct
• criterion-related
• predictive
• concurrent
is necessary for
Figure 2.14: Human intelligence scale
Not really an intelligence test. Simply illustrates that highly
reliable (consistent) measures can be
desperately invalid.
f02.14_EDU645.ai
The 23rd-Century Human Intelligence Scale
Please answer each question as briefly as possible.

Name Age
Address
Questions
What is your name?
What is your address?
How old were you on your last birthday?
What is your mother’s name?
What is your father’s name?
Do you have a dog?
Do you have a cat?
Would you like a dog?
Would you like a cat?
What is your name?
Scoring: 10 IQ points per answer
Acceptable Answers
Correct if matches above

Any name, blank or “Don’t know” accepted
Any name, blank or “Don’t know” accepted
“Yes” or “No” or “Don’t know”
Should match first question
Maximum: 100
Figure 2.15: Improving reliability and validity
For important educational decisions to be fair, they must be
based on the most valid and reliable
assessments possible.
f02.15_EDU645.ai
Suggestions for Improving Test Validity Suggestions for
Improving Test Reliability
• Use clear and easily understood tasks
• Sample from all skill and content areas
• Select items to reflect importance of
specific objectives
• Allow sufficient time for all students to
complete the test
• Use blueprints to guide instruction and
test construction

• Analyze items to determine how well
they match learning targets
• Check to see if students who do well
on your tests also do well in other
comparable classes
• Use a variety of approaches to
assessment
• Make tests longer
• Enlist the assistance of other raters
when using performance assessments
• Develop moderately difficult rather than
excessively easy or very difficult items
• Try to eliminate subjective influences
in scoring
• Develop and use clear rubrics and
checklists for scoring performance
assessments
• Restrict distracting influences when-
ever possible
• Use a variety of different assessments
• Eliminate or reduce the possibility that
chance might affect test outcomes
Table 2.3 The APA Code of Fair Testing Practices in
Education*
A. Developing and Selecting Appropriate Tests
Test Developers Test Users
Test developers should provide the information and

supporting evidence that test users need to select
appropriate tests.
Test users should select tests that meet the intended
purpose and that are appropriate for the intended test
takers.
A-1. Provide evidence of what the test measures, the
recommended uses, the intended test takers, and the
strengths and limitations of the test, including the level
of precision of the test scores.
A-1. Define the purpose for testing, the content and
skills to be tested, and the intended test takers. Select
and use the most appropriate test based on a thorough
review of available information.
A-2. Describe how the content and skills to be tested
were selected and how the tests were developed.
A-2. Review and select tests based on the appropriate-
ness of test content, skills tested, and content coverage
for the intended purpose of testing.
A-3. Communicate information about a test’s charac-
teristics at a level of detail appropriate to the intended
test users.
A-3. Review materials provided by test developers and
select tests for which clear, accurate, and complete
information is provided.
A-4. Provide guidance on the levels of skills,
knowledge, and training necessary for appropriate
review, selection, and administration of tests.

A-4. Select tests through a process that includes
persons with appropriate knowledge, skills, and
training.
A-5. Provide evidence that the technical quality,
including reliability and validity, of the test meets its
intended purposes.
A-5. Evaluate evidence of the technical quality of the
test provided by the test developer and any indepen-
dent reviewers.
A-6. Provide to qualified test users representative
samples of test questions or practice tests, directions,
answer sheets, manuals, and score reports.
A-6. Evaluate representative samples of test questions
or practice tests, directions, answer sheets, manuals,
and score reports before selecting a test.
A-7. Avoid potentially offensive content or language
when developing test questions and related materials.
A-7. Evaluate procedures and materials used by test
developers, as well as the resulting test, to ensure that
potentially offensive content or language is avoided.
A-8. Make appropriately modified forms of tests or
administration procedures available for test takers with
disabilities who need special accommodations.
A-8. Select tests with appropriately modified forms or
administration procedures for test takers with disabili-
ties who need special accommodations.
A-9. Obtain and provide evidence on the performance

of test takers of diverse subgroups, making significant
efforts to obtain sample sizes that are adequate for
subgroup analyses. Evaluate the evidence to ensure
that differences in performance are related to the skills
being assessed.
A-9. Evaluate the available evidence on the perfor-
mance of test takers of diverse subgroups. Determine
to the extent feasible which performance differences
may have been caused by factors unrelated to the skills
being assessed.
B. Administering and Scoring Tests
Test developers should explain how to administer and
score tests correctly and fairly.
Test users should administer and score tests correctly
and fairly.
*Copyright 2004 by the Joint Committee on Testing Practices.
This material may be reproduced in its entirety without fees or
permission,
provided that acknowledgment is made to the Joint Committee
on Testing Practices. Any exceptions to this, including requests
to excerpt or
paraphrase this document, must be presented in writing to
Director, Testing and Assessment, Science Directorate, APA.
This edition replaces
the first edition of the Code, which was published in 1988.
Source: From Code of Fair Testing Practices in Education.
(2004). Washington, DC: Joint Committee on Testing Practices.
(Mailing address:

Joint Committee on Testing Practices, Science Directorate,
American Psychological Association, 750 First Street, NE,
Washington, DC
20002-4242)
(continued)
B-1. Provide clear descriptions of detailed procedures
for administering tests in a standardized manner.
B-1. Follow established procedures for administering
tests in a standardized manner.
B-2. Provide guidelines on reasonable procedures for
assessing persons with disabilities who need special
accommodations or those with diverse linguistic
backgrounds.
B-2. Provide and document appropriate procedures for
test takers with disabilities who need special accom-
modations or those with diverse linguistic backgrounds.
Some accommodations may be required by law or
regulation.
B-3. Provide information to test takers or test users on
test question formats and procedures for answering
test questions, including information on the use of any
needed materials and equipment.
B-3. Provide test takers with an opportunity to become
familiar with test question formats and any materials or
equipment that may be used during testing.

B-4. Establish and implement procedures to ensure the
security of testing materials during all phases of test
development, administration, scoring, and reporting.
B-4. Protect the security of test materials, including
respecting copyrights and eliminating opportunities for
test takers to obtain scores by fraudulent means.
B-5. Provide procedures, materials and guidelines for
scoring the tests, and for monitoring the accuracy of
the scoring process. If scoring the test is the responsi-
bility of the test developer, provide adequate training
for scorers.
B-5. If test scoring is the responsibility of the test user,
provide adequate training to scorers and ensure and
monitor the accuracy of the scoring process.
B-6. Correct errors that affect the interpretation of
the scores and communicate the corrected results
promptly.
B-6. Correct errors that affect the interpretation of
the scores and communicate the corrected results
promptly.
B-7. Develop and implement procedures for ensuring
the confidentiality of scores.
B-7. Develop and implement procedures for ensuring
the confidentiality of scores.
C. Reporting and Interpreting Test Results

Test developers should report test results accurately
and provide information to help test users interpret test
results correctly.
Test users should report and interpret test results accu-
rately and clearly.
C-1. Provide information to support recommended
interpretations of the results, including the nature of
the content, norms or comparison groups, and other
technical evidence. Advise test users of the benefits
and limitations of test results and their interpreta-
tion. Warn against assigning greater precision than is
warranted.
C-1. Interpret the meaning of the test results, taking
into account the nature of the content, norms or
comparison groups, other technical evidence, and
benefits and limitations of test results.
C-2. Provide guidance regarding the interpretations
of results for tests administered with modifications.
Inform test users of potential problems in interpreting
test results when tests or test administration proce-
dures are modified.
C-2. Interpret test results from modified test or test
administration procedures in view of the impact those
modifications may have had on test results.
C-3. Specify appropriate uses of test results and warn
test users of potential misuses.
C-3. Avoid using tests for purposes other than those
recommended by the test developer unless there is

evidence to support the intended use or interpretation.
C-4. When test developers set standards, provide
the rationale, procedures, and evidence for setting
performance standards or passing scores. Avoid using
stigmatizing labels.
C-4. Review the procedures for setting performance
standards or passing scores. Avoid using stigmatizing
labels.
(continued)
C-5. Encourage test users to base decisions about test
takers on multiple sources of appropriate information,
not on a single test score.
C-5. Avoid using a single test score as the sole deter-
minant of decisions about test takers. Interpret test
scores in conjunction with other information about
individuals.
C-6. Provide information to enable test users to accu-
rately interpret and report test results for groups of
test takers, including information about who were and
who were not included in the different groups being
compared, and information about factors that might
influence the interpretation of results.
C-6. State the intended interpretation and use of test
results for groups of test takers. Avoid grouping test
results for purposes not specifically recommended

by the test developer unless evidence is obtained to
support the intended use. Report procedures that were
followed in determining who were and who were not
included in the groups being compared and describe
factors that might influence the interpretation of
results.
C-7. Provide test results in a timely fashion and in a
manner that is understood by the test taker.
C-7. Communicate test results in a timely fashion and in
a manner that is understood by the test taker.
C-8. Provide guidance to test users about how to
monitor the extent to which the test is fulfilling its
intended purposes.
C-8. Develop and implement procedures for moni-
toring test use, including consistency with the intended
purposes of the test.
D. Informing Test Takers
Under some circumstances, test developers have direct
communication with the test takers and/or control of the
tests, testing process, and test results. In other circumstances
the test users have these responsibilities.
Test developers or test users should inform test takers about the
nature of the test, test taker rights and responsi-
bilities, the appropriate use of scores, and procedures for
resolving challenges to scores.
D-1. Inform test takers in advance of the test administration
about the coverage of the test, the types of question
formats, the directions, and appropriate test-taking strategies.

Make such information available to all test takers.
D-2. When a test is optional, provide test takers or their
parents/guardians with information to help them judge
whether a test should be taken—including indications of any
consequences that may result from not taking the
test (e.g., not being eligible to compete for a particular
scholarship) —and whether there is an available alternative
to the test.
D-3. Provide test takers or their parents/guardians with
information about rights test takers may have to obtain
copies of tests and completed answer sheets, to retake tests, to
have tests rescored, or to have scores declared
invalid.
D-4. Provide test takers or their parents/guardians with
information about responsibilities test takers have, such as
being aware of the intended purpose and uses of the test,
performing at capacity, following directions, and not
disclosing test items or interfering with other test takers.
D-5. Inform test takers or their parents/guardians how long
scores will be kept on file and indicate to whom, under
what circumstances, and in what manner test scores and related
information will or will not be released. Protect
test scores from unauthorized release and access.
D-6. Describe procedures for investigating and resolving
circumstances that might result in canceling or with-
holding scores, such as failure to adhere to specified testing
procedures.
D-7. Describe procedures that test takers, parents/guardians,
and other interested parties may use to obtain more
information about the test, register complaints, and have

problems resolved.
Section Summaries Chapter 2
Chapter 2 Themes and Questions
Section Summaries
2.1 Four Purposes of Educational Assessment Assessments can
be used to summarize
student achievement (summative ); for selection and placement
purposes (placement ); as a
basis for diagnosing strengths, problems, and deficits
(diagnostic); or as a means of provid-
ing feedback to improve teaching and learning (formative).
Despite these different emphases,
the overriding purpose of all forms of educational assessment is
to help the learner reach
important goals. Planning for effective assessment requires
clarifying and communicating
instructional objectives and aligning instruction and assessment
with these goals. As much
as possible, assessment should be an integral part of the
instructional process, designed to
provide immediate feedback for both teachers and learners to
assist and improve minute-to-
minute decisions. Teachers should use a variety of approaches
to assessment.
2.2 Test Fairness Tests are fair when they treat all learners in an
evenhanded manner and
when they provide all learners with an equal opportunity to
learn. Tests are unfair when they
examine content that has neither been covered nor assigned; if
they deliberately or uninten-

tionally use misleading “trick” questions; when some learners
have not been given sufficient
opportunity to learn the material; if they don’t allow sufficient
time for all learners to finish;
when they fail to accommodate to the special needs of
individual learners; when they reflect
biases and stereotypes in test construction; if scoring is
influenced by stereotypes and teacher
expectations; when steps are not taken to guard against
cheating; and if they are graded
inconsistently.
2.3 Validity A test is valid to the extent that it measures what it
is intended to measure. Tests
are seldom fair if they are not also valid. Face validity is a
measure of the extent to which a
test looks as though it measures what it is meant to measure.
However, for some assessments
(such as assessments of responder honesty), it is sometimes
essential that an instrument not
appear to measure what it is intended to measure. Content
validity is determined by the extent
to which items on the test reflect course objectives, and it is
normally determined through a
careful analysis and selection of items. Construct validity
relates to how closely a test agrees
with its theoretical underpinnings. It is a measure of how
consistent it is with the thinking that
gave rise to it, and it is more relevant for psychological
assessments (e.g., personality or intel-
ligence tests) than for teacher-made tests. Criterion-related
validity has two aspects: One is
defined by agreement between test results and predictions based
on those results (predictive
validity); the other is evident in the extent to which the results
of a test agree with the results

of other tests that measure the same characteristics.
2.4 Reliability Consistency and predictability are the hallmarks
of high reliability. Reliability
has to do with error of measurement; the greater the errors in
our assessments, the lower the
reliability. Reliability can be estimated by giving the same test
to the same participants on more
than one occasion. High reliability would be reflected in similar
scores for each individual at
different testings (test-retest reliability). Or it can be calculated
by giving different forms of a
test to the same subjects and comparing their results (parallel-
forms reliability). A third option
is to take a single test administered once, divide it into halves,
score each half, and compare
those scores. Providing the test is long enough, scores on each
half will be similar if it is reli-
able. Reliability is affected by the length of a test (generally
higher with longer tests or with
the frequent use of many shorter tests); by the stability of what
is being measured (unstable
characteristics tend to yield lower reliabilities); chance factors;
and item difficulty (moderately
difficult items usually lead to higher test reliability than very
easy or very difficult items).
Key Terms Chapter 2
Applied Questions
1. What are the four principal uses of educational assessment?
Research and defend the prop-
osition that the most important function of educational

assessment should be formative.
2. What are the main characteristics of good educational
assessment? Explain in your own
words why you think one of these characteristics is more
important than the other two in
educational measurement.
3. What makes a test unfair? List the things you think a teacher
can do to improve the fairness
of teacher-made tests.
4. Can a test be fair but invalid? Write a brief position paper
detailing why you think the
answer is yes, no, or maybe.
5. Can a test be valid but unreliable? Generate a test item to
illustrate and support your
answer for this question.
6. How can you determine the reliability of a test? Outline the
steps a teacher might take to
increase test reliability without having to go to the trouble of
actually calculating it.
7. How can you improve test validity and reliability? Make a
list of what you consider to be
guidelines for sound testing practices. Identify those that would
improve test validity and
reliability.
Key Terms
bias Personal prejudice (prejudgment) in favor of or against a
person, thing, or idea rather
than another, when such prejudgment is typically unfair.

Purposes and Characteristics of Educational AssessmentF.docx

Purposes and Characteristics of Educational AssessmentF.docx

Recommended

Recommended

More Related Content

Similar to Purposes and Characteristics of Educational AssessmentF.docx

Similar to Purposes and Characteristics of Educational AssessmentF.docx (20)

More from amrit47

More from amrit47 (20)

Recently uploaded

Recently uploaded (20)

Purposes and Characteristics of Educational AssessmentF.docx