8/4/2014 I come not to bury summative assessments but to praise them | The Thomas B. Fordham Institute
http://edexcellence.net/commentary/education-gadfly-daily/common-core-watch/2012/i-come-not-to-bury-summative-assessments-but-to-praise-them.html# 1/3
I come not to bury summative assessments but to
praise them
Kathleen Porter-Magee (/about-us/fordham-staff/kathleen-porter-magee)
February 10, 2012
The Northwest Evaluation Association recently surveyed parents and teachers
(http://www.nwea.org/sites/www.nwea.org/files/PressReleaseAssessmentPerceptions.pdf) to
gauge their support for various types of
assessm ent. The
results (http://www.edweek.org/ew/articles/2012/02/08/21tests.h31.html) indicated that just a quarter of
teachers find sum m ative
assessm ents “‘extrem ely’ or ‘very’ valuable for determ ining whether students
have a deep understanding of content.” By contrast, 67 percent of teachers (and
85 percent of parents) found form ative and interim assessm ents extrem ely or
very valuable.
I can understand why teachers would find form ative and
interim assessm ents appealing. After all, teachers generally either create those
assessm ents them selves, or are at least intim ately involved with their
creation. And they are, therefore, m ore flexible tools that can be tweaked
depending on, for instance, the pace of classroom instruction.
But, while form ative and interim assessm ents are
critically im portant and should be used to guide instruction and planning, they
cannot and should not be used to replace sum m ative assessm ents, which play an
equally critical role in a standards-driven system .
http://edexcellence.net/blog-types/common-core-watch
http://edexcellence.net/about-us/fordham-staff/kathleen-porter-magee
http://www.nwea.org/sites/www.nwea.org/files/PressReleaseAssessmentPerceptions.pdf
http://www.edweek.org/ew/articles/2012/02/08/21tests.h31.html
8/4/2014 I come not to bury summative assessments but to praise them | The Thomas B. Fordham Institute
http://edexcellence.net/commentary/education-gadfly-daily/common-core-watch/2012/i-come-not-to-bury-summative-assessments-but-to-praise-them.html# 2/3
Formative and
interim assessments
cannot and should
not be used to
replace summative
assessments.
Everyone has a
Sum m ative assessm ents are designed to evaluate whether
students have m astered knowledge and skills at a particular point in tim e. For
instance, a teacher m ight give a sum m ative assessm ent at the end of a unit to
determ ine whether students have learned what they needed to in order to m ove
forward.
Sim ilarly, and end-of-course or end-of-year sum m ative assessm ent can help
determ ine whether students m astered the content and skills outlined in a
state’s standards for that grade.
If you believe that we need standards to ensure that all
students—regardless of their zip code or socioeconom ic status—need to learn
the
sam e essential content and be held to the sam e .
842014 I come not to bury summative assessments but to prais.docx
1. 8/4/2014 I come not to bury summative assessments but to
praise them | The Thomas B. Fordham Institute
http://edexcellence.net/commentary/education-gadfly-
daily/common-core-watch/2012/i-come-not-to-bury-summative-
assessments-but-to-praise-them.html# 1/3
I come not to bury summative assessments but to
praise them
Kathleen Porter-Magee (/about-us/fordham-staff/kathleen-
porter-magee)
February 10, 2012
The Northwest Evaluation Association recently surveyed
parents and teachers
(http://www.nwea.org/sites/www.nwea.org/files/PressReleaseAs
sessmentPerceptions.pdf) to
gauge their support for various types of
assessm ent. The
results
(http://www.edweek.org/ew/articles/2012/02/08/21tests.h31.htm
l) indicated that just a quarter of
teachers find sum m ative
2. assessm ents “‘extrem ely’ or ‘very’ valuable for determ ining
whether students
have a deep understanding of content.” By contrast, 67 percent
of teachers (and
85 percent of parents) found form ative and interim assessm
ents extrem ely or
very valuable.
I can understand why teachers would find form ative and
interim assessm ents appealing. After all, teachers generally
either create those
assessm ents them selves, or are at least intim ately involved
with their
creation. And they are, therefore, m ore flexible tools that can
be tweaked
depending on, for instance, the pace of classroom instruction.
But, while form ative and interim assessm ents are
critically im portant and should be used to guide instruction and
planning, they
cannot and should not be used to replace sum m ative assessm
ents, which play an
equally critical role in a standards-driven system .
http://edexcellence.net/blog-types/common-core-watch
http://edexcellence.net/about-us/fordham-staff/kathleen-porter-
3. magee
http://www.nwea.org/sites/www.nwea.org/files/PressReleaseAss
essmentPerceptions.pdf
http://www.edweek.org/ew/articles/2012/02/08/21tests.h31.html
8/4/2014 I come not to bury summative assessments but to
praise them | The Thomas B. Fordham Institute
http://edexcellence.net/commentary/education-gadfly-
daily/common-core-watch/2012/i-come-not-to-bury-summative-
assessments-but-to-praise-them.html# 2/3
Formative and
interim assessments
cannot and should
not be used to
replace summative
assessments.
Everyone has a
Sum m ative assessm ents are designed to evaluate whether
students have m astered knowledge and skills at a particular
point in tim e. For
instance, a teacher m ight give a sum m ative assessm ent at the
end of a unit to
determ ine whether students have learned what they needed to in
4. order to m ove
forward.
Sim ilarly, and end-of-course or end-of-year sum m ative
assessm ent can help
determ ine whether students m astered the content and skills
outlined in a
state’s standards for that grade.
If you believe that we need standards to ensure that all
students—regardless of their zip code or socioeconom ic
status—need to learn
the
sam e essential content and be held to the sam e standards, than
it’s essential
to have an independent gauge that helps teachers, parents, adm
inistrators, and
leaders understand where students are not reaching the goals
we’ve set out for
them .
Unfortunately, the NWEA survey does not m ake this clear,
opting instead to narrowly define sum m ative assessm ents only
as “state or
district-wide standardized tests that m easure grade-level
5. proficiency, and
end-of-year subject or course exam s.”
It’s hard to im agine m any teachers who are going to be
enthusiastic about the current “state or district-wide
standardized tests” in
use, which often include low-quality questions and the results
of which typically
don’t reach teachers until it’s too late to do anything with them
. And so, by
defining sum m ative assessm ents in the particular rather than
the general, the
NWEA findings tell us less about how teachers feel about the
value of sum m ative
assessm ents writ large, and m ore about how they feel about the
current crop of
state tests, which pretty m uch everyone agrees need significant
im provem ent.
What’s m ore, everyone has a natural bias in favor of the
things they create them selves. And so, it’s unsurprising that
teachers find the
assessm ents that they create and score (in real tim e) m ore
useful than tests
that are created and scored centrally.
6. Yet, having a set of com m on standards—whether com m on to
all schools within a state, or com m on across all states—
requires some
independent m easure of student learning. There needs to be
som e gauge—for
8/4/2014 I come not to bury summative assessments but to
praise them | The Thomas B. Fordham Institute
http://edexcellence.net/commentary/education-gadfly-
daily/common-core-watch/2012/i-come-not-to-bury-summative-
assessments-but-to-praise-them.html# 3/3
natural bias in favor
of the
things they create
themselves.
teachers, adm inistrators, and parents—that helps show whether
classroom
instruction, m aterials, and even form ative and interim assessm
ents are
aligned
to the state standards in term s of both content and rigor. And to
help teachers
7. and parents understand whether, in the end, students learned the
essential
content and skills they needed each year.
Of course, shifting the focus from teacher-created
assessm ents to centrally-developed state (or even district)
assessm ents is
difficult. And m any teachers will resist being judged by som
ething they had no
hand in creating, and realigning instruction around standards
that m ay look
different from what they’ve taught in their classroom s for
years.
In
the end, if we want standards-driven reform to work, we need
to get sum m ative assessm ents right. Trading sum m ative
assessm ents for form ative assessm ents isn’t an option. They
are different tools
with very different roles in the system . That m eans policym
akers and education
leaders need to do a far better job of soliciting teacher feedback
on these
assessm ent tools and they need to focus m uch m ore tim e and
attention on
8. delivering high-quality professional developm ent that helps
teachers use the
data effectively to guide planning, instruction, and form ative
assessm ent
developm ent. But it also m eans that teachers in standards-
driven schools need
to accept that student learning will be m easured by som ething
other than the
observations and assessm ents created within the four walls of
their schools.
Teacher-Made Assessments
Focus Questions
After reading this chapter, you should be able to answer the
following questions:
1. What are some important steps in planning for assessment?
2. What kinds of teacher-made assessment options are
available?
3. What are some guidelines for constructing good selected-
response assessments?
4. What are the advantages and limitations of selected-response
9. assessments?
5. What are the advantages and limitations of constructed-
response assessments?
8
Tetra Images/SuperStock
A fool must now and then be right by chance.
—William Cowper
Even a blind squirrel sometimes finds a nut. And a fool is
sometimes right by chance.
But more often, the fool is wrong and the blind squirrel goes
hungry—or ends up feeding a
red-tailed hawk that is far from blind.
No one would ever have accused Leanne Crowder of being a
fool—not only because she
would have smacked you up the side of the head if you did, but
also because she was clearly
smarter than her more average classmates.
But she didn’t always have time to study for the many little
multiple-choice quizzes with
which Mrs. Moskal liked to keep her classes on their toes. Yet
she almost always did well on
these tests.
“How d’ ya do it?” asked Louis, who was trying hard to hang
out with her.
10. “I guess,” said Leanne. “I do well just by chance.”
Teacher-Made Assessments Chapter 8
Chapter Outline
8.1 Planning for Teacher-Made Tests
Goals and Instructional Objectives
Test Blueprints
Rubrics
Approaches to Classroom Assessment
8.2 Performance-Based Assessments
Types of Performance-Based
Assessments
Improving Performance-Based
Assessment
8.3 Constructed- and Selected-
Response Assessments
What Are Selected-Response
Assessments?
What Are Constructed-Response
Assessments?
Objective Versus Essay and Short-
Answer Tests
Which Approach to Assessment Is Best?
11. 8.4 Developing Selected-Response
Assessments
Multiple-Choice Items
Matching Items
True-False Items
Interpretive Items
8.5 Developing Constructed-Response
Assessments
Short-Answer Constructed-Response
Items
Essay Constructed-Response Items
Planning for Assessment
Chapter 8 Themes and Questions
Section Summaries
Applied Questions
Key Terms
Planning for Teacher-Made Tests Chapter 8
“That’s a lie,” said Louis who was academically gifted but not
especially socially intelligent.
12. He went on to explain that by chance, Leanne might do well
some of the time—as might
any other student in the class. But if chance were the only factor
determining her results,
she should do very poorly most of the time. “If a multiple-
choice item has four options,” he
expounded like a little professor, “and each of them is equally
probable, if you have absolutely
no idea which is correct, on average you should answer
correctly 25% of the time. And you
should be dead wrong three-quarters of the time.”
“I’m dead right three-quarters of the time,” Leanne smirked,
“and I’m not going to any movie
with you.”
It turned out, as Louis eventually discovered, that Leanne had
quickly noticed that Mrs. Moskal’s
test items were so poorly constructed that the clever application
of a handful of guidelines
almost always assured a high degree of success, even if you
only knew a smattering of correct
answers to begin with. For example, Mrs. Moskal made
extensive use of terms like always,
never, everywhere, and entirely in her multiple-choice options;
Leanne knew that these are
almost always false. She also knew that the longest, most
inclusive options are more likely
to be correct than shorter, very specific options. And she was
clever enough to realize that
options that don’t match the question for grammatical or logical
reasons are likely incorrect
—as are silly or humorous options. And options like all of the
above are always correct if two
of the above are correct, and none of the above is more often
incorrect than not.
13. Mrs. Moskal should have read this chapter!
8.1 Planning for Teacher-Made Tests
Reading the chapter might have improved Mrs. Moskal’s
construction of teacher-made tests
(as opposed to standardized tests that are commercially
prepared; these are discussed in
Chapter 10).
Reading this chapter might have suggested to Mrs. Moskal that
she should not rely solely on
her memory and intuition when constructing a test, but that she
should begin with a clear
notion of what she is trying to teach. She then needs to decide
on the best ways of deter-
mining the extent to which she has been successful. If her
assessments are to be useful for
determining how well her students have learned (summative
function of tests) and for improv-
ing their learning (formative function of tests), she needs a
clear notion of her instructional
objectives, some detailed test blueprints and perhaps some
rubrics and checklists to help her
evaluate student performances.
Goals and Instructional Objectives
Educational goals are the nation’s, the state’s, the school
district’s, or the teacher’s general
statements of the broad intended outcomes of the educational
process. Instructional objec-
tives are the more specific statements of intended learning
outcomes relative to a lesson, a
unit, or even a course. In most cases, instructional objectives
reflect the broader goals of the
14. curriculum. Whereas educational goals are often somewhat
vague and idealistic, the most
useful learning objectives for the classroom tend to be very
explicit. Most are phrased in terms
of behaviors that can be taught and learned, and that can be
assessed.
Planning for Teacher-Made Tests Chapter 8
National Learning Goals
The nation’s educational goals, for example, are often detailed
in legislation and regulations.
As we saw in Chapter 1, in the United States, the No Child Left
Behind Act expresses some
very definite aims listed as five distinct goals. These are
summarized in Figure 8.1.
The law, Public Law 107–110, states as its purpose, “To close
the achievement gap with
accountability, flexibility, and choice, so that no child is left
behind.” Among its broad targets
are goals relating to
• Improving academic achievement of those with
disadvantages
• Preparing, training, and recruiting high-quality teachers
and school administrators
• Improving language instruction for those with limited
proficiency in English
• Promoting informed parental choice and expanding
available educational programs
15. • Increasing accountability and flexibility (NCLB, 2002)
Educational goals of this kind, laudable as they might be, are
not easily reached. In fact, cur-
rent statistics (and common sense) tell us that not a single one
of NCLB’s five goals as stated in
Figure 8.1 has been reached. Nor will any be reached in our
lifetime. It simply isn’t reasonable
to expect, for example, that all learners will become proficient
in reading and mathematics,
nor that all teachers will be highly qualified.
Still, these goals are worthwhile ideals. They tell us in what
general direction we should direct
our efforts so that most, even if not all, learners have a much
higher probability of reaching
the goals. National ideals such as these provide important
guides for state educational goals.
Figure 8.1: NCLB educational goals
The educational goals that are explicit in the No Child Left
Behind Act are lofty ideals that not all
learners can reach. But that the educational machinery is aimed
in their direction may herald some
enormous improvements.
f08.01_EDU645.ai
Goal 1
Goal 2
Goal 3
16. Goal 4
Goal 5
• By 2013–2014, ALL students will reach high standards and
attain proficiency in reading and mathematics.
• ALL limited English proficient students will become
proficient in English.
• By 2005–2006, ALL students will be taught by highly
qualified teachers.
• ALL students will be educated in learning environments that
are safe and drug-free.
• ALL students will graduate from high school.
Source: Based on the No Child Left Behind Act of 2001.
Retrieved April 10, 2013, from
http://www2.ed.gov/policy/elsec/leg/esea02/107-110.pdf.
http://www2.ed.gov/policy/elsec/leg/esea02/107-110.pdf
Planning for Teacher-Made Tests Chapter 8
Common Core State Standards
Virtually all states have published descriptions of criteria that
can be used to assess the extent
to which goals are being met. These are often referred to as
standards. Following a nation-
wide education initiative, many states have adopted identical
standards labeled Common Core
State Standards. These standards describe what students should
know at each grade level,
17. and for each subject. For example, based on these Common
Core State Standards, the state
of Washington provides explicit learning targets for science at
all levels from kindergarten to
12th grade (McClellan & Sneider, 2009). California, too, is one
of more than 45 states that
have adopted Common Core State Standards (California
Department of Education, 2012).
One intended result of adopting common core standards is to
bring about a realignment of
curricula in different states.
State standards serve as a guide for the broad goals and for the
specific instructional objec-
tives developed by local school jurisdictions and, ultimately, by
classroom teachers (Crawford,
2011). For example, the California core reading standards for
Literature at grade 1 level
specify that students should be able to do the following
(Sacramento County Office of
Education, 2012a):
• Ask and answer questions about key details in a text.
• Retell stories, including important particulars, and
demonstrate understanding of their
central message or lesson.
• Describe characters, settings, and major events in a story
using key details.
• Identify who is telling the story at various points in a text.
• Confirm predictions about what will happen next in a text.
• Compare and contrast the adventures and experiences of
18. characters in stories.
• Identify words and phrases that suggest feelings or appeal
to the senses.
These are six of the 10 general objectives listed for grade 1 in
this area. Note that each of
these suggests certain instructional activities. For example, the
last objective—identifying
words and phrases that suggest feelings or appeal to the
senses—leads to a wide range of
instructional possibilities. Teachers might take steps to ensure
that students understand what
emotions are and that they recognize words relating to them;
perhaps direct teaching meth-
ods might be used to inform learners about the human senses;
group activities might encour-
age learners to generate affect-related words; learners might be
asked to search stories for
words and phrases associated with feelings.
An objective such as this even suggests instructional activities
related to other subject areas.
For example, in art classes, students might be asked to draw
facial expressions correspond-
ing to emotional states described in the stories they are reading
in language arts. And in
mathematics, they might be asked to count the number of affect-
linked words or phrases
in different paragraphs or on different pages. And, depending on
relevant mathematics
objectives, they might be encouraged to add these or to subtract
the smaller number from
the larger.
Not only do state standards suggest a variety of instructional
19. activities, but by the same token,
they serve as indispensable guidelines for the school’s and the
teacher’s instructional objec-
tives. And these are basic to sound educational assessment. In
the same way as the main
purpose of all forms of instruction is to improve learning, so
too, an overriding objective of
assessment is to help learners reach instructional objectives.
Planning for Teacher-Made Tests Chapter 8
Test Blueprints
The best way of ensuring that assessments are directed toward
instructional objectives is
to use test blueprints. As we saw in Chapter 4, these are
basically tables of specifications
for developing assessment instruments. They are typically based
closely on the instructional
objectives for a course or a unit. They may also reflect a list or
a hierarchical arrangement
of relevant intellectual or motor activities such as those
provided by Bloom’s Taxonomy
(described in Chapter 4). Many states provide blueprints for
large-scale testing (Johnstone &
Thurlow, 2012).
Suppose, for example, you are teaching sixth-grade mathematics
in California. California core
standards list detailed objectives at that grade level for five
different areas: ratios and propor-
tional relationships, the number system, expressions and
equations, geometry, and statistics
and probability (SCOE, 2012b). The first of six core standards
20. for geometry reads as follows:
Find the area of right triangles, other triangles, special
quadrilaterals, and polygons by
composing into rectangles or decomposing into triangles and
other shapes; apply these
techniques in the context of solving real-world and
mathematical problems. (p. 27)
Part of a test blueprint reflecting related learning objectives,
based on Bloom’s Taxonomy,
might look something like that in Table 8.1. Numbers in the
grid indicate the number of test
items for each category. Questions in parentheses are examples
of the sorts of items that
might be used to assess a specific cognitive process with respect
to a given topic. Test blue-
prints of this kind might also include the value assigned to each
type of test item.
Table 8.1 Part of a sample test blueprint for a single geometry
objective reflecting
Bloom’s Revised Taxonomy, cognitive domain
Topic
Remembering
Understanding
Higher processes (applying,
analyzing, evaluating, creating)
21. Right triangles 4 items (e.g., What
is the formula for
finding the area of
a right triangle?)
1 item (e.g., If you were building a house
and could have a total of only 80 feet of
perimeter wall, which of the following
shapes would give you the largest area?
Quadrilateral; polygon; square; right-angle
triangle; other shape. Prove that your
answer is correct.)
Quadrilaterals 3 items
Other Triangles 3 items 2 items (e.g.,
Illustrate how
you would find
the area of an
isosceles triangle
by sketching a
solution.)
1 item
There are several other approaches to devising test blueprints.
For example, the blueprint
might list what learners are expected to understand, remember,
or be able to do. In addition,
the most useful blueprints will include an indication of how
many items or questions there
might be for each entry in the list and the test value for each.
Figure 8.2 gives an example of
a checklist blueprint for a unit covering part of the content of
Chapter 2 in this text. (For other
examples of test blueprints, see Tables 4.3 and 4.4 in Chapter
22. 4.)
Planning for Teacher-Made Tests Chapter 8
A blueprint such as that shown in Figure 8.2 is useful for more
than simply organizing and
writing items for a test. It not only serves to guide the
instructor’s efforts, but, if given to
learners, it can also serve to direct their learning. And perhaps
most important, it directs the
attention of both teachers and learners toward the higher levels
of mental activity.
In this connection, it is worth noting that despite teachers’ best
intentions and their most
carefully prepared test blueprints, assessments don’t always
reflect instructional objectives.
For a variety of reasons, including that they are much easier to
assess, the lowest levels of
cognitive activity in Bloom’s Taxonomy (knowledge and
comprehension) are often far more
likely to be tapped by school assessments than are the higher
levels (Badgett & Christmann,
2009). For example, following an analysis of alignment between
instructional objectives and
assessments in food sciences classes, Jideani and Jideani (2012)
report that knowledge- and
comprehension-based assessments predominated. And this was
true even though instructors
intended that their students would go beyond remembering and
understanding—that they
would also learn to apply, to analyze, to evaluate, and to create.
Rubrics
23. As we saw in Chapter 7, another important tool for assessment
is the rubric. A rubric is a written
guide for assessment. Rubrics are used extensively in
performance assessments where, without
such guides, evaluations are often highly subjective and
unpredictable. Inconsistent assessments
are the hallmark of lack of test reliability. And measures that
are unreliable are also invalid.
Rubrics, like test blueprints, are a guide not only for assessment
but also for instruction. And,
also like blueprints, they are typically given to the learner
before instruction begins. They tell
the student what is important and expected far more clearly than
might be expressed verbally
by most teachers.
Figure 8.2: Checklist test blueprint
A test blueprint for a short-answer test on Chapter 2 of this text.
The instructor might also choose
to indicate the relative value of questions relating to each
objective listed.
f08.02_EDU645.ai
Checklist Test Blueprint for a Unit on
Characteristics of Good Testing Instruments
(Unless otherwise stated, there is one item for each question)
Fairness
• know what test fairness means
• be able to give examples of unfair test items
• understand the requirements of NCLB regarding
24. accommodations for learners with
special needs
Validity
• be able to define validity
• be able to name and explain the difference between each of
the different kinds of validity
• understand how test validity can be improved
Reliability
• understand the importance of test reliability
• know how reliability is calculated
• be able to suggest how reliability can be improved
Planning for Teacher-Made Tests Chapter 8
Table 8.2 is an example of a rubric that might be used for
evaluating an analysis paper at the
sixth-grade level. Developing detailed rubrics of this kind for
every instructional unit simplifies
the teacher’s task enormously. It makes lesson planning
straightforward and clear; it dramati-
cally shortens the amount of time that might otherwise be spent
in planning and developing
assessment instruments; and it is one of the surest ways of
increasing test reliability, validity,
and fairness.
Table 8.2 Rubric for evaluation of an analysis paper
Your analysis paper will be evaluated for each of the following:
Points
25. 1. Purpose clearly stated in two or three sentences 10
2. Information provided to support and justify the purpose 10
3. Relevant information by way of facts, examples, and research
included 20
4. Absence of irrelevant information 5
5. Analysis presented in coherent, logical fashion evident in
paragraphing and sequencing 20
6. Few grammatical and spelling errors (up to 10 points may be
deducted) 0
7. Clear, well-supported conclusions 15
8. High interest level 20
TOTAL 100
Approaches to Classroom Assessment
In Chapter 4, we saw that assessment can serve at least four
different functions in schools.
1. Assessment might be used for placement purposes before or
after instruction—or, some-
times, during instruction (placement assessment).
2. It might assume a helping role when feedback from ongoing
assessments is given to learn-
ers to help them improve their learning and where ongoing
assessments suggest to the
teacher how instructional strategies might be modified
(formative assessment).
26. 3. School assessments often serve to provide a summary of the
learner’s performance and
achievements. These unit- or year-end assessments are usually
the basis for grades and for
decisions affecting future placement (summative assessment).
4. Assessments might also be used to identify problems, to
determine strengths and weak-
nesses, and to suggest possible approaches to remediation
(diagnostic assessment).
Teacher-made assessments, no matter to which of these uses
they are put, can take any
one of several forms. Among them are performance-based
assessments, selected-response
assessments, and constructed-response assessments.
Planning for Teacher-Made Tests Chapter 8
A P P L I C A T I O N S :
New Assessment-Related CAEP Standards for Accreditation
of Teacher Preparation Programs
Until July 2013, two organizations were dedicated to ensuring
that teacher preparation programs
graduated highly qualified teachers for the nation’s PK-12
school systems: the National Council
for Accreditation of Teacher Education (NCATE) and the
Teacher Education Accreditation Council
(TEAC). Higher education institutions that had teacher
preparation programs could demonstrate
that they met either NCATE’s or TEAC’s standards for teacher
27. preparation to attain accreditation.
Accreditation was interpreted as proof of the quality of an
institution’s programs and enhanced its
credibility.
On July 1, 2013, these two organizations became a new entity:
the Council for the Accreditation of
Educator Preparation (CAEP). Their purpose was not just to
merge the two organizations to elimi-
nate duplication of efforts and reduce costs to higher education
institutions: In addition, they set as
their goal:
To create a model unified accreditation system…. CAEP’s goals
should be not only to
raise the performance of candidates as practitioners in the
nation’s PK-12 schools, but
also to raise the stature of the entire profession by raising the
standards for the evi-
dence the field relies on to support its claims of quality. (pp. 2
and 3)
In late August of 2013, the CAEP board of directors will meet
to ratify the standards that teacher
education programs will need to reach if they are to be
accredited. These standards were developed
by a committee whose membership reflected a broad spectrum
of interested parties, from public
school teachers to university deans to state school
superintendents. In addition, the draft standards
were made available for public comment so everyone had an
opportunity to react and contribute.
It is anticipated that teacher preparation programs will have
access to resources regarding the new
CAEP standards by January of 2014.
28. So what are the ramifications of these new standards for teacher
preparation programs? In terms
of assessment, the following table is a comparison of the old
and new standards associated with
assessment.
The new standards have a clear new emphasis: It is no longer
enough simply to have an assessment
system; now institutions must use assessment data to make
decisions and to evaluate how well they
are doing.
The new CAEP standards also recognize the importance of
having multiple tools for assessment
and of collecting data beyond the confines of the institution.
When the standards are approved by
the CAEP board of directors, teacher preparation programs will
need to offer proof that they solicit
information from schools and communities to inform their
practices. This will encourage close con-
tact between teacher preparation institutions and the systems
that hire their graduates and increase
responsiveness to the needs of the schools. Finally, the new
CAEP standards suggest that teacher
preparation programs should follow their graduates into the
schools to collect data on their perfor-
mance as teachers. Teacher preparation programs will be
charged with providing evidence that their
candidates can “walk the talk.”
It will be interesting to see how this new accreditation process
plays out. One initial purpose for
pursuing the consolidation of the two accrediting agencies was
to reduce the financial burden
teacher preparation programs incurred when seeking national
accreditation. Will CAEP with its
29. revised standards accomplish this goal? Or will the revisions
require teacher education programs to
expand the role of the assessment process, thereby increasing its
cost?
Performance-Based Assessments Chapter 8
8.2 Performance-Based Assessments
Performance-based assessments are covered in detail in Chapter
7 and are summarized
briefly here.
Types of Performance-Based Assessments
Basically, a performance-based assessment is one that asks the
student to perform a task
or produce something, often in a situation that approximates a
real-life setting as closely as
possible. Among the most common performance assessments are
developmental assess-
ments, demonstrations, exhibitions, and portfolios.
Performance-based assessments are
often referred to as authentic assessments, although the
expressions are not synonymous.
A performance assessment is judged to be authentic to the
extent that it asks the student
to perform in ways that are closer to the requirements of actual
performances in day-to-day
settings.
NCATE 2008 TEAC 2009 CAEP 2013
Standard 2: Assessment
System and Unit Evaluation
30. The unit has an assessment
system that collects and
analyzes data on applicant
qualifications, candidate
and graduate performance,
and unit operations to
evaluate and improve the
performance of candidates,
the unit, and its programs.
1.5 Evidence of valid
assessment
The program must
provide evidence
regarding the trustwor-
thiness, reliability, and
validity of the evidence
produced from the
assessment method
or methods that it has
adopted.
2. Data drive decisions about candidates
and programs
This standard addresses CAEP’s expectations
regarding data quality and use in program
improvement. The education preparation
provider (EPP) must provide evidence that it
has a functioning quality control system that is
effective in supporting program improvement.
Its quality control system must draw on valid
and reliable evidence from multiple sources.
31. 2.1 Decisions are based on evidence from
multiple measures of candidates’
learning, completers’ performance
in the schools, and school and
community conditions and needs.
2.2 The education preparation provider
has a system for regular self-
assessment based on a coherent logic
that connects the program’s aims,
content, experiences and assessments.
2.3 The reliability and validity of each
assessment measure are known and
adequate, and the unit reviews and
revises assessments and data sources
regularly and systematically.
2.4 The education preparation provider
uses data for program improvement
and disaggregates the evidence for
discrete program options or certifica-
tion areas.
Performance-Based Assessments Chapter 8
A performance assessment might ask a fine arts student to
prepare an exhibition of paintings for display in a school caf-
eteria as a basis for a final mark; a physical education student
might be graded on a demonstration of different sports-
related skills in competitive situations combined with scores
on written tests; and part of a language arts student’s final
grade might be based on a portfolio that contains samples of
written work spanning the school year.
32. It is true that many of the instructional objectives related to
these three situations can be assessed with non-performance-
based, teacher-made instruments. However, a teacher-made
test that is not performance-based is unlikely to reveal very
clearly how well Lenore can select, organize, and present an
art exhibition or how Robert is likely to perform during the
pressure of athletic competition. Nor is a single, year-end cre-
ative writing test likely to say nearly as much about Elizabeth’s
writing skills as does her yearlong collection of representative
compositions. Also, because many performance assessments
do not require high levels of verbal skills, they are exception-
ally well suited for use in early grades or during the preschool
period, as well as for some children with special needs.
Improving Performance-Based Assessment
Performance-based assessments have a number of limita-
tions and drawbacks. First, they can be very time-consuming,
especially when they involve individual performances, each of
which must be evaluated.
Second, performance-based assessments are not always very
practical, particularly when they
require special equipment or locations—both of which might be
the case for assessments in
areas that require performances such as public speaking or
competitive sports activities.
Third, despite the argument that they are more authentic,
performance-based assessments
tend to have much lower reliability. And, because of that fact,
they may often be less valid
and less fair.
However, there are ways of improving performance-based
33. assessments. One way to improve
their reliability is to use carefully designed rubrics and
checklists. Wesolowski (2012) suggests
that these need to make the assessment process as objective as
possible. A rubric should be
designed so that different evaluators who base their assessments
on the same rubric will
arrive at very similar scores.
There is evidence that the usefulness of performance-based
assessments can be greatly
improved through additional teacher training and development.
This can be accomplished
by means of workshops that emphasize the use of rubrics,
checklists, and rating scales. Koh
(2011) looked at the results of teacher participation in such
workshops. He reports that these
teacher development activities resulted in significant
improvements in teachers’ understand-
ing of performance assessments and in the usefulness of the
assessments and the rubrics they
designed.
Wavebreak Media/Thinkstock
▲ Because they are closer to real-life situ-
ations, performance-based assessments
are often described as more authentic
assessments. Some of the most impor-
tant learning targets associated with
the music class to which this student
belongs cannot easily be assessed with a
selected-response test. The test is in the
performance.
34. Constructed- and Selected-Response Assessments Chapter 8
Performance assessments can also be improved by using a
variety of creative and highly
motivating approaches. Schurr (2012) provides numerous
suggestions in a book that lists
more than 60 different ways of using performances to assess
student learning. For example,
students might be asked to write blog or journal entries as
though they were actually part
of Lewis and Clark’s company of explorers, or as though they
were soldiers in Napoleon’s
army or members of Queen Isabella’s court. The book also
includes suggestions for designing
rubrics. It includes examples as well as a list of online
resources for performance assessments.
Figure 8.3 summarizes some of the guidelines that might be
used to ensure that performance-
based assessments are as reliable, valid, and fair as possible.
8.3 Constructed- and Selected-Response Assessments
Test items are the basic units that make up an assessment. These
are often referred to as test
questions, although many assessment items are not questions at
all; instead, they are direc-
tions, instructions, or requests.
Some teacher-made assessments include several different kinds
of items. Often, however,
they are made up of a single sort of item. Test items can
generally be divided into two broad
categories: those that ask students to select a correct answer,
termed selected-response
assessments, and those that require examinees to produce
35. (construct) the correct response,
usually in writing but also sometimes orally. These are referred
to as constructed-response
assessments.
Figure 8.3: Improving performance-based assessment
Of these suggestions, probably the most important for
increasing the reliability, validity, and
fairness of performance-based assessments is the use of
carefully designed scoring rubrics and
checklists.
f08.03_EDU645.ai
Suggestions for
Improving
Performance-Based
Assessments
• When possible, use a variety of different performance
assessments.
• Use carefully constructed rubrics and checklists.
• Assess performances that reflect clear learning
targets.
• Design performance tasks that closely approximate
real-life settings.
• Select tasks that are interesting, motivating, and
challenging.
• Assess behaviors that can be taught and learned and
36. where improvement can be demonstrated through
performance.
• Take steps to ensure that students understand what is
expected and the criteria upon which they will be
assessed.
• Develop performance assessments that are practical
within budget and time constraints.
• Direct assessments toward important rather than
trivial learning targets.
Constructed- and Selected-Response Assessments Chapter 8
What Are Selected-Response Assessments?
Selected-response items are generally considered to be more
objective than constructed-
response items, simply because each item usually has a single
clearly correct answer. In most
cases, if more than one response is correct, that is taken into
account in scoring. As a result,
answer keys for assessments made up of selected-response items
tend to be simple and exact.
No matter which examiner scores a selected-response
assessment, results should be identical.
There are four principal kinds of selected-response items:
1. Multiple-choice items ask students to select which of several
alternatives is the correct
response to a statement or question.
37. 2. True-false items, also called binary-choice items, ask the
responder to make a choice
between two alternatives, such as true or false.
3. Matching-test items present two or more corresponding lists,
from which the examinee
must select those that match.
4. Interpretive items are often similar to multiple-choice items,
except that they provide
information that examinees need to interpret in order to select
the correct alternative.
Information may be in the form of a chart, a graph, a paragraph,
a video, or an audio
recording.
What Are Constructed-Response Assessments?
Constructed-response items are more subjective than selected-
response items, because they
ask learners to generate their own responses. As a result, they
often have more than one cor-
rect answer.
Test makers distinguish between two broad forms of
constructed-response items, based
largely on the length of the answer that is required. Thus there
are short-answer items
requiring brief responses—often no longer than a single
paragraph—and essay items that
ask the student to write a longer, essay-form response for the
item. Figure 8.4 summarizes
these distinctions.
Objective Versus Essay and Short-Answer Tests
38. The constructed-response (objective) items and the more
subjective essay and short-answer
items shown in Figure 8.4 can both be used to measure almost
any significant aspect of stu-
dents’ behavior. It is true, however, that some instructional
objectives are more easily assessed
with one type of item than with the other. The most important
uses, strengths, and limita-
tions of these approaches are described here. While the
descriptions can serve as a guide in
deciding which to use in a given situation, most good
assessment programs use a variety of
approaches:
1. It is easier to tap higher level processes (analysis, synthesis,
and evaluation) with an essay
examination. These can more easily be constructed to allow
students to organize knowl-
edge, to make inferences from it, to illustrate it, to apply it, and
to extrapolate from it.
Still, good multiple-choice items can be designed to measure
much the same things as
constructed-response items. Consider, for example, the
following multiple-choice item:
Constructed- and Selected-Response Assessments Chapter 8
Harvey is going on a solo fishing and camping trip in the far
north. What equipment
and supplies should he bring?
a. rainproof tent; rainproof gear; fishing equipment; food
39. b. an electric outboard motor; a dinner suit; a hunting rifle
c. some books; a smart phone; fishing equipment; money
*d. an ax; camping supplies; fishing equipment; warm,
waterproof clothing
Answering this item requires that the student analyze the
situation, imagine different sce-
narios, and apply previously acquired knowledge to a new
situation. In much the same
way, it is possible to design multiple-choice items that require
that students synthesize
ideas and perhaps even that they create new ones.
As we saw, however, the evidence indicates that most selected-
response assessments
tend to tap remembering—the lowest level in Bloom’s
Taxonomy. Most items simply ask
the student to name, recognize, relate, or recall. Few classroom
teachers can easily create
items that assess higher cognitive processes.
2. Because essay and short-answer exams usually consist of
only a few items, the range of
skills and of information sampled is often less than what can be
sampled with the more
Figure 8.4: Types of assessment items
As this chart indicates, some tests include more than one type of
assessment item.
f08.04_EDU645.ai
40. Types of
Assessment
Items
Selected-
Response
(More Objective)
Multiple choice
True-False
Matching
Interpretive
Short-answer
Essay
Constructed-
Response
(Less Objective)
Constructed- and Selected-Response Assessments Chapter 8
objective tests. Selected-response assessments permit coverage
of more content per unit
of testing time.
3. Essay examinations allow for more divergence. They make it
41. possible for students to pro-
duce unexpected and unscripted responses. Those who do not
like to be limited in their
answers often prefer essays over more objective assessments.
Conversely, those who
express themselves with difficulty when writing often prefer
selected-response assess-
ments. However, Bleske-Rechek, Zeug, and Webb (2007) found
that very few students
consistently do better on one type of assessment than another.
4. Constructing an essay examination is considerably easier and
less time-consuming than
making up an objective examination. In fact, an entire test with
an essay format can
often be written in the same time it would take to write no more
than two or three good
multiple-choice items.
5. Scoring essay examinations usually requires much more time
than scoring objective tests,
especially when classes are large. This is especially true when
tests are scored electroni-
cally. When classes are very small, however, the time required
for making and scoring an
essay test might be less than that required for making and
scoring a selected-response
test. The hypothetical relationship between class size and total
time for constructing and
scoring constructed-response and selected-response tests is
shown in Figure 8.5.
6. As Brown (2010) reports, the reliability of essay
examinations is much lower than that of
objective tests, primarily because of the subjectivity involved in
scoring them. In addition,
42. suggests Brown, examiners often overemphasize the language
aspects of the essays they
are scoring. As a result, they pay less attention to the content,
and the validity of the
grades suffers.
Some researchers have begun to develop computer programs
designed to score con-
structed-response test items. Typically, however, use of these is
limited to questions where
acceptable responses are highly constrained and easily
recognizable (e.g., Johnson, Nadas,
& Bell, 2010; McCurry, 2010).
Figure 8.5: Construction and scoring time: Essays versus
objective assessments
A graph of the hypothetical relationship between class size and
total time required for construct-
ing and scoring selected-response tests (multiple-choice, for
example) and constructed-response
tests (essay tests). As shown, preparation and scoring time for
essay tests increases dramatically
with larger class size, but it does not change appreciably for
machine-scored objective tests.
f08.05_EDU645.ai
Number of students from low to high
To
ta
l
ti
44. o
w
t
o
h
ig
h
Essay
Assessment
Objective
Assessment
Constructed- and Selected-Response Assessments Chapter 8
Which Approach to Assessment Is Best?
The simple answer is, it depends. Very few teachers will ever
find themselves in situations where
they must always use either one form of assessment or the
other. Some class situations, particu-
larly those in which size is a factor, may lend themselves more
readily to objective formats; in
other situations, essay formats may be better; sometimes a
combination of both may be desir-
able. The important point is that each form of assessment has
advantages and disadvantages.
A good teacher should endeavor to develop the skills necessary
for constructing the best items
possible in a variety of formats without becoming a passionate
45. advocate of one over the other.
The good teacher also needs to keep in mind that there are many
alternatives to assess-
ment other than the usual teacher-made or commercially
prepared standardized tests. Among
these, as we saw earlier, are the great variety of approaches to
performance assessment. In
the final analysis, the assessment procedure chosen should be
determined by the goals of the
instructional process and the purposes for which the assessment
will be used.
Nor are teachers always entirely alone when faced with the task
of constructing (or select-
ing) assessment instruments and approaches—even as teachers
are not entirely on their own
when they face important decisions about curriculum,
objectives, or instructional approaches.
Help, support, and advice are available from many sources,
including other teachers, adminis-
trators, parents, and sometimes even students. In many schools,
formal associations, termed
professional learning communities (PLCs), are an extremely
valuable resource (see In the
Classroom: Professional Learning Communities [PLCs] ).
Table 8.3 shows how different types of assessment might be
used to tap learning objectives
relating to Bloom’s revised taxonomy (discussed in Chapter 4).
Note that the most common
I N T H E C L A S S R O O M :
Professional Learning Communities (PLCs)
46. A professional learning community (PLC) is a grouping of
educators, both new and experi-
enced—and sometimes of parents as well—who come together
to talk about, reflect on, and
share ideas and resources in an effort to improve curriculum,
learning, instruction, and assessment
(Dufour, 2012).
Professional learning communities are formal organizations
within schools or school systems. They
are typically established by principals or other school leaders
and are geared toward establishing
collaboration as a basis for promoting student learning. PLCs
are characterized by
• Supportive and collaborative educational
leadership
• Sharing of goals and values
• Collaborative creativity and innovation
• Sharing of personal experiences
• Sharing of instructional approaches and
resources
• Sharing of assessment strategies and applications
• A high degree of mutual support
Evidence suggests that professional learning communities are a
powerful means of professional
development and support (Brookhart, 2009; Strickland, 2009).
They are also a compelling strategy for educational change and
improvement.
Constructed- and Selected-Response Assessments Chapter 8
47. assessments for higher mental processes
such as analyzing, evaluating, and creating
are either constructed-response or perfor-
mance-based assessments. However, as we
see in the next section, selected-response
assessments such as multiple-choice tests
can also be designed to tap these processes.
Table 8.3 What assessment approach to use
Bloom’s Revised
Taxonomy of
Educational
Objectives
Verbs Related to Each
Objective
Students are asked to:
Some Useful Approaches to Assessment
Remembering copy, duplicate, list, learn,
replicate, imitate, memorize,
name, order, relate, reproduce,
repeat, recognize, . . .
Selected-response assessments including multiple-choice, true-
false, matching, and interpretive
Understanding indicate, know, identify, locate,
recognize, report, explain,
restate, review, describe,
48. distinguish, . . .
Selected-response assessments that require learner to locate,
identify, recognize, . . .
Constructed-response assessments including short-answer
and longer essay items where students are asked to explain,
describe, compare, . . .
Applying demonstrate, plan, draw,
outline, dramatize, choose,
sketch, solve, interpret,
operate, do, . . .
Written constructed-response assessments where students
are required to describe prototypes or simulations showing
applications
Performance assessments where learners demonstrate an appli-
cation, perhaps by sketching or dramatizing it
Analyzing calculate, check, categorize,
balance, compare, contrast,
test, differentiate, examine,
try, . . .
Written assessments requiring comparisons, detailed analyses,
advanced calculations
Performance assessments involving activities such as debating
or
designing concept maps
Evaluating assess, choose, appraise, price,
defend, judge, rate, calculate,
support, criticize, predict, . . .
49. Written assessments requiring judging, evaluating, critiquing
Performance assessments using portfolio entries reflecting
opinions, reflections, appraisals, reviews, etc.
Creating arrange, write, produce, make,
design, formulate, compose,
construct, build., generate,
craft, . . .
Written assignments perhaps summarizing original research
projects
Performance assessments involving original output such as
musical compositions, written material, designs, computer
programs, etc.
▶ Professional learning communities (PLCs) are
organized groups of educators who meet regu-
larly to reflect and collaborate on improving
curriculum, learning, instruction, and assess-
ment. Such groups are a powerful strategy for
educational improvement.
iStockphoto/Thinkstock
Developing Selected-Response Assessments Chapter 8
8.4 Developing Selected-Response Assessments
As we noted, selected-response assessments tend to be more
objective than constructed-
response assessments. After all, most of them have only one
correct answer.
50. Multiple-Choice Items
Among the most common of the highly objective selected-
response assessments is that con-
sisting of multiple-choice items. These are items that have a
stem—often a question or an
incomplete statement—followed by a series of possible
responses referred to as alterna-
tives. There are usually four or five alternatives, only one of
which is normally correct; the
others are termed distracters.
On occasion, some multiple-choice tests may contain more than
one correct alternative.
These, as Kubinger and Gottschall (2007) found, are usually
more difficult than items with
a single correct answer, providing responders are required to
select all correct alternatives
for the item to be marked correct. The researchers created a
multiple-choice test where any
number of the five alternatives might be correct. These test
items were more difficult than
identical items that had only one correct answer, because
guessing was now much less likely
to lead to a correct response. If responders did not recognize the
correct answers and tried to
guess which they might be, they would not know how many
alternatives to select.
Multiple-choice stems and alternatives can take a variety of
forms. Stems might consist of
questions, statements requiring completion, or negative
statements. Alternatives might be
best answer, combined answers, or single answers. Examples of
each of these items are
51. shown in Figure 8.6.
Guidelines for Constructing Multiple-Choice Items
Writing good multiple-choice items requires attention to a
number of important guidelines.
Many of them involve common sense (which makes them no
less valid):
1. Both stems and alternatives should be clearly worded,
unambiguous, grammatically cor-
rect, specific, and at the appropriate level of difficulty. In
addition, stems should be clearly
meaningful by themselves. Compare, for example, the following
two items:
A. In the story The Red Rat, how did Sally feel toward Angela
after her accident?
a. sad
b. angry
c. jealous
d. confused
B. In the story The Red Rat, how did Sally feel toward Angela
after Angela’s accident?
a. sad
b. angry
c. jealous
d. confused
The problem with the first stem is that the pronoun her has an
ambiguous referent. Does
the question refer to Sally’s accident or Angela’s? The second
stem corrects that error.
Similarly, stems that use the word they without a specific
context or reference are some-
52. times vague and misleading. For example, the true-false
question “Is it true that they say
Developing Selected-Response Assessments Chapter 8
you should avoid double negatives?” might be true or false,
depending who they is. If they
refers to most authors of assessment textbooks, the correct
answer is true. But if they
refers to the Mowats, who lived back in my isolated neck of the
woods, the correct answer
would be false: They didn’t never say don’t use no double
negatives!
2. But seriously, don’t use no double negatives when writing
multiple-choice items. They
are highly confusing and should be avoided at all costs. Are
single negatives highly rec-
ommended? Not. Common, easily found examples of double and
even triple negatives
include combinations like these:
It is not unnecessary to pay attention—meaning, simply, “It is
necessary to pay
attention.”
It is not impossible to pay attention—meaning, “It is possible to
pay attention.”
Figure 8.6: Examples of multiple-choice items
Stems and alternatives can take a variety of forms. In these
examples, the alternatives are always
ordered alphabetically or numerically. This is a precaution
53. against establishing a pattern that might
provide a clue for guessing the correct response. (Correct
responses are checked.)
f08.06_EDU645.ai
Incomplete Statement Stem
1. The extent to which a test appears to
measure what it is intended to measure
defines
___ a. construct validity
___ b. content validity
___ c. criterion-related validity
___ d. face validity
___ e. test reliability
Question Stem
2. Who is the theorist most closely
associated with the development of
operant conditioning?
___ a. Bandura
___ b. Pavlov
___ c. Skinner
___ d. Thorndike
___ e. Watson
Negative Statement Stem
3. Which of the following is NOT a
dinosaur?
___ a. allosaurus
___ b. brachiosaurus
___ c. stenogrosaurus
___ d. triceratop
54. ___ e. velociraptor
Best Answer Alternative
4. What was the main motive for Britain
entering WWII?
___ a. economics
___ b. fear
___ c. greed
___ d. hatred
___ e. loyalty
5. Order the following from largest to
smallest in geographic area:
1. Brazil
2. Canada
3. China
4. Russia
5. United States
Single Answer Alternative
6. What is the area of a 20 foot by 36 inch
rectangle?
___ a. 16 square feet
___ b. 20 square feet
___ c. 56 square feet
___ d. 60 square feet
___ e. 720 square feet
Combined Answer Alternative
X
X
55. X
X
___ a. 1, 2, 3, 4, 5
___ b. 1, 3, 5, 4, 2
___ c. 4, 2, 5, 3, 1
___ d. 4, 5, 2, 1, 3
___ e. 2, 4, 5, 3, 1
X
X
Developing Selected-Response Assessments Chapter 8
The switch is not disabled—meaning, “The switch is
functioning.”
It is impossible to not do something illegal—meaning,
strangely, “It is not possible to do
something legal.”
For lack of no other option—meaning very little. If we lack no
other option, there must
be another option. No?
Test makers need to be especially wary of negative prefixes
such as un–, im–, dis–, in–,
and so on.
3. Unless the intention is clearly to test memorization, test
items should not be taken word
for word from the text or other study materials. This is
especially the case when instruc-
56. tional objectives involve application, analysis, or other higher
mental processes.
4. Create distracters that seem equally
plausible to students who don’t know
the correct answer. Otherwise, answer-
ing correctly might simply be a matter
of eliminating highly implausible dis-
tracters. Consider the following exam-
ple of a poor item:
A. 10 + 12 + 18 =
a. 2
b. 2,146
c. 40
d. 1
For students who don’t know how to
calculate the correct answer, highly
implausible distracters that can eas-
ily be eliminated may dramatically
increase the score-inflating effects of
guessing.
5. Unintentional cues should be avoided. For example, ending
the stem with a or an often
provides a cue as in, for example:
A. A pachyderm is an
a. cougar
b. dog
c. elephant
d. water buffalo
6. Avoid the use of strong qualifying words such as never,
always, none, impossible, and
57. absolutely in distracters. Distracters that contain them are most
often incorrect. On the
other hand, distracters that contain weaker qualifiers such as
sometimes, frequently, and
usually are often associated with correct alternatives. At other
times, they are simply vague
and confusing. Consider, for example:
A. Multiple-choice alternatives that contain strong qualifiers are
a. always incorrect
b. never incorrect
c. usually incorrect
d. always difficult
iStockphoto/Thinkstock
▲ This boy is completing an online, take-home, selected-
response test. Perhaps using two computers allows him to
have one send out various search engines looking for answers
while he completes the timed test on the other. That is one of
the factors that needs to be considered in online courses.
Developing Selected-Response Assessments Chapter 8
As expected, alternatives with strong qualifiers (always and
never) are incorrect, and the
alternative with a weak qualifier (usually) is correct.
Weak qualifiers often present an additional problem: They can
be highly ambiguous. For
example, interpreting the alternative usually incorrect is
difficult because the term is impre-
cise. Does usually in this context mean most of the time? Does
it mean more than half the
58. time? More than three quarters of the time?
In stems, both kinds of qualifiers also tend to be ambiguous,
and the weaker ones are
more ambiguous than those that are strong. Never usually means
“not ever”—although
it can sometimes be interpreted to mean “hardly ever.” But
frequently is one of those
alarmingly vague terms for whose meaning we have no
absolutes—only relatives. Just how
often is frequently? We don’t really know—which is why the
word fits so well in many of
our lies and exaggerations.
7. Multiple-choice assessments, like all forms of educational
assessment, also need to be
relevant to instructional objectives. That is, they need to
include items that sample course
objectives in proportion to their importance. This is one of the
reasons teachers should use
test blueprints and should make sure they are closely aligned
with instructional objectives.
8. Finally, as we saw in Chapter 2, assessments need to be as
fair, valid, and reliable as pos-
sible. Recall that fair tests are those that:
• Assess material that all learners have had an opportunity
to learn
• Allow sufficient time for all students to complete the test
• Guard against cheating
• Assess material that has been covered
59. • Make accommodation for learner’s special needs
• Are free of the influence of biases and stereotypes
• Avoid misleading, trick questions
• Grade assessments consistently
Recall, too, that the most reliable and valid assessments tend to
be those based on longer
tests or on a variety of shorter tests where scoring criteria are
clear and consistent. These
guidelines are summarized in Table 8.4.
Table 8.4 Checklist for constructing good multiple-choice items
Yes No Are stems and alternatives clear and unambiguous?
Yes No Have I avoided negatives as much as possible?
Yes No Have I included items that measure more than simple
recall?
Yes No Are all distracters equally plausible?
Yes No Have I avoided unintentional cues that suggest correct
answers?
Yes No Have I avoided qualifiers such as never, always, and
usually?
Yes No Do my items assess my instructional objectives?
Yes No Are my assessments as fair, reliable, and valid as
possible?
60. Developing Selected-Response Assessments Chapter 8
Matching Items
The simplest and most common matching-test item is one that
presents two columns of
information, arranged so that each item in one column matches
a single item in the other.
Columns are also organized so that matching terms are
randomly juxtaposed, as shown in
Figure 8.7.
Matching items can be especially useful for assessing
understanding, in addition to remember-
ing. In particular, they assess the student’s knowledge of
associations and relationships. They
can easily be constructed by generating corresponding matching
lists for a wide variety of
items. For example, Figure 8.7 matches people with concepts.
Other possible matches include
historical events with dates; words with definitions; words in
one language with translations
in another; geometric shapes with their names; literary works
titles with names of authors;
historical figures with historical positions or historical events;
names of different kinds of
implements with their uses; and on and on.
The most common matching items present what is termed the
premise column (or some-
times the stem column) on the left and possible matches in what
is called the response
column on the right.
61. A matching item might have more entries in the response
column or an equal number in each.
From a measurement point of view, one advantage of having
different numbers of entries in
each column is that this reduces the possibility of answering
correctly when the student does
not know the answer. In the example shown in Figure 8.7, for
example, students who know
three of four correct responses will also get the fourth correct.
By the same token, those who
know two of the four will have a 50-50 chance of getting the
next two correct. Even if a
Figure 8.7: Example of a matching-test item
Matching-test items should have very clear instructions. More
complex matching items sometimes
allow some responses to be used more than once or not at all.
f08.07_EDU645.ai
Instructions: Match the theorists in column A (described in
Chapter 2) with the concept in
column B most closely associated with each theorist. Write the
number in front of each entry
in column B in the appropriate space after each theorist named
in column A. Each term should
be used only once.
A. Theorist B. Associated Term
Thorndike _____
Watson _____
Skinner _____
62. Pavlov _____
4. Operant conditioning
3. Law of Effect
2. Classical conditioning
1. Behaviorism2
4
1
3
Developing Selected-Response Assessments Chapter 8
student knew only one response, there would still be a pretty
good chance of guessing one
or all of the others correctly. But the more items there are in the
response column, the lower
the odds of selecting the correct unknown response by chance.
Figure 8.8 presents an example of a matching item with more
items in the response than the
premise column.
Similarly, some matching tests are constructed in such a way
that each item in the response
list might be used once, more than once, or not at all. Not only
does this approach effec-
tively eliminate the possibility of narrowing down options for
63. guessing, but it might be
constructed to require that the student engage in behaviors that
require calculating, compar-
ing, differentiating, predicting, appraising, and so on. All of
these activities tap higher level
cognitive skills.
Figure 8.8: Example of a matching-test item with uneven
columns
When matching-test items contain more items in the response
list than in the premise list, reli-
ability of the measure increases because the probability of
correctly guessing unknown responses
decreases.
f08.08_EDU645.ai
Instructions: Match the 21st century world leaders in column A
with the country each has led or
currently leads, listed in column B. Write the number in front of
each entry in column B in the
appropriate space after each leader in column A. There is only
one correct response for each
item in column A.
A. World Leader B. Country Led
4. Italy
3. Egypt
2. Brazil
1. ArgentinaLuiz Inácio Lula da Silva _____2
64. Mohamed Morsy _____3
Al-Assad _____9
Ali Abdullah Saleh _____10
5. North Korea
Kim Jong-un _____5
6. Portugal
Silvio Berlusconi _____4
7. Saudi Arabia
Mariano Rajoy _____8
8. Spain
9. Syria
10. Yemen
Developing Selected-Response Assessments Chapter 8
Not all matching-test items are equally good. Consider, for
example, the item shown in Figure
8.9. Note how the instructions are clear and precise: They state
exactly what the test taker
must do and how often each response can be used. But it really
is a very bad item. Entries in
each column are structured so differently that for those with
adequate reading skills, gram-
matical cues make the answers totally obvious. The person who
65. built this item should have
paid attention to the following guidelines:
1. Items in each column should be parallel. For example, in
Figure 8.7, all items in the premise
column are names of theorists, and all items in the response
column are terms related in
some important way to one of the theories. Similarly, in Figure
8.8, premise entries are all
names of world leaders, and response entries are all countries.
The following is an example
of nonparallel premise items that are to be matched to a
response list of different formulas
for calculating the surface area of different geometric figures:
triangle
square
rectangle
cardboard boxes
circle
The inclusion of cardboard boxes among these geometric
shapes is confusing and
unnecessary.
Test makers must also guard against items that are not
grammatically parallel, as is shown
in Figure 8.9.
Figure 8.9: Example of a poorly constructed matching-test item
To avoid many of the problems that are obvious in this example,
66. simply use complete sentences or
parallel structures in the premise column.
f08.09_EDU645.ai
Instructions: Match the statements in column A with the best
answer from column B. Write the
number in front of each answer in column B in the appropriate
space after each statement in
column A. No answer can be used more than once. One will not
be used at all.
A. In the Story, Pablo’s Chicken B. Answers based on Pablo’s
Chicken
4. kitchen scraps
3. his mother
2. his dog dies
1. angryAt the beginning of the story,
Pablo lives in _____5
Pablo gets very upset when _____2
When Pablo answers the door,
his dog bites _____3
Pablo’s mother is very _____1 5. Monterrey
Developing Selected-Response Assessments Chapter 8
2. All items in the response list should be plausible. This is
67. especially true if the response list
contains more entries than the premise column. The test is less
reliable when it contains
items that allow students to quickly discard implausible
responses.
3. To increase the reliability of the test, the response column
should contain more items than
the premise column.
4. Limit the number of items to between six and 10 in each
column. Longer columns place
too much strain on memory. Recall from Chapter 3 that our
adult short-term memory is
thought to be limited to seven plus or minus two items. It is
difficult to keep more items
than this in our conscious awareness at any one time.
5. We saw that grammatical structure can sometimes provide
unwanted clues in multiple-
choice items. This can also be the case in matching-test items,
as is the case for the item in
Figure 8.9 where grammatical structure reveals almost all the
correct responses. Moreover,
the fourth item in the response column is an implausible
response.
6. Directions should be clear and specific. They should stipulate
how the match is to be made
and on what basis. For example, directions for online matching
tests might read: “Drag
each item in column B to the appropriate box in front of the
matching item in column
A.” Similar instructions for a written matching-item test might
specify: “Write the num-
ber in front of each answer in column B in the appropriate space
68. after each statement in
column A.”
7. Response items should be listed in logical order. Note, for
example, that response columns
in Figures 8.7 through 8.9 are alphabetical. Where response
items are numerical, they
should be listed in ascending or descending order. Doing so
eliminates the possibility of
developing some detectable pattern. It also discourages students
from wasting their time
looking for a pattern.
8. For paper-and-pencil matching items, ensure that lists are
entirely on one page. Having to
flip from one page to another can be time-consuming and
confusing.
Table 8.5 summarizes these guidelines in the form of a
checklist.
Table 8.5 Checklist for constructing good matching items
Yes No Are items in the premise column parallel?
Yes No Are items in the response column parallel?
Yes No Are all response-column items plausible?
Yes No Have I included more response- than premise-column
items?
Yes No Are my lists limited to no more than seven or so
items?
Yes No Have I avoided unintentional cues that suggest correct
69. matches?
Yes No Are my directions clear, specific, and complete?
Yes No Have I listed response-column items in logical order?
Yes No Are my columns entirely on one page?
Yes No Do my items assess my instructional objectives?
Yes No Are my assessments as fair, reliable, and valid as
possible?
Developing Selected-Response Assessments Chapter 8
True-False Items
A relatively common form of assessment, often used in the early
grades, is the true-false
item. True-false items typically take the form of a statement
that the examinee must judge as
true or false. But they can also consist of statements or
questions to which the correct answer
might involve choosing between responses such as yes and no or
right and wrong. As a result,
they are sometimes called binary-choice items rather than true-
false items.
True-false test items tend to be popular in early grades because
they are easy to construct, can
be used to sample a wide range of knowledge, and provide a
quick and easy way to look at
the extent to which instructional objectives are being met.
70. Most true-false items are simple propositions that can be
answered true or false (or yes or no).
For examples:
Reliability is a measure of the consistency of an assessment. T
F
Face validity reflects how closely scores from repeated
administrations of the same test
resemble each other. T F
Predictive validity is a kind of criterion-related validity. T F
True-False Assessments to Tap Higher Mental Processes
Answering these true-false questions requires little more than
simple recall. However, it is
possible to construct true-false items that assess other cognitive
skills. Consider the following:
Is the following statement true or false?
9 ÷ 3 × 12 + 20 = 56 T F
Responding correctly to this item requires recalling and
applying mathematical operations
rather than simply remembering a correct answer.
Binary-choice items can also be constructed so that the
responder is required to engage in a
variety of higher level cognitive activities such as comparing,
predicting, evaluating, and gen-
eralizing. Figure 8.10 shows examples of how this might be
accomplished in relation to the
Revised Bloom’s Taxonomy.
Note that Figure 8.10 does not include any examples that relate
71. to creating. Objectives that
have to do with designing, writing, producing, and related
activities are far better tapped by
means of performance-based assessments or constructed-
response items than with the more
objective, selected-response assessments.
Note, too, that although it is possible to design true-false items
that seem to tap higher cogni-
tive processes, whether they do so depends on what the learner
already knows. As we saw
earlier, what a test measures is not defined entirely by the test
items themselves. Rather, it
depends on the relationship between the test item and the
individual learner. Consider, for
example, the Figure 8.10 item that illustrates judging—an
activity described as having to do
with evaluating:
• It is usually better to use a constructed-response test
rather than a true-false test for
objectives having to do with evaluating. T F
One learner might respond, after considering the characteristics
of constructed-response and
true-false tests, by analyzing the requirements of evaluative
cognitive activities and judging
Developing Selected-Response Assessments Chapter 8
which characteristics of these assessments would be best suited
for the objective. That learn-
er’s cognitive activity would illustrate evaluating.
72. Another learner, however, might simply remember having read
or heard that constructed-
response assessments are better suited for objectives relating to
evaluating and would quickly
select the correct response. That learner’s cognitive activity
would represent the lowest level
in Bloom’s Taxonomy: remembering.
Figure 8.10: True-false items tapping higher cognitive skills
Although it is possible to design true-false items that do more
than measure simple recall, other
forms of assessment are often more appropriate for objectives
relating to activities such as analyz-
ing, evaluating, and creating.
f08.10_EDU645.ai
Bloom’s Revised
Taxonomy of
Educational
Objectives
Remembering
Understanding
Applying
Analyzing
Evaluating
Creating
Possible activity
74. calculate, support,
criticize, predict . . .
arrange, write, produce,
make, design, formulate,
compose, construct,
build, generate, craft . . .
(Creating cannot normally
be tested by means of a true-false item)
X
X
X
X
XThis is a spider: T F
It is usually better to use a constructed
response rather than a true-false test
for objectives having to do with
evaluating. T F
The total area of a rectangular house that
is 60 feet in one dimension and that
contains nothing other than 3 equal-sized
rectangular rooms that measure 30 feet in
one dimension is 1800 square feet.
75. T F
For cutting through a trombone you could
use either a hacksaw or a rip saw. T F
It is correct to say that affect has an
affect on most of us.
Yes No
Developing Selected-Response Assessments Chapter 8
Limitations of True-False Assessments
True-false assessments are open to a number of serious
criticisms. First, unless they are care-
fully and deliberately constructed to go beyond black-and-white
facts, they tend to measure
little more than simple recall. And second, because there is a
50% chance of answering any
one question correctly—all other things being equal—they tend
to provide unreliable assess-
ments. If everyone in a class knew absolutely nothing about an
area being tested with true-
false items, and they all simply guessed each answer randomly,
the average performance of
the class would be around 50%.
Nevertheless, the chance of receiving a high mark as a result of
what is termed blind guess-
ing is very low. And the chance of receiving a very poor mark is
equal to that of receiving a
very high one.
Most guessing tends to be relatively educated rather than
76. completely random. Even if they
are uncertain about the correct answer, many students know
something about the item and
guess on the basis of the information they have and the logic
and good sense at their disposal.
Variations on Binary-Choice Items
In one study, Wakabayashi and Guskin (2010) used an
intriguing approach to reduce the
effect of guessing. Instead of simply giving respondents the
traditional choice of true or false,
they added a third option: unsure. When students were retested
on the same material later,
items initially marked unsure were more likely to have been
learned in the interim and to be
answered correctly on the second test than were incorrect
responses of which the responders
had been more certain.
Another interesting approach that uses true-false items to
understand more clearly the learn-
er’s thinking processes asks responders to explain their choices.
For example, Stein, Larrabee,
and Barman (2008) developed an online test designed to
uncover false beliefs that people
have about science. The test consists of 47 true-false items,
each of which asks responders to
explain the reasons for their choices. As an example, one of the
items reads as follows:
An astronaut is standing on the moon with a baseball in her/his
hand. When the base-
ball is released, it will fall to the moon’s surface. (p. 5)
The correct answer, true, was selected by only 32.8% of 305
respondents, all of whom were
77. students enrolled in teacher education programs at two different
universities. More reassur-
ingly, 94.4% chose the correct answer (true) for this next item:
A force is needed to change the motion of an object. (p. 5)
The usefulness of this approach lies in the explanations, which
often reveal serious misconcep-
tions. And strikingly, this is frequently the case even when
answers are correct. For example,
more than 40% of respondents who answered this last item
correctly did so for the wrong
reasons. They failed to identify the normal forces (called
reaction forces) that counter the
effects of gravity.
Asking students to explain their choices on multiple-choice
tests might reveal significant gaps
in knowledge or logic. This approach could contribute in
important ways to the use of these
tests for formative purposes. Table 8.6 presents guidelines for
writing true-false items.
Developing Selected-Response Assessments Chapter 8
Table 8.6 Checklist for constructing good true-false items
Yes No Is the item useful for assessing important learning
objectives?
Yes No Have I avoided negatives as much as possible?
Yes No Have I included items that measure more than simple
recall?
78. Yes No Is the item absolutely clear and unambiguous?
Yes No Have I avoided qualifiers such as never, always, and
usually?
Yes No Have I avoided a pattern of correct responses?
Yes No Have I balanced correct response choices?
Yes No Have I made my statements as brief as possible?
Yes No Have I avoided trick questions?
Yes No Are my true and false statements of approximately
equal length?
Interpretive Items
Interpretive items present information that the responder needs
to interpret when answer-
ing test items. Although the test items themselves might take
the form of any of the objec-
tive test formats—matching, multiple-choice, or true-false—in
most cases, the material to be
interpreted is followed by multiple-choice questions.
Interpretive material most often takes the form of one or two
written paragraphs. It may also
involve graphs, charts, maps, tables, and video or audio
recordings. Figure 8.11 is an example
of a true-false interpretive item based on a graph. Answering
the items correctly might require
analysis and inference in addition to basic skills in reading
graphs.
79. Figure 8.12 illustrates a more common form of interpretive test
item. It is based on written
material that is novel for the student. Responding correctly
requires a high level of reading skill
and might also reflect a number of higher mental processes such
as those involved in analyz-
ing, generalizing, applying, and evaluating.
Advantages of Interpretive Items
Interpretive items present several advantages over traditional
multiple-choice items. Most
important, they make it considerably easier to tap intellectual
processes other than simple
recall. Because the material to be interpreted is usually novel,
the student cannot rely on recall
to respond correctly.
A second advantage of interpretive items is that they can be
used to assess understanding
of material that is closer to real life. For example, they can
easily be adapted to assess how
clearly students understand the sorts of tables, charts, maps, and
graphs that are found in
newspapers, on television, and in online sources.
Finally, not only can interpretive test items be used to assess a
large range of intellectual skills,
but they can also be scored completely objectively. This is not
the case for performance-based
assessments or for most constructed-response assessments.
Developing Selected-Response Assessments Chapter 8
Figure 8.11: Interpretive true-false item based on a graph
80. Interpretive test items are most often based on written material
but can also be based on a variety
of visual or auditory material, as shown here.
f08.11_EDU645.ai
2000 2006 2007
Year
2008 2009
60
55
50
45
40
35
30
25
P
e
rc
e
n
82. indicate whether each of the following statements is true or
false by putting a
checkmark in front of the appropriate letter.
1. The vertical axis indicates the year in which the survey was
conducted. T F
2. In 2008, there were more 25- to 34-year-olds who were
married than who had
never married. T F
3. Every year between 2000 and 2009 there were more married
than never married
adults between 25 and 34. T F
4. One hundred percent of people surveyed were either currently
married or had
never married. T F
5. The number of never-married 25- to 34-year-olds increased
between the year
2000 and 2009. T F
�
�
�
�
�
Source: U.S. Census Bureau (2011). Retrieved September 12,
2011, from http://factfinder.census.gov/servlet/
STTable?_bm=y&-qr_name=ACS_2009_5YR_G00_S1201&-
84. skills.
Constructing Good Interpretive Items
As is true for all forms of assessment, the best interpretive test
items are those that are rel-
evant to instructional objectives, that sample widely, that tap a
range of mental processes,
and that are as fair, valid, and reliable as possible. Table 8.7 is
a brief checklist of guidelines
for constructing good interpretive items.
Figure 8.12: Interpretive multiple-choice item based on written
material
The most common interpretive items are based on written
material. One of their disadvantages is
that they are highly dependent on reading skills.
f08.12_EDU645.ai
Beavers are hardworking, busy little creatures. Like all
mammals, when they are
young, the babies get milk from their mothers. They are warm-
blooded, so it’s
important for them to keep from freezing. They do this in the
winter by living in
houses they build of logs, sticks, and mud. Beaver houses are
called lodges. The
entrance to the lodge is far enough under water that it doesn’t
freeze even in
very cold weather. Because the walls of the lodge are very thick
and the area in
which the family lives is very small, the heat of their bodies
keeps it warm.
1. Which of the following best
85. describes mammals?
a. Mammals are warm
blooded and live in lodges.
b. Mammals need milk to
survive.
c. Some mammals hatch from
eggs that the mother lays.
d. Mammals produce milk for
their newborns.
e. Mammals live in warm
shelters or caves.
2. Beaver lodges stay warm
because
a. They have an underwater
entrance that doesn’t
freeze.
b. The living area is small and
the walls are thick.
c. Beavers are very hard-
working and busy.
d. Beavers are warm-blooded
mammals.
e. Beavers are mammals.
*
86. *
Developing Constructed-Response Assessments Chapter 8
Table 8.7 Checklist for constructing good interpretive items
Yes No Is the item useful for assessing important learning
objectives?
Yes No Is the reading and difficulty level appropriate for my
students?
Yes No Does the item tap more than simple recall?
Yes No Have I avoided questions that are answered literally in
the interpretive material?
Yes No Have I avoided questions that can be answered without
the interpretive material?
Yes No Are multiple-choice choice items well constructed?
Yes No Is the item absolutely clear and unambiguous?
Yes No Have I included all the information required?
Yes No Are instructions clear?
Yes No Have I avoided unnecessary length?
Yes No Is the interpretive material novel for learners?
Yes No Have I avoided trick questions?
87. 8.5 Developing Constructed-Response Assessments
As the name implies, constructed-response assessments require
test takers to generate their
own responses. Two main kinds of constructed-response
assessments are used in educational
measurement: short-answer items (also referred to as restricted-
response items) and essay
items (also called extended-response items).
The main advantage of constructed-response assessments is that
they lend themselves better
to evaluating higher thought processes and cognitive skills.
Also, they allow for more variation
and more creativity.
When compared with selected-response assessments,
constructed-response assessments
have two principal limitations: First, they usually consist of a
small number of items and there-
fore sample course content less widely; second, they tend to be
less objective than selected-
response assessments, simply because they can seldom be
scored completely objectively.
Short-Answer Constructed-Response Items
Short-answer items sometimes require a response consisting of
only a single word or short
phrase to fill in a blank left in a sentence. These are normally
referred to as completion items
(or fill-in-the-blanks items). At other times, they require a brief
written response—perhaps
only one or two words—that does not fill in a blank space in a
sentence. In contrast, essay
items typically ask for a longer, more detailed written response,
88. often consisting of a number
of paragraphs or even pages.
Completion items can easily be generated simply by taking any
clear, unambiguous, complete
sentence from a written source and reproducing it with a single
word or phrase left out.
One problem with this approach is that it encourages rote
memorization rather than a more
thoughtful approach to studying. In addition, out-of-context
sentences are often somewhat
ambiguous at best; at worst, they can be completely misleading.
Developing Constructed-Response Assessments Chapter 8
Advantages of Short-Answer Items
Short-answer items have several important advantages:
1. Because they require the test taker to produce a response,
they effectively eliminate much
of the influence of guessing. This presents a distinct advantage
over selected-response
approaches such as multiple-choice and true-false assessments.
2. Because they ask for a single correct answer or a very brief
response, they are highly objec-
tive even though they ask for a constructed response. As a
result, it is a simple matter to
generate a marking key very much as is done for multiple-
choice, true-false, or matching
items.
3. They are easy to construct and can quickly sample a wide
range of knowledge.
89. Limitations of Short-Answer Items
Among their limitations is that because examiners need to read
every response, short-answer
items can take longer to score than selected-response measures.
Nor is constructing short-
answer items always easy: Sometimes it is difficult to phrase
the item so that only one answer
is correct.
Another limitation has to do with possible contamination of
scores due to bad spelling. If
marks are deducted for misspelled words, what is being
measured becomes a mixture of spell-
ing and content knowledge. But if spelling errors are ignored,
the marker may occasionally
have to guess at whether the misspelled word actually
represents the correct answer.
Finally, because correct responses are usually limited to a
single choice, they don’t allow for
creativity and are unlikely to tap processes such as synthesis
and evaluation.
Examples of Short-Answer Items
General guidelines for the construction of short-answer items
include many of those listed
earlier in Tables 8.4 through 8.7. In addition, test makers need
to ensure that only one answer
is correct and that what is required as a response is clear and
unambiguous. For this reason,
when preparing completion items it is often advisable to place
blanks at the end of the sen-
tence rather than in the middle. For example, consider these two
completion items:
90. 1. In 1972, produced a film entitled Une Belle Fille Comme
Moi.
2. The name of the person who produced the 1972 film Une
Belle Fille Comme Moi is
.
Although the correct answer in both cases is François Truffaut,
the first item could also be
answered correctly with the word France. But the structure of
the second item makes the
nature of the required response clear. It’s even clearer to
rephrase the sentence as a question
so that it is no longer a fill-in-the blanks short-answer item. For
example:
What is the name of the director who produced the 1972 film
Une Belle Fille Comme
Moi?
Although short-answer items generally require only one or two
words as a correct response,
some ask for slightly longer responses. For example:
Define face validity.
Developing Constructed-Response Assessments Chapter 8
What states are contiguous with Nevada? , ,
, and .
As shown in the second example, four blanks are provided as a
clue to the number of responses
required. Providing a single, longer blank would increase the
91. item’s level of difficulty.
Essay Constructed-Response Items
Essay items also require test takers to generate their own
responses. But instead of being
asked to supply a single word or short phrase, they are asked to
write longer responses. The
instructions given to the test taker—sometimes referred to as
the stimulus—can vary in length
and form, but should always be worded so that the student has a
clear understanding of
what is required. Whenever possible, the stimulus should also
include an indication of scoring
guidelines. Without scoring guidelines, responses can vary
widely and increase the difficulty
of scoring an item.
Some essay assessments might ask a simple question that
requires a one- or two-sentence
response. For example:
What is the main effect of additional government spending on
unemployment?
Why is weather forecasting more accurate now than it was in
the middle of the
20th century?
Both of these questions can be answered correctly with a few
sentences. And both can be
keyed so that different examiners scoring them would arrive at
very similar results. Items of
this kind can have a high degree of reliability.
Many essay items ask for lengthier expositions that typically
92. require organizing information,
marshaling arguments, defending opinions, appealing to
authority, and so on. In responding
to these, students typically have wide latitude both in terms of
what they will say and how
they will say it. As a result, longer essay responses are
especially useful for tapping higher
mental processes such as applying, analyzing, evaluating, and
creating. This is their major
advantage over the more objective approaches.
The main limitation of longer essay assessments has to do with
their scoring. Not only is it
highly time-consuming, but it tends to be decidedly subjective.
As a result, both reliability and
validity of such assessments is lower than that of more
objective measures. For example, some
examiners consistently give higher marks than others (Brown,
2010). Also, because there is
sometimes a sequential effect in scoring essays, essay items that
follow each other are more
likely to receive similar marks than those that are farther apart
(Attali, 2011).
The reliability of essay tests can be increased significantly by
using detailed scoring guides
such as checklists and rubrics. These guides typically specify as
precisely as possible details of
the points, arguments, conclusions, and opinions that will be
considered in the scoring, and
the weightings assigned to each.
Essay questions can be developed to assess knowledge of
subject matter (remembering, in
Bloom’s revised taxonomy); or they can be designed to tap any
of the higher level intellectual
93. skills. Figure 8.13 gives examples of how this might be done.
Developing Constructed-Response Assessments Chapter 8
Writing good essay questions is not as time-consuming as
writing objective items such as
multiple-choice questions, but it does require attention to
several guidelines. These are sum-
marized in Table 8.8.
Figure 8.13: Essay items tapping higher intellectual skills
What each item assesses depends on what test takers already
know and the strategies they use to
craft their responses. Even responding to the first item,
remembering, might require a great deal
of creating, analyzing, evaluating, and understanding if the
learner has not already memorized a
correct response.
f08.13_EDU645.ai
Bloom’s Revised
Taxonomy of
Educational
Objectives
Possible activity
representing
each objective
Example of essay item
that reflects the boldface verb
95. formulate, compose,
construct, build,
generate, craft . . .
Devise a procedure with accompanying
formulas that can be used to calculate the
volume of a small rock of any shape.
Describe each step in the procedure.
List the five most important events that led
to the Second World War.
Write an essay appraising the American
educational system. What are its strengths
and weaknesses? How can it be improved?
(Recommended length: 500–1000 words.)
Read the following two paragraphs.
Compare the use of figures of speech
in each. How are they similar?
How are they different?
Using the formula for compound interest,
calculate the monthly payment for a
$150,000 loan at 6.7% amortized over
22 years. Explain the implications of
lengthening the amortization period.
Explain in no more than half a page how
an internal combustion engine works.
96. Remembering
Understanding
Applying
Analyzing
Evaluating
Creating
Section Summaries Chapter 8
Table 8.8 Checklist for constructing good essay items
Yes No Do the essay questions assess intended instructional
objectives?
Yes No Have I worded the stimulus so that the requirements
are clear?
Yes No Am I assessing more than simple recall?
Yes No Have I indicated how much time should be spent on
each item (usually by saying how many
points each item is worth)?
Yes No Have I developed a scoring rubric or checklist?
Planning for Assessment
In Chapter 2, and again, earlier in this chapter, we described the
various steps that make up
97. an intelligent and effective assessment plan:
1. Know clearly what your instructional objectives are, and
communicate them to your
students.
2. Match instruction to objectives; match instruction to
assessment; match assessment to
objectives.
3. Use formative assessment as an integral part of instruction.
4. Use a variety of different assessments, especially when
important decisions depend on
their outcomes.
5. Use blueprints to construct tests and develop keys, checklists,
and rubrics to score them.
The importance of these five steps can hardly be
overemphasized.
Chapter 8 Themes and Questions
Section Summaries
8.1 Planning for Teacher-Made Tests Important steps in
planning for assessment include
clarifying instructional objectives, devising test blueprints,
matching instruction and assess-
ment to goals, developing rubrics and other scoring guides, and
using a variety of approaches
to assessment. Assessment should be used for placement
decisions, for improving teaching
and learning (diagnostic and formative functions), and for
evaluating and grading achieve-
ment and progress (summative function).