Automated Formative Assessment As A Tool To Scaffold Student Documentary Writing

Journal of Interactive Learning Research (2012) 23(1), ?-?
Automated Formative Assessment as a
Tool to Scaffold Student Documentary Writing
BILL FERSTER
University of Virginia
bferster@virginia.edu
THOMAS C. HAMMOND
Lehigh University
hammond@lehigh.edu
R. CURBY ALEXANDER
University of North Texas
curbyalexander@gmail.com
HUNT LYMAN
The Hill School
huntlyman@thehillschool.org
The hurried pace of the modern classroom does not permit
formative feedback on writing assignments at the frequency
or quality recommended by the research literature. One so-
lution for increasing individual feedback to students is to in-
corporate some form of computer-generated assessment. This
study explores the use of automated assessment of student
writing in a content-specific context (history) on both tradi-
tional and non-traditional tasks. Four classrooms of middle
school history students completed two projects, one cul-
minating in an essay and one culminating in a digital docu-
mentary. From the total set of completed projects, approxi-
mately 70 essays and 70 digital documentary scripts were
then scored by human raters and by an automated evaluation
system. The student essays were used to test the comparison

22 Ferster, Hammond, Alexander, and Lyman
of human and computer-generated feedback in the context of
history education, and the digital documentary scripts were
used to test feedback given on a non-traditional task. The
results were encouraging with very high correlation and reli-
ability factors within and across both sets of documents, sug-
gesting the possibility of new forms of formative assessment
of student writing for content-area instruction in a variety of
emerging formats.
Keywords: Automated formative assessment, writing, history educa-
tion, digital documentaries
Among the many possible strategies for social studies instruction,
writing-intensive activities stand out as a promising but challenging teach-
ing tool. On the one hand, student writing is a powerful mechanism for im-
proving student learning outcomes in social studies (Greene, 1994; Nelms,
1987; Risinger, 1987, 1992; Smith & Niemi, 2001; Sundberg, 2006; Van
Nostrand, 1979). On the other hand, implementing effective student writing
tasks is difficult. Writing tasks are time-consuming, especially when mea-
sured against an already crowded social studies curriculum (Beyer & Brost-
off, 1979b; Nash, Crabtree, & Dunn, 2000). Social studies teachers typically
receive very little instruction in scaffolding students’ writing and providing
effective feedback (Jolliffe, 1987). Furthermore, some students are reluctant
writers, approaching any act of writing—and particularly writing-for-assess-
ment—with anxiety or even dread (Pajares, 2003). Organizing their ideas
or even the act of getting started can be overwhelming (Beyer & Brostoff,
1979a). The use of writing in social studies education deserves continued
scrutiny, and any new strategies must address these existing barriers.
A promising point of focus for exploring student writing in social stud-
ies is the use of formative feedback to the writer. Cognitive scientists and
educators have demonstrated that rapid and appropriate feedback on student
projects has a strong positive effect on the quality of student work (Mory,
2004). Formative feedback can encourage and guide struggling writers, re-
fine students’ content mastery, and develop social studies skills (Beyer,
1979; Nelson, 1990; Olina & Sullivan, 2002). As an instructional best prac-
tice, therefore, social studies teachers should provide students with forma-
tive feedback at several stages in the composition process.
Unfortunately, the majority of our students live in a world where teach-
ers have up to 5 sections with an average of over 23 pupils in each (Gruber,
Broughman, Strizek, & Burian-Fitzgerald, 2002). Combining the realities

Automated Formative Assessment 23
of class size with the content-coverage pressures of the social studies cur-
riculum explains why so little writing takes place in social studies class-
rooms—in one study of more than 600 classrooms, less than 6% of instruc-
tional time was devoted to writing tasks other than note-taking or answering
test questions (Gilstrap, 1991). A teacher who engages students in process-
driven writing tasks and provides formative feedback is therefore the excep-
tion, not the rule. Even under the ideal conditions of small class sizes and a
slower-paced curriculum, however, the delay time between a request for as-
sessment and the response is necessarily longer than the optimal times sug-
gested by the cognition and learning literature (e.g., Gagné, 1970).
One possible solution to these contextual challenges is to automate the
assessment process using some form of computer-mediated intervention
(Frase, Kiefer, Smith, & Fox, 1985). Of course, some disciplines may lend
themselves to automated assessment more readily than others. Mathemat-
ics and science, for example, are considered more empirical, and writing
assignments in these disciplines may be scored more easily than work in
the humanities (e.g., Chapman, 2001). However, even writing in the more
subjective content areas such as social studies and language arts can take
advantage of a collection of computer-based assessment techniques known
as Automated Essay Scoring.
Automated Essay Scoring (AES) is a technique developed by compu-
tational linguists to look at a student writing example, compare it with hun-
dreds of other essays on the same topic that have been scored by human
scorers, and return the likely score that essay may yield when graded by a
teacher. More sophisticated AES systems can offer precise feedback to the
student regarding what to change to improve the essay, indicating the tool’s
potential for use as a source of formative feedback.
An additional area to explore is the use of an AES to evaluate student
writing in non-traditional formats. Research by the Pew Internet and Ameri-
can Life Project indicates that young people write a significant amount, but
much of it is in the form of email messages, blog postings, comments on so-
cial networking sites, and other emerging formats (Lenhart, Arafeh, Smith,
& Macgill, 2008). While much of this writing is taking place outside the
context of curricular instruction, some educators are experimenting with in-
tegrating these non-traditional platforms into their classrooms (e.g., Ikpeze,
2009; Kajder, 2007). One emerging format for integrating student writing
into history education is the digital documentary (Author, 2006 & 2008;
Hofer & Owings-Swan, 2005; Hofer & Swan, 2008). Digital documentaries
are brief digital movies that consist of a montage of images, text or graphics
accompanied by a narration done in the student’s voice. History educators

can integrate digital documentary projects into their instruction to develop
students’ content knowledge, historical thinking skills, and expression skills
(Author, 2009).
BACKGROUND
To provide the context for this study, three areas will be examined: (1)
the tool that provides the framework and context for exploring automated
formative assessment, (2) the role that feedback has in effective student
learning, and (3) the nature and efficacy of automated essay scoring re-
search efforts.
Context: Online Digital Documentary Tool (PrimaryAccess)
This study explored the feasibility of integrating automated assessment
into PrimaryAccess (www.primaryaccess.org), a suite of free, web-based
applications that allows teachers to draw upon thousands of indexed histori-
cal images to create customized activities for their students (Author, 2006
& 2008). The most common use of PrimaryAccess is the creation of digi-
tal documentaries. The images used are typically online archival resources,
such as photographs, paintings, engravings, maps, and documents from
sites such as the Virginia Center for Digital History. (See Figure 1.) How-
ever, teachers and students can incorporate any online images, including
their own work. The narration that accompanies the image stream is based
on a student-authored script (Figure 1b). These scripts share many of the
same characteristics as traditional essays in terms of their expository nature,
length, and internal structure.

a. Select resources b. Write script c. Set motion d. Show movie
Figure 1. Steps involved in creating a digital documentary with PrimaryAc-
cess
(Image source: National Archives and Records Administration)
The script-composition process in PrimaryAccess takes place in a sim-
ple text editor. Students can save iterative versions of the script, often re-
vising and expanding them as prompted by teacher feedback—either asyn-
chronously, in the form of text or audio notes, or synchronously, as in-class
discussions (Author, 2007). The script becomes the basis for the visual pro-
duction stages: students annotate the script by adding primary source im-
ages and then set these images in motion to create the documentary’s vi-
sual sequence. A voice-over narration, recorded with a built-in audio editor,
completes the documentary-making process. This sequence of iterative re-
finement of text, visual arrangement, and narration, reinforces the concept
of writing-as-process—not product—to improve student learning and per-
formance outcomes, as suggested by the research on student writing (e.g.,
Faigley, Cherry, Jolliffe, & Skinner, 1985).
Writing these scripts is therefore a critical step in the process. However,
during our field testing, we have observed—and participating teachers have
confirmed—that the writing is typically the students’ least favorite element
of digital documentary-making as compared to image selection and editing
(Author, 2009). Researchers across multiple institutions are exploring ways
to scaffold the writing process, but one possible support is to provide some
formative assessment in the form of automated feedback during the script
writing stage.

The Role of Feedback
A key component of improving students’ writing is feedback on stu-
dents’ scripts at multiple points in the process (Author, 2007). Such forma-
tive assessment provides students opportunities to revise their work and im-
prove their metacognitive skills as they monitor their progress (Bransford,
Brown, & Cocking, 2000). To maximize the benefit to student learning, the
feedback must meet two criteria. First, the nature of feedback needs to be
appropriate to the work being evaluated. Irrelevant or shallow feedback, or
feedback that exceeds the processing capabilities of the student, is wasteful
(Bruner, 1966). As a demonstration of the significance of targeted feedback,
a study comparing student learning in conditions of contrasting scaffolding,
Author (2007) found that the students who received the highest quality feed-
back from the teacher produced more accurate final projects and ultimately
demonstrated greater content knowledge as indicated by higher scores on
end-of-semester tests. Second, immediate feedback, rather than delayed
feedback, has a stronger impact on learning outcomes (Gagné, 1970; Mag-
er, 1997; Mory, 2004). Receiving quick and targeted comments during the
composition process is therefore a critical support for learning from written
assignments.
Possibilities of Automated Feedback: Research to Date
An automated feedback system can provide support to students’ writing
process and lead to improved outcomes (Frase, Kiefer, Smith, & Fox, 1985).
This notion of immediate feedback provided by a machine is not a recent
concept. B.F. Skinner wrote, “Like a good tutor the machine presents just
that material for which the student is ready. It asks him to take only that step
which he is at the moment best equipped and most likely to take” (1958, p.
972). The capability of the machine Skinner described to provide feedback
was primitive, especially when compared to the rich responses modern com-
puters and software can deliver.
Today’s automated essay scoring (AES) is a technique developed by
computational linguists to look at a student writing example, compare it
with hundreds of other essays on the same topic that have been scored by
human scorers, and return the likely score that essay may yield when graded
by a teacher. Students can use this “preview” of a summative evaluation on
their in-progress document to direct their revision process, making the sum-
mative evaluation a form of formative feedback. More sophisticated AES

systems can even offer precise feedback about what to change to improve
the essay.
Automated essay coring was pioneered by Ellis Page, who developed
the Project Essay Grader (PEG) in the mid-1960s. PEG applied statistical
techniques such as multiple linear regression to essays and considered such
factors as essay length, number of commas, prepositions, and uncommon
words in a weighted model of what he thought approximated the internal
structures used by human raters. Page found high (.78) correlations be-
tween the PEG system and human raters of the same essays, compared to a
.85 correlation between any two human scorers (Kukich, 2000).
The next 30 years led to vigorous research into the automatic scoring
of essays using a wide range of mathematical techniques and factors within
the essays, including Bayesian Inference, Latent Semantic Analysis, Neural
Networks, and others. Although these systems use a variety of computa-
tional modeling approaches, the overall mechanisms are similar. Typically,
hundreds of exemplar essays are hand-scored by human raters. This scor-
ing is put through rigorous inter-rater reliability testing to ensure the accu-
racy of the human ratings. The essays, with scores reflecting the full range
of possible quality levels, are entered into the AES system to train it on the
essay topic. Once trained, the system develops an internal model of what an
arbitrary essay written on the same topic might score in a matter of seconds.
The Educational Testing Service (ETS) began experimenting with
natural-language-processing and information retrieval techniques in the
1990’s to provide automated scoring of essays within the Analytical Writ-
ing Assessment portion contained in Graduate Management Admissions
Test (GMAT). Their e-rater system used a step-wise linear regression of
over 100 essay features to provide a high degree of agreement with human
raters (Wang & Brown, 2007). Valenti, Neri, and Cucchiarelli (2003) com-
pared the degree of performance between ten AES systems in terms of: (a)
accuracy of scoring, (b) multiple regression correlation, and (c) agreement
with human scoring. The systems performed at levels between .80 and .96
on the three terms, and the ETS e-rater system yielded an 87-94% agree-
ment with human scored essays. These correlations are comparable with
those researchers would expect among essays scored by two or more human
scorers (Wang & Brown, 2007).
The effectiveness of AES systems in relation to human raters is well
documented in the literature. A number of studies have cited very high cor-
relations between AES and human scoring, typically with an 85-90% agree-
ment (Attali & Burstein, 2006; Burstein, 2003; Hearst, 2000). Most studies
were performed using essays from the GMAT exams, expository language

arts essays, or science assessments (Valenti, Neri, & Cucchiarelli, 2003).
There is little research on AES in the contexts of social studies instruction
and/or non-traditional writing formats.
The nature of the feedback received should be contextualized to be
valuable. Simply providing a feedback score without specific details of what
comprised that score can be frustrating to learners. Researchers who devel-
oped a set of Web-based case studies for preparing teachers to use technol-
ogy (ETIPS; see http://www.etips.info) added AES to provide formative
feedback on the decision essays composed by preservice teachers at the cul-
mination of their case studies. An initial study of 27 preservice teachers us-
ing the AES found that the nature of the feedback was not sophisticated or
detailed enough to guide students in improving their writing (Scharber &
Dexter, 2004). After revising the feedback, researchers studied 70 preservice
teachers and found a moderate impact on the quality of the essays. Sixty-
three percent answered in a survey that the AES encouraged them to com-
plete more drafts of their answers than they might have otherwise (Riedel,
Dexter, Scharber & Doering, 2006).
At the K-12 level, Vantage Learning (2007) has developed a web-based
instructional writing product (MyAccess!) designed for students in grades
4 and higher. Among other features, the software provides automated feed-
back to students during the essay writing process, as well as upon comple-
tion, via its IntelliMetric Essay Scoring System. The software provides both
a holistic score and analytical scores in the areas of “Focus and Meaning;
Content and Development; Organization; Language, Use and Style; and
Mechanics and Conventions” (p. 1). The developer has performed a num-
ber of studies indicating that the automated scoring of students’ writing is
comparable to scoring provided by expert human raters, although not all in-
dependent studies have agreed with their results (e.g., see Brown & Wang,
in press). We could find no independent studies on the use of its AES with
K-12 students.
Our ultimate interest in AES is its use in a formative manner--to guide
the activity and encourage revision based on specific feedback—rather than
a summative manner. An AES system may be able to scaffold students’
script writing in our documentary-making tool. The literature includes few
studies by independent researchers examining the use of AES as formative
feedback with K-12 students, and none we could find in the context of so-
cial studies learning or digital documentary creation. Before testing AES
with students as they work on authoring digital documentaries, however, we
must answer two initial questions:
1. In the context of history education, does an automated essay

scoring system provide feedback on student essays that is
similar to the feedback provided by a human grader?
2. Does an automated essay scoring system provide feedback
on student digital documentary scripts that is similar to the
feedback provided by a human grader?
METHOD
To complete this initial test of the feasibility of using AES as a for-
mative feedback and writing scaffold for history documentary scripts, we
needed to see if an automated assessment system could perform as well as
human scoring of the student essays and digital documentary scripts. If the
assessments are close, it stands to reason that adding an automated capa-
bility that assesses students’ scripts could be a powerful tool for formative
assessment to improve student engagement and learning outcomes. This is
not a criticism of the educational system or an attempt to “teacher-proof”
the classroom but an experiment to see if a technological intervention might
augment existing classroom relationships.
The data were collected as part of a larger study funded by the Jes-
sie Ball duPont Foundation to test the content-based learning outcomes of
students when using the digital documentary tool as compared to their re-
ceiving more traditional instruction. The research took place during a sin-
gle unit of instruction on early 20th
century American history. Within this
unit, students spent three days exploring the Great Migration and three days
exploring the Harlem Renaissance. For each topic, the students spent one
day working through activity stations to learn about the topic (e.g., primary
source texts and photos of emigrants from the South, videos of Lindy Hop
dancers, audio clips of blues and jazz). The students then spent two days
making their own account about the time period: either an essay (a tradition-
al format for student writing) or a scripted digital documentary (an emerg-
ing medium for history education).
The essays and digital documentary scripts were comparable in terms
of word length, writing style, content, and factual exposition. Although a
professionally-produced documentary narration would look very different
from an essay, in our experience K-12 students tend to write their narrations
in an essay format because that is the writing form with which they are most
familiar. As such, the documentary scripts written by students possess many
of the same criteria identifying a good five-paragraph essay in terms of ba-
sic expository structure, persuasive ability, adherence to conventions of me-
chanics and grammar, and accurate and germane content.

Participants
Participants were 87 seventh-grade American History students at a pub-
lic middle school located in a small urban area of a mid-Atlantic state. This
student group was racially and ethnically diverse, with approximately equal
numbers of boys and girls. The majority of the students were from low- to
middle-income socio-economic status. The participating students repre-
sented a wide variety of ability and engagement levels. The students experi-
enced instruction and project work as members of four classes, all taught by
two teachers. One participating teacher was a 25-year veteran of the school
system, and the other was a novice in her first teaching assignment. Due to
the design of learning stations followed by project work, all students, re-
gardless of class or teacher, experienced the same instruction.
Procedure
Over the course of six days—three on the Great Migration and three
on the Harlem Renaissance—the students created a total of 144 student-
authored documents. Each student experienced both the experimental and
the control condition: on the Great Migration portion of the unit, the student
created either a digital documentary or a traditional essay. For the following
topic, the condition was reversed. (Due to absences and incomplete work,
not all 87 students produced both a digital documentary script and an essay.)
The final pool of documents for analysis contained 73 essays and 71 digital
documentary scripts.
Two former readers for the Advanced Placement (AP) language arts
exam scored the students’ documents. The scoring was conducted blind—
the raters did not know whether an individual document was an essay or a
script. The raters used a standard 6+1 rubric designed for use with middle
school students. The 6+1 rubric first asks raters to score each essay in terms
of six characteristics, or traits: ideas and content, organization, voice, word
choice, sentence fluency, and conventions. Each of the six factors is rated
individually with a score from 1 to 5.
1. NOT YET: a bare beginning; writer not yet showing any control.
2. EMERGING: need for revision outweighs strengths.
3. DEVELOPING: strengths and need for revision are about equal.
4. COMPETENT: on balance, the strengths outweigh the weaknesses.
5. STRONG: shows control and skill in this trait; many strengths
present.

Following the scoring of the six components, the rubric calls for a ho-
listic score, ranging from 1 to 4, to assess overall quality. Indicators of qual-
ity are whether the student addressed the prompt, how sophisticated the
writing was, how precisely the facts and arguments were presented and the
relevance of those facts to the prompt, and the level of logical thinking in
the student’s arguments.
The scorers followed the protocols used in AP exam scoring. First, they
worked independently to score the same 20 essays. Next, they compared
their results and discussed any divergence to encourage rating agreement.
The raters then worked alone, each scoring the entire remaining set of 124
documents, containing both standard essays and scripts. This double-scoring
of the entire set is a departure from AP practices, in which essays are typi-
cally read by a single reader with only 1 in 60 receiving a second reading
(Venkateswaran & Morgan, 2002).
We chose Educational Testing Service’s (ETS) CriterionTM
online essay
evaluation service as the comparison scorer for the two human evaluators.
The Criterion system, described above, has a long track record of success-
ful use in multiple contexts: college or graduate school admissions, ….. The
tool is strongly recommended by the literature on AES (cite). The research-
ers entered each of 144 documents and then recorded and analyzed the auto-
mated response. The AES provides two substantive forms of feedback: a ho-
listic score, ranging from 1 to 4, and verbose responses over five domains:
grammar, usage, mechanics, style, and organization. Within each domain,
between 6 and 11 categories of potential problems are evaluated. The re-
sponses for grammar, usage, and mechanics identify errors such as pronoun
and possessive errors, nonstandard word forms, and missing articles, many
of which are similarly flagged by grammar-checking programs in programs
such as Microsoft Word. The style and organization categories provide ad-
ditional feedback in areas not usually addressed through automated respons-
es. Students are alerted to stylistic problems such as repeated words, many
short sentences, and many long sentences. Students are also provided with
non-substantive descriptive statistics, such as the number of sentences and
average number of words per sentence.
RESULTS
The first step was to examine the similarities in scoring of the student
history essays. These essays are a traditional format for evaluation by AES.

However, the essays used in this case were prepared for the purpose of mas-
tering historical content (i.e., the Great Migration and the Harlem Renais-
sance), not the demonstration of writing ability. The comparison between
human- and computer-generated ratings on students’ essays was encourag-
ing, yielding a .88 Cronbach’s Alpha reliability coefficient and a statistical-
ly-significant .79 correlation coefficient (p < .01) between the human- and
machine-graded holistic scores on the essays. In the context of essays writ-
ten for the purposes of history education, the AES provided scores very sim-
ilar to the human evaluators’ (see Table 1).
Table 1
Descriptive Statistics for Traditional Writing Context (Essay)
Mean SD N
Human-scored
essays
2.67 0.987 73
AES-scored essays 3.202 1.1783 73
The second step was to compare the humans’ scores against the AES
scores across the non-traditional set of documents, the 71 digital documen-
tary scripts. Again, the two sets of scores showed a tight correspondence
(see Table 2). The high Cronbach’s Alpha reliability coefficient (.84) and
correlation coefficient (.73, p < .01) again indicate that the computer-gener-
ated evaluation closely matches that of humans, even in a format other than
a traditional essay.
Table 2
Descriptive Statistics for Non-traditional Writing Context
(Digital Documentary Script)
Mean SD N
Human-scored
scripts
2.49 01.040 71
AES-scored scripts 3.169 0.9169 71
The next step was to more closely examine the relationship between
the traits scores (i.e., scores for ideas and content, organization, voice, word
choice, sentence fluency, and conventions) and the holistic scores from both

the human raters and the AES. For the human raters, their holistic score var-
ied directly with their scoring of the 6 traits, F(6,143) = 173.7, p < .001. The
same relationship existed between the human-generated traits scores and the
AES’s holistic score, F(6,143) = 35.71, p < .001. This correspondence sug-
gests that the holistic scores (whether human-generated or computer-gen-
erated) and the scores of the 6 individual traits were measuring analogous
internal constructs (see Tables 3-4).
Table 3
ANOVA of Human Scorers’ 6 Traits and Holistic Scores
Sum of Squares df Mean Square F Sig.
Regression 129.922 6 21.654 173.7 .000
Residual 17.078 137 .125
Total 147.000 143
Table 4
ANOVA of Human Scorers’ 6 Traits and Automated Holistic Scores
Sum of Squares df Mean Square F Sig.
Regression 96.889 6 16.148 35.71 .000
Residual 61.955 137 .452
Total 158.843 143
During our analysis, we observed a relationship between the length of
a document (i.e., a word count for the essay or digital documentary script)
and the holistic scoring. For both the human and the automated scoring,
there was a statistically significant correlation between the number of words
in a document and its holistic score: a .67 correlation with the human-gen-
erated holistic scores and a .81 correlation with the AES-generated holis-
tic score. While some correspondence between length and quality is math-
ematically probable (i.e., a more fully-developed essay will tend to have
more words than a less well-developed essay), the gap between the human
and computer-generated correlation coefficients raised a concern: a student
might be able to “game” the automated assessment by writing a longer es-
say or script and thus obtaining a higher score. This possibility directed our
attention to the verbose feedback provided by the AES along with its holis-
tic score.

To explore the quality of the verbose feedback provided by the AES,
we compared the system’s comments to the students’ scripts to see whether
these comments were meaningful to the reader. Most of these comments
were accurate but phrased in very generic terms. For example, the response
for a “good” essay (scoring 3 out of 4 possible points) included the state-
ment that the essay “provides a clear sequence of information; provides
pieces of information that are generally related to each other.” This state-
ment was correct but did not provide guidance for further revision by the
student. We then searched for specific instances in which the automated
feedback represented a misunderstanding of the writing, offering feedback
that no competent human reviewer would make. Across the 144 sets of re-
sponses, we identified less than 10 examples of these errors, all grammati-
cal. For example, the following sentence was flagged as containing a sub-
ject-verb agreement error: “Throughout the 20th Century, the segregation of
blacks and whites was abolished.” In this case, the AES read “whites” as
the subject; the subject is actually “segregation.” Earlier in the same essay,
a sentence that begins, “In the early 1900’s” was also flagged as having an
extraneous article (“the”), when it is in fact required.
DISCUSSION
The high correlation between the automated and human scores in both
sets of documents (essays and scripts) and the overall high quality of the
feedback suggests that adding an option for students to submit their writing
for automated feedback could be a useful formative assessment tool, even
in the context of history education and in a non-traditional format such as a
digital documentary. Given an AES module integrated into PrimaryAccess,
students will be able to access an instant, consequence-free first round of
feedback on the style, mechanics, and structure of their scripts. This feed-
back can lead to improved student engagement and multiple revisions of
scripts, resulting in higher quality end products and increased student learn-
ing.
The results, however, underscored the significance of students receiv-
ing human feedback and not just computer-generated evaluations. As not-
ed, some of the students’ human touches in their writing eluded the AES
programmers’ heuristics. In our raters’ opinion, the false flag “errors” were
departures from convention that improved the quality of the document. An
improved AES can reduce the number of instances of such errors, but they
cannot be wholly eliminated. Additionally, substantive feedback about the

content of the scripts—the accurate portrayal of historical facts and not
merely their expression—will still need to be provided by the teacher. For
example, the student statement that “Throughout the 20th Century, the seg-
regation of blacks and whites was abolished” is grammatically correct, but
the historical understandings can be improved: 20th
century desegregation
was not a unified, completed process but rather an on-going mix of policy
decisions (Executive Order 9981, 1948), legal actions (e.g., Brown v. Board
of Education of Topeka, 1954), and personal choices (James Meredith’s de-
cision to apply to the University of Mississippi, 1961). An AES to provide
this level of content-specific feedback in the social studies is both theoreti-
cally and practically impossible; a teacher will have to make the judgment
call as to which nuances to introduce to the student’s thinking. However,
any automated assistance to the student regarding his or her writing should
give the teacher greater latitude to focus on students’ content understandings
and thought processes.
This study faces several limitations. First, this was a relatively small-
scale study with only two human raters following an approved protocol. A
larger pool of documents and additional human raters would strengthen the
interpretability of the quantitative analysis. Second, the participants were
middle school history students; the results do not generalize to other groups
or other uses of the AES, especially not to high-stakes assessments such as
the SATs or end-of-year, summative assessments of student achievement.
Finally, the AES used was designed to grade essays, and the human graders
in this study were trained experts in grading essays written by high-school
level students taking Advance Placement exams. If teachers were to have the
time and inclination to teach students the fine points of documentary mak-
ing, their scripts may bear little resemblance to essays.
FUTURE RESEARCH
The ETS Criterion system appeared capable of delivering high qual-
ity contextual feedback on the essays, but more research needs to done to
provide the content-area knowledge required for these digital documentaries
and other forms of writing in the social studies. What value does the scoring
provide to the teaching and learning of history? Could the automated scor-
ing process inhibit or standardize students’ writing? What interaction effects
exist between automated scoring and different teachers’ teaching styles or
levels of expertise?
While this study looked to confirm the reliability of automated scor-

ing as opposed to human scoring, future studies will investigate the value
of automated formative assessment to students and their learning outcomes.
Future research opportunities could include testing the use of automated as-
sessment in classrooms to compare its efficacy versus no feedback or lim-
ited teacher feedback and looking at the differences in number of revisions,
time on task, and engagement. These differences can be correlated with the
quality of the students’ final products and/or changes on pre/post assess-
ments of writing or content knowledge. Furthermore, the handling of false
flags, such as the example cited above, needs to be more fully explored.
For an AES to be effective, as Shute (1994) noted, “the system [must]
behave intelligently, not actually be intelligent, like a human being” (p. 50).
Most people have had the experience of mistyping a word while entering
Google search—school deform, for example—and having Google’s web ap-
plication return the message, “Did you mean: school reform?” The original
search term is not immediately identifiable as erroneous: the words weren’t
misspelled, and deform is a verb that can have the subject. However, the
Google database does know that most people who typed in school deform
ultimately searched for school reform. By using the large numbers of peo-
ple who use their search engine and some shrewd programming decisions,
Google has been able make their system appear more intelligent.
Because all the PrimaryAccess web applications are instrumented to
create a time-stamped log of a student’s activity writing and creating their
project, it may be possible that some of these student projects can be as-
sessed without needing to actually test the student to determine what they
know. If enough information about how these projects were created can be
captured and then compared with a large enough number of projects that
have been similarly instrumented and also human-scored, this may be an-
other assessment choice combined with automated essay scoring feedback.
References
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater v.2.
Journal of Technology, Learning, and Assessment, 4(3), n.p. Retrieved
November 13, 2009 from http://escholarship.bc.edu/cgi/viewcontent.
cgi?article=1049&context=jtla
Beyer, B.K. (1979). Pre-writing and re-writing to learn. Social Education, 43(3),
187-189, 197.
Beyer, B.K., & Brostoff, A. (1979a). Writing to learn in Social Studies: Intro-
duction. Social Education, 43(3), 176-177.

Beyer, B.K., & Brostoff, A. (1979b). The time it takes: Managing/evaluating
writing and Social Studies. Social Education, 43(3), 194-197.
Bransford, J., Brown, A., & Cocking, R. (2000). How people learn: Brain, mind,
experience, and school. Committee on Learning Research and Educational
Practice, National Research Council. Washington DC: National Academy
Press.
Brown, M. S., & Wang, J. (in press). Automated essay scoring versus human
scoring: A correlational study. Contemporary Issues in Technology and
Teacher Education.
Bruner, J. (1966). Toward a theory of instruction. New York: Norton.
Burstein, J. (2003). The e-rater scoring engine: Automated essay scoring with
natural language processing. In M. D. Shermis & J. Burstein (Eds.), Auto-
mated essay scoring: A cross-disciplinary perspective (pp. 113-121). Mah-
wah, NJ: Lawerence Erlbaum Associates.
Chapman, O. (2001). Calibrated Peer Review: A writing and critical thinking in-
structional tool. Retrieved November 13, 2009 from http://cpr.molsci.ucla.
edu/cpr/resources/documents/misc/CPR_White_Paper.pdf
Faigley, L., Cherry, R.D., Jolliffe, D.A., & Skinner, A.M. (1985). Assessing writ-
ers’knowledge and processes of composing. Norwood, NJ: Ablex.
Frase, L.T., Kiefer, K.E., Smith, C.R., & Fox, M.L. (1985). Theory and practice
in computer-aided composition. In S.W. Freedman (Ed.), The acquisition
of written language: Response and revision (pp. 195-210). Norwood, NJ:
Ablex.
Gagné, R. (1970). The conditions of learning. New York: Holt Reinhart.
Gilstrap, R.L. (1991). Writing for the social studies. In J.P. Shaver (Ed.), Hand-
book of research on Social Studies teaching and learning (pp. 578-587).
New York, NY: Macmillan.
Greene, S. (1994). Students as authors in the study of history. In G. Leinhardt,
I. Beck, & K. Stainton (Eds.), Teaching and Learning in History (pp. 133-
168). Hillsdale, NJ: Lawrence Erlbaum.
Gruber, K., Wiley, S., Broughman, S. P., Strizek, G. A., & Burian-Fitzgerald, M.
(2002). Schools and staffing survey, 1999-2000: Overview of the data for
public, private, public charter, and bureau of Indian affairs elementary and
secondary schools Report No. NCES 2002313). Washington, DC: National
Center for Educational Statistics.
Hearst, M. (2000). The debate on automated essay grading. IEEE Intelligent Sys-
tems, 15(5), 22-37.
Ikpeze, C. (2009, May). Writing for real purpose. Learning and Leading with
Technology, 36, 36-37.
Jolliffe, D.A. (1987). A social educator’s guide to teaching writing. Theory and
Research in Social Education, 15(2), 89-104.
Kajder, S. (2007). Plugging in to 21st
century writers. In T. Newkirk & R. Kent
(Eds.), Teaching the neglected “R”: Rethinking writing instruction in sec-
ondary classrooms (pp. 149-161). Portsmouth, NH: Heinemann.

Kukich, K. (2000), Beyond automated essay scoring. IEEE Intelligent Systems
15(5), 22–27.
Lenhart, A., Arafeh, S., Smith, A., & Macgill, A. (2008). Writing, Technology,
and Teens. Washington, DC: Pew Internet & American Life Project.
Mager, R. (1997). Making instruction work. Atlanta, GA: Center for Effective
Performance.
Mory, E. H. (2004). Feedback research revisited. In D. H. Jonassen (Ed.), Hand-
book of research on educational communications and technology (pp. 745-
783). Mahwah, NJ: Lawrence Erlbaum.
Nash, G. B., Crabtree, C., &and Dunn, R. E. (2000). History on trial: Culture
wars and the teaching of the past. New York: Vintage Books
Nelms, B.F. (1987). Response and responsibility: Reading, writing, and Social
Studies. The Elementary School Journal, 87(5), 571-589.
Nelson, J. (1990). This was an easy assignment: Examining how students inter-
pret academic writing tasks. Research in the Teaching of English, 24, 362-
396.
Olina, Z., & Sullivan, H.J. (2002). Effects of classroom evaluation on student
achievement and attitudes. Educational Technology Research & Develop-
ment, 50(3), 61-75.
Pajares, F. (2003). Self-efficacy beliefs, motivation, and achivement in writing:
A review of the literature. Reading and Writing Quarterly, 19, 139-158.
Risinger, C.F. (1987). Improving writing skills through social studies (ERIC
Digest No. 40). Bloomington, IN: ERIC Clearinghouse for Social Studies/
Social Science Education. (ERIC Document Reproduction Service No. ED
285829).
Risinger, C.F. (1992). Current directions in K-12 Social Studies. Boston,
MA: Houghton Mifflin Co. (ERIC Document Reproduction Service No.
ED359130)
Riedel, E., Dexter, S., Scharber, C., & Doering, A. (2006). Experimental evi-
dence on the effectiveness of automated essay scoring in teacher education
cases, Journal of Educational Computing Research, 35(3) 267-287.
Scharber, C., & Dexter, S. (2004, March). Automated essay score predictions as
a formative assessment tool. Paper presented at the 15th international con-
ference of the Society for Information Technology and Teacher Education,
Atlanta, GA.
Shute, V.J. (1994). Regarding the I in ITS: Student modeling. In T. Ottmann & I.
Tomek (Eds.), Proceedings of Educational Multimedia and Hypermedia 94
(pp. 50-57). Charlottesville, VA: Association for the Advancement of Com-
puting in Education.
Skinner, B.F. (1958). Teaching Machines. Science, 128(3300), 969-977
Smith, J., & Niemi, R. (2001). Learning history in school: The impact of course
work and instructional practices on achievement. Theory and Research in
Social Education, 29(1), 18-42.

Sundberg, S.B. (2006). An investigation of the effects of exam essay questions
on student learning in United States History survey classes. The History
Teacher, 40(1). Retrieved November 13, 2009 from http://www.historycoo-
perative.org/cgi-bin/cite.cgi?=ht/40.1/sundberg.html
Valenti, S., Neri, F., & Cucchiarelli, A. (2003). An overview of current research
on automated essay grading. Journal of Information Technology Education,
2, 319-330.
Van Nostrand, A.D. (1979). Writing and the generation of knowledge. Social
Education, 43(3), 178-180.
Venkateswaran, U., & Morgan, R. (2002). Assessing historical thinking skills:
Scoring the AP U.S. History Document-Based Question. Organization of
American Historians Newsletter. Retrieved November 13, 2009 from http://
www.oah.org/pubs/nl/nov02/ets.html
Wang, J., & Brown, M. (2007). Automated essay scoring versus human scoring:
A comparative study. Journal of Technology, Learning, and Assessment,
6(2). Retrieved November 13, 2009 from http://escholarship.bc.edu/cgi/
viewcontent.cgi?article=1100&context=jtla

Automated Formative Assessment As A Tool To Scaffold Student Documentary Writing

Recommended

Recommended

More Related Content

Similar to Automated Formative Assessment As A Tool To Scaffold Student Documentary Writing

Similar to Automated Formative Assessment As A Tool To Scaffold Student Documentary Writing (20)

More from Martha Brown

More from Martha Brown (20)

Recently uploaded

Recently uploaded (20)

Automated Formative Assessment As A Tool To Scaffold Student Documentary Writing