Automated Essay Scoring

1 | P a g e
B. Comp. Dissertation
Automated Essay Scoring
By
Shubham Goyal
Department of Computer Science
School of Computing
National University of Singapore
2013/2014
Project No: H014380
Advisor: Professor NG Hwee Tou
Deliverables:
Report: 1 Volume

1 | P a g e
Contents
List of Figures........................................................................................................................................ 2
1. Abstract.............................................................................................................................................. 3
2. Acknowledgement........................................................................................................................... 4
3. Goal...................................................................................................................................................... 5
4. Introduction ....................................................................................................................................... 6
5. Related Work.................................................................................................................................... 8
5.1 Background................................................................................................................................. 8
5.2 Comparison of the Current State of the Art Essay Systems.....................................12
6. Implementation ..............................................................................................................................15
6.1 Overview ...................................................................................................................................15
6.2. Features utilized.....................................................................................................................16
6.2.1 Content Features.............................................................................................................16
6.2.2 Syntactic Features ..........................................................................................................16
6.2.3 Surface Features..............................................................................................................17
6.2.4 Error Identification.........................................................................................................17
6.2.5 Structural Features .........................................................................................................18
6.8 Statistical Parsing...............................................................................................................22
6.9 Feature Weights..................................................................................................................26
6.10 Ranking Algorithm .........................................................................................................27
6.11 Evaluation Metrics ..........................................................................................................28
7. Dataset..............................................................................................................................................29
8. Results..............................................................................................................................................30
9. Future Work...................................................................................................................................33
10. Conclusion....................................................................................................................................34
References...........................................................................................................................................35
Appendices..........................................................................................................................................37
Appendix‐A: List of Part‐of‐Speech Tags Used................................................................37

2 | P a g e
List of Figures
Figure 1 Line Chart for Vendor Performance on the Pearson Product Moment
Correlation across ........................................................................................................12
Figure 2 Implementation overview............................................................................15
Figure 3 Skeletons generated from the sentence 'They have many theoretical ideas' 18
Figure 4 Parse Tree of 'they have many theoretical ideas'..........................................19
Figure 5 Annotated skeletons in the sentecnce "They have many theoretical ideas'..21

3 | P a g e
1. Abstract
Automated Essay Scoring (AES) is increasingly becoming popular as human grading
is not only becoming expensive but also cumbersome as the number of test takers
grow. Quick feedback is another characteristic drawing educators towards AES.
However, most of the AES systems present today are commercial closed source
software. Our work aims to design a good AES system that uses some of the most
commonly used features to rank and essays. We also evaluate our scoring engine on a
publicly available dataset to establish benchmarks. We will also make all the source
code available to the public so that future research can use this as a starting point.
Subject Descriptors:
I.2.7 Natural Language Processing
H.3 Information Storage and Retrieval
I.2.6 Learning
I.2.8 Problem Solving, Control Methods, and Search
Keywords:
Artificial Intelligence, Natural Language Processing
Implementation Software and Hardware:
Python, Java

4 | P a g e
2. Acknowledgement
I would like to thank my supervisor, Prof NG Hwee Tou, for giving me the
opportunity to work under him and on this project. I am really honored to have the
pleasure of working under one of the best minds in this field. I would like to thank
him for all his time that he has spent in helping me, motivating me, guiding me and
finally, helping me become a better researcher so that I can fulfill my lifelong
ambition of becoming a good researcher.
I would also like to thank Prof’s graduate student, Raymond, for taking out the time
from his work to help provide me with APIs to get the trigram counts of words in the
English Gigaword corpora.
I also appreciate the help provided by another of Prof’s students, Christian
Hadiwinoto, for creating my account on the NLP cluster and helping me install
packages and run my programs there.

5 | P a g e
3. Goal
This project focuses on building a system for scoring English essays. The system
assigns a score to an essay reflecting the quality of the essay (based on both content
and grammar). The system will be evaluated on a benchmark test data set. Besides
aiming to build a state-of-the-art essay scoring system, the project will also
investigate the robustness and portability of essay scoring systems.

6 | P a g e
4. Introduction
According to Wikipedia, ‘Automated essay scoring (AES) is the use of specialized
computer programs to assign grades to essays written in an educational setting.’
Usually, the grades are not numeric scores but rather, discrete categories. Therefore,
this can be also be considered to be a problem of statistical classification and due to
its very nature, this problem can be said to fall into the domain of natural language
processing.
Historically, the origins of this field can be traced to the work of Ellis Batten “Bo”
Page, who is also widely regarded as the father of automated essay scoring. Page’s
development of and pioneering work with Project Essay Grade (PEG™) software in
the mid-1960s set the stage for the practical application of computer essay scoring
technology following the microcomputer revolution of the 1990s.
The most obvious approach to do automated essay scoring is to employ machine
learning. This will involve getting a set of essays that have been manually scored (or,
the training set). The software should then evaluate the features of the text of each
essay (surface features like the total number of words, or word ngrams, part of speech
ngrams, etc.. mostly quantities that can be measured without any human insight) and
construct a mathematical model that relates these quantities to the scores that the
essays received. Then, we could use the model to calculate scores of new sets of
essays.
The next important question that arises is the determination of the criteria of success.
It might be insightful to look at essay scoring before the arrival of computers.
Usually, high stake essays were and still are rated by a few different raters who would
each give their own score. Then, the different scores would be matched to see if they
agree and if they don’t, either a more experienced rater would be called in to settle the
dispute, or the the majority opinion would be taken. We could apply the same
approach to checking the success of any AES software. The grades given by the
software could be matched with the grades given by human graders on the same
scripts. The more the number of matches, the better the accuracy of the AES software
would be.

7 | P a g e
Thus, various statistics have been proposed to measure this ‘agreement’ between the
AES software and the human graders. It could be something as simple as percent
agreement to more complicated measures like Pearson’s or Spearman’s rank
correlation coefficients.
The practice of AES has not been without its fair share of criticism. Yang et al.
mention "the overreliance on surface features of responses, the insensitivity to the
content of responses and to creativity, and the vulnerability to new types of cheating
and test-taking strategies." Some critics also fear that students’ motivation will be
diminished if they know that a human grader will not be reading their writings.
However, we feel that this criticism is not to AES but rather to the fear of being
assigned false grades. This criticism actually also proves that the current state of the
art systems can be better and so this is an exciting time to be working in this field.

8 | P a g e
5. Related Work
5.1 Background
As already mentioned in the introduction, the late Ellis Page and his colleagues at the
University of Connecticut programmed the first successful automated essay scoring
engine, “Project Essay Grade (PEG)” (1973). PEG did produce good results but one
of the reasons for why it did not become a practical application is probably because of
the technology of the time.
Different AES systems evaluate different types and number of features which are
extracted from the text of the essay. Page and Peterson (1995), in their Phi Delta
Kappan article, “The computer moves into essay grading: Updating the ancient test,”
referred to these elements or features as “proxes” or approximations for underlying
“trins” (i.e., intrinsic characteristics) of writing. In the original version of PEG, the
text was parsed and classified into language elements such as parts of speech, word
length, word functions and the like. PEG would count keywords and make its
predictions based on the patterns of language that human raters valued or devalued in
making their score assignments. Page classified these counts into three categories:
simple, deceptively simple and sophisticated.
For example, a model in the PEG system, might be formed by taking five intrinsic
characteristics of writing (content, creativity, style, mechanics, and organization) and
linking proxes. An example of a simple prox is essay length. Page found that the
relationship between the number of words used and the score assignment was not
linear, but rather logarithmic. In other words, essay length is factored in by human
raters upto some threshold, and then becomes less important as they focus on other
aspects of writing.
On the other hand, an example of a sophisticated prox would be a count of the number
of times “because” is used in an essay. Even though a count of the word “because”
may not be important in and of itself, but as a discourse connector, it serves as a proxy
for sentence complexity. Human raters tend to reward more complex sentences.

9 | P a g e
Some works emphasize the evaluation of content through the specification of
vocabulary (of course, the evaluation of other aspects of writing is performed as
described above). Latent Semantic Analysis and its variants are employed in some
works to provide estimates as to how close the vocabulary in an essay is to a targeted
vocabulary set (Landauer, Foltz & Laham, 1998). The Intelligent Essay Assessor
(Landauer, Foltz & Laham, 2003) is one of the most successful commercial
applications making heavy use of LSA.
If we look at the AES scene at present, there are three major AES developers –
1. e-rater (which is a component of Criterion - http://www.ets.org/criterion) by
Educational Testing Service (ETS)
2. Intellimetric by Vantage Learning (http://www.vantagelearning.com/)
3. Intelligent Essay Assessor by Pearson Knowledge Technologies
(http://kt.pearsonassessments.com/)
Fortunately for us, the construction of e-rater models is given in detail in a recent
work by Attali and Burstein (2006). The system takes in features from six main areas:
1. Grammar, usage, mechanics and style measures (4 features) –
They count the errors in these four categories. And since the raw counts of
errors are highly related to essay length, the rates of errors are used which are
obtained by dividing the counts in each category by the total number of words
in the essay.
2. Organization and development (2 features) –
The first feature in this category is the organization score which assumes a
writing strategy that includes an introductory paragraph, at least a three
paragraph body with each paragraph in the body consisting of a pair of main
point and supporting idea elements, and a concluding paragraph. The score
measures the difference between this minimum five paragraph essay and the
actual discourse elements found in the essay. The second feature is derived
from Criterion’s organization and development module.

10 | P a g e
3. Lexical Complexity (2 features) –
These are specifically related to word based characteristics. The first is a
measure of vocabulary level and the second is based on average word length
in characters across the words in the essay. The first feature is from Breland,
Jones, and Jenkins’ (1994) work on Standardized Frequency Index across the
words of the essay.
4. Prompt-specific vocabulary usage (2 features) – e-rater evaluates the lexical
content of an essay by comparing the words it contains to the words found in a
sample of essays from each score category. This is accomplished by making
use of content vector analysis (Salton, Wong, & Yang, 1975). In short, the
vocabulary of each score category is converted to a vector whose elements are
based on the frequency of each word in a sample of essays.
Like most approaches, e-rater also uses a sample of human scored essay data for
model building purposes. e-rater models can be built at the topic level in which case, a
model is built for a specific essay prompt. However, more often, e-rater models are
built at the grade-level. Preparing models for essays of similar topics or by students of
similar grades is not difficult per se, but it requires significant data collection and
human reader scoring, these are not only time consuming but also costly.
A specification of the Intellimetric model is given in Elliot (2003). The model selects
from more than 300 semantic, syntactic and discourse level features. The features fall
into five major categories:
1. Focus and Unity – Include cohesiveness, consistency in purpose, main idea
2. Development and elaboration – Include metrics to look at the breadth of
content and the support for concepts advanced
3. Organization and Structure – Mainly targeted at the logic of discourse like
transitional fluidity and realationships among parts of the response.
4. Sentence Structure – Include senetence complexity and sentence variety.
5. Mechanics and Conventions – Features measuring conformance to
conventions of edited American English.

11 | P a g e
Intellimetric uses Latent Semantic Dimension which is similar in nature to LSA
described earlier. Latent Semantic Dimension also determines how close the
candidate response is, in terms of content, to a modeled set of vocabulary. The paper
doesn’t go into a lot more details about how Intellimetric works, and rather focuses
more on the validation aspect.
Technical details of the Intelligent Essay Assessor are highlighted in Landauer,
Laham, & Foltz (2003). The content of the essay is assessed by using a combination
of external databases and LSA. The authors do talk about examples of external
databases used for three of their experiments. This is interesting to note because this
can shed important light on what kind of data needs to be extracted for automated
essay scoring in different situations.
In a particular experiment, the essay question was on the anatomy and function of the
heart and circulatory system. This was administered to 94 undergraduates at the
University of Colorado before and after an instructional session (N = 188) and scored
by two professional readers from Educational Testing Service (ETS). In this case, the
LSA semantic space was constructed by analysis of all 95 paragraphs in a set of 26
articles on the heart taken from an electronic version of Grolier’s Academic American
Encyclopedia. Even though this corpus was smaller than the corpuses traditionally
used, it gave good results according to the authors. When the authors tried to expand
it by the addition of general text, the results did not improve.
We also draw immense inspiration from and analyze the work by (Yannakoudakis et
al., 2011) but the discussion on that is deferred to the subsequent sections in the
interests of brevity and to avoid repetition.

12 | P a g e
5.2 Comparison of the Current State of the Art Essay Systems
A recent study (Shermis and Hammer, 2012) has compared the results from nine
automated essay scoring engines on eight prompts drawn from 6 states in the Unites
States that hold high-stakes writing exams. The essays encompassed writing
assessments from three grade levels, namely, 7, 8 and 10; and were evenly distributes
among the different prompts. Totally, there were 22,029 essays,
The following line chart demonstrates the pearson product moment correlation across
the eight essay data sets –
Figure 1. Line Chart for Vendor Performance on the Pearson Product Moment Correlation across
the Eight Essay Data Sets
The nine automated essay scoring engines participating in the study were –
1. Autoscore developed by the American Institutes for Research (AIR)
The main features of this scoring engine include creating a statistical proxy for

13 | P a g e
prompt-specific rubrics (single as well as multiple trait). The engine needs to
be trained on known and valid scores.
2. LightSIDE developed at Carnegie Mellon University’s TELEDIA Lab
This is a free and open source package. This is very beginner friendly. Its
meant to be a tool for non professionals to male use of data mining technology
for varied purposes, one of which is essay assessment.
3. Bookette developed by CTB McGraw-Hill Education
These scoring engines are able to model trait level and/or holistic level scores
for essays with a similar degree of reliability to an expert human rater. CTB
builds two types of engines – prompt specific and generic. When applied in
the classroom, the engines can provide performance feedback through the use
of the information found in the scoring rubric and through feedback on
grammar, spelling, conventions, etc. at the sentence level. Bookette engines
utilize around 90 text-features classified as structural, syntactic, semantic and
mechanics-based.
4. e-rater, developed by Educational Testing Service
This scoring engine is focused on evaluating essay quality. There are doens of
features, each measuring different very specific aspects of essay quality. The
same features serve as the basis for performance feedback to students through
products like Criterion (http://www.ets.org/criterion).
5. Lexile Writing Analyzer developed by MetaMetrics
This is independent of grades, genres, prompts or punctuation and is an engine
for establishing Lxile writer measures. Lexile writer measure is said to be an
inherent individual trait or power to compose written text with writing ability
embedded in a complex web of cognitive and sociocultural processes.
6. Project Essay Grade (PEG), Measurement, Inc.
This scoring engine has had more than 40 years of study and enchancement
devoted to it. Studies conducted at a number of state departments of education
indicate that PEG demonstrated accuracy similar to trained human scorers.

14 | P a g e
7. Intelligent Essay Assessor (IEA), Pearson Knowledge Technologies
Some of the features are derived through semantic models of English (or any
other language) from an analysis of large volumes of text equivalent to the
reading material of a high school student (around 12 million words). This
scoring engine combines background knowledge about English in general and
the subject area of the assessment in particular along with prompt-specific
algorithms to learn how to match student responses to human scores. IAEA
also provides feedback and can even be tuned to understand and examine text
in any language (Spanish, Arabic, Hindi, etc.). It can identify off-topic
responses, very unconventional essays and other unique circumstances that
need human attention. It has also been used for grading millions of essays in
high-stake examinations.
8. CRASETM
by Pacific Metrics
This system is highly configurable, both in terms of the customizations used to
build machine scoring models and in terms of how the system can blend
human scoring and machine scoring (i.e., hybrid models). Its actually a Java
applications that runs as a web service.
9. IntelliMetric developed by Vantage Learning
This scoring system attempts to emulate what the human scorers do.
IntelliMetric is trained to score test-taker essays. Each prompt (essay) is first
scored by expert human scorers who develop anchor papers for each score
point. A number of papers for each score point are loaded into IntelliMetric,
which runs multiple algorithms to determine the specific writing features that
translate to various score points.

15 | P a g e
6. Implementation
6.1 Overview
This is what the entire process actually looks like in a nutshell –
Figure 2 Implementation overview
To score essays automatically, we need to train a machine-learning algorithm. After
the algorithm has been trained, it gives us a machine-learning model, which can be
used to score more essays. In order for a machine-learning model to be created,

16 | P a g e
features first need to be extracted from the text, as a computer cannot directly
understand English. We need to use the numbers or symbols as proxies for meaning.
6.2. Features utilized
6.2.1 Content Features
6.2.1.1 Word n‐grams
An n-gram can simply be defined as a contiguous sequence of n items from a given
sequence of text or speech. For the purpose of essay scoring, they can simply be
understood as collection of one or more tokens. n can have any value but usually only
value of n =1 (unigrams), 2 (bigrams), 3 (trigrams) are considered. This is because
higher order n-grams suffer from the sparse data problem.
The tokens were converted to lower case before being used as n-grams. However, no
stemming was employed.
6.2.2 Syntactic Features
6.2.2.1 POS n‐grams
This feature is the same as word n-grams except that we replace each word with its
(Part of Speech) tag such as noun, verb, adjective, etc.. Parts of speech are also known
as word classes or lexical categories. In this work, we employ the Penn Treebank tag
set. The reason for using the PDTB tagset is its wide use. Appendix A.1 details the
different tags in the PDTB tagset.
The tokens are tagged in their original case because changing the case of the word
might change the tag (for example, proper nouns (NNPS) are usually identified
because of the capital initial letter). The methodology followed in Yannakoudakis et.
al is a bit different because they make use of their own proprietary RASP tagger for
this purpose.

17 | P a g e
6.2.3 Surface Features
6.2.3.1 Script Length
Logically, script length should not have any relation to the score because a smaller
well-written piece of text must get the same score as a larger one. However, as
mentioned in the related works section above, script length has been found to affect
the score. Some works say that empirically, longer the length of the essay, the better
the score is. Also, script length could have the effect of cancelling out any
skewedness in the final results due to features whose weights are influenced by script
length.
This is a surface feature since it is completely language-blind. According to Cohen at
al., these surface variables in themselves are extremely predictive of the essay score.
However, the consequences of using such features alone can be that students will
simply learn to write longer texts with no regard for rhetorical structure, the logic of
argumentation, and so forth. This is why such surface variables need to be alongside
other features which relate to content, syntactic structure or rhetorical structure.
6.2.4 Error Identification
6.2.4.1 Error Rate
By error rate, we refer to the rate of occurrence of unknown (and hence, erring)
ngrams. The simplest way of getting error rates is to use a language model from a
suitably large and hopefully in-context corpus and then see the rate of occurrence of
ngrams in the document which do not occur in the corpus. Error rate can be an
important feature because of many reasons. Firstly, it can serve to identify improper
uses of grammar and words. If the number rate of occurrence of grammatical errors in
two documents are the same, the probability is that the scores would be similar or lie
in the same range too. For the purpose of our research, we are using Prof NG Hwee
Tou’s corpuses. These corpuses are parts of the English Gigaword (details here -
http://catalog.ldc.upenn.edu/LDC2009T13). The first corpus consists of first 4 million
sentences and around 100 million words while the second corpus consists of around
40 million sentences and more than a billion words.

18 | P a g e
6.2.5 Structural Features
The inspiration for these features comes from Massung et al. (2013).
6.2.5.1 Skeletons
We aim to capture the flow or discourse structure of sentences without
bothering about the actual labels. For example, if the input sentence is ‘They
have many theoretical ideas’, the following skeletons would be generated –
Figure 3. Skeletons generated from the sentence 'They have many theoretical ideas'
To understand why Figure 2 is as is, let us look at the parse tree of ‘They have
many theoretical ideas’. The following parse tree has been drawn with the help
of the nltk draw function (The tags used in this figure have been documented
in section A.1 of the Appendix)

19 | P a g e
Figure 4. Parse Tree of 'they have many theoretical ideas'
To represent these skeletons of parse trees (for example, see Figure 2), we
store the trees as sets of square brackets. So, for the sentence ‘They have
many theoretical ideas’, the skeletons of parse trees will be –
a) []
b) [[]]
c) [[[]]]
d) [[[]], [[]], [[]]]
e) [[[]], [[[]], [[]], [[]]]]
f) [[[[]]], [[[]], [[[]], [[]], [[]]]]]
We can choose to ignore the tree represented by a since it is trivial and will be
present in every document (in fact, each word or punctuation can be
represented as []). But (b), (c), (d), (e) and (f) correspond to the graphical
representations of the trees in figure 2.

20 | P a g e
The procedure for identifying the skeletons is pretty simple. We can start from
the root of the parse tree and recursively descends into sub-trees recording the
inherent structure. The following pseudocode attempts to demonstrate how
this works –
procedure get_list_of_skeletons_of_sentence (sentence):
tree = parse(sentence)
if tree is NULL:
return []
subtrees_list = []
convert_tree_to_list(tree, subtrees_list)
return subtrees_list
procedure convert_tree_to_list(node, list):
if node is of type(Tree):
list.append([])
return []
else:
subtree = []
for subtree_node in node:
subtree.append(convert_tree_to_list(subtree_no
de, list))
subtree.sort()
list.append(subtree)
return subtree
This get_list_of_skeletons_of_sentence function returns the skeletal structure of the
sentence (or the list of skeletons that correspond to the sub-trees of the parse tree of
the sentence). It does this by first making the parse tree of the sentence and then
calling a recursive function convert_tree_to_list. convert_tree_to_list recursively goes
to each node in the tree, appends the skeleton of the subtree corresponding to that
node to a list and then returns the list of skeletons. The leaves of the parse tree are
represented as [] in our square bracket notation.

21 | P a g e
The section on ‘Statistical Parsing’ later attempts to detail how the function parse
works.
6.2.5.2 Annotated Skeletons
Annotated skeletons are the same as skeletons with just one extra piece of
information attached to them – the label of the topmost node of each parse
sub-tree. An example of the different parse-trees for the same sentence ‘They
have many theoretical ideas’ is
Figure 5 Annotated skeletons in the sentecnce "They have many theoretical ideas'
6.2.5.3 Rewrite Rules
This feature was used by Kim et al. (2011). It essentially tallies subtrees from
each sentence’s parse. It has historically mainly been used in text
classification where all the parse trees are put in different classes/categories so
each category has a ‘bag-of-trees’. This feature is beneficial as certain trees
can be abundant in particular categories. Kim et al. use it for authorship
classification. Simpler applications would include age detection or language
proficiency. Less proficient writers would be unlikely to use complicated tree
structures. This can be useful for essay scoring as well.

22 | P a g e
After conducting experiments, I decided to use only feature (i) Skeletons from this
section. The results section tries to analyze why this might have been the case.
To prevent overfitting or bias issues, we only use the features which appear at least 4
times in the entire training set. The value 4 is also chosen so because of its use in
Yannakoudakis et al so that it is easy to compare results.
6.8 Statistical Parsing
An important black box in the previous section where structural features were being
discussed was how the parse trees are formed. This is because sentences in average
tend to be very syntactically ambiguous – coordination ambiguity, attachment
ambiguity, etc.. That is why we need to use probabilistic parsing. We consider all
possible interpretations and then choose the most likely one.
The CS4248 (Natural Language Processing) class in NUS which I took talked about
probabilistic context free grammar (PCFG), a probabilistic addition to context free
grammars (CFGs) in which each rule has a probability assigned to it. We use the
probabilistic CKY algorithm to generate the most probable parses. The algorithm is
trained on two Treebank grammars –
a) Penn Treebank
b) QuestionBank
6.8.1 Penn Treebank
The material annotated for this project includes such wide ranging genres as IBM
computer manuals, nursing notes, Wall Street Journal articles, transcribed telephone
conversations, etc.. For our work, we use a sample (5% fragment) of this huge
treebank which has been made available for non-commercial use. It contains parsed
data from Wall Street Journal for 1650 sentences (99 treebank files wsj_0001 to
wsj_0099).
An example annotated sentence from the treebank –

23 | P a g e
( (S
(NP-SBJ
(NP (NNP Pierre) (NNP Vinken) )
(, ,)
(ADJP
(NP (CD 61) (NNS years) )
(JJ old) )
(, ,) )
(VP (MD will)
(VP (VB join)
(NP (DT the) (NN board) )
(PP-CLR (IN as)
(NP (DT a) (JJ nonexecutive) (NN director) ))
(NP-TMP (NNP Nov.) (CD 29) )))
(. .) ))
6.8.2 QuestionBank
This is a corpus of 4000 parse-annotated questions developed by the National Centre
for Language Technology and School of Computing. It is provided free for research
purposes. This is also one of the reasons why it has been employed in this work. The
annotated parse trees are very similar to the ones in the Penn Treebank so examples
have been omitted here in the interests of brevity.
After parsing the annotated data from the treebanks, we get a grammar (a list of
production rules). But we still have to convert the grammars to Chomsky Normal
Form. This is because the CKY algorithm works only on context free grammars given
in Chomsky Normal Form (CNF).
6.8.3 Chomsky Normal Form
A grammar is said to be in Chomsky normal form if all of its production rules are of
the form:
a) A  BC, or

24 | P a g e
b) A  α, or
c) S  ε
where A, B and C are nonterminal symbols, α is a terminal symbol (or a constant), S
is the start symbol and ε represents the empty string. Only and only S is allowed to be
the start symbol. Moreover, rule (c) is valid only if ε is part of the language generated
by the grammar G.
It has been proven that every context free grammar can be transformed into one in
Chomsky Normal Form.
6.8.4 Converting a CFG to CNF
1. Introduce a new start symbol S0. This also means that a new rules will have to
be added with regard to the previous start variable S –
S0  S
2. Eliminate all ε rules. ε rules can only be of the form A  ε, where A is not
the start symbol (the proof is trivial). This can be done by removing every rule
with ε on its right hand side (RHS). For each rule that has A in its RHS, add a
set of new rules consisting of all the combinations of A replaced or not
replaced with ε. If A occurs as a singleton on the right hand side of any rule,
add a new rule A  ε (lets call this new rule R), unless R has already been
removed.
3. Eliminate all unit rules. Unit rules are those whose RHs contains one variable
and no terminals (such a rule is inconsistent with the conditions for aa
grammar in Chomsky Normal Form grammar as described at the beginning of
this section. If the unit rule to be removed is X  Y and there exist one or
more rules of the form Y  Z (where Z is a string of variables and terminals),
add a new rule X  Z (unless this is a unit rule which has already been
removed, obviously).
4. Clean up the remaining rules that are not in Chomsky Normal Form. Replace
A  u1u2…uk, k ≥ 3, u1 ∈ V ∪ Σ with A  u1A1, A1  u2A2, …, Ak-2  uk-

25 | P a g e
1uk, where Ai are new variables. If ui ∈ Σ, replace ui in the above rules with
some new variable Vi and add rule Vi  ui.
Once all the rules have been converted to Chomsky Normal Form, we can assign
probabilities to them. This completes the learning of a probabilistic context free
grammar in Chomsky Normal Form (CNF) from the treebanks.
Now, given an input sentence we need to use the probabilistic grammar to generate
the most likely parse tree. We use the Cocke-Younger-Kasami (CKY) algorithm. We
do have to modify the standard version of the algorithm since the standard version
checks only for membership. The pseudocode for the standard version is as below –
Let the grammar be represented by G.
Let S: a1...an be the input sentence or phrase.
Let R1…Rr be non-terminal symbols present in the grammar.
Let RS contain the start symbols, RS ∈ G
Let P[n, n, r] be a three dimensional array of Booleans
for each i = 1 to n:
for each j = 1 to n:
for each k = 1 to r:
P[i, j, k] = false
for each unit production Rj -> ai
P[i,i,j] = true
for each L = 1 to n – i + 1:
R = L + i – 1

26 | P a g e
for each M = L + 1 to R:
for each (Rα  RβRγ):
if P[L, M – 1, β] and P[M, R, γ]:
P[L, R, α] = true
for each i = 1 to r:
if P[1, n, i]:
return true
return false
The above algorithm is checks for the membership of the sentence in the language.
Our goal was to construct a parse tree, so we changed the array P to store parse tree
nodes instead of the Boolean values. These nodes are associated to the array elements
that were used to produce them so as to build the tree structure. This is a simple back-
tracking procedure.
Thus finally, the parse function in the get_list_of_skeletons_of_sentence can return
tree structure generated by the CKY algorithm.
6.9 Feature Weights
In the previous sections, we have discussed the methods employed to generate the
features given an essay. Each unique feature (for example, a particular token or
unigram, or a particular parse tree) is given a unique number to represent it. However,
we also need to decide what what weights (or, importance) to assign to those features.
We experimented with several feature weights -
1. The simplest way would be to use a 0 or a 1 depending on whether a feature is
present or absent.
2. Another technique that was tried was to use the number of times the feature
occurs in a given essay as its weight.

27 | P a g e
3. tf-idf weighting was also tried for certain features, especially word-ngrams
and POS-ngrams. The next sections gives more details on the tf-idf statistic.
6.9.1 tf‐idf scheme
tf-idf is short for term frequency-inverse document frequency. This is often used as a
weighting factor in information retrieval and data mining. It is a product of term
frequency and inverse document frequency.
Various ways of calculating term frequency exist but probably the easiest one and the
one we have used in our work is to simply take the number of times the feature occurs
in a particular essay.
The inverse document frequency is a measure of whether the term is unique or
common across essays. We can arrive at this statistic by dividing the total number of
essays by the number of essays containing the term, and then taking a logarithm of the
quotient.
The reason why such a statistic is needed is because if we were to simply take counts
of features or their presence/absence, we are missing out on how important they are to
a document. In the context of essays, there might be certain phrases which score
highly once present in the eyes of a grader but might not commonly occur across all
documents. Even though this statistic might make more sense for information
retrieval tasks, it was still tried to see the impact.
6.10 Ranking Algorithm
Now, that we have discussed the features we are using/plan to use, let us look at the
machine learning aspect of the problem. We are modeling this as a ranking problem
and not a classification one. One reason for this is that we have the absolute human
grader scores. Thus, we can go better than classification into a few buckets because if
we convert each score to a grade, we are voluntarily losing some information. On the

28 | P a g e
other hand, predicting the exact score also doesn’t make sense because surely, a
machine will not be able to accurately predict the exact score. Thus, ranking seems
like a good viable option.
We use support vector machine for the above task. Our choice is motivated by the fact
that other works, specifically Yannakoudakis et al, make use of the SVMlight
(http://svmlight.joachims.org/) library and so it is easy to compare results. Actually, to
be precise, we use SVMrank
http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html) which employs new
algorithms for training Ranking SVMs and is much faster than SVMlight
. The decision
to switch over to SVMrank
is made easier by the fact that both the libraries are by the
same author and (T. Joachims, 2006) states that both libraries solve the same
optimization problem, with the only difference being that SVMrank
is much faster.
6.11 Evaluation Metrics
The two evaluation metrics which have been employed in our work are –
6.11.1 Pearson’s Product‐Moment Correlation Coefficient
Pearson’s correlation determines the degree to which two linearly dependent variables
are related. It gives a value in the range [-1, 1] where a value of -1 denotes total
negative correlation, the value of 0 denotes no correlation and the value of 1 denotes
total positive correlation. However, the value of this metric can be misleading in some
rare cases due to outliers or due to the inherent sensitivity to the distribution of data.
6.11.2 Spearman’s Rank Correlation Coefficient
This is a non-parametric robust measure of statistical dependence between two
variables. It essentially assesses how well a relationship between the two variables
can be described using a monotonic function. If there are no repeated data values, a
perfect Spearman correlation of +1 or −1 occurs when each of the variables is a
perfect monotone function of the other.

29 | P a g e
7. Dataset
As can be observed from the related work and introduction, automated essay scoring
is a data intensive task. To be able to predict scores, we not only need the dataset to
contain as many essay scripts as possible but there is also a requirement for the essay
scripts to be properly annotated or at least manually graded, at the least.
For our own experiments, we are currently making use of data drawn from the CLC
FCE dataset, a set of 1,244 exam scripts written by candidates sitting the Cambridge
ESOL First Certificate in English (FCE) examination in 2000 and 2001, and made
available by Cambridge University Press; see (Yannakoudakis et al., 2011).
The CLC dataset is divided into training and test sets. The training set consists of
1141 scripts from the year 2000 written by 1141 distinct learners, and 97 scripts from
the year 2001 for testing written by 97 distinct learners. The learner’s ages follow a
bimodal distribution with peaks at approximately 16-20 and 26-30 years of age.
Yannakoudakis et al. claim that there is no overlap between the prompts used in 2000
and in 2001. The scripts also have some meta-data about candidate’s grades, native
language and age.
The First Certificate in English (FCE) exam’s writing component consists of two
tasks asking learners to write either a letter, a report, an article, a composition or a
short story, between 200 and 400 words. Answers to each of these tasks are annotated
with marks (in the range 1-40). In addition, an overall mark is assigned to both tasks.
Actually, we do not make use of the task scores and just use the overall score. This is
because (Yannakoudakis et al., 2011) use just the overall score and so it gives us a
benchmark to compare our results against.
Each script is also tagged with information about the linguistic errors committed,
using a taxonomy of approximately 80 error types (Nicholls, 2003). An example of
this is the following –
Thanks for <NS type="DD"><i>you</i><c>your</c></NS> letter.
The part of the text between <i> and </i> denotes the incorrect text while the part
between <c> and </c> denotes the correction of that incorrect text.

30 | P a g e
8. Results
The following table contains the correlation values after adding the different features–
Table 1. Spearman’s and Pearson’s Correlation Values
Features Pearson’s Correlation Spearman’s Rank Order
Correlation
Word ngrams 0.6005 0.5967
+ PoS ngrams
(tf-idf weights)
0.6053 0.5982
+ POS ngrams
(counts as
weights)
0.5679 0.5612
+ Script length 0.5685 0.5622
+ Error Rate 0.4247 0.4247
+ Skeletons 0.4904 0.4904
Since we use the same dataset as Yannakoudakis et al. for benchmarking, we can
compare our results with them. Our correlation coefficients have nearly the same
value when we just use word ngrams as a feature. For the other features, the variation
is to be expected since they don’t use the same tagger for PoS tagging, their error rate
is calculated in a different manner and they don’t use the same structural features.
From Table 1, we can see that our predictions vary more and more from the human
scores as more features are added. The best results are obtained when only using the

31 | P a g e
word and PoS ngrams as features. This is unexpected but it might be because the
training dataset is too similar to the test dataset. That might also explain why using
just lexical ngrams can give a correlation as high as 0.6.
Word ngrams only include unigrams and bigrams. Trigrams were tried but they
produced very bad results. This can be attributed to data sparseness. Yannakoudakis
et al. also do not use word trigrams or any higher order n-grams.
If we use POS n-gram counts instead of using their tf-idf weights, the correlation
decreases. This suggests that using the tf-idf weighting scheme is useful, especially
for ngram features.
Word n-grams weighted using tf-idf scheme actually give better results when they are
not normalized. Just using word n-grams weighted using the tf-idf scheme results in a
Pearson’s correlation of 0.6220 and Spearman’s correlation of 0.6251.
We have only used skeletons as structural features here. Actually, the results were
much worse with annotated skeletons and rewrite rules. So, we decided to omit those
features because they represent all the information that skeletons do and some more,
so it would have led to repetition. One possible reason why parse tree skeletons
performed the best can be that since they don’t even contain the label information at
the root, they are the least likely to suffer from data sparseness problem.
Table 2 presents Pearson’s and Spearman’s correlation between the CLC and our
system, when removing one feature at a time.

32 | P a g e
Table 2. Ablation tests showing the correlation between the CLC and the AES
system
Features Pearson’s Correlation Spearman’s Rank Order
Correlation
none 0.4904 0.4904
Word n-grams 0.4956 0.4924
Script length 0.4919 0.4883
+ Error Rate 0.4959 0.4928
+ Skeletons 0.4320 0.4247

33 | P a g e
9. Future Work
For the future of this project, along with the addition of more features, an important
task is to get results for different datasets. This is to make sure that peculiarities in the
dataset do not influence the development of the AES system.
Prompt-specific features also need to be added so that essays can be graded without
the need for human annotated copies at all. Like some commercial essay scoring
engines, our software might also be able to mine for information depending on the
prompt and use that to grade essays on the topic.
This work can also be made into a free web application after some improvements. It
will be an interesting exercise from a research point of view too if we are able to
observe how the software works for real student essays.

34 | P a g e
10. Conclusion
Automated Essay Scoring is an interesting area to work on. There is definitely a lot of
scope for improvement and innovation. A lot still needs to be done to bring this into
the mainstream and gain widespread adoption. If this were done right and reliably, it
would go a long way in not only reducing manual work but also improving teaching
and will revolutionize education since teachers would no longer be concerned about
grading when making decisions on giving essay writing tasks to their students.
We have been able to make a proof-of-concept prototype essay scoring system. A lot
needs to be done to make it as reliable and functional as some of the commercially
available options which have been around for 40 years or so, but this is an
encouraging sign. Our results do not beat the best in this business but we can at least
provide an open source solution on which future research can be founded.

35 | P a g e
References
Automated Essay Scoring (http://en.wikipedia.org/wiki/Automated_essay_scoring)
Ellis Batten Page (http://en.wikipedia.org/wiki/Ellis_Batten_Page)
Yang Yongwei, Chad W. Buckendahl, Piotr J. Juskiewicz and Dennison S. Bhola
(2002). “A review of Strategies for Validating Computer-Automated Scoring”.
Applied Measurement in Education.
Ajay, H. B., Tillett, P. I., & Page, E. B. (1973). Analysis of essays by computer
(AEC-II) (No. 8-0102). Washington, DC: U.S. Department of Health, Education, and
Welfare, Office of Education, National Center for Educational Research and
Development.
Handbook of Automated Essay Evaluation: Current Applications and New Directions.
Edited by Mark D. Shermis and Jill Burstein
Page, E. B., & Petersen, N. S. (1995). The computer moves into essay grading:
Updating the ancient test. Phi Delta Kappan.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic
analysis. Discourse Processes.
Attali, Y., & Burstein, J. (2006). Automated Essay Scoring With e-rater V.2. Journal
of Technology, Learning, and Assessment.
Breland, H. M., Jones, R. J., & Jenkins, L. (1994). The College Board vocabulary
study (College Board Report No. 94–4; Educational Testing Service Research Report
No. 94–26). New York: College Entrance Examination Board.
Salton, G., Wong, A., & Yang, C.S. (1975). A vector space model for automatic
indexing. Communications of the ACM, 18, 613–620.
Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated scoring and
annotation of essays with the Intelligent Essay Assessor. In M. D. Shermis & J.
Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87–
112). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Elliot, S. (2003). Intellimetric: From here to validity. In M. D. Shermis & J. Burstein
(Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 71-86).
Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and
method for automatically grading esol texts. In Proceedings of the 49th
Annual
Meeting of the Association for Computational Linguistics; Human Language
Technologies, Portland, Oregon, USA, 19th
-24th
June 2011.

36 | P a g e
D. Nicholls. 2003. The Cambridge Learner Corpus: Error coding and analysis for
lexicography and ELT. In Proceedings of the Corpus Linguistics 2003 conference,
pages 572–581.
Ziheng Lin, Hwee Tou Ng and Min-Yen Kan (2011). Automatically Evaluating Text
Coherence Using Discourse Relations. In Proceedings of the 49th Annual Meeting of
the Association for Computational Linguistics: Human Language Technologies
(ACL-HLT 2011), Portland, Oregon, USA, June.
Alphabetical list of part-of-speech tags used in the Penn Treebank Project
(http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
Yoav Cohen, Anat Ben-Simon and Myra Hovav (2003). The Effect of Specific
Language Features of the Complexity of Systems for Automated Essay Scoring.
IAEA 29th
Annual conference organized at Manchester, UK.
Sean Massung, ChengXiang Zhai and Julia Hockenmaier (2013). Structural Parse
Tree Features for Text Representation. 2013 IEEE Seventh International Conference
on Semantic Computing.
S. Kim, H. Kim, T. Weninger, J. Han, and H. D. Kim, “Authorship classiﬁcation: a
discriminative syntactic tree mining approach,” in Proceedings of the 34th
international ACM SIGIR conference on Research and development in Information
Retrieval, ser. SIGIR ’11. New York, NY, USA: ACM, 2011, pp. 455–464. [Online].
Available: http://doi.acm.org/10.1145/2009916.2009979
Daniel Jurafsky & James H. Martin. Speech and Language Processing: An
introduction to natural language processing, computational linguistics, and speech
recognition.
John Judge, Aoife Cahil and Josef van Genabith. QuestionBank: Creating a Corpus of
Parse-Annotated Questions.
CYK Algorithm (http://en.wikipedia.org/wiki/CYK_algorithm)
Mark D. Shermis and Ben Hammer. Contrasting State-of-the-Art Automated Scoring
of Essays: Analysis.

37 | P a g e
Appendices
Appendix‐A: List of Part‐of‐Speech Tags Used
The following is an alphabetical list of the part‐of‐speech tags used in the Penn
Treebank Project:
Number Tag Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle

38 | P a g e
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb

Automated Essay Scoring

Recommended

Recommended

More Related Content

Similar to Automated Essay Scoring

Similar to Automated Essay Scoring (20)

More from Richard Hogue

More from Richard Hogue (20)

Recently uploaded

Recently uploaded (20)

Automated Essay Scoring