SlideShare a Scribd company logo
1 of 39
Download to read offline
1 | P a g e
B. Comp. Dissertation
Automated Essay Scoring
By
Shubham Goyal
Department of Computer Science
School of Computing
National University of Singapore
2013/2014
Project No: H014380
Advisor: Professor NG Hwee Tou
Deliverables:
Report: 1 Volume
1 | P a g e
Contents
List of Figures........................................................................................................................................ 2
1. Abstract.............................................................................................................................................. 3
2. Acknowledgement........................................................................................................................... 4
3. Goal...................................................................................................................................................... 5
4. Introduction ....................................................................................................................................... 6
5. Related Work.................................................................................................................................... 8
5.1 Background................................................................................................................................. 8
5.2 Comparison of the Current State of the Art Essay Systems.....................................12
6. Implementation ..............................................................................................................................15
6.1 Overview ...................................................................................................................................15
6.2. Features utilized.....................................................................................................................16
6.2.1 Content Features.............................................................................................................16
6.2.2 Syntactic Features ..........................................................................................................16
6.2.3 Surface Features..............................................................................................................17
6.2.4 Error Identification.........................................................................................................17
6.2.5 Structural Features .........................................................................................................18
6.8 Statistical Parsing...............................................................................................................22
6.9 Feature Weights..................................................................................................................26
6.10 Ranking Algorithm .........................................................................................................27
6.11 Evaluation Metrics ..........................................................................................................28
7. Dataset..............................................................................................................................................29
8. Results..............................................................................................................................................30
9. Future Work...................................................................................................................................33
10. Conclusion....................................................................................................................................34
References...........................................................................................................................................35
Appendices..........................................................................................................................................37
Appendix‐A: List of Part‐of‐Speech Tags Used................................................................37
2 | P a g e
List of Figures
Figure 1 Line Chart for Vendor Performance on the Pearson Product Moment
Correlation across ........................................................................................................12
Figure 2 Implementation overview............................................................................15
Figure 3 Skeletons generated from the sentence 'They have many theoretical ideas' 18
Figure 4 Parse Tree of 'they have many theoretical ideas'..........................................19
Figure 5 Annotated skeletons in the sentecnce "They have many theoretical ideas'..21
3 | P a g e
1. Abstract
Automated Essay Scoring (AES) is increasingly becoming popular as human grading
is not only becoming expensive but also cumbersome as the number of test takers
grow. Quick feedback is another characteristic drawing educators towards AES.
However, most of the AES systems present today are commercial closed source
software. Our work aims to design a good AES system that uses some of the most
commonly used features to rank and essays. We also evaluate our scoring engine on a
publicly available dataset to establish benchmarks. We will also make all the source
code available to the public so that future research can use this as a starting point.
Subject Descriptors:
I.2.7 Natural Language Processing
H.3 Information Storage and Retrieval
I.2.6 Learning
I.2.8 Problem Solving, Control Methods, and Search
Keywords:
Artificial Intelligence, Natural Language Processing
Implementation Software and Hardware:
Python, Java
4 | P a g e
2. Acknowledgement
I would like to thank my supervisor, Prof NG Hwee Tou, for giving me the
opportunity to work under him and on this project. I am really honored to have the
pleasure of working under one of the best minds in this field. I would like to thank
him for all his time that he has spent in helping me, motivating me, guiding me and
finally, helping me become a better researcher so that I can fulfill my lifelong
ambition of becoming a good researcher.
I would also like to thank Prof’s graduate student, Raymond, for taking out the time
from his work to help provide me with APIs to get the trigram counts of words in the
English Gigaword corpora.
I also appreciate the help provided by another of Prof’s students, Christian
Hadiwinoto, for creating my account on the NLP cluster and helping me install
packages and run my programs there.
5 | P a g e
3. Goal
This project focuses on building a system for scoring English essays. The system
assigns a score to an essay reflecting the quality of the essay (based on both content
and grammar). The system will be evaluated on a benchmark test data set. Besides
aiming to build a state-of-the-art essay scoring system, the project will also
investigate the robustness and portability of essay scoring systems.
6 | P a g e
4. Introduction
According to Wikipedia, ‘Automated essay scoring (AES) is the use of specialized
computer programs to assign grades to essays written in an educational setting.’
Usually, the grades are not numeric scores but rather, discrete categories. Therefore,
this can be also be considered to be a problem of statistical classification and due to
its very nature, this problem can be said to fall into the domain of natural language
processing.
Historically, the origins of this field can be traced to the work of Ellis Batten “Bo”
Page, who is also widely regarded as the father of automated essay scoring. Page’s
development of and pioneering work with Project Essay Grade (PEGℱ) software in
the mid-1960s set the stage for the practical application of computer essay scoring
technology following the microcomputer revolution of the 1990s.
The most obvious approach to do automated essay scoring is to employ machine
learning. This will involve getting a set of essays that have been manually scored (or,
the training set). The software should then evaluate the features of the text of each
essay (surface features like the total number of words, or word ngrams, part of speech
ngrams, etc.. mostly quantities that can be measured without any human insight) and
construct a mathematical model that relates these quantities to the scores that the
essays received. Then, we could use the model to calculate scores of new sets of
essays.
The next important question that arises is the determination of the criteria of success.
It might be insightful to look at essay scoring before the arrival of computers.
Usually, high stake essays were and still are rated by a few different raters who would
each give their own score. Then, the different scores would be matched to see if they
agree and if they don’t, either a more experienced rater would be called in to settle the
dispute, or the the majority opinion would be taken. We could apply the same
approach to checking the success of any AES software. The grades given by the
software could be matched with the grades given by human graders on the same
scripts. The more the number of matches, the better the accuracy of the AES software
would be.
7 | P a g e
Thus, various statistics have been proposed to measure this ‘agreement’ between the
AES software and the human graders. It could be something as simple as percent
agreement to more complicated measures like Pearson’s or Spearman’s rank
correlation coefficients.
The practice of AES has not been without its fair share of criticism. Yang et al.
mention "the overreliance on surface features of responses, the insensitivity to the
content of responses and to creativity, and the vulnerability to new types of cheating
and test-taking strategies." Some critics also fear that students’ motivation will be
diminished if they know that a human grader will not be reading their writings.
However, we feel that this criticism is not to AES but rather to the fear of being
assigned false grades. This criticism actually also proves that the current state of the
art systems can be better and so this is an exciting time to be working in this field.
8 | P a g e
5. Related Work
5.1 Background
As already mentioned in the introduction, the late Ellis Page and his colleagues at the
University of Connecticut programmed the first successful automated essay scoring
engine, “Project Essay Grade (PEG)” (1973). PEG did produce good results but one
of the reasons for why it did not become a practical application is probably because of
the technology of the time.
Different AES systems evaluate different types and number of features which are
extracted from the text of the essay. Page and Peterson (1995), in their Phi Delta
Kappan article, “The computer moves into essay grading: Updating the ancient test,”
referred to these elements or features as “proxes” or approximations for underlying
“trins” (i.e., intrinsic characteristics) of writing. In the original version of PEG, the
text was parsed and classified into language elements such as parts of speech, word
length, word functions and the like. PEG would count keywords and make its
predictions based on the patterns of language that human raters valued or devalued in
making their score assignments. Page classified these counts into three categories:
simple, deceptively simple and sophisticated.
For example, a model in the PEG system, might be formed by taking five intrinsic
characteristics of writing (content, creativity, style, mechanics, and organization) and
linking proxes. An example of a simple prox is essay length. Page found that the
relationship between the number of words used and the score assignment was not
linear, but rather logarithmic. In other words, essay length is factored in by human
raters upto some threshold, and then becomes less important as they focus on other
aspects of writing.
On the other hand, an example of a sophisticated prox would be a count of the number
of times “because” is used in an essay. Even though a count of the word “because”
may not be important in and of itself, but as a discourse connector, it serves as a proxy
for sentence complexity. Human raters tend to reward more complex sentences.
9 | P a g e
Some works emphasize the evaluation of content through the specification of
vocabulary (of course, the evaluation of other aspects of writing is performed as
described above). Latent Semantic Analysis and its variants are employed in some
works to provide estimates as to how close the vocabulary in an essay is to a targeted
vocabulary set (Landauer, Foltz & Laham, 1998). The Intelligent Essay Assessor
(Landauer, Foltz & Laham, 2003) is one of the most successful commercial
applications making heavy use of LSA.
If we look at the AES scene at present, there are three major AES developers –
1. e-rater (which is a component of Criterion - http://www.ets.org/criterion) by
Educational Testing Service (ETS)
2. Intellimetric by Vantage Learning (http://www.vantagelearning.com/)
3. Intelligent Essay Assessor by Pearson Knowledge Technologies
(http://kt.pearsonassessments.com/)
Fortunately for us, the construction of e-rater models is given in detail in a recent
work by Attali and Burstein (2006). The system takes in features from six main areas:
1. Grammar, usage, mechanics and style measures (4 features) –
They count the errors in these four categories. And since the raw counts of
errors are highly related to essay length, the rates of errors are used which are
obtained by dividing the counts in each category by the total number of words
in the essay.
2. Organization and development (2 features) –
The first feature in this category is the organization score which assumes a
writing strategy that includes an introductory paragraph, at least a three
paragraph body with each paragraph in the body consisting of a pair of main
point and supporting idea elements, and a concluding paragraph. The score
measures the difference between this minimum five paragraph essay and the
actual discourse elements found in the essay. The second feature is derived
from Criterion’s organization and development module.
10 | P a g e
3. Lexical Complexity (2 features) –
These are specifically related to word based characteristics. The first is a
measure of vocabulary level and the second is based on average word length
in characters across the words in the essay. The first feature is from Breland,
Jones, and Jenkins’ (1994) work on Standardized Frequency Index across the
words of the essay.
4. Prompt-specific vocabulary usage (2 features) – e-rater evaluates the lexical
content of an essay by comparing the words it contains to the words found in a
sample of essays from each score category. This is accomplished by making
use of content vector analysis (Salton, Wong, & Yang, 1975). In short, the
vocabulary of each score category is converted to a vector whose elements are
based on the frequency of each word in a sample of essays.
Like most approaches, e-rater also uses a sample of human scored essay data for
model building purposes. e-rater models can be built at the topic level in which case, a
model is built for a specific essay prompt. However, more often, e-rater models are
built at the grade-level. Preparing models for essays of similar topics or by students of
similar grades is not difficult per se, but it requires significant data collection and
human reader scoring, these are not only time consuming but also costly.
A specification of the Intellimetric model is given in Elliot (2003). The model selects
from more than 300 semantic, syntactic and discourse level features. The features fall
into five major categories:
1. Focus and Unity – Include cohesiveness, consistency in purpose, main idea
2. Development and elaboration – Include metrics to look at the breadth of
content and the support for concepts advanced
3. Organization and Structure – Mainly targeted at the logic of discourse like
transitional fluidity and realationships among parts of the response.
4. Sentence Structure – Include senetence complexity and sentence variety.
5. Mechanics and Conventions – Features measuring conformance to
conventions of edited American English.
11 | P a g e
Intellimetric uses Latent Semantic Dimension which is similar in nature to LSA
described earlier. Latent Semantic Dimension also determines how close the
candidate response is, in terms of content, to a modeled set of vocabulary. The paper
doesn’t go into a lot more details about how Intellimetric works, and rather focuses
more on the validation aspect.
Technical details of the Intelligent Essay Assessor are highlighted in Landauer,
Laham, & Foltz (2003). The content of the essay is assessed by using a combination
of external databases and LSA. The authors do talk about examples of external
databases used for three of their experiments. This is interesting to note because this
can shed important light on what kind of data needs to be extracted for automated
essay scoring in different situations.
In a particular experiment, the essay question was on the anatomy and function of the
heart and circulatory system. This was administered to 94 undergraduates at the
University of Colorado before and after an instructional session (N = 188) and scored
by two professional readers from Educational Testing Service (ETS). In this case, the
LSA semantic space was constructed by analysis of all 95 paragraphs in a set of 26
articles on the heart taken from an electronic version of Grolier’s Academic American
Encyclopedia. Even though this corpus was smaller than the corpuses traditionally
used, it gave good results according to the authors. When the authors tried to expand
it by the addition of general text, the results did not improve.
We also draw immense inspiration from and analyze the work by (Yannakoudakis et
al., 2011) but the discussion on that is deferred to the subsequent sections in the
interests of brevity and to avoid repetition.
12 | P a g e
5.2 Comparison of the Current State of the Art Essay Systems
A recent study (Shermis and Hammer, 2012) has compared the results from nine
automated essay scoring engines on eight prompts drawn from 6 states in the Unites
States that hold high-stakes writing exams. The essays encompassed writing
assessments from three grade levels, namely, 7, 8 and 10; and were evenly distributes
among the different prompts. Totally, there were 22,029 essays,
The following line chart demonstrates the pearson product moment correlation across
the eight essay data sets –
Figure 1. Line Chart for Vendor Performance on the Pearson Product Moment Correlation across
the Eight Essay Data Sets
The nine automated essay scoring engines participating in the study were –
1. Autoscore developed by the American Institutes for Research (AIR)
The main features of this scoring engine include creating a statistical proxy for
13 | P a g e
prompt-specific rubrics (single as well as multiple trait). The engine needs to
be trained on known and valid scores.
2. LightSIDE developed at Carnegie Mellon University’s TELEDIA Lab
This is a free and open source package. This is very beginner friendly. Its
meant to be a tool for non professionals to male use of data mining technology
for varied purposes, one of which is essay assessment.
3. Bookette developed by CTB McGraw-Hill Education
These scoring engines are able to model trait level and/or holistic level scores
for essays with a similar degree of reliability to an expert human rater. CTB
builds two types of engines – prompt specific and generic. When applied in
the classroom, the engines can provide performance feedback through the use
of the information found in the scoring rubric and through feedback on
grammar, spelling, conventions, etc. at the sentence level. Bookette engines
utilize around 90 text-features classified as structural, syntactic, semantic and
mechanics-based.
4. e-rater, developed by Educational Testing Service
This scoring engine is focused on evaluating essay quality. There are doens of
features, each measuring different very specific aspects of essay quality. The
same features serve as the basis for performance feedback to students through
products like Criterion (http://www.ets.org/criterion).
5. Lexile Writing Analyzer developed by MetaMetrics
This is independent of grades, genres, prompts or punctuation and is an engine
for establishing Lxile writer measures. Lexile writer measure is said to be an
inherent individual trait or power to compose written text with writing ability
embedded in a complex web of cognitive and sociocultural processes.
6. Project Essay Grade (PEG), Measurement, Inc.
This scoring engine has had more than 40 years of study and enchancement
devoted to it. Studies conducted at a number of state departments of education
indicate that PEG demonstrated accuracy similar to trained human scorers.
14 | P a g e
7. Intelligent Essay Assessor (IEA), Pearson Knowledge Technologies
Some of the features are derived through semantic models of English (or any
other language) from an analysis of large volumes of text equivalent to the
reading material of a high school student (around 12 million words). This
scoring engine combines background knowledge about English in general and
the subject area of the assessment in particular along with prompt-specific
algorithms to learn how to match student responses to human scores. IAEA
also provides feedback and can even be tuned to understand and examine text
in any language (Spanish, Arabic, Hindi, etc.). It can identify off-topic
responses, very unconventional essays and other unique circumstances that
need human attention. It has also been used for grading millions of essays in
high-stake examinations.
8. CRASETM
by Pacific Metrics
This system is highly configurable, both in terms of the customizations used to
build machine scoring models and in terms of how the system can blend
human scoring and machine scoring (i.e., hybrid models). Its actually a Java
applications that runs as a web service.
9. IntelliMetric developed by Vantage Learning
This scoring system attempts to emulate what the human scorers do.
IntelliMetric is trained to score test-taker essays. Each prompt (essay) is first
scored by expert human scorers who develop anchor papers for each score
point. A number of papers for each score point are loaded into IntelliMetric,
which runs multiple algorithms to determine the specific writing features that
translate to various score points.
15 | P a g e
6. Implementation
6.1 Overview
This is what the entire process actually looks like in a nutshell –
Figure 2 Implementation overview
To score essays automatically, we need to train a machine-learning algorithm. After
the algorithm has been trained, it gives us a machine-learning model, which can be
used to score more essays. In order for a machine-learning model to be created,
16 | P a g e
features first need to be extracted from the text, as a computer cannot directly
understand English. We need to use the numbers or symbols as proxies for meaning.
6.2. Features utilized
6.2.1 Content Features
6.2.1.1 Word n‐grams
An n-gram can simply be defined as a contiguous sequence of n items from a given
sequence of text or speech. For the purpose of essay scoring, they can simply be
understood as collection of one or more tokens. n can have any value but usually only
value of n =1 (unigrams), 2 (bigrams), 3 (trigrams) are considered. This is because
higher order n-grams suffer from the sparse data problem.
The tokens were converted to lower case before being used as n-grams. However, no
stemming was employed.
6.2.2 Syntactic Features
6.2.2.1 POS n‐grams
This feature is the same as word n-grams except that we replace each word with its
(Part of Speech) tag such as noun, verb, adjective, etc.. Parts of speech are also known
as word classes or lexical categories. In this work, we employ the Penn Treebank tag
set. The reason for using the PDTB tagset is its wide use. Appendix A.1 details the
different tags in the PDTB tagset.
The tokens are tagged in their original case because changing the case of the word
might change the tag (for example, proper nouns (NNPS) are usually identified
because of the capital initial letter). The methodology followed in Yannakoudakis et.
al is a bit different because they make use of their own proprietary RASP tagger for
this purpose.
17 | P a g e
6.2.3 Surface Features
6.2.3.1 Script Length
Logically, script length should not have any relation to the score because a smaller
well-written piece of text must get the same score as a larger one. However, as
mentioned in the related works section above, script length has been found to affect
the score. Some works say that empirically, longer the length of the essay, the better
the score is. Also, script length could have the effect of cancelling out any
skewedness in the final results due to features whose weights are influenced by script
length.
This is a surface feature since it is completely language-blind. According to Cohen at
al., these surface variables in themselves are extremely predictive of the essay score.
However, the consequences of using such features alone can be that students will
simply learn to write longer texts with no regard for rhetorical structure, the logic of
argumentation, and so forth. This is why such surface variables need to be alongside
other features which relate to content, syntactic structure or rhetorical structure.
6.2.4 Error Identification
6.2.4.1 Error Rate
By error rate, we refer to the rate of occurrence of unknown (and hence, erring)
ngrams. The simplest way of getting error rates is to use a language model from a
suitably large and hopefully in-context corpus and then see the rate of occurrence of
ngrams in the document which do not occur in the corpus. Error rate can be an
important feature because of many reasons. Firstly, it can serve to identify improper
uses of grammar and words. If the number rate of occurrence of grammatical errors in
two documents are the same, the probability is that the scores would be similar or lie
in the same range too. For the purpose of our research, we are using Prof NG Hwee
Tou’s corpuses. These corpuses are parts of the English Gigaword (details here -
http://catalog.ldc.upenn.edu/LDC2009T13). The first corpus consists of first 4 million
sentences and around 100 million words while the second corpus consists of around
40 million sentences and more than a billion words.
18 | P a g e
6.2.5 Structural Features
The inspiration for these features comes from Massung et al. (2013).
6.2.5.1 Skeletons
We aim to capture the flow or discourse structure of sentences without
bothering about the actual labels. For example, if the input sentence is ‘They
have many theoretical ideas’, the following skeletons would be generated –
Figure 3. Skeletons generated from the sentence 'They have many theoretical ideas'
To understand why Figure 2 is as is, let us look at the parse tree of ‘They have
many theoretical ideas’. The following parse tree has been drawn with the help
of the nltk draw function (The tags used in this figure have been documented
in section A.1 of the Appendix)
19 | P a g e
Figure 4. Parse Tree of 'they have many theoretical ideas'
To represent these skeletons of parse trees (for example, see Figure 2), we
store the trees as sets of square brackets. So, for the sentence ‘They have
many theoretical ideas’, the skeletons of parse trees will be –
a) []
b) [[]]
c) [[[]]]
d) [[[]], [[]], [[]]]
e) [[[]], [[[]], [[]], [[]]]]
f) [[[[]]], [[[]], [[[]], [[]], [[]]]]]
We can choose to ignore the tree represented by a since it is trivial and will be
present in every document (in fact, each word or punctuation can be
represented as []). But (b), (c), (d), (e) and (f) correspond to the graphical
representations of the trees in figure 2.
20 | P a g e
The procedure for identifying the skeletons is pretty simple. We can start from
the root of the parse tree and recursively descends into sub-trees recording the
inherent structure. The following pseudocode attempts to demonstrate how
this works –
procedure get_list_of_skeletons_of_sentence (sentence):
tree = parse(sentence)
if tree is NULL:
return []
subtrees_list = []
convert_tree_to_list(tree, subtrees_list)
return subtrees_list
procedure convert_tree_to_list(node, list):
if node is of type(Tree):
list.append([])
return []
else:
subtree = []
for subtree_node in node:
subtree.append(convert_tree_to_list(subtree_no
de, list))
subtree.sort()
list.append(subtree)
return subtree
This get_list_of_skeletons_of_sentence function returns the skeletal structure of the
sentence (or the list of skeletons that correspond to the sub-trees of the parse tree of
the sentence). It does this by first making the parse tree of the sentence and then
calling a recursive function convert_tree_to_list. convert_tree_to_list recursively goes
to each node in the tree, appends the skeleton of the subtree corresponding to that
node to a list and then returns the list of skeletons. The leaves of the parse tree are
represented as [] in our square bracket notation.
21 | P a g e
The section on ‘Statistical Parsing’ later attempts to detail how the function parse
works.
6.2.5.2 Annotated Skeletons
Annotated skeletons are the same as skeletons with just one extra piece of
information attached to them – the label of the topmost node of each parse
sub-tree. An example of the different parse-trees for the same sentence ‘They
have many theoretical ideas’ is
Figure 5 Annotated skeletons in the sentecnce "They have many theoretical ideas'
6.2.5.3 Rewrite Rules
This feature was used by Kim et al. (2011). It essentially tallies subtrees from
each sentence’s parse. It has historically mainly been used in text
classification where all the parse trees are put in different classes/categories so
each category has a ‘bag-of-trees’. This feature is beneficial as certain trees
can be abundant in particular categories. Kim et al. use it for authorship
classification. Simpler applications would include age detection or language
proficiency. Less proficient writers would be unlikely to use complicated tree
structures. This can be useful for essay scoring as well.
22 | P a g e
After conducting experiments, I decided to use only feature (i) Skeletons from this
section. The results section tries to analyze why this might have been the case.
To prevent overfitting or bias issues, we only use the features which appear at least 4
times in the entire training set. The value 4 is also chosen so because of its use in
Yannakoudakis et al so that it is easy to compare results.
6.8 Statistical Parsing
An important black box in the previous section where structural features were being
discussed was how the parse trees are formed. This is because sentences in average
tend to be very syntactically ambiguous – coordination ambiguity, attachment
ambiguity, etc.. That is why we need to use probabilistic parsing. We consider all
possible interpretations and then choose the most likely one.
The CS4248 (Natural Language Processing) class in NUS which I took talked about
probabilistic context free grammar (PCFG), a probabilistic addition to context free
grammars (CFGs) in which each rule has a probability assigned to it. We use the
probabilistic CKY algorithm to generate the most probable parses. The algorithm is
trained on two Treebank grammars –
a) Penn Treebank
b) QuestionBank
6.8.1 Penn Treebank
The material annotated for this project includes such wide ranging genres as IBM
computer manuals, nursing notes, Wall Street Journal articles, transcribed telephone
conversations, etc.. For our work, we use a sample (5% fragment) of this huge
treebank which has been made available for non-commercial use. It contains parsed
data from Wall Street Journal for 1650 sentences (99 treebank files wsj_0001 to
wsj_0099).
An example annotated sentence from the treebank –
23 | P a g e
( (S
(NP-SBJ
(NP (NNP Pierre) (NNP Vinken) )
(, ,)
(ADJP
(NP (CD 61) (NNS years) )
(JJ old) )
(, ,) )
(VP (MD will)
(VP (VB join)
(NP (DT the) (NN board) )
(PP-CLR (IN as)
(NP (DT a) (JJ nonexecutive) (NN director) ))
(NP-TMP (NNP Nov.) (CD 29) )))
(. .) ))
6.8.2 QuestionBank
This is a corpus of 4000 parse-annotated questions developed by the National Centre
for Language Technology and School of Computing. It is provided free for research
purposes. This is also one of the reasons why it has been employed in this work. The
annotated parse trees are very similar to the ones in the Penn Treebank so examples
have been omitted here in the interests of brevity.
After parsing the annotated data from the treebanks, we get a grammar (a list of
production rules). But we still have to convert the grammars to Chomsky Normal
Form. This is because the CKY algorithm works only on context free grammars given
in Chomsky Normal Form (CNF).
6.8.3 Chomsky Normal Form
A grammar is said to be in Chomsky normal form if all of its production rules are of
the form:
a) A  BC, or
24 | P a g e
b) A  α, or
c) S  Δ
where A, B and C are nonterminal symbols, α is a terminal symbol (or a constant), S
is the start symbol and Δ represents the empty string. Only and only S is allowed to be
the start symbol. Moreover, rule (c) is valid only if Δ is part of the language generated
by the grammar G.
It has been proven that every context free grammar can be transformed into one in
Chomsky Normal Form.
6.8.4 Converting a CFG to CNF
1. Introduce a new start symbol S0. This also means that a new rules will have to
be added with regard to the previous start variable S –
S0  S
2. Eliminate all Δ rules. Δ rules can only be of the form A  Δ, where A is not
the start symbol (the proof is trivial). This can be done by removing every rule
with Δ on its right hand side (RHS). For each rule that has A in its RHS, add a
set of new rules consisting of all the combinations of A replaced or not
replaced with Δ. If A occurs as a singleton on the right hand side of any rule,
add a new rule A  Δ (lets call this new rule R), unless R has already been
removed.
3. Eliminate all unit rules. Unit rules are those whose RHs contains one variable
and no terminals (such a rule is inconsistent with the conditions for aa
grammar in Chomsky Normal Form grammar as described at the beginning of
this section. If the unit rule to be removed is X  Y and there exist one or
more rules of the form Y  Z (where Z is a string of variables and terminals),
add a new rule X  Z (unless this is a unit rule which has already been
removed, obviously).
4. Clean up the remaining rules that are not in Chomsky Normal Form. Replace
A  u1u2
uk, k ≄ 3, u1 ∈ V âˆȘ ÎŁ with A  u1A1, A1  u2A2, 
, Ak-2  uk-
25 | P a g e
1uk, where Ai are new variables. If ui ∈ Σ, replace ui in the above rules with
some new variable Vi and add rule Vi  ui.
Once all the rules have been converted to Chomsky Normal Form, we can assign
probabilities to them. This completes the learning of a probabilistic context free
grammar in Chomsky Normal Form (CNF) from the treebanks.
Now, given an input sentence we need to use the probabilistic grammar to generate
the most likely parse tree. We use the Cocke-Younger-Kasami (CKY) algorithm. We
do have to modify the standard version of the algorithm since the standard version
checks only for membership. The pseudocode for the standard version is as below –
Let the grammar be represented by G.
Let S: a1...an be the input sentence or phrase.
Let R1
Rr be non-terminal symbols present in the grammar.
Let RS contain the start symbols, RS ∈ G
Let P[n, n, r] be a three dimensional array of Booleans
for each i = 1 to n:
for each j = 1 to n:
for each k = 1 to r:
P[i, j, k] = false
for each i = 1 to n:
for each unit production Rj -> ai
P[i,i,j] = true
for each i = 2 to n:
for each L = 1 to n – i + 1:
R = L + i – 1
26 | P a g e
for each M = L + 1 to R:
for each (Rα  RÎČRÎł):
if P[L, M – 1, ÎČ] and P[M, R, Îł]:
P[L, R, α] = true
for each i = 1 to r:
if P[1, n, i]:
return true
return false
The above algorithm is checks for the membership of the sentence in the language.
Our goal was to construct a parse tree, so we changed the array P to store parse tree
nodes instead of the Boolean values. These nodes are associated to the array elements
that were used to produce them so as to build the tree structure. This is a simple back-
tracking procedure.
Thus finally, the parse function in the get_list_of_skeletons_of_sentence can return
tree structure generated by the CKY algorithm.
6.9 Feature Weights
In the previous sections, we have discussed the methods employed to generate the
features given an essay. Each unique feature (for example, a particular token or
unigram, or a particular parse tree) is given a unique number to represent it. However,
we also need to decide what what weights (or, importance) to assign to those features.
We experimented with several feature weights -
1. The simplest way would be to use a 0 or a 1 depending on whether a feature is
present or absent.
2. Another technique that was tried was to use the number of times the feature
occurs in a given essay as its weight.
27 | P a g e
3. tf-idf weighting was also tried for certain features, especially word-ngrams
and POS-ngrams. The next sections gives more details on the tf-idf statistic.
6.9.1 tf‐idf scheme
tf-idf is short for term frequency-inverse document frequency. This is often used as a
weighting factor in information retrieval and data mining. It is a product of term
frequency and inverse document frequency.
Various ways of calculating term frequency exist but probably the easiest one and the
one we have used in our work is to simply take the number of times the feature occurs
in a particular essay.
The inverse document frequency is a measure of whether the term is unique or
common across essays. We can arrive at this statistic by dividing the total number of
essays by the number of essays containing the term, and then taking a logarithm of the
quotient.
The reason why such a statistic is needed is because if we were to simply take counts
of features or their presence/absence, we are missing out on how important they are to
a document. In the context of essays, there might be certain phrases which score
highly once present in the eyes of a grader but might not commonly occur across all
documents. Even though this statistic might make more sense for information
retrieval tasks, it was still tried to see the impact.
6.10 Ranking Algorithm
Now, that we have discussed the features we are using/plan to use, let us look at the
machine learning aspect of the problem. We are modeling this as a ranking problem
and not a classification one. One reason for this is that we have the absolute human
grader scores. Thus, we can go better than classification into a few buckets because if
we convert each score to a grade, we are voluntarily losing some information. On the
28 | P a g e
other hand, predicting the exact score also doesn’t make sense because surely, a
machine will not be able to accurately predict the exact score. Thus, ranking seems
like a good viable option.
We use support vector machine for the above task. Our choice is motivated by the fact
that other works, specifically Yannakoudakis et al, make use of the SVMlight
(http://svmlight.joachims.org/) library and so it is easy to compare results. Actually, to
be precise, we use SVMrank
http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html) which employs new
algorithms for training Ranking SVMs and is much faster than SVMlight
. The decision
to switch over to SVMrank
is made easier by the fact that both the libraries are by the
same author and (T. Joachims, 2006) states that both libraries solve the same
optimization problem, with the only difference being that SVMrank
is much faster.
6.11 Evaluation Metrics
The two evaluation metrics which have been employed in our work are –
6.11.1 Pearson’s Product‐Moment Correlation Coefficient
Pearson’s correlation determines the degree to which two linearly dependent variables
are related. It gives a value in the range [-1, 1] where a value of -1 denotes total
negative correlation, the value of 0 denotes no correlation and the value of 1 denotes
total positive correlation. However, the value of this metric can be misleading in some
rare cases due to outliers or due to the inherent sensitivity to the distribution of data.
6.11.2 Spearman’s Rank Correlation Coefficient
This is a non-parametric robust measure of statistical dependence between two
variables. It essentially assesses how well a relationship between the two variables
can be described using a monotonic function. If there are no repeated data values, a
perfect Spearman correlation of +1 or −1 occurs when each of the variables is a
perfect monotone function of the other.
29 | P a g e
7. Dataset
As can be observed from the related work and introduction, automated essay scoring
is a data intensive task. To be able to predict scores, we not only need the dataset to
contain as many essay scripts as possible but there is also a requirement for the essay
scripts to be properly annotated or at least manually graded, at the least.
For our own experiments, we are currently making use of data drawn from the CLC
FCE dataset, a set of 1,244 exam scripts written by candidates sitting the Cambridge
ESOL First Certificate in English (FCE) examination in 2000 and 2001, and made
available by Cambridge University Press; see (Yannakoudakis et al., 2011).
The CLC dataset is divided into training and test sets. The training set consists of
1141 scripts from the year 2000 written by 1141 distinct learners, and 97 scripts from
the year 2001 for testing written by 97 distinct learners. The learner’s ages follow a
bimodal distribution with peaks at approximately 16-20 and 26-30 years of age.
Yannakoudakis et al. claim that there is no overlap between the prompts used in 2000
and in 2001. The scripts also have some meta-data about candidate’s grades, native
language and age.
The First Certificate in English (FCE) exam’s writing component consists of two
tasks asking learners to write either a letter, a report, an article, a composition or a
short story, between 200 and 400 words. Answers to each of these tasks are annotated
with marks (in the range 1-40). In addition, an overall mark is assigned to both tasks.
Actually, we do not make use of the task scores and just use the overall score. This is
because (Yannakoudakis et al., 2011) use just the overall score and so it gives us a
benchmark to compare our results against.
Each script is also tagged with information about the linguistic errors committed,
using a taxonomy of approximately 80 error types (Nicholls, 2003). An example of
this is the following –
Thanks for <NS type="DD"><i>you</i><c>your</c></NS> letter.
The part of the text between <i> and </i> denotes the incorrect text while the part
between <c> and </c> denotes the correction of that incorrect text.
30 | P a g e
8. Results
The following table contains the correlation values after adding the different features–
Table 1. Spearman’s and Pearson’s Correlation Values
Features Pearson’s Correlation Spearman’s Rank Order
Correlation
Word ngrams 0.6005 0.5967
+ PoS ngrams
(tf-idf weights)
0.6053 0.5982
+ POS ngrams
(counts as
weights)
0.5679 0.5612
+ Script length 0.5685 0.5622
+ Error Rate 0.4247 0.4247
+ Skeletons 0.4904 0.4904
Since we use the same dataset as Yannakoudakis et al. for benchmarking, we can
compare our results with them. Our correlation coefficients have nearly the same
value when we just use word ngrams as a feature. For the other features, the variation
is to be expected since they don’t use the same tagger for PoS tagging, their error rate
is calculated in a different manner and they don’t use the same structural features.
From Table 1, we can see that our predictions vary more and more from the human
scores as more features are added. The best results are obtained when only using the
31 | P a g e
word and PoS ngrams as features. This is unexpected but it might be because the
training dataset is too similar to the test dataset. That might also explain why using
just lexical ngrams can give a correlation as high as 0.6.
Word ngrams only include unigrams and bigrams. Trigrams were tried but they
produced very bad results. This can be attributed to data sparseness. Yannakoudakis
et al. also do not use word trigrams or any higher order n-grams.
If we use POS n-gram counts instead of using their tf-idf weights, the correlation
decreases. This suggests that using the tf-idf weighting scheme is useful, especially
for ngram features.
Word n-grams weighted using tf-idf scheme actually give better results when they are
not normalized. Just using word n-grams weighted using the tf-idf scheme results in a
Pearson’s correlation of 0.6220 and Spearman’s correlation of 0.6251.
We have only used skeletons as structural features here. Actually, the results were
much worse with annotated skeletons and rewrite rules. So, we decided to omit those
features because they represent all the information that skeletons do and some more,
so it would have led to repetition. One possible reason why parse tree skeletons
performed the best can be that since they don’t even contain the label information at
the root, they are the least likely to suffer from data sparseness problem.
Table 2 presents Pearson’s and Spearman’s correlation between the CLC and our
system, when removing one feature at a time.
32 | P a g e
Table 2. Ablation tests showing the correlation between the CLC and the AES
system
Features Pearson’s Correlation Spearman’s Rank Order
Correlation
none 0.4904 0.4904
Word n-grams 0.4956 0.4924
Script length 0.4919 0.4883
+ Error Rate 0.4959 0.4928
+ Skeletons 0.4320 0.4247
33 | P a g e
9. Future Work
For the future of this project, along with the addition of more features, an important
task is to get results for different datasets. This is to make sure that peculiarities in the
dataset do not influence the development of the AES system.
Prompt-specific features also need to be added so that essays can be graded without
the need for human annotated copies at all. Like some commercial essay scoring
engines, our software might also be able to mine for information depending on the
prompt and use that to grade essays on the topic.
This work can also be made into a free web application after some improvements. It
will be an interesting exercise from a research point of view too if we are able to
observe how the software works for real student essays.
34 | P a g e
10. Conclusion
Automated Essay Scoring is an interesting area to work on. There is definitely a lot of
scope for improvement and innovation. A lot still needs to be done to bring this into
the mainstream and gain widespread adoption. If this were done right and reliably, it
would go a long way in not only reducing manual work but also improving teaching
and will revolutionize education since teachers would no longer be concerned about
grading when making decisions on giving essay writing tasks to their students.
We have been able to make a proof-of-concept prototype essay scoring system. A lot
needs to be done to make it as reliable and functional as some of the commercially
available options which have been around for 40 years or so, but this is an
encouraging sign. Our results do not beat the best in this business but we can at least
provide an open source solution on which future research can be founded.
35 | P a g e
References
Automated Essay Scoring (http://en.wikipedia.org/wiki/Automated_essay_scoring)
Ellis Batten Page (http://en.wikipedia.org/wiki/Ellis_Batten_Page)
Yang Yongwei, Chad W. Buckendahl, Piotr J. Juskiewicz and Dennison S. Bhola
(2002). “A review of Strategies for Validating Computer-Automated Scoring”.
Applied Measurement in Education.
Ajay, H. B., Tillett, P. I., & Page, E. B. (1973). Analysis of essays by computer
(AEC-II) (No. 8-0102). Washington, DC: U.S. Department of Health, Education, and
Welfare, Office of Education, National Center for Educational Research and
Development.
Handbook of Automated Essay Evaluation: Current Applications and New Directions.
Edited by Mark D. Shermis and Jill Burstein
Page, E. B., & Petersen, N. S. (1995). The computer moves into essay grading:
Updating the ancient test. Phi Delta Kappan.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic
analysis. Discourse Processes.
Attali, Y., & Burstein, J. (2006). Automated Essay Scoring With e-rater V.2. Journal
of Technology, Learning, and Assessment.
Breland, H. M., Jones, R. J., & Jenkins, L. (1994). The College Board vocabulary
study (College Board Report No. 94–4; Educational Testing Service Research Report
No. 94–26). New York: College Entrance Examination Board.
Salton, G., Wong, A., & Yang, C.S. (1975). A vector space model for automatic
indexing. Communications of the ACM, 18, 613–620.
Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated scoring and
annotation of essays with the Intelligent Essay Assessor. In M. D. Shermis & J.
Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87–
112). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Elliot, S. (2003). Intellimetric: From here to validity. In M. D. Shermis & J. Burstein
(Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 71-86).
Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and
method for automatically grading esol texts. In Proceedings of the 49th
Annual
Meeting of the Association for Computational Linguistics; Human Language
Technologies, Portland, Oregon, USA, 19th
-24th
June 2011.
36 | P a g e
D. Nicholls. 2003. The Cambridge Learner Corpus: Error coding and analysis for
lexicography and ELT. In Proceedings of the Corpus Linguistics 2003 conference,
pages 572–581.
Ziheng Lin, Hwee Tou Ng and Min-Yen Kan (2011). Automatically Evaluating Text
Coherence Using Discourse Relations. In Proceedings of the 49th Annual Meeting of
the Association for Computational Linguistics: Human Language Technologies
(ACL-HLT 2011), Portland, Oregon, USA, June.
Alphabetical list of part-of-speech tags used in the Penn Treebank Project
(http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
Yoav Cohen, Anat Ben-Simon and Myra Hovav (2003). The Effect of Specific
Language Features of the Complexity of Systems for Automated Essay Scoring.
IAEA 29th
Annual conference organized at Manchester, UK.
Sean Massung, ChengXiang Zhai and Julia Hockenmaier (2013). Structural Parse
Tree Features for Text Representation. 2013 IEEE Seventh International Conference
on Semantic Computing.
S. Kim, H. Kim, T. Weninger, J. Han, and H. D. Kim, “Authorship classiïŹcation: a
discriminative syntactic tree mining approach,” in Proceedings of the 34th
international ACM SIGIR conference on Research and development in Information
Retrieval, ser. SIGIR ’11. New York, NY, USA: ACM, 2011, pp. 455–464. [Online].
Available: http://doi.acm.org/10.1145/2009916.2009979
Daniel Jurafsky & James H. Martin. Speech and Language Processing: An
introduction to natural language processing, computational linguistics, and speech
recognition.
John Judge, Aoife Cahil and Josef van Genabith. QuestionBank: Creating a Corpus of
Parse-Annotated Questions.
CYK Algorithm (http://en.wikipedia.org/wiki/CYK_algorithm)
Mark D. Shermis and Ben Hammer. Contrasting State-of-the-Art Automated Scoring
of Essays: Analysis.
37 | P a g e
Appendices
Appendix‐A: List of Part‐of‐Speech Tags Used
The following is an alphabetical list of the part‐of‐speech tags used in the Penn
Treebank Project:
Number Tag Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
38 | P a g e
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb

More Related Content

Similar to Automated Essay Scoring

Dynamic Question Answer Generator An Enhanced Approach to Question Generation
Dynamic Question Answer Generator An Enhanced Approach to Question GenerationDynamic Question Answer Generator An Enhanced Approach to Question Generation
Dynamic Question Answer Generator An Enhanced Approach to Question Generationijtsrd
 
Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...
Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...
Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...Andrew Parish
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLPGVS Chaitanya
 
Requirementv4
Requirementv4Requirementv4
Requirementv4stat
 
Statistics 695A: Machine Learning, Fall 2004
Statistics 695A: Machine Learning, Fall 2004Statistics 695A: Machine Learning, Fall 2004
Statistics 695A: Machine Learning, Fall 2004butest
 
Statistics 695A: Machine Learning, Fall 2004
Statistics 695A: Machine Learning, Fall 2004Statistics 695A: Machine Learning, Fall 2004
Statistics 695A: Machine Learning, Fall 2004butest
 
Writing good research papers
Writing good research papersWriting good research papers
Writing good research papersAhmed Sabbah
 
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdf
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdfMelkamu_Tilahun_Oct_2017_Final_Thesis.pdf
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdfeshetuTesfa
 
Work in progress: ChatGPT as an Assistant in Paper Writing
Work in progress: ChatGPT as an Assistant in Paper WritingWork in progress: ChatGPT as an Assistant in Paper Writing
Work in progress: ChatGPT as an Assistant in Paper WritingManuel Castro
 
Feature Analysis for Affect Recognition Supporting Task Sequencing in Adaptiv...
Feature Analysis for Affect Recognition Supporting Task Sequencing in Adaptiv...Feature Analysis for Affect Recognition Supporting Task Sequencing in Adaptiv...
Feature Analysis for Affect Recognition Supporting Task Sequencing in Adaptiv...janningr
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET Journal
 
A scoring rubric for automatic short answer grading system
A scoring rubric for automatic short answer grading systemA scoring rubric for automatic short answer grading system
A scoring rubric for automatic short answer grading systemTELKOMNIKA JOURNAL
 
Sentiment Analysis in Social Media and Its Operations
Sentiment Analysis in Social Media and Its OperationsSentiment Analysis in Social Media and Its Operations
Sentiment Analysis in Social Media and Its OperationsIRJET Journal
 
810 research proposal
810 research proposal810 research proposal
810 research proposalkpatric1
 

Similar to Automated Essay Scoring (20)

Dynamic Question Answer Generator An Enhanced Approach to Question Generation
Dynamic Question Answer Generator An Enhanced Approach to Question GenerationDynamic Question Answer Generator An Enhanced Approach to Question Generation
Dynamic Question Answer Generator An Enhanced Approach to Question Generation
 
Estimating the overall sentiment score by inferring modus ponens law
Estimating the overall sentiment score by inferring modus ponens lawEstimating the overall sentiment score by inferring modus ponens law
Estimating the overall sentiment score by inferring modus ponens law
 
Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...
Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...
Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...
 
Report
ReportReport
Report
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLP
 
A017640107
A017640107A017640107
A017640107
 
Requirementv4
Requirementv4Requirementv4
Requirementv4
 
Statistics 695A: Machine Learning, Fall 2004
Statistics 695A: Machine Learning, Fall 2004Statistics 695A: Machine Learning, Fall 2004
Statistics 695A: Machine Learning, Fall 2004
 
Statistics 695A: Machine Learning, Fall 2004
Statistics 695A: Machine Learning, Fall 2004Statistics 695A: Machine Learning, Fall 2004
Statistics 695A: Machine Learning, Fall 2004
 
Writing good research papers
Writing good research papersWriting good research papers
Writing good research papers
 
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdf
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdfMelkamu_Tilahun_Oct_2017_Final_Thesis.pdf
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdf
 
Work in progress: ChatGPT as an Assistant in Paper Writing
Work in progress: ChatGPT as an Assistant in Paper WritingWork in progress: ChatGPT as an Assistant in Paper Writing
Work in progress: ChatGPT as an Assistant in Paper Writing
 
Ecer 2011
Ecer 2011Ecer 2011
Ecer 2011
 
Ecer 2011
Ecer 2011Ecer 2011
Ecer 2011
 
Feature Analysis for Affect Recognition Supporting Task Sequencing in Adaptiv...
Feature Analysis for Affect Recognition Supporting Task Sequencing in Adaptiv...Feature Analysis for Affect Recognition Supporting Task Sequencing in Adaptiv...
Feature Analysis for Affect Recognition Supporting Task Sequencing in Adaptiv...
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question Matching
 
A scoring rubric for automatic short answer grading system
A scoring rubric for automatic short answer grading systemA scoring rubric for automatic short answer grading system
A scoring rubric for automatic short answer grading system
 
SCTUR: A Sentiment Classification Technique for URDU
SCTUR: A Sentiment Classification Technique for URDUSCTUR: A Sentiment Classification Technique for URDU
SCTUR: A Sentiment Classification Technique for URDU
 
Sentiment Analysis in Social Media and Its Operations
Sentiment Analysis in Social Media and Its OperationsSentiment Analysis in Social Media and Its Operations
Sentiment Analysis in Social Media and Its Operations
 
810 research proposal
810 research proposal810 research proposal
810 research proposal
 

More from Richard Hogue

Paper Mate Write Bros Ballpoint Pens, Medium P
Paper Mate Write Bros Ballpoint Pens, Medium PPaper Mate Write Bros Ballpoint Pens, Medium P
Paper Mate Write Bros Ballpoint Pens, Medium PRichard Hogue
 
Writing Phrases Best Essay Writing Service, Essay Writ
Writing Phrases Best Essay Writing Service, Essay WritWriting Phrases Best Essay Writing Service, Essay Writ
Writing Phrases Best Essay Writing Service, Essay WritRichard Hogue
 
Examples How To Write A Persuasive Essay - Acker
Examples How To Write A Persuasive Essay - AckerExamples How To Write A Persuasive Essay - Acker
Examples How To Write A Persuasive Essay - AckerRichard Hogue
 
Controversial Issue Essay. Controversial Issue Essay
Controversial Issue Essay. Controversial Issue EssayControversial Issue Essay. Controversial Issue Essay
Controversial Issue Essay. Controversial Issue EssayRichard Hogue
 
Best Tips On How To Write A Term Paper Outline, Form
Best Tips On How To Write A Term Paper Outline, FormBest Tips On How To Write A Term Paper Outline, Form
Best Tips On How To Write A Term Paper Outline, FormRichard Hogue
 
Formal Letter In English For Your Needs - Letter Templ
Formal Letter In English For Your Needs - Letter TemplFormal Letter In English For Your Needs - Letter Templ
Formal Letter In English For Your Needs - Letter TemplRichard Hogue
 
Get Essay Help You Can Get Essays Written For You By
Get Essay Help You Can Get Essays Written For You ByGet Essay Help You Can Get Essays Written For You By
Get Essay Help You Can Get Essays Written For You ByRichard Hogue
 
Sample Website Analysis Essay. Online assignment writing service.
Sample Website Analysis Essay. Online assignment writing service.Sample Website Analysis Essay. Online assignment writing service.
Sample Website Analysis Essay. Online assignment writing service.Richard Hogue
 
Pin By Cindy Campbell On GrammarEnglish Language E
Pin By Cindy Campbell On GrammarEnglish Language EPin By Cindy Campbell On GrammarEnglish Language E
Pin By Cindy Campbell On GrammarEnglish Language ERichard Hogue
 
How To Write Evaluation Paper. Self Evaluation Ess
How To Write Evaluation Paper. Self Evaluation EssHow To Write Evaluation Paper. Self Evaluation Ess
How To Write Evaluation Paper. Self Evaluation EssRichard Hogue
 
Pumpkin Writing Page (Print Practice) - Made By Teach
Pumpkin Writing Page (Print Practice) - Made By TeachPumpkin Writing Page (Print Practice) - Made By Teach
Pumpkin Writing Page (Print Practice) - Made By TeachRichard Hogue
 
What Is The Best Way To Write An Essay - HazelNe
What Is The Best Way To Write An Essay - HazelNeWhat Is The Best Way To Write An Essay - HazelNe
What Is The Best Way To Write An Essay - HazelNeRichard Hogue
 
The Importance Of Reading Books Free Essay Example
The Importance Of Reading Books Free Essay ExampleThe Importance Of Reading Books Free Essay Example
The Importance Of Reading Books Free Essay ExampleRichard Hogue
 
Narrative Essay Personal Leadership Style Essay
Narrative Essay Personal Leadership Style EssayNarrative Essay Personal Leadership Style Essay
Narrative Essay Personal Leadership Style EssayRichard Hogue
 
Thesis Introduction Examples Examples - How To Write A The
Thesis Introduction Examples Examples - How To Write A TheThesis Introduction Examples Examples - How To Write A The
Thesis Introduction Examples Examples - How To Write A TheRichard Hogue
 
Literature Review Thesis Statemen. Online assignment writing service.
Literature Review Thesis Statemen. Online assignment writing service.Literature Review Thesis Statemen. Online assignment writing service.
Literature Review Thesis Statemen. Online assignment writing service.Richard Hogue
 
008 Essay Writing Competitions In India Cust
008 Essay Writing Competitions In India Cust008 Essay Writing Competitions In India Cust
008 Essay Writing Competitions In India CustRichard Hogue
 
A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.
A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.
A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.Richard Hogue
 
Composition Writing Meaning. How To Write A D
Composition Writing Meaning. How To Write A DComposition Writing Meaning. How To Write A D
Composition Writing Meaning. How To Write A DRichard Hogue
 
Get Essay Writing Help At My Assignment Services By Our Highly
Get Essay Writing Help At My Assignment Services By Our HighlyGet Essay Writing Help At My Assignment Services By Our Highly
Get Essay Writing Help At My Assignment Services By Our HighlyRichard Hogue
 

More from Richard Hogue (20)

Paper Mate Write Bros Ballpoint Pens, Medium P
Paper Mate Write Bros Ballpoint Pens, Medium PPaper Mate Write Bros Ballpoint Pens, Medium P
Paper Mate Write Bros Ballpoint Pens, Medium P
 
Writing Phrases Best Essay Writing Service, Essay Writ
Writing Phrases Best Essay Writing Service, Essay WritWriting Phrases Best Essay Writing Service, Essay Writ
Writing Phrases Best Essay Writing Service, Essay Writ
 
Examples How To Write A Persuasive Essay - Acker
Examples How To Write A Persuasive Essay - AckerExamples How To Write A Persuasive Essay - Acker
Examples How To Write A Persuasive Essay - Acker
 
Controversial Issue Essay. Controversial Issue Essay
Controversial Issue Essay. Controversial Issue EssayControversial Issue Essay. Controversial Issue Essay
Controversial Issue Essay. Controversial Issue Essay
 
Best Tips On How To Write A Term Paper Outline, Form
Best Tips On How To Write A Term Paper Outline, FormBest Tips On How To Write A Term Paper Outline, Form
Best Tips On How To Write A Term Paper Outline, Form
 
Formal Letter In English For Your Needs - Letter Templ
Formal Letter In English For Your Needs - Letter TemplFormal Letter In English For Your Needs - Letter Templ
Formal Letter In English For Your Needs - Letter Templ
 
Get Essay Help You Can Get Essays Written For You By
Get Essay Help You Can Get Essays Written For You ByGet Essay Help You Can Get Essays Written For You By
Get Essay Help You Can Get Essays Written For You By
 
Sample Website Analysis Essay. Online assignment writing service.
Sample Website Analysis Essay. Online assignment writing service.Sample Website Analysis Essay. Online assignment writing service.
Sample Website Analysis Essay. Online assignment writing service.
 
Pin By Cindy Campbell On GrammarEnglish Language E
Pin By Cindy Campbell On GrammarEnglish Language EPin By Cindy Campbell On GrammarEnglish Language E
Pin By Cindy Campbell On GrammarEnglish Language E
 
How To Write Evaluation Paper. Self Evaluation Ess
How To Write Evaluation Paper. Self Evaluation EssHow To Write Evaluation Paper. Self Evaluation Ess
How To Write Evaluation Paper. Self Evaluation Ess
 
Pumpkin Writing Page (Print Practice) - Made By Teach
Pumpkin Writing Page (Print Practice) - Made By TeachPumpkin Writing Page (Print Practice) - Made By Teach
Pumpkin Writing Page (Print Practice) - Made By Teach
 
What Is The Best Way To Write An Essay - HazelNe
What Is The Best Way To Write An Essay - HazelNeWhat Is The Best Way To Write An Essay - HazelNe
What Is The Best Way To Write An Essay - HazelNe
 
The Importance Of Reading Books Free Essay Example
The Importance Of Reading Books Free Essay ExampleThe Importance Of Reading Books Free Essay Example
The Importance Of Reading Books Free Essay Example
 
Narrative Essay Personal Leadership Style Essay
Narrative Essay Personal Leadership Style EssayNarrative Essay Personal Leadership Style Essay
Narrative Essay Personal Leadership Style Essay
 
Thesis Introduction Examples Examples - How To Write A The
Thesis Introduction Examples Examples - How To Write A TheThesis Introduction Examples Examples - How To Write A The
Thesis Introduction Examples Examples - How To Write A The
 
Literature Review Thesis Statemen. Online assignment writing service.
Literature Review Thesis Statemen. Online assignment writing service.Literature Review Thesis Statemen. Online assignment writing service.
Literature Review Thesis Statemen. Online assignment writing service.
 
008 Essay Writing Competitions In India Cust
008 Essay Writing Competitions In India Cust008 Essay Writing Competitions In India Cust
008 Essay Writing Competitions In India Cust
 
A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.
A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.
A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.
 
Composition Writing Meaning. How To Write A D
Composition Writing Meaning. How To Write A DComposition Writing Meaning. How To Write A D
Composition Writing Meaning. How To Write A D
 
Get Essay Writing Help At My Assignment Services By Our Highly
Get Essay Writing Help At My Assignment Services By Our HighlyGet Essay Writing Help At My Assignment Services By Our Highly
Get Essay Writing Help At My Assignment Services By Our Highly
 

Recently uploaded

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 

Recently uploaded (20)

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
CĂłdigo Creativo y Arte de Software | Unidad 1
CĂłdigo Creativo y Arte de Software | Unidad 1CĂłdigo Creativo y Arte de Software | Unidad 1
CĂłdigo Creativo y Arte de Software | Unidad 1
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 

Automated Essay Scoring

  • 1. 1 | P a g e B. Comp. Dissertation Automated Essay Scoring By Shubham Goyal Department of Computer Science School of Computing National University of Singapore 2013/2014 Project No: H014380 Advisor: Professor NG Hwee Tou Deliverables: Report: 1 Volume
  • 2. 1 | P a g e Contents List of Figures........................................................................................................................................ 2 1. Abstract.............................................................................................................................................. 3 2. Acknowledgement........................................................................................................................... 4 3. Goal...................................................................................................................................................... 5 4. Introduction ....................................................................................................................................... 6 5. Related Work.................................................................................................................................... 8 5.1 Background................................................................................................................................. 8 5.2 Comparison of the Current State of the Art Essay Systems.....................................12 6. Implementation ..............................................................................................................................15 6.1 Overview ...................................................................................................................................15 6.2. Features utilized.....................................................................................................................16 6.2.1 Content Features.............................................................................................................16 6.2.2 Syntactic Features ..........................................................................................................16 6.2.3 Surface Features..............................................................................................................17 6.2.4 Error Identification.........................................................................................................17 6.2.5 Structural Features .........................................................................................................18 6.8 Statistical Parsing...............................................................................................................22 6.9 Feature Weights..................................................................................................................26 6.10 Ranking Algorithm .........................................................................................................27 6.11 Evaluation Metrics ..........................................................................................................28 7. Dataset..............................................................................................................................................29 8. Results..............................................................................................................................................30 9. Future Work...................................................................................................................................33 10. Conclusion....................................................................................................................................34 References...........................................................................................................................................35 Appendices..........................................................................................................................................37 Appendix‐A: List of Part‐of‐Speech Tags Used................................................................37
  • 3. 2 | P a g e List of Figures Figure 1 Line Chart for Vendor Performance on the Pearson Product Moment Correlation across ........................................................................................................12 Figure 2 Implementation overview............................................................................15 Figure 3 Skeletons generated from the sentence 'They have many theoretical ideas' 18 Figure 4 Parse Tree of 'they have many theoretical ideas'..........................................19 Figure 5 Annotated skeletons in the sentecnce "They have many theoretical ideas'..21
  • 4. 3 | P a g e 1. Abstract Automated Essay Scoring (AES) is increasingly becoming popular as human grading is not only becoming expensive but also cumbersome as the number of test takers grow. Quick feedback is another characteristic drawing educators towards AES. However, most of the AES systems present today are commercial closed source software. Our work aims to design a good AES system that uses some of the most commonly used features to rank and essays. We also evaluate our scoring engine on a publicly available dataset to establish benchmarks. We will also make all the source code available to the public so that future research can use this as a starting point. Subject Descriptors: I.2.7 Natural Language Processing H.3 Information Storage and Retrieval I.2.6 Learning I.2.8 Problem Solving, Control Methods, and Search Keywords: Artificial Intelligence, Natural Language Processing Implementation Software and Hardware: Python, Java
  • 5. 4 | P a g e 2. Acknowledgement I would like to thank my supervisor, Prof NG Hwee Tou, for giving me the opportunity to work under him and on this project. I am really honored to have the pleasure of working under one of the best minds in this field. I would like to thank him for all his time that he has spent in helping me, motivating me, guiding me and finally, helping me become a better researcher so that I can fulfill my lifelong ambition of becoming a good researcher. I would also like to thank Prof’s graduate student, Raymond, for taking out the time from his work to help provide me with APIs to get the trigram counts of words in the English Gigaword corpora. I also appreciate the help provided by another of Prof’s students, Christian Hadiwinoto, for creating my account on the NLP cluster and helping me install packages and run my programs there.
  • 6. 5 | P a g e 3. Goal This project focuses on building a system for scoring English essays. The system assigns a score to an essay reflecting the quality of the essay (based on both content and grammar). The system will be evaluated on a benchmark test data set. Besides aiming to build a state-of-the-art essay scoring system, the project will also investigate the robustness and portability of essay scoring systems.
  • 7. 6 | P a g e 4. Introduction According to Wikipedia, ‘Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting.’ Usually, the grades are not numeric scores but rather, discrete categories. Therefore, this can be also be considered to be a problem of statistical classification and due to its very nature, this problem can be said to fall into the domain of natural language processing. Historically, the origins of this field can be traced to the work of Ellis Batten “Bo” Page, who is also widely regarded as the father of automated essay scoring. Page’s development of and pioneering work with Project Essay Grade (PEGℱ) software in the mid-1960s set the stage for the practical application of computer essay scoring technology following the microcomputer revolution of the 1990s. The most obvious approach to do automated essay scoring is to employ machine learning. This will involve getting a set of essays that have been manually scored (or, the training set). The software should then evaluate the features of the text of each essay (surface features like the total number of words, or word ngrams, part of speech ngrams, etc.. mostly quantities that can be measured without any human insight) and construct a mathematical model that relates these quantities to the scores that the essays received. Then, we could use the model to calculate scores of new sets of essays. The next important question that arises is the determination of the criteria of success. It might be insightful to look at essay scoring before the arrival of computers. Usually, high stake essays were and still are rated by a few different raters who would each give their own score. Then, the different scores would be matched to see if they agree and if they don’t, either a more experienced rater would be called in to settle the dispute, or the the majority opinion would be taken. We could apply the same approach to checking the success of any AES software. The grades given by the software could be matched with the grades given by human graders on the same scripts. The more the number of matches, the better the accuracy of the AES software would be.
  • 8. 7 | P a g e Thus, various statistics have been proposed to measure this ‘agreement’ between the AES software and the human graders. It could be something as simple as percent agreement to more complicated measures like Pearson’s or Spearman’s rank correlation coefficients. The practice of AES has not been without its fair share of criticism. Yang et al. mention "the overreliance on surface features of responses, the insensitivity to the content of responses and to creativity, and the vulnerability to new types of cheating and test-taking strategies." Some critics also fear that students’ motivation will be diminished if they know that a human grader will not be reading their writings. However, we feel that this criticism is not to AES but rather to the fear of being assigned false grades. This criticism actually also proves that the current state of the art systems can be better and so this is an exciting time to be working in this field.
  • 9. 8 | P a g e 5. Related Work 5.1 Background As already mentioned in the introduction, the late Ellis Page and his colleagues at the University of Connecticut programmed the first successful automated essay scoring engine, “Project Essay Grade (PEG)” (1973). PEG did produce good results but one of the reasons for why it did not become a practical application is probably because of the technology of the time. Different AES systems evaluate different types and number of features which are extracted from the text of the essay. Page and Peterson (1995), in their Phi Delta Kappan article, “The computer moves into essay grading: Updating the ancient test,” referred to these elements or features as “proxes” or approximations for underlying “trins” (i.e., intrinsic characteristics) of writing. In the original version of PEG, the text was parsed and classified into language elements such as parts of speech, word length, word functions and the like. PEG would count keywords and make its predictions based on the patterns of language that human raters valued or devalued in making their score assignments. Page classified these counts into three categories: simple, deceptively simple and sophisticated. For example, a model in the PEG system, might be formed by taking five intrinsic characteristics of writing (content, creativity, style, mechanics, and organization) and linking proxes. An example of a simple prox is essay length. Page found that the relationship between the number of words used and the score assignment was not linear, but rather logarithmic. In other words, essay length is factored in by human raters upto some threshold, and then becomes less important as they focus on other aspects of writing. On the other hand, an example of a sophisticated prox would be a count of the number of times “because” is used in an essay. Even though a count of the word “because” may not be important in and of itself, but as a discourse connector, it serves as a proxy for sentence complexity. Human raters tend to reward more complex sentences.
  • 10. 9 | P a g e Some works emphasize the evaluation of content through the specification of vocabulary (of course, the evaluation of other aspects of writing is performed as described above). Latent Semantic Analysis and its variants are employed in some works to provide estimates as to how close the vocabulary in an essay is to a targeted vocabulary set (Landauer, Foltz & Laham, 1998). The Intelligent Essay Assessor (Landauer, Foltz & Laham, 2003) is one of the most successful commercial applications making heavy use of LSA. If we look at the AES scene at present, there are three major AES developers – 1. e-rater (which is a component of Criterion - http://www.ets.org/criterion) by Educational Testing Service (ETS) 2. Intellimetric by Vantage Learning (http://www.vantagelearning.com/) 3. Intelligent Essay Assessor by Pearson Knowledge Technologies (http://kt.pearsonassessments.com/) Fortunately for us, the construction of e-rater models is given in detail in a recent work by Attali and Burstein (2006). The system takes in features from six main areas: 1. Grammar, usage, mechanics and style measures (4 features) – They count the errors in these four categories. And since the raw counts of errors are highly related to essay length, the rates of errors are used which are obtained by dividing the counts in each category by the total number of words in the essay. 2. Organization and development (2 features) – The first feature in this category is the organization score which assumes a writing strategy that includes an introductory paragraph, at least a three paragraph body with each paragraph in the body consisting of a pair of main point and supporting idea elements, and a concluding paragraph. The score measures the difference between this minimum five paragraph essay and the actual discourse elements found in the essay. The second feature is derived from Criterion’s organization and development module.
  • 11. 10 | P a g e 3. Lexical Complexity (2 features) – These are specifically related to word based characteristics. The first is a measure of vocabulary level and the second is based on average word length in characters across the words in the essay. The first feature is from Breland, Jones, and Jenkins’ (1994) work on Standardized Frequency Index across the words of the essay. 4. Prompt-specific vocabulary usage (2 features) – e-rater evaluates the lexical content of an essay by comparing the words it contains to the words found in a sample of essays from each score category. This is accomplished by making use of content vector analysis (Salton, Wong, & Yang, 1975). In short, the vocabulary of each score category is converted to a vector whose elements are based on the frequency of each word in a sample of essays. Like most approaches, e-rater also uses a sample of human scored essay data for model building purposes. e-rater models can be built at the topic level in which case, a model is built for a specific essay prompt. However, more often, e-rater models are built at the grade-level. Preparing models for essays of similar topics or by students of similar grades is not difficult per se, but it requires significant data collection and human reader scoring, these are not only time consuming but also costly. A specification of the Intellimetric model is given in Elliot (2003). The model selects from more than 300 semantic, syntactic and discourse level features. The features fall into five major categories: 1. Focus and Unity – Include cohesiveness, consistency in purpose, main idea 2. Development and elaboration – Include metrics to look at the breadth of content and the support for concepts advanced 3. Organization and Structure – Mainly targeted at the logic of discourse like transitional fluidity and realationships among parts of the response. 4. Sentence Structure – Include senetence complexity and sentence variety. 5. Mechanics and Conventions – Features measuring conformance to conventions of edited American English.
  • 12. 11 | P a g e Intellimetric uses Latent Semantic Dimension which is similar in nature to LSA described earlier. Latent Semantic Dimension also determines how close the candidate response is, in terms of content, to a modeled set of vocabulary. The paper doesn’t go into a lot more details about how Intellimetric works, and rather focuses more on the validation aspect. Technical details of the Intelligent Essay Assessor are highlighted in Landauer, Laham, & Foltz (2003). The content of the essay is assessed by using a combination of external databases and LSA. The authors do talk about examples of external databases used for three of their experiments. This is interesting to note because this can shed important light on what kind of data needs to be extracted for automated essay scoring in different situations. In a particular experiment, the essay question was on the anatomy and function of the heart and circulatory system. This was administered to 94 undergraduates at the University of Colorado before and after an instructional session (N = 188) and scored by two professional readers from Educational Testing Service (ETS). In this case, the LSA semantic space was constructed by analysis of all 95 paragraphs in a set of 26 articles on the heart taken from an electronic version of Grolier’s Academic American Encyclopedia. Even though this corpus was smaller than the corpuses traditionally used, it gave good results according to the authors. When the authors tried to expand it by the addition of general text, the results did not improve. We also draw immense inspiration from and analyze the work by (Yannakoudakis et al., 2011) but the discussion on that is deferred to the subsequent sections in the interests of brevity and to avoid repetition.
  • 13. 12 | P a g e 5.2 Comparison of the Current State of the Art Essay Systems A recent study (Shermis and Hammer, 2012) has compared the results from nine automated essay scoring engines on eight prompts drawn from 6 states in the Unites States that hold high-stakes writing exams. The essays encompassed writing assessments from three grade levels, namely, 7, 8 and 10; and were evenly distributes among the different prompts. Totally, there were 22,029 essays, The following line chart demonstrates the pearson product moment correlation across the eight essay data sets – Figure 1. Line Chart for Vendor Performance on the Pearson Product Moment Correlation across the Eight Essay Data Sets The nine automated essay scoring engines participating in the study were – 1. Autoscore developed by the American Institutes for Research (AIR) The main features of this scoring engine include creating a statistical proxy for
  • 14. 13 | P a g e prompt-specific rubrics (single as well as multiple trait). The engine needs to be trained on known and valid scores. 2. LightSIDE developed at Carnegie Mellon University’s TELEDIA Lab This is a free and open source package. This is very beginner friendly. Its meant to be a tool for non professionals to male use of data mining technology for varied purposes, one of which is essay assessment. 3. Bookette developed by CTB McGraw-Hill Education These scoring engines are able to model trait level and/or holistic level scores for essays with a similar degree of reliability to an expert human rater. CTB builds two types of engines – prompt specific and generic. When applied in the classroom, the engines can provide performance feedback through the use of the information found in the scoring rubric and through feedback on grammar, spelling, conventions, etc. at the sentence level. Bookette engines utilize around 90 text-features classified as structural, syntactic, semantic and mechanics-based. 4. e-rater, developed by Educational Testing Service This scoring engine is focused on evaluating essay quality. There are doens of features, each measuring different very specific aspects of essay quality. The same features serve as the basis for performance feedback to students through products like Criterion (http://www.ets.org/criterion). 5. Lexile Writing Analyzer developed by MetaMetrics This is independent of grades, genres, prompts or punctuation and is an engine for establishing Lxile writer measures. Lexile writer measure is said to be an inherent individual trait or power to compose written text with writing ability embedded in a complex web of cognitive and sociocultural processes. 6. Project Essay Grade (PEG), Measurement, Inc. This scoring engine has had more than 40 years of study and enchancement devoted to it. Studies conducted at a number of state departments of education indicate that PEG demonstrated accuracy similar to trained human scorers.
  • 15. 14 | P a g e 7. Intelligent Essay Assessor (IEA), Pearson Knowledge Technologies Some of the features are derived through semantic models of English (or any other language) from an analysis of large volumes of text equivalent to the reading material of a high school student (around 12 million words). This scoring engine combines background knowledge about English in general and the subject area of the assessment in particular along with prompt-specific algorithms to learn how to match student responses to human scores. IAEA also provides feedback and can even be tuned to understand and examine text in any language (Spanish, Arabic, Hindi, etc.). It can identify off-topic responses, very unconventional essays and other unique circumstances that need human attention. It has also been used for grading millions of essays in high-stake examinations. 8. CRASETM by Pacific Metrics This system is highly configurable, both in terms of the customizations used to build machine scoring models and in terms of how the system can blend human scoring and machine scoring (i.e., hybrid models). Its actually a Java applications that runs as a web service. 9. IntelliMetric developed by Vantage Learning This scoring system attempts to emulate what the human scorers do. IntelliMetric is trained to score test-taker essays. Each prompt (essay) is first scored by expert human scorers who develop anchor papers for each score point. A number of papers for each score point are loaded into IntelliMetric, which runs multiple algorithms to determine the specific writing features that translate to various score points.
  • 16. 15 | P a g e 6. Implementation 6.1 Overview This is what the entire process actually looks like in a nutshell – Figure 2 Implementation overview To score essays automatically, we need to train a machine-learning algorithm. After the algorithm has been trained, it gives us a machine-learning model, which can be used to score more essays. In order for a machine-learning model to be created,
  • 17. 16 | P a g e features first need to be extracted from the text, as a computer cannot directly understand English. We need to use the numbers or symbols as proxies for meaning. 6.2. Features utilized 6.2.1 Content Features 6.2.1.1 Word n‐grams An n-gram can simply be defined as a contiguous sequence of n items from a given sequence of text or speech. For the purpose of essay scoring, they can simply be understood as collection of one or more tokens. n can have any value but usually only value of n =1 (unigrams), 2 (bigrams), 3 (trigrams) are considered. This is because higher order n-grams suffer from the sparse data problem. The tokens were converted to lower case before being used as n-grams. However, no stemming was employed. 6.2.2 Syntactic Features 6.2.2.1 POS n‐grams This feature is the same as word n-grams except that we replace each word with its (Part of Speech) tag such as noun, verb, adjective, etc.. Parts of speech are also known as word classes or lexical categories. In this work, we employ the Penn Treebank tag set. The reason for using the PDTB tagset is its wide use. Appendix A.1 details the different tags in the PDTB tagset. The tokens are tagged in their original case because changing the case of the word might change the tag (for example, proper nouns (NNPS) are usually identified because of the capital initial letter). The methodology followed in Yannakoudakis et. al is a bit different because they make use of their own proprietary RASP tagger for this purpose.
  • 18. 17 | P a g e 6.2.3 Surface Features 6.2.3.1 Script Length Logically, script length should not have any relation to the score because a smaller well-written piece of text must get the same score as a larger one. However, as mentioned in the related works section above, script length has been found to affect the score. Some works say that empirically, longer the length of the essay, the better the score is. Also, script length could have the effect of cancelling out any skewedness in the final results due to features whose weights are influenced by script length. This is a surface feature since it is completely language-blind. According to Cohen at al., these surface variables in themselves are extremely predictive of the essay score. However, the consequences of using such features alone can be that students will simply learn to write longer texts with no regard for rhetorical structure, the logic of argumentation, and so forth. This is why such surface variables need to be alongside other features which relate to content, syntactic structure or rhetorical structure. 6.2.4 Error Identification 6.2.4.1 Error Rate By error rate, we refer to the rate of occurrence of unknown (and hence, erring) ngrams. The simplest way of getting error rates is to use a language model from a suitably large and hopefully in-context corpus and then see the rate of occurrence of ngrams in the document which do not occur in the corpus. Error rate can be an important feature because of many reasons. Firstly, it can serve to identify improper uses of grammar and words. If the number rate of occurrence of grammatical errors in two documents are the same, the probability is that the scores would be similar or lie in the same range too. For the purpose of our research, we are using Prof NG Hwee Tou’s corpuses. These corpuses are parts of the English Gigaword (details here - http://catalog.ldc.upenn.edu/LDC2009T13). The first corpus consists of first 4 million sentences and around 100 million words while the second corpus consists of around 40 million sentences and more than a billion words.
  • 19. 18 | P a g e 6.2.5 Structural Features The inspiration for these features comes from Massung et al. (2013). 6.2.5.1 Skeletons We aim to capture the flow or discourse structure of sentences without bothering about the actual labels. For example, if the input sentence is ‘They have many theoretical ideas’, the following skeletons would be generated – Figure 3. Skeletons generated from the sentence 'They have many theoretical ideas' To understand why Figure 2 is as is, let us look at the parse tree of ‘They have many theoretical ideas’. The following parse tree has been drawn with the help of the nltk draw function (The tags used in this figure have been documented in section A.1 of the Appendix)
  • 20. 19 | P a g e Figure 4. Parse Tree of 'they have many theoretical ideas' To represent these skeletons of parse trees (for example, see Figure 2), we store the trees as sets of square brackets. So, for the sentence ‘They have many theoretical ideas’, the skeletons of parse trees will be – a) [] b) [[]] c) [[[]]] d) [[[]], [[]], [[]]] e) [[[]], [[[]], [[]], [[]]]] f) [[[[]]], [[[]], [[[]], [[]], [[]]]]] We can choose to ignore the tree represented by a since it is trivial and will be present in every document (in fact, each word or punctuation can be represented as []). But (b), (c), (d), (e) and (f) correspond to the graphical representations of the trees in figure 2.
  • 21. 20 | P a g e The procedure for identifying the skeletons is pretty simple. We can start from the root of the parse tree and recursively descends into sub-trees recording the inherent structure. The following pseudocode attempts to demonstrate how this works – procedure get_list_of_skeletons_of_sentence (sentence): tree = parse(sentence) if tree is NULL: return [] subtrees_list = [] convert_tree_to_list(tree, subtrees_list) return subtrees_list procedure convert_tree_to_list(node, list): if node is of type(Tree): list.append([]) return [] else: subtree = [] for subtree_node in node: subtree.append(convert_tree_to_list(subtree_no de, list)) subtree.sort() list.append(subtree) return subtree This get_list_of_skeletons_of_sentence function returns the skeletal structure of the sentence (or the list of skeletons that correspond to the sub-trees of the parse tree of the sentence). It does this by first making the parse tree of the sentence and then calling a recursive function convert_tree_to_list. convert_tree_to_list recursively goes to each node in the tree, appends the skeleton of the subtree corresponding to that node to a list and then returns the list of skeletons. The leaves of the parse tree are represented as [] in our square bracket notation.
  • 22. 21 | P a g e The section on ‘Statistical Parsing’ later attempts to detail how the function parse works. 6.2.5.2 Annotated Skeletons Annotated skeletons are the same as skeletons with just one extra piece of information attached to them – the label of the topmost node of each parse sub-tree. An example of the different parse-trees for the same sentence ‘They have many theoretical ideas’ is Figure 5 Annotated skeletons in the sentecnce "They have many theoretical ideas' 6.2.5.3 Rewrite Rules This feature was used by Kim et al. (2011). It essentially tallies subtrees from each sentence’s parse. It has historically mainly been used in text classification where all the parse trees are put in different classes/categories so each category has a ‘bag-of-trees’. This feature is beneficial as certain trees can be abundant in particular categories. Kim et al. use it for authorship classification. Simpler applications would include age detection or language proficiency. Less proficient writers would be unlikely to use complicated tree structures. This can be useful for essay scoring as well.
  • 23. 22 | P a g e After conducting experiments, I decided to use only feature (i) Skeletons from this section. The results section tries to analyze why this might have been the case. To prevent overfitting or bias issues, we only use the features which appear at least 4 times in the entire training set. The value 4 is also chosen so because of its use in Yannakoudakis et al so that it is easy to compare results. 6.8 Statistical Parsing An important black box in the previous section where structural features were being discussed was how the parse trees are formed. This is because sentences in average tend to be very syntactically ambiguous – coordination ambiguity, attachment ambiguity, etc.. That is why we need to use probabilistic parsing. We consider all possible interpretations and then choose the most likely one. The CS4248 (Natural Language Processing) class in NUS which I took talked about probabilistic context free grammar (PCFG), a probabilistic addition to context free grammars (CFGs) in which each rule has a probability assigned to it. We use the probabilistic CKY algorithm to generate the most probable parses. The algorithm is trained on two Treebank grammars – a) Penn Treebank b) QuestionBank 6.8.1 Penn Treebank The material annotated for this project includes such wide ranging genres as IBM computer manuals, nursing notes, Wall Street Journal articles, transcribed telephone conversations, etc.. For our work, we use a sample (5% fragment) of this huge treebank which has been made available for non-commercial use. It contains parsed data from Wall Street Journal for 1650 sentences (99 treebank files wsj_0001 to wsj_0099). An example annotated sentence from the treebank –
  • 24. 23 | P a g e ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) )) 6.8.2 QuestionBank This is a corpus of 4000 parse-annotated questions developed by the National Centre for Language Technology and School of Computing. It is provided free for research purposes. This is also one of the reasons why it has been employed in this work. The annotated parse trees are very similar to the ones in the Penn Treebank so examples have been omitted here in the interests of brevity. After parsing the annotated data from the treebanks, we get a grammar (a list of production rules). But we still have to convert the grammars to Chomsky Normal Form. This is because the CKY algorithm works only on context free grammars given in Chomsky Normal Form (CNF). 6.8.3 Chomsky Normal Form A grammar is said to be in Chomsky normal form if all of its production rules are of the form: a) A  BC, or
  • 25. 24 | P a g e b) A  α, or c) S  Δ where A, B and C are nonterminal symbols, α is a terminal symbol (or a constant), S is the start symbol and Δ represents the empty string. Only and only S is allowed to be the start symbol. Moreover, rule (c) is valid only if Δ is part of the language generated by the grammar G. It has been proven that every context free grammar can be transformed into one in Chomsky Normal Form. 6.8.4 Converting a CFG to CNF 1. Introduce a new start symbol S0. This also means that a new rules will have to be added with regard to the previous start variable S – S0  S 2. Eliminate all Δ rules. Δ rules can only be of the form A  Δ, where A is not the start symbol (the proof is trivial). This can be done by removing every rule with Δ on its right hand side (RHS). For each rule that has A in its RHS, add a set of new rules consisting of all the combinations of A replaced or not replaced with Δ. If A occurs as a singleton on the right hand side of any rule, add a new rule A  Δ (lets call this new rule R), unless R has already been removed. 3. Eliminate all unit rules. Unit rules are those whose RHs contains one variable and no terminals (such a rule is inconsistent with the conditions for aa grammar in Chomsky Normal Form grammar as described at the beginning of this section. If the unit rule to be removed is X  Y and there exist one or more rules of the form Y  Z (where Z is a string of variables and terminals), add a new rule X  Z (unless this is a unit rule which has already been removed, obviously). 4. Clean up the remaining rules that are not in Chomsky Normal Form. Replace A  u1u2
uk, k ≄ 3, u1 ∈ V âˆȘ ÎŁ with A  u1A1, A1  u2A2, 
, Ak-2  uk-
  • 26. 25 | P a g e 1uk, where Ai are new variables. If ui ∈ ÎŁ, replace ui in the above rules with some new variable Vi and add rule Vi  ui. Once all the rules have been converted to Chomsky Normal Form, we can assign probabilities to them. This completes the learning of a probabilistic context free grammar in Chomsky Normal Form (CNF) from the treebanks. Now, given an input sentence we need to use the probabilistic grammar to generate the most likely parse tree. We use the Cocke-Younger-Kasami (CKY) algorithm. We do have to modify the standard version of the algorithm since the standard version checks only for membership. The pseudocode for the standard version is as below – Let the grammar be represented by G. Let S: a1...an be the input sentence or phrase. Let R1
Rr be non-terminal symbols present in the grammar. Let RS contain the start symbols, RS ∈ G Let P[n, n, r] be a three dimensional array of Booleans for each i = 1 to n: for each j = 1 to n: for each k = 1 to r: P[i, j, k] = false for each i = 1 to n: for each unit production Rj -> ai P[i,i,j] = true for each i = 2 to n: for each L = 1 to n – i + 1: R = L + i – 1
  • 27. 26 | P a g e for each M = L + 1 to R: for each (Rα  RÎČRÎł): if P[L, M – 1, ÎČ] and P[M, R, Îł]: P[L, R, α] = true for each i = 1 to r: if P[1, n, i]: return true return false The above algorithm is checks for the membership of the sentence in the language. Our goal was to construct a parse tree, so we changed the array P to store parse tree nodes instead of the Boolean values. These nodes are associated to the array elements that were used to produce them so as to build the tree structure. This is a simple back- tracking procedure. Thus finally, the parse function in the get_list_of_skeletons_of_sentence can return tree structure generated by the CKY algorithm. 6.9 Feature Weights In the previous sections, we have discussed the methods employed to generate the features given an essay. Each unique feature (for example, a particular token or unigram, or a particular parse tree) is given a unique number to represent it. However, we also need to decide what what weights (or, importance) to assign to those features. We experimented with several feature weights - 1. The simplest way would be to use a 0 or a 1 depending on whether a feature is present or absent. 2. Another technique that was tried was to use the number of times the feature occurs in a given essay as its weight.
  • 28. 27 | P a g e 3. tf-idf weighting was also tried for certain features, especially word-ngrams and POS-ngrams. The next sections gives more details on the tf-idf statistic. 6.9.1 tf‐idf scheme tf-idf is short for term frequency-inverse document frequency. This is often used as a weighting factor in information retrieval and data mining. It is a product of term frequency and inverse document frequency. Various ways of calculating term frequency exist but probably the easiest one and the one we have used in our work is to simply take the number of times the feature occurs in a particular essay. The inverse document frequency is a measure of whether the term is unique or common across essays. We can arrive at this statistic by dividing the total number of essays by the number of essays containing the term, and then taking a logarithm of the quotient. The reason why such a statistic is needed is because if we were to simply take counts of features or their presence/absence, we are missing out on how important they are to a document. In the context of essays, there might be certain phrases which score highly once present in the eyes of a grader but might not commonly occur across all documents. Even though this statistic might make more sense for information retrieval tasks, it was still tried to see the impact. 6.10 Ranking Algorithm Now, that we have discussed the features we are using/plan to use, let us look at the machine learning aspect of the problem. We are modeling this as a ranking problem and not a classification one. One reason for this is that we have the absolute human grader scores. Thus, we can go better than classification into a few buckets because if we convert each score to a grade, we are voluntarily losing some information. On the
  • 29. 28 | P a g e other hand, predicting the exact score also doesn’t make sense because surely, a machine will not be able to accurately predict the exact score. Thus, ranking seems like a good viable option. We use support vector machine for the above task. Our choice is motivated by the fact that other works, specifically Yannakoudakis et al, make use of the SVMlight (http://svmlight.joachims.org/) library and so it is easy to compare results. Actually, to be precise, we use SVMrank http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html) which employs new algorithms for training Ranking SVMs and is much faster than SVMlight . The decision to switch over to SVMrank is made easier by the fact that both the libraries are by the same author and (T. Joachims, 2006) states that both libraries solve the same optimization problem, with the only difference being that SVMrank is much faster. 6.11 Evaluation Metrics The two evaluation metrics which have been employed in our work are – 6.11.1 Pearson’s Product‐Moment Correlation Coefficient Pearson’s correlation determines the degree to which two linearly dependent variables are related. It gives a value in the range [-1, 1] where a value of -1 denotes total negative correlation, the value of 0 denotes no correlation and the value of 1 denotes total positive correlation. However, the value of this metric can be misleading in some rare cases due to outliers or due to the inherent sensitivity to the distribution of data. 6.11.2 Spearman’s Rank Correlation Coefficient This is a non-parametric robust measure of statistical dependence between two variables. It essentially assesses how well a relationship between the two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other.
  • 30. 29 | P a g e 7. Dataset As can be observed from the related work and introduction, automated essay scoring is a data intensive task. To be able to predict scores, we not only need the dataset to contain as many essay scripts as possible but there is also a requirement for the essay scripts to be properly annotated or at least manually graded, at the least. For our own experiments, we are currently making use of data drawn from the CLC FCE dataset, a set of 1,244 exam scripts written by candidates sitting the Cambridge ESOL First Certificate in English (FCE) examination in 2000 and 2001, and made available by Cambridge University Press; see (Yannakoudakis et al., 2011). The CLC dataset is divided into training and test sets. The training set consists of 1141 scripts from the year 2000 written by 1141 distinct learners, and 97 scripts from the year 2001 for testing written by 97 distinct learners. The learner’s ages follow a bimodal distribution with peaks at approximately 16-20 and 26-30 years of age. Yannakoudakis et al. claim that there is no overlap between the prompts used in 2000 and in 2001. The scripts also have some meta-data about candidate’s grades, native language and age. The First Certificate in English (FCE) exam’s writing component consists of two tasks asking learners to write either a letter, a report, an article, a composition or a short story, between 200 and 400 words. Answers to each of these tasks are annotated with marks (in the range 1-40). In addition, an overall mark is assigned to both tasks. Actually, we do not make use of the task scores and just use the overall score. This is because (Yannakoudakis et al., 2011) use just the overall score and so it gives us a benchmark to compare our results against. Each script is also tagged with information about the linguistic errors committed, using a taxonomy of approximately 80 error types (Nicholls, 2003). An example of this is the following – Thanks for <NS type="DD"><i>you</i><c>your</c></NS> letter. The part of the text between <i> and </i> denotes the incorrect text while the part between <c> and </c> denotes the correction of that incorrect text.
  • 31. 30 | P a g e 8. Results The following table contains the correlation values after adding the different features– Table 1. Spearman’s and Pearson’s Correlation Values Features Pearson’s Correlation Spearman’s Rank Order Correlation Word ngrams 0.6005 0.5967 + PoS ngrams (tf-idf weights) 0.6053 0.5982 + POS ngrams (counts as weights) 0.5679 0.5612 + Script length 0.5685 0.5622 + Error Rate 0.4247 0.4247 + Skeletons 0.4904 0.4904 Since we use the same dataset as Yannakoudakis et al. for benchmarking, we can compare our results with them. Our correlation coefficients have nearly the same value when we just use word ngrams as a feature. For the other features, the variation is to be expected since they don’t use the same tagger for PoS tagging, their error rate is calculated in a different manner and they don’t use the same structural features. From Table 1, we can see that our predictions vary more and more from the human scores as more features are added. The best results are obtained when only using the
  • 32. 31 | P a g e word and PoS ngrams as features. This is unexpected but it might be because the training dataset is too similar to the test dataset. That might also explain why using just lexical ngrams can give a correlation as high as 0.6. Word ngrams only include unigrams and bigrams. Trigrams were tried but they produced very bad results. This can be attributed to data sparseness. Yannakoudakis et al. also do not use word trigrams or any higher order n-grams. If we use POS n-gram counts instead of using their tf-idf weights, the correlation decreases. This suggests that using the tf-idf weighting scheme is useful, especially for ngram features. Word n-grams weighted using tf-idf scheme actually give better results when they are not normalized. Just using word n-grams weighted using the tf-idf scheme results in a Pearson’s correlation of 0.6220 and Spearman’s correlation of 0.6251. We have only used skeletons as structural features here. Actually, the results were much worse with annotated skeletons and rewrite rules. So, we decided to omit those features because they represent all the information that skeletons do and some more, so it would have led to repetition. One possible reason why parse tree skeletons performed the best can be that since they don’t even contain the label information at the root, they are the least likely to suffer from data sparseness problem. Table 2 presents Pearson’s and Spearman’s correlation between the CLC and our system, when removing one feature at a time.
  • 33. 32 | P a g e Table 2. Ablation tests showing the correlation between the CLC and the AES system Features Pearson’s Correlation Spearman’s Rank Order Correlation none 0.4904 0.4904 Word n-grams 0.4956 0.4924 Script length 0.4919 0.4883 + Error Rate 0.4959 0.4928 + Skeletons 0.4320 0.4247
  • 34. 33 | P a g e 9. Future Work For the future of this project, along with the addition of more features, an important task is to get results for different datasets. This is to make sure that peculiarities in the dataset do not influence the development of the AES system. Prompt-specific features also need to be added so that essays can be graded without the need for human annotated copies at all. Like some commercial essay scoring engines, our software might also be able to mine for information depending on the prompt and use that to grade essays on the topic. This work can also be made into a free web application after some improvements. It will be an interesting exercise from a research point of view too if we are able to observe how the software works for real student essays.
  • 35. 34 | P a g e 10. Conclusion Automated Essay Scoring is an interesting area to work on. There is definitely a lot of scope for improvement and innovation. A lot still needs to be done to bring this into the mainstream and gain widespread adoption. If this were done right and reliably, it would go a long way in not only reducing manual work but also improving teaching and will revolutionize education since teachers would no longer be concerned about grading when making decisions on giving essay writing tasks to their students. We have been able to make a proof-of-concept prototype essay scoring system. A lot needs to be done to make it as reliable and functional as some of the commercially available options which have been around for 40 years or so, but this is an encouraging sign. Our results do not beat the best in this business but we can at least provide an open source solution on which future research can be founded.
  • 36. 35 | P a g e References Automated Essay Scoring (http://en.wikipedia.org/wiki/Automated_essay_scoring) Ellis Batten Page (http://en.wikipedia.org/wiki/Ellis_Batten_Page) Yang Yongwei, Chad W. Buckendahl, Piotr J. Juskiewicz and Dennison S. Bhola (2002). “A review of Strategies for Validating Computer-Automated Scoring”. Applied Measurement in Education. Ajay, H. B., Tillett, P. I., & Page, E. B. (1973). Analysis of essays by computer (AEC-II) (No. 8-0102). Washington, DC: U.S. Department of Health, Education, and Welfare, Office of Education, National Center for Educational Research and Development. Handbook of Automated Essay Evaluation: Current Applications and New Directions. Edited by Mark D. Shermis and Jill Burstein Page, E. B., & Petersen, N. S. (1995). The computer moves into essay grading: Updating the ancient test. Phi Delta Kappan. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes. Attali, Y., & Burstein, J. (2006). Automated Essay Scoring With e-rater V.2. Journal of Technology, Learning, and Assessment. Breland, H. M., Jones, R. J., & Jenkins, L. (1994). The College Board vocabulary study (College Board Report No. 94–4; Educational Testing Service Research Report No. 94–26). New York: College Entrance Examination Board. Salton, G., Wong, A., & Yang, C.S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18, 613–620. Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated scoring and annotation of essays with the Intelligent Essay Assessor. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87– 112). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Elliot, S. (2003). Intellimetric: From here to validity. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 71-86). Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading esol texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics; Human Language Technologies, Portland, Oregon, USA, 19th -24th June 2011.
  • 37. 36 | P a g e D. Nicholls. 2003. The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT. In Proceedings of the Corpus Linguistics 2003 conference, pages 572–581. Ziheng Lin, Hwee Tou Ng and Min-Yen Kan (2011). Automatically Evaluating Text Coherence Using Discourse Relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), Portland, Oregon, USA, June. Alphabetical list of part-of-speech tags used in the Penn Treebank Project (http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) Yoav Cohen, Anat Ben-Simon and Myra Hovav (2003). The Effect of Specific Language Features of the Complexity of Systems for Automated Essay Scoring. IAEA 29th Annual conference organized at Manchester, UK. Sean Massung, ChengXiang Zhai and Julia Hockenmaier (2013). Structural Parse Tree Features for Text Representation. 2013 IEEE Seventh International Conference on Semantic Computing. S. Kim, H. Kim, T. Weninger, J. Han, and H. D. Kim, “Authorship classiïŹcation: a discriminative syntactic tree mining approach,” in Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, ser. SIGIR ’11. New York, NY, USA: ACM, 2011, pp. 455–464. [Online]. Available: http://doi.acm.org/10.1145/2009916.2009979 Daniel Jurafsky & James H. Martin. Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition. John Judge, Aoife Cahil and Josef van Genabith. QuestionBank: Creating a Corpus of Parse-Annotated Questions. CYK Algorithm (http://en.wikipedia.org/wiki/CYK_algorithm) Mark D. Shermis and Ben Hammer. Contrasting State-of-the-Art Automated Scoring of Essays: Analysis.
  • 38. 37 | P a g e Appendices Appendix‐A: List of Part‐of‐Speech Tags Used The following is an alphabetical list of the part‐of‐speech tags used in the Penn Treebank Project: Number Tag Description 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition or subordinating conjunction 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10. LS List item marker 11. MD Modal 12. NN Noun, singular or mass 13. NNS Noun, plural 14. NNP Proper noun, singular 15. NNPS Proper noun, plural 16. PDT Predeterminer 17. POS Possessive ending 18. PRP Personal pronoun 19. PRP$ Possessive pronoun 20. RB Adverb 21. RBR Adverb, comparative 22. RBS Adverb, superlative 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VB Verb, base form 28. VBD Verb, past tense 29. VBG Verb, gerund or present participle 30. VBN Verb, past participle
  • 39. 38 | P a g e 31. VBP Verb, non-3rd person singular present 32. VBZ Verb, 3rd person singular present 33. WDT Wh-determiner 34. WP Wh-pronoun 35. WP$ Possessive wh-pronoun 36. WRB Wh-adverb