This document appears to be a dissertation proposal for a project on automated essay scoring. It discusses the goal of building a state-of-the-art essay scoring system and evaluating it on a benchmark dataset. The introduction provides background on automated essay scoring and discusses using machine learning approaches to analyze essay features and relate them to human scores. It also discusses evaluating an AES system by comparing its scores to those from human graders and using agreement metrics. The proposal acknowledges some criticism of AES but feels the technology could still be improved.
IRJET- Automated Essay Evaluation using Natural Language ProcessingIRJET Journal
This document discusses research on automated essay evaluation using natural language processing. It provides background on previous systems for automated essay scoring like Project Essay Grader (PEG) from the 1960s and more recent systems like e-Rater, IntelliMetric, and Intelligent Essay Assessors. The researchers extracted features from essays like word count, sentence count, spelling, and part-of-speech to train machine learning models. They achieved correlation scores between 0.86-0.87 when comparing predicted scores to human scores, showing the models can perform at similar reliability levels to human graders. The researchers conclude the models could be improved by incorporating features like parse trees and accounting for different essay prompts.
IRJET - Online Assignment Plagiarism Checking using Data Mining and NLPIRJET Journal
This document presents a proposed system for detecting plagiarism in student assignments submitted online. The system would use data mining algorithms and natural language processing to compare submitted assignments against each other and identify plagiarized content. It would analyze assignments at both the syntactic and semantic levels. The proposed system is intended to more efficiently and accurately detect plagiarism compared to teachers manually reviewing all submissions. The document describes the workflow of the system, including preprocessing of assignments, text analysis, similarity measurement, and algorithms that would be used like Rabin-Karp, KMP and SCAM.
IRJET- Modeling Student’s Vocabulary Knowledge with NaturalIRJET Journal
This document discusses using natural language processing (NLP) tools to model students' vocabulary knowledge based on the lexical properties of their essays. The study analyzed essays from college sophomores and high school students. NLP tool TAALES was used to calculate 135 lexical indices from the essays. Correlations found that two indices accounted for 44% of the variance in sophomores' vocabulary scores, and were also predictive of high school students' scores. The results suggest NLP can inform "stealth assessments" to improve student models in computer-based learning environments.
This document discusses using automatic text analysis techniques to streamline the process of multi-dimensional analysis of collaborative learning discussions. It describes a tool called TagHelper that was evaluated against a hand-coded corpus with a 7-dimensional coding scheme. TagHelper achieved a Cohen's Kappa agreement of over 0.7 for 6 of the 7 dimensions when considering only the text segments it was most confident about, and was confident in its coding for at least 88% of the corpus for 5 of those dimensions. The document motivates the need for such automatic analysis to reduce the time and effort required for manual coding of collaborative learning data.
Deep learning based Arabic short answer grading in serious gamesIJECEIAES
Automatic short answer grading (ASAG) has become part of natural language processing problems. Modern ASAG systems start with natural language preprocessing and end with grading. Researchers started experimenting with machine learning in the preprocessing stage and deep learning techniques in automatic grading for English. However, little research is available on automatic grading for Arabic. Datasets are important to ASAG, and limited datasets are available in Arabic. In this research, we have collected a set of questions, answers, and associated grades in Arabic. We have made this dataset publicly available. We have extended to Arabic the solutions used for English ASAG. We have tested how automatic grading works on answers in Arabic provided by schoolchildren in 6th grade in the context of serious games. We found out those schoolchildren providing answers that are 5.6 words long on average. On such answers, deep learning-based grading has achieved high accuracy even with limited training data. We have tested three different recurrent neural networks for grading. With a transformer, we have achieved an accuracy of 95.67%. ASAG for school children will help detect children with learning problems early. When detected early, teachers can solve learning problems easily. This is the main purpose of this research.
The sarcasm detection with the method of logistic regressionEditorIJAERD
The document discusses sarcasm detection using logistic regression. It compares the performance of logistic regression and SVM classification for sarcasm detection. Logistic regression achieved higher accuracy of 93.5% for sarcasm detection, with lower execution time compared to SVM classification. The proposed approach uses data preprocessing, feature extraction using N-grams, and trains a logistic regression classifier on a manually labeled dataset to classify text as sarcastic or non-sarcastic. Accuracy and execution time analysis shows logistic regression performs better than SVM for this task.
Ace Maths Solutions Unit Five Reading: Exercises on Teaching Data Handling (w...PiLNAfrica
The solutions unit consists of the following:
General points for discussion relating to the teaching of the mathematical content in the activities.
Step-by-step mathematical solutions to the activities.
Annotations to the solutions to assist teachers in their understanding the maths as well as teaching issues relating to the mathematical content represented in the activities.
Suggestions of links to alternative activities for the teaching of the mathematical content represented in the activities.
Ace Maths Solutions Unit Five Reading: Exercises on Teaching Data Handling (w...Saide OER Africa
The solutions unit consists of the following:
General points for discussion relating to the teaching of the mathematical content in the activities.
Step-by-step mathematical solutions to the activities.
Annotations to the solutions to assist teachers in their understanding the maths as well as teaching issues relating to the mathematical content represented in the activities.
Suggestions of links to alternative activities for the teaching of the mathematical content represented in the activities.
IRJET- Automated Essay Evaluation using Natural Language ProcessingIRJET Journal
This document discusses research on automated essay evaluation using natural language processing. It provides background on previous systems for automated essay scoring like Project Essay Grader (PEG) from the 1960s and more recent systems like e-Rater, IntelliMetric, and Intelligent Essay Assessors. The researchers extracted features from essays like word count, sentence count, spelling, and part-of-speech to train machine learning models. They achieved correlation scores between 0.86-0.87 when comparing predicted scores to human scores, showing the models can perform at similar reliability levels to human graders. The researchers conclude the models could be improved by incorporating features like parse trees and accounting for different essay prompts.
IRJET - Online Assignment Plagiarism Checking using Data Mining and NLPIRJET Journal
This document presents a proposed system for detecting plagiarism in student assignments submitted online. The system would use data mining algorithms and natural language processing to compare submitted assignments against each other and identify plagiarized content. It would analyze assignments at both the syntactic and semantic levels. The proposed system is intended to more efficiently and accurately detect plagiarism compared to teachers manually reviewing all submissions. The document describes the workflow of the system, including preprocessing of assignments, text analysis, similarity measurement, and algorithms that would be used like Rabin-Karp, KMP and SCAM.
IRJET- Modeling Student’s Vocabulary Knowledge with NaturalIRJET Journal
This document discusses using natural language processing (NLP) tools to model students' vocabulary knowledge based on the lexical properties of their essays. The study analyzed essays from college sophomores and high school students. NLP tool TAALES was used to calculate 135 lexical indices from the essays. Correlations found that two indices accounted for 44% of the variance in sophomores' vocabulary scores, and were also predictive of high school students' scores. The results suggest NLP can inform "stealth assessments" to improve student models in computer-based learning environments.
This document discusses using automatic text analysis techniques to streamline the process of multi-dimensional analysis of collaborative learning discussions. It describes a tool called TagHelper that was evaluated against a hand-coded corpus with a 7-dimensional coding scheme. TagHelper achieved a Cohen's Kappa agreement of over 0.7 for 6 of the 7 dimensions when considering only the text segments it was most confident about, and was confident in its coding for at least 88% of the corpus for 5 of those dimensions. The document motivates the need for such automatic analysis to reduce the time and effort required for manual coding of collaborative learning data.
Deep learning based Arabic short answer grading in serious gamesIJECEIAES
Automatic short answer grading (ASAG) has become part of natural language processing problems. Modern ASAG systems start with natural language preprocessing and end with grading. Researchers started experimenting with machine learning in the preprocessing stage and deep learning techniques in automatic grading for English. However, little research is available on automatic grading for Arabic. Datasets are important to ASAG, and limited datasets are available in Arabic. In this research, we have collected a set of questions, answers, and associated grades in Arabic. We have made this dataset publicly available. We have extended to Arabic the solutions used for English ASAG. We have tested how automatic grading works on answers in Arabic provided by schoolchildren in 6th grade in the context of serious games. We found out those schoolchildren providing answers that are 5.6 words long on average. On such answers, deep learning-based grading has achieved high accuracy even with limited training data. We have tested three different recurrent neural networks for grading. With a transformer, we have achieved an accuracy of 95.67%. ASAG for school children will help detect children with learning problems early. When detected early, teachers can solve learning problems easily. This is the main purpose of this research.
The sarcasm detection with the method of logistic regressionEditorIJAERD
The document discusses sarcasm detection using logistic regression. It compares the performance of logistic regression and SVM classification for sarcasm detection. Logistic regression achieved higher accuracy of 93.5% for sarcasm detection, with lower execution time compared to SVM classification. The proposed approach uses data preprocessing, feature extraction using N-grams, and trains a logistic regression classifier on a manually labeled dataset to classify text as sarcastic or non-sarcastic. Accuracy and execution time analysis shows logistic regression performs better than SVM for this task.
Ace Maths Solutions Unit Five Reading: Exercises on Teaching Data Handling (w...PiLNAfrica
The solutions unit consists of the following:
General points for discussion relating to the teaching of the mathematical content in the activities.
Step-by-step mathematical solutions to the activities.
Annotations to the solutions to assist teachers in their understanding the maths as well as teaching issues relating to the mathematical content represented in the activities.
Suggestions of links to alternative activities for the teaching of the mathematical content represented in the activities.
Ace Maths Solutions Unit Five Reading: Exercises on Teaching Data Handling (w...Saide OER Africa
The solutions unit consists of the following:
General points for discussion relating to the teaching of the mathematical content in the activities.
Step-by-step mathematical solutions to the activities.
Annotations to the solutions to assist teachers in their understanding the maths as well as teaching issues relating to the mathematical content represented in the activities.
Suggestions of links to alternative activities for the teaching of the mathematical content represented in the activities.
Dynamic Question Answer Generator An Enhanced Approach to Question Generationijtsrd
Teachers and educational institutions seek new questions with different difficulty levels for setting up tests for their students. Also, students long for distinct and new questions to practice for their tests as redundant questions are found everywhere. However, setting up new questions every time is a tedious task for teachers. To overcome this conundrum, we have concocted an artificially intelligent system which generates questions and answers for the mathematical topic –Quadratic equations. The system uses i Randomization technique for generating unique questions each time and ii First order logic and Automated deduction to produce solution for the generated question. The goal was achieved and the system works efficiently. It is robust, reliable and helpful for teachers, students and other organizations for retrieving Quadratic equations questions, hassle free. Rahul Bhatia | Vishakha Gautam | Yash Kumar | Ankush Garg ""Dynamic Question Answer Generator: An Enhanced Approach to Question Generation"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23730.pdf
Paper URL: https://www.ijtsrd.com/computer-science/artificial-intelligence/23730/dynamic-question-answer-generator-an-enhanced-approach-to-question-generation/rahul-bhatia
This document proposes a model to estimate overall sentiment score by applying rules of inference from discrete mathematics. It discusses sentiment analysis and related work using techniques like supervised/unsupervised learning. The problem is identifying sentiment components and restricting patterns for feature identification. Most approaches focus on nouns/adjectives but not verbs/adverbs. The model preprocesses product review datasets using NLTK for stemming, parsing and tokenizing. It builds a lexicon dictionary of positive and negative words. The Lexical Pattern Sentiment Analysis algorithm uses both lexicon and pattern mining - it selects sentence patterns, checks for positive/negative words in the lexicon, and calculates an overall sentiment score.
Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...Andrew Parish
This document summarizes a research paper that analyzed sentiment of movie reviews written in Bangla using machine learning techniques. The researchers collected a dataset of over 4,000 Bangla movie reviews labeled as positive or negative. Using this dataset, they tested support vector machine and long short-term memory models, achieving 88.9% and 82.42% accuracy respectively. The paper also reviewed other prior work on Bangla sentiment analysis and compared different machine learning methods.
This document provides details about a student project to develop a fiction authoring tool. It outlines the problem statement of assisting fiction authors in planning and writing stories individually and collaboratively. The students conducted various empirical studies and analyses of existing tools to understand the fiction writing process. They arrived at a solution approach of first developing a single-user fiction authoring tool before adding collaborative features. The document describes the technology and platforms to be used, which is a web-based tool designed for both desktops and mobile devices.
Open domain Question Answering System - Research project in NLPGVS Chaitanya
Using a computer to answer questions has been a human dream since the beginning of the digital era. A first step towards the achievement of such an ambitious goal is to deal with natural language to enable the computer to understand what its user asks. The discipline that studies the connection between natural language and the representation of its meaning via computational models is computational linguistics. According to such discipline, Question Answering can be defined as the task that, given a question formulated in natural language , aims at finding one or more concise answers. And the Improvements in Technology and the Explosive demand for better information access has reignited the interest in Q & A systems , The wealth of the information on the web makes it an Interactive resource for seeking quick Answers to factual Questions such as “Who is the first American to land in space ?”, or “what is the second Tallest Mountain in the world ?”, yet Today’s Most advanced web Search systems(Bing , Google , yahoo) make it Surprisingly Tedious to locate the Answers , Q& A System Aims to develop techniques that go beyond Retrieval of Relevant documents in order to return the exact answers using Natural language factoid question
The document presents a method for automatically evaluating handwritten student essays using longest common subsequence. It involves 3 phases: 1) constructing a reference material from multiple study materials by comparing sentences semantically, 2) extracting text from scanned handwritten essays using image processing techniques, and 3) grading essays by comparing the extracted text to the reference material using longest common subsequence to calculate a score and assign a grade. The method aims to automatically evaluate essays similar to human evaluation and explores the accuracy and effectiveness of using longest common subsequence for semantic comparison between texts.
This document provides an overview and requirements for the Stat project, an open source machine learning framework for text analysis. It describes the background, motivation, scope, and stakeholders of the project. Key requirements for the framework include being simplified, reusable, and providing built-in capabilities to naturally support text representation and processing tasks.
Statistics 695A: Machine Learning, Fall 2004butest
This document provides information about the Statistics 695A: Machine Learning course for Fall 2004 including:
- The course will cover machine learning theory, methods, algorithms and applications. Students will learn to develop machine learning tools and apply them to data sets.
- Student responsibilities include presenting a research paper, conducting a machine learning study on a data set, and writing an 8-page project paper reporting the results.
- Proposed instructor topics are perceptrons, local learning, Bayesian learning, Bayesian networks, visualization, and learning theory. Students will read about these topics on the course website.
- The instructor, William S. Cleveland, is a professor of statistics and computer science who researches machine learning,
Statistics 695A: Machine Learning, Fall 2004butest
This document provides information about the Statistics 695A: Machine Learning course for Fall 2004 including:
- The course will cover machine learning theory, methods, algorithms and applications. Students will learn to develop machine learning tools and apply them to data.
- Student responsibilities include presenting a research paper, conducting a machine learning study on data and writing a project paper.
- Proposed instructor topics are perceptrons, local learning, Bayesian learning, Bayesian networks and visualization. Students will read about these topics on the course website.
This document summarizes a thesis submitted by Melaku Tilahun Asress for a Master's degree in computer science. The thesis describes the design and implementation of an automatic spelling checker for the Amharic language. The spelling checker uses a morphological analyzer to detect and correct spelling errors, including errors due to internal inflection and duplication of Amharic words. The system was evaluated using text from various reports and achieved an overall performance of 97.27% based on precision and recall metrics. Areas for further improvement include detecting real word errors and comparing the spelling correction algorithm to other techniques.
The document discusses the development of an online instrument called DIM (Digital Information Skills Measurement) to measure students' information skills. DIM aims to combine multiple measurement methods into one online tool to provide a clear picture of students' information skills during an entire search process using the actual Internet. An initial study was conducted using DIM with 84 students, and results were compared to think-aloud protocols. DIM showed potential but also room for improvement, such as providing more context for student evaluations during searches. Further validation steps are outlined.
The document discusses the development of an online instrument called DIM (Digital Information Skills Measurement) to measure students' digital information skills. DIM aims to combine different measurement methods into one online tool to provide a clear picture of students' information skills during the whole search process using the actual Internet. An initial study was conducted with 84 students to validate DIM by comparing its results to think-aloud protocols. The study found that DIM can provide insight into how students search for and evaluate information online, though further validation is still needed to ensure the instrument is sensitive enough.
Feature Analysis for Affect Recognition Supporting Task Sequencing in Adaptiv...janningr
Originally, the task sequencing in adaptive intelligent tutoring systems needs information gained from expert and domain knowledge as well as information about former performances. In a former work a new efficient task sequencer based on a performance prediction system was presented, which only needs former performance information but not the expensive expert and domain knowledge. This task sequencer uses the output of the performance prediction to sequence the tasks according to the theory of Vygotsky’s Zone of Proximal Development. In this presentation we aim to support this sequencer by a further automatically to gain information source, namely speech input from the students interacting with the tutoring system. The proposed approach extracts features from students speech data and applies to that features an automatic affect recognition method. The output of the affect recognition method indicates, if the last task was too easy, too hard or appropriate for the student. In this presentation we (1) propose a new approach for supporting task sequencing by affect recognition, (2) present an analysis of appropriate features for affect recognition extracted from students speech input and (3) show the suitability of the proposed features for affect recognition for supporting task sequencing in adaptive intelligent tutoring systems.
This document presents a system for detecting semantically similar questions in online forums like Quora to reduce duplicate content. It proposes using natural language processing techniques like tagging questions with keywords, vectorizing text with Google News vectors, and calculating similarity with Word Mover's Distance. The system cleans and preprocesses questions before generating tags and calculating similarity between questions to identify duplicates. An evaluation of the system achieved accurate detection of matching and non-matching question pairs.
A scoring rubric for automatic short answer grading systemTELKOMNIKA JOURNAL
During the past decades, researches about automatic grading have become an interesting issue. These studies focuses on how to make machines are able to help human on assessing students’ learning outcomes. Automatic grading enables teachers to assess student's answers with more objective, consistent, and faster. Especially for essay model, it has two different types, i.e. long essay and short answer. Almost of the previous researches merely developed automatic essay grading (AEG) instead of automatic short answer grading (ASAG). This study aims to assess the sentence similarity of short answer to the questions and answers in Indonesian without any language semantic's tool. This research uses pre-processing steps consisting of case folding, tokenization, stemming, and stopword removal. The proposed approach is a scoring rubric obtained by measuring the similarity of sentences using the string-based similarity methods and the keyword matching process. The dataset used in this study consists of 7 questions, 34 alternative reference answers and 224 student’s answers. The experiment results show that the proposed approach is able to achieve a correlation value between 0.65419 up to 0.66383 at Pearson's correlation, with Mean Absolute Error (푀퐴퐸) value about 0.94994 until 1.24295. The proposed approach also leverages the correlation value and decreases the error value in each method.
Sentiment analysis is an important current research area. The demand for sentiment analysis and classification is growing day by day; this paper presents a novel method to classify Urdu documents as previously no work recorded on sentiment classification for Urdu text. We consider the problem by determining whether the review or sentence is positive, negative or neutral. For the purpose we use two machine learning methods Naïve Bayes and Support Vector Machines (SVM) . Firstly the documents are preprocessed and the sentiments features are extracted, then the polarity has been calculated, judged and classify through Machine learning methods.
Sentiment Analysis in Social Media and Its OperationsIRJET Journal
This document summarizes a literature review on sentiment analysis in social media. It explores the styles, platforms, and applications of sentiment analysis. Most papers used either a dictionary-based approach or machine learning approach to analyze sentiment in social media text, with some combining both. Twitter was the most common social media platform used to collect data due to its large volume of public posts. Sentiment analysis has been applied in various domains including business, politics, health, and tracking world events. It can provide valuable insights for organizations and help improve products, services, and decision making.
This research proposal outlines an experimental study that will investigate the effects of reading electronic text on screen versus printed text, as well as the impact of different types of electronic text formatting, on undergraduate students' reading comprehension. Students will read an academic text either on screen or in print and complete a comprehension test. The on-screen text will be presented with variations in layout, font, section division, and consistency to determine if these factors influence comprehension. The results could help inform practices in online and traditional classes and support further research on digital literacy.
Identifying e learner’s opinion using automated sentiment analysis in e-learningeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Paper Mate Write Bros Ballpoint Pens, Medium PRichard Hogue
The passage discusses the positive and negative impacts of social media and mass media on adolescents. It notes that teens spend up to 11 hours per day on social media and are exposed to media via electronics. Social media influences how teens dress, act, talk and what they discuss. While mass media can shape adolescent minds and ideas, it remains unclear whether the overall impact is positive or negative. The passage intends to explore this issue by analyzing social media's impacts through sociological lenses of identity and groupthink.
Writing Phrases Best Essay Writing Service, Essay WritRichard Hogue
The document discusses the social and environmental impacts of vertical integration in the banana export trade in Honduras. It explains that in the late 1800s, Honduras was the largest banana exporter to the US. Over time, banana production gradually transitioned from small, individual farms to large monopolies controlled by three major companies. This led to significant social and environmental changes in Honduras as the companies consolidated land and power over banana production and export.
Dynamic Question Answer Generator An Enhanced Approach to Question Generationijtsrd
Teachers and educational institutions seek new questions with different difficulty levels for setting up tests for their students. Also, students long for distinct and new questions to practice for their tests as redundant questions are found everywhere. However, setting up new questions every time is a tedious task for teachers. To overcome this conundrum, we have concocted an artificially intelligent system which generates questions and answers for the mathematical topic –Quadratic equations. The system uses i Randomization technique for generating unique questions each time and ii First order logic and Automated deduction to produce solution for the generated question. The goal was achieved and the system works efficiently. It is robust, reliable and helpful for teachers, students and other organizations for retrieving Quadratic equations questions, hassle free. Rahul Bhatia | Vishakha Gautam | Yash Kumar | Ankush Garg ""Dynamic Question Answer Generator: An Enhanced Approach to Question Generation"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23730.pdf
Paper URL: https://www.ijtsrd.com/computer-science/artificial-intelligence/23730/dynamic-question-answer-generator-an-enhanced-approach-to-question-generation/rahul-bhatia
This document proposes a model to estimate overall sentiment score by applying rules of inference from discrete mathematics. It discusses sentiment analysis and related work using techniques like supervised/unsupervised learning. The problem is identifying sentiment components and restricting patterns for feature identification. Most approaches focus on nouns/adjectives but not verbs/adverbs. The model preprocesses product review datasets using NLTK for stemming, parsing and tokenizing. It builds a lexicon dictionary of positive and negative words. The Lexical Pattern Sentiment Analysis algorithm uses both lexicon and pattern mining - it selects sentence patterns, checks for positive/negative words in the lexicon, and calculates an overall sentiment score.
Analyzing Sentiment Of Movie Reviews In Bangla By Applying Machine Learning T...Andrew Parish
This document summarizes a research paper that analyzed sentiment of movie reviews written in Bangla using machine learning techniques. The researchers collected a dataset of over 4,000 Bangla movie reviews labeled as positive or negative. Using this dataset, they tested support vector machine and long short-term memory models, achieving 88.9% and 82.42% accuracy respectively. The paper also reviewed other prior work on Bangla sentiment analysis and compared different machine learning methods.
This document provides details about a student project to develop a fiction authoring tool. It outlines the problem statement of assisting fiction authors in planning and writing stories individually and collaboratively. The students conducted various empirical studies and analyses of existing tools to understand the fiction writing process. They arrived at a solution approach of first developing a single-user fiction authoring tool before adding collaborative features. The document describes the technology and platforms to be used, which is a web-based tool designed for both desktops and mobile devices.
Open domain Question Answering System - Research project in NLPGVS Chaitanya
Using a computer to answer questions has been a human dream since the beginning of the digital era. A first step towards the achievement of such an ambitious goal is to deal with natural language to enable the computer to understand what its user asks. The discipline that studies the connection between natural language and the representation of its meaning via computational models is computational linguistics. According to such discipline, Question Answering can be defined as the task that, given a question formulated in natural language , aims at finding one or more concise answers. And the Improvements in Technology and the Explosive demand for better information access has reignited the interest in Q & A systems , The wealth of the information on the web makes it an Interactive resource for seeking quick Answers to factual Questions such as “Who is the first American to land in space ?”, or “what is the second Tallest Mountain in the world ?”, yet Today’s Most advanced web Search systems(Bing , Google , yahoo) make it Surprisingly Tedious to locate the Answers , Q& A System Aims to develop techniques that go beyond Retrieval of Relevant documents in order to return the exact answers using Natural language factoid question
The document presents a method for automatically evaluating handwritten student essays using longest common subsequence. It involves 3 phases: 1) constructing a reference material from multiple study materials by comparing sentences semantically, 2) extracting text from scanned handwritten essays using image processing techniques, and 3) grading essays by comparing the extracted text to the reference material using longest common subsequence to calculate a score and assign a grade. The method aims to automatically evaluate essays similar to human evaluation and explores the accuracy and effectiveness of using longest common subsequence for semantic comparison between texts.
This document provides an overview and requirements for the Stat project, an open source machine learning framework for text analysis. It describes the background, motivation, scope, and stakeholders of the project. Key requirements for the framework include being simplified, reusable, and providing built-in capabilities to naturally support text representation and processing tasks.
Statistics 695A: Machine Learning, Fall 2004butest
This document provides information about the Statistics 695A: Machine Learning course for Fall 2004 including:
- The course will cover machine learning theory, methods, algorithms and applications. Students will learn to develop machine learning tools and apply them to data sets.
- Student responsibilities include presenting a research paper, conducting a machine learning study on a data set, and writing an 8-page project paper reporting the results.
- Proposed instructor topics are perceptrons, local learning, Bayesian learning, Bayesian networks, visualization, and learning theory. Students will read about these topics on the course website.
- The instructor, William S. Cleveland, is a professor of statistics and computer science who researches machine learning,
Statistics 695A: Machine Learning, Fall 2004butest
This document provides information about the Statistics 695A: Machine Learning course for Fall 2004 including:
- The course will cover machine learning theory, methods, algorithms and applications. Students will learn to develop machine learning tools and apply them to data.
- Student responsibilities include presenting a research paper, conducting a machine learning study on data and writing a project paper.
- Proposed instructor topics are perceptrons, local learning, Bayesian learning, Bayesian networks and visualization. Students will read about these topics on the course website.
This document summarizes a thesis submitted by Melaku Tilahun Asress for a Master's degree in computer science. The thesis describes the design and implementation of an automatic spelling checker for the Amharic language. The spelling checker uses a morphological analyzer to detect and correct spelling errors, including errors due to internal inflection and duplication of Amharic words. The system was evaluated using text from various reports and achieved an overall performance of 97.27% based on precision and recall metrics. Areas for further improvement include detecting real word errors and comparing the spelling correction algorithm to other techniques.
The document discusses the development of an online instrument called DIM (Digital Information Skills Measurement) to measure students' information skills. DIM aims to combine multiple measurement methods into one online tool to provide a clear picture of students' information skills during an entire search process using the actual Internet. An initial study was conducted using DIM with 84 students, and results were compared to think-aloud protocols. DIM showed potential but also room for improvement, such as providing more context for student evaluations during searches. Further validation steps are outlined.
The document discusses the development of an online instrument called DIM (Digital Information Skills Measurement) to measure students' digital information skills. DIM aims to combine different measurement methods into one online tool to provide a clear picture of students' information skills during the whole search process using the actual Internet. An initial study was conducted with 84 students to validate DIM by comparing its results to think-aloud protocols. The study found that DIM can provide insight into how students search for and evaluate information online, though further validation is still needed to ensure the instrument is sensitive enough.
Feature Analysis for Affect Recognition Supporting Task Sequencing in Adaptiv...janningr
Originally, the task sequencing in adaptive intelligent tutoring systems needs information gained from expert and domain knowledge as well as information about former performances. In a former work a new efficient task sequencer based on a performance prediction system was presented, which only needs former performance information but not the expensive expert and domain knowledge. This task sequencer uses the output of the performance prediction to sequence the tasks according to the theory of Vygotsky’s Zone of Proximal Development. In this presentation we aim to support this sequencer by a further automatically to gain information source, namely speech input from the students interacting with the tutoring system. The proposed approach extracts features from students speech data and applies to that features an automatic affect recognition method. The output of the affect recognition method indicates, if the last task was too easy, too hard or appropriate for the student. In this presentation we (1) propose a new approach for supporting task sequencing by affect recognition, (2) present an analysis of appropriate features for affect recognition extracted from students speech input and (3) show the suitability of the proposed features for affect recognition for supporting task sequencing in adaptive intelligent tutoring systems.
This document presents a system for detecting semantically similar questions in online forums like Quora to reduce duplicate content. It proposes using natural language processing techniques like tagging questions with keywords, vectorizing text with Google News vectors, and calculating similarity with Word Mover's Distance. The system cleans and preprocesses questions before generating tags and calculating similarity between questions to identify duplicates. An evaluation of the system achieved accurate detection of matching and non-matching question pairs.
A scoring rubric for automatic short answer grading systemTELKOMNIKA JOURNAL
During the past decades, researches about automatic grading have become an interesting issue. These studies focuses on how to make machines are able to help human on assessing students’ learning outcomes. Automatic grading enables teachers to assess student's answers with more objective, consistent, and faster. Especially for essay model, it has two different types, i.e. long essay and short answer. Almost of the previous researches merely developed automatic essay grading (AEG) instead of automatic short answer grading (ASAG). This study aims to assess the sentence similarity of short answer to the questions and answers in Indonesian without any language semantic's tool. This research uses pre-processing steps consisting of case folding, tokenization, stemming, and stopword removal. The proposed approach is a scoring rubric obtained by measuring the similarity of sentences using the string-based similarity methods and the keyword matching process. The dataset used in this study consists of 7 questions, 34 alternative reference answers and 224 student’s answers. The experiment results show that the proposed approach is able to achieve a correlation value between 0.65419 up to 0.66383 at Pearson's correlation, with Mean Absolute Error (푀퐴퐸) value about 0.94994 until 1.24295. The proposed approach also leverages the correlation value and decreases the error value in each method.
Sentiment analysis is an important current research area. The demand for sentiment analysis and classification is growing day by day; this paper presents a novel method to classify Urdu documents as previously no work recorded on sentiment classification for Urdu text. We consider the problem by determining whether the review or sentence is positive, negative or neutral. For the purpose we use two machine learning methods Naïve Bayes and Support Vector Machines (SVM) . Firstly the documents are preprocessed and the sentiments features are extracted, then the polarity has been calculated, judged and classify through Machine learning methods.
Sentiment Analysis in Social Media and Its OperationsIRJET Journal
This document summarizes a literature review on sentiment analysis in social media. It explores the styles, platforms, and applications of sentiment analysis. Most papers used either a dictionary-based approach or machine learning approach to analyze sentiment in social media text, with some combining both. Twitter was the most common social media platform used to collect data due to its large volume of public posts. Sentiment analysis has been applied in various domains including business, politics, health, and tracking world events. It can provide valuable insights for organizations and help improve products, services, and decision making.
This research proposal outlines an experimental study that will investigate the effects of reading electronic text on screen versus printed text, as well as the impact of different types of electronic text formatting, on undergraduate students' reading comprehension. Students will read an academic text either on screen or in print and complete a comprehension test. The on-screen text will be presented with variations in layout, font, section division, and consistency to determine if these factors influence comprehension. The results could help inform practices in online and traditional classes and support further research on digital literacy.
Identifying e learner’s opinion using automated sentiment analysis in e-learningeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Paper Mate Write Bros Ballpoint Pens, Medium PRichard Hogue
The passage discusses the positive and negative impacts of social media and mass media on adolescents. It notes that teens spend up to 11 hours per day on social media and are exposed to media via electronics. Social media influences how teens dress, act, talk and what they discuss. While mass media can shape adolescent minds and ideas, it remains unclear whether the overall impact is positive or negative. The passage intends to explore this issue by analyzing social media's impacts through sociological lenses of identity and groupthink.
Writing Phrases Best Essay Writing Service, Essay WritRichard Hogue
The document discusses the social and environmental impacts of vertical integration in the banana export trade in Honduras. It explains that in the late 1800s, Honduras was the largest banana exporter to the US. Over time, banana production gradually transitioned from small, individual farms to large monopolies controlled by three major companies. This led to significant social and environmental changes in Honduras as the companies consolidated land and power over banana production and export.
Examples How To Write A Persuasive Essay - AckerRichard Hogue
1. The study examined how monolingual and bilingual infants develop associative word learning abilities between 12-14 months of age.
2. Infants were tested using a looking-while-listening preferential looking paradigm to assess whether they could learn to associate novel words with novel objects.
3. The results showed that both monolingual and bilingual infants were able to learn novel word-object associations at 14 months of age, but bilingual infants did not demonstrate this ability at 12 months of age like monolingual infants did.
Netflix is expanding its global operations to over 200 countries in the next two years to increase its international revenue. Currently international markets make up 27% of Netflix's revenue but the company expects this to grow to 80% of total revenue. Netflix faces challenges expanding globally due to competition and regulatory issues. Strategic planning will be required to ensure Netflix's global expansion is successful and allows it to maintain its competitive edge in the streaming industry.
Best Tips On How To Write A Term Paper Outline, FormRichard Hogue
The document provides instructions for requesting writing assistance from HelpWriting.net. It outlines a 5-step process: 1) Create an account with a password and email; 2) Complete a 10-minute order form with instructions, sources, and deadline; 3) Review bids from writers and choose one based on qualifications; 4) Review the completed paper and authorize payment if satisfied; 5) Request revisions until fully satisfied, with a refund option for plagiarized work. The document promises original, high-quality content and full satisfaction of needs.
Formal Letter In English For Your Needs - Letter TemplRichard Hogue
The document provides instructions for requesting an assignment writing from HelpWriting.net. It outlines a 5 step process: 1) Create an account, 2) Complete an order form providing instructions and deadline, 3) Review bids from writers and select one, 4) Review the completed paper and authorize payment, 5) Request revisions until satisfied. It emphasizes the site's commitment to original, high-quality content and offering refunds for plagiarized work.
Get Essay Help You Can Get Essays Written For You ByRichard Hogue
The document outlines a 5-step process for getting essay help from HelpWriting.net, which includes creating an account, submitting a request with instructions and sources, choosing a bid from qualified writers, reviewing and authorizing payment for completed work, and requesting revisions if needed. It emphasizes that HelpWriting.net provides original, high-quality content and refunds plagiarized work to ensure customer satisfaction.
The passage describes the House and Ballroom scene, an intentional LGBTQ community founded by Latino and black people. Members formed houses that provided care and protection. Houses competed against each other at balls in categories like dance. Wealth, glamour, and status were important in ballroom culture. The scene offered acceptance and a comfort zone for LGBTQ people escaping unsupportive families.
Pin By Cindy Campbell On GrammarEnglish Language ERichard Hogue
1. The document outlines the steps to request a paper writing service from HelpWriting.net, including creating an account, completing an order form, reviewing writer bids, authorizing payment, and requesting revisions.
2. Writers utilize a bidding system, and customers can choose a writer based on qualifications, history, and feedback. Customers receive the paper, ensure it meets expectations, and authorize final payment.
3. HelpWriting.net promises original, high-quality content and offers refunds for plagiarism. Customers can request multiple revisions to ensure satisfaction.
How To Write Evaluation Paper. Self Evaluation EssRichard Hogue
This document provides steps for requesting and completing an assignment writing request through the HelpWriting.net platform:
1. Create an account with a valid email and password.
2. Complete a 10-minute order form providing instructions, sources, deadline, and attaching a sample of your writing if desired.
3. Review bids from writers and choose one based on qualifications, history, and feedback, then pay a deposit to start the assignment.
4. Review the completed paper and authorize final payment if pleased, or request revisions using the free revision policy.
Pumpkin Writing Page (Print Practice) - Made By TeachRichard Hogue
The document provides instructions for requesting writing assistance from HelpWriting.net. It outlines a 5-step process: 1) Create an account with a password and email. 2) Complete a 10-minute order form providing instructions, sources, and deadline. 3) Review bids from writers and select one based on qualifications. 4) Review the completed paper and authorize payment if satisfied. 5) Request revisions until fully satisfied, with a refund option for plagiarized work. The document promises original, high-quality content meeting all needs.
What Is The Best Way To Write An Essay - HazelNeRichard Hogue
The document discusses equal employment opportunity, affirmative action, and workforce diversity as key concepts in public human resources management in the United States. It notes that human resource managers must pay careful attention to these concepts and their underlying values. The abstract previews that the full document will cover cases, laws, philosophies, and values related to equal employment opportunity, affirmative action, and diversity, as well as their future implications.
The Importance Of Reading Books Free Essay ExampleRichard Hogue
The document provides instructions for requesting and completing an assignment writing request on the HelpWriting.net website. It outlines a 5-step process: 1) Create an account; 2) Complete an order form with instructions and deadline; 3) Review bids from writers and choose one; 4) Review the completed paper and authorize payment; 5) Request revisions until satisfied. The purpose is to help students obtain high-quality original content by writing their assignments.
Narrative Essay Personal Leadership Style EssayRichard Hogue
The document discusses working capital management at the Heavy Engineering Division of Larsen & Toubro Limited, including analyzing accounts receivable, accounts payable, and inventory management. It provides an overview of the business activities of Larsen & Toubro and its Heavy Engineering Division. The report compares the performance of Heavy Engineering Division to other Indian and foreign companies in key working capital elements.
Thesis Introduction Examples Examples - How To Write A TheRichard Hogue
The passage discusses customer privacy issues in the hospitality service industry. It notes that while technology helps businesses operate more efficiently, it also increases the risk of data breaches and privacy concerns. The hospitality industry in particular handles sensitive customer data like payment information. Several hotel chains have reported data breaches in recent years, highlighting the need for stronger privacy protections in this sector given the large amounts of financial data collected and stored.
Literature Review Thesis Statemen. Online assignment writing service.Richard Hogue
The document discusses the differences between bipolar type I and type II disorders. Bipolar type I is characterized by episodes of mania that last at least one week, along with depressive episodes. The research will focus specifically on bipolar type I in children and youth. Diagnostic criteria for bipolar disorders are outlined in the DSM-5.
008 Essay Writing Competitions In India CustRichard Hogue
Resistance art emerged in the mid-1970s in South Africa after the Soweto uprising. It focused on resisting apartheid and celebrating African strength and unity. Resistance art can represent different points of view and elicit emotional responses in viewers that can inspire social change or revolution. While art in the 1800s and 1900s also contained elements of resistance, it became more direct and overt in the mid-1970s, often led by activist artists focusing on political and social issues. Art has the power to engage people in different ways and reflect or drive social change.
A LEVEL SOCIOLOGY 20 MARK GENDER SOCUS. Online assignment writing service.Richard Hogue
Here are the key points required for a search warrant essay:
- A search warrant is required by the Fourth Amendment to the U.S. Constitution to conduct a search of a private property. This protects citizens from unreasonable searches and seizures by the government.
- To obtain a search warrant, the police must have probable cause that evidence of a crime will be found on the private property. They present sworn testimony and evidence to a judge who determines if probable cause exists.
- The search warrant must specify the location to be searched and the items that can be seized as evidence. This prevents general "fishing expeditions" by law enforcement.
- If the police conduct a search without a valid warrant, any evidence found may
Composition Writing Meaning. How To Write A DRichard Hogue
1. The document provides instructions for how to request and complete an assignment writing request on the HelpWriting.net website. It outlines a 5-step process: create an account, submit a request form with instructions and deadline, choose a bid from qualified writers, review and authorize payment for the completed assignment, and request revisions if needed.
2. The site uses a bidding system where writers submit bids to complete assignment requests. Customers can choose a writer based on qualifications, order history, and feedback to start working on the assignment.
3. HelpWriting.net promises original, high-quality content and offers refunds if work is plagiarized. Customers can request multiple revisions to ensure satisfaction.
Get Essay Writing Help At My Assignment Services By Our HighlyRichard Hogue
This document provides instructions for getting essay writing help from the website HelpWriting.net. It outlines a 5-step process: 1) Create an account with an email and password. 2) Complete an order form with instructions, sources, and deadline. 3) Review bids from writers and choose one. 4) Review the paper and authorize payment. 5) Request revisions until satisfied. It emphasizes that original, high-quality content is guaranteed or a full refund will be provided.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Film vocab for eal 3 students: Australia the movie
Automated Essay Scoring
1. 1 | P a g e
B. Comp. Dissertation
Automated Essay Scoring
By
Shubham Goyal
Department of Computer Science
School of Computing
National University of Singapore
2013/2014
Project No: H014380
Advisor: Professor NG Hwee Tou
Deliverables:
Report: 1 Volume
2. 1 | P a g e
Contents
List of Figures........................................................................................................................................ 2
1. Abstract.............................................................................................................................................. 3
2. Acknowledgement........................................................................................................................... 4
3. Goal...................................................................................................................................................... 5
4. Introduction ....................................................................................................................................... 6
5. Related Work.................................................................................................................................... 8
5.1 Background................................................................................................................................. 8
5.2 Comparison of the Current State of the Art Essay Systems.....................................12
6. Implementation ..............................................................................................................................15
6.1 Overview ...................................................................................................................................15
6.2. Features utilized.....................................................................................................................16
6.2.1 Content Features.............................................................................................................16
6.2.2 Syntactic Features ..........................................................................................................16
6.2.3 Surface Features..............................................................................................................17
6.2.4 Error Identification.........................................................................................................17
6.2.5 Structural Features .........................................................................................................18
6.8 Statistical Parsing...............................................................................................................22
6.9 Feature Weights..................................................................................................................26
6.10 Ranking Algorithm .........................................................................................................27
6.11 Evaluation Metrics ..........................................................................................................28
7. Dataset..............................................................................................................................................29
8. Results..............................................................................................................................................30
9. Future Work...................................................................................................................................33
10. Conclusion....................................................................................................................................34
References...........................................................................................................................................35
Appendices..........................................................................................................................................37
Appendix‐A: List of Part‐of‐Speech Tags Used................................................................37
3. 2 | P a g e
List of Figures
Figure 1 Line Chart for Vendor Performance on the Pearson Product Moment
Correlation across ........................................................................................................12
Figure 2 Implementation overview............................................................................15
Figure 3 Skeletons generated from the sentence 'They have many theoretical ideas' 18
Figure 4 Parse Tree of 'they have many theoretical ideas'..........................................19
Figure 5 Annotated skeletons in the sentecnce "They have many theoretical ideas'..21
4. 3 | P a g e
1. Abstract
Automated Essay Scoring (AES) is increasingly becoming popular as human grading
is not only becoming expensive but also cumbersome as the number of test takers
grow. Quick feedback is another characteristic drawing educators towards AES.
However, most of the AES systems present today are commercial closed source
software. Our work aims to design a good AES system that uses some of the most
commonly used features to rank and essays. We also evaluate our scoring engine on a
publicly available dataset to establish benchmarks. We will also make all the source
code available to the public so that future research can use this as a starting point.
Subject Descriptors:
I.2.7 Natural Language Processing
H.3 Information Storage and Retrieval
I.2.6 Learning
I.2.8 Problem Solving, Control Methods, and Search
Keywords:
Artificial Intelligence, Natural Language Processing
Implementation Software and Hardware:
Python, Java
5. 4 | P a g e
2. Acknowledgement
I would like to thank my supervisor, Prof NG Hwee Tou, for giving me the
opportunity to work under him and on this project. I am really honored to have the
pleasure of working under one of the best minds in this field. I would like to thank
him for all his time that he has spent in helping me, motivating me, guiding me and
finally, helping me become a better researcher so that I can fulfill my lifelong
ambition of becoming a good researcher.
I would also like to thank Prof’s graduate student, Raymond, for taking out the time
from his work to help provide me with APIs to get the trigram counts of words in the
English Gigaword corpora.
I also appreciate the help provided by another of Prof’s students, Christian
Hadiwinoto, for creating my account on the NLP cluster and helping me install
packages and run my programs there.
6. 5 | P a g e
3. Goal
This project focuses on building a system for scoring English essays. The system
assigns a score to an essay reflecting the quality of the essay (based on both content
and grammar). The system will be evaluated on a benchmark test data set. Besides
aiming to build a state-of-the-art essay scoring system, the project will also
investigate the robustness and portability of essay scoring systems.
7. 6 | P a g e
4. Introduction
According to Wikipedia, ‘Automated essay scoring (AES) is the use of specialized
computer programs to assign grades to essays written in an educational setting.’
Usually, the grades are not numeric scores but rather, discrete categories. Therefore,
this can be also be considered to be a problem of statistical classification and due to
its very nature, this problem can be said to fall into the domain of natural language
processing.
Historically, the origins of this field can be traced to the work of Ellis Batten “Bo”
Page, who is also widely regarded as the father of automated essay scoring. Page’s
development of and pioneering work with Project Essay Grade (PEG™) software in
the mid-1960s set the stage for the practical application of computer essay scoring
technology following the microcomputer revolution of the 1990s.
The most obvious approach to do automated essay scoring is to employ machine
learning. This will involve getting a set of essays that have been manually scored (or,
the training set). The software should then evaluate the features of the text of each
essay (surface features like the total number of words, or word ngrams, part of speech
ngrams, etc.. mostly quantities that can be measured without any human insight) and
construct a mathematical model that relates these quantities to the scores that the
essays received. Then, we could use the model to calculate scores of new sets of
essays.
The next important question that arises is the determination of the criteria of success.
It might be insightful to look at essay scoring before the arrival of computers.
Usually, high stake essays were and still are rated by a few different raters who would
each give their own score. Then, the different scores would be matched to see if they
agree and if they don’t, either a more experienced rater would be called in to settle the
dispute, or the the majority opinion would be taken. We could apply the same
approach to checking the success of any AES software. The grades given by the
software could be matched with the grades given by human graders on the same
scripts. The more the number of matches, the better the accuracy of the AES software
would be.
8. 7 | P a g e
Thus, various statistics have been proposed to measure this ‘agreement’ between the
AES software and the human graders. It could be something as simple as percent
agreement to more complicated measures like Pearson’s or Spearman’s rank
correlation coefficients.
The practice of AES has not been without its fair share of criticism. Yang et al.
mention "the overreliance on surface features of responses, the insensitivity to the
content of responses and to creativity, and the vulnerability to new types of cheating
and test-taking strategies." Some critics also fear that students’ motivation will be
diminished if they know that a human grader will not be reading their writings.
However, we feel that this criticism is not to AES but rather to the fear of being
assigned false grades. This criticism actually also proves that the current state of the
art systems can be better and so this is an exciting time to be working in this field.
9. 8 | P a g e
5. Related Work
5.1 Background
As already mentioned in the introduction, the late Ellis Page and his colleagues at the
University of Connecticut programmed the first successful automated essay scoring
engine, “Project Essay Grade (PEG)” (1973). PEG did produce good results but one
of the reasons for why it did not become a practical application is probably because of
the technology of the time.
Different AES systems evaluate different types and number of features which are
extracted from the text of the essay. Page and Peterson (1995), in their Phi Delta
Kappan article, “The computer moves into essay grading: Updating the ancient test,”
referred to these elements or features as “proxes” or approximations for underlying
“trins” (i.e., intrinsic characteristics) of writing. In the original version of PEG, the
text was parsed and classified into language elements such as parts of speech, word
length, word functions and the like. PEG would count keywords and make its
predictions based on the patterns of language that human raters valued or devalued in
making their score assignments. Page classified these counts into three categories:
simple, deceptively simple and sophisticated.
For example, a model in the PEG system, might be formed by taking five intrinsic
characteristics of writing (content, creativity, style, mechanics, and organization) and
linking proxes. An example of a simple prox is essay length. Page found that the
relationship between the number of words used and the score assignment was not
linear, but rather logarithmic. In other words, essay length is factored in by human
raters upto some threshold, and then becomes less important as they focus on other
aspects of writing.
On the other hand, an example of a sophisticated prox would be a count of the number
of times “because” is used in an essay. Even though a count of the word “because”
may not be important in and of itself, but as a discourse connector, it serves as a proxy
for sentence complexity. Human raters tend to reward more complex sentences.
10. 9 | P a g e
Some works emphasize the evaluation of content through the specification of
vocabulary (of course, the evaluation of other aspects of writing is performed as
described above). Latent Semantic Analysis and its variants are employed in some
works to provide estimates as to how close the vocabulary in an essay is to a targeted
vocabulary set (Landauer, Foltz & Laham, 1998). The Intelligent Essay Assessor
(Landauer, Foltz & Laham, 2003) is one of the most successful commercial
applications making heavy use of LSA.
If we look at the AES scene at present, there are three major AES developers –
1. e-rater (which is a component of Criterion - http://www.ets.org/criterion) by
Educational Testing Service (ETS)
2. Intellimetric by Vantage Learning (http://www.vantagelearning.com/)
3. Intelligent Essay Assessor by Pearson Knowledge Technologies
(http://kt.pearsonassessments.com/)
Fortunately for us, the construction of e-rater models is given in detail in a recent
work by Attali and Burstein (2006). The system takes in features from six main areas:
1. Grammar, usage, mechanics and style measures (4 features) –
They count the errors in these four categories. And since the raw counts of
errors are highly related to essay length, the rates of errors are used which are
obtained by dividing the counts in each category by the total number of words
in the essay.
2. Organization and development (2 features) –
The first feature in this category is the organization score which assumes a
writing strategy that includes an introductory paragraph, at least a three
paragraph body with each paragraph in the body consisting of a pair of main
point and supporting idea elements, and a concluding paragraph. The score
measures the difference between this minimum five paragraph essay and the
actual discourse elements found in the essay. The second feature is derived
from Criterion’s organization and development module.
11. 10 | P a g e
3. Lexical Complexity (2 features) –
These are specifically related to word based characteristics. The first is a
measure of vocabulary level and the second is based on average word length
in characters across the words in the essay. The first feature is from Breland,
Jones, and Jenkins’ (1994) work on Standardized Frequency Index across the
words of the essay.
4. Prompt-specific vocabulary usage (2 features) – e-rater evaluates the lexical
content of an essay by comparing the words it contains to the words found in a
sample of essays from each score category. This is accomplished by making
use of content vector analysis (Salton, Wong, & Yang, 1975). In short, the
vocabulary of each score category is converted to a vector whose elements are
based on the frequency of each word in a sample of essays.
Like most approaches, e-rater also uses a sample of human scored essay data for
model building purposes. e-rater models can be built at the topic level in which case, a
model is built for a specific essay prompt. However, more often, e-rater models are
built at the grade-level. Preparing models for essays of similar topics or by students of
similar grades is not difficult per se, but it requires significant data collection and
human reader scoring, these are not only time consuming but also costly.
A specification of the Intellimetric model is given in Elliot (2003). The model selects
from more than 300 semantic, syntactic and discourse level features. The features fall
into five major categories:
1. Focus and Unity – Include cohesiveness, consistency in purpose, main idea
2. Development and elaboration – Include metrics to look at the breadth of
content and the support for concepts advanced
3. Organization and Structure – Mainly targeted at the logic of discourse like
transitional fluidity and realationships among parts of the response.
4. Sentence Structure – Include senetence complexity and sentence variety.
5. Mechanics and Conventions – Features measuring conformance to
conventions of edited American English.
12. 11 | P a g e
Intellimetric uses Latent Semantic Dimension which is similar in nature to LSA
described earlier. Latent Semantic Dimension also determines how close the
candidate response is, in terms of content, to a modeled set of vocabulary. The paper
doesn’t go into a lot more details about how Intellimetric works, and rather focuses
more on the validation aspect.
Technical details of the Intelligent Essay Assessor are highlighted in Landauer,
Laham, & Foltz (2003). The content of the essay is assessed by using a combination
of external databases and LSA. The authors do talk about examples of external
databases used for three of their experiments. This is interesting to note because this
can shed important light on what kind of data needs to be extracted for automated
essay scoring in different situations.
In a particular experiment, the essay question was on the anatomy and function of the
heart and circulatory system. This was administered to 94 undergraduates at the
University of Colorado before and after an instructional session (N = 188) and scored
by two professional readers from Educational Testing Service (ETS). In this case, the
LSA semantic space was constructed by analysis of all 95 paragraphs in a set of 26
articles on the heart taken from an electronic version of Grolier’s Academic American
Encyclopedia. Even though this corpus was smaller than the corpuses traditionally
used, it gave good results according to the authors. When the authors tried to expand
it by the addition of general text, the results did not improve.
We also draw immense inspiration from and analyze the work by (Yannakoudakis et
al., 2011) but the discussion on that is deferred to the subsequent sections in the
interests of brevity and to avoid repetition.
13. 12 | P a g e
5.2 Comparison of the Current State of the Art Essay Systems
A recent study (Shermis and Hammer, 2012) has compared the results from nine
automated essay scoring engines on eight prompts drawn from 6 states in the Unites
States that hold high-stakes writing exams. The essays encompassed writing
assessments from three grade levels, namely, 7, 8 and 10; and were evenly distributes
among the different prompts. Totally, there were 22,029 essays,
The following line chart demonstrates the pearson product moment correlation across
the eight essay data sets –
Figure 1. Line Chart for Vendor Performance on the Pearson Product Moment Correlation across
the Eight Essay Data Sets
The nine automated essay scoring engines participating in the study were –
1. Autoscore developed by the American Institutes for Research (AIR)
The main features of this scoring engine include creating a statistical proxy for
14. 13 | P a g e
prompt-specific rubrics (single as well as multiple trait). The engine needs to
be trained on known and valid scores.
2. LightSIDE developed at Carnegie Mellon University’s TELEDIA Lab
This is a free and open source package. This is very beginner friendly. Its
meant to be a tool for non professionals to male use of data mining technology
for varied purposes, one of which is essay assessment.
3. Bookette developed by CTB McGraw-Hill Education
These scoring engines are able to model trait level and/or holistic level scores
for essays with a similar degree of reliability to an expert human rater. CTB
builds two types of engines – prompt specific and generic. When applied in
the classroom, the engines can provide performance feedback through the use
of the information found in the scoring rubric and through feedback on
grammar, spelling, conventions, etc. at the sentence level. Bookette engines
utilize around 90 text-features classified as structural, syntactic, semantic and
mechanics-based.
4. e-rater, developed by Educational Testing Service
This scoring engine is focused on evaluating essay quality. There are doens of
features, each measuring different very specific aspects of essay quality. The
same features serve as the basis for performance feedback to students through
products like Criterion (http://www.ets.org/criterion).
5. Lexile Writing Analyzer developed by MetaMetrics
This is independent of grades, genres, prompts or punctuation and is an engine
for establishing Lxile writer measures. Lexile writer measure is said to be an
inherent individual trait or power to compose written text with writing ability
embedded in a complex web of cognitive and sociocultural processes.
6. Project Essay Grade (PEG), Measurement, Inc.
This scoring engine has had more than 40 years of study and enchancement
devoted to it. Studies conducted at a number of state departments of education
indicate that PEG demonstrated accuracy similar to trained human scorers.
15. 14 | P a g e
7. Intelligent Essay Assessor (IEA), Pearson Knowledge Technologies
Some of the features are derived through semantic models of English (or any
other language) from an analysis of large volumes of text equivalent to the
reading material of a high school student (around 12 million words). This
scoring engine combines background knowledge about English in general and
the subject area of the assessment in particular along with prompt-specific
algorithms to learn how to match student responses to human scores. IAEA
also provides feedback and can even be tuned to understand and examine text
in any language (Spanish, Arabic, Hindi, etc.). It can identify off-topic
responses, very unconventional essays and other unique circumstances that
need human attention. It has also been used for grading millions of essays in
high-stake examinations.
8. CRASETM
by Pacific Metrics
This system is highly configurable, both in terms of the customizations used to
build machine scoring models and in terms of how the system can blend
human scoring and machine scoring (i.e., hybrid models). Its actually a Java
applications that runs as a web service.
9. IntelliMetric developed by Vantage Learning
This scoring system attempts to emulate what the human scorers do.
IntelliMetric is trained to score test-taker essays. Each prompt (essay) is first
scored by expert human scorers who develop anchor papers for each score
point. A number of papers for each score point are loaded into IntelliMetric,
which runs multiple algorithms to determine the specific writing features that
translate to various score points.
16. 15 | P a g e
6. Implementation
6.1 Overview
This is what the entire process actually looks like in a nutshell –
Figure 2 Implementation overview
To score essays automatically, we need to train a machine-learning algorithm. After
the algorithm has been trained, it gives us a machine-learning model, which can be
used to score more essays. In order for a machine-learning model to be created,
17. 16 | P a g e
features first need to be extracted from the text, as a computer cannot directly
understand English. We need to use the numbers or symbols as proxies for meaning.
6.2. Features utilized
6.2.1 Content Features
6.2.1.1 Word n‐grams
An n-gram can simply be defined as a contiguous sequence of n items from a given
sequence of text or speech. For the purpose of essay scoring, they can simply be
understood as collection of one or more tokens. n can have any value but usually only
value of n =1 (unigrams), 2 (bigrams), 3 (trigrams) are considered. This is because
higher order n-grams suffer from the sparse data problem.
The tokens were converted to lower case before being used as n-grams. However, no
stemming was employed.
6.2.2 Syntactic Features
6.2.2.1 POS n‐grams
This feature is the same as word n-grams except that we replace each word with its
(Part of Speech) tag such as noun, verb, adjective, etc.. Parts of speech are also known
as word classes or lexical categories. In this work, we employ the Penn Treebank tag
set. The reason for using the PDTB tagset is its wide use. Appendix A.1 details the
different tags in the PDTB tagset.
The tokens are tagged in their original case because changing the case of the word
might change the tag (for example, proper nouns (NNPS) are usually identified
because of the capital initial letter). The methodology followed in Yannakoudakis et.
al is a bit different because they make use of their own proprietary RASP tagger for
this purpose.
18. 17 | P a g e
6.2.3 Surface Features
6.2.3.1 Script Length
Logically, script length should not have any relation to the score because a smaller
well-written piece of text must get the same score as a larger one. However, as
mentioned in the related works section above, script length has been found to affect
the score. Some works say that empirically, longer the length of the essay, the better
the score is. Also, script length could have the effect of cancelling out any
skewedness in the final results due to features whose weights are influenced by script
length.
This is a surface feature since it is completely language-blind. According to Cohen at
al., these surface variables in themselves are extremely predictive of the essay score.
However, the consequences of using such features alone can be that students will
simply learn to write longer texts with no regard for rhetorical structure, the logic of
argumentation, and so forth. This is why such surface variables need to be alongside
other features which relate to content, syntactic structure or rhetorical structure.
6.2.4 Error Identification
6.2.4.1 Error Rate
By error rate, we refer to the rate of occurrence of unknown (and hence, erring)
ngrams. The simplest way of getting error rates is to use a language model from a
suitably large and hopefully in-context corpus and then see the rate of occurrence of
ngrams in the document which do not occur in the corpus. Error rate can be an
important feature because of many reasons. Firstly, it can serve to identify improper
uses of grammar and words. If the number rate of occurrence of grammatical errors in
two documents are the same, the probability is that the scores would be similar or lie
in the same range too. For the purpose of our research, we are using Prof NG Hwee
Tou’s corpuses. These corpuses are parts of the English Gigaword (details here -
http://catalog.ldc.upenn.edu/LDC2009T13). The first corpus consists of first 4 million
sentences and around 100 million words while the second corpus consists of around
40 million sentences and more than a billion words.
19. 18 | P a g e
6.2.5 Structural Features
The inspiration for these features comes from Massung et al. (2013).
6.2.5.1 Skeletons
We aim to capture the flow or discourse structure of sentences without
bothering about the actual labels. For example, if the input sentence is ‘They
have many theoretical ideas’, the following skeletons would be generated –
Figure 3. Skeletons generated from the sentence 'They have many theoretical ideas'
To understand why Figure 2 is as is, let us look at the parse tree of ‘They have
many theoretical ideas’. The following parse tree has been drawn with the help
of the nltk draw function (The tags used in this figure have been documented
in section A.1 of the Appendix)
20. 19 | P a g e
Figure 4. Parse Tree of 'they have many theoretical ideas'
To represent these skeletons of parse trees (for example, see Figure 2), we
store the trees as sets of square brackets. So, for the sentence ‘They have
many theoretical ideas’, the skeletons of parse trees will be –
a) []
b) [[]]
c) [[[]]]
d) [[[]], [[]], [[]]]
e) [[[]], [[[]], [[]], [[]]]]
f) [[[[]]], [[[]], [[[]], [[]], [[]]]]]
We can choose to ignore the tree represented by a since it is trivial and will be
present in every document (in fact, each word or punctuation can be
represented as []). But (b), (c), (d), (e) and (f) correspond to the graphical
representations of the trees in figure 2.
21. 20 | P a g e
The procedure for identifying the skeletons is pretty simple. We can start from
the root of the parse tree and recursively descends into sub-trees recording the
inherent structure. The following pseudocode attempts to demonstrate how
this works –
procedure get_list_of_skeletons_of_sentence (sentence):
tree = parse(sentence)
if tree is NULL:
return []
subtrees_list = []
convert_tree_to_list(tree, subtrees_list)
return subtrees_list
procedure convert_tree_to_list(node, list):
if node is of type(Tree):
list.append([])
return []
else:
subtree = []
for subtree_node in node:
subtree.append(convert_tree_to_list(subtree_no
de, list))
subtree.sort()
list.append(subtree)
return subtree
This get_list_of_skeletons_of_sentence function returns the skeletal structure of the
sentence (or the list of skeletons that correspond to the sub-trees of the parse tree of
the sentence). It does this by first making the parse tree of the sentence and then
calling a recursive function convert_tree_to_list. convert_tree_to_list recursively goes
to each node in the tree, appends the skeleton of the subtree corresponding to that
node to a list and then returns the list of skeletons. The leaves of the parse tree are
represented as [] in our square bracket notation.
22. 21 | P a g e
The section on ‘Statistical Parsing’ later attempts to detail how the function parse
works.
6.2.5.2 Annotated Skeletons
Annotated skeletons are the same as skeletons with just one extra piece of
information attached to them – the label of the topmost node of each parse
sub-tree. An example of the different parse-trees for the same sentence ‘They
have many theoretical ideas’ is
Figure 5 Annotated skeletons in the sentecnce "They have many theoretical ideas'
6.2.5.3 Rewrite Rules
This feature was used by Kim et al. (2011). It essentially tallies subtrees from
each sentence’s parse. It has historically mainly been used in text
classification where all the parse trees are put in different classes/categories so
each category has a ‘bag-of-trees’. This feature is beneficial as certain trees
can be abundant in particular categories. Kim et al. use it for authorship
classification. Simpler applications would include age detection or language
proficiency. Less proficient writers would be unlikely to use complicated tree
structures. This can be useful for essay scoring as well.
23. 22 | P a g e
After conducting experiments, I decided to use only feature (i) Skeletons from this
section. The results section tries to analyze why this might have been the case.
To prevent overfitting or bias issues, we only use the features which appear at least 4
times in the entire training set. The value 4 is also chosen so because of its use in
Yannakoudakis et al so that it is easy to compare results.
6.8 Statistical Parsing
An important black box in the previous section where structural features were being
discussed was how the parse trees are formed. This is because sentences in average
tend to be very syntactically ambiguous – coordination ambiguity, attachment
ambiguity, etc.. That is why we need to use probabilistic parsing. We consider all
possible interpretations and then choose the most likely one.
The CS4248 (Natural Language Processing) class in NUS which I took talked about
probabilistic context free grammar (PCFG), a probabilistic addition to context free
grammars (CFGs) in which each rule has a probability assigned to it. We use the
probabilistic CKY algorithm to generate the most probable parses. The algorithm is
trained on two Treebank grammars –
a) Penn Treebank
b) QuestionBank
6.8.1 Penn Treebank
The material annotated for this project includes such wide ranging genres as IBM
computer manuals, nursing notes, Wall Street Journal articles, transcribed telephone
conversations, etc.. For our work, we use a sample (5% fragment) of this huge
treebank which has been made available for non-commercial use. It contains parsed
data from Wall Street Journal for 1650 sentences (99 treebank files wsj_0001 to
wsj_0099).
An example annotated sentence from the treebank –
24. 23 | P a g e
( (S
(NP-SBJ
(NP (NNP Pierre) (NNP Vinken) )
(, ,)
(ADJP
(NP (CD 61) (NNS years) )
(JJ old) )
(, ,) )
(VP (MD will)
(VP (VB join)
(NP (DT the) (NN board) )
(PP-CLR (IN as)
(NP (DT a) (JJ nonexecutive) (NN director) ))
(NP-TMP (NNP Nov.) (CD 29) )))
(. .) ))
6.8.2 QuestionBank
This is a corpus of 4000 parse-annotated questions developed by the National Centre
for Language Technology and School of Computing. It is provided free for research
purposes. This is also one of the reasons why it has been employed in this work. The
annotated parse trees are very similar to the ones in the Penn Treebank so examples
have been omitted here in the interests of brevity.
After parsing the annotated data from the treebanks, we get a grammar (a list of
production rules). But we still have to convert the grammars to Chomsky Normal
Form. This is because the CKY algorithm works only on context free grammars given
in Chomsky Normal Form (CNF).
6.8.3 Chomsky Normal Form
A grammar is said to be in Chomsky normal form if all of its production rules are of
the form:
a) A BC, or
25. 24 | P a g e
b) A α, or
c) S ε
where A, B and C are nonterminal symbols, α is a terminal symbol (or a constant), S
is the start symbol and ε represents the empty string. Only and only S is allowed to be
the start symbol. Moreover, rule (c) is valid only if ε is part of the language generated
by the grammar G.
It has been proven that every context free grammar can be transformed into one in
Chomsky Normal Form.
6.8.4 Converting a CFG to CNF
1. Introduce a new start symbol S0. This also means that a new rules will have to
be added with regard to the previous start variable S –
S0 S
2. Eliminate all ε rules. ε rules can only be of the form A ε, where A is not
the start symbol (the proof is trivial). This can be done by removing every rule
with ε on its right hand side (RHS). For each rule that has A in its RHS, add a
set of new rules consisting of all the combinations of A replaced or not
replaced with ε. If A occurs as a singleton on the right hand side of any rule,
add a new rule A ε (lets call this new rule R), unless R has already been
removed.
3. Eliminate all unit rules. Unit rules are those whose RHs contains one variable
and no terminals (such a rule is inconsistent with the conditions for aa
grammar in Chomsky Normal Form grammar as described at the beginning of
this section. If the unit rule to be removed is X Y and there exist one or
more rules of the form Y Z (where Z is a string of variables and terminals),
add a new rule X Z (unless this is a unit rule which has already been
removed, obviously).
4. Clean up the remaining rules that are not in Chomsky Normal Form. Replace
A u1u2…uk, k ≥ 3, u1 ∈ V ∪ Σ with A u1A1, A1 u2A2, …, Ak-2 uk-
26. 25 | P a g e
1uk, where Ai are new variables. If ui ∈ Σ, replace ui in the above rules with
some new variable Vi and add rule Vi ui.
Once all the rules have been converted to Chomsky Normal Form, we can assign
probabilities to them. This completes the learning of a probabilistic context free
grammar in Chomsky Normal Form (CNF) from the treebanks.
Now, given an input sentence we need to use the probabilistic grammar to generate
the most likely parse tree. We use the Cocke-Younger-Kasami (CKY) algorithm. We
do have to modify the standard version of the algorithm since the standard version
checks only for membership. The pseudocode for the standard version is as below –
Let the grammar be represented by G.
Let S: a1...an be the input sentence or phrase.
Let R1…Rr be non-terminal symbols present in the grammar.
Let RS contain the start symbols, RS ∈ G
Let P[n, n, r] be a three dimensional array of Booleans
for each i = 1 to n:
for each j = 1 to n:
for each k = 1 to r:
P[i, j, k] = false
for each i = 1 to n:
for each unit production Rj -> ai
P[i,i,j] = true
for each i = 2 to n:
for each L = 1 to n – i + 1:
R = L + i – 1
27. 26 | P a g e
for each M = L + 1 to R:
for each (Rα RβRγ):
if P[L, M – 1, β] and P[M, R, γ]:
P[L, R, α] = true
for each i = 1 to r:
if P[1, n, i]:
return true
return false
The above algorithm is checks for the membership of the sentence in the language.
Our goal was to construct a parse tree, so we changed the array P to store parse tree
nodes instead of the Boolean values. These nodes are associated to the array elements
that were used to produce them so as to build the tree structure. This is a simple back-
tracking procedure.
Thus finally, the parse function in the get_list_of_skeletons_of_sentence can return
tree structure generated by the CKY algorithm.
6.9 Feature Weights
In the previous sections, we have discussed the methods employed to generate the
features given an essay. Each unique feature (for example, a particular token or
unigram, or a particular parse tree) is given a unique number to represent it. However,
we also need to decide what what weights (or, importance) to assign to those features.
We experimented with several feature weights -
1. The simplest way would be to use a 0 or a 1 depending on whether a feature is
present or absent.
2. Another technique that was tried was to use the number of times the feature
occurs in a given essay as its weight.
28. 27 | P a g e
3. tf-idf weighting was also tried for certain features, especially word-ngrams
and POS-ngrams. The next sections gives more details on the tf-idf statistic.
6.9.1 tf‐idf scheme
tf-idf is short for term frequency-inverse document frequency. This is often used as a
weighting factor in information retrieval and data mining. It is a product of term
frequency and inverse document frequency.
Various ways of calculating term frequency exist but probably the easiest one and the
one we have used in our work is to simply take the number of times the feature occurs
in a particular essay.
The inverse document frequency is a measure of whether the term is unique or
common across essays. We can arrive at this statistic by dividing the total number of
essays by the number of essays containing the term, and then taking a logarithm of the
quotient.
The reason why such a statistic is needed is because if we were to simply take counts
of features or their presence/absence, we are missing out on how important they are to
a document. In the context of essays, there might be certain phrases which score
highly once present in the eyes of a grader but might not commonly occur across all
documents. Even though this statistic might make more sense for information
retrieval tasks, it was still tried to see the impact.
6.10 Ranking Algorithm
Now, that we have discussed the features we are using/plan to use, let us look at the
machine learning aspect of the problem. We are modeling this as a ranking problem
and not a classification one. One reason for this is that we have the absolute human
grader scores. Thus, we can go better than classification into a few buckets because if
we convert each score to a grade, we are voluntarily losing some information. On the
29. 28 | P a g e
other hand, predicting the exact score also doesn’t make sense because surely, a
machine will not be able to accurately predict the exact score. Thus, ranking seems
like a good viable option.
We use support vector machine for the above task. Our choice is motivated by the fact
that other works, specifically Yannakoudakis et al, make use of the SVMlight
(http://svmlight.joachims.org/) library and so it is easy to compare results. Actually, to
be precise, we use SVMrank
http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html) which employs new
algorithms for training Ranking SVMs and is much faster than SVMlight
. The decision
to switch over to SVMrank
is made easier by the fact that both the libraries are by the
same author and (T. Joachims, 2006) states that both libraries solve the same
optimization problem, with the only difference being that SVMrank
is much faster.
6.11 Evaluation Metrics
The two evaluation metrics which have been employed in our work are –
6.11.1 Pearson’s Product‐Moment Correlation Coefficient
Pearson’s correlation determines the degree to which two linearly dependent variables
are related. It gives a value in the range [-1, 1] where a value of -1 denotes total
negative correlation, the value of 0 denotes no correlation and the value of 1 denotes
total positive correlation. However, the value of this metric can be misleading in some
rare cases due to outliers or due to the inherent sensitivity to the distribution of data.
6.11.2 Spearman’s Rank Correlation Coefficient
This is a non-parametric robust measure of statistical dependence between two
variables. It essentially assesses how well a relationship between the two variables
can be described using a monotonic function. If there are no repeated data values, a
perfect Spearman correlation of +1 or −1 occurs when each of the variables is a
perfect monotone function of the other.
30. 29 | P a g e
7. Dataset
As can be observed from the related work and introduction, automated essay scoring
is a data intensive task. To be able to predict scores, we not only need the dataset to
contain as many essay scripts as possible but there is also a requirement for the essay
scripts to be properly annotated or at least manually graded, at the least.
For our own experiments, we are currently making use of data drawn from the CLC
FCE dataset, a set of 1,244 exam scripts written by candidates sitting the Cambridge
ESOL First Certificate in English (FCE) examination in 2000 and 2001, and made
available by Cambridge University Press; see (Yannakoudakis et al., 2011).
The CLC dataset is divided into training and test sets. The training set consists of
1141 scripts from the year 2000 written by 1141 distinct learners, and 97 scripts from
the year 2001 for testing written by 97 distinct learners. The learner’s ages follow a
bimodal distribution with peaks at approximately 16-20 and 26-30 years of age.
Yannakoudakis et al. claim that there is no overlap between the prompts used in 2000
and in 2001. The scripts also have some meta-data about candidate’s grades, native
language and age.
The First Certificate in English (FCE) exam’s writing component consists of two
tasks asking learners to write either a letter, a report, an article, a composition or a
short story, between 200 and 400 words. Answers to each of these tasks are annotated
with marks (in the range 1-40). In addition, an overall mark is assigned to both tasks.
Actually, we do not make use of the task scores and just use the overall score. This is
because (Yannakoudakis et al., 2011) use just the overall score and so it gives us a
benchmark to compare our results against.
Each script is also tagged with information about the linguistic errors committed,
using a taxonomy of approximately 80 error types (Nicholls, 2003). An example of
this is the following –
Thanks for <NS type="DD"><i>you</i><c>your</c></NS> letter.
The part of the text between <i> and </i> denotes the incorrect text while the part
between <c> and </c> denotes the correction of that incorrect text.
31. 30 | P a g e
8. Results
The following table contains the correlation values after adding the different features–
Table 1. Spearman’s and Pearson’s Correlation Values
Features Pearson’s Correlation Spearman’s Rank Order
Correlation
Word ngrams 0.6005 0.5967
+ PoS ngrams
(tf-idf weights)
0.6053 0.5982
+ POS ngrams
(counts as
weights)
0.5679 0.5612
+ Script length 0.5685 0.5622
+ Error Rate 0.4247 0.4247
+ Skeletons 0.4904 0.4904
Since we use the same dataset as Yannakoudakis et al. for benchmarking, we can
compare our results with them. Our correlation coefficients have nearly the same
value when we just use word ngrams as a feature. For the other features, the variation
is to be expected since they don’t use the same tagger for PoS tagging, their error rate
is calculated in a different manner and they don’t use the same structural features.
From Table 1, we can see that our predictions vary more and more from the human
scores as more features are added. The best results are obtained when only using the
32. 31 | P a g e
word and PoS ngrams as features. This is unexpected but it might be because the
training dataset is too similar to the test dataset. That might also explain why using
just lexical ngrams can give a correlation as high as 0.6.
Word ngrams only include unigrams and bigrams. Trigrams were tried but they
produced very bad results. This can be attributed to data sparseness. Yannakoudakis
et al. also do not use word trigrams or any higher order n-grams.
If we use POS n-gram counts instead of using their tf-idf weights, the correlation
decreases. This suggests that using the tf-idf weighting scheme is useful, especially
for ngram features.
Word n-grams weighted using tf-idf scheme actually give better results when they are
not normalized. Just using word n-grams weighted using the tf-idf scheme results in a
Pearson’s correlation of 0.6220 and Spearman’s correlation of 0.6251.
We have only used skeletons as structural features here. Actually, the results were
much worse with annotated skeletons and rewrite rules. So, we decided to omit those
features because they represent all the information that skeletons do and some more,
so it would have led to repetition. One possible reason why parse tree skeletons
performed the best can be that since they don’t even contain the label information at
the root, they are the least likely to suffer from data sparseness problem.
Table 2 presents Pearson’s and Spearman’s correlation between the CLC and our
system, when removing one feature at a time.
33. 32 | P a g e
Table 2. Ablation tests showing the correlation between the CLC and the AES
system
Features Pearson’s Correlation Spearman’s Rank Order
Correlation
none 0.4904 0.4904
Word n-grams 0.4956 0.4924
Script length 0.4919 0.4883
+ Error Rate 0.4959 0.4928
+ Skeletons 0.4320 0.4247
34. 33 | P a g e
9. Future Work
For the future of this project, along with the addition of more features, an important
task is to get results for different datasets. This is to make sure that peculiarities in the
dataset do not influence the development of the AES system.
Prompt-specific features also need to be added so that essays can be graded without
the need for human annotated copies at all. Like some commercial essay scoring
engines, our software might also be able to mine for information depending on the
prompt and use that to grade essays on the topic.
This work can also be made into a free web application after some improvements. It
will be an interesting exercise from a research point of view too if we are able to
observe how the software works for real student essays.
35. 34 | P a g e
10. Conclusion
Automated Essay Scoring is an interesting area to work on. There is definitely a lot of
scope for improvement and innovation. A lot still needs to be done to bring this into
the mainstream and gain widespread adoption. If this were done right and reliably, it
would go a long way in not only reducing manual work but also improving teaching
and will revolutionize education since teachers would no longer be concerned about
grading when making decisions on giving essay writing tasks to their students.
We have been able to make a proof-of-concept prototype essay scoring system. A lot
needs to be done to make it as reliable and functional as some of the commercially
available options which have been around for 40 years or so, but this is an
encouraging sign. Our results do not beat the best in this business but we can at least
provide an open source solution on which future research can be founded.
36. 35 | P a g e
References
Automated Essay Scoring (http://en.wikipedia.org/wiki/Automated_essay_scoring)
Ellis Batten Page (http://en.wikipedia.org/wiki/Ellis_Batten_Page)
Yang Yongwei, Chad W. Buckendahl, Piotr J. Juskiewicz and Dennison S. Bhola
(2002). “A review of Strategies for Validating Computer-Automated Scoring”.
Applied Measurement in Education.
Ajay, H. B., Tillett, P. I., & Page, E. B. (1973). Analysis of essays by computer
(AEC-II) (No. 8-0102). Washington, DC: U.S. Department of Health, Education, and
Welfare, Office of Education, National Center for Educational Research and
Development.
Handbook of Automated Essay Evaluation: Current Applications and New Directions.
Edited by Mark D. Shermis and Jill Burstein
Page, E. B., & Petersen, N. S. (1995). The computer moves into essay grading:
Updating the ancient test. Phi Delta Kappan.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic
analysis. Discourse Processes.
Attali, Y., & Burstein, J. (2006). Automated Essay Scoring With e-rater V.2. Journal
of Technology, Learning, and Assessment.
Breland, H. M., Jones, R. J., & Jenkins, L. (1994). The College Board vocabulary
study (College Board Report No. 94–4; Educational Testing Service Research Report
No. 94–26). New York: College Entrance Examination Board.
Salton, G., Wong, A., & Yang, C.S. (1975). A vector space model for automatic
indexing. Communications of the ACM, 18, 613–620.
Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated scoring and
annotation of essays with the Intelligent Essay Assessor. In M. D. Shermis & J.
Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87–
112). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Elliot, S. (2003). Intellimetric: From here to validity. In M. D. Shermis & J. Burstein
(Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 71-86).
Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and
method for automatically grading esol texts. In Proceedings of the 49th
Annual
Meeting of the Association for Computational Linguistics; Human Language
Technologies, Portland, Oregon, USA, 19th
-24th
June 2011.
37. 36 | P a g e
D. Nicholls. 2003. The Cambridge Learner Corpus: Error coding and analysis for
lexicography and ELT. In Proceedings of the Corpus Linguistics 2003 conference,
pages 572–581.
Ziheng Lin, Hwee Tou Ng and Min-Yen Kan (2011). Automatically Evaluating Text
Coherence Using Discourse Relations. In Proceedings of the 49th Annual Meeting of
the Association for Computational Linguistics: Human Language Technologies
(ACL-HLT 2011), Portland, Oregon, USA, June.
Alphabetical list of part-of-speech tags used in the Penn Treebank Project
(http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
Yoav Cohen, Anat Ben-Simon and Myra Hovav (2003). The Effect of Specific
Language Features of the Complexity of Systems for Automated Essay Scoring.
IAEA 29th
Annual conference organized at Manchester, UK.
Sean Massung, ChengXiang Zhai and Julia Hockenmaier (2013). Structural Parse
Tree Features for Text Representation. 2013 IEEE Seventh International Conference
on Semantic Computing.
S. Kim, H. Kim, T. Weninger, J. Han, and H. D. Kim, “Authorship classification: a
discriminative syntactic tree mining approach,” in Proceedings of the 34th
international ACM SIGIR conference on Research and development in Information
Retrieval, ser. SIGIR ’11. New York, NY, USA: ACM, 2011, pp. 455–464. [Online].
Available: http://doi.acm.org/10.1145/2009916.2009979
Daniel Jurafsky & James H. Martin. Speech and Language Processing: An
introduction to natural language processing, computational linguistics, and speech
recognition.
John Judge, Aoife Cahil and Josef van Genabith. QuestionBank: Creating a Corpus of
Parse-Annotated Questions.
CYK Algorithm (http://en.wikipedia.org/wiki/CYK_algorithm)
Mark D. Shermis and Ben Hammer. Contrasting State-of-the-Art Automated Scoring
of Essays: Analysis.
38. 37 | P a g e
Appendices
Appendix‐A: List of Part‐of‐Speech Tags Used
The following is an alphabetical list of the part‐of‐speech tags used in the Penn
Treebank Project:
Number Tag Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
39. 38 | P a g e
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb