Emulating human essay scoring with machine learning methods Darrell Laham Tom Landauer Peter Foltz Cognitive Systems:  Human Cognitive  Models in  System Design June 30, 2003
Marcia Derr, Ph.D. Scott Dooley  Terry Drissell  Dave Farnham Peter Foltz, Ph.D. Shawn Frederickson Brent Halsey  Pat Hilton-Suiter Darrell Laham, Ph.D. Tom Landauer, Ph.D Karen Lochbaum, Ph.D. Dian Martin Jeff Nock Jim Parker Randy Sparks, Ph.D. Lynn Streeter, Ph.D
Taxonomy of essay assessment Writing Assessment Types Composition (Language Arts) Does the writer write well? Exposition (Content Areas, e.g. history) Does the writer understand the topic? Levels of Assessment 1. Holistic Scoring 2. Trait and Componential Scoring 3. Annotation 4. Situated Value Judgments Which levels are open to automated scoring?
Taxonomy of essay assessment Analytics Annotations Situated Value Judgments Holistic Score Trait  Scores Knowledge Local  Errors Truth  Values Language Arts (composition) Content Areas (exposition) Level 1 Level 2 Level 3 Level 4 Levels of Assessment
Architecture of scoring systems Intelligent Essay Assessor ™  technologies Latent Semantic Analysis  for scoring quality of content and providing tutorial feedback Style & Mechanics measures  for scoring and validation of essay as appropriate for task Student essays written to directed prompts Constructed-response alternative to multiple-choice for  domain knowledge  assessment Directed essay questions or summaries Reliable, objective, consistent and immediate Used as second reader, formative evaluations, diagnostic tutorials, interactive textbooks
Architecture of scoring systems Customized  Reader % Content  % Style % Mechanics Overall  Score CONTENT variance VL Confidence STYLE Coherence MECHANICS VALIDATION And / Or PLAGIARISM Char Count Misspelled Words Expert Scored Essays
Latent Semantic Analysis LSA is both a psychological theory of knowledge representation and a computational modeling and application tool LSA learns the relationships between text documents and their constituent words (terms) when trained on large numbers of background texts (thousands to millions) Each term, document, or new combination of terms (new document) is represented as a point in a high dimensional “Semantic Space” (300-500 dimensions, not 2 or 3) LSA effectively measures semantic content against prescribed standards of quality based on human judgments Extensive and varied research shows LSA judgments of similarity agree well with human judgments
Meaning Based Representation LSA is NOT simple co-occurrence Over 99% of word pairs whose similarities are induced never appear together in a context (paragraph) Synonyms are rarely seen in the same context LSA is NOT simple keyword matching LSA operates on the deep level (latent) meaning of words rather than the surface characteristics (exact matches)
1  0.73  0.09  0.05  0.03  attorney  1  0.13  0.06  0.06  lawyer  1  0.65  0.64  surgeon  1  0.61  physician  1  doctor  attorney  lawyer  surgeon  physician  doctor
1  0.35  0.49  He is the car doctor.  1  0.86  The physician is in surgery.  1  The doctor operates on the patient.  He is the car doctor.  The physician is in surgery.  The doctor operates on the patient.
Latent Semantic Analysis
What features of LSA are most important? It is a fully automated model of memory  Training data of same magnitude as human experience It begins with first-order local associations between a stimulus and other temporally contiguous stimuli  Represents concepts and contexts (episodes) in same way Conjointly learns about concepts from their natural contexts and contexts from their constituent concepts No explicit hand coding of rules or features Induction stage for generalization  High dimensional vector mathematics offer neurologically plausible computations Not claimed to be a comprehensive model
What features of LSA are  ad hoc ? Based on performance in applications, not requirements of cognitive models… Singular Value Decomposition (SVD) as induction mechanism Many other candidate algorithms have emerged SVD can solve (750K X 10M matrix for 300 dimensions on 8 node Beowulf in 20-30 hours) Emphasis on easily parsable symbol systems, e.g. text Text is relatively easy to work with compared to visual data Now applied to other symbol systems, e.g. genetic codes Text pre-processing specifics Local log, global entropy weighting Similarity metrics (Cosine, Euclidean Distance, etc.)
Performance assessment of system
Performance assessment of system
Performance assessment of system
Performance assessment of system
Performance assessment of system
Performance assessment of system
Performance assessment of system
Performance assessment of system
Focus is on  quality of content  as judged by people rather than on measures of surface features & keywords Uses  background knowledge  of domain in assessment in addition to previously scored essays  Measures what students are saying rather than just how well they are saying it Does best when linked to course student learning materials –  provides   formative assessment of domain knowledge with tutorial feedback  rather than just a simple overall score Requires fewer training essays (100 vs. 500) More difficult to ‘coach’ student in ways to receive artificially high score (e.g. “use semi-colons” or say “Thus and Therefore”) Models do  NOT  use any count variables (Word count, etc.) Performance assessment of system
Performance assessment of system
Performance assessment of system
MAINFRAMES Mainframes  are primarily  referred  to   large  computers with  rapid , advanced processing capabilities that  can execute   and  perform tasks  equivalent to many  Personal Computers (PCs) machines  networked together .  It is characterized with high quantity  Random Access Memory (RAM), very large secondary storage devices, and  high-speed  processors to cater for the needs of the computers under its service.  Consisting of  advanced components, mainframes have the capability of running multiple large applications required by  many and  most enterprises  and organizations.   This is  one of its advantages.  Mainframes are also suitable to cater for those applications  (programs)  or files that are of very  high  demand by its users (clients).  Examples of  such organizations and enterprises using mainframes are  online shopping websites  such as  Ebay, Amazon,  and computing-giant  Microsoft. MAINFRAMES Mainframes  usually are  referred those computers with  fast , advanced processing capabilities that  could   perform  by itself  tasks  that may require a lot of  Personal Computers (PC) Machines.  Usually mainframes would have lots of  RAMs, very large secondary storage devices, and  very fast  processors to cater for the needs of those computers under its service.  Due to the  advanced components mainframes have,  these computers  have the capability of running multiple large applications required by most enterprises , which is  one of its advantage. Mainframes are also suitable to cater for those applications or files that are of very  large  demand by its users (clients). Examples of these  include  the large online shopping websites  -i.e. :  Ebay, Amazon, Microsoft , etc .  Performance assessment of system
 

Emulating Human Essay Scoring With Machine Learning Methods

  • 1.
    Emulating human essayscoring with machine learning methods Darrell Laham Tom Landauer Peter Foltz Cognitive Systems: Human Cognitive Models in System Design June 30, 2003
  • 2.
    Marcia Derr, Ph.D.Scott Dooley Terry Drissell Dave Farnham Peter Foltz, Ph.D. Shawn Frederickson Brent Halsey Pat Hilton-Suiter Darrell Laham, Ph.D. Tom Landauer, Ph.D Karen Lochbaum, Ph.D. Dian Martin Jeff Nock Jim Parker Randy Sparks, Ph.D. Lynn Streeter, Ph.D
  • 3.
    Taxonomy of essayassessment Writing Assessment Types Composition (Language Arts) Does the writer write well? Exposition (Content Areas, e.g. history) Does the writer understand the topic? Levels of Assessment 1. Holistic Scoring 2. Trait and Componential Scoring 3. Annotation 4. Situated Value Judgments Which levels are open to automated scoring?
  • 4.
    Taxonomy of essayassessment Analytics Annotations Situated Value Judgments Holistic Score Trait Scores Knowledge Local Errors Truth Values Language Arts (composition) Content Areas (exposition) Level 1 Level 2 Level 3 Level 4 Levels of Assessment
  • 5.
    Architecture of scoringsystems Intelligent Essay Assessor ™ technologies Latent Semantic Analysis for scoring quality of content and providing tutorial feedback Style & Mechanics measures for scoring and validation of essay as appropriate for task Student essays written to directed prompts Constructed-response alternative to multiple-choice for domain knowledge assessment Directed essay questions or summaries Reliable, objective, consistent and immediate Used as second reader, formative evaluations, diagnostic tutorials, interactive textbooks
  • 6.
    Architecture of scoringsystems Customized Reader % Content % Style % Mechanics Overall Score CONTENT variance VL Confidence STYLE Coherence MECHANICS VALIDATION And / Or PLAGIARISM Char Count Misspelled Words Expert Scored Essays
  • 7.
    Latent Semantic AnalysisLSA is both a psychological theory of knowledge representation and a computational modeling and application tool LSA learns the relationships between text documents and their constituent words (terms) when trained on large numbers of background texts (thousands to millions) Each term, document, or new combination of terms (new document) is represented as a point in a high dimensional “Semantic Space” (300-500 dimensions, not 2 or 3) LSA effectively measures semantic content against prescribed standards of quality based on human judgments Extensive and varied research shows LSA judgments of similarity agree well with human judgments
  • 8.
    Meaning Based RepresentationLSA is NOT simple co-occurrence Over 99% of word pairs whose similarities are induced never appear together in a context (paragraph) Synonyms are rarely seen in the same context LSA is NOT simple keyword matching LSA operates on the deep level (latent) meaning of words rather than the surface characteristics (exact matches)
  • 9.
    1 0.73 0.09 0.05 0.03 attorney 1 0.13 0.06 0.06 lawyer 1 0.65 0.64 surgeon 1 0.61 physician 1 doctor attorney lawyer surgeon physician doctor
  • 10.
    1 0.35 0.49 He is the car doctor. 1 0.86 The physician is in surgery. 1 The doctor operates on the patient. He is the car doctor. The physician is in surgery. The doctor operates on the patient.
  • 11.
  • 12.
    What features ofLSA are most important? It is a fully automated model of memory Training data of same magnitude as human experience It begins with first-order local associations between a stimulus and other temporally contiguous stimuli Represents concepts and contexts (episodes) in same way Conjointly learns about concepts from their natural contexts and contexts from their constituent concepts No explicit hand coding of rules or features Induction stage for generalization High dimensional vector mathematics offer neurologically plausible computations Not claimed to be a comprehensive model
  • 13.
    What features ofLSA are ad hoc ? Based on performance in applications, not requirements of cognitive models… Singular Value Decomposition (SVD) as induction mechanism Many other candidate algorithms have emerged SVD can solve (750K X 10M matrix for 300 dimensions on 8 node Beowulf in 20-30 hours) Emphasis on easily parsable symbol systems, e.g. text Text is relatively easy to work with compared to visual data Now applied to other symbol systems, e.g. genetic codes Text pre-processing specifics Local log, global entropy weighting Similarity metrics (Cosine, Euclidean Distance, etc.)
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    Focus is on quality of content as judged by people rather than on measures of surface features & keywords Uses background knowledge of domain in assessment in addition to previously scored essays Measures what students are saying rather than just how well they are saying it Does best when linked to course student learning materials – provides formative assessment of domain knowledge with tutorial feedback rather than just a simple overall score Requires fewer training essays (100 vs. 500) More difficult to ‘coach’ student in ways to receive artificially high score (e.g. “use semi-colons” or say “Thus and Therefore”) Models do NOT use any count variables (Word count, etc.) Performance assessment of system
  • 23.
  • 24.
  • 25.
    MAINFRAMES Mainframes are primarily referred to large computers with rapid , advanced processing capabilities that can execute and perform tasks equivalent to many Personal Computers (PCs) machines networked together . It is characterized with high quantity Random Access Memory (RAM), very large secondary storage devices, and high-speed processors to cater for the needs of the computers under its service. Consisting of advanced components, mainframes have the capability of running multiple large applications required by many and most enterprises and organizations. This is one of its advantages. Mainframes are also suitable to cater for those applications (programs) or files that are of very high demand by its users (clients). Examples of such organizations and enterprises using mainframes are online shopping websites such as Ebay, Amazon, and computing-giant Microsoft. MAINFRAMES Mainframes usually are referred those computers with fast , advanced processing capabilities that could perform by itself tasks that may require a lot of Personal Computers (PC) Machines. Usually mainframes would have lots of RAMs, very large secondary storage devices, and very fast processors to cater for the needs of those computers under its service. Due to the advanced components mainframes have, these computers have the capability of running multiple large applications required by most enterprises , which is one of its advantage. Mainframes are also suitable to cater for those applications or files that are of very large demand by its users (clients). Examples of these include the large online shopping websites -i.e. : Ebay, Amazon, Microsoft , etc . Performance assessment of system
  • 26.

Editor's Notes

  • #2 Emulating human essay scoring with machine learning methods Darrell Laham, Thomas K Landauer, and Peter W. Foltz Knowledge Analysis Technologies Automated essay scoring technologies have now been used in education applications for several years. Commercial systems such as eRater by Educational Testing Services (ETS) and the Intelligent Essay Assessor by Knowledge Analysis Technologies (K-A-T) are used daily by students and scoring services in both low and high stakes tests. For example, ETS uses eRater and a single human reader to score the College Board’s high stakes Graduate Management Admissions Test (GMAT) essay items. In most high stakes tests two human readers are employed and when their scores differ significantly a third expert reader is brought in to resolve the disagreement. The same holds for the GMAT case--when the computer’s score significantly disagrees with the human’s, a second human expert reader is engaged for resolution. The Intelligent Essay Assessor scores and provides feedback for thousands of essays daily as student’s practice writing for high stakes exams. Automated systems from K-A-T have proven to be as reliable as humans in scoring applications as diverse as middle and high school narrative and expository essays, university level subject matter tests (e.g. biology, psychology, and information technology), military memoranda, and assessment of medical student diagnostic notes for simulated patients. In this paper we will review the types of assessment open to automated scoring, the general architecture of scoring systems, the variety of language modeling methods employed, and implications for other language centered human emulation applications. Brief Outline 1. Types of assessment open to automated scoring Holistic Scoring (overall quality) Trait Scoring (facets of writing) Content Analysis Grammar and Mechanics error annotations 2. The general architecture of scoring systems Training for specific directed questions or prompts Combining independent measures to maximize prediction value 3. Language modeling methods Measurements of surface features of text Keyword and Topic spotting Natural Language Processing Latent Semantic Analysis 4. Implications for human emulation applications Accuracy of assessments Fooling the system Feedback mechanisms and prescriptive advice Focus on Latent Semantic Analysis (LSA) Latent Semantic Analysis is the primary method employed by K-A-T in its scoring applications. This paper will review LSA in more detail than other methods. LSA can be viewed as a method for unsupervised training of a network that associates two classes of events reciprocally by linear connections through a single hidden layer. LSA has been used to learn and represent relations among very large numbers of word types (100K–500K) and very large numbers of natural text passages (1M – 10M) in which they occurred. The result is 300-500 dimensional "semantic spaces" in which any trained or newly added word or passage could be represented as a vector, and similarities were measured by the cosine of the contained angle between vectors or Euclidean distance between vectors. In addition to reliable essay scoring, good accuracy in simulating human judgments and behaviors has been demonstrated by performance on multiple-choice vocabulary and domain knowledge tests, sorting and classification tasks, and in several other ways. Traditionally, imbuing machines with human-like knowledge has relied primarily on explicit coding of symbolic facts into computer data structures and algorithms. A serious limitation of this approach is people's inability to access and express the vast reaches of unconscious knowledge on which they rely, knowledge based on masses of implicit inference and irreversibly melded data. A more important deficiency of this state of affairs is that by coding the knowledge ourselves, (as we also do when we assign subjectively hypothesized rather than objectively identified features to input or output nodes in a neural net) we beg important questions of how humans acquire and represent the coded knowledge in the first place. Thus, from both engineering and scientific perspectives, there are reasons to try to design learning machines that can acquire human-like quantities of human-like knowledge from the same sources as humans. The success of such techniques would not prove that the same mechanisms are used by humans, but because we presently do not know how the problem can be solved in principle, successful simulation may offer theoretical insights as well as practical applications. In the work reported in this paper we have found a way to induce significant amounts of knowledge about the meanings of passages and of their constituent vocabularies of words by training on large bodies of natural text. In general terms, the method simultaneously extracts the similarity between words (the likelihood of being used in passages that convey similar ideas) and the similarity between passages (the likelihood of containing words of similar meaning). The conjoint estimation of similarity is accomplished by a fundamentally simple representational technique that exploits mutual constraints implicit in the occurrences of very many words in very many contexts. We view the resultant system both as a means for automatically learning much of the semantic content of words and passages, and as a potential computational model for the process that underlies the corresponding human ability.
  • #15 On over 4,000 essays on 15 diverse subjects in real course settings Intelligent Essay Assessor results were indistinguishable from those of expert human readers. EXAMS : GMAT 1 (analysis of issue), GMAT 2 (analysis of argument), grade school narrative writing Total Number of Essays = 2263 1286 used for Training 977 to Test System Average Reader 1 to Reader 2 agreement: 0.86 Average IEA to a single Reader agreement: 0.85
  • #22 On over 4,000 essays on 15 diverse subjects in real course settings Intelligent Essay Assessor results were indistinguishable from those of expert human readers.