Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation

Artificial Unintelligence:
Why and How Automated Essay
Scoring Doesn’t Work (most of the
time) & the Perils and Promise of
Automated Essay Evaluation
Les Perelman
Comparative Media Studies / Writing
MIT
Definition of Terms
Automated Essay Scoring
(AES)
• Computer produces
summative assessment for
evaluation
Automated Essay Evaluation
(AEE)
• Computer produces
formative assessment and
responses for learning
Overview
1. Brief recounting of mass-market writing
assessment in the United States
2. AES: how it works and its major flaws
3. The Turing Test: evaluate AES
4. AEE: a brief overview
5. AEE: evaluating current implementations
6. AEE: what we can reasonably hope to achieve
– Writelab
7. Demonstration: Playing with the BABEL
Generator
The First College Board Entrance
Examination in English − June 1901
• The two sides of the character of Achilles as
shown in The Iliad. Illustrate each and tell
whether we find anything like this contrast in
the character of Hector.
• At least two pages
• Four hours to write -- two in the morning & two
in the afternoon
SAT Essay June 2005
• Think carefully about the issue presented in the
following excerpt and the assignment below.
– Most of our schools are not facing up to their
responsibilities. We must begin to ask ourselves whether
educators should help students address the critical moral
choices and social issues of our time. Schools have
responsibilities beyond training people for jobs and getting
students into college.
• Adapted from Svi Shapiro
• Assignment:
– Should schools help students understand moral choices
and social issues?
• 25 minutes
The timed impromptu is an unnatural act
• The timed
impromptu does
not occur in the
real world
• No one writes on
demand without
reflecting about
a topic they may
never have
thought about
Why the change?
• Reliability
– Godshalk, F. et al.
The Measurement of
Writing Ability ETS
1966
– A. Myers et al. (1966)
Simplex structure in
the grading of essay
tests. Educational and
Psychological
Measurement, Vol
26(1), 1966, 41-54.
Where the reliability comes from:
Correlation between length and score is a
negative function of time allotted
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
SharedVariancebetween#Words&Score
25 min
1 hr.
72 hrs.
N=247
N =6498 N=2820
N=115
N= 106 N=106
N=660 N=1458
N=798
The greater the time; the smaller the correlation
College Board’s ScoreWrite
Colbert Report on SAT Word Length
Graders were trained to read for
length and pretentious diction
Ellis B. Page – Project Essay Grade
• Trin -- Intrinsic variable of interest (e.g. word choice, diction;
sentence complexity)
• Prox – “some variable which it is hoped will approximate the
variable of true interest”
e-Rater construct
Quinlan, T., Higgins, D., & Wolff, S. (2009) Evaluating the Construct-
Coverage of the e-rater® Scoring Engine. ETS Research Report 09-01. p. 15
E-rater 2.0 Proxies
• Organization = # of Discourse Elements (i.e.
paragraphs)
• Development = Length of Discourse Elements
(i.e. # of sentences & # of words in paragraphs)
• Lexical Complexity = average word length +
frequency of infrequently used words + absence
of frequently repeated words
Machines Consistently Overvalue
Essay Length
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
1 Argument
2A Argument
Holistic
2B Argument
Grammar
3 Literary
Analysis
4 Literary
Analysis
5 Reading
Summary
6 Reading
Summary
7 Narrative
Composite
8 Narrative
Compositive
AverageSharedVariance
Essay Sets
Hewlett ASAP Study (2012)
Average Shared Variance (r2)
between # of Words and Score for AES Machines & Human Readers
(7 of 9 vendors -- 2 vendors would not allow data to be released)
Average Vendors Average Human Readers
AES Machines Maintain Artificial
Correlations Through the Steroids of
Word Counting
Percentage for Reader 1
# of
words
Other
Percentage for Reader 2
# of
words
Percentage AES Machine
Other
# of words
Other
Shared
Variance
# Words
Human
Machine
But what about voice recognition?
• Relatively very small set + Moore’s Law
What kind of writing AES can’t grade
• Long essays
– ETS’s e-rater has a 1,000 word limit
• Broad and open Writing Tasks
– Two AES machines could not approximate scores
on the essay portion of the Australian Scholastic
Aptitude Test that called for a fixed length (600
word) essay on a fairly open topic McCurry (2010)
The Turing Test: The Imitation Game
CAPTCHA (Completely Automated Public Turing
test to tell Computers and Humans Apart)
The Reverse Turing Test
Coherent
prose
Gibberish
The Basic Automatic BS Essay
Language (BABEL) Generator
http://babel-generator.herokuapp.com/
http://babel-generator.herokuapp.com/
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation
What we can conclude
• The software does not do what it tells
students and teachers it is doing
• The metrics (proxies) used are irrelevant, at
best, and, probably, are largely antithetical to
good writing or communication.
• Students can probably be trained to memorize
language and strategies to obtain high scores
(construct-irrelevant-strategies)
Some Evidence that Students Are
Using BABEL to Game AEE Products
4. What is meant by a “good faith” essay?
It is important to note that although PEG software is extremely reliable in terms of
producing scores that are comparable to those awarded by human judges, it can be
fooled. Computers, like humans, are not perfect.
PEG presumes “good faith” essays authored by “motivated” writers. A “good faith”
essay is one that reflects the writer’s best efforts to respond to the assignment and
the prompt without trickery or deceit. A “motivated” writer is one who genuinely
wants to do well and for whom the assignment has some consequence (a grade, a
factor in admissions or hiring, etc.).
Efforts to “spoof” the system by typing in gibberish, repetitive phrases, or off-topic,
illogical prose will produce illogical and essentially meaningless results.
Most of the Studies Are Conducted
and / or Controlled by the Vendors
Important Unanswered Questions
1. How easy will it be for students to “game” these
machines?
2. When essays are read by a human reader and a
machine and there is a discrepancy between scores,
after the adjudication procedure, what percentage
the machine’s scores are omitted or changed
compared to the scores of human reader?
3. When gamed essays are read by a reader and a
machine, will the human reader’s score always catch
the gamed score?
4. Can human readers also be “gamed”?
Negative Consequences
• What is tested is what is taught
• Emphasis on short writing
• Emphasis on impromptu on-demand writing
When can AES be useful?
• Grading short content-based writing
– Already useful applications
• Use in MOOC’s in conjunction with Peer
Review Applications such as Calibrated Peer
Review
Why are the testing companies so in
love with AES?
• ῥίζα γὰρ πάντων τῶν κακῶν ἐστιν ἡ
φιλαργυρία
• Radix omnium malorum est cupiditas
• The love of money is the root of all evil
• 1st Timothy 6:10
Chase the Moneychangers out of the
Temple
Three Proposals
• First, some sort of professional system of
disclosure for large sums of money, let’s say
more than $10,000, received from outside
professional organizations such as the College
Board.
– With textbooks, the disclosure is transparent.
Second, Grass roots development of
several different Honors English and
Writing curricula
• Teach skills students will need in college
• Developed jointly by high school and college
teachers
• Developed through organizations such as NCTE,
WPA, and NWP & College Admissions
• Accompanied by some sort of certification procedure
• Pacesetter Program as model
Create Tests of Our Own
• Design Criteria:
– Valid
– Fair
• Coaching has minimal effect
• Does not discriminate against bilingual or bidialectical
students
– Feasible
• Not College Board’s bad version of reverse engineering
– “The plane doesn’t fly, but we can make money on
it”
– Transparent design, development, and
administration
Show warts and all
Decentralized Development and Grading
Opportunities for Professional Development
Test to the Teaching
• Different tests and testing communities for
different approaches
• Technology enabled
• Diverse and linked group of readers
– Opportunity to address problems of low-
performing minorities
– Show students that their essay will be read by a
diverse group of readers
So let us
• Urge our schools to stop using the SAT, ACT, &
even AP
• Begin a conversation with professional
organizations to involve
– K-12 teachers
– College admission officers
– College teachers
To envision new and different kinds of writing
tests
Act
• Act to divert some of the billions spent on testing to
improve teaching
• Act to reclaim testing from business and bring it back
to education
• Act to make testing a form of learning not only for
students but also for us
• And act by doing sound research on writing and
testing; because if we don’t do it we are leaving it to
people like the gang at Pearson.
Automated Essay Scoring (AES)
becomes Automated Essay Evaluation
(AEE)
• Teaching writing in the classroom
First-generation retrofitted trait
numeric scores
My Access / IntelliMetric
• Holistic Writing Score: 5.3 / 6.0 88%
Some advice makes writing less effective
Grammar Checkers are Unreliable
Nick Carbone’s Comparison of Grammarly, MS Word,
and WriteCheck
Grammarly MS Word WriteCheck
(e-rater)
# Errors Flagged 52 30 23
# misdiagnosed,
false positives, or
poorly explained
11 8 14
% misdiagnosed,
false positives, or
poorly explained
21% 27% 61%
http://nccei12carbone.blogspot.com/2012/10/an-experiment-with-grammar-checkers.html
MS Word Grammar & Style Checker is
also not infallible
Dean Mark D. Shermis on Microsoft
Word and AEE
The feedback provided by the Web-based software is both quantitative and
qualitative. That is, in addition to an overall rating, students may receive
scores on individual attributes of writing, and the software may summarize or
highlight a variety of errors, ranging from simple grammar to style or content.
Some of the software packages also provide a discourse analysis of the work
Shermis, M. (May11, 2012). How automated grading can make good writers. Los Angeles Times
Category: Usage
• Missing or Extra Article
– Rather than rely on commercials or expert
opinions about a film, individuals often make their
viewing choices based on blogs and the1 collected
reviews of peers on various sites on the Internet.
• 1 You may need to remove this article.
Category: Usage
• Type: Confused Words
– Because the consumption of entertainment is so
ephemeral, one could posit that advertising might
affect1 a consumer's decision more than when by
buying a durable good, such as a toaster or a
blender.
• 1 You have used affect in this sentence. You may need to
use effect instead.
Category: Usage
• Type: Preposition Error
– Rather than rely on commercials or expert
opinions about1 a film, individuals often make
their viewing choices based on blogs and the
collected reviews of peers on various sites on the
Internet.
• 1 You may be using the wrong preposition.
Category: Organization & Development
• Type: Thesis Statement
– The question of how important advertising is to the sale of any product
is an important one. This question is extremely important in the media
industries.1 Because the consumption of entertainment is so
ephemeral, one could posit that advertising might affect a consumer's
decision more than when by buying a durable good, such as a toaster
or a blender.2 The advertising department of the Silver Screen Movie
Production Company has recommended spending more on advertising
and less on movie production. The advertising director's arguments
are not only self-serving, but also logically flawed and, at the least,
inconclusive, resting on several very dubious assumptions.
• 1 Is this part of the essay your thesis? The purpose of a thesis is to organize,
predict, control, and define your essay. Look in the Writer's Handbook for ways
to improve your thesis.
• 2 Is this sentence really a part of your thesis? Remember that a thesis controls
the whole content of your essay. You need to strengthen this thesis so that you
clearly state the main point you will be making. Look in the Writer's Handbook
for tips on doing this.
Category: Organization &
Development
• Type: Supporting Ideas
– First, the motives behind this particular argument
need to be questioned. In essence, the advertising
director is arguing that resources should be taken
away from producing films and given to his
department.1 Although people often make
reasonable requests in their own self-interest, that
this policy would greatly enhance the director's
fiefdom is a consequence that should elicit some
skepticism.1
• 1 Criterion has identified only two sentences to support
your topic sentence. Try to include one more sentence in
this paragraph.
The Missing Piece in Research on
Classroom Use
• Controlled
experiments to avoid
placebo effect
• Comparison with the
default writing tool
of the 21st century,
MS Word.
Both Pearson and Measurement Inc. Concede
that Grammar Checkers are Imperfect
http://doe.sd.gov/oats/documents/WToLrnFAQ.pdf
Q: Why does the grammar check not catch all of a student’s
errors?
A: The technology that supports grammar check features in
programs such as Microsoft Word often return false
positives. Since WriteToLearn is a educational product, the
creators of this program have decided, in an attempt to not
provide students with false positives, to err on the side of
caution. Consequently, there are times when the grammar
check will not catch all of a student’s errors.
Teachers can address these missed grammar errors by using the
post‐it note feature within the program to flag additional errors
students might have missed.
PEG Writer
8. Why does PEG seem to ignore some grammar “trouble spots” identified
by Microsoft Word (or other programs)?
PEG’s grammar checker can detect and provide feedback for a wide variety
of syntactic, semantic and punctuation errors. These errors include, but are
not limited to, run-on sentences, sentence fragments and comma splices;
homophone errors and other errors of word choice; and missing or
misused commas, apostrophes, quotation marks and end punctuation. In
addition, the grammar checker can locate and offer feedback on style
choices inappropriate for formal writing.
Unlike commercial grammar checkers, however, PEG only reports those
errors for which there is a high degree of confidence that the “error” is
indeed an error. Commercial grammar checkers generally implement a
lower threshold and as a result, may report more errors. The downside is
they also report higher number of “false positives” (errors that aren’t
errors). Because PEG factors these error conditions into scoring decisions,
we are careful not to let “false positives” prejudice an otherwise well
constructed essay.
What kind of AEE can be useful & effective?
Basic design principle
Primum non nocere!
Focus on style
• MS Word is flawed but it may be hard to build
something better that won’t confuse students
• What can be emphasized is style:
– Clarity
– Cohesion
– Emphasis
– Concision
– Elegance
• e. g. Parallel structures
What we need to do to build effective
AEE tools
• Start with General Principles
– Then use statistical modeling
– Follow the model of the development of voice
recognition apps
• Transparency
• Independent Research
The right way: by asking questions not
giving answers
Letting the student own the process
Products should be transparent in
displaying their limitations – again,
showing warts and all
The Real Danger of AES and bad AEE:
Widening the Educational Divide
• Private well-endowed institutions
do not use AES
• Flawed AEE will be used in large
classes to “give students more
opportunities to write” poorly
• But what flawed AEE teaches not
only dumbs down the ability to
communicate – it has the
potential to almost totally
eliminate it
But AEE also provides a real opportunity to
provide a cheap accessible tool to teach
and improve writing in multiple contexts
• In classrooms
• Writing at home
• In the workplace
Demo of BABEL Generator
• http://babel-generator.herokuapp.com/
• https://www.dxrgroup.com/cgi-
bin/scoreitnow/password.pl
Break Up into Six Groups for
MyAccess Experiment
• Hypothesis: MyAccess will give high scores to computer-generated
gibberish
• Login to MyAccess
– http://www.vantagelearning.com/login/myaccess-home-edition/
Username Password
studentone one
studentwo two
studentthree three
studentfour four
studentfive five
studentsix six
• Open new window and open BABEL Generator
http://babel-generator.herokuapp.com/
Instructions
1. Click ASSIGNMENTS
2. Select Ages 15-18
3. Select one of the following topics (suggested keywords for
BABEL generator are in parentheses) & click START ESSAY
or START REVISION
a. Nature v. nurture (nature, nurture)
b. A sense of wonder (wonder)
c. Invasion of privacy (privacy)
d. Rating movies, music, & video games (violence, obscenity,
teenagers)
e. What makes a good coach (coach, encouragement)
4. Open another window, open BABEL Generator, generate
essay, & then copy and paste it into the MY ACCESS
window
5. Click SUBMIT ESSAY
References
• Attali, Yigal & Burstein, J. (2006). Automated essay scoring with e-rater V.2. Journal of Technology,
Learning, and Assessment. 4:3. pp. 1-29.
• Bejar, I. I., Flor, M., Futagi, Y., Ramineni, C. (2014). On the vulnerability of automated scoring to
construct-irrelevant response strategies (CIRS): An illustration. Assessing Writing 22 pp. 43-59.
• Condon, W. (2013). Large-scale assessment, locally-developed measures, and automated scoring of
essays: Fishing for red herrings? Assessing Writing 18, pp. 100-108.
• Dikli, S. & Bleyle, S. (October 2014). Automated Essay Scoring feedback for second language writers:
How does it compare to instructor feedback?, Assessing Writing, 22, pp. 1-17.
http://www.sciencedirect.com/science/article/pii/S1075293514000221.
• Elliot, N. & Klobucar, A. “Automated Essay Evaluation and the teaching of writing.” Handbook of
Automated Essay Evaluation: Current Applications and New Directions. Ed. Mark D. Shermis, Jill
Burstein, and Sharon Apel. London: Routledge. June 2013.
• Freitag Ericsson, P. & Haswell, R. Ed. (2006) Machine Scoring of Student Essays: Truth or
Consequences. Logan, UT: Utah State UP
• Godshalk, F. I, Swineford, F., Coffman, W. E. (1966). The Measurement of Writing Ability. New York,
NY: College Entrance Examination Board.
• Haudek, K. C. et al. (2012). What are they thinking? Automated analysis of student writing about
acid-based chemistry in introductory biology. CBE—Life Sciences Education 11, pp. 149-155.
• Haudek, K. C. et al. (2011). Harnessing technology to improve formative assessment of student
conceptions in STEM: Forging a national network. CBE—Life Sciences Education 10, pp. 283-293.
• Herrington, A. & Moran, C. (2012). When writing to a machine in not writing at all. Writing
Assessment in the 21st Century: Essays in Honor of Edward M. White. Ed. Norbert Elliot and Les
Perelman. New York, NY: Hampton Press, pp.219-232
c
• Higgins, D. & Heilman, M. Managing what we can measure: Quantifying the susceptibility of
Automated Scoring Systems to gaming behavior. Educational Measurement 33:3. pp. 36-46.
• Klobucar, Andrew, Paul Deane, Norbert Elliot, Chaitanya Ramineni, Perry Deess, & Alex
Rudniy. (2012).“Automated Essay Scoring and the Search for Valid Writing Assessment.”
International Advances in Writing Research: Cultures, Places, Measures. Ed. Charles.
Bazerman, Chris Dean, Jessica Early, Karen Lunsford, Suzie Null, Paul Rogers, and Amanda
Stansell. Fort Collins, Colorado: WAC . pp. 103-119
http://wac.colostate.edu/books/wrab2011/chapter6.pdf
• McCurry, D. (2010). Can machine scoring deal with broad and open writing tests as well as
human readers? Assessing Writing. Vol. 15 pp. 118–129
• Morgan, J., Shermis, M. D., Van Deventer, L., & Vander Ark, T. (2013). Automated Student
Assessment Prize: Phase 1 & Phase 2: A case study to promote focused innovation in student
writing assessment. Seattle, WA: Getting Smarthttp://gettingsmart.com/cms/wp-
content/uploads/2013/02/ASAP-Case-Study-FINAL.pdf
• Nehm, R. H., Ha, M., Mayfield, E. (2011). Transforming biology with machine learning:
automated scoring of written evolutionary explanations. Journal of Science Education and
Technology. 21:1, pp. 183-196
• Page, E. B.(1966). The imminence of grading essays by computer. Phi Delta Kappan 76:7 pp.
561-565
• Perelman, L. (July, 2014). When ‘the state of the art’ is counting words. Assessing Writing
Vol. 21 pp. 104–111 http://dx.doi.org/10.1016/j.asw.2014.05.001
• Perelman, L. (August 2013). Critique of Mark D. Shermis & Ben Hammer,
“Contrasting State-of-the-Art Automated Scoring of Essays: Analysis.” Journal of
Writing Assessment 6:1
http://journalofwritingassessment.org/article.php?article=69
• Perelman, L. (2012). “Mass-Market Writing Assessments as Bullshit.” Writing
Assessment in the 21st Century: Essays in Honor of Edward M. White. Ed. Norbert
Elliot and Les Perelman. New York, NY: Hampton Press, 2012, pp. 425-438.
• Perelman, L. (2012). Length, Score, Time, & Construct Validity in Holistically Graded
Writing Assessments: The Case against Automated Essay Scoring (AES). In New
Directions in International Writing Research, Ed. C. Bazerman, C. Dean, K. Lunsford,
S. Null, P. Rogers, A. Stansell, and T. Zawacki. Anderson, SC: Parlor Press, pp. 121-
132.http://wac.colostate.edu/books/wrab2011/chapter7.pdf
• Powers, D. et al. (2002). Stumping e-rater: Challenging the validity of automated
essay scoring. Computers in Human Behavior. 18:2 pp. 103-134.
• Quinlan, T., Higgins, D., & Wolff, S. (2009) Evaluating the Construct-Coverage of the
e-rater® Scoring Engine. ETS Research Report 09-01.
• Sandene, B. et al. (2005) Online Assessment in Mathematics and Writing: Reports
From the NAEP Technology-Based Assessment Project, Research and Development
Series. National Center for Education Statistics Report 2005-457.
http://files.eric.ed.gov/fulltext/ED485780.pdf
• Shermis, M. D. (In press). The challenges of emulating human behavior in
writing assessment.
http://www.sciencedirect.com/science/article/pii/S1075293514000373#
• Shermis, M. D. (2014). State-of-the-art automated essay scoring:
Competition, results, and future directions from a United States
demonstration. Assessing Writing, 20, 53-76
• Shermis, M. D., & Hamner, B. (2013). Contrasting state-of-the-art
automated scoring of essays. In M. D. Shermis & J. Burstein (Eds.),
Handbook of automated essay evaluation: Current applications and new
directions . New York, NY: Routledge. pp. 313-346
• Shermis, M. (May11, 2012). How automated grading can make good
writers. Los Angeles Times
http://articles.latimes.com/2012/may/11/news/la-ol-automated-scoring-
blowback-20120510
• Stevenson, M. Phakiti, A. (2014). The effects of computer-generated
feedback on the quality of writing. Assessing Writing. 19, pp.51-65.
1 of 78

Recommended

Automated Language Assessment Scoring and impact on instruction by
Automated Language Assessment Scoring and impact on instructionAutomated Language Assessment Scoring and impact on instruction
Automated Language Assessment Scoring and impact on instructiontfarny
1.2K views28 slides
Guidelines for writing a lecture report by
Guidelines for writing a lecture reportGuidelines for writing a lecture report
Guidelines for writing a lecture reportIngrid WENDLING
654 views1 slide
Introduction to Nvivo by
Introduction to NvivoIntroduction to Nvivo
Introduction to NvivoSeth Porter, MA, MLIS
254 views17 slides
documentation-testing.ppt by
documentation-testing.pptdocumentation-testing.ppt
documentation-testing.pptRoopa slideshare
2.9K views11 slides
System and network administration network services by
System and network administration network servicesSystem and network administration network services
System and network administration network servicesUc Man
8.4K views28 slides
Como enviar una asignación y ve su calificación by
Como enviar una asignación y ve su calificaciónComo enviar una asignación y ve su calificación
Como enviar una asignación y ve su calificaciónUNIVERSIDAD DE PANAMA
14.6K views9 slides

More Related Content

What's hot

Fagan Inspection by
Fagan InspectionFagan Inspection
Fagan InspectionECC International
10.7K views14 slides
Systems Administration by
Systems AdministrationSystems Administration
Systems AdministrationMark John Lado, MIT
2.3K views53 slides
Ppt evaluation of information retrieval system by
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval systemsilambu111
13K views22 slides
In-Memory Big Data Analytics by
In-Memory Big Data AnalyticsIn-Memory Big Data Analytics
In-Memory Big Data AnalyticsSupreeth M P
257 views13 slides
Nagios by
NagiosNagios
Nagiosguest7e7e305
5.4K views18 slides
Information retrieval 7 boolean model by
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean modelVaibhav Khanna
295 views11 slides

What's hot(20)

Ppt evaluation of information retrieval system by silambu111
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval system
silambu11113K views
In-Memory Big Data Analytics by Supreeth M P
In-Memory Big Data AnalyticsIn-Memory Big Data Analytics
In-Memory Big Data Analytics
Supreeth M P257 views
Information retrieval 7 boolean model by Vaibhav Khanna
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
Vaibhav Khanna295 views
Hadoop Overview & Architecture by EMC
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC64.6K views
Introduction to software testing by Hadi Fadlallah
Introduction to software testingIntroduction to software testing
Introduction to software testing
Hadi Fadlallah2.5K views
Introduction to online qualitative research methods by Robert Pinter
Introduction to online qualitative research methodsIntroduction to online qualitative research methods
Introduction to online qualitative research methods
Robert Pinter22.2K views
COCOMO Modal In Software Engineering By NADEEM AHMED by NA000000
COCOMO Modal In Software Engineering By NADEEM AHMED COCOMO Modal In Software Engineering By NADEEM AHMED
COCOMO Modal In Software Engineering By NADEEM AHMED
NA000000241 views
Introduction to NVivo by Marieke Guy
Introduction to NVivoIntroduction to NVivo
Introduction to NVivo
Marieke Guy3.8K views
How to Use Bibliometric Study for Writing a Paper: A Starter Guide by Nader Ale Ebrahim
How to Use Bibliometric Study for Writing a Paper: A Starter GuideHow to Use Bibliometric Study for Writing a Paper: A Starter Guide
How to Use Bibliometric Study for Writing a Paper: A Starter Guide
Nader Ale Ebrahim448 views
Software Measurement and Metrics.pptx by ubaidullah75790
Software Measurement and Metrics.pptxSoftware Measurement and Metrics.pptx
Software Measurement and Metrics.pptx
ubaidullah75790708 views
Content Management Systems in Libraries by Chris
Content Management Systems in LibrariesContent Management Systems in Libraries
Content Management Systems in Libraries
Chris2.3K views
verification and validation by Dinesh Pasi
verification and validationverification and validation
verification and validation
Dinesh Pasi29.3K views

Similar to Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation

Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14 by
Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14
Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14SURF Events
705 views47 slides
Testrocker presentation by
Testrocker presentationTestrocker presentation
Testrocker presentationhsguidance
1.5K views48 slides
Embracing AI in new forms of assessment by
Embracing AI in new forms of assessmentEmbracing AI in new forms of assessment
Embracing AI in new forms of assessmentCharles Darwin University
20 views39 slides
Open Creativity Scoring Tutorial by
Open Creativity Scoring TutorialOpen Creativity Scoring Tutorial
Open Creativity Scoring TutorialDenisDumas2
30 views26 slides
Uconn Coiro Assessment 2008 by
Uconn Coiro Assessment 2008Uconn Coiro Assessment 2008
Uconn Coiro Assessment 2008Julie Coiro
603 views46 slides
What will they need? Pre-assessment techniques for instruction session. by
What will they need?  Pre-assessment techniques for instruction session.What will they need?  Pre-assessment techniques for instruction session.
What will they need? Pre-assessment techniques for instruction session.gwenexner
463 views15 slides

Similar to Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation(20)

Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14 by SURF Events
Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14
Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14
SURF Events705 views
Testrocker presentation by hsguidance
Testrocker presentationTestrocker presentation
Testrocker presentation
hsguidance1.5K views
Open Creativity Scoring Tutorial by DenisDumas2
Open Creativity Scoring TutorialOpen Creativity Scoring Tutorial
Open Creativity Scoring Tutorial
DenisDumas230 views
Uconn Coiro Assessment 2008 by Julie Coiro
Uconn Coiro Assessment 2008Uconn Coiro Assessment 2008
Uconn Coiro Assessment 2008
Julie Coiro603 views
What will they need? Pre-assessment techniques for instruction session. by gwenexner
What will they need?  Pre-assessment techniques for instruction session.What will they need?  Pre-assessment techniques for instruction session.
What will they need? Pre-assessment techniques for instruction session.
gwenexner463 views
Behavioral Science Article Review by Jill Bell
Behavioral Science Article ReviewBehavioral Science Article Review
Behavioral Science Article Review
Jill Bell2 views
Cite It Right! Scoring and Teaching GED Reasoning Through Language Arts Test ... by Meagen Farrell
Cite It Right! Scoring and Teaching GED Reasoning Through Language Arts Test ...Cite It Right! Scoring and Teaching GED Reasoning Through Language Arts Test ...
Cite It Right! Scoring and Teaching GED Reasoning Through Language Arts Test ...
Meagen Farrell2.9K views
AP SS Implementation by Kim Moore
AP SS Implementation AP SS Implementation
AP SS Implementation
Kim Moore142 views
The why and what of testa by Tansy Jessop
The why and what of testaThe why and what of testa
The why and what of testa
Tansy Jessop55 views
Learning-oriented assessment in an era of high-stakes and insecure testing by Mark Carver
Learning-oriented assessment in an era of high-stakes and insecure testingLearning-oriented assessment in an era of high-stakes and insecure testing
Learning-oriented assessment in an era of high-stakes and insecure testing
Mark Carver37 views
Designing activities for online learning pt 3 by Mark_Childs
Designing activities for online learning pt 3Designing activities for online learning pt 3
Designing activities for online learning pt 3
Mark_Childs234 views
Test specifications and designs by ahfameri
Test specifications and designs  Test specifications and designs
Test specifications and designs
ahfameri11.6K views
Classsourcing: Crowd-Based Validation of Question-Answer Learning Objects @ I... by Jakub Šimko
Classsourcing: Crowd-Based Validation of Question-Answer Learning Objects @ I...Classsourcing: Crowd-Based Validation of Question-Answer Learning Objects @ I...
Classsourcing: Crowd-Based Validation of Question-Answer Learning Objects @ I...
Jakub Šimko321 views
Slides by paxxx
SlidesSlides
Slides
paxxx1K views
Hci techniques from idea to deployment by John Thomas
Hci techniques from idea to deploymentHci techniques from idea to deployment
Hci techniques from idea to deployment
John Thomas441 views

Recently uploaded

The Accursed House by Émile Gaboriau by
The Accursed House  by Émile GaboriauThe Accursed House  by Émile Gaboriau
The Accursed House by Émile GaboriauDivyaSheta
158 views15 slides
Collective Bargaining and Understanding a Teacher Contract(16793704.1).pptx by
Collective Bargaining and Understanding a Teacher Contract(16793704.1).pptxCollective Bargaining and Understanding a Teacher Contract(16793704.1).pptx
Collective Bargaining and Understanding a Teacher Contract(16793704.1).pptxCenter for Integrated Training & Education
90 views57 slides
REPRESENTATION - GAUNTLET.pptx by
REPRESENTATION - GAUNTLET.pptxREPRESENTATION - GAUNTLET.pptx
REPRESENTATION - GAUNTLET.pptxiammrhaywood
83 views26 slides
Gopal Chakraborty Memorial Quiz 2.0 Prelims.pptx by
Gopal Chakraborty Memorial Quiz 2.0 Prelims.pptxGopal Chakraborty Memorial Quiz 2.0 Prelims.pptx
Gopal Chakraborty Memorial Quiz 2.0 Prelims.pptxDebapriya Chakraborty
598 views81 slides
Classification of crude drugs.pptx by
Classification of crude drugs.pptxClassification of crude drugs.pptx
Classification of crude drugs.pptxGayatriPatra14
77 views13 slides
Are we onboard yet University of Sussex.pptx by
Are we onboard yet University of Sussex.pptxAre we onboard yet University of Sussex.pptx
Are we onboard yet University of Sussex.pptxJisc
77 views7 slides

Recently uploaded(20)

The Accursed House by Émile Gaboriau by DivyaSheta
The Accursed House  by Émile GaboriauThe Accursed House  by Émile Gaboriau
The Accursed House by Émile Gaboriau
DivyaSheta158 views
REPRESENTATION - GAUNTLET.pptx by iammrhaywood
REPRESENTATION - GAUNTLET.pptxREPRESENTATION - GAUNTLET.pptx
REPRESENTATION - GAUNTLET.pptx
iammrhaywood83 views
Classification of crude drugs.pptx by GayatriPatra14
Classification of crude drugs.pptxClassification of crude drugs.pptx
Classification of crude drugs.pptx
GayatriPatra1477 views
Are we onboard yet University of Sussex.pptx by Jisc
Are we onboard yet University of Sussex.pptxAre we onboard yet University of Sussex.pptx
Are we onboard yet University of Sussex.pptx
Jisc77 views
Ch. 7 Political Participation and Elections.pptx by Rommel Regala
Ch. 7 Political Participation and Elections.pptxCh. 7 Political Participation and Elections.pptx
Ch. 7 Political Participation and Elections.pptx
Rommel Regala72 views
Dance KS5 Breakdown by WestHatch
Dance KS5 BreakdownDance KS5 Breakdown
Dance KS5 Breakdown
WestHatch68 views
Use of Probiotics in Aquaculture.pptx by AKSHAY MANDAL
Use of Probiotics in Aquaculture.pptxUse of Probiotics in Aquaculture.pptx
Use of Probiotics in Aquaculture.pptx
AKSHAY MANDAL89 views
The basics - information, data, technology and systems.pdf by JonathanCovena1
The basics - information, data, technology and systems.pdfThe basics - information, data, technology and systems.pdf
The basics - information, data, technology and systems.pdf
JonathanCovena188 views
JiscOAWeek_LAIR_slides_October2023.pptx by Jisc
JiscOAWeek_LAIR_slides_October2023.pptxJiscOAWeek_LAIR_slides_October2023.pptx
JiscOAWeek_LAIR_slides_October2023.pptx
Jisc79 views
American Psychological Association 7th Edition.pptx by SamiullahAfridi4
American Psychological Association  7th Edition.pptxAmerican Psychological Association  7th Edition.pptx
American Psychological Association 7th Edition.pptx
SamiullahAfridi482 views
7 NOVEL DRUG DELIVERY SYSTEM.pptx by Sachin Nitave
7 NOVEL DRUG DELIVERY SYSTEM.pptx7 NOVEL DRUG DELIVERY SYSTEM.pptx
7 NOVEL DRUG DELIVERY SYSTEM.pptx
Sachin Nitave58 views
UWP OA Week Presentation (1).pptx by Jisc
UWP OA Week Presentation (1).pptxUWP OA Week Presentation (1).pptx
UWP OA Week Presentation (1).pptx
Jisc74 views
Drama KS5 Breakdown by WestHatch
Drama KS5 BreakdownDrama KS5 Breakdown
Drama KS5 Breakdown
WestHatch71 views

Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation

  • 1. Artificial Unintelligence: Why and How Automated Essay Scoring Doesn’t Work (most of the time) & the Perils and Promise of Automated Essay Evaluation Les Perelman Comparative Media Studies / Writing MIT
  • 2. Definition of Terms Automated Essay Scoring (AES) • Computer produces summative assessment for evaluation Automated Essay Evaluation (AEE) • Computer produces formative assessment and responses for learning
  • 3. Overview 1. Brief recounting of mass-market writing assessment in the United States 2. AES: how it works and its major flaws 3. The Turing Test: evaluate AES 4. AEE: a brief overview 5. AEE: evaluating current implementations 6. AEE: what we can reasonably hope to achieve – Writelab 7. Demonstration: Playing with the BABEL Generator
  • 4. The First College Board Entrance Examination in English − June 1901 • The two sides of the character of Achilles as shown in The Iliad. Illustrate each and tell whether we find anything like this contrast in the character of Hector. • At least two pages • Four hours to write -- two in the morning & two in the afternoon
  • 5. SAT Essay June 2005 • Think carefully about the issue presented in the following excerpt and the assignment below. – Most of our schools are not facing up to their responsibilities. We must begin to ask ourselves whether educators should help students address the critical moral choices and social issues of our time. Schools have responsibilities beyond training people for jobs and getting students into college. • Adapted from Svi Shapiro • Assignment: – Should schools help students understand moral choices and social issues? • 25 minutes
  • 6. The timed impromptu is an unnatural act • The timed impromptu does not occur in the real world • No one writes on demand without reflecting about a topic they may never have thought about
  • 7. Why the change? • Reliability – Godshalk, F. et al. The Measurement of Writing Ability ETS 1966 – A. Myers et al. (1966) Simplex structure in the grading of essay tests. Educational and Psychological Measurement, Vol 26(1), 1966, 41-54.
  • 8. Where the reliability comes from: Correlation between length and score is a negative function of time allotted 0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% SharedVariancebetween#Words&Score 25 min 1 hr. 72 hrs. N=247 N =6498 N=2820 N=115 N= 106 N=106 N=660 N=1458 N=798 The greater the time; the smaller the correlation
  • 10. Colbert Report on SAT Word Length
  • 11. Graders were trained to read for length and pretentious diction
  • 12. Ellis B. Page – Project Essay Grade • Trin -- Intrinsic variable of interest (e.g. word choice, diction; sentence complexity) • Prox – “some variable which it is hoped will approximate the variable of true interest”
  • 13. e-Rater construct Quinlan, T., Higgins, D., & Wolff, S. (2009) Evaluating the Construct- Coverage of the e-rater® Scoring Engine. ETS Research Report 09-01. p. 15
  • 14. E-rater 2.0 Proxies • Organization = # of Discourse Elements (i.e. paragraphs) • Development = Length of Discourse Elements (i.e. # of sentences & # of words in paragraphs) • Lexical Complexity = average word length + frequency of infrequently used words + absence of frequently repeated words
  • 15. Machines Consistently Overvalue Essay Length 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 1 Argument 2A Argument Holistic 2B Argument Grammar 3 Literary Analysis 4 Literary Analysis 5 Reading Summary 6 Reading Summary 7 Narrative Composite 8 Narrative Compositive AverageSharedVariance Essay Sets Hewlett ASAP Study (2012) Average Shared Variance (r2) between # of Words and Score for AES Machines & Human Readers (7 of 9 vendors -- 2 vendors would not allow data to be released) Average Vendors Average Human Readers
  • 16. AES Machines Maintain Artificial Correlations Through the Steroids of Word Counting
  • 17. Percentage for Reader 1 # of words Other Percentage for Reader 2 # of words Percentage AES Machine Other # of words Other Shared Variance # Words Human Machine
  • 18. But what about voice recognition? • Relatively very small set + Moore’s Law
  • 19. What kind of writing AES can’t grade • Long essays – ETS’s e-rater has a 1,000 word limit • Broad and open Writing Tasks – Two AES machines could not approximate scores on the essay portion of the Australian Scholastic Aptitude Test that called for a fixed length (600 word) essay on a fairly open topic McCurry (2010)
  • 20. The Turing Test: The Imitation Game
  • 21. CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart)
  • 22. The Reverse Turing Test Coherent prose Gibberish
  • 23. The Basic Automatic BS Essay Language (BABEL) Generator http://babel-generator.herokuapp.com/
  • 33. What we can conclude • The software does not do what it tells students and teachers it is doing • The metrics (proxies) used are irrelevant, at best, and, probably, are largely antithetical to good writing or communication. • Students can probably be trained to memorize language and strategies to obtain high scores (construct-irrelevant-strategies)
  • 34. Some Evidence that Students Are Using BABEL to Game AEE Products 4. What is meant by a “good faith” essay? It is important to note that although PEG software is extremely reliable in terms of producing scores that are comparable to those awarded by human judges, it can be fooled. Computers, like humans, are not perfect. PEG presumes “good faith” essays authored by “motivated” writers. A “good faith” essay is one that reflects the writer’s best efforts to respond to the assignment and the prompt without trickery or deceit. A “motivated” writer is one who genuinely wants to do well and for whom the assignment has some consequence (a grade, a factor in admissions or hiring, etc.). Efforts to “spoof” the system by typing in gibberish, repetitive phrases, or off-topic, illogical prose will produce illogical and essentially meaningless results.
  • 35. Most of the Studies Are Conducted and / or Controlled by the Vendors
  • 36. Important Unanswered Questions 1. How easy will it be for students to “game” these machines? 2. When essays are read by a human reader and a machine and there is a discrepancy between scores, after the adjudication procedure, what percentage the machine’s scores are omitted or changed compared to the scores of human reader? 3. When gamed essays are read by a reader and a machine, will the human reader’s score always catch the gamed score? 4. Can human readers also be “gamed”?
  • 37. Negative Consequences • What is tested is what is taught • Emphasis on short writing • Emphasis on impromptu on-demand writing
  • 38. When can AES be useful? • Grading short content-based writing – Already useful applications • Use in MOOC’s in conjunction with Peer Review Applications such as Calibrated Peer Review
  • 39. Why are the testing companies so in love with AES? • ῥίζα γὰρ πάντων τῶν κακῶν ἐστιν ἡ φιλαργυρία • Radix omnium malorum est cupiditas • The love of money is the root of all evil • 1st Timothy 6:10
  • 40. Chase the Moneychangers out of the Temple
  • 41. Three Proposals • First, some sort of professional system of disclosure for large sums of money, let’s say more than $10,000, received from outside professional organizations such as the College Board. – With textbooks, the disclosure is transparent.
  • 42. Second, Grass roots development of several different Honors English and Writing curricula • Teach skills students will need in college • Developed jointly by high school and college teachers • Developed through organizations such as NCTE, WPA, and NWP & College Admissions • Accompanied by some sort of certification procedure • Pacesetter Program as model
  • 43. Create Tests of Our Own • Design Criteria: – Valid – Fair • Coaching has minimal effect • Does not discriminate against bilingual or bidialectical students – Feasible • Not College Board’s bad version of reverse engineering – “The plane doesn’t fly, but we can make money on it” – Transparent design, development, and administration
  • 45. Decentralized Development and Grading Opportunities for Professional Development
  • 46. Test to the Teaching • Different tests and testing communities for different approaches • Technology enabled • Diverse and linked group of readers – Opportunity to address problems of low- performing minorities – Show students that their essay will be read by a diverse group of readers
  • 47. So let us • Urge our schools to stop using the SAT, ACT, & even AP • Begin a conversation with professional organizations to involve – K-12 teachers – College admission officers – College teachers To envision new and different kinds of writing tests
  • 48. Act • Act to divert some of the billions spent on testing to improve teaching • Act to reclaim testing from business and bring it back to education • Act to make testing a form of learning not only for students but also for us • And act by doing sound research on writing and testing; because if we don’t do it we are leaving it to people like the gang at Pearson.
  • 49. Automated Essay Scoring (AES) becomes Automated Essay Evaluation (AEE) • Teaching writing in the classroom
  • 51. My Access / IntelliMetric • Holistic Writing Score: 5.3 / 6.0 88%
  • 52. Some advice makes writing less effective
  • 53. Grammar Checkers are Unreliable Nick Carbone’s Comparison of Grammarly, MS Word, and WriteCheck Grammarly MS Word WriteCheck (e-rater) # Errors Flagged 52 30 23 # misdiagnosed, false positives, or poorly explained 11 8 14 % misdiagnosed, false positives, or poorly explained 21% 27% 61% http://nccei12carbone.blogspot.com/2012/10/an-experiment-with-grammar-checkers.html
  • 54. MS Word Grammar & Style Checker is also not infallible
  • 55. Dean Mark D. Shermis on Microsoft Word and AEE The feedback provided by the Web-based software is both quantitative and qualitative. That is, in addition to an overall rating, students may receive scores on individual attributes of writing, and the software may summarize or highlight a variety of errors, ranging from simple grammar to style or content. Some of the software packages also provide a discourse analysis of the work Shermis, M. (May11, 2012). How automated grading can make good writers. Los Angeles Times
  • 56. Category: Usage • Missing or Extra Article – Rather than rely on commercials or expert opinions about a film, individuals often make their viewing choices based on blogs and the1 collected reviews of peers on various sites on the Internet. • 1 You may need to remove this article.
  • 57. Category: Usage • Type: Confused Words – Because the consumption of entertainment is so ephemeral, one could posit that advertising might affect1 a consumer's decision more than when by buying a durable good, such as a toaster or a blender. • 1 You have used affect in this sentence. You may need to use effect instead.
  • 58. Category: Usage • Type: Preposition Error – Rather than rely on commercials or expert opinions about1 a film, individuals often make their viewing choices based on blogs and the collected reviews of peers on various sites on the Internet. • 1 You may be using the wrong preposition.
  • 59. Category: Organization & Development • Type: Thesis Statement – The question of how important advertising is to the sale of any product is an important one. This question is extremely important in the media industries.1 Because the consumption of entertainment is so ephemeral, one could posit that advertising might affect a consumer's decision more than when by buying a durable good, such as a toaster or a blender.2 The advertising department of the Silver Screen Movie Production Company has recommended spending more on advertising and less on movie production. The advertising director's arguments are not only self-serving, but also logically flawed and, at the least, inconclusive, resting on several very dubious assumptions. • 1 Is this part of the essay your thesis? The purpose of a thesis is to organize, predict, control, and define your essay. Look in the Writer's Handbook for ways to improve your thesis. • 2 Is this sentence really a part of your thesis? Remember that a thesis controls the whole content of your essay. You need to strengthen this thesis so that you clearly state the main point you will be making. Look in the Writer's Handbook for tips on doing this.
  • 60. Category: Organization & Development • Type: Supporting Ideas – First, the motives behind this particular argument need to be questioned. In essence, the advertising director is arguing that resources should be taken away from producing films and given to his department.1 Although people often make reasonable requests in their own self-interest, that this policy would greatly enhance the director's fiefdom is a consequence that should elicit some skepticism.1 • 1 Criterion has identified only two sentences to support your topic sentence. Try to include one more sentence in this paragraph.
  • 61. The Missing Piece in Research on Classroom Use • Controlled experiments to avoid placebo effect • Comparison with the default writing tool of the 21st century, MS Word.
  • 62. Both Pearson and Measurement Inc. Concede that Grammar Checkers are Imperfect http://doe.sd.gov/oats/documents/WToLrnFAQ.pdf Q: Why does the grammar check not catch all of a student’s errors? A: The technology that supports grammar check features in programs such as Microsoft Word often return false positives. Since WriteToLearn is a educational product, the creators of this program have decided, in an attempt to not provide students with false positives, to err on the side of caution. Consequently, there are times when the grammar check will not catch all of a student’s errors. Teachers can address these missed grammar errors by using the post‐it note feature within the program to flag additional errors students might have missed.
  • 63. PEG Writer 8. Why does PEG seem to ignore some grammar “trouble spots” identified by Microsoft Word (or other programs)? PEG’s grammar checker can detect and provide feedback for a wide variety of syntactic, semantic and punctuation errors. These errors include, but are not limited to, run-on sentences, sentence fragments and comma splices; homophone errors and other errors of word choice; and missing or misused commas, apostrophes, quotation marks and end punctuation. In addition, the grammar checker can locate and offer feedback on style choices inappropriate for formal writing. Unlike commercial grammar checkers, however, PEG only reports those errors for which there is a high degree of confidence that the “error” is indeed an error. Commercial grammar checkers generally implement a lower threshold and as a result, may report more errors. The downside is they also report higher number of “false positives” (errors that aren’t errors). Because PEG factors these error conditions into scoring decisions, we are careful not to let “false positives” prejudice an otherwise well constructed essay.
  • 64. What kind of AEE can be useful & effective?
  • 66. Focus on style • MS Word is flawed but it may be hard to build something better that won’t confuse students • What can be emphasized is style: – Clarity – Cohesion – Emphasis – Concision – Elegance • e. g. Parallel structures
  • 67. What we need to do to build effective AEE tools • Start with General Principles – Then use statistical modeling – Follow the model of the development of voice recognition apps • Transparency • Independent Research
  • 68. The right way: by asking questions not giving answers Letting the student own the process
  • 69. Products should be transparent in displaying their limitations – again, showing warts and all
  • 70. The Real Danger of AES and bad AEE: Widening the Educational Divide • Private well-endowed institutions do not use AES • Flawed AEE will be used in large classes to “give students more opportunities to write” poorly • But what flawed AEE teaches not only dumbs down the ability to communicate – it has the potential to almost totally eliminate it
  • 71. But AEE also provides a real opportunity to provide a cheap accessible tool to teach and improve writing in multiple contexts • In classrooms • Writing at home • In the workplace
  • 72. Demo of BABEL Generator • http://babel-generator.herokuapp.com/ • https://www.dxrgroup.com/cgi- bin/scoreitnow/password.pl
  • 73. Break Up into Six Groups for MyAccess Experiment • Hypothesis: MyAccess will give high scores to computer-generated gibberish • Login to MyAccess – http://www.vantagelearning.com/login/myaccess-home-edition/ Username Password studentone one studentwo two studentthree three studentfour four studentfive five studentsix six • Open new window and open BABEL Generator http://babel-generator.herokuapp.com/
  • 74. Instructions 1. Click ASSIGNMENTS 2. Select Ages 15-18 3. Select one of the following topics (suggested keywords for BABEL generator are in parentheses) & click START ESSAY or START REVISION a. Nature v. nurture (nature, nurture) b. A sense of wonder (wonder) c. Invasion of privacy (privacy) d. Rating movies, music, & video games (violence, obscenity, teenagers) e. What makes a good coach (coach, encouragement) 4. Open another window, open BABEL Generator, generate essay, & then copy and paste it into the MY ACCESS window 5. Click SUBMIT ESSAY
  • 75. References • Attali, Yigal & Burstein, J. (2006). Automated essay scoring with e-rater V.2. Journal of Technology, Learning, and Assessment. 4:3. pp. 1-29. • Bejar, I. I., Flor, M., Futagi, Y., Ramineni, C. (2014). On the vulnerability of automated scoring to construct-irrelevant response strategies (CIRS): An illustration. Assessing Writing 22 pp. 43-59. • Condon, W. (2013). Large-scale assessment, locally-developed measures, and automated scoring of essays: Fishing for red herrings? Assessing Writing 18, pp. 100-108. • Dikli, S. & Bleyle, S. (October 2014). Automated Essay Scoring feedback for second language writers: How does it compare to instructor feedback?, Assessing Writing, 22, pp. 1-17. http://www.sciencedirect.com/science/article/pii/S1075293514000221. • Elliot, N. & Klobucar, A. “Automated Essay Evaluation and the teaching of writing.” Handbook of Automated Essay Evaluation: Current Applications and New Directions. Ed. Mark D. Shermis, Jill Burstein, and Sharon Apel. London: Routledge. June 2013. • Freitag Ericsson, P. & Haswell, R. Ed. (2006) Machine Scoring of Student Essays: Truth or Consequences. Logan, UT: Utah State UP • Godshalk, F. I, Swineford, F., Coffman, W. E. (1966). The Measurement of Writing Ability. New York, NY: College Entrance Examination Board. • Haudek, K. C. et al. (2012). What are they thinking? Automated analysis of student writing about acid-based chemistry in introductory biology. CBE—Life Sciences Education 11, pp. 149-155. • Haudek, K. C. et al. (2011). Harnessing technology to improve formative assessment of student conceptions in STEM: Forging a national network. CBE—Life Sciences Education 10, pp. 283-293. • Herrington, A. & Moran, C. (2012). When writing to a machine in not writing at all. Writing Assessment in the 21st Century: Essays in Honor of Edward M. White. Ed. Norbert Elliot and Les Perelman. New York, NY: Hampton Press, pp.219-232
  • 76. c • Higgins, D. & Heilman, M. Managing what we can measure: Quantifying the susceptibility of Automated Scoring Systems to gaming behavior. Educational Measurement 33:3. pp. 36-46. • Klobucar, Andrew, Paul Deane, Norbert Elliot, Chaitanya Ramineni, Perry Deess, & Alex Rudniy. (2012).“Automated Essay Scoring and the Search for Valid Writing Assessment.” International Advances in Writing Research: Cultures, Places, Measures. Ed. Charles. Bazerman, Chris Dean, Jessica Early, Karen Lunsford, Suzie Null, Paul Rogers, and Amanda Stansell. Fort Collins, Colorado: WAC . pp. 103-119 http://wac.colostate.edu/books/wrab2011/chapter6.pdf • McCurry, D. (2010). Can machine scoring deal with broad and open writing tests as well as human readers? Assessing Writing. Vol. 15 pp. 118–129 • Morgan, J., Shermis, M. D., Van Deventer, L., & Vander Ark, T. (2013). Automated Student Assessment Prize: Phase 1 & Phase 2: A case study to promote focused innovation in student writing assessment. Seattle, WA: Getting Smarthttp://gettingsmart.com/cms/wp- content/uploads/2013/02/ASAP-Case-Study-FINAL.pdf • Nehm, R. H., Ha, M., Mayfield, E. (2011). Transforming biology with machine learning: automated scoring of written evolutionary explanations. Journal of Science Education and Technology. 21:1, pp. 183-196 • Page, E. B.(1966). The imminence of grading essays by computer. Phi Delta Kappan 76:7 pp. 561-565 • Perelman, L. (July, 2014). When ‘the state of the art’ is counting words. Assessing Writing Vol. 21 pp. 104–111 http://dx.doi.org/10.1016/j.asw.2014.05.001
  • 77. • Perelman, L. (August 2013). Critique of Mark D. Shermis & Ben Hammer, “Contrasting State-of-the-Art Automated Scoring of Essays: Analysis.” Journal of Writing Assessment 6:1 http://journalofwritingassessment.org/article.php?article=69 • Perelman, L. (2012). “Mass-Market Writing Assessments as Bullshit.” Writing Assessment in the 21st Century: Essays in Honor of Edward M. White. Ed. Norbert Elliot and Les Perelman. New York, NY: Hampton Press, 2012, pp. 425-438. • Perelman, L. (2012). Length, Score, Time, & Construct Validity in Holistically Graded Writing Assessments: The Case against Automated Essay Scoring (AES). In New Directions in International Writing Research, Ed. C. Bazerman, C. Dean, K. Lunsford, S. Null, P. Rogers, A. Stansell, and T. Zawacki. Anderson, SC: Parlor Press, pp. 121- 132.http://wac.colostate.edu/books/wrab2011/chapter7.pdf • Powers, D. et al. (2002). Stumping e-rater: Challenging the validity of automated essay scoring. Computers in Human Behavior. 18:2 pp. 103-134. • Quinlan, T., Higgins, D., & Wolff, S. (2009) Evaluating the Construct-Coverage of the e-rater® Scoring Engine. ETS Research Report 09-01. • Sandene, B. et al. (2005) Online Assessment in Mathematics and Writing: Reports From the NAEP Technology-Based Assessment Project, Research and Development Series. National Center for Education Statistics Report 2005-457. http://files.eric.ed.gov/fulltext/ED485780.pdf
  • 78. • Shermis, M. D. (In press). The challenges of emulating human behavior in writing assessment. http://www.sciencedirect.com/science/article/pii/S1075293514000373# • Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53-76 • Shermis, M. D., & Hamner, B. (2013). Contrasting state-of-the-art automated scoring of essays. In M. D. Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: Current applications and new directions . New York, NY: Routledge. pp. 313-346 • Shermis, M. (May11, 2012). How automated grading can make good writers. Los Angeles Times http://articles.latimes.com/2012/may/11/news/la-ol-automated-scoring- blowback-20120510 • Stevenson, M. Phakiti, A. (2014). The effects of computer-generated feedback on the quality of writing. Assessing Writing. 19, pp.51-65.