NAACL15_sakaguchi

Effective Feature Integration for
Automated Short Answer Scoring
Keisuke Sakaguchi,1 Michael Heilman,2 Nitin Madnani2
1Johns Hopkins University, CLSP & 2Educational Testing Service
6/2/2015, NAACL 2015

What is Automated Short Answer Scoring?
Passage (700~ words)
2
A boy tried to snatch ---
Student Responses
A woman was ---
l 
l 
l 
A boy ran up behind her
and tried to snatch her
purse. ~

3
Student Responses
A woman was ---
l 
l 
l 
purse. ~
Build a regression model
0 (bad) to 4 (good) scale
Predict a score for a
given student answer

4
Student Responses
A woman was ---
l 
l 
l 
Reference for scoring
purse. ~
1~2 exemplar answers
Brief key concepts (<10)

Two Basic Approaches
5
Student Responses
A woman was ---
l 
l 
l 
1~2 exemplar answers
Brief key concepts (<10)
Reference for scoring
purse. ~

Response-based features
6
A boy tried to snatch a lady’s purse --- --- .

Response-based features
7
Length
Word n-gram (e.g. bigrams)
Character n-gram (e.g. 2-5 grams)
Syntactic dependency (e.g. PARENT-LABEL-CHILD)
Semantic Roles (e.g. TRY-A0-BOY)

Reference-based features
8
Exemplar
(Score 4) A boy tried to steal ---
(Score 3) A lady’s purse ---
…
(Score 0) I don’t know.
Key concepts
(#1) A boy tried to steal a woman’s purse.
(#2) A lady caught him.
…
(#N) She lets him leave.

9
Exemplar
(Score 4) A boy tried to steal ---
(Score 3) A lady’s purse ---
…
(Score 0) I don’t know.
Key concepts
(#1) A boy tried to steal a woman’s purse.
(#2) A lady caught him.
…
(#N) She lets him leave.
Similarity

10
Similarity metrics
Overall similarity
1. BLEU score
2. word2vec cosine (sentence-level)
Alignment-based similarity
3. word2vec alignment
4. WordNet alignment
4 metrics * (exemplars + key concepts)
Similarity

Alignment-based Semantic Similarity
11
A boy trying to snatch a lady's purse …
A boy tried to steal a woman’s purse …
Student
Response
Reference

12
Student
Response
Reference
Filter by POS tags (N, V, ADJ, ADV)

13
Student
Response
Reference
Find out the most similar word (via WN or w2v)
1
len(S)
Ws
max
Wr R
Sim(Ws, Wr)

14
Student
Response
Reference
1
len(S)
Ws
max
Wr R
Sim(Ws, Wr)

15
Student
Response
Reference
1
len(S)
Ws
max
Wr R
Sim(Ws, Wr)

16
Student
Response
Reference
1
len(S)
Ws
max
Wr R
Sim(Ws, Wr)

17
Student
Response
Reference
1
len(S)
Ws
max
Wr R
Sim(Ws, Wr)

18
Student
Response
Sentence-level similarity: taking the average
1
len(S)
Ws
max
Wr R
Sim(Ws, Wr)
Ref 0.8

19
Two Basic Approaches: Review
Response-based Reference-based
Length
Word n-gram
Character n-gram
Syntactic dependency
Semantic Roles
Several similarity metrics
BLEU score
word2vec cosine
word2vec alignment
WordNet alignment

20
Models
Response-based
Reference-based
Response-based Reference-based+
Build Support Vector Regression models (SVR) on:

21
Models
Response-based
Reference-based
Response-based Reference-based+
Build Support Vector Regression models (SVR) on:
Wait! Naïve feature combination does not work!

22
Look closely at the features
Length
Word n-gram
Character n-gram
Semantic Roles
BLEU score
word2vec cosine
word2vec alignment
WordNet alignment

23
Look closely at the features
Length
Word n-gram
Character n-gram
Semantic Roles
BLEU score
word2vec cosine
word2vec alignment
WordNet alignment
binary & sparse continuous & dense
90% 10%

Stacked Generalization (Wolpert, 1992)
24
Layer 1
Classifier
Layer 2
Classifier
Classifier
Classifier
Training Set
Training Set
Training Set
output
final output

Stacking model for our task
25
Naïve feature combination
SVR
Predicted
score
Reference-based
(continuous & dense)
Response-based
(binary & sparse)

Stacking model for our task
26
Layer 1
SVR#1
Layer 2
SVR#2
Score as a
dense feature
Predicted
score
Reference-based
(continuous & dense)
Response-based
(binary & sparse)

Experimental Setting
27
Dataset:
Reading for Understanding (RfU)
Designed for 6th – 9th grade students
4 short-answer questions on 2 different passages
5 training sizes (100, 200, 400, 800, All) * 20 runs
Learner:
Support Vector Regression (Linear)
Evaluation:
Quadratic Weighted Kappa with 10-fold CV
(on held-out test set)

Results (Q1)
28
: Resp: Ref
0.72
0.76
0.8
0.84
100 200 400 800 ALL (1790)
Training Size

Results (Q1)
29
: Resp: Ref
0.72
0.76
0.8
0.84
100 200 400 800 ALL (1790)
: w/o Stack
Training Size

Results (Q1)
30
: Resp: Ref
0.72
0.76
0.8
0.84
100 200 400 800 ALL (1790)
: w/o Stack : w/ Stack
Training Size

0.4
0.5
0.6
0.7
0.54
0.6
0.66
0.72
0.72
0.76
0.8
0.84
All Results
31
Q1
Q2
Q3
Q4
: Resp: Ref : w/o Stack : w/ Stack
0.48
0.56
0.64
0.72

Summary
32
Automated Short Answer Scoring
Response-based and Reference-based approaches
Response-based: binary sparse features
Reference-based: continuous dense features
Model combination (with stacking) improved performance
Look at the stats of your features and apply stacking J

NAACL15_sakaguchi

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

NAACL15_sakaguchi