TACL16_Sakaguchi

Reassessing the Goals of
Grammatical Error Correction:
Fluency instead of Grammaticality
Keisuke Sakaguchi1, Courtney Napoles1,
Matt Post1, & Joel Tetreault2
1Johns Hopkins University
2Yahoo! (now at Grammarly)

Grammaticality and Fluency
From this scope, social media has shorten
our distance.
2
From this scope, social media has
shortened our distance.
From this perspective, social media has
shortened the distance between us.

Overview (High-level)
1. GEC community has not clearly distinguished
Grammaticality and Fluency.
2. Fluency-oriented annotations and metric
- Native speakers preference
- Higher correlation to human ranking
- Easier and cheaper to collect new datasets
3
New corpora should be produced regularly (e.g. SMT),
and avoid over-reliance on a single annotated corpus

History of GEC:
Grammaticality or Fluency?
4
trying to help learners
correct small mistakes?
also trying to help them
sound more fluent?
or
Shared
Task
Target Errors Metric
HOO 11 All error types:
e.g. Prep. Punctuations, word choice …
F-score
HOO 12 Limited error types:
prepositions, determiners
F-score
CoNLL 13 Limited error types:
HOO12 + noun number, verb form, SVA
M2 (≈ F0.5)
CoNLL 14 All error types:
e.g. CoNLL13 + redundancy, word choice …
M2 (≈ F0.5)

5

Existing Annotation Scheme
Fine-grained error type coding (CLC: 80 types, NUCLE: 27 types)
6
However, to have
<NS type ="RD"> his
<c>your</c></NS>
<NS type="FN">photos
<c>photo</c></NS>
taken and
<NS type="TV">showed
<c>shown</c></NS>
on television and
<NS type="MT"><c>in</c></NS>
<NS type ="MD"><c>the</c>
</NS>
newspapers increases your
popularity.
<MISTAKE start_par="1" start
_off="387" end_par="1" end_
off="389"><TYPE>Prep</TYPE><
CORRECTION>of</CORRECTION></
MISTAKE>
<MISTAKE start_par= "1"
start_off="396" end_par= "1”
end_off="413"><TYPE>V0
</TYPE><CORRECTION>that are
inhospitable</CORRECTION></M
ISTAKE>
<MISTAKE start_par= "1"
start_off="422" end_par= "1”
end_off="430"><TYPE>Mec
</TYPE><CORRECTION>deserts</
CORRECTION></MISTAKE>

<MISTAKE start_par="1" start
_off="387" end_par="1" end_
off="389"><TYPE>Prep</TYPE><
CORRECTION>of</CORRECTION></
MISTAKE><MISTAKE start_par=
"1" start_off="396" end_par=
"1” end_off="413"><TYPE>V0
</TYPE><CORRECTION>that are
inhospitable</CORRECTION></M
ISTAKE><MISTAKE start_par=
"1" start_off="422" end_par=
"1” end_off="430"><TYPE>Mec
</TYPE><CORRECTION>deserts</
CORRECTION></MISTAKE>
However, to have <NS type
="RD">his<c>your</c>
</NS> <NS type="FN">
photos<c>photo</c></NS>
taken and <NS type="TV">
showed<c>shown</c></NS>
on television and <NS type=
"MT"><c>in</c></NS> <NS type
="MD"><c>the</c></NS>
newspapers increases your
popularity.
Existing Annotation Scheme
Fine-grained error type coding (CLC: 80 types, NUCLE: 27 types)
7
1. It costs a lot to train annotators for error coding.
2. Inter-annotator agreement (IAA) is very low.
3. Downward pressure on the annotators to make
sentences just Grammatical and not Fluent.

Grammaticality and Fluency
From this scope, social media has shorten
our distance.
8

New annotation: Fluency edits
Simply ask native speakers to rewrite the sentence to
sound natural to them.
9
1. Low cost: no error tags, no training is required
2. Scalability: All native speakers can annotate.
3. Fluency is taken into account.

- NUCLE 3.2 dataset (test: 1,312 sentences)
- Two annotators per sentence.
- Fluency edits vs. Grammatically minimal edits.
10

Data examples
Some family may feel hurt , with regards to their family
pride or reputation , on having the knowledge of such
genetic disorder running in their family .
11
Some families may feel hurt [] with regards to their family
pride or reputation , on having [] knowledge of such a
On [] learning of such a genetic disorder running in their
family , some family members may feel hurt [] regarding
their family pride or reputation .
Minimal
Fluency
Some family members may feel hurt [] with regards to
their family pride or reputation [] on having knowledge of
a genetic disorder running in their family .
NUCLE

Preference by Native Speakers
12
Scored and Ranked by TrueSkill (as used in WMT)
(Rank groups have statistically significant difference.)
Rank Score Annotation scheme
1 1.16 Fluency edits
2 0.54 NUCLE annotation
3 0.26 minimal edits
4 -2.9 Original sentence

13

Metrics
MaxMatch (M2) (Dahlmeier and Ng, 2012)
Phrase level F-measure
Designed for error-coded GEC corpora.
I-measure (Felice and Briscoe, 2015)
Token level accuracy
Designed for error-coded GEC corpora.
GLEU (Napoles et al., 2015)
Similar to BLEU but considering source information.
N-gram precision with penalty term
Suitable for non-error-coded corpora.
14
Grammaticality
Fluency
Grammaticality

Annotation Scheme & Metrics
15
M2
Fluency edits
NUCLE (error coded)
(≈ Minimal edits)
Annotation Scheme
(= Reference)
Automated Metrics
I-measure
GLEU
X
Q: Which is the best combination to evaluate?
Oracle ranking (by human): Grundkiewicz et at (2015).

Correlation (Spearmsn’s r)
16
0.819
0.758
0.626
0.725
0.6
0.65
0.7
0.75
0.8
0.85
GLEU M2
Fluency NUCLE
Fluency Grammaticality
N.B. I-measure showed weakly
negative correlations (omitted).

Correlation (Pearson’s r)
17
0.731
0.665
0.646
0.677
0.6
0.65
0.7
0.75
GLEU M2
Fluency NUCLE
GrammaticalityFluency
N.B. I-measure showed weakly
negative correlations (omitted).

18

- NUCLE 3.2 dataset (1,312 sentences)
- Two crowdsourced annotators per sentence.
- Qualified by expert editors
- Reward: $0.07 ~ $0.1 per sentence
- Total: Approx. $240 in 24 hours.
Q: How is the quality of their edits?
19

Data examples
Some family may feel hurt , with regards to their family
pride or reputation , on having the knowledge of such
20
On [] learning of such a genetic disorder running in their
family , some family members may feel hurt [] regarding
their family pride or reputation .
Some relatives may [] be concerned about the family’s []
reputation – not to mention their own pride – in relation to
this news of [] familial genetic defectiveness [] .
Fluency
Some family members may feel hurt [] with regards to
their family pride or reputation [] on having knowledge of
a genetic disorder running in their family .
NUCLE
Fluency

Preference by Native Speakers
21
Scored and Ranked by TrueSkill (as used in WMT)
(Rank groups have statistically significant difference.)
Rank Score Annotation scheme
1 1.16 Fluency edits
0.97 Fluency edits
2 0.26 NUCLE reference
3 -2.9 Original sentence

Correlation (GLEU metric)
22
0.819
0.676
0.626
0.6
0.65
0.7
0.75
0.8
0.85
Spearman
Fluency Fluency crowd NUCLE
0.731
0.668
0.646
0.6
0.65
0.7
0.75
Pearson
Fluency Fluency crowd NUCLE

Summary
23

All the Fluency reference are available at
https://github.com/keisks/reassess-gec.git
For exhaustive experiments and analysis:
http://aclweb.org/anthology/Q/Q16/Q16-1013.pdf
Thank you!
24

TACL16_Sakaguchi

Recommended

Recommended

More Related Content

Similar to TACL16_Sakaguchi

Similar to TACL16_Sakaguchi (20)

More from Keisuke Sakaguchi

More from Keisuke Sakaguchi (10)

Recently uploaded

Recently uploaded (20)

TACL16_Sakaguchi