A Neural Grammatical Error Correction
built on Better Pre-training and
Sequential Transfer Learning
2019-07-24
Jiyeon Ham
Kakao Brain
1
Contents
1. What is GEC?
2. Previous Work
3. Our Approach
4. Results
2
1. What is GEC?
3
Grammatical Error Correction
Input
Travel by bus is exspensive, bored and annoying.
Output
Travelling by bus is expensive, boring and annoying.
4
ACL 2019 BEA Challenge
• Building Educational Application 2019: Shared Task
• Restricted Track
• Public data only
• Low Resource Track
• WI+Locness dev (4K) only
5
Data
6
• Data sources for each track
Lang8 NUCLE FCE WI+Locness
Description
Online English
learning site
College student
essays
ESL exam
questions
English essays
(native & non-native)
Data size
(sentences)
570K 21K 33K
33K (train)
4K (dev)
4K (test)
Quality Relatively poor Good Good Good
Restricted Track Low Resource Track
Train Lang8, NUCLE, FCE, WI-train WI-dev-3k
Template WI-train WI-dev-3k
Fine-tuning WI-train WI-dev-3k
Validation WI-dev WI-dev-1k
ERRANT
• ERRor Annotation Toolkit (Bryant et al., 2017)*
• Automatically annotate parallel English sentences with
error type information
• Extract the edits, and then classify them according to a
rule-based error type framework
* Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. Automatic annotation
and evaluation of Error Types for Grammatical Error Correction. In Proceedings of
the 55th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers). Vancouver, Canada.
7
ERRANT
Input
Travel by bus is exspensive, bored and annoying.
Output
[Travel→Travelling] by bus is [exspensive→expensive],
[bored→boring] and annoying.
8
R:SPELL
R:VERB:FORM
R:VERB:FORM
2. Previous work
9
GEC as Low-resource Machine
Translation*
• Translating from erroneous
to correct text
• Techniques proposed for
low-resource MT are
applicable to improving
neural GEC
* M. Junczys-Dowmunt, R. Grundkiewicz, S. Guha, K. Heafield: Approaching Neural
Grammatical Error Correction as a Low-Resource Machine Translation Task, NAACL
2018.
10
Denoising Autoencoder
• Learns to reconstruct the original input given its
noisy version
• Minimize the reconstruction loss
𝐿(𝑥, dec(enc )𝑥 )
given an input 𝑥 and a noising function 𝑓 𝑥 = )𝑥
11
Copy-augmented Transformer*
• Combines Transformer with copy scores
• Copy score: softmax outputs of the encoder-decoder
attention
• Pretrained on denoising
autoencoding task
• Auxiliary losses
• Token-level labeling
• Sentence-level
copying
* Zhao, Wei, et al. "Improving Grammatical Error Correction via Pre-Training a Copy-
Augmented Architecture with Unlabeled Data.” NAACL (2019).
12
3. Our approach
13
Pipeline
Pre
processing
• Context-aware
spell checker
• BPE
segmentation
Pre-
training
• Error
extraction
• Perturbation
Training Fine-tuning
Post
processing
• <unk> edit
removal
• Re-rank
• Error type
control
14
Sequential transfer learning
Preprocessing
• Context-aware spellchecker
• Example:
• This is an esay about my favorite sport.
• This is an esay question.
• Incorporates context using a pre-trained neural language
model (LM)
• Fix casing errors with list of proper nouns
• Byte pair encoding (BPE) segmentation
15
Pre-training
• Pre-training a seq2seq model on a denoising task
• Realistic noising scenarios
• Token-based approach
• Extract human edits from annotated GEC corpora
• Missing punctuations (adding a comma), preposition errors
(of→at), verb tenses (has→have)
• Type-based approach
• Use a priori knowledge
• Replace with other prepositions, nouns with their
singular/plural versions, verbs with one of their inflected
versions
16
Pre-training
• Generating pre-training data
• Generate erroneous sentences from high-quality English
corpora
• If a token is exists in the dictionary of token edits
• A token-based error is generated with the probability 0.9
• If a token is not processed
• Apply a type-based error
17
Source Gutenberg Tatoeba WikiText-103
Size
(# sentences)
11.6M 1.17M 3.93M
Training and Fine-tuning
• Model
• Transformer*
• Copy-augmented Transformer
• Fine-tuning
• Both the development & test sets come from the same
source (WI+Locness)
• Use smaller learning rates
* Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information
processing systems. 2017.
18
Postprocessing
• <unk> recovery
• Infrequent tokens will be changed to <unk> by BPE
tokenization
• LM re-ranking
• Generate sentences which are corrected or not
corrected for each changed place, and calculate their
perplexity
• Error type control
• Randomly choose some categories to drop and calculate
ERAANT F0.5 score in valid set
19
4. Results
20
Results
21
Context-Aware Spellchecking
• Our spellchecker incorporates context to hunspell
using a pre-trained neural language model (LM)
22
Add LM-based approach
Fixing casing issues
Comparison of error generation
• Performance gap
decreases on
Restricted Track
• Our pre-training
functions as proxy for
training
23
Result on error types
• Token-based error generations
• Type-based error generations
• Context-aware spellchecker
• Challenging to match human
annotators’ “naturalness” edits
24
Questions
25

A Neural Grammatical Error Correction built on Better Pre-training and Sequential Transfer Learning

  • 1.
    A Neural GrammaticalError Correction built on Better Pre-training and Sequential Transfer Learning 2019-07-24 Jiyeon Ham Kakao Brain 1
  • 2.
    Contents 1. What isGEC? 2. Previous Work 3. Our Approach 4. Results 2
  • 3.
    1. What isGEC? 3
  • 4.
    Grammatical Error Correction Input Travelby bus is exspensive, bored and annoying. Output Travelling by bus is expensive, boring and annoying. 4
  • 5.
    ACL 2019 BEAChallenge • Building Educational Application 2019: Shared Task • Restricted Track • Public data only • Low Resource Track • WI+Locness dev (4K) only 5
  • 6.
    Data 6 • Data sourcesfor each track Lang8 NUCLE FCE WI+Locness Description Online English learning site College student essays ESL exam questions English essays (native & non-native) Data size (sentences) 570K 21K 33K 33K (train) 4K (dev) 4K (test) Quality Relatively poor Good Good Good Restricted Track Low Resource Track Train Lang8, NUCLE, FCE, WI-train WI-dev-3k Template WI-train WI-dev-3k Fine-tuning WI-train WI-dev-3k Validation WI-dev WI-dev-1k
  • 7.
    ERRANT • ERRor AnnotationToolkit (Bryant et al., 2017)* • Automatically annotate parallel English sentences with error type information • Extract the edits, and then classify them according to a rule-based error type framework * Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. Automatic annotation and evaluation of Error Types for Grammatical Error Correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada. 7
  • 8.
    ERRANT Input Travel by busis exspensive, bored and annoying. Output [Travel→Travelling] by bus is [exspensive→expensive], [bored→boring] and annoying. 8 R:SPELL R:VERB:FORM R:VERB:FORM
  • 9.
  • 10.
    GEC as Low-resourceMachine Translation* • Translating from erroneous to correct text • Techniques proposed for low-resource MT are applicable to improving neural GEC * M. Junczys-Dowmunt, R. Grundkiewicz, S. Guha, K. Heafield: Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task, NAACL 2018. 10
  • 11.
    Denoising Autoencoder • Learnsto reconstruct the original input given its noisy version • Minimize the reconstruction loss 𝐿(𝑥, dec(enc )𝑥 ) given an input 𝑥 and a noising function 𝑓 𝑥 = )𝑥 11
  • 12.
    Copy-augmented Transformer* • CombinesTransformer with copy scores • Copy score: softmax outputs of the encoder-decoder attention • Pretrained on denoising autoencoding task • Auxiliary losses • Token-level labeling • Sentence-level copying * Zhao, Wei, et al. "Improving Grammatical Error Correction via Pre-Training a Copy- Augmented Architecture with Unlabeled Data.” NAACL (2019). 12
  • 13.
  • 14.
    Pipeline Pre processing • Context-aware spell checker •BPE segmentation Pre- training • Error extraction • Perturbation Training Fine-tuning Post processing • <unk> edit removal • Re-rank • Error type control 14 Sequential transfer learning
  • 15.
    Preprocessing • Context-aware spellchecker •Example: • This is an esay about my favorite sport. • This is an esay question. • Incorporates context using a pre-trained neural language model (LM) • Fix casing errors with list of proper nouns • Byte pair encoding (BPE) segmentation 15
  • 16.
    Pre-training • Pre-training aseq2seq model on a denoising task • Realistic noising scenarios • Token-based approach • Extract human edits from annotated GEC corpora • Missing punctuations (adding a comma), preposition errors (of→at), verb tenses (has→have) • Type-based approach • Use a priori knowledge • Replace with other prepositions, nouns with their singular/plural versions, verbs with one of their inflected versions 16
  • 17.
    Pre-training • Generating pre-trainingdata • Generate erroneous sentences from high-quality English corpora • If a token is exists in the dictionary of token edits • A token-based error is generated with the probability 0.9 • If a token is not processed • Apply a type-based error 17 Source Gutenberg Tatoeba WikiText-103 Size (# sentences) 11.6M 1.17M 3.93M
  • 18.
    Training and Fine-tuning •Model • Transformer* • Copy-augmented Transformer • Fine-tuning • Both the development & test sets come from the same source (WI+Locness) • Use smaller learning rates * Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. 18
  • 19.
    Postprocessing • <unk> recovery •Infrequent tokens will be changed to <unk> by BPE tokenization • LM re-ranking • Generate sentences which are corrected or not corrected for each changed place, and calculate their perplexity • Error type control • Randomly choose some categories to drop and calculate ERAANT F0.5 score in valid set 19
  • 20.
  • 21.
  • 22.
    Context-Aware Spellchecking • Ourspellchecker incorporates context to hunspell using a pre-trained neural language model (LM) 22 Add LM-based approach Fixing casing issues
  • 23.
    Comparison of errorgeneration • Performance gap decreases on Restricted Track • Our pre-training functions as proxy for training 23
  • 24.
    Result on errortypes • Token-based error generations • Type-based error generations • Context-aware spellchecker • Challenging to match human annotators’ “naturalness” edits 24
  • 25.