What do Neural Machine
Translation Models Learn
about Morphology?
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi,
Hassan Sajjad and James Glass
@ 8/11 ACL2017 Reading
M1 Hayahide Yamagishi
Introduction
● “Little is known about what and how much NMT models learn
about each language and its features.”
● They try to answer the following questions
1. Which parts of the NMT architecture capture word structure?
2. What is the division of labor between defferent components?
3. How do different word representation help learn better morphology and
modeling of infrequent words?
4. How does the target language affect the learning of word structure?
● Task: Part-of-Speech tagging and morphological tagging
2
Task
● Part-of-Speech (POS) tagging
○ computer → NN
○ computers → NNS
● Morphological tagging
○ he → 3, single, male, subject
○ him → 3, single, male, object
● Task: hidden states → tag
○ They would like to test each hidden state.
○ If the accuracy is high, hidden states learn about the word representation.
3
Methodology
1. Training the NMT models (Bahdanau attention, LSTM)
2. Using the trained models as a feature extractor.
3. Training the feedforward NN using the state-tag pairs
○ 1layer: input layer, hidden layer, output layer
4. Test
● “Our goal is not to beat the state-of-the-art on a given task.”
● “We also experimented with a linear classifier and observed
similar trends to the non-linear case.”
4
Data
● Language Pair:
○ {Arabic, German, French, Czech} - English
○ Arabic - Hebrew (Both languages are morphologically-rich and similar.)
○ Arabic - German (Both languages are morphologically-rich but different.)
● Parallel corpus: TED
● POS annotated data
○ Gold: included in some datasets
○ Predict: from the free taggers
5
Char-base Encoder
● Character-aware Neural Language
Model [Kim+, AAAI2016]
● Character-based Neural Machine
Translation [Costa-jussa and Fonollosa,
ACL2016]
● Character embedding
→ word embedding
● Obtained word embeddings are inputted
into the word-based RNN-LM.
6
Effect of word representation (Encoder)
● Word-based vs. Char-based model
● Char-based models are stronger.
7
Impact of word frequency
● Frequent words don’t need the character information.
● “The char-based model is able to learn character n-gram
patterns that are important for identifying word structure.”
8
Confusion matrices
9
Analyzing specific tags
● Arabic → Determiner “Al-” becomes a prefix.
● Char-based model can distinguish “DT+NNS” from “NNS”.
10
Effect of encoder depth
● LSTM carries the context information → Layer 0 is worse.
● States from layer 1 is more effective than states from layer 2.
11
Effect of encoder depth
● Char-based models have the similar tendencies.
12
Effect of encoder depth
● BLEU: 2-layer NMT > 1-layer NMT
○ word / char : +1.11 / +0.56
● Layer 1 learns the word representation
● Layer 2 learns the word meaning
● Word representation < word representation + word meaning
13
Effect of target language
● Translating into morphologically-rich language is harder.
○ Arabic-English: 24.69
○ English-Arabic: 13.37
● “How does the target language affect the learned source
language representations?”
○ “Does translating into a morphologically-rich language require more
knowledge about source language morphology?”
● Experiment: Arabic - {Arabic, Hebrew, German, English}
○ Arabic-Arabic: Autoencoder
14
Result
15
Effect of target languages
● They expected translating into morph-rich languages would
make the model learn more about morphology. → No
● The accuracy doesn’t correlate with the BLEU score
○ Autoencoder couldn’t learn the morphological representation.
○ If the model only works as a recreator, it doesn’t have to learn it.
○ “A better translation model learns more informative representation.”
● Possible explanation
○ Arabic-English is simply better than -Hebrew and -German.
○ These models may not be able to afford to understand the representations of
word structure.
16
Decoder Analysis
● Similar experiments
○ Decoder’s input is the correct previous word.
○ Char-based decoder’s input is the char-based representation.
○ Char-based decoder’s output is the word-level.
● Arabic-English or English-Arabic
17
Effect of decoder states
● Decoder states doesn’t have a morphological information.
● BLEU doesn’t correlate the accuracy
○ French-English: 37.8 BLEU / 54.26% accuracy
18
Effect of attention
● Encoder states
○ Task: creating a generic, close to language-independent representation of
source sentence .
○ When the attention is attached, these are treated as a memo.
○ When the model translates the noun, the attention sees the noun words.
● Decoder states
○ Task: using encoder’s representation to generate the target sentence in a
specific language.
○ “Without the attention mechanism, the decoder is forced to learn more
informative representations of the target language.”
19
Effect of word representation (Decoder)
● Char-based representations don’t hep the decoder
○ The decoder’s predictions are still done at word level.
○ “In Arabic-English the char-based model reduces the number of generated
unknown words in the MT test set by 25%.”
○ “In English-Arabic the number of unknown words remains roughly the same
between word-based and char-based models.”
20
Conclusion
● Their results lead to the following conclusions
○ Char-based representations are better than word-based ones
○ Lower layers captures morphology, while deeper layers improve translation
performance.
○ Translating into morphologically-poorer languages leads to better source
representations.
○ The attentional decoder learns impoverished representations that do not
carry much information about morphology.
● “Jointly learning translation and morphology can possibly
lead to better representations and improved translation.”
21

[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?

  • 1.
    What do NeuralMachine Translation Models Learn about Morphology? Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad and James Glass @ 8/11 ACL2017 Reading M1 Hayahide Yamagishi
  • 2.
    Introduction ● “Little isknown about what and how much NMT models learn about each language and its features.” ● They try to answer the following questions 1. Which parts of the NMT architecture capture word structure? 2. What is the division of labor between defferent components? 3. How do different word representation help learn better morphology and modeling of infrequent words? 4. How does the target language affect the learning of word structure? ● Task: Part-of-Speech tagging and morphological tagging 2
  • 3.
    Task ● Part-of-Speech (POS)tagging ○ computer → NN ○ computers → NNS ● Morphological tagging ○ he → 3, single, male, subject ○ him → 3, single, male, object ● Task: hidden states → tag ○ They would like to test each hidden state. ○ If the accuracy is high, hidden states learn about the word representation. 3
  • 4.
    Methodology 1. Training theNMT models (Bahdanau attention, LSTM) 2. Using the trained models as a feature extractor. 3. Training the feedforward NN using the state-tag pairs ○ 1layer: input layer, hidden layer, output layer 4. Test ● “Our goal is not to beat the state-of-the-art on a given task.” ● “We also experimented with a linear classifier and observed similar trends to the non-linear case.” 4
  • 5.
    Data ● Language Pair: ○{Arabic, German, French, Czech} - English ○ Arabic - Hebrew (Both languages are morphologically-rich and similar.) ○ Arabic - German (Both languages are morphologically-rich but different.) ● Parallel corpus: TED ● POS annotated data ○ Gold: included in some datasets ○ Predict: from the free taggers 5
  • 6.
    Char-base Encoder ● Character-awareNeural Language Model [Kim+, AAAI2016] ● Character-based Neural Machine Translation [Costa-jussa and Fonollosa, ACL2016] ● Character embedding → word embedding ● Obtained word embeddings are inputted into the word-based RNN-LM. 6
  • 7.
    Effect of wordrepresentation (Encoder) ● Word-based vs. Char-based model ● Char-based models are stronger. 7
  • 8.
    Impact of wordfrequency ● Frequent words don’t need the character information. ● “The char-based model is able to learn character n-gram patterns that are important for identifying word structure.” 8
  • 9.
  • 10.
    Analyzing specific tags ●Arabic → Determiner “Al-” becomes a prefix. ● Char-based model can distinguish “DT+NNS” from “NNS”. 10
  • 11.
    Effect of encoderdepth ● LSTM carries the context information → Layer 0 is worse. ● States from layer 1 is more effective than states from layer 2. 11
  • 12.
    Effect of encoderdepth ● Char-based models have the similar tendencies. 12
  • 13.
    Effect of encoderdepth ● BLEU: 2-layer NMT > 1-layer NMT ○ word / char : +1.11 / +0.56 ● Layer 1 learns the word representation ● Layer 2 learns the word meaning ● Word representation < word representation + word meaning 13
  • 14.
    Effect of targetlanguage ● Translating into morphologically-rich language is harder. ○ Arabic-English: 24.69 ○ English-Arabic: 13.37 ● “How does the target language affect the learned source language representations?” ○ “Does translating into a morphologically-rich language require more knowledge about source language morphology?” ● Experiment: Arabic - {Arabic, Hebrew, German, English} ○ Arabic-Arabic: Autoencoder 14
  • 15.
  • 16.
    Effect of targetlanguages ● They expected translating into morph-rich languages would make the model learn more about morphology. → No ● The accuracy doesn’t correlate with the BLEU score ○ Autoencoder couldn’t learn the morphological representation. ○ If the model only works as a recreator, it doesn’t have to learn it. ○ “A better translation model learns more informative representation.” ● Possible explanation ○ Arabic-English is simply better than -Hebrew and -German. ○ These models may not be able to afford to understand the representations of word structure. 16
  • 17.
    Decoder Analysis ● Similarexperiments ○ Decoder’s input is the correct previous word. ○ Char-based decoder’s input is the char-based representation. ○ Char-based decoder’s output is the word-level. ● Arabic-English or English-Arabic 17
  • 18.
    Effect of decoderstates ● Decoder states doesn’t have a morphological information. ● BLEU doesn’t correlate the accuracy ○ French-English: 37.8 BLEU / 54.26% accuracy 18
  • 19.
    Effect of attention ●Encoder states ○ Task: creating a generic, close to language-independent representation of source sentence . ○ When the attention is attached, these are treated as a memo. ○ When the model translates the noun, the attention sees the noun words. ● Decoder states ○ Task: using encoder’s representation to generate the target sentence in a specific language. ○ “Without the attention mechanism, the decoder is forced to learn more informative representations of the target language.” 19
  • 20.
    Effect of wordrepresentation (Decoder) ● Char-based representations don’t hep the decoder ○ The decoder’s predictions are still done at word level. ○ “In Arabic-English the char-based model reduces the number of generated unknown words in the MT test set by 25%.” ○ “In English-Arabic the number of unknown words remains roughly the same between word-based and char-based models.” 20
  • 21.
    Conclusion ● Their resultslead to the following conclusions ○ Char-based representations are better than word-based ones ○ Lower layers captures morphology, while deeper layers improve translation performance. ○ Translating into morphologically-poorer languages leads to better source representations. ○ The attentional decoder learns impoverished representations that do not carry much information about morphology. ● “Jointly learning translation and morphology can possibly lead to better representations and improved translation.” 21