Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

On using monolingual corpora in neural machine translation


Published on

"On using monolingual corpora in neural machine translation" article presentation.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

On using monolingual corpora in neural machine translation

  1. 1. On Using Monolingual Corpora in Neural Machine Translation Presentation by: Ander Martinez Sanchez (D1) 松本研
  2. 2. Abstract ● Recent NMT showed promising results ○ Because of good corpora ● Investigate how to leverage abundant monolingual corpora ● Up to +1.96 BLEU on a low-resource pair (Turkish-English) ○ +1.59 on a focused domain pair (Chinese-English chat messages) ● Also benefits high resource languages ○ +0.39 BLEU on Chinese-English ○ +0.47 BLEU on German-English
  3. 3. Introduction ● Goal: Improve NMT by using monolingual data ● By: Integrating a Language Model (LM) for the target language (English) ● For: a. Resource-limited pair: Turkish-English b. Domain restricted translation: Chinese-English SMS chats c. High-resource pairs: German-English and Chinese-English ● Article structure: a. Recent work b. Basic model architecture c. Shallow and deep fusion approaches d. Datasets e. Experiments and results
  4. 4. Background: Neural Machine Translation SMT ● Theory: Maximize p(y|x). Bayes: p(y) ← language model ● Reality: Systems tend to model ○ fj (x, y) ← a feature, like pair-wise statistics ○ C is a normalization constant. Often ignored. NMT ● A single network optimizes log p(x|y), including feature extraction and C ● Typically, encoder-decoder framework ● Once the conditional distribution is learnt, ○ find a translation using, for instance a beam search algorithm
  5. 5. Model Description Figure from [Ling et al. 2015] 1. Word embeddings 2. Annotation vectors (Encoding) 3. y word embeddings 4. Decoder hidden state 5. Alignment model 6. Context vector 7. Deep output layer and softmax Optimize:
  6. 6. Integrating Language Model into the Decoder ● Two methods for integrating a LM: “shallow fusion” and “deep fusion” ● Both, the Language Model (LM) and the Translation Model (TM) are pre-trained. ● The LM is based on Recurrent Neural Networks (RNNLM) [Mikolov et al. 2011] ○ Very similar to the TM, but without steps (5) and (6) in the previous slide Shallow Fusion Deep Fusion ● NMT computes `p’ for each next word ● New score is the summation to the word score and the hypothesis for t-1 ● K top hypotheses are selected as candidate ● THEN, rescore the hypothesis using a weighted sum. ● Concatenate the hidden states of the LM and TM before the Deep Output Layer. ● The model is finetunned. ○ Only for the parameters involved.
  7. 7. Integrating Language Model into the Decoder Deep Fusion - Balancing the LM and the TM ● In some cases the LM is more informative than in others. Examples: ○ Articles: because Chinese doesn’t have them, in Zh-En the LM is more informative than TM ○ Nouns: The LM is less informative in this case. ● Controller mechanism added. ○ The hidden state of the LM ( ) is multiplied by gt vg and bg are learnt parameters. ○ ● Intuitively, this decides the importance of the LM for each word.
  8. 8. Datasets 1. Chinese-English (Zh-En) a. from NIST OpenMT’15 challenge i. SMS/Chat ii. Conversational Telephone Speech (CTS) iii. Newsgroups/weblogs from DARPA BOLT Project b. Chinese part on character-level c. Restricted to CTS (<--- ?????) 2. Turkish-English (Tr-En) a. WIT and SETimes parallel corpora (TEDx talks) b. Turkish tokenized as subword-units (Zemberek) 3. German-English (De-En) 4. Czech-English (Cs-En) a. WMT’15. Weird sentences dropped. 5. Monolingual Corpora: English Gigaword (LDC)
  9. 9. Datasets
  10. 10. Settings NMT ● Vocabulary sizes for Zh-En and Tr-En: Zh (10k) Tr (30k) En (40) ● Vocabylary sizes for Cs-En and De-En: 200k using sampled softmax ○ [Jean et al. 2014] ● Size of recurrent units: Zh-En (1200), Tr-En (1000) OTHERS? ● Adadelta with minibatches of 80 ● Clip the gradient to 5 if L2 is exceeding ● Non-recurrent layers have dropout [Hinton et al. 2012] ○ and gaussian noise (mean: 0, std: 0.001) to prevent overfitting [Graves, 2011] ● Early stopped on development set BLEU ● Weight matrices initialized as random orthonormal
  11. 11. Settings LM ● For each English vocabulary (3 variations) constructed LM with ○ LSTM of 2,400 units (Zh-En and Tr-En) ○ LSTM of 2,000 units (Cs-En and De-En) ● Optimized with ○ RMSProp (Zh-En and Tr-En) [Tieleman and Hilton, 2012] ○ Adam (Cs-En and De-En) [Kingma and Ba, 2014] ● Sentences with more than 10% of UNK were discarded ● Early stopped on perplexity
  12. 12. Settings ● Shallow Fusion ○ Beta (Eq. 5) selected to maximize translation performance on dev set ○ Range in (0.001 and 0.1) ○ Renormalize softmax of LM without EOS and OOV symbols ■ Maybe due to domain differences in LM and TM ● Deep Fusion ○ Finetunned parameters of Deep Output Layer and the controller ■ RMSProp: ● Dropout prob: 0.56 ● STD of weight noise: 0.005 ● Reduce level of regularization after 10K updates ■ Adadelta: Scaling down update steps by 0.01 ● Handling Rare Words: For De- and Cs- cases copy UNK from source using attention mechanism. (Improved +1.0 BLEU)
  13. 13. Results Zh-En: OpenMT’15 ● Phrase-Based (PB) SMT [Koehn et al. 2003] ○ Rescoring with external neural LM (+CSLM) ■ [Schwenk] ● Hierarchical Phrase-Based SMT (HPB) [Chiang, 2005] ○ +CSLM ● NMT, NMT+Shallow, NMT+Deep ● Except CTS, +Deep helps ● NMT outperformed Phrase-Based SMT
  14. 14. Results Tr-En: IWSLT’14 ● Using Deep Fusion ○ +1.19 BLEU ○ Outperformed the best previously reported result [Yilmaz et al. 2013]
  15. 15. Results Cs-En and De-En: WMT’15 ● Shallow: +0.09 and +0.29 BLEU ● Deep: +0.39 and +0.47
  16. 16. Analysis ● Depends heavily on domain similarity ● In the case of Zh the domain if very different (Conversational vs News) ○ This is supported by the high perplexity ● Perplexity in Tr is lower, which led to larger improvement for shallow and deep ○ Perplexity is even lower for De- and Cs-; so, larger the improvement ● For the case of deep the weight of LM is regulated through the controller ○ For more similar domains it will be more active ○ In the case of De- and Cs- the controller was more active. ■ Correlates with +BLEU ■ Deep can adapt better to domain mismatch
  17. 17. Conclusion and Future Work ● 2 methods were presented and empirically evaluated. ● For Chinese and Turkish, the deep fusion approach achieve better result than existing SMT ● Also improvement was observed for high resource pairs ● The improvement depends heavily in the domain match between LM and ™ ○ In the case were the domain matched, there was improvement for both the shallow and deep approach ● Suggests that domain adaption for the LM may improve translations