Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dual Learning for Machine Translation (NIPS 2016)

2,747 views

Published on

Dual Learning for Machine Translation (NIPS 2016)

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Dual Learning for Machine Translation (NIPS 2016)

  1. 1. “Dual Learning for Machine Translation” Di He et al. 2016年1⽉ Toru Fujino 東⼤ 新領域 ⼈間環境学 陳研究室 D1
  2. 2. Paper information • Authors: Di He et al. (Microsoft Research Asia) • Conference: NIPS 2016 • Date: 11/01/2016 (arxiv) • Times cited: 1
  3. 3. Overview • What • Introduce an autoencoder-like mechanism, “Dual learning”, to utilize monolingual datasets • Results • Dual Learning with 10% data ≈ Baseline model with 100% data 1) “Dual Learning: A New Learning Paradigm”, https://www.youtube.com/watch?v=HzokNo3g63E&feature=youtu.be 1) 1)
  4. 4. Neural machine translation • Learn conditional probability 𝑃(𝑦|𝑥; Θ) from a input 𝑥 = {𝑥,, 𝑥., … , 𝑥01 } to an output 𝑦 = {𝑦,, 𝑦., … , 𝑦03 } • Maximize the log probability Θ∗ = argmax ; ; log 𝑃(𝑦>|𝑦?>, 𝑥; Θ) 03 >@,B,C ∈E
  5. 5. Difficulty in getting large bilingual data • Solution: utilization of monolingual data • Train a language model of the target language, and then integrate it with the MT model1)2) <- does not fundamentally address the shortage of parallel data. • Generate pesudo bilingual data from monolingual data3)4) <- no guarantee on the quality of the pesudo bilingual data 1) T. Brants et al., “Large language models in machine translation”, EMNLP 2007 2) C. Gucehre et al., “On using monolingual corpora in neural machine translation”, arix 2015 3) R. Sennrich et al., “Improving neural machine translation models with monolingual data”, ACL 2016 4) N. Ueffing et al., “Semi-supervised model adaptation for statistical machine translation”, Machine Translation Journal 2008
  6. 6. Dual learning algorithm • Use monolingual datasets to train translation models through dual learning • Things required 𝐷G: corpus of language A 𝐷I: corpus of language B (not necessarily aligned with 𝐷G) 𝑃(. |𝑠; ΘGI): translation model from A to B 𝑃(. |𝑠; 𝛩IG): translation model from B to A 𝐿𝑀G . : learned language model of A 𝐿𝑀I . : learned language model of B
  7. 7. Dual learning algorithm 1. Generate 𝐾 translated sentences 𝑠PQR,,, 𝑠PQR,., … , 𝑠PQR,S from 𝑃 . 𝑠; ΘTU based on beam search
  8. 8. Dual learning algorithm 1. Generate 𝐾 translated sentences 𝑠PQR,,, 𝑠PQR,., … , 𝑠PQR,S from 𝑃 . 𝑠; ΘTU based on beam search 2. Compute intermediate rewards 𝑟,,,, 𝑟,,., … , 𝑟,,S from 𝐿𝑀I(𝑠PQR,W) for each sentence as 𝑟,,W = 𝐿𝑀I(𝑠PQR,W)
  9. 9. Dual learning algorithm 3. Get communication rewards 𝑟.,,, 𝑟.,., … , 𝑟.,W for each sentence as 𝑟.,W = ln 𝑃(𝑠|𝑠PQR,W; ΘUT)
  10. 10. Dual learning algorithm 3. Get communication rewards 𝑟.,,, 𝑟.,., … , 𝑟.,W for each sentence as 𝑟.,W = ln 𝑃(𝑠|𝑠PQR,W; ΘUT) 4. Set the total reward of k-th sentence as 𝑟W = 𝛼𝑟,,W + 1 − 𝛼 𝑟.,W
  11. 11. Dual learning algorithm 5. Compute the stochastic gradient of ΘGI and ΘTU 𝛻^_` 𝐸 𝑟 = 1 𝐾 ;[𝑟W∇TU ln 𝑃(𝑠PQR,W|𝑠; ΘGI)] S W@, 𝛻^`_ 𝐸 𝑟 = 1 𝐾 ;[(1 − 𝛼)∇IG ln 𝑃(𝑠PQR,W|𝑠; ΘIG)] S W@,
  12. 12. Dual learning algorithm 5. Compute the stochastic gradient of ΘGI and ΘTU 𝛻^_` 𝐸 𝑟 = 1 𝐾 ;[𝑟W∇TU ln 𝑃(𝑠PQR,W|𝑠; ΘGI)] S W@, 𝛻^`_ 𝐸 𝑟 = 1 𝐾 ;[(1 − 𝛼)∇IG ln 𝑃(𝑠PQR,W|𝑠; ΘIG)] S W@, 6. Update model parameters ΘGI ← ΘGI + 𝛾,∇g_` 𝐸[𝑟] ΘIG ← ΘIG + 𝛾.∇g`_ 𝐸[𝑟]
  13. 13. Dual learning algorithm
  14. 14. Experiment settings • Baseline models • Bahdanau et al., “Neural Machine Translation by Jointly Learning to Align and Translate” • Sennrich et al., “Improving Neural Machine Translation Models with Monolingual Data”
  15. 15. Dataset • WMTʼ14 • 12M sentence pairs • English -> French, French -> English • Data usage (for dual learning) • Small 1. Train translation models with 10% bilingual data. 2. Train translation models with 10% bilingual data and monolingual data through dual learning algorithm. 3. Train translation models only with monolingual data through dual learning algorithm. • Large 1. Train translation models with 100% bilingual data. 2. Train translation models with 100% bilingual data. 3. Train translation models only with monolingual data through dual learning algorithm.
  16. 16. Evaluation • BLEU: geometric mean of n-gram precision
  17. 17. Results • Outperform the base line models • In Fr->En, dual learning with 10% data ≈ baseline models with 100% data. • Dual learning is effective especially in a small dataset.
  18. 18. Results • For different source sentence length • Improvement is significant for long sentences.
  19. 19. Results • Reconstruction performance (BLEU) • Huge improvement from baseline models, especially in En->Fr-En(S)
  20. 20. Results • Reconstruction examples
  21. 21. Future extensions & words • Application in other domains • Generalization of dual learning • Dual -> Triple -> … -> n-loop • Learn from scratch • only with monolingual data • maybe plus lexical dictionary Application Primal task Dual task Speech processing Speech recognition Text to speech Image understanding Image captioning Image generation Conversation engine Question Response Search engine Search Query/Keyword suggestion
  22. 22. Summary • What • Introduce “Dual learning algorithm” to utilize monolingual data • Results • With 100% data, the model outperforms the baseline models • With 10% data, the model shows the comparable result with the baseline models • Future • Dual learning mechanism can be applied to other domains • Learn from scratch
  23. 23. Some notes • Dual Learning does not learn word-to-word correspondences? • Training from bilingual data is a must? • Or lexical dictionary
  24. 24. Appendix: Stochastic gradient of models

×