Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

533 views

Published on

https://telecombcn-dl.github.io/2017-dlsl/

Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.

The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.

Published in: Data & Analytics
  • Be the first to comment

Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

  1. 1. [course site] Day 3 Lecture 4 Neural Machine Translation Marta R. Costa-jussà
  2. 2. 2 Acknowledgments Kyunghyun Cho, NVIDIA BLOGS: https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus/
  3. 3. 3 Previous concepts from this course ● Recurrent neural network (LSTM and GRU) (handle variable-length sequences) ● Word embeddings ● Language Modeling (assign a probability to a sentence)
  4. 4. 4 Machine Translation background Machine Translation is the application that is able to automatically translate from source (S) to target (T).
  5. 5. 5 Rule-based approach Main approaches have been either rule-based or statistical-based
  6. 6. 6 Statistical-based approach Main approaches have been either rule-based or statistical-based
  7. 7. 7 Why a new approach? We need years to develop a nice rule-based approach Regarding statistical systems: (1) Word alignment and Translation are optimized separately (2) Translation at the level of words, but difficulties with high variations in morphology (e.g. translation English-to-Finnish) (3) Translation by language pairs (a) difficult to think of an automatic interlingua (b) bad performance with low resourced-languages
  8. 8. 8 Why Neural Machine Translation? ● Integrated MT paradigm ● Trainable at the subword/character level ● Multilingual advantages
  9. 9. 9 What do we need? ● Parallel Corpus Same requirement than phrase-based systems
  10. 10. 10 Sources of parallel corpus ● European Plenary Parlament Speeches (EPPS) transcriptions ● Canadian Handsards ● United Nations ● CommonCrawl ● ... International evaluation campaigns: Conference on Machine Translation (WMT) International Workshop on Spoken Language Translation (IWSLT)
  11. 11. 11 What else do we need? Same requirements than phrase-based systems Automatic measure
  12. 12. 12 Towards Neural Machine Translation
  13. 13. 13 Encoder-Decoder Front View Side View Representation of the sentence Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).
  14. 14. Encoder
  15. 15. 15 Encoder in three steps Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) (2) (3) (1) One hot encoding (2) Continuous space representation (3) Sequence summarization
  16. 16. 16 Step 1: One-hot encoding Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K). Word One-hot encoding economic 000010... growth 001000... has 100000... slowed 000001... From previous lecture on language modeling
  17. 17. 17 Step 2: Projection to continuous space Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) siM Wi E The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights. K Ksi = Ewi
  18. 18. 18 Step 3: Recurrence Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
  19. 19. 19 Step 3: Recurrence Sequence Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) Activation function should be either LSTM or GRU
  20. 20. Decoder
  21. 21. 21 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) RNN’s internal state zi depends on: summary vector ht , previous output word ui-1 and previous internal state zi-1 . NEW INTERNAL STATE
  22. 22. 22 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) With zi ready, we can score each word k in the vocabulary with a dot product given this hidden state... RNN internal state Neuron weights for word k
  23. 23. 23 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) A score is higher if word vector wk and the decoder’s internal state zi are similar to each other. RNN internal state Neuron weights for word k Remember: a dot product gives the length of the projection of one vector onto another. Similar vectors (nearly parallel) the projection is longer than if they are very different (nearly perpendicular)
  24. 24. 24 Decoder Bridle, John S. "Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters." NIPS 1989 ...we can finally normalize to word probabilities with a softmax. Probability that the ith word is word k Previous words Hidden state Given the score for word k
  25. 25. 25 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) go back to the 1st step… (1) computing the decoder’s internal state (2) score and normalize target words (3) select the next word
  26. 26. 26 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) More words for the decoded sentence are generated until a <EOS> (End Of Sentence) “word” is predicted. EOS
  27. 27. Training
  28. 28. 28 Training: Maximum Likelihood Estimation Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
  29. 29. 29 Computational Complexity Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
  30. 30. Why this may not work?
  31. 31. Why this may not work? We are encoding the entire source sentence into a single context vector
  32. 32. How to solve this? With the attention-based mechanism… more details tomorrow
  33. 33. 33 Summary ● Machine Translation is faced as a sequence-to-sequence problem ● The source sentence is encoded into a fixed length vector and this fixed length vector is decoded into the final most probable target sentence ● Only parallel corpus and automatic evaluation measures are required to train a neural machine translation system
  34. 34. 34 Learn more Natural Language Understanding with Distributed Representation, Kyunghyun Cho, Chapter 6, 2015 (available in github)
  35. 35. Thanks ! Q&A ? https://www.costa-jussa.com marta.ruiz@upc.edu
  36. 36. 37 Another useful image for encoding-decoding Kyunghyun Cho, “Natural Language Understanding with Distributed Representations” (2015) ENCODER DECODER input words output words

×