This document provides instructions for building your own neural machine translation system in 15 minutes using open source tools. It discusses the benefits of having your own translator, including handling private data, large custom datasets, and domain-specific translation. The workflow outlined trains a basic model on public parallel corpus data, splitting it for training and validation. Steps include preprocessing, training a bidirectional LSTM model, and releasing and using the model to translate. Public corpus sources and tools like OpenNMT and Google's Seq2Seq library are referenced.
Design and Development of a Provenance Capture Platform for Data Science
AIMeetup #4: Neural-machine-translation
1. How to build own
translator in 15 minutes
Neural Machine Translation in practice
Bartek Rozkrut
2040.io
2. Why so
important?
40 billion USD /
year industry
Huge barrier for
many people
Provide unlimited
access to
knowledge
Scale NLP
problems
3.
4.
5.
6.
7.
8.
9.
10.
11. Why own translator?
1.Private / sensitive data
2.Huge amount of data – eg. e-mail translation (cost)
3.Off-line / off-cloud / on-premise
4.Custom domain-specific translation / vocabulary
12. Neural Machine Translation – example workflow
1. Download Parallel Corpus files
2. Append all corpus files (source + target) in same order
3. Split TRAIN / VAL set
4. Tokenization
5. Preprocess
6. Train
7. Release model (CPU compatible)
8. Translate!
9. REPEAT!
14. Parallel Corpus (source file – PL, EUROPARL)
1.Tytuł: Admirał NATO potrzebuje przyjaciół.
2.Dziękuję.
3.Naprawdę potrzebuję...
4.Ten program stał się katalizatorem. Następnego dnia setki
osób chciały mnie dodać do znajomych. Indonezyjczycy i
Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan
znajomych, a tak przy okazji, co to jest NATO?"
15. Parallel Corpus (target file - EN , EUROPARL)
1.The headline was: NATO Admiral Needs Friends.
2.Thank you.
3.Which I do.
4.And the story was a catalyst, and the next morning I had
hundreds of Facebook friend requests from Indonesians and
Finns, mostly saying, "Admiral, we heard you need a friend, and
oh, by the way, what is NATO?"
20. Our experience from PL=>EN training
1.100k vocabulary (word-level)
2.Bidirectional LSTM, 2 layers, RNN size 500
3.5M sentences from public data sources
4.~ 20 BLEU
21. OpenNMT – run Docker container
Run CPU-based interactive session with command:
sudo docker run -it 2040/opennmt bash
Run GPU-based interactive session with command:
sudo nvidia-docker run -it 2040/opennmt bash