How to build own
translator in 15 minutes
Neural Machine Translation in practice
Bartek Rozkrut
2040.io
Why so
important?
40 billion USD /
year industry
Huge barrier for
many people
Provide unlimited
access to
knowledge
Scale NLP
problems
RNN vs CNN
IN MACHINE
TRANSLATION
Why own translator?
• Private / sensitive data
• Huge amount of data – eg. e-mail translation (cost)
• Off-line / off-cloud / on-premise
• Custom domain-specific translation / vocabulary
Neural Machine Translation – example workflow
1. Download Parallel Corpus files
2. Append all corpus files (source + target) in same order
3. Split TRAIN / VAL set
4. Tokenization
5. Preprocess (build vocabulary, remove too long sentences, …)
6. Train
7. Release model (CPU compatible)
8. Translate!
9. REPEAT! ☺
Parallel Corpus – public data
HTTP://OPUS.LINGFIL.UU.SE
Parallel Corpus (source file – PL, EUROPARL)
1.Tytuł: Admirał NATO potrzebuje przyjaciół.
2.Dziękuję.
3.Naprawdę potrzebuję...
4.Ten program stał się katalizatorem. Następnego dnia setki
osób chciały mnie dodać do znajomych. Indonezyjczycy i
Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan
znajomych, a tak przy okazji, co to jest NATO?"
Parallel Corpus (target file - EN , EUROPARL)
1.The headline was: NATO Admiral Needs Friends.
2.Thank you.
3.Which I do.
4.And the story was a catalyst, and the next morning I had
hundreds of Facebook friend requests from Indonesians and
Finns, mostly saying, "Admiral, we heard you need a friend, and
oh, by the way, what is NATO?"
Vocabulary
1.Word level
2.Sub-word level (eg. Byte Pair Encoding)
3.Character level
BLEU
HTTP://OPENNMT.NET/
OPENNMT (RNN) – DECEMBER 2016
HTTPS://GOOGLE.GITHUB.IO/SEQ2SEQ/
GOOGLE’S SEQ2SEQ (RNN) – MARCH 2017
HTTPS://GITHUB.COM/FACEBOOKRESEARCH/FAIRSEQ/
FACEBOOK FAIRSEQ (CNN) – MAY 2017
CONVOLUTIONAL NEURAL NETWORK
VS
RECURRENT NEURAL NETWORK
MACHINE TRANSLATION
9X
SPEEDUP
Our experience from PL=>EN training
• 100k vocabulary (word-level)
• Bidirectional LSTM, 2 layers, RNN size 500
• 5M sentences from public data sources
• 2 weeks of training on 1 GPU NVIDIA Tesla K80
• ~ 20 BLEU
Our experience from PL=>EN translation (word level)
• [PL] Kora mózgowa jest odpowiedzialna za
wszystkie nasze racjonalne i analityczne myśli
oraz język.
• [EN] The neocortex is responsible for all of our
rational and analytical thought and language.
• [HYPOTHESIS] <unk> cortex is responsible for all
our rational and analytical thoughts and language.
Our experience from PL=>EN translation (word level)
• [PL] Jesteśmy firmą zajmującą się automatyzacją, która ma na celu
budowanie lekkich struktur bo są bardziej wydajne energetycznie.
Chcemy się nauczyć więcej o pneumatyce i przepływie powietrza.
• [EN] We are a company in the field of automation, and we'd like to
do very lightweight structures because that's energy efficient, and
we'd like to learn more about pneumatics and air flow phenomena.
• [HYPOTHESIS] We're a <unk> company, which is designed to build
light structures because they're more energy efficient, and we want
to learn more about <unk> and air flow.
OpenNMT – run Docker container
Run CPU-based interactive session with command:
sudo docker run -it 2040/opennmt bash
Run GPU-based interactive session with command:
sudo nvidia-docker run -it 2040/opennmt bash
OpenNMT – split paralell corpus
split -l $[ $(wc -l src.txt|cut -d" " -f1) * 9/10 ] src.txt
mv xaa train-src.txt
mv xab val-src.txt
split -l $[ $(wc -l tgt.txt|cut -d" " -f1) * 9/10 ] tgt.txt
mv xaa train-tgt.txt
mv xab val-tgt.txt
OpenNMT – preprocess paralell corpus
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-src.txt >
train-src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-tgt.txt >
train-tgt.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-src.txt > val-
src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-tgt.txt > val-
tgt.txt.tok
th preprocess.lua -train_src train-src.txt.tok -train_tgt train-tgt.txt.tok -
valid_src val-src.txt.tok -valid_tgt val-tgt.txt.tok -save_data _data
OpenNMT – train && release && translate
th train.lua -data _data-train.t7 -layers 2 -rnn_size 500 -brnn -save_model
model -gpuid 1
th tools/release_model.lua -model model.t7 -gpuid 1
th translate.lua -model model.t7 -src src-val.txt -output file-tgt.tok -gpuid
1
Best hyperparams from 250k GPU hours (thx Google)
HTTPS://ARXIV.ORG/ABS/1703.03906
Other applications
1.Image 2 Text
2.OCR (eg. Tesseract OCR v4.0 – LSTM)
3.Lip reading
4.Simple Q&A
5.Chatbots
HTTP://WEB.STANFORD.EDU/CLASS/CS224N/
Thanks!
Bartek Rozkrut
bartek@2040.io

Ai meetup Neural machine translation updated

  • 1.
    How to buildown translator in 15 minutes Neural Machine Translation in practice Bartek Rozkrut 2040.io
  • 2.
    Why so important? 40 billionUSD / year industry Huge barrier for many people Provide unlimited access to knowledge Scale NLP problems
  • 11.
    RNN vs CNN INMACHINE TRANSLATION
  • 12.
    Why own translator? •Private / sensitive data • Huge amount of data – eg. e-mail translation (cost) • Off-line / off-cloud / on-premise • Custom domain-specific translation / vocabulary
  • 13.
    Neural Machine Translation– example workflow 1. Download Parallel Corpus files 2. Append all corpus files (source + target) in same order 3. Split TRAIN / VAL set 4. Tokenization 5. Preprocess (build vocabulary, remove too long sentences, …) 6. Train 7. Release model (CPU compatible) 8. Translate! 9. REPEAT! ☺
  • 14.
    Parallel Corpus –public data HTTP://OPUS.LINGFIL.UU.SE
  • 15.
    Parallel Corpus (sourcefile – PL, EUROPARL) 1.Tytuł: Admirał NATO potrzebuje przyjaciół. 2.Dziękuję. 3.Naprawdę potrzebuję... 4.Ten program stał się katalizatorem. Następnego dnia setki osób chciały mnie dodać do znajomych. Indonezyjczycy i Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan znajomych, a tak przy okazji, co to jest NATO?"
  • 16.
    Parallel Corpus (targetfile - EN , EUROPARL) 1.The headline was: NATO Admiral Needs Friends. 2.Thank you. 3.Which I do. 4.And the story was a catalyst, and the next morning I had hundreds of Facebook friend requests from Indonesians and Finns, mostly saying, "Admiral, we heard you need a friend, and oh, by the way, what is NATO?"
  • 17.
    Vocabulary 1.Word level 2.Sub-word level(eg. Byte Pair Encoding) 3.Character level
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    CONVOLUTIONAL NEURAL NETWORK VS RECURRENTNEURAL NETWORK MACHINE TRANSLATION 9X SPEEDUP
  • 23.
    Our experience fromPL=>EN training • 100k vocabulary (word-level) • Bidirectional LSTM, 2 layers, RNN size 500 • 5M sentences from public data sources • 2 weeks of training on 1 GPU NVIDIA Tesla K80 • ~ 20 BLEU
  • 24.
    Our experience fromPL=>EN translation (word level) • [PL] Kora mózgowa jest odpowiedzialna za wszystkie nasze racjonalne i analityczne myśli oraz język. • [EN] The neocortex is responsible for all of our rational and analytical thought and language. • [HYPOTHESIS] <unk> cortex is responsible for all our rational and analytical thoughts and language.
  • 25.
    Our experience fromPL=>EN translation (word level) • [PL] Jesteśmy firmą zajmującą się automatyzacją, która ma na celu budowanie lekkich struktur bo są bardziej wydajne energetycznie. Chcemy się nauczyć więcej o pneumatyce i przepływie powietrza. • [EN] We are a company in the field of automation, and we'd like to do very lightweight structures because that's energy efficient, and we'd like to learn more about pneumatics and air flow phenomena. • [HYPOTHESIS] We're a <unk> company, which is designed to build light structures because they're more energy efficient, and we want to learn more about <unk> and air flow.
  • 26.
    OpenNMT – runDocker container Run CPU-based interactive session with command: sudo docker run -it 2040/opennmt bash Run GPU-based interactive session with command: sudo nvidia-docker run -it 2040/opennmt bash
  • 27.
    OpenNMT – splitparalell corpus split -l $[ $(wc -l src.txt|cut -d" " -f1) * 9/10 ] src.txt mv xaa train-src.txt mv xab val-src.txt split -l $[ $(wc -l tgt.txt|cut -d" " -f1) * 9/10 ] tgt.txt mv xaa train-tgt.txt mv xab val-tgt.txt
  • 28.
    OpenNMT – preprocessparalell corpus th tools/tokenize.lua -joiner_annotate -mode aggressive < train-src.txt > train-src.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < train-tgt.txt > train-tgt.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < val-src.txt > val- src.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < val-tgt.txt > val- tgt.txt.tok th preprocess.lua -train_src train-src.txt.tok -train_tgt train-tgt.txt.tok - valid_src val-src.txt.tok -valid_tgt val-tgt.txt.tok -save_data _data
  • 29.
    OpenNMT – train&& release && translate th train.lua -data _data-train.t7 -layers 2 -rnn_size 500 -brnn -save_model model -gpuid 1 th tools/release_model.lua -model model.t7 -gpuid 1 th translate.lua -model model.t7 -src src-val.txt -output file-tgt.tok -gpuid 1
  • 30.
    Best hyperparams from250k GPU hours (thx Google) HTTPS://ARXIV.ORG/ABS/1703.03906
  • 31.
    Other applications 1.Image 2Text 2.OCR (eg. Tesseract OCR v4.0 – LSTM) 3.Lip reading 4.Simple Q&A 5.Chatbots
  • 32.
  • 33.