Ai meetup Neural machine translation updated

How to build own
translator in 15 minutes
Neural Machine Translation in practice
Bartek Rozkrut
2040.io

Why so
important?
40 billion USD /
year industry
Huge barrier for
many people
Provide unlimited
access to
knowledge
Scale NLP
problems

RNN vs CNN
IN MACHINE
TRANSLATION

Why own translator?
• Private / sensitive data
• Huge amount of data – eg. e-mail translation (cost)
• Off-line / off-cloud / on-premise
• Custom domain-specific translation / vocabulary

Neural Machine Translation – example workflow
1. Download Parallel Corpus files
2. Append all corpus files (source + target) in same order
3. Split TRAIN / VAL set
4. Tokenization
5. Preprocess (build vocabulary, remove too long sentences, …)
6. Train
7. Release model (CPU compatible)
8. Translate!
9. REPEAT! ☺

Parallel Corpus – public data
HTTP://OPUS.LINGFIL.UU.SE

Parallel Corpus (source file – PL, EUROPARL)
1.Tytuł: Admirał NATO potrzebuje przyjaciół.
2.Dziękuję.
3.Naprawdę potrzebuję...
4.Ten program stał się katalizatorem. Następnego dnia setki
osób chciały mnie dodać do znajomych. Indonezyjczycy i
Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan
znajomych, a tak przy okazji, co to jest NATO?"

Parallel Corpus (target file - EN , EUROPARL)
1.The headline was: NATO Admiral Needs Friends.
2.Thank you.
3.Which I do.
4.And the story was a catalyst, and the next morning I had
hundreds of Facebook friend requests from Indonesians and
Finns, mostly saying, "Admiral, we heard you need a friend, and
oh, by the way, what is NATO?"

Vocabulary
1.Word level
2.Sub-word level (eg. Byte Pair Encoding)
3.Character level

HTTP://OPENNMT.NET/
OPENNMT (RNN) – DECEMBER 2016

HTTPS://GOOGLE.GITHUB.IO/SEQ2SEQ/
GOOGLE’S SEQ2SEQ (RNN) – MARCH 2017

HTTPS://GITHUB.COM/FACEBOOKRESEARCH/FAIRSEQ/
FACEBOOK FAIRSEQ (CNN) – MAY 2017

CONVOLUTIONAL NEURAL NETWORK
VS
RECURRENT NEURAL NETWORK
MACHINE TRANSLATION
9X
SPEEDUP

Our experience from PL=>EN training
• 100k vocabulary (word-level)
• Bidirectional LSTM, 2 layers, RNN size 500
• 5M sentences from public data sources
• 2 weeks of training on 1 GPU NVIDIA Tesla K80
• ~ 20 BLEU

Our experience from PL=>EN translation (word level)
• [PL] Kora mózgowa jest odpowiedzialna za
wszystkie nasze racjonalne i analityczne myśli
oraz język.
• [EN] The neocortex is responsible for all of our
rational and analytical thought and language.
• [HYPOTHESIS] <unk> cortex is responsible for all
our rational and analytical thoughts and language.

Our experience from PL=>EN translation (word level)
• [PL] Jesteśmy firmą zajmującą się automatyzacją, która ma na celu
budowanie lekkich struktur bo są bardziej wydajne energetycznie.
Chcemy się nauczyć więcej o pneumatyce i przepływie powietrza.
• [EN] We are a company in the field of automation, and we'd like to
do very lightweight structures because that's energy efficient, and
we'd like to learn more about pneumatics and air flow phenomena.
• [HYPOTHESIS] We're a <unk> company, which is designed to build
light structures because they're more energy efficient, and we want
to learn more about <unk> and air flow.

OpenNMT – run Docker container
Run CPU-based interactive session with command:
sudo docker run -it 2040/opennmt bash
Run GPU-based interactive session with command:
sudo nvidia-docker run -it 2040/opennmt bash

OpenNMT – split paralell corpus
split -l $[ $(wc -l src.txt|cut -d" " -f1) * 9/10 ] src.txt
mv xaa train-src.txt
mv xab val-src.txt
split -l $[ $(wc -l tgt.txt|cut -d" " -f1) * 9/10 ] tgt.txt
mv xaa train-tgt.txt
mv xab val-tgt.txt

OpenNMT – preprocess paralell corpus
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-src.txt >
train-src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-tgt.txt >
train-tgt.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-src.txt > val-
src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-tgt.txt > val-
tgt.txt.tok
th preprocess.lua -train_src train-src.txt.tok -train_tgt train-tgt.txt.tok -
valid_src val-src.txt.tok -valid_tgt val-tgt.txt.tok -save_data _data

OpenNMT – train && release && translate
th train.lua -data _data-train.t7 -layers 2 -rnn_size 500 -brnn -save_model
model -gpuid 1
th tools/release_model.lua -model model.t7 -gpuid 1
th translate.lua -model model.t7 -src src-val.txt -output file-tgt.tok -gpuid
1

Best hyperparams from 250k GPU hours (thx Google)
HTTPS://ARXIV.ORG/ABS/1703.03906

Other applications
1.Image 2 Text
2.OCR (eg. Tesseract OCR v4.0 – LSTM)
3.Lip reading
4.Simple Q&A
5.Chatbots

HTTP://WEB.STANFORD.EDU/CLASS/CS224N/

Thanks!
Bartek Rozkrut
bartek@2040.io

Ai meetup Neural machine translation updated

More Related Content

What's hot

Similar to Ai meetup Neural machine translation updated

More from 2040.io

Recently uploaded

Ai meetup Neural machine translation updated