SlideShare a Scribd company logo
1 of 28
Download to read offline
How to build own
translator in 15 minutes
Neural Machine Translation in practice
Bartek Rozkrut
2040.io
Why so
important?
40 billion USD /
year industry
Huge barrier for
many people
Provide unlimited
access to
knowledge
Scale NLP
problems
Why own translator?
1.Private / sensitive data
2.Huge amount of data – eg. e-mail translation (cost)
3.Off-line / off-cloud / on-premise
4.Custom domain-specific translation / vocabulary
Neural Machine Translation – example workflow
1. Download Parallel Corpus files
2. Append all corpus files (source + target) in same order
3. Split TRAIN / VAL set
4. Tokenization
5. Preprocess
6. Train
7. Release model (CPU compatible)
8. Translate!
9. REPEAT! 
Parallel Corpus – public data
HTTP://OPUS.LINGFIL.UU.SE
Parallel Corpus (source file – PL, EUROPARL)
1.Tytuł: Admirał NATO potrzebuje przyjaciół.
2.Dziękuję.
3.Naprawdę potrzebuję...
4.Ten program stał się katalizatorem. Następnego dnia setki
osób chciały mnie dodać do znajomych. Indonezyjczycy i
Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan
znajomych, a tak przy okazji, co to jest NATO?"
Parallel Corpus (target file - EN , EUROPARL)
1.The headline was: NATO Admiral Needs Friends.
2.Thank you.
3.Which I do.
4.And the story was a catalyst, and the next morning I had
hundreds of Facebook friend requests from Indonesians and
Finns, mostly saying, "Admiral, we heard you need a friend, and
oh, by the way, what is NATO?"
Vocabulary
1.Word level
2.Sub-word level (eg. Byte Pair Encoding)
3.Character level
BLEU
HTTP://OPENNMT.NET/
OPENNMT – DECEMBER 2016
HTTPS://GOOGLE.GITHUB.IO/SEQ2SEQ/
GOOGLE’S SEQ2SEQ – MARCH 2017
Our experience from PL=>EN training
1.100k vocabulary (word-level)
2.Bidirectional LSTM, 2 layers, RNN size 500
3.5M sentences from public data sources
4.~ 20 BLEU
OpenNMT – run Docker container
Run CPU-based interactive session with command:
sudo docker run -it 2040/opennmt bash
Run GPU-based interactive session with command:
sudo nvidia-docker run -it 2040/opennmt bash
OpenNMT – split paralell corpus
split -l $[ $(wc -l src.txt|cut -d" " -f1) * 9/10 ] src.txt
mv xaa train-src.txt
mv xab val-src.txt
split -l $[ $(wc -l tgt.txt|cut -d" " -f1) * 9/10 ] tgt.txt
mv xaa train-tgt.txt
mv xab val-tgt.txt
OpenNMT – preprocess paralell corpus
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-src.txt >
train-src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-tgt.txt >
train-tgt.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-src.txt > val-
src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-tgt.txt > val-
tgt.txt.tok
th preprocess.lua -train_src train-src.txt.tok -train_tgt train-tgt.txt.tok -
valid_src val-src.txt.tok -valid_tgt val-tgt.txt.tok -save_data _data
OpenNMT – train && release && translate
th train.lua -data _data-train.t7 -layers 2 -rnn_size 500 -brnn -save_model
model -gpuid 1
th tools/release_model.lua -model model.t7 -gpuid 1
th translate.lua -model model.t7 -src src-val.txt -output file-tgt.tok -gpuid
1
Best hyperparams from 250k GPU hours (thx Google)
HTTPS://ARXIV.ORG/ABS/1703.03906
Other applications
1.Image 2 Text
2.OCR (eg. Tesseract OCR v4.0 – LSTM)
3.Lip reading
4.Simple Q&A
5.Chatbots
HTTP://WEB.STANFORD.EDU/CLASS/CS224N/
SLIDES USED WITH PERMISSION FROM RICHARD SOCHER
Thanks!
Bartek Rozkrut
bartek@2040.io

More Related Content

What's hot

TLPI - Chapter 44 Pipe and Fifos
TLPI - Chapter 44 Pipe and FifosTLPI - Chapter 44 Pipe and Fifos
TLPI - Chapter 44 Pipe and Fifos
Shu-Yu Fu
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
hugo lu
 
FBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp serversFBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp servers
Angelo Failla
 

What's hot (20)

Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017
 
Experimental dtrace
Experimental dtraceExperimental dtrace
Experimental dtrace
 
tokyotalk
tokyotalktokyotalk
tokyotalk
 
2014.10 - Towards Description Set Profiles for RDF Using SPARQL as Intermedia...
2014.10 - Towards Description Set Profiles for RDF Using SPARQL as Intermedia...2014.10 - Towards Description Set Profiles for RDF Using SPARQL as Intermedia...
2014.10 - Towards Description Set Profiles for RDF Using SPARQL as Intermedia...
 
Playing Nice with Others
Playing Nice with OthersPlaying Nice with Others
Playing Nice with Others
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack
 
Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducers
 
Memory Barriers in the Linux Kernel
Memory Barriers in the Linux KernelMemory Barriers in the Linux Kernel
Memory Barriers in the Linux Kernel
 
Ns2pre
Ns2preNs2pre
Ns2pre
 
Automata Invasion
Automata InvasionAutomata Invasion
Automata Invasion
 
Learning RSocket Using RSC
Learning RSocket Using RSCLearning RSocket Using RSC
Learning RSocket Using RSC
 
Linux50commands
Linux50commandsLinux50commands
Linux50commands
 
TLPI - Chapter 44 Pipe and Fifos
TLPI - Chapter 44 Pipe and FifosTLPI - Chapter 44 Pipe and Fifos
TLPI - Chapter 44 Pipe and Fifos
 
Versioned Triple Pattern Fragments
Versioned Triple Pattern FragmentsVersioned Triple Pattern Fragments
Versioned Triple Pattern Fragments
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
Serialization in Go
Serialization in GoSerialization in Go
Serialization in Go
 
Golang concurrency design
Golang concurrency designGolang concurrency design
Golang concurrency design
 
OpenZFS send and receive
OpenZFS send and receiveOpenZFS send and receive
OpenZFS send and receive
 
FBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp serversFBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp servers
 

Similar to AIMeetup #4: Neural-machine-translation

Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Timothy Spann
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf
cifoxo
 

Similar to AIMeetup #4: Neural-machine-translation (20)

MOSP Walkthrough 2009
MOSP Walkthrough 2009MOSP Walkthrough 2009
MOSP Walkthrough 2009
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
 
Scaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesScaleable PHP Applications in Kubernetes
Scaleable PHP Applications in Kubernetes
 
FAIR Projector Builder
FAIR Projector BuilderFAIR Projector Builder
FAIR Projector Builder
 
REAL TIME OPERATING SYSTEM PART 2
REAL TIME OPERATING SYSTEM PART 2REAL TIME OPERATING SYSTEM PART 2
REAL TIME OPERATING SYSTEM PART 2
 
Basic Linux Internals
Basic Linux InternalsBasic Linux Internals
Basic Linux Internals
 
Cytoscape and External Data Analysis Tools
Cytoscape and External Data Analysis ToolsCytoscape and External Data Analysis Tools
Cytoscape and External Data Analysis Tools
 
UNIX Basics and Cluster Computing
UNIX Basics and Cluster ComputingUNIX Basics and Cluster Computing
UNIX Basics and Cluster Computing
 
Will iPython replace bash?
Will iPython replace bash?Will iPython replace bash?
Will iPython replace bash?
 
Will iPython replace Bash?
Will iPython replace Bash?Will iPython replace Bash?
Will iPython replace Bash?
 
Autoware vs. Computer Performance @ ROS Japan UG #43 組み込み勉強会
Autoware vs. Computer Performance @ ROS Japan UG #43 組み込み勉強会Autoware vs. Computer Performance @ ROS Japan UG #43 組み込み勉強会
Autoware vs. Computer Performance @ ROS Japan UG #43 組み込み勉強会
 
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
 
Using FLiP with influxdb for edgeai iot at scale 2022
Using FLiP with influxdb for edgeai iot at scale 2022Using FLiP with influxdb for edgeai iot at scale 2022
Using FLiP with influxdb for edgeai iot at scale 2022
 
Persistent Memory Development Kit (PMDK): State of the Project
Persistent Memory Development Kit (PMDK): State of the ProjectPersistent Memory Development Kit (PMDK): State of the Project
Persistent Memory Development Kit (PMDK): State of the Project
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
 
2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
 

More from 2040.io

More from 2040.io (16)

Jak budujemy inteligentnego asystenta biznesowego
Jak budujemy inteligentnego asystenta biznesowegoJak budujemy inteligentnego asystenta biznesowego
Jak budujemy inteligentnego asystenta biznesowego
 
Obsługa klienta z wykorzystaniem sztucznej inteligencji
Obsługa klienta z wykorzystaniem sztucznej inteligencjiObsługa klienta z wykorzystaniem sztucznej inteligencji
Obsługa klienta z wykorzystaniem sztucznej inteligencji
 
Jak AI pozwala nam usłyszeć głos klienta
Jak AI pozwala nam usłyszeć głos klientaJak AI pozwala nam usłyszeć głos klienta
Jak AI pozwala nam usłyszeć głos klienta
 
Wyzwania związane z modelowaniem mobilnych systemów świadomych kontekstu
Wyzwania związane z modelowaniem mobilnych systemów świadomych kontekstuWyzwania związane z modelowaniem mobilnych systemów świadomych kontekstu
Wyzwania związane z modelowaniem mobilnych systemów świadomych kontekstu
 
Rozpoznawanie mowy: problem rozwiązany?
Rozpoznawanie mowy: problem rozwiązany?Rozpoznawanie mowy: problem rozwiązany?
Rozpoznawanie mowy: problem rozwiązany?
 
Czy Deep Learning działa?
Czy Deep Learning działa?Czy Deep Learning działa?
Czy Deep Learning działa?
 
Analiza semantyczna zasosowana w środowisku Menerva
Analiza semantyczna zasosowana w środowisku MenervaAnaliza semantyczna zasosowana w środowisku Menerva
Analiza semantyczna zasosowana w środowisku Menerva
 
Time-series prediction with neural networks
Time-series prediction with neural networksTime-series prediction with neural networks
Time-series prediction with neural networks
 
AIMeetup #4: Artificial intelligence and economics
AIMeetup #4: Artificial intelligence and economicsAIMeetup #4: Artificial intelligence and economics
AIMeetup #4: Artificial intelligence and economics
 
AIMeetup #4: Let’s compete with machine! edrone crm
AIMeetup #4: Let’s compete with machine! edrone crmAIMeetup #4: Let’s compete with machine! edrone crm
AIMeetup #4: Let’s compete with machine! edrone crm
 
AIMeetup #3: Uczenie maszynowe - rocket science czy chleb powszedni?
AIMeetup #3: Uczenie maszynowe - rocket science czy chleb powszedni?AIMeetup #3: Uczenie maszynowe - rocket science czy chleb powszedni?
AIMeetup #3: Uczenie maszynowe - rocket science czy chleb powszedni?
 
AIMeetup #3: Cortana intelligence suite - tchnij życie w swoje dane
AIMeetup #3: Cortana intelligence suite - tchnij życie w swoje daneAIMeetup #3: Cortana intelligence suite - tchnij życie w swoje dane
AIMeetup #3: Cortana intelligence suite - tchnij życie w swoje dane
 
AIMeetup #2: A.I. - podstawowe pojęcia techniczne
AIMeetup #2: A.I. - podstawowe pojęcia techniczneAIMeetup #2: A.I. - podstawowe pojęcia techniczne
AIMeetup #2: A.I. - podstawowe pojęcia techniczne
 
AIMeetup #2: Jak dzięki Data Mining księgujemy automatycznie koszty w Infakt.pl?
AIMeetup #2: Jak dzięki Data Mining księgujemy automatycznie koszty w Infakt.pl?AIMeetup #2: Jak dzięki Data Mining księgujemy automatycznie koszty w Infakt.pl?
AIMeetup #2: Jak dzięki Data Mining księgujemy automatycznie koszty w Infakt.pl?
 
AIMeetup #2: Jak wykorzystaliśmy technologię rozpoznawania mowy i mówcy do au...
AIMeetup #2: Jak wykorzystaliśmy technologię rozpoznawania mowy i mówcy do au...AIMeetup #2: Jak wykorzystaliśmy technologię rozpoznawania mowy i mówcy do au...
AIMeetup #2: Jak wykorzystaliśmy technologię rozpoznawania mowy i mówcy do au...
 
AIMeetup #2: Gdzie można nakarmić sztuczną inteligencję?
AIMeetup #2: Gdzie można nakarmić sztuczną inteligencję? AIMeetup #2: Gdzie można nakarmić sztuczną inteligencję?
AIMeetup #2: Gdzie można nakarmić sztuczną inteligencję?
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 

AIMeetup #4: Neural-machine-translation

  • 1. How to build own translator in 15 minutes Neural Machine Translation in practice Bartek Rozkrut 2040.io
  • 2. Why so important? 40 billion USD / year industry Huge barrier for many people Provide unlimited access to knowledge Scale NLP problems
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. Why own translator? 1.Private / sensitive data 2.Huge amount of data – eg. e-mail translation (cost) 3.Off-line / off-cloud / on-premise 4.Custom domain-specific translation / vocabulary
  • 12. Neural Machine Translation – example workflow 1. Download Parallel Corpus files 2. Append all corpus files (source + target) in same order 3. Split TRAIN / VAL set 4. Tokenization 5. Preprocess 6. Train 7. Release model (CPU compatible) 8. Translate! 9. REPEAT! 
  • 13. Parallel Corpus – public data HTTP://OPUS.LINGFIL.UU.SE
  • 14. Parallel Corpus (source file – PL, EUROPARL) 1.Tytuł: Admirał NATO potrzebuje przyjaciół. 2.Dziękuję. 3.Naprawdę potrzebuję... 4.Ten program stał się katalizatorem. Następnego dnia setki osób chciały mnie dodać do znajomych. Indonezyjczycy i Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan znajomych, a tak przy okazji, co to jest NATO?"
  • 15. Parallel Corpus (target file - EN , EUROPARL) 1.The headline was: NATO Admiral Needs Friends. 2.Thank you. 3.Which I do. 4.And the story was a catalyst, and the next morning I had hundreds of Facebook friend requests from Indonesians and Finns, mostly saying, "Admiral, we heard you need a friend, and oh, by the way, what is NATO?"
  • 16. Vocabulary 1.Word level 2.Sub-word level (eg. Byte Pair Encoding) 3.Character level
  • 17. BLEU
  • 20. Our experience from PL=>EN training 1.100k vocabulary (word-level) 2.Bidirectional LSTM, 2 layers, RNN size 500 3.5M sentences from public data sources 4.~ 20 BLEU
  • 21. OpenNMT – run Docker container Run CPU-based interactive session with command: sudo docker run -it 2040/opennmt bash Run GPU-based interactive session with command: sudo nvidia-docker run -it 2040/opennmt bash
  • 22. OpenNMT – split paralell corpus split -l $[ $(wc -l src.txt|cut -d" " -f1) * 9/10 ] src.txt mv xaa train-src.txt mv xab val-src.txt split -l $[ $(wc -l tgt.txt|cut -d" " -f1) * 9/10 ] tgt.txt mv xaa train-tgt.txt mv xab val-tgt.txt
  • 23. OpenNMT – preprocess paralell corpus th tools/tokenize.lua -joiner_annotate -mode aggressive < train-src.txt > train-src.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < train-tgt.txt > train-tgt.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < val-src.txt > val- src.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < val-tgt.txt > val- tgt.txt.tok th preprocess.lua -train_src train-src.txt.tok -train_tgt train-tgt.txt.tok - valid_src val-src.txt.tok -valid_tgt val-tgt.txt.tok -save_data _data
  • 24. OpenNMT – train && release && translate th train.lua -data _data-train.t7 -layers 2 -rnn_size 500 -brnn -save_model model -gpuid 1 th tools/release_model.lua -model model.t7 -gpuid 1 th translate.lua -model model.t7 -src src-val.txt -output file-tgt.tok -gpuid 1
  • 25. Best hyperparams from 250k GPU hours (thx Google) HTTPS://ARXIV.ORG/ABS/1703.03906
  • 26. Other applications 1.Image 2 Text 2.OCR (eg. Tesseract OCR v4.0 – LSTM) 3.Lip reading 4.Simple Q&A 5.Chatbots