The document discusses the history and evolution of machine translation techniques, including rule-based MT, example-based MT, phrase-based MT, and neural machine translation. It then focuses on deep learning approaches for various natural language processing tasks such as language identification, part-of-speech tagging, transliteration, and language modeling. The document advocates for a linguistically motivated neural network architecture that leverages existing linguistic rules and annotations to improve machine translation quality.
Deep Learning for Machine Translation, by Jean Senellart, SYSTRAN
1. Deep Learning for Machine
Translation
Satoshi Enoue, Jungi Kim, Jean Senellart, SYSTRAN
2. SYSTRAN Through Machine Translation
History
Rule Base Machine Translation
Example-Based Machine
Translation
Phrase Based Machine Translation
Syntax Based Machine Translation
Neural Machine
Translation
Hybrid Machine Translation
SYSTRAN
197
1968
SYSTRAN (SYStem
TRANslation)
founded by Dr.
Toma in La Jolla,
California (USA)
1969
Provided first
MT software for
the US Air Force,
(Russian to
English)
1975
Used by NASA
for the Apollo-
Soyuz
American-Soviet
project
1975
Translation systems for
all European languages
in the European
Commission
1986
SYSTRAN is acquired
by France’s Gachot SA,
thus becoming a
French company with
a U.S. subsidiary
1995
Pioneered development of
first Windows-based MT
software
1997
First free Web-based translation
service: Altavista Babelfish. SYSTRAN
made the Internet community aware
of the usefulness and capabilities of
machine translation
2002
SYSTRAN was used on
most major Internet
Portals: Yahoo!, Google,
AltaVista, Lycos.
1996
SYSTRAN within SEIKO’s
pocket translators.
1990’s
Port technology from mainframes to
Desktop PC’s and Client-Server environments
for personal and corporate use
2014
Following acquisition by CSLI,
SYSTRAN SA forms part of the
SYSTRAN International Group
2005
Launched embedded
translation software for
mobile devices
2009
Developed first
hybrid translation
software and
solution: SES 7
Translation Server
2011
Launch of SES 7
Training Server,
first solution for
self-learning of
MT engines
2015
SES8 Translation and Training
Server – Large Models
2016
More than 140 language Pairs.
Launch of SYSTRAN.io, the
Natural Language Processing
API platform
3. The new game changer
• Deep Neural Network
Technologies
• Image Analysis
• Voice Recognition
• Text
• Text Generation
• Word Embeddings
• Multitask NLP
• Neural Machine Translation
• … Games
• Super Human Abilities
Sequence of fascinating results
and technologies over the last 3
years – all based on Deep Neural
Network (DNN) – covering a large
variety of domains…
26/04/2016
SYSTRAN - Copyright 2016
3
4. The new game changer
• Deep Neural Network
Technologies
• Image Analysis
• Voice Recognition
• Text
• Text Generation
• Word Embeddings
• Multitask NLP
• Neural Machine Translation
• … Games
26/04/2016
SYSTRAN - Copyright 2016
4
5. The new game changer
• Deep Neural Network
Technologies
• Image Analysis
• Voice Recognition
• Text
• Text Generation
• Word Embeddings
• Multitask NLP
• Neural Machine Translation
• … Games Abilities
• Google 2015 RNN voice search
recognition outperforms 2012
DNN models
• Baidu Deep Speech announces
16.5% improvement over
baseline and higher
performance than human in
noisy environment
26/04/2016
SYSTRAN - Copyright 2016
5
6. The new game changer - examples
• Deep Neural Network
Technologies
• Image Analysis
• Voice Recognition
• Text
• Text Generation
• Word Embeddings
• Multitask NLP
• Neural Machine Translation
• … Gamesn Abilities
Les yeux prenaient des redoutables, des troncs de
feu.
Toutes les prétexticheurs par ces quatre
repentilleuses avec du sergent de Digne,
débragiffés nymoeurs sur les derniers instants à
hardis, boucher, sans dénongée en plus ennérence,
ils se refecturent encore. Ils auraient déjà mangé
ses très interses.
ShakespeareVictorHugo
Char-RNN, Andrej Karpathy, 2015
공급자는 AspNetXSprchyLibrary의 인스턴스를
만들어 다른 경고를 오버터 컴퓨터에 저장할 수
있습니다.
MSDN
26/04/2016
SYSTRAN - Copyright 2016
9. The new game changer - examples
• Deep Neural Network
Technologies
• Image Analysis
• Voice Recognition
• Text
• Text Generation
• Word Embeddings
• Multitask NLP
• Neural Machine Translation
• … Games
26/04/2016 9word2vec, Google, 2013
10. The new game changer - examples
• Unified Neural Network
Architecture for several NLP tasks
POS tagging, chunking, NER, SRL
• Focus on avoiding task/linguistic
specific engineering
• Joint decision on the different tasks
Outperforms almost all of the state
of the art results for each individual
tasks
Natural Language Processing (Almost) from Scratch, Collobert et al., 201126/04/2016
• Deep Neural Network
Technologies
• Image Analysis
• Voice Recognition
• Text
• Text Generation
• Word Embeddings
• Multitask NLP
• Neural Machine Translation
• … Gamesn Abilities
11. The new game changer - examples
• Deep Neural Network
Technologies
• Image Analysis
• Voice Recognition
• Text
• Text Generation
• Word Embeddings
• Multitask NLP
• Neural Machine Translation:
sentence encoding-decoding
• … Games
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, K. Cho et al, 2014
12. The new game changer - examples
• Deep Neural Network
Technologies
• Image Analysis
• Voice Recognition
• Text
• Text Generation
• Word Embeddings
• Multitask NLP
• Neural Machine : sentence encoding-
decoding
• … Games – DQN, AlphaGo
HUMAN-LEVEL CONTROL THROUGH DEEP REINFORCEMENT LEARNING, Google DeepMind, 201526/04/2016
14. The new game changer - examples
More and more evidence of
“super-human abilities”
Could we also reach Super-
human Machine Translation?
26/04/2016
SYSTRAN - Copyright 2016
14
15. The new game changer – ingredients
• MLP – multilayer perceptron
• Actually an “old concept”
• CNN
• Convolutional Neural network
• Word Embeddings
• Representing words as vectors
• RNN – GRU, LSTM
• MLP with memory
• Attention-Based models
• Ability to decide where to find
information
26/04/2016
SYSTRAN - Copyright 2016
15
16. The new game changer – ingredients
• MLP – multilayer perceptron
• Actually an “old concept”
• CNN
• Convolutional Neural network
• Word Embeddings
• Representing words as vectors
• RNN – GRU, LSTM
• MLP with memory
• Attention-Based models
• Ability to decide where to find
information
26/04/2016
SYSTRAN - Copyright 2016
16
17. The new game changer – ingredients
• MLP – multilayer perceptron
• Actually an “old concept”
• CNN
• Convolutional Neural network
• Word Embeddings
• Representing words as vectors
• RNN – GRU, LSTM
• MLP with memory
• Attention-Based models
• Ability to decide where to find
information
26/04/2016
SYSTRAN - Copyright 2016
17
18. The new game changer – ingredients
• MLP – multilayer perceptron
• Actually an “old concept”
• CNN
• Convolutional Neural network
• Word Embeddings
• Representing words as vectors
• RNN – GRU, LSTM
• MLP with memory
• Attention-Based models
• Ability to decide where to find
information
26/04/2016
SYSTRAN - Copyright 2016
18
19. The new game changer – ingredients
• MLP – multilayer perceptron
• Actually an “old concept”
• CNN
• Convolutional Neural network
• Word Embeddings
• Representing words as vectors
• RNN – GRU, LSTM
• MLP with memory
• Attention-Based models
• Ability to decide where to find
information
26/04/2016
SYSTRAN - Copyright 2016
19
All of these features are the ingredients to
Neural Machine Translation
20. About Neural Machine
Translation (NMT)
• The goal is to perform end-to-end translation
• Like in Speech Recognition
• The spirit is to remove all these features and have single system
• For Machine Translation – first NMT systems are encoder-decoder
• But not that magic
• Not systematic improvements over SMT baseline
• Use of ensemble systems
• Issues with sentence lengths, vocabulary size
• Solutions come back with some interest in “linguistic” characteristics
• Attention-Based model (alignment information)
• Deep Fusion with Language Model (better modelling of target language)
• Combine with word level (~ morphology)
26/04/2016
SYSTRAN - Copyright 2016
20
21. SYSTRAN approach to NMT
• Current Real Use-Case Requirements:
• Adaptation to (small) domain
• Help for post-editing
• Preserved speed
• Consistent results amongst multiple target languages
• Possibility to let users control translation through annotations, terminology
• …
• Toward Linguistically Motivated NN architecture
• SYSTRAN MT is composed of linguistic modules – let us start with them
• Lot of knowledge to leverage
26/04/2016
SYSTRAN - Copyright 2016
21
22. SYSTRAN Deep Learning Story – Part I
Language Identification
SYSTRAN LDK 1
•Statistical Classifier – 3-grams
•Heavily Feature Engineered over years
•e.g. diacritics model for latin language
•Include lexicon of frequent terms
•Quite good accuracy on news-type data
– need ~20 characters
Basic RNN
•“out-of-the-box” character level RNN
•no specific language specific
engineering
•80K words training per language
Google CLD
•Naïve Bayesian Classifier – 4-grams
•Trained on “big data”
•carefully scrapped over 100M pages
•Specific tricks for closely related
languages (Spanish/Portuguese)
•Geared for webpages - 200+ characters
Learnings: with same data RNN approach easily outperforms baseline, no
specific engineering needed… big data is not competing...
26/04/2016
SYSTRAN - Copyright 2016
22
News
Sentences
One-word
request
Ted-Talk
Sentences
Tweets
LDK 97 55.2 87.4 78.3
RNN 98.2 61.5 91.4 77.9
CLD 96.1 15.3 86 78.1
23. SYSTRAN Deep Learning Story – Part II
Part of Speech Tagging
Phase 1 - 1968-2014 - Handcrafting
•Manual Rule and Lexicon Coding of homography
•Closely related to Morphology description
•27 languages covered
Phase 2 - 2008-2015 – Annotating
•Train Classifier to "relearn” rules (fnTBL)
•Transfer knowledge through system output
•Maintenance through Annotation
Phase 3 - 2015- - Generalizing
•Relearn with RNN
•Joint decision (so far tokenization/part of speech
tagging) – working on morphology
•Better generalization from additional knowledge
(word embeddings)
26/04/2016
SYSTRAN - Copyright 2016
23
Learnings: Possibility to leverage ”handcrafting” and gain quality. But
learning becoming too smart – it also learns initial errors
24. SYSTRAN Deep Learning Story – Part III
Transliteration
26/04/2016 24
• Transliteration of person names
is depending on
• Source Language
• Target Language
• But also Name origin
• 카스파로프 = Kasparov
• 필리프 = Philippe
• Good Transliteration system
needs:
• Detection of origin
• Transliteration mechanism
•Extremely complicated – since it requires
phonetics modeling
Rule-Based
• Satisfactory but origin detection and multiple
domains
• No generalization - unseen sequence is wrong
PBMT
• Encoding-Decoding Approach
• Long distance "view" guarantee consistency of
transliteration
RNN
Learnings:
- losing reliability/traceability of the process
+ more global consistency, compactness of the solution
25. SYSTRAN Deep Learning Story – Part IV
Language Modeling
• RNN language model proves to overpass standard n-gram models
• No limitation in the span
• Seems to capture also better the language structure
• Better generalization due to word embedding
• Can be easily introduced in PBMT engine through rescoring
• Are still challenging pure sequence-to-sequence NMT approaches
26/04/2016 25
Learnings:
- Very long training process, several weeks of training for one language
+ Consistent quality gain, easy introduction in existing framework
26. Learnings from Deep Learning
• Consistent quality improvement in all the experiments/modules we
worked on
• Better leverage of existing training material
• Better generalization
• Incrementability: by design, it is immediate to feed more training data
– i.e. adapt dynamically to usage
• Globally more simple than alternative approaches and cognitively
interesting
• Fit to be combined in a global NN architecture
26/04/2016
SYSTRAN - Copyright 2016
26
28. What about Statistical Post Editing:
Learning to correct?
26/04/2016
SYSTRAN - Copyright 2016
28
• SPE was introduced as smart
alternative the SMT
• Corresponding to real MT use case for
localization
• Very little data can produce adaptation
• Reduce Human Post-Editor Work by
iteratively learning edits
• However implementation with PBMT
is not satisfactory
• PBMT does not learn to correct but to
translate
• Not incremental
• Learning to correct
• More control of the process
Toward a “translation checker”
• Change the paradigm – now human post-
editor to MT output, tomorrow
automatic post-editor to human output?
MT
HPE
29. Deep Learning for Machine Translation
• No doubt – it is coming:
• We will probably reach “superhuman” machine translation in coming years
• And this could become real translation assistant
• How is not yet completely clear
• From our perspective, we are working on hybrid approach = linguistically motivated
NN architecture
• More will also be coming from research world
• Still some work ahead
• Training of models is still a technological challenge
• We need the models to explain as much as to translate to become really useful – or
for language learning
• Multi-level analysis - document translation and not just sentences
• Multi-modal => could lead to full self language learning
26/04/2016
SYSTRAN - Copyright 2016
29
Editor's Notes
The last 3 years…
In Image recognition
In Voice Recognition
Show X is to Y what Z is to …
M
M
M
Road Sign Recognition
For some tasks
Actually it is not one single technology but a mix of different technologies – what is very seducing is this remains relatively simple, and appealing
Convolution Neural Network are very used in the image processing – and can be seen as consecutive layers of processing that progressively extract more and more advanced features
Actually it is not one single technology but a mix of different technologies – what is very seducing is this remains relatively simple, and appealing
Actually it is not one single technology but a mix of different technologies – what is very seducing is this remains relatively simple, and appealing
End-to-end – is also called “sequence-to-sequence”
Requirements from our customer are actually quite strong – and our goal is not to produce a generic academic NMT engine, but actual solutions for our customer requirements
So we would like to share with you findings of these moves to DNN and we took for that several modules
Example on Chinese
So we are not yet there – but what we foresee and work on is to establish a NN architecture preserving the actual traditional linguistic workflow with specialized NN stacking up to produce machine translation
From this specialization – we except several things - first we would be able to use the existing knowledge, second we would still have “checkpoints” in the process allowing to monitor translation process
Alternatively, the other important research directions for us – is to improve modeling on Statistical Post-Editing introduced in 2007 as an alternative to raising SMT. SPE is corresponding to real user-case: very little data, an existing system performing well but not really adapted to the task.