SlideShare a Scribd company logo
1 of 5
Download to read offline
Applying Deep Learning Neural Network Machine
Translation to Language Services
Special Course
Yannis Flet-Berliac
MSc in Digital Media Engineering
DTU
[Supervisor] Michael Kai Petersen
Associate Professor
DTU
Abstract
Recurrent neural networks (RNNs) have been performing
well for learning tasks for several decades now. The most
useful benefit they present for this paper is their ability to
use contextual information when mapping between input
and output sequences.
Nevertheless, standard recurrent neural network architec-
tures challenged with translation tasks can only provide a
limited range of context. Context which is of massive im-
portance for complex translations.
A deep neural network for machine translation implies
the use of a sequence-to-sequence model, consisting of two
RNNs: an encoder that processes the input and a decoder
that generates the output. To address this, we will use a
recurrent neural network with a Long Short-Term Memory
architecture as depicted in the next section. This recent
architecture - whose first attempts were made in the 1990s
- will be evaluated with highly specialized texts currently
challenging language services.
To meaningfully assess the model’s performances, texts
from a translation company and thoughts from skilled ex-
perts about specialized topics will be tested.
1. INTRODUCTION
To understand why a Long Short-Term Memory architec-
ture gives better results for machine translation, we need to
evoke the so-called vanishing gradient problem.
This is a problem arising in gradient-based learning meth-
ods: those methods learn a parameter’s value in the neural
network by understanding how a small change in the param-
eter’s value will affect the network’s output.
If a change in the parameter’s value causes very small
change in the network’s output, the network can’t learn the
parameter effectively, which is a problem.
This is exactly what is happening in the vanishing gradient
problem - the gradients of the network’s output with respect
to the parameters in the early layers (the previous words
in the input sequence, see later for a graphic illustration)
become extremely small ie. even a large change in the value
of parameters for the early layers doesn’t have a big effect
on the output.
This is reasonably avoided by basic RNNs for short sen-
tences or when the distance between the relevant informa-
tion and the place where the model needs to predict the next
word is small, as depicted in figure 1. In that architecture, A
is a chunk of the neural network (a hidden layer) looking at
some input xi and outputing the value hi. The highlighted
elements x0 and x1 can easily be understood as the words
that gather the relevant information in the sentence and the
h3 the word-to-predict in that same sentence. The informa-
tion flow is depicted by the horizontal arrows that go from
left to right.
Figure 1: The RNN succeeds in learning to use the
past information for small gaps in the sentence
But when this gap enlarges like in figure 2, the accuracy of
the basic RNN model can’t be proven anymore. The infor-
mation flow suffers from the vanishing gradient descent and
the more it goes through the chunks A, the more reduced
its intensity is. Long term dependencies are harder to learn
and this is even more reported for texts with complex con-
text that require language services’ expertise for a proper
translation.
Figure 2: The RNN fails in learning to use the past
information for large gaps in the sentence
As shown in figure 3 the sensitivity decays over time as
new inputs overwrites the activations of the hidden layer.
The RNN ”forgets” the first inputs (ie. the impact of previ-
ous words is reduced and the context is forgotten).
Hopefully, a Long Short-Term Memory (LSTM) architec-
Figure 3: The sensitivity of hidden layers decays
over time
ture can avoid that and gives interesting results for sequence-
to-sequence learning.
2. THE MODEL
2.1 LSTM
The LSTM model introduces a new structure called a
memory cell (see figure 4 below). Compared to a simple
RNN cell structure (see figure 5 below), LSTM memory cells
are composed of four main elements: an input gate, a neu-
ron with a self-recurrent connection (also called the cell state
which is represented as the long horizontal arrow at the top
of the cell), a forget gate (which looks at ht−1 and xt, and
outputs a number between 0 and 1 for each value in the cell
state - a 1 will keep all of the input information while a 0
will ignore it) and an output gate. As shown, the gates serve
to modulate the interactions between the memory cell itself
and its environment.
The input gate can either allow incoming signal to alter
the state of the memory cell or completely block it. On the
other hand, the output gate can either allow the state of the
memory cell to have an effect on other neurons or prevent
it. Finally, the forget gate can modulate the memory cell’s
self-recurrent connection, allowing the cell to remember or
forget its previous state, as needed according to cost function
during the training of the model.
Figure 4: Repeating module of an LSTM memory
cell
Thinking in terms of the forward pass, the input gate
learns to decide when to let activation pass into the memory
cell and the output gate learns when to let activation pass
out of the memory cell.
The core of our model consists of an LSTM cell that pro-
cesses one word at a time. Using all those flows of informa-
tion it computes probabilities of the possible continuations
Figure 5: Repeating module of a standard RNN cell
of the sentence given an input of sequence of words (we will
address the sequence-to-sequence formulation in the next
section). The memory state of the network is at the very
beginning initialized with a vector of zeros and gets updated
after reading each word.
In our case, for computational reasons we will process data
in mini-batches of size 64 (64 sentences per time step).
2.2 Sequence-to-Sequence
For our model we will use a neural network architecture
that learns to encode a varying-length sequence into a fixed-
length vector representation and to decode a given fixed-
length vector* representation back into a variable-length se-
quence.
Figure 6: Sequence-to-Sequence RNN Encoder-
Decoder model
The encoder is a RNN that reads each word of an input
sequence x sequentially. As it reads each word, the hidden
state of the RNN changes (the LSTM memory cell state
changes, ie. the information flow gets updated recursively).
When the RNN reaches the end of the sequence the hidden
state of the RNN is a summary of the whole input sequence:
the context information has been preserved.
The decoder of the model is yet another RNN which is
trained to generate the output sequence by predicting the
next word given the hidden state we just talked about.
Until that very moment, the model we described has each
input being encoded into a fixed-size state vector*. This
may make it difficult for the neural network to cope with
long sentences with all the information being compressed
into a fixed-size vector. That is the reason why the attention
mechanism has been additionally used. Both a recent and
interesting architectural innovation in neural networks first
introduced by A. Graves and then developed by D. Bah-
danau et al., this mechanism encodes the input sentence
into a sequence of vectors and chooses a subset of these vec-
tors adaptively while decoding the translation. The decoder
decides which parts of the source sentence it needs to pay
attention to. The neural networks can focus on different
parts of their input which allows the decoder to appropri-
ately focus on the relevant segments of the information flow
spreading along the sequence.
Finally the two components of the RNN Encoder-Decoder
are jointly trained to maximize the conditional log-likelihood:
max
θ
1
N
N
n=1
logpθ(yn | xn)
with θ the model parameters, xn an input sequence, yn its
associated output sequence and N the total number of pair
sentences in the data set (more details raised in the next sec-
tion). To maximize this log-likelihood, the gradient-based
algorithm is eventually called to estimate the model param-
eters at each step of the training.
3. THE TRAINING
We used TensorFlow’s framework and its seq2seq library
together with the WMT’14 English to French data set com-
posed of 12 million sentences. In our experiment and for
both languages we used a vocabulary size of 40K words
with the most frequent ones. This seemed to be a fairly
good trade-off between computing time and model’s accu-
racy. Accordingly, every word out of the vocabulary was
replaced with a special ’UNK’ token. We also used an AWS
EC2 instance with GPU Support to have a remote access to
a large computing power and additional storage needed for
the training. To set up the model, it was needed to install
several tools into the instance including TensorFlow 0.6, Es-
sentials, Cuda Toolkit 7.0, cuDNN Toolkit 6.5 and Bazel
0.1.4. An EC2 g2.8xlarge instance using Ubuntu Server
14.04 LTS AMI has been used for the experiment. A public
AMI (ami-a5d1adb2) has been created during this project
with the resulting machine translation model, as of now.
Figure 7: The 4-layers neural network used in the
experiment.
40K input words and 40k output words have been used
with 4 layers of 512 LSTM cells each with different LSTMs
for input and output languages, presenting 1000 dimensional
word embeddings. The resulting network looks like the one
in figure 7 above. As you can observe, we represented the
512 LSTM cells by only using 4 for the encoder and 4 for
the decoder. To save virtual storage space we used batches
of size 64 during the training. It took 1.3 seconds for each
time-step approximately - given that we trained the model
for 160000 steps, the resulting computing time is 58 hours
with a g2.8xlarge EC2.
When in your instance with the provided AMI, type the
following requests in your shell to go further with the model’s
training and see how the perplexity evolves:
cd tensorflow/tensorflow/models/rnn/translate
python translate.py --data_dir /data --train_dir /
--size=512 --num_layers=4
You should see hundreds of lines of outputs that will demon-
strate a working GPU Support (as a side note, you don’t nec-
essarily need a g2.8xlarge instance to make it work, g2.2xlarge
is also fine but as you can imagine, slower).
You can as well type the following code at any time after
interrupting the training to test out the machine translation:
cd tensorflow/tensorflow/models/rnn/translate
python translate.py --decode --data_dir /data
--train_dir / --size=512 --num_layers=4
This will prompt you with a ’>’ waiting for an English sen-
tence to be inputted. The model will then translate the
latter into French.
4. SOME RESULTS
When running the training we access some information (a
few output examples below) concerning the global state of
the model and specific states concerning each bucket, every
200 time-steps (one bucket contains sentences within a fixed
range of length).
global step 119600 learning rate 0.2107 step-time 1.39
perplexity 5.55
eval: bucket 0 perplexity 5.97
eval: bucket 1 perplexity 4.07
eval: bucket 2 perplexity 5.78
eval: bucket 3 perplexity 7.02
global step 119800 learning rate 0.2107 step-time 1.30
perplexity 5.48
eval: bucket 0 perplexity 6.98
eval: bucket 1 perplexity 5.09
eval: bucket 2 perplexity 5.75
eval: bucket 3 perplexity 6.95
global step 120000 learning rate 0.2107 step-time 1.41
perplexity 5.58
eval: bucket 0 perplexity 5.01
eval: bucket 1 perplexity 4.77
eval: bucket 2 perplexity 5.73
eval: bucket 3 perplexity 7.14
This gives us useful insight into the perplexity of the model,
ie. its accuracy. As a reminder, the perplexity measures
the cross entropy between the empirical distribution (the
actual one) and the predicted distribution. It then divides
the cross entropy by the number of words and takes the
exponential after throwing out unseen words. It gives a good
understanding of our model’s performances.
Now that we ran the model for almost 60 hours, we can
test it out by calculating the BLEU score (Bilingual Eval-
uation Understudy) on some sentences. The BLEU score
compares the hypothesized translation from the model with
a human translation. We will use the implementation of
the BLEU score by Papineni et al. Two identical sentences
would have a score of 1.
The SOURCE is the source sentence in English, the REF-
ERENCE is the target sentence in French translated by a
human, the HYPOTHESIS MT is the sentence in French
generated by our machine translation, and the HYPOTHE-
SIS GT is the sentence in French generated by Google Trans-
late.
Figure 8: BLEU score : MT (0.288) and GT (0.217)
Figure 9: BLEU score : MT (0.653) and GT (0.296)
Figure 10: BLEU score : MT (0.340) and GT (0.375)
We should remain reluctant basing our evaluation exclu-
sively on the BLEU score since the model we have been
generating is not strong enough yet. Indeed, although the
translations most of the time make sense to us, they don’t
give a faithful replica of meaning - the model still generates
a sentence by reshaping how the words are linked together
(more about this later in this section). Below are two inter-
esting examples with a sentence generated by the machine
translation (HYPOTHESIS MY), and a human translation
of the translation (RE-TRANSLATION). Thinking of the
evaluation of the model, this helps us to convey the mean-
ing of how the machine translation assesses a given sentence.
Examples were extracted from an article by The Atlantic,
interviewing the cognitive scientist Donald Hoffman depict-
ing the concept of reality. Obviously a very specific topic.
It is striking how the machine translation adapts the mean-
ing of the sentence without altering the point of it, not by
changing it but rather by slightly affecting it. It chooses
its own trained perspective. The translation seems not to
be a translation of a sentence but rather the translation of
the meaning, as perceived by the machine. Last but not
least, not any grammatical or orthographic mistakes have
been reported.
Figure 11: Human re-translation of a sentence trans-
lated with a deep neural network
Figure 12: Another human re-translation of a sen-
tence translated with a deep neural network
5. WHAT IS NEXT
From the outlined translations and many others that have
been generated, we can accept the both strong and amazing
capabilities of a deep neural network using Long Short-Term
Memory cells for specialized texts defying language services.
In the coming weeks this model will be improved with a final
training session using the remaining train set.
Additionally, another model will be trained to outperform
the current one, by improving major elements. First, a
more powerful tokenizer can be used when pre-processing
the data. We will also focus on tuning the model by ad-
justing the learning rate, the decay and by initializing the
weights using a adaptive pre-factor together with the ran-
dom number distribution: 4
ni+nj
N(0, 1) with i and j cor-
responding to the layers linked by the weights. The opti-
mizer will be changed to AdagradOptimizer (the optimizer
that implements the Adagrad algorithm) instead of the com-
mon Gradient Descent one. Finally, the vocabulary size will
be upgraded to 100K words instead of 40K.
6. REFERENCES
[1] Michael Nielsen. Neural Networks and Deep Learning.
2016.
[2] Alex Graves. Supervised Sequence Labelling with
Recurrent Neural Networks. 2012.
[3] Andrej Karpathy. The Unreasonable Effectiveness of
Recurrent Neural Networks. 2015.
[4] Sutskever et al. Sequence to Sequence Learning with
Neural Networks. 2014.
[5] C. Olah. Understanding LSTM Networks. 2015.
[6] Cho et al. Learning Phrase Representations using RNN
Encoder-Decoder for Statistical Machine Translation.
2014.
[7] Papineni et al. BLEU: a Method for Automatic
Evaluation of Machine Translation. 2002.
[8] Alex Graves. Generating sequences with recurrent
neural networks. 2013.
[9] S. Hochreiter et al. Gradient flow in recurrent nets: the
difficulty of learning long-term dependencies. 2001.
[10] T. Mikolov. Statistical Language Models based on
Neural Networks. 2012.
[11] D. Bahdanau et al. Neural Machine Translation by
Jointly Learning to Align and Translate. 2015.
[12] J. Pouget-Abadie et al. Overcoming the curse of
sentence length for neural machine translation using
automatic segmentation. 2014.

More Related Content

What's hot

Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief NetworksHasan H Topcu
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Universitat Politècnica de Catalunya
 
Secured transmission through multi layer perceptron in wireless communication...
Secured transmission through multi layer perceptron in wireless communication...Secured transmission through multi layer perceptron in wireless communication...
Secured transmission through multi layer perceptron in wireless communication...ijmnct
 
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...Universitat Politècnica de Catalunya
 
deeplearning
deeplearningdeeplearning
deeplearninghuda2018
 
GROUP SESSION KEY EXCHANGE MULTILAYER PERCEPTRON BASED SIMULATED ANNEALING GU...
GROUP SESSION KEY EXCHANGE MULTILAYER PERCEPTRON BASED SIMULATED ANNEALING GU...GROUP SESSION KEY EXCHANGE MULTILAYER PERCEPTRON BASED SIMULATED ANNEALING GU...
GROUP SESSION KEY EXCHANGE MULTILAYER PERCEPTRON BASED SIMULATED ANNEALING GU...ijwmn
 
A New Security Level for Elliptic Curve Cryptosystem Using Cellular Automata ...
A New Security Level for Elliptic Curve Cryptosystem Using Cellular Automata ...A New Security Level for Elliptic Curve Cryptosystem Using Cellular Automata ...
A New Security Level for Elliptic Curve Cryptosystem Using Cellular Automata ...Editor IJCATR
 
Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Danilo Calle
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysisodsc
 
Electricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural NetworksElectricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural NetworksTaegyun Jeon
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You NeedSEMINARGROOT
 
Analysis of Searchable Encryption
Analysis of Searchable EncryptionAnalysis of Searchable Encryption
Analysis of Searchable EncryptionNagendra Posani
 
An Effective Privacy-Preserving Data Coding in Peer-To-Peer Network
An Effective Privacy-Preserving Data Coding in Peer-To-Peer NetworkAn Effective Privacy-Preserving Data Coding in Peer-To-Peer Network
An Effective Privacy-Preserving Data Coding in Peer-To-Peer NetworkIJCNCJournal
 

What's hot (20)

Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief Networks
 
Rnn & Lstm
Rnn & LstmRnn & Lstm
Rnn & Lstm
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
 
Secured transmission through multi layer perceptron in wireless communication...
Secured transmission through multi layer perceptron in wireless communication...Secured transmission through multi layer perceptron in wireless communication...
Secured transmission through multi layer perceptron in wireless communication...
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
LSTM Basics
LSTM BasicsLSTM Basics
LSTM Basics
 
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
Video Analysis with Recurrent Neural Networks (Master Computer Vision Barcelo...
 
deeplearning
deeplearningdeeplearning
deeplearning
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
Lec10new
Lec10newLec10new
Lec10new
 
GROUP SESSION KEY EXCHANGE MULTILAYER PERCEPTRON BASED SIMULATED ANNEALING GU...
GROUP SESSION KEY EXCHANGE MULTILAYER PERCEPTRON BASED SIMULATED ANNEALING GU...GROUP SESSION KEY EXCHANGE MULTILAYER PERCEPTRON BASED SIMULATED ANNEALING GU...
GROUP SESSION KEY EXCHANGE MULTILAYER PERCEPTRON BASED SIMULATED ANNEALING GU...
 
6119ijcsitce01
6119ijcsitce016119ijcsitce01
6119ijcsitce01
 
A New Security Level for Elliptic Curve Cryptosystem Using Cellular Automata ...
A New Security Level for Elliptic Curve Cryptosystem Using Cellular Automata ...A New Security Level for Elliptic Curve Cryptosystem Using Cellular Automata ...
A New Security Level for Elliptic Curve Cryptosystem Using Cellular Automata ...
 
Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)
 
Lstm
LstmLstm
Lstm
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Electricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural NetworksElectricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural Networks
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Analysis of Searchable Encryption
Analysis of Searchable EncryptionAnalysis of Searchable Encryption
Analysis of Searchable Encryption
 
An Effective Privacy-Preserving Data Coding in Peer-To-Peer Network
An Effective Privacy-Preserving Data Coding in Peer-To-Peer NetworkAn Effective Privacy-Preserving Data Coding in Peer-To-Peer Network
An Effective Privacy-Preserving Data Coding in Peer-To-Peer Network
 

Similar to Applying Deep Learning Machine Translation to Language Services

Concepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionConcepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionSaumyaMundra3
 
DL for sentence classification project Write-up
DL for sentence classification project Write-upDL for sentence classification project Write-up
DL for sentence classification project Write-upHoàng Triều Trịnh
 
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONEXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONijaia
 
Handwritten Digit Recognition using Convolutional Neural Networks
Handwritten Digit Recognition using Convolutional Neural  NetworksHandwritten Digit Recognition using Convolutional Neural  Networks
Handwritten Digit Recognition using Convolutional Neural NetworksIRJET Journal
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNsGrigory Sapunov
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...rinzindorjej
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...rinzindorjej
 
Convolution neural networks
Convolution neural networksConvolution neural networks
Convolution neural networksFares Hasan
 
Arabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachArabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachIJECEIAES
 
Speech Processing with deep learning
Speech Processing  with deep learningSpeech Processing  with deep learning
Speech Processing with deep learningMohamed Essam
 
modeling-a-perceptron-neuron-using-verilog-developed-floating-point-numbering...
modeling-a-perceptron-neuron-using-verilog-developed-floating-point-numbering...modeling-a-perceptron-neuron-using-verilog-developed-floating-point-numbering...
modeling-a-perceptron-neuron-using-verilog-developed-floating-point-numbering...RioCarthiis
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSitakanta Mishra
 
Complexity analysis of multilayer perceptron neural network embedded into a w...
Complexity analysis of multilayer perceptron neural network embedded into a w...Complexity analysis of multilayer perceptron neural network embedded into a w...
Complexity analysis of multilayer perceptron neural network embedded into a w...Amir Shokri
 
An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)EmmanuelJosterSsenjo
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learningRADO7900
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
A Hybrid Deep Neural Network Model For Time Series Forecasting
A Hybrid Deep Neural Network Model For Time Series ForecastingA Hybrid Deep Neural Network Model For Time Series Forecasting
A Hybrid Deep Neural Network Model For Time Series ForecastingMartha Brown
 
Artificial neural network for machine learning
Artificial neural network for machine learningArtificial neural network for machine learning
Artificial neural network for machine learninggrinu
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 

Similar to Applying Deep Learning Machine Translation to Language Services (20)

Concepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionConcepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, Attention
 
DL for sentence classification project Write-up
DL for sentence classification project Write-upDL for sentence classification project Write-up
DL for sentence classification project Write-up
 
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONEXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
 
Bh36352357
Bh36352357Bh36352357
Bh36352357
 
Handwritten Digit Recognition using Convolutional Neural Networks
Handwritten Digit Recognition using Convolutional Neural  NetworksHandwritten Digit Recognition using Convolutional Neural  Networks
Handwritten Digit Recognition using Convolutional Neural Networks
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNs
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
 
Convolution neural networks
Convolution neural networksConvolution neural networks
Convolution neural networks
 
Arabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachArabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approach
 
Speech Processing with deep learning
Speech Processing  with deep learningSpeech Processing  with deep learning
Speech Processing with deep learning
 
modeling-a-perceptron-neuron-using-verilog-developed-floating-point-numbering...
modeling-a-perceptron-neuron-using-verilog-developed-floating-point-numbering...modeling-a-perceptron-neuron-using-verilog-developed-floating-point-numbering...
modeling-a-perceptron-neuron-using-verilog-developed-floating-point-numbering...
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
Complexity analysis of multilayer perceptron neural network embedded into a w...
Complexity analysis of multilayer perceptron neural network embedded into a w...Complexity analysis of multilayer perceptron neural network embedded into a w...
Complexity analysis of multilayer perceptron neural network embedded into a w...
 
An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
A Hybrid Deep Neural Network Model For Time Series Forecasting
A Hybrid Deep Neural Network Model For Time Series ForecastingA Hybrid Deep Neural Network Model For Time Series Forecasting
A Hybrid Deep Neural Network Model For Time Series Forecasting
 
Artificial neural network for machine learning
Artificial neural network for machine learningArtificial neural network for machine learning
Artificial neural network for machine learning
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 

Recently uploaded

SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 

Recently uploaded (20)

SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 

Applying Deep Learning Machine Translation to Language Services

  • 1. Applying Deep Learning Neural Network Machine Translation to Language Services Special Course Yannis Flet-Berliac MSc in Digital Media Engineering DTU [Supervisor] Michael Kai Petersen Associate Professor DTU Abstract Recurrent neural networks (RNNs) have been performing well for learning tasks for several decades now. The most useful benefit they present for this paper is their ability to use contextual information when mapping between input and output sequences. Nevertheless, standard recurrent neural network architec- tures challenged with translation tasks can only provide a limited range of context. Context which is of massive im- portance for complex translations. A deep neural network for machine translation implies the use of a sequence-to-sequence model, consisting of two RNNs: an encoder that processes the input and a decoder that generates the output. To address this, we will use a recurrent neural network with a Long Short-Term Memory architecture as depicted in the next section. This recent architecture - whose first attempts were made in the 1990s - will be evaluated with highly specialized texts currently challenging language services. To meaningfully assess the model’s performances, texts from a translation company and thoughts from skilled ex- perts about specialized topics will be tested. 1. INTRODUCTION To understand why a Long Short-Term Memory architec- ture gives better results for machine translation, we need to evoke the so-called vanishing gradient problem. This is a problem arising in gradient-based learning meth- ods: those methods learn a parameter’s value in the neural network by understanding how a small change in the param- eter’s value will affect the network’s output. If a change in the parameter’s value causes very small change in the network’s output, the network can’t learn the parameter effectively, which is a problem. This is exactly what is happening in the vanishing gradient problem - the gradients of the network’s output with respect to the parameters in the early layers (the previous words in the input sequence, see later for a graphic illustration) become extremely small ie. even a large change in the value of parameters for the early layers doesn’t have a big effect on the output. This is reasonably avoided by basic RNNs for short sen- tences or when the distance between the relevant informa- tion and the place where the model needs to predict the next word is small, as depicted in figure 1. In that architecture, A is a chunk of the neural network (a hidden layer) looking at some input xi and outputing the value hi. The highlighted elements x0 and x1 can easily be understood as the words that gather the relevant information in the sentence and the h3 the word-to-predict in that same sentence. The informa- tion flow is depicted by the horizontal arrows that go from left to right. Figure 1: The RNN succeeds in learning to use the past information for small gaps in the sentence But when this gap enlarges like in figure 2, the accuracy of the basic RNN model can’t be proven anymore. The infor- mation flow suffers from the vanishing gradient descent and the more it goes through the chunks A, the more reduced its intensity is. Long term dependencies are harder to learn and this is even more reported for texts with complex con- text that require language services’ expertise for a proper translation. Figure 2: The RNN fails in learning to use the past information for large gaps in the sentence As shown in figure 3 the sensitivity decays over time as new inputs overwrites the activations of the hidden layer. The RNN ”forgets” the first inputs (ie. the impact of previ- ous words is reduced and the context is forgotten). Hopefully, a Long Short-Term Memory (LSTM) architec-
  • 2. Figure 3: The sensitivity of hidden layers decays over time ture can avoid that and gives interesting results for sequence- to-sequence learning. 2. THE MODEL 2.1 LSTM The LSTM model introduces a new structure called a memory cell (see figure 4 below). Compared to a simple RNN cell structure (see figure 5 below), LSTM memory cells are composed of four main elements: an input gate, a neu- ron with a self-recurrent connection (also called the cell state which is represented as the long horizontal arrow at the top of the cell), a forget gate (which looks at ht−1 and xt, and outputs a number between 0 and 1 for each value in the cell state - a 1 will keep all of the input information while a 0 will ignore it) and an output gate. As shown, the gates serve to modulate the interactions between the memory cell itself and its environment. The input gate can either allow incoming signal to alter the state of the memory cell or completely block it. On the other hand, the output gate can either allow the state of the memory cell to have an effect on other neurons or prevent it. Finally, the forget gate can modulate the memory cell’s self-recurrent connection, allowing the cell to remember or forget its previous state, as needed according to cost function during the training of the model. Figure 4: Repeating module of an LSTM memory cell Thinking in terms of the forward pass, the input gate learns to decide when to let activation pass into the memory cell and the output gate learns when to let activation pass out of the memory cell. The core of our model consists of an LSTM cell that pro- cesses one word at a time. Using all those flows of informa- tion it computes probabilities of the possible continuations Figure 5: Repeating module of a standard RNN cell of the sentence given an input of sequence of words (we will address the sequence-to-sequence formulation in the next section). The memory state of the network is at the very beginning initialized with a vector of zeros and gets updated after reading each word. In our case, for computational reasons we will process data in mini-batches of size 64 (64 sentences per time step). 2.2 Sequence-to-Sequence For our model we will use a neural network architecture that learns to encode a varying-length sequence into a fixed- length vector representation and to decode a given fixed- length vector* representation back into a variable-length se- quence. Figure 6: Sequence-to-Sequence RNN Encoder- Decoder model The encoder is a RNN that reads each word of an input sequence x sequentially. As it reads each word, the hidden state of the RNN changes (the LSTM memory cell state changes, ie. the information flow gets updated recursively). When the RNN reaches the end of the sequence the hidden state of the RNN is a summary of the whole input sequence: the context information has been preserved. The decoder of the model is yet another RNN which is trained to generate the output sequence by predicting the next word given the hidden state we just talked about. Until that very moment, the model we described has each input being encoded into a fixed-size state vector*. This may make it difficult for the neural network to cope with long sentences with all the information being compressed into a fixed-size vector. That is the reason why the attention mechanism has been additionally used. Both a recent and interesting architectural innovation in neural networks first introduced by A. Graves and then developed by D. Bah- danau et al., this mechanism encodes the input sentence
  • 3. into a sequence of vectors and chooses a subset of these vec- tors adaptively while decoding the translation. The decoder decides which parts of the source sentence it needs to pay attention to. The neural networks can focus on different parts of their input which allows the decoder to appropri- ately focus on the relevant segments of the information flow spreading along the sequence. Finally the two components of the RNN Encoder-Decoder are jointly trained to maximize the conditional log-likelihood: max θ 1 N N n=1 logpθ(yn | xn) with θ the model parameters, xn an input sequence, yn its associated output sequence and N the total number of pair sentences in the data set (more details raised in the next sec- tion). To maximize this log-likelihood, the gradient-based algorithm is eventually called to estimate the model param- eters at each step of the training. 3. THE TRAINING We used TensorFlow’s framework and its seq2seq library together with the WMT’14 English to French data set com- posed of 12 million sentences. In our experiment and for both languages we used a vocabulary size of 40K words with the most frequent ones. This seemed to be a fairly good trade-off between computing time and model’s accu- racy. Accordingly, every word out of the vocabulary was replaced with a special ’UNK’ token. We also used an AWS EC2 instance with GPU Support to have a remote access to a large computing power and additional storage needed for the training. To set up the model, it was needed to install several tools into the instance including TensorFlow 0.6, Es- sentials, Cuda Toolkit 7.0, cuDNN Toolkit 6.5 and Bazel 0.1.4. An EC2 g2.8xlarge instance using Ubuntu Server 14.04 LTS AMI has been used for the experiment. A public AMI (ami-a5d1adb2) has been created during this project with the resulting machine translation model, as of now. Figure 7: The 4-layers neural network used in the experiment. 40K input words and 40k output words have been used with 4 layers of 512 LSTM cells each with different LSTMs for input and output languages, presenting 1000 dimensional word embeddings. The resulting network looks like the one in figure 7 above. As you can observe, we represented the 512 LSTM cells by only using 4 for the encoder and 4 for the decoder. To save virtual storage space we used batches of size 64 during the training. It took 1.3 seconds for each time-step approximately - given that we trained the model for 160000 steps, the resulting computing time is 58 hours with a g2.8xlarge EC2. When in your instance with the provided AMI, type the following requests in your shell to go further with the model’s training and see how the perplexity evolves: cd tensorflow/tensorflow/models/rnn/translate python translate.py --data_dir /data --train_dir / --size=512 --num_layers=4 You should see hundreds of lines of outputs that will demon- strate a working GPU Support (as a side note, you don’t nec- essarily need a g2.8xlarge instance to make it work, g2.2xlarge is also fine but as you can imagine, slower). You can as well type the following code at any time after interrupting the training to test out the machine translation: cd tensorflow/tensorflow/models/rnn/translate python translate.py --decode --data_dir /data --train_dir / --size=512 --num_layers=4 This will prompt you with a ’>’ waiting for an English sen- tence to be inputted. The model will then translate the latter into French. 4. SOME RESULTS When running the training we access some information (a few output examples below) concerning the global state of the model and specific states concerning each bucket, every 200 time-steps (one bucket contains sentences within a fixed range of length). global step 119600 learning rate 0.2107 step-time 1.39 perplexity 5.55 eval: bucket 0 perplexity 5.97 eval: bucket 1 perplexity 4.07 eval: bucket 2 perplexity 5.78 eval: bucket 3 perplexity 7.02 global step 119800 learning rate 0.2107 step-time 1.30 perplexity 5.48 eval: bucket 0 perplexity 6.98 eval: bucket 1 perplexity 5.09 eval: bucket 2 perplexity 5.75 eval: bucket 3 perplexity 6.95 global step 120000 learning rate 0.2107 step-time 1.41 perplexity 5.58 eval: bucket 0 perplexity 5.01 eval: bucket 1 perplexity 4.77 eval: bucket 2 perplexity 5.73 eval: bucket 3 perplexity 7.14 This gives us useful insight into the perplexity of the model, ie. its accuracy. As a reminder, the perplexity measures the cross entropy between the empirical distribution (the actual one) and the predicted distribution. It then divides the cross entropy by the number of words and takes the exponential after throwing out unseen words. It gives a good understanding of our model’s performances.
  • 4. Now that we ran the model for almost 60 hours, we can test it out by calculating the BLEU score (Bilingual Eval- uation Understudy) on some sentences. The BLEU score compares the hypothesized translation from the model with a human translation. We will use the implementation of the BLEU score by Papineni et al. Two identical sentences would have a score of 1. The SOURCE is the source sentence in English, the REF- ERENCE is the target sentence in French translated by a human, the HYPOTHESIS MT is the sentence in French generated by our machine translation, and the HYPOTHE- SIS GT is the sentence in French generated by Google Trans- late. Figure 8: BLEU score : MT (0.288) and GT (0.217) Figure 9: BLEU score : MT (0.653) and GT (0.296) Figure 10: BLEU score : MT (0.340) and GT (0.375) We should remain reluctant basing our evaluation exclu- sively on the BLEU score since the model we have been generating is not strong enough yet. Indeed, although the translations most of the time make sense to us, they don’t give a faithful replica of meaning - the model still generates a sentence by reshaping how the words are linked together (more about this later in this section). Below are two inter- esting examples with a sentence generated by the machine translation (HYPOTHESIS MY), and a human translation of the translation (RE-TRANSLATION). Thinking of the evaluation of the model, this helps us to convey the mean- ing of how the machine translation assesses a given sentence. Examples were extracted from an article by The Atlantic, interviewing the cognitive scientist Donald Hoffman depict- ing the concept of reality. Obviously a very specific topic. It is striking how the machine translation adapts the mean- ing of the sentence without altering the point of it, not by changing it but rather by slightly affecting it. It chooses its own trained perspective. The translation seems not to be a translation of a sentence but rather the translation of the meaning, as perceived by the machine. Last but not least, not any grammatical or orthographic mistakes have been reported. Figure 11: Human re-translation of a sentence trans- lated with a deep neural network Figure 12: Another human re-translation of a sen- tence translated with a deep neural network 5. WHAT IS NEXT From the outlined translations and many others that have been generated, we can accept the both strong and amazing capabilities of a deep neural network using Long Short-Term Memory cells for specialized texts defying language services. In the coming weeks this model will be improved with a final training session using the remaining train set. Additionally, another model will be trained to outperform the current one, by improving major elements. First, a more powerful tokenizer can be used when pre-processing the data. We will also focus on tuning the model by ad- justing the learning rate, the decay and by initializing the weights using a adaptive pre-factor together with the ran- dom number distribution: 4 ni+nj N(0, 1) with i and j cor- responding to the layers linked by the weights. The opti- mizer will be changed to AdagradOptimizer (the optimizer that implements the Adagrad algorithm) instead of the com- mon Gradient Descent one. Finally, the vocabulary size will be upgraded to 100K words instead of 40K. 6. REFERENCES [1] Michael Nielsen. Neural Networks and Deep Learning. 2016. [2] Alex Graves. Supervised Sequence Labelling with Recurrent Neural Networks. 2012. [3] Andrej Karpathy. The Unreasonable Effectiveness of Recurrent Neural Networks. 2015. [4] Sutskever et al. Sequence to Sequence Learning with Neural Networks. 2014. [5] C. Olah. Understanding LSTM Networks. 2015. [6] Cho et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. 2014. [7] Papineni et al. BLEU: a Method for Automatic Evaluation of Machine Translation. 2002. [8] Alex Graves. Generating sequences with recurrent neural networks. 2013. [9] S. Hochreiter et al. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. 2001. [10] T. Mikolov. Statistical Language Models based on Neural Networks. 2012. [11] D. Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate. 2015.
  • 5. [12] J. Pouget-Abadie et al. Overcoming the curse of sentence length for neural machine translation using automatic segmentation. 2014.