SlideShare a Scribd company logo
Deep Neural Machine
Translation
with Linear Associative Unit
Mingxuan Wang, Zhengdong Lu, Jie Zhou, Qun Liu
(ACL 2017)
首都大 B4 勝又智
Abstract
● NMT systems with deep architecture RNNs often suffer from severe gradient
diffusion.
○ due to the non-linear recurrent activations, which often make the optimization much more
difficult.
● we propose novel linear associative units (LAU).
○ reduce the gradient path inside the recurrent units
● experiment
○ NIST task: Chinese-English
○ WMT14: English-German
English-French
● analysis
○ LAU vs. GRU
○ Depth vs. Width
○ about Length 2
background: gate structure
● LSTM, GRU: capture long-term dependencies
● Residual Network (He et al., 2015)
● highway Network (Srivastava et al., 2015)
● Fast-Forward Network (Zhou et al., 2016) (F-F connections)
3
highway Network (H, T: non-linear function)
background: Gated Recurrent Unit
taking a linear sum between the existing state and the newly computed state
z_t: update gate
r_t: reset gate
4
Model: LAU (Linear Asocciative Unit)
LAU extends GRU by having an additional linear transformation of the input.
f_t and r_t express how much of the non-linear abstraction are preduced
by the input x_t and previous hidden state h_t.
g_t decides how much of the linear transformation of the input is carried
to the hidden state.
GRU
5
What is good using LAU?
LAU offers a direct way for input x_t to go to latter hidden state layers.
This mechanism is very useful for translation where the input should sometimes
be directly carried to the next stage of processing without any substantial
composition or nonlinear transformation.
ex. imagine we want to translate a rare entity name such as ‘Bahrain’ to Chinese.
→ LAU is able to retain the embedding of this word in its hidden state.
Otherwise, serious distortion occurs due to lack of training instances.
6
Model: encoder-decoder (DeepLAU)
● vertical stacking
○ only the output of the previous layer of RNN is
fed to the current layer as input.
● bidirectional encoder
○ φ is LAU.
○ the directions are marked by a direction term d.
d = -1 or +1
when d = -1, processing in forward diretion.
otherwise backward direction.
7
Model: Encoder side
● in order to learn more temporal dependencies, they choose unusual
bidirectional approach.
● encoding
○ an RNN layer processes the input sequence in forward direction.
○ the output of this layer is taken by an upper RNN layer as input, processed in reverse
direction.
○ Formally, following Equation (9), they set d = (-1)^ℓ
○ the final encoder consists of Lenc layers and produces the output
8
model: Attention side
α_t,j is caluclated by the first layer of decoder at step t - 1 (st-1),
the most-top layer of the encoder at step j (hj),
and context word yt-1.
σ(): tanh()
9
model: Decoder side
the decoder follows Equation (9) with fixed direction term d = -1.
At the first layer, they use the following input:
At inference stage, they only utilize the top-most hidden state s(Ldec)
to make the final predication with a softmax layer:
yt-1 is the target word embedding
10
Experiments: corpus
● NIST Chinese-English
○ training: LDC corpora 1.25M sents, 27.9M Chinese words and 34.5M English words
○ dev: NIST 2002 (MT02) dataset
○ test: NIST 2003 (MT03), NIST 2004 (MT04), NIST 2005 (MT05), NIST 2006 (MT06)
● WMT14 English-German
○ training: WMT14 training corpus 4.5M sents, 91M English words and 87M German words
○ dev: news-test 2012, 2013
○ test: news-test 2014
● WMT14 English-French
○ training: subset of WMT14 training corpus 12M sents, 304M English words and 348M French
words
○ dev: concatenation of news-test 2012 and news-test 2013
○ test: news-test 2014
11
Experiments: setup
● For all expariments
○ dimension: embedding, hidden states and ct are 512 size.
○ optimizer: Adadelta
○ batch size: 128
○ input length limit: 80 words
○ beam size: 10
○ dropout rate: 0.5
○ layer: both encoder and decoder have 4 layers
● settings of each experiment
○ in Chinese-English and English-French, use the soruce and target vocab frequent 30k
○ for English-German, use the source 120k and the target 80k in order of frequent
12
Result: Chinese-English, English-French
LAUs apply adaptive gate function conditioned on the input which it able to decide
how much linear information should be transferred to the next step.
13
Result: English-German
14
Analysis: LAU vs. GRU, Depth vs. Width
LAU vs. GRU
● row 3 to row 7
→ LAU bring imporovement.
● row 3,4 to row 7,8
→ GRU decrese BLEU,
but LAU bring improvement.
Depth vs. Width
● when increasing the model depth
but they failed to see further imporovements.
● hidden size (width)
row 2 to row 3
→ improvements is relative small
→ depth plays a more important role in incresing
the complexity of neural networks than width. in any analysis, use NIST Chinese-English task 15
Analysis: About Length
DeepLAU models yield higher BLEU score
than the DeepGRU model.
→ very deep RNN model is good at
modelling the nested latent structures
on relatively complicated sentences.
16
Conclusion
● propose a Linear Asociative Unit (LAU)
○ it makes a fusion of both linear and nonlinear transformation.
● LAU enable us to build a deep neural network for MT.
● My feeling
○ After all, is this model a non-recurrent deep or a recurrent deep?
maybe, both are likely to be good…
○ I was also interested in the weight of the model.
So, I wanted them to mention about it.
17
reference
● Srivastava et al. Training Very Deep Networks NIPS 2015
● Zhou et al. Deep Recurrent Models with Fast-Forward Connections for Neural
Machine Translation arXiv
● He et al. Deep residual learning for image recognition arXiv
● Wu et al. Google’s neural machine translation system: Bridging the gap
between human and machine translation arXiv
18

More Related Content

What's hot

Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
Satoru Katsumata
 
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
Hye-min Ahn
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Universitat Politècnica de Catalunya
 
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Universitat Politècnica de Catalunya
 
Problems of function based syntax
Problems of function based syntaxProblems of function based syntax
Problems of function based syntax
Diego Krivochen
 
Video caption generation via seq-to-seq model (TensorFlow implementation)
Video caption generation via seq-to-seq model (TensorFlow implementation)Video caption generation via seq-to-seq model (TensorFlow implementation)
Video caption generation via seq-to-seq model (TensorFlow implementation)
Chun-Min Chang
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
Fwdays
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」
kurotaki_weblab
 
DL for sentence classification project Write-up
DL for sentence classification project Write-upDL for sentence classification project Write-up
DL for sentence classification project Write-up
Hoàng Triều Trịnh
 
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural NetworksSkip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Universitat Politècnica de Catalunya
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
Khang Pham
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Universitat Politècnica de Catalunya
 
Learning Communication with Neural Networks
Learning Communication with Neural NetworksLearning Communication with Neural Networks
Learning Communication with Neural Networks
hytae
 
ON THE DUALITY FEATURE OF P-CLASS PROBLEMS AND NP COMPLETE PROBLEMS
ON THE DUALITY FEATURE OF P-CLASS PROBLEMS AND NP COMPLETE PROBLEMSON THE DUALITY FEATURE OF P-CLASS PROBLEMS AND NP COMPLETE PROBLEMS
ON THE DUALITY FEATURE OF P-CLASS PROBLEMS AND NP COMPLETE PROBLEMS
cscpconf
 
Emnlp2015 reading festival_lstm_cws
Emnlp2015 reading festival_lstm_cwsEmnlp2015 reading festival_lstm_cws
Emnlp2015 reading festival_lstm_cws
Ace12358
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Minh Pham
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
Arvind Devaraj
 
ALGEBRAIC DEGREE ESTIMATION OF BLOCK CIPHERS USING RANDOMIZED ALGORITHM; UPPE...
ALGEBRAIC DEGREE ESTIMATION OF BLOCK CIPHERS USING RANDOMIZED ALGORITHM; UPPE...ALGEBRAIC DEGREE ESTIMATION OF BLOCK CIPHERS USING RANDOMIZED ALGORITHM; UPPE...
ALGEBRAIC DEGREE ESTIMATION OF BLOCK CIPHERS USING RANDOMIZED ALGORITHM; UPPE...
ijcisjournal
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 

What's hot (20)

Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
 
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
 
Problems of function based syntax
Problems of function based syntaxProblems of function based syntax
Problems of function based syntax
 
Video caption generation via seq-to-seq model (TensorFlow implementation)
Video caption generation via seq-to-seq model (TensorFlow implementation)Video caption generation via seq-to-seq model (TensorFlow implementation)
Video caption generation via seq-to-seq model (TensorFlow implementation)
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」
 
DL for sentence classification project Write-up
DL for sentence classification project Write-upDL for sentence classification project Write-up
DL for sentence classification project Write-up
 
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural NetworksSkip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Learning Communication with Neural Networks
Learning Communication with Neural NetworksLearning Communication with Neural Networks
Learning Communication with Neural Networks
 
ON THE DUALITY FEATURE OF P-CLASS PROBLEMS AND NP COMPLETE PROBLEMS
ON THE DUALITY FEATURE OF P-CLASS PROBLEMS AND NP COMPLETE PROBLEMSON THE DUALITY FEATURE OF P-CLASS PROBLEMS AND NP COMPLETE PROBLEMS
ON THE DUALITY FEATURE OF P-CLASS PROBLEMS AND NP COMPLETE PROBLEMS
 
Emnlp2015 reading festival_lstm_cws
Emnlp2015 reading festival_lstm_cwsEmnlp2015 reading festival_lstm_cws
Emnlp2015 reading festival_lstm_cws
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
ALGEBRAIC DEGREE ESTIMATION OF BLOCK CIPHERS USING RANDOMIZED ALGORITHM; UPPE...
ALGEBRAIC DEGREE ESTIMATION OF BLOCK CIPHERS USING RANDOMIZED ALGORITHM; UPPE...ALGEBRAIC DEGREE ESTIMATION OF BLOCK CIPHERS USING RANDOMIZED ALGORITHM; UPPE...
ALGEBRAIC DEGREE ESTIMATION OF BLOCK CIPHERS USING RANDOMIZED ALGORITHM; UPPE...
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 

Similar to Deep Neural Machine Translation with Linear Associative Unit

Convolutional and Recurrent Neural Networks
Convolutional and Recurrent Neural NetworksConvolutional and Recurrent Neural Networks
Convolutional and Recurrent Neural Networks
Ramesh Ragala
 
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
Sharath TS
 
Scene understanding
Scene understandingScene understanding
Scene understanding
Mohammed Shoaib
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
Comparison of Single Channel Blind Dereverberation Methods for Speech SignalsComparison of Single Channel Blind Dereverberation Methods for Speech Signals
Comparison of Single Channel Blind Dereverberation Methods for Speech SignalsDeha Deniz Türköz
 
Multidimensional RNN
Multidimensional RNNMultidimensional RNN
Multidimensional RNN
Grigory Sapunov
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
Junaid Bhat
 
Nips 2017 in a nutshell
Nips 2017 in a nutshellNips 2017 in a nutshell
Nips 2017 in a nutshell
LULU CHENG
 
On using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translationOn using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translation
NAIST Machine Translation Study Group
 
Semi orthogonal low-rank matrix factorization for deep neural networks
Semi orthogonal low-rank matrix factorization for deep neural networksSemi orthogonal low-rank matrix factorization for deep neural networks
Semi orthogonal low-rank matrix factorization for deep neural networks
品媛 陳
 
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
Antonio Tejero de Pablos
 
Seminar dm
Seminar dmSeminar dm
Seminar dm
MHDAmmarALkelany
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
SARCCOM
 
Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)
Oswald Campesato
 
slides.pdf
slides.pdfslides.pdf
slides.pdf
JongwooKo1
 
A Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxA Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptx
ssuser2624f71
 
Ire presentation
Ire presentationIre presentation
Ire presentation
Raj Patel
 
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINEUnderstanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
Raj Patel
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
byteLAKE
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)
sohaib_alam
 

Similar to Deep Neural Machine Translation with Linear Associative Unit (20)

Convolutional and Recurrent Neural Networks
Convolutional and Recurrent Neural NetworksConvolutional and Recurrent Neural Networks
Convolutional and Recurrent Neural Networks
 
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
 
Scene understanding
Scene understandingScene understanding
Scene understanding
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
Comparison of Single Channel Blind Dereverberation Methods for Speech SignalsComparison of Single Channel Blind Dereverberation Methods for Speech Signals
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
 
Multidimensional RNN
Multidimensional RNNMultidimensional RNN
Multidimensional RNN
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Nips 2017 in a nutshell
Nips 2017 in a nutshellNips 2017 in a nutshell
Nips 2017 in a nutshell
 
On using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translationOn using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translation
 
Semi orthogonal low-rank matrix factorization for deep neural networks
Semi orthogonal low-rank matrix factorization for deep neural networksSemi orthogonal low-rank matrix factorization for deep neural networks
Semi orthogonal low-rank matrix factorization for deep neural networks
 
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
 
Seminar dm
Seminar dmSeminar dm
Seminar dm
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
 
Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)
 
slides.pdf
slides.pdfslides.pdf
slides.pdf
 
A Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptxA Generalization of Transformer Networks to Graphs.pptx
A Generalization of Transformer Networks to Graphs.pptx
 
Ire presentation
Ire presentationIre presentation
Ire presentation
 
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINEUnderstanding Large Social Networks | IRE Major Project | Team 57 | LINE
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)
 

More from Satoru Katsumata

Exploiting Monolingual Data at Scale for Neural Machine Translation
Exploiting Monolingual Data at Scale for Neural Machine TranslationExploiting Monolingual Data at Scale for Neural Machine Translation
Exploiting Monolingual Data at Scale for Neural Machine Translation
Satoru Katsumata
 
Incorporating Syntactic and Semantic Information in Word Embeddings using Gra...
Incorporating Syntactic and Semantic Information in Word Embeddings using Gra...Incorporating Syntactic and Semantic Information in Word Embeddings using Gra...
Incorporating Syntactic and Semantic Information in Word Embeddings using Gra...
Satoru Katsumata
 
How Contextual are Contextualized Word Representations?
How Contextual are Contextualized Word Representations?How Contextual are Contextualized Word Representations?
How Contextual are Contextualized Word Representations?
Satoru Katsumata
 
Word-node2vec
Word-node2vecWord-node2vec
Word-node2vec
Satoru Katsumata
 
Corpora Generation for Grammatical Error Correction
Corpora Generation for Grammatical Error CorrectionCorpora Generation for Grammatical Error Correction
Corpora Generation for Grammatical Error Correction
Satoru Katsumata
 
Understanding Back-Translation at Scale
Understanding Back-Translation at ScaleUnderstanding Back-Translation at Scale
Understanding Back-Translation at Scale
Satoru Katsumata
 
2018年度レトリバインターン参加報告
2018年度レトリバインターン参加報告2018年度レトリバインターン参加報告
2018年度レトリバインターン参加報告
Satoru Katsumata
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary Induction
Satoru Katsumata
 
Guiding neural machine translation with retrieved translation pieces
Guiding neural machine translation with retrieved translation piecesGuiding neural machine translation with retrieved translation pieces
Guiding neural machine translation with retrieved translation pieces
Satoru Katsumata
 
Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Mo...
Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Mo...Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Mo...
Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Mo...
Satoru Katsumata
 
Memory-augmented Neural Machine Translation
Memory-augmented Neural Machine TranslationMemory-augmented Neural Machine Translation
Memory-augmented Neural Machine Translation
Satoru Katsumata
 
A convolutional encoder model for neural machine translation
A convolutional encoder model for neural machine translationA convolutional encoder model for neural machine translation
A convolutional encoder model for neural machine translation
Satoru Katsumata
 

More from Satoru Katsumata (12)

Exploiting Monolingual Data at Scale for Neural Machine Translation
Exploiting Monolingual Data at Scale for Neural Machine TranslationExploiting Monolingual Data at Scale for Neural Machine Translation
Exploiting Monolingual Data at Scale for Neural Machine Translation
 
Incorporating Syntactic and Semantic Information in Word Embeddings using Gra...
Incorporating Syntactic and Semantic Information in Word Embeddings using Gra...Incorporating Syntactic and Semantic Information in Word Embeddings using Gra...
Incorporating Syntactic and Semantic Information in Word Embeddings using Gra...
 
How Contextual are Contextualized Word Representations?
How Contextual are Contextualized Word Representations?How Contextual are Contextualized Word Representations?
How Contextual are Contextualized Word Representations?
 
Word-node2vec
Word-node2vecWord-node2vec
Word-node2vec
 
Corpora Generation for Grammatical Error Correction
Corpora Generation for Grammatical Error CorrectionCorpora Generation for Grammatical Error Correction
Corpora Generation for Grammatical Error Correction
 
Understanding Back-Translation at Scale
Understanding Back-Translation at ScaleUnderstanding Back-Translation at Scale
Understanding Back-Translation at Scale
 
2018年度レトリバインターン参加報告
2018年度レトリバインターン参加報告2018年度レトリバインターン参加報告
2018年度レトリバインターン参加報告
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary Induction
 
Guiding neural machine translation with retrieved translation pieces
Guiding neural machine translation with retrieved translation piecesGuiding neural machine translation with retrieved translation pieces
Guiding neural machine translation with retrieved translation pieces
 
Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Mo...
Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Mo...Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Mo...
Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Mo...
 
Memory-augmented Neural Machine Translation
Memory-augmented Neural Machine TranslationMemory-augmented Neural Machine Translation
Memory-augmented Neural Machine Translation
 
A convolutional encoder model for neural machine translation
A convolutional encoder model for neural machine translationA convolutional encoder model for neural machine translation
A convolutional encoder model for neural machine translation
 

Recently uploaded

Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
Red blood cells- genesis-maturation.pptx
Red blood cells- genesis-maturation.pptxRed blood cells- genesis-maturation.pptx
Red blood cells- genesis-maturation.pptx
muralinath2
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Mudde & Rovira Kaltwasser. - Populism in Europe and the Americas - Threat Or...
Mudde &  Rovira Kaltwasser. - Populism in Europe and the Americas - Threat Or...Mudde &  Rovira Kaltwasser. - Populism in Europe and the Americas - Threat Or...
Mudde & Rovira Kaltwasser. - Populism in Europe and the Americas - Threat Or...
frank0071
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 

Recently uploaded (20)

Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
Red blood cells- genesis-maturation.pptx
Red blood cells- genesis-maturation.pptxRed blood cells- genesis-maturation.pptx
Red blood cells- genesis-maturation.pptx
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Mudde & Rovira Kaltwasser. - Populism in Europe and the Americas - Threat Or...
Mudde &  Rovira Kaltwasser. - Populism in Europe and the Americas - Threat Or...Mudde &  Rovira Kaltwasser. - Populism in Europe and the Americas - Threat Or...
Mudde & Rovira Kaltwasser. - Populism in Europe and the Americas - Threat Or...
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 

Deep Neural Machine Translation with Linear Associative Unit

  • 1. Deep Neural Machine Translation with Linear Associative Unit Mingxuan Wang, Zhengdong Lu, Jie Zhou, Qun Liu (ACL 2017) 首都大 B4 勝又智
  • 2. Abstract ● NMT systems with deep architecture RNNs often suffer from severe gradient diffusion. ○ due to the non-linear recurrent activations, which often make the optimization much more difficult. ● we propose novel linear associative units (LAU). ○ reduce the gradient path inside the recurrent units ● experiment ○ NIST task: Chinese-English ○ WMT14: English-German English-French ● analysis ○ LAU vs. GRU ○ Depth vs. Width ○ about Length 2
  • 3. background: gate structure ● LSTM, GRU: capture long-term dependencies ● Residual Network (He et al., 2015) ● highway Network (Srivastava et al., 2015) ● Fast-Forward Network (Zhou et al., 2016) (F-F connections) 3 highway Network (H, T: non-linear function)
  • 4. background: Gated Recurrent Unit taking a linear sum between the existing state and the newly computed state z_t: update gate r_t: reset gate 4
  • 5. Model: LAU (Linear Asocciative Unit) LAU extends GRU by having an additional linear transformation of the input. f_t and r_t express how much of the non-linear abstraction are preduced by the input x_t and previous hidden state h_t. g_t decides how much of the linear transformation of the input is carried to the hidden state. GRU 5
  • 6. What is good using LAU? LAU offers a direct way for input x_t to go to latter hidden state layers. This mechanism is very useful for translation where the input should sometimes be directly carried to the next stage of processing without any substantial composition or nonlinear transformation. ex. imagine we want to translate a rare entity name such as ‘Bahrain’ to Chinese. → LAU is able to retain the embedding of this word in its hidden state. Otherwise, serious distortion occurs due to lack of training instances. 6
  • 7. Model: encoder-decoder (DeepLAU) ● vertical stacking ○ only the output of the previous layer of RNN is fed to the current layer as input. ● bidirectional encoder ○ φ is LAU. ○ the directions are marked by a direction term d. d = -1 or +1 when d = -1, processing in forward diretion. otherwise backward direction. 7
  • 8. Model: Encoder side ● in order to learn more temporal dependencies, they choose unusual bidirectional approach. ● encoding ○ an RNN layer processes the input sequence in forward direction. ○ the output of this layer is taken by an upper RNN layer as input, processed in reverse direction. ○ Formally, following Equation (9), they set d = (-1)^ℓ ○ the final encoder consists of Lenc layers and produces the output 8
  • 9. model: Attention side α_t,j is caluclated by the first layer of decoder at step t - 1 (st-1), the most-top layer of the encoder at step j (hj), and context word yt-1. σ(): tanh() 9
  • 10. model: Decoder side the decoder follows Equation (9) with fixed direction term d = -1. At the first layer, they use the following input: At inference stage, they only utilize the top-most hidden state s(Ldec) to make the final predication with a softmax layer: yt-1 is the target word embedding 10
  • 11. Experiments: corpus ● NIST Chinese-English ○ training: LDC corpora 1.25M sents, 27.9M Chinese words and 34.5M English words ○ dev: NIST 2002 (MT02) dataset ○ test: NIST 2003 (MT03), NIST 2004 (MT04), NIST 2005 (MT05), NIST 2006 (MT06) ● WMT14 English-German ○ training: WMT14 training corpus 4.5M sents, 91M English words and 87M German words ○ dev: news-test 2012, 2013 ○ test: news-test 2014 ● WMT14 English-French ○ training: subset of WMT14 training corpus 12M sents, 304M English words and 348M French words ○ dev: concatenation of news-test 2012 and news-test 2013 ○ test: news-test 2014 11
  • 12. Experiments: setup ● For all expariments ○ dimension: embedding, hidden states and ct are 512 size. ○ optimizer: Adadelta ○ batch size: 128 ○ input length limit: 80 words ○ beam size: 10 ○ dropout rate: 0.5 ○ layer: both encoder and decoder have 4 layers ● settings of each experiment ○ in Chinese-English and English-French, use the soruce and target vocab frequent 30k ○ for English-German, use the source 120k and the target 80k in order of frequent 12
  • 13. Result: Chinese-English, English-French LAUs apply adaptive gate function conditioned on the input which it able to decide how much linear information should be transferred to the next step. 13
  • 15. Analysis: LAU vs. GRU, Depth vs. Width LAU vs. GRU ● row 3 to row 7 → LAU bring imporovement. ● row 3,4 to row 7,8 → GRU decrese BLEU, but LAU bring improvement. Depth vs. Width ● when increasing the model depth but they failed to see further imporovements. ● hidden size (width) row 2 to row 3 → improvements is relative small → depth plays a more important role in incresing the complexity of neural networks than width. in any analysis, use NIST Chinese-English task 15
  • 16. Analysis: About Length DeepLAU models yield higher BLEU score than the DeepGRU model. → very deep RNN model is good at modelling the nested latent structures on relatively complicated sentences. 16
  • 17. Conclusion ● propose a Linear Asociative Unit (LAU) ○ it makes a fusion of both linear and nonlinear transformation. ● LAU enable us to build a deep neural network for MT. ● My feeling ○ After all, is this model a non-recurrent deep or a recurrent deep? maybe, both are likely to be good… ○ I was also interested in the weight of the model. So, I wanted them to mention about it. 17
  • 18. reference ● Srivastava et al. Training Very Deep Networks NIPS 2015 ● Zhou et al. Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation arXiv ● He et al. Deep residual learning for image recognition arXiv ● Wu et al. Google’s neural machine translation system: Bridging the gap between human and machine translation arXiv 18