SlideShare a Scribd company logo
1 of 23
Download to read offline
[1] Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical
machine translation." arXiv preprint arXiv:1406.1078 (2014).
[2] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural
networks." Advances in neural information processing systems. 2014.
01
Seq2Seq (Encoder-Decoder) Model
Olivia Ni
• Introduction
• Seq2Seq (Encoder-Decoder) Model
• Preliminary: LSTM & GRU
• Notation
• Encoder
• Decoder
• Training Process
• Tips for Seq2Seq (Encoder-Decoder) Model
• Attention
• Scheduled Sampling
• Beam Search
• Appendix (Word Embedding/ Representation)
• Reference
02
Outline
03
Introduction (1/2)
One-to-One Many-to-One One-to-Many Many-to-Many Many-to-Many
Standard NN model w/o
recurrent structure, from
fixed-sized input to fixed-
sized output.
RNN-based model w/
recurrent structure, from
non-fixed-sized sequence
input to fixed-sized output.
RNN-based model w/
recurrent structure, from
fixed-sized input to non-
fixed-sized sequence
output.
RNN-based model w/
recurrent structure, from
non-fixed-sized sequence
input to non-fixed-sized
sequence output.
RNN-based model w/
recurrent structure, from
non-fixed-sized synced
sequence input to non-
fixed-sized sequence
output.
Image classification Sentiment analysis where
a given sentence is
classified as expressing
positive or negative
sentiment.
Image captioning takes an
image and outputs a
sentence of words.
Machine translation: a
RNN reads a sentence in
English and then outputs a
sentence in French.
Video classification where
we wish to label each
frame of the video.
# Each rectangle vector; Each arrow function (matrix multiplication)
# Red Input vectors; Blue Output vectors; Green RNN’s hidden states
04
Introduction (2/2)
編碼器 (Encoder)
Use one RNN to analyze input sequence
• In thesis [1], this encoder RNN = GRU
• In thesis [2], this encoder RNN = LSTM
解碼器 (Decoder)
Use another RNN to generate output sequence
• In thesis [1], this encoder RNN = GRU
• In thesis [2], this encoder RNN = LSTM
上下文向量 (context vector)
A fixed-length vector representations for
the input sentence
05
Seq2Seq Model - Preliminary: LSTM (1/5)
• The cell state runs straight down
the entire chain, with only some
minor linear interactions.
 Easy for information to flow
along it unchanged
• The LSTM does have the ability
to remove or add information to
the cell state, carefully regulated
by structures called gates.
06
Seq2Seq Model - Preliminary: LSTM (2/5)
• Forget gate (sigmoid + pointwise
multiplication operation):
decides what information we’re
going to throw away from the
cell state
• 1: ‘’Complete keep this”
• 0: “Complete get rid of this”
07
Seq2Seq Model - Preliminary: LSTM (3/5)
• Input gate (sigmoid + pointwise
multiplication operation):
decides what new information
we’re going to store in the cell
state
Vanilla RNN
08
Seq2Seq Model - Preliminary: LSTM (4/5)
• Cell state update: forgets the
things we decided to forget
earlier and add the new
candidate values, scaled by how
much we decided to update
• 𝑓𝑡: decide which to forget
• 𝑖 𝑡: decide which to update
⟹ 𝐶𝑡 has been updated at timestamp 𝑡, which change slowly!
09
Seq2Seq Model - Preliminary: LSTM (5/5)
• Output gate (sigmoid + pointwise
multiplication operation):
decides what new information
we’re going to output
⟹ ℎ 𝑡 has been updated at timestamp 𝑡, which change faster!
10
Seq2Seq Model - Preliminary: GRU
Update gate:
Reset gate:
State Candidate:
Current State:
• Gated Recurrent Unit (GRU)
• Idea:
• combine the forget and input gates
into a single “update gate”
• merge the cell state and hidden state
11
Seq2Seq Model - Notation
𝑽(𝒔) The vocabulary size of the input
𝒙𝒊 The one-hot vector of i-th word in the input sentence
ഥ𝒙𝒊 The embedding vector of i-th word in the input sentence
𝑬(𝒔) Embedding matrix of the encoder
𝒉𝒊
𝒔 The i-th hidden vector of the encoder
𝑽(𝒕) The vocabulary size of the output
𝒚𝒋 The one-hot vector of j-th word in the output sentence
ഥ𝒚𝒋 The embedding vector of j-th word in the output sentence
𝑬(𝒕) Embedding matrix of the decoder
𝒉𝒋
𝒕
The j-th hidden vector of the decoder
12
Seq2Seq Model - Encoder
Encoder
Embedding
Layer
The encoder embedding layer converts each word in the input sentence to the embedding vector, when processing the i-th word in the input
sentence, the input and the output of the layer are the following:
The input is 𝒙𝒊: the one-hot vector which represent i-th word
The output is ഥ𝒙𝒊: the embedding vector which represent i-th word
Each embedding vector is calculated by the following process: ഥ𝒙𝒊 = 𝑬(𝒔)
𝒙𝒊 , where 𝑬(𝒔)
∈ ℝ 𝑫× 𝑽 𝒔
is the embedding matrix of the encoder.
Encoder
Recurrent
Layer
The encoder recurrent layer generates the hidden vectors from the embedding vectors, when processing the i-th embedding vector, the input
and the output of the layer are the following:
The input is ഥ𝒙𝒊: the embedding vector which represent i-th word
The output is 𝒉𝒊
𝒔
: the hidden vector of the i-th position
Each hidden vector 𝒉𝒊
𝒔
is calculated by the following process: 𝒉𝒊
𝒔
= 𝝍(𝒔) ഥ𝒙𝒊, 𝒉𝒊−𝟏
𝒔
, where 𝝍(𝒔)
could be an LSTM, GRU model, etc.
13
Seq2Seq Model - Decoder
Decoder
Embedding
Layer
The decoder embedding layer converts each word in the output sentence to the embedding vector. When processing the j-th word in the
output sentence, the input and the output of the layer are the following:
 The input is 𝒚𝒋−𝟏: the one-hot vector which represent (j-1)- th word generated by the decoder output layer
 The output is ഥ𝒚𝒋: the embedding vector which represent (j-1)-th word
Each embedding vector is calculated by the following process: ഥ𝒚𝒋 = 𝑬(𝒕)
𝒚𝒋−𝟏 , where 𝑬(𝒕)
∈ ℝ 𝑫× 𝑽 𝒕
is the embedding matrix of the encoder.
Decoder
Recurrent
Layer
The decoder recurrent layer generates the hidden vectors from the embedding vectors, when processing the j-th embedding vector, the input
and the output of the layer are the following:
 The input is ഥ𝒚𝒋: the embedding vector
 The output is 𝒉𝒋
𝒕
: the hidden vector of the j-th position
Each hidden vector 𝒉𝒋
𝒕
is calculated by the following process: 𝒉𝒋
𝒕
= 𝝍(𝒕) ഥ𝒙𝒋, 𝒉𝒋−𝟏
𝒕
, where 𝝍(𝒕)
could be an LSTM, GRU model, etc. And, we
must use the encoder’s hidden vector of the last position as the decoder’s hidden vector of the first position as following: 𝒉 𝟎
𝒕
= 𝒛 = 𝒉 𝑰
𝒔
Decoder
Output
Layer
The decoder output layer generates the probability of the j-th word of the output sentence from the hidden vector, when processing the j-th
embedding vector, the input and the output of the layer are the following:
 The input is 𝒉𝒋
𝒕
: the hidden vector of the j-th position
 The output is 𝒑𝒋: the probability of generating the one-hot vector 𝒚𝒋 of the j-th word
Each probabilities is calculated by the following process:
𝒑𝒋 = 𝑷 𝜽 𝒚𝒋|𝒀<𝒋 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱 𝒐𝒋 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱 𝐖(𝟎)
𝒉𝒋
𝒕
+ 𝐛(𝟎)
• MLE (Maximum Likelihood Estimation) for the following conditional
probability:
𝑃 𝜃 𝒀|𝑿 = 𝑃 𝜃 𝑦𝑗 𝑗=0
𝐽+1
|𝑿 = ෑ
𝑗=1
𝐽+1
𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝑿 = ෑ
𝑗=1
𝐽+1
𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝒛
• Minimize the following lost function:
−log 𝑃 𝜃 𝒀|𝑿 = −log𝑃 𝜃 𝑦𝑗 𝑗=0
𝐽+1
|𝑿 = − ෍
𝑗=1
𝐽+1
𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝑿 = − ෍
𝑗=1
𝐽+1
𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝒛
14
Seq2Seq Model – Training Process
15
Tips for Seq2Seq Model - Attention
𝛼1
1
𝛼2
1
𝛼3
1
𝛼4
1 𝛼1
2
𝛼2
2
𝛼3
2
𝛼4
2
𝛼1
3
𝛼2
3
𝛼3
3
𝛼4
3
𝛼1
4
𝛼2
4
𝛼3
4
𝛼4
4
• Good Attention:
Each input component has approximately the same attention weight
(e.g.) Regularization term on attention
The generated sentence will be like
w1 w2(woman)w3 w4(woman)
⟹ no cooking information  ⟹ bad attention!
෍
𝑖
𝜏 − ෍
𝑡
𝛼 𝑡
𝑖
2
For each component Over the generation
• Mismatch between Train and Test (Exposure Bias)
• Training  The inputs are reference.
• Generation  The inputs are the outputs of the last time step.
16
Tips for Seq2Seq Model - Scheduled Sampling
from reference
probability @ training
• Beam search (vs. Greedy Search)
• Core idea: Keep several best path at each step
• Not possible to check all the paths ⟹ Pre-define parameter: Beam size = 2
17
Tips for Seq2Seq Model - Beam Search
A B
A AB B
A B A B A
B
A B
0.4
0.9
0.9
0.6
0.4
0.4
0.6
0.6
Beam searchGreedy Search
• Recall: There are two embedding layers in the seq2seq model.
• How to learn their embedding matrix?  Word embedding/ representation!
• History:
18
Appendix – Word Embedding/ Representation
Issues: (1)newly-invented words, (2)subjective, (3)annotation effort, (4)difficult to compute word similarity
Knowledge-based
representation
Corpus-based
representation
Atomic symbol
(One-hot encoding)
Neighbors
High-dimensional
sparse word vector
(window-based co-occurrence matrix)
Low-dimensional
dense word vector
dimension
reduction
(SVD)
direct
Learning
(word2vec, Glove)
Knowledge-based
representation
Corpus-based
representation
Atomic symbol
(One-hot encoding)
Neighbors
High-dimensional
sparse word vector
(window-based co-occurrence matrix)
Low-dimensional
dense word vector
dimension
reduction
(SVD)
direct
Learning
(word2vec, Glove)
• Recall: There are two embedding layers in the seq2seq model.
• How to learn their embedding matrix?  Word embedding/ representation!
• History:
19
Appendix – Word Embedding/ Representation
Issues: difficult to compute the similarity
Idea: words with similar meanings often have similar neighbors
• Recall: There are two embedding layers in the seq2seq model.
• How to learn their embedding matrix?  Word embedding/ representation!
• History:
20
Appendix – Word Embedding/ Representation
Knowledge-based
representation
Corpus-based
representation
Atomic symbol
(One-hot encoding)
Neighbors
High-dimensional
sparse word vector
(window-based co-occurrence matrix)
Low-dimensional
dense word vector
dimension
reduction
(SVD)
direct
Learning
(word2vec, Glove)
Issues: (1)matrix size increases with vocabulary, (2)high dimensional, (3)sparsity → poor robustness
Idea: low dimensional word vector
• Recall: There are two embedding layers in the seq2seq model.
• How to learn their embedding matrix?  Word embedding/ representation!
• History:
21
Appendix – Word Embedding/ Representation
Knowledge-based
representation
Corpus-based
representation
Atomic symbol
(One-hot encoding)
Neighbors
High-dimensional
sparse word vector
(window-based co-occurrence matrix)
Low-dimensional
dense word vector
dimension
reduction
(SVD)
direct
Learning
(word2vec, Glove)
Issues: (1)computationally expensive: O(mn^2) when n < m for n x m matrix, (2)difficult to add new words
Idea: directly learn low-dimensional word vectors
• Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical
machine translation." arXiv preprint arXiv:1406.1078 (2014).
• Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural
networks." Advances in neural information processing systems. 2014.
• Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual
attention." International conference on machine learning. 2015.
• Bengio, Samy, et al. "Scheduled sampling for sequence prediction with recurrent neural
networks." Advances in Neural Information Processing Systems. 2015.
• Wiseman, Sam, and Alexander M. Rush. "Sequence-to-sequence learning as beam-search
optimization." arXiv preprint arXiv:1606.02960 (2016).
• 台大 李宏毅 教授 [Machine Learning and having it deep and structured (2018,Spring)]
• 台大 陳縕儂 教授 [Applied Deep Learning(2019,Spring)]
22
Reference
23
Thanks for your listening.

More Related Content

What's hot

Time series predictions using LSTMs
Time series predictions using LSTMsTime series predictions using LSTMs
Time series predictions using LSTMsSetu Chokshi
 
Abstractive Text Summarization
Abstractive Text SummarizationAbstractive Text Summarization
Abstractive Text SummarizationTho Phan
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term MemoryYan Xu
 
Image Caption Generation using Convolutional Neural Network and LSTM
Image Caption Generation using Convolutional Neural Network and LSTMImage Caption Generation using Convolutional Neural Network and LSTM
Image Caption Generation using Convolutional Neural Network and LSTMOmkar Reddy
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 
text summarization using amr
text summarization using amrtext summarization using amr
text summarization using amramit nagarkoti
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
Deep Learning With Python Tutorial | Edureka
Deep Learning With Python Tutorial | EdurekaDeep Learning With Python Tutorial | Edureka
Deep Learning With Python Tutorial | EdurekaEdureka!
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Prakhar Rastogi
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanismKhang Pham
 
ppt on machine learning to deep learning (1).pptx
ppt on machine learning to deep learning (1).pptxppt on machine learning to deep learning (1).pptx
ppt on machine learning to deep learning (1).pptxAnweshaGarima
 

What's hot (20)

Time series predictions using LSTMs
Time series predictions using LSTMsTime series predictions using LSTMs
Time series predictions using LSTMs
 
Abstractive Text Summarization
Abstractive Text SummarizationAbstractive Text Summarization
Abstractive Text Summarization
 
Deep learning
Deep learningDeep learning
Deep learning
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
BERT
BERTBERT
BERT
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Image Caption Generation using Convolutional Neural Network and LSTM
Image Caption Generation using Convolutional Neural Network and LSTMImage Caption Generation using Convolutional Neural Network and LSTM
Image Caption Generation using Convolutional Neural Network and LSTM
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
text summarization using amr
text summarization using amrtext summarization using amr
text summarization using amr
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Bert
BertBert
Bert
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
 
Deep Learning With Python Tutorial | Edureka
Deep Learning With Python Tutorial | EdurekaDeep Learning With Python Tutorial | Edureka
Deep Learning With Python Tutorial | Edureka
 
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN)
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
ppt on machine learning to deep learning (1).pptx
ppt on machine learning to deep learning (1).pptxppt on machine learning to deep learning (1).pptx
ppt on machine learning to deep learning (1).pptx
 

Similar to Seq2Seq (encoder decoder) model

Week9_Seq2seq.pptx
Week9_Seq2seq.pptxWeek9_Seq2seq.pptx
Week9_Seq2seq.pptxKhngNguyn81
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya
 
Using Cipher Key to Generate Dynamic S-Box in AES Cipher System
Using Cipher Key to Generate Dynamic S-Box in AES Cipher SystemUsing Cipher Key to Generate Dynamic S-Box in AES Cipher System
Using Cipher Key to Generate Dynamic S-Box in AES Cipher SystemCSCJournals
 
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptxNibrasulIslam
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptxthanhdowork
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Thien Q. Tran
 
JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020Joseph Kuo
 
kalaivaanar_NSK_V3.ppt
kalaivaanar_NSK_V3.pptkalaivaanar_NSK_V3.ppt
kalaivaanar_NSK_V3.pptVINOTHRAJR1
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsSujit Pal
 
VIT_Workshop.ppt
VIT_Workshop.pptVIT_Workshop.ppt
VIT_Workshop.pptVINOTHRAJR1
 
1 Cryptography Introduction_shared.ppt
1 Cryptography Introduction_shared.ppt1 Cryptography Introduction_shared.ppt
1 Cryptography Introduction_shared.pptssuser0cd7c9
 
Data Security Using Elliptic Curve Cryptography
Data Security Using Elliptic Curve CryptographyData Security Using Elliptic Curve Cryptography
Data Security Using Elliptic Curve CryptographyIJCERT
 
Natural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning ApproachNatural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning ApproachMinhazul Arefin
 

Similar to Seq2Seq (encoder decoder) model (20)

Week9_Seq2seq.pptx
Week9_Seq2seq.pptxWeek9_Seq2seq.pptx
Week9_Seq2seq.pptx
 
icwet1097
icwet1097icwet1097
icwet1097
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
Using Cipher Key to Generate Dynamic S-Box in AES Cipher System
Using Cipher Key to Generate Dynamic S-Box in AES Cipher SystemUsing Cipher Key to Generate Dynamic S-Box in AES Cipher System
Using Cipher Key to Generate Dynamic S-Box in AES Cipher System
 
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
Aes
AesAes
Aes
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
Verilog
VerilogVerilog
Verilog
 
JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020
 
kalaivaanar_NSK_V3.ppt
kalaivaanar_NSK_V3.pptkalaivaanar_NSK_V3.ppt
kalaivaanar_NSK_V3.ppt
 
Analysis of a Modified RC4
Analysis of a Modified RC4 Analysis of a Modified RC4
Analysis of a Modified RC4
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
AES on modern GPUs
AES on modern GPUsAES on modern GPUs
AES on modern GPUs
 
DAA Notes.pdf
DAA Notes.pdfDAA Notes.pdf
DAA Notes.pdf
 
VIT_Workshop.ppt
VIT_Workshop.pptVIT_Workshop.ppt
VIT_Workshop.ppt
 
1 Cryptography Introduction_shared.ppt
1 Cryptography Introduction_shared.ppt1 Cryptography Introduction_shared.ppt
1 Cryptography Introduction_shared.ppt
 
Data Security Using Elliptic Curve Cryptography
Data Security Using Elliptic Curve CryptographyData Security Using Elliptic Curve Cryptography
Data Security Using Elliptic Curve Cryptography
 
Natural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning ApproachNatural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning Approach
 

Recently uploaded

CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptxSilpa
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxSilpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body Areesha Ahmad
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxSilpa
 

Recently uploaded (20)

CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 

Seq2Seq (encoder decoder) model

  • 1. [1] Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014). [2] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014. 01 Seq2Seq (Encoder-Decoder) Model Olivia Ni
  • 2. • Introduction • Seq2Seq (Encoder-Decoder) Model • Preliminary: LSTM & GRU • Notation • Encoder • Decoder • Training Process • Tips for Seq2Seq (Encoder-Decoder) Model • Attention • Scheduled Sampling • Beam Search • Appendix (Word Embedding/ Representation) • Reference 02 Outline
  • 3. 03 Introduction (1/2) One-to-One Many-to-One One-to-Many Many-to-Many Many-to-Many Standard NN model w/o recurrent structure, from fixed-sized input to fixed- sized output. RNN-based model w/ recurrent structure, from non-fixed-sized sequence input to fixed-sized output. RNN-based model w/ recurrent structure, from fixed-sized input to non- fixed-sized sequence output. RNN-based model w/ recurrent structure, from non-fixed-sized sequence input to non-fixed-sized sequence output. RNN-based model w/ recurrent structure, from non-fixed-sized synced sequence input to non- fixed-sized sequence output. Image classification Sentiment analysis where a given sentence is classified as expressing positive or negative sentiment. Image captioning takes an image and outputs a sentence of words. Machine translation: a RNN reads a sentence in English and then outputs a sentence in French. Video classification where we wish to label each frame of the video. # Each rectangle vector; Each arrow function (matrix multiplication) # Red Input vectors; Blue Output vectors; Green RNN’s hidden states
  • 4. 04 Introduction (2/2) 編碼器 (Encoder) Use one RNN to analyze input sequence • In thesis [1], this encoder RNN = GRU • In thesis [2], this encoder RNN = LSTM 解碼器 (Decoder) Use another RNN to generate output sequence • In thesis [1], this encoder RNN = GRU • In thesis [2], this encoder RNN = LSTM 上下文向量 (context vector) A fixed-length vector representations for the input sentence
  • 5. 05 Seq2Seq Model - Preliminary: LSTM (1/5) • The cell state runs straight down the entire chain, with only some minor linear interactions.  Easy for information to flow along it unchanged • The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.
  • 6. 06 Seq2Seq Model - Preliminary: LSTM (2/5) • Forget gate (sigmoid + pointwise multiplication operation): decides what information we’re going to throw away from the cell state • 1: ‘’Complete keep this” • 0: “Complete get rid of this”
  • 7. 07 Seq2Seq Model - Preliminary: LSTM (3/5) • Input gate (sigmoid + pointwise multiplication operation): decides what new information we’re going to store in the cell state Vanilla RNN
  • 8. 08 Seq2Seq Model - Preliminary: LSTM (4/5) • Cell state update: forgets the things we decided to forget earlier and add the new candidate values, scaled by how much we decided to update • 𝑓𝑡: decide which to forget • 𝑖 𝑡: decide which to update ⟹ 𝐶𝑡 has been updated at timestamp 𝑡, which change slowly!
  • 9. 09 Seq2Seq Model - Preliminary: LSTM (5/5) • Output gate (sigmoid + pointwise multiplication operation): decides what new information we’re going to output ⟹ ℎ 𝑡 has been updated at timestamp 𝑡, which change faster!
  • 10. 10 Seq2Seq Model - Preliminary: GRU Update gate: Reset gate: State Candidate: Current State: • Gated Recurrent Unit (GRU) • Idea: • combine the forget and input gates into a single “update gate” • merge the cell state and hidden state
  • 11. 11 Seq2Seq Model - Notation 𝑽(𝒔) The vocabulary size of the input 𝒙𝒊 The one-hot vector of i-th word in the input sentence ഥ𝒙𝒊 The embedding vector of i-th word in the input sentence 𝑬(𝒔) Embedding matrix of the encoder 𝒉𝒊 𝒔 The i-th hidden vector of the encoder 𝑽(𝒕) The vocabulary size of the output 𝒚𝒋 The one-hot vector of j-th word in the output sentence ഥ𝒚𝒋 The embedding vector of j-th word in the output sentence 𝑬(𝒕) Embedding matrix of the decoder 𝒉𝒋 𝒕 The j-th hidden vector of the decoder
  • 12. 12 Seq2Seq Model - Encoder Encoder Embedding Layer The encoder embedding layer converts each word in the input sentence to the embedding vector, when processing the i-th word in the input sentence, the input and the output of the layer are the following: The input is 𝒙𝒊: the one-hot vector which represent i-th word The output is ഥ𝒙𝒊: the embedding vector which represent i-th word Each embedding vector is calculated by the following process: ഥ𝒙𝒊 = 𝑬(𝒔) 𝒙𝒊 , where 𝑬(𝒔) ∈ ℝ 𝑫× 𝑽 𝒔 is the embedding matrix of the encoder. Encoder Recurrent Layer The encoder recurrent layer generates the hidden vectors from the embedding vectors, when processing the i-th embedding vector, the input and the output of the layer are the following: The input is ഥ𝒙𝒊: the embedding vector which represent i-th word The output is 𝒉𝒊 𝒔 : the hidden vector of the i-th position Each hidden vector 𝒉𝒊 𝒔 is calculated by the following process: 𝒉𝒊 𝒔 = 𝝍(𝒔) ഥ𝒙𝒊, 𝒉𝒊−𝟏 𝒔 , where 𝝍(𝒔) could be an LSTM, GRU model, etc.
  • 13. 13 Seq2Seq Model - Decoder Decoder Embedding Layer The decoder embedding layer converts each word in the output sentence to the embedding vector. When processing the j-th word in the output sentence, the input and the output of the layer are the following:  The input is 𝒚𝒋−𝟏: the one-hot vector which represent (j-1)- th word generated by the decoder output layer  The output is ഥ𝒚𝒋: the embedding vector which represent (j-1)-th word Each embedding vector is calculated by the following process: ഥ𝒚𝒋 = 𝑬(𝒕) 𝒚𝒋−𝟏 , where 𝑬(𝒕) ∈ ℝ 𝑫× 𝑽 𝒕 is the embedding matrix of the encoder. Decoder Recurrent Layer The decoder recurrent layer generates the hidden vectors from the embedding vectors, when processing the j-th embedding vector, the input and the output of the layer are the following:  The input is ഥ𝒚𝒋: the embedding vector  The output is 𝒉𝒋 𝒕 : the hidden vector of the j-th position Each hidden vector 𝒉𝒋 𝒕 is calculated by the following process: 𝒉𝒋 𝒕 = 𝝍(𝒕) ഥ𝒙𝒋, 𝒉𝒋−𝟏 𝒕 , where 𝝍(𝒕) could be an LSTM, GRU model, etc. And, we must use the encoder’s hidden vector of the last position as the decoder’s hidden vector of the first position as following: 𝒉 𝟎 𝒕 = 𝒛 = 𝒉 𝑰 𝒔 Decoder Output Layer The decoder output layer generates the probability of the j-th word of the output sentence from the hidden vector, when processing the j-th embedding vector, the input and the output of the layer are the following:  The input is 𝒉𝒋 𝒕 : the hidden vector of the j-th position  The output is 𝒑𝒋: the probability of generating the one-hot vector 𝒚𝒋 of the j-th word Each probabilities is calculated by the following process: 𝒑𝒋 = 𝑷 𝜽 𝒚𝒋|𝒀<𝒋 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱 𝒐𝒋 = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱 𝐖(𝟎) 𝒉𝒋 𝒕 + 𝐛(𝟎)
  • 14. • MLE (Maximum Likelihood Estimation) for the following conditional probability: 𝑃 𝜃 𝒀|𝑿 = 𝑃 𝜃 𝑦𝑗 𝑗=0 𝐽+1 |𝑿 = ෑ 𝑗=1 𝐽+1 𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝑿 = ෑ 𝑗=1 𝐽+1 𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝒛 • Minimize the following lost function: −log 𝑃 𝜃 𝒀|𝑿 = −log𝑃 𝜃 𝑦𝑗 𝑗=0 𝐽+1 |𝑿 = − ෍ 𝑗=1 𝐽+1 𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝑿 = − ෍ 𝑗=1 𝐽+1 𝑃 𝜃 𝑦𝑗|𝒀<𝒋, 𝒛 14 Seq2Seq Model – Training Process
  • 15. 15 Tips for Seq2Seq Model - Attention 𝛼1 1 𝛼2 1 𝛼3 1 𝛼4 1 𝛼1 2 𝛼2 2 𝛼3 2 𝛼4 2 𝛼1 3 𝛼2 3 𝛼3 3 𝛼4 3 𝛼1 4 𝛼2 4 𝛼3 4 𝛼4 4 • Good Attention: Each input component has approximately the same attention weight (e.g.) Regularization term on attention The generated sentence will be like w1 w2(woman)w3 w4(woman) ⟹ no cooking information  ⟹ bad attention! ෍ 𝑖 𝜏 − ෍ 𝑡 𝛼 𝑡 𝑖 2 For each component Over the generation
  • 16. • Mismatch between Train and Test (Exposure Bias) • Training  The inputs are reference. • Generation  The inputs are the outputs of the last time step. 16 Tips for Seq2Seq Model - Scheduled Sampling from reference probability @ training
  • 17. • Beam search (vs. Greedy Search) • Core idea: Keep several best path at each step • Not possible to check all the paths ⟹ Pre-define parameter: Beam size = 2 17 Tips for Seq2Seq Model - Beam Search A B A AB B A B A B A B A B 0.4 0.9 0.9 0.6 0.4 0.4 0.6 0.6 Beam searchGreedy Search
  • 18. • Recall: There are two embedding layers in the seq2seq model. • How to learn their embedding matrix?  Word embedding/ representation! • History: 18 Appendix – Word Embedding/ Representation Issues: (1)newly-invented words, (2)subjective, (3)annotation effort, (4)difficult to compute word similarity Knowledge-based representation Corpus-based representation Atomic symbol (One-hot encoding) Neighbors High-dimensional sparse word vector (window-based co-occurrence matrix) Low-dimensional dense word vector dimension reduction (SVD) direct Learning (word2vec, Glove)
  • 19. Knowledge-based representation Corpus-based representation Atomic symbol (One-hot encoding) Neighbors High-dimensional sparse word vector (window-based co-occurrence matrix) Low-dimensional dense word vector dimension reduction (SVD) direct Learning (word2vec, Glove) • Recall: There are two embedding layers in the seq2seq model. • How to learn their embedding matrix?  Word embedding/ representation! • History: 19 Appendix – Word Embedding/ Representation Issues: difficult to compute the similarity Idea: words with similar meanings often have similar neighbors
  • 20. • Recall: There are two embedding layers in the seq2seq model. • How to learn their embedding matrix?  Word embedding/ representation! • History: 20 Appendix – Word Embedding/ Representation Knowledge-based representation Corpus-based representation Atomic symbol (One-hot encoding) Neighbors High-dimensional sparse word vector (window-based co-occurrence matrix) Low-dimensional dense word vector dimension reduction (SVD) direct Learning (word2vec, Glove) Issues: (1)matrix size increases with vocabulary, (2)high dimensional, (3)sparsity → poor robustness Idea: low dimensional word vector
  • 21. • Recall: There are two embedding layers in the seq2seq model. • How to learn their embedding matrix?  Word embedding/ representation! • History: 21 Appendix – Word Embedding/ Representation Knowledge-based representation Corpus-based representation Atomic symbol (One-hot encoding) Neighbors High-dimensional sparse word vector (window-based co-occurrence matrix) Low-dimensional dense word vector dimension reduction (SVD) direct Learning (word2vec, Glove) Issues: (1)computationally expensive: O(mn^2) when n < m for n x m matrix, (2)difficult to add new words Idea: directly learn low-dimensional word vectors
  • 22. • Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014). • Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014. • Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015. • Bengio, Samy, et al. "Scheduled sampling for sequence prediction with recurrent neural networks." Advances in Neural Information Processing Systems. 2015. • Wiseman, Sam, and Alexander M. Rush. "Sequence-to-sequence learning as beam-search optimization." arXiv preprint arXiv:1606.02960 (2016). • 台大 李宏毅 教授 [Machine Learning and having it deep and structured (2018,Spring)] • 台大 陳縕儂 教授 [Applied Deep Learning(2019,Spring)] 22 Reference
  • 23. 23 Thanks for your listening.