SlideShare a Scribd company logo
LSTM Encoder-Decoder Architecture with
Attention Mechanism for Machine Comprehension
Eugene Nho
MBA/MS, Intelligent Systems
Stanford University
enho@stanford.edu
Brian Higgins
BS, Computer Science
Stanford University
bhiggins@stanford.edu
Abstract
Machine comprehension remains a challenging open area of research. While many question answering
models have been explored for existing datasets, little work has been done with the newly released
MS MARCO dataset, which mirrors the reality much more closely and poses many unique challenges.
We explore an end-to-end neural architecture with attention mechanisms for comprehending relevant
information and generating text answers for MS MARCO.
1 Introduction
Machine comprehension—building systems that comprehend natural language documents—is a challenging open area
of research within natural language processing (NLP). Since the widespread use of statistical machine learning and
more recently deep learning, its progress has been made largely in lockstep with the introduction of large-scale datasets.
The latest such dataset is MS MARCO [1]. Unlike previous machine comprehension and question answering datasets,
MS MARCO consists of real questions generated by anonymized queries from Bing, and has target answers in the
form of sequence of words. Compared to other forms of answers like multiple choice, single-token answer or span of
words, generating text is significantly more challenging. To our knowledge, no literature exists on end-to-end systems
specifically addressing this dataset yet.
In this paper, we explore an attention-based encoder-decoder architecture that comprehends input text and generates
text answers.
2 Related Work
Machine comprehension and question answering tasks went through multiple stages of evolution over the past decade.
Traditionally, these tasks relied on complex NLP pipelines involving steps like syntactic parsing, semantic parsing, and
question classification. With the rise of neural networks, end-to-end neural architectures have increasingly been applied
to comprehension tasks [2][3][4][5]. The evolution of available datasets was a driving force behind the progress in
this domain—from models simply choosing between multiple choice answers [4][5] to those generating single-token
answers [2][3] or span-of-words answers [6].
Throughout this evolution, attention has emerged as a key concept, in particular as the task demanded by dataset became
more complex. Attention was first utilized more in context of non-NLP tasks such as learning alignments between image
objects and agent actions [7], or between visual features of an image and text description in the caption generation task.
Bahdanau et al. (2015) applied the attention mechanism to Neural Machine Translation (NMT) [8], and Luong et al.
(2015) explored different architectures for attention-based NMT, including a global approach and a local approach [9].
Figure 1: Architecture of the encoder-decoder model with attention
Several approaches have been tried to incorporate attention into machine comprehension and question answering tasks.
Seo et al. (2017) applied bidirectional models with attention to achieve near state-of-the-art results for SQuAD [10].
Wang & Jiang (2016) combined match-LSTM, an attention model originally proposed for text entailment, with Pointer
Net, a sequence-to-sequence model proposed by Vinyals et al (2015) that constrains output words to the tokens from
input sequences [6][11].
3 Dataset: MS MARCO
MS MARCO is a reading comprehension dataset with a number of characteristics that make it the most realistic
available dataset for question answering. All questions are real anonymized queries from Bing, and the context passages,
from which target answers are derived, are sampled from real web documents, closely mirroring the real-world scenario
of finding an answer to a question on a search engine. The target answers, as mentioned earlier, are sequence of words
and are human-generated.
The dataset has 100,000 queries. Each query consists of one question, approximately 10 context passages, and target
answers. While most queries have one answer, some have many and some have none. The average length of the
questions, passages, and target answers are approximately 15, 85, and 8 tokens, respectively. There are no particular
topical focus, but the queries fall into one of the following five categories: description (52.6%), numeric (28.4%), entity
(10.5%), location (5.7%) and person (2.7%).
Given the nature of text generation task required by the dataset, we use ROUGE-L and BLEU as evaluation metrics.
4 Methods
The problem we are trying to solve can be defined formally as follows. For each example in the dataset, we are given a
question and a set of potentially relevant context passages. The question is represented as Q 2 Rd⇥n
where d is the
embedding size and n is the length of the question. The set of passages is represented as S = {P1, P2, ..., P }, where
is the number of passages. Our objective is to (1) choose the most relevant passage P out of S, and (2) generate the
answer consisting of a sequence of words, represented as R 2 Rd⇥k
where k is the length of the answer. Because both
tasks require very similar architectures, this paper will only focus on the second objective, assuming the best passage P
has already been selected.
2
We use an encoder-decoder architecture with multiple attention mechanisms to achieve this, as shown in Figure 1. The
components of this model are:
1. Question Encoder is an LSTM layer mapping the question information to a vector space.
2. Passage Encoder with Attention is an LSTM layer with an attention mechanism connecting the passage
information with the encoded knowledge about the question.
3. Attention Encoder is an extra LSTM layer further distilling the output from the Passage Encoder with
Attention.
4. Decoder is an LSTM layer that takes in information from the three encoders and generates the answer. It has
an attention mechanism connecting to the Attention Encoder.
4.1 Question Encoder
The purpose of the question encoder is to incorporate contextual information from each token of the question to a
vector space. We use a standard unidirectional LSTM [12] layer to process the question, represented as follows:
Hq
=
!
LSTM(Q) (1)
The output matrix Hq
2 Rh⇥n
is the hidden state representation of the question, where h is the dimension of the
hidden state. hq
i represents the ith column of Hq
, and encapsulates the contextual information up to the ith token of the
question (qi).
4.2 Passage Encoder with Attention
The objective of this layer is to capture the information from the passage relevant to the contents of the question. To that
end, it employs a standard LSTM layer with the global attention mechanism proposed by Luong et al [9]. P 2 Rd⇥l
is the matrix representing the most relevant context passage, where d is the embedding size and l is the length of the
passage. At each time step t, this encoder uses the embedding of tth token of the passage and the entire hidden state
representation from the question encoder to capture contextual information up to the tth token relevant to the question.
This is represented as follows:
˜hp
t =
!
AttentionLSTM(pt, Hq
) (2)
where ˜hp
t 2 Rh
is the hidden vector capturing the information, pt 2 Rd
is the tth token of the passage (i.e. tth column
of P), and Hq
is the matrix representing all the hidden states of the question encoder.
!
AttentionLSTM is an abstraction involving the following two steps: First, it captures information from the tth token
using a regular LSTM cell, represented as hp
t =
!
LSTM(pt), where hp
t 2 Rh
. Second, using Hq
and hp
t , it derives a
context vector ct 2 Rh
that captures relevant information from the questions. ct is concatenated with hp
t to generate ˜hp
t ,
as shown below:
˜hp
t =
!
AttentionLSTM(pt, Hq
) = tanh(Wc[hp
t ; ct] + bc) (3)
where Wc 2 Rh⇥2h
is a weight matrix and bc 2 Rh
is bias.
The context vector ct captures all the hidden states from the question encoder weighted by how relevant each question
word is to the tth passage word. To derive ct, we first calculate the relevance score between the tth passage word,
represented by hp
t , and jth question word, represented by hq
j :
score(hp
t , hq
j ) = (Wshp
t + bs)|
hq
j (4)
where Ws 2 Rh⇥h
is a weight matrix and bs 2 Rh
is bias. We then create the attention vector at = {a
(1)
t , a
(2)
t , ... ,
a
(n)
t } 2 R1⇥n
. at is a horizontal vector whose jth element is a softmax value of score(hp
t , hq
j ). ct is the weighted
average of Hq
based on the attention vector at.
a
(j)
t =
exp(score(hp
t , hq
j ))
Pn
⇢ exp(score(hp
t , hq
⇢))
(5)
gt = Hq ¯at (6)
ct =
nX
i
g
(i)
t (7)
3
Figure 2: Loss and evaluation metrics
where ¯at 2 Rh⇥n
is the vertically broadcasted matrix of at, and g
(i)
t is the ith column of gt. All the formulas aside, the
important intuition here is that the hidden state of this encoder at each time step t incorporates information about not
only the passage tokens up to that step (as is the case with a normal LSTM hidden state), but also a measure of how
relevant each of the question tokens is to that particular tth passage token.
4.3 Attention Encoder
The purpose of this LSTM layer is to further distill the contextual information captured by the passage encoder. It is
a variant of the Match LSTM layer, first proposed by Wang & Jiang (2016) for text entailment and later applied to
question answering by the same authors [6]. We utilize a standard unidirectional LSTM layer taking as input all the
hidden states of the Passage Encoder with Attention, as shown below:
Hm
=
!
LSTM( ˜Hp
) (8)
where ˜Hp
= {˜hp
1, ˜hp
2, ..., ˜hp
l } 2 Rh⇥l
is the hidden state representation from the passage encoder.
4.4 Decoder
We use an LSTM layer with global attention between the decoder and the Attention Encoder to generate the answer.
The purpose of this attention mechanism is to let the decoder "peek" at the relevant information encapsulating the
passage and the question as it generates the answer. In addition, to pass along the contextual information captured by
the encoders, we concatenate the last hidden states from all three encoders and feed the combined matrix as the initial
hidden state to the decoder. At each time step t, the decoder takes as input the embedding of the generated token from
the previous step, and uses the same attention mechanism described in the earlier section to derive the hidden state
representation ˜hr
t 2 R3h
:
˜hr
t =
!
AttentionLSTM(rt 1, Hm
) (9)
where rt 1 2 Rd
is the embedding of the word generated in the last time step. Then we apply matrix multiplication and
softmax to generate the output vector ot 2 RV
.
ot = softmax(U˜hr
t + b) (10)
where V is the vocabulary size. rt is the word embedding corresponding to ot.
4
Description Person Numeric Location Entity Full Data Set
BLEU-1 9.1 3.2 7.4 3.8 7.4 9.3
ROUGE-L 15.1 2.8 13.7 3.4 5.5 12.8
Table 1: Evaluation metrics by question type from the held out validation set
5 Experiments and Discussion
5.1 Results
A variant of the model described earlier with a double-layer LSTM decoder achieved ROUGE-L of 12.8 and BLEU of
9.3 on the held out validation set. While this result is not sate-of-the-art, our model outperformed several benchmark
models such as Seq2Seq model with Memory Networks [1].
Figure 2 shows the training progression for our model. Loss steadily declines, but the model starts overfitting around
the 10th epoch, where both evaluation metrics on the development set peaked. Our classifier for selecting the most
relevant context passage achieved an accuracy of 100% on both the development set and the held out validation set.
We conducted over 40 runs for hyperparameter optimization, and found the following:
• Larger batch size generally performed better, with a batch size of 256 outperforming 64 (9.3 vs 6.4 BLEU)
• L2 loss outperformed cross entropy (9.3 vs 7.0 BLEU)
• Our model was sensitive to learning rate, with evaluation metrics dropping precipitously when the rate
approached 0.01 range (vs 0.001 or 0.0001)
5.2 Error Analysis
As illustrated by Table 1, our model was much more effective in generating descriptions and numeric answers than
people and locations. Its relative weakness on people and locations most likely occurred because of the rarity of their
names. The dataset contained a total of approximately 650,000 tokens. Because of memory restrictions, we used a
vocabulary size of 20,000 during training, and the rest of the words were cast to the < unknown > token. This practice
led to lower performance on the people and location questions, which on average contain more uncommon tokens
outside of the vocabulary size.
We also found that longer answers were much harder to predict. This is well known and reported in text generation
tasks because the farther into the decoder the model gets, the more diluted the original hidden state becomes. Because
of this, it is very challenging to predict long sequences of words.
Lastly, our design decision on how to treat the < end > token in the generated answer impacted our error rate. When
training, we masked the generated answers based on the lengths of ground truth. Our hypothesis was that this would
encourage our model to generate predictions that are of the same length as the target answers. However, because of this
decision, the model was not penalized for generating often irrelevant tokens beyond the length of the target answer.
This led to lower evaluation metrics on the held out validation set.
6 Conclusion
We see a few areas of improvement for future work. First, we would like to reduce the burden on the decoder softmax
by limiting the vocabulary to only the tokens used in that particular query’s question and passage. Predicting the right
token out of 20,000 options is challenging, especially when the model is required to get that right 20 to 30 times in a
row. Second, we would like to train a language model to provide a warm start for our decoder, so the decoder does not
have to learn how to generate sensible sequence of words in English in addition to responding to the questions and
context passages simultaneously.
5
References
[1] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A
Human Generated MAchine Reading COmprehension Dataset. In NIPS 2016.
[2] Karl Moritz Hermann, Tomáš Koˇciský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom.
2015. Teaching Machines To Read And Comprehend In arXiv:1506.03340 [cs.CL].
[3] Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. 2016. Text Understanding With The Attention Sum Reader
Network. In ACL 2016.
[4] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The Goldilocks Principle: Reading Children’s Books with
Explicit Memory Representations. In arXiv:1511.02301 [cs.CL].
[5] Wenpeng Yin, Sebastian Ebert, and Hinrich Schütze. 2016. Attention-Based Convolutional Neural Network For Machine
Comprehension. In arXiv:1602.04341 [cs.CL].
[6] Shuohang Wang and Jing Jiang. 2016. Machine comprehension using Match-LSTM and answer pointer. Under review as a
conference paper at ICLR 2017.
[7] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In NIPS.
[8] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. 2014. Neural Machine Translation By Jointly Learning To Align And
Translate. In arXiv:1409.0473 [cs.CL].
[9] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine
Translation. In arXiv:1508.04025 [cs.CL].
[10] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional Attention Flow for Machine
Comprehension. In Proceedings of ICLR 2017.
[11] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. arXiv:1506.03134 [stat.ML].
[12] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9.8 (1997): 1735-1780.
[13] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine
Comprehension of Text. In Proceedings of EMNLP 2016.
[14] Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning To Stop Reading In Machine Comprehension.
In arXiv:1609.05284 [cs.LG].
6

More Related Content

What's hot

A Novel Approach for User Search Results Using Feedback Sessions
A Novel Approach for User Search Results Using Feedback  SessionsA Novel Approach for User Search Results Using Feedback  Sessions
A Novel Approach for User Search Results Using Feedback Sessions
IJMER
 
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONEXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
ijaia
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
cscpconf
 
text summarization using amr
text summarization using amrtext summarization using amr
text summarization using amr
amit nagarkoti
 
IRJET- Chatbot Using Gated End-to-End Memory Networks
IRJET-  	  Chatbot Using Gated End-to-End Memory NetworksIRJET-  	  Chatbot Using Gated End-to-End Memory Networks
IRJET- Chatbot Using Gated End-to-End Memory Networks
IRJET Journal
 
Cleveree: an artificially intelligent web service for Jacob voice chatbot
Cleveree: an artificially intelligent web service for Jacob voice chatbotCleveree: an artificially intelligent web service for Jacob voice chatbot
Cleveree: an artificially intelligent web service for Jacob voice chatbot
TELKOMNIKA JOURNAL
 
A4 elanjceziyan
A4 elanjceziyanA4 elanjceziyan
A4 elanjceziyan
Jasline Presilda
 
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLINGLOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
ijaia
 
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTGENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
cscpconf
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
ijnlc
 
Nn kb
Nn kbNn kb
Neural Network in Knowledge Bases
Neural Network in Knowledge BasesNeural Network in Knowledge Bases
Neural Network in Knowledge Bases
Kushal Arora
 
ENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKING
ijasuc
 
Reasoning Over Knowledge Base
Reasoning Over Knowledge BaseReasoning Over Knowledge Base
Reasoning Over Knowledge Base
Shubham Agarwal
 
A critical reassessment of
A critical reassessment ofA critical reassessment of
A critical reassessment of
ijcisjournal
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
Aryak Sengupta
 

What's hot (16)

A Novel Approach for User Search Results Using Feedback Sessions
A Novel Approach for User Search Results Using Feedback  SessionsA Novel Approach for User Search Results Using Feedback  Sessions
A Novel Approach for User Search Results Using Feedback Sessions
 
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONEXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
 
text summarization using amr
text summarization using amrtext summarization using amr
text summarization using amr
 
IRJET- Chatbot Using Gated End-to-End Memory Networks
IRJET-  	  Chatbot Using Gated End-to-End Memory NetworksIRJET-  	  Chatbot Using Gated End-to-End Memory Networks
IRJET- Chatbot Using Gated End-to-End Memory Networks
 
Cleveree: an artificially intelligent web service for Jacob voice chatbot
Cleveree: an artificially intelligent web service for Jacob voice chatbotCleveree: an artificially intelligent web service for Jacob voice chatbot
Cleveree: an artificially intelligent web service for Jacob voice chatbot
 
A4 elanjceziyan
A4 elanjceziyanA4 elanjceziyan
A4 elanjceziyan
 
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLINGLOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
 
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTGENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
Nn kb
Nn kbNn kb
Nn kb
 
Neural Network in Knowledge Bases
Neural Network in Knowledge BasesNeural Network in Knowledge Bases
Neural Network in Knowledge Bases
 
ENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKING
 
Reasoning Over Knowledge Base
Reasoning Over Knowledge BaseReasoning Over Knowledge Base
Reasoning Over Knowledge Base
 
A critical reassessment of
A critical reassessment ofA critical reassessment of
A critical reassessment of
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 

Similar to NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder Architecture

Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on HmmEquirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
International Journal of Engineering Inventions www.ijeijournal.com
 
Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area
IJECEIAES
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
kevig
 
Final PPT.pptx
Final PPT.pptxFinal PPT.pptx
Final PPT.pptx
samarth70133
 
Transformer_tutorial.pdf
Transformer_tutorial.pdfTransformer_tutorial.pdf
Transformer_tutorial.pdf
fikki11
 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
San Kim
 
Building a vietnamese dialog mechanism for v dlg~tabl system
Building a vietnamese dialog mechanism for v dlg~tabl systemBuilding a vietnamese dialog mechanism for v dlg~tabl system
Building a vietnamese dialog mechanism for v dlg~tabl system
ijnlc
 
A new RSA public key encryption scheme with chaotic maps
A new RSA public key encryption scheme with chaotic maps A new RSA public key encryption scheme with chaotic maps
A new RSA public key encryption scheme with chaotic maps
IJECEIAES
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
ijaia
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
ijsc
 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
ijsc
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
NAVER Engineering
 
Alexander Sirenko - Query expansion for Question Answering
Alexander Sirenko - Query expansion for Question AnsweringAlexander Sirenko - Query expansion for Question Answering
Alexander Sirenko - Query expansion for Question Answering
Alexander Sirenko
 
2nd sem
2nd sem2nd sem
2nd sem
nastysuman009
 
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer SheetPerspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Hoang Nguyen Phong
 
Apply deep learning to improve the question analysis model in the Vietnamese ...
Apply deep learning to improve the question analysis model in the Vietnamese ...Apply deep learning to improve the question analysis model in the Vietnamese ...
Apply deep learning to improve the question analysis model in the Vietnamese ...
IJECEIAES
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Model
gerogepatton
 
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
IJERA Editor
 
Tweet Cloud
Tweet CloudTweet Cloud
Tweet Cloud
Minwoo Kim
 
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONSYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
csandit
 

Similar to NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder Architecture (20)

Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on HmmEquirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
 
Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area Sensing Trending Topics in Twitter for Greater Jakarta Area
Sensing Trending Topics in Twitter for Greater Jakarta Area
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
Final PPT.pptx
Final PPT.pptxFinal PPT.pptx
Final PPT.pptx
 
Transformer_tutorial.pdf
Transformer_tutorial.pdfTransformer_tutorial.pdf
Transformer_tutorial.pdf
 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
 
Building a vietnamese dialog mechanism for v dlg~tabl system
Building a vietnamese dialog mechanism for v dlg~tabl systemBuilding a vietnamese dialog mechanism for v dlg~tabl system
Building a vietnamese dialog mechanism for v dlg~tabl system
 
A new RSA public key encryption scheme with chaotic maps
A new RSA public key encryption scheme with chaotic maps A new RSA public key encryption scheme with chaotic maps
A new RSA public key encryption scheme with chaotic maps
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
Alexander Sirenko - Query expansion for Question Answering
Alexander Sirenko - Query expansion for Question AnsweringAlexander Sirenko - Query expansion for Question Answering
Alexander Sirenko - Query expansion for Question Answering
 
2nd sem
2nd sem2nd sem
2nd sem
 
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer SheetPerspective in Informatics 3 - Assignment 2 - Answer Sheet
Perspective in Informatics 3 - Assignment 2 - Answer Sheet
 
Apply deep learning to improve the question analysis model in the Vietnamese ...
Apply deep learning to improve the question analysis model in the Vietnamese ...Apply deep learning to improve the question analysis model in the Vietnamese ...
Apply deep learning to improve the question analysis model in the Vietnamese ...
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Model
 
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
 
Tweet Cloud
Tweet CloudTweet Cloud
Tweet Cloud
 
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITIONSYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
 

Recently uploaded

132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
zubairahmad848137
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 

Recently uploaded (20)

132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 

NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder Architecture

  • 1. LSTM Encoder-Decoder Architecture with Attention Mechanism for Machine Comprehension Eugene Nho MBA/MS, Intelligent Systems Stanford University enho@stanford.edu Brian Higgins BS, Computer Science Stanford University bhiggins@stanford.edu Abstract Machine comprehension remains a challenging open area of research. While many question answering models have been explored for existing datasets, little work has been done with the newly released MS MARCO dataset, which mirrors the reality much more closely and poses many unique challenges. We explore an end-to-end neural architecture with attention mechanisms for comprehending relevant information and generating text answers for MS MARCO. 1 Introduction Machine comprehension—building systems that comprehend natural language documents—is a challenging open area of research within natural language processing (NLP). Since the widespread use of statistical machine learning and more recently deep learning, its progress has been made largely in lockstep with the introduction of large-scale datasets. The latest such dataset is MS MARCO [1]. Unlike previous machine comprehension and question answering datasets, MS MARCO consists of real questions generated by anonymized queries from Bing, and has target answers in the form of sequence of words. Compared to other forms of answers like multiple choice, single-token answer or span of words, generating text is significantly more challenging. To our knowledge, no literature exists on end-to-end systems specifically addressing this dataset yet. In this paper, we explore an attention-based encoder-decoder architecture that comprehends input text and generates text answers. 2 Related Work Machine comprehension and question answering tasks went through multiple stages of evolution over the past decade. Traditionally, these tasks relied on complex NLP pipelines involving steps like syntactic parsing, semantic parsing, and question classification. With the rise of neural networks, end-to-end neural architectures have increasingly been applied to comprehension tasks [2][3][4][5]. The evolution of available datasets was a driving force behind the progress in this domain—from models simply choosing between multiple choice answers [4][5] to those generating single-token answers [2][3] or span-of-words answers [6]. Throughout this evolution, attention has emerged as a key concept, in particular as the task demanded by dataset became more complex. Attention was first utilized more in context of non-NLP tasks such as learning alignments between image objects and agent actions [7], or between visual features of an image and text description in the caption generation task. Bahdanau et al. (2015) applied the attention mechanism to Neural Machine Translation (NMT) [8], and Luong et al. (2015) explored different architectures for attention-based NMT, including a global approach and a local approach [9].
  • 2. Figure 1: Architecture of the encoder-decoder model with attention Several approaches have been tried to incorporate attention into machine comprehension and question answering tasks. Seo et al. (2017) applied bidirectional models with attention to achieve near state-of-the-art results for SQuAD [10]. Wang & Jiang (2016) combined match-LSTM, an attention model originally proposed for text entailment, with Pointer Net, a sequence-to-sequence model proposed by Vinyals et al (2015) that constrains output words to the tokens from input sequences [6][11]. 3 Dataset: MS MARCO MS MARCO is a reading comprehension dataset with a number of characteristics that make it the most realistic available dataset for question answering. All questions are real anonymized queries from Bing, and the context passages, from which target answers are derived, are sampled from real web documents, closely mirroring the real-world scenario of finding an answer to a question on a search engine. The target answers, as mentioned earlier, are sequence of words and are human-generated. The dataset has 100,000 queries. Each query consists of one question, approximately 10 context passages, and target answers. While most queries have one answer, some have many and some have none. The average length of the questions, passages, and target answers are approximately 15, 85, and 8 tokens, respectively. There are no particular topical focus, but the queries fall into one of the following five categories: description (52.6%), numeric (28.4%), entity (10.5%), location (5.7%) and person (2.7%). Given the nature of text generation task required by the dataset, we use ROUGE-L and BLEU as evaluation metrics. 4 Methods The problem we are trying to solve can be defined formally as follows. For each example in the dataset, we are given a question and a set of potentially relevant context passages. The question is represented as Q 2 Rd⇥n where d is the embedding size and n is the length of the question. The set of passages is represented as S = {P1, P2, ..., P }, where is the number of passages. Our objective is to (1) choose the most relevant passage P out of S, and (2) generate the answer consisting of a sequence of words, represented as R 2 Rd⇥k where k is the length of the answer. Because both tasks require very similar architectures, this paper will only focus on the second objective, assuming the best passage P has already been selected. 2
  • 3. We use an encoder-decoder architecture with multiple attention mechanisms to achieve this, as shown in Figure 1. The components of this model are: 1. Question Encoder is an LSTM layer mapping the question information to a vector space. 2. Passage Encoder with Attention is an LSTM layer with an attention mechanism connecting the passage information with the encoded knowledge about the question. 3. Attention Encoder is an extra LSTM layer further distilling the output from the Passage Encoder with Attention. 4. Decoder is an LSTM layer that takes in information from the three encoders and generates the answer. It has an attention mechanism connecting to the Attention Encoder. 4.1 Question Encoder The purpose of the question encoder is to incorporate contextual information from each token of the question to a vector space. We use a standard unidirectional LSTM [12] layer to process the question, represented as follows: Hq = ! LSTM(Q) (1) The output matrix Hq 2 Rh⇥n is the hidden state representation of the question, where h is the dimension of the hidden state. hq i represents the ith column of Hq , and encapsulates the contextual information up to the ith token of the question (qi). 4.2 Passage Encoder with Attention The objective of this layer is to capture the information from the passage relevant to the contents of the question. To that end, it employs a standard LSTM layer with the global attention mechanism proposed by Luong et al [9]. P 2 Rd⇥l is the matrix representing the most relevant context passage, where d is the embedding size and l is the length of the passage. At each time step t, this encoder uses the embedding of tth token of the passage and the entire hidden state representation from the question encoder to capture contextual information up to the tth token relevant to the question. This is represented as follows: ˜hp t = ! AttentionLSTM(pt, Hq ) (2) where ˜hp t 2 Rh is the hidden vector capturing the information, pt 2 Rd is the tth token of the passage (i.e. tth column of P), and Hq is the matrix representing all the hidden states of the question encoder. ! AttentionLSTM is an abstraction involving the following two steps: First, it captures information from the tth token using a regular LSTM cell, represented as hp t = ! LSTM(pt), where hp t 2 Rh . Second, using Hq and hp t , it derives a context vector ct 2 Rh that captures relevant information from the questions. ct is concatenated with hp t to generate ˜hp t , as shown below: ˜hp t = ! AttentionLSTM(pt, Hq ) = tanh(Wc[hp t ; ct] + bc) (3) where Wc 2 Rh⇥2h is a weight matrix and bc 2 Rh is bias. The context vector ct captures all the hidden states from the question encoder weighted by how relevant each question word is to the tth passage word. To derive ct, we first calculate the relevance score between the tth passage word, represented by hp t , and jth question word, represented by hq j : score(hp t , hq j ) = (Wshp t + bs)| hq j (4) where Ws 2 Rh⇥h is a weight matrix and bs 2 Rh is bias. We then create the attention vector at = {a (1) t , a (2) t , ... , a (n) t } 2 R1⇥n . at is a horizontal vector whose jth element is a softmax value of score(hp t , hq j ). ct is the weighted average of Hq based on the attention vector at. a (j) t = exp(score(hp t , hq j )) Pn ⇢ exp(score(hp t , hq ⇢)) (5) gt = Hq ¯at (6) ct = nX i g (i) t (7) 3
  • 4. Figure 2: Loss and evaluation metrics where ¯at 2 Rh⇥n is the vertically broadcasted matrix of at, and g (i) t is the ith column of gt. All the formulas aside, the important intuition here is that the hidden state of this encoder at each time step t incorporates information about not only the passage tokens up to that step (as is the case with a normal LSTM hidden state), but also a measure of how relevant each of the question tokens is to that particular tth passage token. 4.3 Attention Encoder The purpose of this LSTM layer is to further distill the contextual information captured by the passage encoder. It is a variant of the Match LSTM layer, first proposed by Wang & Jiang (2016) for text entailment and later applied to question answering by the same authors [6]. We utilize a standard unidirectional LSTM layer taking as input all the hidden states of the Passage Encoder with Attention, as shown below: Hm = ! LSTM( ˜Hp ) (8) where ˜Hp = {˜hp 1, ˜hp 2, ..., ˜hp l } 2 Rh⇥l is the hidden state representation from the passage encoder. 4.4 Decoder We use an LSTM layer with global attention between the decoder and the Attention Encoder to generate the answer. The purpose of this attention mechanism is to let the decoder "peek" at the relevant information encapsulating the passage and the question as it generates the answer. In addition, to pass along the contextual information captured by the encoders, we concatenate the last hidden states from all three encoders and feed the combined matrix as the initial hidden state to the decoder. At each time step t, the decoder takes as input the embedding of the generated token from the previous step, and uses the same attention mechanism described in the earlier section to derive the hidden state representation ˜hr t 2 R3h : ˜hr t = ! AttentionLSTM(rt 1, Hm ) (9) where rt 1 2 Rd is the embedding of the word generated in the last time step. Then we apply matrix multiplication and softmax to generate the output vector ot 2 RV . ot = softmax(U˜hr t + b) (10) where V is the vocabulary size. rt is the word embedding corresponding to ot. 4
  • 5. Description Person Numeric Location Entity Full Data Set BLEU-1 9.1 3.2 7.4 3.8 7.4 9.3 ROUGE-L 15.1 2.8 13.7 3.4 5.5 12.8 Table 1: Evaluation metrics by question type from the held out validation set 5 Experiments and Discussion 5.1 Results A variant of the model described earlier with a double-layer LSTM decoder achieved ROUGE-L of 12.8 and BLEU of 9.3 on the held out validation set. While this result is not sate-of-the-art, our model outperformed several benchmark models such as Seq2Seq model with Memory Networks [1]. Figure 2 shows the training progression for our model. Loss steadily declines, but the model starts overfitting around the 10th epoch, where both evaluation metrics on the development set peaked. Our classifier for selecting the most relevant context passage achieved an accuracy of 100% on both the development set and the held out validation set. We conducted over 40 runs for hyperparameter optimization, and found the following: • Larger batch size generally performed better, with a batch size of 256 outperforming 64 (9.3 vs 6.4 BLEU) • L2 loss outperformed cross entropy (9.3 vs 7.0 BLEU) • Our model was sensitive to learning rate, with evaluation metrics dropping precipitously when the rate approached 0.01 range (vs 0.001 or 0.0001) 5.2 Error Analysis As illustrated by Table 1, our model was much more effective in generating descriptions and numeric answers than people and locations. Its relative weakness on people and locations most likely occurred because of the rarity of their names. The dataset contained a total of approximately 650,000 tokens. Because of memory restrictions, we used a vocabulary size of 20,000 during training, and the rest of the words were cast to the < unknown > token. This practice led to lower performance on the people and location questions, which on average contain more uncommon tokens outside of the vocabulary size. We also found that longer answers were much harder to predict. This is well known and reported in text generation tasks because the farther into the decoder the model gets, the more diluted the original hidden state becomes. Because of this, it is very challenging to predict long sequences of words. Lastly, our design decision on how to treat the < end > token in the generated answer impacted our error rate. When training, we masked the generated answers based on the lengths of ground truth. Our hypothesis was that this would encourage our model to generate predictions that are of the same length as the target answers. However, because of this decision, the model was not penalized for generating often irrelevant tokens beyond the length of the target answer. This led to lower evaluation metrics on the held out validation set. 6 Conclusion We see a few areas of improvement for future work. First, we would like to reduce the burden on the decoder softmax by limiting the vocabulary to only the tokens used in that particular query’s question and passage. Predicting the right token out of 20,000 options is challenging, especially when the model is required to get that right 20 to 30 times in a row. Second, we would like to train a language model to provide a warm start for our decoder, so the decoder does not have to learn how to generate sensible sequence of words in English in addition to responding to the questions and context passages simultaneously. 5
  • 6. References [1] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In NIPS 2016. [2] Karl Moritz Hermann, Tomáš Koˇciský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching Machines To Read And Comprehend In arXiv:1506.03340 [cs.CL]. [3] Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. 2016. Text Understanding With The Attention Sum Reader Network. In ACL 2016. [4] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations. In arXiv:1511.02301 [cs.CL]. [5] Wenpeng Yin, Sebastian Ebert, and Hinrich Schütze. 2016. Attention-Based Convolutional Neural Network For Machine Comprehension. In arXiv:1602.04341 [cs.CL]. [6] Shuohang Wang and Jing Jiang. 2016. Machine comprehension using Match-LSTM and answer pointer. Under review as a conference paper at ICLR 2017. [7] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In NIPS. [8] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. 2014. Neural Machine Translation By Jointly Learning To Align And Translate. In arXiv:1409.0473 [cs.CL]. [9] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In arXiv:1508.04025 [cs.CL]. [10] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional Attention Flow for Machine Comprehension. In Proceedings of ICLR 2017. [11] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. arXiv:1506.03134 [stat.ML]. [12] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9.8 (1997): 1735-1780. [13] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of EMNLP 2016. [14] Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning To Stop Reading In Machine Comprehension. In arXiv:1609.05284 [cs.LG]. 6