SlideShare a Scribd company logo
1 of 31
Download to read offline
Quora Question Pairs
By: Shradha Sunil
Peddamail Jayavardhan Reddy
Dataset
DataSet:
The datasets needed for this project have been obtained from the Kaggle site.
The files are:
1. Train Set:
a. Size - 21.17 MB
b. Number of Question Pairs: 404290
2. Test Set:
a. Size -112.47 MB
b. Number of Question Pairs: 2345796
DISTRIBUTION OF TRAINING SET
Word Embeddings:
1. Word2Vec:
○ Dimension : 300d
○ Number of Words: 3 million words
○ Training Set: 100 billion words from Google News dataset
2. Glove Vector Representation:
○ Dimension: 50d,100d, 200d, 300d
○ Number of words: 400k words
○ TrainSet: 6 billion tokens from Wikipedia 2014 + Gigaword 5
○ We used 100d Glove vectors representation.
API’s used:
1. Keras
2. Tensorflow
3. Gensim
4. Sklearn
5. Plotly,Seaborn - Visualizations
Implementation of simpler models
The features used are
1. Tf-idf was used as a pre-processing technique.
2. The following models were implemented using the sklearn library
1. Random forest
2. Logistic regression
3. Naive bayes
4. Decision tree
5. SVM
Results
Model Training
accuracy
Validation
accuracy
Training loss Validation
loss
Random
forest
0.775 0.721 0.685315 0.697242
Logistic
regression
0.685 0.669 0.72543 0.741672
Decision tree 0.672 0.695 0.75232 0.74327
Naïve bayes 0.574 0.633 0.8472 0.8253
SVM 0.429 0.553 0.8723 0.8244
Neural Network Architectures
1. Convolutional Neural Network ( CNN )
2. Long Short Term Memory Network ( LSTM)
3. Bidirectional LSTM
Simple Siamese Architecture:
Neural Network
Input 2
Output 2
Neural Network
Input 1
Output 1
Similarity Measure
Weights
Shared
Output
Siamese Architecture Using CNN:
Siamese Architecture with CNN with GloVe Representation:
Embedding Layer (First Layer): This layer is the first layer in Deep NLP Architectures. This layer maps a given input and
converts it into its corresponding Word Embedding. The embedding layer is then fed into the convolution layer.
Conv1D layer: Convolution1d performs 1d convolution on the word embeddings that are obtained as the output of the
embedding layer. When the convolutional layer is implemented, the parameters that need to be included are the number of
filters, the kernel size, the stride length and the shape of the input.
Max-pooling layer: Max pooling is a sample-based discretization process. The objective is to down-sample an input
representation, reducing its dimensionality and allowing for assumptions to be made about features contained in the
sub-regions.
Dropout: Dropout is carried out to prevent overfitting.
Flatten: The flatten large is used to flatten the output of the CNN layer.
Merge Layer: Merge Large is used to combine different vector outputs from the CNN. Both CNN’s will produce a sentence
vector, merge layer combined them to produce a single vector output.
Dense Layer: It is a regularly connected dense layer. It converts the vector output from the Merge Layer into a number between
0 and 1, which shows the measure of the similarity predicted by the network.
Parameters within the CNN:
Model Number Of layers Dropout Validation Loss Training Loss
CNN 2 0.5 0.54606734 0.544829
CNN 3 0.4 0.6147397 0.61473978
Observations of several CNN models with varying dropout and number of layers
after training on 20 epochs:
It is observed that unlike how more layers lead to more abstraction in images, in text it is found
that two layers lead to the best results.
This is because most of the features are learnt in the first layer itself so any activation in the
first layer gets passed through and nothing new is learnt when adding deeper layers.
Siamese Architecture Using LSTM:
LSTM
Embedding Layer
LSTM
Embedding Layer
Dense
Weights
Shared
Merge
Question 1 Question 2
Siamese Architecture with LSTM with Word2Vec 300d Representation:
Siamese Architecture with LSTM with Glove 100d Representation:
Comparison Word2vec vs Glove:
Word Representation Model Validation Log-loss
( 20 epochs)
Time-Taken
Word2Vec Siamese LSTM 0.423 80 mins( per epoch)
Glove Siamese LSTM 0.434 25 mins(per epoch)
Observation:
1. Comparable Log-loss
2. Glove considerably faster compared to Word2vec
● Used Glove Vectors 100 dimension for rest of the analysis( Lesser dimensions,
lesser complexity, faster implementation)
Bidirectional LSTM
Word 1 Word 2 Word 3
LSTM LSTMLSTM
LSTMLSTMLSTM
Concat
Aggregate
ConcatConcat
Embeddings
LSTM
Word 1 Word 2 Word 3
LSTM LSTMLSTM
Aggregate
Embeddings
Comparison between Bi-LSTM and LSTM:
Model Drop-out Validation loss Time-Taken
Bidirectional LSTM Recurrent_Dropout=0.2,Dropout=0.
2
0.44385 90 mins
LSTM Recurrent_Dropout=0.2,Dropout=0.
2
0.43454 25 mins
Observations:
1. Bi-LSTM performs better than LSTM for most tasks.
2. But Overfitting Problem. Bi-LSTM stopped in 7 Epochs. Validation loss started to Increase after 5
epochs.
3. Implemented using Keras Model Checkpoint and Early Stopping.
4. LSTM model also suffered from overfitting ( Stopped in 12 Epochs)
Solution:Tuning Recurrent_Dropout and Dropout for Bi-LSTM ( Did not Implement because of Time
Constraints)
NOTE: We finalized on LSTM + Glove Vectors for Further Analysis, due to comparative performance
and faster implementation.
Model Drop-out Validation loss Training Loss
LSTM Recurrent_Dropout=0.2,Dropout=0.2 0.4345 0.4301
LSTM Recurrent_Dropout=0.5,Dropout=0.5 0.4532 0.4476
LSTM Recurrent_Dropout=0.0,Dropout=0.0 0.4632 0.4375
LSTM Varied Dropout:
Observations:
1. LSTM with dropout of 0.5,0.5 suffers from reduced performance on both train and validation
set.
2. LSTM without dropout was also implemented, as expected it suffers from overfitting.
3. We could have also tried randomized dropout( not implemented due to time complexity)
Further Steps to Improve Log-loss:
1) Preprocessing Text
a) Replace Numbers by ‘n’
b) Handle Special Words and Abbreviations( I’d)
c) Stop -Word Removal
2) Handling Un-balanced Classes
3) Question Symmetry
Preprocessing Text:
● Preprocessing input text is an important step for any NLP task. Why?
● Different pre-processing steps analyzed:
Preprocessing Step Original Text Modified Text
Number Replacement I have 100 apples. I have n Apples
Special Words I’d run a marathon. I would run a marathon.
Stop word removal I have 100 apples. I 100 apples
Importance of Text Preprocessing:
Number Replacement:
● Wordvec, glove vectors don’t recognize numbers
● Numbers have to be replaced by n
Special Words:
● Words like I’d , I’m cause problems when tokenized
○ I’d - I + d
○ I’m - I + m
● The meaning is lost.
● We need to handle these word separately.
○ I’d - I would
○ I’m - I am
Stop Words Removal:
● Stop words increase complexity of system, they might or might not provide
any additional information
Handling Imbalance Classes:
● Training Data and Testing Data might have different distribution of positive and negative
examples.
● If training data has considerably more positive examples, the model tends towards positive
while prediction and vice versa.
● In our data,
○ Train Set: 36.92% positive example
○ Test set: 17.46% positive examples
● We need map the share of positive examples to be the same in Test/Train.
● The weight of one positive example in the train set counts for 0.472 ( 0.1746 / 0.3692)
positive entities in the test set.
● Similarly, the weight of negative example in the train set is (1 - 0.1746) / (1 - 0.3692) = 1.309 .
● Log loss function = -( 0.472001959 * t * logy + 1.309028344 * (1.0 - t) * log(1.0 - y) )
Where t: target value , y : predicted value
● Class Weighting can be provided as parameter in ‘Keras’.
Note:
● This is a work around to achieve better Performance on Test set. As in this case, the distribution is
totally skewed in Test Set.
● Ideally, We need to split the Train, Test and Validation set, maintaining class balance.
Question Pair Symmetry:
● Tried to interchange Q1 and Q2 to see if effects the Model.
● Sometimes model might learn features which are related to the question order.
● We interchanged half of Q1 with Q2 and trained the model.
Analysis: Different Preprocessing Steps and their effect on log-loss
Model parameters:
● Length of Each Question- 30
● Recurrent Dropout - 0.2
● Dropout - 0.2
● Length of LSTM Layer -100
● Train : Validation - 9:1
Model Comparison:
Steps Train
Loss
Validatio
n loss
Public Test
loss
Private Test
Loss
Symmetry-No , Simple Text Preprocessing -Yes , Class
Weighting - No
0.4034 0.4056 0.4041 0.4065
Symmetry-No , Simple Text Preprocessing -Yes , Class
Weighting - Yes
0.3045 0.2931 0.3104 0.3144
Symmetry-Yes
Simple Text Preprocessing -Yes
Class Weighting - Yes
0.3033 0.2925 0.3079 0.3109
Symmetry-Yes , Simple Text Preprocessing -Yes , Class
Weighting - Yes, Stop word Removal - Yes
0.3512 0.3567 0.3626 0.3617
Observations:
● Underfitting for some models - Restricted to 20 epochs due to time-constraint( The log-loss was still
decreasing)
● Stop words Removal decreased the performance. Stop words had contextual information which was
important.
Note: Simple Text Preprocessing - Numbers and Special Words
Further Steps which can Explore:
● Parameter tuning especially dropout can be tuned to attain better log-loss.
● Analysis using increased number of epochs.
● Ensemble of Models
○ Bagging - Avoids Overfitting
○ Boosting - Helps improve weak models( Under fitting)
● Observed lot of Nouns in the data, Parts-of-Speech features
○ Replacing Proper Noun by Noun
● More text Processing: Lemmatization, Stemming
● Character Level LSTM’s
● Word - Attention Model’s
References:
1. Contextual Bidirectional Long Short-Term Memory Recurrent Neural Network Language Models: A
Generative Approach to Sentiment Analysis - http://www.aclweb.org/anthology/E17-1096
2. A Hierarchical Neural Autoencoder for Paragraphs and Documents - https://arxiv.org/pdf/1506.01057v2.pdf
3. Siamese Recurrent Architectures for Learning Sentence Similarity -
http://www.mit.edu/~jonasm/info/MuellerThyagarajan_AAAI16.pdf
4. Efficient Estimation of Word Representations in Vector Space - https://arxiv.org/pdf/1301.3781.pdf
5. GloVe: Global Vectors for Word Representation - https://nlp.stanford.edu/pubs/glove.pdf
6. HDLTex: Hierarchical Deep Learning for Text Classification - https://arxiv.org/pdf/1709.08267.pdf
7. Signature Verification using a "Siamese" Time Delay Neural Network -
https://papers.nips.cc/paper/769-signature-verification-using-a-siamese-time-delay-neural-network.pdf
8. Dropout: A Simple Way to Prevent Neural Networks from Overfitting -
http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
9. Understanding CNNs for NLP -
http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
10. Comparison study of CNNs and RNNs for NLP - https://arxiv.org/pdf/1702.01923.pdf
11. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
12. Term Frequency Inverse Document Frequency - http://www.tfidf.com/
Thankyou
Any Questions?

More Related Content

What's hot

Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networksmilad abbasi
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural networkServer CALAP
 
Android Malware 2020 (CCCS-CIC-AndMal-2020)
Android Malware 2020 (CCCS-CIC-AndMal-2020)Android Malware 2020 (CCCS-CIC-AndMal-2020)
Android Malware 2020 (CCCS-CIC-AndMal-2020)Indraneel Dabhade
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper reviewtaeseon ryu
 
Lecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptxLecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptxKarimdabbabi
 
Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)Olusola Amusan
 
Conversational AI with Transformer Models
Conversational AI with Transformer ModelsConversational AI with Transformer Models
Conversational AI with Transformer ModelsDatabricks
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
Meta learning with memory augmented neural network
Meta learning with memory augmented neural networkMeta learning with memory augmented neural network
Meta learning with memory augmented neural networkKaty Lee
 

What's hot (20)

Multi tasking learning
Multi tasking learningMulti tasking learning
Multi tasking learning
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
BERT
BERTBERT
BERT
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
AlexNet
AlexNetAlexNet
AlexNet
 
Bert
BertBert
Bert
 
ML Basics
ML BasicsML Basics
ML Basics
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Android Malware 2020 (CCCS-CIC-AndMal-2020)
Android Malware 2020 (CCCS-CIC-AndMal-2020)Android Malware 2020 (CCCS-CIC-AndMal-2020)
Android Malware 2020 (CCCS-CIC-AndMal-2020)
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
LSTM Basics
LSTM BasicsLSTM Basics
LSTM Basics
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
 
Lecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptxLecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptx
 
Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)
 
BERT
BERTBERT
BERT
 
Conversational AI with Transformer Models
Conversational AI with Transformer ModelsConversational AI with Transformer Models
Conversational AI with Transformer Models
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
Meta learning with memory augmented neural network
Meta learning with memory augmented neural networkMeta learning with memory augmented neural network
Meta learning with memory augmented neural network
 

Similar to Quora Question Pairs Modeling with LSTMs and Attention

Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svmtaikhoan262
 
Text cnn on acme ugc moderation
Text cnn on acme ugc moderationText cnn on acme ugc moderation
Text cnn on acme ugc moderationMarsan Ma
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition Pruthvij Thakar
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...taeseon ryu
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languagesAnkit Pandey
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...Databricks
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptxthanhdowork
 
Important Concepts for Machine Learning
Important Concepts for Machine LearningImportant Concepts for Machine Learning
Important Concepts for Machine LearningSolivarLabs
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
CODE TUNINGtertertertrtryryryryrtytrytrtry
CODE TUNINGtertertertrtryryryryrtytrytrtryCODE TUNINGtertertertrtryryryryrtytrytrtry
CODE TUNINGtertertertrtryryryryrtytrytrtrykapib57390
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesVarad Meru
 

Similar to Quora Question Pairs Modeling with LSTMs and Attention (20)

Matopt
MatoptMatopt
Matopt
 
C3 w2
C3 w2C3 w2
C3 w2
 
Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
Guide
GuideGuide
Guide
 
Text cnn on acme ugc moderation
Text cnn on acme ugc moderationText cnn on acme ugc moderation
Text cnn on acme ugc moderation
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languages
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
deep CNN vs conventional ML
deep CNN vs conventional MLdeep CNN vs conventional ML
deep CNN vs conventional ML
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
Important Concepts for Machine Learning
Important Concepts for Machine LearningImportant Concepts for Machine Learning
Important Concepts for Machine Learning
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Study_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_SystemsStudy_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_Systems
 
eam2
eam2eam2
eam2
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
CODE TUNINGtertertertrtryryryryrtytrytrtry
CODE TUNINGtertertertrtryryryryrtytrytrtryCODE TUNINGtertertertrtryryryryrtytrytrtry
CODE TUNINGtertertertrtryryryryrtytrytrtry
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 

Recently uploaded

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 

Recently uploaded (20)

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 

Quora Question Pairs Modeling with LSTMs and Attention

  • 1. Quora Question Pairs By: Shradha Sunil Peddamail Jayavardhan Reddy
  • 2. Dataset DataSet: The datasets needed for this project have been obtained from the Kaggle site. The files are: 1. Train Set: a. Size - 21.17 MB b. Number of Question Pairs: 404290 2. Test Set: a. Size -112.47 MB b. Number of Question Pairs: 2345796
  • 4.
  • 5. Word Embeddings: 1. Word2Vec: ○ Dimension : 300d ○ Number of Words: 3 million words ○ Training Set: 100 billion words from Google News dataset 2. Glove Vector Representation: ○ Dimension: 50d,100d, 200d, 300d ○ Number of words: 400k words ○ TrainSet: 6 billion tokens from Wikipedia 2014 + Gigaword 5 ○ We used 100d Glove vectors representation. API’s used: 1. Keras 2. Tensorflow 3. Gensim 4. Sklearn 5. Plotly,Seaborn - Visualizations
  • 6.
  • 7. Implementation of simpler models The features used are 1. Tf-idf was used as a pre-processing technique. 2. The following models were implemented using the sklearn library 1. Random forest 2. Logistic regression 3. Naive bayes 4. Decision tree 5. SVM
  • 8. Results Model Training accuracy Validation accuracy Training loss Validation loss Random forest 0.775 0.721 0.685315 0.697242 Logistic regression 0.685 0.669 0.72543 0.741672 Decision tree 0.672 0.695 0.75232 0.74327 Naïve bayes 0.574 0.633 0.8472 0.8253 SVM 0.429 0.553 0.8723 0.8244
  • 9. Neural Network Architectures 1. Convolutional Neural Network ( CNN ) 2. Long Short Term Memory Network ( LSTM) 3. Bidirectional LSTM
  • 10. Simple Siamese Architecture: Neural Network Input 2 Output 2 Neural Network Input 1 Output 1 Similarity Measure Weights Shared Output
  • 12. Siamese Architecture with CNN with GloVe Representation:
  • 13. Embedding Layer (First Layer): This layer is the first layer in Deep NLP Architectures. This layer maps a given input and converts it into its corresponding Word Embedding. The embedding layer is then fed into the convolution layer. Conv1D layer: Convolution1d performs 1d convolution on the word embeddings that are obtained as the output of the embedding layer. When the convolutional layer is implemented, the parameters that need to be included are the number of filters, the kernel size, the stride length and the shape of the input. Max-pooling layer: Max pooling is a sample-based discretization process. The objective is to down-sample an input representation, reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions. Dropout: Dropout is carried out to prevent overfitting. Flatten: The flatten large is used to flatten the output of the CNN layer. Merge Layer: Merge Large is used to combine different vector outputs from the CNN. Both CNN’s will produce a sentence vector, merge layer combined them to produce a single vector output. Dense Layer: It is a regularly connected dense layer. It converts the vector output from the Merge Layer into a number between 0 and 1, which shows the measure of the similarity predicted by the network. Parameters within the CNN:
  • 14. Model Number Of layers Dropout Validation Loss Training Loss CNN 2 0.5 0.54606734 0.544829 CNN 3 0.4 0.6147397 0.61473978 Observations of several CNN models with varying dropout and number of layers after training on 20 epochs: It is observed that unlike how more layers lead to more abstraction in images, in text it is found that two layers lead to the best results. This is because most of the features are learnt in the first layer itself so any activation in the first layer gets passed through and nothing new is learnt when adding deeper layers.
  • 15. Siamese Architecture Using LSTM: LSTM Embedding Layer LSTM Embedding Layer Dense Weights Shared Merge Question 1 Question 2
  • 16. Siamese Architecture with LSTM with Word2Vec 300d Representation:
  • 17. Siamese Architecture with LSTM with Glove 100d Representation:
  • 18. Comparison Word2vec vs Glove: Word Representation Model Validation Log-loss ( 20 epochs) Time-Taken Word2Vec Siamese LSTM 0.423 80 mins( per epoch) Glove Siamese LSTM 0.434 25 mins(per epoch) Observation: 1. Comparable Log-loss 2. Glove considerably faster compared to Word2vec ● Used Glove Vectors 100 dimension for rest of the analysis( Lesser dimensions, lesser complexity, faster implementation)
  • 19. Bidirectional LSTM Word 1 Word 2 Word 3 LSTM LSTMLSTM LSTMLSTMLSTM Concat Aggregate ConcatConcat Embeddings LSTM Word 1 Word 2 Word 3 LSTM LSTMLSTM Aggregate Embeddings
  • 20. Comparison between Bi-LSTM and LSTM: Model Drop-out Validation loss Time-Taken Bidirectional LSTM Recurrent_Dropout=0.2,Dropout=0. 2 0.44385 90 mins LSTM Recurrent_Dropout=0.2,Dropout=0. 2 0.43454 25 mins Observations: 1. Bi-LSTM performs better than LSTM for most tasks. 2. But Overfitting Problem. Bi-LSTM stopped in 7 Epochs. Validation loss started to Increase after 5 epochs. 3. Implemented using Keras Model Checkpoint and Early Stopping. 4. LSTM model also suffered from overfitting ( Stopped in 12 Epochs) Solution:Tuning Recurrent_Dropout and Dropout for Bi-LSTM ( Did not Implement because of Time Constraints) NOTE: We finalized on LSTM + Glove Vectors for Further Analysis, due to comparative performance and faster implementation.
  • 21. Model Drop-out Validation loss Training Loss LSTM Recurrent_Dropout=0.2,Dropout=0.2 0.4345 0.4301 LSTM Recurrent_Dropout=0.5,Dropout=0.5 0.4532 0.4476 LSTM Recurrent_Dropout=0.0,Dropout=0.0 0.4632 0.4375 LSTM Varied Dropout: Observations: 1. LSTM with dropout of 0.5,0.5 suffers from reduced performance on both train and validation set. 2. LSTM without dropout was also implemented, as expected it suffers from overfitting. 3. We could have also tried randomized dropout( not implemented due to time complexity)
  • 22. Further Steps to Improve Log-loss: 1) Preprocessing Text a) Replace Numbers by ‘n’ b) Handle Special Words and Abbreviations( I’d) c) Stop -Word Removal 2) Handling Un-balanced Classes 3) Question Symmetry
  • 23. Preprocessing Text: ● Preprocessing input text is an important step for any NLP task. Why? ● Different pre-processing steps analyzed: Preprocessing Step Original Text Modified Text Number Replacement I have 100 apples. I have n Apples Special Words I’d run a marathon. I would run a marathon. Stop word removal I have 100 apples. I 100 apples
  • 24. Importance of Text Preprocessing: Number Replacement: ● Wordvec, glove vectors don’t recognize numbers ● Numbers have to be replaced by n Special Words: ● Words like I’d , I’m cause problems when tokenized ○ I’d - I + d ○ I’m - I + m ● The meaning is lost. ● We need to handle these word separately. ○ I’d - I would ○ I’m - I am Stop Words Removal: ● Stop words increase complexity of system, they might or might not provide any additional information
  • 25. Handling Imbalance Classes: ● Training Data and Testing Data might have different distribution of positive and negative examples. ● If training data has considerably more positive examples, the model tends towards positive while prediction and vice versa. ● In our data, ○ Train Set: 36.92% positive example ○ Test set: 17.46% positive examples ● We need map the share of positive examples to be the same in Test/Train. ● The weight of one positive example in the train set counts for 0.472 ( 0.1746 / 0.3692) positive entities in the test set. ● Similarly, the weight of negative example in the train set is (1 - 0.1746) / (1 - 0.3692) = 1.309 . ● Log loss function = -( 0.472001959 * t * logy + 1.309028344 * (1.0 - t) * log(1.0 - y) ) Where t: target value , y : predicted value ● Class Weighting can be provided as parameter in ‘Keras’. Note: ● This is a work around to achieve better Performance on Test set. As in this case, the distribution is totally skewed in Test Set. ● Ideally, We need to split the Train, Test and Validation set, maintaining class balance.
  • 26. Question Pair Symmetry: ● Tried to interchange Q1 and Q2 to see if effects the Model. ● Sometimes model might learn features which are related to the question order. ● We interchanged half of Q1 with Q2 and trained the model. Analysis: Different Preprocessing Steps and their effect on log-loss Model parameters: ● Length of Each Question- 30 ● Recurrent Dropout - 0.2 ● Dropout - 0.2 ● Length of LSTM Layer -100 ● Train : Validation - 9:1
  • 27. Model Comparison: Steps Train Loss Validatio n loss Public Test loss Private Test Loss Symmetry-No , Simple Text Preprocessing -Yes , Class Weighting - No 0.4034 0.4056 0.4041 0.4065 Symmetry-No , Simple Text Preprocessing -Yes , Class Weighting - Yes 0.3045 0.2931 0.3104 0.3144 Symmetry-Yes Simple Text Preprocessing -Yes Class Weighting - Yes 0.3033 0.2925 0.3079 0.3109 Symmetry-Yes , Simple Text Preprocessing -Yes , Class Weighting - Yes, Stop word Removal - Yes 0.3512 0.3567 0.3626 0.3617 Observations: ● Underfitting for some models - Restricted to 20 epochs due to time-constraint( The log-loss was still decreasing) ● Stop words Removal decreased the performance. Stop words had contextual information which was important. Note: Simple Text Preprocessing - Numbers and Special Words
  • 28.
  • 29. Further Steps which can Explore: ● Parameter tuning especially dropout can be tuned to attain better log-loss. ● Analysis using increased number of epochs. ● Ensemble of Models ○ Bagging - Avoids Overfitting ○ Boosting - Helps improve weak models( Under fitting) ● Observed lot of Nouns in the data, Parts-of-Speech features ○ Replacing Proper Noun by Noun ● More text Processing: Lemmatization, Stemming ● Character Level LSTM’s ● Word - Attention Model’s
  • 30. References: 1. Contextual Bidirectional Long Short-Term Memory Recurrent Neural Network Language Models: A Generative Approach to Sentiment Analysis - http://www.aclweb.org/anthology/E17-1096 2. A Hierarchical Neural Autoencoder for Paragraphs and Documents - https://arxiv.org/pdf/1506.01057v2.pdf 3. Siamese Recurrent Architectures for Learning Sentence Similarity - http://www.mit.edu/~jonasm/info/MuellerThyagarajan_AAAI16.pdf 4. Efficient Estimation of Word Representations in Vector Space - https://arxiv.org/pdf/1301.3781.pdf 5. GloVe: Global Vectors for Word Representation - https://nlp.stanford.edu/pubs/glove.pdf 6. HDLTex: Hierarchical Deep Learning for Text Classification - https://arxiv.org/pdf/1709.08267.pdf 7. Signature Verification using a "Siamese" Time Delay Neural Network - https://papers.nips.cc/paper/769-signature-verification-using-a-siamese-time-delay-neural-network.pdf 8. Dropout: A Simple Way to Prevent Neural Networks from Overfitting - http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf 9. Understanding CNNs for NLP - http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ 10. Comparison study of CNNs and RNNs for NLP - https://arxiv.org/pdf/1702.01923.pdf 11. http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 12. Term Frequency Inverse Document Frequency - http://www.tfidf.com/