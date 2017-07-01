Is That A Duplicate Quora Question? Abhishek Thakur @abhi1thakur
About Me ➢ I’m a data scientist ➢ I like: ○ scikit-learn ○ keras ○ xgboost ○ python ➢ I don’t like: ○ errrR ○ excel I like...
The Problem ➢ ~ 13 million questions (as of March, 2017) ➢ Many duplicate questions ➢ Cluster and join duplicates together...
Duplicate Questions ➢ How does Quora quickly mark questions as needing improvement? ➢ Why does Quora mark my questions as ...
Non-Duplicate Questions ➢ Who should I address my cover letter to if I'm applying for a big company like Mozilla? ➢ Which ...
The Data ➢ 400,000+ pairs of questions ➢ Initially data was very skewed ➢ Negative samples from related questions ➢ Not re...
The Data ➢ 255045 negative samples (non-duplicates) ➢ 149306 positive samples (duplicates) ➢ 40% positive samples
The Data ➢ Average number characters in question1: 59.57 ➢ Minimum number of characters in question1: 1 ➢ Maximum number o...
Basic Feature Engineering ➢ Length of question1 ➢ Length of question2 ➢ Difference in the two lengths ➢ Character length o...
Basic Feature Engineering ➢ Basic feature set: fs-1 data['len_q1'] = data.question1.apply(lambda x: len(str(x))) data['len...
Fuzzy Features ➢ Also known as approximate string matching ➢ Number of “primitive” operations required to convert string t...
Fuzzy Features ➢ pip install fuzzywuzzy ➢ Uses Levenshtein distance ➢ QRatio ➢ WRatio ➢ Token set ratio ➢ Token sort ratio...
Fuzzy Features ➢ Fuzzy feature set: fs-2 data['fuzz_qratio'] = data.apply(lambda x: fuzz.QRatio(str(x['question1']), str(x...
TF-IDF ➢ TF(t) = Number of times a term t appears in a document / Total number of terms in the document ➢ IDF(t) = log(Tot...
SVD ➢ Latent semantic analysis ➢ scikit-learn version of SVD ➢ 120 components svd = decomposition.TruncatedSVD(n_component...
A Combination of TF-IDF & SVD ➢ TF-IDF features: fs3-1
A Combination of TF-IDF & SVD ➢ TF-IDF features: fs3-2
A Combination of TF-IDF & SVD ➢ TF-IDF + SVD features: fs3-3
A Combination of TF-IDF & SVD ➢ TF-IDF + SVD features: fs3-4
A Combination of TF-IDF & SVD ➢ TF-IDF + SVD features: fs3-5
Word2Vec Features ➢ Multi-dimensional vector for all the words in any dictionary ➢ Always great insights ➢ Very popular in...
Word2Vec Features
Word2Vec Features ➢ Representing words ➢ Representing sentences def sent2vec(s): words = str(s).lower().decode('utf-8') wo...
W2V Features: Cosine Distance
W2V Features: Manhattan Distance ➢ Also known as cityblock distance
W2V Features: Canberra Distance
W2V Features: Minkowski Distance
W2V Features: Braycurtis Distance
W2V Features: WMD Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances.
W2V Features: Skew ➢ Skew = 0 for normal distribution ➢ Skew > 0: more weight in left tail
W2V Features: Kurtosis ➢ 4th central moment over the square of variance ➢ Types: ○ Pearson ○ Fisher: subtract 3.0 from res...
W2V Features ➢ Word2Vec feature set: fs-4 scipy.spatial.distance scipy.stats minkowski jaccard manhattanbraycurtis euclide...
Raw Word2Vec Vectors https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne ➢ Raw W2V feature set: fs-5
Features Snapshot
Feature Snapshot
Machine Learning Models
Machine Learning Models ➢ Logistic regression ➢ Xgboost ➢ 5 fold cross-validation ➢ Accuracy as a comparison metric (also,...
Results
Deep Learning
LSTM ➢ Long short term memory ➢ A type of RNN ➢ Learn long term dependencies ➢ Used two LSTM layers
1D CNN ➢ One dimensional convolutional layer ➢ Temporal convolution ➢ Simple to implement: for i in range(sample_length): ...
Embedding Layers ➢ Simple layer ➢ Converts indexes to vectors ➢ [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
Time Distributed Dense Layer ➢ TimeDistributed wrapper around dense layer ➢ TimeDistributed applies the layer to every tem...
GloVe Embeddings ➢ Count based model ➢ Dimensionality reduction on co-occurrence counts matrix ➢ word-context matrix -> wo...
Basis of Deep Learning Model ➢ Keras-snli model: https://github.com/Smerity/keras_snli
Before Training DeepNets ➢ Tokenize data ➢ Convert text data to sequences tk = text.Tokenizer(nb_words=200000) max_len = 4...
Before Training DeepNets ➢ Initialize GloVe embeddings embeddings_index = {} f = open('data/glove.840B.300d.txt') for line...
Before Training DeepNets ➢ Create the embedding matrix embedding_matrix = np.zeros((len(word_index) + 1, 300)) for word, i...
Final Deep Learning Model
Final Deep Learning Model
Model 1 and Model 2 model1 = Sequential() model1.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input...
Final Deep Learning Model
Model 3 and Model 4
Model 3 and Model 4 model3 = Sequential() model3.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input...
Final Deep Learning Model
Model 5 and Model 6 model5 = Sequential() model5.add(Embedding(len(word_index) + 1, 300, input_length=40, dropout=0.2)) mo...
Final Deep Learning Model
Merged Model
Time to Train the DeepNet ➢ Total params: 174,913,917 ➢ Trainable params: 60,172,917 ➢ Non-trainable params: 114,741,000 ➢...
Combined Results The deep network was trained on an NVIDIA TitanX and took approximately 300 seconds for each epoch and to...
Improving Further ➢ Cleaning the text data, e.g correcting mis-spellings ➢ POS tagging ➢ Entity recognition ➢ Combining de...
Timeline 24 Jan, 2017 27 Feb, 2017 16 Mar, 2017 7 Jun, 2017 Quora Dataset Release My Model + Writeup Release Kaggle Compet...
Conclusion & References ➢ The deepnet gives near state-of-the-art result ➢ BiMPM model accuracy: 88% Some reference: ➢ Zhi...
Thank you! Questions / Comments? Code: bit.ly/quoraduplicates Get in touch: ➢ E-mail: abhishek4@gmail.com ➢ LinkedIn: bit....
