Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk

Speaker: Andriy Gryshchuk, Senior Research Engineer at Grammarly.
Summary: Paraphrase detection is a challenging NLP task since it requires both thorough syntactic and thorough semantic analysis to identify whether two phrases have the same intent. A few months ago, paraphrase identification became an objective of one of the most popular Kaggle competitions, Quora Question Pairs. In this talk, Yuriy Guts and Andriy Gryshchuk, silver medalists of the competition, will share their arsenal of statistical, linguistic, and Deep Learning approaches that helped them succeed in this challenge.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk

  1. 1. Quora Question Pairs Competition by Andriy Gryshchuk
  2. 2. popularity 3,300 teams (>4,000 participants) NLP Features engineering Deep Learning Interesting and big enough dataset Different from other recent competitions
  3. 3. Goal - Find duplicate questions Classification formulation: For each pair of questions predict probability that the questions have the same meaning
  4. 4. Data Train set: 400,000 pairs of questions (very big comparing with the previously available sets for paraphrase detection) (question1, question2, is_duplicate) Test set: 2,345,796 pairs (some of them are artificially generated as anti-cheating) Manually labeled (noisy)
  5. 5. Examples - positive 'Why have human beings evolved more than any other beings on Earth?' 'What technicality results in humans being more intelligent than other animals?' 'How Do You Protect Yourself from Dogs?' 'What is the best way to save yourself from an attacking angry dog?' 'Why are Quorians more liberal than conservative?' 'Why does Quora tend to attract more leftists than conservatives?'
  6. 6. Examples - Negatives Examples - negatives How to convert fractions to whole numbers? How do you convert whole numbers into fractions? What tips do you have for starting a school newspaper? What are some tips on starting a school newspaper? What Do I Do About My Boyfriend Ignoring Me? What should I do when my boyfriend is ignoring me? How dangerous is Mexico City? Why is Mexico City dangerous? What are some words that exist in English but do not exist in Japanese? What are some words that exist in Japanese but do not exist in English?
  7. 7. Negatives are not random There are positive pairs with no common words There are negative pairs with all the words common A lot of ambiguous cases Noise
  8. 8. Metric Logloss - questionable, ROC - could be much better choice Very different distributions of the train and test sets 36% positives in trainset 17% positive in test set (public part) Upsampling (or formula)
  9. 9. Metric Logloss - questionable, ROC - could be much better choice Very different distributions of the train and test sets 36% positives in trainset 17% positive in test set (public part) Upsampling (or formula) When distributions are different choose metric less sensible to distribution changes
  10. 10. Approaches Classical ML vs Deep Learning
  11. 11. Approaches Classical ML 90% efforts creating features 10% efforts modelling Deep Learning 5% efforts creating features 95% efforts modelling
  12. 12. Approaches Classical ML 90% efforts creating features 10% efforts modelling Deep Learning 5% efforts creating features 95% efforts modelling Kaggle way - Ensemble them all
  13. 13. Classical ML 90% efforts creating features 10% efforts modelling My team has about 300 features One of the top team claimed 4000 features
  14. 14. Sentence as Vector Sentence vector - just mean of the word vectors Or weighted mean - how to find right weights? unsupervised methods Similarities: Cosine similarity Cityblock distance Euclidean distance
  15. 15. Raw Embeddings Raw embeddings are surprisingly powerful features Sentence to vector and just use vectors components as features
  16. 16. Which Wordvectors Glove, Word2vec? 50D, 100D, 200D, 300D? All of them? Ensembles improves when models run of different embeddings.
  17. 17. Deep Learning Modeling 95% of efforts Features 5% of efforts Pretrained embeddings of words are features Pad and cut sentences to the same length Start modelling
  18. 18. Ideas for NNs Sentence embeddings computed just as the mean of the word vectors are powerful
  19. 19. Ideas for NNs Sentence embeddings computed just as the mean of the word vectors are powerful Weighted mean? Non-linearity?
  20. 20. Ideas for NNs Sentence embeddings computed just as the mean of the word vectors are powerful Weighted mean? Non-linearity? This is NN
  21. 21. Ideas for NNs Sentence embeddings computed just as the mean of the word vectors are powerful Weighted mean? Non-linearity This is NN Still just bag of words
  22. 22. Ideas for NNs Sentence embeddings computed just as the mean of the word vectors are powerful Weighted mean? Non-linearity This is NN Still just bag of words N-grams?
  23. 23. Ideas for NNs Sentence embeddings computed just as the mean of the word vectors are powerful Weighted mean? Non-linearity This is NN Still just bag of words N-grams? This is convolutional NN
  24. 24. Symmetry asks for Question 1 Question 2 Neural Network All weights shared Output
  25. 25. Question 1 Question 2 Common embedding layer Fully Connected Layer Conv Block 1 Conv Block N...
  26. 26. Paraphrase detection state of the art Microsoft Research Paraphrase Corpus (~5,000 sentence pairs) Results Table Methods: Unsupervised - phrase vector as weighted average Autoencoder - better phrase vector Supervised - CNN + structured features
  27. 27. Previous works Socher, R. and Huang, E.H., and Pennington, J. and Ng, A.Y., and Manning, C.D. (2011). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection Dmitrijs Milajevs, Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, Matthew Purver Evaluating Neural Word Representations in Tensor-Based Compositional Settings He, Hua, Gimpel K. and Lin J. (2015). Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks
  28. 28. From He, Hua, Gimpel K. and Lin J. (2015). Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks
  29. 29. Symmetry asks for Question 1 Question 2 Neural Network All weights shared Output
  30. 30. Question 1 Question 2 Common embedding layer Fully Connected Layer Conv Block 1 Conv Block N...
  31. 31. Convolutional Block as main component Input 1 Input 2 Number of convolutional transformations Global Pooling one number
  32. 32. Parameters of convolutional block ● Filter size ● Number of filters ● Global Pooling ● Depth ● Kernel regularizers, activity regularizer ● combine transformation (cosine, euclidean, cityblock)
  33. 33. Shallow Convolutional Block def conv_lst4(layer_class, size, out_dim = 300, activation='relu', kernel_regularizer = None, activity_regularizer = None): res = [] res.append(layer_class(out_dim, size, activation=activation, kernel_regularizer = kernel_regularizer, activity_regularizer = activity_regularizer)) res_max = res.copy() res_max.append(GlobalMaxPooling1D()) res_avg = res.copy() res_avg.append(GlobalAveragePooling1D()) for res in [res_max, res_avg]: res.append(Dense(out_dim, activation='linear')) return [res_max, res_avg]
  34. 34. … deep_lst = [conv_deep_lst(Conv1D, size, emb_mx.shape[1], kernel_regularizer = kernel_regularizer, activity_regularizer = activity_regularizer) for size in [3,4]] a_deep = [apply_layers(f,a) for f in deep_lst] b_deep = [apply_layers(f,b) for f in deep_lst] dot_deep = [keras.layers.dot([a,b], normalize=True,axes=-1) for a,b in zip(a_deep,b_deep)] ….
  35. 35. Embeddings Use pretrained? Train your own?
  36. 36. Embeddings Use pretrained? Train your own? Depends how much data you have
  37. 37. Trainable embeddings Super powerful Super easy to overfit Regularize L2 penalty for embedding weights Average several runs
  38. 38. Two copies of embeddings The same initial state (pretrained) Trainable and frozen
  39. 39. { 'name': 'nn_m8', 'fit_fun':fit_nn, 'fit_par': { 'n_iter':6, 'build_fun': partial(build_m8, train_emb = True, max_pool = True, embeddings_regularizer = keras.regularizers.l2(1e-5), n_more = X_train_stored.shape[1]), 'schedule': [(1e-3,5), (1e-5,2)], 'jit_sch':partial(jit_schedule, vol = 0.1)} } def jit_schedule(schedule, vol = 0.1): for lr,ep in schedule: lr = np.random.uniform(lr - vol*lr, lr + vol*lr) yield lr,ep
  40. 40. RNNs vs CNNs
  41. 41. RNNs vs CNNs Similar accuracy CNNs two orders of magnitude faster Fast CNN allows to average many runs
  42. 42. More Feature to NN Features created for Classifiers were added to NN End-to-end promise is great but if you already have features use them
  43. 43. Final model Question 1 Question 2 Neural Network All weights shared Output Fully Connected Layer “Classical” Features
  44. 44. Other NNs RNNs - several order of magnitudes slower Character level RNNs - very slow RNNs with attention NNs on the same features as tree-based classifiers Top team reports that NNs on word vectors + classical features work the best Xgboost and alikes exploited the leak well
  45. 45. Analysis Shallow convolutions Just bag of words or bag of n-grams No internal representation of “meaning” or “topic”
  46. 46. How to improve? Deeper networks - would require dedicated embeddings Positional embeddings Transfer learning - apply a pre-trained Neural Translation model and take the hidden state of the decoder as input
  47. 47. Ensemble 5-Folds on the first level First level itself was average of several runs Xgboost on the second level CV unstable “upsample-bagging” on the second level Real bagging on the second level (800 rounds) “third-level” - team ensemble (just weighted average)
  48. 48. Ensemble ('/meta_84_glove_6b_50d/nn_m8/', 0.17182514423484868), ('/meta_84_glove_6b_300d/nn_m8/', 0.17308944420181949), ('/meta_84_glove_6b_100d/nn_m8/', 0.17327907625486416), ('/meta_84_glove_6b_100d/gbm_tuned_00025/', 0.17386390869911419), ('/meta_84_glove_6b_200d/nn_m8/', 0.17478704276847895), ('/meta_84_glove_6b_100d/gbm_tuned_001/', 0.17486090598394549), ('/meta_83_glove_6b_50d/gbm_tuned_00025/', 0.17514204487042342), ('/meta_84_glove_6b_50d/gbm_tuned_001/', 0.17626284406063045), ('/meta_84_glove_6b_100d/gbm_dart_01/', 0.17639511061431704), ('/meta_84_glove_6b_100d/xgb_02_d10/', 0.17660031146404326), ('/meta_83_glove_6b_50d/gbm_tuned_0025/', 0.17688759229979395), ('/meta_84_glove_6b_50d/gbm_dart_005/', 0.17713067893022988), ('/meta_84_glove_6b_50d/gbm_dart_01/', 0.17761469925842949), ('/meta_84_glove_6b_100d/xgb_05_d10/', 0.17832099461464535), ('/meta_82_glove_6b_50d/gbm_tuned_00025/', 0.17841488421938717), ('/meta_83_glove_6b_100d/nn_m61/', 0.18009823071205816), ('/meta_82_glove_6b_50d/gbm_tuned_0025/', 0.18026383839031426), ('/meta_84_glove_6b_50d/xgb_05_d10/', 0.18079926772563515), ('/meta_83_glove_6b_50d/xgb_05/', 0.18513503621897476), ('/meta_83_glove_6b_100d/nn_m51_cn3/', 0.18574331177990389), ('/meta_83_glove_6b_200d/nn_m62/', 0.18607323372840762), ('/meta_83_glove_6b_50d/nn_m6/', 0.18646785119161874), ('/meta_82_glove_6b_50d/xgb_05/', 0.1875326701626234),
  49. 49. Final Ensemble 20 rounds of “upsample-bagging” of Xgboost of 44 1st level models The team ensemble: 0.8*andriy’s model + 0.2*komaki’s
  50. 50. Unfortunate Event Leak 50% of kaggle competitions have leaks, 20% have “killer” leaks What about real life? Be ready
  51. 51. Top team exploited the leak a lot Difficult to compare genuine results The leak could poison genuine features as well Trainable embeddings might get info from the leak Sampling process common reason of Kaggle’s leaks I would suppose in real life it is true as well. Be careful.
  52. 52. Hyperparameters tuning Ensembles give more than extensive tuning Just simple average of two reasonable but different models is better that one overtuned model K-fold ensembles of different models beat everything K-fold ensemble even for single model with one set of hyperparameters Overtuned models are fragile Love tuning - regularize
  53. 53. Questions

×