Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Approaching (almost) Any NLP Problem

11,049 views

Published on

This is from my recent talk about how to approach nlp problems in real world.

Published in: Education

Approaching (almost) Any NLP Problem

  1. 1. Approaching (almost) any NLP problem @abhi1thakur
  2. 2. AI is like an imaginary friend most enterprises claim to have these days
  3. 3. 3
  4. 4. 4
  5. 5. 5 I like big data and I cannot lie
  6. 6. ➢ Not so much intro ➢ Where is NLP used ➢ Pre-processing ➢ Machine Learning Models ➢ Solving a problem ➢ Traditional approaches ➢ Deep Learning Models ➢ Muppets Agenda 6
  7. 7. Translation Sentiment Classification Chatbots / VAs Autocomplete Entity Extraction Question Answering Review Rating Prediction Search Engine Speech to Text Topic Extraction Applications of natural language processing
  8. 8. Pre-processing the text data 8 can u he.lp me with loan? 😊 Unintentional Characters Abbreviations Symbols Emojis can you help me with loan ?
  9. 9. Pre-processing the text data 9 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML
  10. 10. Pre-processing the text data 10 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML def remove_space(text): text = text.strip() text = text.split() return " ".join(text)
  11. 11. Pre-processing the text data 11 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML ➢ Very important step ➢ Is not always about spaces ➢ Converts words into tokens ➢ Might be different for different languages ➢ Simplest is to use `word_tokenizer` from NLTK ➢ Write your own ;)
  12. 12. Pre-processing the text data 12 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML import nltk nltk.download('punkt') from nltk.tokenize import word_tokenize text = "hello, how are you?" tokens = word_tokenize(text) print(tokens) 'hello', ',', 'how', 'are', 'you', '?' hello, how are you?
  13. 13. Pre-processing the text data 13 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML ➢ Very very crucial step ➢ In chat: can u tel me abot new sim card pland? ➢ Most models without spelling correction will fail ➢ Peter Norvig’s spelling correction ➢ Make your own ;)
  14. 14. Pre-processing the text data 14 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML I need a new car insurance I need aa new car insurance I ned a new car insuraance I needd a new carr insurance I need a neew car insurance I need a new car insurancee EmbeddingsLayer BidirectionalStacked char-LSTM Output
  15. 15. Pre-processing the text data 15 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML def edits1(word): letters = 'abcdefghijklmnopqrstuvwxyz' splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [L + R[1:] for L, R in splits if R] transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1] replaces = [L + c + R[1:] for L, R in splits if R for c in letters] inserts = [L + c + R for L, R in splits for c in letters] return set(deletes + transposes + replaces + inserts) def edits2(word): return (e2 for e1 in edits1(word) for e2 in edits1(e1))
  16. 16. Pre-processing the text data 16 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML contraction = { "'cause": 'because', ',cause': 'because', ';cause': 'because', "ain't": 'am not', 'ain,t': 'am not', 'ain;t': 'am not', 'ain´t': 'am not', 'ain’t': 'am not', "aren't": 'are not', 'aren,t': 'are not', 'aren;t': 'are not', 'aren´t': 'are not', 'aren’t': 'are not' }
  17. 17. Pre-processing the text data 17 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML def mapping_replacer(x, dic): for word in dic.keys(): if " " + word + " " in x: x = x.replace(" " + word + " ", " " + dic[word] + " ") return x
  18. 18. Pre-processing the text data 18 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML ➢ Reduces words to root form ➢ Why is stemming important? ➢ NLTK stemmers
  19. 19. Pre-processing the text data 19 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML fishing fishfished fishes In [1]: from nltk.stem import SnowballStemmer In [2]: s = SnowballStemmer('english') In [3]: s.stem("fishing") Out[3]: 'fish'
  20. 20. Pre-processing the text data 20 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML import emoji emojis = emoji.UNICODE_EMOJI pip install emoji
  21. 21. Pre-processing the text data 21 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML I need new car insurance car insurance new need I
  22. 22. Pre-processing the text data 22 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  23. 23. Pre-processing the text data 23 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  24. 24. Pre-processing the text data 24 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  25. 25. Pre-processing the text data 25 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  26. 26. What kind of models to use? 26 ➢ SVM ➢ Logistic Regression ➢ Gradient Boosting ➢ Neural Networks
  27. 27. Let’s look at a problem 27
  28. 28. Quora duplicate question identification 28 ➢ ~ 13 million questions ➢ Many duplicate questions ➢ Cluster and join duplicates together ➢ Remove clutter
  29. 29. Non-duplicate questions 29 ➢ Who should I address my cover letter to if I'm applying for a big company like Mozilla? ➢ Which car is better from safety view?""swift or grand i10"".My first priority is safety? ➢ How can I start an online shopping (e-commerce) website? ➢ Which web technology is best suitable for building a big E-Commerce website?
  30. 30. Duplicate questions 30 ➢ How does Quora quickly mark questions as needing improvement? ➢ Why does Quora mark my questions as needing improvement/clarification before I have time to give it details? Literally within seconds… ➢ What practical applications might evolve from the discovery of the Higgs Boson? ➢ What are some practical benefits of discovery of the Higgs Boson?
  31. 31. Dataset 31 ➢ 400,000+ pairs of questions ➢ Initially data was very skewed ➢ Negative sampling ➢ Noise exists (as usual)
  32. 32. Dataset 32 ➢ 255045 negative samples (non-duplicates) ➢ 149306 positive samples (duplicates) ➢ 40% positive samples
  33. 33. Dataset: basic exploration 33 ➢ Average number characters in question1: 59.57 ➢ Minimum number of characters in question1: 1 ➢ Maximum number of characters in question1: 623 ➢ Average number characters in question2: 60.14 ➢ Minimum number of characters in question2: 1 ➢ Maximum number of characters in question2: 1169
  34. 34. Basic feature engineering 34 ➢ Length of question1 ➢ Length of question2 ➢ Difference in the two lengths ➢ Character length of question1 without spaces ➢ Character length of question2 without spaces ➢ Number of words in question1 ➢ Number of words in question2 ➢ Number of common words in question1 and question2
  35. 35. Basic feature engineering 35 data['len_q1'] = data.question1.apply(lambda x: len(str(x))) data['len_q2'] = data.question2.apply(lambda x: len(str(x))) data['diff_len'] = data.len_q1 - data.len_q2 data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', ''))))) data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', ''))))) data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split())) data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split()))
  36. 36. data['len_common_words'] = data.apply(lambda x: len( set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split()) )), axis=1) Basic feature engineering
  37. 37. Basic modelling Tabular Data (Basic Features) Training Set Validation Set Logistic Regression XGB Normalization 0.658 0.721
  38. 38. Fuzzy features 38 ➢ Also known as approximate string matching ➢ Number of “primitive” operations required to convert string to exact match ➢ Primitive operations: ○ Insertion ○ Deletion ○ Substitution ➢ Typically used for: ○ Spell checking ○ Plagiarism detection ○ DNA sequence matching ○ Spam filtering
  39. 39. Fuzzy features 39 ➢ pip install fuzzywuzzy ➢ Uses Levenshtein distance ➢ QRatio ➢ WRatio ➢ Token set ratio ➢ Token sort ratio ➢ Partial token set ratio ➢ Partial token sort ratio https://github.com/seatgeek/fuzzywuzzy
  40. 40. Fuzzy features 40 data['fuzz_qratio'] = data.apply( lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_WRatio'] = data.apply( lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_ratio'] = data.apply( lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_token_set_ratio'] = data.apply( lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
  41. 41. Fuzzy features 41 data['fuzz_partial_token_sort_ratio'] = data.apply( lambda x: fuzz.partial_token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_token_set_ratio'] = data.apply( lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_token_sort_ratio'] = data.apply( lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)
  42. 42. Improving models Tabular Data (Basic Features + Fuzzy Features) Training Set Validation Set Logistic Regression XGB Normalization 0.658 0.660 0.721 0.738
  43. 43. Can we improve it further? 43
  44. 44. Traditional handling of text data 46 ➢ Hashing of words ➢ Count vectorization ➢ TF-IDF ➢ SVD
  45. 45. TF-IDF 47 Number of times a term t appears in a document TF(t) = ------------------------------------------------------- Total number of terms in the document Total number of documents IDF(t) = LOG( ------------------------------------------------------- ) Number of documents with term t in it TF-IDF(t) = TF(t) * IDF(t)
  46. 46. TF-IDF 48 tfidf = TfidfVectorizer( min_df=3, max_features=None, strip_accents='unicode', analyzer='word', token_pattern=r'w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1, stop_words='english' )
  47. 47. SVD 49 ➢ Latent semantic analysis ➢ scikit-learn version of SVD ➢ 120 components svd = decomposition.TruncatedSVD(n_components=120) xtrain_svd = svd.fit_transform(xtrain) xtest_svd = svd.transform(xtest)
  48. 48. Question-1 Question-2 Simply using TF-IDF: method-1 TF-IDF TF-IDF Logistic Regression XGB 0.721 0.738 0.749 0.658 0.660 0.777
  49. 49. Question-1 Question-2 Simply using TF-IDF: method-2 TF-IDF Logistic Regression XGB 0.721 0.738 0.748 0.658 0.660 0.804
  50. 50. Question-1 Question-2 Simply using TF-IDF + SVD: method-1 TF-IDF TF-IDF Logistic Regression XGB 0.721 0.738 0.763 0.658 0.660 0.706 SVD SVD
  51. 51. Question-1 Question-2 Simply using TF-IDF + SVD: method-2 TF-IDF TF-IDF Logistic Regression XGB 0.721 0.738 0.753 0.658 0.660 0.700 SVD
  52. 52. Question-1 Question-2 Simply using TF-IDF + SVD: method-3 TF-IDF Logistic Regression XGB 0.721 0.738 0.759 0.658 0.660 0.714 SVD
  53. 53. Word embeddings WORD | | | | | | | ➢ Multi-dimensional vector for all the words in any dictionary ➢ Always great insights ➢ Very popular in natural language processing tasks ➢ Google news vectors 300d ➢ GloVe ➢ FastText
  54. 54. Word embeddings Germany Berlin - Germany France Paris + France Berlin - Germany + France ~ Paris Every word gets a position in space
  55. 55. Word embeddings ➢ Embeddings for words ➢ Embeddings for whole sentence
  56. 56. Word embeddings def sent2vec(s, model, stop_words, tokenizer): words = str(s).lower() words = tokenizer(words) words = [w for w in words if not w in stop_words] words = [w for w in words if w.isalpha()] M = [] for w in words: M.append(model[w]) M = np.array(M) v = M.sum(axis=0) return v / np.sqrt((v ** 2).sum())
  57. 57. Word embeddings
  58. 58. Word embeddings features
  59. 59. Word embeddings features Spatial Distances Euclidean Manhattan Cosine Canberra Minkowski Braycurtis
  60. 60. Word embeddings features Statistical Features Skew Kurtosis ➢ Skew = 0 for normal distribution ➢ Skew > 0: more weight in left tail ➢ Kurtosis: 4th central moment over the square of variance
  61. 61. Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances. Word mover’s distance: WMD
  62. 62. Results comparison Features Logistic Regression Accuracy XGBoost Accuracy Basic Features 0.658 0.721 Basic Features + Fuzzy Features 0.660 0.738 Basic + Fuzzy + Word2Vec Features 0.676 0.766 Word2Vec Features X 0.78 Basic + Fuzzy + Word2Vec Features + Full Word2Vec Vectors X 0.814 TFIDF + SVD (Best Combination) 0.804 0.763
  63. 63. What can deep learning do? ➢ Natural language processing ➢ Speech processing ➢ Computer vision ➢ And more and more
  64. 64. 1-D CNN ➢ One dimensional convolutional layer ➢ Temporal convolution ➢ Simple to implement: for i in range(sample_length): y[i] = 0 for j in range(kernel_length): y[i] += x[i-j] * h[j]
  65. 65. LSTM ➢ Long short term memory ➢ A type of RNN ➢ Used two LSTM layers
  66. 66. Embedding layers ➢ Simple layer ➢ Converts indexes to vectors ➢ [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
  67. 67. Time distributed dense layer ➢ TimeDistributed wrapper around dense layer ➢ TimeDistributed applies the layer to every temporal slice of input ➢ Followed by Lambda layer ➢ Implements “translation” layer used by Stephen Merity (keras snli model) model1 = Sequential() model1.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=40, trainable=False)) model1.add(TimeDistributed(Dense(300, activation='relu'))) model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))
  68. 68. Handling text data before training tk = text.Tokenizer(nb_words=200000) max_len = 40 tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str))) x1 = tk.texts_to_sequences(data.question1.values) x1 = sequence.pad_sequences(x1, maxlen=max_len) x2 = tk.texts_to_sequences(data.question2.values.astype(str)) x2 = sequence.pad_sequences(x2, maxlen=max_len) word_index = tk.word_index
  69. 69. Handling text data before training embeddings_index = {} f = open('glove.840B.300d.txt') for line in tqdm(f): values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close()
  70. 70. Handling text data before training
  71. 71. Handling text data before training
  72. 72. Handling text data before training embedding_matrix = np.zeros((len(word_index) + 1, 300)) for word, i in tqdm(word_index.items()): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector
  73. 73. Basis of deep learning model ➢ Keras-snli model: https://github.com/Smerity/keras_snli
  74. 74. Creating the deep learning model
  75. 75. Final Deep Learning Model
  76. 76. Model 1 and Model 2 model1 = Sequential() model1.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=40, trainable=False)) model1.add(TimeDistributed(Dense(300, activation='relu'))) model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,))) model2 = Sequential() model2.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=40, trainable=False)) model2.add(TimeDistributed(Dense(300, activation='relu'))) model2.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))
  77. 77. Final Deep Learning Model
  78. 78. Model 3 and Model 4
  79. 79. Model 3 and Model 4 model3 = Sequential() model3.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=40, trainable=False)) model3.add(Convolution1D(nb_filter=nb_filter, filter_length=filter_length, border_mode='valid', activation='relu', subsample_length=1)) model3.add(Dropout(0.2)) . . . model3.add(Dense(300)) model3.add(Dropout(0.2)) model3.add(BatchNormalization())
  80. 80. Final Deep Learning Model
  81. 81. Model 5 and Model 6 model5 = Sequential() model5.add(Embedding(len(word_index) + 1, 300, input_length=40, dropout=0.2)) model5.add(LSTM(300, dropout_W=0.2, dropout_U=0.2)) model6 = Sequential() model6.add(Embedding(len(word_index) + 1, 300, input_length=40, dropout=0.2)) model6.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
  82. 82. Final Deep Learning Model
  83. 83. Merged Model
  84. 84. Time to Train the DeepNet ➢ Total params: 174,913,917 ➢ Trainable params: 60,172,917 ➢ Non-trainable params: 114,741,000 ➢ NVIDIA Titan X
  85. 85. Time to Train the DeepNet ➢ The deep network was trained on an NVIDIA TitanX and took approximately 300 seconds for each epoch and took 10-15 hours to train. This network achieved an accuracy of 0.848 (~0.85). ➢ The SOTA at that time was around 0.88. (Bi-MPM model)
  86. 86. Can we end without talking about the muppets?
  87. 87. Ofcourse!
  88. 88. Just kidding, no!
  89. 89. BERT ➢ Based on transformer encoder ➢ Each encoder block has self-attention ➢ Encoder blocks: 12 or 24 ➢ Feed forward hidden units: 768 or 1024 ➢ Attention heads: 12 or 16
  90. 90. BERT encoder block Encoder Block 1 __ __ __ __ __ __ __ __ __ __ __ __ 512 512inputs Vectorsofsize768or1024
  91. 91. How BERT learns? ➢ BERT has a fixed vocab ➢ BERT has encoder blocks (transformer blocks) ➢ A word is masked and BERT tries to predict that word ➢ BERT training also tries to predict next sentence ➢ Combining losses from two above approaches, BERT learns
  92. 92. BERT tokenization ➢ [CLS] TOKENS [SEP] ➢ [CLS] TOKENS_A [SEP] TOKENS_B [SEP] Example of tokenization: hi, everyone! this is tokenization example [CLS] hi , everyone ! this is token ##ization example [SEP]
  93. 93. BERT tokenization https://github.com/huggingface/tokenizers
  94. 94. Approaching duplicate questions using BERT
  95. 95. Approaching duplicate questions using BERT
  96. 96. Approaching duplicate questions using BERT
  97. 97. Approaching duplicate questions using BERT
  98. 98. Approaching duplicate questions using BERT
  99. 99. Approaching duplicate questions using BERT
  100. 100. There is a lot more….
  101. 101. Maybe next time!
  102. 102. Few things to remember...
  103. 103. Fine-tuning often gives good results ➢ It is faster ➢ It is better (not always) ➢ Why reinvent the wheel?
  104. 104. Fine-tuning often gives good results
  105. 105. Bigger isn’t always better
  106. 106. A good model has some key ingredients...
  107. 107. Understanding the data Exploring the data Sugar
  108. 108. Pre-processing Feature engineering Feature selection Spice
  109. 109. A good cross validation Low Error Rate Simple or combination of models Post-processing All the things that are nice
  110. 110. Chemical X
  111. 111. A Good Machine Learning Model
  112. 112. ➢ e-mail: abhishek4@gmail.com ➢ Linkedin: linkedin.com/in/abhi1thakur ➢ kaggle: kaggle.com/abhishek ➢ tweet me: @abhi1thakur ➢ YouTube: youtube.com/AbhishekThakurAbhi Approaching (almost) any machine learning problem: the book will release in Summer 2020. Fill out the form here to be the first one to know when it’s ready to buy: http://bit.ly/approachingalmost

×