The document discusses various approaches to natural language processing (NLP) problems, including preprocessing text data, traditional machine learning models, deep learning models, and word embeddings. It covers preprocessing steps like removing spaces, tokenization, spelling correction, stemming, handling stopwords. It also discusses using TF-IDF features and latent semantic analysis with SVD for classification models. Finally, it discusses using word embeddings to represent text as vectors for sequence models.
6. ➢ Not so much intro
➢ Where is NLP used
➢ Pre-processing
➢ Machine Learning Models
➢ Solving a problem
➢ Traditional approaches
➢ Deep Learning Models
➢ Muppets
Agenda
6
7. Translation
Sentiment Classification
Chatbots / VAs
Autocomplete
Entity Extraction
Question Answering
Review Rating
Prediction
Search Engine Speech to Text
Topic Extraction
Applications of natural language processing
8. Pre-processing the text data
8
can u he.lp me with loan? 😊
Unintentional
Characters
Abbreviations Symbols Emojis
can you help me with loan ?
9. Pre-processing the text data
9
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
10. Pre-processing the text data
10
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
def remove_space(text):
text = text.strip()
text = text.split()
return " ".join(text)
11. Pre-processing the text data
11
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
➢ Very important step
➢ Is not always about spaces
➢ Converts words into tokens
➢ Might be different for different
languages
➢ Simplest is to use `word_tokenizer`
from NLTK
➢ Write your own ;)
12. Pre-processing the text data
12
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "hello, how are you?"
tokens = word_tokenize(text)
print(tokens)
'hello', ',', 'how', 'are', 'you', '?'
hello, how are you?
13. Pre-processing the text data
13
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
➢ Very very crucial step
➢ In chat: can u tel me abot new sim
card pland?
➢ Most models without spelling
correction will fail
➢ Peter Norvig’s spelling correction
➢ Make your own ;)
14. Pre-processing the text data
14
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
I need a new car insurance
I need aa new car insurance
I ned a new car insuraance
I needd a new carr insurance
I need a neew car insurance
I need a new car insurancee
EmbeddingsLayer
BidirectionalStacked
char-LSTM
Output
15. Pre-processing the text data
15
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
def edits1(word):
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
def edits2(word):
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
17. Pre-processing the text data
17
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
def mapping_replacer(x, dic):
for word in dic.keys():
if " " + word + " " in x:
x = x.replace(" " + word + " ", " " + dic[word] + " ")
return x
18. Pre-processing the text data
18
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
➢ Reduces words to root form
➢ Why is stemming important?
➢ NLTK stemmers
19. Pre-processing the text data
19
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
fishing
fishfished
fishes
In [1]: from nltk.stem import SnowballStemmer
In [2]: s = SnowballStemmer('english')
In [3]: s.stem("fishing")
Out[3]: 'fish'
20. Pre-processing the text data
20
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
import emoji
emojis = emoji.UNICODE_EMOJI
pip install emoji
21. Pre-processing the text data
21
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
I need new car insurance
car insurance
new
need
I
22. Pre-processing the text data
22
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
23. Pre-processing the text data
23
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
24. Pre-processing the text data
24
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
25. Pre-processing the text data
25
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
26. What kind of models to use?
26
➢ SVM
➢ Logistic Regression
➢ Gradient Boosting
➢ Neural Networks
28. Quora duplicate question identification
28
➢ ~ 13 million questions
➢ Many duplicate questions
➢ Cluster and join duplicates together
➢ Remove clutter
29. Non-duplicate questions
29
➢ Who should I address my cover letter to if I'm applying for a big company
like Mozilla?
➢ Which car is better from safety view?""swift or grand i10"".My first priority
is safety?
➢ How can I start an online shopping (e-commerce) website?
➢ Which web technology is best suitable for building a big E-Commerce
website?
30. Duplicate questions
30
➢ How does Quora quickly mark questions as needing improvement?
➢ Why does Quora mark my questions as needing
improvement/clarification before I have time to give it details? Literally
within seconds…
➢ What practical applications might evolve from the discovery of the Higgs
Boson?
➢ What are some practical benefits of discovery of the Higgs Boson?
31. Dataset
31
➢ 400,000+ pairs of questions
➢ Initially data was very skewed
➢ Negative sampling
➢ Noise exists (as usual)
33. Dataset: basic exploration
33
➢ Average number characters in question1: 59.57
➢ Minimum number of characters in question1: 1
➢ Maximum number of characters in question1: 623
➢ Average number characters in question2: 60.14
➢ Minimum number of characters in question2: 1
➢ Maximum number of characters in question2: 1169
34. Basic feature engineering
34
➢ Length of question1
➢ Length of question2
➢ Difference in the two lengths
➢ Character length of question1 without spaces
➢ Character length of question2 without spaces
➢ Number of words in question1
➢ Number of words in question2
➢ Number of common words in question1 and question2
38. Fuzzy features
38
➢ Also known as approximate string matching
➢ Number of “primitive” operations required to convert string to exact
match
➢ Primitive operations:
○ Insertion
○ Deletion
○ Substitution
➢ Typically used for:
○ Spell checking
○ Plagiarism detection
○ DNA sequence matching
○ Spam filtering
39. Fuzzy features
39
➢ pip install fuzzywuzzy
➢ Uses Levenshtein distance
➢ QRatio
➢ WRatio
➢ Token set ratio
➢ Token sort ratio
➢ Partial token set ratio
➢ Partial token sort ratio
https://github.com/seatgeek/fuzzywuzzy
46. Traditional handling of text data
46
➢ Hashing of words
➢ Count vectorization
➢ TF-IDF
➢ SVD
47. TF-IDF
47
Number of times a term t appears in a document
TF(t) = -------------------------------------------------------
Total number of terms in the document
Total number of documents
IDF(t) = LOG( ------------------------------------------------------- )
Number of documents with term t in it
TF-IDF(t) = TF(t) * IDF(t)
55. Word embeddings
WORD | | | | | | |
➢ Multi-dimensional vector for all the words in any dictionary
➢ Always great insights
➢ Very popular in natural language processing tasks
➢ Google news vectors 300d
➢ GloVe
➢ FastText
58. Word embeddings
def sent2vec(s, model, stop_words, tokenizer):
words = str(s).lower()
words = tokenizer(words)
words = [w for w in words if not w in stop_words]
words = [w for w in words if w.isalpha()]
M = []
for w in words:
M.append(model[w])
M = np.array(M)
v = M.sum(axis=0)
return v / np.sqrt((v ** 2).sum())
63. Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances.
Word mover’s distance: WMD
64. Results comparison
Features Logistic
Regression
Accuracy
XGBoost
Accuracy
Basic Features 0.658 0.721
Basic Features + Fuzzy Features 0.660 0.738
Basic + Fuzzy + Word2Vec Features 0.676 0.766
Word2Vec Features X 0.78
Basic + Fuzzy + Word2Vec Features + Full Word2Vec
Vectors
X 0.814
TFIDF + SVD (Best Combination) 0.804 0.763
65.
66. What can deep learning do?
➢ Natural language processing
➢ Speech processing
➢ Computer vision
➢ And more and more
67.
68. 1-D CNN
➢ One dimensional convolutional layer
➢ Temporal convolution
➢ Simple to implement:
for i in range(sample_length):
y[i] = 0
for j in range(kernel_length):
y[i] += x[i-j] * h[j]
69. LSTM
➢ Long short term memory
➢ A type of RNN
➢ Used two LSTM layers
71. Time distributed dense layer
➢ TimeDistributed wrapper around dense layer
➢ TimeDistributed applies the layer to every temporal slice of input
➢ Followed by Lambda layer
➢ Implements “translation” layer used by Stephen Merity (keras snli
model)
model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))
72. Handling text data before training
tk = text.Tokenizer(nb_words=200000)
max_len = 40
tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str)))
x1 = tk.texts_to_sequences(data.question1.values)
x1 = sequence.pad_sequences(x1, maxlen=max_len)
x2 = tk.texts_to_sequences(data.question2.values.astype(str))
x2 = sequence.pad_sequences(x2, maxlen=max_len)
word_index = tk.word_index
73. Handling text data before training
embeddings_index = {}
f = open('glove.840B.300d.txt')
for line in tqdm(f):
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
76. Handling text data before training
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
77. Basis of deep learning model
➢ Keras-snli model: https://github.com/Smerity/keras_snli
88. Time to Train the DeepNet
➢ Total params: 174,913,917
➢ Trainable params: 60,172,917
➢ Non-trainable params: 114,741,000
➢ NVIDIA Titan X
89.
90. Time to Train the DeepNet
➢ The deep network was trained on an NVIDIA TitanX and took approximately
300 seconds for each epoch and took 10-15 hours to train. This network
achieved an accuracy of 0.848 (~0.85).
➢ The SOTA at that time was around 0.88. (Bi-MPM model)
94. BERT
➢ Based on transformer encoder
➢ Each encoder block has self-attention
➢ Encoder blocks: 12 or 24
➢ Feed forward hidden units: 768 or 1024
➢ Attention heads: 12 or 16
96. How BERT learns?
➢ BERT has a fixed vocab
➢ BERT has encoder blocks (transformer blocks)
➢ A word is masked and BERT tries to predict that word
➢ BERT training also tries to predict next sentence
➢ Combining losses from two above approaches, BERT learns
97. BERT tokenization
➢ [CLS] TOKENS [SEP]
➢ [CLS] TOKENS_A [SEP] TOKENS_B [SEP]
Example of tokenization:
hi, everyone! this is tokenization example
[CLS] hi , everyone ! this is token ##ization example [SEP]
118. ➢ e-mail: abhishek4@gmail.com
➢ Linkedin: linkedin.com/in/abhi1thakur
➢ kaggle: kaggle.com/abhishek
➢ tweet me: @abhi1thakur
➢ YouTube: youtube.com/AbhishekThakurAbhi
Approaching (almost) any
machine learning problem:
the book will release in
Summer 2020.
Fill out the form here to be the
first one to know when it’s
ready to buy:
http://bit.ly/approachingalmost