NLP Approaching any problem

Approaching (almost) any NLP problem
@abhi1thakur

AI is like an imaginary
friend most enterprises
claim to have these days

5
I like big data
and
I cannot lie

➢ Not so much intro
➢ Where is NLP used
➢ Pre-processing
➢ Machine Learning Models
➢ Solving a problem
➢ Traditional approaches
➢ Deep Learning Models
➢ Muppets
Agenda
6

Translation
Sentiment Classiﬁcation
Chatbots / VAs
Autocomplete
Entity Extraction
Question Answering
Review Rating
Prediction
Search Engine Speech to Text
Topic Extraction
Applications of natural language processing

Pre-processing the text data
8
can u he.lp me with loan? 😊
Unintentional
Characters
Abbreviations Symbols Emojis
can you help me with loan ?

9
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML

10
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
def remove_space(text):
text = text.strip()
text = text.split()
return " ".join(text)

11
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
➢ Very important step
➢ Is not always about spaces
➢ Converts words into tokens
➢ Might be diﬀerent for diﬀerent
languages
➢ Simplest is to use `word_tokenizer`
from NLTK
➢ Write your own ;)

12
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "hello, how are you?"
tokens = word_tokenize(text)
print(tokens)
'hello', ',', 'how', 'are', 'you', '?'
hello, how are you?

13
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
➢ Very very crucial step
➢ In chat: can u tel me abot new sim
card pland?
➢ Most models without spelling
correction will fail
➢ Peter Norvig’s spelling correction
➢ Make your own ;)

14
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
I need a new car insurance
I need aa new car insurance
I ned a new car insuraance
I needd a new carr insurance
I need a neew car insurance
I need a new car insurancee
EmbeddingsLayer
BidirectionalStacked
char-LSTM
Output

15
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
def edits1(word):
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
def edits2(word):
return (e2 for e1 in edits1(word) for e2 in edits1(e1))

16
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
contraction = {
"'cause": 'because',
',cause': 'because',
';cause': 'because',
"ain't": 'am not',
'ain,t': 'am not',
'ain;t': 'am not',
'ain´t': 'am not',
'ain’t': 'am not',
"aren't": 'are not',
'aren,t': 'are not',
'aren;t': 'are not',
'aren´t': 'are not',
'aren’t': 'are not'
}

17
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
def mapping_replacer(x, dic):
for word in dic.keys():
if " " + word + " " in x:
x = x.replace(" " + word + " ", " " + dic[word] + " ")
return x

18
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
➢ Reduces words to root form
➢ Why is stemming important?
➢ NLTK stemmers

19
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
fishing
fishfished
fishes
In [1]: from nltk.stem import SnowballStemmer
In [2]: s = SnowballStemmer('english')
In [3]: s.stem("fishing")
Out[3]: 'fish'

20
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
import emoji
emojis = emoji.UNICODE_EMOJI
pip install emoji

21
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
I need new car insurance
car insurance
new
need
I

22
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

23
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML

24
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML

25
➢ Tokenization
➢ Stemming
➢ Emoji handling
➢ Cleaning HTML

What kind of models to use?
26
➢ SVM
➢ Logistic Regression
➢ Gradient Boosting
➢ Neural Networks

Quora duplicate question identiﬁcation
28
➢ ~ 13 million questions
➢ Many duplicate questions
➢ Cluster and join duplicates together
➢ Remove clutter

Non-duplicate questions
29
➢ Who should I address my cover letter to if I'm applying for a big company
like Mozilla?
➢ Which car is better from safety view?""swift or grand i10"".My ﬁrst priority
is safety?
➢ How can I start an online shopping (e-commerce) website?
➢ Which web technology is best suitable for building a big E-Commerce
website?

Duplicate questions
30
➢ How does Quora quickly mark questions as needing improvement?
➢ Why does Quora mark my questions as needing
improvement/clariﬁcation before I have time to give it details? Literally
within seconds…
➢ What practical applications might evolve from the discovery of the Higgs
Boson?
➢ What are some practical beneﬁts of discovery of the Higgs Boson?

Dataset
31
➢ 400,000+ pairs of questions
➢ Initially data was very skewed
➢ Negative sampling
➢ Noise exists (as usual)

Dataset
32
➢ 255045 negative samples (non-duplicates)
➢ 149306 positive samples (duplicates)
➢ 40% positive samples

Dataset: basic exploration
33
➢ Average number characters in question1: 59.57
➢ Minimum number of characters in question1: 1
➢ Maximum number of characters in question1: 623
➢ Average number characters in question2: 60.14
➢ Minimum number of characters in question2: 1
➢ Maximum number of characters in question2: 1169

Basic feature engineering
34
➢ Length of question1
➢ Length of question2
➢ Diﬀerence in the two lengths
➢ Character length of question1 without spaces
➢ Character length of question2 without spaces
➢ Number of words in question1
➢ Number of words in question2
➢ Number of common words in question1 and question2

35
data['len_q1'] = data.question1.apply(lambda x: len(str(x)))
data['len_q2'] = data.question2.apply(lambda x: len(str(x)))
data['diff_len'] = data.len_q1 - data.len_q2
data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split()))
data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split()))

data['len_common_words'] = data.apply(lambda x:
len(
set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split())
)), axis=1)

Basic modelling
Tabular Data
(Basic
Features)
Training Set
Validation Set
Logistic
Regression
XGB
Normalization
0.658
0.721

Fuzzy features
38
➢ Also known as approximate string matching
➢ Number of “primitive” operations required to convert string to exact
match
➢ Primitive operations:
○ Insertion
○ Deletion
○ Substitution
➢ Typically used for:
○ Spell checking
○ Plagiarism detection
○ DNA sequence matching
○ Spam ﬁltering

Fuzzy features
39
➢ pip install fuzzywuzzy
➢ Uses Levenshtein distance
➢ QRatio
➢ WRatio
➢ Token set ratio
➢ Token sort ratio
➢ Partial token set ratio
➢ Partial token sort ratio
https://github.com/seatgeek/fuzzywuzzy

Fuzzy features
40
data['fuzz_qratio'] = data.apply(
lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_WRatio'] = data.apply(
lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_partial_ratio'] = data.apply(
lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_partial_token_set_ratio'] = data.apply(
lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)

Fuzzy features
41
data['fuzz_partial_token_sort_ratio'] = data.apply(
lambda x: fuzz.partial_token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_token_set_ratio'] = data.apply(
lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_token_sort_ratio'] = data.apply(
lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)

Improving models
Tabular Data
(Basic
Features +
Fuzzy
Features)
Training Set
Validation Set
Logistic
Regression
XGB
Normalization
0.658
0.660
0.721
0.738

Traditional handling of text data
46
➢ Hashing of words
➢ Count vectorization
➢ TF-IDF
➢ SVD

TF-IDF
47
Number of times a term t appears in a document
TF(t) = -------------------------------------------------------
Total number of terms in the document
Total number of documents
IDF(t) = LOG( ------------------------------------------------------- )
Number of documents with term t in it
TF-IDF(t) = TF(t) * IDF(t)

TF-IDF
48
tfidf = TfidfVectorizer(
min_df=3,
max_features=None,
strip_accents='unicode',
analyzer='word',
token_pattern=r'w{1,}',
ngram_range=(1, 2),
use_idf=1,
smooth_idf=1,
sublinear_tf=1,
stop_words='english'
)

SVD
49
➢ Latent semantic analysis
➢ scikit-learn version of SVD
➢ 120 components
svd = decomposition.TruncatedSVD(n_components=120)
xtrain_svd = svd.fit_transform(xtrain)
xtest_svd = svd.transform(xtest)

Question-1 Question-2
Simply using TF-IDF: method-1
TF-IDF TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.749
0.658
0.660
0.777

Simply using TF-IDF: method-2
TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.748
0.658
0.660
0.804

Simply using TF-IDF + SVD: method-1
TF-IDF TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.763
0.658
0.660
0.706
SVD SVD

TF-IDF TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.753
0.658
0.660
0.700
SVD

TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.759
0.658
0.660
0.714
SVD

Word embeddings
WORD | | | | | | |
➢ Multi-dimensional vector for all the words in any dictionary
➢ Always great insights
➢ Very popular in natural language processing tasks
➢ Google news vectors 300d
➢ GloVe
➢ FastText

Word embeddings
Germany
Berlin
- Germany
France
Paris
+ France
Berlin - Germany + France ~ Paris
Every word gets a position in space

Word embeddings
➢ Embeddings for words
➢ Embeddings for whole sentence

Word embeddings
def sent2vec(s, model, stop_words, tokenizer):
words = str(s).lower()
words = tokenizer(words)
words = [w for w in words if not w in stop_words]
words = [w for w in words if w.isalpha()]
M = []
for w in words:
M.append(model[w])
M = np.array(M)
v = M.sum(axis=0)
return v / np.sqrt((v ** 2).sum())

Word embeddings features
Spatial
Distances
Euclidean
Manhattan
Cosine
Canberra
Minkowski
Braycurtis

Word embeddings features
Statistical
Features
Skew
Kurtosis
➢ Skew = 0 for normal
distribution
➢ Skew > 0: more weight in left
tail
➢ Kurtosis: 4th central moment
over the square of variance

Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances.
Word mover’s distance: WMD

Results comparison
Features Logistic
Regression
Accuracy
XGBoost
Accuracy
Basic Features 0.658 0.721
Basic Features + Fuzzy Features 0.660 0.738
Basic + Fuzzy + Word2Vec Features 0.676 0.766
Word2Vec Features X 0.78
Basic + Fuzzy + Word2Vec Features + Full Word2Vec
Vectors
X 0.814
TFIDF + SVD (Best Combination) 0.804 0.763

What can deep learning do?
➢ Natural language processing
➢ Speech processing
➢ Computer vision
➢ And more and more

1-D CNN
➢ One dimensional convolutional layer
➢ Temporal convolution
➢ Simple to implement:
for i in range(sample_length):
y[i] = 0
for j in range(kernel_length):
y[i] += x[i-j] * h[j]

LSTM
➢ Long short term memory
➢ A type of RNN
➢ Used two LSTM layers

Embedding layers
➢ Simple layer
➢ Converts indexes to vectors
➢ [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

Time distributed dense layer
➢ TimeDistributed wrapper around dense layer
➢ TimeDistributed applies the layer to every temporal slice of input
➢ Followed by Lambda layer
➢ Implements “translation” layer used by Stephen Merity (keras snli
model)
model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))

Handling text data before training
tk = text.Tokenizer(nb_words=200000)
max_len = 40
tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str)))
x1 = tk.texts_to_sequences(data.question1.values)
x1 = sequence.pad_sequences(x1, maxlen=max_len)
x2 = tk.texts_to_sequences(data.question2.values.astype(str))
x2 = sequence.pad_sequences(x2, maxlen=max_len)
word_index = tk.word_index

embeddings_index = {}
f = open('glove.840B.300d.txt')
for line in tqdm(f):
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()

embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector

Basis of deep learning model
➢ Keras-snli model: https://github.com/Smerity/keras_snli

Creating the deep learning model

Model 1 and Model 2
300,
input_length=40,
trainable=False))
model1.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))
300,
input_length=40,
trainable=False))
model2.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))

Model 3 and Model 4
300,
input_length=40,
trainable=False))
model3.add(Convolution1D(nb_filter=nb_filter,
filter_length=filter_length,
border_mode='valid',
activation='relu',
subsample_length=1))
model3.add(Dropout(0.2))
.
.
.
model3.add(Dense(300))
model3.add(Dropout(0.2))
model3.add(BatchNormalization())

Model 5 and Model 6
model5.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model5.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
model6.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model6.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))

Time to Train the DeepNet
➢ Total params: 174,913,917
➢ Trainable params: 60,172,917
➢ Non-trainable params: 114,741,000
➢ NVIDIA Titan X

Time to Train the DeepNet
➢ The deep network was trained on an NVIDIA TitanX and took approximately
300 seconds for each epoch and took 10-15 hours to train. This network
achieved an accuracy of 0.848 (~0.85).
➢ The SOTA at that time was around 0.88. (Bi-MPM model)

Can we end without talking about the muppets?

BERT
➢ Based on transformer encoder
➢ Each encoder block has self-attention
➢ Encoder blocks: 12 or 24
➢ Feed forward hidden units: 768 or 1024
➢ Attention heads: 12 or 16

BERT encoder block
Encoder Block
1
__
__
__
__
__
__
__
__
__
__
__
__
512
512inputs
Vectorsofsize768or1024

How BERT learns?
➢ BERT has a ﬁxed vocab
➢ BERT has encoder blocks (transformer blocks)
➢ A word is masked and BERT tries to predict that word
➢ BERT training also tries to predict next sentence
➢ Combining losses from two above approaches, BERT learns

BERT tokenization
➢ [CLS] TOKENS [SEP]
➢ [CLS] TOKENS_A [SEP] TOKENS_B [SEP]
Example of tokenization:
hi, everyone! this is tokenization example
[CLS] hi , everyone ! this is token ##ization example [SEP]

BERT tokenization
https://github.com/huggingface/tokenizers

Approaching duplicate questions using BERT

Fine-tuning often gives good results
➢ It is faster
➢ It is better (not always)
➢ Why reinvent the wheel?

Fine-tuning often gives good results

A good model has some key ingredients...

Understanding the data
Exploring the data
Sugar

Pre-processing
Feature engineering
Feature selection
Spice

A good cross validation
Low Error Rate
Simple or combination of models
Post-processing
All the things that are nice

➢ e-mail: abhishek4@gmail.com
➢ Linkedin: linkedin.com/in/abhi1thakur
➢ kaggle: kaggle.com/abhishek
➢ tweet me: @abhi1thakur
➢ YouTube: youtube.com/AbhishekThakurAbhi
Approaching (almost) any
machine learning problem:
the book will release in
Summer 2020.
Fill out the form here to be the
ﬁrst one to know when it’s
ready to buy:
http://bit.ly/approachingalmost

NLP Approaching any problem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NLP Approaching any problem

Similar to NLP Approaching any problem (20)

Recently uploaded

Recently uploaded (20)

NLP Approaching any problem