SlideShare a Scribd company logo
1 of 118
Download to read offline
Approaching (almost) any NLP problem
@abhi1thakur
AI is like an imaginary
friend most enterprises
claim to have these days
3
4
5
I like big data
and
I cannot lie
➢ Not so much intro
➢ Where is NLP used
➢ Pre-processing
➢ Machine Learning Models
➢ Solving a problem
➢ Traditional approaches
➢ Deep Learning Models
➢ Muppets
Agenda
6
Translation
Sentiment Classification
Chatbots / VAs
Autocomplete
Entity Extraction
Question Answering
Review Rating
Prediction
Search Engine Speech to Text
Topic Extraction
Applications of natural language processing
Pre-processing the text data
8
can u he.lp me with loan? 😊
Unintentional
Characters
Abbreviations Symbols Emojis
can you help me with loan ?
Pre-processing the text data
9
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
Pre-processing the text data
10
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
def remove_space(text):
text = text.strip()
text = text.split()
return " ".join(text)
Pre-processing the text data
11
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
➢ Very important step
➢ Is not always about spaces
➢ Converts words into tokens
➢ Might be different for different
languages
➢ Simplest is to use `word_tokenizer`
from NLTK
➢ Write your own ;)
Pre-processing the text data
12
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "hello, how are you?"
tokens = word_tokenize(text)
print(tokens)
'hello', ',', 'how', 'are', 'you', '?'
hello, how are you?
Pre-processing the text data
13
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
➢ Very very crucial step
➢ In chat: can u tel me abot new sim
card pland?
➢ Most models without spelling
correction will fail
➢ Peter Norvig’s spelling correction
➢ Make your own ;)
Pre-processing the text data
14
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
I need a new car insurance
I need aa new car insurance
I ned a new car insuraance
I needd a new carr insurance
I need a neew car insurance
I need a new car insurancee
EmbeddingsLayer
BidirectionalStacked
char-LSTM
Output
Pre-processing the text data
15
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
def edits1(word):
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
def edits2(word):
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
Pre-processing the text data
16
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
contraction = {
"'cause": 'because',
',cause': 'because',
';cause': 'because',
"ain't": 'am not',
'ain,t': 'am not',
'ain;t': 'am not',
'ain´t': 'am not',
'ain’t': 'am not',
"aren't": 'are not',
'aren,t': 'are not',
'aren;t': 'are not',
'aren´t': 'are not',
'aren’t': 'are not'
}
Pre-processing the text data
17
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
def mapping_replacer(x, dic):
for word in dic.keys():
if " " + word + " " in x:
x = x.replace(" " + word + " ", " " + dic[word] + " ")
return x
Pre-processing the text data
18
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
➢ Reduces words to root form
➢ Why is stemming important?
➢ NLTK stemmers
Pre-processing the text data
19
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
fishing
fishfished
fishes
In [1]: from nltk.stem import SnowballStemmer
In [2]: s = SnowballStemmer('english')
In [3]: s.stem("fishing")
Out[3]: 'fish'
Pre-processing the text data
20
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
import emoji
emojis = emoji.UNICODE_EMOJI
pip install emoji
Pre-processing the text data
21
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
I need new car insurance
car insurance
new
need
I
Pre-processing the text data
22
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pre-processing the text data
23
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pre-processing the text data
24
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pre-processing the text data
25
➢ Removing weird spaces
➢ Tokenization
➢ Spelling correction
➢ Contraction mapping
➢ Stemming
➢ Emoji handling
➢ Stopwords handling
➢ Cleaning HTML
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
What kind of models to use?
26
➢ SVM
➢ Logistic Regression
➢ Gradient Boosting
➢ Neural Networks
Let’s look at a problem
27
Quora duplicate question identification
28
➢ ~ 13 million questions
➢ Many duplicate questions
➢ Cluster and join duplicates together
➢ Remove clutter
Non-duplicate questions
29
➢ Who should I address my cover letter to if I'm applying for a big company
like Mozilla?
➢ Which car is better from safety view?""swift or grand i10"".My first priority
is safety?
➢ How can I start an online shopping (e-commerce) website?
➢ Which web technology is best suitable for building a big E-Commerce
website?
Duplicate questions
30
➢ How does Quora quickly mark questions as needing improvement?
➢ Why does Quora mark my questions as needing
improvement/clarification before I have time to give it details? Literally
within seconds…
➢ What practical applications might evolve from the discovery of the Higgs
Boson?
➢ What are some practical benefits of discovery of the Higgs Boson?
Dataset
31
➢ 400,000+ pairs of questions
➢ Initially data was very skewed
➢ Negative sampling
➢ Noise exists (as usual)
Dataset
32
➢ 255045 negative samples (non-duplicates)
➢ 149306 positive samples (duplicates)
➢ 40% positive samples
Dataset: basic exploration
33
➢ Average number characters in question1: 59.57
➢ Minimum number of characters in question1: 1
➢ Maximum number of characters in question1: 623
➢ Average number characters in question2: 60.14
➢ Minimum number of characters in question2: 1
➢ Maximum number of characters in question2: 1169
Basic feature engineering
34
➢ Length of question1
➢ Length of question2
➢ Difference in the two lengths
➢ Character length of question1 without spaces
➢ Character length of question2 without spaces
➢ Number of words in question1
➢ Number of words in question2
➢ Number of common words in question1 and question2
Basic feature engineering
35
data['len_q1'] = data.question1.apply(lambda x: len(str(x)))
data['len_q2'] = data.question2.apply(lambda x: len(str(x)))
data['diff_len'] = data.len_q1 - data.len_q2
data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))
data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split()))
data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split()))
data['len_common_words'] = data.apply(lambda x:
len(
set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split())
)), axis=1)
Basic feature engineering
Basic modelling
Tabular Data
(Basic
Features)
Training Set
Validation Set
Logistic
Regression
XGB
Normalization
0.658
0.721
Fuzzy features
38
➢ Also known as approximate string matching
➢ Number of “primitive” operations required to convert string to exact
match
➢ Primitive operations:
○ Insertion
○ Deletion
○ Substitution
➢ Typically used for:
○ Spell checking
○ Plagiarism detection
○ DNA sequence matching
○ Spam filtering
Fuzzy features
39
➢ pip install fuzzywuzzy
➢ Uses Levenshtein distance
➢ QRatio
➢ WRatio
➢ Token set ratio
➢ Token sort ratio
➢ Partial token set ratio
➢ Partial token sort ratio
https://github.com/seatgeek/fuzzywuzzy
Fuzzy features
40
data['fuzz_qratio'] = data.apply(
lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_WRatio'] = data.apply(
lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_partial_ratio'] = data.apply(
lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_partial_token_set_ratio'] = data.apply(
lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
Fuzzy features
41
data['fuzz_partial_token_sort_ratio'] = data.apply(
lambda x: fuzz.partial_token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_token_set_ratio'] = data.apply(
lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
data['fuzz_token_sort_ratio'] = data.apply(
lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)
Improving models
Tabular Data
(Basic
Features +
Fuzzy
Features)
Training Set
Validation Set
Logistic
Regression
XGB
Normalization
0.658
0.660
0.721
0.738
Can we improve it further?
43
Traditional handling of text data
46
➢ Hashing of words
➢ Count vectorization
➢ TF-IDF
➢ SVD
TF-IDF
47
Number of times a term t appears in a document
TF(t) = -------------------------------------------------------
Total number of terms in the document
Total number of documents
IDF(t) = LOG( ------------------------------------------------------- )
Number of documents with term t in it
TF-IDF(t) = TF(t) * IDF(t)
TF-IDF
48
tfidf = TfidfVectorizer(
min_df=3,
max_features=None,
strip_accents='unicode',
analyzer='word',
token_pattern=r'w{1,}',
ngram_range=(1, 2),
use_idf=1,
smooth_idf=1,
sublinear_tf=1,
stop_words='english'
)
SVD
49
➢ Latent semantic analysis
➢ scikit-learn version of SVD
➢ 120 components
svd = decomposition.TruncatedSVD(n_components=120)
xtrain_svd = svd.fit_transform(xtrain)
xtest_svd = svd.transform(xtest)
Question-1 Question-2
Simply using TF-IDF: method-1
TF-IDF TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.749
0.658
0.660
0.777
Question-1 Question-2
Simply using TF-IDF: method-2
TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.748
0.658
0.660
0.804
Question-1 Question-2
Simply using TF-IDF + SVD: method-1
TF-IDF TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.763
0.658
0.660
0.706
SVD SVD
Question-1 Question-2
Simply using TF-IDF + SVD: method-2
TF-IDF TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.753
0.658
0.660
0.700
SVD
Question-1 Question-2
Simply using TF-IDF + SVD: method-3
TF-IDF
Logistic
Regression
XGB
0.721
0.738
0.759
0.658
0.660
0.714
SVD
Word embeddings
WORD | | | | | | |
➢ Multi-dimensional vector for all the words in any dictionary
➢ Always great insights
➢ Very popular in natural language processing tasks
➢ Google news vectors 300d
➢ GloVe
➢ FastText
Word embeddings
Germany
Berlin
- Germany
France
Paris
+ France
Berlin - Germany + France ~ Paris
Every word gets a position in space
Word embeddings
➢ Embeddings for words
➢ Embeddings for whole sentence
Word embeddings
def sent2vec(s, model, stop_words, tokenizer):
words = str(s).lower()
words = tokenizer(words)
words = [w for w in words if not w in stop_words]
words = [w for w in words if w.isalpha()]
M = []
for w in words:
M.append(model[w])
M = np.array(M)
v = M.sum(axis=0)
return v / np.sqrt((v ** 2).sum())
Word embeddings
Word embeddings features
Word embeddings features
Spatial
Distances
Euclidean
Manhattan
Cosine
Canberra
Minkowski
Braycurtis
Word embeddings features
Statistical
Features
Skew
Kurtosis
➢ Skew = 0 for normal
distribution
➢ Skew > 0: more weight in left
tail
➢ Kurtosis: 4th central moment
over the square of variance
Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances.
Word mover’s distance: WMD
Results comparison
Features Logistic
Regression
Accuracy
XGBoost
Accuracy
Basic Features 0.658 0.721
Basic Features + Fuzzy Features 0.660 0.738
Basic + Fuzzy + Word2Vec Features 0.676 0.766
Word2Vec Features X 0.78
Basic + Fuzzy + Word2Vec Features + Full Word2Vec
Vectors
X 0.814
TFIDF + SVD (Best Combination) 0.804 0.763
What can deep learning do?
➢ Natural language processing
➢ Speech processing
➢ Computer vision
➢ And more and more
1-D CNN
➢ One dimensional convolutional layer
➢ Temporal convolution
➢ Simple to implement:
for i in range(sample_length):
y[i] = 0
for j in range(kernel_length):
y[i] += x[i-j] * h[j]
LSTM
➢ Long short term memory
➢ A type of RNN
➢ Used two LSTM layers
Embedding layers
➢ Simple layer
➢ Converts indexes to vectors
➢ [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
Time distributed dense layer
➢ TimeDistributed wrapper around dense layer
➢ TimeDistributed applies the layer to every temporal slice of input
➢ Followed by Lambda layer
➢ Implements “translation” layer used by Stephen Merity (keras snli
model)
model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))
Handling text data before training
tk = text.Tokenizer(nb_words=200000)
max_len = 40
tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str)))
x1 = tk.texts_to_sequences(data.question1.values)
x1 = sequence.pad_sequences(x1, maxlen=max_len)
x2 = tk.texts_to_sequences(data.question2.values.astype(str))
x2 = sequence.pad_sequences(x2, maxlen=max_len)
word_index = tk.word_index
Handling text data before training
embeddings_index = {}
f = open('glove.840B.300d.txt')
for line in tqdm(f):
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
Handling text data before training
Handling text data before training
Handling text data before training
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
Basis of deep learning model
➢ Keras-snli model: https://github.com/Smerity/keras_snli
Creating the deep learning model
Final Deep Learning Model
Model 1 and Model 2
model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))
model2 = Sequential()
model2.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model2.add(TimeDistributed(Dense(300, activation='relu')))
model2.add(Lambda(lambda x: K.sum(x, axis=1),
output_shape=(300,)))
Final Deep Learning Model
Model 3 and Model 4
Model 3 and Model 4
model3 = Sequential()
model3.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model3.add(Convolution1D(nb_filter=nb_filter,
filter_length=filter_length,
border_mode='valid',
activation='relu',
subsample_length=1))
model3.add(Dropout(0.2))
.
.
.
model3.add(Dense(300))
model3.add(Dropout(0.2))
model3.add(BatchNormalization())
Final Deep Learning Model
Model 5 and Model 6
model5 = Sequential()
model5.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model5.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
model6 = Sequential()
model6.add(Embedding(len(word_index) + 1, 300, input_length=40,
dropout=0.2))
model6.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
Final Deep Learning Model
Merged Model
Time to Train the DeepNet
➢ Total params: 174,913,917
➢ Trainable params: 60,172,917
➢ Non-trainable params: 114,741,000
➢ NVIDIA Titan X
Time to Train the DeepNet
➢ The deep network was trained on an NVIDIA TitanX and took approximately
300 seconds for each epoch and took 10-15 hours to train. This network
achieved an accuracy of 0.848 (~0.85).
➢ The SOTA at that time was around 0.88. (Bi-MPM model)
Can we end without talking about the muppets?
Ofcourse!
Just kidding, no!
BERT
➢ Based on transformer encoder
➢ Each encoder block has self-attention
➢ Encoder blocks: 12 or 24
➢ Feed forward hidden units: 768 or 1024
➢ Attention heads: 12 or 16
BERT encoder block
Encoder Block
1
__
__
__
__
__
__
__
__
__
__
__
__
512
512inputs
Vectorsofsize768or1024
How BERT learns?
➢ BERT has a fixed vocab
➢ BERT has encoder blocks (transformer blocks)
➢ A word is masked and BERT tries to predict that word
➢ BERT training also tries to predict next sentence
➢ Combining losses from two above approaches, BERT learns
BERT tokenization
➢ [CLS] TOKENS [SEP]
➢ [CLS] TOKENS_A [SEP] TOKENS_B [SEP]
Example of tokenization:
hi, everyone! this is tokenization example
[CLS] hi , everyone ! this is token ##ization example [SEP]
BERT tokenization
https://github.com/huggingface/tokenizers
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
Approaching duplicate questions using BERT
There is a lot more….
Maybe next time!
Few things to remember...
Fine-tuning often gives good results
➢ It is faster
➢ It is better (not always)
➢ Why reinvent the wheel?
Fine-tuning often gives good results
Bigger isn’t always better
A good model has some key ingredients...
Understanding the data
Exploring the data
Sugar
Pre-processing
Feature engineering
Feature selection
Spice
A good cross validation
Low Error Rate
Simple or combination of models
Post-processing
All the things that are nice
Chemical X
A
Good
Machine Learning
Model
➢ e-mail: abhishek4@gmail.com
➢ Linkedin: linkedin.com/in/abhi1thakur
➢ kaggle: kaggle.com/abhishek
➢ tweet me: @abhi1thakur
➢ YouTube: youtube.com/AbhishekThakurAbhi
Approaching (almost) any
machine learning problem:
the book will release in
Summer 2020.
Fill out the form here to be the
first one to know when it’s
ready to buy:
http://bit.ly/approachingalmost

More Related Content

What's hot

Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaEdureka!
 
scratch course-part1-2023.pdf
scratch course-part1-2023.pdfscratch course-part1-2023.pdf
scratch course-part1-2023.pdfDoaa Mohey Eldin
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processinggulshan kumar
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expressionvaluebound
 
Date and Time Module in Python | Edureka
Date and Time Module in Python | EdurekaDate and Time Module in Python | Edureka
Date and Time Module in Python | EdurekaEdureka!
 
Machine Learning Using Python
Machine Learning Using PythonMachine Learning Using Python
Machine Learning Using PythonSavitaHanchinal
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
 
BOOLEAN ALGEBRA
BOOLEAN ALGEBRA BOOLEAN ALGEBRA
BOOLEAN ALGEBRA Shaik Aman
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayesDhwaj Raj
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Complete C programming Language Course
Complete C programming Language CourseComplete C programming Language Course
Complete C programming Language CourseVivek chan
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Instruction codes and computer registers
Instruction codes and computer registersInstruction codes and computer registers
Instruction codes and computer registersmahesh kumar prajapat
 
밑바닥부터 시작하는딥러닝 8장
밑바닥부터 시작하는딥러닝 8장밑바닥부터 시작하는딥러닝 8장
밑바닥부터 시작하는딥러닝 8장Sunggon Song
 
15 puzzle problem using branch and bound
15 puzzle problem using branch and bound15 puzzle problem using branch and bound
15 puzzle problem using branch and boundAbhishek Singh
 
C++ Programming Language
C++ Programming Language C++ Programming Language
C++ Programming Language Mohamed Loey
 

What's hot (20)

Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
scratch course-part1-2023.pdf
scratch course-part1-2023.pdfscratch course-part1-2023.pdf
scratch course-part1-2023.pdf
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
Date and Time Module in Python | Edureka
Date and Time Module in Python | EdurekaDate and Time Module in Python | Edureka
Date and Time Module in Python | Edureka
 
Machine Learning Using Python
Machine Learning Using PythonMachine Learning Using Python
Machine Learning Using Python
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
BOOLEAN ALGEBRA
BOOLEAN ALGEBRA BOOLEAN ALGEBRA
BOOLEAN ALGEBRA
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
Computer Organization and Architecture.pdf
Computer Organization and Architecture.pdfComputer Organization and Architecture.pdf
Computer Organization and Architecture.pdf
 
Python final ppt
Python final pptPython final ppt
Python final ppt
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Complete C programming Language Course
Complete C programming Language CourseComplete C programming Language Course
Complete C programming Language Course
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
Instruction codes and computer registers
Instruction codes and computer registersInstruction codes and computer registers
Instruction codes and computer registers
 
밑바닥부터 시작하는딥러닝 8장
밑바닥부터 시작하는딥러닝 8장밑바닥부터 시작하는딥러닝 8장
밑바닥부터 시작하는딥러닝 8장
 
15 puzzle problem using branch and bound
15 puzzle problem using branch and bound15 puzzle problem using branch and bound
15 puzzle problem using branch and bound
 
C++ Programming Language
C++ Programming Language C++ Programming Language
C++ Programming Language
 
Function Pointer
Function PointerFunction Pointer
Function Pointer
 

Similar to NLP Approaching any problem

Code Quality Makes Your Job Easier
Code Quality Makes Your Job EasierCode Quality Makes Your Job Easier
Code Quality Makes Your Job EasierTonya Mork
 
Regular expression presentation for the HUB
Regular expression presentation for the HUBRegular expression presentation for the HUB
Regular expression presentation for the HUBthehoagie
 
Code quality; patch quality
Code quality; patch qualityCode quality; patch quality
Code quality; patch qualitydn
 
Code quality. Patch quality
Code quality. Patch qualityCode quality. Patch quality
Code quality. Patch qualitymalcolmt
 
@IndeedEng: Tokens and Millicents - technical challenges in launching Indeed...
@IndeedEng:  Tokens and Millicents - technical challenges in launching Indeed...@IndeedEng:  Tokens and Millicents - technical challenges in launching Indeed...
@IndeedEng: Tokens and Millicents - technical challenges in launching Indeed...indeedeng
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaningfeiwin
 
Sentiments Improvement
Sentiments ImprovementSentiments Improvement
Sentiments ImprovementMisha Kozik
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Brutal refactoring, lying code, the Churn, and other emotional stories from L...
Brutal refactoring, lying code, the Churn, and other emotional stories from L...Brutal refactoring, lying code, the Churn, and other emotional stories from L...
Brutal refactoring, lying code, the Churn, and other emotional stories from L...Matthias Noback
 
Beyond the Style Guides
Beyond the Style GuidesBeyond the Style Guides
Beyond the Style GuidesMosky Liu
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)Mike Felch
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/ibrettflorio
 
Design Patterns - IA Summit 2006
Design Patterns - IA Summit 2006Design Patterns - IA Summit 2006
Design Patterns - IA Summit 2006Jamie Reffell
 
Hunting primes (a caccia di primi) 27 ott 2014
Hunting primes (a caccia di primi)   27 ott 2014Hunting primes (a caccia di primi)   27 ott 2014
Hunting primes (a caccia di primi) 27 ott 2014Vincenzo Sambito
 
Inside Darwin Analytics
Inside Darwin AnalyticsInside Darwin Analytics
Inside Darwin Analyticsjelmersnoeck
 
The Art of Clean code
The Art of Clean codeThe Art of Clean code
The Art of Clean codeVictor Rentea
 
BITM3730 10-17.pptx
BITM3730 10-17.pptxBITM3730 10-17.pptx
BITM3730 10-17.pptxMattMarino13
 

Similar to NLP Approaching any problem (20)

Code Quality Makes Your Job Easier
Code Quality Makes Your Job EasierCode Quality Makes Your Job Easier
Code Quality Makes Your Job Easier
 
Regular expression presentation for the HUB
Regular expression presentation for the HUBRegular expression presentation for the HUB
Regular expression presentation for the HUB
 
Code quality; patch quality
Code quality; patch qualityCode quality; patch quality
Code quality; patch quality
 
Code quality. Patch quality
Code quality. Patch qualityCode quality. Patch quality
Code quality. Patch quality
 
@IndeedEng: Tokens and Millicents - technical challenges in launching Indeed...
@IndeedEng:  Tokens and Millicents - technical challenges in launching Indeed...@IndeedEng:  Tokens and Millicents - technical challenges in launching Indeed...
@IndeedEng: Tokens and Millicents - technical challenges in launching Indeed...
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaning
 
Sentiments Improvement
Sentiments ImprovementSentiments Improvement
Sentiments Improvement
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Brutal refactoring, lying code, the Churn, and other emotional stories from L...
Brutal refactoring, lying code, the Churn, and other emotional stories from L...Brutal refactoring, lying code, the Churn, and other emotional stories from L...
Brutal refactoring, lying code, the Churn, and other emotional stories from L...
 
Beyond the Style Guides
Beyond the Style GuidesBeyond the Style Guides
Beyond the Style Guides
 
ACM init() Spring 2015 Day 1
ACM init() Spring 2015 Day 1ACM init() Spring 2015 Day 1
ACM init() Spring 2015 Day 1
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)
 
XSS and How to Escape
XSS and How to EscapeXSS and How to Escape
XSS and How to Escape
 
Tdd in practice
Tdd in practiceTdd in practice
Tdd in practice
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
 
Design Patterns - IA Summit 2006
Design Patterns - IA Summit 2006Design Patterns - IA Summit 2006
Design Patterns - IA Summit 2006
 
Hunting primes (a caccia di primi) 27 ott 2014
Hunting primes (a caccia di primi)   27 ott 2014Hunting primes (a caccia di primi)   27 ott 2014
Hunting primes (a caccia di primi) 27 ott 2014
 
Inside Darwin Analytics
Inside Darwin AnalyticsInside Darwin Analytics
Inside Darwin Analytics
 
The Art of Clean code
The Art of Clean codeThe Art of Clean code
The Art of Clean code
 
BITM3730 10-17.pptx
BITM3730 10-17.pptxBITM3730 10-17.pptx
BITM3730 10-17.pptx
 

Recently uploaded

Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 

Recently uploaded (20)

Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 

NLP Approaching any problem

  • 1. Approaching (almost) any NLP problem @abhi1thakur
  • 2. AI is like an imaginary friend most enterprises claim to have these days
  • 3. 3
  • 4. 4
  • 5. 5 I like big data and I cannot lie
  • 6. ➢ Not so much intro ➢ Where is NLP used ➢ Pre-processing ➢ Machine Learning Models ➢ Solving a problem ➢ Traditional approaches ➢ Deep Learning Models ➢ Muppets Agenda 6
  • 7. Translation Sentiment Classification Chatbots / VAs Autocomplete Entity Extraction Question Answering Review Rating Prediction Search Engine Speech to Text Topic Extraction Applications of natural language processing
  • 8. Pre-processing the text data 8 can u he.lp me with loan? 😊 Unintentional Characters Abbreviations Symbols Emojis can you help me with loan ?
  • 9. Pre-processing the text data 9 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML
  • 10. Pre-processing the text data 10 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML def remove_space(text): text = text.strip() text = text.split() return " ".join(text)
  • 11. Pre-processing the text data 11 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML ➢ Very important step ➢ Is not always about spaces ➢ Converts words into tokens ➢ Might be different for different languages ➢ Simplest is to use `word_tokenizer` from NLTK ➢ Write your own ;)
  • 12. Pre-processing the text data 12 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML import nltk nltk.download('punkt') from nltk.tokenize import word_tokenize text = "hello, how are you?" tokens = word_tokenize(text) print(tokens) 'hello', ',', 'how', 'are', 'you', '?' hello, how are you?
  • 13. Pre-processing the text data 13 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML ➢ Very very crucial step ➢ In chat: can u tel me abot new sim card pland? ➢ Most models without spelling correction will fail ➢ Peter Norvig’s spelling correction ➢ Make your own ;)
  • 14. Pre-processing the text data 14 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML I need a new car insurance I need aa new car insurance I ned a new car insuraance I needd a new carr insurance I need a neew car insurance I need a new car insurancee EmbeddingsLayer BidirectionalStacked char-LSTM Output
  • 15. Pre-processing the text data 15 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML def edits1(word): letters = 'abcdefghijklmnopqrstuvwxyz' splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [L + R[1:] for L, R in splits if R] transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1] replaces = [L + c + R[1:] for L, R in splits if R for c in letters] inserts = [L + c + R for L, R in splits for c in letters] return set(deletes + transposes + replaces + inserts) def edits2(word): return (e2 for e1 in edits1(word) for e2 in edits1(e1))
  • 16. Pre-processing the text data 16 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML contraction = { "'cause": 'because', ',cause': 'because', ';cause': 'because', "ain't": 'am not', 'ain,t': 'am not', 'ain;t': 'am not', 'ain´t': 'am not', 'ain’t': 'am not', "aren't": 'are not', 'aren,t': 'are not', 'aren;t': 'are not', 'aren´t': 'are not', 'aren’t': 'are not' }
  • 17. Pre-processing the text data 17 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML def mapping_replacer(x, dic): for word in dic.keys(): if " " + word + " " in x: x = x.replace(" " + word + " ", " " + dic[word] + " ") return x
  • 18. Pre-processing the text data 18 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML ➢ Reduces words to root form ➢ Why is stemming important? ➢ NLTK stemmers
  • 19. Pre-processing the text data 19 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML fishing fishfished fishes In [1]: from nltk.stem import SnowballStemmer In [2]: s = SnowballStemmer('english') In [3]: s.stem("fishing") Out[3]: 'fish'
  • 20. Pre-processing the text data 20 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML import emoji emojis = emoji.UNICODE_EMOJI pip install emoji
  • 21. Pre-processing the text data 21 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML I need new car insurance car insurance new need I
  • 22. Pre-processing the text data 22 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  • 23. Pre-processing the text data 23 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  • 24. Pre-processing the text data 24 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  • 25. Pre-processing the text data 25 ➢ Removing weird spaces ➢ Tokenization ➢ Spelling correction ➢ Contraction mapping ➢ Stemming ➢ Emoji handling ➢ Stopwords handling ➢ Cleaning HTML https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  • 26. What kind of models to use? 26 ➢ SVM ➢ Logistic Regression ➢ Gradient Boosting ➢ Neural Networks
  • 27. Let’s look at a problem 27
  • 28. Quora duplicate question identification 28 ➢ ~ 13 million questions ➢ Many duplicate questions ➢ Cluster and join duplicates together ➢ Remove clutter
  • 29. Non-duplicate questions 29 ➢ Who should I address my cover letter to if I'm applying for a big company like Mozilla? ➢ Which car is better from safety view?""swift or grand i10"".My first priority is safety? ➢ How can I start an online shopping (e-commerce) website? ➢ Which web technology is best suitable for building a big E-Commerce website?
  • 30. Duplicate questions 30 ➢ How does Quora quickly mark questions as needing improvement? ➢ Why does Quora mark my questions as needing improvement/clarification before I have time to give it details? Literally within seconds… ➢ What practical applications might evolve from the discovery of the Higgs Boson? ➢ What are some practical benefits of discovery of the Higgs Boson?
  • 31. Dataset 31 ➢ 400,000+ pairs of questions ➢ Initially data was very skewed ➢ Negative sampling ➢ Noise exists (as usual)
  • 32. Dataset 32 ➢ 255045 negative samples (non-duplicates) ➢ 149306 positive samples (duplicates) ➢ 40% positive samples
  • 33. Dataset: basic exploration 33 ➢ Average number characters in question1: 59.57 ➢ Minimum number of characters in question1: 1 ➢ Maximum number of characters in question1: 623 ➢ Average number characters in question2: 60.14 ➢ Minimum number of characters in question2: 1 ➢ Maximum number of characters in question2: 1169
  • 34. Basic feature engineering 34 ➢ Length of question1 ➢ Length of question2 ➢ Difference in the two lengths ➢ Character length of question1 without spaces ➢ Character length of question2 without spaces ➢ Number of words in question1 ➢ Number of words in question2 ➢ Number of common words in question1 and question2
  • 35. Basic feature engineering 35 data['len_q1'] = data.question1.apply(lambda x: len(str(x))) data['len_q2'] = data.question2.apply(lambda x: len(str(x))) data['diff_len'] = data.len_q1 - data.len_q2 data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', ''))))) data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', ''))))) data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split())) data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split()))
  • 36. data['len_common_words'] = data.apply(lambda x: len( set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split()) )), axis=1) Basic feature engineering
  • 37. Basic modelling Tabular Data (Basic Features) Training Set Validation Set Logistic Regression XGB Normalization 0.658 0.721
  • 38. Fuzzy features 38 ➢ Also known as approximate string matching ➢ Number of “primitive” operations required to convert string to exact match ➢ Primitive operations: ○ Insertion ○ Deletion ○ Substitution ➢ Typically used for: ○ Spell checking ○ Plagiarism detection ○ DNA sequence matching ○ Spam filtering
  • 39. Fuzzy features 39 ➢ pip install fuzzywuzzy ➢ Uses Levenshtein distance ➢ QRatio ➢ WRatio ➢ Token set ratio ➢ Token sort ratio ➢ Partial token set ratio ➢ Partial token sort ratio https://github.com/seatgeek/fuzzywuzzy
  • 40. Fuzzy features 40 data['fuzz_qratio'] = data.apply( lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_WRatio'] = data.apply( lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_ratio'] = data.apply( lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_partial_token_set_ratio'] = data.apply( lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)
  • 41. Fuzzy features 41 data['fuzz_partial_token_sort_ratio'] = data.apply( lambda x: fuzz.partial_token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_token_set_ratio'] = data.apply( lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1) data['fuzz_token_sort_ratio'] = data.apply( lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)
  • 42. Improving models Tabular Data (Basic Features + Fuzzy Features) Training Set Validation Set Logistic Regression XGB Normalization 0.658 0.660 0.721 0.738
  • 43. Can we improve it further? 43
  • 44.
  • 45.
  • 46. Traditional handling of text data 46 ➢ Hashing of words ➢ Count vectorization ➢ TF-IDF ➢ SVD
  • 47. TF-IDF 47 Number of times a term t appears in a document TF(t) = ------------------------------------------------------- Total number of terms in the document Total number of documents IDF(t) = LOG( ------------------------------------------------------- ) Number of documents with term t in it TF-IDF(t) = TF(t) * IDF(t)
  • 49. SVD 49 ➢ Latent semantic analysis ➢ scikit-learn version of SVD ➢ 120 components svd = decomposition.TruncatedSVD(n_components=120) xtrain_svd = svd.fit_transform(xtrain) xtest_svd = svd.transform(xtest)
  • 50. Question-1 Question-2 Simply using TF-IDF: method-1 TF-IDF TF-IDF Logistic Regression XGB 0.721 0.738 0.749 0.658 0.660 0.777
  • 51. Question-1 Question-2 Simply using TF-IDF: method-2 TF-IDF Logistic Regression XGB 0.721 0.738 0.748 0.658 0.660 0.804
  • 52. Question-1 Question-2 Simply using TF-IDF + SVD: method-1 TF-IDF TF-IDF Logistic Regression XGB 0.721 0.738 0.763 0.658 0.660 0.706 SVD SVD
  • 53. Question-1 Question-2 Simply using TF-IDF + SVD: method-2 TF-IDF TF-IDF Logistic Regression XGB 0.721 0.738 0.753 0.658 0.660 0.700 SVD
  • 54. Question-1 Question-2 Simply using TF-IDF + SVD: method-3 TF-IDF Logistic Regression XGB 0.721 0.738 0.759 0.658 0.660 0.714 SVD
  • 55. Word embeddings WORD | | | | | | | ➢ Multi-dimensional vector for all the words in any dictionary ➢ Always great insights ➢ Very popular in natural language processing tasks ➢ Google news vectors 300d ➢ GloVe ➢ FastText
  • 56. Word embeddings Germany Berlin - Germany France Paris + France Berlin - Germany + France ~ Paris Every word gets a position in space
  • 57. Word embeddings ➢ Embeddings for words ➢ Embeddings for whole sentence
  • 58. Word embeddings def sent2vec(s, model, stop_words, tokenizer): words = str(s).lower() words = tokenizer(words) words = [w for w in words if not w in stop_words] words = [w for w in words if w.isalpha()] M = [] for w in words: M.append(model[w]) M = np.array(M) v = M.sum(axis=0) return v / np.sqrt((v ** 2).sum())
  • 62. Word embeddings features Statistical Features Skew Kurtosis ➢ Skew = 0 for normal distribution ➢ Skew > 0: more weight in left tail ➢ Kurtosis: 4th central moment over the square of variance
  • 63. Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances. Word mover’s distance: WMD
  • 64. Results comparison Features Logistic Regression Accuracy XGBoost Accuracy Basic Features 0.658 0.721 Basic Features + Fuzzy Features 0.660 0.738 Basic + Fuzzy + Word2Vec Features 0.676 0.766 Word2Vec Features X 0.78 Basic + Fuzzy + Word2Vec Features + Full Word2Vec Vectors X 0.814 TFIDF + SVD (Best Combination) 0.804 0.763
  • 65.
  • 66. What can deep learning do? ➢ Natural language processing ➢ Speech processing ➢ Computer vision ➢ And more and more
  • 67.
  • 68. 1-D CNN ➢ One dimensional convolutional layer ➢ Temporal convolution ➢ Simple to implement: for i in range(sample_length): y[i] = 0 for j in range(kernel_length): y[i] += x[i-j] * h[j]
  • 69. LSTM ➢ Long short term memory ➢ A type of RNN ➢ Used two LSTM layers
  • 70. Embedding layers ➢ Simple layer ➢ Converts indexes to vectors ➢ [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
  • 71. Time distributed dense layer ➢ TimeDistributed wrapper around dense layer ➢ TimeDistributed applies the layer to every temporal slice of input ➢ Followed by Lambda layer ➢ Implements “translation” layer used by Stephen Merity (keras snli model) model1 = Sequential() model1.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=40, trainable=False)) model1.add(TimeDistributed(Dense(300, activation='relu'))) model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))
  • 72. Handling text data before training tk = text.Tokenizer(nb_words=200000) max_len = 40 tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str))) x1 = tk.texts_to_sequences(data.question1.values) x1 = sequence.pad_sequences(x1, maxlen=max_len) x2 = tk.texts_to_sequences(data.question2.values.astype(str)) x2 = sequence.pad_sequences(x2, maxlen=max_len) word_index = tk.word_index
  • 73. Handling text data before training embeddings_index = {} f = open('glove.840B.300d.txt') for line in tqdm(f): values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close()
  • 74. Handling text data before training
  • 75. Handling text data before training
  • 76. Handling text data before training embedding_matrix = np.zeros((len(word_index) + 1, 300)) for word, i in tqdm(word_index.items()): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector
  • 77. Basis of deep learning model ➢ Keras-snli model: https://github.com/Smerity/keras_snli
  • 78. Creating the deep learning model
  • 80. Model 1 and Model 2 model1 = Sequential() model1.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=40, trainable=False)) model1.add(TimeDistributed(Dense(300, activation='relu'))) model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,))) model2 = Sequential() model2.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=40, trainable=False)) model2.add(TimeDistributed(Dense(300, activation='relu'))) model2.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))
  • 82. Model 3 and Model 4
  • 83. Model 3 and Model 4 model3 = Sequential() model3.add(Embedding(len(word_index) + 1, 300, weights=[embedding_matrix], input_length=40, trainable=False)) model3.add(Convolution1D(nb_filter=nb_filter, filter_length=filter_length, border_mode='valid', activation='relu', subsample_length=1)) model3.add(Dropout(0.2)) . . . model3.add(Dense(300)) model3.add(Dropout(0.2)) model3.add(BatchNormalization())
  • 85. Model 5 and Model 6 model5 = Sequential() model5.add(Embedding(len(word_index) + 1, 300, input_length=40, dropout=0.2)) model5.add(LSTM(300, dropout_W=0.2, dropout_U=0.2)) model6 = Sequential() model6.add(Embedding(len(word_index) + 1, 300, input_length=40, dropout=0.2)) model6.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))
  • 88. Time to Train the DeepNet ➢ Total params: 174,913,917 ➢ Trainable params: 60,172,917 ➢ Non-trainable params: 114,741,000 ➢ NVIDIA Titan X
  • 89.
  • 90. Time to Train the DeepNet ➢ The deep network was trained on an NVIDIA TitanX and took approximately 300 seconds for each epoch and took 10-15 hours to train. This network achieved an accuracy of 0.848 (~0.85). ➢ The SOTA at that time was around 0.88. (Bi-MPM model)
  • 91. Can we end without talking about the muppets?
  • 94. BERT ➢ Based on transformer encoder ➢ Each encoder block has self-attention ➢ Encoder blocks: 12 or 24 ➢ Feed forward hidden units: 768 or 1024 ➢ Attention heads: 12 or 16
  • 95. BERT encoder block Encoder Block 1 __ __ __ __ __ __ __ __ __ __ __ __ 512 512inputs Vectorsofsize768or1024
  • 96. How BERT learns? ➢ BERT has a fixed vocab ➢ BERT has encoder blocks (transformer blocks) ➢ A word is masked and BERT tries to predict that word ➢ BERT training also tries to predict next sentence ➢ Combining losses from two above approaches, BERT learns
  • 97. BERT tokenization ➢ [CLS] TOKENS [SEP] ➢ [CLS] TOKENS_A [SEP] TOKENS_B [SEP] Example of tokenization: hi, everyone! this is tokenization example [CLS] hi , everyone ! this is token ##ization example [SEP]
  • 105.
  • 106. There is a lot more….
  • 108. Few things to remember...
  • 109. Fine-tuning often gives good results ➢ It is faster ➢ It is better (not always) ➢ Why reinvent the wheel?
  • 110. Fine-tuning often gives good results
  • 112. A good model has some key ingredients...
  • 115. A good cross validation Low Error Rate Simple or combination of models Post-processing All the things that are nice
  • 118. ➢ e-mail: abhishek4@gmail.com ➢ Linkedin: linkedin.com/in/abhi1thakur ➢ kaggle: kaggle.com/abhishek ➢ tweet me: @abhi1thakur ➢ YouTube: youtube.com/AbhishekThakurAbhi Approaching (almost) any machine learning problem: the book will release in Summer 2020. Fill out the form here to be the first one to know when it’s ready to buy: http://bit.ly/approachingalmost