50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro Panebianco

50 Shades of Text - Leveraging Natural
Language Processing
Alessandro Panebianco

Agenda
• About Me

• Natural Language Processing

• Vectorization Techniques

• Word Embeddings

• Sentence Embeddings

• Demo

• Lessons Learned
2

• Computer Engineering

• Data Science Consultancy

• E-commerce

• Energy&Utilities
About me
3
email: ale.panebianco@me.com

Natural Language Processing
Language is the method of human communication, either
spoken or written, consisting of the use of words in a
structured and conventional way
4

The goal of Natural Language Processing is for
computers to achieve human-like comprehension of
texts/languages
5

Why?
https://youtu.be/lXUQ-DdSDoE?t=81
6

Applications
• Machine translation (Google Translate)

• Natural language generation (Reddit bot)

• Sentiment analysis (Cambridge Analytica)

• Lexical semantics (Thesaurus)

• Web and application search (Amazon)

• Question answering (chatbots)

…. and many others
7

How do we enable machines to
interpret language?
Transforming raw text into numerical features
8
Vector
Hashing
Trick
Bag of
words TF-IDF Word2Vec GloVe FastText

Vectorization Techniques
Bag of words
How to go from words to vectors?

without music life would be a mistake Radiohead are great band
S1 1 1 1 1 1 1 1 0 0 0 0
S2 0 1 0 0 0 1 0 1 1 1 1
S1: Without music life would be a mistake
S2: Radiohead are a great music band
9
๏ Dictionary size
๏ Sparsity
๏ Word order absence
✓ Easy to implement
✓ Fast

Vectorization Techniques (II)
Hashing Trick
๏ Hash is one-way

๏ Same output for diﬀerent inputs
10
✓ Same input -> Same output

✓ Range is always ﬁxed (vector size)

Term Frequency - Inverse Document Frequency
Weight rare words higher than common words
Vectorization Techniques (III)
TF-IDF
without music life would be a mistake Radiohead are great band
S1 0.3 0 0.3 0.3 0.3 0 0.3 0 0 0 0
S2 0 0 0 0 0 0 0 0.3 0.3 0.3 0.3
11
๏ Dictionary size
๏ Sparsity
๏ Word order absence
✓ Easy to implement
✓ Fast
✓ Weight words

Word Embeddings
Word2Vec
• The goal of word embeddings is to generate vectors
encoding semantics
12
• Word2Vec does it maximizing the cosine similarity between
two randomly initialized vectors
Context Window
Without music life would be a mistake

Word Embeddings
Word2Vec (II)
Queen
Woman
King
Man
13
• Analogies

• Synonyms

• Syntactic-Semantic vectors

• Speech tagging

• Named entity recognition
King - Man + Woman = Queen

Word Embeddings
GloVe
• It diﬀers from word2vec for being a count based model
instead of predictive

• Dimensionality reduction on the co-occurrence counts
matrix

• It uses cosine similarity
14

Word Embeddings
FastText
FastText : Word Embeddings = XGBoost : Random Forest
15

Sentence Embeddings
What If we want to represent more than a single word?

Many techniques have been utilized:

• Common aggregation operations (avg, sum,
concatenation, etc.)

• Doc2Vec

• Neural Networks (CNN,LSTM,etc.)
16

Sentence Embeddings (II)
Doc2Vec
Every paragraph is mapped to a unique vector

The paragraph token can be thought of as another word. It
acts as a memory that remembers what is missing from the
current context — or the topic of the paragraph
17

Sentence Embeddings (III)
CNN
• Stacking words together create a Matrix (image)

• Filters act like word scans (i.e. misspellings)

• Max Pooling would highlight the most important words
(i.e. what is the item of a query)

• The LSTM layer keeps the word order
18

Sentence Embeddings (IV)
LSTM
• RNNs resemble how we process language (i.e. Google
searches)

• The LSTM layer generates a new encoding for the original
input giving relevance to the word order
(return_sequences=True)

• The convolution layer ﬁlters the most important local
features (i.e. what is the item of a query)
19

Demo
Training Data:

https://www.kaggle.com/c/home-depot-product-search-
relevance/data

GloVe vectors:

http://nlp.stanford.edu/data/glove.6B.zip
20

Lessons Learned
• NLP is one of the most mature research ﬁelds in the AI
space

• Make your own word embeddings using an ad-hoc
vocabulary

• With a large corpus, try FastText

• With short texts (i.e. user queries) experiment higher text
granularity (n-grams, characters)

• Explore sentence embeddings through Neural Networks
21

50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro Panebianco

More Related Content

What's hot

Similar to 50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro Panebianco

More from Data Science Milan

Recently uploaded

50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro Panebianco