MACHINE-DRIVEN TEXT ANALYSIS

MACHINE-DRIVEN TEXT ANALYSIS
An Introduction to NLP in
Python
Massimo Schenone
Senior Consultant

What is NLP?
Natural Language Processing
NLP is all about creating systems that process or “understand”
human language in order to perform certain tasks

NLP USE IN BUSINESS
Creditworthiness assessment
Neural machine translation
Hiring and recruitment
Chatbots
Sentiment analysis
Advertising
Market intelligence
Healthcare

LITTLE BIT OF HISTORY
1950
1954
Georgetown IBM
experiment
Automatic translation of Russian
sentences into English
A. Turing
"Computing
Machinery and
Intelligence"
1960s
SHRDLU
restricted
"blocks worlds"
1964-66
1970s
Late 80s-90s
ELIZA
Simulated
conversation
Statistical
revolution
Mid 2000s
“conceptual
ontologies”
as computational
models
Machine
Learning
2010s
Deep
Learning
Today
The ImageNet
Moment of NLP

A changing field
Rules-based vs statistical approach
✔
Until the end of the 80s NLP systems were designed by hand-coding
a set of rules: this is rarely robust to natural language
variation
✔
Machine-learning paradigm: using statistical inference to
automatically learn such rules through the analysis of large
corpora of typical real-world examples
PROs:
•
focus on the most common cases
•
robust to unfamiliar or erroneous input
•
more accurate simply by supplying more input data
CONs:
•
data availability
•
precision and accuracy

NLP Tools
●
Context-free grammars
●
Regular expressions
●
Tokenization
●
Parse trees
●
N-grams
●
Linear algebra
●
Statistical inference
●
Neural nets
●
Word embeddings
●
Machine and deep learning

Common use Python libraries
nltk: very broad NLP library
spaCy: parse trees, tokenizer, opinionated
gensim: topic modeling and similarity
fasttext: text classification and representation learning
sklearn: general purpose Python ML library
fastai: built on top of PyTorch
TF.Text: a collection of text related classes and ops ready
to use with TensorFlow 2.0

The Corpora
A corpus usually contains raw text (in ASCII or
UTF-8) and any metadata associated with the text.
> import nltk
> from nltk.corpus import words.words()
> from nltk.corpus import reuters
> from nltk.corpus import brown
> brown.categories()
['adventure', 'belles_lettres', 'editorial',
'fiction', 'government', 'hobbies',
'humor', 'learned', 'lore', 'mystery', 'news',
'religion', 'reviews', 'romance',
'science_fiction']

Tokenization
The process of breaking a text down into tokens
Types are unique tokens present in
a corpus. The set of all types in a
corpus is its vocabulary.
Words can be distinguished as
content words and stopwords.
The first step of the pipeline,
just after cleaning

Tokenizing text in Python
Pure Python, spaCy or NLTK can be used
In spaCy tokenization is
done by applying rules
specific to each language
[≠ text.split()]
NTLK features a tweet
tokenizer which
preserves #hashtags,
@handles and smiles

Lemmatization
Lemmas are root forms of words.
spaCy uses a predefined dictionary,
called WordNet, for extracting lemmas
Stemming: the poor man’s lemmatization,
truncates the word to its stem (arguing →
argu)

WordNet
A large lexical database of English.
Nouns, verbs, adjectives and adverbs are grouped into sets of
cognitive synonyms (synsets), each expressing a distinct concept.

Grammatical analysis
spaCy provides a variety of linguistic annotations
to give you insights into a text’s grammatical
structure
The loaded statistical models enable spaCy
to predict linguistic annotations – for
example, whether a word is a verb or a
noun (part-of-the-speech or POS tagging)
or whether a noun is the subject of a
sentence, or the object

Dependency parsing
displacy is one of the features that makes spaCy a
nice tool

Named Entity Recognition (NER)
Labelling named “real-world” objects, like persons,
companies or locations.
The spaCy pretrained model performs pretty well (at least in English).
Again you can use displaCy to get a beautiful visualization of the NE annotated
sentence.

WHY REPRESENTATION IS IMPORTANT
Text representation scheme must facilitate the
extraction of the features
The semantics (meaning) of a sentence comes from the 4 steps
above:
●
Break the sentence into lexical units
●
Derive the meaning of each unit
●
Understand syntactic (grammatical) structure of the sentence
●
Understand the context in which the sentence appears

TEXT REPRESENTATION IS NOT EASY
Images and sounds have a natural digital representation scheme
For text there is no obvious way.

HOW TO FEED A STATISTICAL MODEL?
Machines do not understand text, they are good at crunching
numbers
Statistics and linear algebra work with numbers
Machine learning algorithms assume that all features used to
represent an observation are numeric
Text representation is the conversion from raw text to a
suitable numerical form

Legacy techniques
●
One-hot encoding
●
Bag of words
●
N-gram
●
TF-IDF

One-hot encoding
Every element is zero except the one corresponding
to a specific word
def one_hot(word, word_dict):
vector = np.zeros(len(word_dict))
vector[word_dict[word]] = 1
return vector
print(one_hot("paris", word_dict))
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
No information about words relations
Must pre-determine vocabulary size
Size of input vector scales with size of vocabulary
“Out-of-Vocabulary” (OOV) problem

Bag of words
A vector representation of a text produced by
simply adding up all the one-hot encoded vectors:
bow = np.zeros(vocabulary_size)
for word in text_words:
hot_word = one_hot(word, word_dict)
bow += hot_word
print(bow)
[6. 2. 2. 4. 5. 1. 1. 2. 1. 1. 1. 1. 2. 2. 4. 1.
1. 1. 1.]
bow[word_dict[“paris”]]
1.0
Vectors simply contains the number of
times each word appears in our document.
Orderless
No notion of similarity

N-gram model
A contiguous sequence of n items from a given
sample of text
Vocabulary = set of all n-grams in corpus
No notion of similarity

Collocations
A sequence of words that occur together unusually often
nltk.collocations can help identifying
phrases that act like single words
In the example bi-grams are paired with a
"more likely to occur" score

Term Frequency
Intuitively, we expect the frequency with which a given word is mentioned
should correspond to the relevance of that word for the piece of text we are
considering.
Very frequent words are
really meaningful!?
→ stopwords

Stopwords
Words that are very frequent but not meaningful
Remove the most common 100 words
Use spaCy predefined stopwords
Use nltk predefined stopwords

TF–IDF
Reflects how important a word is to a document in a corpus
Idea: importance increases proportionally to the frequency of a word in the document; but is
inversely proportional to the frequency of the word in the corpus
The tf–idf is the product of two statistics, term frequency and inverse document frequency.

TF–IDF
A toy example may help to clearly understand
D1: “The car is driven on the road.”
D2: “The truck is driven on the highway.”
len(D1) = len(D2) = 7
TF-IDF (t) = TF(t) * log(N/DF)
N=2
TF-IDF of common words is zero, they are not significant.

TF–IDF
Let’s now code TF-IDF in Python from scratch
Compute the TF score for each word
in the corpus, by document
Computes the IDF score of every
word in the corpus

TF–IDF
TF-IDF implementation using sklearn
Under the hood two functions are executed:
fit: learn vocabulary and idf from training set
transform: transform documents to
document-term matrix
terms with zero idf don't get suppressed entirely score is 0.6 !?

TF–IDF
TF-IDF implementation using sklearn and stopwords

VECTOR SPACE MODELS
Represent text units as vectors of numbers
Each dimension corresponds to a separate term (words, keywords, phrases).
D = (t1, t2, …, tn)
If a term occurs in the document, its value in the
vector is non-zero (TF-IDF weighting)
We can choose other features to fill the vector
Vector operations can be used to compare documents
with queries

COSINE SIMILARITY
Measure of similarity between two non-zero vectors
Relevance rankings of documents in a keyword search can be calculated by comparing
the cosine of angles between each document vector and the original query vector
The resulting similarity ranges from −1 meaning
exactly opposite, to 1 meaning exactly the same
For text matching, the attribute vectors A and B are
usually the term frequency vectors of the documents.

WORD EMBEDDINGS
The state-of-the-art text representation
An approach to provide a dense vector representation of words that capture something about
their meaning
The central idea of word embedding training is that similar words are typically surrounded by the
same “context” words in normal use
“You shall know a word by the company it keeps” – Firth, J.R. (1957)

WORD EMBEDDINGS
Popular Word Embedding Algorithms
Skip-Gram
Continuous Bag of Words (CBOW)
Word2Vec (2013) by Google - during training uses CBOW and Skip-Gram techniques together
“Global Vectors for Word Representations” (GloVe) from a team at Stanford University
fasttext created by Facebook AI group

WORD EMBEDDINGS IN ACTION
Word vectors let you import knowledge from raw text
into your model
●
We can represent words as
vectors of numbers
●
We can easily calculate how
similar vectors are to each
other
●
We can add and subtract
word embeddings and arrive
at interesting results
●
The most famous example is
the formula:
“king” - “man” + “woman”
Coloring the cells based on their values we can easily compare two or more vectors

WORD EMBEDDINGS WITH SPACY
spaCy comes shipped with pre-trained models!
spaCy’s small models (all packages
that end in _sm) don’t ship with word
vectors, and only include context-
sensitive tensors (POS, NER, etc)
You can still use the similarity()
methods to compare documents
and tokens, but the result won’t be
as good
en_vectors_web_lg model provides
300-dimensional GloVe vectors for
over 1 million terms of English

WORD EMBEDDINGS WITH SPACY
(continued)
We now need to find the closest vector in the vocabulary to the result
of “king - man + woman”

SOME HINTS ABOUT THE LAST
TENDENCIES IN NLP

FASTTEXT
An open source NLP library developed by facebook AI
(2016)
Its goal is to provide word embedding and text classification efficiently through its shallow neural
network implementation
The accuracy of this library is on par of deep neural networks and requires very less amount of
time to train
Models can be saved and later reduced in size to even fit on mobile devices

FASTTEXT
Word representation
Training:
Word vector:
The dimension (dim) controls the size of the vectors (default 100)
The subwords are all the substrings contained in a word between
the minimum size (minn) and the maximal size (maxn). [3-6]
Parameters:

FASTTEXT
Testing your model
A simple way to check the quality of a word vector is to look at its nearest neighbors
This give an intuition of the type of semantic information the vectors are able to capture

A USE CASE: SENTIMENT ANALYSIS

Identify and extract opinions within a given text
valence = {}
for word in pos:
valence[word.lower()] = 1
for word in neg:
valence[word.lower()] = -1
def sentiment(text, valence):
words = extract_words(text.lower())
word_count = 0
score = 0
for word in words:
if word in valence:
score += valence[word]
word_count += 1
return score/word_count
texts = ["I'm very happy",
"The product is pretty annoying, and I hate it",
"I'm sad"]
for text in texts:
print(sentiment(text, valence))
1.0
-0.3333333333333333
-1.0
Simplest approach is counting positive and negative
words using opinion lists, but ...
Modifier words ("very", "much", "not", "pretty",
"somewhat") change the meaning
→ We need real valued weights
Sentiment analysis from scratch

NLTK’s VADER
A lexicon and rule-based sentiment analysis tool that is
specifically attuned to sentiments expressed in social
media
Positive + Neutral + Negative = 1
-1 < Compound < +1
analyzer = SentimentIntensityAnalyzer()
score = analyzer.polarity_scores(sentence)
VADER analyses sentiments primarily
based on certain key points:
●
Punctuation (cool!)
●
Capitalization (GREAT)
●
Degree modifiers (very good)
●
Conjunctions (good but I hate it)

USING FASTTEXT
We use the sentiment140 dataset that contains 1600000 tweets that have been annotated with 2 keys:
0 = negative, 4 = positive
fastText use the format __label__ to recognize labels from words
Data preprocessing and cleaning

USING FASTTEXT
Training the classifier
Using bigrams
Increasing also learning rate
Increasing epochs
Default parameters:
epoch=5
lr=0.1
wordNgrams=1
Precision Recall

USING FASTTEXT
Evaluate the model predictions

USING FASTTEXT
Classify tweets

CONCLUSIONS
Large volumes of data are crucial to the success of a machine learning
project, but having clean, high-quality data is just as important (ImageNet
moment of NLP)
Sometimes rules-based approaches (still) work better, e.g. VADER
Don’t fall in love with tools or algorithms, feel free to build the best possible
environment to process your data and get the job done!

A BIG THANKS!
Massimo Schenone

Some online stuff
●
Introduction to Information Retrieval
https://nlp.stanford.edu/IR-book/html/htmledition/irbook.html
●
Speech and Language Processing
https://web.stanford.edu/~jurafsky/slp3/
●
Natural Language Processing From Scratch
https://2017.pygotham.org/talks/natural-language-processing-from-scratch
/
●
Advanced NLP with spaCy
https://course.spacy.io
●
Sentiment analysis
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
●
The Illustrated Word2vec
https://jalammar.github.io/illustrated-word2vec/

MACHINE-DRIVEN TEXT ANALYSIS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MACHINE-DRIVEN TEXT ANALYSIS

Similar to MACHINE-DRIVEN TEXT ANALYSIS (20)

Recently uploaded

Recently uploaded (20)

MACHINE-DRIVEN TEXT ANALYSIS