Natural language processing: feature extraction

Natural Language
Processing:
Part 1 Feature Extraction
GabeGabe
Hamilton

Why NLP?
Search
Translation
Question answering
Conversational Response
Content Generation

So we have a goal that requires NLP
Maybe analyzing writing to see who is likely to have written an email.
And we have some data to analyze
The Enron emails for instance https://www.cs.cmu.edu/~enron/

Let’s extract some Features
We’ll analyze data and use some Machine Learning.

First the high level
Doing something useful with our documents requires some organization.
Will we operate on documents?
Organize by author or subject?
Maybe it’s question and answer based?
Is there a relationship between documents? To/From graph for emails.
This can lead to a number of features

Given some text
Tech Confluence is a lunchtime meet-up of
developers, designers and generally amazing
people in downtown Denver to present and
discuss software related topics.
What features can we get out of it?

Language Features
● Paragraphs
● Sentences
● Words
● Parts of speech
● Entities (Jane, Amazon, Eiffel Tower)
● Sentiment
● Topics
● Assertions (Arthur was a King)
A step further
Frequency
Relationships
Clustering

Voice
Vowels and Consonants
Phonemes
Tone (Relative Pitch contrasts)
Signal vs Noise
Not what this talk is about.

Frequency
Russia, officially
the Russian
Federation, is a
country in Eurasia.
Russia is the
largest country
England is a
country that is part
of the United
Kingdom. England
shares land borders
with the country of
Wales to the west

Frequency
Term Frequency
How often does a word appear in a
document?
Document Frequency
How often does a word appear across
our documents?
Gives us Stop words (common
terms).
Inverse Document Frequency
How much information does a word
provide? Uncommon words are more
useful.
Term Frequency * Inverse Document Freq
Russia 2 TF * ½ IDF = 1.0
Country 2 TF * ¼ IDF = 0.5

Supervised Machine Learning
Mapping x to y
Build a model.
What are the common patterns?
And the uncommon patterns.
If we have many examples of nouns we can
train a program to classify nouns.
Same with
other parts of speech
Entity recognition
Sentiment
aka Statistical Pattern Matching

Classifiers
Part of speech Entities

Some Tools
GCP NLP https://cloud.google.com/natural-language/
Which is powered by https://opensource.google.com/projects/syntaxnet
Spacy (python) https://spacy.io/
OpenNLP (Java) https://opennlp.apache.org/
Stanford Core NLP https://nlp.stanford.edu/software/

Encoding
How could we encode the words in our
documents?
We could have an array with one spot for
every word.
Say there are 50,000 unique words in our pile
of documents… 50,000 spots in our array.
And our encoding isn’t very useful. It doesn’t
encode the information well.

Word Vectors
So we look at what words tend to be around a
given word.
And we train an algorithm to encode that
word in terms of the words around it.
And we continue encoding, squishing it down
into maybe 1000 spots.

Word Vectors
Each spot now encodes one of the 1000
dimensions of the words in our documents.
We could look at the dimensions and give
them names. This one seems to be age.

Word Vectors
Easy to calculate similarity of words
And to add and subtract words

“
Image Credits
Books http://mrg.bz/zClfuK
Bands http://mrg.bz/834def

Natural language processing: feature extraction

More Related Content

What's hot

Similar to Natural language processing: feature extraction

More from Gabriel Hamilton

Recently uploaded

Natural language processing: feature extraction