Natural Language
Processing:
Part 1 Feature Extraction
GabeGabe
Hamilton
Why NLP?
Search
Translation
Question answering
Conversational Response
Content Generation
So we have a goal that requires NLP
Maybe analyzing writing to see who is likely to have written an email.
And we have some data to analyze
The Enron emails for instance https://www.cs.cmu.edu/~enron/
Let’s extract some Features
We’ll analyze data and use some Machine Learning.
First the high level
Doing something useful with our documents requires some organization.
Will we operate on documents?
Organize by author or subject?
Maybe it’s question and answer based?
Is there a relationship between documents? To/From graph for emails.
This can lead to a number of features
Given some text
Tech Confluence is a lunchtime meet-up of
developers, designers and generally amazing
people in downtown Denver to present and
discuss software related topics.
What features can we get out of it?
Language Features
Language Features
● Paragraphs
● Sentences
● Words
● Parts of speech
● Entities (Jane, Amazon, Eiffel Tower)
● Sentiment
● Topics
● Assertions (Arthur was a King)
A step further
Frequency
Relationships
Clustering
Voice
Vowels and Consonants
Phonemes
Tone (Relative Pitch contrasts)
Signal vs Noise
Not what this talk is about.
Frequency
Frequency
Russia, officially
the Russian
Federation, is a
country in Eurasia.
Russia is the
largest country
England is a
country that is part
of the United
Kingdom. England
shares land borders
with the country of
Wales to the west
Frequency
Term Frequency
How often does a word appear in a
document?
Document Frequency
How often does a word appear across
our documents?
Gives us Stop words (common
terms).
Inverse Document Frequency
How much information does a word
provide? Uncommon words are more
useful.
Term Frequency * Inverse Document Freq
Russia 2 TF * ½ IDF = 1.0
Country 2 TF * ¼ IDF = 0.5
Supervised Machine Learning
Mapping x to y
Build a model.
What are the common patterns?
And the uncommon patterns.
If we have many examples of nouns we can
train a program to classify nouns.
Same with
other parts of speech
Entity recognition
Sentiment
aka Statistical Pattern Matching
Classifiers
Part of speech Entities
Some Tools
GCP NLP https://cloud.google.com/natural-language/
Which is powered by https://opensource.google.com/projects/syntaxnet
Spacy (python) https://spacy.io/
OpenNLP (Java) https://opennlp.apache.org/
Stanford Core NLP https://nlp.stanford.edu/software/
Encoding
How could we encode the words in our
documents?
We could have an array with one spot for
every word.
Say there are 50,000 unique words in our pile
of documents… 50,000 spots in our array.
And our encoding isn’t very useful. It doesn’t
encode the information well.
Word Vectors
So we look at what words tend to be around a
given word.
And we train an algorithm to encode that
word in terms of the words around it.
And we continue encoding, squishing it down
into maybe 1000 spots.
Word Vectors
Each spot now encodes one of the 1000
dimensions of the words in our documents.
We could look at the dimensions and give
them names. This one seems to be age.
Word Vectors
Easy to calculate similarity of words
And to add and subtract words
Parse Trees
SyntaxNet
Questions
“
Image Credits
Books http://mrg.bz/zClfuK
Bands http://mrg.bz/834def

Natural language processing: feature extraction

  • 1.
    Natural Language Processing: Part 1Feature Extraction GabeGabe Hamilton
  • 2.
  • 3.
    So we havea goal that requires NLP Maybe analyzing writing to see who is likely to have written an email. And we have some data to analyze The Enron emails for instance https://www.cs.cmu.edu/~enron/
  • 4.
    Let’s extract someFeatures We’ll analyze data and use some Machine Learning.
  • 5.
    First the highlevel Doing something useful with our documents requires some organization. Will we operate on documents? Organize by author or subject? Maybe it’s question and answer based? Is there a relationship between documents? To/From graph for emails. This can lead to a number of features
  • 6.
    Given some text TechConfluence is a lunchtime meet-up of developers, designers and generally amazing people in downtown Denver to present and discuss software related topics. What features can we get out of it?
  • 7.
  • 8.
    Language Features ● Paragraphs ●Sentences ● Words ● Parts of speech ● Entities (Jane, Amazon, Eiffel Tower) ● Sentiment ● Topics ● Assertions (Arthur was a King) A step further Frequency Relationships Clustering
  • 9.
    Voice Vowels and Consonants Phonemes Tone(Relative Pitch contrasts) Signal vs Noise Not what this talk is about.
  • 10.
  • 11.
    Frequency Russia, officially the Russian Federation,is a country in Eurasia. Russia is the largest country England is a country that is part of the United Kingdom. England shares land borders with the country of Wales to the west
  • 12.
    Frequency Term Frequency How oftendoes a word appear in a document? Document Frequency How often does a word appear across our documents? Gives us Stop words (common terms). Inverse Document Frequency How much information does a word provide? Uncommon words are more useful. Term Frequency * Inverse Document Freq Russia 2 TF * ½ IDF = 1.0 Country 2 TF * ¼ IDF = 0.5
  • 13.
    Supervised Machine Learning Mappingx to y Build a model. What are the common patterns? And the uncommon patterns. If we have many examples of nouns we can train a program to classify nouns. Same with other parts of speech Entity recognition Sentiment aka Statistical Pattern Matching
  • 14.
  • 15.
    Some Tools GCP NLPhttps://cloud.google.com/natural-language/ Which is powered by https://opensource.google.com/projects/syntaxnet Spacy (python) https://spacy.io/ OpenNLP (Java) https://opennlp.apache.org/ Stanford Core NLP https://nlp.stanford.edu/software/
  • 16.
    Encoding How could weencode the words in our documents? We could have an array with one spot for every word. Say there are 50,000 unique words in our pile of documents… 50,000 spots in our array. And our encoding isn’t very useful. It doesn’t encode the information well.
  • 17.
    Word Vectors So welook at what words tend to be around a given word. And we train an algorithm to encode that word in terms of the words around it. And we continue encoding, squishing it down into maybe 1000 spots.
  • 18.
    Word Vectors Each spotnow encodes one of the 1000 dimensions of the words in our documents. We could look at the dimensions and give them names. This one seems to be age.
  • 19.
    Word Vectors Easy tocalculate similarity of words And to add and subtract words
  • 20.
  • 21.
  • 22.
  • 23.