Natural Language Processing
with Python
April 9, 2016
Intro to Natural Language
Processing
What is Language?
Or, why should we analyze it?
What does this mean?
Crab
Is the symbol representative of its meaning?
Crab/kræb/
Or is it just a mapping in a mental lexicon?
Crab/kræb/
symbolsound sight
These symbols are arbitrary
κάβουρας/kávouras/
symbolsound sight
Linguistic symbols:
- Not acoustic “things”
- Not mental processes
Critical implications for:
- Literature
- Linguistics
- Computer Science
- Artificial Intelligence
What is Language?
The science that has been developed around the facts of language
passed through three stages before finding its true and unique
object. First something called "grammar" was studied. This study,
initiated by the Greeks and continued mainly by the French, was
based on logic. It lacked a scientific approach and was detached
from language itself. Its only aim was to give rules for
distinguishing between correct and incorrect forms; it was a
normative discipline, far removed from actual observation, and its
scope was limited.
-- Ferdinand de Saussure
What is Natural Language Processing?
The science that has been developed around the facts of language
passed through three stages before finding its true and unique
object. First something called "grammar" was studied. This study,
initiated by the Greeks and continued mainly by the French, was
based on logic. It lacked a scientific approach and was detached
from language itself. Its only aim was to give rules for
distinguishing between correct and incorrect forms; it was a
normative discipline, far removed from actual observation, and its
scope was limited.
-- Ferdinand de Saussure
What is Natural Language Processing?
Formal vs. Natural Languages
Formal Languages
- Strict, unchanging rules defined
by grammars and parsed by
regular expressions
- Generally application specific
(chemistry, math)
- Literal: exactly what is said is
meant.
- No ambiguity
- Parsable by regular expressions
- Inflexible: no new terms or
meaning.
Natural Languages
- Flexible, evolving language that
occurs naturally in human
communication
- Unspecific and used in many
domains and applications
- Redundant and verbose in order
to make up for ambiguity
- Expressive
- Difficult to parse
- Very flexible even in narrow
contexts
Computer science has
traditionally focused on formal
languages.
tokens
Who has written a compiler or interpreter?
Lexical
Analysis
Syntactic
Analysis
“Lexing” “Parsing”
Regular Grammar
(Word Rules)
Context Free Grammar
(Sentence Rules)
string tree
Syntax Errors are your friend! http://bbc.in/23pSPRq
However, ambiguity is required
for understanding when
communicating between people
with diverse experience.
Most of the time. http://bit.ly/1XlOtHv
Natural Language Processing
requires flexibility, which
generally comes from machine
learning.
Language models are used for flexibility
“There was a ton of traffic on the
beltway so I was ________.”
“At beach we watched the ________.”
“Watch out for that ________!”
Intuition: Language is Predictable (if flexible)
These models predict relationships between tokens.
crab
beach
steamed
boat
ocean
sand
seafood
But they can’t understand meaning (yet)
So please keep in mind:
Tokens != Words
- Substrings
- Only structural
- Data
“bearing”
“shouldn’t”
- Objects
- Contains a “sense”
- Meaning
to bear.verb-1
should.auxverb-3
not.adverb-1
The State of the Art
- Academic design for use alongside intelligent
agents (AI discipline)
- Relies on formal models or representations of
knowledge & language
- Models are adapted and augmented through
probabilistic methods and machine learning.
- A small number of algorithms comprise the
standard framework.
Traditional NLP Applications
- Summarization
- Reference Resolution
- Machine Translation
- Language Generation
- Language Understanding
- Document Classification
- Author Identification
- Part of Speech Tagging
- Question Answering
- Information Extraction
- Information Retrieval
- Speech Recognition
- Sense Disambiguation
- Topic Recognition
- Relationship Detection
- Named Entity Recognition
What is Required?
Domain Knowledge
A Corpus in the Domain
The NLP Pipeline
The study of the forms of things, words in particular.
Consider pluralization for English:
● Orthographic Rules: puppy → puppies
● Morphological Rules: goose → geese or fish
Major parsing tasks:
stemming, lemmatization and tokenization.
Morphology
The study of the rules for the formation of sentences.
Major tasks:
chunking, parsing, feature parsing, grammars
Syntax
Semantics
The study of meaning.
- I see what I eat.
- I eat what I see.
- He poached salmon.
Major Tasks
Frame extraction, creation of TMRs
Thematic Meaning Representation
“The man hit the building with the baseball bat”
{
"subject": {"text": "the man", "sense": human-agent},
"predicate": {"text": "hit", "sense": strike-physical-force},
"object": {"text": "the building", "sense": habitable-structure},
"instrument": {"text": "with the baseball bat" "sense": sports-equipment}
}
Recent NLP Applications
- Yelp Insights
- Winning Jeopardy! IBM Watson
- Computer assisted medical coding (3M Health Information Systems)
- Geoparsing -- CALVIN (built by Charlie Greenbacker)
- Author Identification (classification/clustering)
- Sentiment Analysis (RTNNs, classification)
- Language Detection
- Event Detection
- Google Knowledge Graph
- Named Entity Recognition and Classification
- Machine Translation
Applications are BIG data
- Examples are easier to create than rules.
- Rules and logic miss frequency and language
dynamics
- More data is better for machine learning,
relevance is in the long tail
- Knowledge engineering is not scalable
- Computational linguistics methodologies are
stochastic
The Natural Language
Toolkit (NLTK)
What is NLTK?
- Python interface to over 50 corpora and lexical
resources
- Focus on Machine Learning with specific domain
knowledge
- Free and Open Source
- Numpy and Scipy under the hood
- Fast and Formal
What is NLTK?
Suite of libraries for a variety of academic text
processing tasks:
- tokenization, stemming, tagging,
- chunking, parsing, classification,
- language modeling, logical semantics
Pedagogical resources for teaching NLP theory in
Python ...
Who Wrote NLTK?
Steven Bird
Associate Professor
University of Melbourne
Senior Research Associate, LDC
Ewan Klein
Professor of Language Technology
University of Edinburgh.
Batteries Included
NLTK = Corpora + Algorithms
Ready for Research!
What is NLTK not?
- Production ready out of the box*
- Lightweight
- Generally applicable
- Magic
*There are actually a few things that are production
ready right out of the box.
The Good
- Preprocessing
- segmentation, tokenization, PoS tagging
- Word level processing
- WordNet, Lemmatization, Stemming, NGram
- Utilities
- Tree, FreqDist, ConditionalFreqDist
- Streaming CorpusReader objects
- Classification
- Maximum Entropy, Naive Bayes, Decision Tree
- Chunking, Named Entity Recognition
- Parsers Galore!
- Languages Galore!
The Bad
- Syntactic Parsing
- No included grammar (not a black box)
- Feature/Dependency Parsing
- No included feature grammar
- The sem package
- Toy only (lambda-calculus & first order logic)
- Lots of extra stuff
- papers, chat programs, alignments, etc.
Other Python NLP Libraries
- TextBlob
- SpaCy
- Scikit-Learn
- Pattern
- gensim
- MITIE
- guess_language
- Python wrapper for Stanford CoreNLP
- Python wrapper for Berkeley Parser
- readability-lxml
- BeautifulSoup
NLTK Demo
- Working with Included Corpora
- Segmentation
- Tokenization
- Tagging
- A Parsing Exercise
- Named Entity RecognitionWorking with Text
Machine Learning on Text
Machine learning uses instances
(examples) of data to fit a
parameterized model which is
used to make predictions
concerning new instances.
In text analysis, what are the
instances?
Instances = Documents
(no matter their size)
A corpus is a collection of
documents to learn about.
(labeled or unlabeled)
Features describe instances in a
way that machines can learn on by
putting them into feature space.
Document Features
Document level features
- Metadata: title, author
- Paragraphs
- Sentence construction
Word level features
- Vocabulary
- Form (capitalization)
- Frequency
Document
Paragraphs
Sentences
Tokens
Vector Encoding
- Basic representation of documents: a vector whose length is
equal to the vocabulary of the entire corpus.
- Word positions in the vector are based on lexicographic order.
The elephant sneezed
at the sight of
potatoes.
Bats can see via
echolocation. See the
bat sight sneeze!
Wondering, she
opened the door to
the studio.
at
bat
can
door
echolocationelephant
of
open
potato
see
she
sightsneeze
studio
the
to
viaw
onder
- One of the simplest models: compute the frequency of words in
the document and use those numbers as the vector encoding.
0
at
2
bat
1
can
0
door
1
echolocation
0
elephant
0
of
0
open
0
potato
2
see
0
she
1sight
1
sneeze
0
studio
1
the
0
to
1
via
0
w
onder
Bag of Words: Token Frequency
The elephant sneezed
at the sight of
potatoes.
Bats can see via
echolocation. See the
bat sight sneeze!
Wondering, she
opened the door to
the studio.
One Hot Encoding
- The feature vector encodes the vocabulary of the document.
- All words are equally distant, so must reduce word forms.
- Usually used for artificial neural network models.
The elephant sneezed
at the sight of
potatoes.
Bats can see via
echolocation. See the
bat sight sneeze!
Wondering, she
opened the door to
the studio.
1
at
0
bat
0
can
0
door
0
echolocation
1
elephant
0
of
0
open
1
potato
0
see
0
she
1sight
1
sneeze
0
studio
1
the
0
to
0
via
0
w
onder
TF-IDF Encoding
- Highlight terms that are very relevant to a document relative to
the rest of the corpus by computing the term frequency times
the inverse document frequency of the term.
The elephant sneezed
at the sight of
potatoes.
Bats can see via
echolocation. See the
bat sight sneeze!
Wondering, she
opened the door to
the studio.
0
at
0
bat
0
can
.3
door
0
echolocation
0
elephant
0
of
.3
open
0
potato
0
see
.3
she
0sight
0
sneeze
.4
studio
0
the
0
to
0
via
.3
w
onder
Pros and Cons of Vector Encoding
Pros
- Machine learning requires
a vector anyway.
- Can embed complex
representations like TF-IDF
into the vector form.
- Drives towards token-
concept mapping without
rules.
Cons
- The vectors have lots of
columns (high dimension)
- Word order, grammar, and
other structural features
are natively lost.
- Difficult to add knowledge
to learning process.
In the end, much of the work for
language aware applications
comes from domain specific
feature analysis; not just simple
vectorization.
Classification and Clustering
Classification
- Supervised ML
- Requires pre-labeled
corpus of documents
- Sentiment Analysis
- Models:
- Naive Bayes
- Maximum Entropy
Clustering
- Unsupervised ML
- Groups similar
documents together.
- Topic Modeling
- Models:
- LDA
- NNMF
Classification Pipeline
Training Text,
Documents
Labels
Feature
Vectors
Classifier
Algorithm
New Document Feature
Vector
Predictive
Model
Classification
Topic Modeling Pipeline
Training Text,
Documents
Feature
Vectors
Clustering
Algorithm
New Document Feature
Vector
Topic A Topic B Topic C
Similarity
These two fundamental techniques
are the basis of most NLP.
It’s all about designing the problem
correctly.
Consider an automatic question
answering system: how might
you architect the application?
Production Grade NLP
NLTK
TextBlob
- Access to LDA model
- Good TF-IDF Modeling
- Use word2vec
- Text processing
- Lexical resources
- Access to WordNet
- More model families
- Faster implementations
- Pipelines for NLP
Our Task
Build a system that ingests raw language data
and transforms it into a suitable representation
for creating revolutionary applications.
Workshop
Building language aware data
products
- Reading corpora for analysis
- Feature Extraction
- Document Classification
- Topic Modeling (Clustering)

Natural Language Processing with Python

  • 1.
  • 2.
    Intro to NaturalLanguage Processing
  • 3.
    What is Language? Or,why should we analyze it?
  • 4.
    What does thismean? Crab
  • 5.
    Is the symbolrepresentative of its meaning? Crab/kræb/
  • 6.
    Or is itjust a mapping in a mental lexicon? Crab/kræb/ symbolsound sight
  • 7.
    These symbols arearbitrary κάβουρας/kávouras/ symbolsound sight
  • 8.
    Linguistic symbols: - Notacoustic “things” - Not mental processes Critical implications for: - Literature - Linguistics - Computer Science - Artificial Intelligence What is Language?
  • 9.
    The science thathas been developed around the facts of language passed through three stages before finding its true and unique object. First something called "grammar" was studied. This study, initiated by the Greeks and continued mainly by the French, was based on logic. It lacked a scientific approach and was detached from language itself. Its only aim was to give rules for distinguishing between correct and incorrect forms; it was a normative discipline, far removed from actual observation, and its scope was limited. -- Ferdinand de Saussure What is Natural Language Processing?
  • 10.
    The science thathas been developed around the facts of language passed through three stages before finding its true and unique object. First something called "grammar" was studied. This study, initiated by the Greeks and continued mainly by the French, was based on logic. It lacked a scientific approach and was detached from language itself. Its only aim was to give rules for distinguishing between correct and incorrect forms; it was a normative discipline, far removed from actual observation, and its scope was limited. -- Ferdinand de Saussure What is Natural Language Processing?
  • 11.
    Formal vs. NaturalLanguages Formal Languages - Strict, unchanging rules defined by grammars and parsed by regular expressions - Generally application specific (chemistry, math) - Literal: exactly what is said is meant. - No ambiguity - Parsable by regular expressions - Inflexible: no new terms or meaning. Natural Languages - Flexible, evolving language that occurs naturally in human communication - Unspecific and used in many domains and applications - Redundant and verbose in order to make up for ambiguity - Expressive - Difficult to parse - Very flexible even in narrow contexts
  • 12.
    Computer science has traditionallyfocused on formal languages.
  • 13.
    tokens Who has writtena compiler or interpreter? Lexical Analysis Syntactic Analysis “Lexing” “Parsing” Regular Grammar (Word Rules) Context Free Grammar (Sentence Rules) string tree
  • 14.
    Syntax Errors areyour friend! http://bbc.in/23pSPRq
  • 15.
    However, ambiguity isrequired for understanding when communicating between people with diverse experience.
  • 16.
    Most of thetime. http://bit.ly/1XlOtHv
  • 17.
    Natural Language Processing requiresflexibility, which generally comes from machine learning.
  • 18.
    Language models areused for flexibility
  • 19.
    “There was aton of traffic on the beltway so I was ________.” “At beach we watched the ________.” “Watch out for that ________!” Intuition: Language is Predictable (if flexible)
  • 20.
    These models predictrelationships between tokens. crab beach steamed boat ocean sand seafood
  • 21.
    But they can’tunderstand meaning (yet)
  • 22.
    So please keepin mind: Tokens != Words - Substrings - Only structural - Data “bearing” “shouldn’t” - Objects - Contains a “sense” - Meaning to bear.verb-1 should.auxverb-3 not.adverb-1
  • 23.
    The State ofthe Art - Academic design for use alongside intelligent agents (AI discipline) - Relies on formal models or representations of knowledge & language - Models are adapted and augmented through probabilistic methods and machine learning. - A small number of algorithms comprise the standard framework.
  • 24.
    Traditional NLP Applications -Summarization - Reference Resolution - Machine Translation - Language Generation - Language Understanding - Document Classification - Author Identification - Part of Speech Tagging - Question Answering - Information Extraction - Information Retrieval - Speech Recognition - Sense Disambiguation - Topic Recognition - Relationship Detection - Named Entity Recognition
  • 25.
    What is Required? DomainKnowledge A Corpus in the Domain
  • 26.
  • 27.
    The study ofthe forms of things, words in particular. Consider pluralization for English: ● Orthographic Rules: puppy → puppies ● Morphological Rules: goose → geese or fish Major parsing tasks: stemming, lemmatization and tokenization. Morphology
  • 28.
    The study ofthe rules for the formation of sentences. Major tasks: chunking, parsing, feature parsing, grammars Syntax
  • 29.
    Semantics The study ofmeaning. - I see what I eat. - I eat what I see. - He poached salmon. Major Tasks Frame extraction, creation of TMRs
  • 30.
    Thematic Meaning Representation “Theman hit the building with the baseball bat” { "subject": {"text": "the man", "sense": human-agent}, "predicate": {"text": "hit", "sense": strike-physical-force}, "object": {"text": "the building", "sense": habitable-structure}, "instrument": {"text": "with the baseball bat" "sense": sports-equipment} }
  • 31.
    Recent NLP Applications -Yelp Insights - Winning Jeopardy! IBM Watson - Computer assisted medical coding (3M Health Information Systems) - Geoparsing -- CALVIN (built by Charlie Greenbacker) - Author Identification (classification/clustering) - Sentiment Analysis (RTNNs, classification) - Language Detection - Event Detection - Google Knowledge Graph - Named Entity Recognition and Classification - Machine Translation
  • 32.
    Applications are BIGdata - Examples are easier to create than rules. - Rules and logic miss frequency and language dynamics - More data is better for machine learning, relevance is in the long tail - Knowledge engineering is not scalable - Computational linguistics methodologies are stochastic
  • 33.
  • 34.
    What is NLTK? -Python interface to over 50 corpora and lexical resources - Focus on Machine Learning with specific domain knowledge - Free and Open Source - Numpy and Scipy under the hood - Fast and Formal
  • 35.
    What is NLTK? Suiteof libraries for a variety of academic text processing tasks: - tokenization, stemming, tagging, - chunking, parsing, classification, - language modeling, logical semantics Pedagogical resources for teaching NLP theory in Python ...
  • 36.
    Who Wrote NLTK? StevenBird Associate Professor University of Melbourne Senior Research Associate, LDC Ewan Klein Professor of Language Technology University of Edinburgh.
  • 37.
    Batteries Included NLTK =Corpora + Algorithms Ready for Research!
  • 38.
    What is NLTKnot? - Production ready out of the box* - Lightweight - Generally applicable - Magic *There are actually a few things that are production ready right out of the box.
  • 39.
    The Good - Preprocessing -segmentation, tokenization, PoS tagging - Word level processing - WordNet, Lemmatization, Stemming, NGram - Utilities - Tree, FreqDist, ConditionalFreqDist - Streaming CorpusReader objects - Classification - Maximum Entropy, Naive Bayes, Decision Tree - Chunking, Named Entity Recognition - Parsers Galore! - Languages Galore!
  • 40.
    The Bad - SyntacticParsing - No included grammar (not a black box) - Feature/Dependency Parsing - No included feature grammar - The sem package - Toy only (lambda-calculus & first order logic) - Lots of extra stuff - papers, chat programs, alignments, etc.
  • 41.
    Other Python NLPLibraries - TextBlob - SpaCy - Scikit-Learn - Pattern - gensim - MITIE - guess_language - Python wrapper for Stanford CoreNLP - Python wrapper for Berkeley Parser - readability-lxml - BeautifulSoup
  • 42.
    NLTK Demo - Workingwith Included Corpora - Segmentation - Tokenization - Tagging - A Parsing Exercise - Named Entity RecognitionWorking with Text
  • 43.
  • 44.
    Machine learning usesinstances (examples) of data to fit a parameterized model which is used to make predictions concerning new instances.
  • 45.
    In text analysis,what are the instances?
  • 46.
    Instances = Documents (nomatter their size)
  • 47.
    A corpus isa collection of documents to learn about. (labeled or unlabeled)
  • 48.
    Features describe instancesin a way that machines can learn on by putting them into feature space.
  • 49.
    Document Features Document levelfeatures - Metadata: title, author - Paragraphs - Sentence construction Word level features - Vocabulary - Form (capitalization) - Frequency Document Paragraphs Sentences Tokens
  • 50.
    Vector Encoding - Basicrepresentation of documents: a vector whose length is equal to the vocabulary of the entire corpus. - Word positions in the vector are based on lexicographic order. The elephant sneezed at the sight of potatoes. Bats can see via echolocation. See the bat sight sneeze! Wondering, she opened the door to the studio. at bat can door echolocationelephant of open potato see she sightsneeze studio the to viaw onder
  • 51.
    - One ofthe simplest models: compute the frequency of words in the document and use those numbers as the vector encoding. 0 at 2 bat 1 can 0 door 1 echolocation 0 elephant 0 of 0 open 0 potato 2 see 0 she 1sight 1 sneeze 0 studio 1 the 0 to 1 via 0 w onder Bag of Words: Token Frequency The elephant sneezed at the sight of potatoes. Bats can see via echolocation. See the bat sight sneeze! Wondering, she opened the door to the studio.
  • 52.
    One Hot Encoding -The feature vector encodes the vocabulary of the document. - All words are equally distant, so must reduce word forms. - Usually used for artificial neural network models. The elephant sneezed at the sight of potatoes. Bats can see via echolocation. See the bat sight sneeze! Wondering, she opened the door to the studio. 1 at 0 bat 0 can 0 door 0 echolocation 1 elephant 0 of 0 open 1 potato 0 see 0 she 1sight 1 sneeze 0 studio 1 the 0 to 0 via 0 w onder
  • 53.
    TF-IDF Encoding - Highlightterms that are very relevant to a document relative to the rest of the corpus by computing the term frequency times the inverse document frequency of the term. The elephant sneezed at the sight of potatoes. Bats can see via echolocation. See the bat sight sneeze! Wondering, she opened the door to the studio. 0 at 0 bat 0 can .3 door 0 echolocation 0 elephant 0 of .3 open 0 potato 0 see .3 she 0sight 0 sneeze .4 studio 0 the 0 to 0 via .3 w onder
  • 54.
    Pros and Consof Vector Encoding Pros - Machine learning requires a vector anyway. - Can embed complex representations like TF-IDF into the vector form. - Drives towards token- concept mapping without rules. Cons - The vectors have lots of columns (high dimension) - Word order, grammar, and other structural features are natively lost. - Difficult to add knowledge to learning process.
  • 55.
    In the end,much of the work for language aware applications comes from domain specific feature analysis; not just simple vectorization.
  • 56.
    Classification and Clustering Classification -Supervised ML - Requires pre-labeled corpus of documents - Sentiment Analysis - Models: - Naive Bayes - Maximum Entropy Clustering - Unsupervised ML - Groups similar documents together. - Topic Modeling - Models: - LDA - NNMF
  • 57.
  • 58.
    Topic Modeling Pipeline TrainingText, Documents Feature Vectors Clustering Algorithm New Document Feature Vector Topic A Topic B Topic C Similarity
  • 59.
    These two fundamentaltechniques are the basis of most NLP. It’s all about designing the problem correctly.
  • 60.
    Consider an automaticquestion answering system: how might you architect the application?
  • 61.
    Production Grade NLP NLTK TextBlob -Access to LDA model - Good TF-IDF Modeling - Use word2vec - Text processing - Lexical resources - Access to WordNet - More model families - Faster implementations - Pipelines for NLP
  • 62.
    Our Task Build asystem that ingests raw language data and transforms it into a suitable representation for creating revolutionary applications.
  • 63.
    Workshop Building language awaredata products - Reading corpora for analysis - Feature Extraction - Document Classification - Topic Modeling (Clustering)