Lightweight Natural Language Processing (NLP)

Lightweight NLP
for Social Media Applications

Bruce Smith
Lithium Technologies, Inc.
SXSW 2012
March 13, 2012

@btsmith
#nlp #sxsw

Lightweight NLP

Are You
What Can You
in the
Learn in this
Right Session?
Session?

2

NLP = Natural Language Processing
▪ This session is not about

Natural Law Party

Neuro-linguistic Programming

No Light Perception (total blindness)

Nonlinear Programming

@btsmith #nlp 3

N-Grams ≠ Engrams, Enneagrams, etc
▪ I will talk about “n-grams” several times
▪ Wikipedia has pages for 3 different kinds of “engram”
• Neuropsychology
• Scientology
• 2009 album by Finnish black metal band Beherit

▪ Wikipedia has pages for 3 different kinds of “enneagram”
• Nine-sided star polygon
• Enneagram of Personality
• Fourth Way Enneagram

@btsmith #nlp 4

Are you…
▪ developing a social media application?

▪ looking for ways to make your application better?

▪ interested in a quick introduction to NLP or text analytics?

@btsmith #nlp 5

Do you want to know…
▪ how you can use NLP tools in your social media app?

▪ if you need a Ph.D. to use NLP tools?

▪ where to find free NLP tools?

▪ where to learn more?

@btsmith #nlp 6

Do you want to understand…
▪ the role of machine learning in NLP?

▪ the difference between training and production?

▪ what a training corpus is and where to find one?

@btsmith #nlp 7

This is a Great Time to Start Using NLP!
▪ Computers are powerful and cheap!
▪ There‟s a lot of very good, free software!
▪ There‟s an enormous amount of very good, free text data!
▪ Don’t be afraid of non-English content!
• Unicode is your friend
• just remember „utf-8‟

@btsmith #nlp 8

Lightweight NLP

Very Simple NLP
with
Very Little Math

9

Document, Corpus, Treebank
▪ document
• newspaper article, novel, patent, scientific paper
• blog post, comment, status update, tweet
▪ corpus
• collection of documents
• plural is “corpora”
▪ treebank
• annotated corpus
• words are annotated with parts of speech
• sentences are annotated with parse trees

@btsmith #nlp 10

Penn Treebank‟s Parts of Speech
CC Coordinating conjunction … …
CD Cardinal number POS Possessive ending
DT Determiner PRP Personal pronoun
IN Preposition or PRP$ Possessive pronoun
subordinating conjunction … …
… … VB Verb, base form
JJ Adjective VBD Verb, past tense
JJR Adjective, comparative VBG Verb, gerund
JJS Adjective, superlative or present participle
… … … …
NN Noun, singular or mass WP Wh-pronoun
NNS Noun, plural WP$ Possessive wh-pronoun
NNP Proper noun, singular … …

@btsmith #nlp 11

Phrase Structure Grammars & Parse Trees
Phrases (non-terminals)
Parse Tree
S Sentence
S Grammar NP Noun Phrase
S → NP VP VP Verb Phrase
VP … PP Prepositional Phrase
… …
NP → NN
NP → JJ NN
NP NP
… POS (terminals)
NNP Proper noun, singular
VP → V NP
NNS Noun, plural
NNP VBZ NNS ….
VBZ Verb, 3rd person
Bruce likes dogs
singular present
… …

@btsmith #nlp 12

N-Grams
▪ contiguous subsequence of n items
• in order and with no gaps
• words
• characters

▪ n-grams have special names when n is small
• unigram n=1
• bigram n=2
• trigram n=3

@btsmith #nlp 13

Character N-Grams
▪ Unigrams for this session‟s title
Lightweight NLP for Social Media Applications

l w t o i d p t
i e n r a i l i
g i l s l a i o
h g p o m a c n
t h f c e p a s

@btsmith #nlp 14

Character N-Grams
▪ Bigrams for this session‟s title

li we tn or ia di pl ti
ig ei nl rs al ia li io
gh ig lp so lm aa ic on
ht gh pf oc me ap ca ns
tw ht fo ci ed pp at

@btsmith #nlp 15

Character N-Grams
▪ Trigrams for this session‟s title

lig wei tnl ors ial dia pli tio
igh eig nlp rso alm iaa lic ion
ght igh lpf soc lme aap ica ons
htw ght pfo oci med app cat
twe htn for cia edi ppl ati

@btsmith #nlp 16

Character N-Gram Frequencies
▪ N-grams are interesting when we look at frequencies

i–6 gh – 2 ght – 2
a–4 ht – 2 igh – 2
l–4 ia – 2 aap – 1
o–3 ig – 2 alm – 1
p–3 li – 2 aap – 1
… … …

@btsmith #nlp 17

Word N-Gram Frequencies
▪ Word n-grams from Pride and Prejudice (using NLTK)

to – 4116 to be – 436 i am sure – 72
the – 4105 of the – 430 as soon as – 59
of – 3572 in the – 359 in the world – 57
and – 3491 it was – 280 i do not – 46
her – 2551 of her – 276 could not be – 42
a – 2092 to the – 242 she could not – 39
… … …

@btsmith #nlp 18

N-Gram Frequencies
▪ Word n-grams from Pride and Prejudice
with no stopword unigrams

elinor – 685 to be – 436 i am sure – 72
could – 578 of the – 430 as soon as – 59
marianne – 566 in the – 359 in the world – 57
mrs – 530 it was – 280 i do not – 46
would – 515 of her – 276 could not be – 42
said – 397 to the – 242 she could not – 39
… … …
@btsmith #nlp 19

Cosine Similarity
▪ Make a vector from of a document‟s n-gram frequencies
▪ If A and B are frequency vectors for two documents

𝑛
𝐴∙ 𝐵 𝑖=1(𝐴 𝑖 𝐵𝑖)
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = =
𝐴 𝐵 𝑛
𝑖=1(𝐴 𝑖 )
2 𝑛
𝑖=1(𝐵 𝑖 )
2

@btsmith #nlp 20

Cosine Similarity
▪ Create word N-gram frequency vectors
• with unigrams, bigrams, trigrams
• Moby Dick
• Pride and Prejudice

▪ Compute their cosine similarity
0.534
▪ More interesting with a larger set of documents…

@btsmith #nlp 21

NLP and Machine Learning
▪ In the past, NLP was more about
grammars and logic and parsing
▪ Today, NLP is more about
statistics and machine learning
▪ Why?
• computers are much more powerful
• there are enormous amounts of very good, free data

@btsmith #nlp 22

▪ Think of machine learning as

programming by analyzing sample data

▪ Example
• Use the Penn Treebank as sample data
• Build a program that labels words with parts-of-speech

@btsmith #nlp 23

▪ Training
• depends on sample data, your training corpus
• there are very good, free machine learning tools
• sometimes training is slow
• experiment with different techniques (perceptron, SVM, etc)
• test, test, test…

▪ Production
• uses models generated during training
• typically very fast

@btsmith #nlp 24

Lightweight NLP

Lightweight NLP
Techniques

25

Lightweight NLP Techniques
▪ Language Identification

▪ Sentence Breaking

▪ Stemming

▪ Part-of-Speech Tagging

@btsmith #nlp 26

Language Identification
You might try looking at
▪ character sets (e.g. Unicode character blocks)
▪ words in language-specific dictionaries
▪ character n-gram frequencies and cosine similarity

@btsmith #nlp 27

▪ Character n-gram frequencies for English
e 12.6% th 3.9% the 3.5%
t 9.1% he 3.7% and 1.6%
a 8.0% in 2.3% ing 1.1%
o 7.6% er 2.2% her 0.8%
i 6.9% an 2.1% hat 0.7%
n 6.9% re 1.7% his 0.6%
s 6.3% nd 1.6% tha 0.6%
h 6.2% on 1.4% ere 0.6%
… … …

From Cryptograms.org, derived from English documents at Project Gutenberg

@btsmith #nlp 28

Language Identification with Tika
▪ tika.apache.org
▪ models for
da Danish fr French ro Romanian
de German is Icelandic ru Russian
et Estonian it Italian sv Swedish
el Greek nl Dutch th Thai
en English no Norwegian uk Ukrainian
es Spanish pl Polish …
fi Finnish pt Portuguese

▪ trainable with sample data
@btsmith #nlp 29

Where can you find samples of…
▪ French?
▪ German?
▪ Russian?
▪ Japanese?
▪ Arabic?
▪ Cherokee?

@btsmith #nlp 30

Sentence Breaking
▪ Also known as
• sentence boundary disambiguation
• sentence detection

▪ You could just look for punctuation, but…
• what about abbreviations?
• what about numbers?
• what about domain names like lithium.com, etc?
• what about names like Yahoo!, etc?

@btsmith #nlp 31

Sentence Breaking with OpenNLP
▪ opennlp.apache.org
▪ models for

da Danish nl Dutch
de German pt Portuguese
en English se Swedish

▪ trainable with new sample data

@btsmith #nlp 32

Stemming
▪ Reducing a word to a stem or base form
▪ Porter Stemmer is a popular stemmer for English
▪ Examples

lightweight → lightweight
natural → natur
language → languag
processing → process

@btsmith #nlp 33

Stemming
▪ A few examples from Pride and Prejudice (using NLTK)

affect amus close
affect amuse close
affectation amused closed
affected amusement closely
affecting amusements closing
affection amusing grate
affections grate
affects grateful
gratefully

@btsmith #nlp 34

Stemming with Snowball
▪ tartarus.org
▪ stemmers for

de German nl Dutch
en English no Norwegian
es Spanish pt Portuguese
fi Finnish ru Russian
fr French se Swedish
it Italian …

@btsmith #nlp 35

Part-of-Speech Tagging
▪ Part of Speech frequently abbreviated POS
▪ Not every language has the same parts of speech
▪ Even for one language,
not everyone agrees on the parts of speech
▪ Example: Penn Treebank POS tags for English

@btsmith #nlp 36

Part-of-Speech Tagging
lightweight nlp for social nlp is easier than you thought
media applications
nlp NN
lightweight NN is VBZ
nlp NN easier JJR
for IN than IN
social JJ you PRP
media NNS thought VBD
applications NNS

@btsmith #nlp 37

Part-of-Speech Tagging with OpenNLP
▪ opennlp.apache.org
▪ two kinds of models for each of

de German pt Portuguese
en English se Swedish
nl Dutch
▪ trainable with new sample data

@btsmith #nlp 38

Lightweight NLP

Lightweight NLP
in
Applications

39

Lightweight NLP in Applications
▪ Language Identification
▪ Sentence Breaking for Summaries
▪ Stemming for Word Counts
▪ POS Tagging for Document Categorization
▪ Lithium SMM Quotes

@btsmith #nlp 40

Lithium SMM (Social Media Monitoring)

@btsmith #nlp 41

▪ Language ID is never perfect,
especially with social media!

• short documents
• ambiguity
• mixed languages
• nonsense
• and… lots of very strange stuff

@btsmith #nlp 42

What language is this? ______________$$$$______________
____________$$$$$$$$____________
___________$$$$$$$$$$___________
___________$$$$$$$$$$___________
_____________$$$$$$_____________
_____________$$$$$$_____________
_____________$$$$$$_____________
____$$$$_____$$$$$$_____$$$$____
___$$$$$_____$$$$$$_____$$$$$___
_$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$_
_$$$$$$$$$$$.СРБИЈА.$$$$$$$$$$$_
___$$$$$$$$$$$$$$$$$$$$$$$$$$___
____$$$$_____$$$$$$_____$$$$____
_____________$$$$$$_____________
_____________$$$$$$_____________
_____________$$$$$$_____________
_____________$$$$$$_____________
_____________$$$$$$_____________
_____________$$$$$$_____________
_____________$$$$$$_____________
_____________$$$$$$_____________
___________$$$$$$$$$$___________
___________$$$$$$$$$$___________
____________$$$$$$$$____________
______________$$$$______________

@btsmith #nlp 43

What language is this?

ღೋ ´¯`•.¸,¤°`°¤,¸.•´¯`•.¸,¤Ƹ ̵̡Ӝ ´¯`•.
̵̨̄Ʒ´¯`•.ღೋ
╱▔▌
╔═╗╔═╗╔═╗╔═╗╔═╗░╔═╗╔╗╔═╗╔═╗╔═╗╔╗╔╗─╔═╗─ ╱▔▔▔▔╲ ╱▌
║█║║═╣║═╣║╠╝║═╣░║▌║╚╝║█║║█║║█║║║║║─║═╣ ╱◑▓░▓░░ ▌
║╔╝║═╣╠═║║╠╗║═╣░║▌║──║╦║║╔╝║═╣║║║╚╗║═╣ ╲░░░░░╱╲▌
╚╝─╚═╝╚═╝╚═╝╚═╝░╚═╝──╚╩╝╚╝─╚╩╝╚╝╚═╝╚═╝─ ▔▔╲▌▔

@btsmith #nlp 44

Lithium SMM

@btsmith #nlp 45

Sentence Breaking for Summaries
▪ Summary does not replace the document
▪ Summary lets you decide if the document is interesting
▪ Summaries are sentences selected from the document
• contain the search terms
• not too short, not too long, etc
• truncated only if necessary

@btsmith #nlp 46

Lithium SMM

@btsmith #nlp 47

Frequent Words and Stemming
▪ Most common words in the results for your query
• excludes stopwords

▪ Trending words were previously not common
▪ Click on a frequent word to search within results
▪ Should we count…
• words?
• stems?

@btsmith #nlp 48

POS Tagging
▪ We use POS Tagging in Lithium SMM Quotes
• along with other things
• not such a “lightweight” application

▪ POS also useful for document categorization
• POS-based features
• machine learning

@btsmith #nlp 49

POS Tags and Document Categorization
▪ Author Gender
Automatic Categorization of Author Gender via N-Gram Analysis,
Jonathan Doyle and Vlado Keselj. In The 6th Symposium on Natural
Language Processing, SNLP'2005, Chiang Rai, Thailand, December 2005.

▪ Opinion Spam
Finding Deceptive Opinion Spam by Any Stretch of the Imagination,
Myle Ott, Yejin Choi, Claire Cardie and Jeffrey T. Hancock, The 49th Annual
Meeting of the Association for Computational Linguistics: Human Language
Technologies, Portland, Oregon, USA, June 19-24, 2011.

@btsmith #nlp 50

Lithium SMM Quotes
▪ Quotes
• Select interesting sentences from social media documents
• Classify them as love, hate, comparison, warning, etc.

▪ Quotes depends on
• language identification
• sentence breaking
• POS tagging
• parsing
• specialized dictionaries

@btsmith #nlp 51

Lithium SMM Quotes

@btsmith #nlp 52

Lightweight NLP

Resources

53

Wikipedia
▪ Corpus linguistics ▪ Part-of-speech tagging
▪ Cosine similarity ▪ Sentence boundary
disambiguation
▪ Function word
▪ Stemming
▪ Language identification
▪ Stop words
▪ Machine learning
▪ Text mining
▪ N-gram
▪ Treebank
▪ Natural language processing

@btsmith #nlp 54

Software
▪ NLTK ▪ Snowball
• Natural Language Toolkit • ANSI C and Java stemmers
• Python library for NLP • snowball.tartarus.org
• nltk.org

▪ Tika
• Java toolkit for extracting metadata
▪ OpenNLP and text from documents
• machine-learning based NLP tools • includes language identification
• Java library for NLP • tika.apache.org
• opennlp.apache.org

@btsmith #nlp 55

Books
▪ Natural Language Processing with Python
Steven Bird, Ewan Klein & Edward Loper
O‟Reilly, 2009

▪ Foundations of Statistical Natural Language Processing
Chris Manning & Hinrich Schütze
MIT Press, 1999

@btsmith #nlp 56

Organization
▪ Association for Computational Linguistics

http://www.aclweb.org

▪ Remember that‟s aclweb.org

acl.org is the Association of Christian Librarians

@btsmith #nlp 57

Contact Info
▪ Bruce Smith
@btsmith
bruce.smith@lithium.com

▪ People at SXSW wearing Lithium‟s Nation Builder T-shirts

@btsmith #nlp 58

Lightweight Natural Language Processing (NLP)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

More from Lithium

More from Lithium (20)

Recently uploaded

Recently uploaded (20)

Lightweight Natural Language Processing (NLP)