Jazz up your social media apps with Natural Language Processing (NLP). Find out how you can use NLP in your social media apps, where to find free NLP apps and where to learn more about NLPs in order to put your social media investments to work with the right technology.
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Lightweight Natural Language Processing (NLP)
1. Lightweight NLP
for Social Media Applications
Bruce Smith
Lithium Technologies, Inc.
SXSW 2012
March 13, 2012
@btsmith
#nlp #sxsw
2. Lightweight NLP
for Social Media Applications
Are You
What Can You
in the
Learn in this
Right Session?
Session?
2
3. NLP = Natural Language Processing
▪ This session is not about
Natural Law Party
Neuro-linguistic Programming
No Light Perception (total blindness)
Nonlinear Programming
@btsmith #nlp 3
4. N-Grams ≠ Engrams, Enneagrams, etc
▪ I will talk about “n-grams” several times
▪ Wikipedia has pages for 3 different kinds of “engram”
• Neuropsychology
• Scientology
• 2009 album by Finnish black metal band Beherit
▪ Wikipedia has pages for 3 different kinds of “enneagram”
• Nine-sided star polygon
• Enneagram of Personality
• Fourth Way Enneagram
@btsmith #nlp 4
5. Are you…
▪ developing a social media application?
▪ looking for ways to make your application better?
▪ interested in a quick introduction to NLP or text analytics?
@btsmith #nlp 5
6. Do you want to know…
▪ how you can use NLP tools in your social media app?
▪ if you need a Ph.D. to use NLP tools?
▪ where to find free NLP tools?
▪ where to learn more?
@btsmith #nlp 6
7. Do you want to understand…
▪ the role of machine learning in NLP?
▪ the difference between training and production?
▪ what a training corpus is and where to find one?
@btsmith #nlp 7
8. This is a Great Time to Start Using NLP!
▪ Computers are powerful and cheap!
▪ There‟s a lot of very good, free software!
▪ There‟s an enormous amount of very good, free text data!
▪ Don’t be afraid of non-English content!
• Unicode is your friend
• just remember „utf-8‟
@btsmith #nlp 8
10. Document, Corpus, Treebank
▪ document
• newspaper article, novel, patent, scientific paper
• blog post, comment, status update, tweet
▪ corpus
• collection of documents
• plural is “corpora”
▪ treebank
• annotated corpus
• words are annotated with parts of speech
• sentences are annotated with parse trees
@btsmith #nlp 10
11. Penn Treebank‟s Parts of Speech
CC Coordinating conjunction … …
CD Cardinal number POS Possessive ending
DT Determiner PRP Personal pronoun
IN Preposition or PRP$ Possessive pronoun
subordinating conjunction … …
… … VB Verb, base form
JJ Adjective VBD Verb, past tense
JJR Adjective, comparative VBG Verb, gerund
JJS Adjective, superlative or present participle
… … … …
NN Noun, singular or mass WP Wh-pronoun
NNS Noun, plural WP$ Possessive wh-pronoun
NNP Proper noun, singular … …
@btsmith #nlp 11
12. Phrase Structure Grammars & Parse Trees
Phrases (non-terminals)
Parse Tree
S Sentence
S Grammar NP Noun Phrase
S → NP VP VP Verb Phrase
VP … PP Prepositional Phrase
… …
NP → NN
NP → JJ NN
NP NP
… POS (terminals)
NNP Proper noun, singular
VP → V NP
NNS Noun, plural
NNP VBZ NNS ….
VBZ Verb, 3rd person
Bruce likes dogs
singular present
… …
@btsmith #nlp 12
13. N-Grams
▪ contiguous subsequence of n items
• in order and with no gaps
• words
• characters
▪ n-grams have special names when n is small
• unigram n=1
• bigram n=2
• trigram n=3
@btsmith #nlp 13
14. Character N-Grams
▪ Unigrams for this session‟s title
Lightweight NLP for Social Media Applications
l w t o i d p t
i e n r a i l i
g i l s l a i o
h g p o m a c n
t h f c e p a s
@btsmith #nlp 14
15. Character N-Grams
▪ Bigrams for this session‟s title
Lightweight NLP for Social Media Applications
li we tn or ia di pl ti
ig ei nl rs al ia li io
gh ig lp so lm aa ic on
ht gh pf oc me ap ca ns
tw ht fo ci ed pp at
@btsmith #nlp 15
16. Character N-Grams
▪ Trigrams for this session‟s title
Lightweight NLP for Social Media Applications
lig wei tnl ors ial dia pli tio
igh eig nlp rso alm iaa lic ion
ght igh lpf soc lme aap ica ons
htw ght pfo oci med app cat
twe htn for cia edi ppl ati
@btsmith #nlp 16
17. Character N-Gram Frequencies
▪ N-grams are interesting when we look at frequencies
Lightweight NLP for Social Media Applications
i–6 gh – 2 ght – 2
a–4 ht – 2 igh – 2
l–4 ia – 2 aap – 1
o–3 ig – 2 alm – 1
p–3 li – 2 aap – 1
… … …
@btsmith #nlp 17
18. Word N-Gram Frequencies
▪ Word n-grams from Pride and Prejudice (using NLTK)
to – 4116 to be – 436 i am sure – 72
the – 4105 of the – 430 as soon as – 59
of – 3572 in the – 359 in the world – 57
and – 3491 it was – 280 i do not – 46
her – 2551 of her – 276 could not be – 42
a – 2092 to the – 242 she could not – 39
… … …
@btsmith #nlp 18
19. N-Gram Frequencies
▪ Word n-grams from Pride and Prejudice
with no stopword unigrams
elinor – 685 to be – 436 i am sure – 72
could – 578 of the – 430 as soon as – 59
marianne – 566 in the – 359 in the world – 57
mrs – 530 it was – 280 i do not – 46
would – 515 of her – 276 could not be – 42
said – 397 to the – 242 she could not – 39
… … …
@btsmith #nlp 19
20. Cosine Similarity
▪ Make a vector from of a document‟s n-gram frequencies
▪ If A and B are frequency vectors for two documents
𝑛
𝐴∙ 𝐵 𝑖=1(𝐴 𝑖 𝐵𝑖)
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = =
𝐴 𝐵 𝑛
𝑖=1(𝐴 𝑖 )
2 𝑛
𝑖=1(𝐵 𝑖 )
2
@btsmith #nlp 20
21. Cosine Similarity
▪ Create word N-gram frequency vectors
• with unigrams, bigrams, trigrams
• Moby Dick
• Pride and Prejudice
▪ Compute their cosine similarity
0.534
▪ More interesting with a larger set of documents…
@btsmith #nlp 21
22. NLP and Machine Learning
▪ In the past, NLP was more about
grammars and logic and parsing
▪ Today, NLP is more about
statistics and machine learning
▪ Why?
• computers are much more powerful
• there are enormous amounts of very good, free data
@btsmith #nlp 22
23. NLP and Machine Learning
▪ Think of machine learning as
programming by analyzing sample data
▪ Example
• Use the Penn Treebank as sample data
• Build a program that labels words with parts-of-speech
@btsmith #nlp 23
24. NLP and Machine Learning
▪ Training
• depends on sample data, your training corpus
• there are very good, free machine learning tools
• sometimes training is slow
• experiment with different techniques (perceptron, SVM, etc)
• test, test, test…
▪ Production
• uses models generated during training
• typically very fast
@btsmith #nlp 24
27. Language Identification
You might try looking at
▪ character sets (e.g. Unicode character blocks)
▪ words in language-specific dictionaries
▪ character n-gram frequencies and cosine similarity
@btsmith #nlp 27
28. Language Identification
▪ Character n-gram frequencies for English
e 12.6% th 3.9% the 3.5%
t 9.1% he 3.7% and 1.6%
a 8.0% in 2.3% ing 1.1%
o 7.6% er 2.2% her 0.8%
i 6.9% an 2.1% hat 0.7%
n 6.9% re 1.7% his 0.6%
s 6.3% nd 1.6% tha 0.6%
h 6.2% on 1.4% ere 0.6%
… … …
From Cryptograms.org, derived from English documents at Project Gutenberg
@btsmith #nlp 28
29. Language Identification with Tika
▪ tika.apache.org
▪ models for
da Danish fr French ro Romanian
de German is Icelandic ru Russian
et Estonian it Italian sv Swedish
el Greek nl Dutch th Thai
en English no Norwegian uk Ukrainian
es Spanish pl Polish …
fi Finnish pt Portuguese
▪ trainable with sample data
@btsmith #nlp 29
30. Where can you find samples of…
▪ French?
▪ German?
▪ Russian?
▪ Japanese?
▪ Arabic?
▪ Cherokee?
@btsmith #nlp 30
31. Sentence Breaking
▪ Also known as
• sentence boundary disambiguation
• sentence detection
▪ You could just look for punctuation, but…
• what about abbreviations?
• what about numbers?
• what about domain names like lithium.com, etc?
• what about names like Yahoo!, etc?
@btsmith #nlp 31
32. Sentence Breaking with OpenNLP
▪ opennlp.apache.org
▪ models for
da Danish nl Dutch
de German pt Portuguese
en English se Swedish
▪ trainable with new sample data
@btsmith #nlp 32
33. Stemming
▪ Reducing a word to a stem or base form
▪ Porter Stemmer is a popular stemmer for English
▪ Examples
lightweight → lightweight
natural → natur
language → languag
processing → process
@btsmith #nlp 33
34. Stemming
▪ A few examples from Pride and Prejudice (using NLTK)
affect amus close
affect amuse close
affectation amused closed
affected amusement closely
affecting amusements closing
affection amusing grate
affections grate
affects grateful
gratefully
@btsmith #nlp 34
35. Stemming with Snowball
▪ tartarus.org
▪ stemmers for
de German nl Dutch
en English no Norwegian
es Spanish pt Portuguese
fi Finnish ru Russian
fr French se Swedish
it Italian …
@btsmith #nlp 35
36. Part-of-Speech Tagging
▪ Part of Speech frequently abbreviated POS
▪ Not every language has the same parts of speech
▪ Even for one language,
not everyone agrees on the parts of speech
▪ Example: Penn Treebank POS tags for English
@btsmith #nlp 36
37. Part-of-Speech Tagging
lightweight nlp for social nlp is easier than you thought
media applications
nlp NN
lightweight NN is VBZ
nlp NN easier JJR
for IN than IN
social JJ you PRP
media NNS thought VBD
applications NNS
@btsmith #nlp 37
38. Part-of-Speech Tagging with OpenNLP
▪ opennlp.apache.org
▪ two kinds of models for each of
de German pt Portuguese
en English se Swedish
nl Dutch
▪ trainable with new sample data
@btsmith #nlp 38
42. Language Identification
▪ Language ID is never perfect,
especially with social media!
• short documents
• ambiguity
• mixed languages
• nonsense
• and… lots of very strange stuff
@btsmith #nlp 42
46. Sentence Breaking for Summaries
▪ Summary does not replace the document
▪ Summary lets you decide if the document is interesting
▪ Summaries are sentences selected from the document
• contain the search terms
• not too short, not too long, etc
• truncated only if necessary
@btsmith #nlp 46
48. Frequent Words and Stemming
▪ Most common words in the results for your query
• excludes stopwords
▪ Trending words were previously not common
▪ Click on a frequent word to search within results
▪ Should we count…
• words?
• stems?
@btsmith #nlp 48
49. POS Tagging
▪ We use POS Tagging in Lithium SMM Quotes
• along with other things
• not such a “lightweight” application
▪ POS also useful for document categorization
• POS-based features
• machine learning
@btsmith #nlp 49
50. POS Tags and Document Categorization
▪ Author Gender
Automatic Categorization of Author Gender via N-Gram Analysis,
Jonathan Doyle and Vlado Keselj. In The 6th Symposium on Natural
Language Processing, SNLP'2005, Chiang Rai, Thailand, December 2005.
▪ Opinion Spam
Finding Deceptive Opinion Spam by Any Stretch of the Imagination,
Myle Ott, Yejin Choi, Claire Cardie and Jeffrey T. Hancock, The 49th Annual
Meeting of the Association for Computational Linguistics: Human Language
Technologies, Portland, Oregon, USA, June 19-24, 2011.
@btsmith #nlp 50
51. Lithium SMM Quotes
▪ Quotes
• Select interesting sentences from social media documents
• Classify them as love, hate, comparison, warning, etc.
▪ Quotes depends on
• language identification
• sentence breaking
• POS tagging
• parsing
• specialized dictionaries
@btsmith #nlp 51
54. Wikipedia
▪ Corpus linguistics ▪ Part-of-speech tagging
▪ Cosine similarity ▪ Sentence boundary
disambiguation
▪ Function word
▪ Stemming
▪ Language identification
▪ Stop words
▪ Machine learning
▪ Text mining
▪ N-gram
▪ Treebank
▪ Natural language processing
@btsmith #nlp 54
55. Software
▪ NLTK ▪ Snowball
• Natural Language Toolkit • ANSI C and Java stemmers
• Python library for NLP • snowball.tartarus.org
• nltk.org
▪ Tika
• Java toolkit for extracting metadata
▪ OpenNLP and text from documents
• machine-learning based NLP tools • includes language identification
• Java library for NLP • tika.apache.org
• opennlp.apache.org
@btsmith #nlp 55
56. Books
▪ Natural Language Processing with Python
Steven Bird, Ewan Klein & Edward Loper
O‟Reilly, 2009
▪ Foundations of Statistical Natural Language Processing
Chris Manning & Hinrich Schütze
MIT Press, 1999
@btsmith #nlp 56
57. Organization
▪ Association for Computational Linguistics
http://www.aclweb.org
▪ Remember that‟s aclweb.org
acl.org is the Association of Christian Librarians
@btsmith #nlp 57
58. Contact Info
▪ Bruce Smith
@btsmith
bruce.smith@lithium.com
▪ People at SXSW wearing Lithium‟s Nation Builder T-shirts
@btsmith #nlp 58