Lightweight Natural Language Processing (NLP)


Published on

Jazz up your social media apps with Natural Language Processing (NLP). Find out how you can use NLP in your social media apps, where to find free NLP apps and where to learn more about NLPs in order to put your social media investments to work with the right technology.

Published in: Technology, Business

Lightweight Natural Language Processing (NLP)

  1. Lightweight NLPfor Social Media ApplicationsBruce SmithLithium Technologies, Inc.SXSW 2012March 13, 2012@btsmith#nlp #sxsw
  2. Lightweight NLPfor Social Media ApplicationsAre YouWhat Can Youin theLearn in thisRight Session?Session? 2
  3. NLP = Natural Language Processing▪ This session is not about Natural Law Party Neuro-linguistic Programming No Light Perception (total blindness) Nonlinear Programming @btsmith #nlp 3
  4. N-Grams ≠ Engrams, Enneagrams, etc▪ I will talk about “n-grams” several times▪ Wikipedia has pages for 3 different kinds of “engram” • Neuropsychology • Scientology • 2009 album by Finnish black metal band Beherit▪ Wikipedia has pages for 3 different kinds of “enneagram” • Nine-sided star polygon • Enneagram of Personality • Fourth Way Enneagram @btsmith #nlp 4
  5. Are you…▪ developing a social media application?▪ looking for ways to make your application better?▪ interested in a quick introduction to NLP or text analytics? @btsmith #nlp 5
  6. Do you want to know…▪ how you can use NLP tools in your social media app?▪ if you need a Ph.D. to use NLP tools?▪ where to find free NLP tools?▪ where to learn more? @btsmith #nlp 6
  7. Do you want to understand…▪ the role of machine learning in NLP?▪ the difference between training and production?▪ what a training corpus is and where to find one? @btsmith #nlp 7
  8. This is a Great Time to Start Using NLP!▪ Computers are powerful and cheap!▪ There‟s a lot of very good, free software!▪ There‟s an enormous amount of very good, free text data!▪ Don’t be afraid of non-English content! • Unicode is your friend • just remember „utf-8‟ @btsmith #nlp 8
  9. Lightweight NLPfor Social Media ApplicationsVery Simple NLPwithVery Little Math 9
  10. Document, Corpus, Treebank▪ document • newspaper article, novel, patent, scientific paper • blog post, comment, status update, tweet▪ corpus • collection of documents • plural is “corpora”▪ treebank • annotated corpus • words are annotated with parts of speech • sentences are annotated with parse trees @btsmith #nlp 10
  11. Penn Treebank‟s Parts of SpeechCC Coordinating conjunction … …CD Cardinal number POS Possessive endingDT Determiner PRP Personal pronounIN Preposition or PRP$ Possessive pronoun subordinating conjunction … …… … VB Verb, base formJJ Adjective VBD Verb, past tenseJJR Adjective, comparative VBG Verb, gerundJJS Adjective, superlative or present participle… … … …NN Noun, singular or mass WP Wh-pronounNNS Noun, plural WP$ Possessive wh-pronounNNP Proper noun, singular … … @btsmith #nlp 11
  12. Phrase Structure Grammars & Parse Trees Phrases (non-terminals)Parse Tree S Sentence S Grammar NP Noun Phrase S → NP VP VP Verb Phrase VP … PP Prepositional Phrase … … NP → NN NP → JJ NN NP NP … POS (terminals) NNP Proper noun, singular VP → V NP NNS Noun, pluralNNP VBZ NNS …. VBZ Verb, 3rd personBruce likes dogs singular present … … @btsmith #nlp 12
  13. N-Grams▪ contiguous subsequence of n items • in order and with no gaps • words • characters▪ n-grams have special names when n is small • unigram n=1 • bigram n=2 • trigram n=3 @btsmith #nlp 13
  14. Character N-Grams▪ Unigrams for this session‟s title Lightweight NLP for Social Media Applicationsl w t o i d p ti e n r a i l ig i l s l a i oh g p o m a c nt h f c e p a s @btsmith #nlp 14
  15. Character N-Grams▪ Bigrams for this session‟s title Lightweight NLP for Social Media Applicationsli we tn or ia di pl tiig ei nl rs al ia li iogh ig lp so lm aa ic onht gh pf oc me ap ca nstw ht fo ci ed pp at @btsmith #nlp 15
  16. Character N-Grams▪ Trigrams for this session‟s title Lightweight NLP for Social Media Applicationslig wei tnl ors ial dia pli tioigh eig nlp rso alm iaa lic ionght igh lpf soc lme aap ica onshtw ght pfo oci med app cattwe htn for cia edi ppl ati @btsmith #nlp 16
  17. Character N-Gram Frequencies▪ N-grams are interesting when we look at frequencies Lightweight NLP for Social Media Applicationsi–6 gh – 2 ght – 2a–4 ht – 2 igh – 2l–4 ia – 2 aap – 1o–3 ig – 2 alm – 1p–3 li – 2 aap – 1… … … @btsmith #nlp 17
  18. Word N-Gram Frequencies▪ Word n-grams from Pride and Prejudice (using NLTK)to – 4116 to be – 436 i am sure – 72the – 4105 of the – 430 as soon as – 59of – 3572 in the – 359 in the world – 57and – 3491 it was – 280 i do not – 46her – 2551 of her – 276 could not be – 42a – 2092 to the – 242 she could not – 39… … … @btsmith #nlp 18
  19. N-Gram Frequencies▪ Word n-grams from Pride and Prejudice with no stopword unigramselinor – 685 to be – 436 i am sure – 72could – 578 of the – 430 as soon as – 59marianne – 566 in the – 359 in the world – 57mrs – 530 it was – 280 i do not – 46would – 515 of her – 276 could not be – 42said – 397 to the – 242 she could not – 39… … … @btsmith #nlp 19
  20. Cosine Similarity▪ Make a vector from of a document‟s n-gram frequencies▪ If A and B are frequency vectors for two documents 𝑛 𝐴∙ 𝐵 𝑖=1(𝐴 𝑖 𝐵𝑖) 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = = 𝐴 𝐵 𝑛 𝑖=1(𝐴 𝑖 ) 2 𝑛 𝑖=1(𝐵 𝑖 ) 2 @btsmith #nlp 20
  21. Cosine Similarity▪ Create word N-gram frequency vectors • with unigrams, bigrams, trigrams • Moby Dick • Pride and Prejudice▪ Compute their cosine similarity 0.534▪ More interesting with a larger set of documents… @btsmith #nlp 21
  22. NLP and Machine Learning▪ In the past, NLP was more about grammars and logic and parsing▪ Today, NLP is more about statistics and machine learning▪ Why? • computers are much more powerful • there are enormous amounts of very good, free data @btsmith #nlp 22
  23. NLP and Machine Learning▪ Think of machine learning as programming by analyzing sample data▪ Example • Use the Penn Treebank as sample data • Build a program that labels words with parts-of-speech @btsmith #nlp 23
  24. NLP and Machine Learning▪ Training • depends on sample data, your training corpus • there are very good, free machine learning tools • sometimes training is slow • experiment with different techniques (perceptron, SVM, etc) • test, test, test…▪ Production • uses models generated during training • typically very fast @btsmith #nlp 24
  25. Lightweight NLPfor Social Media ApplicationsLightweight NLPTechniques 25
  26. Lightweight NLP Techniques▪ Language Identification▪ Sentence Breaking▪ Stemming▪ Part-of-Speech Tagging @btsmith #nlp 26
  27. Language IdentificationYou might try looking at▪ character sets (e.g. Unicode character blocks)▪ words in language-specific dictionaries▪ character n-gram frequencies and cosine similarity @btsmith #nlp 27
  28. Language Identification▪ Character n-gram frequencies for Englishe 12.6% th 3.9% the 3.5%t 9.1% he 3.7% and 1.6%a 8.0% in 2.3% ing 1.1%o 7.6% er 2.2% her 0.8%i 6.9% an 2.1% hat 0.7%n 6.9% re 1.7% his 0.6%s 6.3% nd 1.6% tha 0.6%h 6.2% on 1.4% ere 0.6%… … …From, derived from English documents at Project Gutenberg @btsmith #nlp 28
  29. Language Identification with Tika▪▪ models forda Danish fr French ro Romaniande German is Icelandic ru Russianet Estonian it Italian sv Swedishel Greek nl Dutch th Thaien English no Norwegian uk Ukrainianes Spanish pl Polish …fi Finnish pt Portuguese▪ trainable with sample data @btsmith #nlp 29
  30. Where can you find samples of…▪ French?▪ German?▪ Russian?▪ Japanese?▪ Arabic?▪ Cherokee? @btsmith #nlp 30
  31. Sentence Breaking▪ Also known as • sentence boundary disambiguation • sentence detection▪ You could just look for punctuation, but… • what about abbreviations? • what about numbers? • what about domain names like, etc? • what about names like Yahoo!, etc? @btsmith #nlp 31
  32. Sentence Breaking with OpenNLP▪▪ models for da Danish nl Dutch de German pt Portuguese en English se Swedish▪ trainable with new sample data @btsmith #nlp 32
  33. Stemming▪ Reducing a word to a stem or base form▪ Porter Stemmer is a popular stemmer for English▪ Examples lightweight → lightweight natural → natur language → languag processing → process @btsmith #nlp 33
  34. Stemming▪ A few examples from Pride and Prejudice (using NLTK)affect amus close affect amuse close affectation amused closed affected amusement closely affecting amusements closing affection amusing grate affections grate affects grateful gratefully @btsmith #nlp 34
  35. Stemming with Snowball▪▪ stemmers for de German nl Dutch en English no Norwegian es Spanish pt Portuguese fi Finnish ru Russian fr French se Swedish it Italian … @btsmith #nlp 35
  36. Part-of-Speech Tagging▪ Part of Speech frequently abbreviated POS▪ Not every language has the same parts of speech▪ Even for one language, not everyone agrees on the parts of speech▪ Example: Penn Treebank POS tags for English @btsmith #nlp 36
  37. Part-of-Speech Tagginglightweight nlp for social nlp is easier than you thoughtmedia applications nlp NN lightweight NN is VBZ nlp NN easier JJR for IN than IN social JJ you PRP media NNS thought VBD applications NNS @btsmith #nlp 37
  38. Part-of-Speech Tagging with OpenNLP▪▪ two kinds of models for each of de German pt Portuguese en English se Swedish nl Dutch▪ trainable with new sample data @btsmith #nlp 38
  39. Lightweight NLPfor Social Media ApplicationsLightweight NLPinApplications 39
  40. Lightweight NLP in Applications▪ Language Identification▪ Sentence Breaking for Summaries▪ Stemming for Word Counts▪ POS Tagging for Document Categorization▪ Lithium SMM Quotes @btsmith #nlp 40
  41. Lithium SMM (Social Media Monitoring) @btsmith #nlp 41
  42. Language Identification▪ Language ID is never perfect, especially with social media! • short documents • ambiguity • mixed languages • nonsense • and… lots of very strange stuff @btsmith #nlp 42
  43. What language is this? ______________$$$$______________ ____________$$$$$$$$____________ ___________$$$$$$$$$$___________ ___________$$$$$$$$$$___________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ ____$$$$_____$$$$$$_____$$$$____ ___$$$$$_____$$$$$$_____$$$$$___ _$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$_ _$$$$$$$$$$$.СРБИЈА.$$$$$$$$$$$_ ___$$$$$$$$$$$$$$$$$$$$$$$$$$___ ____$$$$_____$$$$$$_____$$$$____ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ ___________$$$$$$$$$$___________ ___________$$$$$$$$$$___________ ____________$$$$$$$$____________ ______________$$$$______________ @btsmith #nlp 43
  44. What language is this?ღೋ ´¯`•.¸,¤°`°¤,¸.•´¯`•.¸,¤Ƹ ̵̡Ӝ ´¯`•. ̵̨̄Ʒ´¯`•.ღೋ ╱▔▌╔═╗╔═╗╔═╗╔═╗╔═╗░╔═╗╔╗╔═╗╔═╗╔═╗╔╗╔╗─╔═╗─ ╱▔▔▔▔╲ ╱▌║█║║═╣║═╣║╠╝║═╣░║▌║╚╝║█║║█║║█║║║║║─║═╣ ╱◑▓░▓░░ ▌║╔╝║═╣╠═║║╠╗║═╣░║▌║──║╦║║╔╝║═╣║║║╚╗║═╣ ╲░░░░░╱╲▌╚╝─╚═╝╚═╝╚═╝╚═╝░╚═╝──╚╩╝╚╝─╚╩╝╚╝╚═╝╚═╝─ ▔▔╲▌▔ @btsmith #nlp 44
  45. Lithium SMM @btsmith #nlp 45
  46. Sentence Breaking for Summaries▪ Summary does not replace the document▪ Summary lets you decide if the document is interesting▪ Summaries are sentences selected from the document • contain the search terms • not too short, not too long, etc • truncated only if necessary @btsmith #nlp 46
  47. Lithium SMM @btsmith #nlp 47
  48. Frequent Words and Stemming▪ Most common words in the results for your query • excludes stopwords▪ Trending words were previously not common▪ Click on a frequent word to search within results▪ Should we count… • words? • stems? @btsmith #nlp 48
  49. POS Tagging▪ We use POS Tagging in Lithium SMM Quotes • along with other things • not such a “lightweight” application▪ POS also useful for document categorization • POS-based features • machine learning @btsmith #nlp 49
  50. POS Tags and Document Categorization▪ Author Gender Automatic Categorization of Author Gender via N-Gram Analysis, Jonathan Doyle and Vlado Keselj. In The 6th Symposium on Natural Language Processing, SNLP2005, Chiang Rai, Thailand, December 2005.▪ Opinion Spam Finding Deceptive Opinion Spam by Any Stretch of the Imagination, Myle Ott, Yejin Choi, Claire Cardie and Jeffrey T. Hancock, The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, June 19-24, 2011. @btsmith #nlp 50
  51. Lithium SMM Quotes▪ Quotes • Select interesting sentences from social media documents • Classify them as love, hate, comparison, warning, etc.▪ Quotes depends on • language identification • sentence breaking • POS tagging • parsing • specialized dictionaries @btsmith #nlp 51
  52. Lithium SMM Quotes @btsmith #nlp 52
  53. Lightweight NLPfor Social Media ApplicationsResources 53
  54. Wikipedia▪ Corpus linguistics ▪ Part-of-speech tagging▪ Cosine similarity ▪ Sentence boundary disambiguation▪ Function word ▪ Stemming▪ Language identification ▪ Stop words▪ Machine learning ▪ Text mining▪ N-gram ▪ Treebank▪ Natural language processing @btsmith #nlp 54
  55. Software▪ NLTK ▪ Snowball • Natural Language Toolkit • ANSI C and Java stemmers • Python library for NLP • • ▪ Tika • Java toolkit for extracting metadata▪ OpenNLP and text from documents • machine-learning based NLP tools • includes language identification • Java library for NLP • • @btsmith #nlp 55
  56. Books▪ Natural Language Processing with Python Steven Bird, Ewan Klein & Edward Loper O‟Reilly, 2009▪ Foundations of Statistical Natural Language Processing Chris Manning & Hinrich Schütze MIT Press, 1999 @btsmith #nlp 56
  57. Organization▪ Association for Computational Linguistics▪ Remember that‟s is the Association of Christian Librarians @btsmith #nlp 57
  58. Contact Info▪ Bruce Smith @btsmith▪ People at SXSW wearing Lithium‟s Nation Builder T-shirts @btsmith #nlp 58