• Share
  • Email
  • Embed
  • Like
  • Private Content
Lightweight Natural Language Processing (NLP)
 

Lightweight Natural Language Processing (NLP)

on

  • 3,353 views

Jazz up your social media apps with Natural Language Processing (NLP). Find out how you can use NLP in your social media apps, where to find free NLP apps and where to learn more about NLPs in order ...

Jazz up your social media apps with Natural Language Processing (NLP). Find out how you can use NLP in your social media apps, where to find free NLP apps and where to learn more about NLPs in order to put your social media investments to work with the right technology.

Statistics

Views

Total Views
3,353
Views on SlideShare
3,317
Embed Views
36

Actions

Likes
9
Downloads
55
Comments
0

3 Embeds 36

http://97.107.137.247:8080 20
http://lanyrd.com 14
http://bundlr.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Lightweight Natural Language Processing (NLP) Lightweight Natural Language Processing (NLP) Presentation Transcript

    • Lightweight NLPfor Social Media ApplicationsBruce SmithLithium Technologies, Inc.SXSW 2012March 13, 2012@btsmith#nlp #sxsw
    • Lightweight NLPfor Social Media ApplicationsAre YouWhat Can Youin theLearn in thisRight Session?Session? 2
    • NLP = Natural Language Processing▪ This session is not about Natural Law Party Neuro-linguistic Programming No Light Perception (total blindness) Nonlinear Programming @btsmith #nlp 3
    • N-Grams ≠ Engrams, Enneagrams, etc▪ I will talk about “n-grams” several times▪ Wikipedia has pages for 3 different kinds of “engram” • Neuropsychology • Scientology • 2009 album by Finnish black metal band Beherit▪ Wikipedia has pages for 3 different kinds of “enneagram” • Nine-sided star polygon • Enneagram of Personality • Fourth Way Enneagram @btsmith #nlp 4
    • Are you…▪ developing a social media application?▪ looking for ways to make your application better?▪ interested in a quick introduction to NLP or text analytics? @btsmith #nlp 5
    • Do you want to know…▪ how you can use NLP tools in your social media app?▪ if you need a Ph.D. to use NLP tools?▪ where to find free NLP tools?▪ where to learn more? @btsmith #nlp 6
    • Do you want to understand…▪ the role of machine learning in NLP?▪ the difference between training and production?▪ what a training corpus is and where to find one? @btsmith #nlp 7
    • This is a Great Time to Start Using NLP!▪ Computers are powerful and cheap!▪ There‟s a lot of very good, free software!▪ There‟s an enormous amount of very good, free text data!▪ Don’t be afraid of non-English content! • Unicode is your friend • just remember „utf-8‟ @btsmith #nlp 8
    • Lightweight NLPfor Social Media ApplicationsVery Simple NLPwithVery Little Math 9
    • Document, Corpus, Treebank▪ document • newspaper article, novel, patent, scientific paper • blog post, comment, status update, tweet▪ corpus • collection of documents • plural is “corpora”▪ treebank • annotated corpus • words are annotated with parts of speech • sentences are annotated with parse trees @btsmith #nlp 10
    • Penn Treebank‟s Parts of SpeechCC Coordinating conjunction … …CD Cardinal number POS Possessive endingDT Determiner PRP Personal pronounIN Preposition or PRP$ Possessive pronoun subordinating conjunction … …… … VB Verb, base formJJ Adjective VBD Verb, past tenseJJR Adjective, comparative VBG Verb, gerundJJS Adjective, superlative or present participle… … … …NN Noun, singular or mass WP Wh-pronounNNS Noun, plural WP$ Possessive wh-pronounNNP Proper noun, singular … … @btsmith #nlp 11
    • Phrase Structure Grammars & Parse Trees Phrases (non-terminals)Parse Tree S Sentence S Grammar NP Noun Phrase S → NP VP VP Verb Phrase VP … PP Prepositional Phrase … … NP → NN NP → JJ NN NP NP … POS (terminals) NNP Proper noun, singular VP → V NP NNS Noun, pluralNNP VBZ NNS …. VBZ Verb, 3rd personBruce likes dogs singular present … … @btsmith #nlp 12
    • N-Grams▪ contiguous subsequence of n items • in order and with no gaps • words • characters▪ n-grams have special names when n is small • unigram n=1 • bigram n=2 • trigram n=3 @btsmith #nlp 13
    • Character N-Grams▪ Unigrams for this session‟s title Lightweight NLP for Social Media Applicationsl w t o i d p ti e n r a i l ig i l s l a i oh g p o m a c nt h f c e p a s @btsmith #nlp 14
    • Character N-Grams▪ Bigrams for this session‟s title Lightweight NLP for Social Media Applicationsli we tn or ia di pl tiig ei nl rs al ia li iogh ig lp so lm aa ic onht gh pf oc me ap ca nstw ht fo ci ed pp at @btsmith #nlp 15
    • Character N-Grams▪ Trigrams for this session‟s title Lightweight NLP for Social Media Applicationslig wei tnl ors ial dia pli tioigh eig nlp rso alm iaa lic ionght igh lpf soc lme aap ica onshtw ght pfo oci med app cattwe htn for cia edi ppl ati @btsmith #nlp 16
    • Character N-Gram Frequencies▪ N-grams are interesting when we look at frequencies Lightweight NLP for Social Media Applicationsi–6 gh – 2 ght – 2a–4 ht – 2 igh – 2l–4 ia – 2 aap – 1o–3 ig – 2 alm – 1p–3 li – 2 aap – 1… … … @btsmith #nlp 17
    • Word N-Gram Frequencies▪ Word n-grams from Pride and Prejudice (using NLTK)to – 4116 to be – 436 i am sure – 72the – 4105 of the – 430 as soon as – 59of – 3572 in the – 359 in the world – 57and – 3491 it was – 280 i do not – 46her – 2551 of her – 276 could not be – 42a – 2092 to the – 242 she could not – 39… … … @btsmith #nlp 18
    • N-Gram Frequencies▪ Word n-grams from Pride and Prejudice with no stopword unigramselinor – 685 to be – 436 i am sure – 72could – 578 of the – 430 as soon as – 59marianne – 566 in the – 359 in the world – 57mrs – 530 it was – 280 i do not – 46would – 515 of her – 276 could not be – 42said – 397 to the – 242 she could not – 39… … … @btsmith #nlp 19
    • Cosine Similarity▪ Make a vector from of a document‟s n-gram frequencies▪ If A and B are frequency vectors for two documents 𝑛 𝐴∙ 𝐵 𝑖=1(𝐴 𝑖 𝐵𝑖) 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = = 𝐴 𝐵 𝑛 𝑖=1(𝐴 𝑖 ) 2 𝑛 𝑖=1(𝐵 𝑖 ) 2 @btsmith #nlp 20
    • Cosine Similarity▪ Create word N-gram frequency vectors • with unigrams, bigrams, trigrams • Moby Dick • Pride and Prejudice▪ Compute their cosine similarity 0.534▪ More interesting with a larger set of documents… @btsmith #nlp 21
    • NLP and Machine Learning▪ In the past, NLP was more about grammars and logic and parsing▪ Today, NLP is more about statistics and machine learning▪ Why? • computers are much more powerful • there are enormous amounts of very good, free data @btsmith #nlp 22
    • NLP and Machine Learning▪ Think of machine learning as programming by analyzing sample data▪ Example • Use the Penn Treebank as sample data • Build a program that labels words with parts-of-speech @btsmith #nlp 23
    • NLP and Machine Learning▪ Training • depends on sample data, your training corpus • there are very good, free machine learning tools • sometimes training is slow • experiment with different techniques (perceptron, SVM, etc) • test, test, test…▪ Production • uses models generated during training • typically very fast @btsmith #nlp 24
    • Lightweight NLPfor Social Media ApplicationsLightweight NLPTechniques 25
    • Lightweight NLP Techniques▪ Language Identification▪ Sentence Breaking▪ Stemming▪ Part-of-Speech Tagging @btsmith #nlp 26
    • Language IdentificationYou might try looking at▪ character sets (e.g. Unicode character blocks)▪ words in language-specific dictionaries▪ character n-gram frequencies and cosine similarity @btsmith #nlp 27
    • Language Identification▪ Character n-gram frequencies for Englishe 12.6% th 3.9% the 3.5%t 9.1% he 3.7% and 1.6%a 8.0% in 2.3% ing 1.1%o 7.6% er 2.2% her 0.8%i 6.9% an 2.1% hat 0.7%n 6.9% re 1.7% his 0.6%s 6.3% nd 1.6% tha 0.6%h 6.2% on 1.4% ere 0.6%… … …From Cryptograms.org, derived from English documents at Project Gutenberg @btsmith #nlp 28
    • Language Identification with Tika▪ tika.apache.org▪ models forda Danish fr French ro Romaniande German is Icelandic ru Russianet Estonian it Italian sv Swedishel Greek nl Dutch th Thaien English no Norwegian uk Ukrainianes Spanish pl Polish …fi Finnish pt Portuguese▪ trainable with sample data @btsmith #nlp 29
    • Where can you find samples of…▪ French?▪ German?▪ Russian?▪ Japanese?▪ Arabic?▪ Cherokee? @btsmith #nlp 30
    • Sentence Breaking▪ Also known as • sentence boundary disambiguation • sentence detection▪ You could just look for punctuation, but… • what about abbreviations? • what about numbers? • what about domain names like lithium.com, etc? • what about names like Yahoo!, etc? @btsmith #nlp 31
    • Sentence Breaking with OpenNLP▪ opennlp.apache.org▪ models for da Danish nl Dutch de German pt Portuguese en English se Swedish▪ trainable with new sample data @btsmith #nlp 32
    • Stemming▪ Reducing a word to a stem or base form▪ Porter Stemmer is a popular stemmer for English▪ Examples lightweight → lightweight natural → natur language → languag processing → process @btsmith #nlp 33
    • Stemming▪ A few examples from Pride and Prejudice (using NLTK)affect amus close affect amuse close affectation amused closed affected amusement closely affecting amusements closing affection amusing grate affections grate affects grateful gratefully @btsmith #nlp 34
    • Stemming with Snowball▪ tartarus.org▪ stemmers for de German nl Dutch en English no Norwegian es Spanish pt Portuguese fi Finnish ru Russian fr French se Swedish it Italian … @btsmith #nlp 35
    • Part-of-Speech Tagging▪ Part of Speech frequently abbreviated POS▪ Not every language has the same parts of speech▪ Even for one language, not everyone agrees on the parts of speech▪ Example: Penn Treebank POS tags for English @btsmith #nlp 36
    • Part-of-Speech Tagginglightweight nlp for social nlp is easier than you thoughtmedia applications nlp NN lightweight NN is VBZ nlp NN easier JJR for IN than IN social JJ you PRP media NNS thought VBD applications NNS @btsmith #nlp 37
    • Part-of-Speech Tagging with OpenNLP▪ opennlp.apache.org▪ two kinds of models for each of de German pt Portuguese en English se Swedish nl Dutch▪ trainable with new sample data @btsmith #nlp 38
    • Lightweight NLPfor Social Media ApplicationsLightweight NLPinApplications 39
    • Lightweight NLP in Applications▪ Language Identification▪ Sentence Breaking for Summaries▪ Stemming for Word Counts▪ POS Tagging for Document Categorization▪ Lithium SMM Quotes @btsmith #nlp 40
    • Lithium SMM (Social Media Monitoring) @btsmith #nlp 41
    • Language Identification▪ Language ID is never perfect, especially with social media! • short documents • ambiguity • mixed languages • nonsense • and… lots of very strange stuff @btsmith #nlp 42
    • What language is this? ______________$$$$______________ ____________$$$$$$$$____________ ___________$$$$$$$$$$___________ ___________$$$$$$$$$$___________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ ____$$$$_____$$$$$$_____$$$$____ ___$$$$$_____$$$$$$_____$$$$$___ _$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$_ _$$$$$$$$$$$.СРБИЈА.$$$$$$$$$$$_ ___$$$$$$$$$$$$$$$$$$$$$$$$$$___ ____$$$$_____$$$$$$_____$$$$____ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ ___________$$$$$$$$$$___________ ___________$$$$$$$$$$___________ ____________$$$$$$$$____________ ______________$$$$______________ @btsmith #nlp 43
    • What language is this?ღೋ ´¯`•.¸,¤°`°¤,¸.•´¯`•.¸,¤Ƹ ̵̡Ӝ ´¯`•. ̵̨̄Ʒ´¯`•.ღೋ ╱▔▌╔═╗╔═╗╔═╗╔═╗╔═╗░╔═╗╔╗╔═╗╔═╗╔═╗╔╗╔╗─╔═╗─ ╱▔▔▔▔╲ ╱▌║█║║═╣║═╣║╠╝║═╣░║▌║╚╝║█║║█║║█║║║║║─║═╣ ╱◑▓░▓░░ ▌║╔╝║═╣╠═║║╠╗║═╣░║▌║──║╦║║╔╝║═╣║║║╚╗║═╣ ╲░░░░░╱╲▌╚╝─╚═╝╚═╝╚═╝╚═╝░╚═╝──╚╩╝╚╝─╚╩╝╚╝╚═╝╚═╝─ ▔▔╲▌▔ @btsmith #nlp 44
    • Lithium SMM @btsmith #nlp 45
    • Sentence Breaking for Summaries▪ Summary does not replace the document▪ Summary lets you decide if the document is interesting▪ Summaries are sentences selected from the document • contain the search terms • not too short, not too long, etc • truncated only if necessary @btsmith #nlp 46
    • Lithium SMM @btsmith #nlp 47
    • Frequent Words and Stemming▪ Most common words in the results for your query • excludes stopwords▪ Trending words were previously not common▪ Click on a frequent word to search within results▪ Should we count… • words? • stems? @btsmith #nlp 48
    • POS Tagging▪ We use POS Tagging in Lithium SMM Quotes • along with other things • not such a “lightweight” application▪ POS also useful for document categorization • POS-based features • machine learning @btsmith #nlp 49
    • POS Tags and Document Categorization▪ Author Gender Automatic Categorization of Author Gender via N-Gram Analysis, Jonathan Doyle and Vlado Keselj. In The 6th Symposium on Natural Language Processing, SNLP2005, Chiang Rai, Thailand, December 2005.▪ Opinion Spam Finding Deceptive Opinion Spam by Any Stretch of the Imagination, Myle Ott, Yejin Choi, Claire Cardie and Jeffrey T. Hancock, The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, June 19-24, 2011. @btsmith #nlp 50
    • Lithium SMM Quotes▪ Quotes • Select interesting sentences from social media documents • Classify them as love, hate, comparison, warning, etc.▪ Quotes depends on • language identification • sentence breaking • POS tagging • parsing • specialized dictionaries @btsmith #nlp 51
    • Lithium SMM Quotes @btsmith #nlp 52
    • Lightweight NLPfor Social Media ApplicationsResources 53
    • Wikipedia▪ Corpus linguistics ▪ Part-of-speech tagging▪ Cosine similarity ▪ Sentence boundary disambiguation▪ Function word ▪ Stemming▪ Language identification ▪ Stop words▪ Machine learning ▪ Text mining▪ N-gram ▪ Treebank▪ Natural language processing @btsmith #nlp 54
    • Software▪ NLTK ▪ Snowball • Natural Language Toolkit • ANSI C and Java stemmers • Python library for NLP • snowball.tartarus.org • nltk.org ▪ Tika • Java toolkit for extracting metadata▪ OpenNLP and text from documents • machine-learning based NLP tools • includes language identification • Java library for NLP • tika.apache.org • opennlp.apache.org @btsmith #nlp 55
    • Books▪ Natural Language Processing with Python Steven Bird, Ewan Klein & Edward Loper O‟Reilly, 2009▪ Foundations of Statistical Natural Language Processing Chris Manning & Hinrich Schütze MIT Press, 1999 @btsmith #nlp 56
    • Organization▪ Association for Computational Linguistics http://www.aclweb.org▪ Remember that‟s aclweb.org acl.org is the Association of Christian Librarians @btsmith #nlp 57
    • Contact Info▪ Bruce Smith @btsmith bruce.smith@lithium.com▪ People at SXSW wearing Lithium‟s Nation Builder T-shirts @btsmith #nlp 58