SlideShare a Scribd company logo
1 of 58
Download to read offline
Lightweight NLP
for Social Media Applications


Bruce Smith
Lithium Technologies, Inc.
SXSW 2012
March 13, 2012



@btsmith
#nlp #sxsw
Lightweight NLP
for Social Media Applications



Are You
What Can You
in the
Learn in this
Right Session?
Session?


                                2
NLP = Natural Language Processing
▪   This session is not about

        Natural Law Party

        Neuro-linguistic Programming

        No Light Perception (total blindness)

        Nonlinear Programming


                            @btsmith            #nlp   3
N-Grams ≠ Engrams, Enneagrams, etc
▪   I will talk about “n-grams” several times
▪   Wikipedia has pages for 3 different kinds of “engram”
    • Neuropsychology
    • Scientology
    • 2009 album by Finnish black metal band Beherit

▪   Wikipedia has pages for 3 different kinds of “enneagram”
    • Nine-sided star polygon
    • Enneagram of Personality
    • Fourth Way Enneagram

                                 @btsmith                   #nlp   4
Are you…
▪   developing a social media application?


▪   looking for ways to make your application better?


▪   interested in a quick introduction to NLP or text analytics?



                             @btsmith                      #nlp    5
Do you want to know…
▪   how you can use NLP tools in your social media app?


▪   if you need a Ph.D. to use NLP tools?


▪   where to find free NLP tools?


▪   where to learn more?


                            @btsmith                  #nlp   6
Do you want to understand…
▪   the role of machine learning in NLP?


▪   the difference between training and production?


▪   what a training corpus is and where to find one?




                            @btsmith                   #nlp   7
This is a Great Time to Start Using NLP!
▪   Computers are powerful and cheap!
▪   There‟s a lot of very good, free software!
▪   There‟s an enormous amount of very good, free text data!
▪   Don’t be afraid of non-English content!
    • Unicode is your friend
    • just remember „utf-8‟




                               @btsmith                #nlp    8
Lightweight NLP
for Social Media Applications



Very Simple NLP
with
Very Little Math


                                9
Document, Corpus, Treebank
▪   document
    • newspaper article, novel, patent, scientific paper
    • blog post, comment, status update, tweet
▪   corpus
    • collection of documents
    • plural is “corpora”
▪   treebank
    • annotated corpus
    • words are annotated with parts of speech
    • sentences are annotated with parse trees

                                    @btsmith               #nlp   10
Penn Treebank‟s Parts of Speech
CC    Coordinating conjunction         …      …
CD    Cardinal number                  POS    Possessive ending
DT    Determiner                       PRP    Personal pronoun
IN    Preposition or                   PRP$   Possessive pronoun
      subordinating conjunction        …      …
…     …                                VB     Verb, base form
JJ    Adjective                        VBD    Verb, past tense
JJR   Adjective, comparative           VBG    Verb, gerund
JJS   Adjective, superlative                  or present participle
…     …                                …      …
NN    Noun, singular or mass           WP     Wh-pronoun
NNS   Noun, plural                     WP$    Possessive wh-pronoun
NNP   Proper noun, singular            …      …

                                  @btsmith                       #nlp   11
Phrase Structure Grammars & Parse Trees
                                             Phrases (non-terminals)
Parse Tree
                                             S     Sentence
        S                   Grammar          NP    Noun Phrase
                            S → NP VP        VP    Verb Phrase
                VP          …                PP    Prepositional Phrase
                                             …     …
                            NP → NN
                            NP → JJ NN
 NP                  NP
                            …                POS (terminals)
                                             NNP   Proper noun, singular
                            VP → V NP
                                             NNS   Noun, plural
NNP     VBZ          NNS    ….
                                             VBZ   Verb, 3rd person
Bruce   likes        dogs
                                                   singular present
                                             …     …

                                  @btsmith                         #nlp    12
N-Grams
▪   contiguous subsequence of n items
    • in order and with no gaps
    • words
    • characters


▪   n-grams have special names when n is small
    • unigram n=1
    • bigram n=2
    • trigram n=3


                                  @btsmith       #nlp   13
Character N-Grams
▪   Unigrams for this session‟s title
      Lightweight NLP for Social Media Applications

l        w      t       o       i        d   p   t
i        e      n       r       a        i   l   i
g        i      l       s       l        a   i   o
h        g      p       o       m        a   c   n
t        h      f       c       e        p   a   s

                              @btsmith                #nlp   14
Character N-Grams
▪    Bigrams for this session‟s title
       Lightweight NLP for Social Media Applications

li        we      tn      or      ia      di   pl   ti
ig        ei      nl      rs      al      ia   li   io
gh        ig      lp      so      lm      aa   ic   on
ht        gh      pf      oc      me      ap   ca   ns
tw        ht      fo      ci      ed      pp   at

                               @btsmith                  #nlp   15
Character N-Grams
▪     Trigrams for this session‟s title
        Lightweight NLP for Social Media Applications

lig        wei     tnl     ors     ial      dia   pli   tio
igh        eig     nlp     rso     alm      iaa   lic   ion
ght        igh     lpf     soc     lme      aap   ica   ons
htw        ght     pfo     oci     med      app   cat
twe        htn     for     cia     edi      ppl   ati

                                 @btsmith                     #nlp   16
Character N-Gram Frequencies
▪   N-grams are interesting when we look at frequencies
      Lightweight NLP for Social Media Applications

i–6                 gh – 2              ght – 2
a–4                 ht – 2              igh – 2
l–4                 ia – 2              aap – 1
o–3                 ig – 2              alm – 1
p–3                 li – 2              aap – 1
…                   …                   …


                             @btsmith                     #nlp   17
Word N-Gram Frequencies
▪   Word n-grams from Pride and Prejudice (using NLTK)

to – 4116           to be – 436       i am sure – 72
the – 4105          of the – 430      as soon as – 59
of – 3572           in the – 359      in the world – 57
and – 3491          it was – 280      i do not – 46
her – 2551          of her – 276      could not be – 42
a – 2092            to the – 242      she could not – 39
…                   …                 …

                           @btsmith                  #nlp   18
N-Gram Frequencies
▪   Word n-grams from Pride and Prejudice
    with no stopword unigrams

elinor – 685        to be – 436       i am sure – 72
could – 578         of the – 430      as soon as – 59
marianne – 566      in the – 359      in the world – 57
mrs – 530           it was – 280      i do not – 46
would – 515         of her – 276      could not be – 42
said – 397          to the – 242      she could not – 39
…                   …                 …
                           @btsmith                  #nlp   19
Cosine Similarity
▪   Make a vector from of a document‟s n-gram frequencies
▪   If A and B are frequency vectors for two documents

                                           𝑛
                          𝐴∙ 𝐵            𝑖=1(𝐴 𝑖   𝐵𝑖)
     𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 =         =
                         𝐴 𝐵           𝑛
                                      𝑖=1(𝐴 𝑖 )
                                                2     𝑛
                                                     𝑖=1(𝐵 𝑖 )
                                                               2




                           @btsmith                                #nlp   20
Cosine Similarity
▪   Create word N-gram frequency vectors
    • with unigrams, bigrams, trigrams
    • Moby Dick
    • Pride and Prejudice

▪   Compute their cosine similarity
                                  0.534
▪   More interesting with a larger set of documents…


                                  @btsmith             #nlp   21
NLP and Machine Learning
▪   In the past, NLP was more about
    grammars and logic and parsing
▪   Today, NLP is more about
    statistics and machine learning
▪   Why?
    • computers are much more powerful
    • there are enormous amounts of very good, free data



                                 @btsmith                  #nlp   22
NLP and Machine Learning
▪   Think of machine learning as

        programming by analyzing sample data


▪   Example
    • Use the Penn Treebank as sample data
    • Build a program that labels words with parts-of-speech



                                  @btsmith                     #nlp   23
NLP and Machine Learning
▪   Training
    •   depends on sample data, your training corpus
    •   there are very good, free machine learning tools
    •   sometimes training is slow
    •   experiment with different techniques (perceptron, SVM, etc)
    •   test, test, test…

▪   Production
    • uses models generated during training
    • typically very fast


                                    @btsmith                          #nlp   24
Lightweight NLP
for Social Media Applications



Lightweight NLP
Techniques




                                25
Lightweight NLP Techniques
▪   Language Identification


▪   Sentence Breaking


▪   Stemming


▪   Part-of-Speech Tagging


                              @btsmith   #nlp   26
Language Identification
You might try looking at
▪   character sets (e.g. Unicode character blocks)
▪   words in language-specific dictionaries
▪   character n-gram frequencies and cosine similarity




                            @btsmith                     #nlp   27
Language Identification
▪   Character n-gram frequencies for English
e       12.6%           th       3.9%            the      3.5%
t       9.1%            he       3.7%            and      1.6%
a       8.0%            in       2.3%            ing      1.1%
o       7.6%            er       2.2%            her      0.8%
i       6.9%            an       2.1%            hat      0.7%
n       6.9%            re       1.7%            his      0.6%
s       6.3%            nd       1.6%            tha      0.6%
h       6.2%            on       1.4%            ere      0.6%
…                       …                        …

From Cryptograms.org, derived from English documents at Project Gutenberg

                                 @btsmith                             #nlp   28
Language Identification with Tika
▪    tika.apache.org
▪    models for
da       Danish        fr   French       ro   Romanian
de       German        is   Icelandic    ru   Russian
et       Estonian      it   Italian      sv   Swedish
el       Greek         nl   Dutch        th   Thai
en       English       no   Norwegian    uk   Ukrainian
es       Spanish       pl   Polish       …
fi       Finnish       pt   Portuguese

▪    trainable with sample data
                             @btsmith                     #nlp   29
Where can you find samples of…
▪   French?
▪   German?
▪   Russian?
▪   Japanese?
▪   Arabic?
▪   Cherokee?

                @btsmith         #nlp   30
Sentence Breaking
▪   Also known as
    • sentence boundary disambiguation
    • sentence detection

▪   You could just look for punctuation, but…
    •   what   about abbreviations?
    •   what   about numbers?
    •   what   about domain names like lithium.com, etc?
    •   what   about names like Yahoo!, etc?




                                     @btsmith              #nlp   31
Sentence Breaking with OpenNLP
▪   opennlp.apache.org
▪   models for

    da Danish              nl        Dutch
    de German              pt        Portuguese
    en English             se        Swedish


▪   trainable with new sample data

                           @btsmith               #nlp   32
Stemming
▪   Reducing a word to a stem or base form
▪   Porter Stemmer is a popular stemmer for English
▪   Examples

        lightweight   → lightweight
        natural       → natur
        language      → languag
        processing    → process

                            @btsmith                  #nlp   33
Stemming
▪   A few examples from Pride and Prejudice (using NLTK)

affect                 amus              close
         affect               amuse              close
         affectation          amused             closed
         affected             amusement          closely
         affecting            amusements         closing
         affection            amusing    grate
         affections                              grate
         affects                                 grateful
                                                 gratefully

                              @btsmith                        #nlp   34
Stemming with Snowball
▪   tartarus.org
▪   stemmers for

    de   German    nl    Dutch
    en   English   no    Norwegian
    es   Spanish   pt    Portuguese
    fi   Finnish   ru    Russian
    fr   French    se    Swedish
    it   Italian   …

                   @btsmith           #nlp   35
Part-of-Speech Tagging
▪   Part of Speech frequently abbreviated POS
▪   Not every language has the same parts of speech
▪   Even for one language,
    not everyone agrees on the parts of speech
▪   Example: Penn Treebank POS tags for English




                           @btsmith                   #nlp   36
Part-of-Speech Tagging
lightweight nlp for social     nlp is easier than you thought
media applications
                                        nlp        NN
       lightweight    NN                is         VBZ
       nlp            NN                easier     JJR
       for            IN                than       IN
       social         JJ                you        PRP
       media          NNS               thought    VBD
       applications   NNS


                             @btsmith                    #nlp   37
Part-of-Speech Tagging with OpenNLP
▪   opennlp.apache.org
▪   two kinds of models for each of

    de German                pt       Portuguese
    en English               se       Swedish
    nl Dutch
▪   trainable with new sample data


                            @btsmith               #nlp   38
Lightweight NLP
for Social Media Applications



Lightweight NLP
in
Applications


                                39
Lightweight NLP in Applications
▪   Language Identification
▪   Sentence Breaking for Summaries
▪   Stemming for Word Counts
▪   POS Tagging for Document Categorization
▪   Lithium SMM Quotes



                              @btsmith        #nlp   40
Lithium SMM (Social Media Monitoring)




                 @btsmith          #nlp   41
Language Identification
▪   Language ID is never perfect,
    especially with social media!

    •   short documents
    •   ambiguity
    •   mixed languages
    •   nonsense
    •   and… lots of very strange stuff




                                     @btsmith   #nlp   42
What language is this?      ______________$$$$______________
                            ____________$$$$$$$$____________
                            ___________$$$$$$$$$$___________
                            ___________$$$$$$$$$$___________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            ____$$$$_____$$$$$$_____$$$$____
                            ___$$$$$_____$$$$$$_____$$$$$___
                            _$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$_
                            _$$$$$$$$$$$.СРБИЈА.$$$$$$$$$$$_
                            ___$$$$$$$$$$$$$$$$$$$$$$$$$$___
                            ____$$$$_____$$$$$$_____$$$$____
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            ___________$$$$$$$$$$___________
                            ___________$$$$$$$$$$___________
                            ____________$$$$$$$$____________
                            ______________$$$$______________

                 @btsmith                             #nlp     43
What language is this?

ღೋ   ´¯`•.¸,¤°`°¤,¸.•´¯`•.¸,¤Ƹ ̵̡Ӝ          ´¯`•.
                                      ̵̨̄Ʒ´¯`•.ღೋ
                                          ╱▔▌
╔═╗╔═╗╔═╗╔═╗╔═╗░╔═╗╔╗╔═╗╔═╗╔═╗╔╗╔╗─╔═╗─ ╱▔▔▔▔╲ ╱▌
║█║║═╣║═╣║╠╝║═╣░║▌║╚╝║█║║█║║█║║║║║─║═╣ ╱◑▓░▓░░ ▌
║╔╝║═╣╠═║║╠╗║═╣░║▌║──║╦║║╔╝║═╣║║║╚╗║═╣ ╲░░░░░╱╲▌
╚╝─╚═╝╚═╝╚═╝╚═╝░╚═╝──╚╩╝╚╝─╚╩╝╚╝╚═╝╚═╝─ ▔▔╲▌▔




                           @btsmith                 #nlp   44
Lithium SMM




              @btsmith   #nlp   45
Sentence Breaking for Summaries
▪   Summary does not replace the document
▪   Summary lets you decide if the document is interesting
▪   Summaries are sentences selected from the document
    • contain the search terms
    • not too short, not too long, etc
    • truncated only if necessary




                                    @btsmith            #nlp   46
Lithium SMM




              @btsmith   #nlp   47
Frequent Words and Stemming
▪   Most common words in the results for your query
    • excludes stopwords

▪   Trending words were previously not common
▪   Click on a frequent word to search within results
▪   Should we count…
    • words?
    • stems?


                             @btsmith                   #nlp   48
POS Tagging
▪   We use POS Tagging in Lithium SMM Quotes
    • along with other things
    • not such a “lightweight” application

▪   POS also useful for document categorization
    • POS-based features
    • machine learning




                                    @btsmith      #nlp   49
POS Tags and Document Categorization
▪   Author Gender
    Automatic Categorization of Author Gender via N-Gram Analysis,
    Jonathan Doyle and Vlado Keselj. In The 6th Symposium on Natural
    Language Processing, SNLP'2005, Chiang Rai, Thailand, December 2005.


▪   Opinion Spam
    Finding Deceptive Opinion Spam by Any Stretch of the Imagination,
    Myle Ott, Yejin Choi, Claire Cardie and Jeffrey T. Hancock, The 49th Annual
    Meeting of the Association for Computational Linguistics: Human Language
    Technologies, Portland, Oregon, USA, June 19-24, 2011.

                                    @btsmith                              #nlp    50
Lithium SMM Quotes
▪   Quotes
    • Select interesting sentences from social media documents
    • Classify them as love, hate, comparison, warning, etc.

▪   Quotes depends on
    •   language identification
    •   sentence breaking
    •   POS tagging
    •   parsing
    •   specialized dictionaries



                                   @btsmith                      #nlp   51
Lithium SMM Quotes




               @btsmith   #nlp   52
Lightweight NLP
for Social Media Applications



Resources




                                53
Wikipedia
▪   Corpus linguistics             ▪   Part-of-speech tagging
▪   Cosine similarity              ▪   Sentence boundary
                                       disambiguation
▪   Function word
                                   ▪   Stemming
▪   Language identification
                                   ▪   Stop words
▪   Machine learning
                                   ▪   Text mining
▪   N-gram
                                   ▪   Treebank
▪   Natural language processing

                              @btsmith                          #nlp   54
Software
▪   NLTK                                 ▪   Snowball
    • Natural Language Toolkit               • ANSI C and Java stemmers
    • Python library for NLP                 • snowball.tartarus.org
    • nltk.org

                                         ▪   Tika
                                             • Java toolkit for extracting metadata
▪   OpenNLP                                      and text from documents
    • machine-learning based NLP tools       •   includes language identification
    • Java library for NLP                   •   tika.apache.org
    • opennlp.apache.org

                                 @btsmith                                 #nlp        55
Books
▪   Natural Language Processing with Python
    Steven Bird, Ewan Klein & Edward Loper
    O‟Reilly, 2009


▪   Foundations of Statistical Natural Language Processing
    Chris Manning & Hinrich Schütze
    MIT Press, 1999


                           @btsmith                    #nlp   56
Organization
▪   Association for Computational Linguistics


        http://www.aclweb.org


▪   Remember that‟s aclweb.org

    acl.org is the Association of Christian Librarians

                             @btsmith                    #nlp   57
Contact Info
▪   Bruce Smith
    @btsmith
    bruce.smith@lithium.com


▪   People at SXSW wearing Lithium‟s Nation Builder T-shirts




                           @btsmith                    #nlp    58

More Related Content

What's hot

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Language modelling and its use cases
Language modelling and its use casesLanguage modelling and its use cases
Language modelling and its use casesKhrystyna Skopyk
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Simplilearn
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: ParsingRushdi Shams
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processingSanzid Kawsar
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic RegressionKnoldus Inc.
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionAritra Mukherjee
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
Natural language processing
Natural language processing Natural language processing
Natural language processing Md.Sumon Sarder
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysisM. Atif Qureshi
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 

What's hot (20)

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Language modelling and its use cases
Language modelling and its use casesLanguage modelling and its use cases
Language modelling and its use cases
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: Parsing
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Text mining
Text miningText mining
Text mining
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Natural language processing
Natural language processing Natural language processing
Natural language processing
 
Regularization
RegularizationRegularization
Regularization
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Text Similarity
Text SimilarityText Similarity
Text Similarity
 

More from Lithium

How Customer Communities Power Word-of-Mouth Marketing
How Customer Communities Power Word-of-Mouth MarketingHow Customer Communities Power Word-of-Mouth Marketing
How Customer Communities Power Word-of-Mouth MarketingLithium
 
Technology management in the age of the customer
Technology management in the age of the customerTechnology management in the age of the customer
Technology management in the age of the customerLithium
 
Financial Services Firms Claim the CX Advantage
Financial Services Firms Claim the CX Advantage Financial Services Firms Claim the CX Advantage
Financial Services Firms Claim the CX Advantage Lithium
 
FSI- Claim the Customer Experience Now
FSI- Claim the Customer Experience Now FSI- Claim the Customer Experience Now
FSI- Claim the Customer Experience Now Lithium
 
Lithium serious social response copy
Lithium serious social response copyLithium serious social response copy
Lithium serious social response copyLithium
 
Hey, Financial Services - Get Serious About Social, or Get Spanked!
Hey, Financial Services - Get Serious About Social, or Get Spanked!Hey, Financial Services - Get Serious About Social, or Get Spanked!
Hey, Financial Services - Get Serious About Social, or Get Spanked!Lithium
 
Lithium whitepaper: Hey, Tech! Get Serious About Social Customer Enlistment
Lithium whitepaper: Hey, Tech! Get Serious About Social Customer EnlistmentLithium whitepaper: Hey, Tech! Get Serious About Social Customer Enlistment
Lithium whitepaper: Hey, Tech! Get Serious About Social Customer EnlistmentLithium
 
The 4 Gears from The Science of Social 2
The 4 Gears from The Science of Social 2 The 4 Gears from The Science of Social 2
The 4 Gears from The Science of Social 2 Lithium
 
Lithium Get Serious About Online Communities
Lithium Get Serious About Online CommunitiesLithium Get Serious About Online Communities
Lithium Get Serious About Online CommunitiesLithium
 
#SeriousAboutSocial Webcast
#SeriousAboutSocial Webcast#SeriousAboutSocial Webcast
#SeriousAboutSocial WebcastLithium
 
Social Business Advantage
Social Business AdvantageSocial Business Advantage
Social Business AdvantageLithium
 
Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...
Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...
Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...Lithium
 
Get Serious About Social Customer Engagement
Get Serious About Social Customer EngagementGet Serious About Social Customer Engagement
Get Serious About Social Customer EngagementLithium
 
Get Serious About Social Customer Care
Get Serious About Social Customer Care Get Serious About Social Customer Care
Get Serious About Social Customer Care Lithium
 
Superfans
Superfans Superfans
Superfans Lithium
 
Best Practices Ideation
Best Practices IdeationBest Practices Ideation
Best Practices IdeationLithium
 
MarketingProfs 2012 State of Social Media
MarketingProfs 2012 State of Social Media MarketingProfs 2012 State of Social Media
MarketingProfs 2012 State of Social Media Lithium
 
Gold in Them Hills: Computing ROI for Support Communities
Gold in Them Hills: Computing ROI for Support CommunitiesGold in Them Hills: Computing ROI for Support Communities
Gold in Them Hills: Computing ROI for Support CommunitiesLithium
 
Building Customer Networks for Successful Word of Mouth Marketing
Building Customer Networks for Successful Word of Mouth MarketingBuilding Customer Networks for Successful Word of Mouth Marketing
Building Customer Networks for Successful Word of Mouth MarketingLithium
 
Telesperience customer-experience-benchmark-2013 asi8-d3sd
Telesperience customer-experience-benchmark-2013 asi8-d3sdTelesperience customer-experience-benchmark-2013 asi8-d3sd
Telesperience customer-experience-benchmark-2013 asi8-d3sdLithium
 

More from Lithium (20)

How Customer Communities Power Word-of-Mouth Marketing
How Customer Communities Power Word-of-Mouth MarketingHow Customer Communities Power Word-of-Mouth Marketing
How Customer Communities Power Word-of-Mouth Marketing
 
Technology management in the age of the customer
Technology management in the age of the customerTechnology management in the age of the customer
Technology management in the age of the customer
 
Financial Services Firms Claim the CX Advantage
Financial Services Firms Claim the CX Advantage Financial Services Firms Claim the CX Advantage
Financial Services Firms Claim the CX Advantage
 
FSI- Claim the Customer Experience Now
FSI- Claim the Customer Experience Now FSI- Claim the Customer Experience Now
FSI- Claim the Customer Experience Now
 
Lithium serious social response copy
Lithium serious social response copyLithium serious social response copy
Lithium serious social response copy
 
Hey, Financial Services - Get Serious About Social, or Get Spanked!
Hey, Financial Services - Get Serious About Social, or Get Spanked!Hey, Financial Services - Get Serious About Social, or Get Spanked!
Hey, Financial Services - Get Serious About Social, or Get Spanked!
 
Lithium whitepaper: Hey, Tech! Get Serious About Social Customer Enlistment
Lithium whitepaper: Hey, Tech! Get Serious About Social Customer EnlistmentLithium whitepaper: Hey, Tech! Get Serious About Social Customer Enlistment
Lithium whitepaper: Hey, Tech! Get Serious About Social Customer Enlistment
 
The 4 Gears from The Science of Social 2
The 4 Gears from The Science of Social 2 The 4 Gears from The Science of Social 2
The 4 Gears from The Science of Social 2
 
Lithium Get Serious About Online Communities
Lithium Get Serious About Online CommunitiesLithium Get Serious About Online Communities
Lithium Get Serious About Online Communities
 
#SeriousAboutSocial Webcast
#SeriousAboutSocial Webcast#SeriousAboutSocial Webcast
#SeriousAboutSocial Webcast
 
Social Business Advantage
Social Business AdvantageSocial Business Advantage
Social Business Advantage
 
Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...
Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...
Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...
 
Get Serious About Social Customer Engagement
Get Serious About Social Customer EngagementGet Serious About Social Customer Engagement
Get Serious About Social Customer Engagement
 
Get Serious About Social Customer Care
Get Serious About Social Customer Care Get Serious About Social Customer Care
Get Serious About Social Customer Care
 
Superfans
Superfans Superfans
Superfans
 
Best Practices Ideation
Best Practices IdeationBest Practices Ideation
Best Practices Ideation
 
MarketingProfs 2012 State of Social Media
MarketingProfs 2012 State of Social Media MarketingProfs 2012 State of Social Media
MarketingProfs 2012 State of Social Media
 
Gold in Them Hills: Computing ROI for Support Communities
Gold in Them Hills: Computing ROI for Support CommunitiesGold in Them Hills: Computing ROI for Support Communities
Gold in Them Hills: Computing ROI for Support Communities
 
Building Customer Networks for Successful Word of Mouth Marketing
Building Customer Networks for Successful Word of Mouth MarketingBuilding Customer Networks for Successful Word of Mouth Marketing
Building Customer Networks for Successful Word of Mouth Marketing
 
Telesperience customer-experience-benchmark-2013 asi8-d3sd
Telesperience customer-experience-benchmark-2013 asi8-d3sdTelesperience customer-experience-benchmark-2013 asi8-d3sd
Telesperience customer-experience-benchmark-2013 asi8-d3sd
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Lightweight Natural Language Processing (NLP)

  • 1. Lightweight NLP for Social Media Applications Bruce Smith Lithium Technologies, Inc. SXSW 2012 March 13, 2012 @btsmith #nlp #sxsw
  • 2. Lightweight NLP for Social Media Applications Are You What Can You in the Learn in this Right Session? Session? 2
  • 3. NLP = Natural Language Processing ▪ This session is not about Natural Law Party Neuro-linguistic Programming No Light Perception (total blindness) Nonlinear Programming @btsmith #nlp 3
  • 4. N-Grams ≠ Engrams, Enneagrams, etc ▪ I will talk about “n-grams” several times ▪ Wikipedia has pages for 3 different kinds of “engram” • Neuropsychology • Scientology • 2009 album by Finnish black metal band Beherit ▪ Wikipedia has pages for 3 different kinds of “enneagram” • Nine-sided star polygon • Enneagram of Personality • Fourth Way Enneagram @btsmith #nlp 4
  • 5. Are you… ▪ developing a social media application? ▪ looking for ways to make your application better? ▪ interested in a quick introduction to NLP or text analytics? @btsmith #nlp 5
  • 6. Do you want to know… ▪ how you can use NLP tools in your social media app? ▪ if you need a Ph.D. to use NLP tools? ▪ where to find free NLP tools? ▪ where to learn more? @btsmith #nlp 6
  • 7. Do you want to understand… ▪ the role of machine learning in NLP? ▪ the difference between training and production? ▪ what a training corpus is and where to find one? @btsmith #nlp 7
  • 8. This is a Great Time to Start Using NLP! ▪ Computers are powerful and cheap! ▪ There‟s a lot of very good, free software! ▪ There‟s an enormous amount of very good, free text data! ▪ Don’t be afraid of non-English content! • Unicode is your friend • just remember „utf-8‟ @btsmith #nlp 8
  • 9. Lightweight NLP for Social Media Applications Very Simple NLP with Very Little Math 9
  • 10. Document, Corpus, Treebank ▪ document • newspaper article, novel, patent, scientific paper • blog post, comment, status update, tweet ▪ corpus • collection of documents • plural is “corpora” ▪ treebank • annotated corpus • words are annotated with parts of speech • sentences are annotated with parse trees @btsmith #nlp 10
  • 11. Penn Treebank‟s Parts of Speech CC Coordinating conjunction … … CD Cardinal number POS Possessive ending DT Determiner PRP Personal pronoun IN Preposition or PRP$ Possessive pronoun subordinating conjunction … … … … VB Verb, base form JJ Adjective VBD Verb, past tense JJR Adjective, comparative VBG Verb, gerund JJS Adjective, superlative or present participle … … … … NN Noun, singular or mass WP Wh-pronoun NNS Noun, plural WP$ Possessive wh-pronoun NNP Proper noun, singular … … @btsmith #nlp 11
  • 12. Phrase Structure Grammars & Parse Trees Phrases (non-terminals) Parse Tree S Sentence S Grammar NP Noun Phrase S → NP VP VP Verb Phrase VP … PP Prepositional Phrase … … NP → NN NP → JJ NN NP NP … POS (terminals) NNP Proper noun, singular VP → V NP NNS Noun, plural NNP VBZ NNS …. VBZ Verb, 3rd person Bruce likes dogs singular present … … @btsmith #nlp 12
  • 13. N-Grams ▪ contiguous subsequence of n items • in order and with no gaps • words • characters ▪ n-grams have special names when n is small • unigram n=1 • bigram n=2 • trigram n=3 @btsmith #nlp 13
  • 14. Character N-Grams ▪ Unigrams for this session‟s title Lightweight NLP for Social Media Applications l w t o i d p t i e n r a i l i g i l s l a i o h g p o m a c n t h f c e p a s @btsmith #nlp 14
  • 15. Character N-Grams ▪ Bigrams for this session‟s title Lightweight NLP for Social Media Applications li we tn or ia di pl ti ig ei nl rs al ia li io gh ig lp so lm aa ic on ht gh pf oc me ap ca ns tw ht fo ci ed pp at @btsmith #nlp 15
  • 16. Character N-Grams ▪ Trigrams for this session‟s title Lightweight NLP for Social Media Applications lig wei tnl ors ial dia pli tio igh eig nlp rso alm iaa lic ion ght igh lpf soc lme aap ica ons htw ght pfo oci med app cat twe htn for cia edi ppl ati @btsmith #nlp 16
  • 17. Character N-Gram Frequencies ▪ N-grams are interesting when we look at frequencies Lightweight NLP for Social Media Applications i–6 gh – 2 ght – 2 a–4 ht – 2 igh – 2 l–4 ia – 2 aap – 1 o–3 ig – 2 alm – 1 p–3 li – 2 aap – 1 … … … @btsmith #nlp 17
  • 18. Word N-Gram Frequencies ▪ Word n-grams from Pride and Prejudice (using NLTK) to – 4116 to be – 436 i am sure – 72 the – 4105 of the – 430 as soon as – 59 of – 3572 in the – 359 in the world – 57 and – 3491 it was – 280 i do not – 46 her – 2551 of her – 276 could not be – 42 a – 2092 to the – 242 she could not – 39 … … … @btsmith #nlp 18
  • 19. N-Gram Frequencies ▪ Word n-grams from Pride and Prejudice with no stopword unigrams elinor – 685 to be – 436 i am sure – 72 could – 578 of the – 430 as soon as – 59 marianne – 566 in the – 359 in the world – 57 mrs – 530 it was – 280 i do not – 46 would – 515 of her – 276 could not be – 42 said – 397 to the – 242 she could not – 39 … … … @btsmith #nlp 19
  • 20. Cosine Similarity ▪ Make a vector from of a document‟s n-gram frequencies ▪ If A and B are frequency vectors for two documents 𝑛 𝐴∙ 𝐵 𝑖=1(𝐴 𝑖 𝐵𝑖) 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = = 𝐴 𝐵 𝑛 𝑖=1(𝐴 𝑖 ) 2 𝑛 𝑖=1(𝐵 𝑖 ) 2 @btsmith #nlp 20
  • 21. Cosine Similarity ▪ Create word N-gram frequency vectors • with unigrams, bigrams, trigrams • Moby Dick • Pride and Prejudice ▪ Compute their cosine similarity 0.534 ▪ More interesting with a larger set of documents… @btsmith #nlp 21
  • 22. NLP and Machine Learning ▪ In the past, NLP was more about grammars and logic and parsing ▪ Today, NLP is more about statistics and machine learning ▪ Why? • computers are much more powerful • there are enormous amounts of very good, free data @btsmith #nlp 22
  • 23. NLP and Machine Learning ▪ Think of machine learning as programming by analyzing sample data ▪ Example • Use the Penn Treebank as sample data • Build a program that labels words with parts-of-speech @btsmith #nlp 23
  • 24. NLP and Machine Learning ▪ Training • depends on sample data, your training corpus • there are very good, free machine learning tools • sometimes training is slow • experiment with different techniques (perceptron, SVM, etc) • test, test, test… ▪ Production • uses models generated during training • typically very fast @btsmith #nlp 24
  • 25. Lightweight NLP for Social Media Applications Lightweight NLP Techniques 25
  • 26. Lightweight NLP Techniques ▪ Language Identification ▪ Sentence Breaking ▪ Stemming ▪ Part-of-Speech Tagging @btsmith #nlp 26
  • 27. Language Identification You might try looking at ▪ character sets (e.g. Unicode character blocks) ▪ words in language-specific dictionaries ▪ character n-gram frequencies and cosine similarity @btsmith #nlp 27
  • 28. Language Identification ▪ Character n-gram frequencies for English e 12.6% th 3.9% the 3.5% t 9.1% he 3.7% and 1.6% a 8.0% in 2.3% ing 1.1% o 7.6% er 2.2% her 0.8% i 6.9% an 2.1% hat 0.7% n 6.9% re 1.7% his 0.6% s 6.3% nd 1.6% tha 0.6% h 6.2% on 1.4% ere 0.6% … … … From Cryptograms.org, derived from English documents at Project Gutenberg @btsmith #nlp 28
  • 29. Language Identification with Tika ▪ tika.apache.org ▪ models for da Danish fr French ro Romanian de German is Icelandic ru Russian et Estonian it Italian sv Swedish el Greek nl Dutch th Thai en English no Norwegian uk Ukrainian es Spanish pl Polish … fi Finnish pt Portuguese ▪ trainable with sample data @btsmith #nlp 29
  • 30. Where can you find samples of… ▪ French? ▪ German? ▪ Russian? ▪ Japanese? ▪ Arabic? ▪ Cherokee? @btsmith #nlp 30
  • 31. Sentence Breaking ▪ Also known as • sentence boundary disambiguation • sentence detection ▪ You could just look for punctuation, but… • what about abbreviations? • what about numbers? • what about domain names like lithium.com, etc? • what about names like Yahoo!, etc? @btsmith #nlp 31
  • 32. Sentence Breaking with OpenNLP ▪ opennlp.apache.org ▪ models for da Danish nl Dutch de German pt Portuguese en English se Swedish ▪ trainable with new sample data @btsmith #nlp 32
  • 33. Stemming ▪ Reducing a word to a stem or base form ▪ Porter Stemmer is a popular stemmer for English ▪ Examples lightweight → lightweight natural → natur language → languag processing → process @btsmith #nlp 33
  • 34. Stemming ▪ A few examples from Pride and Prejudice (using NLTK) affect amus close affect amuse close affectation amused closed affected amusement closely affecting amusements closing affection amusing grate affections grate affects grateful gratefully @btsmith #nlp 34
  • 35. Stemming with Snowball ▪ tartarus.org ▪ stemmers for de German nl Dutch en English no Norwegian es Spanish pt Portuguese fi Finnish ru Russian fr French se Swedish it Italian … @btsmith #nlp 35
  • 36. Part-of-Speech Tagging ▪ Part of Speech frequently abbreviated POS ▪ Not every language has the same parts of speech ▪ Even for one language, not everyone agrees on the parts of speech ▪ Example: Penn Treebank POS tags for English @btsmith #nlp 36
  • 37. Part-of-Speech Tagging lightweight nlp for social nlp is easier than you thought media applications nlp NN lightweight NN is VBZ nlp NN easier JJR for IN than IN social JJ you PRP media NNS thought VBD applications NNS @btsmith #nlp 37
  • 38. Part-of-Speech Tagging with OpenNLP ▪ opennlp.apache.org ▪ two kinds of models for each of de German pt Portuguese en English se Swedish nl Dutch ▪ trainable with new sample data @btsmith #nlp 38
  • 39. Lightweight NLP for Social Media Applications Lightweight NLP in Applications 39
  • 40. Lightweight NLP in Applications ▪ Language Identification ▪ Sentence Breaking for Summaries ▪ Stemming for Word Counts ▪ POS Tagging for Document Categorization ▪ Lithium SMM Quotes @btsmith #nlp 40
  • 41. Lithium SMM (Social Media Monitoring) @btsmith #nlp 41
  • 42. Language Identification ▪ Language ID is never perfect, especially with social media! • short documents • ambiguity • mixed languages • nonsense • and… lots of very strange stuff @btsmith #nlp 42
  • 43. What language is this? ______________$$$$______________ ____________$$$$$$$$____________ ___________$$$$$$$$$$___________ ___________$$$$$$$$$$___________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ ____$$$$_____$$$$$$_____$$$$____ ___$$$$$_____$$$$$$_____$$$$$___ _$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$_ _$$$$$$$$$$$.СРБИЈА.$$$$$$$$$$$_ ___$$$$$$$$$$$$$$$$$$$$$$$$$$___ ____$$$$_____$$$$$$_____$$$$____ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ ___________$$$$$$$$$$___________ ___________$$$$$$$$$$___________ ____________$$$$$$$$____________ ______________$$$$______________ @btsmith #nlp 43
  • 44. What language is this? ღೋ ´¯`•.¸,¤°`°¤,¸.•´¯`•.¸,¤Ƹ ̵̡Ӝ ´¯`•. ̵̨̄Ʒ´¯`•.ღೋ ╱▔▌ ╔═╗╔═╗╔═╗╔═╗╔═╗░╔═╗╔╗╔═╗╔═╗╔═╗╔╗╔╗─╔═╗─ ╱▔▔▔▔╲ ╱▌ ║█║║═╣║═╣║╠╝║═╣░║▌║╚╝║█║║█║║█║║║║║─║═╣ ╱◑▓░▓░░ ▌ ║╔╝║═╣╠═║║╠╗║═╣░║▌║──║╦║║╔╝║═╣║║║╚╗║═╣ ╲░░░░░╱╲▌ ╚╝─╚═╝╚═╝╚═╝╚═╝░╚═╝──╚╩╝╚╝─╚╩╝╚╝╚═╝╚═╝─ ▔▔╲▌▔ @btsmith #nlp 44
  • 45. Lithium SMM @btsmith #nlp 45
  • 46. Sentence Breaking for Summaries ▪ Summary does not replace the document ▪ Summary lets you decide if the document is interesting ▪ Summaries are sentences selected from the document • contain the search terms • not too short, not too long, etc • truncated only if necessary @btsmith #nlp 46
  • 47. Lithium SMM @btsmith #nlp 47
  • 48. Frequent Words and Stemming ▪ Most common words in the results for your query • excludes stopwords ▪ Trending words were previously not common ▪ Click on a frequent word to search within results ▪ Should we count… • words? • stems? @btsmith #nlp 48
  • 49. POS Tagging ▪ We use POS Tagging in Lithium SMM Quotes • along with other things • not such a “lightweight” application ▪ POS also useful for document categorization • POS-based features • machine learning @btsmith #nlp 49
  • 50. POS Tags and Document Categorization ▪ Author Gender Automatic Categorization of Author Gender via N-Gram Analysis, Jonathan Doyle and Vlado Keselj. In The 6th Symposium on Natural Language Processing, SNLP'2005, Chiang Rai, Thailand, December 2005. ▪ Opinion Spam Finding Deceptive Opinion Spam by Any Stretch of the Imagination, Myle Ott, Yejin Choi, Claire Cardie and Jeffrey T. Hancock, The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, June 19-24, 2011. @btsmith #nlp 50
  • 51. Lithium SMM Quotes ▪ Quotes • Select interesting sentences from social media documents • Classify them as love, hate, comparison, warning, etc. ▪ Quotes depends on • language identification • sentence breaking • POS tagging • parsing • specialized dictionaries @btsmith #nlp 51
  • 52. Lithium SMM Quotes @btsmith #nlp 52
  • 53. Lightweight NLP for Social Media Applications Resources 53
  • 54. Wikipedia ▪ Corpus linguistics ▪ Part-of-speech tagging ▪ Cosine similarity ▪ Sentence boundary disambiguation ▪ Function word ▪ Stemming ▪ Language identification ▪ Stop words ▪ Machine learning ▪ Text mining ▪ N-gram ▪ Treebank ▪ Natural language processing @btsmith #nlp 54
  • 55. Software ▪ NLTK ▪ Snowball • Natural Language Toolkit • ANSI C and Java stemmers • Python library for NLP • snowball.tartarus.org • nltk.org ▪ Tika • Java toolkit for extracting metadata ▪ OpenNLP and text from documents • machine-learning based NLP tools • includes language identification • Java library for NLP • tika.apache.org • opennlp.apache.org @btsmith #nlp 55
  • 56. Books ▪ Natural Language Processing with Python Steven Bird, Ewan Klein & Edward Loper O‟Reilly, 2009 ▪ Foundations of Statistical Natural Language Processing Chris Manning & Hinrich Schütze MIT Press, 1999 @btsmith #nlp 56
  • 57. Organization ▪ Association for Computational Linguistics http://www.aclweb.org ▪ Remember that‟s aclweb.org acl.org is the Association of Christian Librarians @btsmith #nlp 57
  • 58. Contact Info ▪ Bruce Smith @btsmith bruce.smith@lithium.com ▪ People at SXSW wearing Lithium‟s Nation Builder T-shirts @btsmith #nlp 58