SlideShare a Scribd company logo
1 of 58
Download to read offline
Lightweight NLP
for Social Media Applications


Bruce Smith
Lithium Technologies, Inc.
SXSW 2012
March 13, 2012



@btsmith
#nlp #sxsw
Lightweight NLP
for Social Media Applications



Are You
What Can You
in the
Learn in this
Right Session?
Session?


                                2
NLP = Natural Language Processing
▪   This session is not about

        Natural Law Party

        Neuro-linguistic Programming

        No Light Perception (total blindness)

        Nonlinear Programming


                            @btsmith            #nlp   3
N-Grams ≠ Engrams, Enneagrams, etc
▪   I will talk about “n-grams” several times
▪   Wikipedia has pages for 3 different kinds of “engram”
    • Neuropsychology
    • Scientology
    • 2009 album by Finnish black metal band Beherit

▪   Wikipedia has pages for 3 different kinds of “enneagram”
    • Nine-sided star polygon
    • Enneagram of Personality
    • Fourth Way Enneagram

                                 @btsmith                   #nlp   4
Are you…
▪   developing a social media application?


▪   looking for ways to make your application better?


▪   interested in a quick introduction to NLP or text analytics?



                             @btsmith                      #nlp    5
Do you want to know…
▪   how you can use NLP tools in your social media app?


▪   if you need a Ph.D. to use NLP tools?


▪   where to find free NLP tools?


▪   where to learn more?


                            @btsmith                  #nlp   6
Do you want to understand…
▪   the role of machine learning in NLP?


▪   the difference between training and production?


▪   what a training corpus is and where to find one?




                            @btsmith                   #nlp   7
This is a Great Time to Start Using NLP!
▪   Computers are powerful and cheap!
▪   There‟s a lot of very good, free software!
▪   There‟s an enormous amount of very good, free text data!
▪   Don’t be afraid of non-English content!
    • Unicode is your friend
    • just remember „utf-8‟




                               @btsmith                #nlp    8
Lightweight NLP
for Social Media Applications



Very Simple NLP
with
Very Little Math


                                9
Document, Corpus, Treebank
▪   document
    • newspaper article, novel, patent, scientific paper
    • blog post, comment, status update, tweet
▪   corpus
    • collection of documents
    • plural is “corpora”
▪   treebank
    • annotated corpus
    • words are annotated with parts of speech
    • sentences are annotated with parse trees

                                    @btsmith               #nlp   10
Penn Treebank‟s Parts of Speech
CC    Coordinating conjunction         …      …
CD    Cardinal number                  POS    Possessive ending
DT    Determiner                       PRP    Personal pronoun
IN    Preposition or                   PRP$   Possessive pronoun
      subordinating conjunction        …      …
…     …                                VB     Verb, base form
JJ    Adjective                        VBD    Verb, past tense
JJR   Adjective, comparative           VBG    Verb, gerund
JJS   Adjective, superlative                  or present participle
…     …                                …      …
NN    Noun, singular or mass           WP     Wh-pronoun
NNS   Noun, plural                     WP$    Possessive wh-pronoun
NNP   Proper noun, singular            …      …

                                  @btsmith                       #nlp   11
Phrase Structure Grammars & Parse Trees
                                             Phrases (non-terminals)
Parse Tree
                                             S     Sentence
        S                   Grammar          NP    Noun Phrase
                            S → NP VP        VP    Verb Phrase
                VP          …                PP    Prepositional Phrase
                                             …     …
                            NP → NN
                            NP → JJ NN
 NP                  NP
                            …                POS (terminals)
                                             NNP   Proper noun, singular
                            VP → V NP
                                             NNS   Noun, plural
NNP     VBZ          NNS    ….
                                             VBZ   Verb, 3rd person
Bruce   likes        dogs
                                                   singular present
                                             …     …

                                  @btsmith                         #nlp    12
N-Grams
▪   contiguous subsequence of n items
    • in order and with no gaps
    • words
    • characters


▪   n-grams have special names when n is small
    • unigram n=1
    • bigram n=2
    • trigram n=3


                                  @btsmith       #nlp   13
Character N-Grams
▪   Unigrams for this session‟s title
      Lightweight NLP for Social Media Applications

l        w      t       o       i        d   p   t
i        e      n       r       a        i   l   i
g        i      l       s       l        a   i   o
h        g      p       o       m        a   c   n
t        h      f       c       e        p   a   s

                              @btsmith                #nlp   14
Character N-Grams
▪    Bigrams for this session‟s title
       Lightweight NLP for Social Media Applications

li        we      tn      or      ia      di   pl   ti
ig        ei      nl      rs      al      ia   li   io
gh        ig      lp      so      lm      aa   ic   on
ht        gh      pf      oc      me      ap   ca   ns
tw        ht      fo      ci      ed      pp   at

                               @btsmith                  #nlp   15
Character N-Grams
▪     Trigrams for this session‟s title
        Lightweight NLP for Social Media Applications

lig        wei     tnl     ors     ial      dia   pli   tio
igh        eig     nlp     rso     alm      iaa   lic   ion
ght        igh     lpf     soc     lme      aap   ica   ons
htw        ght     pfo     oci     med      app   cat
twe        htn     for     cia     edi      ppl   ati

                                 @btsmith                     #nlp   16
Character N-Gram Frequencies
▪   N-grams are interesting when we look at frequencies
      Lightweight NLP for Social Media Applications

i–6                 gh – 2              ght – 2
a–4                 ht – 2              igh – 2
l–4                 ia – 2              aap – 1
o–3                 ig – 2              alm – 1
p–3                 li – 2              aap – 1
…                   …                   …


                             @btsmith                     #nlp   17
Word N-Gram Frequencies
▪   Word n-grams from Pride and Prejudice (using NLTK)

to – 4116           to be – 436       i am sure – 72
the – 4105          of the – 430      as soon as – 59
of – 3572           in the – 359      in the world – 57
and – 3491          it was – 280      i do not – 46
her – 2551          of her – 276      could not be – 42
a – 2092            to the – 242      she could not – 39
…                   …                 …

                           @btsmith                  #nlp   18
N-Gram Frequencies
▪   Word n-grams from Pride and Prejudice
    with no stopword unigrams

elinor – 685        to be – 436       i am sure – 72
could – 578         of the – 430      as soon as – 59
marianne – 566      in the – 359      in the world – 57
mrs – 530           it was – 280      i do not – 46
would – 515         of her – 276      could not be – 42
said – 397          to the – 242      she could not – 39
…                   …                 …
                           @btsmith                  #nlp   19
Cosine Similarity
▪   Make a vector from of a document‟s n-gram frequencies
▪   If A and B are frequency vectors for two documents

                                           𝑛
                          𝐴∙ 𝐵            𝑖=1(𝐴 𝑖   𝐵𝑖)
     𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 =         =
                         𝐴 𝐵           𝑛
                                      𝑖=1(𝐴 𝑖 )
                                                2     𝑛
                                                     𝑖=1(𝐵 𝑖 )
                                                               2




                           @btsmith                                #nlp   20
Cosine Similarity
▪   Create word N-gram frequency vectors
    • with unigrams, bigrams, trigrams
    • Moby Dick
    • Pride and Prejudice

▪   Compute their cosine similarity
                                  0.534
▪   More interesting with a larger set of documents…


                                  @btsmith             #nlp   21
NLP and Machine Learning
▪   In the past, NLP was more about
    grammars and logic and parsing
▪   Today, NLP is more about
    statistics and machine learning
▪   Why?
    • computers are much more powerful
    • there are enormous amounts of very good, free data



                                 @btsmith                  #nlp   22
NLP and Machine Learning
▪   Think of machine learning as

        programming by analyzing sample data


▪   Example
    • Use the Penn Treebank as sample data
    • Build a program that labels words with parts-of-speech



                                  @btsmith                     #nlp   23
NLP and Machine Learning
▪   Training
    •   depends on sample data, your training corpus
    •   there are very good, free machine learning tools
    •   sometimes training is slow
    •   experiment with different techniques (perceptron, SVM, etc)
    •   test, test, test…

▪   Production
    • uses models generated during training
    • typically very fast


                                    @btsmith                          #nlp   24
Lightweight NLP
for Social Media Applications



Lightweight NLP
Techniques




                                25
Lightweight NLP Techniques
▪   Language Identification


▪   Sentence Breaking


▪   Stemming


▪   Part-of-Speech Tagging


                              @btsmith   #nlp   26
Language Identification
You might try looking at
▪   character sets (e.g. Unicode character blocks)
▪   words in language-specific dictionaries
▪   character n-gram frequencies and cosine similarity




                            @btsmith                     #nlp   27
Language Identification
▪   Character n-gram frequencies for English
e       12.6%           th       3.9%            the      3.5%
t       9.1%            he       3.7%            and      1.6%
a       8.0%            in       2.3%            ing      1.1%
o       7.6%            er       2.2%            her      0.8%
i       6.9%            an       2.1%            hat      0.7%
n       6.9%            re       1.7%            his      0.6%
s       6.3%            nd       1.6%            tha      0.6%
h       6.2%            on       1.4%            ere      0.6%
…                       …                        …

From Cryptograms.org, derived from English documents at Project Gutenberg

                                 @btsmith                             #nlp   28
Language Identification with Tika
▪    tika.apache.org
▪    models for
da       Danish        fr   French       ro   Romanian
de       German        is   Icelandic    ru   Russian
et       Estonian      it   Italian      sv   Swedish
el       Greek         nl   Dutch        th   Thai
en       English       no   Norwegian    uk   Ukrainian
es       Spanish       pl   Polish       …
fi       Finnish       pt   Portuguese

▪    trainable with sample data
                             @btsmith                     #nlp   29
Where can you find samples of…
▪   French?
▪   German?
▪   Russian?
▪   Japanese?
▪   Arabic?
▪   Cherokee?

                @btsmith         #nlp   30
Sentence Breaking
▪   Also known as
    • sentence boundary disambiguation
    • sentence detection

▪   You could just look for punctuation, but…
    •   what   about abbreviations?
    •   what   about numbers?
    •   what   about domain names like lithium.com, etc?
    •   what   about names like Yahoo!, etc?




                                     @btsmith              #nlp   31
Sentence Breaking with OpenNLP
▪   opennlp.apache.org
▪   models for

    da Danish              nl        Dutch
    de German              pt        Portuguese
    en English             se        Swedish


▪   trainable with new sample data

                           @btsmith               #nlp   32
Stemming
▪   Reducing a word to a stem or base form
▪   Porter Stemmer is a popular stemmer for English
▪   Examples

        lightweight   → lightweight
        natural       → natur
        language      → languag
        processing    → process

                            @btsmith                  #nlp   33
Stemming
▪   A few examples from Pride and Prejudice (using NLTK)

affect                 amus              close
         affect               amuse              close
         affectation          amused             closed
         affected             amusement          closely
         affecting            amusements         closing
         affection            amusing    grate
         affections                              grate
         affects                                 grateful
                                                 gratefully

                              @btsmith                        #nlp   34
Stemming with Snowball
▪   tartarus.org
▪   stemmers for

    de   German    nl    Dutch
    en   English   no    Norwegian
    es   Spanish   pt    Portuguese
    fi   Finnish   ru    Russian
    fr   French    se    Swedish
    it   Italian   …

                   @btsmith           #nlp   35
Part-of-Speech Tagging
▪   Part of Speech frequently abbreviated POS
▪   Not every language has the same parts of speech
▪   Even for one language,
    not everyone agrees on the parts of speech
▪   Example: Penn Treebank POS tags for English




                           @btsmith                   #nlp   36
Part-of-Speech Tagging
lightweight nlp for social     nlp is easier than you thought
media applications
                                        nlp        NN
       lightweight    NN                is         VBZ
       nlp            NN                easier     JJR
       for            IN                than       IN
       social         JJ                you        PRP
       media          NNS               thought    VBD
       applications   NNS


                             @btsmith                    #nlp   37
Part-of-Speech Tagging with OpenNLP
▪   opennlp.apache.org
▪   two kinds of models for each of

    de German                pt       Portuguese
    en English               se       Swedish
    nl Dutch
▪   trainable with new sample data


                            @btsmith               #nlp   38
Lightweight NLP
for Social Media Applications



Lightweight NLP
in
Applications


                                39
Lightweight NLP in Applications
▪   Language Identification
▪   Sentence Breaking for Summaries
▪   Stemming for Word Counts
▪   POS Tagging for Document Categorization
▪   Lithium SMM Quotes



                              @btsmith        #nlp   40
Lithium SMM (Social Media Monitoring)




                 @btsmith          #nlp   41
Language Identification
▪   Language ID is never perfect,
    especially with social media!

    •   short documents
    •   ambiguity
    •   mixed languages
    •   nonsense
    •   and… lots of very strange stuff




                                     @btsmith   #nlp   42
What language is this?      ______________$$$$______________
                            ____________$$$$$$$$____________
                            ___________$$$$$$$$$$___________
                            ___________$$$$$$$$$$___________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            ____$$$$_____$$$$$$_____$$$$____
                            ___$$$$$_____$$$$$$_____$$$$$___
                            _$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$_
                            _$$$$$$$$$$$.СРБИЈА.$$$$$$$$$$$_
                            ___$$$$$$$$$$$$$$$$$$$$$$$$$$___
                            ____$$$$_____$$$$$$_____$$$$____
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            _____________$$$$$$_____________
                            ___________$$$$$$$$$$___________
                            ___________$$$$$$$$$$___________
                            ____________$$$$$$$$____________
                            ______________$$$$______________

                 @btsmith                             #nlp     43
What language is this?

ღೋ   ´¯`•.¸,¤°`°¤,¸.•´¯`•.¸,¤Ƹ ̵̡Ӝ          ´¯`•.
                                      ̵̨̄Ʒ´¯`•.ღೋ
                                          ╱▔▌
╔═╗╔═╗╔═╗╔═╗╔═╗░╔═╗╔╗╔═╗╔═╗╔═╗╔╗╔╗─╔═╗─ ╱▔▔▔▔╲ ╱▌
║█║║═╣║═╣║╠╝║═╣░║▌║╚╝║█║║█║║█║║║║║─║═╣ ╱◑▓░▓░░ ▌
║╔╝║═╣╠═║║╠╗║═╣░║▌║──║╦║║╔╝║═╣║║║╚╗║═╣ ╲░░░░░╱╲▌
╚╝─╚═╝╚═╝╚═╝╚═╝░╚═╝──╚╩╝╚╝─╚╩╝╚╝╚═╝╚═╝─ ▔▔╲▌▔




                           @btsmith                 #nlp   44
Lithium SMM




              @btsmith   #nlp   45
Sentence Breaking for Summaries
▪   Summary does not replace the document
▪   Summary lets you decide if the document is interesting
▪   Summaries are sentences selected from the document
    • contain the search terms
    • not too short, not too long, etc
    • truncated only if necessary




                                    @btsmith            #nlp   46
Lithium SMM




              @btsmith   #nlp   47
Frequent Words and Stemming
▪   Most common words in the results for your query
    • excludes stopwords

▪   Trending words were previously not common
▪   Click on a frequent word to search within results
▪   Should we count…
    • words?
    • stems?


                             @btsmith                   #nlp   48
POS Tagging
▪   We use POS Tagging in Lithium SMM Quotes
    • along with other things
    • not such a “lightweight” application

▪   POS also useful for document categorization
    • POS-based features
    • machine learning




                                    @btsmith      #nlp   49
POS Tags and Document Categorization
▪   Author Gender
    Automatic Categorization of Author Gender via N-Gram Analysis,
    Jonathan Doyle and Vlado Keselj. In The 6th Symposium on Natural
    Language Processing, SNLP'2005, Chiang Rai, Thailand, December 2005.


▪   Opinion Spam
    Finding Deceptive Opinion Spam by Any Stretch of the Imagination,
    Myle Ott, Yejin Choi, Claire Cardie and Jeffrey T. Hancock, The 49th Annual
    Meeting of the Association for Computational Linguistics: Human Language
    Technologies, Portland, Oregon, USA, June 19-24, 2011.

                                    @btsmith                              #nlp    50
Lithium SMM Quotes
▪   Quotes
    • Select interesting sentences from social media documents
    • Classify them as love, hate, comparison, warning, etc.

▪   Quotes depends on
    •   language identification
    •   sentence breaking
    •   POS tagging
    •   parsing
    •   specialized dictionaries



                                   @btsmith                      #nlp   51
Lithium SMM Quotes




               @btsmith   #nlp   52
Lightweight NLP
for Social Media Applications



Resources




                                53
Wikipedia
▪   Corpus linguistics             ▪   Part-of-speech tagging
▪   Cosine similarity              ▪   Sentence boundary
                                       disambiguation
▪   Function word
                                   ▪   Stemming
▪   Language identification
                                   ▪   Stop words
▪   Machine learning
                                   ▪   Text mining
▪   N-gram
                                   ▪   Treebank
▪   Natural language processing

                              @btsmith                          #nlp   54
Software
▪   NLTK                                 ▪   Snowball
    • Natural Language Toolkit               • ANSI C and Java stemmers
    • Python library for NLP                 • snowball.tartarus.org
    • nltk.org

                                         ▪   Tika
                                             • Java toolkit for extracting metadata
▪   OpenNLP                                      and text from documents
    • machine-learning based NLP tools       •   includes language identification
    • Java library for NLP                   •   tika.apache.org
    • opennlp.apache.org

                                 @btsmith                                 #nlp        55
Books
▪   Natural Language Processing with Python
    Steven Bird, Ewan Klein & Edward Loper
    O‟Reilly, 2009


▪   Foundations of Statistical Natural Language Processing
    Chris Manning & Hinrich Schütze
    MIT Press, 1999


                           @btsmith                    #nlp   56
Organization
▪   Association for Computational Linguistics


        http://www.aclweb.org


▪   Remember that‟s aclweb.org

    acl.org is the Association of Christian Librarians

                             @btsmith                    #nlp   57
Contact Info
▪   Bruce Smith
    @btsmith
    bruce.smith@lithium.com


▪   People at SXSW wearing Lithium‟s Nation Builder T-shirts




                           @btsmith                    #nlp    58

More Related Content

What's hot

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep LearningNatasha Latysheva
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 
NLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingNLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingHemantha Kulathilake
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfDeptii Chaudhari
 

What's hot (20)

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
NLP
NLPNLP
NLP
 
Machine Tanslation
Machine TanslationMachine Tanslation
Machine Tanslation
 
Chapter 5 Syntax Directed Translation
Chapter 5   Syntax Directed TranslationChapter 5   Syntax Directed Translation
Chapter 5 Syntax Directed Translation
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
NLP
NLPNLP
NLP
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Draw and explain the architecture of general purpose microprocessor
Draw and explain the architecture of general purpose microprocessor Draw and explain the architecture of general purpose microprocessor
Draw and explain the architecture of general purpose microprocessor
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
NLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological ParsingNLP_KASHK:Finite-State Morphological Parsing
NLP_KASHK:Finite-State Morphological Parsing
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdf
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 

More from Lithium

How Customer Communities Power Word-of-Mouth Marketing
How Customer Communities Power Word-of-Mouth MarketingHow Customer Communities Power Word-of-Mouth Marketing
How Customer Communities Power Word-of-Mouth MarketingLithium
 
Technology management in the age of the customer
Technology management in the age of the customerTechnology management in the age of the customer
Technology management in the age of the customerLithium
 
Financial Services Firms Claim the CX Advantage
Financial Services Firms Claim the CX Advantage Financial Services Firms Claim the CX Advantage
Financial Services Firms Claim the CX Advantage Lithium
 
FSI- Claim the Customer Experience Now
FSI- Claim the Customer Experience Now FSI- Claim the Customer Experience Now
FSI- Claim the Customer Experience Now Lithium
 
Lithium serious social response copy
Lithium serious social response copyLithium serious social response copy
Lithium serious social response copyLithium
 
Hey, Financial Services - Get Serious About Social, or Get Spanked!
Hey, Financial Services - Get Serious About Social, or Get Spanked!Hey, Financial Services - Get Serious About Social, or Get Spanked!
Hey, Financial Services - Get Serious About Social, or Get Spanked!Lithium
 
Lithium whitepaper: Hey, Tech! Get Serious About Social Customer Enlistment
Lithium whitepaper: Hey, Tech! Get Serious About Social Customer EnlistmentLithium whitepaper: Hey, Tech! Get Serious About Social Customer Enlistment
Lithium whitepaper: Hey, Tech! Get Serious About Social Customer EnlistmentLithium
 
The 4 Gears from The Science of Social 2
The 4 Gears from The Science of Social 2 The 4 Gears from The Science of Social 2
The 4 Gears from The Science of Social 2 Lithium
 
Lithium Get Serious About Online Communities
Lithium Get Serious About Online CommunitiesLithium Get Serious About Online Communities
Lithium Get Serious About Online CommunitiesLithium
 
#SeriousAboutSocial Webcast
#SeriousAboutSocial Webcast#SeriousAboutSocial Webcast
#SeriousAboutSocial WebcastLithium
 
Social Business Advantage
Social Business AdvantageSocial Business Advantage
Social Business AdvantageLithium
 
Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...
Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...
Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...Lithium
 
Get Serious About Social Customer Engagement
Get Serious About Social Customer EngagementGet Serious About Social Customer Engagement
Get Serious About Social Customer EngagementLithium
 
Get Serious About Social Customer Care
Get Serious About Social Customer Care Get Serious About Social Customer Care
Get Serious About Social Customer Care Lithium
 
Superfans
Superfans Superfans
Superfans Lithium
 
Best Practices Ideation
Best Practices IdeationBest Practices Ideation
Best Practices IdeationLithium
 
MarketingProfs 2012 State of Social Media
MarketingProfs 2012 State of Social Media MarketingProfs 2012 State of Social Media
MarketingProfs 2012 State of Social Media Lithium
 
Gold in Them Hills: Computing ROI for Support Communities
Gold in Them Hills: Computing ROI for Support CommunitiesGold in Them Hills: Computing ROI for Support Communities
Gold in Them Hills: Computing ROI for Support CommunitiesLithium
 
Building Customer Networks for Successful Word of Mouth Marketing
Building Customer Networks for Successful Word of Mouth MarketingBuilding Customer Networks for Successful Word of Mouth Marketing
Building Customer Networks for Successful Word of Mouth MarketingLithium
 
Telesperience customer-experience-benchmark-2013 asi8-d3sd
Telesperience customer-experience-benchmark-2013 asi8-d3sdTelesperience customer-experience-benchmark-2013 asi8-d3sd
Telesperience customer-experience-benchmark-2013 asi8-d3sdLithium
 

More from Lithium (20)

How Customer Communities Power Word-of-Mouth Marketing
How Customer Communities Power Word-of-Mouth MarketingHow Customer Communities Power Word-of-Mouth Marketing
How Customer Communities Power Word-of-Mouth Marketing
 
Technology management in the age of the customer
Technology management in the age of the customerTechnology management in the age of the customer
Technology management in the age of the customer
 
Financial Services Firms Claim the CX Advantage
Financial Services Firms Claim the CX Advantage Financial Services Firms Claim the CX Advantage
Financial Services Firms Claim the CX Advantage
 
FSI- Claim the Customer Experience Now
FSI- Claim the Customer Experience Now FSI- Claim the Customer Experience Now
FSI- Claim the Customer Experience Now
 
Lithium serious social response copy
Lithium serious social response copyLithium serious social response copy
Lithium serious social response copy
 
Hey, Financial Services - Get Serious About Social, or Get Spanked!
Hey, Financial Services - Get Serious About Social, or Get Spanked!Hey, Financial Services - Get Serious About Social, or Get Spanked!
Hey, Financial Services - Get Serious About Social, or Get Spanked!
 
Lithium whitepaper: Hey, Tech! Get Serious About Social Customer Enlistment
Lithium whitepaper: Hey, Tech! Get Serious About Social Customer EnlistmentLithium whitepaper: Hey, Tech! Get Serious About Social Customer Enlistment
Lithium whitepaper: Hey, Tech! Get Serious About Social Customer Enlistment
 
The 4 Gears from The Science of Social 2
The 4 Gears from The Science of Social 2 The 4 Gears from The Science of Social 2
The 4 Gears from The Science of Social 2
 
Lithium Get Serious About Online Communities
Lithium Get Serious About Online CommunitiesLithium Get Serious About Online Communities
Lithium Get Serious About Online Communities
 
#SeriousAboutSocial Webcast
#SeriousAboutSocial Webcast#SeriousAboutSocial Webcast
#SeriousAboutSocial Webcast
 
Social Business Advantage
Social Business AdvantageSocial Business Advantage
Social Business Advantage
 
Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...
Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...
Forrester Case Study: Giffgaff uses co-creation to build a differentiated mob...
 
Get Serious About Social Customer Engagement
Get Serious About Social Customer EngagementGet Serious About Social Customer Engagement
Get Serious About Social Customer Engagement
 
Get Serious About Social Customer Care
Get Serious About Social Customer Care Get Serious About Social Customer Care
Get Serious About Social Customer Care
 
Superfans
Superfans Superfans
Superfans
 
Best Practices Ideation
Best Practices IdeationBest Practices Ideation
Best Practices Ideation
 
MarketingProfs 2012 State of Social Media
MarketingProfs 2012 State of Social Media MarketingProfs 2012 State of Social Media
MarketingProfs 2012 State of Social Media
 
Gold in Them Hills: Computing ROI for Support Communities
Gold in Them Hills: Computing ROI for Support CommunitiesGold in Them Hills: Computing ROI for Support Communities
Gold in Them Hills: Computing ROI for Support Communities
 
Building Customer Networks for Successful Word of Mouth Marketing
Building Customer Networks for Successful Word of Mouth MarketingBuilding Customer Networks for Successful Word of Mouth Marketing
Building Customer Networks for Successful Word of Mouth Marketing
 
Telesperience customer-experience-benchmark-2013 asi8-d3sd
Telesperience customer-experience-benchmark-2013 asi8-d3sdTelesperience customer-experience-benchmark-2013 asi8-d3sd
Telesperience customer-experience-benchmark-2013 asi8-d3sd
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Lightweight Natural Language Processing (NLP)

  • 1. Lightweight NLP for Social Media Applications Bruce Smith Lithium Technologies, Inc. SXSW 2012 March 13, 2012 @btsmith #nlp #sxsw
  • 2. Lightweight NLP for Social Media Applications Are You What Can You in the Learn in this Right Session? Session? 2
  • 3. NLP = Natural Language Processing ▪ This session is not about Natural Law Party Neuro-linguistic Programming No Light Perception (total blindness) Nonlinear Programming @btsmith #nlp 3
  • 4. N-Grams ≠ Engrams, Enneagrams, etc ▪ I will talk about “n-grams” several times ▪ Wikipedia has pages for 3 different kinds of “engram” • Neuropsychology • Scientology • 2009 album by Finnish black metal band Beherit ▪ Wikipedia has pages for 3 different kinds of “enneagram” • Nine-sided star polygon • Enneagram of Personality • Fourth Way Enneagram @btsmith #nlp 4
  • 5. Are you… ▪ developing a social media application? ▪ looking for ways to make your application better? ▪ interested in a quick introduction to NLP or text analytics? @btsmith #nlp 5
  • 6. Do you want to know… ▪ how you can use NLP tools in your social media app? ▪ if you need a Ph.D. to use NLP tools? ▪ where to find free NLP tools? ▪ where to learn more? @btsmith #nlp 6
  • 7. Do you want to understand… ▪ the role of machine learning in NLP? ▪ the difference between training and production? ▪ what a training corpus is and where to find one? @btsmith #nlp 7
  • 8. This is a Great Time to Start Using NLP! ▪ Computers are powerful and cheap! ▪ There‟s a lot of very good, free software! ▪ There‟s an enormous amount of very good, free text data! ▪ Don’t be afraid of non-English content! • Unicode is your friend • just remember „utf-8‟ @btsmith #nlp 8
  • 9. Lightweight NLP for Social Media Applications Very Simple NLP with Very Little Math 9
  • 10. Document, Corpus, Treebank ▪ document • newspaper article, novel, patent, scientific paper • blog post, comment, status update, tweet ▪ corpus • collection of documents • plural is “corpora” ▪ treebank • annotated corpus • words are annotated with parts of speech • sentences are annotated with parse trees @btsmith #nlp 10
  • 11. Penn Treebank‟s Parts of Speech CC Coordinating conjunction … … CD Cardinal number POS Possessive ending DT Determiner PRP Personal pronoun IN Preposition or PRP$ Possessive pronoun subordinating conjunction … … … … VB Verb, base form JJ Adjective VBD Verb, past tense JJR Adjective, comparative VBG Verb, gerund JJS Adjective, superlative or present participle … … … … NN Noun, singular or mass WP Wh-pronoun NNS Noun, plural WP$ Possessive wh-pronoun NNP Proper noun, singular … … @btsmith #nlp 11
  • 12. Phrase Structure Grammars & Parse Trees Phrases (non-terminals) Parse Tree S Sentence S Grammar NP Noun Phrase S → NP VP VP Verb Phrase VP … PP Prepositional Phrase … … NP → NN NP → JJ NN NP NP … POS (terminals) NNP Proper noun, singular VP → V NP NNS Noun, plural NNP VBZ NNS …. VBZ Verb, 3rd person Bruce likes dogs singular present … … @btsmith #nlp 12
  • 13. N-Grams ▪ contiguous subsequence of n items • in order and with no gaps • words • characters ▪ n-grams have special names when n is small • unigram n=1 • bigram n=2 • trigram n=3 @btsmith #nlp 13
  • 14. Character N-Grams ▪ Unigrams for this session‟s title Lightweight NLP for Social Media Applications l w t o i d p t i e n r a i l i g i l s l a i o h g p o m a c n t h f c e p a s @btsmith #nlp 14
  • 15. Character N-Grams ▪ Bigrams for this session‟s title Lightweight NLP for Social Media Applications li we tn or ia di pl ti ig ei nl rs al ia li io gh ig lp so lm aa ic on ht gh pf oc me ap ca ns tw ht fo ci ed pp at @btsmith #nlp 15
  • 16. Character N-Grams ▪ Trigrams for this session‟s title Lightweight NLP for Social Media Applications lig wei tnl ors ial dia pli tio igh eig nlp rso alm iaa lic ion ght igh lpf soc lme aap ica ons htw ght pfo oci med app cat twe htn for cia edi ppl ati @btsmith #nlp 16
  • 17. Character N-Gram Frequencies ▪ N-grams are interesting when we look at frequencies Lightweight NLP for Social Media Applications i–6 gh – 2 ght – 2 a–4 ht – 2 igh – 2 l–4 ia – 2 aap – 1 o–3 ig – 2 alm – 1 p–3 li – 2 aap – 1 … … … @btsmith #nlp 17
  • 18. Word N-Gram Frequencies ▪ Word n-grams from Pride and Prejudice (using NLTK) to – 4116 to be – 436 i am sure – 72 the – 4105 of the – 430 as soon as – 59 of – 3572 in the – 359 in the world – 57 and – 3491 it was – 280 i do not – 46 her – 2551 of her – 276 could not be – 42 a – 2092 to the – 242 she could not – 39 … … … @btsmith #nlp 18
  • 19. N-Gram Frequencies ▪ Word n-grams from Pride and Prejudice with no stopword unigrams elinor – 685 to be – 436 i am sure – 72 could – 578 of the – 430 as soon as – 59 marianne – 566 in the – 359 in the world – 57 mrs – 530 it was – 280 i do not – 46 would – 515 of her – 276 could not be – 42 said – 397 to the – 242 she could not – 39 … … … @btsmith #nlp 19
  • 20. Cosine Similarity ▪ Make a vector from of a document‟s n-gram frequencies ▪ If A and B are frequency vectors for two documents 𝑛 𝐴∙ 𝐵 𝑖=1(𝐴 𝑖 𝐵𝑖) 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = = 𝐴 𝐵 𝑛 𝑖=1(𝐴 𝑖 ) 2 𝑛 𝑖=1(𝐵 𝑖 ) 2 @btsmith #nlp 20
  • 21. Cosine Similarity ▪ Create word N-gram frequency vectors • with unigrams, bigrams, trigrams • Moby Dick • Pride and Prejudice ▪ Compute their cosine similarity 0.534 ▪ More interesting with a larger set of documents… @btsmith #nlp 21
  • 22. NLP and Machine Learning ▪ In the past, NLP was more about grammars and logic and parsing ▪ Today, NLP is more about statistics and machine learning ▪ Why? • computers are much more powerful • there are enormous amounts of very good, free data @btsmith #nlp 22
  • 23. NLP and Machine Learning ▪ Think of machine learning as programming by analyzing sample data ▪ Example • Use the Penn Treebank as sample data • Build a program that labels words with parts-of-speech @btsmith #nlp 23
  • 24. NLP and Machine Learning ▪ Training • depends on sample data, your training corpus • there are very good, free machine learning tools • sometimes training is slow • experiment with different techniques (perceptron, SVM, etc) • test, test, test… ▪ Production • uses models generated during training • typically very fast @btsmith #nlp 24
  • 25. Lightweight NLP for Social Media Applications Lightweight NLP Techniques 25
  • 26. Lightweight NLP Techniques ▪ Language Identification ▪ Sentence Breaking ▪ Stemming ▪ Part-of-Speech Tagging @btsmith #nlp 26
  • 27. Language Identification You might try looking at ▪ character sets (e.g. Unicode character blocks) ▪ words in language-specific dictionaries ▪ character n-gram frequencies and cosine similarity @btsmith #nlp 27
  • 28. Language Identification ▪ Character n-gram frequencies for English e 12.6% th 3.9% the 3.5% t 9.1% he 3.7% and 1.6% a 8.0% in 2.3% ing 1.1% o 7.6% er 2.2% her 0.8% i 6.9% an 2.1% hat 0.7% n 6.9% re 1.7% his 0.6% s 6.3% nd 1.6% tha 0.6% h 6.2% on 1.4% ere 0.6% … … … From Cryptograms.org, derived from English documents at Project Gutenberg @btsmith #nlp 28
  • 29. Language Identification with Tika ▪ tika.apache.org ▪ models for da Danish fr French ro Romanian de German is Icelandic ru Russian et Estonian it Italian sv Swedish el Greek nl Dutch th Thai en English no Norwegian uk Ukrainian es Spanish pl Polish … fi Finnish pt Portuguese ▪ trainable with sample data @btsmith #nlp 29
  • 30. Where can you find samples of… ▪ French? ▪ German? ▪ Russian? ▪ Japanese? ▪ Arabic? ▪ Cherokee? @btsmith #nlp 30
  • 31. Sentence Breaking ▪ Also known as • sentence boundary disambiguation • sentence detection ▪ You could just look for punctuation, but… • what about abbreviations? • what about numbers? • what about domain names like lithium.com, etc? • what about names like Yahoo!, etc? @btsmith #nlp 31
  • 32. Sentence Breaking with OpenNLP ▪ opennlp.apache.org ▪ models for da Danish nl Dutch de German pt Portuguese en English se Swedish ▪ trainable with new sample data @btsmith #nlp 32
  • 33. Stemming ▪ Reducing a word to a stem or base form ▪ Porter Stemmer is a popular stemmer for English ▪ Examples lightweight → lightweight natural → natur language → languag processing → process @btsmith #nlp 33
  • 34. Stemming ▪ A few examples from Pride and Prejudice (using NLTK) affect amus close affect amuse close affectation amused closed affected amusement closely affecting amusements closing affection amusing grate affections grate affects grateful gratefully @btsmith #nlp 34
  • 35. Stemming with Snowball ▪ tartarus.org ▪ stemmers for de German nl Dutch en English no Norwegian es Spanish pt Portuguese fi Finnish ru Russian fr French se Swedish it Italian … @btsmith #nlp 35
  • 36. Part-of-Speech Tagging ▪ Part of Speech frequently abbreviated POS ▪ Not every language has the same parts of speech ▪ Even for one language, not everyone agrees on the parts of speech ▪ Example: Penn Treebank POS tags for English @btsmith #nlp 36
  • 37. Part-of-Speech Tagging lightweight nlp for social nlp is easier than you thought media applications nlp NN lightweight NN is VBZ nlp NN easier JJR for IN than IN social JJ you PRP media NNS thought VBD applications NNS @btsmith #nlp 37
  • 38. Part-of-Speech Tagging with OpenNLP ▪ opennlp.apache.org ▪ two kinds of models for each of de German pt Portuguese en English se Swedish nl Dutch ▪ trainable with new sample data @btsmith #nlp 38
  • 39. Lightweight NLP for Social Media Applications Lightweight NLP in Applications 39
  • 40. Lightweight NLP in Applications ▪ Language Identification ▪ Sentence Breaking for Summaries ▪ Stemming for Word Counts ▪ POS Tagging for Document Categorization ▪ Lithium SMM Quotes @btsmith #nlp 40
  • 41. Lithium SMM (Social Media Monitoring) @btsmith #nlp 41
  • 42. Language Identification ▪ Language ID is never perfect, especially with social media! • short documents • ambiguity • mixed languages • nonsense • and… lots of very strange stuff @btsmith #nlp 42
  • 43. What language is this? ______________$$$$______________ ____________$$$$$$$$____________ ___________$$$$$$$$$$___________ ___________$$$$$$$$$$___________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ ____$$$$_____$$$$$$_____$$$$____ ___$$$$$_____$$$$$$_____$$$$$___ _$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$_ _$$$$$$$$$$$.СРБИЈА.$$$$$$$$$$$_ ___$$$$$$$$$$$$$$$$$$$$$$$$$$___ ____$$$$_____$$$$$$_____$$$$____ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ _____________$$$$$$_____________ ___________$$$$$$$$$$___________ ___________$$$$$$$$$$___________ ____________$$$$$$$$____________ ______________$$$$______________ @btsmith #nlp 43
  • 44. What language is this? ღೋ ´¯`•.¸,¤°`°¤,¸.•´¯`•.¸,¤Ƹ ̵̡Ӝ ´¯`•. ̵̨̄Ʒ´¯`•.ღೋ ╱▔▌ ╔═╗╔═╗╔═╗╔═╗╔═╗░╔═╗╔╗╔═╗╔═╗╔═╗╔╗╔╗─╔═╗─ ╱▔▔▔▔╲ ╱▌ ║█║║═╣║═╣║╠╝║═╣░║▌║╚╝║█║║█║║█║║║║║─║═╣ ╱◑▓░▓░░ ▌ ║╔╝║═╣╠═║║╠╗║═╣░║▌║──║╦║║╔╝║═╣║║║╚╗║═╣ ╲░░░░░╱╲▌ ╚╝─╚═╝╚═╝╚═╝╚═╝░╚═╝──╚╩╝╚╝─╚╩╝╚╝╚═╝╚═╝─ ▔▔╲▌▔ @btsmith #nlp 44
  • 45. Lithium SMM @btsmith #nlp 45
  • 46. Sentence Breaking for Summaries ▪ Summary does not replace the document ▪ Summary lets you decide if the document is interesting ▪ Summaries are sentences selected from the document • contain the search terms • not too short, not too long, etc • truncated only if necessary @btsmith #nlp 46
  • 47. Lithium SMM @btsmith #nlp 47
  • 48. Frequent Words and Stemming ▪ Most common words in the results for your query • excludes stopwords ▪ Trending words were previously not common ▪ Click on a frequent word to search within results ▪ Should we count… • words? • stems? @btsmith #nlp 48
  • 49. POS Tagging ▪ We use POS Tagging in Lithium SMM Quotes • along with other things • not such a “lightweight” application ▪ POS also useful for document categorization • POS-based features • machine learning @btsmith #nlp 49
  • 50. POS Tags and Document Categorization ▪ Author Gender Automatic Categorization of Author Gender via N-Gram Analysis, Jonathan Doyle and Vlado Keselj. In The 6th Symposium on Natural Language Processing, SNLP'2005, Chiang Rai, Thailand, December 2005. ▪ Opinion Spam Finding Deceptive Opinion Spam by Any Stretch of the Imagination, Myle Ott, Yejin Choi, Claire Cardie and Jeffrey T. Hancock, The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, June 19-24, 2011. @btsmith #nlp 50
  • 51. Lithium SMM Quotes ▪ Quotes • Select interesting sentences from social media documents • Classify them as love, hate, comparison, warning, etc. ▪ Quotes depends on • language identification • sentence breaking • POS tagging • parsing • specialized dictionaries @btsmith #nlp 51
  • 52. Lithium SMM Quotes @btsmith #nlp 52
  • 53. Lightweight NLP for Social Media Applications Resources 53
  • 54. Wikipedia ▪ Corpus linguistics ▪ Part-of-speech tagging ▪ Cosine similarity ▪ Sentence boundary disambiguation ▪ Function word ▪ Stemming ▪ Language identification ▪ Stop words ▪ Machine learning ▪ Text mining ▪ N-gram ▪ Treebank ▪ Natural language processing @btsmith #nlp 54
  • 55. Software ▪ NLTK ▪ Snowball • Natural Language Toolkit • ANSI C and Java stemmers • Python library for NLP • snowball.tartarus.org • nltk.org ▪ Tika • Java toolkit for extracting metadata ▪ OpenNLP and text from documents • machine-learning based NLP tools • includes language identification • Java library for NLP • tika.apache.org • opennlp.apache.org @btsmith #nlp 55
  • 56. Books ▪ Natural Language Processing with Python Steven Bird, Ewan Klein & Edward Loper O‟Reilly, 2009 ▪ Foundations of Statistical Natural Language Processing Chris Manning & Hinrich Schütze MIT Press, 1999 @btsmith #nlp 56
  • 57. Organization ▪ Association for Computational Linguistics http://www.aclweb.org ▪ Remember that‟s aclweb.org acl.org is the Association of Christian Librarians @btsmith #nlp 57
  • 58. Contact Info ▪ Bruce Smith @btsmith bruce.smith@lithium.com ▪ People at SXSW wearing Lithium‟s Nation Builder T-shirts @btsmith #nlp 58