SlideShare a Scribd company logo
1 of 33
LIWC Dictionary
         Expansion
Luiz Gustavo Ferraz Aoqui
Social Computing Lab – GSCT – KAIST
Motivation
• Dictionary-based classifiers have high precision
  • But usually low recall

• Natural language is very dynamic
  • New words appear
  • Words change their meaning and sentiment
  • Heap’s Law

• Hard to update the dictionary at the same speed
LIWC Dictionary
• Fairly large dictionary
  • Almost 4,500 words and steams
     • 406 positive
     • 499 negative
• Development and Update is a long process
  • Almost exclusively done manually
  • Requires a lot of human resources
• Last update was in 2007
  • Twitter was launched in July, 2006
System overview
 19027743 1985381275 NULL NULL <d>2009-06-01
 00:00:00</d> <s>web</s> <t>I think i
 'm gonna go with the magic in 6.... just cause now
 that bron bron's out i wanna
 see kobe lose too.</t> SeanBennettt 98 434 159 -
 18000 0 0 <n>Sean Bennett</n> <u
 d>2009-01-15 16:36:04</ud> <t>Eastern Time (US
 &amp; Canada)</t> <l>Long Island,
  NY</l>
                            .
                            .
                            .




 Postive:
 .. :) :- ...... live tweet ;) .& -- =) everytime rain tweets (:
 mj xd michael !!!!!! lil ." dog sun jus fan wit =] :] aww
 album via luv photo ;- john pic different kno wearing
 la ).

 Negative:
 !! :( ?? getting twitter omg ?! ppl :/ dude idk da
 weather bout wtf iphone smh wat internet =( heat dnt
 =/ facebook :| gosh kate :[ fml ima jon swear punch
 text =[ cringe ): nd ** imma
System overview
System overview/Parser
 19027743 1985381275 NULL NULL <d>2009-06-01
 00:00:00</d> <s>web</s> <t>I think i'm gonna go
 with the magic in 6.... just cause now that bron bron's
 out i wanna see kobe lose too.</t> SeanBennettt 98
 434 159 -18000 0 0 <n>Sean Bennett</n> <ud>2009-
 01-15 16:36:04</ud> <t>Eastern Time (US &amp;
 Canada)</t> <l>Long Island, NY</l>
                               .
                               .
                               .




  haha nooo! i just wanna kill mee!!!! i didn`t do my
 homework...and i feel sick =(
 I can see the bus again. that makes me happy.
 $$ Black Swan Fund Makes a Big Bet on Inflation
 wonder how Roubini feels about this...?
 blahh, i feel boredd and tiredd as hell haha
 jay to conan... upgrade. lc to kristin... downgrade.
 rushing home for lauren's final episode. my life
 makes me sad.
Parser

 Structured          Extract tweet
                                                Tweets
    Text                (RegEx)




                                                 Filter
Clean Tweets


                           Clean
                Remove     Remove    Remove
               user name     URL     hash tag
                (RegEx)    (RegEx)   (RegEx)
Parser
    • Regular Expressions
         • Very powerful tool for text processing…
         • ..but very complex
         • Ex.:


<d>2009-06-01 00:00:00</d>
<s>web</s> <t>I just reached level 2.
#spymaster http://bit.ly/playspy</t>
asmith393 1522 1498 207 -18000 0 0                     I just reached level 2. #spymaster
<n>Adam Smith</n> <ud>2007-03-07        <t>(.*?)</t>   http://bit.ly/playspy
18:17:20</ud> <t>Eastern Time (US
&amp; Canada)</t>
Parser
   • Regular Expressions
        • Very powerful tool for text processing…
        • ..but very complex
        • Ex.:


I just reached level 2.                             I just reached level 2.
#spymaster                  #[0-9a-zA-Z+_]*         http://bit.ly/playspy
http://bit.ly/playspy
Parser
   • Regular Expressions
        • Very powerful tool for text processing…
        • ..but very complex
        • Ex.:


I just reached level 2.
#spymaster
                          ((http://|www.)([a-zA-    I just reached level 2.
                                                    #spymaster
http://bit.ly/playspy           Z0-9/.~])*)
System overview/Master
  haha nooo! i just wanna kill mee!!!! i didn`t do my
 homework...and i feel sick =(
 I can see the bus again. that makes me happy.
 $$ Black Swan Fund Makes a Big Bet on Inflation
 wonder how Roubini feels about this...?
 blahh, i feel boredd and tiredd as hell haha
 jay to conan... upgrade. lc to kristin... downgrade.
 rushing home for lauren's final episode. my life
 makes me sad.




                             Index                      Frequency   Chunks   Co-frequency
Master
                                  Tweets
                Splitter           Tweets
                                    Chunks              Mapper
Tweets

                Indexer            Index            M     M       M



                                                    R

                                   Reducer                 R
                                                                  R




                                        Unsorted          Co-frequency
                                                         Co-frequency
    Frequency              Sort                         Co-frequency
                                        Frequency
Master/Splitter
• Count the lines in the input file
• Select only tweets that words on the LIWC
  dictionary
• Split the input file in smaller chunks
Master/Indexer
• Simply save the vocabulary on a file sorted
  alphabetically
• Important in the future
Master/Mapper
• Spawn processes in parallel and divide the
  chunks among them
• Each worker does two jobs:
  • First: create (word, frequency) pairs


                                       Frequency.tmp
                                       someone         6
                                       down            8
                                       ever            10
    Chunk             Worker           kinda           2
                                       crazy           14
                                       …
Master/Mapper
• Spawn processes in parallel and divide the
  chunks among them
• Each worker does two jobs:
  • First: create (word, frequency) pairs
  • Second: save the co-words for each word
Master/Mapper
      Split Words
         Remove
        Duplicates
      Generate files

     Save co-words                                Worker
                                    haha
    haha                                            i
                                nooo                                 do
haha nooo! i just wanna kill                 !    didn`t
mee!!!! i didn`t do my          i                                         my
homework...and i feel sick =(           just
                                                   homework
                                wanna
                                                        ...          and
                                      kill
                                mee                           feel                    =(
                                                    i
                                           !!!!                                sick
Master/Mapper/Issues
• Splitting is not trivial
  • Splitting in whitespaces
     • homework… ≠ homework
  • Remove punctuation
     • :) ☐
  • Solution: RegEx again
     • ([w-'`]*)(W*)

• File names:
  • Unique, easy to find and respect OS rules
     • Hash
       • This is why the index file is important
Master/Mapper/Issues
• Parallel programming on Python
 • Original interpreter don’t support multi-thread…
    • Alternatives, such as Jython and IronPython, do
 • …but it is still possible to work in parallel
 • Multi-thread vs. Multi-process
 • Multi-process in Python
    • multiprocessing module
    • http://docs.python.org/library/multiprocessing.html#module-
      multiprocessing.pool
Master/Reducer
• Spawn processes in parallel and split the words
  among them
• Basically counts the mapper results
• Also, each work does two jobs:
  • First: sums all the (word, frequency) pairs and save

  frequency.tmp
  car     4                             frequency.txt
  house   2            Reducer          car      5
  ball    5                             house    3
  car     1                             ball     5
  house   1
Master/Reducer
• Spawn processes in parallel and split the words
  among them
• Basically counts the mapper results
• Also, each work does two jobs:
  • First: sums all the (word, frequency) pairs and save
  • Second: sums the co-occurrence frequency

   trip
                                      trip
   car     1
                      Worker          car     3
   ball    3
                                      Ball    3
   car     2
                                      house   1
   house   1
Master/Reducer/Issues
• Index file
  • Useful to access the files
     • Each word has a file with a list of co-words
     • But file name is hashed
       • Non-invertible function
     • Look-up on index, hash the word and get the file
Master/Sort
• Simply sort the frequencies file
  • Most frequent first
Classifier

                  α   β   γ
   Frequency                     Scores
                  δ




   Co-frequency
                  Max results   New words
Classifier/Sentiment words


            Car        232
            Ball       143
            Street     125   Top α%
Frequency   House      121
            Boat       114
            Pencil     105
            Pen        98
            Computer   81
Classifier/Co-words


                  Top β%

               engine    tire   door
      Car
      Ball
               court    game    play
      Street

               name     size
Classifier/Score

 engine     tire    door
                                  engine   1 0
 court     game     play
                                  tire     1 0
                                  door     2 1
  door     size
                                  size     1 2


 size     room     type    home


 price     size    door
Classifier/Collapse
• Created to deal with problems like:
  • :) :)) :), :).
  • They should all be treated as the same token
  • Harder for words
Classifier/New words
• Rules to compare the scores
  • So far the rules are
    • If the positive score is bigger than the negative
      score plus delta, tag the word as positive
    • Same idea for negative
• Returns the new words up to a maximum value
Other ideas
• WordNet based
• PMI similarity score
Evaluation
• Two evaluation methods:
 • First method
    • Find tweets that could not be categorized before
      but now they can
    • Manually check the precision of the result
 • Second method
    • Manually select positive and negative tweets
    • Compare the precision of the old dictionary with
      the new dictionary
Sub-product
• LIWC Dictionary Library for Python
  • Provides easy access to the dictionary information
     • Easy search
     • Reverse index
     • Match wildcard
  • Ex.:
LIWC Dictionary Expansion

More Related Content

Similar to LIWC Dictionary Expansion

Pair Programming :: Agile Portugal 2014
Pair Programming :: Agile Portugal 2014Pair Programming :: Agile Portugal 2014
Pair Programming :: Agile Portugal 2014
Pedro Gustavo Torres
 
New Dog, Old Tricks: Running Halo 3 Without a Hard Drive
New Dog, Old Tricks: Running Halo 3 Without a Hard DriveNew Dog, Old Tricks: Running Halo 3 Without a Hard Drive
New Dog, Old Tricks: Running Halo 3 Without a Hard Drive
guest8943c5
 
Loading___done_gdc_2008
Loading___done_gdc_2008Loading___done_gdc_2008
Loading___done_gdc_2008
guest8943c5
 
Python language data types
Python language data typesPython language data types
Python language data types
Harry Potter
 
Python language data types
Python language data typesPython language data types
Python language data types
Young Alista
 
Python language data types
Python language data typesPython language data types
Python language data types
Luis Goldster
 
Python language data types
Python language data typesPython language data types
Python language data types
Tony Nguyen
 
Python language data types
Python language data typesPython language data types
Python language data types
Fraboni Ec
 
Python language data types
Python language data typesPython language data types
Python language data types
James Wong
 

Similar to LIWC Dictionary Expansion (20)

Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
 
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectDeep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
 
Pair Programming :: Blip 2014
Pair Programming :: Blip 2014Pair Programming :: Blip 2014
Pair Programming :: Blip 2014
 
Pair Programming :: Agile Portugal 2014
Pair Programming :: Agile Portugal 2014Pair Programming :: Agile Portugal 2014
Pair Programming :: Agile Portugal 2014
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Rails development environment talk
Rails development environment talkRails development environment talk
Rails development environment talk
 
New Dog, Old Tricks: Running Halo 3 Without a Hard Drive
New Dog, Old Tricks: Running Halo 3 Without a Hard DriveNew Dog, Old Tricks: Running Halo 3 Without a Hard Drive
New Dog, Old Tricks: Running Halo 3 Without a Hard Drive
 
Loading___done_gdc_2008
Loading___done_gdc_2008Loading___done_gdc_2008
Loading___done_gdc_2008
 
COMPUTER INTRODUCTION
COMPUTER INTRODUCTIONCOMPUTER INTRODUCTION
COMPUTER INTRODUCTION
 
Simple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETL
Simple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETLSimple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETL
Simple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETL
 
Asynchronous Awesome
Asynchronous AwesomeAsynchronous Awesome
Asynchronous Awesome
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
20. Mathematics I
20. Mathematics I20. Mathematics I
20. Mathematics I
 
LocJAM April 2014
LocJAM April 2014LocJAM April 2014
LocJAM April 2014
 

Recently uploaded

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

LIWC Dictionary Expansion

  • 1. LIWC Dictionary Expansion Luiz Gustavo Ferraz Aoqui Social Computing Lab – GSCT – KAIST
  • 2. Motivation • Dictionary-based classifiers have high precision • But usually low recall • Natural language is very dynamic • New words appear • Words change their meaning and sentiment • Heap’s Law • Hard to update the dictionary at the same speed
  • 3. LIWC Dictionary • Fairly large dictionary • Almost 4,500 words and steams • 406 positive • 499 negative • Development and Update is a long process • Almost exclusively done manually • Requires a lot of human resources • Last update was in 2007 • Twitter was launched in July, 2006
  • 4. System overview 19027743 1985381275 NULL NULL <d>2009-06-01 00:00:00</d> <s>web</s> <t>I think i 'm gonna go with the magic in 6.... just cause now that bron bron's out i wanna see kobe lose too.</t> SeanBennettt 98 434 159 - 18000 0 0 <n>Sean Bennett</n> <u d>2009-01-15 16:36:04</ud> <t>Eastern Time (US &amp; Canada)</t> <l>Long Island, NY</l> . . . Postive: .. :) :- ...... live tweet ;) .& -- =) everytime rain tweets (: mj xd michael !!!!!! lil ." dog sun jus fan wit =] :] aww album via luv photo ;- john pic different kno wearing la ). Negative: !! :( ?? getting twitter omg ?! ppl :/ dude idk da weather bout wtf iphone smh wat internet =( heat dnt =/ facebook :| gosh kate :[ fml ima jon swear punch text =[ cringe ): nd ** imma
  • 6. System overview/Parser 19027743 1985381275 NULL NULL <d>2009-06-01 00:00:00</d> <s>web</s> <t>I think i'm gonna go with the magic in 6.... just cause now that bron bron's out i wanna see kobe lose too.</t> SeanBennettt 98 434 159 -18000 0 0 <n>Sean Bennett</n> <ud>2009- 01-15 16:36:04</ud> <t>Eastern Time (US &amp; Canada)</t> <l>Long Island, NY</l> . . . haha nooo! i just wanna kill mee!!!! i didn`t do my homework...and i feel sick =( I can see the bus again. that makes me happy. $$ Black Swan Fund Makes a Big Bet on Inflation wonder how Roubini feels about this...? blahh, i feel boredd and tiredd as hell haha jay to conan... upgrade. lc to kristin... downgrade. rushing home for lauren's final episode. my life makes me sad.
  • 7. Parser Structured Extract tweet Tweets Text (RegEx) Filter Clean Tweets Clean Remove Remove Remove user name URL hash tag (RegEx) (RegEx) (RegEx)
  • 8. Parser • Regular Expressions • Very powerful tool for text processing… • ..but very complex • Ex.: <d>2009-06-01 00:00:00</d> <s>web</s> <t>I just reached level 2. #spymaster http://bit.ly/playspy</t> asmith393 1522 1498 207 -18000 0 0 I just reached level 2. #spymaster <n>Adam Smith</n> <ud>2007-03-07 <t>(.*?)</t> http://bit.ly/playspy 18:17:20</ud> <t>Eastern Time (US &amp; Canada)</t>
  • 9. Parser • Regular Expressions • Very powerful tool for text processing… • ..but very complex • Ex.: I just reached level 2. I just reached level 2. #spymaster #[0-9a-zA-Z+_]* http://bit.ly/playspy http://bit.ly/playspy
  • 10. Parser • Regular Expressions • Very powerful tool for text processing… • ..but very complex • Ex.: I just reached level 2. #spymaster ((http://|www.)([a-zA- I just reached level 2. #spymaster http://bit.ly/playspy Z0-9/.~])*)
  • 11. System overview/Master haha nooo! i just wanna kill mee!!!! i didn`t do my homework...and i feel sick =( I can see the bus again. that makes me happy. $$ Black Swan Fund Makes a Big Bet on Inflation wonder how Roubini feels about this...? blahh, i feel boredd and tiredd as hell haha jay to conan... upgrade. lc to kristin... downgrade. rushing home for lauren's final episode. my life makes me sad. Index Frequency Chunks Co-frequency
  • 12. Master Tweets Splitter Tweets Chunks Mapper Tweets Indexer Index M M M R Reducer R R Unsorted Co-frequency Co-frequency Frequency Sort Co-frequency Frequency
  • 13. Master/Splitter • Count the lines in the input file • Select only tweets that words on the LIWC dictionary • Split the input file in smaller chunks
  • 14. Master/Indexer • Simply save the vocabulary on a file sorted alphabetically • Important in the future
  • 15. Master/Mapper • Spawn processes in parallel and divide the chunks among them • Each worker does two jobs: • First: create (word, frequency) pairs Frequency.tmp someone 6 down 8 ever 10 Chunk Worker kinda 2 crazy 14 …
  • 16. Master/Mapper • Spawn processes in parallel and divide the chunks among them • Each worker does two jobs: • First: create (word, frequency) pairs • Second: save the co-words for each word
  • 17. Master/Mapper Split Words Remove Duplicates Generate files Save co-words Worker haha haha i nooo do haha nooo! i just wanna kill ! didn`t mee!!!! i didn`t do my i my homework...and i feel sick =( just homework wanna ... and kill mee feel =( i !!!! sick
  • 18. Master/Mapper/Issues • Splitting is not trivial • Splitting in whitespaces • homework… ≠ homework • Remove punctuation • :) ☐ • Solution: RegEx again • ([w-'`]*)(W*) • File names: • Unique, easy to find and respect OS rules • Hash • This is why the index file is important
  • 19. Master/Mapper/Issues • Parallel programming on Python • Original interpreter don’t support multi-thread… • Alternatives, such as Jython and IronPython, do • …but it is still possible to work in parallel • Multi-thread vs. Multi-process • Multi-process in Python • multiprocessing module • http://docs.python.org/library/multiprocessing.html#module- multiprocessing.pool
  • 20. Master/Reducer • Spawn processes in parallel and split the words among them • Basically counts the mapper results • Also, each work does two jobs: • First: sums all the (word, frequency) pairs and save frequency.tmp car 4 frequency.txt house 2 Reducer car 5 ball 5 house 3 car 1 ball 5 house 1
  • 21. Master/Reducer • Spawn processes in parallel and split the words among them • Basically counts the mapper results • Also, each work does two jobs: • First: sums all the (word, frequency) pairs and save • Second: sums the co-occurrence frequency trip trip car 1 Worker car 3 ball 3 Ball 3 car 2 house 1 house 1
  • 22. Master/Reducer/Issues • Index file • Useful to access the files • Each word has a file with a list of co-words • But file name is hashed • Non-invertible function • Look-up on index, hash the word and get the file
  • 23. Master/Sort • Simply sort the frequencies file • Most frequent first
  • 24. Classifier α β γ Frequency Scores δ Co-frequency Max results New words
  • 25. Classifier/Sentiment words Car 232 Ball 143 Street 125 Top α% Frequency House 121 Boat 114 Pencil 105 Pen 98 Computer 81
  • 26. Classifier/Co-words Top β% engine tire door Car Ball court game play Street name size
  • 27. Classifier/Score engine tire door engine 1 0 court game play tire 1 0 door 2 1 door size size 1 2 size room type home price size door
  • 28. Classifier/Collapse • Created to deal with problems like: • :) :)) :), :). • They should all be treated as the same token • Harder for words
  • 29. Classifier/New words • Rules to compare the scores • So far the rules are • If the positive score is bigger than the negative score plus delta, tag the word as positive • Same idea for negative • Returns the new words up to a maximum value
  • 30. Other ideas • WordNet based • PMI similarity score
  • 31. Evaluation • Two evaluation methods: • First method • Find tweets that could not be categorized before but now they can • Manually check the precision of the result • Second method • Manually select positive and negative tweets • Compare the precision of the old dictionary with the new dictionary
  • 32. Sub-product • LIWC Dictionary Library for Python • Provides easy access to the dictionary information • Easy search • Reverse index • Match wildcard • Ex.: