SlideShare a Scribd company logo
LIWC Dictionary
         Expansion
Luiz Gustavo Ferraz Aoqui
Social Computing Lab – GSCT – KAIST
Motivation
• Dictionary-based classifiers have high precision
  • But usually low recall

• Natural language is very dynamic
  • New words appear
  • Words change their meaning and sentiment
  • Heap’s Law

• Hard to update the dictionary at the same speed
LIWC Dictionary
• Fairly large dictionary
  • Almost 4,500 words and steams
     • 406 positive
     • 499 negative
• Development and Update is a long process
  • Almost exclusively done manually
  • Requires a lot of human resources
• Last update was in 2007
  • Twitter was launched in July, 2006
System overview
 19027743 1985381275 NULL NULL <d>2009-06-01
 00:00:00</d> <s>web</s> <t>I think i
 'm gonna go with the magic in 6.... just cause now
 that bron bron's out i wanna
 see kobe lose too.</t> SeanBennettt 98 434 159 -
 18000 0 0 <n>Sean Bennett</n> <u
 d>2009-01-15 16:36:04</ud> <t>Eastern Time (US
 &amp; Canada)</t> <l>Long Island,
  NY</l>
                            .
                            .
                            .




 Postive:
 .. :) :- ...... live tweet ;) .& -- =) everytime rain tweets (:
 mj xd michael !!!!!! lil ." dog sun jus fan wit =] :] aww
 album via luv photo ;- john pic different kno wearing
 la ).

 Negative:
 !! :( ?? getting twitter omg ?! ppl :/ dude idk da
 weather bout wtf iphone smh wat internet =( heat dnt
 =/ facebook :| gosh kate :[ fml ima jon swear punch
 text =[ cringe ): nd ** imma
System overview
System overview/Parser
 19027743 1985381275 NULL NULL <d>2009-06-01
 00:00:00</d> <s>web</s> <t>I think i'm gonna go
 with the magic in 6.... just cause now that bron bron's
 out i wanna see kobe lose too.</t> SeanBennettt 98
 434 159 -18000 0 0 <n>Sean Bennett</n> <ud>2009-
 01-15 16:36:04</ud> <t>Eastern Time (US &amp;
 Canada)</t> <l>Long Island, NY</l>
                               .
                               .
                               .




  haha nooo! i just wanna kill mee!!!! i didn`t do my
 homework...and i feel sick =(
 I can see the bus again. that makes me happy.
 $$ Black Swan Fund Makes a Big Bet on Inflation
 wonder how Roubini feels about this...?
 blahh, i feel boredd and tiredd as hell haha
 jay to conan... upgrade. lc to kristin... downgrade.
 rushing home for lauren's final episode. my life
 makes me sad.
Parser

 Structured          Extract tweet
                                                Tweets
    Text                (RegEx)




                                                 Filter
Clean Tweets


                           Clean
                Remove     Remove    Remove
               user name     URL     hash tag
                (RegEx)    (RegEx)   (RegEx)
Parser
    • Regular Expressions
         • Very powerful tool for text processing…
         • ..but very complex
         • Ex.:


<d>2009-06-01 00:00:00</d>
<s>web</s> <t>I just reached level 2.
#spymaster http://bit.ly/playspy</t>
asmith393 1522 1498 207 -18000 0 0                     I just reached level 2. #spymaster
<n>Adam Smith</n> <ud>2007-03-07        <t>(.*?)</t>   http://bit.ly/playspy
18:17:20</ud> <t>Eastern Time (US
&amp; Canada)</t>
Parser
   • Regular Expressions
        • Very powerful tool for text processing…
        • ..but very complex
        • Ex.:


I just reached level 2.                             I just reached level 2.
#spymaster                  #[0-9a-zA-Z+_]*         http://bit.ly/playspy
http://bit.ly/playspy
Parser
   • Regular Expressions
        • Very powerful tool for text processing…
        • ..but very complex
        • Ex.:


I just reached level 2.
#spymaster
                          ((http://|www.)([a-zA-    I just reached level 2.
                                                    #spymaster
http://bit.ly/playspy           Z0-9/.~])*)
System overview/Master
  haha nooo! i just wanna kill mee!!!! i didn`t do my
 homework...and i feel sick =(
 I can see the bus again. that makes me happy.
 $$ Black Swan Fund Makes a Big Bet on Inflation
 wonder how Roubini feels about this...?
 blahh, i feel boredd and tiredd as hell haha
 jay to conan... upgrade. lc to kristin... downgrade.
 rushing home for lauren's final episode. my life
 makes me sad.




                             Index                      Frequency   Chunks   Co-frequency
Master
                                  Tweets
                Splitter           Tweets
                                    Chunks              Mapper
Tweets

                Indexer            Index            M     M       M



                                                    R

                                   Reducer                 R
                                                                  R




                                        Unsorted          Co-frequency
                                                         Co-frequency
    Frequency              Sort                         Co-frequency
                                        Frequency
Master/Splitter
• Count the lines in the input file
• Select only tweets that words on the LIWC
  dictionary
• Split the input file in smaller chunks
Master/Indexer
• Simply save the vocabulary on a file sorted
  alphabetically
• Important in the future
Master/Mapper
• Spawn processes in parallel and divide the
  chunks among them
• Each worker does two jobs:
  • First: create (word, frequency) pairs


                                       Frequency.tmp
                                       someone         6
                                       down            8
                                       ever            10
    Chunk             Worker           kinda           2
                                       crazy           14
                                       …
Master/Mapper
• Spawn processes in parallel and divide the
  chunks among them
• Each worker does two jobs:
  • First: create (word, frequency) pairs
  • Second: save the co-words for each word
Master/Mapper
      Split Words
         Remove
        Duplicates
      Generate files

     Save co-words                                Worker
                                    haha
    haha                                            i
                                nooo                                 do
haha nooo! i just wanna kill                 !    didn`t
mee!!!! i didn`t do my          i                                         my
homework...and i feel sick =(           just
                                                   homework
                                wanna
                                                        ...          and
                                      kill
                                mee                           feel                    =(
                                                    i
                                           !!!!                                sick
Master/Mapper/Issues
• Splitting is not trivial
  • Splitting in whitespaces
     • homework… ≠ homework
  • Remove punctuation
     • :) ☐
  • Solution: RegEx again
     • ([w-'`]*)(W*)

• File names:
  • Unique, easy to find and respect OS rules
     • Hash
       • This is why the index file is important
Master/Mapper/Issues
• Parallel programming on Python
 • Original interpreter don’t support multi-thread…
    • Alternatives, such as Jython and IronPython, do
 • …but it is still possible to work in parallel
 • Multi-thread vs. Multi-process
 • Multi-process in Python
    • multiprocessing module
    • http://docs.python.org/library/multiprocessing.html#module-
      multiprocessing.pool
Master/Reducer
• Spawn processes in parallel and split the words
  among them
• Basically counts the mapper results
• Also, each work does two jobs:
  • First: sums all the (word, frequency) pairs and save

  frequency.tmp
  car     4                             frequency.txt
  house   2            Reducer          car      5
  ball    5                             house    3
  car     1                             ball     5
  house   1
Master/Reducer
• Spawn processes in parallel and split the words
  among them
• Basically counts the mapper results
• Also, each work does two jobs:
  • First: sums all the (word, frequency) pairs and save
  • Second: sums the co-occurrence frequency

   trip
                                      trip
   car     1
                      Worker          car     3
   ball    3
                                      Ball    3
   car     2
                                      house   1
   house   1
Master/Reducer/Issues
• Index file
  • Useful to access the files
     • Each word has a file with a list of co-words
     • But file name is hashed
       • Non-invertible function
     • Look-up on index, hash the word and get the file
Master/Sort
• Simply sort the frequencies file
  • Most frequent first
Classifier

                  α   β   γ
   Frequency                     Scores
                  δ




   Co-frequency
                  Max results   New words
Classifier/Sentiment words


            Car        232
            Ball       143
            Street     125   Top α%
Frequency   House      121
            Boat       114
            Pencil     105
            Pen        98
            Computer   81
Classifier/Co-words


                  Top β%

               engine    tire   door
      Car
      Ball
               court    game    play
      Street

               name     size
Classifier/Score

 engine     tire    door
                                  engine   1 0
 court     game     play
                                  tire     1 0
                                  door     2 1
  door     size
                                  size     1 2


 size     room     type    home


 price     size    door
Classifier/Collapse
• Created to deal with problems like:
  • :) :)) :), :).
  • They should all be treated as the same token
  • Harder for words
Classifier/New words
• Rules to compare the scores
  • So far the rules are
    • If the positive score is bigger than the negative
      score plus delta, tag the word as positive
    • Same idea for negative
• Returns the new words up to a maximum value
Other ideas
• WordNet based
• PMI similarity score
Evaluation
• Two evaluation methods:
 • First method
    • Find tweets that could not be categorized before
      but now they can
    • Manually check the precision of the result
 • Second method
    • Manually select positive and negative tweets
    • Compare the precision of the old dictionary with
      the new dictionary
Sub-product
• LIWC Dictionary Library for Python
  • Provides easy access to the dictionary information
     • Easy search
     • Reverse index
     • Match wildcard
  • Ex.:
LIWC Dictionary Expansion

More Related Content

Similar to LIWC Dictionary Expansion

Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
Kenneth Geisshirt
 
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectDeep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Keunwoo Choi
 
Pair Programming :: Agile Portugal 2014
Pair Programming :: Agile Portugal 2014Pair Programming :: Agile Portugal 2014
Pair Programming :: Agile Portugal 2014Pedro Gustavo Torres
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
Mike Acton
 
Rails development environment talk
Rails development environment talkRails development environment talk
Rails development environment talk
Reuven Lerner
 
New Dog, Old Tricks: Running Halo 3 Without a Hard Drive
New Dog, Old Tricks: Running Halo 3 Without a Hard DriveNew Dog, Old Tricks: Running Halo 3 Without a Hard Drive
New Dog, Old Tricks: Running Halo 3 Without a Hard Driveguest8943c5
 
Loading___done_gdc_2008
Loading___done_gdc_2008Loading___done_gdc_2008
Loading___done_gdc_2008guest8943c5
 
COMPUTER INTRODUCTION
COMPUTER INTRODUCTIONCOMPUTER INTRODUCTION
COMPUTER INTRODUCTION
Amit Sharma
 
Simple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETL
Simple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETLSimple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETL
Simple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETL
Romain Dorgueil
 
Asynchronous Awesome
Asynchronous AwesomeAsynchronous Awesome
Asynchronous Awesome
Flip Sasser
 
Python language data types
Python language data typesPython language data types
Python language data types
James Wong
 
Python language data types
Python language data typesPython language data types
Python language data types
Harry Potter
 
Python language data types
Python language data typesPython language data types
Python language data types
Hoang Nguyen
 
Python language data types
Python language data typesPython language data types
Python language data types
Young Alista
 
Python language data types
Python language data typesPython language data types
Python language data types
Luis Goldster
 
Python language data types
Python language data typesPython language data types
Python language data types
Tony Nguyen
 
Python language data types
Python language data typesPython language data types
Python language data types
Fraboni Ec
 
LocJAM April 2014
LocJAM April 2014LocJAM April 2014
LocJAM April 2014
gloc247
 

Similar to LIWC Dictionary Expansion (20)

Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
 
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, ExpectDeep Learning with Audio Signals: Prepare, Process, Design, Expect
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
 
Pair Programming :: Blip 2014
Pair Programming :: Blip 2014Pair Programming :: Blip 2014
Pair Programming :: Blip 2014
 
Pair Programming :: Agile Portugal 2014
Pair Programming :: Agile Portugal 2014Pair Programming :: Agile Portugal 2014
Pair Programming :: Agile Portugal 2014
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Rails development environment talk
Rails development environment talkRails development environment talk
Rails development environment talk
 
New Dog, Old Tricks: Running Halo 3 Without a Hard Drive
New Dog, Old Tricks: Running Halo 3 Without a Hard DriveNew Dog, Old Tricks: Running Halo 3 Without a Hard Drive
New Dog, Old Tricks: Running Halo 3 Without a Hard Drive
 
Loading___done_gdc_2008
Loading___done_gdc_2008Loading___done_gdc_2008
Loading___done_gdc_2008
 
COMPUTER INTRODUCTION
COMPUTER INTRODUCTIONCOMPUTER INTRODUCTION
COMPUTER INTRODUCTION
 
Simple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETL
Simple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETLSimple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETL
Simple Data Engineering in Python 3.5+ — Pycon.DE 2017 Karlsruhe — Bonobo ETL
 
Asynchronous Awesome
Asynchronous AwesomeAsynchronous Awesome
Asynchronous Awesome
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
20. Mathematics I
20. Mathematics I20. Mathematics I
20. Mathematics I
 
LocJAM April 2014
LocJAM April 2014LocJAM April 2014
LocJAM April 2014
 

Recently uploaded

Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 

Recently uploaded (20)

Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 

LIWC Dictionary Expansion

  • 1. LIWC Dictionary Expansion Luiz Gustavo Ferraz Aoqui Social Computing Lab – GSCT – KAIST
  • 2. Motivation • Dictionary-based classifiers have high precision • But usually low recall • Natural language is very dynamic • New words appear • Words change their meaning and sentiment • Heap’s Law • Hard to update the dictionary at the same speed
  • 3. LIWC Dictionary • Fairly large dictionary • Almost 4,500 words and steams • 406 positive • 499 negative • Development and Update is a long process • Almost exclusively done manually • Requires a lot of human resources • Last update was in 2007 • Twitter was launched in July, 2006
  • 4. System overview 19027743 1985381275 NULL NULL <d>2009-06-01 00:00:00</d> <s>web</s> <t>I think i 'm gonna go with the magic in 6.... just cause now that bron bron's out i wanna see kobe lose too.</t> SeanBennettt 98 434 159 - 18000 0 0 <n>Sean Bennett</n> <u d>2009-01-15 16:36:04</ud> <t>Eastern Time (US &amp; Canada)</t> <l>Long Island, NY</l> . . . Postive: .. :) :- ...... live tweet ;) .& -- =) everytime rain tweets (: mj xd michael !!!!!! lil ." dog sun jus fan wit =] :] aww album via luv photo ;- john pic different kno wearing la ). Negative: !! :( ?? getting twitter omg ?! ppl :/ dude idk da weather bout wtf iphone smh wat internet =( heat dnt =/ facebook :| gosh kate :[ fml ima jon swear punch text =[ cringe ): nd ** imma
  • 6. System overview/Parser 19027743 1985381275 NULL NULL <d>2009-06-01 00:00:00</d> <s>web</s> <t>I think i'm gonna go with the magic in 6.... just cause now that bron bron's out i wanna see kobe lose too.</t> SeanBennettt 98 434 159 -18000 0 0 <n>Sean Bennett</n> <ud>2009- 01-15 16:36:04</ud> <t>Eastern Time (US &amp; Canada)</t> <l>Long Island, NY</l> . . . haha nooo! i just wanna kill mee!!!! i didn`t do my homework...and i feel sick =( I can see the bus again. that makes me happy. $$ Black Swan Fund Makes a Big Bet on Inflation wonder how Roubini feels about this...? blahh, i feel boredd and tiredd as hell haha jay to conan... upgrade. lc to kristin... downgrade. rushing home for lauren's final episode. my life makes me sad.
  • 7. Parser Structured Extract tweet Tweets Text (RegEx) Filter Clean Tweets Clean Remove Remove Remove user name URL hash tag (RegEx) (RegEx) (RegEx)
  • 8. Parser • Regular Expressions • Very powerful tool for text processing… • ..but very complex • Ex.: <d>2009-06-01 00:00:00</d> <s>web</s> <t>I just reached level 2. #spymaster http://bit.ly/playspy</t> asmith393 1522 1498 207 -18000 0 0 I just reached level 2. #spymaster <n>Adam Smith</n> <ud>2007-03-07 <t>(.*?)</t> http://bit.ly/playspy 18:17:20</ud> <t>Eastern Time (US &amp; Canada)</t>
  • 9. Parser • Regular Expressions • Very powerful tool for text processing… • ..but very complex • Ex.: I just reached level 2. I just reached level 2. #spymaster #[0-9a-zA-Z+_]* http://bit.ly/playspy http://bit.ly/playspy
  • 10. Parser • Regular Expressions • Very powerful tool for text processing… • ..but very complex • Ex.: I just reached level 2. #spymaster ((http://|www.)([a-zA- I just reached level 2. #spymaster http://bit.ly/playspy Z0-9/.~])*)
  • 11. System overview/Master haha nooo! i just wanna kill mee!!!! i didn`t do my homework...and i feel sick =( I can see the bus again. that makes me happy. $$ Black Swan Fund Makes a Big Bet on Inflation wonder how Roubini feels about this...? blahh, i feel boredd and tiredd as hell haha jay to conan... upgrade. lc to kristin... downgrade. rushing home for lauren's final episode. my life makes me sad. Index Frequency Chunks Co-frequency
  • 12. Master Tweets Splitter Tweets Chunks Mapper Tweets Indexer Index M M M R Reducer R R Unsorted Co-frequency Co-frequency Frequency Sort Co-frequency Frequency
  • 13. Master/Splitter • Count the lines in the input file • Select only tweets that words on the LIWC dictionary • Split the input file in smaller chunks
  • 14. Master/Indexer • Simply save the vocabulary on a file sorted alphabetically • Important in the future
  • 15. Master/Mapper • Spawn processes in parallel and divide the chunks among them • Each worker does two jobs: • First: create (word, frequency) pairs Frequency.tmp someone 6 down 8 ever 10 Chunk Worker kinda 2 crazy 14 …
  • 16. Master/Mapper • Spawn processes in parallel and divide the chunks among them • Each worker does two jobs: • First: create (word, frequency) pairs • Second: save the co-words for each word
  • 17. Master/Mapper Split Words Remove Duplicates Generate files Save co-words Worker haha haha i nooo do haha nooo! i just wanna kill ! didn`t mee!!!! i didn`t do my i my homework...and i feel sick =( just homework wanna ... and kill mee feel =( i !!!! sick
  • 18. Master/Mapper/Issues • Splitting is not trivial • Splitting in whitespaces • homework… ≠ homework • Remove punctuation • :) ☐ • Solution: RegEx again • ([w-'`]*)(W*) • File names: • Unique, easy to find and respect OS rules • Hash • This is why the index file is important
  • 19. Master/Mapper/Issues • Parallel programming on Python • Original interpreter don’t support multi-thread… • Alternatives, such as Jython and IronPython, do • …but it is still possible to work in parallel • Multi-thread vs. Multi-process • Multi-process in Python • multiprocessing module • http://docs.python.org/library/multiprocessing.html#module- multiprocessing.pool
  • 20. Master/Reducer • Spawn processes in parallel and split the words among them • Basically counts the mapper results • Also, each work does two jobs: • First: sums all the (word, frequency) pairs and save frequency.tmp car 4 frequency.txt house 2 Reducer car 5 ball 5 house 3 car 1 ball 5 house 1
  • 21. Master/Reducer • Spawn processes in parallel and split the words among them • Basically counts the mapper results • Also, each work does two jobs: • First: sums all the (word, frequency) pairs and save • Second: sums the co-occurrence frequency trip trip car 1 Worker car 3 ball 3 Ball 3 car 2 house 1 house 1
  • 22. Master/Reducer/Issues • Index file • Useful to access the files • Each word has a file with a list of co-words • But file name is hashed • Non-invertible function • Look-up on index, hash the word and get the file
  • 23. Master/Sort • Simply sort the frequencies file • Most frequent first
  • 24. Classifier α β γ Frequency Scores δ Co-frequency Max results New words
  • 25. Classifier/Sentiment words Car 232 Ball 143 Street 125 Top α% Frequency House 121 Boat 114 Pencil 105 Pen 98 Computer 81
  • 26. Classifier/Co-words Top β% engine tire door Car Ball court game play Street name size
  • 27. Classifier/Score engine tire door engine 1 0 court game play tire 1 0 door 2 1 door size size 1 2 size room type home price size door
  • 28. Classifier/Collapse • Created to deal with problems like: • :) :)) :), :). • They should all be treated as the same token • Harder for words
  • 29. Classifier/New words • Rules to compare the scores • So far the rules are • If the positive score is bigger than the negative score plus delta, tag the word as positive • Same idea for negative • Returns the new words up to a maximum value
  • 30. Other ideas • WordNet based • PMI similarity score
  • 31. Evaluation • Two evaluation methods: • First method • Find tweets that could not be categorized before but now they can • Manually check the precision of the result • Second method • Manually select positive and negative tweets • Compare the precision of the old dictionary with the new dictionary
  • 32. Sub-product • LIWC Dictionary Library for Python • Provides easy access to the dictionary information • Easy search • Reverse index • Match wildcard • Ex.: