LIWC Dictionary Expansion

LIWC Dictionary
Expansion
Luiz Gustavo Ferraz Aoqui
Social Computing Lab – GSCT – KAIST

Motivation
• Dictionary-based classifiers have high precision
• But usually low recall

• Natural language is very dynamic
• New words appear
• Words change their meaning and sentiment
• Heap’s Law

• Hard to update the dictionary at the same speed

LIWC Dictionary
• Fairly large dictionary
• Almost 4,500 words and steams
• 406 positive
• 499 negative
• Development and Update is a long process
• Almost exclusively done manually
• Requires a lot of human resources
• Last update was in 2007
• Twitter was launched in July, 2006

System overview
19027743 1985381275 NULL NULL <d>2009-06-01
00:00:00</d> <s>web</s> <t>I think i
'm gonna go with the magic in 6.... just cause now
that bron bron's out i wanna
see kobe lose too.</t> SeanBennettt 98 434 159 -
18000 0 0 <n>Sean Bennett</n> <u
d>2009-01-15 16:36:04</ud> <t>Eastern Time (US
& Canada)</t> <l>Long Island,
NY</l>
.
.
.

Postive:
.. :) :- ...... live tweet ;) .& -- =) everytime rain tweets (:
mj xd michael !!!!!! lil ." dog sun jus fan wit =] :] aww
album via luv photo ;- john pic different kno wearing
la ).

Negative:
!! :( ?? getting twitter omg ?! ppl :/ dude idk da
weather bout wtf iphone smh wat internet =( heat dnt
=/ facebook :| gosh kate :[ fml ima jon swear punch
text =[ cringe ): nd ** imma

System overview/Parser
19027743 1985381275 NULL NULL <d>2009-06-01
00:00:00</d> <s>web</s> <t>I think i'm gonna go
with the magic in 6.... just cause now that bron bron's
out i wanna see kobe lose too.</t> SeanBennettt 98
434 159 -18000 0 0 <n>Sean Bennett</n> <ud>2009-
01-15 16:36:04</ud> <t>Eastern Time (US &
Canada)</t> <l>Long Island, NY</l>
.
.
.

haha nooo! i just wanna kill mee!!!! i didn`t do my
homework...and i feel sick =(
I can see the bus again. that makes me happy.
$$ Black Swan Fund Makes a Big Bet on Inflation
wonder how Roubini feels about this...?
blahh, i feel boredd and tiredd as hell haha
jay to conan... upgrade. lc to kristin... downgrade.
rushing home for lauren's final episode. my life
makes me sad.

Parser

Structured Extract tweet
Tweets
Text (RegEx)

Filter
Clean Tweets

Clean
Remove Remove Remove
user name URL hash tag
(RegEx) (RegEx) (RegEx)

Parser
• Regular Expressions
• Very powerful tool for text processing…
• ..but very complex
• Ex.:

<d>2009-06-01 00:00:00</d>
<s>web</s> <t>I just reached level 2.
#spymaster http://bit.ly/playspy</t>
asmith393 1522 1498 207 -18000 0 0 I just reached level 2. #spymaster
<n>Adam Smith</n> <ud>2007-03-07 <t>(.*?)</t> http://bit.ly/playspy
18:17:20</ud> <t>Eastern Time (US
& Canada)</t>

Parser
• Ex.:

I just reached level 2. I just reached level 2.
#spymaster #[0-9a-zA-Z+_]* http://bit.ly/playspy
http://bit.ly/playspy

Parser
• Ex.:

I just reached level 2.
#spymaster
((http://|www.)([a-zA- I just reached level 2.
#spymaster
http://bit.ly/playspy Z0-9/.~])*)

System overview/Master
haha nooo! i just wanna kill mee!!!! i didn`t do my
homework...and i feel sick =(
I can see the bus again. that makes me happy.
$$ Black Swan Fund Makes a Big Bet on Inflation
wonder how Roubini feels about this...?
blahh, i feel boredd and tiredd as hell haha
jay to conan... upgrade. lc to kristin... downgrade.
rushing home for lauren's final episode. my life
makes me sad.

Index Frequency Chunks Co-frequency

Master
Tweets
Splitter Tweets
Chunks Mapper
Tweets

Indexer Index M M M

R

Reducer R
R

Unsorted Co-frequency
Co-frequency
Frequency Sort Co-frequency
Frequency

Master/Splitter
• Count the lines in the input file
• Select only tweets that words on the LIWC
dictionary
• Split the input file in smaller chunks

Master/Indexer
• Simply save the vocabulary on a file sorted
alphabetically
• Important in the future

Master/Mapper
• Spawn processes in parallel and divide the
chunks among them
• Each worker does two jobs:
• First: create (word, frequency) pairs

Frequency.tmp
someone 6
down 8
ever 10
Chunk Worker kinda 2
crazy 14
…

Master/Mapper
• Spawn processes in parallel and divide the
chunks among them
• Each worker does two jobs:
• First: create (word, frequency) pairs
• Second: save the co-words for each word

Master/Mapper
Split Words
Remove
Duplicates
Generate files

Save co-words Worker
haha
haha i
nooo do
haha nooo! i just wanna kill ! didn`t
mee!!!! i didn`t do my i my
homework...and i feel sick =( just
homework
wanna
... and
kill
mee feel =(
i
!!!! sick

Master/Mapper/Issues
• Splitting is not trivial
• Splitting in whitespaces
• homework… ≠ homework
• Remove punctuation
• :) ☐
• Solution: RegEx again
• ([w-'`]*)(W*)

• File names:
• Unique, easy to find and respect OS rules
• Hash
• This is why the index file is important

Master/Mapper/Issues
• Parallel programming on Python
• Original interpreter don’t support multi-thread…
• Alternatives, such as Jython and IronPython, do
• …but it is still possible to work in parallel
• Multi-thread vs. Multi-process
• Multi-process in Python
• multiprocessing module
• http://docs.python.org/library/multiprocessing.html#module-
multiprocessing.pool

Master/Reducer
• Spawn processes in parallel and split the words
among them
• Basically counts the mapper results
• Also, each work does two jobs:
• First: sums all the (word, frequency) pairs and save

frequency.tmp
car 4 frequency.txt
house 2 Reducer car 5
ball 5 house 3
car 1 ball 5
house 1

Master/Reducer
• Spawn processes in parallel and split the words
among them
• Basically counts the mapper results
• Also, each work does two jobs:
• First: sums all the (word, frequency) pairs and save
• Second: sums the co-occurrence frequency

trip
trip
car 1
Worker car 3
ball 3
Ball 3
car 2
house 1
house 1

Master/Reducer/Issues
• Index file
• Useful to access the files
• Each word has a file with a list of co-words
• But file name is hashed
• Non-invertible function
• Look-up on index, hash the word and get the file

Master/Sort
• Simply sort the frequencies file
• Most frequent first

Classifier

α β γ
Frequency Scores
δ

Co-frequency
Max results New words

Classifier/Sentiment words

Car 232
Ball 143
Street 125 Top α%
Frequency House 121
Boat 114
Pencil 105
Pen 98
Computer 81

Classifier/Co-words

Top β%

engine tire door
Car
Ball
court game play
Street

name size

Classifier/Score

engine tire door
engine 1 0
court game play
tire 1 0
door 2 1
door size
size 1 2

size room type home

price size door

Classifier/Collapse
• Created to deal with problems like:
• :) :)) :), :).
• They should all be treated as the same token
• Harder for words

Classifier/New words
• Rules to compare the scores
• So far the rules are
• If the positive score is bigger than the negative
score plus delta, tag the word as positive
• Same idea for negative
• Returns the new words up to a maximum value

Other ideas
• WordNet based
• PMI similarity score

Evaluation
• Two evaluation methods:
• First method
• Find tweets that could not be categorized before
but now they can
• Manually check the precision of the result
• Second method
• Manually select positive and negative tweets
• Compare the precision of the old dictionary with
the new dictionary

Sub-product
• LIWC Dictionary Library for Python
• Provides easy access to the dictionary information
• Easy search
• Reverse index
• Match wildcard
• Ex.:

LIWC Dictionary Expansion

Recommended

Recommended

More Related Content

Similar to LIWC Dictionary Expansion

Similar to LIWC Dictionary Expansion (20)

Recently uploaded

Recently uploaded (20)

LIWC Dictionary Expansion