This presentation explains the research I made during while working at the Social Computing Lab at KAIST.
The main goal was to expand the LIWC vocabulary and adapt for Twiter sentiment analysis.
Download it to see the animations :)
2. Motivation
• Dictionary-based classifiers have high precision
• But usually low recall
• Natural language is very dynamic
• New words appear
• Words change their meaning and sentiment
• Heap’s Law
• Hard to update the dictionary at the same speed
3. LIWC Dictionary
• Fairly large dictionary
• Almost 4,500 words and steams
• 406 positive
• 499 negative
• Development and Update is a long process
• Almost exclusively done manually
• Requires a lot of human resources
• Last update was in 2007
• Twitter was launched in July, 2006
4. System overview
19027743 1985381275 NULL NULL <d>2009-06-01
00:00:00</d> <s>web</s> <t>I think i
'm gonna go with the magic in 6.... just cause now
that bron bron's out i wanna
see kobe lose too.</t> SeanBennettt 98 434 159 -
18000 0 0 <n>Sean Bennett</n> <u
d>2009-01-15 16:36:04</ud> <t>Eastern Time (US
& Canada)</t> <l>Long Island,
NY</l>
.
.
.
Postive:
.. :) :- ...... live tweet ;) .& -- =) everytime rain tweets (:
mj xd michael !!!!!! lil ." dog sun jus fan wit =] :] aww
album via luv photo ;- john pic different kno wearing
la ).
Negative:
!! :( ?? getting twitter omg ?! ppl :/ dude idk da
weather bout wtf iphone smh wat internet =( heat dnt
=/ facebook :| gosh kate :[ fml ima jon swear punch
text =[ cringe ): nd ** imma
6. System overview/Parser
19027743 1985381275 NULL NULL <d>2009-06-01
00:00:00</d> <s>web</s> <t>I think i'm gonna go
with the magic in 6.... just cause now that bron bron's
out i wanna see kobe lose too.</t> SeanBennettt 98
434 159 -18000 0 0 <n>Sean Bennett</n> <ud>2009-
01-15 16:36:04</ud> <t>Eastern Time (US &
Canada)</t> <l>Long Island, NY</l>
.
.
.
haha nooo! i just wanna kill mee!!!! i didn`t do my
homework...and i feel sick =(
I can see the bus again. that makes me happy.
$$ Black Swan Fund Makes a Big Bet on Inflation
wonder how Roubini feels about this...?
blahh, i feel boredd and tiredd as hell haha
jay to conan... upgrade. lc to kristin... downgrade.
rushing home for lauren's final episode. my life
makes me sad.
7. Parser
Structured Extract tweet
Tweets
Text (RegEx)
Filter
Clean Tweets
Clean
Remove Remove Remove
user name URL hash tag
(RegEx) (RegEx) (RegEx)
8. Parser
• Regular Expressions
• Very powerful tool for text processing…
• ..but very complex
• Ex.:
<d>2009-06-01 00:00:00</d>
<s>web</s> <t>I just reached level 2.
#spymaster http://bit.ly/playspy</t>
asmith393 1522 1498 207 -18000 0 0 I just reached level 2. #spymaster
<n>Adam Smith</n> <ud>2007-03-07 <t>(.*?)</t> http://bit.ly/playspy
18:17:20</ud> <t>Eastern Time (US
& Canada)</t>
9. Parser
• Regular Expressions
• Very powerful tool for text processing…
• ..but very complex
• Ex.:
I just reached level 2. I just reached level 2.
#spymaster #[0-9a-zA-Z+_]* http://bit.ly/playspy
http://bit.ly/playspy
10. Parser
• Regular Expressions
• Very powerful tool for text processing…
• ..but very complex
• Ex.:
I just reached level 2.
#spymaster
((http://|www.)([a-zA- I just reached level 2.
#spymaster
http://bit.ly/playspy Z0-9/.~])*)
11. System overview/Master
haha nooo! i just wanna kill mee!!!! i didn`t do my
homework...and i feel sick =(
I can see the bus again. that makes me happy.
$$ Black Swan Fund Makes a Big Bet on Inflation
wonder how Roubini feels about this...?
blahh, i feel boredd and tiredd as hell haha
jay to conan... upgrade. lc to kristin... downgrade.
rushing home for lauren's final episode. my life
makes me sad.
Index Frequency Chunks Co-frequency
12. Master
Tweets
Splitter Tweets
Chunks Mapper
Tweets
Indexer Index M M M
R
Reducer R
R
Unsorted Co-frequency
Co-frequency
Frequency Sort Co-frequency
Frequency
13. Master/Splitter
• Count the lines in the input file
• Select only tweets that words on the LIWC
dictionary
• Split the input file in smaller chunks
15. Master/Mapper
• Spawn processes in parallel and divide the
chunks among them
• Each worker does two jobs:
• First: create (word, frequency) pairs
Frequency.tmp
someone 6
down 8
ever 10
Chunk Worker kinda 2
crazy 14
…
16. Master/Mapper
• Spawn processes in parallel and divide the
chunks among them
• Each worker does two jobs:
• First: create (word, frequency) pairs
• Second: save the co-words for each word
17. Master/Mapper
Split Words
Remove
Duplicates
Generate files
Save co-words Worker
haha
haha i
nooo do
haha nooo! i just wanna kill ! didn`t
mee!!!! i didn`t do my i my
homework...and i feel sick =( just
homework
wanna
... and
kill
mee feel =(
i
!!!! sick
18. Master/Mapper/Issues
• Splitting is not trivial
• Splitting in whitespaces
• homework… ≠ homework
• Remove punctuation
• :) ☐
• Solution: RegEx again
• ([w-'`]*)(W*)
• File names:
• Unique, easy to find and respect OS rules
• Hash
• This is why the index file is important
19. Master/Mapper/Issues
• Parallel programming on Python
• Original interpreter don’t support multi-thread…
• Alternatives, such as Jython and IronPython, do
• …but it is still possible to work in parallel
• Multi-thread vs. Multi-process
• Multi-process in Python
• multiprocessing module
• http://docs.python.org/library/multiprocessing.html#module-
multiprocessing.pool
20. Master/Reducer
• Spawn processes in parallel and split the words
among them
• Basically counts the mapper results
• Also, each work does two jobs:
• First: sums all the (word, frequency) pairs and save
frequency.tmp
car 4 frequency.txt
house 2 Reducer car 5
ball 5 house 3
car 1 ball 5
house 1
21. Master/Reducer
• Spawn processes in parallel and split the words
among them
• Basically counts the mapper results
• Also, each work does two jobs:
• First: sums all the (word, frequency) pairs and save
• Second: sums the co-occurrence frequency
trip
trip
car 1
Worker car 3
ball 3
Ball 3
car 2
house 1
house 1
22. Master/Reducer/Issues
• Index file
• Useful to access the files
• Each word has a file with a list of co-words
• But file name is hashed
• Non-invertible function
• Look-up on index, hash the word and get the file
24. Classifier
α β γ
Frequency Scores
δ
Co-frequency
Max results New words
25. Classifier/Sentiment words
Car 232
Ball 143
Street 125 Top α%
Frequency House 121
Boat 114
Pencil 105
Pen 98
Computer 81
26. Classifier/Co-words
Top β%
engine tire door
Car
Ball
court game play
Street
name size
27. Classifier/Score
engine tire door
engine 1 0
court game play
tire 1 0
door 2 1
door size
size 1 2
size room type home
price size door
28. Classifier/Collapse
• Created to deal with problems like:
• :) :)) :), :).
• They should all be treated as the same token
• Harder for words
29. Classifier/New words
• Rules to compare the scores
• So far the rules are
• If the positive score is bigger than the negative
score plus delta, tag the word as positive
• Same idea for negative
• Returns the new words up to a maximum value
31. Evaluation
• Two evaluation methods:
• First method
• Find tweets that could not be categorized before
but now they can
• Manually check the precision of the result
• Second method
• Manually select positive and negative tweets
• Compare the precision of the old dictionary with
the new dictionary
32. Sub-product
• LIWC Dictionary Library for Python
• Provides easy access to the dictionary information
• Easy search
• Reverse index
• Match wildcard
• Ex.: