Sentiments Improvement

Sentiment improvements
Proposed ideas:
Part I. Data preprocessing
Part II. PMI-IR approach
Team members:
Denys Astanin
Mykhailo Kozik

Data preprocessing
Raw data
Preprocessed data
Narrowing
Long words
Emoticons
Decoding
Spell
Correction
Abbreviations
Decoding
Tags
Detection
:'( → cry
@Alex nice photo
#photoworld
goooood → good
lol → laughing out loud
I am shure that is realy exsellent plece
|
I am sure that is really excellent place

Narrowing long words
Using regexp narrow more than 2 duplicate letters in word to just 2
goooooood → good (correct narrowing)
baaaaaad → baad (incorrect narrowing, but will be corrected with spell-checker)
This hotel so goooooooood! This hotel so good!
NEUTRAL POSITIVE
This place not coooooool! This place not cool!
NEUTRAL NEGATIVE
Try this regexp: http://regexr.com?30abm

Narrowing long words. Examples
dancing with the stars and two and a half men toniiiight
@BrunoMars you were AMAZINGGGGGG at the vma's need to see you!
RT @BriannaStull13: I hateeeeeee pandora ads....
It was sooo badddd
Woooooooooooow I Like that, very nice and big like
Thts cooool
i hack any thing but for moneyyyy
who know how hacked one add fb??? pleaseeee

Narrowing long words. Performance
10K 100K 1M
Long words 83.13 msec 828.30 msec 8370.97 msec
~8 sec.
Normal words 31.92 msec 275.34 msec 2763.77 msec
~3 sec.
Mixed words* 35.23 msec 339.23 msec 3370.31 msec
~3 sec.
* assume that 1% of words are long words

Emoticons decoding
Using map of smile meanings convert smile to word that it means
<3 → love
:( → sad
Look at her http://t.co/12345 <3 Look at her http://t.co/12345 love
NEUTRAL POSITIVE
I will be out of work tomorrow :( I will be out of work tomorrow sad
NEUTRAL NEGATIVE
List of emoticons: http://en.wikipedia.org/wiki/List_of_emoticons

Emoticons decoding. Examples
Awww He is Too cute :) Thanks bae next weekend..
@LenovoDoTour I have missed these two days in Belgrade :(
Katie Holmes <3 #VMA
ahaha just to warn you!! ;)
it's amazing how Oracle can do so much! I'm loving it <3
please someone help me i need to finish this im out of time!! thank!! :D
Boa noite, viajantes! Menos um diazinho nessa semana =)
:-( don't have my Mcard number required to fill out form

Emoticons decoding. Performance
10K 100K 1M
1 smile list 45.03 msec 444.62 msec 4426.74 msec
~4 sec.
5 smile list 189.87 msec 1304.10 msec
~1 sec.
12355.37 msec
~12 sec.
10 smile list 227.26 msec 2325.23 msec
~2 sec.
26954.26 msec
~27 sec.
We have so poor performance when smile list grow up due to method that perform
replacements. Better results can achieved with using state machines or regexps

Abbreviations decoding
Using map of abbreviations convert abbr to word that it means
lol → laughing out loud
thx → thanks
Got it! lol Got it! laughing out loud
NEUTRAL POSITIVE
I was DWI, haha I was driving while intoxicated, haha
NEUTRAL NEGATIVE
List of abbreviations: http://www.smartdefine.org/internet_slang/abbreviations/r

Abbreviation decoding. Examples
No offense though.. Lol
O lmao!
http://t.co/Evvh4hj ROFL
JFYI #blackcarpet
Nice code LOL
TNX you Rose! We appreciate it!
OMG, FML!
Wait me, i will be AFK

Emoticons and Abbreviations
 Alternative approach
 Abbreviations, acronyms, slang words are already parsed as tokens
 Parse smiles as tokens also in FX
 Now we can use ”Tune sentiments” on these tokens

Spell correction
Perform spell correction on data before sentiment calculation
I lov this hotel! I love this hotel!
NEUTRAL POSITIVE
They have terryble servic They have terrible service
NEUTRAL NEGATIVE

Spell corection. Examples
i hope @ladygaga will take some rest now becauce of...
But its still also hilarioouss
Shoukd i wast my money?
Business eviroment
It's impossibru!
I like dansing! <3
You can dowload the data from http://to.download/file
Coleguaues, lets keep it clean.

Spell correction. Edit distance
 Edit types:
 Deletion beauetiful → beautiful
 Insertion speling → spelling
 Substitution performanse → performance
 Swaping yaer → year
 Examples
unsucesful → unsuccesful → unsuccessful (2 edits)
wardoub → wardroub → wardrobu → wardrobe (3 edits)

Spell correction. Algorithm
 Peter Norvig's spelling corrector
 Bayes rule approach
 Train data
 Simple implementation
 High performance
 Low accuracy
More theory: http://norvig.com/spell-correct.html
Train data: http://norvig.com/big.txt

Spell correction. Coverage
Edit1 + Edit2 covers 98%!!!

Spell correction. Accuracy
Test data 1 Test data 2
1 edit 61.8% 67.2%
2 edits 71.2% 74.1%
Test data 1: Wikipedia – Common misspelled words (~4k)
http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines
Test data 2: Birkbeck spelling error corpus (270)
http://www.ota.ox.ac.uk/headers/0643.xml

Spell correction. Performance
10K 100K 1M
1 edit 11350.52 msec
~11 sec.
117261.12 msec
~2 min.
1252882.23 msec
~20 min.
2 edits 4300631.29 msec
~70 min.
Due to quadratic complexity these tests
make no sense
Spell-check complexity for word:
Edit distance 1: O(C·n)
Edit distance 2: O(C²·n²)
* n – length of word
** C ~= 50

Spell correction. Improvements
 Performance
 Memoize correction (Best → O(1))
 Give ability to user to perform spell-correction
 Improve train data
 Coverage & Accuracy
 Use more edits candidates
 Use common mispelling rules
 Use weights for edit operations
 Hit part of speech
 Hit context

Tags detection
Process differently source-specific information (twitter)
● Hashtag (#music) use word splitter
● Username (@LadyGaga) just ignore it
I say to @love hello! I say to - hello!
POSITIVE NEUTRAL
I mean that i #hatetwitter I mean that i hate twitter
NEUTRAL NEGATIVE

Tags detection. Examples
@INevaTrustEm ok :) we need to make a date for this
Watching @danieltosh #toofunny
#lovetolaugh
#sick
Avatar, #wasteofmoney
#soft #thissucks
#happytweet
RT @BriannaStull13: what do you mean?

Tags detection. Words splitting
 Dynamic programming
 Statistical approach due to ambiguity
#orcore → [orc_ore], [or_core]
#expertsexchange → [expert_sex_change], [experts_exchange]
 Train data
 Dictionary (default linux ~100K words)

Tags detection. Twitter hashtags
Twitter hashtags crawled from (~800):
http://hashtags.org/
http://kingnetforums.weebly.com/twitter-hashtags-lists.html
http://edudemic.com/2011/10/twitter-hashtag-dictionary/
http://nicolehumphrey.net/60-favorite-twitter-hashtags-for-writers-clickable-list/
http://www.dailywritingtips.com/40-twitter-hashtags-for-writers/
http://greeneconomypost.com/green-twitter-hashtag-17290.htm

Tags detection. Performance
100 400 800
Time 4019.73 msec
~4 sec.
6429.19
~6 sec.
7897.23
~8 sec.
Accuracy 83.00% 86.25% 84.88%
Main problems:
● Train set not often solves ambiguity problem
● Dictionary hits filter lot of right candidates
#rapnotamusic → [ra_p_not_a_music]

Words splitting. Improvements
 Performance
 Memoize splitting
 Prefix tree approach
 Viterbi algorithm (http://en.wikipedia.org/wiki/Viterbi_algorithm)
 Accuracy
 Use famous names, geographic locations, slang, abbreviations,
acronyms,...
 Big dictionary
 Improve train data (twitter-specific)

Preprocessing performance
Input conditions:
Data: 2.4K (incorrect) of 15.8K (total) from Omniture15K.xls file (15%)
Emoticons size: 14 most common smiles
Abbreviations size: 8 most common abbrs
Spell-correction distance: 1
Train data: big.txt
Dictionary: linux-words.txt
Results:
Sentence count: 2412
Preprocessing time: 29214.88 msec (~29 sec.)
Number of corrected sentences: 368
Percent of corrected to incorrect data: 15.28%
Percent of corrected to total data: 2.33%

Data preprocessing. Future.
 Sentence breaker

Environment
 Hardware
 CPU: 2 x Intel Pentium Dual T2370 @ 1.73GHz
 RAM: 2.0 GB
 Software
 OS: Ubuntu 11.04
 Kernel: Linux 2.6.38-13-generic
 IDE: Emacs 23.2.1
 Programming: Clojure 1.3

Sentiments Improvement

More Related Content

Viewers also liked

Similar to Sentiments Improvement

More from Misha Kozik

Recently uploaded

Sentiments Improvement