Sentiment improvements 
Proposed ideas: 
Part I. Data preprocessing 
Part II. PMI-IR approach 
Team members: 
Denys Astanin 
Mykhailo Kozik
Data preprocessing 
Raw data 
Preprocessed data 
Narrowing 
Long words 
Emoticons 
Decoding 
Spell 
Correction 
Abbreviations 
Decoding 
Tags 
Detection 
:'( → cry 
@Alex nice photo 
#photoworld 
goooood → good 
lol → laughing out loud 
I am shure that is realy exsellent plece 
| 
I am sure that is really excellent place
Narrowing long words 
Using regexp narrow more than 2 duplicate letters in word to just 2 
goooooood → good (correct narrowing) 
baaaaaad → baad (incorrect narrowing, but will be corrected with spell-checker) 
This hotel so goooooooood! This hotel so good! 
NEUTRAL POSITIVE 
This place not coooooool! This place not cool! 
NEUTRAL NEGATIVE 
Try this regexp: http://regexr.com?30abm
Narrowing long words. Examples 
dancing with the stars and two and a half men toniiiight 
@BrunoMars you were AMAZINGGGGGG at the vma's need to see you! 
RT @BriannaStull13: I hateeeeeee pandora ads.... 
It was sooo badddd 
Woooooooooooow I Like that, very nice and big like 
Thts cooool 
i hack any thing but for moneyyyy 
who know how hacked one add fb??? pleaseeee
Narrowing long words. Performance 
10K 100K 1M 
Long words 83.13 msec 828.30 msec 8370.97 msec 
~8 sec. 
Normal words 31.92 msec 275.34 msec 2763.77 msec 
~3 sec. 
Mixed words* 35.23 msec 339.23 msec 3370.31 msec 
~3 sec. 
* assume that 1% of words are long words
Emoticons decoding 
Using map of smile meanings convert smile to word that it means 
<3 → love 
:( → sad 
Look at her http://t.co/12345 <3 Look at her http://t.co/12345 love 
NEUTRAL POSITIVE 
I will be out of work tomorrow :( I will be out of work tomorrow sad 
NEUTRAL NEGATIVE 
List of emoticons: http://en.wikipedia.org/wiki/List_of_emoticons
Emoticons decoding. Examples 
Awww He is Too cute :) Thanks bae next weekend.. 
@LenovoDoTour I have missed these two days in Belgrade :( 
Katie Holmes <3 #VMA 
ahaha just to warn you!! ;) 
it's amazing how Oracle can do so much! I'm loving it <3 
please someone help me i need to finish this im out of time!! thank!! :D 
Boa noite, viajantes! Menos um diazinho nessa semana =) 
:-( don't have my Mcard number required to fill out form
Emoticons decoding. Performance 
10K 100K 1M 
1 smile list 45.03 msec 444.62 msec 4426.74 msec 
~4 sec. 
5 smile list 189.87 msec 1304.10 msec 
~1 sec. 
12355.37 msec 
~12 sec. 
10 smile list 227.26 msec 2325.23 msec 
~2 sec. 
26954.26 msec 
~27 sec. 
We have so poor performance when smile list grow up due to method that perform 
replacements. Better results can achieved with using state machines or regexps
Abbreviations decoding 
Using map of abbreviations convert abbr to word that it means 
lol → laughing out loud 
thx → thanks 
Got it! lol Got it! laughing out loud 
NEUTRAL POSITIVE 
I was DWI, haha I was driving while intoxicated, haha 
NEUTRAL NEGATIVE 
List of abbreviations: http://www.smartdefine.org/internet_slang/abbreviations/r
Abbreviation decoding. Examples 
No offense though.. Lol 
O lmao! 
http://t.co/Evvh4hj ROFL 
JFYI #blackcarpet 
Nice code LOL 
TNX you Rose! We appreciate it! 
OMG, FML! 
Wait me, i will be AFK
Emoticons and Abbreviations 
 Alternative approach 
 Abbreviations, acronyms, slang words are already parsed as tokens 
 Parse smiles as tokens also in FX 
 Now we can use ”Tune sentiments” on these tokens
Spell correction 
Perform spell correction on data before sentiment calculation 
I lov this hotel! I love this hotel! 
NEUTRAL POSITIVE 
They have terryble servic They have terrible service 
NEUTRAL NEGATIVE
Spell corection. Examples 
i hope @ladygaga will take some rest now becauce of... 
But its still also hilarioouss 
Shoukd i wast my money? 
Business eviroment 
It's impossibru! 
I like dansing! <3 
You can dowload the data from http://to.download/file 
Coleguaues, lets keep it clean.
Spell correction. Edit distance 
 Edit types: 
 Deletion beauetiful → beautiful 
 Insertion speling → spelling 
 Substitution performanse → performance 
 Swaping yaer → year 
 Examples 
unsucesful → unsuccesful → unsuccessful (2 edits) 
wardoub → wardroub → wardrobu → wardrobe (3 edits)
Spell correction. Algorithm 
 Peter Norvig's spelling corrector 
 Bayes rule approach 
 Train data 
 Simple implementation 
 High performance 
 Low accuracy 
More theory: http://norvig.com/spell-correct.html 
Train data: http://norvig.com/big.txt
Spell correction. Coverage 
Edit1 + Edit2 covers 98%!!!
Spell correction. Accuracy 
Test data 1 Test data 2 
1 edit 61.8% 67.2% 
2 edits 71.2% 74.1% 
Test data 1: Wikipedia – Common misspelled words (~4k) 
http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines 
Test data 2: Birkbeck spelling error corpus (270) 
http://www.ota.ox.ac.uk/headers/0643.xml
Spell correction. Performance 
10K 100K 1M 
1 edit 11350.52 msec 
~11 sec. 
117261.12 msec 
~2 min. 
1252882.23 msec 
~20 min. 
2 edits 4300631.29 msec 
~70 min. 
Due to quadratic complexity these tests 
make no sense 
Spell-check complexity for word: 
Edit distance 1: O(C·n) 
Edit distance 2: O(C²·n²) 
* n – length of word 
** C ~= 50
Spell correction. Improvements 
 Performance 
 Memoize correction (Best → O(1)) 
 Give ability to user to perform spell-correction 
 Improve train data 
 Coverage & Accuracy 
 Use more edits candidates 
 Use common mispelling rules 
 Use weights for edit operations 
 Hit part of speech 
 Hit context 
 Improve train data
Tags detection 
Process differently source-specific information (twitter) 
● Hashtag (#music) use word splitter 
● Username (@LadyGaga) just ignore it 
I say to @love hello! I say to - hello! 
POSITIVE NEUTRAL 
I mean that i #hatetwitter I mean that i hate twitter 
NEUTRAL NEGATIVE
Tags detection. Examples 
@INevaTrustEm ok :) we need to make a date for this 
Watching @danieltosh #toofunny 
#lovetolaugh 
#sick 
Avatar, #wasteofmoney 
#soft #thissucks 
#happytweet 
RT @BriannaStull13: what do you mean?
Tags detection. Words splitting 
 Dynamic programming 
 Statistical approach due to ambiguity 
#orcore → [orc_ore], [or_core] 
#expertsexchange → [expert_sex_change], [experts_exchange] 
 Train data 
 Dictionary (default linux ~100K words)
Tags detection. Twitter hashtags 
Twitter hashtags crawled from (~800): 
http://hashtags.org/ 
http://kingnetforums.weebly.com/twitter-hashtags-lists.html 
http://edudemic.com/2011/10/twitter-hashtag-dictionary/ 
http://nicolehumphrey.net/60-favorite-twitter-hashtags-for-writers-clickable-list/ 
http://www.dailywritingtips.com/40-twitter-hashtags-for-writers/ 
http://greeneconomypost.com/green-twitter-hashtag-17290.htm
Tags detection. Performance 
100 400 800 
Time 4019.73 msec 
~4 sec. 
6429.19 
~6 sec. 
7897.23 
~8 sec. 
Accuracy 83.00% 86.25% 84.88% 
Main problems: 
● Train set not often solves ambiguity problem 
● Dictionary hits filter lot of right candidates 
#rapnotamusic → [ra_p_not_a_music]
Words splitting. Improvements 
 Performance 
 Memoize splitting 
 Prefix tree approach 
 Viterbi algorithm (http://en.wikipedia.org/wiki/Viterbi_algorithm) 
 Improve train data 
 Accuracy 
 Use famous names, geographic locations, slang, abbreviations, 
acronyms,... 
 Big dictionary 
 Improve train data (twitter-specific)
Preprocessing performance 
Input conditions: 
Data: 2.4K (incorrect) of 15.8K (total) from Omniture15K.xls file (15%) 
Emoticons size: 14 most common smiles 
Abbreviations size: 8 most common abbrs 
Spell-correction distance: 1 
Train data: big.txt 
Dictionary: linux-words.txt 
Results: 
Sentence count: 2412 
Preprocessing time: 29214.88 msec (~29 sec.) 
Number of corrected sentences: 368 
Percent of corrected to incorrect data: 15.28% 
Percent of corrected to total data: 2.33%
Data preprocessing. Future. 
 Sentence breaker
Environment 
 Hardware 
 CPU: 2 x Intel Pentium Dual T2370 @ 1.73GHz 
 RAM: 2.0 GB 
 Software 
 OS: Ubuntu 11.04 
 Kernel: Linux 2.6.38-13-generic 
 IDE: Emacs 23.2.1 
 Programming: Clojure 1.3

Sentiments Improvement

  • 1.
    Sentiment improvements Proposedideas: Part I. Data preprocessing Part II. PMI-IR approach Team members: Denys Astanin Mykhailo Kozik
  • 2.
    Data preprocessing Rawdata Preprocessed data Narrowing Long words Emoticons Decoding Spell Correction Abbreviations Decoding Tags Detection :'( → cry @Alex nice photo #photoworld goooood → good lol → laughing out loud I am shure that is realy exsellent plece | I am sure that is really excellent place
  • 3.
    Narrowing long words Using regexp narrow more than 2 duplicate letters in word to just 2 goooooood → good (correct narrowing) baaaaaad → baad (incorrect narrowing, but will be corrected with spell-checker) This hotel so goooooooood! This hotel so good! NEUTRAL POSITIVE This place not coooooool! This place not cool! NEUTRAL NEGATIVE Try this regexp: http://regexr.com?30abm
  • 4.
    Narrowing long words.Examples dancing with the stars and two and a half men toniiiight @BrunoMars you were AMAZINGGGGGG at the vma's need to see you! RT @BriannaStull13: I hateeeeeee pandora ads.... It was sooo badddd Woooooooooooow I Like that, very nice and big like Thts cooool i hack any thing but for moneyyyy who know how hacked one add fb??? pleaseeee
  • 5.
    Narrowing long words.Performance 10K 100K 1M Long words 83.13 msec 828.30 msec 8370.97 msec ~8 sec. Normal words 31.92 msec 275.34 msec 2763.77 msec ~3 sec. Mixed words* 35.23 msec 339.23 msec 3370.31 msec ~3 sec. * assume that 1% of words are long words
  • 6.
    Emoticons decoding Usingmap of smile meanings convert smile to word that it means <3 → love :( → sad Look at her http://t.co/12345 <3 Look at her http://t.co/12345 love NEUTRAL POSITIVE I will be out of work tomorrow :( I will be out of work tomorrow sad NEUTRAL NEGATIVE List of emoticons: http://en.wikipedia.org/wiki/List_of_emoticons
  • 7.
    Emoticons decoding. Examples Awww He is Too cute :) Thanks bae next weekend.. @LenovoDoTour I have missed these two days in Belgrade :( Katie Holmes <3 #VMA ahaha just to warn you!! ;) it's amazing how Oracle can do so much! I'm loving it <3 please someone help me i need to finish this im out of time!! thank!! :D Boa noite, viajantes! Menos um diazinho nessa semana =) :-( don't have my Mcard number required to fill out form
  • 8.
    Emoticons decoding. Performance 10K 100K 1M 1 smile list 45.03 msec 444.62 msec 4426.74 msec ~4 sec. 5 smile list 189.87 msec 1304.10 msec ~1 sec. 12355.37 msec ~12 sec. 10 smile list 227.26 msec 2325.23 msec ~2 sec. 26954.26 msec ~27 sec. We have so poor performance when smile list grow up due to method that perform replacements. Better results can achieved with using state machines or regexps
  • 9.
    Abbreviations decoding Usingmap of abbreviations convert abbr to word that it means lol → laughing out loud thx → thanks Got it! lol Got it! laughing out loud NEUTRAL POSITIVE I was DWI, haha I was driving while intoxicated, haha NEUTRAL NEGATIVE List of abbreviations: http://www.smartdefine.org/internet_slang/abbreviations/r
  • 10.
    Abbreviation decoding. Examples No offense though.. Lol O lmao! http://t.co/Evvh4hj ROFL JFYI #blackcarpet Nice code LOL TNX you Rose! We appreciate it! OMG, FML! Wait me, i will be AFK
  • 11.
    Emoticons and Abbreviations  Alternative approach  Abbreviations, acronyms, slang words are already parsed as tokens  Parse smiles as tokens also in FX  Now we can use ”Tune sentiments” on these tokens
  • 12.
    Spell correction Performspell correction on data before sentiment calculation I lov this hotel! I love this hotel! NEUTRAL POSITIVE They have terryble servic They have terrible service NEUTRAL NEGATIVE
  • 13.
    Spell corection. Examples i hope @ladygaga will take some rest now becauce of... But its still also hilarioouss Shoukd i wast my money? Business eviroment It's impossibru! I like dansing! <3 You can dowload the data from http://to.download/file Coleguaues, lets keep it clean.
  • 14.
    Spell correction. Editdistance  Edit types:  Deletion beauetiful → beautiful  Insertion speling → spelling  Substitution performanse → performance  Swaping yaer → year  Examples unsucesful → unsuccesful → unsuccessful (2 edits) wardoub → wardroub → wardrobu → wardrobe (3 edits)
  • 15.
    Spell correction. Algorithm  Peter Norvig's spelling corrector  Bayes rule approach  Train data  Simple implementation  High performance  Low accuracy More theory: http://norvig.com/spell-correct.html Train data: http://norvig.com/big.txt
  • 16.
    Spell correction. Coverage Edit1 + Edit2 covers 98%!!!
  • 17.
    Spell correction. Accuracy Test data 1 Test data 2 1 edit 61.8% 67.2% 2 edits 71.2% 74.1% Test data 1: Wikipedia – Common misspelled words (~4k) http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines Test data 2: Birkbeck spelling error corpus (270) http://www.ota.ox.ac.uk/headers/0643.xml
  • 18.
    Spell correction. Performance 10K 100K 1M 1 edit 11350.52 msec ~11 sec. 117261.12 msec ~2 min. 1252882.23 msec ~20 min. 2 edits 4300631.29 msec ~70 min. Due to quadratic complexity these tests make no sense Spell-check complexity for word: Edit distance 1: O(C·n) Edit distance 2: O(C²·n²) * n – length of word ** C ~= 50
  • 19.
    Spell correction. Improvements  Performance  Memoize correction (Best → O(1))  Give ability to user to perform spell-correction  Improve train data  Coverage & Accuracy  Use more edits candidates  Use common mispelling rules  Use weights for edit operations  Hit part of speech  Hit context  Improve train data
  • 20.
    Tags detection Processdifferently source-specific information (twitter) ● Hashtag (#music) use word splitter ● Username (@LadyGaga) just ignore it I say to @love hello! I say to - hello! POSITIVE NEUTRAL I mean that i #hatetwitter I mean that i hate twitter NEUTRAL NEGATIVE
  • 21.
    Tags detection. Examples @INevaTrustEm ok :) we need to make a date for this Watching @danieltosh #toofunny #lovetolaugh #sick Avatar, #wasteofmoney #soft #thissucks #happytweet RT @BriannaStull13: what do you mean?
  • 22.
    Tags detection. Wordssplitting  Dynamic programming  Statistical approach due to ambiguity #orcore → [orc_ore], [or_core] #expertsexchange → [expert_sex_change], [experts_exchange]  Train data  Dictionary (default linux ~100K words)
  • 23.
    Tags detection. Twitterhashtags Twitter hashtags crawled from (~800): http://hashtags.org/ http://kingnetforums.weebly.com/twitter-hashtags-lists.html http://edudemic.com/2011/10/twitter-hashtag-dictionary/ http://nicolehumphrey.net/60-favorite-twitter-hashtags-for-writers-clickable-list/ http://www.dailywritingtips.com/40-twitter-hashtags-for-writers/ http://greeneconomypost.com/green-twitter-hashtag-17290.htm
  • 24.
    Tags detection. Performance 100 400 800 Time 4019.73 msec ~4 sec. 6429.19 ~6 sec. 7897.23 ~8 sec. Accuracy 83.00% 86.25% 84.88% Main problems: ● Train set not often solves ambiguity problem ● Dictionary hits filter lot of right candidates #rapnotamusic → [ra_p_not_a_music]
  • 25.
    Words splitting. Improvements  Performance  Memoize splitting  Prefix tree approach  Viterbi algorithm (http://en.wikipedia.org/wiki/Viterbi_algorithm)  Improve train data  Accuracy  Use famous names, geographic locations, slang, abbreviations, acronyms,...  Big dictionary  Improve train data (twitter-specific)
  • 26.
    Preprocessing performance Inputconditions: Data: 2.4K (incorrect) of 15.8K (total) from Omniture15K.xls file (15%) Emoticons size: 14 most common smiles Abbreviations size: 8 most common abbrs Spell-correction distance: 1 Train data: big.txt Dictionary: linux-words.txt Results: Sentence count: 2412 Preprocessing time: 29214.88 msec (~29 sec.) Number of corrected sentences: 368 Percent of corrected to incorrect data: 15.28% Percent of corrected to total data: 2.33%
  • 27.
    Data preprocessing. Future.  Sentence breaker
  • 28.
    Environment  Hardware  CPU: 2 x Intel Pentium Dual T2370 @ 1.73GHz  RAM: 2.0 GB  Software  OS: Ubuntu 11.04  Kernel: Linux 2.6.38-13-generic  IDE: Emacs 23.2.1  Programming: Clojure 1.3