BBCSoundIndex

1,295 views

Published on

Slides accompanying the VLDB 2010 Journal paper -
Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth, Multimodal Social Intelligence in a Real-Time Dashboard System to appear in a special issue of the VLDB Journal on "Data Management and Mining for Social Networks and Social Media"

Published in: Education, Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,295
On SlideShare
0
From Embeds
0
Number of Embeds
192
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

BBCSoundIndex

  1. 1. http://www.almaden.ibm.com/cs/projects/iis/sound/ BBC SoundIndex Pulse of the Online Music Populace Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth, Multimodal Social Intelligence in a Real-Time Dashboard System to appear in a special issue of the VLDB Journal on "Data Management and Mining for Social Networks and Social Media", 2010
  2. 2. !  Netizens do not always The Vision buy their music, let alone buy in a CD store. http://www.almaden.ibm.com/cs/projects/iis/sound/ !  Traditional sales figures are a poor indicator of What is ‘really’ hot? music popularity.
  3. 3. !  Netizens do not always The Vision buy their music, let alone buy in a CD store. http://www.almaden.ibm.com/cs/projects/iis/sound/ !  Traditional sales figures are a poor indicator of What is ‘really’ hot? music popularity. BBC: Are online music communities good proxies for popular music listings?!
  4. 4. !  Netizens do not always The Vision buy their music, let alone buy in a CD store. http://www.almaden.ibm.com/cs/projects/iis/sound/ !  Traditional sales figures are a poor indicator of What is ‘really’ hot? music popularity. BBC: Are online music communities good proxies for popular music listings?! IBM: Well, lets build and find out!
  5. 5. !  Netizens do not always The Vision buy their music, let alone buy in a CD store. http://www.almaden.ibm.com/cs/projects/iis/sound/ !  Traditional sales figures are a poor indicator of What is ‘really’ hot? music popularity. BBC: Are online music communities good proxies for popular music listings?! IBM: Well, lets build and find out! BBC SoundIndex - “A pioneering project to tap into the online buzz surrounding artists and songs, by leveraging several popular online sources”
  6. 6. !  Netizens do not always The Vision buy their music, let alone buy in a CD store. http://www.almaden.ibm.com/cs/projects/iis/sound/ !  Traditional sales figures are a poor indicator of What is ‘really’ hot? music popularity. BBC: Are online music communities good proxies for popular music listings?! IBM: Well, lets build and find out! BBC SoundIndex - “A pioneering project to tap into the online buzz surrounding artists and songs, by leveraging several popular online sources”
  7. 7. !  Netizens do not always The Vision buy their music, let alone buy in a CD store. http://www.almaden.ibm.com/cs/projects/iis/sound/ !  Traditional sales figures are a poor indicator of What is ‘really’ hot? music popularity. BBC: Are online music communities good proxies for popular music listings?! IBM: Well, lets build and find out! BBC SoundIndex - “A pioneering project to tap into the online buzz surrounding artists and songs, by leveraging several popular online sources” “one chart for everyone” is so old!
  8. 8. “Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media. User metadata, unstructured, Artist/Track structured attention Metadata metadata right data source, right crowd, timeliness of data..?
  9. 9. “Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media. Album/Track identification Sentiment Identification Spam and off-topic comments UIMA Analytics Environment right data source, right crowd, timeliness of data..?
  10. 10. “Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media. Exracted concepts into explorable datastructures right data source, right crowd, timeliness of data..?
  11. 11. “Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media. What are 18 year olds in London listening to? right data source, right crowd, timeliness of data..?
  12. 12. “Multimodal Social Intelligence in a Real-Time Dashboard System”, VLDB Journal 2010 Special Issue: Data Management and Mining for Social Networks and Social Media. What are 18 year olds in London listening to? Validating crowd-sourced preferences right data source, right crowd, timeliness of data..?
  13. 13. Imagine doing this for a local business! Pulse of the ‘foodie’ populace! Where are 20 something year olds going? Why?
  14. 14. 4 SoundIndex Architecture Fig. 2 SoundIndex architecture. Data sources are ingested and if necessary transformed into structured data using the MusicBrainz RDF and data miners. The resulting structured data is stored in a database and periodically extracted to update the front end. Is the radio still the most popular medium for music? In of “plays” such as YouTube videos and LastFM tracks, 2007 less than half of all teenagers purchased a CD6 and and purely structured data such as the number of sales with the advent of portable MP3 players, fewer people from iTunes. We then clean this data, using spam re- are listening to the radio. moval, off-topic detection, and performed analytics to With the rise of new ways in which communities spot songs, artists and sentiment expressions. The cleaned and annotated data is combined, adjudicating for dif-
  15. 15. UIMA Annotators
  16. 16. UIMA Annotators MySpace user comments (~ Twitter) Named Entity Recognition Sentiment Recognition Spam Elimination I heart your new song Umbrella.. madge..ur pics on celebration concert with jesus r awesome! Challenges, intuitions, findings, results..
  17. 17. Recognizing Named Entities Cultural entities, Informal text, Context- poor utterances, restricted to the music sense.. “Ohh these sour times... rock!”
  18. 18. Recognizing Named Entities Cultural entities, Informal text, Context- poor utterances, restricted to the music sense.. Problem Defn: Semantic Annotation of album/track names (using MusicBrainz) [at ISWC ‘09] “Ohh these sour times... rock!”
  19. 19. Got to Be There Ben NER Music and Me Forever Michael Off the Wall Thriller Bad Dangerous
  20. 20. Got to Be There Ben NER Music and Me Forever Michael Off the Wall Thriller Spot and Disambiguate Paradigm Bad Generate Candidate entities Dangerous “Thriller was my most fav MJ album” “this is bad news, ill miss you MJ” Disambiguate spots/mentions in context
  21. 21. Got to Be There Ben NER Music and Me Forever Michael Off the Wall Thriller Spot and Disambiguate Paradigm Bad Generate Candidate entities Dangerous “Thriller was my most fav MJ album” “this is bad news, ill miss you MJ” Disambiguate spots/mentions in context Disambiguation Intensive Approach
  22. 22. Challenge 1: Multiple Senses, Same Domain 60 songs with Merry Christmas 3600 songs with Yesterday 195 releases of American Pie 31 artists covering American Pie “Happy 25th! Loved your song Smile ..”
  23. 23. Challenge 1: Multiple Senses, Same Domain 60 songs with Merry Christmas 3600 songs with Yesterday 195 releases of American Pie 31 artists covering American Pie “Happy 25th! Loved your song Smile ..”
  24. 24. Intuition: Scoped Graphs This new Merry Christmas tune.. SO GOOD! Scoped Relationship graphs using cues from the content, webpage title, url.. Which ‘Merry Christmas’? 9 ‘So Good’ is also a song!
  25. 25. Intuition: Scoped Graphs This new Merry Christmas tune.. SO GOOD! Scoped Relationship graphs using cues from the content, webpage title, url.. Reduce potential entity spot size Generate candidate entities Disambiguate in context Which ‘Merry Christmas’? 9 ‘So Good’ is also a song!
  26. 26. What Content Cues to Exploit? “I heart your new album Habits” “Happy 25th lilly, love ur song smile” ”Congrats on your rising star award”..Experimenting with Restrictions Career Length Releases that are not new; Album Release artists who are at least 25; No. of Albums new careers.. ... Specific Artist
  27. 27. Gold Truth Dataset 1800+ spots in MySpace user comments from artist pages Keep your SMILE on! good spot, bad spot, inconclusive spot? 4-way annotator agreements across spots Madonna 90% agreement Rihanna 84% agreement Lily Allen 53% agreement
  28. 28. Sample Restrictions, Spot Precision 3 artists, 1800+ spots Experimenting with Restrictions Career Length Album Release No. of Albums ... Specific Artist
  29. 29. Sample Restrictions, Spot Precision From all of MusicBrainz (281890 artists, 6220519 tracks) 3 artists, 1800+ spots to tracks of one artist !"""#$ !""#$ !"#$ !#$ #$ #"$ #""$ Experimenting with %&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+** #""$ Restrictions !"#$%&%'()'*)+,#)-.'++#" *****789!9$*,/):0+0-%; #"$ Career Length )A&:.23*8*&2>?@+ &.*2)&+.*8*&2>?@+ Album Release #$ No. of Albums !#$ &/.0+.+*<5-+)*=0/+.*&2>?@*<&+* ... 0%*.5)*,&+.*8*3)&/+ !"#$ Specific Artist *&22*&/.0+.+*<5-*/)2)&+)1*&%* &2>?@*0%*.5)*,&+.*8*3)&/+ !""#$ *)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%; !"""#$
  30. 30. Sample Restrictions, Spot Precision Closely follows distribution of random restrictions, conforming loosely to a Zipf distribution From all of MusicBrainz (281890 artists, 6220519 tracks) 3 artists, 1800+ spots to tracks of one artist !"#$"%&'()'&*"'&+,(%(-.'/0"1'2.'&*"'03(&&"# !"!!!#$ !"!!#$ !"!#$ !"#$ #$ #!$ #!!$ Experimenting with #!!$ Restrictions #!$ Career Length !#"$404(%'()'&*"'53(&&"# #$ Album Release No. of Albums !"#$ ... %&'()*+(*,-'.,-.(/& 012,-,34.51'&6.,- !"!#$ Specific Artist 7)(*'(.8/1)1+(&)*'(*+' 19&)1:&.&2;&+(&6 !"!!#$ *3;),9&3&-(..<=*;> 6*'()*5?(*,-@ !"!!!#$
  31. 31. Sample Restrictions, Spot Precision Closely follows distribution of random restrictions, conforming loosely to a Zipf distribution From all of MusicBrainz (281890 artists, 6220519 tracks) 3 artists, 1800+ spots to tracks of one artist !"#$"%&'()'&*"'&+,(%(-.'/0"1'2.'&*"'03(&&"# !"!!!#$ !"!!#$ !"!#$ !"#$ #$ #!$ #!!$ Experimenting with #!!$ Restrictions #!$ Career Length !#"$404(%'()'&*"'53(&&"# #$ Album Release No. of Albums !"#$ ... %&'()*+(*,-'.,-.(/& 012,-,34.51'&6.,- !"!#$ Specific Artist 7)(*'(.8/1)1+(&)*'(*+' 19&)1:&.&2;&+(&6 !"!!#$ *3;),9&3&-(..<=*;> 6*'()*5?(*,-@ !"!!!#$ Choosing which constraints to implement is simple - pick easiest first
  32. 32. Madonna’s Scoped Entity Lists tracks User comments are on MySpace artist pages Restriction: Artist name Assumption: no other artist/work mention
  33. 33. Madonna’s Scoped Entity Lists tracks User comments are on MySpace artist pages Restriction: Artist name Assumption: no other artist/work mention Naive spotter has advantage of spotting all possible mentions (modulo spelling errors) Generates several false positives “this is bad news, ill miss you MJ”
  34. 34. Challenge 2: Disambiguating Spots Got your new album Smile. Loved it! Keep your SMILE on! Separating Valid and invalid mentions of music named entities.
  35. 35. Challenge 2: Disambiguating Spots Got your new album Smile. Loved it! Keep your SMILE on! Separating Valid and invalid mentions of music named entities.
  36. 36. Intuition: Using Natural Language Cues Got your new album Smile. Loved it! Syntactic features Notation-S + POS tag of s s.POS POS tag of one token before s s.POSb POS tag of one token after s s.POSa Typed dependency between s and sentiment word s.POS-TDsent ∗ Typed dependency between s and domain-specific term s.POS-TDdom ∗ Boolean Typed dependency between s and sentiment s.B-TDsent ∗ Boolean Typed dependency between s and domain-specific term s.B-TDdom ∗ Word-level features Notation-W + Capitalization of spot s s.allCaps + Capitalization of first letter of s s.firstCaps + s in Quotes s.inQuotes Domain-specific features Notation-D Sentiment expression in the same sentence as s s.Ssent Sentiment expression elsewhere in the comment s.Csent Domain-related term in the same sentence as s s.Sdom Domain-related term elsewhere in the comment s.Cdom + Refers to basic features, others are advanced features ∗ These features apply only to one-word-long spots. Generic syntactic, by the SVM learner Table 6. Features used spot-level, domain features 15
  37. 37. Intuition: Using Natural Language Cues Got your new album Smile. Loved it! Syntactic features 1. Notation-S Expressions: Sentiment + POS tag of s Slang sentiment gazetteer s.POS POS tag of one token before s s.POSb POS tag of one token after s usinga Urban Dictionary s.POS Typed dependency between s and sentiment word s.POS-TDsent ∗ Typed dependency between s and domain-specific term 2. s.POS-TDdom ∗ Domain specific terms Boolean Typed dependency between s and sentiment music,sent ∗ s.B-TD album, concert.. Boolean Typed dependency between s and domain-specific term s.B-TDdom ∗ Word-level features Notation-W + Capitalization of spot s s.allCaps + Capitalization of first letter of s s.firstCaps + s in Quotes s.inQuotes Domain-specific features Notation-D Sentiment expression in the same sentence as s s.Ssent Sentiment expression elsewhere in the comment s.Csent Domain-related term in the same sentence as s s.Sdom Domain-related term elsewhere in the comment s.Cdom + Refers to basic features, others are advanced features ∗ These features apply only to one-word-long spots. Generic syntactic, by the SVM learner Table 6. Features used spot-level, domain features 15
  38. 38. Boolean Typed dependency between s and domain-specific term s.B-TDdom ∗ Word-level features Notation-W + Capitalization of spot s s.allCaps + Capitalization of first letter of s s.firstCaps Intuition: Using Natural Language Cues + s in Quotes s.inQuotes Domain-specific features Notation-D Got yourexpression in the same sentenceit! s Sentiment new album Smile. Loved as s.Ssent 1. Sentiment Expressions: Syntactic features Sentiment expression elsewhere in the comment Notation-S s.Csent Domain-related term in the same sentence as s s.POS s.Sdom + POS tag of s Domain-related term elsewhere in the comment Slang sentimentdom s.C gazetteer POS tag of one token+before s s.POSb Refers to basic features, others are advanced features Urban Dictionary usinga POS tag of one token∗ after s s.POS These features apply only to one-word-long spots. Typed dependency between s and sentiment word s.POS-TDsent ∗ 2. s.POS-TDdom Domain specific terms Typed dependency between s and domain-specificFeatures used by the SVM learner∗ Table 6. term Boolean Typed dependency between s and sentiment music,sent ∗ s.B-TD album, concert.. Boolean Typed dependency between s and domain-specific term s.B-TDdom ∗ Word-level features Typed Dependencies: Valid Notation-W new album Smile. spot: Got your + Capitalization of spot sWe also captured the typed de- Simplys.allCaps loved it! + Capitalization of first letter of paths (grammatical rela- pendency s Encoding: nsubj(loved-8, Smile-5) imply- s.firstCaps + s in Quotes tions) via the s.POS- ing that Smile is the nominal subject of s.inQuotes the expression loved. Domain-specific features and s.POS-TDdom boolean TDsent Notation-D Sentiment expressionfeatures. These were as s in the same sentence obtained be- s.Ssent Invalid spot: Keep your smile on. You’ll Sentiment expressiontween a spot and comment elsewhere in the co-occurring senti- s.C do great! sent Domain-related termment and domain-specific words by in the same sentence as s s.Sdom Encoding: No typed dependency between Domain-related termthe Stanford the comment elsewhere in parser[12] (see exam- smile s.Cdom and great + ple in 7). We also encode a boolean Refers to basic features, others are advanced features ∗ These features apply only indicating whether a relation value to one-word-long spots. Table 7. Typed Dependencies Example was found at all using the s.B-TDsent Generic syntactic, byThis SVM learneraccommodate parse errors given the Table 6. Features used spot-level, us to and s.B-TDdom features. the allows domain features 15 informal and often non-grammatical English in this corpus.
  39. 39. Valid Music Binary Classification mention or not? Keep ur SMILE on! Got your new album Smile. Loved it! SVM Binary Classifiers Training Set: Positive examples (+1): 550 valid spots Negative examples (-1): 550 invalid spots Test Set: 120 valid spots, 458 invalid spots
  40. 40. Efficacy of Features Precision intensive Recall intensive
  41. 41. Efficacy of Features Precision intensive 42-91 78-50 90-35 Identified 90% of valid spots Eliminated 35% of invalid spots Recall intensive
  42. 42. Efficacy of Features Precision intensive 42-91 - Feature combinations were most stable, best performing - Gazetteer matched domain words 78-50 and sentiment expressions proved to be useful 90-35 Identified 90% of valid spots Eliminated 35% of invalid spots Recall intensive
  43. 43. Efficacy of Features Precision intensive 42-91 - Feature combinations were most stable, best performing - Gazetteer matched domain words 78-50 and sentiment expressions proved to be useful 90-35 Identified 90% of valid spots Eliminated 35% of invalid spots Recall intensive PR tradeoffs: choosing feature combinations depending on end application
  44. 44. How did we do overall ? '!!" &!" 5('*%$%63)7)8'*#"" %!" $!" #!" 71,89-9/(:;/1:<9=>:?==,( Madonna’s 71,89-9/(:;/1:@9A)(() 71,89-9/(:;/1:B)C/(() @,8)==:D)==:0A1,,E track spots- !" -./00,1 2!345 6&35! 6$3$! 6#345 6'36! 6!35% %#3&$ %'36! %!3&5 $53&% $#32' 23% precision ()*+, !"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14 Step 1: Spot with naive spotter, restricted knowledge base
  45. 45. How did we do overall ? '!!" &!" 5('*%$%63)7)8'*#"" %!" $!" #!" 71,89-9/(:;/1:<9=>:?==,( Madonna’s 71,89-9/(:;/1:@9A)(() 71,89-9/(:;/1:B)C/(() @,8)==:D)==:0A1,,E track spots- !" -./00,1 2!345 6&35! 6$3$! 6#345 6'36! 6!35% %#3&$ %'36! %!3&5 $53&% $#32' 23% precision 42-91 : “All ()*+, !"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14 features” setting Step 1: Spot with naive spotter, Step 2: Disambiguate using NL restricted knowledge base features (SVM classifier)
  46. 46. How did we do overall ? '!!" &!" 5('*%$%63)7)8'*#"" %!" Madonna’s track spots- $!" ~60% #!" 71,89-9/(:;/1:<9=>:?==,( precision Madonna’s 71,89-9/(:;/1:@9A)(() 71,89-9/(:;/1:B)C/(() @,8)==:D)==:0A1,,E track spots- !" -./00,1 2!345 6&35! 6$3$! 6#345 6'36! 6!35% %#3&$ %'36! %!3&5 $53&% $#32' 23% precision 42-91 : “All ()*+, !"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14 features” setting Step 1: Spot with naive spotter, Step 2: Disambiguate using NL restricted knowledge base features (SVM classifier)
  47. 47. Lessons Learned.. Using domain knowledge especially for cultural entities is non-trivial %!! Computationally intensive NLP is prohibitive 23*$,&3(#&)#$",$ $!! 0129;4+32-.84- Two stage approach: NL learners over '!+,-./+012/+324/+56,176+,-.+8.7967: #!! dictionary-based naive spotters "!! allows more time-intensive NL analytics to run on less than the full set of input data ! control over precision"(!! recall of final result "&!! "'!! and ")!! "*!! #!!! !"#$%&'()#&*+&)#$",$&*#&+*-./".0&'()#&*+&1)./
  48. 48. Lessons Learned.. Using domain knowledge especially for cultural entities is non-trivial Computationally intensive NLP is prohibitive Two stage approach: NL learners over dictionary-based naive spotters allows more time-intensive NL analytics to run on less than the full set of input data control over precision and recall of final result
  49. 49. Sentiment Expressions NER allows us to track and trend online mentions Popularity assessment: sentiment associated with the target entity Exploratory Project: what kinds of sentiments, distribution of positive and negative expressions..
  50. 50. Observations, Challenges Negations, sarcasms, refutations are rare Short sentences: OK to overlook target attachment Target demographic: Teenagers! Slang expressions: Wicked, tight, “the shit”, tropical, bad.. What are common positive and negative expressions this demographic uses? Urbandictionary.com to the rescue!
  51. 51. Urbandictionary.com Slang Dictionary! Related tags bad appears with terrible and good! Glossary
  52. 52. Mining a Slang Sentiment Dictionary Map frequently used sentiment expressions to their orientations
  53. 53. Mining a Slang Sentiment Dictionary Map frequently used sentiment expressions to their orientations Seed positive and negative oriented dictionary entries good, terrible..
  54. 54. Mining a Slang Sentiment Dictionary Map frequently used sentiment expressions to their orientations Seed positive and negative oriented dictionary entries good, terrible.. Step 1:For each seed, query UD, get related tags Good ->awesome, sweet, fun, bad, rad.. Terrible ->bad, horrible, awful, shit..
  55. 55. Mining a Slang Sentiment Dictionary Good ->awesome, sweet, fun, bad, rad Step 2: Calculate semantic orientation of each related tag [Turney] SO(rad) = PMIud(rad, “good”) – PMIud(rad, “terrible”)
  56. 56. Mining a Slang Sentiment Dictionary Good ->awesome, sweet, fun, bad, rad Step 2: Calculate semantic orientation of each related tag [Turney] SO(rad) = PMIud(rad, “good”) – PMIud(rad, “terrible”) PMI over the Web vs. UD SOweb(rock) = -0.112 SOud(rock) = 0.513
  57. 57. Mining a Slang Sentiment Dictionary Step 3: Record orientation; add to dictionary {good, rad} {terrible} Continue for other related tags and all new entries .. Mined dictionary: 300+ entries (manually verified for noise)
  58. 58. Sentiment Expressions in Text “Your new album is wicked” Shallow NL parse Your/PRP$ new/JJ album/NN is/VBZ wicked/JJ verbs, adjectives [Hatzivassiloglou 97]
  59. 59. Sentiment Expressions in Text “Your new album is wicked” Shallow NL parse Your/PRP$ new/JJ album/NN is/VBZ wicked/JJ verbs, adjectives [Hatzivassiloglou 97] Look up word in mined dictionary, record orientation
  60. 60. Sentiment Expressions in Text “Your new album is wicked” Shallow NL parse Your/PRP$ new/JJ album/NN is/VBZ wicked/JJ verbs, adjectives [Hatzivassiloglou 97] Look up word in mined dictionary, record orientation Why dont we just spot using the dictionary!? fuuunn! (coverage issues) “super man is in town”; “i heard amy at tropical cafe yday..”
  61. 61. Dictionary Coverage Issues Resort to corpus (transliterations) Co-occurence in corpus (tight:top scoring dictionary entries) “tight-awesome”:456, “tight-sweet”:136, “tight-hot” 429
  62. 62. Miscellaneous Presence of entity improves confidence of identified sentiment Short sentences, high coherence Associate sentiment with all entities If no entity spotted, associate with artist (whose page we are on) “you are awesome!” on Madonna’s page
  63. 63. based on experiments. Table 3 shows the accuracy of the annotator and illustrates the importance of using transliterations in such corpora. Dictionary + Evaluations - Mined Identifying Orientation Annotator Type Precision Recall Positive Sentiment 0.81 0.9 Negative Sentiment 0.5 1.0 PS excluding transliterations 0.84 0.67 Table 3 Transliteration accuracy impact These results indicate that the syntax and seman- tics of sentiment expression in informal text is difficult to determine. Words that were incorrectly identified as
  64. 64. based on experiments. Table 3 shows the accuracy of the annotator and illustrates the importance of using transliterations in such corpora. Dictionary + Evaluations - Mined Identifying Orientation Annotator Type Precision Recall Positive Sentiment 0.81 0.9 Negative Sentiment 0.5 1.0 PS excluding transliterations 0.84 0.67 Negative sentiments: slang orientations Table 3 Transliteration accuracy impact out of context! you were terrible last night at SNL but I still <3 you! These results indicate that the syntax and seman- tics of sentiment expression in informal text is difficult to determine. Words that were incorrectly identified as
  65. 65. based on experiments. Table 3 shows the accuracy of the annotator and illustrates the importance of using transliterations in such corpora. Dictionary + Evaluations - Mined Identifying Orientation Annotator Type Precision Recall Positive Sentiment 0.81 0.9 Negative Sentiment 0.5 1.0 PS excluding transliterations 0.84 0.67 Negative sentiments: slang orientations Table 3 Transliteration accuracy impact out of context! you were terrible last night at SNL but I still <3 you! These results indicate that the syntax and seman- Precision excluding corpus transliterations: tics of sentiment expression in informal text is difficult Incorrect NL parses, selectivity to determine. Words that were incorrectly identified as
  66. 66. Crucial Finding < 4% -ve +ve Sentiment Expression Frequencies in 60k comments, 26 weeks
  67. 67. Forget about the sentiment Crucial Finding annotator! Just use entity mention volumes! < 4% -ve +ve Sentiment Expression Frequencies in 60k comments, 26 weeks
  68. 68. Spam, Off-topic, Promotions Special type of spam: unrelated to artists’ work Paul McCartney’s divorce; Rihanna’s Abuse; Madge and Jesus Self-Promotions “check out my new cool sounding tracks..” music domain, similar keywords, harder to tell apart Standard Spam “Buy cheap cellphones here..”
  69. 69. 16 Observations SPAM: 80% have 0 sentiments CHECK US OUT!!! ADD US!!! PLZ ADD ME! 60k comments, 26 IF YOU LIKE THESE GUYS ADD US!!! weeks NON-SPAM: 50% have at least 3 sentiments Common 4 grams Your music is really bangin! pulled out several You’re a genius! Keep droppin bombs! ‘spammy’ phrases u doin it up 4 real. i really love the album. keep doin wat u do best. u r so bad! Phrases -> Spam, Non- hey just hittin you up showin love to one of spam buckets chi-town’s own. MADD LOVE. The spam annotator should Examples of of other annotatornon-spam com Fig. 8 be aware sentiment in spam and results!
  70. 70. Spam Elimination Aggregate function Phrases indicative of spam (4-way annotator agreements) Rules over previous annotator results if a spam phrase, artist/track name and a positive sentiment were spotted, the comment is probably not spam.
  71. 71. Performance Annotator Type Precision Recall Spam 0.76 0.8 Non-Spam 0.83 0.88 Directly proportional to previous annotator Table 4: Spam annotator performance results incorrect entity, sentiment spots the comment did not have a spam pattern, and the fir annotator spotted incorrect tracks, the spam annotator in Some self-promotions are just clever! terpreted the comment to be related to music and classifie ”like umbrella, ull love this song. . . “ t as non-spam. Other cases included more clever prom
  72. 72. Do not Confuse Activity with Popularity! non-spam spam Artist popularity ranking changed dramatically after excluding spam! ns. Tun- noise could significantly impact the data analysi rmined ordering of artists if it is not accounted for. racy of % Spam, 60k comments over 26 weeks of using Gorillaz 54% Placebo 39% Coldplay 42% Amy Winehouse 38% Lily Allen 40% Lady Sovereign 37% Keane 40% Joss Stone 36% all
  73. 73. Workflow Unstructured chatter to explorable hypercube
  74. 74. Hypercube, List Projection Exploring dimensions of popularity Data Hypercube (DB2) from structured data in posts (user demographics, location, age, gender, timestamp) unstructured data (entity, sentiment, spam flag) Intersection of dimensions => non-spam comment counts
  75. 75. artist), and from the measurements generated by the above annotator methods. ThisRecall Annotator Type Precision annotates eachwhich is then used to sorta a comment with the Spam 0.76 0.8 gregate and analyze the hyperc series of tags from unstructured and structured data. The on Non-Spam 0.83 0.88 dimensional data operations resulting tuple is then placed into a startially custom in which for schema popular lists Hypercube to One-Dimensional List Table 4: Spam annotator performance sort the artists, tracks, etc. to to musical bill addition which is then used to relevance with regardsWe can ag- the primary measure is a the traditional can slice and project (margina topics. didgregate equivalent the defining a function. asof multi- hot in This is andspam pattern, hypercube usinglistsvariety “What is analyze of and the first a such the comment dimensional data operations on it to derive males?” and “Who are th not have a old what are essen- annotator spotted incorrect tracks, the spam annotator in- tially custom popular lists for particularSan Francisco?” They transla musical topics. In terpreted the commentGender, Location, T ime, Artist, ...) → M M : (Age, to be relatedtraditional billboard “Top Artist” lists, we(1) nce the to music and classified addition toincluded more clever promo- it as non-spam. Other cases operations: tional comments can included the actual artist tracks, gen- that slice have stored the aggregation ofthe cube for and project (marginalize dimensions) X In ourRoll-ups as “What is hot(e.g. New York 1 (X) : forM (A year G = M case,very limited spam content. in ‘like we occurrences lists such uine sentiments and L City 19 = 19, nd umbrella ull love this song. . . ’). As is evident, theintersecting dimension from theof non-spammales?” and at theare the most popular T,... first old comment “Who amount of artists values nnotator in- available at hand in addition to translatethis way makes it easy to of the WhatFrancisco?” They grammatically 19 year X San is hot in New York for information hypercube. Storing the data to following mathematical poor sentences necessitates more sophisticated techniques nd classified old operations: examine rankings over various time intervals,(X) : L2 weight variousL = M (A, G, ever promo- males? for spam identification. T,A,G,... Given the amount of effort involved etc. Once (and if) a total ordering the obvious question tracks,dimensions differently, goal of using comment gen- X arises – why filter spam? For our end where, (X) : on the list, = 19, Gspam is key. “N ewY orkCity , might Lis fixed this intermediateLdata staging stepT, X, ...) . (e.g. approach positions ‘like counts to lead to 1 M (A filtering = M, = e amount corroborated by the volume of spam and non-spam This is of be eliminated. T,... content observed over a period of 26 weeks for 8 artists; see ammatically (2) X = Name of the artist X Table 5. The chart indicates that some artists had at least T = Timestamp 4.4 Projecting to a list techniques L2 (X) : M (A, G, L = “SanF rancisco ,of the commenter half as many spam as non-spam comments on their page. A = Age X, ...) (3) This level of noise would significantly impact the ordering T,A,G,... G = Gender us questionif it were not accounted seeking to generate Ultimately we are for. of artists a “one dimensional” L = Location ng comment where,
  76. 76. artist), and from the measurements generated by the above annotator methods. ThisRecall Annotator Type Precision annotates eachwhich is then used to sorta a comment with the Spam 0.76 0.8 gregate and analyze the hyperc series of tags from unstructured and structured data. The on Non-Spam 0.83 0.88 dimensional data operations resulting tuple is then placed into a startially custom in which for schema popular lists Hypercube to One-Dimensional List Table 4: Spam annotator performance sort the artists, tracks, etc. to to musical bill addition which is then used to relevance with regardsWe can ag- the primary measure is a the traditional can slice and project (margina topics. didgregate equivalent the defining a function. asof multi- hot in This is andspam pattern, hypercube usinglistsvariety “What is analyze of and the first a such the comment dimensional data operations on it to derive males?” and “Who are th not have a old what are essen- annotator spotted incorrect tracks, the spam annotator in- tially custom popular lists for particularSan Francisco?” They transla musical topics. In terpreted the commentGender, Location, T ime, Artist, ...) → M M : (Age, to be relatedtraditional billboard “Top Artist” lists, we(1) nce the to music and classified addition toincluded more clever promo- it as non-spam. Other cases operations: tional comments can included the actual artist tracks, gen- that slice have stored the aggregation ofthe cube for and project (marginalize dimensions) X In ourRoll-ups as “What is hot(e.g. New York 1 (X) : forM (A year G = M case,very limited spam content. in ‘like we occurrences lists such uine sentiments and L City 19 = 19, nd umbrella ull love this song. . . ’). As is evident, theintersecting dimension from theof non-spammales?” and at theare the most popular T,... first old comment “Who amount of artists values nnotator in- available at hand in addition to translatethis way makes it easy to of the WhatFrancisco?” They grammatically 19 year X San is hot in New York for information hypercube. Storing the data to following mathematical poor sentences necessitates more sophisticated techniques nd classified old operations: examine rankings over various time intervals,(X) : L2 weight variousL = M (A, G, ever promo- males? for spam identification. T,A,G,... Given the amount of effort involved etc. Once (and if) a total ordering the obvious question tracks,dimensions differently, goal of using comment gen- X arises – why filter spam? For our end where, (X) : on the list, = 19, Gspam is key. “N ewY orkCity , might Lis fixed this intermediateLdata staging stepT, X, ...) . (e.g. approach positions ‘like counts to lead to 1 M (A filtering = M, = e amount corroborated by the volume of spam and non-spam This is of be eliminated. T,... content observed over a period of 26 weeks for 8 artists; see ammatically (2) X = Name of the artist X Table 5. The chart indicates that some artists had at least T = Timestamp 4.4 Projecting to a list techniques L2 (X) : M (A, G, L = “SanF rancisco ,of the commenter half as many spam as non-spam comments on their page. A = Age X, ...) (3) This level of noise would significantly impact the ordering T,A,G,... G = Gender us questionif it were not accounted seeking to generate Ultimately we are for. of artists a “one dimensional” L = Location ng comment where,
  77. 77. The Word on the Street comments Billboards Top 50 Singles were spam Billboard.com MySpace Analysis comments comments chart during the week of had positive sentiments had negative sentiments Soulja Boy T.I. comments Sept 22-28 2007 had no identifiable sentiments on Statistics Kanye West Timbaland Soulja Boy Fall Out Boy Fergie Rihanna (Unique) Top 45 J. Holiday Keyshia Cole 50 Cent Avril Lavigne MySpace artist pages in Section 8, the structured metadata Keyshia Cole Timbaland mestamp, etc.) and annotation results Nickelback Pink m, sentiment, etc.) were loaded in Hypercube, Crawl, Annotate, Build the Pink 50 Cent Colbie Caillat Alicia Keys Project lists resented by each cell of the cube is the Table 8 Billboard’s Top Artists vs. our generated list ents for a given artist. The dimension- Showing Top 10 e is dependent on what variables we 1 was comprised of respondents between ages 8
  78. 78. The Word on the Street * Top artists appear in both lists * Several Overlaps 50 Singles Billboards Top comments were spam Billboard.com MySpace Analysis * Artists with long the week of chart during history/body comments had positive sentiments comments had negative sentiments Soulja Boy T.I. of work vs. ‘up and coming’ artists Sept 22-28 2007 comments had no identifiable sentiments Kanye West Soulja Boy on Statistics Timbaland Fall Out Boy Fergie Rihanna *(Unique) Top of MySpace - Predictive power 45 J. Holiday Keyshia Cole 50 Cent Avril Lavigne MySpace artist pages in Section 8, the next week looked a lot like Billboard structured metadata Keyshia Cole Timbaland mestamp, etc.) MySpace this week.. and annotation results Nickelback Pink m, sentiment, etc.) were loaded in Hypercube, Crawl, Annotate, Build the Pink 50 Cent Colbie Caillat Alicia Keys Project lists big music influencers Teenagerscell of the cube is the are resented by each Table 8 Billboard’s Top Artists vs. our generated list ents for a given [MediaMark2004] artist. The dimension- Showing Top 10 e is dependent on what variables we 1 was comprised of respondents between ages 8
  79. 79. Casual Preference Poll Results “Which list more accurately reflects the artists that were more popular last week?” 75 participants Overall 2:1 preference for MySpace list Younger age groups: 6:1 (8-15 yrs) 38% of total comments were spam Billboard.com MySpace Analysis 61% of total comments had positive sentiments 4% of total comments had negative sentiments Soulja Boy T.I. 35% of total comments had no identifiable sentiments Kanye West Soulja Boy Table 7 Annotation Statistics Timbaland Fall Out Boy Fergie Rihanna J. Holiday Keyshia Cole 50 Cent Avril Lavigne As described in Section 8, the structured metadata Challenging traditional polling methods! (artist name, timestamp, etc.) and annotation results (spam/non-spam, sentiment, etc.) were loaded in the Keyshia Cole Nickelback Pink Timbaland Pink 50 Cent Colbie Caillat Alicia Keys hypercube. The data represented by each cell of the cube is the Table 8 Billboard’s Top Artists vs. our generated list
  80. 80. Lessons Learned, Miles to go Informal (teen-authored) text is not your average blog content.. Quality check is not on a day to day spot/ comment level precision, but at a system level - are we missing a source/artist, is the crawl behaving, adjudication techniques for multiple data sources.. Leveraging SI for BI is difficult - Validation is key, continuous fine tweaking with experts

×