Entity Spotting in Informal Text

1,784 views

Published on

Slides for: Context and Domain Knowledge Enhanced Entity Spotting in Informal Text, ISWC 2009

  • Be the first to comment

  • Be the first to like this

Entity Spotting in Informal Text

  1. 1. Entity Spotting in Informal Text Meena Nagarajan with Daniel Gruhl*, Jan Pieper*, Christine Robson*, Amit P. Sheth Kno.e.sis, Wright State IBM Research - Almaden, San Jose CA* Thursday, October 29, 2009 1
  2. 2. Tracking Online Popularity http://www.almaden.ibm.com/cs/projects/iis/sound/ Thursday, October 29, 2009 2
  3. 3. Tracking Online Popularity http://www.almaden.ibm.com/cs/projects/iis/sound/ • What is the buzz in the online Music Community? • Ranking and displaying top X music artists, songs, tracks, albums.. • Spotting entities, despamming, sentiment identification, aggregation, top X lists.. Thursday, October 29, 2009 3
  4. 4. Spotting music entities in user-generated content in online music forums (MySpace) Thursday, October 29, 2009 4
  5. 5. Chatter in Online Music Communities http://knoesis.wright.edu/research/semweb/projects/music/ Thursday, October 29, 2009 5
  6. 6. Goal: Semantic Annotation of artists, tracks, songs, albums.. Music Brainz RDF Ohh these sour times... rock! Ohh these <track id=574623> sour times </track> ... rock! Thursday, October 29, 2009 6
  7. 7. Multiple Senses in the same Domain • 60 songs with Merry Christmas • 3600 songs with Yesterday • 195 releases of American Pie Caught AMERICAN PIE on cable so much • 31 artists covering American Pie fun! Thursday, October 29, 2009 7
  8. 8. Annotating UGC, other Challenges • Several Cultural named entities • artifacts of culture, common words in everyday language LOVED UR MUSIC YESTERDAY! Just showing some Love to you Madonna you are The Queen to me Lily your face lights up when you smile! Thursday, October 29, 2009 8
  9. 9. Annotating UGC, other Challenges • Informal Text • slang, abbreviations, misspellings.. • indifferent approach to grammar.. • Context dependent terms • Unknown distributions Thursday, October 29, 2009 9
  10. 10. Our Approach Spotting and subsequent sense disambiguation of spots Ohh these sour times... rock! Ohh these <track id=574623> sour times </track> ... rock! Thursday, October 29, 2009 10
  11. 11. 3.1 Ground Truth Data Set Ground Truth Data Set Our experimental evaluation focuses on user comments from the MySpace pages of three artists: Madonna, Rihanna and Lily Allen (see Table 2). The artists were selected to be popular enough to draw comment but different enough to provide variety. The entity definitions were taken from the MusicBrainz RDF (see • 3 artists : Madonna, Rihanna, Lily Allen Figure 1), which also includes some but not all common aliases and misspellings. • Madonna an artist with a extensive discography as well as a current album and 1858 spots (MySpace UGC) using naive spotter over concert tour Rihanna MusicBrainz artist metadata a pop singer with recent accolades including a Grammy Award and a very active MySpace presence • Lilly Allen an independent artist with song titles that include “Smile,” “Allright, Adjudicate if a spot is an entity or not (or inconclusive) Still”, “Naive”, and “Friday Night” who also generates a fair amount of buzz around her personal life not related to music • hand tagged bythe Ground Truth Data Set Table 2. Artists in 4 authors We establish a ground truth data Precision Artist Good spots Bad spots set of 1858 entity spots (best case for (Spots scored) for these Agreement Agreement naive spotter) artists (breakdown in Table 3). The 100% 75 % 100% 75% data was obtained by crawling the 33% Rihanna (615) 165 18 351 8 artist’s MySpace page comments and73% Lily (523) 268 42 10 100 23% dentifying all exact string matches Madonna (720) 138 24 503 20 of the artist’s song titles. Only com- Table 3. Manual scoring agreements on ments with at least one spot were re- naive entity spotter results. tained. October 29, 2009 Thursday, These spots were then hand 11
  12. 12. Experiments and Results Thursday, October 29, 2009 12
  13. 13. Experiments All entities from MusicBrainz 1. Light weight, edit distance based entity spotter Thursday, October 29, 2009 13
  14. 14. Experiments 1. Naive spotter using all entities from all of MusicBrainz 2. This new Merry Christmas tune is so good! ? but which one ? Disambiguate between the 60+ Merry Christmas entries in MusicBrainz Thursday, October 29, 2009 14
  15. 15. Experiments 2. Constrain set of possible entities from Musicbrainz - to increase spotting accuracy - constrain using cues from the comment to eliminate alternatives This new Merry Christmas tune is so good! Thursday, October 29, 2009 15
  16. 16. Experiments 3. Eliminate non-music mentions Natural language and domain specific cues Your SMILE rocks! Thursday, October 29, 2009 16
  17. 17. Restricted Entity Spotting Thursday, October 29, 2009 17
  18. 18. 2. Restricted Entity Spotting • Investigating the relationship between number of entities used and spotting accuracy • Understand systematic ways of scoping domain models for use in semantic annotation • Experiments to gauge benefits of implementing particular constraints in annotator systems • harder artist age detector vs. easier gender detector ? Thursday, October 29, 2009 18
  19. 19. sets of artists that are factors of 10 smaller (10%, 1%, etc). These subsets ays contain our three actual artists (Madonna, Rihanna and Lily Allen), ause we are interested in simulating restrictions that remove invalid artists. e most restricted entity set contains just the songs of one artist (≈0.0001% of 2a. Random Restrictions MusicBrainz taxonomy). In order to rule out selection bias, we perform 200 dom draws of sets of artists for each set size - a total of 1200 experiments. ure 2 Precision the precision increases as the set of possible entities shrinks. shows that each setcase for 200 results are plotted and a best fit line has been added (best size, all naive spotter) ndicate the average precision. Note that the figure is in log-log scale. !"#$"%&'()'&*"'+,-.$'/#0.%1'&02(%(34 !"""#$ !""#$ !"#$ !#$ #$ #"$ #""$ 33% #""$ !#"$.-.(%'()'&*"'56(&&"# 73% #"$ 23% #$ Domain restrictions of 10% of the RDF %&'()*''+, !#$ result in approximately 9.8 times /178,,1 improvement in precision !"#$ %&'()*''+,-.)/(.012+)314+ 5&61,,1 !""#$ 5&61,,1-.)/(.012+)314+ /178,,1-.)/(.012+)314+ !"""#$ . 2. Precision of a naive spotter using differently sized portions of the MusicBrainz onomy to spot song titles on artist’s MySpace pages We observe that the curves in Figure 2 conform to a power law formula, cifically a Zipf distribution ( nR2 ). Zipf’s law was originally applied to demon- 1 ate the Zipf distribution in frequency of words in natural language corpora • From all of MusicBrainz (281890 artists, 6220519 , and has since been demonstrated in other corpora including web searches Figure 2 shows that song titles in Informal English exhibit the same fre- ncy characteristics as plain English. Furthermore, we can see that in the tracks) to songs of one artist (for all three artists) rage case, a domain restrictions of 10% of the MusicBrainz RDF will result roximately in a 9.8 times improvement in precision of a naive spotter. This result is remarkably consistent across all three artists. The R2 values the power lines on the three artists are 0.9776, 0.979, 0.9836, which gives a iation of 0.61% in R2 value between spots on the three MySpace pages. Thursday, October 29, 2009 19
  20. 20. 2b. Real-world Constraints for Restrictions “Happy 25th Rhi!” (eliminate using Artist DOB - metadata in MusicBrainz) “ur new album dummy is awesome” (eliminate using Album release dates - metadata in MusicBrainz) • Systematic scoping of the RDF • Question: Do real-world constraints from metadata reduce size of the entity spot set in a meaningful way? • Experiments: Derived manually and tested for usefulness Thursday, October 29, 2009 20
  21. 21. D 1,193 20-30 year career Real-world Constraints Recent Album Restrictions- Applied to Madonna E 6,491 Artists who released an album in the past year F 10,501 Artists who released an album in the past 5 years Artist Age Restrictions- Applied to Lily Allen Restrictions over MusicBrainz H 112 Artist born 1985, album in past 2 years J 284 Artists born in 1985 (or bands founded in 1985) Key Count Restriction L 4,780 Artists or bands under 25 with album in past 2 years Artist 10,187 Artists or bands under 25 Applied to Madonna M Career Length Restrictions- years old Number of Album Restrictions- Applied 1 year) album B 22 80’s artists with recent (within to Lily Allen KC 154 First album 1983 1,530 Only one album, released in the past 2 years D 1,193 20-30 year career N 19,809 Artists with only one album Recent Album Restrictions- Applied to Madonna Recent Album Restrictions- Applied to Rihanna QE 6,491 3 albums exactly, first album last the past year 83 Artists who released an album in year R 10,501 3+ albums, first album last year the past 5 years F 196 Artists who released an album in Artist Age Restrictions- Applied to Lily Allen S 1,398 First album last year H T 2,653 Artistsborn 1985, album one in theyears year 112 Artist with 3+ albums, in past 2 past UJ 6,491 Artists who released an album in the past year 284 born in 1985 (or bands founded in 1985) Specific4,780 Artists or bands under 25 witheach Artist L Artist Restrictions- Applied to album in past 2 years M 10,187 Madonna only under 25 years old A 1 Artists or bands .... .... Number of 1 Lily Allen only G Album Restrictions- Applied to Lily Allen P K 1,530 Rihanna only 1 Only one album, released in the past 2 years N 281,890 All artists in only one album Z 19,809 Artists with MusicBrainz Recent Album Restrictions- Applied to Rihanna D. I’ve been The fan for 25 album Table 4. youralbums, first years!last sample restrictions. ! efficacy of various year Q 83 3 albums exactly, first album last year R 196 3+ M. Happy 25th S 1,398 First album last year e Thursday, October 29, 2009 2,653 Artists of restrictions onecareer,past year album considerTthree classes with 3+ albums, - in the age and based 21
  22. 22. Real-world Constraints • Applied different constraints to different artists • Reduce potential entity spot size • Run naive spotter • Measure precision Thursday, October 29, 2009 22
  23. 23. Real-world Constraints “I heart your new album” Rihanna: short career, recent album “I love all your 3 albums” “You are most favorite new pop artist” releases, 3 album releases etc.... !"""#$ !""#$ !"#$ !#$ #$ #"$ #""$ #""$ %&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+** !"#$%&%'()'*)+,#)-.'++#" *****789!9$*,/):0+0-%; #"$ )A&:.23*8*&2>?@+ &.*2)&+.*8*&2>?@+ #$ !#$ &/.0+.+*<5-+)*=0/+.*&2>?@*<&+* 0%*.5)*,&+.*8*3)&/+ !"#$ *&22*&/.0+.+*<5-*/)2)&+)1*&%* &2>?@*0%*.5)*,&+.*8*3)&/+ !""#$ *)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%; !"""#$ Thursday, October 29, 2009 23
  24. 24. Real-world Constraints Age restrictions, only one album, last year releases, extensive career etc... !"""#$ !""#$ !"#$ !#$ #$ #"$ #""$ !"""#$ !""#$ !"#$ !#$ #$ #"$ #""$ #""$ #""$ 3%?@1*)8:''1&*'&%(31A*:3*:340*B%A:33%*):3C)* -%>?5*):,''5&*'&%(-52*,-*,-7<*@(7<*A775-*),-B) !"#$%&%'()'*)+,#)-.'++#" !"#$%&%'()'*)+,#)-.'++#" ********D--!9$*8&12()(:3E ***************1C#$*:&5D()(,-6 #"$ #"$ 1%&40*=">)*%&'()')*+(',*%3* %-*%7+48*(-*'95*:%)'*';,*<5%&) %4567*(3*',1*8%)'*01%& #$ %&'()')*4-25&*=0*<5%&)* #$ %&'()')*+,:)1*;(&)'* ,72*1,&*+%-2)*75))* %&'()')*+(',*%* &141%)1*+%)*(3*#<=/ '95-*=0*<5%&)*,726 !#$ !#$ -"./"*01%&*2%&11& %&'()')*+,&-*(-*#./0* %&'()')*+(',*%3*%4567*(3*',1*8%)'*01%& !"#$ 1,&*+%-2)*3,4-252*(-*#./06 !"#$ %&'()')*+(',*%3*%4567*(3*',1*8%)'*9*01%&) !""#$ %&'()')*;('9*,-7<*,-5*%7+48 !""#$ 13'(&1*B6)(2*F&%(3)*'%G:3:70**D"!"""9$*8&12()(:3E 5-'(&5*E4)(D*F&%(-G*'%H,-,8<*1"!""C$*:&5D()(,-6 !"""#$ !"""#$ Madonna Thursday, October 29, 2009 Lily Allen 24
  25. 25. Take aways.. • Real world restrictions closely follow distribution of random restrictions, conforming loosely to a Zipf distribution • Confirms general effectiveness of limiting domain size regardless of restriction • Choosing which constraints to implement is simple - pick whatever is easiest first • use metadata from the model to guide you Thursday, October 29, 2009 25
  26. 26. Non-music Mentions Thursday, October 29, 2009 26
  27. 27. Disambiguating Non- music References UGC on Lily Allen’s page about her new track Smile Got your new album Smile. Loved it! Keep your SMILE on! Thursday, October 29, 2009 27
  28. 28. Binary Classification, SVM Got your new album Smile. Loved it! Keep your SMILE on! Syntactic features Notation-S + POS tag of s s.POS POS tag of one token before s POS tag of one token after s s.POSb s.POSa Training data Typed dependency between s and sentiment word * s.POS-TDsent ∗ Typed dependency between s and domain-specific term * Boolean Typed dependency between s and sentiment * 550 good spots s.POS-TDdom ∗ s.B-TDsent ∗ Boolean Typed dependency between s and domain-specific term * s.B-TDdom ∗ Word-level features + Capitalization of spot s 550 bad spots Notation-W s.allCaps + Capitalization of first letter of s s.firstCaps + s in Quotes s.inQuotes Test data Domain-specific features Notation-D Sentiment expression in the same sentence as s s.Ssent Sentiment expression elsewhere in the comment 120 good spots s.Csent Domain-related term in the same sentence as s s.Sdom 229 * 2 bad spots Domain-related term elsewhere in the comment s.Cdom + Refers to basic features, others are advanced features ∗ These features apply only to one-word-long spots. Table 6. Features used by the SVM learner Thursday, October 29, 2009 28
  29. 29. Most Useful Combinations FP best : All features, other combinations Precision intensive 42-91 TP next best : word, domain, contextual (POS) 78-50 TP best : word, domain, contextual 90-35 Not all syntactic features are Recall intensive useless, contrary to general belief, wrt informal text Thursday, October 29, 2009 29
  30. 30. Naive MB spotter + NLP • Annotate using naive '!!" &!" 5('*%$%63)7)8'*#"" %!" spotter • best case baseline $!" #!" 71,89-9/(:;/1:<9=>:?==,( 71,89-9/(:;/1:@9A)(() 71,89-9/(:;/1:B)C/(() (artist is known) @,8)==:D)==:0A1,,E !" -./00,1 2!345 6&35! 6$3$! 6#345 6'36! 6!35% %#3&$ %'36! %!3&5 $53&% $#32' ()*+, • follow with NLP analytics !"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14 PR tradeoffs: choosing feature to weed out FPs combinations depending on end application requirement • run on less than entire input data Thursday, October 29, 2009 30
  31. 31. Summary.. • Real-time large-scale data processing • prohibits computationally intensive NLP techniques • Simple inexpensive NL learners over a dictionary- based naive spotter can yield reasonable performance • restricting the taxonomy results in proportionally higher precision • Spot + Disambiguate a feasible approach for (esply. Cultural) NER in Informal Text Thursday, October 29, 2009 31
  32. 32. Thank You! • Bing,Yahoo, Google: Meena Nagarajan • Contact us • {dgruhl, jhpieper, crobson}@us.ibm.com, {meena, amit}@knoesis.org • More about this work • http://www.almaden.ibm.com/cs/projects/iis/sound/ • http://knoesis.wright.edu/researchers/meena Thursday, October 29, 2009 32

×