Your SlideShare is downloading. ×
0
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Surfacing Real-World Event Content on Twitter
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Surfacing Real-World Event Content on Twitter

1,125

Published on

Talk given at Google NYC on October 15th, 2010.

Talk given at Google NYC on October 15th, 2010.

Published in: Technology, News & Politics
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,125
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
41
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. SURFACING REAL-WORLD EVENT CONTENT ON TWITTER Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University
  • 2. Event Content in Social Media
  • 3. Event Content in Social Media Smaller events, without traditional news coverage Popular, widely known events
  • 4. Event Content in Social Media  Discovery  Detect events using features of social media content (e.g., term statistics)  Mining content from known event sources (e.g., user- contributed event databases)  Organization  Associating social media content with events  Identifying similar content within and across sites  Presentation  Selecting what content to display to a user  Providing interfaces that summarize and aggregate the content along different dimensions
  • 5. Event Content in Social Media  Discovery  Detect events using features of social media content (e.g., term statistics)  Mining content from known event sources (e.g., user- contributed event databases)  Organization  Associating social media content with events  Identifying similar content within and across sites  Presentation  Selecting what content to display to a user  Providing interfaces that summarize and aggregate the content along different dimensions
  • 6. Event Content in Social Media  Discovery  Detect events using features of social media content (e.g., term statistics)  Mining content from known event sources (e.g., user- contributed event databases)  Organization  Associating social media content with events  Identifying similar content within and across sites  Presentation  Selecting what content to display to a user  Providing interfaces that summarize and aggregate the content along different dimensions
  • 7. Identifying Events in Social Media  Timeliness  Real-time  Retrospective  (Prospective)  Content discovery  Known properties  Event databases (e.g., Upcoming, Eventful)  Keyword triggers (e.g, “earthquake”)  Shared calendars  Unknown properties
  • 8. Identifying Events in Social Media  Timeliness  Real-time  Retrospective  (Prospective)  Content discovery  Known properties  Event databases (e.g., Upcoming, Eventful)  Keyword triggers (e.g, “earthquake”)  Shared calendars  Unknown properties
  • 9. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown
  • 10. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Twitter new event detection [Petrović et al. NAACL’10]
  • 11. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Twitter new event detection [Petrović et al. NAACL’10] Event detection on Flickr [Chen and Roy CIKM’09]
  • 12. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Earthquake prediction using Twitter [Sakaki et al. WWW’10] Twitter new event detection [Petrović et al. NAACL’10] Event detection on Flickr [Chen and Roy CIKM’09]
  • 13. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Earthquake prediction using Twitter [Sakaki et al. WWW’10] Twitter new event detection [Petrović et al. NAACL’10] Event detection on Flickr [Chen and Roy CIKM’09] Organization of YouTube concert videos [Kennedy and Naaman WWW’09]
  • 14. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown
  • 15. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Surfacing events on Twitter
  • 16. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter
  • 17. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter Identifying Twitter content for planned events
  • 18. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter Identifying Twitter content for planned events Connecting events across sites (e.g., YouTube, Picasa)
  • 19. Twitter Content  Streams of textual messages  Brief content (140 characters)  Communicated to network of followers
  • 20. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am
  • 21. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 22. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 23. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 24. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 25. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 26. Identifying Events on Twitter  Challenges:  Wide variety of topics, not all related to events (e.g., morning greetings, “thank you” messages)  Low quality text: abbreviations, unconventional language, riddled with typos, grammatically incorrect  Opportunities:  Content generated in real-time as events happen  Time and location information
  • 27. Identifying Events on Twitter  Challenges:  Wide variety of topics, not all related to events (e.g., morning greetings, “thank you” messages)  Low quality text: abbreviations, unconventional language, riddled with typos, grammatically incorrect  Opportunities:  Content generated in real-time as events happen  Time and location information
  • 28. Events on Twitter  Types of events on Twitter  Exogenous: Real-world occurrences (e.g., Superbowl, “Lost” finale)  Endogenous: Specific to the Twitter-verse (e.g., #thingsyoushouldntsay meme, RT statement by Lady Gaga)  Event:  One or more terms and a time period  Volume of messages posted for the terms in the time period exceeds some expected level of activity
  • 29. Events on Twitter  Types of events on Twitter  Exogenous: Real-world occurrences (e.g., Superbowl, “Lost” finale)  Endogenous: Specific to the Twitter-verse (e.g., #thingsyoushouldntsay meme, RT statement by Lady Gaga)  Event:  One or more terms and a time period  Volume of messages posted for the terms in the time period exceeds some expected level of activity
  • 30. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 31. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 32. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 33. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 34. Surfacing Event Content on Twitter Tweets
  • 35. Surfacing Event Content on Twitter Tweets
  • 36. Surfacing Event Content on Twitter Tweet Clusters Tweets
  • 37. Surfacing Event Content on Twitter Tweet Clusters Tweets
  • 38. Surfacing Event Content on Twitter Tweet Clusters Tweets Event Clusters
  • 39. Surfacing Event Content on Twitter Tweet Clusters Tweets Event Clusters
  • 40. Surfacing Event Content on Twitter Tweet Clusters Tweets Event Clusters
  • 41. Surfacing Event Content on Twitter Tweet Clusters Tweets Event Clusters Selected Tweets
  • 42. Organizing Tweets in Real-Time  Order tweets by post time  Use TF-IDF vector representation of textual content  Stop word elimination  Stemming  Enhanced weight for hashtags (#tag)  IDF computed over past data  Separate tweets by location  Focus on tweets from NYC  Different locations can be processed in parallel
  • 43. Organizing Tweets in Real-Time  Order tweets by post time  Use TF-IDF vector representation of textual content  Stop word elimination  Stemming  Enhanced weight for hashtags (#tag)  IDF computed over past data  Separate tweets by location  Focus on tweets from NYC  Different locations can be processed in parallel
  • 44. Organizing Tweets in Real-Time  Order tweets by post time  Use TF-IDF vector representation of textual content  Stop word elimination  Stemming  Enhanced weight for hashtags (#tag)  IDF computed over past data  Separate tweets by location  Focus on tweets from NYC  Different locations can be processed in parallel
  • 45. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for  Event identification in textual news [Allan et al. 1998]  News event detection on Twitter [Sankaranarayanan et al. 2009]  Does not require a priori knowledge of number of clusters  Known fragmentation issue, often solved with a periodic second pass
  • 46. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for  Event identification in textual news [Allan et al. 1998]  News event detection on Twitter [Sankaranarayanan et al. 2009]  Does not require a priori knowledge of number of clusters  Known fragmentation issue, often solved with a periodic second pass
  • 47. Overview of Cluster-based Approach  Group similar tweets via online clustering  Compute statistics of cluster content  Top terms (e.g., [earthquake, haiti])  Number of documents per hour  …  Use cluster-level features to identify event clusters  Single feature with threshold (e.g., increase in volume over time-window)  Trained classification model
  • 48. Overview of Cluster-based Approach  Group similar tweets via online clustering  Compute statistics of cluster content  Top terms (e.g., [earthquake, haiti])  Number of documents per hour  …  Use cluster-level features to identify event clusters  Single feature with threshold (e.g., increase in volume over time-window)  Trained classification model
  • 49. Overview of Cluster-based Approach  Group similar tweets via online clustering  Compute statistics of cluster content  Top terms (e.g., [earthquake, haiti])  Number of documents per hour  …  Use cluster-level features to identify event clusters  Single feature with threshold (e.g., increase in volume over time-window)  Trained classification model
  • 50. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 51. Social Interaction Features  Retweets  RT @username  Often characterize Twitter-specific events  Replies  Tweet starts with @username  Possible indication of non-event content  Mentions  @username anywhere in the tweet  Reference to twitter users that might be part of an event
  • 52. Social Interaction Features  Retweets  RT @username  Often characterize Twitter-specific events  Replies  Tweet starts with @username  Possible indication of non-event content  Mentions  @username anywhere in the tweet  Reference to twitter users that might be part of an event
  • 53. Social Interaction Features  Retweets  RT @username  Often characterize Twitter-specific events  Replies  Tweet starts with @username  Possible indication of non-event content  Mentions  @username anywhere in the tweet  Reference to twitter users that might be part of an event
  • 54. Topic Coherence Intuition: clusters with strong inter-document similarity may contain event information Class Today Early Work Sleep Start I’m gonna do my best to go sleep during all my classes today =) Starting work early today. Looking fwd to cooking class tonight! Today starts the rest of my life… Katie Couric President Obama Interview CBS Katie Couric Interview With President Obama http://bit.ly/bRsGPo The Katie Couric-President Obama interview has now begun on CBS Katie Couric interviews President Obama during CBS' Super Bowl pregame coverage
  • 55. Trending Behavior  Trending characteristics of top terms in cluster:  Exponential fit  Deviation from expected volume Volume over time for the term “valentine” time documents time (hours)
  • 56. Twitter-Centric Event Features  Tagging behavior  Multi-word tags (e.g., #myhomelesssignwouldsay)  Percentage of tagged tweets  Top term is a tag  …  Retweeting  Percentage of messages with RT @  Percentage of messages from top RTed tweet  …
  • 57. Twitter-Centric Event Features  Tagging behavior  Multi-word tags (e.g., #myhomelesssignwouldsay)  Percentage of tagged tweets  Top term is a tag  …  Retweeting  Percentage of messages with RT @  Percentage of messages from top RTed tweet  …
  • 58. Event Classifier  Use features to build a classifier  Human-annotated training data  SVM model (selected during training phase)  Alternative classification modes:  RW-Event: real-world event vs. rest  TC-Event: event (real-world or Twitter-centric) vs. non- event
  • 59. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 60. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 61. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  • 62. Event Content Selection Tiger Woods Apology
  • 63. Event Content Selection Tiger Woods Apology Tiger Woods to make a public apology Friday and talk about his future in golf. Tiger Woods Returns To Golf - Public Apology http://bit.ly/9Ui5jx Tiger woods y'all,tiger woods y'all,ah tiger woods y'all Tiger Woods Hugs: http://tinyurl.com/yhf4 uzw Wedge wars upstage Watson v Woods: BBC Sport (blog)
  • 64. Event Content Selection Tiger Woods Apology Tiger Woods to make a public apology Friday and talk about his future in golf. Tiger Woods Returns To Golf - Public Apology http://bit.ly/9Ui5jx Tiger woods y'all,tiger woods y'all,ah tiger woods y'all Tiger Woods Hugs: http://tinyurl.com/yhf4 uzw Wedge wars upstage Watson v Woods: BBC Sport (blog)
  • 65. Event Content Selection  Challenges:  Clusters contain noise  Relevant tweets might have poor quality text  Relevant, high quality tweets might not be interesting  For each tweet and a given event evaluate  Quality  Relevance  Usefulness
  • 66. Event Content Selection  Challenges:  Clusters contain noise  Relevant tweets might have poor quality text  Relevant, high quality tweets might not be interesting  For each tweet and a given event evaluate  Quality  Relevance  Usefulness
  • 67. Centrality Based Tweet Selection  Centroid Cosine similarity of each tweet to cluster centroid  Degree  Tweets are nodes  Tweets are connected if their similarity is above a threshold  Compute degree centrality of each node  LexRank [Erkan and Radev 2004]  Same graph structure as Degree method  Central tweets are similar to other central tweets
  • 68. Centrality Based Tweet Selection  Centroid Cosine similarity of each tweet to cluster centroid  Degree  Tweets are nodes  Tweets are connected if their similarity is above a threshold  Compute degree centrality of each node  LexRank [Erkan and Radev 2004]  Same graph structure as Degree method  Central tweets are similar to other central tweets
  • 69. Centrality Based Tweet Selection  Centroid Cosine similarity of each tweet to cluster centroid  Degree  Tweets are nodes  Tweets are connected if their similarity is above a threshold  Compute degree centrality of each node  LexRank [Erkan and Radev 2004]  Same graph structure as Degree method  Central tweets are similar to other central tweets
  • 70. Experimental Setup: Data  >2,600,000 tweets, collected via Twitter API  Location: New York City area Indicated on user profile  Time: February 2010  First week used to calibrate statistics  Second week used for training/validation  Third and fourth weeks used for testing
  • 71. Experimental Setup: Data  >2,600,000 tweets, collected via Twitter API  Location: New York City area Indicated on user profile  Time: February 2010  First week used to calibrate statistics  Second week used for training/validation  Third and fourth weeks used for testing
  • 72. Experimental Setup: Data  >2,600,000 tweets, collected via Twitter API  Location: New York City area Indicated on user profile  Time: February 2010  First week used to calibrate statistics  Second week used for training/validation  Third and fourth weeks used for testing
  • 73. Experimental Setup: Training  Data:  504 clusters  Fastest growing clusters/hour in second week of February 2010  Labels:  Real-world event (e.g., [superbowl,colts,saints,sb44])  Twitter-specific event (e.g., [uknowubrokewhen,money,job])  Non-event (e.g., [happy,love,lol])  Ambiguous cluster (e.g., [south,park,west,sxsw,cartman])
  • 74. Experimental Setup: Training  Data:  504 clusters  Fastest growing clusters/hour in second week of February 2010  Labels:  Real-world event (e.g., [superbowl,colts,saints,sb44])  Twitter-specific event (e.g., [uknowubrokewhen,money,job])  Non-event (e.g., [happy,love,lol])  Ambiguous cluster (e.g., [south,park,west,sxsw,cartman])
  • 75. Experimental Setup: Testing  Baselines:  Naïve Bayes text classification (NB-Text)  Fastest-growing clusters per hour  Classifiers:  RW-Event  TC-Event  400 clusters  5 hours  Top 20 clusters per hour according to RW-Event, TC- Event, Fastest-growing, random
  • 76. Experimental Setup: Testing  Baselines:  Naïve Bayes text classification (NB-Text)  Fastest-growing clusters per hour  Classifiers:  RW-Event  TC-Event  400 clusters  5 hours  Top 20 clusters per hour according to RW-Event, TC- Event, Fastest-growing, random
  • 77. Experimental Setup: Testing  Baselines:  Naïve Bayes text classification (NB-Text)  Fastest-growing clusters per hour  Classifiers:  RW-Event  TC-Event  400 clusters  5 hours  Top 20 clusters per hour according to RW-Event, TC- Event, Fastest-growing, random
  • 78. Experimental Methodology: Event Classification  Classification accuracy  10-fold cross validation  Separate test set of randomly chosen tweets  Event surfacing  Top events per hour for each technique  Evaluation:  Precision@K  NDCG@K
  • 79. Experimental Methodology: Event Classification  Classification accuracy  10-fold cross validation  Separate test set of randomly chosen tweets  Event surfacing  Top events per hour for each technique  Evaluation:  Precision@K  NDCG@K
  • 80. Identified Events Description Keywords Senator Evan Bayh's Retirement bayh, evan, senate, congress, retire Westminster Dog Show westminster, dog, show, club, kennel Obama’s Meeting with the Dalai Lama lama, dalai, meet, obama, china NYC Toy Fair toyfairny, starwars, hasbro, lego, toy Marc Jacobs Fashion Show jacobs, marc, nyfw, show, fashion A sample of events identified by our classifiers on the test set
  • 81. Classification Performance (F-measure)  RW-Event classifier is more effective at discriminating between real-world events and rest of Twitter data Classifier Validation Test NB-Text 0.785 0.702 RW-Event 0.849 0.837 TC-Event 0.875 0.789
  • 82. Precision@K Evaluation 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 15 20 Precision Number of Clusters (K) RW-Event TC-Event Fastest Random
  • 83. NDCG@K Evaluation 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 15 20 NDCG Number of Clusters (K) RW-Event TC-Event Fastest Random
  • 84. Experimental Methodology: Content Selection  50 event clusters  Randomly selected from test set  5 top tweets per event for each: Centroid, Degree, LexRank  Labeled on a 1-4 scale  Quality: excellent (4)  poor (1)  Relevance: clearly relevant (4)  not relevant (1)  Usefulness: clearly useful (4)  not useful (1)
  • 85. Selected Tweets: Example Method Tweet Centroid Video: Tiger regretful; unsure about return to golf - Main Line ...: (AP) Tiger Woods publicly apologized Friday... http://bit.ly/dAO41N Degree Watson: Woods needs to show humility upon return (AP): Tom Watson says Tiger Woods needs to "show some humility to... http://bit.ly/cHVH7x LexRank RT @EricStangel: Tiger Woods statement: And now for Elin's repsonse.... A sample of tweets selected by different centrality methods
  • 86. Content Selection Results  Average scores over all events  High quality and relevance (>3) for both Degree and Centroid  Centroid only method with high usefulness Method Quality Relevance Usefulness LexRank 3.444 2.984 2.608 Degree 3.536 3.156 2.802 Centroid 3.636 3.694 3.474
  • 87. Preferred Method per Event  Centroid is the preferred method across all metrics For usefulness, Centroid tweets preferred more than 2:1 compared to Degree, 4:1 compared to LexRank Method Quality Relevance Usefulness LexRank 22.66% 16.33% 12% Degree 31.66% 25.33% 28% Centroid 45.66% 58.33% 60%
  • 88. Conclusions Techniques for discovering, organizing, and presenting social media from real-world events  Event classifiers  Important to capture features of Twitter-specific events in order to reveal the real-world events  Effectively surfaced real-world events in an unsupervised setting  Content selection  Similarity to centroid technique better at selecting event content  There is relevant and useful event content on Twitter!
  • 89. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter Identifying Twitter content for planned events Connecting events across sites (e.g., YouTube, Picasa)
  • 90. Learning Similarity Metrics for Event Identification in Social Media (WSDM ’10) Ctitle Ctags Ctime
  • 91. Combine similarities Learning Similarity Metrics for Event Identification in Social Media (WSDM ’10) Wtitle Wtags Wtime f(C,W) Ctitle Ctags Ctime Learned in a training step
  • 92. Combine similarities Learning Similarity Metrics for Event Identification in Social Media (WSDM ’10) Wtitle Wtags Wtime f(C,W) Ctitle Ctags Ctime Final clustering solution Learned in a training step
  • 93. Identifying Tweets for Known Events
  • 94. Identifying Tweets for Known Events
  • 95. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter Identifying Twitter content for planned events Connecting events across sites (e.g., YouTube, Picasa)
  • 96. Thank you!  Pablo Barrio  David Elson  Dan Iter  Yves Petinot  Sara Rosenthal  Gonçalo Simões  Matt Solomon  Kapil Thadani

×