• Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,049
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
38
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. SURFACING REAL-WORLD EVENT CONTENT ON TWITTER Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University
  • 2. Event Content in Social Media
  • 3. Event Content in Social Media Popular, widely known events Smaller events, without traditional news coverage
  • 4. Event Content in Social Media  Discovery  Detect events using features of social media content (e.g., term statistics)  Mining content from known event sources (e.g., user- contributed event databases)  Organization  Associating social media content with events  Identifying similar content within and across sites  Presentation  Selecting what content to display to a user  Providing interfaces that summarize and aggregate the content along different dimensions
  • 5. Event Content in Social Media  Discovery  Detect events using features of social media content (e.g., term statistics)  Mining content from known event sources (e.g., user- contributed event databases)  Organization  Associating social media content with events  Identifying similar content within and across sites  Presentation  Selecting what content to display to a user  Providing interfaces that summarize and aggregate the content along different dimensions
  • 6. Event Content in Social Media  Discovery  Detect events using features of social media content (e.g., term statistics)  Mining content from known event sources (e.g., user- contributed event databases)  Organization  Associating social media content with events  Identifying similar content within and across sites  Presentation  Selecting what content to display to a user  Providing interfaces that summarize and aggregate the content along different dimensions
  • 7. Identifying Events in Social Media  Timeliness  Real-time  Retrospective  (Prospective)  Content discovery  Known properties  Eventdatabases (e.g., Upcoming, Eventful)  Keyword triggers (e.g, “earthquake”)  Shared calendars  Unknown properties
  • 8. Identifying Events in Social Media  Timeliness  Real-time  Retrospective  (Prospective)  Content discovery  Known properties  Eventdatabases (e.g., Upcoming, Eventful)  Keyword triggers (e.g, “earthquake”)  Shared calendars  Unknown properties
  • 9. Identifying Events in Social Media Timeliness Real-time Retrospective Unknown Content Discovery Known
  • 10. Identifying Events in Social Media Timeliness Real-time Retrospective Unknown Twitter new event detection Content Discovery [Petrović et al. NAACL’10] Known
  • 11. Identifying Events in Social Media Timeliness Real-time Retrospective Unknown Twitter new event detection Event detection on Flickr Content Discovery [Petrović et al. NAACL’10] [Chen and Roy CIKM’09] Known
  • 12. Identifying Events in Social Media Timeliness Real-time Retrospective Unknown Twitter new event detection Event detection on Flickr Content Discovery [Petrović et al. NAACL’10] [Chen and Roy CIKM’09] Earthquake prediction Known using Twitter [Sakaki et al. WWW’10]
  • 13. Identifying Events in Social Media Timeliness Real-time Retrospective Unknown Twitter new event detection Event detection on Flickr Content Discovery [Petrović et al. NAACL’10] [Chen and Roy CIKM’09] Earthquake prediction Organization of YouTube Known using Twitter [Sakaki et al. concert videos [Kennedy and WWW’10] Naaman WWW’09]
  • 14. Identifying Events in Social Media Timeliness Real-time Retrospective Unknown Content Discovery Known
  • 15. Identifying Events in Social Media Timeliness Real-time Retrospective Unknown Surfacing events on Content Discovery Twitter Known
  • 16. Identifying Events in Social Media Timeliness Real-time Retrospective Unknown Learning similarity metrics Surfacing events on Content Discovery for event identification on Twitter Flickr [Becker et al. WSDM’10] Known
  • 17. Identifying Events in Social Media Timeliness Real-time Retrospective Unknown Learning similarity metrics Surfacing events on Content Discovery for event identification on Twitter Flickr [Becker et al. WSDM’10] Known Identifying Twitter content for planned events
  • 18. Identifying Events in Social Media Timeliness Real-time Retrospective Unknown Learning similarity metrics Surfacing events on Content Discovery for event identification on Twitter Flickr [Becker et al. WSDM’10] Connecting events across Known Identifying Twitter content sites (e.g., YouTube, for planned events Picasa)
  • 19. Twitter Content  Streams of textual messages  Brief content (140 characters)  Communicated to network of followers
  • 20. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am
  • 21. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 22. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 23. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 24. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 25. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  • 26. Identifying Events on Twitter  Challenges:  Wide variety of topics, not all related to events (e.g., morning greetings, “thank you” messages)  Low quality text: abbreviations, unconventional language, riddled with typos, grammatically incorrect  Opportunities:  Contentgenerated in real-time as events happen  Time and location information
  • 27. Identifying Events on Twitter  Challenges:  Wide variety of topics, not all related to events (e.g., morning greetings, “thank you” messages)  Low quality text: abbreviations, unconventional language, riddled with typos, grammatically incorrect  Opportunities:  Contentgenerated in real-time as events happen  Time and location information
  • 28. Events on Twitter  Types of events on Twitter  Exogenous: Real-world occurrences (e.g., Superbowl, “Lost” finale)  Endogenous: Specific to the Twitter-verse (e.g., #thingsyoushouldntsay meme, RT statement by Lady Gaga)  Event:  One or more terms and a time period  Volume of messages posted for the terms in the time period exceeds some expected level of activity
  • 29. Events on Twitter  Types of events on Twitter  Exogenous: Real-world occurrences (e.g., Superbowl, “Lost” finale)  Endogenous: Specific to the Twitter-verse (e.g., #thingsyoushouldntsay meme, RT statement by Lady Gaga)  Event:  One or more terms and a time period  Volume of messages posted for the terms in the time period exceeds some expected level of activity
  • 30. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Selectcontent for each event  Evaluate the quality, relevance, and usefulness
  • 31. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Selectcontent for each event  Evaluate the quality, relevance, and usefulness
  • 32. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Selectcontent for each event  Evaluate the quality, relevance, and usefulness
  • 33. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Selectcontent for each event  Evaluate the quality, relevance, and usefulness
  • 34. Surfacing Event Content on Twitter Tweets
  • 35. Surfacing Event Content on Twitter Tweets
  • 36. Surfacing Event Content on Twitter Tweets Tweet Clusters
  • 37. Surfacing Event Content on Twitter Tweets Tweet Clusters
  • 38. Surfacing Event Content on Twitter Tweets Event Clusters Tweet Clusters
  • 39. Surfacing Event Content on Twitter Tweets Event Clusters Tweet Clusters
  • 40. Surfacing Event Content on Twitter Tweets Event Clusters Tweet Clusters
  • 41. Surfacing Event Content on Twitter Tweets Event Clusters Selected Tweets Tweet Clusters
  • 42. Organizing Tweets in Real-Time  Order tweets by post time  Use TF-IDF vector representation of textual content  Stop word elimination  Stemming  Enhanced weight for hashtags (#tag)  IDF computed over past data  Separate tweets by location  Focus on tweets from NYC  Different locations can be processed in parallel
  • 43. Organizing Tweets in Real-Time  Order tweets by post time  Use TF-IDF vector representation of textual content  Stop word elimination  Stemming  Enhanced weight for hashtags (#tag)  IDF computed over past data  Separate tweets by location  Focus on tweets from NYC  Different locations can be processed in parallel
  • 44. Organizing Tweets in Real-Time  Order tweets by post time  Use TF-IDF vector representation of textual content  Stop word elimination  Stemming  Enhanced weight for hashtags (#tag)  IDF computed over past data  Separate tweets by location  Focus on tweets from NYC  Different locations can be processed in parallel
  • 45. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for  Event identification in textual news [Allan et al. 1998]  News event detection on Twitter [Sankaranarayanan et al. 2009]  Does not require a priori knowledge of number of clusters  Known fragmentation issue, often solved with a periodic second pass
  • 46. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for  Event identification in textual news [Allan et al. 1998]  News event detection on Twitter [Sankaranarayanan et al. 2009]  Does not require a priori knowledge of number of clusters  Known fragmentation issue, often solved with a periodic second pass
  • 47. Overview of Cluster-based Approach  Group similar tweets via online clustering  Compute statistics of cluster content  Topterms (e.g., [earthquake, haiti])  Number of documents per hour …  Use cluster-level features to identify event clusters  Single feature with threshold (e.g., increase in volume over time-window)  Trained classification model
  • 48. Overview of Cluster-based Approach  Group similar tweets via online clustering  Compute statistics of cluster content  Topterms (e.g., [earthquake, haiti])  Number of documents per hour …  Use cluster-level features to identify event clusters  Single feature with threshold (e.g., increase in volume over time-window)  Trained classification model
  • 49. Overview of Cluster-based Approach  Group similar tweets via online clustering  Compute statistics of cluster content  Topterms (e.g., [earthquake, haiti])  Number of documents per hour …  Use cluster-level features to identify event clusters  Single feature with threshold (e.g., increase in volume over time-window)  Trained classification model
  • 50. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Selectcontent for each event  Evaluate the quality, relevance, and usefulness
  • 51. Social Interaction Features  Retweets  RT @username  Often characterize Twitter-specific events  Replies  Tweet starts with @username  Possible indication of non-event content  Mentions  @username anywhere in the tweet  Reference to twitter users that might be part of an event
  • 52. Social Interaction Features  Retweets  RT @username  Often characterize Twitter-specific events  Replies  Tweet starts with @username  Possible indication of non-event content  Mentions  @username anywhere in the tweet  Reference to twitter users that might be part of an event
  • 53. Social Interaction Features  Retweets  RT @username  Often characterize Twitter-specific events  Replies  Tweet starts with @username  Possible indication of non-event content  Mentions  @username anywhere in the tweet  Reference to twitter users that might be part of an event
  • 54. Topic Coherence Intuition: clusters with strong inter-document similarity may contain event information I’m gonna do my best to go Katie Couric Interview With sleep during all my classes President Obama today =) http://bit.ly/bRsGPo Class Katie Today Couric Starting work early today. The Katie Couric-President Early President Looking fwd to cooking class Obama interview has now Work Obama tonight! begun on CBS Sleep Interview Start CBS Katie Couric interviews Today starts the rest of my President Obama during CBS' life… Super Bowl pregame coverage
  • 55. Trending Behavior  Trending characteristics of top terms in cluster: documents  Exponential fit  Deviation from expected volume time time (hours) Volume over time for the term “valentine”
  • 56. Twitter-Centric Event Features  Tagging behavior  Multi-word tags (e.g., #myhomelesssignwouldsay)  Percentage of tagged tweets  Top term is a tag …  Retweeting  Percentage of messages with RT @  Percentage of messages from top RTed tweet …
  • 57. Twitter-Centric Event Features  Tagging behavior  Multi-word tags (e.g., #myhomelesssignwouldsay)  Percentage of tagged tweets  Top term is a tag …  Retweeting  Percentage of messages with RT @  Percentage of messages from top RTed tweet …
  • 58. Event Classifier  Use features to build a classifier  Human-annotated training data  SVM model (selected during training phase)  Alternative classification modes:  Exo-Event: real-world event vs. rest  Endo+Exo-Event: event (real-world or Twitter-centric) vs. non-event
  • 59. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Selectcontent for each event  Evaluate the quality, relevance, and usefulness
  • 60. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Selectcontent for each event  Evaluate the quality, relevance, and usefulness
  • 61. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Selectcontent for each event  Evaluate the quality, relevance, and usefulness
  • 62. Event Content Selection Tiger Woods Apology
  • 63. Event Content Selection Tiger Woods to make a Tiger woods y'all,tiger public apology Friday and woods y'all,ah tiger woods talk about his future in golf. y'all Tiger Tiger Woods Returns To Woods Tiger Woods Hugs: Golf - Public Apology Apology http://tinyurl.com/yhf4 http://bit.ly/9Ui5jx uzw Wedge wars upstage Watson v Woods: BBC Sport (blog)
  • 64. Event Content Selection Tiger Woods to make a Tiger woods y'all,tiger public apology Friday and woods y'all,ah tiger woods talk about his future in golf. y'all Tiger Tiger Woods Returns To Woods Tiger Woods Hugs: Golf - Public Apology Apology http://tinyurl.com/yhf4 http://bit.ly/9Ui5jx uzw Wedge wars upstage Watson v Woods: BBC Sport (blog)
  • 65. Event Content Selection  Challenges:  Clusterscontain noise  Relevant tweets might have poor quality text  Relevant, high quality tweets might not be interesting  For each tweet and a given event evaluate  Quality  Relevance  Usefulness
  • 66. Event Content Selection  Challenges:  Clusterscontain noise  Relevant tweets might have poor quality text  Relevant, high quality tweets might not be interesting  For each tweet and a given event evaluate  Quality  Relevance  Usefulness
  • 67. Centrality Based Tweet Selection  Centroid Cosine similarity of each tweet to cluster centroid  Degree  Tweets are nodes  Tweets are connected if their similarity is above a threshold  Compute degree centrality of each node  LexRank [Erkan and Radev 2004]  Same graph structure as Degree method  Central tweets are similar to other central tweets
  • 68. Centrality Based Tweet Selection  Centroid Cosine similarity of each tweet to cluster centroid  Degree  Tweets are nodes  Tweets are connected if their similarity is above a threshold  Compute degree centrality of each node  LexRank [Erkan and Radev 2004]  Same graph structure as Degree method  Central tweets are similar to other central tweets
  • 69. Centrality Based Tweet Selection  Centroid Cosine similarity of each tweet to cluster centroid  Degree  Tweets are nodes  Tweets are connected if their similarity is above a threshold  Compute degree centrality of each node  LexRank [Erkan and Radev 2004]  Same graph structure as Degree method  Central tweets are similar to other central tweets
  • 70. Experimental Setup: Data  >2,600,000 tweets, collected via Twitter API  Location: New York City area Indicated on user profile  Time: February 2010  Firstweek used to calibrate statistics  Second week used for training/validation  Third and fourth weeks used for testing
  • 71. Experimental Setup: Data  >2,600,000 tweets, collected via Twitter API  Location: New York City area Indicated on user profile  Time: February 2010  Firstweek used to calibrate statistics  Second week used for training/validation  Third and fourth weeks used for testing
  • 72. Experimental Setup: Data  >2,600,000 tweets, collected via Twitter API  Location: New York City area Indicated on user profile  Time: February 2010  Firstweek used to calibrate statistics  Second week used for training/validation  Third and fourth weeks used for testing
  • 73. Experimental Setup: Training  Data:  504 clusters  Fastest growing clusters/hour in second week of February 2010  Labels:  Real-world event (e.g., [superbowl,colts,saints,sb44])  Twitter-specific event (e.g., [uknowubrokewhen,money,job])  Non-event (e.g., [happy,love,lol])  Ambiguous cluster (e.g., [south,park,west,sxsw,cartman])
  • 74. Experimental Setup: Training  Data:  504 clusters  Fastest growing clusters/hour in second week of February 2010  Labels:  Real-world event (e.g., [superbowl,colts,saints,sb44])  Twitter-specific event (e.g., [uknowubrokewhen,money,job])  Non-event (e.g., [happy,love,lol])  Ambiguous cluster (e.g., [south,park,west,sxsw,cartman])
  • 75. Experimental Setup: Testing  Baselines:  Naïve Bayes text classification (NB-Text)  Fastest-growing clusters per hour  Classifiers:  RW-Event  TC-Event  400 clusters 5 hours  Top 20 clusters per hour according to RW-Event, TC- Event, Fastest-growing, random
  • 76. Experimental Setup: Testing  Baselines:  Naïve Bayes text classification (NB-Text)  Fastest-growing clusters per hour  Classifiers:  RW-Event  TC-Event  400 clusters 5 hours  Top 20 clusters per hour according to RW-Event, TC- Event, Fastest-growing, random
  • 77. Experimental Setup: Testing  Baselines:  Naïve Bayes text classification (NB-Text)  Fastest-growing clusters per hour  Classifiers:  RW-Event  TC-Event  400 clusters 5 hours  Top 20 clusters per hour according to RW-Event, TC- Event, Fastest-growing, random
  • 78. Experimental Methodology: Event Classification  Classification accuracy  10-foldcross validation  Separate test set of randomly chosen tweets  Event surfacing  Top events per hour for each technique  Evaluation:  Precision@K  NDCG@K
  • 79. Experimental Methodology: Event Classification  Classification accuracy  10-foldcross validation  Separate test set of randomly chosen tweets  Event surfacing  Top events per hour for each technique  Evaluation:  Precision@K  NDCG@K
  • 80. Identified Events Description Keywords Senator Evan Bayh's Retirement bayh, evan, senate, congress, retire Westminster Dog Show westminster, dog, show, club, kennel Obama’s Meeting with the Dalai Lama lama, dalai, meet, obama, china NYC Toy Fair toyfairny, starwars, hasbro, lego, toy Marc Jacobs Fashion Show jacobs, marc, nyfw, show, fashion A sample of events identified by our classifiers on the test set
  • 81. Classification Performance (F-measure) Classifier Validation Test NB-Text 0.785 0.702 RW-Event 0.849 0.837 TC-Event 0.875 0.789  RW-Event classifier is more effective at discriminating between real-world events and rest of Twitter data
  • 82. Precision@K Evaluation RW-Event 1 TC-Event 0.9 Fastest 0.8 Random 0.7 0.6 Precision 0.5 0.4 0.3 0.2 0.1 0 5 10 15 20 Number of Clusters (K)
  • 83. NDCG@K Evaluation 1 0.9 0.8 0.7 0.6 NDCG 0.5 0.4 0.3 RW-Event 0.2 TC-Event Fastest 0.1 Random 0 5 10 15 20 Number of Clusters (K)
  • 84. Experimental Methodology: Content Selection  50 event clusters  Randomly selected from test set  5 top tweets per event for each: Centroid, Degree, LexRank  Labeled on a 1-4 scale  Quality:excellent (4)  poor (1)  Relevance: clearly relevant (4)  not relevant (1)  Usefulness: clearly useful (4)  not useful (1)
  • 85. Selected Tweets: Example Method Tweet Video: Tiger regretful; unsure about return to golf - Main Line ...: Centroid (AP) Tiger Woods publicly apologized Friday... http://bit.ly/dAO41N Watson: Woods needs to show humility upon return (AP): Tom Degree Watson says Tiger Woods needs to "show some humility to... http://bit.ly/cHVH7x RT @EricStangel: Tiger Woods statement: And now for Elin's LexRank repsonse.... A sample of tweets selected by different centrality methods
  • 86. Content Selection Results  Average scores over all events Method Quality Relevance Usefulness LexRank 3.444 2.984 2.608 Degree 3.536 3.156 2.802 Centroid 3.636 3.694 3.474  High quality and relevance (>3) for both Degree and Centroid  Centroid only method with high usefulness
  • 87. Preferred Method per Event Method Quality Relevance Usefulness LexRank 22.66% 16.33% 12% Degree 31.66% 25.33% 28% Centroid 45.66% 58.33% 60%  Centroid is the preferred method across all metrics For usefulness, Centroid tweets preferred more than 2:1 compared to Degree, 4:1 compared to LexRank
  • 88. Conclusions Techniques for discovering, organizing, and presenting social media from real-world events  Event classifiers  Important to capture features of Twitter-specific events in order to reveal the real-world events  Effectively surfaced real-world events in an unsupervised setting  Content selection  Similarity to centroid technique better at selecting event content  There is relevant and useful event content on Twitter!
  • 89. Identifying Events in Social Media Timeliness Real-time Retrospective Unknown Learning similarity metrics Surfacing events on Content Discovery for event identification on Twitter Flickr [Becker et al. WSDM’10] Connecting events across Known Identifying Twitter content sites (e.g., YouTube, for planned events Picasa)
  • 90. Learning Similarity Metrics for Event Identification in Social Media (WSDM ’10) Ctitle Ctags Ctime
  • 91. Learning Similarity Metrics for Event Identification in Social Media (WSDM ’10) Ctitle Wtitle Combine similarities Wtags Ctags f(C,W) Wtime Ctime Learned in a training step
  • 92. Learning Similarity Metrics for Event Identification in Social Media (WSDM ’10) Ctitle Wtitle Final Combine clustering similarities solution Wtags Ctags f(C,W) Wtime Ctime Learned in a training step
  • 93. Identifying Tweets for Known Events
  • 94. Identifying Tweets for Known Events
  • 95. Identifying Events in Social Media Timeliness Real-time Retrospective Unknown Learning similarity metrics Surfacing events on Content Discovery for event identification on Twitter Flickr [Becker et al. WSDM’10] Connecting events across Known Identifying Twitter content sites (e.g., YouTube, for planned events Picasa)
  • 96. Thank you!  Pablo Barrio  David Elson  Dan Iter  Yves Petinot  Sara Rosenthal  Gonçalo Simões  Matt Solomon  Kapil Thadani