SURFACING REAL-WORLD
EVENT CONTENT ON TWITTER
Hila Becker, Luis Gravano Mor Naaman
Columbia University Rutgers University
Event Content in Social Media
Event Content in Social Media
Smaller events, without traditional
news coverage
Popular, widely known events
Event Content in Social Media
 Discovery
 Detect events using features of social media content (e.g.,
term statistics)
...
Event Content in Social Media
 Discovery
 Detect events using features of social media content (e.g.,
term statistics)
...
Event Content in Social Media
 Discovery
 Detect events using features of social media content (e.g.,
term statistics)
...
Identifying Events in Social Media
 Timeliness
 Real-time
 Retrospective
 (Prospective)
 Content discovery
 Known pr...
Identifying Events in Social Media
 Timeliness
 Real-time
 Retrospective
 (Prospective)
 Content discovery
 Known pr...
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Twitter new event dete...
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Twitter new event dete...
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Earthquake prediction
...
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Earthquake prediction
...
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Surfacing events on
Tw...
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Learning similarity me...
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Learning similarity me...
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Learning similarity me...
Twitter Content
 Streams of textual
messages
 Brief content (140
characters)
 Communicated to network
of followers
Twitter Trending Topics
Twitter trending topics, September 24, 2010 7:00am
Twitter Trending Topics
Twitter trending topics, September 24, 2010 7:00am
 Recurring
 Twitter-centric
 Confusing
 Rea...
Twitter Trending Topics
Twitter trending topics, September 24, 2010 7:00am
 Recurring
 Twitter-centric
 Confusing
 Rea...
Twitter Trending Topics
Twitter trending topics, September 24, 2010 7:00am
 Recurring
 Twitter-centric
 Confusing
 Rea...
Twitter Trending Topics
Twitter trending topics, September 24, 2010 7:00am
 Recurring
 Twitter-centric
 Confusing
 Rea...
Twitter Trending Topics
Twitter trending topics, September 24, 2010 7:00am
 Recurring
 Twitter-centric
 Confusing
 Rea...
Identifying Events on Twitter
 Challenges:
 Wide variety of topics, not all related to events (e.g.,
morning greetings, ...
Identifying Events on Twitter
 Challenges:
 Wide variety of topics, not all related to events (e.g.,
morning greetings, ...
Events on Twitter
 Types of events on Twitter
 Exogenous: Real-world occurrences (e.g., Superbowl,
“Lost” finale)
 Endo...
Events on Twitter
 Types of events on Twitter
 Exogenous: Real-world occurrences (e.g., Superbowl,
“Lost” finale)
 Endo...
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Gro...
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Gro...
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Gro...
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Gro...
Surfacing Event Content on Twitter
Tweets
Surfacing Event Content on Twitter
Tweets
Surfacing Event Content on Twitter
Tweet Clusters
Tweets
Surfacing Event Content on Twitter
Tweet Clusters
Tweets
Surfacing Event Content on Twitter
Tweet Clusters
Tweets Event Clusters
Surfacing Event Content on Twitter
Tweet Clusters
Tweets Event Clusters
Surfacing Event Content on Twitter
Tweet Clusters
Tweets Event Clusters
Surfacing Event Content on Twitter
Tweet Clusters
Tweets Event Clusters Selected Tweets
Organizing Tweets in Real-Time
 Order tweets by post time
 Use TF-IDF vector representation of textual content
 Stop wo...
Organizing Tweets in Real-Time
 Order tweets by post time
 Use TF-IDF vector representation of textual content
 Stop wo...
Organizing Tweets in Real-Time
 Order tweets by post time
 Use TF-IDF vector representation of textual content
 Stop wo...
Clustering Algorithm
 Many alternatives possible! [Berkhin 2002]
 Single-pass incremental clustering algorithm
 Scalabl...
Clustering Algorithm
 Many alternatives possible! [Berkhin 2002]
 Single-pass incremental clustering algorithm
 Scalabl...
Overview of Cluster-based Approach
 Group similar tweets via online clustering
 Compute statistics of cluster content
 ...
Overview of Cluster-based Approach
 Group similar tweets via online clustering
 Compute statistics of cluster content
 ...
Overview of Cluster-based Approach
 Group similar tweets via online clustering
 Compute statistics of cluster content
 ...
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Gro...
Social Interaction Features
 Retweets
 RT @username
 Often characterize
Twitter-specific events
 Replies
 Tweet start...
Social Interaction Features
 Retweets
 RT @username
 Often characterize
Twitter-specific events
 Replies
 Tweet start...
Social Interaction Features
 Retweets
 RT @username
 Often characterize
Twitter-specific events
 Replies
 Tweet start...
Topic Coherence
Intuition: clusters with strong inter-document similarity
may contain event information
Class
Today
Early
...
Trending Behavior
 Trending
characteristics of
top terms in
cluster:
 Exponential fit
 Deviation from
expected
volume
V...
Twitter-Centric Event Features
 Tagging behavior
 Multi-word tags (e.g., #myhomelesssignwouldsay)
 Percentage of tagged...
Twitter-Centric Event Features
 Tagging behavior
 Multi-word tags (e.g., #myhomelesssignwouldsay)
 Percentage of tagged...
Event Classifier
 Use features to build a classifier
 Human-annotated training data
 SVM model (selected during trainin...
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Gro...
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Gro...
Real-Time Unsupervised Event
Identification on Twitter
 Organization
 Content representation: text, time, location
 Gro...
Event Content Selection
Tiger
Woods
Apology
Event Content Selection
Tiger
Woods
Apology
Tiger Woods to make a
public apology Friday and
talk about his future in golf....
Event Content Selection
Tiger
Woods
Apology
Tiger Woods to make a
public apology Friday and
talk about his future in golf....
Event Content Selection
 Challenges:
 Clusters contain noise
 Relevant tweets might have poor quality text
 Relevant, ...
Event Content Selection
 Challenges:
 Clusters contain noise
 Relevant tweets might have poor quality text
 Relevant, ...
Centrality Based Tweet Selection
 Centroid
Cosine similarity of each tweet to cluster centroid
 Degree
 Tweets are node...
Centrality Based Tweet Selection
 Centroid
Cosine similarity of each tweet to cluster centroid
 Degree
 Tweets are node...
Centrality Based Tweet Selection
 Centroid
Cosine similarity of each tweet to cluster centroid
 Degree
 Tweets are node...
Experimental Setup: Data
 >2,600,000 tweets, collected via Twitter API
 Location: New York City area
Indicated on user p...
Experimental Setup: Data
 >2,600,000 tweets, collected via Twitter API
 Location: New York City area
Indicated on user p...
Experimental Setup: Data
 >2,600,000 tweets, collected via Twitter API
 Location: New York City area
Indicated on user p...
Experimental Setup: Training
 Data:
 504 clusters
 Fastest growing clusters/hour in second week of February
2010
 Labe...
Experimental Setup: Training
 Data:
 504 clusters
 Fastest growing clusters/hour in second week of February
2010
 Labe...
Experimental Setup: Testing
 Baselines:
 Naïve Bayes text classification (NB-Text)
 Fastest-growing clusters per hour
...
Experimental Setup: Testing
 Baselines:
 Naïve Bayes text classification (NB-Text)
 Fastest-growing clusters per hour
...
Experimental Setup: Testing
 Baselines:
 Naïve Bayes text classification (NB-Text)
 Fastest-growing clusters per hour
...
Experimental Methodology: Event
Classification
 Classification accuracy
 10-fold cross validation
 Separate test set of...
Experimental Methodology: Event
Classification
 Classification accuracy
 10-fold cross validation
 Separate test set of...
Identified Events
Description Keywords
Senator Evan Bayh's Retirement bayh, evan, senate, congress, retire
Westminster Dog...
Classification Performance (F-measure)
 RW-Event classifier is more effective at
discriminating between real-world events...
Precision@K Evaluation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20
Precision
Number of Clusters (K)
RW-Event
TC-Eve...
NDCG@K Evaluation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20
NDCG
Number of Clusters (K)
RW-Event
TC-Event
Fastest...
Experimental Methodology:
Content Selection
 50 event clusters
 Randomly selected from test set
 5 top tweets per event...
Selected Tweets: Example
Method Tweet
Centroid
Video: Tiger regretful; unsure about return to golf - Main Line ...:
(AP) T...
Content Selection Results
 Average scores over all events
 High quality and relevance (>3) for both Degree
and Centroid
...
Preferred Method per Event
 Centroid is the preferred method across all metrics
For usefulness, Centroid tweets preferred...
Conclusions
Techniques for discovering, organizing, and presenting
social media from real-world events
 Event classifiers...
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Learning similarity me...
Learning Similarity Metrics for Event
Identification in Social Media (WSDM ’10)
Ctitle
Ctags
Ctime
Combine
similarities
Learning Similarity Metrics for Event
Identification in Social Media (WSDM ’10)
Wtitle
Wtags
Wtime
f(...
Combine
similarities
Learning Similarity Metrics for Event
Identification in Social Media (WSDM ’10)
Wtitle
Wtags
Wtime
f(...
Identifying Tweets for Known Events
Identifying Tweets for Known Events
Identifying Events in Social Media
Timeliness
ContentDiscovery
Real-time Retrospective
KnownUnknown
Learning similarity me...
Thank you!
 Pablo Barrio
 David Elson
 Dan Iter
 Yves Petinot
 Sara Rosenthal
 Gonçalo Simões
 Matt Solomon
 Kapil...
Upcoming SlideShare
Loading in...5
×

Surfacing Real-World Event Content on Twitter

1,138

Published on

Talk given at Google NYC on October 15th, 2010.

Published in: Technology, News & Politics
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,138
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
41
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Surfacing Real-World Event Content on Twitter

  1. 1. SURFACING REAL-WORLD EVENT CONTENT ON TWITTER Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University
  2. 2. Event Content in Social Media
  3. 3. Event Content in Social Media Smaller events, without traditional news coverage Popular, widely known events
  4. 4. Event Content in Social Media  Discovery  Detect events using features of social media content (e.g., term statistics)  Mining content from known event sources (e.g., user- contributed event databases)  Organization  Associating social media content with events  Identifying similar content within and across sites  Presentation  Selecting what content to display to a user  Providing interfaces that summarize and aggregate the content along different dimensions
  5. 5. Event Content in Social Media  Discovery  Detect events using features of social media content (e.g., term statistics)  Mining content from known event sources (e.g., user- contributed event databases)  Organization  Associating social media content with events  Identifying similar content within and across sites  Presentation  Selecting what content to display to a user  Providing interfaces that summarize and aggregate the content along different dimensions
  6. 6. Event Content in Social Media  Discovery  Detect events using features of social media content (e.g., term statistics)  Mining content from known event sources (e.g., user- contributed event databases)  Organization  Associating social media content with events  Identifying similar content within and across sites  Presentation  Selecting what content to display to a user  Providing interfaces that summarize and aggregate the content along different dimensions
  7. 7. Identifying Events in Social Media  Timeliness  Real-time  Retrospective  (Prospective)  Content discovery  Known properties  Event databases (e.g., Upcoming, Eventful)  Keyword triggers (e.g, “earthquake”)  Shared calendars  Unknown properties
  8. 8. Identifying Events in Social Media  Timeliness  Real-time  Retrospective  (Prospective)  Content discovery  Known properties  Event databases (e.g., Upcoming, Eventful)  Keyword triggers (e.g, “earthquake”)  Shared calendars  Unknown properties
  9. 9. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown
  10. 10. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Twitter new event detection [Petrović et al. NAACL’10]
  11. 11. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Twitter new event detection [Petrović et al. NAACL’10] Event detection on Flickr [Chen and Roy CIKM’09]
  12. 12. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Earthquake prediction using Twitter [Sakaki et al. WWW’10] Twitter new event detection [Petrović et al. NAACL’10] Event detection on Flickr [Chen and Roy CIKM’09]
  13. 13. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Earthquake prediction using Twitter [Sakaki et al. WWW’10] Twitter new event detection [Petrović et al. NAACL’10] Event detection on Flickr [Chen and Roy CIKM’09] Organization of YouTube concert videos [Kennedy and Naaman WWW’09]
  14. 14. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown
  15. 15. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Surfacing events on Twitter
  16. 16. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter
  17. 17. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter Identifying Twitter content for planned events
  18. 18. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter Identifying Twitter content for planned events Connecting events across sites (e.g., YouTube, Picasa)
  19. 19. Twitter Content  Streams of textual messages  Brief content (140 characters)  Communicated to network of followers
  20. 20. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am
  21. 21. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  22. 22. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  23. 23. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  24. 24. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  25. 25. Twitter Trending Topics Twitter trending topics, September 24, 2010 7:00am  Recurring  Twitter-centric  Confusing  Real-World Events?
  26. 26. Identifying Events on Twitter  Challenges:  Wide variety of topics, not all related to events (e.g., morning greetings, “thank you” messages)  Low quality text: abbreviations, unconventional language, riddled with typos, grammatically incorrect  Opportunities:  Content generated in real-time as events happen  Time and location information
  27. 27. Identifying Events on Twitter  Challenges:  Wide variety of topics, not all related to events (e.g., morning greetings, “thank you” messages)  Low quality text: abbreviations, unconventional language, riddled with typos, grammatically incorrect  Opportunities:  Content generated in real-time as events happen  Time and location information
  28. 28. Events on Twitter  Types of events on Twitter  Exogenous: Real-world occurrences (e.g., Superbowl, “Lost” finale)  Endogenous: Specific to the Twitter-verse (e.g., #thingsyoushouldntsay meme, RT statement by Lady Gaga)  Event:  One or more terms and a time period  Volume of messages posted for the terms in the time period exceeds some expected level of activity
  29. 29. Events on Twitter  Types of events on Twitter  Exogenous: Real-world occurrences (e.g., Superbowl, “Lost” finale)  Endogenous: Specific to the Twitter-verse (e.g., #thingsyoushouldntsay meme, RT statement by Lady Gaga)  Event:  One or more terms and a time period  Volume of messages posted for the terms in the time period exceeds some expected level of activity
  30. 30. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  31. 31. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  32. 32. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  33. 33. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  34. 34. Surfacing Event Content on Twitter Tweets
  35. 35. Surfacing Event Content on Twitter Tweets
  36. 36. Surfacing Event Content on Twitter Tweet Clusters Tweets
  37. 37. Surfacing Event Content on Twitter Tweet Clusters Tweets
  38. 38. Surfacing Event Content on Twitter Tweet Clusters Tweets Event Clusters
  39. 39. Surfacing Event Content on Twitter Tweet Clusters Tweets Event Clusters
  40. 40. Surfacing Event Content on Twitter Tweet Clusters Tweets Event Clusters
  41. 41. Surfacing Event Content on Twitter Tweet Clusters Tweets Event Clusters Selected Tweets
  42. 42. Organizing Tweets in Real-Time  Order tweets by post time  Use TF-IDF vector representation of textual content  Stop word elimination  Stemming  Enhanced weight for hashtags (#tag)  IDF computed over past data  Separate tweets by location  Focus on tweets from NYC  Different locations can be processed in parallel
  43. 43. Organizing Tweets in Real-Time  Order tweets by post time  Use TF-IDF vector representation of textual content  Stop word elimination  Stemming  Enhanced weight for hashtags (#tag)  IDF computed over past data  Separate tweets by location  Focus on tweets from NYC  Different locations can be processed in parallel
  44. 44. Organizing Tweets in Real-Time  Order tweets by post time  Use TF-IDF vector representation of textual content  Stop word elimination  Stemming  Enhanced weight for hashtags (#tag)  IDF computed over past data  Separate tweets by location  Focus on tweets from NYC  Different locations can be processed in parallel
  45. 45. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for  Event identification in textual news [Allan et al. 1998]  News event detection on Twitter [Sankaranarayanan et al. 2009]  Does not require a priori knowledge of number of clusters  Known fragmentation issue, often solved with a periodic second pass
  46. 46. Clustering Algorithm  Many alternatives possible! [Berkhin 2002]  Single-pass incremental clustering algorithm  Scalable, online solution  Used effectively for  Event identification in textual news [Allan et al. 1998]  News event detection on Twitter [Sankaranarayanan et al. 2009]  Does not require a priori knowledge of number of clusters  Known fragmentation issue, often solved with a periodic second pass
  47. 47. Overview of Cluster-based Approach  Group similar tweets via online clustering  Compute statistics of cluster content  Top terms (e.g., [earthquake, haiti])  Number of documents per hour  …  Use cluster-level features to identify event clusters  Single feature with threshold (e.g., increase in volume over time-window)  Trained classification model
  48. 48. Overview of Cluster-based Approach  Group similar tweets via online clustering  Compute statistics of cluster content  Top terms (e.g., [earthquake, haiti])  Number of documents per hour  …  Use cluster-level features to identify event clusters  Single feature with threshold (e.g., increase in volume over time-window)  Trained classification model
  49. 49. Overview of Cluster-based Approach  Group similar tweets via online clustering  Compute statistics of cluster content  Top terms (e.g., [earthquake, haiti])  Number of documents per hour  …  Use cluster-level features to identify event clusters  Single feature with threshold (e.g., increase in volume over time-window)  Trained classification model
  50. 50. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  51. 51. Social Interaction Features  Retweets  RT @username  Often characterize Twitter-specific events  Replies  Tweet starts with @username  Possible indication of non-event content  Mentions  @username anywhere in the tweet  Reference to twitter users that might be part of an event
  52. 52. Social Interaction Features  Retweets  RT @username  Often characterize Twitter-specific events  Replies  Tweet starts with @username  Possible indication of non-event content  Mentions  @username anywhere in the tweet  Reference to twitter users that might be part of an event
  53. 53. Social Interaction Features  Retweets  RT @username  Often characterize Twitter-specific events  Replies  Tweet starts with @username  Possible indication of non-event content  Mentions  @username anywhere in the tweet  Reference to twitter users that might be part of an event
  54. 54. Topic Coherence Intuition: clusters with strong inter-document similarity may contain event information Class Today Early Work Sleep Start I’m gonna do my best to go sleep during all my classes today =) Starting work early today. Looking fwd to cooking class tonight! Today starts the rest of my life… Katie Couric President Obama Interview CBS Katie Couric Interview With President Obama http://bit.ly/bRsGPo The Katie Couric-President Obama interview has now begun on CBS Katie Couric interviews President Obama during CBS' Super Bowl pregame coverage
  55. 55. Trending Behavior  Trending characteristics of top terms in cluster:  Exponential fit  Deviation from expected volume Volume over time for the term “valentine” time documents time (hours)
  56. 56. Twitter-Centric Event Features  Tagging behavior  Multi-word tags (e.g., #myhomelesssignwouldsay)  Percentage of tagged tweets  Top term is a tag  …  Retweeting  Percentage of messages with RT @  Percentage of messages from top RTed tweet  …
  57. 57. Twitter-Centric Event Features  Tagging behavior  Multi-word tags (e.g., #myhomelesssignwouldsay)  Percentage of tagged tweets  Top term is a tag  …  Retweeting  Percentage of messages with RT @  Percentage of messages from top RTed tweet  …
  58. 58. Event Classifier  Use features to build a classifier  Human-annotated training data  SVM model (selected during training phase)  Alternative classification modes:  RW-Event: real-world event vs. rest  TC-Event: event (real-world or Twitter-centric) vs. non- event
  59. 59. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  60. 60. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  61. 61. Real-Time Unsupervised Event Identification on Twitter  Organization  Content representation: text, time, location  Group similar content via clustering  Discovery  Extract discriminating features of clusters  Build an event classifier  Presentation  Select content for each event  Evaluate the quality, relevance, and usefulness
  62. 62. Event Content Selection Tiger Woods Apology
  63. 63. Event Content Selection Tiger Woods Apology Tiger Woods to make a public apology Friday and talk about his future in golf. Tiger Woods Returns To Golf - Public Apology http://bit.ly/9Ui5jx Tiger woods y'all,tiger woods y'all,ah tiger woods y'all Tiger Woods Hugs: http://tinyurl.com/yhf4 uzw Wedge wars upstage Watson v Woods: BBC Sport (blog)
  64. 64. Event Content Selection Tiger Woods Apology Tiger Woods to make a public apology Friday and talk about his future in golf. Tiger Woods Returns To Golf - Public Apology http://bit.ly/9Ui5jx Tiger woods y'all,tiger woods y'all,ah tiger woods y'all Tiger Woods Hugs: http://tinyurl.com/yhf4 uzw Wedge wars upstage Watson v Woods: BBC Sport (blog)
  65. 65. Event Content Selection  Challenges:  Clusters contain noise  Relevant tweets might have poor quality text  Relevant, high quality tweets might not be interesting  For each tweet and a given event evaluate  Quality  Relevance  Usefulness
  66. 66. Event Content Selection  Challenges:  Clusters contain noise  Relevant tweets might have poor quality text  Relevant, high quality tweets might not be interesting  For each tweet and a given event evaluate  Quality  Relevance  Usefulness
  67. 67. Centrality Based Tweet Selection  Centroid Cosine similarity of each tweet to cluster centroid  Degree  Tweets are nodes  Tweets are connected if their similarity is above a threshold  Compute degree centrality of each node  LexRank [Erkan and Radev 2004]  Same graph structure as Degree method  Central tweets are similar to other central tweets
  68. 68. Centrality Based Tweet Selection  Centroid Cosine similarity of each tweet to cluster centroid  Degree  Tweets are nodes  Tweets are connected if their similarity is above a threshold  Compute degree centrality of each node  LexRank [Erkan and Radev 2004]  Same graph structure as Degree method  Central tweets are similar to other central tweets
  69. 69. Centrality Based Tweet Selection  Centroid Cosine similarity of each tweet to cluster centroid  Degree  Tweets are nodes  Tweets are connected if their similarity is above a threshold  Compute degree centrality of each node  LexRank [Erkan and Radev 2004]  Same graph structure as Degree method  Central tweets are similar to other central tweets
  70. 70. Experimental Setup: Data  >2,600,000 tweets, collected via Twitter API  Location: New York City area Indicated on user profile  Time: February 2010  First week used to calibrate statistics  Second week used for training/validation  Third and fourth weeks used for testing
  71. 71. Experimental Setup: Data  >2,600,000 tweets, collected via Twitter API  Location: New York City area Indicated on user profile  Time: February 2010  First week used to calibrate statistics  Second week used for training/validation  Third and fourth weeks used for testing
  72. 72. Experimental Setup: Data  >2,600,000 tweets, collected via Twitter API  Location: New York City area Indicated on user profile  Time: February 2010  First week used to calibrate statistics  Second week used for training/validation  Third and fourth weeks used for testing
  73. 73. Experimental Setup: Training  Data:  504 clusters  Fastest growing clusters/hour in second week of February 2010  Labels:  Real-world event (e.g., [superbowl,colts,saints,sb44])  Twitter-specific event (e.g., [uknowubrokewhen,money,job])  Non-event (e.g., [happy,love,lol])  Ambiguous cluster (e.g., [south,park,west,sxsw,cartman])
  74. 74. Experimental Setup: Training  Data:  504 clusters  Fastest growing clusters/hour in second week of February 2010  Labels:  Real-world event (e.g., [superbowl,colts,saints,sb44])  Twitter-specific event (e.g., [uknowubrokewhen,money,job])  Non-event (e.g., [happy,love,lol])  Ambiguous cluster (e.g., [south,park,west,sxsw,cartman])
  75. 75. Experimental Setup: Testing  Baselines:  Naïve Bayes text classification (NB-Text)  Fastest-growing clusters per hour  Classifiers:  RW-Event  TC-Event  400 clusters  5 hours  Top 20 clusters per hour according to RW-Event, TC- Event, Fastest-growing, random
  76. 76. Experimental Setup: Testing  Baselines:  Naïve Bayes text classification (NB-Text)  Fastest-growing clusters per hour  Classifiers:  RW-Event  TC-Event  400 clusters  5 hours  Top 20 clusters per hour according to RW-Event, TC- Event, Fastest-growing, random
  77. 77. Experimental Setup: Testing  Baselines:  Naïve Bayes text classification (NB-Text)  Fastest-growing clusters per hour  Classifiers:  RW-Event  TC-Event  400 clusters  5 hours  Top 20 clusters per hour according to RW-Event, TC- Event, Fastest-growing, random
  78. 78. Experimental Methodology: Event Classification  Classification accuracy  10-fold cross validation  Separate test set of randomly chosen tweets  Event surfacing  Top events per hour for each technique  Evaluation:  Precision@K  NDCG@K
  79. 79. Experimental Methodology: Event Classification  Classification accuracy  10-fold cross validation  Separate test set of randomly chosen tweets  Event surfacing  Top events per hour for each technique  Evaluation:  Precision@K  NDCG@K
  80. 80. Identified Events Description Keywords Senator Evan Bayh's Retirement bayh, evan, senate, congress, retire Westminster Dog Show westminster, dog, show, club, kennel Obama’s Meeting with the Dalai Lama lama, dalai, meet, obama, china NYC Toy Fair toyfairny, starwars, hasbro, lego, toy Marc Jacobs Fashion Show jacobs, marc, nyfw, show, fashion A sample of events identified by our classifiers on the test set
  81. 81. Classification Performance (F-measure)  RW-Event classifier is more effective at discriminating between real-world events and rest of Twitter data Classifier Validation Test NB-Text 0.785 0.702 RW-Event 0.849 0.837 TC-Event 0.875 0.789
  82. 82. Precision@K Evaluation 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 15 20 Precision Number of Clusters (K) RW-Event TC-Event Fastest Random
  83. 83. NDCG@K Evaluation 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 15 20 NDCG Number of Clusters (K) RW-Event TC-Event Fastest Random
  84. 84. Experimental Methodology: Content Selection  50 event clusters  Randomly selected from test set  5 top tweets per event for each: Centroid, Degree, LexRank  Labeled on a 1-4 scale  Quality: excellent (4)  poor (1)  Relevance: clearly relevant (4)  not relevant (1)  Usefulness: clearly useful (4)  not useful (1)
  85. 85. Selected Tweets: Example Method Tweet Centroid Video: Tiger regretful; unsure about return to golf - Main Line ...: (AP) Tiger Woods publicly apologized Friday... http://bit.ly/dAO41N Degree Watson: Woods needs to show humility upon return (AP): Tom Watson says Tiger Woods needs to "show some humility to... http://bit.ly/cHVH7x LexRank RT @EricStangel: Tiger Woods statement: And now for Elin's repsonse.... A sample of tweets selected by different centrality methods
  86. 86. Content Selection Results  Average scores over all events  High quality and relevance (>3) for both Degree and Centroid  Centroid only method with high usefulness Method Quality Relevance Usefulness LexRank 3.444 2.984 2.608 Degree 3.536 3.156 2.802 Centroid 3.636 3.694 3.474
  87. 87. Preferred Method per Event  Centroid is the preferred method across all metrics For usefulness, Centroid tweets preferred more than 2:1 compared to Degree, 4:1 compared to LexRank Method Quality Relevance Usefulness LexRank 22.66% 16.33% 12% Degree 31.66% 25.33% 28% Centroid 45.66% 58.33% 60%
  88. 88. Conclusions Techniques for discovering, organizing, and presenting social media from real-world events  Event classifiers  Important to capture features of Twitter-specific events in order to reveal the real-world events  Effectively surfaced real-world events in an unsupervised setting  Content selection  Similarity to centroid technique better at selecting event content  There is relevant and useful event content on Twitter!
  89. 89. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter Identifying Twitter content for planned events Connecting events across sites (e.g., YouTube, Picasa)
  90. 90. Learning Similarity Metrics for Event Identification in Social Media (WSDM ’10) Ctitle Ctags Ctime
  91. 91. Combine similarities Learning Similarity Metrics for Event Identification in Social Media (WSDM ’10) Wtitle Wtags Wtime f(C,W) Ctitle Ctags Ctime Learned in a training step
  92. 92. Combine similarities Learning Similarity Metrics for Event Identification in Social Media (WSDM ’10) Wtitle Wtags Wtime f(C,W) Ctitle Ctags Ctime Final clustering solution Learned in a training step
  93. 93. Identifying Tweets for Known Events
  94. 94. Identifying Tweets for Known Events
  95. 95. Identifying Events in Social Media Timeliness ContentDiscovery Real-time Retrospective KnownUnknown Learning similarity metrics for event identification on Flickr [Becker et al. WSDM’10] Surfacing events on Twitter Identifying Twitter content for planned events Connecting events across sites (e.g., YouTube, Picasa)
  96. 96. Thank you!  Pablo Barrio  David Elson  Dan Iter  Yves Petinot  Sara Rosenthal  Gonçalo Simões  Matt Solomon  Kapil Thadani
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×