EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media Streams

936 views
825 views

Published on

Paper presentation in PCI 2013 conference.
Abstract: Social media platforms such as Twitter and Facebook have seen increasing adoption by people worldwide. Coupled with the habit of people to use social media for sharing their daily activities and experiences, it is not surprising that a substantial part of real-world events are well described by the online streams of status updates, posts and media content. In fact, in the case of large events, such as festivals, the number of online messages and shared content can be so high that it is very hard to get an objective view of the event. To this end, this paper presents EventSense, a social media sensing framework that can help event organizers and enthusiasts capture the pulse of large events and gain valuable insights into their impact on visitors. More specifically, EventSense enables the automatic association of online messages to entities
of interest (e.g. films in the case of a film festival), the automatic discovery of topics discussed online, and the detection of sentiment (positive/negative/neutral) both at an entity level (e.g. per film) and on aggregate. In addition, the framework produces an informative social media summary of the event of interest by automatically selecting and putting together its highlights, e.g. the most discussed entities and topics, the most influential users, the evolution of the discussions’ sentiment, and the most shared media and news content. A real-world case study is presented by applying EventSense on a rich dataset collected around the 53rd Thessaloniki International Film Festival.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
936
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media Streams

  1. 1. #1 EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media Streams Case study: Thessaloniki International Film Festival E. Schinas, S. Papadopoulos, S. Diplaris, Y. Kompatsiaris, Y. Mass, J. Herzig, L. Boudakidis
  2. 2. #2 Capturing & mining large-scale events • Large-scale events  attended by thousands of people  captured by mobile devices in the form of status updates, photos, ratings, etc. • SXSW Music, Film and Interactive Conferences and Festivals – 30000+ attendees – ~300,000 tweets between Mar 3 and 7 – 40,247 tweets even the last month • Sundance film festival – 200 films, 10 days, 50,000+ attendees – 200,000+ tweets during the festival – 20,438 tweets even the last month A search for #tiff53 in twitter returns an unstructured list of tweets
  3. 3. #3 Capturing & mining large-scale events • The online representation of an event as a sequential list of posts and status updates is ineffective • A more effective means of event representation would employ facets, such as entities, sub-events and sentiment. • Challenge: – Organize information around – entities of interest – Extract meaningful insights, obtain informative summaries • EventSense framework
  4. 4. #4 Entity Detection (1/3) • Entities are defined as lists of properties: – a film consists of a title, description, names of director(s)/actors • Matching status updates (tweets) to entities relies on representing both as tf * idf vectors m: message (tweet), f: feature (term), M: set of all event messages boost(f): boosting factor when f is a named entity
  5. 5. #5 Entity Detection (2/3) Unigrams Bigrams αργυρ : 0.348 αργυρ αλεξανδρ : 0.348 αλεξανδρ : 0.289 αλεξανδρ τουρκ : 0.233 τουρκ : 0.231 τουρκ ταιν : 0.201 ταιν : 0.191 ταιν μουχλ : 0.418 μουχλ : 0.616 - 1. Language Detection 2. Tokenization (using the appropriate tokenizer) 3. Stemming 4. tf * idf weighting 5. Boost film’s name
  6. 6. #6 Entity Detection (3/3) Entity of interest 1. Select a combination of properties e.g.title, director and actors 2. Aggregate selected properties to a single string  «Μούχλα Αλί Αιντίν» 3. Calculate tf * idf vector of n-grams using the same vocabulary with tweets 4. Calculate cosine similarity between an incoming message and the set of all entities of interest. 5. Assign message to the entities that similarity exceeds a predefined threshold.
  7. 7. #7 Topic analysis • 1 NN clustering algorithm to create clusters/topics Assign an incoming message to the nearest topic, if cosine similarity exceeds a predefined threshold. Else create a new topic. • Similarity threshold sensitivity analysis similar to entity extraction • LSH approximation to scale up (Petrovich at al., NAACL 2010) hash the input items so that similar items are mapped to the same buckets with high probability. Reduce search only to this bucket. • Title Extraction per Topic For the set of the items of a topic we find the largest sequence of words with the highest frequency.
  8. 8. #8 Sentiment Analysis • Training using tweets with emoticons. E.g.   positive,   negative (A. Go, R. Bhayani, and L. Huang) • For each message we extract two types of features. The first is n-grams. The second includes the existence of user mentions and URLs, punctuation, repeated letters • Naive Bayes (NB) classifier for positive and negative data. Assuming a uniform prior for all classes, independence between features, and using the Bayes rule we get:
  9. 9. #9 Aggregation & summarization • For each entity we retrieve the set of associated messages and calculate the mean value of sentiment, Polarity and Subjectivity • Calculate the same sentiment measures per topic and per user • Several other statistics: top shared messages, URLs and images, top active & influential users
  10. 10. #10 Dataset: 53rd Thessaloniki International Film Festival Three sources of data 1. A detailed set of the 168 films included in the official festival program of tiff53 2. 3,974 tweets that contain the official hashtag of the festival (#tiff53) for the period between November 1st and 13th 3. Film rating and bookmarking data created by the ThessFest mobile app (available both for iPhone* and Android**). * https://itunes.apple.com/gr/app/thessfest/id504913309?mt=8 ** https://play.google.com/store/apps/details?id=com.mk4droid.FF_pack&hl=el 10 days long event 2-11 November 2012
  11. 11. #11 Tweet-film matching results • film = <title, description, directors, actors> • Multiple entity representations using Greek/English/both, uni-/bi-grams • Similarity threshold sensitivity analysis Pooling multiple representationsthreshold ∈ (0.1, 0.3)
  12. 12. #12 Topic analysis results • 834 topics (clusters) • Manual inspection of topics: – 53.8% of topic titles considered informative – 98.5% of topics were found to be “clean” Topics in time Top-10
  13. 13. #13 Sentiment analysis results • Training – 800K positive & negative tweets for English – 12K positive & negative tweets for Greek • Tuning (for threshold) – Manually annotated dataset from Thessaloniki Documentary Festival (similar event) – 325/73/553 in English and 781/216/781 in Greek • Testing – 324/33/724 in English and 901/315/1667 in Greek – Best accuracy (English) ~ 0.75 – Performance in Greek much poorer compared to English  need for richer training corpus pos neg neut
  14. 14. #14 Aggregation & summarization results (1/2) #T: number of tweets Pol: polarity of film tweets Subj: subjectivity of film tweets R: average rating #R: number of ratings #F: number of times the film was bookmarked • Films with positive polarity are rated higher. • Films that are tweeted a lot are also more likely to be rated. • Films that are tweet a lot are also more likely to be added to the users’ bookmarks. Pearson correlation across film statistics
  15. 15. #15 Aggregation & summarization results (2/2) Most active & influential Twitter accounts (+sentiment per user) Most shared photos (+number of retweets)
  16. 16. #16 Summary • Extract entities of interest from messages F1 = 0.737 (precision = 0.774, recall = 0.697) • Detect topics in event related messages 834 topics, 98.5% considered “clean” • Sentiment analysis per messages, entities & topics Accuracy: 0.75 for English, 0.62 for Greek • Aggregation & statistics Valuable insights and overview information
  17. 17. #17 Future Work • Apply the proposed framework to larger-scale events of different nature (e.g. music festivals, sports events). • Monitoring and processing more OSN sources (e.g. Facebook, Instagram). • Refine the proposed methods with the goal of improving accuracy and robustness over different datasets. • Experiment with techniques for automatically creating visual informative summaries based on the results of the automatic analysis.
  18. 18. #18 Thank You Questions?
  19. 19. #19 References 1. Petrovic S., Osborne M., Lavrenko V. (2010) Streaming first story detection with application to Twitter. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL (NAACL) 2. A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant supervision. 2009.

×