StreamGrid: Summarization of large-scale Events using Topic Modeling and Temporal Analysis


Published on

Paper presentation at #somus2014 workshop, part of ICMR 2014, Glasgow, Scotland.

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

StreamGrid: Summarization of large-scale Events using Topic Modeling and Temporal Analysis

  1. 1. SoMuS 2014 Workshop ICMR, Glasgow, Scotland, 1 April 2014 StreamGrid: Summarization of Large Scale Events using Topic Modelling and Temporal Analysis Manos Schinas, Symeon Papadopoulos, Yiannis Kompatsiaris, Pericles A. Mitkas Information Technologies Institute (ITI) Center for Research & Technology Hellas (CERTH) Department of Electrical & Computer Engineering Aristotle University of Thessaloniki (AUTh)
  2. 2. #somus2014 #2 Overview • The Problem • Existing Approaches • StreamGrid • Experimental Study • Summary – Future Work
  3. 3. #somus2014 #3 Event Summarization motivation & definition
  4. 4. #somus2014 #4 Large-scale Public Events • A lot of attendants using social media • Huge amount of event-related social content #oscars  4.5M tweets #sxsw  1.35M tweets #SB48 (Super Bowl)  24.9M tweets in 4 hours!
  5. 5. #somus2014 Large-scale Public Events • Long-running events consist of several sub-events, e.g. 10 days of Sundance Film Festival include opening and awards ceremonies, screenings etc. • Many aspects and entities of interest in the context of an event e.g. films in film festivals, teams in sports events, etc. • Many messages can be considered as spam or non- informative. • Redundancy due to near-duplicate messages #5
  6. 6. #somus2014 Event-based Summarization Produce concise multi-document summaries for a given event, covering its main aspects. #6 Event-based Summarizer List of all messages Set of Selected Messages
  7. 7. #somus2014 Related Work #7
  8. 8. #somus2014 #8 Existing Approaches Radev et al. 2004 (baseline) • Summary consists of the messages closest to the tf∙idf centroid of all messages Shen et al. 2013 • Mixture model to detect sub-events at participant level • tf∙idf centroid to find a summary of each sub-event Chakrabarti & Punera 2011 • Hidden Markov Model to obtain a time-based segmentation • tf∙idf centroid to find a summary of time segment
  9. 9. #somus2014 Existing Approaches Erkan et al. 2004 (LexRank) • Graph-based approach to find salient sentences • Uses centrality of each sentence in a similarity graph • Adapted for multi-document summarization using each message as a sentence • Outperforms naïve centroid-based approach Shen et al. 2013 • Online clustering algorithm to find sub-events. • Greedy algorithm for summarization using the LexRank score of each message. #9
  10. 10. #somus2014 StreamGrid approach description #10
  11. 11. #somus2014 StreamGrid Overview • Find topics using Latent Dirichlet Allocation (LDA) • Create a timeline for each topic • Create StreamGrid structure • Summarize using StreamGrid #11
  12. 12. #somus2014 Topic Modeling using LDA • To work with very short documents (tweets), LDA needs some kind of message pooling • Number of topics estimation – Minimize: (a) total perplexity for a set of test documents and (b) average textual similarity across topics #12 Microblog messages merge Pooling Schemes • Time proximity • Same Author • Same Hashtag • Textual similarity Merged messages
  13. 13. #somus2014 Topic Modeling using LDA • Split documents D to Dtrain and Dtest • Estimate K topics over Dtrain • Calculate total perplexity of Dtest #13
  14. 14. #somus2014 StreamGrid Creation • Assign each message to the topic with the highest probability p under condition p > pth (spam messages are discarded) • Create StreamGrid #14 time interval j topic i cell c(i,j) = {set of messages associated with topic i, posted during time interval j }
  15. 15. #somus2014 StreamGrid Creation • For each cell c(i,j) calculate a merged tf∙idf vector uij • For each term t calculate the weight: where tfij(t) is the frequency of t in cell c(i,j) • For each message m of c(i,j) calculate the weight: #15
  16. 16. #somus2014 StreamGrid Creation • Detect active cells of each topic by applying peak detection on the associated topic timeline. • Given a topic i and a detected peak in time window [a,b], all cells c(i,j), a < j < b, are defined as active. • For the set of active topics A during a time interval j, calculate a significance score: #16
  17. 17. #somus2014 StreamGrid Creation • To get an overall estimation of the importance of each topic throughout the event we calculate two measures: #17
  18. 18. #somus2014 Topic-time Summarization • Our goal is the generation of a summary of an event for an arbitrary time frame F=[x1,x2]. • Summary has to meet the following criteria – As many aspects of the event are covered – Redundancy due to near duplicate messages are minimized • We use a greedy algorithm that selects important messages from each active topic in F and minimizes redundancy simultaneously. #18
  19. 19. #somus2014 Topic-time Summarization • A topic i is active in F if any of the cells contained in F is active. • The significance score of an active topic i in F is the max significance score across all time intervals in F. • The weight W(m,F) of a message m in F is the sum of the weights in each time interval. #19 Time frame F’ Active topics in F’ Time frame F Active topics in F
  20. 20. #somus2014 Topic-time Summarization: Algorithm Input: StreamGrid, time frame F, summary length L Output: summary set S 1. Get active topics in F 2. for each active topic select message with highest weight Mc 3. while |S|<L do 4. for each message m in Mc do 5. calculate score(m) 6. end for 7. Add message with highest score to S and remove it from Mc 8. end while #20
  21. 21. #somus2014 Topic-time Summarization • The score of a message m is a combination of its importance and of the redundancy introduced by its selection. • Redundancy is the average textual similarity among the set of already selected messages S #21
  22. 22. #somus2014 Experimental Study Sundance Film Festival 2013 #22
  23. 23. #somus2014 Dataset & Event Sundance Film Festival • Two week festival: Jan 15-30, 2013 • Data collection based on Streaming API with the following parameters: – hashtags: #sundance, #sundance2013, #sundancefest – account: @sundancefest • Total number of tweets: 201,752 • Total number of original tweets: 100,046 #23
  24. 24. #somus2014 Topic Modeling • Merge messages with the same hashtag gave the best results with respect to perplexity. • Main trend for perplexity is to decrease as K increases. • Average similarity between clusters stabilized for K>200 → K = 200 #24
  25. 25. #somus2014 Peaky & Persistent Topics #25
  26. 26. #somus2014 Event Timeline #26 Awards ceremony “Stoker” film by Chan-wook Park & “Use Orally as Indicated” film
  27. 27. #somus2014 Selected Timeslots • Evaluate using two timeslots with high activity. • The first time frame has a small number of very popular tweets mainly about two films. • The second is a more diverse set of tweets. • A good measure of the quality of a summary is the number of films covered. #27 From To Tweets Description Mon Jan 21 05:00:00 EET 2013 Mon Jan 21 06:00:00 EET 2013 5755 “Stoker” film by Chan-wook Park & “Use Orally as Indicated” film Sun Jan 27 03:00:00 EET 2013 Sun Jan 27 09:00:00 EET 2013 9009 Awards ceremony
  28. 28. #somus2014 Baselines • Random Summarizer: Selects L random tweets. • Popularity Summarizer: Selects the top L tweets based on retweet count. • tf∙idf Summarizer: Uses tf∙idf weight of each tweet to select top L. • Cluster-based Summarizer: Creates L clusters using k-means clustering and selects the highest weighted message of each cluster. • LexRank Summarizer: Graph-based method that assigns a weight on each tweet based on its adjacent edges. #28
  29. 29. #somus2014 Timeslot #1 (Stoker & Use Orally as Indicated) Popularity-based Summarizer • 5/10 tweets of the summary are related to the Stoker Film → Tends to cover only a few popular aspects of the event • Minimizes near-duplicate redundancy, as it uses only the original tweets. • "Use Orally as Indicated“ is the second film covered in the summary (130 RTs) #29
  30. 30. #somus2014 Timeslot #1 (Stoker & Use Orally as Indicated) LexRank Summarizer • 9/10 tweets of the summary are retweets of a tweet related to “Use Orally as Indicated” film → A lot of redundancy • These tweets have high degree centrality, as there are many connections between them. tf∙idf Summarizer • Covers two different films (Stoker, Stuart Hall). • Many tweets about these films. #30
  31. 31. #somus2014 Timeslot #1 (Stoker & Use Orally as Indicated) StreamGrid Summarizer • Covers five different films (The Look of Love, Dirty Wars, Before Midnight, Kill you Darlings, Life according to Sam) • There are no duplicates or near-duplicates. • “Stoker” and “Use Orally as Indicated” are not covered! • A combination of StreamGrid Summarization and Popularity Summarization could solve this. #31
  32. 32. #somus2014 Timeslot #2 (Awards Ceremony) KPI: Number of winning films covered by the summary • Popularity-based summarizer outperforms all other approaches: covers 8 films that won any award that night (Afternoon Delight, Fruitvale, The Spectacular Now, Blood Brothers, Metro Manila, Dirty Wars, Crystal Fair, Pussy Riot) • StreamGrid covers 6 films (Computer Chess, Inequality for all, Fruitvale, Afternoon Delight, In a world, American Promise). • Only two films in common → Integrate popularity into StreamGrid to obtain better results. • LexRank does not cover any of the winning films, but includes this: 'The Canyons' Snubbed By Sundance Film Festival -- Lindsay Lohan to Blame? • tf∙idf Summarizer includes three films but none from the winning ones! #32
  33. 33. #somus2014 Multimedia Summaries #33 Popularity-based summary StreamGrid summary Is there any systematic-objective way to evaluate these?
  34. 34. #somus2014 Conclusions & Future Work #34
  35. 35. #somus2014 Summary • Topic modeling approach to capture automatically the main aspects of the event from a large set of event-related microblogging messages. • Peak detection on each topic-related timeline to find active moments of each topic. • Use of active topic to select a set of representative messages for an arbitrary time frame. • Greedy algorithm for the selection of messages with respect to content coverage and redundancy reduction. #35
  36. 36. #somus2014 Future Work • Real-time version of StreamGrid framework to get summaries of evolving and continuous social streams. • Investigate how different topic modeling techniques affect the produced summary. • Find a more systematic way to evaluate summaries (especially multimedia!). #36
  37. 37. #somus2014 Thank you! #37 Questions?
  38. 38. #somus2014 Key References • Shou, Lidan, et al. "Sumblr: continuous summarization of evolving tweet streams." Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2013. • Marcus, Adam, et al. "Twitinfo: aggregating and visualizing microblogs for event exploration." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2011. • Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022. • Erkan, Günes, and Dragomir R. Radev. "LexRank: Graph-based lexical centrality as salience in text summarization." J. Artif. Intell. Res.(JAIR) 22.1 (2004): 457-479. #38