Your SlideShare is downloading. ×
StreamGrid: Summarization of large-scale Events using Topic Modeling and Temporal Analysis
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

StreamGrid: Summarization of large-scale Events using Topic Modeling and Temporal Analysis


Published on

Paper presentation at #somus2014 workshop, part of ICMR 2014, Glasgow, Scotland. …

Paper presentation at #somus2014 workshop, part of ICMR 2014, Glasgow, Scotland.

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. SoMuS 2014 Workshop ICMR, Glasgow, Scotland, 1 April 2014 StreamGrid: Summarization of Large Scale Events using Topic Modelling and Temporal Analysis Manos Schinas, Symeon Papadopoulos, Yiannis Kompatsiaris, Pericles A. Mitkas Information Technologies Institute (ITI) Center for Research & Technology Hellas (CERTH) Department of Electrical & Computer Engineering Aristotle University of Thessaloniki (AUTh)
  • 2. #somus2014 #2 Overview • The Problem • Existing Approaches • StreamGrid • Experimental Study • Summary – Future Work
  • 3. #somus2014 #3 Event Summarization motivation & definition
  • 4. #somus2014 #4 Large-scale Public Events • A lot of attendants using social media • Huge amount of event-related social content #oscars  4.5M tweets #sxsw  1.35M tweets #SB48 (Super Bowl)  24.9M tweets in 4 hours!
  • 5. #somus2014 Large-scale Public Events • Long-running events consist of several sub-events, e.g. 10 days of Sundance Film Festival include opening and awards ceremonies, screenings etc. • Many aspects and entities of interest in the context of an event e.g. films in film festivals, teams in sports events, etc. • Many messages can be considered as spam or non- informative. • Redundancy due to near-duplicate messages #5
  • 6. #somus2014 Event-based Summarization Produce concise multi-document summaries for a given event, covering its main aspects. #6 Event-based Summarizer List of all messages Set of Selected Messages
  • 7. #somus2014 Related Work #7
  • 8. #somus2014 #8 Existing Approaches Radev et al. 2004 (baseline) • Summary consists of the messages closest to the tf∙idf centroid of all messages Shen et al. 2013 • Mixture model to detect sub-events at participant level • tf∙idf centroid to find a summary of each sub-event Chakrabarti & Punera 2011 • Hidden Markov Model to obtain a time-based segmentation • tf∙idf centroid to find a summary of time segment
  • 9. #somus2014 Existing Approaches Erkan et al. 2004 (LexRank) • Graph-based approach to find salient sentences • Uses centrality of each sentence in a similarity graph • Adapted for multi-document summarization using each message as a sentence • Outperforms naïve centroid-based approach Shen et al. 2013 • Online clustering algorithm to find sub-events. • Greedy algorithm for summarization using the LexRank score of each message. #9
  • 10. #somus2014 StreamGrid approach description #10
  • 11. #somus2014 StreamGrid Overview • Find topics using Latent Dirichlet Allocation (LDA) • Create a timeline for each topic • Create StreamGrid structure • Summarize using StreamGrid #11
  • 12. #somus2014 Topic Modeling using LDA • To work with very short documents (tweets), LDA needs some kind of message pooling • Number of topics estimation – Minimize: (a) total perplexity for a set of test documents and (b) average textual similarity across topics #12 Microblog messages merge Pooling Schemes • Time proximity • Same Author • Same Hashtag • Textual similarity Merged messages
  • 13. #somus2014 Topic Modeling using LDA • Split documents D to Dtrain and Dtest • Estimate K topics over Dtrain • Calculate total perplexity of Dtest #13
  • 14. #somus2014 StreamGrid Creation • Assign each message to the topic with the highest probability p under condition p > pth (spam messages are discarded) • Create StreamGrid #14 time interval j topic i cell c(i,j) = {set of messages associated with topic i, posted during time interval j }
  • 15. #somus2014 StreamGrid Creation • For each cell c(i,j) calculate a merged tf∙idf vector uij • For each term t calculate the weight: where tfij(t) is the frequency of t in cell c(i,j) • For each message m of c(i,j) calculate the weight: #15
  • 16. #somus2014 StreamGrid Creation • Detect active cells of each topic by applying peak detection on the associated topic timeline. • Given a topic i and a detected peak in time window [a,b], all cells c(i,j), a < j < b, are defined as active. • For the set of active topics A during a time interval j, calculate a significance score: #16
  • 17. #somus2014 StreamGrid Creation • To get an overall estimation of the importance of each topic throughout the event we calculate two measures: #17
  • 18. #somus2014 Topic-time Summarization • Our goal is the generation of a summary of an event for an arbitrary time frame F=[x1,x2]. • Summary has to meet the following criteria – As many aspects of the event are covered – Redundancy due to near duplicate messages are minimized • We use a greedy algorithm that selects important messages from each active topic in F and minimizes redundancy simultaneously. #18
  • 19. #somus2014 Topic-time Summarization • A topic i is active in F if any of the cells contained in F is active. • The significance score of an active topic i in F is the max significance score across all time intervals in F. • The weight W(m,F) of a message m in F is the sum of the weights in each time interval. #19 Time frame F’ Active topics in F’ Time frame F Active topics in F
  • 20. #somus2014 Topic-time Summarization: Algorithm Input: StreamGrid, time frame F, summary length L Output: summary set S 1. Get active topics in F 2. for each active topic select message with highest weight Mc 3. while |S|<L do 4. for each message m in Mc do 5. calculate score(m) 6. end for 7. Add message with highest score to S and remove it from Mc 8. end while #20
  • 21. #somus2014 Topic-time Summarization • The score of a message m is a combination of its importance and of the redundancy introduced by its selection. • Redundancy is the average textual similarity among the set of already selected messages S #21
  • 22. #somus2014 Experimental Study Sundance Film Festival 2013 #22
  • 23. #somus2014 Dataset & Event Sundance Film Festival • Two week festival: Jan 15-30, 2013 • Data collection based on Streaming API with the following parameters: – hashtags: #sundance, #sundance2013, #sundancefest – account: @sundancefest • Total number of tweets: 201,752 • Total number of original tweets: 100,046 #23
  • 24. #somus2014 Topic Modeling • Merge messages with the same hashtag gave the best results with respect to perplexity. • Main trend for perplexity is to decrease as K increases. • Average similarity between clusters stabilized for K>200 → K = 200 #24
  • 25. #somus2014 Peaky & Persistent Topics #25
  • 26. #somus2014 Event Timeline #26 Awards ceremony “Stoker” film by Chan-wook Park & “Use Orally as Indicated” film
  • 27. #somus2014 Selected Timeslots • Evaluate using two timeslots with high activity. • The first time frame has a small number of very popular tweets mainly about two films. • The second is a more diverse set of tweets. • A good measure of the quality of a summary is the number of films covered. #27 From To Tweets Description Mon Jan 21 05:00:00 EET 2013 Mon Jan 21 06:00:00 EET 2013 5755 “Stoker” film by Chan-wook Park & “Use Orally as Indicated” film Sun Jan 27 03:00:00 EET 2013 Sun Jan 27 09:00:00 EET 2013 9009 Awards ceremony
  • 28. #somus2014 Baselines • Random Summarizer: Selects L random tweets. • Popularity Summarizer: Selects the top L tweets based on retweet count. • tf∙idf Summarizer: Uses tf∙idf weight of each tweet to select top L. • Cluster-based Summarizer: Creates L clusters using k-means clustering and selects the highest weighted message of each cluster. • LexRank Summarizer: Graph-based method that assigns a weight on each tweet based on its adjacent edges. #28
  • 29. #somus2014 Timeslot #1 (Stoker & Use Orally as Indicated) Popularity-based Summarizer • 5/10 tweets of the summary are related to the Stoker Film → Tends to cover only a few popular aspects of the event • Minimizes near-duplicate redundancy, as it uses only the original tweets. • "Use Orally as Indicated“ is the second film covered in the summary (130 RTs) #29
  • 30. #somus2014 Timeslot #1 (Stoker & Use Orally as Indicated) LexRank Summarizer • 9/10 tweets of the summary are retweets of a tweet related to “Use Orally as Indicated” film → A lot of redundancy • These tweets have high degree centrality, as there are many connections between them. tf∙idf Summarizer • Covers two different films (Stoker, Stuart Hall). • Many tweets about these films. #30
  • 31. #somus2014 Timeslot #1 (Stoker & Use Orally as Indicated) StreamGrid Summarizer • Covers five different films (The Look of Love, Dirty Wars, Before Midnight, Kill you Darlings, Life according to Sam) • There are no duplicates or near-duplicates. • “Stoker” and “Use Orally as Indicated” are not covered! • A combination of StreamGrid Summarization and Popularity Summarization could solve this. #31
  • 32. #somus2014 Timeslot #2 (Awards Ceremony) KPI: Number of winning films covered by the summary • Popularity-based summarizer outperforms all other approaches: covers 8 films that won any award that night (Afternoon Delight, Fruitvale, The Spectacular Now, Blood Brothers, Metro Manila, Dirty Wars, Crystal Fair, Pussy Riot) • StreamGrid covers 6 films (Computer Chess, Inequality for all, Fruitvale, Afternoon Delight, In a world, American Promise). • Only two films in common → Integrate popularity into StreamGrid to obtain better results. • LexRank does not cover any of the winning films, but includes this: 'The Canyons' Snubbed By Sundance Film Festival -- Lindsay Lohan to Blame? • tf∙idf Summarizer includes three films but none from the winning ones! #32
  • 33. #somus2014 Multimedia Summaries #33 Popularity-based summary StreamGrid summary Is there any systematic-objective way to evaluate these?
  • 34. #somus2014 Conclusions & Future Work #34
  • 35. #somus2014 Summary • Topic modeling approach to capture automatically the main aspects of the event from a large set of event-related microblogging messages. • Peak detection on each topic-related timeline to find active moments of each topic. • Use of active topic to select a set of representative messages for an arbitrary time frame. • Greedy algorithm for the selection of messages with respect to content coverage and redundancy reduction. #35
  • 36. #somus2014 Future Work • Real-time version of StreamGrid framework to get summaries of evolving and continuous social streams. • Investigate how different topic modeling techniques affect the produced summary. • Find a more systematic way to evaluate summaries (especially multimedia!). #36
  • 37. #somus2014 Thank you! #37 Questions?
  • 38. #somus2014 Key References • Shou, Lidan, et al. "Sumblr: continuous summarization of evolving tweet streams." Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2013. • Marcus, Adam, et al. "Twitinfo: aggregating and visualizing microblogs for event exploration." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2011. • Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022. • Erkan, Günes, and Dragomir R. Radev. "LexRank: Graph-based lexical centrality as salience in text summarization." J. Artif. Intell. Res.(JAIR) 22.1 (2004): 457-479. #38