MULTISCALE EVENT DETECTION IN SOCIAL MEDIA
MS candidate: Denis Antyukhov Thesis advisor: Panagiotis Karras
INTRODUCTION TO EVENT DETECTION
In general, events can be defined as real-world occurrences that unfold over space
and time (Allan et al. 1998; Xie et al. 2008).
Event detection from conventional media sources has long been addressed in the
Topic Detection and Tracking program, an initiative sponsored by the Defence
Advanced Research Projects Agency.
Event detection consists of three major phases: data preprocessing, data
representation, and data organisation or clustering.
Detection methods can be broadly classified into document-pivot and feature-
pivot, depending on whether they rely on clustering documents or certain
document features (e.g. keywords).
Objective: Consider a data stream that contains temporal, spatial and text
information. Design event detection approaches that (i) are able to identify events
that appear at multiple spatiotemporal scales, and (ii) are robust against the
ambiguous and noisy information present in the data.
TWITTER AS A SOURCE OF EVENT DATA
Currently the most popular microblogging service 150 mil. users, 500 mil. posts per day.
Big data: has the attributes of high Volume, Velocity and Variety.
Pros:
Diverse: rich and continuous source of user-generated content can yield valuable information,
unavailable from traditional media outlets
Dynamic: Short information is easier to consume and faster to spread, hence the rapid propagation
of news through the network (Phuvipadawat and Murata 2010)
Social: Tweets contain geo-tags, social data, hashtags, hyperlinks and other entities that can be
meaningfully utilised to enhance event detection.
Cons:
Noisy: irregular/misspelled words, abbreviations, internet-slang, improper sentence structure
Meaningless: spam, rumours, babbles, and other messages generally not related to any actual
event are overwhelming
EVENT DETECTION SYSTEM ARCHITECTURE
PHASE 1: RETRIEVAL AND PREPROCESSING
• Begin with retrieving a collection of documents from the database. 

~1K new documents are written to the DB every minute.
• Typically we retrieve documents that were created within some time
window (6-24 hrs) and originate from a certain area
• Pre-processing: normalisation (digits, punctuation, urls removed),
tokenisation, stop-words filtering, stemming, etc
• Aggressive filtering: obviously meaningless messages (no clean text, all
emoji, hashtags > 4) discarded. This removes ~30% tweets
• Twitter-specific entities (hashtags, check-ins, emoji, urls) extraction.
These require explicit processing and are considered apart from clean
textual data.
PHASE 2: DATA REPRESENTATION
In order for documents to be successfully clustered, their contents need to be
represented in a more convenient and meaningful way.
• textual content is represented as feature vectors in vector space. We use TF-
IDF weighting scheme (baseline) and a word2vec model
• emoji are represented as a binary variable: positive or negative
• proper nouns and hashtags are extracted and their weight is boosted
• usage statistics of each word in vocabulary are organised as time-series for
anomaly extraction
• in progress: classify posted photos using a ConvNet, use the predicted
classes to enhance detection precision
PHASE 3: DOCUMENT-PIVOT CLUSTERING
Using the representations obtained in phase 2, we now use clustering models to
obtain event candidates. Ensemble of models is used:
• hierarchical clustering model based on word-vector representations
• geo-model which uses DBSCAN to cluster geo-tagged tweets, then filter based
on textual content
• entity model, which relies on hashtags and proper nouns as input to hierarchical
clustering
DBSCAN is very good at processing geo-data,
being robust to outliers, detecting arbitrarily
shaped clusters.
Hierarchical clustering is useful with clustering
vector representations of text. The resulting
dendrogram is cut at a threshold distance to
produce clusters
PHASE 3 (CONTD): FEATURE PIVOT CLUSTERING
Instead of clustering documents, feature-pivot models detect anomalies in
distributions of features to achieve event detection.
Example: time-series analysis of term usage frequency
Useful for detecting trendy, ‘bursting’ topics in a continuous stream of
messages
Terms are represented as signals. Well developed, efficient signal-
processing techniques exist for analysis (correlation, convolution)
FROM TERM-TIME SERIES TO WAVELET ANALYSIS
Time-series of term usage frequencies are similar for related terms, particularly for
documents also close in space. By clustering related terms in time-space domain
we may obtain event-related clusters.
Idea: When two tweets share common terms and are close in space, we could
tolerate a coarser temporal resolution in computing similarity. Vice versa, when
they are close in time, we could tolerate a coarser spatial resolution.
Solution: measure similarity between time series by the correlation of their
coefficients under the wavelet transform (Daubechies 1992)
Discrete wavelet transform (DWT) using the Haar wavelet provides a natural way
to handle different temporal scales
Approximation coefficients of DWT at different levels correspond to aggregating
the time series from fine scales into coarse scales, each time by a factor of two.
CLASSIFICATION
• Clustering models detect event-related clusters, as well
as heterogeneous collections, rumours and spam.
To classify clusters we extract ~30 cluster-level features:
• Textual: number of unique unigrams, proper nouns,
hashtags, mean tf-idf/word2vec similarity …
Social: number of friends, followers, retweets

Meta: hyperlinks, check-ins, hashtags, etc
• We use a SVM and a perceptron to classify clusters into
useless/event-related. Training set is obtained from
historical data. Current classification report:
precision recall f1 score
event 0.69 0.79 0.74
spam 0.73 0.72 0.72
SAMPLE OUTPUT
REFERENCES
. [1]  C. C. Aggarwal and C. Zhai. A survey of text clustering algorithms. in mining text data. Springer: New York, 2012.
. [2]  J. Allan. Topic detection and tracking: event-based information organization. Springer, volume 12, 2002.
. [3]  Aixin Sun Chenliang Li. Twevent: segment-based event detection from tweets. CIKM ’12 Proceedings of the 21st
ACM international conference on Information and knowledge management, 2012.
. [4]  Luis Gravano Hila Becker, Mor Naaman. Beyond trending topics: Real-world event identification on twitter. Fifth
International AAAI Conference on Weblogs and Social Media, 2011.
. [5]  Hanan Samet er al. Jagan Sankaranarayanan. Twitterstand: News in tweets. University of Maryland, 2009.
. [6]  Carlos Martin Dancausa et al. Luca Maria Aiello, Georgios Petkos. Sensing trending topics in twitter. IEEE
Transactions on Multimedia, 2013.
. [7]  Sasa Petrovic. Real-time event detection in massive streams. University of Edinburgh, 2012.
. [8]  Tsuyoshi Murata Swit Phuvipadawat. Breaking news detection and tracking in twitter. 2010 IEEE/WIC/ACM
International Conference on Web Intelligence and Intelligent Agent Technology, 2010.
. [9]  T. Pierce Yaang, Y. and J. Carbonell. A study of retrospective and on-line event detection. SIGIR 98, ACM, New
York, NY, 1998.
. [10]  J. Zhang J. Carbonell Yang, Y. Topic-conditioned novelty detection. KDD 02, ACM, New York, NY, 2002.

Pre-defense_talk

  • 1.
    MULTISCALE EVENT DETECTIONIN SOCIAL MEDIA MS candidate: Denis Antyukhov Thesis advisor: Panagiotis Karras
  • 2.
    INTRODUCTION TO EVENTDETECTION In general, events can be defined as real-world occurrences that unfold over space and time (Allan et al. 1998; Xie et al. 2008). Event detection from conventional media sources has long been addressed in the Topic Detection and Tracking program, an initiative sponsored by the Defence Advanced Research Projects Agency. Event detection consists of three major phases: data preprocessing, data representation, and data organisation or clustering. Detection methods can be broadly classified into document-pivot and feature- pivot, depending on whether they rely on clustering documents or certain document features (e.g. keywords). Objective: Consider a data stream that contains temporal, spatial and text information. Design event detection approaches that (i) are able to identify events that appear at multiple spatiotemporal scales, and (ii) are robust against the ambiguous and noisy information present in the data.
  • 3.
    TWITTER AS ASOURCE OF EVENT DATA Currently the most popular microblogging service 150 mil. users, 500 mil. posts per day. Big data: has the attributes of high Volume, Velocity and Variety. Pros: Diverse: rich and continuous source of user-generated content can yield valuable information, unavailable from traditional media outlets Dynamic: Short information is easier to consume and faster to spread, hence the rapid propagation of news through the network (Phuvipadawat and Murata 2010) Social: Tweets contain geo-tags, social data, hashtags, hyperlinks and other entities that can be meaningfully utilised to enhance event detection. Cons: Noisy: irregular/misspelled words, abbreviations, internet-slang, improper sentence structure Meaningless: spam, rumours, babbles, and other messages generally not related to any actual event are overwhelming
  • 4.
  • 5.
    PHASE 1: RETRIEVALAND PREPROCESSING • Begin with retrieving a collection of documents from the database. 
 ~1K new documents are written to the DB every minute. • Typically we retrieve documents that were created within some time window (6-24 hrs) and originate from a certain area • Pre-processing: normalisation (digits, punctuation, urls removed), tokenisation, stop-words filtering, stemming, etc • Aggressive filtering: obviously meaningless messages (no clean text, all emoji, hashtags > 4) discarded. This removes ~30% tweets • Twitter-specific entities (hashtags, check-ins, emoji, urls) extraction. These require explicit processing and are considered apart from clean textual data.
  • 6.
    PHASE 2: DATAREPRESENTATION In order for documents to be successfully clustered, their contents need to be represented in a more convenient and meaningful way. • textual content is represented as feature vectors in vector space. We use TF- IDF weighting scheme (baseline) and a word2vec model • emoji are represented as a binary variable: positive or negative • proper nouns and hashtags are extracted and their weight is boosted • usage statistics of each word in vocabulary are organised as time-series for anomaly extraction • in progress: classify posted photos using a ConvNet, use the predicted classes to enhance detection precision
  • 7.
    PHASE 3: DOCUMENT-PIVOTCLUSTERING Using the representations obtained in phase 2, we now use clustering models to obtain event candidates. Ensemble of models is used: • hierarchical clustering model based on word-vector representations • geo-model which uses DBSCAN to cluster geo-tagged tweets, then filter based on textual content • entity model, which relies on hashtags and proper nouns as input to hierarchical clustering DBSCAN is very good at processing geo-data, being robust to outliers, detecting arbitrarily shaped clusters. Hierarchical clustering is useful with clustering vector representations of text. The resulting dendrogram is cut at a threshold distance to produce clusters
  • 8.
    PHASE 3 (CONTD):FEATURE PIVOT CLUSTERING Instead of clustering documents, feature-pivot models detect anomalies in distributions of features to achieve event detection. Example: time-series analysis of term usage frequency Useful for detecting trendy, ‘bursting’ topics in a continuous stream of messages Terms are represented as signals. Well developed, efficient signal- processing techniques exist for analysis (correlation, convolution)
  • 9.
    FROM TERM-TIME SERIESTO WAVELET ANALYSIS Time-series of term usage frequencies are similar for related terms, particularly for documents also close in space. By clustering related terms in time-space domain we may obtain event-related clusters. Idea: When two tweets share common terms and are close in space, we could tolerate a coarser temporal resolution in computing similarity. Vice versa, when they are close in time, we could tolerate a coarser spatial resolution. Solution: measure similarity between time series by the correlation of their coefficients under the wavelet transform (Daubechies 1992) Discrete wavelet transform (DWT) using the Haar wavelet provides a natural way to handle different temporal scales Approximation coefficients of DWT at different levels correspond to aggregating the time series from fine scales into coarse scales, each time by a factor of two.
  • 10.
    CLASSIFICATION • Clustering modelsdetect event-related clusters, as well as heterogeneous collections, rumours and spam. To classify clusters we extract ~30 cluster-level features: • Textual: number of unique unigrams, proper nouns, hashtags, mean tf-idf/word2vec similarity … Social: number of friends, followers, retweets
 Meta: hyperlinks, check-ins, hashtags, etc • We use a SVM and a perceptron to classify clusters into useless/event-related. Training set is obtained from historical data. Current classification report: precision recall f1 score event 0.69 0.79 0.74 spam 0.73 0.72 0.72
  • 11.
  • 12.
    REFERENCES . [1]  C.C. Aggarwal and C. Zhai. A survey of text clustering algorithms. in mining text data. Springer: New York, 2012. . [2]  J. Allan. Topic detection and tracking: event-based information organization. Springer, volume 12, 2002. . [3]  Aixin Sun Chenliang Li. Twevent: segment-based event detection from tweets. CIKM ’12 Proceedings of the 21st ACM international conference on Information and knowledge management, 2012. . [4]  Luis Gravano Hila Becker, Mor Naaman. Beyond trending topics: Real-world event identification on twitter. Fifth International AAAI Conference on Weblogs and Social Media, 2011. . [5]  Hanan Samet er al. Jagan Sankaranarayanan. Twitterstand: News in tweets. University of Maryland, 2009. . [6]  Carlos Martin Dancausa et al. Luca Maria Aiello, Georgios Petkos. Sensing trending topics in twitter. IEEE Transactions on Multimedia, 2013. . [7]  Sasa Petrovic. Real-time event detection in massive streams. University of Edinburgh, 2012. . [8]  Tsuyoshi Murata Swit Phuvipadawat. Breaking news detection and tracking in twitter. 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2010. . [9]  T. Pierce Yaang, Y. and J. Carbonell. A study of retrospective and on-line event detection. SIGIR 98, ACM, New York, NY, 1998. . [10]  J. Zhang J. Carbonell Yang, Y. Topic-conditioned novelty detection. KDD 02, ACM, New York, NY, 2002.