This thesis proposes approaches for multiscale event detection from social media data streams. The system architecture involves three phases: (1) data retrieval and preprocessing to filter noisy tweets, (2) data representation using text vectorization, entity extraction and time series analysis, and (3) document-pivot and feature-pivot clustering to detect event candidates. Clustering results are classified using features like topic distribution and social engagement to identify meaningful events. Wavelet analysis is also used to measure term similarity across different temporal and spatial scales.
DoRES — A Three-tier Ontology for Modelling Crises in the Digital Age
Pre-defense_talk
1. MULTISCALE EVENT DETECTION IN SOCIAL MEDIA
MS candidate: Denis Antyukhov Thesis advisor: Panagiotis Karras
2. INTRODUCTION TO EVENT DETECTION
In general, events can be defined as real-world occurrences that unfold over space
and time (Allan et al. 1998; Xie et al. 2008).
Event detection from conventional media sources has long been addressed in the
Topic Detection and Tracking program, an initiative sponsored by the Defence
Advanced Research Projects Agency.
Event detection consists of three major phases: data preprocessing, data
representation, and data organisation or clustering.
Detection methods can be broadly classified into document-pivot and feature-
pivot, depending on whether they rely on clustering documents or certain
document features (e.g. keywords).
Objective: Consider a data stream that contains temporal, spatial and text
information. Design event detection approaches that (i) are able to identify events
that appear at multiple spatiotemporal scales, and (ii) are robust against the
ambiguous and noisy information present in the data.
3. TWITTER AS A SOURCE OF EVENT DATA
Currently the most popular microblogging service 150 mil. users, 500 mil. posts per day.
Big data: has the attributes of high Volume, Velocity and Variety.
Pros:
Diverse: rich and continuous source of user-generated content can yield valuable information,
unavailable from traditional media outlets
Dynamic: Short information is easier to consume and faster to spread, hence the rapid propagation
of news through the network (Phuvipadawat and Murata 2010)
Social: Tweets contain geo-tags, social data, hashtags, hyperlinks and other entities that can be
meaningfully utilised to enhance event detection.
Cons:
Noisy: irregular/misspelled words, abbreviations, internet-slang, improper sentence structure
Meaningless: spam, rumours, babbles, and other messages generally not related to any actual
event are overwhelming
5. PHASE 1: RETRIEVAL AND PREPROCESSING
• Begin with retrieving a collection of documents from the database.
~1K new documents are written to the DB every minute.
• Typically we retrieve documents that were created within some time
window (6-24 hrs) and originate from a certain area
• Pre-processing: normalisation (digits, punctuation, urls removed),
tokenisation, stop-words filtering, stemming, etc
• Aggressive filtering: obviously meaningless messages (no clean text, all
emoji, hashtags > 4) discarded. This removes ~30% tweets
• Twitter-specific entities (hashtags, check-ins, emoji, urls) extraction.
These require explicit processing and are considered apart from clean
textual data.
6. PHASE 2: DATA REPRESENTATION
In order for documents to be successfully clustered, their contents need to be
represented in a more convenient and meaningful way.
• textual content is represented as feature vectors in vector space. We use TF-
IDF weighting scheme (baseline) and a word2vec model
• emoji are represented as a binary variable: positive or negative
• proper nouns and hashtags are extracted and their weight is boosted
• usage statistics of each word in vocabulary are organised as time-series for
anomaly extraction
• in progress: classify posted photos using a ConvNet, use the predicted
classes to enhance detection precision
7. PHASE 3: DOCUMENT-PIVOT CLUSTERING
Using the representations obtained in phase 2, we now use clustering models to
obtain event candidates. Ensemble of models is used:
• hierarchical clustering model based on word-vector representations
• geo-model which uses DBSCAN to cluster geo-tagged tweets, then filter based
on textual content
• entity model, which relies on hashtags and proper nouns as input to hierarchical
clustering
DBSCAN is very good at processing geo-data,
being robust to outliers, detecting arbitrarily
shaped clusters.
Hierarchical clustering is useful with clustering
vector representations of text. The resulting
dendrogram is cut at a threshold distance to
produce clusters
8. PHASE 3 (CONTD): FEATURE PIVOT CLUSTERING
Instead of clustering documents, feature-pivot models detect anomalies in
distributions of features to achieve event detection.
Example: time-series analysis of term usage frequency
Useful for detecting trendy, ‘bursting’ topics in a continuous stream of
messages
Terms are represented as signals. Well developed, efficient signal-
processing techniques exist for analysis (correlation, convolution)
9. FROM TERM-TIME SERIES TO WAVELET ANALYSIS
Time-series of term usage frequencies are similar for related terms, particularly for
documents also close in space. By clustering related terms in time-space domain
we may obtain event-related clusters.
Idea: When two tweets share common terms and are close in space, we could
tolerate a coarser temporal resolution in computing similarity. Vice versa, when
they are close in time, we could tolerate a coarser spatial resolution.
Solution: measure similarity between time series by the correlation of their
coefficients under the wavelet transform (Daubechies 1992)
Discrete wavelet transform (DWT) using the Haar wavelet provides a natural way
to handle different temporal scales
Approximation coefficients of DWT at different levels correspond to aggregating
the time series from fine scales into coarse scales, each time by a factor of two.
10. CLASSIFICATION
• Clustering models detect event-related clusters, as well
as heterogeneous collections, rumours and spam.
To classify clusters we extract ~30 cluster-level features:
• Textual: number of unique unigrams, proper nouns,
hashtags, mean tf-idf/word2vec similarity …
Social: number of friends, followers, retweets
Meta: hyperlinks, check-ins, hashtags, etc
• We use a SVM and a perceptron to classify clusters into
useless/event-related. Training set is obtained from
historical data. Current classification report:
precision recall f1 score
event 0.69 0.79 0.74
spam 0.73 0.72 0.72
12. REFERENCES
. [1] C. C. Aggarwal and C. Zhai. A survey of text clustering algorithms. in mining text data. Springer: New York, 2012.
. [2] J. Allan. Topic detection and tracking: event-based information organization. Springer, volume 12, 2002.
. [3] Aixin Sun Chenliang Li. Twevent: segment-based event detection from tweets. CIKM ’12 Proceedings of the 21st
ACM international conference on Information and knowledge management, 2012.
. [4] Luis Gravano Hila Becker, Mor Naaman. Beyond trending topics: Real-world event identification on twitter. Fifth
International AAAI Conference on Weblogs and Social Media, 2011.
. [5] Hanan Samet er al. Jagan Sankaranarayanan. Twitterstand: News in tweets. University of Maryland, 2009.
. [6] Carlos Martin Dancausa et al. Luca Maria Aiello, Georgios Petkos. Sensing trending topics in twitter. IEEE
Transactions on Multimedia, 2013.
. [7] Sasa Petrovic. Real-time event detection in massive streams. University of Edinburgh, 2012.
. [8] Tsuyoshi Murata Swit Phuvipadawat. Breaking news detection and tracking in twitter. 2010 IEEE/WIC/ACM
International Conference on Web Intelligence and Intelligent Agent Technology, 2010.
. [9] T. Pierce Yaang, Y. and J. Carbonell. A study of retrospective and on-line event detection. SIGIR 98, ACM, New
York, NY, 1998.
. [10] J. Zhang J. Carbonell Yang, Y. Topic-conditioned novelty detection. KDD 02, ACM, New York, NY, 2002.