SlideShare a Scribd company logo
1 of 12
Download to read offline
MULTISCALE EVENT DETECTION IN SOCIAL MEDIA
MS candidate: Denis Antyukhov Thesis advisor: Panagiotis Karras
INTRODUCTION TO EVENT DETECTION
In general, events can be defined as real-world occurrences that unfold over space
and time (Allan et al. 1998; Xie et al. 2008).
Event detection from conventional media sources has long been addressed in the
Topic Detection and Tracking program, an initiative sponsored by the Defence
Advanced Research Projects Agency.
Event detection consists of three major phases: data preprocessing, data
representation, and data organisation or clustering.
Detection methods can be broadly classified into document-pivot and feature-
pivot, depending on whether they rely on clustering documents or certain
document features (e.g. keywords).
Objective: Consider a data stream that contains temporal, spatial and text
information. Design event detection approaches that (i) are able to identify events
that appear at multiple spatiotemporal scales, and (ii) are robust against the
ambiguous and noisy information present in the data.
TWITTER AS A SOURCE OF EVENT DATA
Currently the most popular microblogging service 150 mil. users, 500 mil. posts per day.
Big data: has the attributes of high Volume, Velocity and Variety.
Pros:
Diverse: rich and continuous source of user-generated content can yield valuable information,
unavailable from traditional media outlets
Dynamic: Short information is easier to consume and faster to spread, hence the rapid propagation
of news through the network (Phuvipadawat and Murata 2010)
Social: Tweets contain geo-tags, social data, hashtags, hyperlinks and other entities that can be
meaningfully utilised to enhance event detection.
Cons:
Noisy: irregular/misspelled words, abbreviations, internet-slang, improper sentence structure
Meaningless: spam, rumours, babbles, and other messages generally not related to any actual
event are overwhelming
EVENT DETECTION SYSTEM ARCHITECTURE
PHASE 1: RETRIEVAL AND PREPROCESSING
• Begin with retrieving a collection of documents from the database. 

~1K new documents are written to the DB every minute.
• Typically we retrieve documents that were created within some time
window (6-24 hrs) and originate from a certain area
• Pre-processing: normalisation (digits, punctuation, urls removed),
tokenisation, stop-words filtering, stemming, etc
• Aggressive filtering: obviously meaningless messages (no clean text, all
emoji, hashtags > 4) discarded. This removes ~30% tweets
• Twitter-specific entities (hashtags, check-ins, emoji, urls) extraction.
These require explicit processing and are considered apart from clean
textual data.
PHASE 2: DATA REPRESENTATION
In order for documents to be successfully clustered, their contents need to be
represented in a more convenient and meaningful way.
• textual content is represented as feature vectors in vector space. We use TF-
IDF weighting scheme (baseline) and a word2vec model
• emoji are represented as a binary variable: positive or negative
• proper nouns and hashtags are extracted and their weight is boosted
• usage statistics of each word in vocabulary are organised as time-series for
anomaly extraction
• in progress: classify posted photos using a ConvNet, use the predicted
classes to enhance detection precision
PHASE 3: DOCUMENT-PIVOT CLUSTERING
Using the representations obtained in phase 2, we now use clustering models to
obtain event candidates. Ensemble of models is used:
• hierarchical clustering model based on word-vector representations
• geo-model which uses DBSCAN to cluster geo-tagged tweets, then filter based
on textual content
• entity model, which relies on hashtags and proper nouns as input to hierarchical
clustering
DBSCAN is very good at processing geo-data,
being robust to outliers, detecting arbitrarily
shaped clusters.
Hierarchical clustering is useful with clustering
vector representations of text. The resulting
dendrogram is cut at a threshold distance to
produce clusters
PHASE 3 (CONTD): FEATURE PIVOT CLUSTERING
Instead of clustering documents, feature-pivot models detect anomalies in
distributions of features to achieve event detection.
Example: time-series analysis of term usage frequency
Useful for detecting trendy, ‘bursting’ topics in a continuous stream of
messages
Terms are represented as signals. Well developed, efficient signal-
processing techniques exist for analysis (correlation, convolution)
FROM TERM-TIME SERIES TO WAVELET ANALYSIS
Time-series of term usage frequencies are similar for related terms, particularly for
documents also close in space. By clustering related terms in time-space domain
we may obtain event-related clusters.
Idea: When two tweets share common terms and are close in space, we could
tolerate a coarser temporal resolution in computing similarity. Vice versa, when
they are close in time, we could tolerate a coarser spatial resolution.
Solution: measure similarity between time series by the correlation of their
coefficients under the wavelet transform (Daubechies 1992)
Discrete wavelet transform (DWT) using the Haar wavelet provides a natural way
to handle different temporal scales
Approximation coefficients of DWT at different levels correspond to aggregating
the time series from fine scales into coarse scales, each time by a factor of two.
CLASSIFICATION
• Clustering models detect event-related clusters, as well
as heterogeneous collections, rumours and spam.
To classify clusters we extract ~30 cluster-level features:
• Textual: number of unique unigrams, proper nouns,
hashtags, mean tf-idf/word2vec similarity …
Social: number of friends, followers, retweets

Meta: hyperlinks, check-ins, hashtags, etc
• We use a SVM and a perceptron to classify clusters into
useless/event-related. Training set is obtained from
historical data. Current classification report:
precision recall f1 score
event 0.69 0.79 0.74
spam 0.73 0.72 0.72
SAMPLE OUTPUT
REFERENCES
. [1]  C. C. Aggarwal and C. Zhai. A survey of text clustering algorithms. in mining text data. Springer: New York, 2012.
. [2]  J. Allan. Topic detection and tracking: event-based information organization. Springer, volume 12, 2002.
. [3]  Aixin Sun Chenliang Li. Twevent: segment-based event detection from tweets. CIKM ’12 Proceedings of the 21st
ACM international conference on Information and knowledge management, 2012.
. [4]  Luis Gravano Hila Becker, Mor Naaman. Beyond trending topics: Real-world event identification on twitter. Fifth
International AAAI Conference on Weblogs and Social Media, 2011.
. [5]  Hanan Samet er al. Jagan Sankaranarayanan. Twitterstand: News in tweets. University of Maryland, 2009.
. [6]  Carlos Martin Dancausa et al. Luca Maria Aiello, Georgios Petkos. Sensing trending topics in twitter. IEEE
Transactions on Multimedia, 2013.
. [7]  Sasa Petrovic. Real-time event detection in massive streams. University of Edinburgh, 2012.
. [8]  Tsuyoshi Murata Swit Phuvipadawat. Breaking news detection and tracking in twitter. 2010 IEEE/WIC/ACM
International Conference on Web Intelligence and Intelligent Agent Technology, 2010.
. [9]  T. Pierce Yaang, Y. and J. Carbonell. A study of retrospective and on-line event detection. SIGIR 98, ACM, New
York, NY, 1998.
. [10]  J. Zhang J. Carbonell Yang, Y. Topic-conditioned novelty detection. KDD 02, ACM, New York, NY, 2002.

More Related Content

What's hot

BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE...
 BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE... BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE...
BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE...Nexgen Technology
 
by Warren Jin
by Warren Jin by Warren Jin
by Warren Jin butest
 
A Survey On Ontology Agent Based Distributed Data Mining
A Survey On Ontology Agent Based Distributed Data MiningA Survey On Ontology Agent Based Distributed Data Mining
A Survey On Ontology Agent Based Distributed Data MiningEditor IJMTER
 
A Robust Cybersecurity Topic Classification Tool
A Robust Cybersecurity Topic Classification ToolA Robust Cybersecurity Topic Classification Tool
A Robust Cybersecurity Topic Classification ToolIJNSA Journal
 
Paper id 25201431
Paper id 25201431Paper id 25201431
Paper id 25201431IJRAT
 
Designing for Online Collaborative Sensemaking
Designing for Online Collaborative SensemakingDesigning for Online Collaborative Sensemaking
Designing for Online Collaborative SensemakingNitesh Goyal
 
Deep Learning Approach for Enhanced Cyber Threat Indicators in Twitter Stream
Deep Learning Approach for Enhanced Cyber Threat Indicators in Twitter StreamDeep Learning Approach for Enhanced Cyber Threat Indicators in Twitter Stream
Deep Learning Approach for Enhanced Cyber Threat Indicators in Twitter StreamSimranKetha
 
Structural Analysis of Hacktivism on Twitter
Structural Analysis of Hacktivism on TwitterStructural Analysis of Hacktivism on Twitter
Structural Analysis of Hacktivism on TwitterNidhal Selmi
 
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...Nexgen Technology
 
Graph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social NetworkGraph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social NetworkKhan Mostafa
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project finalCraig Cannon
 
Image Based Relational Database Watermarking: A Survey
Image Based Relational Database Watermarking: A SurveyImage Based Relational Database Watermarking: A Survey
Image Based Relational Database Watermarking: A Surveyiosrjce
 
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...IRJET Journal
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper PresentationShubham Singh
 
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD ...
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD ...A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD ...
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD ...nexgentechnology
 

What's hot (18)

BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE...
 BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE... BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE...
BIG DATA SANITIZATION AND CYBER SITUATIONALAWARENESS: A NETWORK TELESCOPE PE...
 
by Warren Jin
by Warren Jin by Warren Jin
by Warren Jin
 
A Survey On Ontology Agent Based Distributed Data Mining
A Survey On Ontology Agent Based Distributed Data MiningA Survey On Ontology Agent Based Distributed Data Mining
A Survey On Ontology Agent Based Distributed Data Mining
 
A Robust Cybersecurity Topic Classification Tool
A Robust Cybersecurity Topic Classification ToolA Robust Cybersecurity Topic Classification Tool
A Robust Cybersecurity Topic Classification Tool
 
Paper id 25201431
Paper id 25201431Paper id 25201431
Paper id 25201431
 
Designing for Online Collaborative Sensemaking
Designing for Online Collaborative SensemakingDesigning for Online Collaborative Sensemaking
Designing for Online Collaborative Sensemaking
 
Deep Learning Approach for Enhanced Cyber Threat Indicators in Twitter Stream
Deep Learning Approach for Enhanced Cyber Threat Indicators in Twitter StreamDeep Learning Approach for Enhanced Cyber Threat Indicators in Twitter Stream
Deep Learning Approach for Enhanced Cyber Threat Indicators in Twitter Stream
 
Structural Analysis of Hacktivism on Twitter
Structural Analysis of Hacktivism on TwitterStructural Analysis of Hacktivism on Twitter
Structural Analysis of Hacktivism on Twitter
 
C3602021025
C3602021025C3602021025
C3602021025
 
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
 
Graph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social NetworkGraph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social Network
 
Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project final
 
Image Based Relational Database Watermarking: A Survey
Image Based Relational Database Watermarking: A SurveyImage Based Relational Database Watermarking: A Survey
Image Based Relational Database Watermarking: A Survey
 
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
 
Lspnew (1)
Lspnew (1)Lspnew (1)
Lspnew (1)
 
Bx044461467
Bx044461467Bx044461467
Bx044461467
 
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD ...
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD ...A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD ...
A SECURE AND DYNAMIC MULTI-KEYWORD RANKED SEARCH SCHEME OVER ENCRYPTED CLOUD ...
 

Similar to Pre-defense_talk

Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...ijnlc
 
final_nlp
final_nlpfinal_nlp
final_nlpaphex34
 
Meliorating usable document density for online event detection
Meliorating usable document density for online event detectionMeliorating usable document density for online event detection
Meliorating usable document density for online event detectionIJICTJOURNAL
 
Dr31564567
Dr31564567Dr31564567
Dr31564567IJMER
 
An Integrated Framework on Mining Logs Files for Computing System Management
An Integrated Framework on Mining Logs Files for Computing System ManagementAn Integrated Framework on Mining Logs Files for Computing System Management
An Integrated Framework on Mining Logs Files for Computing System Managementfeiwin
 
Mythbusters: Event Stream Processing v. Complex Event Processing
Mythbusters: Event Stream Processing v. Complex Event ProcessingMythbusters: Event Stream Processing v. Complex Event Processing
Mythbusters: Event Stream Processing v. Complex Event ProcessingTim Bass
 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
 
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET- 	  Automated Document Summarization and Classification using Deep Lear...IRJET- 	  Automated Document Summarization and Classification using Deep Lear...
IRJET- Automated Document Summarization and Classification using Deep Lear...IRJET Journal
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Reviewijdpsjournal
 
Extracting intelligence from online news sources
Extracting intelligence from online news sourcesExtracting intelligence from online news sources
Extracting intelligence from online news sourceseSAT Publishing House
 
Extracting intelligence from online news sources
Extracting intelligence from online news sourcesExtracting intelligence from online news sources
Extracting intelligence from online news sourceseSAT Journals
 
Nt1310 Unit 1 Literature Review
Nt1310 Unit 1 Literature ReviewNt1310 Unit 1 Literature Review
Nt1310 Unit 1 Literature ReviewCamella Taylor
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachAIRCC Publishing Corporation
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHijcsit
 
Activity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareActivity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareEditor IJCATR
 
On the Navigability of Social Tagging Systems
On the Navigability of Social Tagging SystemsOn the Navigability of Social Tagging Systems
On the Navigability of Social Tagging SystemsMarkus Strohmaier
 
Quantitative and Qualitative Analysis of Time-Series Classification using Dee...
Quantitative and Qualitative Analysis of Time-Series Classification using Dee...Quantitative and Qualitative Analysis of Time-Series Classification using Dee...
Quantitative and Qualitative Analysis of Time-Series Classification using Dee...Nader Ale Ebrahim
 
DoRES — A Three-tier Ontology for Modelling Crises in the Digital Age
DoRES — A Three-tier Ontology for Modelling Crises in the Digital AgeDoRES — A Three-tier Ontology for Modelling Crises in the Digital Age
DoRES — A Three-tier Ontology for Modelling Crises in the Digital AgeGregoire Burel
 

Similar to Pre-defense_talk (20)

Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...
 
final_nlp
final_nlpfinal_nlp
final_nlp
 
Meliorating usable document density for online event detection
Meliorating usable document density for online event detectionMeliorating usable document density for online event detection
Meliorating usable document density for online event detection
 
Dr31564567
Dr31564567Dr31564567
Dr31564567
 
An Integrated Framework on Mining Logs Files for Computing System Management
An Integrated Framework on Mining Logs Files for Computing System ManagementAn Integrated Framework on Mining Logs Files for Computing System Management
An Integrated Framework on Mining Logs Files for Computing System Management
 
Mythbusters: Event Stream Processing v. Complex Event Processing
Mythbusters: Event Stream Processing v. Complex Event ProcessingMythbusters: Event Stream Processing v. Complex Event Processing
Mythbusters: Event Stream Processing v. Complex Event Processing
 
Text mining
Text miningText mining
Text mining
 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
 
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET- 	  Automated Document Summarization and Classification using Deep Lear...IRJET- 	  Automated Document Summarization and Classification using Deep Lear...
IRJET- Automated Document Summarization and Classification using Deep Lear...
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Review
 
Extracting intelligence from online news sources
Extracting intelligence from online news sourcesExtracting intelligence from online news sources
Extracting intelligence from online news sources
 
Extracting intelligence from online news sources
Extracting intelligence from online news sourcesExtracting intelligence from online news sources
Extracting intelligence from online news sources
 
Nt1310 Unit 1 Literature Review
Nt1310 Unit 1 Literature ReviewNt1310 Unit 1 Literature Review
Nt1310 Unit 1 Literature Review
 
Text mining
Text miningText mining
Text mining
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
 
Activity Context Modeling in Context-Aware
Activity Context Modeling in Context-AwareActivity Context Modeling in Context-Aware
Activity Context Modeling in Context-Aware
 
On the Navigability of Social Tagging Systems
On the Navigability of Social Tagging SystemsOn the Navigability of Social Tagging Systems
On the Navigability of Social Tagging Systems
 
Quantitative and Qualitative Analysis of Time-Series Classification using Dee...
Quantitative and Qualitative Analysis of Time-Series Classification using Dee...Quantitative and Qualitative Analysis of Time-Series Classification using Dee...
Quantitative and Qualitative Analysis of Time-Series Classification using Dee...
 
DoRES — A Three-tier Ontology for Modelling Crises in the Digital Age
DoRES — A Three-tier Ontology for Modelling Crises in the Digital AgeDoRES — A Three-tier Ontology for Modelling Crises in the Digital Age
DoRES — A Three-tier Ontology for Modelling Crises in the Digital Age
 

Pre-defense_talk

  • 1. MULTISCALE EVENT DETECTION IN SOCIAL MEDIA MS candidate: Denis Antyukhov Thesis advisor: Panagiotis Karras
  • 2. INTRODUCTION TO EVENT DETECTION In general, events can be defined as real-world occurrences that unfold over space and time (Allan et al. 1998; Xie et al. 2008). Event detection from conventional media sources has long been addressed in the Topic Detection and Tracking program, an initiative sponsored by the Defence Advanced Research Projects Agency. Event detection consists of three major phases: data preprocessing, data representation, and data organisation or clustering. Detection methods can be broadly classified into document-pivot and feature- pivot, depending on whether they rely on clustering documents or certain document features (e.g. keywords). Objective: Consider a data stream that contains temporal, spatial and text information. Design event detection approaches that (i) are able to identify events that appear at multiple spatiotemporal scales, and (ii) are robust against the ambiguous and noisy information present in the data.
  • 3. TWITTER AS A SOURCE OF EVENT DATA Currently the most popular microblogging service 150 mil. users, 500 mil. posts per day. Big data: has the attributes of high Volume, Velocity and Variety. Pros: Diverse: rich and continuous source of user-generated content can yield valuable information, unavailable from traditional media outlets Dynamic: Short information is easier to consume and faster to spread, hence the rapid propagation of news through the network (Phuvipadawat and Murata 2010) Social: Tweets contain geo-tags, social data, hashtags, hyperlinks and other entities that can be meaningfully utilised to enhance event detection. Cons: Noisy: irregular/misspelled words, abbreviations, internet-slang, improper sentence structure Meaningless: spam, rumours, babbles, and other messages generally not related to any actual event are overwhelming
  • 4. EVENT DETECTION SYSTEM ARCHITECTURE
  • 5. PHASE 1: RETRIEVAL AND PREPROCESSING • Begin with retrieving a collection of documents from the database. 
 ~1K new documents are written to the DB every minute. • Typically we retrieve documents that were created within some time window (6-24 hrs) and originate from a certain area • Pre-processing: normalisation (digits, punctuation, urls removed), tokenisation, stop-words filtering, stemming, etc • Aggressive filtering: obviously meaningless messages (no clean text, all emoji, hashtags > 4) discarded. This removes ~30% tweets • Twitter-specific entities (hashtags, check-ins, emoji, urls) extraction. These require explicit processing and are considered apart from clean textual data.
  • 6. PHASE 2: DATA REPRESENTATION In order for documents to be successfully clustered, their contents need to be represented in a more convenient and meaningful way. • textual content is represented as feature vectors in vector space. We use TF- IDF weighting scheme (baseline) and a word2vec model • emoji are represented as a binary variable: positive or negative • proper nouns and hashtags are extracted and their weight is boosted • usage statistics of each word in vocabulary are organised as time-series for anomaly extraction • in progress: classify posted photos using a ConvNet, use the predicted classes to enhance detection precision
  • 7. PHASE 3: DOCUMENT-PIVOT CLUSTERING Using the representations obtained in phase 2, we now use clustering models to obtain event candidates. Ensemble of models is used: • hierarchical clustering model based on word-vector representations • geo-model which uses DBSCAN to cluster geo-tagged tweets, then filter based on textual content • entity model, which relies on hashtags and proper nouns as input to hierarchical clustering DBSCAN is very good at processing geo-data, being robust to outliers, detecting arbitrarily shaped clusters. Hierarchical clustering is useful with clustering vector representations of text. The resulting dendrogram is cut at a threshold distance to produce clusters
  • 8. PHASE 3 (CONTD): FEATURE PIVOT CLUSTERING Instead of clustering documents, feature-pivot models detect anomalies in distributions of features to achieve event detection. Example: time-series analysis of term usage frequency Useful for detecting trendy, ‘bursting’ topics in a continuous stream of messages Terms are represented as signals. Well developed, efficient signal- processing techniques exist for analysis (correlation, convolution)
  • 9. FROM TERM-TIME SERIES TO WAVELET ANALYSIS Time-series of term usage frequencies are similar for related terms, particularly for documents also close in space. By clustering related terms in time-space domain we may obtain event-related clusters. Idea: When two tweets share common terms and are close in space, we could tolerate a coarser temporal resolution in computing similarity. Vice versa, when they are close in time, we could tolerate a coarser spatial resolution. Solution: measure similarity between time series by the correlation of their coefficients under the wavelet transform (Daubechies 1992) Discrete wavelet transform (DWT) using the Haar wavelet provides a natural way to handle different temporal scales Approximation coefficients of DWT at different levels correspond to aggregating the time series from fine scales into coarse scales, each time by a factor of two.
  • 10. CLASSIFICATION • Clustering models detect event-related clusters, as well as heterogeneous collections, rumours and spam. To classify clusters we extract ~30 cluster-level features: • Textual: number of unique unigrams, proper nouns, hashtags, mean tf-idf/word2vec similarity … Social: number of friends, followers, retweets
 Meta: hyperlinks, check-ins, hashtags, etc • We use a SVM and a perceptron to classify clusters into useless/event-related. Training set is obtained from historical data. Current classification report: precision recall f1 score event 0.69 0.79 0.74 spam 0.73 0.72 0.72
  • 12. REFERENCES . [1]  C. C. Aggarwal and C. Zhai. A survey of text clustering algorithms. in mining text data. Springer: New York, 2012. . [2]  J. Allan. Topic detection and tracking: event-based information organization. Springer, volume 12, 2002. . [3]  Aixin Sun Chenliang Li. Twevent: segment-based event detection from tweets. CIKM ’12 Proceedings of the 21st ACM international conference on Information and knowledge management, 2012. . [4]  Luis Gravano Hila Becker, Mor Naaman. Beyond trending topics: Real-world event identification on twitter. Fifth International AAAI Conference on Weblogs and Social Media, 2011. . [5]  Hanan Samet er al. Jagan Sankaranarayanan. Twitterstand: News in tweets. University of Maryland, 2009. . [6]  Carlos Martin Dancausa et al. Luca Maria Aiello, Georgios Petkos. Sensing trending topics in twitter. IEEE Transactions on Multimedia, 2013. . [7]  Sasa Petrovic. Real-time event detection in massive streams. University of Edinburgh, 2012. . [8]  Tsuyoshi Murata Swit Phuvipadawat. Breaking news detection and tracking in twitter. 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2010. . [9]  T. Pierce Yaang, Y. and J. Carbonell. A study of retrospective and on-line event detection. SIGIR 98, ACM, New York, NY, 1998. . [10]  J. Zhang J. Carbonell Yang, Y. Topic-conditioned novelty detection. KDD 02, ACM, New York, NY, 2002.