SlideShare a Scribd company logo
Feature-rich event detection in
social media streams
by Denis Antyukhov
Motivation
• Social networking platforms such as Twitter have emerged in
recent years, creating a radically new mode of communication
between people.
• Monitoring and analysing rich and continuous flow of user-
generated content can yield unprecedentedly valuable
information, which would not have been available from traditional
media outlets.
• However, learning from microblog streams poses new challenges,
due to the noisy nature of user-generated content
• Goal: develop a system for detecting new and ongoing events from
streams of Twitter messages
System Architecture
Data
• Sources: Boston, New York, Miami, Chicago and
Vancouver tweets
• Mining: self-developed tool built against Twitter
public API
• Storage: MongoDB (supports temporal and spatial
indexing)
• Total: 8 million tweets in json format, 30 GB size
Language processing
Tweets are tokenised using regular expressions, stop
words are removed, hyperlinks, hashtags, user
mentions, check-ins and emoji are extracted.
To facilitate vectorising and classification operations,
two models were learned using our tweet corpus:
• TF-IDF: 75 000 - vocabulary weighting statistic
• word2vec: 512-dimensional word embeddings
Clustering: GEO
• 10% of tweets have precise geo position
• Very useful for event detection: people tweeting
about a certain event are often close to each other
• DBSCAN algorithm is used to cluster data points in
2-dimensional space
• This model is really good at detecting sports and
music events, and other major happenings
Clustering: entity
• ~50% tweets contain #hashtags and @ check-ins, we refer
to these objects as entities
• Entities are useful because people tend to use same
hashtags when relating to a certain event, and check-ins
correspond to a location in space. They are also prone
to misspelling.
• We extract all entities found within a time window, vectorise,
and apply hierarchical clustering based on cosine similarity
• This model efficiently detects popular events described by
hashtags, or happening at a certain venue
Clustering: bursty n-gram
• This approach is based on detecting n-grams that
demonstrate bursty behaviour (i.e. start appearing
unusually often compared to historic data)
• We start by extracting 2,3-grams that are frequently
appearing within a time window, vectorise documents
that contain them and apply hierarchical clustering.
• We rank the clusters using formula (Aiello et al. 2013)
Classification
• Our models detect event-related clusters, as well as
heterogeneous collections, rumours and spam.
To classify clusters we extract 20 rich features:
• Textual: number of unique unigrams, proper nouns,
hashtags, mean tf-idf/word2vec similarity …
Social: number of friends, followers, retweets
Meta: hyperlinks, instagram links, entities, etc
• We use a perceptron model to classify clusters based
on this feature representation (2 hidden layers)
precision recall f1 score
event 0.69 0.79 0.74
spam 0.73 0.72 0.72
Events of NY
0
10
20
30
40
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
GEO Entity n-gram
Sample output
#SantaCon #fun #santaconnewport #santacon @ Newport, Rhode Island
#concert #The1975 #NYC #terminal5 @ TERMINAL 5
#jets #giants #NYGiants @ New York Giants vs. New York Jets

More Related Content

Similar to final_nlp

Pre-defense_talk
Pre-defense_talkPre-defense_talk
Pre-defense_talk
aphex34
 
New Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter DataNew Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter Data
Axel Bruns
 
Twitter as a personalizable information service ii
Twitter as a personalizable information service iiTwitter as a personalizable information service ii
Twitter as a personalizable information service ii
Kan-Han (John) Lu
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitter
Dan Nguyen
 
Hashtag Conversations, Eventgraphs, and User Ego Neighborhoods: Extracting...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting...Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting...
Hashtag Conversations, Eventgraphs, and User Ego Neighborhoods: Extracting...
learjk
 
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
Shalin Hai-Jew
 
Leveraging social media for training object detectors
Leveraging social media for training object detectorsLeveraging social media for training object detectors
Leveraging social media for training object detectors
Manish Kumar
 
Inferring social media user attributes using language and network information
Inferring social media user attributes using language and network informationInferring social media user attributes using language and network information
Inferring social media user attributes using language and network information
Nikolaos Aletras
 
Network Awareness Tool - Learning Analytics in the workplace: 
Detecting and ...
Network Awareness Tool - Learning Analytics in the workplace: 
Detecting and ...Network Awareness Tool - Learning Analytics in the workplace: 
Detecting and ...
Network Awareness Tool - Learning Analytics in the workplace: 
Detecting and ...
Maarten de Laat, LOOK, Open Universiteit, The Netherlands
 
REVEAL - Social Media Verification - brochure
REVEAL - Social Media Verification - brochureREVEAL - Social Media Verification - brochure
REVEAL - Social Media Verification - brochure
REVEAL - Social Media Verification
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
lljohnston
 
Graph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social NetworkGraph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social Network
Khan Mostafa
 
Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions
R A Akerkar
 
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Digital Methods Initiative
 
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
Pavan Kapanipathi
 
Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011
lljohnston
 
Social Media Crawling & Mining Seminar
Social Media Crawling & Mining Seminar Social Media Crawling & Mining Seminar
Social Media Crawling & Mining Seminar
Symeon Papadopoulos
 
Opinion mining for social media
Opinion mining for social mediaOpinion mining for social media
Opinion mining for social media
Diana Maynard
 
Twitris - Web Information System 2011 Course
Twitris - Web Information System 2011 Course Twitris - Web Information System 2011 Course
Twitris - Web Information System 2011 Course
Ashutosh Jadhav
 
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEEFINALYEARSTUDENTPROJECTS
 

Similar to final_nlp (20)

Pre-defense_talk
Pre-defense_talkPre-defense_talk
Pre-defense_talk
 
New Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter DataNew Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter Data
 
Twitter as a personalizable information service ii
Twitter as a personalizable information service iiTwitter as a personalizable information service ii
Twitter as a personalizable information service ii
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitter
 
Hashtag Conversations, Eventgraphs, and User Ego Neighborhoods: Extracting...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting...Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting...
Hashtag Conversations, Eventgraphs, and User Ego Neighborhoods: Extracting...
 
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
 
Leveraging social media for training object detectors
Leveraging social media for training object detectorsLeveraging social media for training object detectors
Leveraging social media for training object detectors
 
Inferring social media user attributes using language and network information
Inferring social media user attributes using language and network informationInferring social media user attributes using language and network information
Inferring social media user attributes using language and network information
 
Network Awareness Tool - Learning Analytics in the workplace: 
Detecting and ...
Network Awareness Tool - Learning Analytics in the workplace: 
Detecting and ...Network Awareness Tool - Learning Analytics in the workplace: 
Detecting and ...
Network Awareness Tool - Learning Analytics in the workplace: 
Detecting and ...
 
REVEAL - Social Media Verification - brochure
REVEAL - Social Media Verification - brochureREVEAL - Social Media Verification - brochure
REVEAL - Social Media Verification - brochure
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
Graph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social NetworkGraph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social Network
 
Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions
 
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
 
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...
 
Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011
 
Social Media Crawling & Mining Seminar
Social Media Crawling & Mining Seminar Social Media Crawling & Mining Seminar
Social Media Crawling & Mining Seminar
 
Opinion mining for social media
Opinion mining for social mediaOpinion mining for social media
Opinion mining for social media
 
Twitris - Web Information System 2011 Course
Twitris - Web Information System 2011 Course Twitris - Web Information System 2011 Course
Twitris - Web Information System 2011 Course
 
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
 

final_nlp

  • 1. Feature-rich event detection in social media streams by Denis Antyukhov
  • 2. Motivation • Social networking platforms such as Twitter have emerged in recent years, creating a radically new mode of communication between people. • Monitoring and analysing rich and continuous flow of user- generated content can yield unprecedentedly valuable information, which would not have been available from traditional media outlets. • However, learning from microblog streams poses new challenges, due to the noisy nature of user-generated content • Goal: develop a system for detecting new and ongoing events from streams of Twitter messages
  • 4. Data • Sources: Boston, New York, Miami, Chicago and Vancouver tweets • Mining: self-developed tool built against Twitter public API • Storage: MongoDB (supports temporal and spatial indexing) • Total: 8 million tweets in json format, 30 GB size
  • 5. Language processing Tweets are tokenised using regular expressions, stop words are removed, hyperlinks, hashtags, user mentions, check-ins and emoji are extracted. To facilitate vectorising and classification operations, two models were learned using our tweet corpus: • TF-IDF: 75 000 - vocabulary weighting statistic • word2vec: 512-dimensional word embeddings
  • 6. Clustering: GEO • 10% of tweets have precise geo position • Very useful for event detection: people tweeting about a certain event are often close to each other • DBSCAN algorithm is used to cluster data points in 2-dimensional space • This model is really good at detecting sports and music events, and other major happenings
  • 7. Clustering: entity • ~50% tweets contain #hashtags and @ check-ins, we refer to these objects as entities • Entities are useful because people tend to use same hashtags when relating to a certain event, and check-ins correspond to a location in space. They are also prone to misspelling. • We extract all entities found within a time window, vectorise, and apply hierarchical clustering based on cosine similarity • This model efficiently detects popular events described by hashtags, or happening at a certain venue
  • 8. Clustering: bursty n-gram • This approach is based on detecting n-grams that demonstrate bursty behaviour (i.e. start appearing unusually often compared to historic data) • We start by extracting 2,3-grams that are frequently appearing within a time window, vectorise documents that contain them and apply hierarchical clustering. • We rank the clusters using formula (Aiello et al. 2013)
  • 9. Classification • Our models detect event-related clusters, as well as heterogeneous collections, rumours and spam. To classify clusters we extract 20 rich features: • Textual: number of unique unigrams, proper nouns, hashtags, mean tf-idf/word2vec similarity … Social: number of friends, followers, retweets Meta: hyperlinks, instagram links, entities, etc • We use a perceptron model to classify clusters based on this feature representation (2 hidden layers) precision recall f1 score event 0.69 0.79 0.74 spam 0.73 0.72 0.72
  • 10. Events of NY 0 10 20 30 40 Monday Tuesday Wednesday Thursday Friday Saturday Sunday GEO Entity n-gram
  • 11. Sample output #SantaCon #fun #santaconnewport #santacon @ Newport, Rhode Island #concert #The1975 #NYC #terminal5 @ TERMINAL 5 #jets #giants #NYGiants @ New York Giants vs. New York Jets