final_nlp

•

0 likes•182 views

This document describes a system for detecting new and ongoing events from Twitter data streams. It collects tweets from several cities and uses techniques like geo-clustering, entity extraction, and bursty n-gram analysis to cluster related tweets. It then classifies the clusters as events or other categories like spam. The system was able to detect various local events with over 70% precision and recall based on textual, social, and metadata features of the tweets.

Feature-rich event detection in
social media streams
by Denis Antyukhov

Motivation
• Social networking platforms such as Twitter have emerged in
recent years, creating a radically new mode of communication
between people.
• Monitoring and analysing rich and continuous ﬂow of user-
generated content can yield unprecedentedly valuable
information, which would not have been available from traditional
media outlets.
• However, learning from microblog streams poses new challenges,
due to the noisy nature of user-generated content
• Goal: develop a system for detecting new and ongoing events from
streams of Twitter messages

Data
• Sources: Boston, New York, Miami, Chicago and
Vancouver tweets
• Mining: self-developed tool built against Twitter
public API
• Storage: MongoDB (supports temporal and spatial
indexing)
• Total: 8 million tweets in json format, 30 GB size

Language processing
Tweets are tokenised using regular expressions, stop
words are removed, hyperlinks, hashtags, user
mentions, check-ins and emoji are extracted.
To facilitate vectorising and classiﬁcation operations,
two models were learned using our tweet corpus:
• TF-IDF: 75 000 - vocabulary weighting statistic
• word2vec: 512-dimensional word embeddings

Clustering: GEO
• 10% of tweets have precise geo position
• Very useful for event detection: people tweeting
about a certain event are often close to each other
• DBSCAN algorithm is used to cluster data points in
2-dimensional space
• This model is really good at detecting sports and
music events, and other major happenings

Clustering: entity
• ~50% tweets contain #hashtags and @ check-ins, we refer
to these objects as entities
• Entities are useful because people tend to use same
hashtags when relating to a certain event, and check-ins
correspond to a location in space. They are also prone
to misspelling.
• We extract all entities found within a time window, vectorise,
and apply hierarchical clustering based on cosine similarity
• This model efﬁciently detects popular events described by
hashtags, or happening at a certain venue

Clustering: bursty n-gram
• This approach is based on detecting n-grams that
demonstrate bursty behaviour (i.e. start appearing
unusually often compared to historic data)
• We start by extracting 2,3-grams that are frequently
appearing within a time window, vectorise documents
that contain them and apply hierarchical clustering.
• We rank the clusters using formula (Aiello et al. 2013)

Classiﬁcation
• Our models detect event-related clusters, as well as
heterogeneous collections, rumours and spam.
To classify clusters we extract 20 rich features:
• Textual: number of unique unigrams, proper nouns,
hashtags, mean tf-idf/word2vec similarity …
Social: number of friends, followers, retweets
Meta: hyperlinks, instagram links, entities, etc
• We use a perceptron model to classify clusters based
on this feature representation (2 hidden layers)
precision recall f1 score
event 0.69 0.79 0.74
spam 0.73 0.72 0.72

Events of NY
0
10
20
30
40
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
GEO Entity n-gram

Sample output
#SantaCon #fun #santaconnewport #santacon @ Newport, Rhode Island
#concert #The1975 #NYC #terminal5 @ TERMINAL 5
#jets #giants #NYGiants @ New York Giants vs. New York Jets

Misinformation detection is challenging due to factors like source reputation, author credibility, and sentence structure. The document discusses experiments using datasets from Rubin, Buzzfeed, Snopes, and Rashkin to test algorithms like CNNs, TF-IDF, and SVM for misinformation detection. The results showed that while reputation-based classification correlated somewhat with true labels, it was not sufficient for predicting the veracity of many news articles, and the models performed poorly on reputation-based labeled data.

Cytora: Real-Time Political Risk Analysis

huguk

Data Mining: Graph mining and social network analysis

DataminingTools Inc

Graph mining analyzes structured data like social networks and the web through graph search algorithms. It aims to find frequent subgraphs using Apriori-based or pattern growth approaches. Social networks exhibit characteristics like densification and heavy-tailed degree distributions. Link mining analyzes heterogeneous, multi-relational social network data through tasks like link prediction and group detection, facing challenges of logical vs statistical dependencies and collective classification. Multi-relational data mining searches for patterns across multiple database tables, including multi-relational clustering that utilizes information across relations.

Earthquake shakes twitter users real-time event detection by social sensors

Mike Mayer

Here are some responses to your questions: 1. A document generally contains longer form content, a blog is a series of documents published online in chronological order, and a microblog like Twitter restricts content to short messages (tweets) of 140 characters or less. 2. Yes, tweets are considered real-time information since they are posted as events are occurring. This allows news to spread much faster on social media than traditional outlets. 3. A target event is an event the system aims to detect, like natural disasters or traffic jams. Tweets are related because users will post about experiencing or witnessing such events. 4. The goal of the system is to detect target events using tweets as sensors.

Twitter Sub-event Detection Project Presentation

Pallav Shah

The document describes a project to detect sub-events within a corpus of tweets commenting on a single major event. It uses a two-step approach: (1) sub-event detection identifies when a sub-event occurs and which tweets comprise it by comparing tweet vectors, and (2) tweet selection chooses a representative tweet for each sub-event. Tweets are clustered into groups representing different sub-events. The summarization then uses TF-IDF to assign weights to terms and select tweets for each sub-event summary. The system was tested on 2012 US election tweets with around 60% accuracy compared to manual evaluation.

Social Media Analytics for Graph-Based Event Detection

Yiannis Kompatsiaris

The pervasive use of Online Social Networks (OSN) for networking, communication and search in tandem with the ubiquitous availability of smartphones, which enables real-time multimedia capturing and sharing, have led to massive amounts of user-generated content and activities being amassed online, and made publicly available for analysis and mining. Each content item is associated with an abundance of metadata and related information such as location, tags, comments, favorites and mood indicators, access logs, and so on. At the same time, all this information is implicitly or explicitly interconnected based on various properties such as social links among users, groups, communities, and sharing patterns. These properties transform social media into data sources of an extremely dynamic nature that reflect topics of interests, events, and the evolution of community opinion and focus. Social media processing offers a unique opportunity to structure and extract information and to benefit multiple areas ranging from new media experiences to psychology and marketing. The objective of this talk is to provide an overview of the current research in emerging topics related to applications where social media can act as sensors of real-life phenomena and case studies that reveal valuable insights. After discussing challenges and presenting a generic conceptual architecture, there will be a focus on efficient processing and indexing algorithms that can handle massive amounts of content with application to graph-based event detection and summarization in social media streams.

Social unrest: a systemic risk perspective

Global Risk Forum GRFDavos

This document summarizes a presentation on analyzing social unrest from a systemic risk perspective. It provides a framework for describing and analyzing social unrest by outlining how unrest can cause, be caused by, manifest, and promote other risks. It defines social unrest and proposes a "ladder of social unrest" to classify the intensity. Key drivers that can escalate or deescalate unrest are identified. The framework is intended as a diagnostic and heuristic tool to facilitate comparative study and identify triggers. Future work aims to apply the framework to specific cases and substantiate claims with more empirical evidence.

This thesis proposes approaches for multiscale event detection from social media data streams. The system architecture involves three phases: (1) data retrieval and preprocessing to filter noisy tweets, (2) data representation using text vectorization, entity extraction and time series analysis, and (3) document-pivot and feature-pivot clustering to detect event candidates. Clustering results are classified using features like topic distribution and social engagement to identify meaningful events. Wavelet analysis is also used to measure term similarity across different temporal and spatial scales.

New Methodologies for Capturing and Working with Publicly Available Twitter Data

Axel Bruns

This document proposes new methodologies for capturing and analyzing publicly available Twitter data. It discusses using tools like yourTwapperkeeper and Gawk to gather and process Twitter data at scale. Potential research questions are exploring how online publics form through hashtags and what structures emerge. Metrics for analyzing hashtags, timeframes, and users are presented. Challenges in working with big Twitter data at scale are also discussed.

Twitter as a personalizable information service ii

Kan-Han (John) Lu

Twitter is a free social networking microblogging service that allows registered members to broadcast, in real-time, short posts called tweets. Twitter members can broadcast tweets and follow other users’ tweets by using multiple devices, making this information system one of the fastest in the world. In this chapter, we leverage this characteristic to introduce a novel topic-detection method aimed at informing, in real-time, a specific user about the most emerging arguments expressed by the network around his/her domain interests. With this goal, we aim at formalizing the information spread over the network by studying the topology of the network and by modeling the implicit and explicit connections among the users. Then, we propose an innovative term aging model, based on a biological metaphor, to retrieve the freshest arguments of discussion, represented through a minimal set of terms, expressed by the community within the foci of interest of a specific user. We finally test the proposed model through various experiments and user studies.

Groundhog day: near duplicate detection on twitter

Dan Nguyen

This document presents a framework for detecting near-duplicate tweets on Twitter. The framework analyzes tweet pairs using three approaches: (1) comparing syntactic characteristics like word overlap, (2) measuring semantic similarity, and (3) analyzing contextual information. Machine learning is used to learn patterns that help identify duplicate tweets. The framework is integrated into a Twitter search engine called Twinder to diversify search results and improve search quality. Extensive experiments evaluate strategies for detecting duplicate tweets and analyzing features that impact detection. The results show semantic features can boost duplicate detection performance.

Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting...

learjk

This document discusses methods for extracting and analyzing social network data from Twitter, including hashtag conversations, event graphs, search networks, and user ego neighborhoods. It provides an overview of electronic social network analysis on Twitter using tools like NodeXL. Specific topics covered include analyzing hashtag conversations, mapping networks of interactions around events, identifying discussants and topics in search networks, capturing a user's direct connections on Twitter, and performing motif censuses to understand global network structures.

Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...

Shalin Hai-Jew

Leveraging social media for training object detectors

Manish Kumar

The document discusses using collective intelligence from social media to enhance object detectors. It describes how social media has generated tremendous user-contributed data with tags indicating meaning. This motivates investigating whether collective intelligence from social media can remove the need for dedicated human supervision during machine learning. Specifically, the author examines whether tags from social media can guide a learning process to teach machines object recognition from visual content like humans.

Inferring social media user attributes using language and network information

Nikolaos Aletras

Inferring attributes of social media users is an important problem in computational social science. Automated inference of user characteristics has applications in personalised recommender systems, targeted computational advertising and online political campaigning. In this talk, I’ll present how we can combine language and network information to predict users’ (1) socioeconomic characteristics (e.g., income and occupational class); and (2) voting behaviour. Lancaster University - Jan 2019

Network Awareness Tool - Learning Analytics in the workplace:  Detecting and ...

Maarten de Laat, LOOK, Open Universiteit, The Netherlands

This document describes a Network Awareness Tool being developed to detect and analyze informal workplace learning. The tool aims to capture traces of social informal learning through everyday work interactions. It will use social network analysis to map learning networks and communities of practice within organizations. The tool collects data on learning topics, individual networks, and the strength and content of interactions to provide feedback on informal learning structures and themes. This will help understand informal learning and identify support needs to foster professional development. Future plans include combining online and offline data, integrating it with learning analytics dashboards, and improving social browsing and network analysis through semantic techniques.

REVEAL - Social Media Verification - brochure

REVEAL - Social Media Verification

Content that is posted on and shared via Social Media networks has had a profound effect on how we communicate. It changed the way information is spread and consumed dramatically. This also had significant impacts on traditional ways of information gathering and distribution. Social Media can have both positive and negative effects. It can be used for various purposes, ranging from effective disaster management to news reporting and the intentional spreading of false information such as propaganda or marketing. In order to better evaluate and classify content residing in Social Networks, it is necessary to filter and assess its information value in terms of credibility, trustworthiness, reputation, popularity, influence, authenticity and proximity. What is the REVEAL project about? REVEAL focuses on verification technologies, tools and strategies. It aims to develop tools, components and strategies that aid journalists and enterprise community managers in identifying, assessing and verifying User Generated Content (UGC) on Social Networks. In REVEAL, we aim to reveal and analyse much more than bare content. The analysis framework for assessing the credibility of information found in Social Networks is based on three main pillars: Contributor, Content and Context. Joint analysis of the validity of Contributor, Content and Context provides a more thorough approach for revealing truthfulness.

Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012

lljohnston

Big Data challenges in developing repositories include: - Collections like web archives and historic newspapers contain billions of files and grow quickly, requiring constant processing and large-scale infrastructure. - Researchers want to analyze entire collections using algorithms and computational methods rather than accessing individual items. - Repository services need to support self-serve access, full-text search of entire collections, and APIs to enable computational research methods. - Ingesting and providing access to collections measured in petabytes and containing highly diverse content and metadata requires normalization and standardization.

Graph-based Analysis and Opinion Mining in Social Network

Khan Mostafa

This is the final report for Networks & Data Mining Techniques project focusing on mining social network to estimate public opinion about entities and associated keywords. This project mines Twitter for recent feeds and analyzes them to estimate sentiment score, discussed entity and describing keywords in each tweet. This data is then exploited to elicit overall sentiment associated with each entity. Entities and keywords extracted is also used to form an entity-keyword bigraph. This graph is further used to detect entity communities and keywords found within those communities. Presented implementation works in linear time.

Harvesting Intelligence from User Interactions

R A Akerkar

This document discusses how to harvest intelligence from user interactions on websites. It explains that as users share opinions, content, and participate in online communities, data is generated that can be converted into intelligence to personalize websites. It describes how collecting diverse opinions from many users can lead to "wise crowds" and collective intelligence. The key is to allow user interactions, learn about users in aggregate, and personalize content using this data. Content and collaborative filtering are approaches to build user and item profiles to detect meaningful relationships and make recommendations. The goal is to transform applications from being content-centric to being user-centric using collective intelligence.

Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013

Digital Methods Initiative

This document discusses cross-platform profiling as a method for studying issues across multiple online spaces. It provides examples of profiling controversies and issues like Fukushima, the economic crisis, hashtags on climate change, and the WCIT conference. Profiling involves analyzing actor composition, key platforms, framing, and variation over time. It demonstrates profiling using tools like the Google scraper and TCAT associational profiler to map word frequencies, co-occurrence networks, and changing associations for issues on Google, Twitter and other platforms. The document raises questions about how media liveliness relates to issue liveliness and how profiling can capture social dynamics and platform specificities.

Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...

Pavan Kapanipathi

This document discusses addressing the challenges of volume and velocity on the social web, specifically Twitter. It utilizes Wikipedia as a crowdsourced knowledge base. To address volume, it generates hierarchical interest graphs from user tweets to filter and make recommendations. To address velocity, it tracks dynamic events on Twitter by leveraging the evolving structure of Wikipedia to tag related tweets. It was evaluated on 3 events with over 15,000 tweets, achieving a NDCG of 92% for the top 5 hashtags.

Leslie Johnston Keynote, Best Practices Exchange 2011

lljohnston

This document discusses how digital collections are now considered data rather than just records or content. It notes that researchers want to analyze entire collections as data sets rather than individual records. Large digital collections like web archives, historic newspapers, and Twitter archives contain billions of records that researchers want to query, analyze, and visualize as data. Institutions are collaborating through groups like the National Digital Stewardship Alliance and developing open source tools like ViewShare to support access to and preservation of these "big data" collections.

Social Media Crawling & Mining Seminar

Symeon Papadopoulos

Opinion mining for social media

Diana Maynard

This document discusses opinion mining for social media. It provides an introduction to opinion mining and sentiment analysis, and discusses some of the challenges involved in performing opinion mining on social media data, including short sentences, incorrect language, and topic divergence. The document then outlines the Arcomem research project, which aims to perform opinion mining on social media to analyze opinions about events over time. It describes the project's entity, topic and opinion extraction workflow and some of the main research directions.

Twitris - Web Information System 2011 Course

Ashutosh Jadhav

Twitris is a system for understanding perceptions from social media data like Twitter. It collects tweets on topics of interest using keywords and clusters them by location and time into events. It then analyzes the tweets to extract key phrases, check sentiment, identify entities and relationships, and look at traffic over time. It also integrates contextual data from other sources. The goal is to provide an overview of what is being said about an event spatially, temporally, and thematically while accounting for diversity and changes in perceptions over time.

IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...

IEEEFINALYEARSTUDENTPROJECTS

Similar to final_nlp

Pre-defense_talk

aphex34

New Methodologies for Capturing and Working with Publicly Available Twitter Data

Axel Bruns

Twitter as a personalizable information service ii

Kan-Han (John) Lu

Groundhog day: near duplicate detection on twitter

Dan Nguyen

Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting...

learjk

Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...

Shalin Hai-Jew

Leveraging social media for training object detectors

Manish Kumar

Inferring social media user attributes using language and network information

Nikolaos Aletras

Network Awareness Tool - Learning Analytics in the workplace:  Detecting and ...

Maarten de Laat, LOOK, Open Universiteit, The Netherlands

REVEAL - Social Media Verification - brochure

REVEAL - Social Media Verification

Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012

lljohnston

Graph-based Analysis and Opinion Mining in Social Network

Khan Mostafa

Harvesting Intelligence from User Interactions

R A Akerkar

Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013

Digital Methods Initiative

Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...

Pavan Kapanipathi

Leslie Johnston Keynote, Best Practices Exchange 2011

lljohnston

Social Media Crawling & Mining Seminar

Symeon Papadopoulos

Opinion mining for social media

Diana Maynard

Twitris - Web Information System 2011 Course

Ashutosh Jadhav

IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...

IEEEFINALYEARSTUDENTPROJECTS

Similar to final_nlp (20)

Pre-defense_talk

New Methodologies for Capturing and Working with Publicly Available Twitter Data

Twitter as a personalizable information service ii

Groundhog day: near duplicate detection on twitter

Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting...

Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...

Leveraging social media for training object detectors

Inferring social media user attributes using language and network information

Network Awareness Tool - Learning Analytics in the workplace:  Detecting and ...

REVEAL - Social Media Verification - brochure

Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012

Graph-based Analysis and Opinion Mining in Social Network

Harvesting Intelligence from User Interactions

Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013

Adressing Volume and Velocity Challenge on the Social Web using Crowd Sourced...

Leslie Johnston Keynote, Best Practices Exchange 2011

Social Media Crawling & Mining Seminar

Opinion mining for social media

Twitris - Web Information System 2011 Course

IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...

final_nlp

1. Feature-rich event detection in social media streams by Denis Antyukhov

2. Motivation • Social networking platforms such as Twitter have emerged in recent years, creating a radically new mode of communication between people. • Monitoring and analysing rich and continuous ﬂow of user- generated content can yield unprecedentedly valuable information, which would not have been available from traditional media outlets. • However, learning from microblog streams poses new challenges, due to the noisy nature of user-generated content • Goal: develop a system for detecting new and ongoing events from streams of Twitter messages

3. System Architecture

4. Data • Sources: Boston, New York, Miami, Chicago and Vancouver tweets • Mining: self-developed tool built against Twitter public API • Storage: MongoDB (supports temporal and spatial indexing) • Total: 8 million tweets in json format, 30 GB size

5. Language processing Tweets are tokenised using regular expressions, stop words are removed, hyperlinks, hashtags, user mentions, check-ins and emoji are extracted. To facilitate vectorising and classiﬁcation operations, two models were learned using our tweet corpus: • TF-IDF: 75 000 - vocabulary weighting statistic • word2vec: 512-dimensional word embeddings

6. Clustering: GEO • 10% of tweets have precise geo position • Very useful for event detection: people tweeting about a certain event are often close to each other • DBSCAN algorithm is used to cluster data points in 2-dimensional space • This model is really good at detecting sports and music events, and other major happenings

7. Clustering: entity • ~50% tweets contain #hashtags and @ check-ins, we refer to these objects as entities • Entities are useful because people tend to use same hashtags when relating to a certain event, and check-ins correspond to a location in space. They are also prone to misspelling. • We extract all entities found within a time window, vectorise, and apply hierarchical clustering based on cosine similarity • This model efﬁciently detects popular events described by hashtags, or happening at a certain venue

8. Clustering: bursty n-gram • This approach is based on detecting n-grams that demonstrate bursty behaviour (i.e. start appearing unusually often compared to historic data) • We start by extracting 2,3-grams that are frequently appearing within a time window, vectorise documents that contain them and apply hierarchical clustering. • We rank the clusters using formula (Aiello et al. 2013)

9. Classiﬁcation • Our models detect event-related clusters, as well as heterogeneous collections, rumours and spam. To classify clusters we extract 20 rich features: • Textual: number of unique unigrams, proper nouns, hashtags, mean tf-idf/word2vec similarity … Social: number of friends, followers, retweets Meta: hyperlinks, instagram links, entities, etc • We use a perceptron model to classify clusters based on this feature representation (2 hidden layers) precision recall f1 score event 0.69 0.79 0.74 spam 0.73 0.72 0.72

10. Events of NY 0 10 20 30 40 Monday Tuesday Wednesday Thursday Friday Saturday Sunday GEO Entity n-gram

11. Sample output #SantaCon #fun #santaconnewport #santacon @ Newport, Rhode Island #concert #The1975 #NYC #terminal5 @ TERMINAL 5 #jets #giants #NYGiants @ New York Giants vs. New York Jets

final_nlp

Recommended

Recommended

More Related Content

Similar to final_nlp

Similar to final_nlp (20)

final_nlp