SBFT Tool Competition 2024 -- Python Test Case Generation Track
Twitter Sub-event Detection Project Presentation
1. Project : Sub-event
detection on Social Media
Codebase:
https://github.com/pallavshah/TwitterSubeventDetector
Pallav Shah Akshay Joshi
Rajat Bhardwaj Ravneet Singh Kathuria
2. The Project
• Make a timeline/summary of events from a corpus of tweets
commenting on the event.
• The corpus consists of tweets from a specific domain talking about a
single major event.
• The objective of the project is to extract sub-events within the event.
• Summary will be short description about the sub event.
3. Our Approach
We followed a two-step approach:
• Sub-event Detection: The first step is to identify if and when a sub-
event has occurred and if it has, what tweets comprise the sub-event
• Tweet Selection: The second step is to choose a representative tweet
that describes the sub-event appropriately.
The aggregation of these two processes will in turn provide a set of
tweets as a summary of the event.
4. Part1: Detecting the sub-
event
Sub-event detection is done by finding the distance measure between
different tweets of same event.
• Dictionary of words: The parsed data is used to create a dictionary
which stores relevant words and its count in the corpus.
• Vector for each tweet: The generated dictionary and a second parse
over the parsed data are used to get a single sparse vector
corresponding to each tweet. This vector contains the id and count of
each word present in the tweet.
5. Part 1: Detecting the sub-
event(continued)
• The sub-event detector module:
The module uses LSHash Library of Python to find similarity distance
between various tweets. Each tweet is analyzed and compared with the
existing group of similar tweets.
If the tweet matches to any of the group with a high threshold, the tweet is
assumed to belong to that group and added to it.
Otherwise, a new group is created with that tweet as the representative
tweet of the group. In the end all the tweets as thus partitioned into groups
(or clusters) representing different sub-events.
6. Part 2: Summarization of Sub-
event
• Term Frequency Inverse Document Frequency: A statistical weighting
technique that assigns each term within a document a weight that
reflects the term’s saliency within the document. The TF-IDF value is
composed of two primary parts.
The term frequency component (TF) assigns more weight to words that occur
frequently within a document because important words are often repeated.
The inverse document frequency component (IDF) compensates for the fact
that some words such as common stop words are frequent.
Normalization of tweets: The tweets are normalized to prevent bias towards
larger tweets.
8. Technologies Used
We have used the following python libraries:
• LSHash: https://pypi.python.org/pypi/lshash/0.0.3dev
• Gensim: http://radimrehurek.com/gensim/
Dataset
We used Snow dataset containing tweets of 2012 US General Elections.
9. Experiments and Results
• Tested on the 2012 US General Elections tweets data set from SNOW
2014.
• Results bore around 60% accuracy as compared to manual evaluation
of the tweets data.