SubTopic Detection of Tweets Related to an Entity


Published on

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

SubTopic Detection of Tweets Related to an Entity

  1. 1. Sub-Topic Detection Of Tweets Related To An Entity International Institute of Information Technology-Hyderabad Mentor - Sandeep Pannem By P Yashaswi (201102111) Aayush Asawa(201305617) Kumari Ankita(201101161) Diksha J. Yadav(201125130)
  2. 2. Introduction ➢ Tweets are classified according to the “Topic” and then the “Subtopic” they refer to. ○ “Topic” refers to any major event in the real world. ○ “Subtopics” are fine-grained aspects of such events. ➢ Mining subtopics from entities/topics from tweets helps in trend analysis, social monitoring, topic tracking and reputation mining. ➢ Generally all tweets related to a particular entity have similar keywords. So, while detecting the subtopics will have to deal with more features.
  3. 3. Work Flow Training Data Store features in Lucene Classifier (Phase 1,2,3) Detected Subtopic Extract Tweet features Input Tweet
  4. 4. Approach Input : Training set of tweets which have subtopic names as class labels. Test tweets which are to be classified into subtopics Output : Assign subtopics to each of the test tweets The entire workflow can be broken into three phases : 1. Pre-processing 2. Feature Extraction and Representation 3. Classification.
  5. 5. Feature Extraction The following features are extracted from each tweet : ➢ TweetConcepts (using TagMe API) ➢ Named entity and event phrases( using Twical) ➢ URLConcepts(using TagMe API on the content in the external links) ➢ Key Phrases(extracting noun phrases after POS tagging) ➢ Hash tags ➢ Categories(extracting categories for the titles got though TagMe) Similarity Measures used : ➢ Wikipedia miner(for comparing wikipedia titles) ➢ Wordnet similarity measure(to compare key phrases)
  6. 6. Classification ➢ Subtopic detection is considered as a classification problem where subtopics are the class labels for the tweets which are the data points. ➢ The classifier derives logic from what features majority of the tweet (datapoints) of a particular subtopic(class label) have. ➢ Based on the features initial seed clusters are created for each topic and each cluster is represented as crisp information and index. ➢ The features of test tweets are found and compared with the clusters, and then a cluster to which it best matches is assigned to the test tweet. ➢ This is done using Machine Learning technique.
  7. 7. Pre-Processing Pre-processing involves the following steps : ➢ Removal of stopwords from the tweets and stemming from the training data points. ➢ Extracting URLS from the tweets. This is done for both training and test tweets.
  8. 8. Algorithm Offline Process 1. All the tweets in the training data are grouped together according to their sub topic 2. For every tweet in a subtopic, the features are extracted and are grouped to form subtopic features. 3. The subtopic features of all the subtopic are stored in the lucene index under different fields. 4. All those features that are common in two or more subtopics are removed, also those features are removed that are directly related to the entity name.
  9. 9. Algorithm Online Procedure 1. Phase 1 : The category features of the test tweet are searched in the lucene index and the top 10 subtopics are listed. 2. Phase 2 : The tweet concepts and URL concepts of test tweet are compared with that of the top 10 subtopics from Phase 1 and top 5 subtopics are listed based on wikipedia miner similarity measure. 3. Phase 3 : NER, Key phrases, event phrases are compared with the top 5 category list from phase 2 using wordnet similarity measures. For hash tags direct intersection is done .After this the best of 5 subtopics is chosen All these can also be clubbed together to get the best subtopic
  10. 10. Experiments ➢ RepLab 2013 data set was used. The dataset contains tweets for 61entities. Each entity has about 700 tweets for training and 1500 tweets for testing. ➢ For evaluation we use Reliability ,Sensitivity and F Measure. The results that we got for the entity “Volvo” are: Sensitivity : 0.37 , Reliability : 0.39 F measure : 0.38
  11. 11. Future Work ➢ We can build an SVM classifier which can accurately determine which feature has to be given preference while classifying the tweets ➢ The input vectors would have dimensions as various features of various subtopics with the corresponding similarity measures as the coefficients , where the labelled subtopic is the class label ➢ In the testing phase we can create similar vectors for test tweets to get their corresponding subtopics
  12. 12. Reference 1. REINA at RepLab2013 Topic Detection Task: Community Detection 2. Entity Tracking in Real-Time using Sub-Topic Detection on Twitter