IRE2014 Filtering Tweets Related to an entity

FILTERING TWEETS RELATED TO AN ENTITY
TEAM: GROUP 8, PROJECT 19
• MALLIKARJUN B R(201307681)
• APRATIM UTKARSH(201305516)
• RISHABH LADHA(201101014)
• KARTIK DUBEY(201001117)

Introduction
• One of the major problem in monitoring the online reputation of companies, is to
decide about the entity information.
• Given a tweet, need to decide whether it belongs to a particular entity or not.
• Problem is particularly hard in microblogging services such as Twitter.

APPROACH
• Supervised Machine learning is used to decide if the entity belongs to an entity or
not.
• Dataset from RepLab, home page and wikipedia page of the entity is being used.
• It involves pre-processing of the above data, extracting features from the data to
train using SVM.
• Test data also goes through same procedure, the output is predicted using the
weight vector obtained from the trained model.

Pre-Processing
• Extract user mentions and URLs
• Convert hashtags to words by removing the hash symbol
• Remove all punctuation
• Convert text to lower case
• Remove accents and convert non-ASCII characters to their ASCII equivalents
• Remove stop-words based on the list of stop words for English.

Features
• Similarity w.r.t related tweets
• Similarity w.r.t unrelated tweets
• Keyword similarity using Word-Net database
• Web similarity

Tools Used
• CMU POS-Tagger
-http://www.ark.cs.cmu.edu/TweetNLP/
• Stanford Corenlp(POS Tagger and Lammetiser):
-http://nlp.stanford.edu/software/corenlp.shtml
• WordNet
-http://lyle.smu.edu/~tspell/jaws/index.html?utm_source=twitterfeed&utm_medium=twitter
• Jsoup Parser
-http://jsoup.org/
• LIBSVM (For Multi-Class Classification)
-http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Evaluation and Results
• Corpus consists of tweets and a list of 61 entities.
• Trained over each entity separately using libsvm.
• Using the test data for each entity, we calculated the accuracy for entire dataset
• Accuracy of entity varies from 96% to 40%. Overall accuracy is 80%.

Conclusion
• In this paper we tackled the problem of company name disambiguation in Twitter
• The main goal of this task was to classify tweets as relevant or not to a given
target entity
• We have explored several types of features, namely similarity between keywords,
TF-IDF of n-grams and we have also explored external resources such as Freebase
and Wikipedia.
• Results show that it is possible to achieve an Accuracy over 0.90.

IRE2014 Filtering Tweets Related to an entity

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to IRE2014 Filtering Tweets Related to an entity

Similar to IRE2014 Filtering Tweets Related to an entity (20)

Recently uploaded

Recently uploaded (20)

IRE2014 Filtering Tweets Related to an entity