IRE Project IIIT Hyderabad Tweet classification Group 37

Tweet Classification
Mentor: Romil Bansal
GROUP NO-37
Manish Jindal(201305578)
Trilok Sharma(201206527)
Yash Shah (201101127)
Guided by : Dr. Vasudeva Varma

Problem Statement : To automatically classify Tweets
from Twitter into various genres based on predefined
Wikipedia Categories.
Motivation:
o Twitter is a major social networking service with over 200
million tweets made every day.
o Twitter provides a list of Trending Topics in real time, but it
is often hard to understand what these trending topics are
about.
o It is important and necessary to classify these topics into
general categories with high accuracy for better
information retrieval.

Data
Dataset :
o Input Data is the static / real-time data consisting of the
user tweets.
o Training dataset :
Fetched from twitter with twitter4j api.
Final Deliverable:
o It will return list of all categories to which the input tweet
belongs.
o It will also give the accuracy of the algorithm used for
classifying tweets.

Categories
We took following categories into consideration for
classifying twitter data.
1)Business 5)Law 9)Politics
2)Education 6)Lifestyle 10)Sports
3)Entertainment 7)Nature 11)Technology
4)Health 8)Places

Concepts used for better performance
 Outliers removal
 To remove low frequent and high frequent words using
Bag of words approach .
 Stop words removal
 To remove most common words, such as the, is, at,
which, and on.
 Keyword Stemming
 To reduce inflected words to their stem, base or root
form using porter stemming
 Cleaning crawl data

Other Concepts used ..
 Spelling Correction
 To correct spellings using Edit distance method.
 Named Entity Recognition:
 For ranking result category and finding most
appropriate.
 Synonym form
 If feature(word) of test query not found as one of
dimension in feature space than replace that word with
its synonym. Done using WordNet.

Tweets Classification Algorithms
We used 3 algorithms for classification
1) Naïve based
2) SVM based Supervised
3) Rule based

Crawl
tweeter data
Tweets Cleaning,
Stop word removal
Create Index file
Of feature vector
Extract Features
(Unique wordlist)
Create feature
vector for each
tweet
Edit Distance,
WordNet
(synonyms)
Test
Query/
Tweet
Create Index file
Of feature vectors
Create
/ Apply
Model files
Output
Category
Training
Testing
Remove Outliers
Tweets Cleaning,
Stop word removal
Create feature
vector for test tweet
Apply Named
Entity
Recognition
Rank result
category

Main idea for Supervised Learning
 Assumption: training set consists of instances of
different classes described cj as conjunctions of
attributes values
 Task: Classify a new instance d based on a tuple of
attribute values into one of the classes cj C
 Key idea: assign the most probable class using
supervised learning algorithm.

Method 1 : Bayes Classifier
 Bayes rule states :
 We used “WEKA” library for machine learning in Bayes
Classifier for our project.
Normalization
Constant
Likelihood Prior

Method 2 : SVM Classifier
(Support Vector Machine)
 Given a new point x, we can score its projection
onto the hyperplane normal:
 I.e., compute score: wTx + b = Σαiyixi
Tx + b
 Decide class based on whether < or > 0
 Can set confidence threshold t.
11
-1
0
1
Score > t: yes
Score < -t: no
Else: don’t
know

13
Multi-class SVM Approaches
 1-against-all
Each of the SVMs separates a single class from all
remaining classes (Cortes and Vapnik, 1995)
 1-against-1
Pair-wise. k(k-1)/2, k Y SVMs are trained. Each SVM
separates a pair of classes (Fridman, 1996)

Advantages of SVM
 High dimensional input space
 Few irrelevant features (dense concept)
 Sparse document vectors (sparse instances)
 Text categorization problems are linearly separable
 For linearly inseparable data we can use kernels to map
data into high dimensional space, so that it becomes
linearly separable with hyperplane.

Method 3 : Rule Based
 We defined set of rule to classify a tweet based on term
frequency.
 a. Extract the features of a tweet.
 b. Count term frequency of each feature , the feature
having maximum term frequency from all categories
mentioned above will be our first classification.
 c. As it cannot be right all time so now we maintain
count of categories in which tweet falls , category
which is near to tweet will be our next classification.

Example-
 Tweet=sachin is a good player, who eats apple and
banana which is good for health.
 Feature- sachin,player,eats,apple,health,banana
 Stop word-is,a,good,he,was,for,which,and,who
 Classification- Feature-category term-frequency
sachin-sports 2000
player-sports 900
eating-health 500
apple-technology 1000
health-health 800
banana-health 700

 Max term-frequency - sachin
 So our category is - sports
 2nd approximation -
 Max feature is laying in health i.e. 3 times ,
 So our second approximation would be health.
 If both of these are in same category then we have only
one category.
i.e. if here max feature would be laying in sports than we
have only one result that is sports.

Cross-validation (Accuracy)
 Steps for k-fold cross-validation :
Step 1: split data into k subsets of equal size
Step 2 : use each subset in turn for testing, the
remainder for training
 Often the subsets are stratified before the cross-
validation is performed
 The error estimates are averaged to yield an
overall error estimate

Accuracy Results ( 10 folds)
Accuracy of Algorithm in %
Categories Algo. SVM Naïve Rule
Business 86.6 81.44 98.30
Education 85.71 76.07 81.8
Entertainment 86.8 79.1 87.49
Health 95.67 84.62 90.93
Law 81.17 73.38 75.25
Lifestyle 93.27 89.71 82.42
Nature 87.0 78.64 84.24
Places 81.01 75.35 80.73
Politics 81.91 81.88 76.31
Sports 87.11 83.57 81.87
Technology 83.64 82.44 77.05

Unique features
 Worked on latest Crawled tweeter data using tweeter4j api
 Worked on Eleven different Categories.
 Applied three different method of supervised learning to
classify in different categories.
 Achieved high performance speed with accuracy in range of
85 to 95 %
 Done Tweets Cleaning , Stemming , Stop Word removal.
 Used Edit distance for spelling correction.
 Used Named entity recognition for ranking.
 Used WordNet for Query Expansion and Synonyms
finding.
 Validated using CrossFold (10 fold) validation.

IRE Project IIIT Hyderabad Tweet classification Group 37

More Related Content

What's hot

Similar to IRE Project IIIT Hyderabad Tweet classification Group 37

Recently uploaded

IRE Project IIIT Hyderabad Tweet classification Group 37