2. Problem Statement
Input - Textual content of a tweet
Output – Label signifying the sentiment of the tweet (Positive, Neutral or
Negative)
3. Motivation
Tweets sometimes express opinions about different topics. These opinions are
important
Consumers can use sentiment analysis to research products or services before
making a purchase. E.g. Kindle
Marketers can use this to research public opinion of their company and products,
or to analyze customer satisfaction. E.g. Election Polls
Organizations can also use this to gather critical feedback about problems in newly
released products. E.g. Brand Management (Nike, Adidas)
4. Challenges
Noisy text
Lack of context - 140 characters only
Acronyms - lol, brb, gr8
Emoticons - :) , :( , :|
Negation
6. Approach
Tweet Downloader
Download the tweets using twitter API
(https://github.com/aritter/twitter_download).
9684 training and 8987 testing tweets are downloaded.
Parser
The parser removes all unavailable tweets from the downloaded data
After removing these we have 7612 tweets for training and 7868 tweets
for testing
7. Approach
Pre-processing
Replace Emoticons by their polarity.
Remove URLs and Targets.
Expand acronyms. eg 'brb' to 'be right back'
Remove stop words.
Tokenization
Stemming
Case-folding
Remove punctuation marks
Replace sequence of repeating characters eg. 'hellooooo' by 'helloo'
8. Approach
Feature Extractor
The pre-processed data file is fed to the feature extractor which creates the
feature vector.
The basic(baseline) feature that was considered was of unigrams.
A list of all unique unigrams across the training set was constructed and it formed
the basic vector for each tweet.
Synsets are used for words that are not found in the list of unique unigrams.
9. Approach
Add Additional Features
Polarity scores of the tweets
Negation
Hashtags
Special characters (?,!,*)
Capitalized words
SVM Classification and Prediction
The features extracted are passed to the classifier
The model built is used to predict the sentiment of the new tweets
10. Results
Features Accuracy Precision Recall F1 score
Unigram 54.855% 0.5264 0.5061 0.5126
Unigram+Additional
features
57.079% 0.5525 0.5308 0.5386
Bigrams 58.579% 0.5713 0.5173 0.5269
Bigrams+Additional
features
60.739% 0.5930 0.5525 0.5637