On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter

On Stopwords, Filtering and Data Sparsity for
Sentiment Analysis of Twitter
Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani
Knowledge Media Institute, The Open University,
Milton Keynes, United Kingdom
The 9th edition of the Language Resources and Evaluation
Conference, Reykjavik, Iceland

• Sentiment Analysis
• Twitter
• Stopwords Removal Methods
• Comparative Study
• Conclusion
Outline

“Sentiment analysis is the task of identifying
positive and negative opinions, emotions and
evaluations in text”
3
The main dish was
delicious
It is a Syrian dish
The main dish was
salty and horrible
Opinion OpinionFact
Sentiment Analysis

Stopwords Removal in Twitter Sentiment Analysis
- Kouloumpis et al. 2011
- Pak & Paroubek, 2010
- Asiaee et al., 2012
- Bollen et al., 2011
- Bifet and Frank, 2010
- Speriosu et al., 2011
- Zhang & Yuan, 2013
- Gokulakrishnan et al 2012
- Saif et al., 2012
- Hu et al., 2013
- Camara et al., 2013
Removing
Stopwords
is USEFUL
NOYES

• Precompiled
• Very popular
• Outdated
• Domain-
Independent
Classic Stopword Lists

• Unsupervised Methods
– Term Frequency
– Term-based Random Sampling
• Supervised
– Term Entropy Measures
– Maximum Likelihood Estimation
Automatic Stopwords Generation Methods

Stopwords Removal
for Twitter Sentiment
Analysis

Stopword Analysis Set-Up (1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
OMD
HCR
STS
SemEval
WAB
GASP
OMD HCR STS SemEval WAB GASP
Negative 688 957 1402 1590 2580 5235
Positive 393 397 632 3781 2915 1050
Datasets

Stopwords Removal Methods
1. The Baseline Method
– (non removal of stopwords)
1. The Classic Method
– This method is based on removing stopwords
obtained from pre-compiled lists
– Van Stoplist

3. Methods based on Zipf’s
Law
- TF-High Method
Removing most frequent
- TF1 Method
Removing singleton words (i.e.,
words that occur once in tweets)
- IDF Method
Removing words with low inverse
document frequency (IDF)

4. Term-based Random Sampling (TBRS)
5. The Mutual Information Method (MI)

Twitter Sentiment Classifiers
– Two Supervised Classifiers:
• Maximum Entropy (MaxEnt)
• Naïve Bayes (NB)
– Measure the performance in Accuracy and F1
measure
– 10 fold cross validation

Experimental Results
Assess the impact of removing
stopwords by observing fluctuations on:
- Classification Performance
- Feature space
- Data Sparsity

Experimental Results (1)
1. Classification Performance
70
75
80
85
90
95
OMD HCR STS-Gold SemEval WAB GASP
Accuracy(%)
MaxEnt NB
60
65
70
75
80
85
90
F1(%)
MaxEnt NB
The baseline classification performance in Accuracy and F-measure
of MaxEnt and NB classifiers across all datasets
Accuracy F-Measure

1. Classification Performance
60
65
70
75
80
85
90
Baseline Classic TF1 TF-High IDF TBRS MI
Accuracy(%)
MaxEnt NB
50
55
60
65
70
75
80
85
Baseline Classic TF1 TF-High IDF TBRS MIF1(%)
MaxEnt NB
Accuracy F-Measure
Average Accuracy and F-measure of MaxEnt and NB classifiers using
different stoplists

2. Feature Space
0.00
5.50
65.24
0.82
11.22
6.06
19.34
Reduction rate on the feature space of
the various stoplists
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
TF=1 TF>1
The number of singleton words to the number
non singleton words in all datasets

3. Data Sparsity
0.98800
0.99000
0.99200
0.99400
0.99600
0.99800
1.00000
SparsityDegree
Stoplist impact on the sparsity degree of all datasets

The Ideal Stoplist (1)
• The ideal stopword removal method is the
one which:
– Helps maintaining a high classification
performance,
– Leads to shrinking the classifier’s feature space
– Reduces the data sparseness
– Has low runtime and storage complexity
– Has minimal human supervision

The Ideal Stoplist (2)
Average accuracy, F1, reduction rate on feature space and data sparsity of the six stoplist
methods. Positive sparsity values refer to an increase in the sparsity degree while negative
values refer to a decrease in the sparsity degree.
Overall Analysis Results

Conclusion
• We studied how six different stopword removal methods
affect the sentiment polarity classification on Twitter.
• The use of pre-compiled (classic) Stoplist has a negative
impact on the classification performance.
• TF1 stopword removal method is the one that obtains the
best trade-off:
– Reducing the feature space by nearly 65%,
– Decreasing the data sparsity degree up to 0.37%, and
– Maintaining a high classification performance.

On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter

Similar to On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter (20)

Recently uploaded

Recently uploaded (20)

On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter

Editor's Notes