SlideShare a Scribd company logo
On Stopwords, Filtering and Data Sparsity for
Sentiment Analysis of Twitter
Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani
Knowledge Media Institute, The Open University,
Milton Keynes, United Kingdom
The 9th edition of the Language Resources and Evaluation
Conference, Reykjavik, Iceland
• Sentiment Analysis
• Twitter
• Stopwords Removal Methods
• Comparative Study
• Conclusion
Outline
“Sentiment analysis is the task of identifying
positive and negative opinions, emotions and
evaluations in text”
3
The main dish was
delicious
It is a Syrian dish
The main dish was
salty and horrible
Opinion OpinionFact
Sentiment Analysis
Stopwords Removal
Stopwords Removal in Twitter Sentiment Analysis
- Kouloumpis et al. 2011
- Pak & Paroubek, 2010
- Asiaee et al., 2012
- Bollen et al., 2011
- Bifet and Frank, 2010
- Speriosu et al., 2011
- Zhang & Yuan, 2013
- Gokulakrishnan et al 2012
- Saif et al., 2012
- Hu et al., 2013
- Camara et al., 2013
Removing
Stopwords
is USEFUL
NOYES
• Precompiled
• Very popular
• Outdated
• Domain-
Independent
Classic Stopword Lists
• Unsupervised Methods
– Term Frequency
– Term-based Random Sampling
• Supervised
– Term Entropy Measures
– Maximum Likelihood Estimation
Automatic Stopwords Generation Methods
Stopwords Removal
for Twitter Sentiment
Analysis
Stopword Analysis Set-Up (1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
OMD
HCR
STS
SemEval
WAB
GASP
OMD HCR STS SemEval WAB GASP
Negative 688 957 1402 1590 2580 5235
Positive 393 397 632 3781 2915 1050
Datasets
Stopword Analysis Set-Up (2)
Stopwords Removal Methods
1. The Baseline Method
– (non removal of stopwords)
1. The Classic Method
– This method is based on removing stopwords
obtained from pre-compiled lists
– Van Stoplist
Stopword Analysis Set-Up (3)
Stopwords Removal Methods
3. Methods based on Zipf’s
Law
- TF-High Method
Removing most frequent
- TF1 Method
Removing singleton words (i.e.,
words that occur once in tweets)
- IDF Method
Removing words with low inverse
document frequency (IDF)
Stopword Analysis Set-Up (4)
Stopwords Removal Methods
4. Term-based Random Sampling (TBRS)
5. The Mutual Information Method (MI)
Stopword Analysis Set-Up (5)
Twitter Sentiment Classifiers
– Two Supervised Classifiers:
• Maximum Entropy (MaxEnt)
• Naïve Bayes (NB)
– Measure the performance in Accuracy and F1
measure
– 10 fold cross validation
Experimental Results
Assess the impact of removing
stopwords by observing fluctuations on:
- Classification Performance
- Feature space
- Data Sparsity
Experimental Results (1)
1. Classification Performance
70
75
80
85
90
95
OMD HCR STS-Gold SemEval WAB GASP
Accuracy(%)
MaxEnt NB
60
65
70
75
80
85
90
OMD HCR STS-Gold SemEval WAB GASP
F1(%)
MaxEnt NB
The baseline classification performance in Accuracy and F-measure
of MaxEnt and NB classifiers across all datasets
Accuracy F-Measure
Experimental Results (2)
1. Classification Performance
60
65
70
75
80
85
90
Baseline Classic TF1 TF-High IDF TBRS MI
Accuracy(%)
MaxEnt NB
50
55
60
65
70
75
80
85
Baseline Classic TF1 TF-High IDF TBRS MIF1(%)
MaxEnt NB
Accuracy F-Measure
Average Accuracy and F-measure of MaxEnt and NB classifiers using
different stoplists
Experimental Results (3)
2. Feature Space
0.00
5.50
65.24
0.82
11.22
6.06
19.34
Baseline Classic TF1 TF-High IDF TBRS MI
Reduction rate on the feature space of
the various stoplists
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
OMD HCR STS-Gold SemEval WAB GASP
TF=1 TF>1
The number of singleton words to the number
non singleton words in all datasets
Experimental Results (4)
3. Data Sparsity
0.98800
0.99000
0.99200
0.99400
0.99600
0.99800
1.00000
Baseline Classic TF1 TF-High IDF TBRS MI
SparsityDegree
OMD HCR STS-Gold SemEval WAB GASP
Stoplist impact on the sparsity degree of all datasets
The Ideal Stoplist (1)
• The ideal stopword removal method is the
one which:
– Helps maintaining a high classification
performance,
– Leads to shrinking the classifier’s feature space
– Reduces the data sparseness
– Has low runtime and storage complexity
– Has minimal human supervision
The Ideal Stoplist (2)
Average accuracy, F1, reduction rate on feature space and data sparsity of the six stoplist
methods. Positive sparsity values refer to an increase in the sparsity degree while negative
values refer to a decrease in the sparsity degree.
Overall Analysis Results
Conclusion
• We studied how six different stopword removal methods
affect the sentiment polarity classification on Twitter.
• The use of pre-compiled (classic) Stoplist has a negative
impact on the classification performance.
• TF1 stopword removal method is the one that obtains the
best trade-off:
– Reducing the feature space by nearly 65%,
– Decreasing the data sparsity degree up to 0.37%, and
– Maintaining a high classification performance.

More Related Content

What's hot

RECURSIVE DESCENT PARSING
RECURSIVE DESCENT PARSINGRECURSIVE DESCENT PARSING
RECURSIVE DESCENT PARSING
Jothi Lakshmi
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
Rishabh Gupta
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
University of California, Davis
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
Karan Veer Singh
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
Andrew Ferlitsch
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching Algorithms
Mohamed Loey
 
Boyer more algorithm
Boyer more algorithmBoyer more algorithm
Boyer more algorithm
Kritika Purohit
 
AI: AI & Problem Solving
AI: AI & Problem SolvingAI: AI & Problem Solving
AI: AI & Problem Solving
DataminingTools Inc
 
Chaps 1-3-ai-prolog
Chaps 1-3-ai-prologChaps 1-3-ai-prolog
Chaps 1-3-ai-prolog
saru40
 
Regression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machineRegression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machine
Dr. Radhey Shyam
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
Ganesh Solanke
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performance
Mala Deep Upadhaya
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
lavanya marichamy
 
Depth First Search ( DFS )
Depth First Search ( DFS )Depth First Search ( DFS )
Depth First Search ( DFS )
Sazzad Hossain
 
C++ For Quantitative Finance
C++ For Quantitative FinanceC++ For Quantitative Finance
C++ For Quantitative Finance
ASAD ALI
 
Independent Component Analysis
Independent Component Analysis Independent Component Analysis
Independent Component Analysis
Ibrahim Amer
 
Pattern matching
Pattern matchingPattern matching
Pattern matchingshravs_188
 
Hidden markov model
Hidden markov modelHidden markov model
Hidden markov model
BushraShaikh44
 
Artificial Intelligence Searching Techniques
Artificial Intelligence Searching TechniquesArtificial Intelligence Searching Techniques
Artificial Intelligence Searching Techniques
Dr. C.V. Suresh Babu
 

What's hot (20)

RECURSIVE DESCENT PARSING
RECURSIVE DESCENT PARSINGRECURSIVE DESCENT PARSING
RECURSIVE DESCENT PARSING
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching Algorithms
 
Boyer more algorithm
Boyer more algorithmBoyer more algorithm
Boyer more algorithm
 
AI: AI & Problem Solving
AI: AI & Problem SolvingAI: AI & Problem Solving
AI: AI & Problem Solving
 
Chaps 1-3-ai-prolog
Chaps 1-3-ai-prologChaps 1-3-ai-prolog
Chaps 1-3-ai-prolog
 
Regression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machineRegression, Bayesian Learning and Support vector machine
Regression, Bayesian Learning and Support vector machine
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performance
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Depth First Search ( DFS )
Depth First Search ( DFS )Depth First Search ( DFS )
Depth First Search ( DFS )
 
Hidden markov model ppt
Hidden markov model pptHidden markov model ppt
Hidden markov model ppt
 
C++ For Quantitative Finance
C++ For Quantitative FinanceC++ For Quantitative Finance
C++ For Quantitative Finance
 
Independent Component Analysis
Independent Component Analysis Independent Component Analysis
Independent Component Analysis
 
Pattern matching
Pattern matchingPattern matching
Pattern matching
 
Hidden markov model
Hidden markov modelHidden markov model
Hidden markov model
 
Artificial Intelligence Searching Techniques
Artificial Intelligence Searching TechniquesArtificial Intelligence Searching Techniques
Artificial Intelligence Searching Techniques
 

Viewers also liked

Semantic Patterns for Sentiment Analysis of Twitter
Semantic Patterns for Sentiment Analysis of TwitterSemantic Patterns for Sentiment Analysis of Twitter
Semantic Patterns for Sentiment Analysis of Twitter
Knowledge Media Institute - The Open University
 
Adapting Sentiment Lexicons using Contextual Semantics for Sentiment Analysis...
Adapting Sentiment Lexicons using Contextual Semantics for Sentiment Analysis...Adapting Sentiment Lexicons using Contextual Semantics for Sentiment Analysis...
Adapting Sentiment Lexicons using Contextual Semantics for Sentiment Analysis...
Knowledge Media Institute - The Open University
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
Dev Sahu
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
Jaganadh Gopinadhan
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
Ayushi Dalmia
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
Sumit Raj
 
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
Knowledge Media Institute - The Open University
 
Intrusion Detection with Neural Networks
Intrusion Detection with Neural NetworksIntrusion Detection with Neural Networks
Intrusion Detection with Neural Networks
antoniomorancardenas
 
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Knowledge Media Institute - The Open University
 
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
Cataldo Musto
 
Alleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment AnalysisAlleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment Analysis
Knowledge Media Institute - The Open University
 
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing SystemsShuyo Nakatani
 
Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)
Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)
Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)
Meagan Louie
 
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment AnalysisSupervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
Tharindu Kumara
 
Social media & sentiment analysis splunk conf2012
Social media & sentiment analysis   splunk conf2012Social media & sentiment analysis   splunk conf2012
Social media & sentiment analysis splunk conf2012
Michael Wilde
 
Short Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-GramShort Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-GramShuyo Nakatani
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltkWei-Ting Kuo
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
Dhwaj Raj
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
Trilok Sharma
 

Viewers also liked (20)

Semantic Patterns for Sentiment Analysis of Twitter
Semantic Patterns for Sentiment Analysis of TwitterSemantic Patterns for Sentiment Analysis of Twitter
Semantic Patterns for Sentiment Analysis of Twitter
 
Adapting Sentiment Lexicons using Contextual Semantics for Sentiment Analysis...
Adapting Sentiment Lexicons using Contextual Semantics for Sentiment Analysis...Adapting Sentiment Lexicons using Contextual Semantics for Sentiment Analysis...
Adapting Sentiment Lexicons using Contextual Semantics for Sentiment Analysis...
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
SentiCircles for Contextual and Conceptual Semantic Sentiment Analysis of Twi...
 
Intrusion Detection with Neural Networks
Intrusion Detection with Neural NetworksIntrusion Detection with Neural Networks
Intrusion Detection with Neural Networks
 
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
 
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
 
Alleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment AnalysisAlleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment Analysis
 
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
 
Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)
Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)
Introduction to Language and Linguistics 006: Syntax & Semantics (the interface)
 
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment AnalysisSupervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
 
Social media & sentiment analysis splunk conf2012
Social media & sentiment analysis   splunk conf2012Social media & sentiment analysis   splunk conf2012
Social media & sentiment analysis splunk conf2012
 
Short Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-GramShort Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-Gram
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltk
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
 
NLP
NLPNLP
NLP
 

Similar to On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter

Tabu search
Tabu searchTabu search
Tabu search
Ahmed Fouad Ali
 
Improving rapid access to reports of RCTs from EMBASE: innovative methods to...
Improving rapid access to reports of RCTs from EMBASE: innovative  methods to...Improving rapid access to reports of RCTs from EMBASE: innovative  methods to...
Improving rapid access to reports of RCTs from EMBASE: innovative methods to...
York Health Economics Consortium (YHEC)
 
Multimedia Geocoding: The RECOD 2014 Approach
Multimedia Geocoding: The RECOD 2014 ApproachMultimedia Geocoding: The RECOD 2014 Approach
Multimedia Geocoding: The RECOD 2014 Approach
multimediaeval
 
High Throughput Screening for ARV Drugs in HIV Prevention Studies
High Throughput Screening for ARV Drugs in HIV Prevention StudiesHigh Throughput Screening for ARV Drugs in HIV Prevention Studies
High Throughput Screening for ARV Drugs in HIV Prevention Studies
HopkinsCFAR
 
Information retrieval in systematic reviews: a case study of the crime preven...
Information retrieval in systematic reviews: a case study of the crime preven...Information retrieval in systematic reviews: a case study of the crime preven...
Information retrieval in systematic reviews: a case study of the crime preven...
Lisa Tompson
 
Esophageal Speech Recognition using Artificial Neural Network (ANN)
Esophageal Speech Recognition using Artificial Neural Network (ANN)Esophageal Speech Recognition using Artificial Neural Network (ANN)
Esophageal Speech Recognition using Artificial Neural Network (ANN)
Saibur Rahman
 
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
r-kor
 
Pesticide Residue Analysis Webinar Series: Tips and Tricks for the Whole Work...
Pesticide Residue Analysis Webinar Series: Tips and Tricks for the Whole Work...Pesticide Residue Analysis Webinar Series: Tips and Tricks for the Whole Work...
Pesticide Residue Analysis Webinar Series: Tips and Tricks for the Whole Work...
Chromatography & Mass Spectrometry Solutions
 
Statistics for linguistics
Statistics for linguisticsStatistics for linguistics
Statistics for linguistics
aiaioo
 
Bioanalytical ppt
Bioanalytical pptBioanalytical ppt
Bioanalytical ppt
Sai Praveen Reddy
 
Survey Method in Research
Survey Method in ResearchSurvey Method in Research
Survey Method in Research
Jasmin Cruz
 
Data analytics, a (short) tour
Data analytics, a (short) tourData analytics, a (short) tour
Data analytics, a (short) tour
Venkatesh Prasad Ranganath
 
44publicspkeaking06
44publicspkeaking0644publicspkeaking06
44publicspkeaking06
emailtuanh
 
POPULATION AND SAMPLING.pptx
POPULATION AND SAMPLING.pptxPOPULATION AND SAMPLING.pptx
POPULATION AND SAMPLING.pptx
MartMantilla1
 
Marketing Research Project on T test
Marketing Research Project on T test Marketing Research Project on T test
Marketing Research Project on T test
Meghna Baid
 
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Weiyang Tong
 
Classifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer PairsClassifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer Pairs
Jinho Choi
 
The painful removal of tiling artefacts in ToF-SIMS data
The painful removal of tiling artefacts in ToF-SIMS dataThe painful removal of tiling artefacts in ToF-SIMS data
The painful removal of tiling artefacts in ToF-SIMS data
CSIRO
 
Making sense of citizen science data: A review of methods
Making sense of citizen science data: A review of methodsMaking sense of citizen science data: A review of methods
Making sense of citizen science data: A review of methods
olivier gimenez
 

Similar to On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter (20)

Tabu search
Tabu searchTabu search
Tabu search
 
Improving rapid access to reports of RCTs from EMBASE: innovative methods to...
Improving rapid access to reports of RCTs from EMBASE: innovative  methods to...Improving rapid access to reports of RCTs from EMBASE: innovative  methods to...
Improving rapid access to reports of RCTs from EMBASE: innovative methods to...
 
Multimedia Geocoding: The RECOD 2014 Approach
Multimedia Geocoding: The RECOD 2014 ApproachMultimedia Geocoding: The RECOD 2014 Approach
Multimedia Geocoding: The RECOD 2014 Approach
 
High Throughput Screening for ARV Drugs in HIV Prevention Studies
High Throughput Screening for ARV Drugs in HIV Prevention StudiesHigh Throughput Screening for ARV Drugs in HIV Prevention Studies
High Throughput Screening for ARV Drugs in HIV Prevention Studies
 
Information retrieval in systematic reviews: a case study of the crime preven...
Information retrieval in systematic reviews: a case study of the crime preven...Information retrieval in systematic reviews: a case study of the crime preven...
Information retrieval in systematic reviews: a case study of the crime preven...
 
Esophageal Speech Recognition using Artificial Neural Network (ANN)
Esophageal Speech Recognition using Artificial Neural Network (ANN)Esophageal Speech Recognition using Artificial Neural Network (ANN)
Esophageal Speech Recognition using Artificial Neural Network (ANN)
 
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
 
Pesticide Residue Analysis Webinar Series: Tips and Tricks for the Whole Work...
Pesticide Residue Analysis Webinar Series: Tips and Tricks for the Whole Work...Pesticide Residue Analysis Webinar Series: Tips and Tricks for the Whole Work...
Pesticide Residue Analysis Webinar Series: Tips and Tricks for the Whole Work...
 
Statistics for linguistics
Statistics for linguisticsStatistics for linguistics
Statistics for linguistics
 
Bioanalytical ppt
Bioanalytical pptBioanalytical ppt
Bioanalytical ppt
 
Survey Method in Research
Survey Method in ResearchSurvey Method in Research
Survey Method in Research
 
Data analytics, a (short) tour
Data analytics, a (short) tourData analytics, a (short) tour
Data analytics, a (short) tour
 
44publicspkeaking06
44publicspkeaking0644publicspkeaking06
44publicspkeaking06
 
POPULATION AND SAMPLING.pptx
POPULATION AND SAMPLING.pptxPOPULATION AND SAMPLING.pptx
POPULATION AND SAMPLING.pptx
 
Marketing Research Project on T test
Marketing Research Project on T test Marketing Research Project on T test
Marketing Research Project on T test
 
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
 
Classifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer PairsClassifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer Pairs
 
The painful removal of tiling artefacts in ToF-SIMS data
The painful removal of tiling artefacts in ToF-SIMS dataThe painful removal of tiling artefacts in ToF-SIMS data
The painful removal of tiling artefacts in ToF-SIMS data
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Making sense of citizen science data: A review of methods
Making sense of citizen science data: A review of methodsMaking sense of citizen science data: A review of methods
Making sense of citizen science data: A review of methods
 

Recently uploaded

Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
ronaldlakony0
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
NoelManyise1
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
frank0071
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 

Recently uploaded (20)

Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 

On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter

  • 1. On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom The 9th edition of the Language Resources and Evaluation Conference, Reykjavik, Iceland
  • 2. • Sentiment Analysis • Twitter • Stopwords Removal Methods • Comparative Study • Conclusion Outline
  • 3. “Sentiment analysis is the task of identifying positive and negative opinions, emotions and evaluations in text” 3 The main dish was delicious It is a Syrian dish The main dish was salty and horrible Opinion OpinionFact Sentiment Analysis
  • 4.
  • 5.
  • 7. Stopwords Removal in Twitter Sentiment Analysis - Kouloumpis et al. 2011 - Pak & Paroubek, 2010 - Asiaee et al., 2012 - Bollen et al., 2011 - Bifet and Frank, 2010 - Speriosu et al., 2011 - Zhang & Yuan, 2013 - Gokulakrishnan et al 2012 - Saif et al., 2012 - Hu et al., 2013 - Camara et al., 2013 Removing Stopwords is USEFUL NOYES
  • 8. • Precompiled • Very popular • Outdated • Domain- Independent Classic Stopword Lists
  • 9. • Unsupervised Methods – Term Frequency – Term-based Random Sampling • Supervised – Term Entropy Measures – Maximum Likelihood Estimation Automatic Stopwords Generation Methods
  • 10. Stopwords Removal for Twitter Sentiment Analysis
  • 11. Stopword Analysis Set-Up (1) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% OMD HCR STS SemEval WAB GASP OMD HCR STS SemEval WAB GASP Negative 688 957 1402 1590 2580 5235 Positive 393 397 632 3781 2915 1050 Datasets
  • 12. Stopword Analysis Set-Up (2) Stopwords Removal Methods 1. The Baseline Method – (non removal of stopwords) 1. The Classic Method – This method is based on removing stopwords obtained from pre-compiled lists – Van Stoplist
  • 13. Stopword Analysis Set-Up (3) Stopwords Removal Methods 3. Methods based on Zipf’s Law - TF-High Method Removing most frequent - TF1 Method Removing singleton words (i.e., words that occur once in tweets) - IDF Method Removing words with low inverse document frequency (IDF)
  • 14. Stopword Analysis Set-Up (4) Stopwords Removal Methods 4. Term-based Random Sampling (TBRS) 5. The Mutual Information Method (MI)
  • 15. Stopword Analysis Set-Up (5) Twitter Sentiment Classifiers – Two Supervised Classifiers: • Maximum Entropy (MaxEnt) • Naïve Bayes (NB) – Measure the performance in Accuracy and F1 measure – 10 fold cross validation
  • 16. Experimental Results Assess the impact of removing stopwords by observing fluctuations on: - Classification Performance - Feature space - Data Sparsity
  • 17. Experimental Results (1) 1. Classification Performance 70 75 80 85 90 95 OMD HCR STS-Gold SemEval WAB GASP Accuracy(%) MaxEnt NB 60 65 70 75 80 85 90 OMD HCR STS-Gold SemEval WAB GASP F1(%) MaxEnt NB The baseline classification performance in Accuracy and F-measure of MaxEnt and NB classifiers across all datasets Accuracy F-Measure
  • 18. Experimental Results (2) 1. Classification Performance 60 65 70 75 80 85 90 Baseline Classic TF1 TF-High IDF TBRS MI Accuracy(%) MaxEnt NB 50 55 60 65 70 75 80 85 Baseline Classic TF1 TF-High IDF TBRS MIF1(%) MaxEnt NB Accuracy F-Measure Average Accuracy and F-measure of MaxEnt and NB classifiers using different stoplists
  • 19. Experimental Results (3) 2. Feature Space 0.00 5.50 65.24 0.82 11.22 6.06 19.34 Baseline Classic TF1 TF-High IDF TBRS MI Reduction rate on the feature space of the various stoplists 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% OMD HCR STS-Gold SemEval WAB GASP TF=1 TF>1 The number of singleton words to the number non singleton words in all datasets
  • 20. Experimental Results (4) 3. Data Sparsity 0.98800 0.99000 0.99200 0.99400 0.99600 0.99800 1.00000 Baseline Classic TF1 TF-High IDF TBRS MI SparsityDegree OMD HCR STS-Gold SemEval WAB GASP Stoplist impact on the sparsity degree of all datasets
  • 21. The Ideal Stoplist (1) • The ideal stopword removal method is the one which: – Helps maintaining a high classification performance, – Leads to shrinking the classifier’s feature space – Reduces the data sparseness – Has low runtime and storage complexity – Has minimal human supervision
  • 22. The Ideal Stoplist (2) Average accuracy, F1, reduction rate on feature space and data sparsity of the six stoplist methods. Positive sparsity values refer to an increase in the sparsity degree while negative values refer to a decrease in the sparsity degree. Overall Analysis Results
  • 23. Conclusion • We studied how six different stopword removal methods affect the sentiment polarity classification on Twitter. • The use of pre-compiled (classic) Stoplist has a negative impact on the classification performance. • TF1 stopword removal method is the one that obtains the best trade-off: – Reducing the feature space by nearly 65%, – Decreasing the data sparsity degree up to 0.37%, and – Maintaining a high classification performance.

Editor's Notes

  1. Hi everyone, My name is Hassan Saif, a PhD student KMI in UK. Today I’m gonna present our work on Evaluation datasets for Twitter Sentiment Analysis.. Surveying the pre-existed datasets and proposing a new dataset. the STS-Gold.
  2. I’m gonna start with some basic definitions about the sentiment analysis task on Twitter. then talking about the motivation behind our study. Next I will give a quick overview about the existed evaluation datasets and preseting our new dataset the STS-Gold. Afterwards I’m gonna talk talk about the our results obtained from a comparative study we conducted on the these datasets. Our study is three main parts: in the first part I’m gonna give an overview of some of the most widely used evaluation datasets for Twitter sentiment analysis, pointing out their limitation In the second part I’m gonna present our new gold standard dataset which overcome some of the limitations of the pre-existed datasets The third part is about a comparative study we performed on the all the datasets in terms of 4 different aspects.
  3. Early work on Sentiment analysis focused mainly on extracting sentiment from conventional text such as movie reviews, blogs, news articles and open forums Textual content in these type of media sources is linguistically rich, consists of well structured and formal sentences, and discusses specific topic or domain (e.g., movie reviews)
  4. However, with the emergent of social media networks and microblogging platforms, especially Twitter, research interests shifted to analyzing and extracting sentiment from theses new sources. Nevertheless, One of the key challenges that Twitter sentiment analysis methods have to confront is the noisy nature of Twitter generated data. Twitter allows only for 140 characters in each post, which influences the use of abbreviations, irregular expressions and infrequent words. This phenomena increases the level of data sparsity, affecting the performance of Twitter sentiment classifiers
  5. A well known method to reduce the noise of textual data is the removal of stopwords. This method is based on the idea that discarding non-discriminative words reduces the feature space of the classifiers and helps them to produce more accurate results
  6. This pre-processing method, widely used in the literature of document classification and retrieval, has been applied to Twitter in the context of sentiment analysis obtaining contradictory results. While some works support their removal (RED Box), others claim that stopwords in- deed carry sentiment information and removing them harms the performance of Twitter sentiment classifiers
  7. In addition, most of the works that have applied stopword removal for Twitter sentiment classification use pre-compiled stopwords lists, such as the Van stoplist However, these stoplists have been criticized for: being outdated (a phenomena that may affect specially Twitter data, where new information and terms are continuously emerging) (ii) for not accounting for the specificities of the domain under analysis since non-discriminative words in some domain or corpus may have discriminative power in different domain.
  8. Aiming to solve these limitations several approaches have emerged in the areas of document retrieval and classification that aim to dynamically build stopword lists from the corpus under analysis. These approaches measure the discriminative power of terms by using different methods including Unsupervised methods such as those based on the terms’ frequencies or Supervised methods such as term entropy measures and Maximum Likelihood Estimation
  9. In our work, we studied the effect of different stopword removal methods for polarity classification of tweets and whether removing stopwords affects the performance of Twitter sentiment classifiers.
  10. To this end, we use six Twitter datasets obtained from the literature of Twitter sentiment classification. As can be noted, these datasets have different size and different number of positive and negative tweets.
  11. The Baseline method for this analysis is the non removal of stopwords. We also assess the influence of six different stopword removal methods using six stopwords removal methods including: The Classic Method: which is based on removing stopwords obtained from pre-compiled lists. In our analysis we use the classic Van stoplist
  12. In addition to the classic Stoplist, we use three stopword generation methods inspired by Zipf’s law including: removing most frequent words (TF-High) and removing words that occur once, i.e. singleton words (TF1). We also consider re- moving words with low inverse document frequency (IDF).
  13. 4- Term-based Random Sampling : This method works by iterating over separate chunks of data ran- domly selected. It then ranks terms in each chunk based on their informativeness values using the Kullback-Leibler divergence measure 5- The Mutual Information Method: The mutual information method (MI) is a supervised method that works by computing the mutual information between a given term and a document class (e.g., positive, negative), providing an indication of how much in- formation the term can tell about a given class. Low mutual information suggests that the term has low discrimination power and hence it should be easily removed.
  14. To assess the effect of stopwords in sentiment classification we use two of the most popular supervised classifiers used in the literature of sentiment analysis, Maximum Entropy (MaxEnt) and Naive Bayes (NB) from Mallet. We report the performance of both classifiers in accuracy and aver- age F-measure using a 10-fold cross validation. Also, note that we use unigram features to train both classifiers in our experiments.
  15. We assess the impact of removing stopwords by observing fluctuations (increases and decreases) on three different as- pects of the sentiment classification task: the classification performance, measured in terms of accuracy and F-masure, the size of the classifier’s feature space and the level of data sparsity. Our baseline for comparison is not removing stopwords.
  16. The first aspect that we study is how removing stopwords affects the classification performance This figure shows the baseline classification performance in accuracy (a) and F-measure (b) for the MaxEnt and NB classifiers across all the datasets. As we can see, when no stopwords are removed, the MaxEnt classifier always outperforms the NB classifier in accuracy and F1 measure on all datasets.
  17. This figure shows the average performances in accuracy and F-measure obtained from the MaxEnt and NB classifiers by using the six stopword removal methods on all datasets - Here we notice a significant loss in accuracy and in F-measure is encountered when using the IDF stoplist, while the highest performance is always obtained when using the MI stoplist. Also, using the classic stoplist gives lower performance than the baseline with an average loss of 1.04% and 1.24% in accuracy and F-measure respectively On the contrary, removing singleton words (the TF1 stoplist) improves the accuracy by 1.15% and F-measure by 2.65% compared to the classic stoplist. We also notice that the TF1 stoplist gives slightly lower accuracy and F-measure than the MI stoplist respectively. Nonetheless, generating TF1 stoplists is much simpler than generating the MI ones in the sense that the former, as opposed to the latter, does not required any labelled data. Finally, it seems that NB is more sensitive to removing stopwords than MaxEnt. NB faces more dramatic changes in accuracy than MaxEnt across the different stoplists.
  18. The second aspect we study is the average reduction rate on the classifier’s feature space caused by each of the studied stopword removal methods - As we can see, Removing singleton words reduces the feature space substantially by 65.24%. MI comes next with a reduction rate of 19.34%. On the other hand, removing the most frequent words (TF-High) has no actual effect on the feature space. All other stoplists reduces the number of features by less than 12%. - From the figure on the right, we can observe that singleton words constitute two-thirds of the vocabulary size of all datasets. In other words, the ratio of singleton words to non singleton words is two to one for all datasets. This two-to-one ratio explains the large reduction rate in the feature space when removing singleton words.
  19. The third aspect we study is the reduction on the data sparseness caused by our 6 stopwords removal methods on all datasets. Previous work on Sentiment analysis showed that Twitter Twitter data are sparser than other types of data (e.g., movie review data) due to the large number of infrequent words present within tweets. Therefore, an important effect of a stoplist for Twitter sentiment analysis is to help in reducing the sparsity degree of the data. Our analysis showed that our Twitter datasets are very sparse indeed, where the average sparsity degree of the baseline is 0.997. Compared to the baseline, using the TF1 method lowers the sparsity degree on all datasets by 0.37% on average. On the other hand, the effect of the TBRS stoplists is barely noticeable Also All other stopword removal methods in- crease the sparsity effect with different degrees, including the classic, TF-High, IDF and MI.
  20. After our evaluation, the question remains – What is the best or the ideal stopword method for Sentiment analysis on Twitter? Broadly speaking, the ideal stopword removal method is the one which helps maintaining a high classification performance, leads to shrinking the classifier’s feature space and effectively reducing the data sparseness. Moreover, since Twitter operates in streaming fashion (i.e., millions of tweets are generated, sent and discarded instantly), the ideal stoplist method is required to have low runtime and storage complexity and to cope with the continuous shift in the sentiment class distribution in tweets. Lastly and most importantly, the human supervision factor (e.g., threshold setup, data annotation, manual validation, etc.) in the method’s workflow should be minimal.
  21. - This table shows the average performances of the evaluated stoplist methods in terms of the sentiment classification accuracy and F-measure, reduction on the feature space and the data sparseness, and the type of the human supervision required. - According to these results, the MI and the TF1 methods show very competitive performances comparing to other methods; the MI method comes first in accuracy and F1 measure while the TF1 method outperform all other methods in the amount of reduction on feature space and data sparseness. looking at the human supervision factor, the TF1 method seems a simpler and more effective choice than the MI method. Firstly, because the notion behind TF1 is rather simple - “stopwords are those which occur once in tweets”, and hence, the computational complexity of generating TF1 stoplists is generally low. Secondly, the TF1 method is fully unsupervised while the MI method needs two major human supervisions including: (i) deciding on the size of the generated stoplists, which is usually done empirically and (ii) manually annotating tweet messages with their sentiment class label in order to calculate the informativeness values of terms as described in Equation 2.