This was a research project for an undergraduate academic seminar. Analyzed the impact of various text preprocessing techniques, feature weighting (FF, FP, TF-IDF), feature selection (filters, wrappers, embedded), lemmatization, tokenization (unigram, bigram and 1-to-3-gram) on 3 open Twitter datasets.
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Analysis
1. TE Project Based Seminar
On
Analyzing Text Preprocessing and Feature
Selection Methods for Sentiment Analysis
Student’s Name: Nirav Raje
Guide’s Name: Dr. Debajyoti Mukhopadhyay
2. Definition: The task of automatically classifying a text written in a
natural language into a positive or negative feeling, opinion or
subjectivity.
The subjective analysis of a text is the main task of Sentiment
Analysis (SA).
Other tasks:
▪ Predicting the polarity of a given sentence
▪ Identifying emotional status of a sentence.
Sentiment Analysis - Introduction
3. Process of Sentiment Analysis
Data
Gathering
Text Pre-
processing
Feature
Extraction
Feature
Vector
ClassifierEvaluation
4. Personal interpretation of individuals
Noise and uninformative parts in text
Words with no impact on SA of text
Sarcasm
Named Entity Recognition
Anaphora Resolution (Pronoun/noun phrase resolution)
Challenges in SA
5. Sentiment analysis is mainly a classification task.
Pre-processing : The process of cleaning and preparing the text for
classification.
Pre-processing operations can be widely divided into 2 categories:
Transformations:
Online text cleaning, white space removal, expanding
abbreviation, stemming, stop words removal, negation handling
Filtering:
Involves the most challenging part of feature selection.
Text Pre-processing
6. An extended comparison of sentiment polarity
classification methods for Twitter text has not been
done.
Effect on different data sets has not been analyzed.
Hence, we present the role of text pre-processing in
sentiment analysis, and a report on experiment results
demonstrating that feature selection and representation
can affect the classification performance positively.
3 different data sets have been used to examine classifier
accuracies.
Conclusion from Literature Review
7. To tackle the extended comparison of sentiment polarity
classification methods for Twitter text and the role of
text pre-processing in sentiment analysis.
Provide a report on experimental results which
demonstrates that with the use of appropriate feature
selection and representation procedures, the
performance of SA classifiers is positively affected.
Problem Statement
8. To reduce the noise in the text should help improve the
performance of the classifier and speed up the
classification process, thus aiding in real time sentiment
analysis.
Hypothesis of Pre-processing
9. Basic Operation and Cleaning
Removing unimportant or disturbing elements.
Normalization of some misspelled words.
Text should not contain URLs, hash tags (i.e. #happy) or
mentions (i.e. @BarackObama).
Tabs and line breaks should be replaced with a blank and
quotation marks with apexes.
To remove the vowels repeated in sequence at least three times.
Laughs, which are normally sequences of “a" and “h". These are
replaced with a “laugh" tag.
Convert text to lowercase.
Data Transformations
10. Emoticon Handling:
This module reduces the number of emoticons to only two
categories: smile positive and smile negative, as shown in table.
Smile Positive Smile Negative
0:-) >:(
:) ;(
:D >:)
:* D:<
:o :(
:P :|
;) >:/
Data Transformations
11. Negation Handling:
Dealing with negations (like “not good")
All negative constructs (can't, don't, isn't, never etc.) are
replaced with “not".
Dictionary:
Detection and correction of misspelled words using a dictionary.
Substitute slang with its formal meaning (i.e., l8 → late), using a
list.
Replace insults with the tag “bad word".
Data Transformations
12. Stemming:
Reduces words to root form and groups them.
Puts word variations like “great", “greatly", “greatest", and
“greater" all into one bucket,
Effectively decreases entropy and increases the relevance of the
concept of “great”.
Stop words Removal
These words are, for example, pronouns, articles, etc.
These could be words like: a, and, is, on, of, or, the, was, with.
They can lead to a less accurate classification.
Data Transformations
13. Feature Selection
Features - words, terms or phrases that strongly express the opinion
as positive or negative.
Feature selection is the process of selecting those attributes in your
dataset that are most relevant to the predictive modeling problem
you are working on.
Drawbacks of the extra features:
They make document classification slower.
They reduce accuracy.
Allows the classifier to fit a model to the problem set more quickly
Allows it to classify items faster.
Filtering
15. Feature Weighting Methods:
1. Feature Frequency (FF):
The method uses the term frequency, i.e. the frequency that each
unigram occurs within a document, as the feature values for that
document.
2. Feature Presence (FP):
Very similar to feature frequency.
Difference: Rather than using frequency of unigram simple we use a
one to indicate its existence.
Filtering
16. 3. Term Frequency Inverse Document Frequency (TF-IDF):
A numerical statistic that is intended to reflect how important a
word is to a document in a collection or corpus.
Often used as a weighting factor in information retrieval, text
mining and user modeling.
The TF-IDF value increases proportionally to the number of
times a word appears in the document.
TF-IDF = FF*Log (N/DF)
where,
N indicates the number of documents
DF is the number of documents that contains this feature
FF is the number of occurrences in the document.
Filtering
17. To evaluate the role of pre-processing techniques on
classification problems.
Hence, we examine the performance of several well-
known learning based classification algorithms using
various pre-processing options on three different subject
datasets.
Goal of Current Experiment
21. Our Evaluation results indicated:
On selection of attributes with IG>0, their resultant number
decreased appreciably.
Overall algorithms trained faster due to attribute selection.
1-to-3-grams performed better than the other representations,
having a close competition with unigram.
In case of NB classifier, percentage of correctly classified instances
increased over 7 points.
The effect of pre-processing techniques on classifier accuracy was
the same regardless of the datasets.
Results of the Proposed Work
22. Feature extraction improves the classification accuracy
in comparison with using all created attributes.
Significant accuracy rates are obtained when applying
the attribute selection based on information gain.
Unigram and 1-to-3-grams perform better than the other
representations of n-grams.
Thus our experiments’ results illustrate that with
appropriate feature selection and representation,
sentiment analysis accuracies can be improved.
Conclusion
23. To investigate further the available pre-processing
options in order to find the optimal settings.
Focusing on choice of best algorithm for attribute
selection strategies.
Evaluation of rankings methods such as Infogain, Chi-
square, etc.
To involve embedded methods, which carry out feature
selection and model tuning at the same time.
Future Work
24. References
1. E. Haddi, X. Liu, Y. Shi, “The role of text pre-processing in sentiment
analysis”, Procedia Computer Science 17, pp. 26–32, 2013.
2. Giulio Angiani, Laura Ferrari, Tomaso Fontanini, Paolo Fornacciari, Eleonora
Iotti, Federico Magliani, and Stefano Manicardi, “A Comparison between
Preprocessing Techniques for Sentiment Analysis in Twitter”, Dipartimento di
Ingegneria dell'Informazione Universita degli Studi di Parma Parco Area delle
Scienze 181/A, 43124 Parma, Italy, 2016.
3. Gonçalves, P. Araújo, M. Benevenuto, F. Cha, “Comparing and Combining
Sentiment Analysis Methods”, Proceedings of the First ACM Conference on
Online Social Networks, COSN ’13. ACM, New York, NY, USA, pp. 27–38,
2013.
4. Akrivi Krouska, Christos Troussas, Maria Virvou Software Engineering
Laboratory, “The Effect Of Preprocessing Techniques On Twitter Sentiment
Analysis”, Department of Informatics University of Piraeus Greece, 2016.
25. References
5. Tim O’Keefe, Irena Koprinska, “Feature Selection and Weighting Methods in
Sentiment Analysis”, School of Information Technologies, University of
Sydney, NSW, Australia, 2006.
6. Yan Xu, Lin Chen, Beijing Language And Culture University, “Term-
frequency based feature Selection methods for Text Categorization”, Beijing,
China, Institute of Computing Technology, Chinese Academy of Sciences,
2010.
7. “The Role of Text Pre-Processing in Opinion Mining on a Social Media
Language Dataset” Fernando Leandro dos Santos, CIC-UnB University of
Brasilia, Brasilia, Brazil, Marcelo Ladeira, CIC-UnB, University of Brasilia,
Brasilia, Brazil