NCCCI 2014 Conference Paper on Best Practices for Sentiment Analysis

5th
NATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS, NCCCI 2014
A STUDY ON FACTORS INFLUENCING AS A BEST PRACTICE FOR SENTIMENT ANALYSIS
Mrs.R.Nithya Dr.D.Maheswari,
Assistant Professor & Ph.D Scholar, Assistant Professor,
School of Computer Studies(UG), School of Computer Studies(PG),
RVS College of Arts and Science, RVS College of Arts and Science,
Sulur, Coimbatore, India. Sulur, Coimbatore, India
nithya.r@rvsgroup.com maheswari@rvsgroup.com
Abstract—No doubt, that online web communities like web
portals, microblogs, discussion forums, shopping sites, comments
as tweets has brought huge voluminous of opinion rich data
which causes us to focus on the area of opinion mining. It is also
able to identify the sentiment followed by classification and
detailed summarization. But still it is not possible by the
research community to confine exactly in selecting best
techniques and approaches for performing sentiment analysis.
This paper will motivate the researcher by providing some useful
tips in handling such kind of work.
Keywords- Opinion mining; Natural Language Processing;
Levels of analysis; Useful tips
I. INTRODUCTION
Business hope data mining will allow them to boost sales and
profits by better understanding their customer and in
improving the performance of the products and services they
offer. For example, coaches in the National Basketball
Association (NBA) have used productive combinations of
players and measure the effectiveness of individual players.
Thus social media acting as democracy’s pipeline, an
amplifier of unfiltered emotion. It plays vital role in sharing
opinion on diverse topics like finance, politics, travel,
education, sports, entertainment, news, history, environment
and so forth. Opinion mining or Sentiment analysis is an
important sub discipline of Data mining and Natural Language
Processing which deals with building a system that explores
the user’s opinions made in blog spots, comments, reviews,
discussions, news, feedback or tweets, about a product, policy,
person or topic. To be specific, opinion mining can be defined
as a sub discipline of computational linguistics that focuses on
extracting people’s opinion form the web. It analyses from a
given piece of text about; which part is opinion expressing;
who wrote the opinion; what is being commented. Sentiment
analysis, on the other hand is about determining the
subjectivity, polarity like positive, negative or neutral and
polarity strength. Thus we have to keenly look into pre-
processing to avoid noisy data before focusing on text
analysis.
II. LEVELS OF ANALYSIS
In general, sentiment analysis has been investigated mainly at
three levels:
A. Document level: The task at this level is to classify whether
a whole opinion document expresses a positive or negative
sentiment. For example, given a product review, the system
determines whether the review expresses an overall positive or
negative opinion about the product. This task is commonly
known as document-level sentiment classification. This level
of analysis assumes that each document expresses opinions on
a single entity (e.g., a single).
B. Sentence level: The task at this level goes to the sentences
and determines whether each sentence expressed a positive,
negative, or neutral opinion. Neutral usually means no
opinion. This level of analysis is closely related to subjectivity
classification, which distinguishes sentences (called objective
sentences) that express factual information from sentences
(called subjective sentences) that express subjective views and
opinions. However, we should note that subjectivity is not
equivalent to sentiment as many objective sentences can imply
opinions, e.g., “We bought the car last month and the
windshield wiper has fallen off.”
C. Entity and Aspect level: Aspect level performs finer-
grained analysis. Instead of looking at language constructs
(documents, paragraphs, sentences, clauses or phrases), aspect
level directly looks at the opinion itself. It is based on the idea
that an opinion consists of a sentiment (positive or negative)
and a target (of opinion). Realizing the importance of opinion
targets also helps us understand the sentiment analysis
problem better. For example, the sentence “The iPhone’s call
quality is good, but its battery life is short” evaluates two
aspects, call quality and battery life, of iPhone (entity). The
sentiment on iPhone’s call quality is positive, but the
sentiment on its battery life is negative. The call quality and
battery life of iPhone are the opinion targets.
III. OPINION – A MASTERPIECE
Polarity is mostly indicated by subjective element
either as single word or group of complex words. Opinion can
be fetched in two different ways. One is of questionnaire
where the questions and its answers will be very relevant o
product and its feature. So it is easy to make score and finalize
the outcome whereas unstructured review that may usually
include feedback in the form of text and images from various
social monitoring tools and online shopping sites. In market

5th
each product may be introduced on the basis of some latest
features they hold and they can either uplift or downsize the
demand of that product. Forrester estimates that Indians spent
around $1.6 billion online on retail e-commerce sites in 2012.
By 2016 it can either extend upto $8.8 billion. So that the
online shopping sites are engaging with their consumers on the
emotional front as well as fulfilling their need for information
in order to indicate that they are not limited to satisfy only on
their functional needs. Generally there are two types of
reviews in web. One is of company sites such as
Epinions.com, Zdnet.com, Dpreview.com, Bizarte.com and
Consumerreview.com. The reviews from these sites act as big
picture in informing the merchant’s shipping details, checkout
process, return policy etc. Another is of product reviews that
include information about quality, price, product details that
are essential for increasing customers confidence. Both these
reviews makes customer feel trustworthy which is nowadays
lacking in most of the e-commerce markets.
Thus these opinions when analysed increase sales,
identify customers – like and dislike, finally maintain brand
perception and online reputation. These reviews are fetched
from questionnaire, blogs, online forums extending upto
facebook, twitter etc., Questionnaire are usually called as
structured one because they include normally questions very
relevant to product and its services whereas unstructured
review may include feedback in the form of text and images
from various social monitoring tools and online shopping sites
like shopclues, fabfurnish, pepperfry etc.,. The rapid growth of
e-commerce thus leads to get large volumes of comments on
product from online customers. Therefore, before purchasing a
product or getting services these buyer go on browse through
various websites to know about its features and finally make a
decision. Some companies are trying to influence the GenY in
particular, since they are the future citizens who contribute to
the growth of Indian Economy; by allowing users to post their
own reviews in order to summarize them by having experts. It
is not an easy target to analyze opinion given by customers
because they may not directly give their opinion on product or
sometimes they make comparison on products and even they
can make spelling mistakes, improperly use punctuations,
code words, unfamiliar abbreviations, slang and use non
dictionary words
IV. USEFUL TIPS FOR SENTIMENT ANALYSIS
A. Lexicon based and Learning based techniques
Lexicon based techniques use a dictionary to perform entity-
level sentiment analysis. This technique uses dictionaries of
words annotated with their semantic orientation usually
polarity and its strength to calculate a score for the polarity of
the document. Usually this method gives high precision but
low recall. Learning based techniques require creating a model
by training the classifier with labeled examples. This means
that you must first gather a dataset with examples for positive,
negative and neutral classes, extract the features/words from
the examples and then train the algorithm based on the
examples. Choosing one among the method greatly depends
on the application, domain and language. Using lexicon based
techniques with large dictionaries enables us to achieve very
good results. Nevertheless they require using a lexicon,
something which is not always available in all languages. On
the other hand Learning based techniques deliver good results
nevertheless they require obtaining datasets and require
training.
B. Statistical and Syntactic techniques
Syntactic techniques can deliver better accuracy because they
make use of the syntactic rules of the language in order to
detect the verbs, adjectives and nouns. Unfortunately such
techniques heavily depend on the language of the document
and as a result the classifiers can’t be ported to other
languages. On the other hand statistical techniques have
probabilistic background and focus on the relations between
the words and categories. Statistical techniques have two
significant benefits over the Syntactic ones. It can be used in
other languages with minor or no adaptations and it can use
Machine Translation of the original dataset and still get quite
good results. This obviously is impossible by using syntactic
techniques.
C. Importance of Neutral Class
While performing Sentiment Analysis most of the researchers
tend to ignore the Neutral class and focus only on positive and
negative classes. Nevertheless it is important to understand
that not all sentences have a sentiment. Training the classifier
to detect only the positive and negative classes forces several
neutral words to be classified either as positive or negative
something that leads to over fitting.
D. Tokenization algorithm
Before starting with the analysis it is compulsory to conclude
what is the way by which the document to be set forth for
implication. Tokenization, pos tagging, stemming, parsing,
chunking, parsing are the interfaces that helps to represent the
data in the document. The term stemming refers to the
reduction of words to their roots. That is it tries to get the
root of word for eg., plays, playing, played -> play. Porter’s
stemming algorithm can be used to remove stop words. Brill
Tagger, Tree Tagger, CST Tagger are the tool used for
annotating text with part-of-speech (POS). POS also called
grammatical tagging is the process of marking up a word in a
corpus as corresponding to a particular part-of-speech,
based on both its definition, as well as its adjacent and related
words in a phrase, sentence or paragraph. A parser processes
input sentences according to the productions of a grammar,
and builds one or more constituent structures that conform to
the grammar. It is used to identify the grammatical structures

5th
in a sentence. And all this depends on the topic, application
and language which are used in undergoing analysis. Thus
several preliminary tests are needed to be carried out to find
the best algorithmic configuration. Semantic analysis is the
process of relating syntactic structures from the levels of
phrases, clauses, sentences and paragraphs. Semantic
orientation would have application in tracking opinions in
online discussions, analysis of news responses etc., Word
frequency deals with the words that are occur frequently in the
comments. Collocation is the term that denotes the words that
are commonly appearing nearby each other. This approach
can be achieved by undergoing N-gram test through text
analysis tools. In N-grams it lists common two-,three-,etc.-
word phrases that occur together. If n-grams framework is
used then it is necessary to decide on number of keyword
combinations to be used. Just remember that in case of its use,
the number of n should not be too big. Particularly in
Sentiment Analysis it is enough to use uni-grams or bi-grams
as if increasing the number of keyword combinations can hurt
the results. Moreover keep in mind that in Sentiment Analysis
the number of occurrences of the word in the text does not
make much of a difference.
E. Feature Selection algorithm
Feature selection is significant for sentiment analysis as the
opinionated text may have high dimensions, which can
entirely affect the performance of sentiment analysis classifier.
And that too in learning based techniques, before training the
classifier, it is must to select the words/features that is to be
used in model. Obviously it is not possible to select all the
words that the tokenization algorithm returned simply because
there are several irrelevant words among them. Feature
selection methods reduce the original feature set by removing
irrelevant features for text sentiment classification to improve
classification accuracy and decrease the running time of
learning algorithms. There are five commonly used feature
selection methods in data mining research to improve the
performance of system and they are DF, IG, CHI, GR and
Relief-F. The two most common methods are Mutual
Information Gain and Chi-square test. And all these feature
selection methods compute a score for each individual feature
and then select top ranked features as per that score.
a. Document Frequency (DF)
Document Frequency measures the number of documents in
which the feature appears in a dataset. This method removes
those features whose document frequency is less than or
greater than a predefined threshold frequency. Selecting
frequent features will improve the likelihood that the features
will also be comprised by prospective future test cases. The
basic assumption is that both rare and common features are
either non-informative for sentiment category prediction, or
not impactful to improve classification accuracy. Research
literature shows that this method is simplest, scalable and
effective for text classification.
b. Information Gain (IG)
Information gain is utilized as a feature (term) goodness
criterion in machine learning based classification. It measures
information obtained (in bits) for class prediction of an
arbitrary text document by evaluating the presence or absence
of a feature in that text document. Information Gain is
calculated by the feature’s contribution on decreasing overall
entropy. The expected information needed to classify an
instance (tuple) for partition D or identify the class label of an
instance in D is known as entropy and is given by:
Where m represents the number of classes (m=2 for binary
classification) and Pi denotes probability that a random
instance in partition D belongs to class Ci estimated as |Ci,
D| /|D| (i.e. proportion of instances of each class or category).
A log function to the base 2 justifies the fact that we encode
information in bits. If we have to partition (classify) the
instance in D on some feature attribute A {a1,…, av}, D will
split into v partitions set {D1, D2,…, Dv}.
The amount of information in bits, we still require for an exact
classification is measured by:
Where |Dj|/|D| is the weight of the jth partition and Info(Dj) is
the entropy of partition Dj. Finally Information gain by
partitioning on A is
We select the features ranked as per the highest information
gain score. We can optimize the information needed or
decrease the overall entropy by classifying the instances using
those ranked features.
c. Gain Ratio (GR)
Gain Ratio enhances Information Gain as it offers a
normalized score of a feature’s contribution to an optimal
information gain based classification decision. Gain Ratio is
utilized as an iterative process where we select smaller sets of
features in incremental fashion. These iterations terminate
when there is only predefined number of features remaining.
Gain ratio is used as one of disparity measures and the high
gain ratio for selected feature implies that the feature will be
useful for classification. Gain Ratio was firstly used in
decision tree (C4.5), and applies normalization to information
gain score by utilizing a split information value [30]. The split
information value corresponds to the potential information
obtained by partitioning the training data set D into v
partitions, resulting to v outcomes on attribute A:

5th
Where high SplitInfo means partitions have equal size
(uniform) and low SplitInfo means few partitions contains
most of the tuples (peaks). Finally the gain ratio is defined as:
d. CHI statistic (CHI)
The Chi Squared statistic (CHI) measures the association
between the word feature and its associated class or category.
CHI as a common statistical test represents divergence from
the distribution expected (i.e. resultant partition) based on the
assumption that the feature occurrence is perfectly
independent of the class value [20, 29]. It is defined as,
Where A is the frequency when t and Ci co-occur; B represents
counts when t occurs without Ci. E is the number representing
events when Ci occurs without t; D is the frequency when
neither Ci nor t occurs; N represents total documents in the
corpus. The CHI statistic will be zero if t and Ci are
independent.
e. Relief-F Algorithm
The basic principle of Relief-F is to select feature instances at
random, compute their nearest neighbors, and optimize a
feature weighting vector to award more importance (weight)
to features that discriminate the instance from neighbors of
different classes. Specifically, Relief-F attempt to evaluate a
good estimation of weight Wf from the following probabilities
for weighting and ranking feature f:
Each algorithm evaluates the keywords in a different way and
thus leads to different selections. Also each algorithm requires
different configuration such as the level of statistical
significance, the number of selected features etc.
F. Classification method
Like Max Entropy, Naïve bayes, Support Vector Machine
many classification methods are available of which most
famous are Naïve bayes and SVM. Naïve bayes takes very
less training time and needs very small training data when
compared to SVM. Sometimes Naïve bayes is able to provide
the same or even better results than more advanced methods. It
is also possible to use different classification methods as they
deliver different results. And each classifier might work better
with specific feature selection configuration. Generally it is
expected that state of the art classification techniques such as
SVM would outperform more simple techniques such as
Naïve Bayes. Sometimes Naïve Bayes is able to provide the
same or even better results than more advanced methods. It is
advised not to eliminate a classification model only due to its
reputation.
G. Selection of Domain
There is no single algorithm that performs well in all
topics/domains/applications. It is to be prepared to look at the
fact that the accuracy of selected classifier can be as high as
90% in one domain/topic and as low as 60% in some other.
Max Entropy with Chi-square acts as best combination for
restaurant review. Binarized Naïve Bayes with Mutual
Information acts best for twitter when compared to SVM.
Particularly in case of twitter, avoid using lexicon based
techniques because users are known to use idioms, jargons and
twitter slangs what heavily affect the polarity of the tweet.
H. Towards Optimization
The best source of information for Sentiment Analysis is
obviously the academic papers. Each suggested technique may
not work well at all times. While usually the papers can turn to
be the right direction, some techniques work only to specific
domains and each may appear with different perspective. It is
advised not to select a research paper just because of its
optimized results or just because it is found on a research
paper or if it makes algorithm unnecessary complicated and
difficult to explain its results.
I. Dataset
There are lots of datasets available online with even POS tags
like movie review dataset, restaurant dataset etc., For example
consider the movie review corpus has 1000 positive files and
1000 negative files. Three-fourth of them can be used as the
training set, and the rest can be used as training set. Some of
the examples are too ambiguous, contain mixed sentiments
and make comparisons and thus they are not ideal to be used
for training.
It is advisable to use human annotated datasets as match as
possible and not automatically extracted examples. Scrapping
structured reviews from various websites is also a problematic
approach so be extra careful in selecting them. It is to be
finally remembered that, the probability of classifying a
document as positive, negative or neutral is equal. Thus in the
dataset the number of examples in each category should be
equal.
J. Visualization of result
One of the most powerful techniques for building highly
accurate classifiers is using ensemble learning and combining
the results of different classifiers. Ensemble learning has great

5th
applications in fields of computer vision where the same
object can be presented in 3D, 2D, infrared etc. Thus using
several different weak classifiers that focus on different areas
can help us build strong high-accuracy classifiers.
Unfortunately in text analysis this is not as effective. The
options of looking the problem from a different angle are
limited and the results of the classifiers are usually highly
correlated. Thus this makes the use of ensemble learning less
practical and less useful.
V. CONCLUSION
Sentiment detection has a wide variety of
applications in information systems, including classifying
reviews, government policy making, election judgment and
other real time applications. It is also found that different types
of features and classification algorithms are to be combined in
order to overcome the demerits of the system. In future, a
proposal will be made in incorporating these useful tips for
doing sentiment analysis at the level best by using Python, an
interactive programming language. It has numerous amount of
library files that supports with NLTK.
ACKNOWLEDGMENT
I would like to extend my thanks to all the internal and
external reviewers of conferences for their valuable feedback
on assessing my earlier research papers on sentiment analysis.
REFERENCES
[1] Ayesha Rashid, Naveed Anwer, Dr. Muddaser Iqbal ,Dr.Muhammed
Sher, “A Surver Paper: Areas, Techniques and Challenges of Opinion
Mining, IJCSI,Vol.10, Issue 6, No.2, November 2013. ISSN:1694-0784.
[2] Nitish Gupta, Shashwat Chandra, “Product Feature Discovery and
Ranking for Sentiment Analysis from Online Reviews”, University of
Illinois, November 2013.
[3] Anuj Sharma, Shubhamoy Dey, “Performance Investigation of Feature
Selection Methods and Sentiment Lexicons for Sentiment Analysis”,
Special Issue of International Journal of Computer Applications (0975-
8887) – ACCTHPCA, June 2012.
[4] Kunpeng Zhang, Ramanathan Narayanan, “Voice of the Customers:
Mining Online Customer Reviews for Product”, 2010.
[5] G. Diana Maynard, Kalina Bontcheva, Dominic Rout, “ Challenges in
developing opinion mining tools for social media”, funded by
Engineering and Physical Sciences Research Council.
[6] Hu, and Liu, “Opinion extraction and summarization on the web”,
AAAI., (2006), pp.1621-1624.
[7] www.scoop.it
[8] www.streamhackers.com

NCCCI 2014 Conference Paper on Best Practices for Sentiment Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NCCCI 2014 Conference Paper on Best Practices for Sentiment Analysis

Similar to NCCCI 2014 Conference Paper on Best Practices for Sentiment Analysis (20)

More from International Journal of Advance Research and Innovative Ideas in Education

More from International Journal of Advance Research and Innovative Ideas in Education (8)

Recently uploaded

Recently uploaded (20)

NCCCI 2014 Conference Paper on Best Practices for Sentiment Analysis