Major presentation

 It aims to determine the attitude of a
speaker or a writer with respect to some
topic or the overall contextual polarity of a
document.
 The attitude may be his or her judgment or
evaluation, affective state (that is, the
emotional state of the author when writing),
or the intended emotional communication
(that is, the emotional effect the author
wishes to have on the reader).

 Determining document subjectivity:
Often called subjectivity classification, this subtask
determines whether a giving text is objective (expressing a
fact) or subjective (expressing an opinion or emotion).
 Determining document orientation:
Often called sentiment classification or document-level
sentiment classification, this subtask determines the
polarity of a given subjective text. In other words,
determines whether this text expresses a positive or a
negative sentiment on its subject matter.
 Determining the strength of document orientation:
This subtask decides whether the positive sentiment
expressed by a text on its subject matter is weakly positive,
mildly positive or strongly positive.

 Consumer information
› Product reviews
 Marketing
› Consumer attitudes
› Trends
 Politics
› Politicians want to know voters’ views
› Voters want to know policitians’ stances and who else
supports them
 Social
› Find like-minded individuals or communities

 Machine learning
› Naïve Bayes
› Maximum Entropy Classifier
› SVM
 Unsupervised methods
› K-means
› Olsu’s Threshold
› Fuzzy c-means

 Data to annotate given.
 But no training data or additional
resources provided.
 Aim:
 To create a lexical resource in an
automated way without any human
intervention for annotating data.
 Affective lexicon to be used for polarity
classification.

 To obtain training material, use
emoticons as indicators of a mood within
a message.
 Split the tweets into 2 sets:
 positive -  ;) :} :] ...
 negative -  :{ :’( ...
 We get a positive word list and a
negative word list.
 If a word present more frequently in
positive set, then it is positive and vice
versa.

 Aim:
 To analyse the effectiveness of various
popular classifiers and identify the more
suitable classifier for twitter that could
ease the process of classifying
sentiments in tweets.

 Strategy:
To use two or more classifiers chained
one after the other. This resulted in a
high yield, better accuracy of mined
data. I

 First stage: the incoming preprocessed
data is classified into three categories –
polar, neutral and irrelevant.
 Second stage: the data classified under
polar is fed to a second classifier for
further segregation into positive and
negative.

The classification algorithms used in the
research are:
 Naive Bayes
 Random Forest
 Support Vector Machines(SVM)
 SMO

 The research has been performed on
Tunisian user’s statuses on Facebook during
the “Arabic Spring” era.
 The aim is to extract useful information
about user’s sentiments and behaviours
during this sensitive and significant period.
 For this purpose, a method based on
Support Vector Machine(SVM) and Naive
Bayes has been proposed.

 The methodology used is collection of raw
data, followed by lexicon development.
 Three types of lexicons were created ;
lexicon for social acronyms, lexicon for
emoticons and lexicon for interjections.
 Then data preprocessing is done – stop
words removal and stemming, followed by
feature extraction.
 Finally, the machine learning algorithms are
applied.
 The performance of different feature sets
using Naive Bayer (NB) and SVM classifiers
was then compared.

 This paper is concerned with the
problem of mining social emotions from
text. The aim of this research is to
discover the connection between
different social emotions and affective
terms and based on it automatically
predict social emotion of the text.
 An official Chinese news portal has been
used for the dataset collection.

 The proposed solution is to construct a joint
emotion-topic model. Latent Dirichlet
Allocation (LDA) has been used with an
additional layer for emotion modelling.
 A three step process has been used for
generation of affective terms:
 The first step is to generate an emotion from a
document specific emotional distribution.
 The second step is to generate a latent topic
from a Multinomial distribution.
 The final step is to develop an approximate
inference method based on Gibbs sampling.

 As a complete generative model, the
proposed model allows to infer a
number of conditional probabilities for
unseen documents. For example,
probabilities of latent topics given an
emotion and that of terms given a topic.
This method was found to be better than
emotion-term model and multiclass SVM
as the emotion assignments at the term
level could be visualised.

 This paper proposes an aspect-based
sentiment classification approach to
analyze sentiments for tweets.
 In previous studies, the overall sentiment of
a tweet was determined. But this is not
useful for the companies which need to
monitor consumer opinion of their
product/services. For them it would be
more useful to have information as to which
aspects of the product/service the users are
happy or unhappy about.

 The aspect-based sentiment classifier makes use of a POS
tagger, a sentiment lexicon and a few gazetteer lists to produce
results of the form [aspect, sentiment words, polarity]. This
process consists of three main steps:
 1. Aspect-sentiment extraction: Given a tweet, this step
determines a list of possible aspect candidates along with their
associated sentiments and polarity.
 2. Aspect ranking and selection: A tweet can express many
different opinions. Only important aspects should be selected.
For example, when classifying tweets on a telecommunication
company, some of the aspects of interest include customer
service, 3G connectivity, speed, etc. In this step the aspect
candidates are then ranked and the set of most significant
aspects are selected as the expected aspects.
 3. Aspect classification: Using the set of expected aspects and
results from the aspect-sentiment extraction step, we obtain the
final list of aspects along with their polarity for each tweet.

 The experimental results suggested that
a layered classification approach which
uses the aspect-based classifier as the
first layer classification and the tweet-
level classifier as the second layer
classification is more effective than a
classifier trained using target-dependent
features. This approach is able to
consistently improve the performance of
existing sentiment classifiers.

 The aim of this research was to
automatically extract the set of
messages which contain opinions, filter
out non-opinion messages and
determine their sentiment directions, that
is positive or negative.
 Manually labelled data has been used
as training data to build model.

 The initial step is to preprocess the crawled
tweets by removing usernames, hashtags,
retweet tags, non-English words.
 Three resources were constructed for the
further preprocessing which included a stop
word dictionary, an emotion dictionary and
an acronym dictionary.
 After preprocessing all words are
transformed into the form (word, POS tag,
English-word, Stop-word).
 Thereafter, tweets containing opinions are
extracted, filtering out the non-opinion
tweets. Naive Bayes classifier is then used
to classify the tweets based on sentiment.

 Since a word may have different meanings
in different domains, short text classification
is done.
 Two feature selection algorithms have been
used for this purpose – Mutual Information
(MI) and X Feature Selection. The short texts
are classified into different domains, so that
the classifier can automatically classify with
greater performance the tweets as being
either positive or negative.

 The main objective of this research was
to compare state-of-the-art Sentiment
Analysis methods against a novel hybrid
method.
 The Hybrid method adopts a
combination of both the supervised
methods and unsupervised methods.
 It utilizes a Sentiment lexicon to generate
a new set of features to train a linear
SVM classifier.

 In this paper, domain based Twitter Sentiment
Analysis is done. The domain considered is
smartphones.
 The Hybrid Polarity Detection System has three
modules:
 The first module is the Preprocessing Module in which
cleaning of data is done. The preprocessing steps
include removal of usernames, URL tags etc.
 The second module is Sentiment Feature Generator
Module. In this module slangs are replaced with their
proper language equivalents Senti Strength lexicon is
then used to tag the words with their sentiment score.
Fourteen features are extracted from the text.
 The third module is Machine Learning Classifier, in
which a linear SVM takes the input feature set and
classifies the tweets as positive or negative.

 In this paper, the authors have provided a
summary of the differential evolution algorithm
and its improved measures in order to facilitate
researchers studying the topic. Firstly the
differential evolution algorithm basics and its
various operations such as Mutation, Crossover
and Selection have been explained.
Thereafter, the different improvements
directed to increase the optimization
performance are compared. The efficiency of
differential evolution is optimised using
improvements making it a more efficient
application. The improvement measures mainly
include the evolution operation, parameter
settings and other improvements, focussing
mainly on the mutation operation.

 Most traditional clustering algorithms
simply assume that the number of
clusters is given and focus on the quality
of clustering results. This paper presents a
clustering algorithm for clustering and
automatically determining the number
of clusters as well. The proposed
algorithm has two steps. Firstly, a
mechanism, region splitting and merging
(RSM) to split and then merge the similar
groups until a self adaptive threshold is
reached. Secondly, the number of
clusters fine tuned using automatic
clustering differential evolution (ACDE).

 Data Collection: Retrieval of twitter
status updates
 Lexicon development
 Data Pre-Processing
 Feature Extraction, Normalisation and
Reduction
 K-means Clustering
 Differential Evolution

 Casefolding.
 Removal of:
 unnecessary punctuations
 extra blank spaces
 retweet tag
 usertags
 URL’s
 Hashtags
 Removal of stopwords
 Replacement of emoticons
 Positive emoticons – EPOS
 Negative emoticons – ENEG
 Neutral emoticons – ENEUT
 Replacement of sentiment words
 Positive words – POS
 Negative words – NEG
 Replacement of negation and intensity words
 Negation words – NEGATION
 Intensity words -- INTENSITY

 Feature Extraction
 The feature extraction is the process of extracting the main
characteristics of the text. For a machine learning algorithm to
perform well, it is essential to have features that are descriptive
of the text. The total number of occurrences of following features
have been taken into account for each tweet:
 Words
 Exclamation marks (!)
 EPOS keyword
 ENEG keyword
 ENEUT keyword
 POS keyword
 NEG keyword
 NEGATION keyword
 INTENSITY keyword
 Random words (words left, which do not fall into any category)

 The values of all the features are normalised
to the range of 0 to 1. The normalised value
is given by
Normalised(e) = e - Emin
Emax - Emin
where,
e - the original value
Emax - the maximum value of the feature
Emin - the minimum value of the feature

 Feature reduction is done by computing
cross correlation for the features. One
among the features which are closely
related is removed from the table.

ALGORITHM K-MEANS DIFFERENTIAL
EVOLUTION
ACCURACY 51% 59%

Findings
 Through this project I have investigated the utility
of sentiment classification on a collection of
dataset.
 While exploring the topic, I observed that there is
a limited number of algorithms that are useful for
twitter sentiment analysis.
 The twitter statuses have unique characteristics
compared to other corpuses. Since there is a
limitation of 140 words, the usual data mining
techniques used for movie reviews, etc can’t be
used.
 Also, not many research papers were available
for feature reduction.

Conclusion
 This project has been a great learning experience in the field
of information retrieval and data mining. In this project, twitter
dataset was collected for the purpose of Sentiment analysis.
Various data preprocessing techniques were applied on the
dataset. Thereafter, features were extracted from each tweet
and normalised. Feature reduction was then applied to
remove one among the closely related features. The quality
of features/ attributes that are extracted from the training
dataset affects the performance of the technique. K-means
clustering algorithm and Differential Evolution, an optimization
algorithm was then applied to cluster data into two classes,
positive and negative. Finally, the accuracies of these two
algorithms was compared. On the basis of accuracies, it can
be said that Differential Evolution performs better than K-
Means Algorithm for Twitter dataset.

Future Work
 As future work, three more clustering
techniques will be applied as part of
unsupervised learning which include Olsu’s
Threshold, Fuzzy c- means and EM algorithm.
Next step would be to compare it with
supervised learning methods including SVM,
Naive Bayes and LDA. Accuracies of different
algorithms will be calculated and compared.
 Twitter dataset for a particular product will be
collected and Opinion mining will be
applied.

Major presentation

More Related Content

What's hot

Viewers also liked

Similar to Major presentation

Recently uploaded

Major presentation