 It aims to determine the attitude of a
speaker or a writer with respect to some
topic or the overall contextual polarity of a
document.
 The attitude may be his or her judgment or
evaluation, affective state (that is, the
emotional state of the author when writing),
or the intended emotional communication
(that is, the emotional effect the author
wishes to have on the reader).
 Determining document subjectivity:
Often called subjectivity classification, this subtask
determines whether a giving text is objective (expressing a
fact) or subjective (expressing an opinion or emotion).
 Determining document orientation:
Often called sentiment classification or document-level
sentiment classification, this subtask determines the
polarity of a given subjective text. In other words,
determines whether this text expresses a positive or a
negative sentiment on its subject matter.
 Determining the strength of document orientation:
This subtask decides whether the positive sentiment
expressed by a text on its subject matter is weakly positive,
mildly positive or strongly positive.
 Consumer information
› Product reviews
 Marketing
› Consumer attitudes
› Trends
 Politics
› Politicians want to know voters’ views
› Voters want to know policitians’ stances and who else
supports them
 Social
› Find like-minded individuals or communities
 Machine learning
› Naïve Bayes
› Maximum Entropy Classifier
› SVM
 Unsupervised methods
› K-means
› Olsu’s Threshold
› Fuzzy c-means
 Data to annotate given.
 But no training data or additional
resources provided.
 Aim:
 To create a lexical resource in an
automated way without any human
intervention for annotating data.
 Affective lexicon to be used for polarity
classification.
 To obtain training material, use
emoticons as indicators of a mood within
a message.
 Split the tweets into 2 sets:
 positive -  ;) :} :] ...
 negative -  :{ :’( ...
 We get a positive word list and a
negative word list.
 If a word present more frequently in
positive set, then it is positive and vice
versa.
 Aim:
 To analyse the effectiveness of various
popular classifiers and identify the more
suitable classifier for twitter that could
ease the process of classifying
sentiments in tweets.
 Strategy:
To use two or more classifiers chained
one after the other. This resulted in a
high yield, better accuracy of mined
data. I
 First stage: the incoming preprocessed
data is classified into three categories –
polar, neutral and irrelevant.
 Second stage: the data classified under
polar is fed to a second classifier for
further segregation into positive and
negative.
The classification algorithms used in the
research are:
 Naive Bayes
 Random Forest
 Support Vector Machines(SVM)
 SMO
 The research has been performed on
Tunisian user’s statuses on Facebook during
the “Arabic Spring” era.
 The aim is to extract useful information
about user’s sentiments and behaviours
during this sensitive and significant period.
 For this purpose, a method based on
Support Vector Machine(SVM) and Naive
Bayes has been proposed.
 The methodology used is collection of raw
data, followed by lexicon development.
 Three types of lexicons were created ;
lexicon for social acronyms, lexicon for
emoticons and lexicon for interjections.
 Then data preprocessing is done – stop
words removal and stemming, followed by
feature extraction.
 Finally, the machine learning algorithms are
applied.
 The performance of different feature sets
using Naive Bayer (NB) and SVM classifiers
was then compared.
 This paper is concerned with the
problem of mining social emotions from
text. The aim of this research is to
discover the connection between
different social emotions and affective
terms and based on it automatically
predict social emotion of the text.
 An official Chinese news portal has been
used for the dataset collection.
 The proposed solution is to construct a joint
emotion-topic model. Latent Dirichlet
Allocation (LDA) has been used with an
additional layer for emotion modelling.
 A three step process has been used for
generation of affective terms:
 The first step is to generate an emotion from a
document specific emotional distribution.
 The second step is to generate a latent topic
from a Multinomial distribution.
 The final step is to develop an approximate
inference method based on Gibbs sampling.
 As a complete generative model, the
proposed model allows to infer a
number of conditional probabilities for
unseen documents. For example,
probabilities of latent topics given an
emotion and that of terms given a topic.
This method was found to be better than
emotion-term model and multiclass SVM
as the emotion assignments at the term
level could be visualised.
 This paper proposes an aspect-based
sentiment classification approach to
analyze sentiments for tweets.
 In previous studies, the overall sentiment of
a tweet was determined. But this is not
useful for the companies which need to
monitor consumer opinion of their
product/services. For them it would be
more useful to have information as to which
aspects of the product/service the users are
happy or unhappy about.
 The aspect-based sentiment classifier makes use of a POS
tagger, a sentiment lexicon and a few gazetteer lists to produce
results of the form [aspect, sentiment words, polarity]. This
process consists of three main steps:
 1. Aspect-sentiment extraction: Given a tweet, this step
determines a list of possible aspect candidates along with their
associated sentiments and polarity.
 2. Aspect ranking and selection: A tweet can express many
different opinions. Only important aspects should be selected.
For example, when classifying tweets on a telecommunication
company, some of the aspects of interest include customer
service, 3G connectivity, speed, etc. In this step the aspect
candidates are then ranked and the set of most significant
aspects are selected as the expected aspects.
 3. Aspect classification: Using the set of expected aspects and
results from the aspect-sentiment extraction step, we obtain the
final list of aspects along with their polarity for each tweet.
 The experimental results suggested that
a layered classification approach which
uses the aspect-based classifier as the
first layer classification and the tweet-
level classifier as the second layer
classification is more effective than a
classifier trained using target-dependent
features. This approach is able to
consistently improve the performance of
existing sentiment classifiers.
 The aim of this research was to
automatically extract the set of
messages which contain opinions, filter
out non-opinion messages and
determine their sentiment directions, that
is positive or negative.
 Manually labelled data has been used
as training data to build model.
 The initial step is to preprocess the crawled
tweets by removing usernames, hashtags,
retweet tags, non-English words.
 Three resources were constructed for the
further preprocessing which included a stop
word dictionary, an emotion dictionary and
an acronym dictionary.
 After preprocessing all words are
transformed into the form (word, POS tag,
English-word, Stop-word).
 Thereafter, tweets containing opinions are
extracted, filtering out the non-opinion
tweets. Naive Bayes classifier is then used
to classify the tweets based on sentiment.
 Since a word may have different meanings
in different domains, short text classification
is done.
 Two feature selection algorithms have been
used for this purpose – Mutual Information
(MI) and X Feature Selection. The short texts
are classified into different domains, so that
the classifier can automatically classify with
greater performance the tweets as being
either positive or negative.
 The main objective of this research was
to compare state-of-the-art Sentiment
Analysis methods against a novel hybrid
method.
 The Hybrid method adopts a
combination of both the supervised
methods and unsupervised methods.
 It utilizes a Sentiment lexicon to generate
a new set of features to train a linear
SVM classifier.
 In this paper, domain based Twitter Sentiment
Analysis is done. The domain considered is
smartphones.
 The Hybrid Polarity Detection System has three
modules:
 The first module is the Preprocessing Module in which
cleaning of data is done. The preprocessing steps
include removal of usernames, URL tags etc.
 The second module is Sentiment Feature Generator
Module. In this module slangs are replaced with their
proper language equivalents Senti Strength lexicon is
then used to tag the words with their sentiment score.
Fourteen features are extracted from the text.
 The third module is Machine Learning Classifier, in
which a linear SVM takes the input feature set and
classifies the tweets as positive or negative.
 In this paper, the authors have provided a
summary of the differential evolution algorithm
and its improved measures in order to facilitate
researchers studying the topic. Firstly the
differential evolution algorithm basics and its
various operations such as Mutation, Crossover
and Selection have been explained.
Thereafter, the different improvements
directed to increase the optimization
performance are compared. The efficiency of
differential evolution is optimised using
improvements making it a more efficient
application. The improvement measures mainly
include the evolution operation, parameter
settings and other improvements, focussing
mainly on the mutation operation.
 Most traditional clustering algorithms
simply assume that the number of
clusters is given and focus on the quality
of clustering results. This paper presents a
clustering algorithm for clustering and
automatically determining the number
of clusters as well. The proposed
algorithm has two steps. Firstly, a
mechanism, region splitting and merging
(RSM) to split and then merge the similar
groups until a self adaptive threshold is
reached. Secondly, the number of
clusters fine tuned using automatic
clustering differential evolution (ACDE).
 Data Collection: Retrieval of twitter
status updates
 Lexicon development
 Data Pre-Processing
 Feature Extraction, Normalisation and
Reduction
 K-means Clustering
 Differential Evolution
 Casefolding.
 Removal of:
 unnecessary punctuations
 extra blank spaces
 retweet tag
 usertags
 URL’s
 Hashtags
 Removal of stopwords
 Replacement of emoticons
 Positive emoticons – EPOS
 Negative emoticons – ENEG
 Neutral emoticons – ENEUT
 Replacement of sentiment words
 Positive words – POS
 Negative words – NEG
 Replacement of negation and intensity words
 Negation words – NEGATION
 Intensity words -- INTENSITY
 Feature Extraction
 The feature extraction is the process of extracting the main
characteristics of the text. For a machine learning algorithm to
perform well, it is essential to have features that are descriptive
of the text. The total number of occurrences of following features
have been taken into account for each tweet:
 Words
 Exclamation marks (!)
 EPOS keyword
 ENEG keyword
 ENEUT keyword
 POS keyword
 NEG keyword
 NEGATION keyword
 INTENSITY keyword
 Random words (words left, which do not fall into any category)
 The values of all the features are normalised
to the range of 0 to 1. The normalised value
is given by
Normalised(e) = e - Emin
Emax - Emin
where,
e - the original value
Emax - the maximum value of the feature
Emin - the minimum value of the feature
 Feature reduction is done by computing
cross correlation for the features. One
among the features which are closely
related is removed from the table.
ALGORITHM K-MEANS DIFFERENTIAL
EVOLUTION
ACCURACY 51% 59%
Findings
 Through this project I have investigated the utility
of sentiment classification on a collection of
dataset.
 While exploring the topic, I observed that there is
a limited number of algorithms that are useful for
twitter sentiment analysis.
 The twitter statuses have unique characteristics
compared to other corpuses. Since there is a
limitation of 140 words, the usual data mining
techniques used for movie reviews, etc can’t be
used.
 Also, not many research papers were available
for feature reduction.
Conclusion
 This project has been a great learning experience in the field
of information retrieval and data mining. In this project, twitter
dataset was collected for the purpose of Sentiment analysis.
Various data preprocessing techniques were applied on the
dataset. Thereafter, features were extracted from each tweet
and normalised. Feature reduction was then applied to
remove one among the closely related features. The quality
of features/ attributes that are extracted from the training
dataset affects the performance of the technique. K-means
clustering algorithm and Differential Evolution, an optimization
algorithm was then applied to cluster data into two classes,
positive and negative. Finally, the accuracies of these two
algorithms was compared. On the basis of accuracies, it can
be said that Differential Evolution performs better than K-
Means Algorithm for Twitter dataset.
Future Work
 As future work, three more clustering
techniques will be applied as part of
unsupervised learning which include Olsu’s
Threshold, Fuzzy c- means and EM algorithm.
Next step would be to compare it with
supervised learning methods including SVM,
Naive Bayes and LDA. Accuracies of different
algorithms will be calculated and compared.
 Twitter dataset for a particular product will be
collected and Opinion mining will be
applied.

Major presentation

  • 2.
     It aimsto determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document.  The attitude may be his or her judgment or evaluation, affective state (that is, the emotional state of the author when writing), or the intended emotional communication (that is, the emotional effect the author wishes to have on the reader).
  • 3.
     Determining documentsubjectivity: Often called subjectivity classification, this subtask determines whether a giving text is objective (expressing a fact) or subjective (expressing an opinion or emotion).  Determining document orientation: Often called sentiment classification or document-level sentiment classification, this subtask determines the polarity of a given subjective text. In other words, determines whether this text expresses a positive or a negative sentiment on its subject matter.  Determining the strength of document orientation: This subtask decides whether the positive sentiment expressed by a text on its subject matter is weakly positive, mildly positive or strongly positive.
  • 4.
     Consumer information ›Product reviews  Marketing › Consumer attitudes › Trends  Politics › Politicians want to know voters’ views › Voters want to know policitians’ stances and who else supports them  Social › Find like-minded individuals or communities
  • 5.
     Machine learning ›Naïve Bayes › Maximum Entropy Classifier › SVM  Unsupervised methods › K-means › Olsu’s Threshold › Fuzzy c-means
  • 8.
     Data toannotate given.  But no training data or additional resources provided.  Aim:  To create a lexical resource in an automated way without any human intervention for annotating data.  Affective lexicon to be used for polarity classification.
  • 9.
     To obtaintraining material, use emoticons as indicators of a mood within a message.  Split the tweets into 2 sets:  positive -  ;) :} :] ...  negative -  :{ :’( ...  We get a positive word list and a negative word list.  If a word present more frequently in positive set, then it is positive and vice versa.
  • 11.
     Aim:  Toanalyse the effectiveness of various popular classifiers and identify the more suitable classifier for twitter that could ease the process of classifying sentiments in tweets.
  • 12.
     Strategy: To usetwo or more classifiers chained one after the other. This resulted in a high yield, better accuracy of mined data. I
  • 13.
     First stage:the incoming preprocessed data is classified into three categories – polar, neutral and irrelevant.  Second stage: the data classified under polar is fed to a second classifier for further segregation into positive and negative.
  • 14.
    The classification algorithmsused in the research are:  Naive Bayes  Random Forest  Support Vector Machines(SVM)  SMO
  • 16.
     The researchhas been performed on Tunisian user’s statuses on Facebook during the “Arabic Spring” era.  The aim is to extract useful information about user’s sentiments and behaviours during this sensitive and significant period.  For this purpose, a method based on Support Vector Machine(SVM) and Naive Bayes has been proposed.
  • 17.
     The methodologyused is collection of raw data, followed by lexicon development.  Three types of lexicons were created ; lexicon for social acronyms, lexicon for emoticons and lexicon for interjections.  Then data preprocessing is done – stop words removal and stemming, followed by feature extraction.  Finally, the machine learning algorithms are applied.  The performance of different feature sets using Naive Bayer (NB) and SVM classifiers was then compared.
  • 19.
     This paperis concerned with the problem of mining social emotions from text. The aim of this research is to discover the connection between different social emotions and affective terms and based on it automatically predict social emotion of the text.  An official Chinese news portal has been used for the dataset collection.
  • 20.
     The proposedsolution is to construct a joint emotion-topic model. Latent Dirichlet Allocation (LDA) has been used with an additional layer for emotion modelling.  A three step process has been used for generation of affective terms:  The first step is to generate an emotion from a document specific emotional distribution.  The second step is to generate a latent topic from a Multinomial distribution.  The final step is to develop an approximate inference method based on Gibbs sampling.
  • 21.
     As acomplete generative model, the proposed model allows to infer a number of conditional probabilities for unseen documents. For example, probabilities of latent topics given an emotion and that of terms given a topic. This method was found to be better than emotion-term model and multiclass SVM as the emotion assignments at the term level could be visualised.
  • 23.
     This paperproposes an aspect-based sentiment classification approach to analyze sentiments for tweets.  In previous studies, the overall sentiment of a tweet was determined. But this is not useful for the companies which need to monitor consumer opinion of their product/services. For them it would be more useful to have information as to which aspects of the product/service the users are happy or unhappy about.
  • 24.
     The aspect-basedsentiment classifier makes use of a POS tagger, a sentiment lexicon and a few gazetteer lists to produce results of the form [aspect, sentiment words, polarity]. This process consists of three main steps:  1. Aspect-sentiment extraction: Given a tweet, this step determines a list of possible aspect candidates along with their associated sentiments and polarity.  2. Aspect ranking and selection: A tweet can express many different opinions. Only important aspects should be selected. For example, when classifying tweets on a telecommunication company, some of the aspects of interest include customer service, 3G connectivity, speed, etc. In this step the aspect candidates are then ranked and the set of most significant aspects are selected as the expected aspects.  3. Aspect classification: Using the set of expected aspects and results from the aspect-sentiment extraction step, we obtain the final list of aspects along with their polarity for each tweet.
  • 25.
     The experimentalresults suggested that a layered classification approach which uses the aspect-based classifier as the first layer classification and the tweet- level classifier as the second layer classification is more effective than a classifier trained using target-dependent features. This approach is able to consistently improve the performance of existing sentiment classifiers.
  • 27.
     The aimof this research was to automatically extract the set of messages which contain opinions, filter out non-opinion messages and determine their sentiment directions, that is positive or negative.  Manually labelled data has been used as training data to build model.
  • 28.
     The initialstep is to preprocess the crawled tweets by removing usernames, hashtags, retweet tags, non-English words.  Three resources were constructed for the further preprocessing which included a stop word dictionary, an emotion dictionary and an acronym dictionary.  After preprocessing all words are transformed into the form (word, POS tag, English-word, Stop-word).  Thereafter, tweets containing opinions are extracted, filtering out the non-opinion tweets. Naive Bayes classifier is then used to classify the tweets based on sentiment.
  • 29.
     Since aword may have different meanings in different domains, short text classification is done.  Two feature selection algorithms have been used for this purpose – Mutual Information (MI) and X Feature Selection. The short texts are classified into different domains, so that the classifier can automatically classify with greater performance the tweets as being either positive or negative.
  • 31.
     The mainobjective of this research was to compare state-of-the-art Sentiment Analysis methods against a novel hybrid method.  The Hybrid method adopts a combination of both the supervised methods and unsupervised methods.  It utilizes a Sentiment lexicon to generate a new set of features to train a linear SVM classifier.
  • 32.
     In thispaper, domain based Twitter Sentiment Analysis is done. The domain considered is smartphones.  The Hybrid Polarity Detection System has three modules:  The first module is the Preprocessing Module in which cleaning of data is done. The preprocessing steps include removal of usernames, URL tags etc.  The second module is Sentiment Feature Generator Module. In this module slangs are replaced with their proper language equivalents Senti Strength lexicon is then used to tag the words with their sentiment score. Fourteen features are extracted from the text.  The third module is Machine Learning Classifier, in which a linear SVM takes the input feature set and classifies the tweets as positive or negative.
  • 34.
     In thispaper, the authors have provided a summary of the differential evolution algorithm and its improved measures in order to facilitate researchers studying the topic. Firstly the differential evolution algorithm basics and its various operations such as Mutation, Crossover and Selection have been explained. Thereafter, the different improvements directed to increase the optimization performance are compared. The efficiency of differential evolution is optimised using improvements making it a more efficient application. The improvement measures mainly include the evolution operation, parameter settings and other improvements, focussing mainly on the mutation operation.
  • 36.
     Most traditionalclustering algorithms simply assume that the number of clusters is given and focus on the quality of clustering results. This paper presents a clustering algorithm for clustering and automatically determining the number of clusters as well. The proposed algorithm has two steps. Firstly, a mechanism, region splitting and merging (RSM) to split and then merge the similar groups until a self adaptive threshold is reached. Secondly, the number of clusters fine tuned using automatic clustering differential evolution (ACDE).
  • 37.
     Data Collection:Retrieval of twitter status updates  Lexicon development  Data Pre-Processing  Feature Extraction, Normalisation and Reduction  K-means Clustering  Differential Evolution
  • 39.
     Casefolding.  Removalof:  unnecessary punctuations  extra blank spaces  retweet tag  usertags  URL’s  Hashtags  Removal of stopwords  Replacement of emoticons  Positive emoticons – EPOS  Negative emoticons – ENEG  Neutral emoticons – ENEUT  Replacement of sentiment words  Positive words – POS  Negative words – NEG  Replacement of negation and intensity words  Negation words – NEGATION  Intensity words -- INTENSITY
  • 40.
     Feature Extraction The feature extraction is the process of extracting the main characteristics of the text. For a machine learning algorithm to perform well, it is essential to have features that are descriptive of the text. The total number of occurrences of following features have been taken into account for each tweet:  Words  Exclamation marks (!)  EPOS keyword  ENEG keyword  ENEUT keyword  POS keyword  NEG keyword  NEGATION keyword  INTENSITY keyword  Random words (words left, which do not fall into any category)
  • 42.
     The valuesof all the features are normalised to the range of 0 to 1. The normalised value is given by Normalised(e) = e - Emin Emax - Emin where, e - the original value Emax - the maximum value of the feature Emin - the minimum value of the feature
  • 44.
     Feature reductionis done by computing cross correlation for the features. One among the features which are closely related is removed from the table.
  • 45.
  • 46.
    Findings  Through thisproject I have investigated the utility of sentiment classification on a collection of dataset.  While exploring the topic, I observed that there is a limited number of algorithms that are useful for twitter sentiment analysis.  The twitter statuses have unique characteristics compared to other corpuses. Since there is a limitation of 140 words, the usual data mining techniques used for movie reviews, etc can’t be used.  Also, not many research papers were available for feature reduction.
  • 47.
    Conclusion  This projecthas been a great learning experience in the field of information retrieval and data mining. In this project, twitter dataset was collected for the purpose of Sentiment analysis. Various data preprocessing techniques were applied on the dataset. Thereafter, features were extracted from each tweet and normalised. Feature reduction was then applied to remove one among the closely related features. The quality of features/ attributes that are extracted from the training dataset affects the performance of the technique. K-means clustering algorithm and Differential Evolution, an optimization algorithm was then applied to cluster data into two classes, positive and negative. Finally, the accuracies of these two algorithms was compared. On the basis of accuracies, it can be said that Differential Evolution performs better than K- Means Algorithm for Twitter dataset.
  • 48.
    Future Work  Asfuture work, three more clustering techniques will be applied as part of unsupervised learning which include Olsu’s Threshold, Fuzzy c- means and EM algorithm. Next step would be to compare it with supervised learning methods including SVM, Naive Bayes and LDA. Accuracies of different algorithms will be calculated and compared.  Twitter dataset for a particular product will be collected and Opinion mining will be applied.