IRJET- Analysis of Music Recommendation System using Machine Learning Alg...
Poster (2)
1. ISIDENTSUSOUTHEASTASIA SENTIMENT ENGINE
INTRODUCTION ARCHITECTURE
DATA PREPARATION
RESULTS
Data extraction and data cleaning: In
this stage the mention text is
extracted from the database and
cleaned using various string
operations and regex commands
POS Tagging (Parts of speech
tagging):This stage is mainly where
the feature extraction occurs. The
identified content is pushed through
various text libraries such as NLTK
and TextBlob. This library tags the
data as different parts of speech
which can easily filtered out
Vectorization: In this stage the text is
converted into numbers using two
libraries count vectorization and Tf-idf
Centering and Feature scaling: All the
features are centered and scaled to
prevent unusual weightage to the
features.
Logistic regression classifier: Uses
around 6783 features to perform
classification
Random forest classifier: Uses 16725
features to perform classification.
Extra-Tree classifier model: The
probability scores from both the
classifiers are fed into this engine.
This is done to reduce the bias of the
two models.
Data Extraction : In this stage mentions are
extracted from the from the mongo DB database
with Sentiment, mention text and object ID. From
the database to identify the flags on which the text
data has to be extracted are
Sentiment_Processed:1, Object ID, and Sentiment.
Data Preprocessing: : In this stage the extracted
mentions are tokenized and pre-processed. During
the pre-processing stage the following operations
string operations like removal of hash-tags,
commas, semi colons etc. are performed.
Vectorization: : In this stage the extracted mentions
are tokenized and pre-processed. During the pre-
processing stage the following operations string
operations like removal of hash-tags, commas, semi
colons etc. are performed.
Classifiers: We run the logistic regression classifier,
thee random forest and the Extra –Tree classifier to
predict the sentiment.
Updating the Database: The sentiment which is
produced is appended with the respective objectID.
Since the entire sentiment tagging process is a
batch process the object ID’s can be easily tagged
with the mentions. This function updates the
database on Sentiment, Sentiment_Ravi, and
Sentiment_Processed. The flag Sentiment flag is
updated with the sentiment and the flag
Sentiment_Ravi is also updated with sentiment. The
Sentiment_Processed is updated with 2.
In the data preparation part an initial review of the text documents are conducted. This review was
conducted using word cloud analysis
The data set which was used was split into test and train data set. Using each one these datasets
a word cloud analysis is performed
In the Train data set we notice that Honda and Toyota were the most popular words, which is
closely followed by recalls . From this we can conclude that these words should form the features
for the dataset.
In the train dataset we notice that the words Honda and Toyota appear, like in the train dataset.
We also notice that words like replace also appear in the test dataset. This shows that these
words are also very important to the analysis
To create feature vectors for the model , both the test and train data are combined and passed
through a POS tagging script which would identify the various parts of speech and help us
identify the feature vectors for the model.
The above visualization shows us a distribution of the various types of mention texts. The train and
test datasets are in the ratio of 3:1 . We always use a larger train dataset than a test data set.
We notice that in the train dataset the number of negative mentions is half the number of positive
mentions. Hence this engine was more tuned towards positive mentions. This is because of large
number of positive feature vectors were extracted from the database
In the Test dataset we notice that the number of positive and negative mentions are equally split.
The accuracy on the test dataset set was around 75%
These are the final results obtained from the sentiment
engine. All the results have been processed in the production
system. A comparison is also performed with the existing
lexicon based classifier.
All measurements were performed on samples from July
and august. An equal time interval was considered . The
reason for this this is because we need to evaluate the
accuracy over a range of sample sizes and estimate its
performance.
We notice that the accuracy of the positive mentions is
consistently very high and the accuracy of the negative
mentions is consistently low. The reason for this is the
imbalanced dataset and the imbalanced feature set. The
balance of each of these could not be adjusted as the data
flowing in is mainly positive mention texts.
From the results graph we notice that the accuracy of the
Machine learning classification engine is consistently high
as opposed to the lexicon based engine. The reason for
this is mainly due to the way both the engines are trained.
The classification based sentiment engine is better
oriented to the needs of business as opposed to a general
lexicon based sentiment engine.
We notice that there is a sudden drop in accuracy of the
classification based sentiment engine and a sudden spike
in the lexicon based sentiment engine on august 20th 2016.
The reason for this is because of the disproportionality
high number of negative mentions flowing on that date.
The classification based sentiment engines accuracy will
improve with retraining ,as the amount of negative data
increases with time, thus increasing the number of
samples the algorithm is exposed to.
Mukund Krishna Ravi
Sentiment analysis (opinion mining) is a
technique by which we extract context
polarity to determine the sentiment of the
mention texts. In this particular use case, the
firm ISI Dentsu wanted to improve the
accuracy of its existing sentiment engine by
using classification models. This was done
by constructing a classification algorithm
only for the positive and negative samples
and thereby increasing the overall accuracy
of the sentiment engine. Also, we focused on
a specific portion of the dataset as per the
requirements
TRAINING ARCHITECTURE PRODUCTION ARCHITECTURE