The document proposes a multi-tier sentiment analysis system called MSABDP to analyze large-scale social media data more efficiently. MSABDP uses Hadoop for its distributed processing and storage capabilities. It collects Twitter data using Apache Flume and stores it in HDFS. It then applies a multi-tier classification approach combining lexicon-based and machine learning techniques to classify tweets into multiple sentiment classes, reducing complexity compared to single-tier architectures. Evaluation on real Twitter data showed MSABDP improved classification accuracy over single-tier approaches by 7%.
Multi-Tier Sentiment Analysis System in Big Data Environment
1. Multi-Tier Sentiment Analysis System in Big
Data Environment
Wint Nyein Chan*, Thandar Thein**
* University of Computer Studies, Yangon, **University of Computer Studies, Maubin
Email: wintnyeinchan2012@gmail.com, thandartheinn@gmail.com
Abstract- Over recent years, big data, a huge amount of structured and unstructured data is generated from social
Network. There needs to extract the valulable information from the social big data. The traditional analytic platform needs to be
scaled up for analyzing social big data in an efficient and timely manner. Sentiment Analysis of social big data helps the
organizations by providing business insights with public opinion. Sentiment analysis based on multi-class classification scheme is
oriented towards classification of text into more detailed sentiment labels. Multi-class classification with single tier architecture
where single model is developed and entire labeled data is trained may increase the classification complexity. In this paper,
multi-tier sentiment analysis system on big data analytics platform (MSABDP) is proposed to reduce the multi class classification
complexity and efficiently analyze large scale data set. Hadoop is built for big data analytics and it is a good platform for being
able to manage large data at scale and which can improve scalability and efficiency by adopting distributed processing
environment since they have been implemented using a MapReduce framework and a Hadoop distributed storage (HDFS). The
MSABDP is implemented by combining SentiStrength lexicon and learning based classification scheme with multi-tier
architecture and run on big data analytics platform for being able to manage large data at scale. The proposed system collects a
large amount of real Twitter data by using Apache Flume and the data was used for evaluation. The evaluation results have
shown that the proposed multi class classification system with multi-tier architecture is able to significantly improve the
classification accuracy over multi class classification based on single-tier architecture by 7%.
Keywords: big data analytics, hadoop, multi class classification, sentiment analysis, social big data
1. INTRODUCTION
As the rapid growth of the Internet and online activity, many services such as blogging, podcasting, social
networking and bookmarking are popular. These services allow users to create and share information within open
and closed communities and contribute to the volumes of the data. Moreover, in social network, data is created and
delivered from various systems operating in real-time by aggregating constant information about user activities and
interactions. Twitter has 320 million monthly active users and they posts 500 million tweets every day; Facebook
has 936 million daily and 1,440 million monthly active users as of December, 2015. These factors are reasons of a
rise of Big Data [5]. Big data is characterized by the volume, velocity, veracity, variety, value and volatility of data.
Nevertheless, the appropriate tools are needed to acquire, organize and derive value and volatility of data.
With the advent of social media, data is captured from different sources, such as mobile devices and web
browsers, and it is stored in various data formats. The traditional storage and analytics platform can’t handle the
different sources and different formats of the structured and unstructured data. Big Data Analytics has become
popular for analysing and managing large volume of the structure and unstructured data [12]. Hadoop is a good
platform for Big Data Analytics as it provides scalability, cost-effective, flexible, fast, secure and authentication,
parallel processing, Availability and resilient nature. It is an open-source software framework comprises of two
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
204 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. parts: storage part and processing part. The storage part is called the Hadoop Distributed File System (HDFS) and
the processing part is called MapReduce.
Analysing social data by providing customer certification through sentiment analysis helps the organization to
determine marketing strategy and improve customer services. Sentiment analysis is one of the main agenda in big
data that focus on various ways to analyse big data to identify patterns and relationships, make informed predictions,
deliver actionable intelligence and gain business insight from this steady influx of information [13]. Twitter is one of
the popular social media data platforms, which combines features of blogs and social network services. Twitter was
established in 2006 and experienced rapid growth of users in the first years [15]. Twitter is a good source of
information in the sense of snapshots of moods and feelings as well as up-to-date events and current situation
commenting. Actually, achieving a high level of performance for sentiment analysis for twitter data is very difficult.
At first, the large amount of volume and high velocity of twitter stream data is needed to be efficiently stored and
processed. Second, Twitter allows users to post no more than 140 characters for each post, which makes Twitter
sentiment classification can be lower performance than that when mining longer texts. In addition, classification into
multiple classes can increase the classification complexities, which tend to decrease the classification accuracy.
In this paper, Sentiment analysis system is proposed which is implemented by combing lexicon based and
learning based approaches, and shows that the proposed system is good performances in terms of classification
accuracy. The main contributions of this paper are as follows:
1. Big Data Analytics Platform is developed to scale up the traditional analytics platform for analyzing social
big data
2. Multi-tier sentiment analysis on Big Data Analytics platform (MSABDP) is proposed To reduce the multi
class classification complexity.
The remainder of this paper is arranged as follows: In section 2, the related work presents large scale sentiment
analysis in a distributed environment and multi class classification. Section 3 illustrates the system design to analyze
the social big data sets and presents the implementation of the proposed system. Section 4 shows the experimental
setup and results. Section 5 concludes the paper and describes the possible directions for future works.
2. RELATED WORK
In the past years, many works has been released in sentiment analysis, but there is a little work for sentiment
analysis on big data. There exist many possible variants; some of the papers related with the proposed system are
reviewed in this section.
The paper [11] presented multi-tier classification architecture for sentiment analysis. The architecture consists of
three models. The first model classifies three classes and the remaining two models are performed as the binary
classifier. To achieve high performance, features were selected by applying various feature selection techniques.
150,000 movie reviews posted on social media were used to train and test the performance of the system. Four
classifiers (Naïve Bayes, SVM, Random Forest, and SGD) were used in the experiments. The result showed that the
performance of proposed architecture was significantly improved prediction accuracy over the simple single tier
model by more than 10%. The proposed system runs on traditional analytics platform and classifies the movie
reviews into multi class by applying only supervised machine learning classifiers.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
205 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
3. The authors [10] presented an approach that relies on writing patterns and special unigrams for multiclass
sentiment analysis. They extracted patterns that rely on PoS -Tag of words and then calculate the resemblance
degree res (p, t) of each pattern in the training set p to the tweet t. For each tweet, they extracted a 4 set of features;
referred to the training set and use Random Forest machine learning algorithms to perform the classification.
Training data set contains 21000 tweets which had been manually classified the class. The system classified the texts
which collected from Twitter into seven classes: happiness, sadness, anger, love, hate, sarcasm and neutral. In the
experiment, the accuracy of the multiclass sentiment analysis is 56.9%. Since the training data are manually
classified the class, no more methods were required to develop the training data sets.
In [17], As the total volume of tweets is extremely high and it taks a long time to process, the authors proposed a
distributed system for Twitter sentiment analysis with two components: lexicon builder and a sentiment classifier.
The lexicon builder was implemented based on Hadoop and HBase and they demonstrated that it scales with
increasing number of machines and the amount of training data. In particular, for 100000 tweets, the training time
was decreased by 35% when moving from 2 to 3 machines, and 40% for 2 to 4 machines, 47% for 2 to 5 machines.
For 300000 tweets, the running time was decreased by 23% when moving from a cluster with 4 machines to 5
machines. To achieve higher accuracy than the lexicon-based approach, the lexicon-based and machine-learning
based approaches are combined. Online Logistic Regression from the mahout machine learning library was used for
learning based approach. The experiment results show that their system can obtain higher accuracy by combining
lexicon-based and learning based approaches. The proposed system used the existing class labeled dataset and
classified into binary class.
B.Liu et al. [4] presented a simple and complete system for sentiment mining on large data sets using a Naive
Bayes classifier. To achieve fine-grain control of the analysis procedure, they implemented the NBC on top of
Hadoop framework with additional four modules: the work flow controller (WFC), the data parser, the user terminal
and the result collector. They evaluated the scalability of NBC in large data set and the result is encouraging in that
the accuracy of NBC is improved and approaches 82% when the dataset size increases. They have demonstrated that
NBC is able to scale up to analyze the sentiment of millions movie reviews with increasing throughput. The two
datasets in the experiments had already been labeled the class. Instead of applying Mahout Machine learning library,
NBC is implemented on top of Hadoop framework with additional modules.
In the proposed system, multi-tier sentiment analysis system is developed on big data analytics platform to reduce
the multi class classification complexity. Multi-tier sentiment classification is implemented by combining lexicon
and Supervised learning-based approaches. In this system, SentiStrength is used to label the class in order to develop
the training data sets. Mahout Naïve bayes with multi-tier architecture is used for multi class sentiment
classification. The proposed system can improve the scalability and efficiency by adopting distributed processing
environment since they are implemented using HDFS and MapReduce framework.
3. BIG DATA ANALYTICS PLATFORM DEVELOPMENT
Multi-tier Sentiment Analysis system is implemented on Big Data Analytics Platform with Apache Flume
[14], HDFS, MapReduce [6] and Mahout Machine learning library [3]. High level architecture of the MSABDP is
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
206 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. illustrated in Figure 1. It consists of four layers: Data Ingestion Layer, Storage Layer, Processing Layer, and
Analytics Layer.
In Data Ingestion Layer, data are collected from Twitter steaming API with Apache Flume. The collected data is
stored in Hadoop File System (HDFS), distributed storage, which is located in Storage Layer. One name node and
three data nodes are developed for distributed storage. Data cleaning and preprocessing, class labeling and multi-tier
classification are developed in Analytics Layer. All of the processes from Analytics Layer are implemented with
Map Reduce paradigm which is used for distributed processing and located in Processing Layer. The detail process
description of each layer is described the following subsection. The system can provide the capabilities such as
scalability, cost effective, and parallel processing by using Hadoop platform.
Processing Layer
Data
Ingestion
Layer
….
ApacheFlume(DataCollection)
Analytics Layer
Data Cleaning and Preprocessing Class Labeling Mahout Machine Learning(Multi-tier Classification)
Slave Node 1
Data Node
Node Manager
Map Task 2
Reduce Task 1
Containers
Slave Node 2
Data Node
Node Manager
Application Master
Map Task 1
Containers
Slave Node n
Data Node
Node Manager
Map Task N
Reduce Task M
Containers
Scheduler
Application Manager
Resource ManagerYarn
Master
Node
HDFS
Master
Node
Name Node
S Name Node
Processing Layer
Container Launch RequestNode Status Resource Request
Map Reduce Coordination Heart beats, balancing, replication
Figure1. High Level Architecture of MSABDP
3.1 Data Ingestion Layer
In this layer, Apache Flume is used to collect the tweet stream data and the data is ingested to HDFS through the
memory channel. Apache Flume is deployed by Twitter Agent which has 3 main components: a TwitterSource, a
Memory Channel and a HDFS Sink. The incoming tweets are limited by keywords and language in the
TwitterSource. The source comes as an event-driven source and receives events through mechanisms like callbacks
or RPC calls. In this TwitterSource, the twitter4j library is used to keep access to the Twitter Streaming API. The
application secrets like Consumer Key and Consumer Secret, Access Token, Access Secret have been used to
connect to the Twitter APIs. The memory channel acts as a pathway between the TwitterSource and HDFS Sink.
The channel uses as an in memory queue to store events until they’re ready to be written to a sink. As the channel
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
207 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
5. holds all events in memory, the channel’s capacity and transaction capacity are limited by the capacity and
transaction capacity parameters in the configuration file. Events are added to the channel by TwitterSource, and later
removed from the Channel by HDS Sink. HDFS Sink, which writes events to a configured location in HDFS. In the
HDFS Sink configuration, the size of files can be defined with the roll count parameter and set up to 10,000, so each
file will end up containing 10,000 tweets. The original data format is retained by setting the file Type to Data
Stream. The file path is defined as that the files will end up in a series of directories for the year, month, day and
hour during which the event occurs. The timestamp is set to true in the configuration and which is used by Flume to
determine the timestamp of the event, and is used to resolve the full path where the event should end up.
3.2 Storage Layer
In Storage Layer, HDFS is used to provide scalable and reliable data storage. HDFS serves master/slave
architecture and single NameNode serve as a master server. Name Node executes file system namespace operations
like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
DataNodes is used to store the actual data in HDFS. The input file is split into one or more blocks and these blocks
are stored in a set of DataNodes. Each block size is 64 MB. DataNodes are responsible for serving read and write
requests from the clients. The DataNodes also perform block creation, deletion, and replication upon instruction
from the NameNode.
3.3 Processing Layer
Yarn and MapReduce-2 is located in the Processing Layer to process vast amounts of data in-parallel on clusters
of commodity hardware in a reliable, fault-tolerant manner. The MapReduce job splits the input data set into
independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the
outputs of the maps, which are then input to the reduce tasks. Both the input and the output of the job are stored in
HDFS. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. The
MapReduce framework consists of a single master ResourceManager, one slave NodeManager per cluster-node, and
MRAppMaster per application. The application specifies the input/output locations and supply map and reduce
functions via implementations of interfaces and abstract-classes. These, and other job parameters, comprise the job
configuration. The Hadoop job client then submits the job and configuration to the ResourceManager which then
assumes the responsibility of distributing the software and configuration to the slaves, scheduling tasks and
monitoring them, providing status and diagnostic information to the job-client.
3.4 Analytics Layer
In this layer, sentiment analysis is performed by combing lexicon based and machine learning based approaches.
SentiStrength lexicon based scheme is applied for class labeling and the class labeled data is used as the training
data to build the multitier classification model. Data cleaning and preprocessing have been performed to develop the
effective classification model. Mahout scalable machine-learning library, specifically designed to use Hadoop for
enabling scalable processing of huge data sets, is used to develop the classification model.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
208 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
6. 4. SENTIMENT ANALYSIS SYSTEM FOR SOCIAL BIG DATA
4.1 Sentiment Analysis
Sentiment analysis [7] is an increasingly popular text mining method to determine the opinion of a text. It is
often referred to as opinion mining. Sentiment analysis uses individual elements of a text on different text levels,
like whole document, paragraph, sentence or only a text window, and tries to correctly determine and assign
particular moods and emotions to the respective entities, also called objects. The challenge of this analysis [9] is to
replicate a human process understanding of the text information assignment of the individual polarized information
and interpretation of the human mind. The simplest way to do sentiment classification is using the lexicon-based
approach, also known as a bag of words, or dictionary-based approach. This allows more accurate assessment of
the individual components of the text. The machine learning techniques have widely applied in the text
classification area and most of them are supervised learning classification methods. Acquiring effective training
data is a challenge, although learning based approaches outperforms the lexicon-based (Pang et al.,). Manual
Labeling for training data is time and labor consuming.
4.2 Sentiment Analysis of Social Big Data
The data from social media [13] is multistructured (Variety), very large scale and distributed (Volume) and
sometimes needs to be analyzed in a real time manner (Velocity). In addition, the fourth V, Value, is needed for
sentiment analysis on social data. The specification of such data source is known as the 4 Vs aspect in big data
domain. Sentiment analysis on the social media data helps the organization to determine marketing strategy by
providing public opinion. Efficient techniques to collect social data and extracting valuable information from
collected data are essential demand. Traditional sentiment classification techniques do not perform in social data. In
this work, Mult-tier Sentiment Analysis System on Big Data Analytics Platform is proposed for analyzing social big
data.
4.3 Mult-tier Sentiment Analysis System on Big Data Analytics Platform
The MSABDP is developed with four processes: data collection, data cleaning and preprocessing, class labeling
and multi-titer sentiment classification. Apache Flume is used to collect the real twitter data. As the collected tweets
include noise and irrelevant data, data cleaning and preprocessing are performed in order to produce normalized data
for labeling the class. The sentiment classification is developed by combining lexicon-based classifier and
supervised machine learning approach. The classification process is implemented on Map Reduce software
framework to support distributed computing, and it allows parallel programming using the function concept called
Map functions. Instead of labeling the class manually, SentiStrength lexicon based classifier is used for the task of
annotating the training data for the learning-based classifier. Concurrently, the tweets text data is preprocessed to
improve the accuracy of classification. And then preprocessed data with class label are combined as the input of
learning based classification. In reduce stage, the result data are combined and it produces a new set of output. And
the classification model is developed by using the mahout machine learning library. The newly coming test data is
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
209 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
7. classified using the model in order to produce the classified tweets. The process flow of the proposed system is
described in Figure 2.
Twitter Service(Twitter
Stream API )
Twitter
Source
HDFS
Sink
Memory Channel
Data Collection
FLume
Data Cleaning and Preprocessing
New Instances
SP : strongly_positive; P: positive; Neu: neutral; SN: strongly_negative; N: negative
Multi-tier Classification
Removing Character Repetitions
Removing StopWords
Negation Handling
Removing Noisy Data
Selecting Tweet Text feature
Removing Duplicate Tweets
Replacing Emoticons with their Sentiment
Values
Replacing Abbreviation with their
expression
Training
Data Set
Testing
Data Set
Class
Labeled
Data Set
Machine Leaning Classifier
NegPos
Model I
(SP/P (Pos), Neu, SN/N(Neg) )
Classification Model
SP P SN NNeu
Model II
(SP, P)
Model III
(SN, N)
Class Labeling
Sentiment Score Calculation
Class Labeling
Lexicon
Figure2. Process Flow Diagram of MSABDP
5. Data Collection
The real tweet data are collected by using Apache Flume [9] and the collected real data is used to evaluate the
system. Apache Flume deploys the Twitter Agent in order to collect data that is generated from TwitterSource
(Twitter Stream API) and transferred over to HDFS through MemoryChannel.
5.1. Twitter Source
Twitter source processes events and moves them along by sending the stream data into a Memory channel
[10]. The sources operate by gathering discrete pieces of data, translating the data into individual events, and then
using the channel to process events one at a time, or as a batch. . In the system, custom Twitter Source is used as a
data source. Data are limited by key words and language. In the TwitterSource, the twitter4j library is used to keep
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
210 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
8. access to the Twitter Streaming API. In order to connect to the Twitter APIs, some of the application specific secrets
like Consumer Key and Consumer Secret, Access Token, Access Secret are needed to be used. To get the
application specific secrets, it is needed to create a Twitter application. The Twitter application is created by the
following link https://apps.twitter.com/. It generates Consumer key, Consumer secret, Access token, and Access
token secret.
5.2. Memory Channel
The channel acts as a pathway between the TwitterSource and HDFS Sink. Events are added to the channel by
TwitterSource, and later removed from the Channel by HDFS Sink. It is uses as an in-memory queue to store events
until they’re ready to be written to a sink. As the channel holds all events in memory, the channel’s capacity and
transaction capacity is limited by the “capacity” and “transaction Capacity” parameters in the configuration file. In
the work, the channel’s capacity is set up to 10000 and the transaction capacity is setup to 100.
5.3. HDFS Sink
HDFS Sink, which writes events to a configured location in HDFS. In the HDFS Sink configuration, defines the
size of the files with the roll Count parameter and set up to 10,000, so each file will end up containing 10,000
tweets. It also retains the original data format, by setting the file Type to DataStream and setting writing Format to
Text.
6. Data Cleaning and Preprocessing
As flume ingests the data as the nested json format and it may contain irrelevant and duplicated data, this data has
to be cleaned and preprocessed for effective analysis. Selecting tweet text features, removing duplicate tweets,
removing noisy data, removing character repetition, removing stopwords and negation handling is performed during
this process. The process not only simplifies the classification task, but also serves to greatly decrease the processing
cost in the training phase.
6.1. Selecting Tweet Text Feature
Many tweet data features are included in one record of tweet stream data as shown in Figure 3. The sentiment
classification system is focused on tweet text feature with English language. The tweet text feature is selected
among other feature because it expresses twitter users’ feeling and opinion. The tweet id feature is also selected to
assign as a key and tweet text feature is assigned as a value in Map Reduce process. To select tweet id and tweet text
feature, the collected tweet data with json format is fetched as a json object using json parser. And “tweets_id” and
“tweets_text” are extracted as a string from json object.
6.2. Removing Duplicate Tweets
Twitter Stream data may include many duplicate retweets. To remove duplicate retweets, the extracted
“tweets_id” and “tweets_text” is checked whether already stored or not in the list of tweet data. If the extracted
features have not been already stored, they are added as the list of tweet data. Otherwise, they are removed. Table 1
shows the selected sample tweet_text and tweets_id features.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
211 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
9. 6.3. Removing Noisy Data
The term noisy data is used to describe any piece of information within the tweet that will not be useful for the
machine learning algorithm to assign a class to that tweet. There are included noisy data such as character
repetitions, website links with URL, @username, punctuation additional white space Replace hash tags with the
same word without the hash tags. For example, #fun is replaced with fun. Replaced website links with URL so links
to websites that start with www.* or http. Converted @username to “usermentionsymbol by replacing @username
instances found in tweets with “usermentionsymbol” for the classifier to easily identify that a user is being
referenced. Non-Alphabets are replaced with spaces.
6.4. Removing Character Repetition
To remove the character repetition, there is needed to be checked whether the repetitive characters contain or not
in the input data. To check whether the presence of repetitive character, the former and pervious characters are
compared by indexing word in one record of tweet. If the previous and current characters are the same, the character
is counted. If the quantity of character is two or more, the procedure returns as true. Otherwise, it returns false. If the
repetitive characters are found and the character count is more than 2, replace the character itself by deleting the
repetitive characters with substring function.
6.5. Removing Stopwords
When working with text classification methods, removal of stopwords is a common approach to reduce noise in
the data. In this work, not only common stopwords but also stopwords based on classification domain are considered
by manually examining the data. For example, domain stopwords contain iphone, apple, mobile, etc.
6.6. Negation Handling
Negation handling is one of the factors that significantly affect the accuracy of learning based classifier. For
example: the word “good” in the phrase “not good” will be contributing to positive sentiment rather that negative
sentiment as the presence of “not” before it is not taken into account. It transforms a word followed by a not or n’t
into “not_” + word.
6.7. Replacing Emoticons with Sentiment Values
As the maximum length of a tweet is 140 characters, the case of emoticon will represent the sentiment of that
tweet. [7] Therefore, emoticons can be replaced with their sentimental values (such as “feeling sad” and “feeling
happy”) by using happy and sad emotion dictionary. The happy emoticon dictionary is used to label with 90
emotions and sad emoticon dictionary is used to label with 107 emoticons. For example, “:)” is labeled as “feeling
happy” whereas “:=(” is “feeling sad”. Table 3 shows the sample emoticons and their sentiment values. The
preprocessed tweets that replaced emoticons can be seen in Table 1.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
212 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
10. Table 1: Sample Emoticons and their Sentiment Values
6.8. Replace Abbreviation with their expansions
As the abbreviation in Tweets can boost the sentiment values, it is replaced with their expansions by using
abbreviation dictionary. The dictionary has translations for 330 abbreviations [2]. The most frequent used
abbreviations can be seen in Table 2.
Table 2: Sample abbreviations and their expansion
7. Class Labeling
Instead of manual labeling the class, SentiStrength lexicon based approach is used for annotating the training data
of the learning-based classifier. SentiStrength [16], a lexicon-based classifier, uses additional (non-lexical) linguistic
information and rules to detect sentiment strength in short informal English text. The contextual valence shifter:
negation and intensifier are used to evaluate the context sentiment and to solve the context dependent problem by
applying NegationWordList and BoosterWordList. There are consists of eight dictionaries: BoosterWordList,
EmoticonLookupTable, EmotionLookupTable, EnglishWordList, IdionLookupTable, NegationWordList,
QuestionWords and slangLookupTable.
EmotionLookupTable consists of 2546 sentiment words with their strength. Some words include Kleene star
stemming (e.g., ador*). The word “miss” is a special case with a positive and negative strength of 2. It is frequently
used to express sadness and loves simultaneously. A spelling correction algorithm deletes repeated letters in a word
when the letters are more frequently repeated than normal for English or, if a word is not found in an English
dictionary, when deleting repeated letters creates a dictionary word. A booster word list is used to strengthen or
weaken the emotion of sentiment words. A booster word list consists of 28 words and their strength of sentiment.
Some booster words are “totally, completely and might”. An idiom list is used to identify the sentiment of a few
common phrases. This overrides individual sentiment word strengths. Negations are used to reverse the semantic
polarity of a particular term, and skip any intervening booster words. Default multiplier for negated words is 0.5. In
this case, instead use of default multiplier for negated words, the multiplier is set to 1. At least two repeated letters
added to words give a strength boost sentiment words by 1. For instance, haaaappy is more positive than happy. An
emoticon list with polarities is used to identify additional sentiment. The emoticon list consists of 116 common
emoticon words and their polarity strength (-1 or 1). Sentences with exclamation marks have a minimum positive
strength of 2, unless negative. Repeated punctuation with one or more exclamation marks boost the strength of the
Emoticons Sentiment Values
:-) :) :o) :] :3 :c) Feeling happy
:-( :( :c :[ Feeling sad
Abbreviations English expansion
gr8,gr8t great
lol Laughing out loud
rotf Rolling on the floor
bff Best friend forever
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
213 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
11. immediately preceding sentiment word by 1. Negative sentiment is ignored in questions. For each text,
SentiStrength outputs a positive sentiment score from 1 to 5 and a negative score from -1 to -5.
Figure3 shows the procedure for calculating the sentiment score by applying SentiStrength. The procedure is
performed with Map Reduce function. At the Mapper stage, the collected raw data is parsed with the JsonParser in
order to select the tweets and tweets_id. After performing data cleaning and preprocessing, the sentiment score is
calculated with SentiStrength. If the sentiment score is greater than 1, the output label is “strongly_positive. If the
sentiment score is greater than 0 and equal with or greater than 1, the output is “positive”. For strongly_negative, the
score is less than -1. If the score is less than 0 and equal to or greater than -1 is negative. Otherwise, the output is
neutral. The Reduce stage is not necessary due to the fact that the result combination is not needed. Currently, the
Reduce stage outputs the results obtained by the Mappers.
Figure3. Class Labeling Procedure
8. Multi-tier Classification
In order to implement the multi-tier classification, there are two main parts: multi-tier classification model
development and classification by developed model. The classification model is developed with multi-tier
architecture and the new data (unknown data) are classified by the developed models.
8.1. Multi-tier Classification Model Development
Developing the classification model is a vital part of the proposed sentiment analysis system. Mahout naive Bayes
classifier, scalable machine learning algorithm, is conducted to develop the classification model. Naïve Bayes is a
learning algorithm that is frequently employed to tackle text classification problems. It is computationally very
efficient and easy to implement. Preprocessed class labeled data set is split into training and testing datasets in order
Procedure: Class Labeling Job
1. Input: raw tweets(twitter text data) //k:key, v:value
2. Output: k: preprocessed tweet & tweets_id, v: class label
3. Function MAPPER(key, values)
4. while(value € values):
5. Parse the data (tweet, tweet_id) with JsonParser
6. Perform data cleaning and preprocessing
7. Calculate the total polarity strength for each sentence by applying SentiStrength_Data
8. If (score > 1) then print “strongly_positive”
9. Else if (score > 0 && score <= 1) then print “positive”
10. Else if ( score <0 && score >- 1) then print “negative”
11. Else if (score <- 1) then print “strongly_negative”
12. Else if (score==0) then print “neutral”
13. emit(tweet & tweet_id, class label)
14. End
15. end while
16. Function Reducer(key, values)
17. emit(tweet & tweet_id, class label)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
214 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
12. to build the classification model. The input data is transformed into the sequence file. As this sequence file consists
of key value pairs, class category and tweets_id are set to key and tweet texts are set to value. Feature generation is
performed by creating sparse vector. In feature generation, TFIDF feature vectors are used for improving the
performance of classification models. For multi-tier architecture, three classification models are developed and each
model inherits the same configuration as the first model. Figure 4 illustrates the procedure for developing
classification model I. To develop the first model (Model I), class labeled datasets which class category is “P” and
“SP” is identified as “Pos” and the class category which class category is “N” and “SN” is identified as “Neg”. In
model I, all of the labeled datasets (class category is Pos, Neu, Neg) are used to train the classifier and new test
instance (unlabeled data) is classified into three classes (Pos, Neu, Neg). For the first model, positive and
strongly_positive sentiment data are classified as “Pos” and negative and strongly_negative sentiment data are
classified as “Neg”. Neutral sentiment data is still labeled as “Neu” in the Model I and the neutral class are directly
appended to the final classification result. The new instances which are classified as “Pos” are going to Model II and
other instances are going to Model
III.
Figure4. Procedure for Developing Classification Model I
In model II, the labeled datasets which class category is “P” and “SP” are used to train the classifier and the test
instances which class category is “Pos” are classified into two classes: “positive” and “strongly_positive”. The
procedure for developing Model II is illustrated in Figure5.
Figure5. Procedure for Developing Classification Model II
Figure 6 shows the procedure for developing classification model III. In model III, the class labeled datasets
which class category is “N” and “SN” are used to train the classifier and new test instances which class category is
“Neg” are classified into two classes: “negative” and “strongly negative”.
Procedure : Developing Classification Model I
Input: Class Labeled data (Pos, Neu, Neg)
Output: Classification model I
1. Begin
2. Transform input data to Sequence File
3. Generate TFIDF feature vector to create the model
4. Develop classification model I
5. End
Procedure : Developing Classification Model II
Input: Class Labeled data (P,SP)
Output: Classification model II
1. Begin
2. Transform input data to Sequence File
3. Generate TFIDF feature vector to create the model
4. Develop classification model II
5. End
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
215 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
13. Figure6. Procedure for Developing Classification Model III
8.2. Classification by Developed model
Figure7. Distributed Classification Procedure
The three trained models are used to classify the new instances. For classification, naïve bayes classifier uses
probabilities to decide which class best matches for a given input text. Word id and tfidf weight are used to create
vector for the new tweet. The naïve bayes classifier is classified by using the vector. For model I, the score of three
class label is calculated. The “bestscore” is set to “-Double.MAX_VALUE”. The “bestcategoryId” is set to “-1” and
“categoryId” is set to index of classification result. If the indexed score is greater than “bestscore”, bestcategoryId is
replaced with categoryId. If the “bestcategoryId” is equal with “1”, the classifier classify as “Pos”. If the
“bestcategoryId” is equal with “0”, the classifier classify as “Neu”. Otherwise, it classify as “Neg”. For model II,
Procedure : Developing Classification Model II
Input: Class Labeled data (N,SN)
Output: Classification model II
1. Begin
2. Transform input data to Sequence File
3. Generate TFIDF feature vector to create the model
4. Develop classification model III
5. End
Procedure : Classification Job
Input: New Collected Tweet Stream Data
Output: k : tweet, tweet_id, v: class category
1. Function Mapper(k1, v1)
2. while(value € values):
3. Performed Data cleaning and Preprocessing
4. Create feature vector by using wordid, tfidf value
5. Classify feature vector of the new instances by applying Model I
6. Print classified results(“Pos”, “Neu”, “Neg”)
7. If (results= = “Pos”)
8. Classify feature vector of tweet(class category is “Pos”) by applying Model II
9. Print classified results(“positive”, “strongly_positive”)
10. End if
11. Else If (results= = “Neg”)
12. Classify feature vector of tweet(class category is “Neg”) by applying Model III
13. Print classified results(“negative”, “strongly_negative”)
14. End if
15. Else Print classified results(“neutral”)
16. Emit(tweet & tweet_id, class category)
17. End while
18. End function
19. Function Reduce(k2, v2)
20. Function Reducer(key, values)
21. emit(tweet & tweet_id, class category)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
216 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
14. the score of two class label is calculated. If the “bestcategoryId” is equal with “1”, the classifier classify as
“strongly_positive”. Otherwise, the classifier classify as “positive”. For model III, the score of two class label is
calculated. If the “bestcategoryId” is equal with “1”, the classifier classify as “strongly_negative”. Otherwise, the
classifier classify as “negative”. The algorithm is considered naive because it assumes that the value of a particular
feature is independent of the value of any other feature, given the class variable. Laplace smoothing is performed
with value of α set to 1.
9. Result Analysis
In this work, experimental parameters for implementing the proposed system, the dataset and explanations about
evaluation result are described.
9.1 Experimental Environment
The specifications of devices and necessary software component of proposed system are described in table 3.
Table3. Experiment Parameters
Parameters Specification
Server/Client OS Ubuntu 14.04 LTS
Host Specification Intel ® Core i7-3770 CPU @ 3.40GHz,
8GB Memory, 1TB Hard Disk
VMs Specification 4GB RAM, 100 GB Hard Disk
Software Component - Hadoop 2.7.1
- Flume 1.6
- SentiStrength 2
- Mahout 0.10.0
9.2 Datasets
In order to test the functionality of the proposed system and prove the achieved results with promising accuracy,
tweets stream data related with iphone and samsung mobile product is examined. 200,000 tweet data are utilized as
the training datasets and 1000 new batch of tweets are applied as the test set.
9.3 Evaluation Results
In the experiment, the cluster is composed of 4 computing nodes (VMs) with one name node and three data
nodes. Data collection to classification model development module is developed as the offline process. At another
side, new batch of tweets are collected, and the new tweets are classified by applying the multitier classification
model.
To present the evaluation results of the system, beginning from evaluating the performance of lexicon based
classifier. To establish the ground truth, the evaluation result of SentiStrength lexicon based classifier is compared
with manual classification. In this work, 5,000 tweets are randomly selected from training data of two mobile phone
brand and then evaluate the result of SentiStrength lexicon based classification.
Figure 8 illustrates the comparative results of lexicon based classification and manual classification for iphone.
The results show the tweets Percentage of classification with SentiStrength based approach for positive,
strongly_positive, negative, strongly_negative, neutral class labels are 17, 7, 14, 3 and 59. And manual classified
Tweets percentages for positive, strongly_positive, negative, strongly_negative, neutral are 22, 10, 12, 4, and 52.
Therefore, error rate for lexicon based classification is 5% in positive, 3% in strongly_positive and 2% in negative,
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
217 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
15. 1% in strongly_negative, and 7 % in neutral. Therefore, the overall accuracy rate is 82%and error rate is 18% for
iphone data.
Figure8. Comparison of SentiStrength lexicon based classification and Manual Classification for iphone
Figure9. Comparison of SentiStrength Lexicon based Classification and Manual Classification for samsung
The comparative results of SentiStrength lexicon based classification and manual classification for samsung is
illustrated in Figure 9. The results show the tweets Percentage of classification with SentiStrength based approach
for positive, strongly_positive, negative, strongly_negative, neutral class labels are 20, 5, 16, 6 and 53. And manual
classified Tweets percentages for positive, strongly_positive, negative, strongly_negative, neutral are 24, 8, 11, 8,
and 49. Therefore, error rate for lexicon based classification is 5% in positive, 3% in strongly_positive and 2% in
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
218 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
16. negative, 1% in strongly_negative, and 7 % in neutral. Therefore, the overall accuracy rate is 84%and error rate is
16% for samsung.
To evaluate the performance of multi-tier classification, two set of experiments are implemented. Reference data,
which is correctly classified instances by using the preceding models, are used in the first case. In this case, only
local mistakes are identified as the misclassified instances from the previous model have not been considered. For
second case, the accuracy is computed by using global data which is based on the subsequent results of preceding
models. The global data may contain the misclassified instances from all of the hierarchical classification models.
Table3. Comparison of single-tier and multi-tier result for iphone
Table4. Comparison of single-tier and multi-tier result for samsung
Table 3 shows the comparative results of single-tier and multi-tier classification for iphone. The results show a
higher accuracy in reference data while it is slightly lower for global data in multi-tier classification. The multi-tier
Architecture Classified as Accuracy(%) Overall
Accuracy(%)
Multi-tier classification
(with reference data)
Model I Pos
Neg
Neu
83.4
59.7
69.4
82.37Model II positive
strongly_positive
71.7
77.4
Model III Negative
Strongly_negative
68.2
44.8
Multi-tier classification
(with global data)
Model I Pos
Neg
Neu
83.4
59.7
69.4
80.03Model II positive
strongly_positive
42.8
59.6
Model III negative
strongly_negative
66.7
61.4
Single-tier classification
neutral
positive
strongly_positive
negative
strongly_negative
82.1
41.3
67.9
46.2
48.4
75.24
Architecture Classified as Accuracy(%) Overall
Accuracy(%)
Multi-tier classification
(with reference results)
Model I Pos
Neg
Neu
81.4
71.7
59.4
80.14Model II positive
strongly_positive
71.7
77.4
Model III negative
strongly_negative
48.2
59.8
Multi-tier classification
(with global results)
Model I Pos
Neg
Neu
80.4
49.3
53.4
78.13Model II positive
strongly_positive
32.9
54.9
Model III negative
strongly_negative
67.1
68.2
Single-tier classification
neutral
positive
strongly_positive
negative
strongly_negative
82.7
42.1
66.5
39.9
43.7
74.53
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
219 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
17. classification (with reference data) is higher than single-teir with 7% and the multi-tier classification (with global
data) is higher than single-tier with 5%. The comparative results of single-tier and multi-tier classifications for
Samsung are illustrated in Table 4. The results show the multi-tier classification (with reference data) is achieved
higher accuracy than multi-tier classification (with global data) and single-teir with 4% and 6% respectively.
Figure 10. Comparison Result for Performance (with Running Time)
In order to test the scalability of the system, the MSABDP run the job with different number of tweets and with
different number of nodes. Each node is developed on each machine. Figures 10 show the scalability of MSABDP
and single node cluster to four node cluster. According to the result, the running time of MSABDP with different
volumes of data decreases when adding more nodes into the cluster.
10. CONCLUSION
In this paper, Big Data Analytics Platform (Hadoop) is developed to scale up the traditional analytics platform for
analyzing social big data . Multi-tier sentiment analysis on Big Data Analytics platform (MSABDP) is proposed to
reduce the multi class classification complexity. Apache Flume is used to collect a huge amount of real twitter data
and MSABDP classifies the real twitter data into five classes: strongly_positive, positive, neutral, negative, and
strongly_negative. The sentiment classification is implemented by combining lexicon and supervised machine
learning-based approach. The system enables high-level performance of learning based classification while taking
advantage of the lexicon-based classifier’s effortless setup process. To reduce the multi class classification
complexity, learning based approach with multi-tier architecture is applied to classify the multi class. The proposed
system evaluates real twitter data of iphone and samsung mobile phone product. The Evaluation result shows that
the reliability of the performance of SentiStrength lexicon based classification by comparing manual classification
results and achieved the accuracy rate is 84% for iphone and 82% for samsung. To evaluate the multi class
classification scheme with mult-tier architecture, two set of experiments are implemented with reference data and
global data. For iphone product, the multi-tier classification (with reference data) is higher than single-teir with 7%
and the multi-tier classification (with global data) is higher than single-tier with 5%. For samsung, the multi-tier
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
220 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
18. classification (with reference data) is achieved higher accuracy than multi-tier classification (with global data) and
single-teir with 4% and 6% respectively. The running time of MSABDP with different volumes of data decreases
when adding more nodes into the cluster. In future works, the proposed sentiment analysis system will be
implemented with Spark and Mahout Samsara. The regarding future work, more experiments by more classifier and
data from another domain will be experimented.
REFERENCES
[1] A.Tsakalidis, S.Papadopoulos,I.Kompatsiairs, “ An Ensemble Model for Cross-Domain Polarity Classification on Twitter”, 15 th Web
Information System Engineering, Thessaloniki, Greece, 14 October 2014
[2] Abbreviation List. Available at “http://www.englishclub.com/eslchat/abbreviations.html.” [Online; accessed 10-June-2016].
[3] Andrew C. Oliver. “Machine-learning-with-mahout”. Available: http://www.infoworld.com/article/2608418/application-
development/enjoy-machine-learning-with-mahout-on-hadoop.html” , May 29, 2014, [Online; accessed 3-December-2016].
[4] B.Liu, E.Blasch, Y.Chen, D.Shen, G.Chen, “Scalable Sentiment Classification for Big Data Analysis Using Naive Bayes Classifier”,
IEEE International Conference on Big Data, IEEE, (2013), pp. 99-104.
[5] G. Vaitheeswaran, Dr. L. Arockiam,”Combining Lexicon and Machine Learning Method to Enhance the Accuracy of Sentiment
Analysis on Big Data”, International Journal of Computer Science and Information Technologies, Vol.7(1) , 2016, 306-311.
[6] Hadoop Yarn. Available at “https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/YARN.html.” [Online; accessed 20-
August-2016]
[7] Harsh Thakkar and Dhiren Patel, “Approaches for Sentiment Analysis on Twitter: A State of Art study”, CoRR abs/1512.01043
(2015). 2014.
[8] Johan Bollen, Huina Mao, and Alberto Pepe. Modeling public mood and emotion:Twitter sentiment and socio-economic phenomena.
In ICWSM, 2011.
[9] L.Sigler. “Text Analytics and Sentiment Analysis. Available: http://www.clarabridge.com/text-analytics-vs-sentiment-analysis/ ,July
27, 2015, [Online; accessed 12-October-2016].
[10] M.Bouazizi, T.Ohtsuki, “Sentiment Analysis: from Binary to Multi-Class Classfication”, IEEE ICC 2016 SAC Social Networking,
2016.
[11] M.Moh, A.Gajjala, S.C.R.Gangireddy, T.S.Moh, “On Multi-Tier Sentiment Analysis using Supervised Machine Learning”,
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2015.
[12] M.Skuza, A.Romanowski, “ Sentiment Analysis of Twitter Data within Big Data Distributed Environment for Stock Prediction”,
Computer Science and Information Systems (FedCSIS), 13-16 Sept. 2015.
[13] N.M.Sharef, H.M.Zin and S.Nadali, “Overview and Future Opportunities of Sentiment Analysis”, Journal of Computer Sciences ,
January 2016.
[14] S.Shaikh. “Flume Installation and Streaming twitter data using flume”, Available: https://www.eduonix.com/blog/bigdata-and-
hadoop/flume-installation-and-streaming-twitter-data-using-flume/, June 30, 2015, [Online; accessed 20-Oct-2016]
[15] “ Twitter Statistics “, Available at: http://www.statisticbrain.con/twitter-statistics/, 2013, [Online; accessed 2-January-2016]
[16] V.K. Singh, R. Piryani, A. Uddin, P.Waila, “Sentiment Analysis of Movie Reviews. A new Feature-based Heuristic for Aspect-level
Sentiment Classification.”, International Multi-Conference on Automation, Computing, Communication, Control and Compressed
Sensing (iMac4s), 22-23 March 2013.
[17] V.N.Khuc, C.Shivade, R.Ramnath, J.Ramanathan, “Towards Building Large-Scale Distributed Systems for Twitter Sentiment
Analysis”, 12 Proceedings of the 27th Annual ACM Symposium on Applied Computing, Pages 459-464, Trento, Italy, March 26-30,
2012
Wint Nyein Chan is a tutor of Computer University (Hpa-an). She received her M.C.Sc degree in computer
science from Computer University (Dawei), Myanmar, in 2011. She is a Ph.D candidate of UCSY. Her
research interests include big data, cloud computing, and machine learning.
Dr. Thandar Thein received her M.Sc. (computer science) and Ph.D. degrees in 1996 and 2004, respectively
from University of Computer Studies, Yangon (UCSY), Myanmar. She did post doctorate research in Korea
Aerospace University. She is currently a Pro-Rector of University of Computer Studies, Maubin. Her
research interests include cloud computing, mobile cloud computing, big data, digital forensic, security
engineering, and network security.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
221 https://sites.google.com/site/ijcsis/
ISSN 1947-5500