SlideShare a Scribd company logo
1 of 18
Download to read offline
Multi-Tier Sentiment Analysis System in Big
Data Environment
Wint Nyein Chan*, Thandar Thein**
* University of Computer Studies, Yangon, **University of Computer Studies, Maubin
Email: wintnyeinchan2012@gmail.com, thandartheinn@gmail.com
Abstract- Over recent years, big data, a huge amount of structured and unstructured data is generated from social
Network. There needs to extract the valulable information from the social big data. The traditional analytic platform needs to be
scaled up for analyzing social big data in an efficient and timely manner. Sentiment Analysis of social big data helps the
organizations by providing business insights with public opinion. Sentiment analysis based on multi-class classification scheme is
oriented towards classification of text into more detailed sentiment labels. Multi-class classification with single tier architecture
where single model is developed and entire labeled data is trained may increase the classification complexity. In this paper,
multi-tier sentiment analysis system on big data analytics platform (MSABDP) is proposed to reduce the multi class classification
complexity and efficiently analyze large scale data set. Hadoop is built for big data analytics and it is a good platform for being
able to manage large data at scale and which can improve scalability and efficiency by adopting distributed processing
environment since they have been implemented using a MapReduce framework and a Hadoop distributed storage (HDFS). The
MSABDP is implemented by combining SentiStrength lexicon and learning based classification scheme with multi-tier
architecture and run on big data analytics platform for being able to manage large data at scale. The proposed system collects a
large amount of real Twitter data by using Apache Flume and the data was used for evaluation. The evaluation results have
shown that the proposed multi class classification system with multi-tier architecture is able to significantly improve the
classification accuracy over multi class classification based on single-tier architecture by 7%.
Keywords: big data analytics, hadoop, multi class classification, sentiment analysis, social big data
1. INTRODUCTION
As the rapid growth of the Internet and online activity, many services such as blogging, podcasting, social
networking and bookmarking are popular. These services allow users to create and share information within open
and closed communities and contribute to the volumes of the data. Moreover, in social network, data is created and
delivered from various systems operating in real-time by aggregating constant information about user activities and
interactions. Twitter has 320 million monthly active users and they posts 500 million tweets every day; Facebook
has 936 million daily and 1,440 million monthly active users as of December, 2015. These factors are reasons of a
rise of Big Data [5]. Big data is characterized by the volume, velocity, veracity, variety, value and volatility of data.
Nevertheless, the appropriate tools are needed to acquire, organize and derive value and volatility of data.
With the advent of social media, data is captured from different sources, such as mobile devices and web
browsers, and it is stored in various data formats. The traditional storage and analytics platform can’t handle the
different sources and different formats of the structured and unstructured data. Big Data Analytics has become
popular for analysing and managing large volume of the structure and unstructured data [12]. Hadoop is a good
platform for Big Data Analytics as it provides scalability, cost-effective, flexible, fast, secure and authentication,
parallel processing, Availability and resilient nature. It is an open-source software framework comprises of two
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
204 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
parts: storage part and processing part. The storage part is called the Hadoop Distributed File System (HDFS) and
the processing part is called MapReduce.
Analysing social data by providing customer certification through sentiment analysis helps the organization to
determine marketing strategy and improve customer services. Sentiment analysis is one of the main agenda in big
data that focus on various ways to analyse big data to identify patterns and relationships, make informed predictions,
deliver actionable intelligence and gain business insight from this steady influx of information [13]. Twitter is one of
the popular social media data platforms, which combines features of blogs and social network services. Twitter was
established in 2006 and experienced rapid growth of users in the first years [15]. Twitter is a good source of
information in the sense of snapshots of moods and feelings as well as up-to-date events and current situation
commenting. Actually, achieving a high level of performance for sentiment analysis for twitter data is very difficult.
At first, the large amount of volume and high velocity of twitter stream data is needed to be efficiently stored and
processed. Second, Twitter allows users to post no more than 140 characters for each post, which makes Twitter
sentiment classification can be lower performance than that when mining longer texts. In addition, classification into
multiple classes can increase the classification complexities, which tend to decrease the classification accuracy.
In this paper, Sentiment analysis system is proposed which is implemented by combing lexicon based and
learning based approaches, and shows that the proposed system is good performances in terms of classification
accuracy. The main contributions of this paper are as follows:
1. Big Data Analytics Platform is developed to scale up the traditional analytics platform for analyzing social
big data
2. Multi-tier sentiment analysis on Big Data Analytics platform (MSABDP) is proposed To reduce the multi
class classification complexity.
The remainder of this paper is arranged as follows: In section 2, the related work presents large scale sentiment
analysis in a distributed environment and multi class classification. Section 3 illustrates the system design to analyze
the social big data sets and presents the implementation of the proposed system. Section 4 shows the experimental
setup and results. Section 5 concludes the paper and describes the possible directions for future works.
2. RELATED WORK
In the past years, many works has been released in sentiment analysis, but there is a little work for sentiment
analysis on big data. There exist many possible variants; some of the papers related with the proposed system are
reviewed in this section.
The paper [11] presented multi-tier classification architecture for sentiment analysis. The architecture consists of
three models. The first model classifies three classes and the remaining two models are performed as the binary
classifier. To achieve high performance, features were selected by applying various feature selection techniques.
150,000 movie reviews posted on social media were used to train and test the performance of the system. Four
classifiers (Naïve Bayes, SVM, Random Forest, and SGD) were used in the experiments. The result showed that the
performance of proposed architecture was significantly improved prediction accuracy over the simple single tier
model by more than 10%. The proposed system runs on traditional analytics platform and classifies the movie
reviews into multi class by applying only supervised machine learning classifiers.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
205 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
The authors [10] presented an approach that relies on writing patterns and special unigrams for multiclass
sentiment analysis. They extracted patterns that rely on PoS -Tag of words and then calculate the resemblance
degree res (p, t) of each pattern in the training set p to the tweet t. For each tweet, they extracted a 4 set of features;
referred to the training set and use Random Forest machine learning algorithms to perform the classification.
Training data set contains 21000 tweets which had been manually classified the class. The system classified the texts
which collected from Twitter into seven classes: happiness, sadness, anger, love, hate, sarcasm and neutral. In the
experiment, the accuracy of the multiclass sentiment analysis is 56.9%. Since the training data are manually
classified the class, no more methods were required to develop the training data sets.
In [17], As the total volume of tweets is extremely high and it taks a long time to process, the authors proposed a
distributed system for Twitter sentiment analysis with two components: lexicon builder and a sentiment classifier.
The lexicon builder was implemented based on Hadoop and HBase and they demonstrated that it scales with
increasing number of machines and the amount of training data. In particular, for 100000 tweets, the training time
was decreased by 35% when moving from 2 to 3 machines, and 40% for 2 to 4 machines, 47% for 2 to 5 machines.
For 300000 tweets, the running time was decreased by 23% when moving from a cluster with 4 machines to 5
machines. To achieve higher accuracy than the lexicon-based approach, the lexicon-based and machine-learning
based approaches are combined. Online Logistic Regression from the mahout machine learning library was used for
learning based approach. The experiment results show that their system can obtain higher accuracy by combining
lexicon-based and learning based approaches. The proposed system used the existing class labeled dataset and
classified into binary class.
B.Liu et al. [4] presented a simple and complete system for sentiment mining on large data sets using a Naive
Bayes classifier. To achieve fine-grain control of the analysis procedure, they implemented the NBC on top of
Hadoop framework with additional four modules: the work flow controller (WFC), the data parser, the user terminal
and the result collector. They evaluated the scalability of NBC in large data set and the result is encouraging in that
the accuracy of NBC is improved and approaches 82% when the dataset size increases. They have demonstrated that
NBC is able to scale up to analyze the sentiment of millions movie reviews with increasing throughput. The two
datasets in the experiments had already been labeled the class. Instead of applying Mahout Machine learning library,
NBC is implemented on top of Hadoop framework with additional modules.
In the proposed system, multi-tier sentiment analysis system is developed on big data analytics platform to reduce
the multi class classification complexity. Multi-tier sentiment classification is implemented by combining lexicon
and Supervised learning-based approaches. In this system, SentiStrength is used to label the class in order to develop
the training data sets. Mahout Naïve bayes with multi-tier architecture is used for multi class sentiment
classification. The proposed system can improve the scalability and efficiency by adopting distributed processing
environment since they are implemented using HDFS and MapReduce framework.
3. BIG DATA ANALYTICS PLATFORM DEVELOPMENT
Multi-tier Sentiment Analysis system is implemented on Big Data Analytics Platform with Apache Flume
[14], HDFS, MapReduce [6] and Mahout Machine learning library [3]. High level architecture of the MSABDP is
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
206 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
illustrated in Figure 1. It consists of four layers: Data Ingestion Layer, Storage Layer, Processing Layer, and
Analytics Layer.
In Data Ingestion Layer, data are collected from Twitter steaming API with Apache Flume. The collected data is
stored in Hadoop File System (HDFS), distributed storage, which is located in Storage Layer. One name node and
three data nodes are developed for distributed storage. Data cleaning and preprocessing, class labeling and multi-tier
classification are developed in Analytics Layer. All of the processes from Analytics Layer are implemented with
Map Reduce paradigm which is used for distributed processing and located in Processing Layer. The detail process
description of each layer is described the following subsection. The system can provide the capabilities such as
scalability, cost effective, and parallel processing by using Hadoop platform.
Processing Layer
Data
Ingestion
Layer
….
ApacheFlume(DataCollection)
Analytics Layer
Data Cleaning and Preprocessing Class Labeling Mahout Machine Learning(Multi-tier Classification)
Slave Node 1
Data Node
Node Manager
Map Task 2
Reduce Task 1
Containers
Slave Node 2
Data Node
Node Manager
Application Master
Map Task 1
Containers
Slave Node n
Data Node
Node Manager
Map Task N
Reduce Task M
Containers
Scheduler
Application Manager
Resource ManagerYarn
Master
Node
HDFS
Master
Node
Name Node
S Name Node
Processing Layer
Container Launch RequestNode Status Resource Request
Map Reduce Coordination Heart beats, balancing, replication
Figure1. High Level Architecture of MSABDP
3.1 Data Ingestion Layer
In this layer, Apache Flume is used to collect the tweet stream data and the data is ingested to HDFS through the
memory channel. Apache Flume is deployed by Twitter Agent which has 3 main components: a TwitterSource, a
Memory Channel and a HDFS Sink. The incoming tweets are limited by keywords and language in the
TwitterSource. The source comes as an event-driven source and receives events through mechanisms like callbacks
or RPC calls. In this TwitterSource, the twitter4j library is used to keep access to the Twitter Streaming API. The
application secrets like Consumer Key and Consumer Secret, Access Token, Access Secret have been used to
connect to the Twitter APIs. The memory channel acts as a pathway between the TwitterSource and HDFS Sink.
The channel uses as an in memory queue to store events until they’re ready to be written to a sink. As the channel
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
207 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
holds all events in memory, the channel’s capacity and transaction capacity are limited by the capacity and
transaction capacity parameters in the configuration file. Events are added to the channel by TwitterSource, and later
removed from the Channel by HDS Sink. HDFS Sink, which writes events to a configured location in HDFS. In the
HDFS Sink configuration, the size of files can be defined with the roll count parameter and set up to 10,000, so each
file will end up containing 10,000 tweets. The original data format is retained by setting the file Type to Data
Stream. The file path is defined as that the files will end up in a series of directories for the year, month, day and
hour during which the event occurs. The timestamp is set to true in the configuration and which is used by Flume to
determine the timestamp of the event, and is used to resolve the full path where the event should end up.
3.2 Storage Layer
In Storage Layer, HDFS is used to provide scalable and reliable data storage. HDFS serves master/slave
architecture and single NameNode serve as a master server. Name Node executes file system namespace operations
like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
DataNodes is used to store the actual data in HDFS. The input file is split into one or more blocks and these blocks
are stored in a set of DataNodes. Each block size is 64 MB. DataNodes are responsible for serving read and write
requests from the clients. The DataNodes also perform block creation, deletion, and replication upon instruction
from the NameNode.
3.3 Processing Layer
Yarn and MapReduce-2 is located in the Processing Layer to process vast amounts of data in-parallel on clusters
of commodity hardware in a reliable, fault-tolerant manner. The MapReduce job splits the input data set into
independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the
outputs of the maps, which are then input to the reduce tasks. Both the input and the output of the job are stored in
HDFS. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. The
MapReduce framework consists of a single master ResourceManager, one slave NodeManager per cluster-node, and
MRAppMaster per application. The application specifies the input/output locations and supply map and reduce
functions via implementations of interfaces and abstract-classes. These, and other job parameters, comprise the job
configuration. The Hadoop job client then submits the job and configuration to the ResourceManager which then
assumes the responsibility of distributing the software and configuration to the slaves, scheduling tasks and
monitoring them, providing status and diagnostic information to the job-client.
3.4 Analytics Layer
In this layer, sentiment analysis is performed by combing lexicon based and machine learning based approaches.
SentiStrength lexicon based scheme is applied for class labeling and the class labeled data is used as the training
data to build the multitier classification model. Data cleaning and preprocessing have been performed to develop the
effective classification model. Mahout scalable machine-learning library, specifically designed to use Hadoop for
enabling scalable processing of huge data sets, is used to develop the classification model.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
208 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. SENTIMENT ANALYSIS SYSTEM FOR SOCIAL BIG DATA
4.1 Sentiment Analysis
Sentiment analysis [7] is an increasingly popular text mining method to determine the opinion of a text. It is
often referred to as opinion mining. Sentiment analysis uses individual elements of a text on different text levels,
like whole document, paragraph, sentence or only a text window, and tries to correctly determine and assign
particular moods and emotions to the respective entities, also called objects. The challenge of this analysis [9] is to
replicate a human process understanding of the text information assignment of the individual polarized information
and interpretation of the human mind. The simplest way to do sentiment classification is using the lexicon-based
approach, also known as a bag of words, or dictionary-based approach. This allows more accurate assessment of
the individual components of the text. The machine learning techniques have widely applied in the text
classification area and most of them are supervised learning classification methods. Acquiring effective training
data is a challenge, although learning based approaches outperforms the lexicon-based (Pang et al.,). Manual
Labeling for training data is time and labor consuming.
4.2 Sentiment Analysis of Social Big Data
The data from social media [13] is multistructured (Variety), very large scale and distributed (Volume) and
sometimes needs to be analyzed in a real time manner (Velocity). In addition, the fourth V, Value, is needed for
sentiment analysis on social data. The specification of such data source is known as the 4 Vs aspect in big data
domain. Sentiment analysis on the social media data helps the organization to determine marketing strategy by
providing public opinion. Efficient techniques to collect social data and extracting valuable information from
collected data are essential demand. Traditional sentiment classification techniques do not perform in social data. In
this work, Mult-tier Sentiment Analysis System on Big Data Analytics Platform is proposed for analyzing social big
data.
4.3 Mult-tier Sentiment Analysis System on Big Data Analytics Platform
The MSABDP is developed with four processes: data collection, data cleaning and preprocessing, class labeling
and multi-titer sentiment classification. Apache Flume is used to collect the real twitter data. As the collected tweets
include noise and irrelevant data, data cleaning and preprocessing are performed in order to produce normalized data
for labeling the class. The sentiment classification is developed by combining lexicon-based classifier and
supervised machine learning approach. The classification process is implemented on Map Reduce software
framework to support distributed computing, and it allows parallel programming using the function concept called
Map functions. Instead of labeling the class manually, SentiStrength lexicon based classifier is used for the task of
annotating the training data for the learning-based classifier. Concurrently, the tweets text data is preprocessed to
improve the accuracy of classification. And then preprocessed data with class label are combined as the input of
learning based classification. In reduce stage, the result data are combined and it produces a new set of output. And
the classification model is developed by using the mahout machine learning library. The newly coming test data is
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
209 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
classified using the model in order to produce the classified tweets. The process flow of the proposed system is
described in Figure 2.
Twitter Service(Twitter
Stream API )
Twitter
Source
HDFS
Sink
Memory Channel
Data Collection
FLume
Data Cleaning and Preprocessing
New Instances
SP : strongly_positive; P: positive; Neu: neutral; SN: strongly_negative; N: negative
Multi-tier Classification
Removing Character Repetitions
Removing StopWords
Negation Handling
Removing Noisy Data
Selecting Tweet Text feature
Removing Duplicate Tweets
Replacing Emoticons with their Sentiment
Values
Replacing Abbreviation with their
expression
Training
Data Set
Testing
Data Set
Class
Labeled
Data Set
Machine Leaning Classifier
NegPos
Model I
(SP/P (Pos), Neu, SN/N(Neg) )
Classification Model
SP P SN NNeu
Model II
(SP, P)
Model III
(SN, N)
Class Labeling
Sentiment Score Calculation
Class Labeling
Lexicon
Figure2. Process Flow Diagram of MSABDP
5. Data Collection
The real tweet data are collected by using Apache Flume [9] and the collected real data is used to evaluate the
system. Apache Flume deploys the Twitter Agent in order to collect data that is generated from TwitterSource
(Twitter Stream API) and transferred over to HDFS through MemoryChannel.
5.1. Twitter Source
Twitter source processes events and moves them along by sending the stream data into a Memory channel
[10]. The sources operate by gathering discrete pieces of data, translating the data into individual events, and then
using the channel to process events one at a time, or as a batch. . In the system, custom Twitter Source is used as a
data source. Data are limited by key words and language. In the TwitterSource, the twitter4j library is used to keep
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
210 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
access to the Twitter Streaming API. In order to connect to the Twitter APIs, some of the application specific secrets
like Consumer Key and Consumer Secret, Access Token, Access Secret are needed to be used. To get the
application specific secrets, it is needed to create a Twitter application. The Twitter application is created by the
following link https://apps.twitter.com/. It generates Consumer key, Consumer secret, Access token, and Access
token secret.
5.2. Memory Channel
The channel acts as a pathway between the TwitterSource and HDFS Sink. Events are added to the channel by
TwitterSource, and later removed from the Channel by HDFS Sink. It is uses as an in-memory queue to store events
until they’re ready to be written to a sink. As the channel holds all events in memory, the channel’s capacity and
transaction capacity is limited by the “capacity” and “transaction Capacity” parameters in the configuration file. In
the work, the channel’s capacity is set up to 10000 and the transaction capacity is setup to 100.
5.3. HDFS Sink
HDFS Sink, which writes events to a configured location in HDFS. In the HDFS Sink configuration, defines the
size of the files with the roll Count parameter and set up to 10,000, so each file will end up containing 10,000
tweets. It also retains the original data format, by setting the file Type to DataStream and setting writing Format to
Text.
6. Data Cleaning and Preprocessing
As flume ingests the data as the nested json format and it may contain irrelevant and duplicated data, this data has
to be cleaned and preprocessed for effective analysis. Selecting tweet text features, removing duplicate tweets,
removing noisy data, removing character repetition, removing stopwords and negation handling is performed during
this process. The process not only simplifies the classification task, but also serves to greatly decrease the processing
cost in the training phase.
6.1. Selecting Tweet Text Feature
Many tweet data features are included in one record of tweet stream data as shown in Figure 3. The sentiment
classification system is focused on tweet text feature with English language. The tweet text feature is selected
among other feature because it expresses twitter users’ feeling and opinion. The tweet id feature is also selected to
assign as a key and tweet text feature is assigned as a value in Map Reduce process. To select tweet id and tweet text
feature, the collected tweet data with json format is fetched as a json object using json parser. And “tweets_id” and
“tweets_text” are extracted as a string from json object.
6.2. Removing Duplicate Tweets
Twitter Stream data may include many duplicate retweets. To remove duplicate retweets, the extracted
“tweets_id” and “tweets_text” is checked whether already stored or not in the list of tweet data. If the extracted
features have not been already stored, they are added as the list of tweet data. Otherwise, they are removed. Table 1
shows the selected sample tweet_text and tweets_id features.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
211 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
6.3. Removing Noisy Data
The term noisy data is used to describe any piece of information within the tweet that will not be useful for the
machine learning algorithm to assign a class to that tweet. There are included noisy data such as character
repetitions, website links with URL, @username, punctuation additional white space Replace hash tags with the
same word without the hash tags. For example, #fun is replaced with fun. Replaced website links with URL so links
to websites that start with www.* or http. Converted @username to “usermentionsymbol by replacing @username
instances found in tweets with “usermentionsymbol” for the classifier to easily identify that a user is being
referenced. Non-Alphabets are replaced with spaces.
6.4. Removing Character Repetition
To remove the character repetition, there is needed to be checked whether the repetitive characters contain or not
in the input data. To check whether the presence of repetitive character, the former and pervious characters are
compared by indexing word in one record of tweet. If the previous and current characters are the same, the character
is counted. If the quantity of character is two or more, the procedure returns as true. Otherwise, it returns false. If the
repetitive characters are found and the character count is more than 2, replace the character itself by deleting the
repetitive characters with substring function.
6.5. Removing Stopwords
When working with text classification methods, removal of stopwords is a common approach to reduce noise in
the data. In this work, not only common stopwords but also stopwords based on classification domain are considered
by manually examining the data. For example, domain stopwords contain iphone, apple, mobile, etc.
6.6. Negation Handling
Negation handling is one of the factors that significantly affect the accuracy of learning based classifier. For
example: the word “good” in the phrase “not good” will be contributing to positive sentiment rather that negative
sentiment as the presence of “not” before it is not taken into account. It transforms a word followed by a not or n’t
into “not_” + word.
6.7. Replacing Emoticons with Sentiment Values
As the maximum length of a tweet is 140 characters, the case of emoticon will represent the sentiment of that
tweet. [7] Therefore, emoticons can be replaced with their sentimental values (such as “feeling sad” and “feeling
happy”) by using happy and sad emotion dictionary. The happy emoticon dictionary is used to label with 90
emotions and sad emoticon dictionary is used to label with 107 emoticons. For example, “:)” is labeled as “feeling
happy” whereas “:=(” is “feeling sad”. Table 3 shows the sample emoticons and their sentiment values. The
preprocessed tweets that replaced emoticons can be seen in Table 1.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
212 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Table 1: Sample Emoticons and their Sentiment Values
6.8. Replace Abbreviation with their expansions
As the abbreviation in Tweets can boost the sentiment values, it is replaced with their expansions by using
abbreviation dictionary. The dictionary has translations for 330 abbreviations [2]. The most frequent used
abbreviations can be seen in Table 2.
Table 2: Sample abbreviations and their expansion
7. Class Labeling
Instead of manual labeling the class, SentiStrength lexicon based approach is used for annotating the training data
of the learning-based classifier. SentiStrength [16], a lexicon-based classifier, uses additional (non-lexical) linguistic
information and rules to detect sentiment strength in short informal English text. The contextual valence shifter:
negation and intensifier are used to evaluate the context sentiment and to solve the context dependent problem by
applying NegationWordList and BoosterWordList. There are consists of eight dictionaries: BoosterWordList,
EmoticonLookupTable, EmotionLookupTable, EnglishWordList, IdionLookupTable, NegationWordList,
QuestionWords and slangLookupTable.
EmotionLookupTable consists of 2546 sentiment words with their strength. Some words include Kleene star
stemming (e.g., ador*). The word “miss” is a special case with a positive and negative strength of 2. It is frequently
used to express sadness and loves simultaneously. A spelling correction algorithm deletes repeated letters in a word
when the letters are more frequently repeated than normal for English or, if a word is not found in an English
dictionary, when deleting repeated letters creates a dictionary word. A booster word list is used to strengthen or
weaken the emotion of sentiment words. A booster word list consists of 28 words and their strength of sentiment.
Some booster words are “totally, completely and might”. An idiom list is used to identify the sentiment of a few
common phrases. This overrides individual sentiment word strengths. Negations are used to reverse the semantic
polarity of a particular term, and skip any intervening booster words. Default multiplier for negated words is 0.5. In
this case, instead use of default multiplier for negated words, the multiplier is set to 1. At least two repeated letters
added to words give a strength boost sentiment words by 1. For instance, haaaappy is more positive than happy. An
emoticon list with polarities is used to identify additional sentiment. The emoticon list consists of 116 common
emoticon words and their polarity strength (-1 or 1). Sentences with exclamation marks have a minimum positive
strength of 2, unless negative. Repeated punctuation with one or more exclamation marks boost the strength of the
Emoticons Sentiment Values
:-) :) :o) :] :3 :c) Feeling happy
:-( :( :c :[ Feeling sad
Abbreviations English expansion
gr8,gr8t great
lol Laughing out loud
rotf Rolling on the floor
bff Best friend forever
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
213 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
immediately preceding sentiment word by 1. Negative sentiment is ignored in questions. For each text,
SentiStrength outputs a positive sentiment score from 1 to 5 and a negative score from -1 to -5.
Figure3 shows the procedure for calculating the sentiment score by applying SentiStrength. The procedure is
performed with Map Reduce function. At the Mapper stage, the collected raw data is parsed with the JsonParser in
order to select the tweets and tweets_id. After performing data cleaning and preprocessing, the sentiment score is
calculated with SentiStrength. If the sentiment score is greater than 1, the output label is “strongly_positive. If the
sentiment score is greater than 0 and equal with or greater than 1, the output is “positive”. For strongly_negative, the
score is less than -1. If the score is less than 0 and equal to or greater than -1 is negative. Otherwise, the output is
neutral. The Reduce stage is not necessary due to the fact that the result combination is not needed. Currently, the
Reduce stage outputs the results obtained by the Mappers.
Figure3. Class Labeling Procedure
8. Multi-tier Classification
In order to implement the multi-tier classification, there are two main parts: multi-tier classification model
development and classification by developed model. The classification model is developed with multi-tier
architecture and the new data (unknown data) are classified by the developed models.
8.1. Multi-tier Classification Model Development
Developing the classification model is a vital part of the proposed sentiment analysis system. Mahout naive Bayes
classifier, scalable machine learning algorithm, is conducted to develop the classification model. Naïve Bayes is a
learning algorithm that is frequently employed to tackle text classification problems. It is computationally very
efficient and easy to implement. Preprocessed class labeled data set is split into training and testing datasets in order
Procedure: Class Labeling Job
1. Input: raw tweets(twitter text data) //k:key, v:value
2. Output: k: preprocessed tweet & tweets_id, v: class label
3. Function MAPPER(key, values)
4. while(value € values):
5. Parse the data (tweet, tweet_id) with JsonParser
6. Perform data cleaning and preprocessing
7. Calculate the total polarity strength for each sentence by applying SentiStrength_Data
8. If (score > 1) then print “strongly_positive”
9. Else if (score > 0 && score <= 1) then print “positive”
10. Else if ( score <0 && score >- 1) then print “negative”
11. Else if (score <- 1) then print “strongly_negative”
12. Else if (score==0) then print “neutral”
13. emit(tweet & tweet_id, class label)
14. End
15. end while
16. Function Reducer(key, values)
17. emit(tweet & tweet_id, class label)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
214 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
to build the classification model. The input data is transformed into the sequence file. As this sequence file consists
of key value pairs, class category and tweets_id are set to key and tweet texts are set to value. Feature generation is
performed by creating sparse vector. In feature generation, TFIDF feature vectors are used for improving the
performance of classification models. For multi-tier architecture, three classification models are developed and each
model inherits the same configuration as the first model. Figure 4 illustrates the procedure for developing
classification model I. To develop the first model (Model I), class labeled datasets which class category is “P” and
“SP” is identified as “Pos” and the class category which class category is “N” and “SN” is identified as “Neg”. In
model I, all of the labeled datasets (class category is Pos, Neu, Neg) are used to train the classifier and new test
instance (unlabeled data) is classified into three classes (Pos, Neu, Neg). For the first model, positive and
strongly_positive sentiment data are classified as “Pos” and negative and strongly_negative sentiment data are
classified as “Neg”. Neutral sentiment data is still labeled as “Neu” in the Model I and the neutral class are directly
appended to the final classification result. The new instances which are classified as “Pos” are going to Model II and
other instances are going to Model
III.
Figure4. Procedure for Developing Classification Model I
In model II, the labeled datasets which class category is “P” and “SP” are used to train the classifier and the test
instances which class category is “Pos” are classified into two classes: “positive” and “strongly_positive”. The
procedure for developing Model II is illustrated in Figure5.
Figure5. Procedure for Developing Classification Model II
Figure 6 shows the procedure for developing classification model III. In model III, the class labeled datasets
which class category is “N” and “SN” are used to train the classifier and new test instances which class category is
“Neg” are classified into two classes: “negative” and “strongly negative”.
Procedure : Developing Classification Model I
Input: Class Labeled data (Pos, Neu, Neg)
Output: Classification model I
1. Begin
2. Transform input data to Sequence File
3. Generate TFIDF feature vector to create the model
4. Develop classification model I
5. End
Procedure : Developing Classification Model II
Input: Class Labeled data (P,SP)
Output: Classification model II
1. Begin
2. Transform input data to Sequence File
3. Generate TFIDF feature vector to create the model
4. Develop classification model II
5. End
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
215 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Figure6. Procedure for Developing Classification Model III
8.2. Classification by Developed model
Figure7. Distributed Classification Procedure
The three trained models are used to classify the new instances. For classification, naïve bayes classifier uses
probabilities to decide which class best matches for a given input text. Word id and tfidf weight are used to create
vector for the new tweet. The naïve bayes classifier is classified by using the vector. For model I, the score of three
class label is calculated. The “bestscore” is set to “-Double.MAX_VALUE”. The “bestcategoryId” is set to “-1” and
“categoryId” is set to index of classification result. If the indexed score is greater than “bestscore”, bestcategoryId is
replaced with categoryId. If the “bestcategoryId” is equal with “1”, the classifier classify as “Pos”. If the
“bestcategoryId” is equal with “0”, the classifier classify as “Neu”. Otherwise, it classify as “Neg”. For model II,
Procedure : Developing Classification Model II
Input: Class Labeled data (N,SN)
Output: Classification model II
1. Begin
2. Transform input data to Sequence File
3. Generate TFIDF feature vector to create the model
4. Develop classification model III
5. End
Procedure : Classification Job
Input: New Collected Tweet Stream Data
Output: k : tweet, tweet_id, v: class category
1. Function Mapper(k1, v1)
2. while(value € values):
3. Performed Data cleaning and Preprocessing
4. Create feature vector by using wordid, tfidf value
5. Classify feature vector of the new instances by applying Model I
6. Print classified results(“Pos”, “Neu”, “Neg”)
7. If (results= = “Pos”)
8. Classify feature vector of tweet(class category is “Pos”) by applying Model II
9. Print classified results(“positive”, “strongly_positive”)
10. End if
11. Else If (results= = “Neg”)
12. Classify feature vector of tweet(class category is “Neg”) by applying Model III
13. Print classified results(“negative”, “strongly_negative”)
14. End if
15. Else Print classified results(“neutral”)
16. Emit(tweet & tweet_id, class category)
17. End while
18. End function
19. Function Reduce(k2, v2)
20. Function Reducer(key, values)
21. emit(tweet & tweet_id, class category)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
216 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
the score of two class label is calculated. If the “bestcategoryId” is equal with “1”, the classifier classify as
“strongly_positive”. Otherwise, the classifier classify as “positive”. For model III, the score of two class label is
calculated. If the “bestcategoryId” is equal with “1”, the classifier classify as “strongly_negative”. Otherwise, the
classifier classify as “negative”. The algorithm is considered naive because it assumes that the value of a particular
feature is independent of the value of any other feature, given the class variable. Laplace smoothing is performed
with value of α set to 1.
9. Result Analysis
In this work, experimental parameters for implementing the proposed system, the dataset and explanations about
evaluation result are described.
9.1 Experimental Environment
The specifications of devices and necessary software component of proposed system are described in table 3.
Table3. Experiment Parameters
Parameters Specification
Server/Client OS Ubuntu 14.04 LTS
Host Specification Intel ® Core i7-3770 CPU @ 3.40GHz,
8GB Memory, 1TB Hard Disk
VMs Specification 4GB RAM, 100 GB Hard Disk
Software Component - Hadoop 2.7.1
- Flume 1.6
- SentiStrength 2
- Mahout 0.10.0
9.2 Datasets
In order to test the functionality of the proposed system and prove the achieved results with promising accuracy,
tweets stream data related with iphone and samsung mobile product is examined. 200,000 tweet data are utilized as
the training datasets and 1000 new batch of tweets are applied as the test set.
9.3 Evaluation Results
In the experiment, the cluster is composed of 4 computing nodes (VMs) with one name node and three data
nodes. Data collection to classification model development module is developed as the offline process. At another
side, new batch of tweets are collected, and the new tweets are classified by applying the multitier classification
model.
To present the evaluation results of the system, beginning from evaluating the performance of lexicon based
classifier. To establish the ground truth, the evaluation result of SentiStrength lexicon based classifier is compared
with manual classification. In this work, 5,000 tweets are randomly selected from training data of two mobile phone
brand and then evaluate the result of SentiStrength lexicon based classification.
Figure 8 illustrates the comparative results of lexicon based classification and manual classification for iphone.
The results show the tweets Percentage of classification with SentiStrength based approach for positive,
strongly_positive, negative, strongly_negative, neutral class labels are 17, 7, 14, 3 and 59. And manual classified
Tweets percentages for positive, strongly_positive, negative, strongly_negative, neutral are 22, 10, 12, 4, and 52.
Therefore, error rate for lexicon based classification is 5% in positive, 3% in strongly_positive and 2% in negative,
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
217 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
1% in strongly_negative, and 7 % in neutral. Therefore, the overall accuracy rate is 82%and error rate is 18% for
iphone data.
Figure8. Comparison of SentiStrength lexicon based classification and Manual Classification for iphone
Figure9. Comparison of SentiStrength Lexicon based Classification and Manual Classification for samsung
The comparative results of SentiStrength lexicon based classification and manual classification for samsung is
illustrated in Figure 9. The results show the tweets Percentage of classification with SentiStrength based approach
for positive, strongly_positive, negative, strongly_negative, neutral class labels are 20, 5, 16, 6 and 53. And manual
classified Tweets percentages for positive, strongly_positive, negative, strongly_negative, neutral are 24, 8, 11, 8,
and 49. Therefore, error rate for lexicon based classification is 5% in positive, 3% in strongly_positive and 2% in
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
218 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
negative, 1% in strongly_negative, and 7 % in neutral. Therefore, the overall accuracy rate is 84%and error rate is
16% for samsung.
To evaluate the performance of multi-tier classification, two set of experiments are implemented. Reference data,
which is correctly classified instances by using the preceding models, are used in the first case. In this case, only
local mistakes are identified as the misclassified instances from the previous model have not been considered. For
second case, the accuracy is computed by using global data which is based on the subsequent results of preceding
models. The global data may contain the misclassified instances from all of the hierarchical classification models.
Table3. Comparison of single-tier and multi-tier result for iphone
Table4. Comparison of single-tier and multi-tier result for samsung
Table 3 shows the comparative results of single-tier and multi-tier classification for iphone. The results show a
higher accuracy in reference data while it is slightly lower for global data in multi-tier classification. The multi-tier
Architecture Classified as Accuracy(%) Overall
Accuracy(%)
Multi-tier classification
(with reference data)
Model I Pos
Neg
Neu
83.4
59.7
69.4
82.37Model II positive
strongly_positive
71.7
77.4
Model III Negative
Strongly_negative
68.2
44.8
Multi-tier classification
(with global data)
Model I Pos
Neg
Neu
83.4
59.7
69.4
80.03Model II positive
strongly_positive
42.8
59.6
Model III negative
strongly_negative
66.7
61.4
Single-tier classification
neutral
positive
strongly_positive
negative
strongly_negative
82.1
41.3
67.9
46.2
48.4
75.24
Architecture Classified as Accuracy(%) Overall
Accuracy(%)
Multi-tier classification
(with reference results)
Model I Pos
Neg
Neu
81.4
71.7
59.4
80.14Model II positive
strongly_positive
71.7
77.4
Model III negative
strongly_negative
48.2
59.8
Multi-tier classification
(with global results)
Model I Pos
Neg
Neu
80.4
49.3
53.4
78.13Model II positive
strongly_positive
32.9
54.9
Model III negative
strongly_negative
67.1
68.2
Single-tier classification
neutral
positive
strongly_positive
negative
strongly_negative
82.7
42.1
66.5
39.9
43.7
74.53
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
219 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
classification (with reference data) is higher than single-teir with 7% and the multi-tier classification (with global
data) is higher than single-tier with 5%. The comparative results of single-tier and multi-tier classifications for
Samsung are illustrated in Table 4. The results show the multi-tier classification (with reference data) is achieved
higher accuracy than multi-tier classification (with global data) and single-teir with 4% and 6% respectively.
Figure 10. Comparison Result for Performance (with Running Time)
In order to test the scalability of the system, the MSABDP run the job with different number of tweets and with
different number of nodes. Each node is developed on each machine. Figures 10 show the scalability of MSABDP
and single node cluster to four node cluster. According to the result, the running time of MSABDP with different
volumes of data decreases when adding more nodes into the cluster.
10. CONCLUSION
In this paper, Big Data Analytics Platform (Hadoop) is developed to scale up the traditional analytics platform for
analyzing social big data . Multi-tier sentiment analysis on Big Data Analytics platform (MSABDP) is proposed to
reduce the multi class classification complexity. Apache Flume is used to collect a huge amount of real twitter data
and MSABDP classifies the real twitter data into five classes: strongly_positive, positive, neutral, negative, and
strongly_negative. The sentiment classification is implemented by combining lexicon and supervised machine
learning-based approach. The system enables high-level performance of learning based classification while taking
advantage of the lexicon-based classifier’s effortless setup process. To reduce the multi class classification
complexity, learning based approach with multi-tier architecture is applied to classify the multi class. The proposed
system evaluates real twitter data of iphone and samsung mobile phone product. The Evaluation result shows that
the reliability of the performance of SentiStrength lexicon based classification by comparing manual classification
results and achieved the accuracy rate is 84% for iphone and 82% for samsung. To evaluate the multi class
classification scheme with mult-tier architecture, two set of experiments are implemented with reference data and
global data. For iphone product, the multi-tier classification (with reference data) is higher than single-teir with 7%
and the multi-tier classification (with global data) is higher than single-tier with 5%. For samsung, the multi-tier
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
220 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
classification (with reference data) is achieved higher accuracy than multi-tier classification (with global data) and
single-teir with 4% and 6% respectively. The running time of MSABDP with different volumes of data decreases
when adding more nodes into the cluster. In future works, the proposed sentiment analysis system will be
implemented with Spark and Mahout Samsara. The regarding future work, more experiments by more classifier and
data from another domain will be experimented.
REFERENCES
[1] A.Tsakalidis, S.Papadopoulos,I.Kompatsiairs, “ An Ensemble Model for Cross-Domain Polarity Classification on Twitter”, 15 th Web
Information System Engineering, Thessaloniki, Greece, 14 October 2014
[2] Abbreviation List. Available at “http://www.englishclub.com/eslchat/abbreviations.html.” [Online; accessed 10-June-2016].
[3] Andrew C. Oliver. “Machine-learning-with-mahout”. Available: http://www.infoworld.com/article/2608418/application-
development/enjoy-machine-learning-with-mahout-on-hadoop.html” , May 29, 2014, [Online; accessed 3-December-2016].
[4] B.Liu, E.Blasch, Y.Chen, D.Shen, G.Chen, “Scalable Sentiment Classification for Big Data Analysis Using Naive Bayes Classifier”,
IEEE International Conference on Big Data, IEEE, (2013), pp. 99-104.
[5] G. Vaitheeswaran, Dr. L. Arockiam,”Combining Lexicon and Machine Learning Method to Enhance the Accuracy of Sentiment
Analysis on Big Data”, International Journal of Computer Science and Information Technologies, Vol.7(1) , 2016, 306-311.
[6] Hadoop Yarn. Available at “https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/YARN.html.” [Online; accessed 20-
August-2016]
[7] Harsh Thakkar and Dhiren Patel, “Approaches for Sentiment Analysis on Twitter: A State of Art study”, CoRR abs/1512.01043
(2015). 2014.
[8] Johan Bollen, Huina Mao, and Alberto Pepe. Modeling public mood and emotion:Twitter sentiment and socio-economic phenomena.
In ICWSM, 2011.
[9] L.Sigler. “Text Analytics and Sentiment Analysis. Available: http://www.clarabridge.com/text-analytics-vs-sentiment-analysis/ ,July
27, 2015, [Online; accessed 12-October-2016].
[10] M.Bouazizi, T.Ohtsuki, “Sentiment Analysis: from Binary to Multi-Class Classfication”, IEEE ICC 2016 SAC Social Networking,
2016.
[11] M.Moh, A.Gajjala, S.C.R.Gangireddy, T.S.Moh, “On Multi-Tier Sentiment Analysis using Supervised Machine Learning”,
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2015.
[12] M.Skuza, A.Romanowski, “ Sentiment Analysis of Twitter Data within Big Data Distributed Environment for Stock Prediction”,
Computer Science and Information Systems (FedCSIS), 13-16 Sept. 2015.
[13] N.M.Sharef, H.M.Zin and S.Nadali, “Overview and Future Opportunities of Sentiment Analysis”, Journal of Computer Sciences ,
January 2016.
[14] S.Shaikh. “Flume Installation and Streaming twitter data using flume”, Available: https://www.eduonix.com/blog/bigdata-and-
hadoop/flume-installation-and-streaming-twitter-data-using-flume/, June 30, 2015, [Online; accessed 20-Oct-2016]
[15] “ Twitter Statistics “, Available at: http://www.statisticbrain.con/twitter-statistics/, 2013, [Online; accessed 2-January-2016]
[16] V.K. Singh, R. Piryani, A. Uddin, P.Waila, “Sentiment Analysis of Movie Reviews. A new Feature-based Heuristic for Aspect-level
Sentiment Classification.”, International Multi-Conference on Automation, Computing, Communication, Control and Compressed
Sensing (iMac4s), 22-23 March 2013.
[17] V.N.Khuc, C.Shivade, R.Ramnath, J.Ramanathan, “Towards Building Large-Scale Distributed Systems for Twitter Sentiment
Analysis”, 12 Proceedings of the 27th Annual ACM Symposium on Applied Computing, Pages 459-464, Trento, Italy, March 26-30,
2012
Wint Nyein Chan is a tutor of Computer University (Hpa-an). She received her M.C.Sc degree in computer
science from Computer University (Dawei), Myanmar, in 2011. She is a Ph.D candidate of UCSY. Her
research interests include big data, cloud computing, and machine learning.
Dr. Thandar Thein received her M.Sc. (computer science) and Ph.D. degrees in 1996 and 2004, respectively
from University of Computer Studies, Yangon (UCSY), Myanmar. She did post doctorate research in Korea
Aerospace University. She is currently a Pro-Rector of University of Computer Studies, Maubin. Her
research interests include cloud computing, mobile cloud computing, big data, digital forensic, security
engineering, and network security.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 9, September 2017
221 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

More Related Content

What's hot

Scalable recommendation with social contextual information
Scalable recommendation with social contextual informationScalable recommendation with social contextual information
Scalable recommendation with social contextual informationeSAT Journals
 
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...IJTET Journal
 
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...ijtsrd
 
Time-Ordered Collaborative Filtering for News Recommendation
Time-Ordered Collaborative Filtering for News RecommendationTime-Ordered Collaborative Filtering for News Recommendation
Time-Ordered Collaborative Filtering for News RecommendationIRJET Journal
 
Structural Balance Theory Based Recommendation for Social Service Portal
Structural Balance Theory Based Recommendation for Social Service PortalStructural Balance Theory Based Recommendation for Social Service Portal
Structural Balance Theory Based Recommendation for Social Service PortalYogeshIJTSRD
 
A real-time big data sentiment analysis for iraqi tweets using spark streaming
A real-time big data sentiment analysis for iraqi tweets using spark streamingA real-time big data sentiment analysis for iraqi tweets using spark streaming
A real-time big data sentiment analysis for iraqi tweets using spark streamingjournalBEEI
 
Distributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache SparkDistributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache SparkAnastasios Theodosiou
 
Temporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter StreamTemporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter StreamTELKOMNIKA JOURNAL
 
Predicting the Brand Popularity from the Brand Metadata
Predicting the Brand Popularity from the Brand MetadataPredicting the Brand Popularity from the Brand Metadata
Predicting the Brand Popularity from the Brand MetadataIJECEIAES
 
Social Media Mining: An Introduction
Social Media Mining: An IntroductionSocial Media Mining: An Introduction
Social Media Mining: An IntroductionAli Abbasi
 
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET Journal
 
Frequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaFrequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaIJERA Editor
 
Final Poster for Engineering Showcase
Final Poster for Engineering ShowcaseFinal Poster for Engineering Showcase
Final Poster for Engineering ShowcaseTucker Truesdale
 
Poster Abstracts
Poster AbstractsPoster Abstracts
Poster Abstractsbutest
 
A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MININGcsandit
 
Analysis on Recommended System for Web Information Retrieval Using HMM
Analysis on Recommended System for Web Information Retrieval Using HMMAnalysis on Recommended System for Web Information Retrieval Using HMM
Analysis on Recommended System for Web Information Retrieval Using HMMIJERA Editor
 

What's hot (18)

Ijariie1184
Ijariie1184Ijariie1184
Ijariie1184
 
Scalable recommendation with social contextual information
Scalable recommendation with social contextual informationScalable recommendation with social contextual information
Scalable recommendation with social contextual information
 
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...
 
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
 
Time-Ordered Collaborative Filtering for News Recommendation
Time-Ordered Collaborative Filtering for News RecommendationTime-Ordered Collaborative Filtering for News Recommendation
Time-Ordered Collaborative Filtering for News Recommendation
 
Structural Balance Theory Based Recommendation for Social Service Portal
Structural Balance Theory Based Recommendation for Social Service PortalStructural Balance Theory Based Recommendation for Social Service Portal
Structural Balance Theory Based Recommendation for Social Service Portal
 
A real-time big data sentiment analysis for iraqi tweets using spark streaming
A real-time big data sentiment analysis for iraqi tweets using spark streamingA real-time big data sentiment analysis for iraqi tweets using spark streaming
A real-time big data sentiment analysis for iraqi tweets using spark streaming
 
Distributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache SparkDistributed Link Prediction in Large Scale Graphs using Apache Spark
Distributed Link Prediction in Large Scale Graphs using Apache Spark
 
Temporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter StreamTemporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter Stream
 
Predicting the Brand Popularity from the Brand Metadata
Predicting the Brand Popularity from the Brand MetadataPredicting the Brand Popularity from the Brand Metadata
Predicting the Brand Popularity from the Brand Metadata
 
Social Media Mining: An Introduction
Social Media Mining: An IntroductionSocial Media Mining: An Introduction
Social Media Mining: An Introduction
 
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
 
Frequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social MediaFrequent Item set Mining of Big Data for Social Media
Frequent Item set Mining of Big Data for Social Media
 
Final Poster for Engineering Showcase
Final Poster for Engineering ShowcaseFinal Poster for Engineering Showcase
Final Poster for Engineering Showcase
 
Poster Abstracts
Poster AbstractsPoster Abstracts
Poster Abstracts
 
A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MINING
 
Analysis on Recommended System for Web Information Retrieval Using HMM
Analysis on Recommended System for Web Information Retrieval Using HMMAnalysis on Recommended System for Web Information Retrieval Using HMM
Analysis on Recommended System for Web Information Retrieval Using HMM
 
G1803054653
G1803054653G1803054653
G1803054653
 

Similar to Multi-Tier Sentiment Analysis System in Big Data Environment

Political prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learningPolitical prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learningVishwambhar Deshpande
 
Using R for Classification of Large Social Network Data
Using R for Classification of Large Social Network DataUsing R for Classification of Large Social Network Data
Using R for Classification of Large Social Network DataIJCSIS Research Publications
 
Methods for Sentiment Analysis: A Literature Study
Methods for Sentiment Analysis: A Literature StudyMethods for Sentiment Analysis: A Literature Study
Methods for Sentiment Analysis: A Literature Studyvivatechijri
 
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptxSampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx20211a05p7
 
Political Prediction Analysis using text mining and deep learning.pptx
Political Prediction Analysis using text mining and deep learning.pptxPolitical Prediction Analysis using text mining and deep learning.pptx
Political Prediction Analysis using text mining and deep learning.pptxDineshGaikwad36
 
Sentiment Analysis in Social Media and Its Operations
Sentiment Analysis in Social Media and Its OperationsSentiment Analysis in Social Media and Its Operations
Sentiment Analysis in Social Media and Its OperationsIRJET Journal
 
Recommender System in light of Big Data
Recommender System in light of Big DataRecommender System in light of Big Data
Recommender System in light of Big DataKhadija Atiya
 
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...IRJET Journal
 
IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...
IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...
IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...IRJET Journal
 
IRJET- Implementing Social CRM System for an Online Grocery Shopping Platform...
IRJET- Implementing Social CRM System for an Online Grocery Shopping Platform...IRJET- Implementing Social CRM System for an Online Grocery Shopping Platform...
IRJET- Implementing Social CRM System for an Online Grocery Shopping Platform...IRJET Journal
 
Vol 7 No 1 - November 2013
Vol 7 No 1 - November 2013Vol 7 No 1 - November 2013
Vol 7 No 1 - November 2013ijcsbi
 
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)paperpublications3
 
RUNNING HEADER Analytics Ecosystem1Analytics Ecosystem4.docx
RUNNING HEADER Analytics Ecosystem1Analytics Ecosystem4.docxRUNNING HEADER Analytics Ecosystem1Analytics Ecosystem4.docx
RUNNING HEADER Analytics Ecosystem1Analytics Ecosystem4.docxanhlodge
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGIJwest
 
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...IRJET Journal
 
Framework for Product Recommandation for Review Dataset
Framework for Product Recommandation for Review DatasetFramework for Product Recommandation for Review Dataset
Framework for Product Recommandation for Review Datasetrahulmonikasharma
 
A large-scale sentiment analysis using political tweets
A large-scale sentiment analysis using political tweetsA large-scale sentiment analysis using political tweets
A large-scale sentiment analysis using political tweetsIJECEIAES
 

Similar to Multi-Tier Sentiment Analysis System in Big Data Environment (20)

Political prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learningPolitical prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learning
 
Using R for Classification of Large Social Network Data
Using R for Classification of Large Social Network DataUsing R for Classification of Large Social Network Data
Using R for Classification of Large Social Network Data
 
Methods for Sentiment Analysis: A Literature Study
Methods for Sentiment Analysis: A Literature StudyMethods for Sentiment Analysis: A Literature Study
Methods for Sentiment Analysis: A Literature Study
 
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptxSampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
 
Political Prediction Analysis using text mining and deep learning.pptx
Political Prediction Analysis using text mining and deep learning.pptxPolitical Prediction Analysis using text mining and deep learning.pptx
Political Prediction Analysis using text mining and deep learning.pptx
 
Sentiment Analysis in Social Media and Its Operations
Sentiment Analysis in Social Media and Its OperationsSentiment Analysis in Social Media and Its Operations
Sentiment Analysis in Social Media and Its Operations
 
Recommender System in light of Big Data
Recommender System in light of Big DataRecommender System in light of Big Data
Recommender System in light of Big Data
 
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
 
IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...
IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...
IRJET- Strength and Workability of High Volume Fly Ash Self-Compacting Concre...
 
IRJET- Implementing Social CRM System for an Online Grocery Shopping Platform...
IRJET- Implementing Social CRM System for an Online Grocery Shopping Platform...IRJET- Implementing Social CRM System for an Online Grocery Shopping Platform...
IRJET- Implementing Social CRM System for an Online Grocery Shopping Platform...
 
Vol 7 No 1 - November 2013
Vol 7 No 1 - November 2013Vol 7 No 1 - November 2013
Vol 7 No 1 - November 2013
 
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
 
RUNNING HEADER Analytics Ecosystem1Analytics Ecosystem4.docx
RUNNING HEADER Analytics Ecosystem1Analytics Ecosystem4.docxRUNNING HEADER Analytics Ecosystem1Analytics Ecosystem4.docx
RUNNING HEADER Analytics Ecosystem1Analytics Ecosystem4.docx
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
[IJCT-V3I2P30] Authors: Sunny Sharma
[IJCT-V3I2P30] Authors: Sunny Sharma[IJCT-V3I2P30] Authors: Sunny Sharma
[IJCT-V3I2P30] Authors: Sunny Sharma
 
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...IRJET-  	  Twitter Sentimental Analysis for Predicting Election Result using ...
IRJET- Twitter Sentimental Analysis for Predicting Election Result using ...
 
Framework for Product Recommandation for Review Dataset
Framework for Product Recommandation for Review DatasetFramework for Product Recommandation for Review Dataset
Framework for Product Recommandation for Review Dataset
 
A large-scale sentiment analysis using political tweets
A large-scale sentiment analysis using political tweetsA large-scale sentiment analysis using political tweets
A large-scale sentiment analysis using political tweets
 

Recently uploaded

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 

Recently uploaded (20)

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 

Multi-Tier Sentiment Analysis System in Big Data Environment

  • 1. Multi-Tier Sentiment Analysis System in Big Data Environment Wint Nyein Chan*, Thandar Thein** * University of Computer Studies, Yangon, **University of Computer Studies, Maubin Email: wintnyeinchan2012@gmail.com, thandartheinn@gmail.com Abstract- Over recent years, big data, a huge amount of structured and unstructured data is generated from social Network. There needs to extract the valulable information from the social big data. The traditional analytic platform needs to be scaled up for analyzing social big data in an efficient and timely manner. Sentiment Analysis of social big data helps the organizations by providing business insights with public opinion. Sentiment analysis based on multi-class classification scheme is oriented towards classification of text into more detailed sentiment labels. Multi-class classification with single tier architecture where single model is developed and entire labeled data is trained may increase the classification complexity. In this paper, multi-tier sentiment analysis system on big data analytics platform (MSABDP) is proposed to reduce the multi class classification complexity and efficiently analyze large scale data set. Hadoop is built for big data analytics and it is a good platform for being able to manage large data at scale and which can improve scalability and efficiency by adopting distributed processing environment since they have been implemented using a MapReduce framework and a Hadoop distributed storage (HDFS). The MSABDP is implemented by combining SentiStrength lexicon and learning based classification scheme with multi-tier architecture and run on big data analytics platform for being able to manage large data at scale. The proposed system collects a large amount of real Twitter data by using Apache Flume and the data was used for evaluation. The evaluation results have shown that the proposed multi class classification system with multi-tier architecture is able to significantly improve the classification accuracy over multi class classification based on single-tier architecture by 7%. Keywords: big data analytics, hadoop, multi class classification, sentiment analysis, social big data 1. INTRODUCTION As the rapid growth of the Internet and online activity, many services such as blogging, podcasting, social networking and bookmarking are popular. These services allow users to create and share information within open and closed communities and contribute to the volumes of the data. Moreover, in social network, data is created and delivered from various systems operating in real-time by aggregating constant information about user activities and interactions. Twitter has 320 million monthly active users and they posts 500 million tweets every day; Facebook has 936 million daily and 1,440 million monthly active users as of December, 2015. These factors are reasons of a rise of Big Data [5]. Big data is characterized by the volume, velocity, veracity, variety, value and volatility of data. Nevertheless, the appropriate tools are needed to acquire, organize and derive value and volatility of data. With the advent of social media, data is captured from different sources, such as mobile devices and web browsers, and it is stored in various data formats. The traditional storage and analytics platform can’t handle the different sources and different formats of the structured and unstructured data. Big Data Analytics has become popular for analysing and managing large volume of the structure and unstructured data [12]. Hadoop is a good platform for Big Data Analytics as it provides scalability, cost-effective, flexible, fast, secure and authentication, parallel processing, Availability and resilient nature. It is an open-source software framework comprises of two International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 204 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 2. parts: storage part and processing part. The storage part is called the Hadoop Distributed File System (HDFS) and the processing part is called MapReduce. Analysing social data by providing customer certification through sentiment analysis helps the organization to determine marketing strategy and improve customer services. Sentiment analysis is one of the main agenda in big data that focus on various ways to analyse big data to identify patterns and relationships, make informed predictions, deliver actionable intelligence and gain business insight from this steady influx of information [13]. Twitter is one of the popular social media data platforms, which combines features of blogs and social network services. Twitter was established in 2006 and experienced rapid growth of users in the first years [15]. Twitter is a good source of information in the sense of snapshots of moods and feelings as well as up-to-date events and current situation commenting. Actually, achieving a high level of performance for sentiment analysis for twitter data is very difficult. At first, the large amount of volume and high velocity of twitter stream data is needed to be efficiently stored and processed. Second, Twitter allows users to post no more than 140 characters for each post, which makes Twitter sentiment classification can be lower performance than that when mining longer texts. In addition, classification into multiple classes can increase the classification complexities, which tend to decrease the classification accuracy. In this paper, Sentiment analysis system is proposed which is implemented by combing lexicon based and learning based approaches, and shows that the proposed system is good performances in terms of classification accuracy. The main contributions of this paper are as follows: 1. Big Data Analytics Platform is developed to scale up the traditional analytics platform for analyzing social big data 2. Multi-tier sentiment analysis on Big Data Analytics platform (MSABDP) is proposed To reduce the multi class classification complexity. The remainder of this paper is arranged as follows: In section 2, the related work presents large scale sentiment analysis in a distributed environment and multi class classification. Section 3 illustrates the system design to analyze the social big data sets and presents the implementation of the proposed system. Section 4 shows the experimental setup and results. Section 5 concludes the paper and describes the possible directions for future works. 2. RELATED WORK In the past years, many works has been released in sentiment analysis, but there is a little work for sentiment analysis on big data. There exist many possible variants; some of the papers related with the proposed system are reviewed in this section. The paper [11] presented multi-tier classification architecture for sentiment analysis. The architecture consists of three models. The first model classifies three classes and the remaining two models are performed as the binary classifier. To achieve high performance, features were selected by applying various feature selection techniques. 150,000 movie reviews posted on social media were used to train and test the performance of the system. Four classifiers (Naïve Bayes, SVM, Random Forest, and SGD) were used in the experiments. The result showed that the performance of proposed architecture was significantly improved prediction accuracy over the simple single tier model by more than 10%. The proposed system runs on traditional analytics platform and classifies the movie reviews into multi class by applying only supervised machine learning classifiers. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 205 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 3. The authors [10] presented an approach that relies on writing patterns and special unigrams for multiclass sentiment analysis. They extracted patterns that rely on PoS -Tag of words and then calculate the resemblance degree res (p, t) of each pattern in the training set p to the tweet t. For each tweet, they extracted a 4 set of features; referred to the training set and use Random Forest machine learning algorithms to perform the classification. Training data set contains 21000 tweets which had been manually classified the class. The system classified the texts which collected from Twitter into seven classes: happiness, sadness, anger, love, hate, sarcasm and neutral. In the experiment, the accuracy of the multiclass sentiment analysis is 56.9%. Since the training data are manually classified the class, no more methods were required to develop the training data sets. In [17], As the total volume of tweets is extremely high and it taks a long time to process, the authors proposed a distributed system for Twitter sentiment analysis with two components: lexicon builder and a sentiment classifier. The lexicon builder was implemented based on Hadoop and HBase and they demonstrated that it scales with increasing number of machines and the amount of training data. In particular, for 100000 tweets, the training time was decreased by 35% when moving from 2 to 3 machines, and 40% for 2 to 4 machines, 47% for 2 to 5 machines. For 300000 tweets, the running time was decreased by 23% when moving from a cluster with 4 machines to 5 machines. To achieve higher accuracy than the lexicon-based approach, the lexicon-based and machine-learning based approaches are combined. Online Logistic Regression from the mahout machine learning library was used for learning based approach. The experiment results show that their system can obtain higher accuracy by combining lexicon-based and learning based approaches. The proposed system used the existing class labeled dataset and classified into binary class. B.Liu et al. [4] presented a simple and complete system for sentiment mining on large data sets using a Naive Bayes classifier. To achieve fine-grain control of the analysis procedure, they implemented the NBC on top of Hadoop framework with additional four modules: the work flow controller (WFC), the data parser, the user terminal and the result collector. They evaluated the scalability of NBC in large data set and the result is encouraging in that the accuracy of NBC is improved and approaches 82% when the dataset size increases. They have demonstrated that NBC is able to scale up to analyze the sentiment of millions movie reviews with increasing throughput. The two datasets in the experiments had already been labeled the class. Instead of applying Mahout Machine learning library, NBC is implemented on top of Hadoop framework with additional modules. In the proposed system, multi-tier sentiment analysis system is developed on big data analytics platform to reduce the multi class classification complexity. Multi-tier sentiment classification is implemented by combining lexicon and Supervised learning-based approaches. In this system, SentiStrength is used to label the class in order to develop the training data sets. Mahout Naïve bayes with multi-tier architecture is used for multi class sentiment classification. The proposed system can improve the scalability and efficiency by adopting distributed processing environment since they are implemented using HDFS and MapReduce framework. 3. BIG DATA ANALYTICS PLATFORM DEVELOPMENT Multi-tier Sentiment Analysis system is implemented on Big Data Analytics Platform with Apache Flume [14], HDFS, MapReduce [6] and Mahout Machine learning library [3]. High level architecture of the MSABDP is International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 206 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 4. illustrated in Figure 1. It consists of four layers: Data Ingestion Layer, Storage Layer, Processing Layer, and Analytics Layer. In Data Ingestion Layer, data are collected from Twitter steaming API with Apache Flume. The collected data is stored in Hadoop File System (HDFS), distributed storage, which is located in Storage Layer. One name node and three data nodes are developed for distributed storage. Data cleaning and preprocessing, class labeling and multi-tier classification are developed in Analytics Layer. All of the processes from Analytics Layer are implemented with Map Reduce paradigm which is used for distributed processing and located in Processing Layer. The detail process description of each layer is described the following subsection. The system can provide the capabilities such as scalability, cost effective, and parallel processing by using Hadoop platform. Processing Layer Data Ingestion Layer …. ApacheFlume(DataCollection) Analytics Layer Data Cleaning and Preprocessing Class Labeling Mahout Machine Learning(Multi-tier Classification) Slave Node 1 Data Node Node Manager Map Task 2 Reduce Task 1 Containers Slave Node 2 Data Node Node Manager Application Master Map Task 1 Containers Slave Node n Data Node Node Manager Map Task N Reduce Task M Containers Scheduler Application Manager Resource ManagerYarn Master Node HDFS Master Node Name Node S Name Node Processing Layer Container Launch RequestNode Status Resource Request Map Reduce Coordination Heart beats, balancing, replication Figure1. High Level Architecture of MSABDP 3.1 Data Ingestion Layer In this layer, Apache Flume is used to collect the tweet stream data and the data is ingested to HDFS through the memory channel. Apache Flume is deployed by Twitter Agent which has 3 main components: a TwitterSource, a Memory Channel and a HDFS Sink. The incoming tweets are limited by keywords and language in the TwitterSource. The source comes as an event-driven source and receives events through mechanisms like callbacks or RPC calls. In this TwitterSource, the twitter4j library is used to keep access to the Twitter Streaming API. The application secrets like Consumer Key and Consumer Secret, Access Token, Access Secret have been used to connect to the Twitter APIs. The memory channel acts as a pathway between the TwitterSource and HDFS Sink. The channel uses as an in memory queue to store events until they’re ready to be written to a sink. As the channel International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 207 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 5. holds all events in memory, the channel’s capacity and transaction capacity are limited by the capacity and transaction capacity parameters in the configuration file. Events are added to the channel by TwitterSource, and later removed from the Channel by HDS Sink. HDFS Sink, which writes events to a configured location in HDFS. In the HDFS Sink configuration, the size of files can be defined with the roll count parameter and set up to 10,000, so each file will end up containing 10,000 tweets. The original data format is retained by setting the file Type to Data Stream. The file path is defined as that the files will end up in a series of directories for the year, month, day and hour during which the event occurs. The timestamp is set to true in the configuration and which is used by Flume to determine the timestamp of the event, and is used to resolve the full path where the event should end up. 3.2 Storage Layer In Storage Layer, HDFS is used to provide scalable and reliable data storage. HDFS serves master/slave architecture and single NameNode serve as a master server. Name Node executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. DataNodes is used to store the actual data in HDFS. The input file is split into one or more blocks and these blocks are stored in a set of DataNodes. Each block size is 64 MB. DataNodes are responsible for serving read and write requests from the clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. 3.3 Processing Layer Yarn and MapReduce-2 is located in the Processing Layer to process vast amounts of data in-parallel on clusters of commodity hardware in a reliable, fault-tolerant manner. The MapReduce job splits the input data set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Both the input and the output of the job are stored in HDFS. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. The MapReduce framework consists of a single master ResourceManager, one slave NodeManager per cluster-node, and MRAppMaster per application. The application specifies the input/output locations and supply map and reduce functions via implementations of interfaces and abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job and configuration to the ResourceManager which then assumes the responsibility of distributing the software and configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client. 3.4 Analytics Layer In this layer, sentiment analysis is performed by combing lexicon based and machine learning based approaches. SentiStrength lexicon based scheme is applied for class labeling and the class labeled data is used as the training data to build the multitier classification model. Data cleaning and preprocessing have been performed to develop the effective classification model. Mahout scalable machine-learning library, specifically designed to use Hadoop for enabling scalable processing of huge data sets, is used to develop the classification model. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 208 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 6. 4. SENTIMENT ANALYSIS SYSTEM FOR SOCIAL BIG DATA 4.1 Sentiment Analysis Sentiment analysis [7] is an increasingly popular text mining method to determine the opinion of a text. It is often referred to as opinion mining. Sentiment analysis uses individual elements of a text on different text levels, like whole document, paragraph, sentence or only a text window, and tries to correctly determine and assign particular moods and emotions to the respective entities, also called objects. The challenge of this analysis [9] is to replicate a human process understanding of the text information assignment of the individual polarized information and interpretation of the human mind. The simplest way to do sentiment classification is using the lexicon-based approach, also known as a bag of words, or dictionary-based approach. This allows more accurate assessment of the individual components of the text. The machine learning techniques have widely applied in the text classification area and most of them are supervised learning classification methods. Acquiring effective training data is a challenge, although learning based approaches outperforms the lexicon-based (Pang et al.,). Manual Labeling for training data is time and labor consuming. 4.2 Sentiment Analysis of Social Big Data The data from social media [13] is multistructured (Variety), very large scale and distributed (Volume) and sometimes needs to be analyzed in a real time manner (Velocity). In addition, the fourth V, Value, is needed for sentiment analysis on social data. The specification of such data source is known as the 4 Vs aspect in big data domain. Sentiment analysis on the social media data helps the organization to determine marketing strategy by providing public opinion. Efficient techniques to collect social data and extracting valuable information from collected data are essential demand. Traditional sentiment classification techniques do not perform in social data. In this work, Mult-tier Sentiment Analysis System on Big Data Analytics Platform is proposed for analyzing social big data. 4.3 Mult-tier Sentiment Analysis System on Big Data Analytics Platform The MSABDP is developed with four processes: data collection, data cleaning and preprocessing, class labeling and multi-titer sentiment classification. Apache Flume is used to collect the real twitter data. As the collected tweets include noise and irrelevant data, data cleaning and preprocessing are performed in order to produce normalized data for labeling the class. The sentiment classification is developed by combining lexicon-based classifier and supervised machine learning approach. The classification process is implemented on Map Reduce software framework to support distributed computing, and it allows parallel programming using the function concept called Map functions. Instead of labeling the class manually, SentiStrength lexicon based classifier is used for the task of annotating the training data for the learning-based classifier. Concurrently, the tweets text data is preprocessed to improve the accuracy of classification. And then preprocessed data with class label are combined as the input of learning based classification. In reduce stage, the result data are combined and it produces a new set of output. And the classification model is developed by using the mahout machine learning library. The newly coming test data is International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 209 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 7. classified using the model in order to produce the classified tweets. The process flow of the proposed system is described in Figure 2. Twitter Service(Twitter Stream API ) Twitter Source HDFS Sink Memory Channel Data Collection FLume Data Cleaning and Preprocessing New Instances SP : strongly_positive; P: positive; Neu: neutral; SN: strongly_negative; N: negative Multi-tier Classification Removing Character Repetitions Removing StopWords Negation Handling Removing Noisy Data Selecting Tweet Text feature Removing Duplicate Tweets Replacing Emoticons with their Sentiment Values Replacing Abbreviation with their expression Training Data Set Testing Data Set Class Labeled Data Set Machine Leaning Classifier NegPos Model I (SP/P (Pos), Neu, SN/N(Neg) ) Classification Model SP P SN NNeu Model II (SP, P) Model III (SN, N) Class Labeling Sentiment Score Calculation Class Labeling Lexicon Figure2. Process Flow Diagram of MSABDP 5. Data Collection The real tweet data are collected by using Apache Flume [9] and the collected real data is used to evaluate the system. Apache Flume deploys the Twitter Agent in order to collect data that is generated from TwitterSource (Twitter Stream API) and transferred over to HDFS through MemoryChannel. 5.1. Twitter Source Twitter source processes events and moves them along by sending the stream data into a Memory channel [10]. The sources operate by gathering discrete pieces of data, translating the data into individual events, and then using the channel to process events one at a time, or as a batch. . In the system, custom Twitter Source is used as a data source. Data are limited by key words and language. In the TwitterSource, the twitter4j library is used to keep International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 210 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 8. access to the Twitter Streaming API. In order to connect to the Twitter APIs, some of the application specific secrets like Consumer Key and Consumer Secret, Access Token, Access Secret are needed to be used. To get the application specific secrets, it is needed to create a Twitter application. The Twitter application is created by the following link https://apps.twitter.com/. It generates Consumer key, Consumer secret, Access token, and Access token secret. 5.2. Memory Channel The channel acts as a pathway between the TwitterSource and HDFS Sink. Events are added to the channel by TwitterSource, and later removed from the Channel by HDFS Sink. It is uses as an in-memory queue to store events until they’re ready to be written to a sink. As the channel holds all events in memory, the channel’s capacity and transaction capacity is limited by the “capacity” and “transaction Capacity” parameters in the configuration file. In the work, the channel’s capacity is set up to 10000 and the transaction capacity is setup to 100. 5.3. HDFS Sink HDFS Sink, which writes events to a configured location in HDFS. In the HDFS Sink configuration, defines the size of the files with the roll Count parameter and set up to 10,000, so each file will end up containing 10,000 tweets. It also retains the original data format, by setting the file Type to DataStream and setting writing Format to Text. 6. Data Cleaning and Preprocessing As flume ingests the data as the nested json format and it may contain irrelevant and duplicated data, this data has to be cleaned and preprocessed for effective analysis. Selecting tweet text features, removing duplicate tweets, removing noisy data, removing character repetition, removing stopwords and negation handling is performed during this process. The process not only simplifies the classification task, but also serves to greatly decrease the processing cost in the training phase. 6.1. Selecting Tweet Text Feature Many tweet data features are included in one record of tweet stream data as shown in Figure 3. The sentiment classification system is focused on tweet text feature with English language. The tweet text feature is selected among other feature because it expresses twitter users’ feeling and opinion. The tweet id feature is also selected to assign as a key and tweet text feature is assigned as a value in Map Reduce process. To select tweet id and tweet text feature, the collected tweet data with json format is fetched as a json object using json parser. And “tweets_id” and “tweets_text” are extracted as a string from json object. 6.2. Removing Duplicate Tweets Twitter Stream data may include many duplicate retweets. To remove duplicate retweets, the extracted “tweets_id” and “tweets_text” is checked whether already stored or not in the list of tweet data. If the extracted features have not been already stored, they are added as the list of tweet data. Otherwise, they are removed. Table 1 shows the selected sample tweet_text and tweets_id features. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 211 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 9. 6.3. Removing Noisy Data The term noisy data is used to describe any piece of information within the tweet that will not be useful for the machine learning algorithm to assign a class to that tweet. There are included noisy data such as character repetitions, website links with URL, @username, punctuation additional white space Replace hash tags with the same word without the hash tags. For example, #fun is replaced with fun. Replaced website links with URL so links to websites that start with www.* or http. Converted @username to “usermentionsymbol by replacing @username instances found in tweets with “usermentionsymbol” for the classifier to easily identify that a user is being referenced. Non-Alphabets are replaced with spaces. 6.4. Removing Character Repetition To remove the character repetition, there is needed to be checked whether the repetitive characters contain or not in the input data. To check whether the presence of repetitive character, the former and pervious characters are compared by indexing word in one record of tweet. If the previous and current characters are the same, the character is counted. If the quantity of character is two or more, the procedure returns as true. Otherwise, it returns false. If the repetitive characters are found and the character count is more than 2, replace the character itself by deleting the repetitive characters with substring function. 6.5. Removing Stopwords When working with text classification methods, removal of stopwords is a common approach to reduce noise in the data. In this work, not only common stopwords but also stopwords based on classification domain are considered by manually examining the data. For example, domain stopwords contain iphone, apple, mobile, etc. 6.6. Negation Handling Negation handling is one of the factors that significantly affect the accuracy of learning based classifier. For example: the word “good” in the phrase “not good” will be contributing to positive sentiment rather that negative sentiment as the presence of “not” before it is not taken into account. It transforms a word followed by a not or n’t into “not_” + word. 6.7. Replacing Emoticons with Sentiment Values As the maximum length of a tweet is 140 characters, the case of emoticon will represent the sentiment of that tweet. [7] Therefore, emoticons can be replaced with their sentimental values (such as “feeling sad” and “feeling happy”) by using happy and sad emotion dictionary. The happy emoticon dictionary is used to label with 90 emotions and sad emoticon dictionary is used to label with 107 emoticons. For example, “:)” is labeled as “feeling happy” whereas “:=(” is “feeling sad”. Table 3 shows the sample emoticons and their sentiment values. The preprocessed tweets that replaced emoticons can be seen in Table 1. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 212 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 10. Table 1: Sample Emoticons and their Sentiment Values 6.8. Replace Abbreviation with their expansions As the abbreviation in Tweets can boost the sentiment values, it is replaced with their expansions by using abbreviation dictionary. The dictionary has translations for 330 abbreviations [2]. The most frequent used abbreviations can be seen in Table 2. Table 2: Sample abbreviations and their expansion 7. Class Labeling Instead of manual labeling the class, SentiStrength lexicon based approach is used for annotating the training data of the learning-based classifier. SentiStrength [16], a lexicon-based classifier, uses additional (non-lexical) linguistic information and rules to detect sentiment strength in short informal English text. The contextual valence shifter: negation and intensifier are used to evaluate the context sentiment and to solve the context dependent problem by applying NegationWordList and BoosterWordList. There are consists of eight dictionaries: BoosterWordList, EmoticonLookupTable, EmotionLookupTable, EnglishWordList, IdionLookupTable, NegationWordList, QuestionWords and slangLookupTable. EmotionLookupTable consists of 2546 sentiment words with their strength. Some words include Kleene star stemming (e.g., ador*). The word “miss” is a special case with a positive and negative strength of 2. It is frequently used to express sadness and loves simultaneously. A spelling correction algorithm deletes repeated letters in a word when the letters are more frequently repeated than normal for English or, if a word is not found in an English dictionary, when deleting repeated letters creates a dictionary word. A booster word list is used to strengthen or weaken the emotion of sentiment words. A booster word list consists of 28 words and their strength of sentiment. Some booster words are “totally, completely and might”. An idiom list is used to identify the sentiment of a few common phrases. This overrides individual sentiment word strengths. Negations are used to reverse the semantic polarity of a particular term, and skip any intervening booster words. Default multiplier for negated words is 0.5. In this case, instead use of default multiplier for negated words, the multiplier is set to 1. At least two repeated letters added to words give a strength boost sentiment words by 1. For instance, haaaappy is more positive than happy. An emoticon list with polarities is used to identify additional sentiment. The emoticon list consists of 116 common emoticon words and their polarity strength (-1 or 1). Sentences with exclamation marks have a minimum positive strength of 2, unless negative. Repeated punctuation with one or more exclamation marks boost the strength of the Emoticons Sentiment Values :-) :) :o) :] :3 :c) Feeling happy :-( :( :c :[ Feeling sad Abbreviations English expansion gr8,gr8t great lol Laughing out loud rotf Rolling on the floor bff Best friend forever International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 213 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 11. immediately preceding sentiment word by 1. Negative sentiment is ignored in questions. For each text, SentiStrength outputs a positive sentiment score from 1 to 5 and a negative score from -1 to -5. Figure3 shows the procedure for calculating the sentiment score by applying SentiStrength. The procedure is performed with Map Reduce function. At the Mapper stage, the collected raw data is parsed with the JsonParser in order to select the tweets and tweets_id. After performing data cleaning and preprocessing, the sentiment score is calculated with SentiStrength. If the sentiment score is greater than 1, the output label is “strongly_positive. If the sentiment score is greater than 0 and equal with or greater than 1, the output is “positive”. For strongly_negative, the score is less than -1. If the score is less than 0 and equal to or greater than -1 is negative. Otherwise, the output is neutral. The Reduce stage is not necessary due to the fact that the result combination is not needed. Currently, the Reduce stage outputs the results obtained by the Mappers. Figure3. Class Labeling Procedure 8. Multi-tier Classification In order to implement the multi-tier classification, there are two main parts: multi-tier classification model development and classification by developed model. The classification model is developed with multi-tier architecture and the new data (unknown data) are classified by the developed models. 8.1. Multi-tier Classification Model Development Developing the classification model is a vital part of the proposed sentiment analysis system. Mahout naive Bayes classifier, scalable machine learning algorithm, is conducted to develop the classification model. Naïve Bayes is a learning algorithm that is frequently employed to tackle text classification problems. It is computationally very efficient and easy to implement. Preprocessed class labeled data set is split into training and testing datasets in order Procedure: Class Labeling Job 1. Input: raw tweets(twitter text data) //k:key, v:value 2. Output: k: preprocessed tweet & tweets_id, v: class label 3. Function MAPPER(key, values) 4. while(value € values): 5. Parse the data (tweet, tweet_id) with JsonParser 6. Perform data cleaning and preprocessing 7. Calculate the total polarity strength for each sentence by applying SentiStrength_Data 8. If (score > 1) then print “strongly_positive” 9. Else if (score > 0 && score <= 1) then print “positive” 10. Else if ( score <0 && score >- 1) then print “negative” 11. Else if (score <- 1) then print “strongly_negative” 12. Else if (score==0) then print “neutral” 13. emit(tweet & tweet_id, class label) 14. End 15. end while 16. Function Reducer(key, values) 17. emit(tweet & tweet_id, class label) International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 214 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 12. to build the classification model. The input data is transformed into the sequence file. As this sequence file consists of key value pairs, class category and tweets_id are set to key and tweet texts are set to value. Feature generation is performed by creating sparse vector. In feature generation, TFIDF feature vectors are used for improving the performance of classification models. For multi-tier architecture, three classification models are developed and each model inherits the same configuration as the first model. Figure 4 illustrates the procedure for developing classification model I. To develop the first model (Model I), class labeled datasets which class category is “P” and “SP” is identified as “Pos” and the class category which class category is “N” and “SN” is identified as “Neg”. In model I, all of the labeled datasets (class category is Pos, Neu, Neg) are used to train the classifier and new test instance (unlabeled data) is classified into three classes (Pos, Neu, Neg). For the first model, positive and strongly_positive sentiment data are classified as “Pos” and negative and strongly_negative sentiment data are classified as “Neg”. Neutral sentiment data is still labeled as “Neu” in the Model I and the neutral class are directly appended to the final classification result. The new instances which are classified as “Pos” are going to Model II and other instances are going to Model III. Figure4. Procedure for Developing Classification Model I In model II, the labeled datasets which class category is “P” and “SP” are used to train the classifier and the test instances which class category is “Pos” are classified into two classes: “positive” and “strongly_positive”. The procedure for developing Model II is illustrated in Figure5. Figure5. Procedure for Developing Classification Model II Figure 6 shows the procedure for developing classification model III. In model III, the class labeled datasets which class category is “N” and “SN” are used to train the classifier and new test instances which class category is “Neg” are classified into two classes: “negative” and “strongly negative”. Procedure : Developing Classification Model I Input: Class Labeled data (Pos, Neu, Neg) Output: Classification model I 1. Begin 2. Transform input data to Sequence File 3. Generate TFIDF feature vector to create the model 4. Develop classification model I 5. End Procedure : Developing Classification Model II Input: Class Labeled data (P,SP) Output: Classification model II 1. Begin 2. Transform input data to Sequence File 3. Generate TFIDF feature vector to create the model 4. Develop classification model II 5. End International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 215 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 13. Figure6. Procedure for Developing Classification Model III 8.2. Classification by Developed model Figure7. Distributed Classification Procedure The three trained models are used to classify the new instances. For classification, naïve bayes classifier uses probabilities to decide which class best matches for a given input text. Word id and tfidf weight are used to create vector for the new tweet. The naïve bayes classifier is classified by using the vector. For model I, the score of three class label is calculated. The “bestscore” is set to “-Double.MAX_VALUE”. The “bestcategoryId” is set to “-1” and “categoryId” is set to index of classification result. If the indexed score is greater than “bestscore”, bestcategoryId is replaced with categoryId. If the “bestcategoryId” is equal with “1”, the classifier classify as “Pos”. If the “bestcategoryId” is equal with “0”, the classifier classify as “Neu”. Otherwise, it classify as “Neg”. For model II, Procedure : Developing Classification Model II Input: Class Labeled data (N,SN) Output: Classification model II 1. Begin 2. Transform input data to Sequence File 3. Generate TFIDF feature vector to create the model 4. Develop classification model III 5. End Procedure : Classification Job Input: New Collected Tweet Stream Data Output: k : tweet, tweet_id, v: class category 1. Function Mapper(k1, v1) 2. while(value € values): 3. Performed Data cleaning and Preprocessing 4. Create feature vector by using wordid, tfidf value 5. Classify feature vector of the new instances by applying Model I 6. Print classified results(“Pos”, “Neu”, “Neg”) 7. If (results= = “Pos”) 8. Classify feature vector of tweet(class category is “Pos”) by applying Model II 9. Print classified results(“positive”, “strongly_positive”) 10. End if 11. Else If (results= = “Neg”) 12. Classify feature vector of tweet(class category is “Neg”) by applying Model III 13. Print classified results(“negative”, “strongly_negative”) 14. End if 15. Else Print classified results(“neutral”) 16. Emit(tweet & tweet_id, class category) 17. End while 18. End function 19. Function Reduce(k2, v2) 20. Function Reducer(key, values) 21. emit(tweet & tweet_id, class category) International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 216 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 14. the score of two class label is calculated. If the “bestcategoryId” is equal with “1”, the classifier classify as “strongly_positive”. Otherwise, the classifier classify as “positive”. For model III, the score of two class label is calculated. If the “bestcategoryId” is equal with “1”, the classifier classify as “strongly_negative”. Otherwise, the classifier classify as “negative”. The algorithm is considered naive because it assumes that the value of a particular feature is independent of the value of any other feature, given the class variable. Laplace smoothing is performed with value of α set to 1. 9. Result Analysis In this work, experimental parameters for implementing the proposed system, the dataset and explanations about evaluation result are described. 9.1 Experimental Environment The specifications of devices and necessary software component of proposed system are described in table 3. Table3. Experiment Parameters Parameters Specification Server/Client OS Ubuntu 14.04 LTS Host Specification Intel ® Core i7-3770 CPU @ 3.40GHz, 8GB Memory, 1TB Hard Disk VMs Specification 4GB RAM, 100 GB Hard Disk Software Component - Hadoop 2.7.1 - Flume 1.6 - SentiStrength 2 - Mahout 0.10.0 9.2 Datasets In order to test the functionality of the proposed system and prove the achieved results with promising accuracy, tweets stream data related with iphone and samsung mobile product is examined. 200,000 tweet data are utilized as the training datasets and 1000 new batch of tweets are applied as the test set. 9.3 Evaluation Results In the experiment, the cluster is composed of 4 computing nodes (VMs) with one name node and three data nodes. Data collection to classification model development module is developed as the offline process. At another side, new batch of tweets are collected, and the new tweets are classified by applying the multitier classification model. To present the evaluation results of the system, beginning from evaluating the performance of lexicon based classifier. To establish the ground truth, the evaluation result of SentiStrength lexicon based classifier is compared with manual classification. In this work, 5,000 tweets are randomly selected from training data of two mobile phone brand and then evaluate the result of SentiStrength lexicon based classification. Figure 8 illustrates the comparative results of lexicon based classification and manual classification for iphone. The results show the tweets Percentage of classification with SentiStrength based approach for positive, strongly_positive, negative, strongly_negative, neutral class labels are 17, 7, 14, 3 and 59. And manual classified Tweets percentages for positive, strongly_positive, negative, strongly_negative, neutral are 22, 10, 12, 4, and 52. Therefore, error rate for lexicon based classification is 5% in positive, 3% in strongly_positive and 2% in negative, International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 217 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 15. 1% in strongly_negative, and 7 % in neutral. Therefore, the overall accuracy rate is 82%and error rate is 18% for iphone data. Figure8. Comparison of SentiStrength lexicon based classification and Manual Classification for iphone Figure9. Comparison of SentiStrength Lexicon based Classification and Manual Classification for samsung The comparative results of SentiStrength lexicon based classification and manual classification for samsung is illustrated in Figure 9. The results show the tweets Percentage of classification with SentiStrength based approach for positive, strongly_positive, negative, strongly_negative, neutral class labels are 20, 5, 16, 6 and 53. And manual classified Tweets percentages for positive, strongly_positive, negative, strongly_negative, neutral are 24, 8, 11, 8, and 49. Therefore, error rate for lexicon based classification is 5% in positive, 3% in strongly_positive and 2% in International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 218 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 16. negative, 1% in strongly_negative, and 7 % in neutral. Therefore, the overall accuracy rate is 84%and error rate is 16% for samsung. To evaluate the performance of multi-tier classification, two set of experiments are implemented. Reference data, which is correctly classified instances by using the preceding models, are used in the first case. In this case, only local mistakes are identified as the misclassified instances from the previous model have not been considered. For second case, the accuracy is computed by using global data which is based on the subsequent results of preceding models. The global data may contain the misclassified instances from all of the hierarchical classification models. Table3. Comparison of single-tier and multi-tier result for iphone Table4. Comparison of single-tier and multi-tier result for samsung Table 3 shows the comparative results of single-tier and multi-tier classification for iphone. The results show a higher accuracy in reference data while it is slightly lower for global data in multi-tier classification. The multi-tier Architecture Classified as Accuracy(%) Overall Accuracy(%) Multi-tier classification (with reference data) Model I Pos Neg Neu 83.4 59.7 69.4 82.37Model II positive strongly_positive 71.7 77.4 Model III Negative Strongly_negative 68.2 44.8 Multi-tier classification (with global data) Model I Pos Neg Neu 83.4 59.7 69.4 80.03Model II positive strongly_positive 42.8 59.6 Model III negative strongly_negative 66.7 61.4 Single-tier classification neutral positive strongly_positive negative strongly_negative 82.1 41.3 67.9 46.2 48.4 75.24 Architecture Classified as Accuracy(%) Overall Accuracy(%) Multi-tier classification (with reference results) Model I Pos Neg Neu 81.4 71.7 59.4 80.14Model II positive strongly_positive 71.7 77.4 Model III negative strongly_negative 48.2 59.8 Multi-tier classification (with global results) Model I Pos Neg Neu 80.4 49.3 53.4 78.13Model II positive strongly_positive 32.9 54.9 Model III negative strongly_negative 67.1 68.2 Single-tier classification neutral positive strongly_positive negative strongly_negative 82.7 42.1 66.5 39.9 43.7 74.53 International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 219 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 17. classification (with reference data) is higher than single-teir with 7% and the multi-tier classification (with global data) is higher than single-tier with 5%. The comparative results of single-tier and multi-tier classifications for Samsung are illustrated in Table 4. The results show the multi-tier classification (with reference data) is achieved higher accuracy than multi-tier classification (with global data) and single-teir with 4% and 6% respectively. Figure 10. Comparison Result for Performance (with Running Time) In order to test the scalability of the system, the MSABDP run the job with different number of tweets and with different number of nodes. Each node is developed on each machine. Figures 10 show the scalability of MSABDP and single node cluster to four node cluster. According to the result, the running time of MSABDP with different volumes of data decreases when adding more nodes into the cluster. 10. CONCLUSION In this paper, Big Data Analytics Platform (Hadoop) is developed to scale up the traditional analytics platform for analyzing social big data . Multi-tier sentiment analysis on Big Data Analytics platform (MSABDP) is proposed to reduce the multi class classification complexity. Apache Flume is used to collect a huge amount of real twitter data and MSABDP classifies the real twitter data into five classes: strongly_positive, positive, neutral, negative, and strongly_negative. The sentiment classification is implemented by combining lexicon and supervised machine learning-based approach. The system enables high-level performance of learning based classification while taking advantage of the lexicon-based classifier’s effortless setup process. To reduce the multi class classification complexity, learning based approach with multi-tier architecture is applied to classify the multi class. The proposed system evaluates real twitter data of iphone and samsung mobile phone product. The Evaluation result shows that the reliability of the performance of SentiStrength lexicon based classification by comparing manual classification results and achieved the accuracy rate is 84% for iphone and 82% for samsung. To evaluate the multi class classification scheme with mult-tier architecture, two set of experiments are implemented with reference data and global data. For iphone product, the multi-tier classification (with reference data) is higher than single-teir with 7% and the multi-tier classification (with global data) is higher than single-tier with 5%. For samsung, the multi-tier International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 220 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 18. classification (with reference data) is achieved higher accuracy than multi-tier classification (with global data) and single-teir with 4% and 6% respectively. The running time of MSABDP with different volumes of data decreases when adding more nodes into the cluster. In future works, the proposed sentiment analysis system will be implemented with Spark and Mahout Samsara. The regarding future work, more experiments by more classifier and data from another domain will be experimented. REFERENCES [1] A.Tsakalidis, S.Papadopoulos,I.Kompatsiairs, “ An Ensemble Model for Cross-Domain Polarity Classification on Twitter”, 15 th Web Information System Engineering, Thessaloniki, Greece, 14 October 2014 [2] Abbreviation List. Available at “http://www.englishclub.com/eslchat/abbreviations.html.” [Online; accessed 10-June-2016]. [3] Andrew C. Oliver. “Machine-learning-with-mahout”. Available: http://www.infoworld.com/article/2608418/application- development/enjoy-machine-learning-with-mahout-on-hadoop.html” , May 29, 2014, [Online; accessed 3-December-2016]. [4] B.Liu, E.Blasch, Y.Chen, D.Shen, G.Chen, “Scalable Sentiment Classification for Big Data Analysis Using Naive Bayes Classifier”, IEEE International Conference on Big Data, IEEE, (2013), pp. 99-104. [5] G. Vaitheeswaran, Dr. L. Arockiam,”Combining Lexicon and Machine Learning Method to Enhance the Accuracy of Sentiment Analysis on Big Data”, International Journal of Computer Science and Information Technologies, Vol.7(1) , 2016, 306-311. [6] Hadoop Yarn. Available at “https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/YARN.html.” [Online; accessed 20- August-2016] [7] Harsh Thakkar and Dhiren Patel, “Approaches for Sentiment Analysis on Twitter: A State of Art study”, CoRR abs/1512.01043 (2015). 2014. [8] Johan Bollen, Huina Mao, and Alberto Pepe. Modeling public mood and emotion:Twitter sentiment and socio-economic phenomena. In ICWSM, 2011. [9] L.Sigler. “Text Analytics and Sentiment Analysis. Available: http://www.clarabridge.com/text-analytics-vs-sentiment-analysis/ ,July 27, 2015, [Online; accessed 12-October-2016]. [10] M.Bouazizi, T.Ohtsuki, “Sentiment Analysis: from Binary to Multi-Class Classfication”, IEEE ICC 2016 SAC Social Networking, 2016. [11] M.Moh, A.Gajjala, S.C.R.Gangireddy, T.S.Moh, “On Multi-Tier Sentiment Analysis using Supervised Machine Learning”, IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2015. [12] M.Skuza, A.Romanowski, “ Sentiment Analysis of Twitter Data within Big Data Distributed Environment for Stock Prediction”, Computer Science and Information Systems (FedCSIS), 13-16 Sept. 2015. [13] N.M.Sharef, H.M.Zin and S.Nadali, “Overview and Future Opportunities of Sentiment Analysis”, Journal of Computer Sciences , January 2016. [14] S.Shaikh. “Flume Installation and Streaming twitter data using flume”, Available: https://www.eduonix.com/blog/bigdata-and- hadoop/flume-installation-and-streaming-twitter-data-using-flume/, June 30, 2015, [Online; accessed 20-Oct-2016] [15] “ Twitter Statistics “, Available at: http://www.statisticbrain.con/twitter-statistics/, 2013, [Online; accessed 2-January-2016] [16] V.K. Singh, R. Piryani, A. Uddin, P.Waila, “Sentiment Analysis of Movie Reviews. A new Feature-based Heuristic for Aspect-level Sentiment Classification.”, International Multi-Conference on Automation, Computing, Communication, Control and Compressed Sensing (iMac4s), 22-23 March 2013. [17] V.N.Khuc, C.Shivade, R.Ramnath, J.Ramanathan, “Towards Building Large-Scale Distributed Systems for Twitter Sentiment Analysis”, 12 Proceedings of the 27th Annual ACM Symposium on Applied Computing, Pages 459-464, Trento, Italy, March 26-30, 2012 Wint Nyein Chan is a tutor of Computer University (Hpa-an). She received her M.C.Sc degree in computer science from Computer University (Dawei), Myanmar, in 2011. She is a Ph.D candidate of UCSY. Her research interests include big data, cloud computing, and machine learning. Dr. Thandar Thein received her M.Sc. (computer science) and Ph.D. degrees in 1996 and 2004, respectively from University of Computer Studies, Yangon (UCSY), Myanmar. She did post doctorate research in Korea Aerospace University. She is currently a Pro-Rector of University of Computer Studies, Maubin. Her research interests include cloud computing, mobile cloud computing, big data, digital forensic, security engineering, and network security. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 9, September 2017 221 https://sites.google.com/site/ijcsis/ ISSN 1947-5500