1. Presented by
Name of Student
(Roll No)
Under the Guidance of
An Approach For Sentiment Analysis On
Big Social Data Using Spark
Dr. Chiranjeevi Manike, Associate Professor
Department of Computer Science & Engineering
B V Raju Institute of Technology, Narsapur
2. • Collecting the opinions of the public by analyzing the big
social data has attracted a large amount of attention due to its
interactive and real-time nature.
• For this concept, recent studies have depended on both Social
Media and Sentiment Analysis so as to accompany big events
by tracking people’s behavior.
• The proposed system provides an adaptable approach of
Sentiment Analysis that analyzes social media posts and draws
user’s opinions in real-time.
• The approach used consists of two steps.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
Abstract
3. • The first step is to build a dynamic dictionary of words’
Polarity based on a chosen set of Hashtags that are related to a
given subject.
• The second step is to classify the posts under many subjects by
introducing new qualities which firmly refine the polarity level
of a post.
• Twitter, Facebook and other social media conversations can be
mined for Sentiment data to know about the competition.
• Social media blogs help in knowing the current discussions of
the public.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
Abstract
4. • The obtained information can be used to take focused, real-
time, decisions that boost market share.
• Spark is used as it is helpful in streaming real-time data from
various sources of Social Networks such as Twitter, Stock
Exchange, and Geographical Information Systems.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
Abstract
5. Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
Introduction
• Millions of people around the world are being able to express
their viewpoints and sprawl them in the present days.
• For that purpose, Social Media has been very helpful from many
years.
• Social Networking websites and applications let the users show
their opinions by responding (liking or disliking) to the content
posted.
• The users may even post the content of their own to display their
intentions or feelings towards one particular subject or even more
number of them.
6. Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
Introduction
• The accumulated data and the performed activities on social
media produce Volume, Variety, Value, Variability and Veracity in
large amounts and thus can be called as Big Social Data.
• Usually, the data of this kind consists of numerous sets of
opinions that can be processed to know public proneness in the
digital platform.
• Many research methods are involved to process this type of
activities such as Text Analysis.
• Most of the internet data, that is almost 80 percentage of it is text.
• That is why, Text Analysis has emerged to be an important factor
for Public Sentiment and Opinion Extraction.
7. Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
Introduction
• Sentiment Analysis is also known as Opinion Mining.
• It targets on the people’s sentiment regarding a subject of matter
by performing analysis on their posts and related actions on
social media.
• Then, it proceeds with classification of the posts to determine
polarity and give results such as positive, negative and so on.
• The ‘Sentiment’ in each statement/tweet can be extracted using
two popular approaches:
– Lexicon Analysis
– Machine Learning
8. Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
Introduction
• Objective texts are also a part of Sentiment Analysis as they
show the ‘Neutral’ category of polarity.
• Eastern emojis are no longer used as they are the combination of
special characters and some people don’t tend to understand
them.
• So, Emojis which we use in the present day, play a crucial role in
the Sentiment of text, especially in tweets.
10. Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
Introduction
• The two methods of analysis have been used more often on big
social data to gather public opinion to assess user's satisfaction of
a subject (services, products, events, topics or persons) in several
domains including Politics, Marketing, Health, Travel etc.
• However, the results may vary depending on a reasonable degree
of accuracy.
• The failure is caused generally due to the challenges of opinion
mining such as the semantic orientation of a word which could
change based on the context.
11. Literature Review
[01] Garg, K. and Kaur, D., 2019. Sentiment Analysis on Twitter Data using
Apache Hadoop and Performance Evaluation on Hadoop MapReduce and
Apache Spark. In Proceedings on the International Conference on Artificial
Intelligence (ICAI) (pp. 233-238). The Steering Committee of The World
Congress in Computer Science, Computer Engineering and Applied
Computing (WorldComp).
Objective of Paper: Analyze real-time streaming of Twitter data to identify
the sentiment expressed in each tweet using Cloudera.
Approach/Algorithm/Framework: Hadoop MapReduce framework
Pros: Performance of Apache Spark has turned out to be considerably higher
almost 2x in terms of time on a single node.
Cons: The correlation between user influence and sentiment of the author is
not computed by using Hadoop effectively.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
12. Literature Review
[03] Al-Saqqa, S., Al-Naymat, G. and Awajan, A., 2018. A Large-Scale
Sentiment Data Classification for Online Reviews Under Apache Spark.
Procedia Computer Science, 141, pp.183-189.
Objective of Paper: To present new evaluation experiments of sentiment
analysis for a large-scale dataset of online customer's reviews under Apache
Spark data Processing System.
Approach/Algorithm/Framework: Spark's MLlib's classifiers/algorithms:
Naive Bayes, Support Vector Machine and Logistic regression
Pros: According to the experimental results, Support vector machine
classifier performs better than Naïve Bayes and Logistic Regression
classifiers.
Cons: Experiments using different feature sets and n-gram models (bi-gram
and tri-gram) that may enhance the performance of the classification are not
conducted.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
13. Literature Review
[04] Ranganathan, J., Irudayaraj, A.S. and Tzacheva, A.A., 2017, November.
Action rules for sentiment analysis on twitter data using spark. In 2017 IEEE
International Conference on Data Mining Workshops (ICDMW) (pp. 51-60).
IEEE.
Objective of Paper: To implement a new optimized and more promising
system, in terms of speed and efficiency, to generate meta-actions by
implementing Specific Action Rule discovery based on Grabbing strategy
(SARGS) algorithm.
Approach/Algorithm/Framework: Action Rule mining algorithm
Pros: According to the results, faster computational time for Spark system is
noticed compared to Hadoop MapReduce for implementing the meta-action
generation methods.
Cons: Testing the system with more real-time large data like NPS dataset to
test and improve system’s scalability and feasibility is not done.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
14. Literature Review
[07] Adib, P., Alirezazadeh, S. and Nezarat, A., 2017, October. Enhancing trust
accuracy among online social network users utilizing data text mining
techniques in apache spark. In 2017 7th International Conference on
Computer and Knowledge Engineering (ICCKE) (pp. 283-288). IEEE.
Objective of Paper: To find malicious users and analyze their behavior to
proceed a more accurate trust within distributed execution in Spark
environment for providing a quicker call.
Approach/Algorithm/Framework: Stochastic gradient descent (SGD)
Pros: The proposed model benefits from a high diagnostic accuracy and
precedes SGD with 38% higher performance.
Cons: The use of reverse malicious words in dictionary to keep a more
accurate detection of malicious users through their tweets is not done here.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
15. Literature Review
[08] Podhoranyi, M. and Vojacek, L., 2019, September. Social Media Data
Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter
Data Analysis. In Proceedings of the 2019 4th International Conference on
Cloud Computing and Internet of Things (pp. 1-6).
Objective of Paper: To develop an architecture and a workflow which can
process Twitter social network data in near real-time so that tweets with the
defined topic – floods are analyzed.
Approach/Algorithm/Framework: Apache Flume, Hadoop Distributed File
System (HDFS), HIVE with HiveQL, YARN, SPARK.
Pros: The Word Frequency method (n-grams) is effective in revealing the
tweets’ content and proved their high informative potential in terms of data
quality and quantity.
Cons: Text analyzing methods that are focused on geo-names extraction are
not applied.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
16. Literature Review
[09] Yeruva, V.K., Junaid, S. and Lee, Y., 2017, November. Exploring social
contextual influences on healthy eating using big data analytics. In 2017
IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
(pp. 1507-1514). IEEE.
Objective of Paper: To implement a Big Data Analytics framework which
targets to explore social contextual influences on healthy eating.
Approach/Algorithm/Framework: BIDAF – Big Data Analytics
Framework for Smart Society
Pros: The obtained results indicated that the BiDAF framework is effective in
classification and sentiment analysis of food tweet messages and showed its
potential towards healthy eating.
Cons: BiDAF might not be very suitable for building a highly-customized
model required by client as it is a Front-end framework.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
17. Literature Review
[10] Moise, I., Gaere, E., Merz, R., Koch, S. and Pournaras, E., 2016, December.
Tracking language mobility in the twitter landscape. In 2016 IEEE 16th
International Conference on Data Mining Workshops (ICDMW) (pp. 663-
670). IEEE.
Objective of Paper: To examine the mobility of languages as captured by the
Twitter signal and extracting value from Twitter data.
Approach/Algorithm/Framework: Density-based Clustering and Self-
Organizing Maps Techniques
Pros: The analysis enabled the detection of tourism trends and real-world
events, as discovered through the Twitter lens based on country-language
coupling.
Cons: Exploring the methods that identify location from the text of the
tweets, and then applying the same analytical steps on other countries and
languages are not included.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
18. Literature Review
[11] Bharill, N., Tiwari, A. and Malviya, A., 2016. Fuzzy based scalable
clustering algorithms for handling big data using apache spark. IEEE
Transactions on Big Data, 2(4), pp.339-352.
Objective of Paper: Implementing partitional based clustering algorithms on
Apache Spark, which are suited for clustering large datasets due to their low
computational requirements.
Approach/Algorithm/Framework: Sampling with Iterative Optimization
Fuzzy c-Means algorithm (SRSIO-FCM)
Pros: The results produced comparative reports regarding time and space
complexity, run time and measure of clustering quality, revealing that SRSIO-
FCM is able to run in very less time without compromising on the clustering
quality.
Cons: Well-known cluster validity measures for use on Big Data by using
similar extensions are not presented.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
19. Literature Review
[12] Rodrigues, A.P. and Chiplunkar, N.N., 2018. Real-time Twitter data analysis
using Hadoop ecosystem. Cogent Engineering, 5(1), p.1534519.
Objective of Paper: To compare Executed tweets to Real-time tweets and
Performance in terms of execution time for analysis of real-time tweets using
Pig and Hive.
Approach/Algorithm/Framework: Hadoop Ecosystem
Pros: The experimental results show that Pig is more efficient than Hive as
Pig takes less time for execution when compared to Hive.
Cons: Only large-scale business organizations which generate big data can
utilize Hadoop's function and it cannot efficiently perform in small-scale data
environments.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
20. Literature Review
[13] Swe, T.T., Phyu, P. and Thein, S.P.P., 2019. Weather Prediction Model using
Random Forest Algorithm and Apache Spark. Weather, 3(6).
Objective of Paper: Analyzing the algorithms on big data that are suitable
for weather prediction and focusing on the performance analysis with
Random Forest algorithms.
Approach/Algorithm/Framework: Apache Spark
Pros: Experimental results indicate the supreme and notable merits of
Random Forest over the other algorithms in terms of classification accuracy,
performance, and scalability.
Cons: The incremental parallel random forest algorithm for data streams in
cloud environment and task scheduling mechanism for the algorithm are not
implemented.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
21. Literature Review
[14] Wang, Y., Wang, M. and Xu, W., 2018. A sentiment-enhanced hybrid
recommender system for movie recommendation: a big data analytics
framework. Wireless Communications and Mobile Computing, 2018.
Objective of Paper: A hybrid recommendation model to improve the
accuracy and timeliness of mobile movie recommender system based on
sentiment analysis .
Approach/Algorithm/Framework: Apache Spark, Content-based
recommender system, Collaborative filtering recommender system, Hybrid
recommender system
Pros: The implemented method makes it suitable and fast for users to receive
useful movie suggestions.
Cons: Eliminating the individual characteristics hidden in the text description
from users is not performed.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
22. Literature Review
[16] Omar, H.K. and Jumaa, A.K., 2019. Big Data Analysis Using Apache Spark
MLlib and Hadoop HDFS with Scala and Java. Kurdistan Journal of Applied
Research, 4(1), pp.7-14.
Objective of Paper: Analyze big data with more suitable programming
languages and as consequences gaining better performance.
Approach/Algorithm/Framework: Decision Tree Regression algorithm,
Clustering algorithm
Pros: It is observed that the Scala of Spark speeds up the calculation of the
algorithms and completes them in less time as compared to Java.
Cons: Altering the environment from a single node cluster into a multi-node
which leads to the better performance with the capability of executing larger
data sets is not accomplished.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
23. Problem Statement
To design a Sentiment Analysis System (framework) where
real-time (or near real-time) sentiments are gathered for
Catastrophe management, Utility modification and Core
marketing.
• Data preprocessing (fetching raw tweets and cleaning).
• Classification of posts/tweets.
• Displaying the top trending topics on the dashboard.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
26. Proposed System
A Dashboard showing the Trend of a Twitter topic
Abstract
Introduction
Literature Review
Problem Statement
Research Gap
Proposed System
References
27. Proposed System
• Initially, Apache Flume is used, for connecting to Twitter to fetch
tweets.
• Then, to get a stream of real-time tweets, Apache Spark Streaming is
utilized.
• Hive is employed for querying the tweets present in HDFS.
• Tweets are enriched to incorporate information on Sentiment and
related entities derived from the post.
• Various Statistics about the data using Live Charts which are updated
continuously are applied to display the Trending topics onto the
Dashboard.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
28. References
[01] Garg, K. and Kaur, D., 2019. Sentiment Analysis on Twitter Data using Apache Hadoop and
Performance Evaluation on Hadoop MapReduce and Apache Spark. In Proceedings on the
International Conference on Artificial Intelligence (ICAI) (pp. 233-238). The Steering
Committee of The World Congress in Computer Science, Computer Engineering and Applied
Computing (WorldComp).
[02] Svyatkovskiy, A., Imai, K., Kroeger, M. and Shiraito, Y., 2016, December. Large-scale text
processing pipeline with Apache Spark. In 2016 IEEE International Conference on Big Data
(Big Data) (pp. 3928-3935). IEEE.
[03] Al-Saqqa, S., Al-Naymat, G. and Awajan, A., 2018. A Large-Scale Sentiment Data Classification
for Online Reviews Under Apache Spark. Procedia Computer Science, 141, pp.183-189.
[04] Ranganathan, J., Irudayaraj, A.S. and Tzacheva, A.A., 2017, November. Action rules for
sentiment analysis on twitter data using spark. In 2017 IEEE International Conference on Data
Mining Workshops (ICDMW) (pp. 51-60). IEEE.
[05] Elzayady, H., Badran, K.M. and Salama, G.I., 2018, December. Sentiment Analysis on Twitter
Data using Apache Spark Framework. In 2018 13th International Conference on Computer
Engineering and Systems (ICCES) (pp. 171-176). IEEE.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
29. References
[06] Das, S., Behera, R.K. and Rath, S.K., 2018. Real-time sentiment analysis of Twitter streaming
data for stock prediction. Procedia computer science, 132, pp.956-964.
[07] Adib, P., Alirezazadeh, S. and Nezarat, A., 2017, October. Enhancing trust accuracy among
online social network users utilizing data text mining techniques in apache spark. In 2017 7th
International Conference on Computer and Knowledge Engineering (ICCKE) (pp. 283-288).
IEEE.
[08] Podhoranyi, M. and Vojacek, L., 2019, September. Social Media Data Processing Infrastructure
by Using Apache Spark Big Data Platform: Twitter Data Analysis. In Proceedings of the 2019
4th International Conference on Cloud Computing and Internet of Things (pp. 1-6).
[09] Yeruva, V.K., Junaid, S. and Lee, Y., 2017, November. Exploring social contextual influences on
healthy eating using big data analytics. In 2017 IEEE International Conference on
Bioinformatics and Biomedicine (BIBM) (pp. 1507-1514). IEEE.
[10] Moise, I., Gaere, E., Merz, R., Koch, S. and Pournaras, E., 2016, December. Tracking language
mobility in the twitter landscape. In 2016 IEEE 16th International Conference on Data Mining
Workshops (ICDMW) (pp. 663-670). IEEE.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References
30. References
[11] Bharill, N., Tiwari, A. and Malviya, A., 2016. Fuzzy based scalable clustering algorithms for
handling big data using apache spark. IEEE Transactions on Big Data, 2(4), pp.339-352.
[12] Rodrigues, A.P. and Chiplunkar, N.N., 2018. Real-time Twitter data analysis using Hadoop
ecosystem. Cogent Engineering, 5(1), p.1534519.
[13] Swe, T.T., Phyu, P. and Thein, S.P.P., 2019. Weather Prediction Model using Random Forest
Algorithm and Apache Spark. Weather, 3(6).
[14] Wang, Y., Wang, M. and Xu, W., 2018. A sentiment-enhanced hybrid recommender system for
movie recommendation: a big data analytics framework. Wireless Communications and Mobile
Computing, 2018.
[15] Alparslan, E. and Karahoca, A., 2016. Detecting similar opinion holders for massive sentiment
analysis. Global Journal of Information Technology: Emerging Technologies, 6(1), pp.65-71.
[16] Omar, H.K. and Jumaa, A.K., 2019. Big Data Analysis Using Apache Spark MLlib and Hadoop
HDFS with Scala and Java. Kurdistan Journal of Applied Research, 4(1), pp.7-14.
Abstract
Introduction
Literature Review
Problem Statement
Proposed System
References