Online Tweet Sentiment Analysis with Apache Spark

Online Tweet Sentiment Analysis
with Apache Spark
Davide Nardone
0120/131
PARTHENOPE
UNIVERSITY

1. Introduction
2. Bag-of-words
3. Spark Streaming
4. Apache Kafka
5. DataFrame and SQL operation
6. Machine Learning library (MLlib)
7. Apache Zeppelin
8. Implementation and results
Summary

 Sentiment Analysis (SA) refers to the use of Natural
Language Processing (NLP) and Text Analysis to
extract, identify or otherwise characterize the sentiment
content of a text unit.
Introductio
n
The main dish
was delicious
It is an dish
The main dish
was salty and
horrible
Positive NegativeNeutral

 Existing SA approaches can be grouped into three main
categories:
1. Knowledge-based techniques;
2. Statistical method;
3. Hybrid approaches.
 Statistical method take advantages on elements of
Machine Learning (ML) such as Latent Semantic
Analysis (LSA), Multinomial Naïve Bayes (MNB),
Support Vector Machines (SVM) etc.
Introduction
(cont.)

 The bag-of-words model is a simplifying representation
used in NLP and Information Retrieval (IR).
 In this model, a text is represented as the the bag of its
words, ignoring grammar and even word order but
keeping multiplicity.
 The bag-of-words model is commonly used in methods
of document classification where the occurrence of
each word (TF) is used as feature for training a
classifier.
Bag-of-
words

1. Tokening;
2. Stopping;
3. Stemming;
4. Computation of tf (term frequency) idf (inverse
document frequency);
5. Using a machine learning classifier for the tweets
classification (e.g., Naïve Bayes, Support Vector
Machine, etc.)
Bag-of-words
(cont.)

 Spark Streaming in an extension of the core Spark API.
 Data can be ingested from many sources like Kafka,
etc.
 Processed data can be pushed out to filesystems,
databases, etc.
 Furthermore, it’s possible to apply Spark’s machine
learning algorithms on data streams.
Spark
Streaming

 Spark Streaming receives live input data streams and
divides the data into batches.
 Spark Streaming provides a high-level abstraction
called Discretized Stream, DStream (continuous stream
of data).
 DStream can be created either from input data streams
from sources as Kafka, Flume, etc.
Spark Streaming
(cont.)

 Kafka is a Distributed Streaming Platform and it
behaves like a partitioned, replicated commit log
services.
 It provides the functionality of a messaging system.
 Kafka is run as a cluster on one or more servers.
 The Kafka cluster stores streams of records in
categories called topics.
Apache Kafka

 Kafka has two out of four main core APIs:
1. The Producer API allows an application to publish a
stream record to one or more Kafka topics;
2. The Consumer API allows an application to subscribe to
one or more topics and process the stream of records
produced to them.
Apache Kafka
(cont.)
 So, at high level, producers
send messages over the
network to the Kafka cluster
which in turn serves them up
to consumers.

 Spark SQL is a component on the top of Spark Core that
introduce a new data abstraction called SchemaRDD which
provides support for structured and semi-structured data.
 Spark SQL also provides JDBC connectivity and can
access to several databases using both Hadoop connector
and Spark connector.
 In order to access to store or get data from it, it’s
necessary:
 Define an SQLContext (entry point) for using all the Spark's
functionality;
 Create a table schema by means of a StructType on which is
applied a specific method for creating a Dataframe.
 By using JDBC drivers, the previous schema is written on a
database.
Output operations for
DStream

 MLlib is a Spark’s library of machine learning functions.
 MLlib contains a variety of learning algorithms and is
accessible from all Spark’s programming languages.
 It consists of common learning algorithms and features,
which includes classification, regression, clustering, etc.
Machine Learning with MLlib

 The mllib.features package contains several classes for
common features transformation. These includes
algorithms to construct feature vectors from text and ways
to to normalize and scale features.
 Term frequency-inverse document frequency (TF-IDF) is a
feature vectorization method widely used in text mining to
reflect the importance of a term to a document in the
corpus.
Feature extraction

 Classification and regression are two common forms of
supervised learning, where algorithms attempts to predict a
variable from features of objects using labeled training
data.
 Both classification and regression use LabeledPoint class
in MLlib.
 MLlib includes a variety of methods for classification and
regression, including simple linear methods and decision
three and forests.
Classification

 Naïve Bayes is a multiclass classification algorithm that
scores how well each point belongs in each class based on
linear function of the features.
 It’s commonly used in text classification with TF-IDF
features, among other applications such as Tweet
Sentiment Analysis.
 In MLlib, it’s possible to use Naïve Bayes through the
mllib.classification.NaiveBayes class.
Naïve Bayes

 Clustering is the unsupervised learning task that involves
grouping objects into clusters of high similarity.
 Unlike the supervised tasks, where the data is labeled,
clustering can be used to make sense of unlabeled data.
 It is commonly used in data exploration and in anomaly
detection
Clustering

 MLlib, in addition to including the popular K-means “offline
algorithm”, it also provides an “online” version for clustering
“online” data streams.
 When data arrive in a stream, the algorithm dynamically:
1. Estimate the membership data groups;
2. Update the centroids of the clusters.
Streaming K-means

 In MLlib, it’s possible to use Streaming K-means through
the mllib.clustering.StreamingKMeans class.
Streaming K-means (cont.)

 Given a dataset of points in high-dimension space, we are
often interested in reducing the dimensionality of the points
so that they can be analyzed with simpler tools.
 For example, we might want to plot the points in two
dimensions, or just reduce the number of features to train
models more efficiently.
 In MLlib, it’s possible to use Streaming K-means through
the mllib.feature.PCA class.
Principal Component
Analysis (PCA)

 Apache Zeppelin is a web-based notebook that enables
interactive data visualization.
 Apache Zeppelin interpreters concept allows any
language/data-processing-backend to be plugged into
Zeppelin such as JDBC.
Apache Zeppelin

 Because of the lack of Spark-Streaming API (Python) for
accessing to a Twitter account, the tweet streams have
been simulated using Apache Kafka.
 In particular, the entity accounting for this task is a
Producer which publishes stream of data on a specific
topic.
 The training and testing data stream have been retrieved
from [1].
 On the other side, each received DStream is processed by
a Consumer, using stateless Spark functions such as map,
transform, etc..
Implementation and results

Naïve Bayes classification
results

Future work
 Integrate Twitter API’s method to retrieve tweet from
accounts.
 Use an alternative feature extraction method for the
Streaming K-means task.

[1] http://help.sentiment140.com/for-students/
[2] Karau, Holden, et al. Learning spark: lightning-fast big data
analysis. “O'Reilly Media, Inc.", 2015.
[3] Bogomolny, A. Benford’s Law and Zipf ’sLaw.
http://www.cut-the-knot.org/doyouknow/zipfLaw.shtml.
References

For any questions, contact me at:
davide.nardone@live.it

Online Tweet Sentiment Analysis with Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Online Tweet Sentiment Analysis with Apache Spark

Similar to Online Tweet Sentiment Analysis with Apache Spark (20)

More from Davide Nardone

More from Davide Nardone (9)

Recently uploaded

Recently uploaded (20)

Online Tweet Sentiment Analysis with Apache Spark