Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Online Tweet Sentiment Analysis with Apache Spark


Published on

Sentiment Analysis (SA) relates to the use of: Natural Language Processing (NLP), analysis and computational linguistics text to extract and identify subjective information in the source material. A fundamental task of SA is to "classify" the polarity of a given document text, phrases or levels of functionality/appearance - whether the opinion expressed in a document or in a sentence is positive, negative or neutral. Usually, this analysis is performed "offline" using Machine Learning (ML) techniques. In this project two online tweet classification methods have been proposed, which exploits the well known framework "Apache Spark" for processing the data and the tool "Apache Zeppelin" for data visualization.

Published in: Data & Analytics
  • Be the first to comment

Online Tweet Sentiment Analysis with Apache Spark

  1. 1. Online Tweet Sentiment Analysis with Apache Spark Davide Nardone 0120/131 PARTHENOPE UNIVERSITY
  2. 2. 1. Introduction 2. Bag-of-words 3. Spark Streaming 4. Apache Kafka 5. DataFrame and SQL operation 6. Machine Learning library (MLlib) 7. Apache Zeppelin 8. Implementation and results Summary
  3. 3.  Sentiment Analysis (SA) refers to the use of Natural Language Processing (NLP) and Text Analysis to extract, identify or otherwise characterize the sentiment content of a text unit. Introductio n The main dish was delicious It is an dish The main dish was salty and horrible Positive NegativeNeutral
  4. 4.  Existing SA approaches can be grouped into three main categories: 1. Knowledge-based techniques; 2. Statistical method; 3. Hybrid approaches.  Statistical method take advantages on elements of Machine Learning (ML) such as Latent Semantic Analysis (LSA), Multinomial Naïve Bayes (MNB), Support Vector Machines (SVM) etc. Introduction (cont.)
  5. 5.  The bag-of-words model is a simplifying representation used in NLP and Information Retrieval (IR).  In this model, a text is represented as the the bag of its words, ignoring grammar and even word order but keeping multiplicity.  The bag-of-words model is commonly used in methods of document classification where the occurrence of each word (TF) is used as feature for training a classifier. Bag-of- words
  6. 6. 1. Tokening; 2. Stopping; 3. Stemming; 4. Computation of tf (term frequency) idf (inverse document frequency); 5. Using a machine learning classifier for the tweets classification (e.g., Naïve Bayes, Support Vector Machine, etc.) Bag-of-words (cont.)
  7. 7.  Spark Streaming in an extension of the core Spark API.  Data can be ingested from many sources like Kafka, etc.  Processed data can be pushed out to filesystems, databases, etc.  Furthermore, it’s possible to apply Spark’s machine learning algorithms on data streams. Spark Streaming
  8. 8.  Spark Streaming receives live input data streams and divides the data into batches.  Spark Streaming provides a high-level abstraction called Discretized Stream, DStream (continuous stream of data).  DStream can be created either from input data streams from sources as Kafka, Flume, etc. Spark Streaming (cont.)
  9. 9.  Kafka is a Distributed Streaming Platform and it behaves like a partitioned, replicated commit log services.  It provides the functionality of a messaging system.  Kafka is run as a cluster on one or more servers.  The Kafka cluster stores streams of records in categories called topics. Apache Kafka
  10. 10.  Kafka has two out of four main core APIs: 1. The Producer API allows an application to publish a stream record to one or more Kafka topics; 2. The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them. Apache Kafka (cont.)  So, at high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers.
  11. 11.  Spark SQL is a component on the top of Spark Core that introduce a new data abstraction called SchemaRDD which provides support for structured and semi-structured data.  Spark SQL also provides JDBC connectivity and can access to several databases using both Hadoop connector and Spark connector.  In order to access to store or get data from it, it’s necessary:  Define an SQLContext (entry point) for using all the Spark's functionality;  Create a table schema by means of a StructType on which is applied a specific method for creating a Dataframe.  By using JDBC drivers, the previous schema is written on a database. Output operations for DStream
  12. 12.  MLlib is a Spark’s library of machine learning functions.  MLlib contains a variety of learning algorithms and is accessible from all Spark’s programming languages.  It consists of common learning algorithms and features, which includes classification, regression, clustering, etc. Machine Learning with MLlib
  13. 13.  The mllib.features package contains several classes for common features transformation. These includes algorithms to construct feature vectors from text and ways to to normalize and scale features.  Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Feature extraction
  14. 14.  Classification and regression are two common forms of supervised learning, where algorithms attempts to predict a variable from features of objects using labeled training data.  Both classification and regression use LabeledPoint class in MLlib.  MLlib includes a variety of methods for classification and regression, including simple linear methods and decision three and forests. Classification
  15. 15.  Naïve Bayes is a multiclass classification algorithm that scores how well each point belongs in each class based on linear function of the features.  It’s commonly used in text classification with TF-IDF features, among other applications such as Tweet Sentiment Analysis.  In MLlib, it’s possible to use Naïve Bayes through the mllib.classification.NaiveBayes class. Naïve Bayes
  16. 16.  Clustering is the unsupervised learning task that involves grouping objects into clusters of high similarity.  Unlike the supervised tasks, where the data is labeled, clustering can be used to make sense of unlabeled data.  It is commonly used in data exploration and in anomaly detection Clustering
  17. 17.  MLlib, in addition to including the popular K-means “offline algorithm”, it also provides an “online” version for clustering “online” data streams.  When data arrive in a stream, the algorithm dynamically: 1. Estimate the membership data groups; 2. Update the centroids of the clusters. Streaming K-means
  18. 18.  In MLlib, it’s possible to use Streaming K-means through the mllib.clustering.StreamingKMeans class. Streaming K-means (cont.)
  19. 19.  Given a dataset of points in high-dimension space, we are often interested in reducing the dimensionality of the points so that they can be analyzed with simpler tools.  For example, we might want to plot the points in two dimensions, or just reduce the number of features to train models more efficiently.  In MLlib, it’s possible to use Streaming K-means through the mllib.feature.PCA class. Principal Component Analysis (PCA)
  20. 20.  Apache Zeppelin is a web-based notebook that enables interactive data visualization.  Apache Zeppelin interpreters concept allows any language/data-processing-backend to be plugged into Zeppelin such as JDBC. Apache Zeppelin
  21. 21.  Because of the lack of Spark-Streaming API (Python) for accessing to a Twitter account, the tweet streams have been simulated using Apache Kafka.  In particular, the entity accounting for this task is a Producer which publishes stream of data on a specific topic.  The training and testing data stream have been retrieved from [1].  On the other side, each received DStream is processed by a Consumer, using stateless Spark functions such as map, transform, etc.. Implementation and results
  22. 22. Naïve Bayes classification results
  23. 23. Clustering results
  24. 24. Future work  Integrate Twitter API’s method to retrieve tweet from accounts.  Use an alternative feature extraction method for the Streaming K-means task.
  25. 25. [1] [2] Karau, Holden, et al. Learning spark: lightning-fast big data analysis. “O'Reilly Media, Inc.", 2015. [3] Bogomolny, A. Benford’s Law and Zipf ’sLaw. References
  26. 26. For any questions, contact me at: