SlideShare a Scribd company logo
1 of 26
Online Tweet Sentiment Analysis
with Apache Spark
Davide Nardone
0120/131
PARTHENOPE
UNIVERSITY
1. Introduction
2. Bag-of-words
3. Spark Streaming
4. Apache Kafka
5. DataFrame and SQL operation
6. Machine Learning library (MLlib)
7. Apache Zeppelin
8. Implementation and results
Summary
 Sentiment Analysis (SA) refers to the use of Natural
Language Processing (NLP) and Text Analysis to
extract, identify or otherwise characterize the sentiment
content of a text unit.
Introductio
n
The main dish
was delicious
It is an dish
The main dish
was salty and
horrible
Positive NegativeNeutral
 Existing SA approaches can be grouped into three main
categories:
1. Knowledge-based techniques;
2. Statistical method;
3. Hybrid approaches.
 Statistical method take advantages on elements of
Machine Learning (ML) such as Latent Semantic
Analysis (LSA), Multinomial Naïve Bayes (MNB),
Support Vector Machines (SVM) etc.
Introduction
(cont.)
 The bag-of-words model is a simplifying representation
used in NLP and Information Retrieval (IR).
 In this model, a text is represented as the the bag of its
words, ignoring grammar and even word order but
keeping multiplicity.
 The bag-of-words model is commonly used in methods
of document classification where the occurrence of
each word (TF) is used as feature for training a
classifier.
Bag-of-
words
1. Tokening;
2. Stopping;
3. Stemming;
4. Computation of tf (term frequency) idf (inverse
document frequency);
5. Using a machine learning classifier for the tweets
classification (e.g., Naïve Bayes, Support Vector
Machine, etc.)
Bag-of-words
(cont.)
 Spark Streaming in an extension of the core Spark API.
 Data can be ingested from many sources like Kafka,
etc.
 Processed data can be pushed out to filesystems,
databases, etc.
 Furthermore, it’s possible to apply Spark’s machine
learning algorithms on data streams.
Spark
Streaming
 Spark Streaming receives live input data streams and
divides the data into batches.
 Spark Streaming provides a high-level abstraction
called Discretized Stream, DStream (continuous stream
of data).
 DStream can be created either from input data streams
from sources as Kafka, Flume, etc.
Spark Streaming
(cont.)
 Kafka is a Distributed Streaming Platform and it
behaves like a partitioned, replicated commit log
services.
 It provides the functionality of a messaging system.
 Kafka is run as a cluster on one or more servers.
 The Kafka cluster stores streams of records in
categories called topics.
Apache Kafka
 Kafka has two out of four main core APIs:
1. The Producer API allows an application to publish a
stream record to one or more Kafka topics;
2. The Consumer API allows an application to subscribe to
one or more topics and process the stream of records
produced to them.
Apache Kafka
(cont.)
 So, at high level, producers
send messages over the
network to the Kafka cluster
which in turn serves them up
to consumers.
 Spark SQL is a component on the top of Spark Core that
introduce a new data abstraction called SchemaRDD which
provides support for structured and semi-structured data.
 Spark SQL also provides JDBC connectivity and can
access to several databases using both Hadoop connector
and Spark connector.
 In order to access to store or get data from it, it’s
necessary:
 Define an SQLContext (entry point) for using all the Spark's
functionality;
 Create a table schema by means of a StructType on which is
applied a specific method for creating a Dataframe.
 By using JDBC drivers, the previous schema is written on a
database.
Output operations for
DStream
 MLlib is a Spark’s library of machine learning functions.
 MLlib contains a variety of learning algorithms and is
accessible from all Spark’s programming languages.
 It consists of common learning algorithms and features,
which includes classification, regression, clustering, etc.
Machine Learning with MLlib
 The mllib.features package contains several classes for
common features transformation. These includes
algorithms to construct feature vectors from text and ways
to to normalize and scale features.
 Term frequency-inverse document frequency (TF-IDF) is a
feature vectorization method widely used in text mining to
reflect the importance of a term to a document in the
corpus.
Feature extraction
 Classification and regression are two common forms of
supervised learning, where algorithms attempts to predict a
variable from features of objects using labeled training
data.
 Both classification and regression use LabeledPoint class
in MLlib.
 MLlib includes a variety of methods for classification and
regression, including simple linear methods and decision
three and forests.
Classification
 Naïve Bayes is a multiclass classification algorithm that
scores how well each point belongs in each class based on
linear function of the features.
 It’s commonly used in text classification with TF-IDF
features, among other applications such as Tweet
Sentiment Analysis.
 In MLlib, it’s possible to use Naïve Bayes through the
mllib.classification.NaiveBayes class.
Naïve Bayes
 Clustering is the unsupervised learning task that involves
grouping objects into clusters of high similarity.
 Unlike the supervised tasks, where the data is labeled,
clustering can be used to make sense of unlabeled data.
 It is commonly used in data exploration and in anomaly
detection
Clustering
 MLlib, in addition to including the popular K-means “offline
algorithm”, it also provides an “online” version for clustering
“online” data streams.
 When data arrive in a stream, the algorithm dynamically:
1. Estimate the membership data groups;
2. Update the centroids of the clusters.
Streaming K-means
 In MLlib, it’s possible to use Streaming K-means through
the mllib.clustering.StreamingKMeans class.
Streaming K-means (cont.)
 Given a dataset of points in high-dimension space, we are
often interested in reducing the dimensionality of the points
so that they can be analyzed with simpler tools.
 For example, we might want to plot the points in two
dimensions, or just reduce the number of features to train
models more efficiently.
 In MLlib, it’s possible to use Streaming K-means through
the mllib.feature.PCA class.
Principal Component
Analysis (PCA)
 Apache Zeppelin is a web-based notebook that enables
interactive data visualization.
 Apache Zeppelin interpreters concept allows any
language/data-processing-backend to be plugged into
Zeppelin such as JDBC.
Apache Zeppelin
 Because of the lack of Spark-Streaming API (Python) for
accessing to a Twitter account, the tweet streams have
been simulated using Apache Kafka.
 In particular, the entity accounting for this task is a
Producer which publishes stream of data on a specific
topic.
 The training and testing data stream have been retrieved
from [1].
 On the other side, each received DStream is processed by
a Consumer, using stateless Spark functions such as map,
transform, etc..
Implementation and results
Naïve Bayes classification
results
Clustering results
Future work
 Integrate Twitter API’s method to retrieve tweet from
accounts.
 Use an alternative feature extraction method for the
Streaming K-means task.
[1] http://help.sentiment140.com/for-students/
[2] Karau, Holden, et al. Learning spark: lightning-fast big data
analysis. “O'Reilly Media, Inc.", 2015.
[3] Bogomolny, A. Benford’s Law and Zipf ’sLaw.
http://www.cut-the-knot.org/doyouknow/zipfLaw.shtml.
References
For any questions, contact me at:
davide.nardone@live.it

More Related Content

What's hot

Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Simplilearn
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Simplilearn
 

What's hot (20)

Inverted index
Inverted indexInverted index
Inverted index
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentation
 
Lzw coding technique for image compression
Lzw coding technique for image compressionLzw coding technique for image compression
Lzw coding technique for image compression
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Machine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationMachine Learning - Object Detection and Classification
Machine Learning - Object Detection and Classification
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Text MIning
Text MIningText MIning
Text MIning
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 
Applications of Emotions Recognition
Applications of Emotions RecognitionApplications of Emotions Recognition
Applications of Emotions Recognition
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.Pattern recognition and Machine Learning.
Pattern recognition and Machine Learning.
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Bagging.pptx
Bagging.pptxBagging.pptx
Bagging.pptx
 
Lstm
LstmLstm
Lstm
 
Passport Automation System
Passport Automation SystemPassport Automation System
Passport Automation System
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
 

Viewers also liked

Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
Fabio Benedetti
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweets
Vasu Jain
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Cloudera, Inc.
 
Product Sentiment Analysis
Product Sentiment AnalysisProduct Sentiment Analysis
Product Sentiment Analysis
nancy amala
 

Viewers also liked (20)

Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweets
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project report
 
Extreme - Web & Social Media monitoring and analysis - Company Presentation
Extreme - Web & Social Media monitoring and analysis - Company PresentationExtreme - Web & Social Media monitoring and analysis - Company Presentation
Extreme - Web & Social Media monitoring and analysis - Company Presentation
 
C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...
C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...
C* Summit 2013: Dude, Where's My Tweet? Taming the Twitter Firehose by Andrew...
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
The Kotlin Programming Language, Svetlana Isakova
The Kotlin Programming Language, Svetlana IsakovaThe Kotlin Programming Language, Svetlana Isakova
The Kotlin Programming Language, Svetlana Isakova
 
Lexicon-Based Sentiment Analysis at GHC 2014
Lexicon-Based Sentiment Analysis at GHC 2014Lexicon-Based Sentiment Analysis at GHC 2014
Lexicon-Based Sentiment Analysis at GHC 2014
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Product Sentiment Analysis
Product Sentiment AnalysisProduct Sentiment Analysis
Product Sentiment Analysis
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?
 
Measuring Opinion Credibility in Twitter
Measuring Opinion Credibility in TwitterMeasuring Opinion Credibility in Twitter
Measuring Opinion Credibility in Twitter
 
Mike davies sentiment_analysis_presentation_backup
Mike davies sentiment_analysis_presentation_backupMike davies sentiment_analysis_presentation_backup
Mike davies sentiment_analysis_presentation_backup
 

Similar to Online Tweet Sentiment Analysis with Apache Spark

MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdfMLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
Timothy Spann
 
Java Abs Java Productivity Creator & Analyzer
Java Abs   Java Productivity Creator & AnalyzerJava Abs   Java Productivity Creator & Analyzer
Java Abs Java Productivity Creator & Analyzer
ncct
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipeline
Conference Papers
 

Similar to Online Tweet Sentiment Analysis with Apache Spark (20)

Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlib
 
Web Spa
Web SpaWeb Spa
Web Spa
 
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdfMLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
 
HANA SPS07 App Function Library
HANA SPS07 App Function LibraryHANA SPS07 App Function Library
HANA SPS07 App Function Library
 
Basic concepts of parallelization
Basic concepts of parallelizationBasic concepts of parallelization
Basic concepts of parallelization
 
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
 
OOP Comparative Study
OOP Comparative StudyOOP Comparative Study
OOP Comparative Study
 
SA UNIT II KAFKA.pdf
SA UNIT II KAFKA.pdfSA UNIT II KAFKA.pdf
SA UNIT II KAFKA.pdf
 
Synopsis Software Training ppt.pptx
Synopsis Software Training ppt.pptxSynopsis Software Training ppt.pptx
Synopsis Software Training ppt.pptx
 
Java Abs Java Productivity Creator & Analyzer
Java Abs   Java Productivity Creator & AnalyzerJava Abs   Java Productivity Creator & Analyzer
Java Abs Java Productivity Creator & Analyzer
 
Mca5010 web technologies
Mca5010   web technologiesMca5010   web technologies
Mca5010 web technologies
 
Edbt19 paper 329
Edbt19 paper 329Edbt19 paper 329
Edbt19 paper 329
 
Real time text stream processing - a dynamic and distributed nlp pipeline
Real time text stream  processing - a dynamic and distributed nlp pipelineReal time text stream  processing - a dynamic and distributed nlp pipeline
Real time text stream processing - a dynamic and distributed nlp pipeline
 
Mit302 web technologies
Mit302 web technologiesMit302 web technologies
Mit302 web technologies
 
Java_Interview Qns
Java_Interview QnsJava_Interview Qns
Java_Interview Qns
 
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Parallel programming model, language and compiler in ACA.
Parallel programming model, language and compiler in ACA.Parallel programming model, language and compiler in ACA.
Parallel programming model, language and compiler in ACA.
 
Proposal for google summe of code 2016
Proposal for google summe of code 2016 Proposal for google summe of code 2016
Proposal for google summe of code 2016
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 

More from Davide Nardone

M.Sc thesis
M.Sc thesisM.Sc thesis
M.Sc thesis
Davide Nardone
 
A Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature SelectionA Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature Selection
Davide Nardone
 
Blind Source Separation using Dictionary Learning
Blind Source Separation using Dictionary LearningBlind Source Separation using Dictionary Learning
Blind Source Separation using Dictionary Learning
Davide Nardone
 
Accelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPUAccelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPU
Davide Nardone
 

More from Davide Nardone (9)

M.Sc thesis
M.Sc thesisM.Sc thesis
M.Sc thesis
 
Quantum computing
Quantum computingQuantum computing
Quantum computing
 
A Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature SelectionA Sparse-Coding Based Approach for Class-Specific Feature Selection
A Sparse-Coding Based Approach for Class-Specific Feature Selection
 
A Biological Smart Platform for the Environmental Risk Assessment
A Biological Smart Platform for the Environmental Risk AssessmentA Biological Smart Platform for the Environmental Risk Assessment
A Biological Smart Platform for the Environmental Risk Assessment
 
Installing Apache tomcat with Netbeans
Installing Apache tomcat with NetbeansInstalling Apache tomcat with Netbeans
Installing Apache tomcat with Netbeans
 
Internet of Things: Research Directions
Internet of Things: Research DirectionsInternet of Things: Research Directions
Internet of Things: Research Directions
 
Blind Source Separation using Dictionary Learning
Blind Source Separation using Dictionary LearningBlind Source Separation using Dictionary Learning
Blind Source Separation using Dictionary Learning
 
Accelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPUAccelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPU
 
LZ78
LZ78LZ78
LZ78
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 

Recently uploaded (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 

Online Tweet Sentiment Analysis with Apache Spark

  • 1. Online Tweet Sentiment Analysis with Apache Spark Davide Nardone 0120/131 PARTHENOPE UNIVERSITY
  • 2. 1. Introduction 2. Bag-of-words 3. Spark Streaming 4. Apache Kafka 5. DataFrame and SQL operation 6. Machine Learning library (MLlib) 7. Apache Zeppelin 8. Implementation and results Summary
  • 3.  Sentiment Analysis (SA) refers to the use of Natural Language Processing (NLP) and Text Analysis to extract, identify or otherwise characterize the sentiment content of a text unit. Introductio n The main dish was delicious It is an dish The main dish was salty and horrible Positive NegativeNeutral
  • 4.  Existing SA approaches can be grouped into three main categories: 1. Knowledge-based techniques; 2. Statistical method; 3. Hybrid approaches.  Statistical method take advantages on elements of Machine Learning (ML) such as Latent Semantic Analysis (LSA), Multinomial Naïve Bayes (MNB), Support Vector Machines (SVM) etc. Introduction (cont.)
  • 5.  The bag-of-words model is a simplifying representation used in NLP and Information Retrieval (IR).  In this model, a text is represented as the the bag of its words, ignoring grammar and even word order but keeping multiplicity.  The bag-of-words model is commonly used in methods of document classification where the occurrence of each word (TF) is used as feature for training a classifier. Bag-of- words
  • 6. 1. Tokening; 2. Stopping; 3. Stemming; 4. Computation of tf (term frequency) idf (inverse document frequency); 5. Using a machine learning classifier for the tweets classification (e.g., Naïve Bayes, Support Vector Machine, etc.) Bag-of-words (cont.)
  • 7.  Spark Streaming in an extension of the core Spark API.  Data can be ingested from many sources like Kafka, etc.  Processed data can be pushed out to filesystems, databases, etc.  Furthermore, it’s possible to apply Spark’s machine learning algorithms on data streams. Spark Streaming
  • 8.  Spark Streaming receives live input data streams and divides the data into batches.  Spark Streaming provides a high-level abstraction called Discretized Stream, DStream (continuous stream of data).  DStream can be created either from input data streams from sources as Kafka, Flume, etc. Spark Streaming (cont.)
  • 9.  Kafka is a Distributed Streaming Platform and it behaves like a partitioned, replicated commit log services.  It provides the functionality of a messaging system.  Kafka is run as a cluster on one or more servers.  The Kafka cluster stores streams of records in categories called topics. Apache Kafka
  • 10.  Kafka has two out of four main core APIs: 1. The Producer API allows an application to publish a stream record to one or more Kafka topics; 2. The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them. Apache Kafka (cont.)  So, at high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers.
  • 11.  Spark SQL is a component on the top of Spark Core that introduce a new data abstraction called SchemaRDD which provides support for structured and semi-structured data.  Spark SQL also provides JDBC connectivity and can access to several databases using both Hadoop connector and Spark connector.  In order to access to store or get data from it, it’s necessary:  Define an SQLContext (entry point) for using all the Spark's functionality;  Create a table schema by means of a StructType on which is applied a specific method for creating a Dataframe.  By using JDBC drivers, the previous schema is written on a database. Output operations for DStream
  • 12.  MLlib is a Spark’s library of machine learning functions.  MLlib contains a variety of learning algorithms and is accessible from all Spark’s programming languages.  It consists of common learning algorithms and features, which includes classification, regression, clustering, etc. Machine Learning with MLlib
  • 13.  The mllib.features package contains several classes for common features transformation. These includes algorithms to construct feature vectors from text and ways to to normalize and scale features.  Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Feature extraction
  • 14.  Classification and regression are two common forms of supervised learning, where algorithms attempts to predict a variable from features of objects using labeled training data.  Both classification and regression use LabeledPoint class in MLlib.  MLlib includes a variety of methods for classification and regression, including simple linear methods and decision three and forests. Classification
  • 15.  Naïve Bayes is a multiclass classification algorithm that scores how well each point belongs in each class based on linear function of the features.  It’s commonly used in text classification with TF-IDF features, among other applications such as Tweet Sentiment Analysis.  In MLlib, it’s possible to use Naïve Bayes through the mllib.classification.NaiveBayes class. Naïve Bayes
  • 16.  Clustering is the unsupervised learning task that involves grouping objects into clusters of high similarity.  Unlike the supervised tasks, where the data is labeled, clustering can be used to make sense of unlabeled data.  It is commonly used in data exploration and in anomaly detection Clustering
  • 17.  MLlib, in addition to including the popular K-means “offline algorithm”, it also provides an “online” version for clustering “online” data streams.  When data arrive in a stream, the algorithm dynamically: 1. Estimate the membership data groups; 2. Update the centroids of the clusters. Streaming K-means
  • 18.  In MLlib, it’s possible to use Streaming K-means through the mllib.clustering.StreamingKMeans class. Streaming K-means (cont.)
  • 19.  Given a dataset of points in high-dimension space, we are often interested in reducing the dimensionality of the points so that they can be analyzed with simpler tools.  For example, we might want to plot the points in two dimensions, or just reduce the number of features to train models more efficiently.  In MLlib, it’s possible to use Streaming K-means through the mllib.feature.PCA class. Principal Component Analysis (PCA)
  • 20.  Apache Zeppelin is a web-based notebook that enables interactive data visualization.  Apache Zeppelin interpreters concept allows any language/data-processing-backend to be plugged into Zeppelin such as JDBC. Apache Zeppelin
  • 21.  Because of the lack of Spark-Streaming API (Python) for accessing to a Twitter account, the tweet streams have been simulated using Apache Kafka.  In particular, the entity accounting for this task is a Producer which publishes stream of data on a specific topic.  The training and testing data stream have been retrieved from [1].  On the other side, each received DStream is processed by a Consumer, using stateless Spark functions such as map, transform, etc.. Implementation and results
  • 24. Future work  Integrate Twitter API’s method to retrieve tweet from accounts.  Use an alternative feature extraction method for the Streaming K-means task.
  • 25. [1] http://help.sentiment140.com/for-students/ [2] Karau, Holden, et al. Learning spark: lightning-fast big data analysis. “O'Reilly Media, Inc.", 2015. [3] Bogomolny, A. Benford’s Law and Zipf ’sLaw. http://www.cut-the-knot.org/doyouknow/zipfLaw.shtml. References
  • 26. For any questions, contact me at: davide.nardone@live.it