SlideShare a Scribd company logo
©2015 IBM Corporation
Spark + Watson +
Twitter
DataPalooza SF 2015
David Taieb
STSM - IBM Cloud Data Services
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Introduction
©2015 IBM Corporation
Introduction
Our mission:
We are here to help developers realize their most ambitious projects.
Goals for today’s session:
•Introduction to real time analytics using Spark Streaming
•Technical Deep dive on the Spark + Watson + Twitter sample application
•At the end of this session, you should be able to download the source code and run the
application on IBM Analytics for Apache Spark
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
What is spark
Spark is an open source
in-memory
computing framework for
distributed data processing
and
iterative analysis
on massive data volumes
©2015 IBM Corporation
Spark Core Libraries
Spark CoreSpark Core
general compute engine, handles
distributed task dispatching, scheduling
and basic I/O functions
Spark
SQL
Spark
SQL
Spark
Streaming
Spark
Streaming
Mllib
(machine
learning)
Mllib
(machine
learning)
GraphX
(graph)
GraphX
(graph)
executes
SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
©2015 IBM Corporation
Key reasons for interest in Spark
Open SourceOpen Source
FastFast
distributed data
processing
distributed data
processing
ProductiveProductive
Web ScaleWeb Scale
•In-memory storage greatly reduces disk I/O
•Up to 100x faster in memory, 10x faster on disk
•Largest project and one of the most active on Apache
•Vibrant growing community of developers continuously improve code
base and extend capabilities
•Fast adoption in the enterprise (IBM, Databricks, etc…)
•Fault tolerant, seamlessly recompute lost data from hardware failure
•Scalable: easily increase number of worker nodes
•Flexible job execution: Batch, Streaming, Interactive
•Easily handle Petabytes of data without special code handling
•Compatible with existing Hadoop ecosystem
•Unified programming model across a range of use cases
•Rich and expressive apis hide complexities of parallel computing and worker node
management
•Support for Java, Scala, Python and R: less code written
•Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX
©2015 IBM Corporation
High level architecture
Spark Application
(driver)
Master
(cluster Manager)
Worker Node Worker Node
Worker Node Worker Node
…
Spark Cluster
Kernel
Master
(cluster Manager)
Worker Node Worker Node
…
Spark Cluster
Notebook Server
Browser
Http/WebSockets
Kernel Protocol (e.g ZeroMQ)
Batch Job
(Spark-Submit)
Interactive
Notebook
• RDD Partitioning
• Task packaging and
dispatching
• Worker node scheduling
©2015 IBM Corporation
Spark programming model lifecycle
Load data into RDDs
Apply transformation
into new RDDs
Apply Actions
(analytics) to produce
results
• In memory collection:
• sc.parallelize
• Unstructured data:
• Text: sc.textFile
• HDFS: sc.hadoopFile
• Structured data:
• Json: sqlCtxt.jsonFile
• Parquet: sqlCtxt.parquetFile
• Jdbc: sqlCtxt.load
• Custom data source: 1.4+
• Streaming data:
• TwitterUtils.createStream
• KafkaUtils.createStream
• FlumeUtils.createStream
• MQTTUtils.createStream
• Custom DStream
• Sc: SparkContext entry point: created by the application or automatically provided by Notebook
shell
• sqlCtxt: SQLContext entry point for working with DataFrames and execute SQLQueries
• Create new RDDs by applying transformations to
existing one
• map(fn): apply fn to all elements in RDD
• flatMap(fn): Same as map, fn can return 0 or more
elements
• filter(fn): select only elements for which fn returns
true
• reduceByKey
• sortByKey
• Sample: sample a fraction of data
• Union: combine elements of 2 RDDs
• Intersection: intersect 2 RDDS
• Distinct: remove duplicate elements
• ….
• Produce results from running analytics against
RDDs
• reduce(fn): perform summary operation on the
elements
• collect(): return all elements in an Array
• count(): count the number of elements in the
RDD
• take(n): return the first n elements in an Array
• foreach(fn): execute the fn on all the elements
in the RDD
• saveAsTextFile: persist the elements in a text
file
• ….
©2015 IBM Corporation
Job Scheduling
©2015 IBM Corporation
Ecosystem of the IBM Analytics for Apache
Spark as service
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Setup local development Environment
• Pre-requisites
- Scala runtime 2.10.4 http://www.scala-
lang.org/download/2.10.4.html
- Homebrew http://brew.sh/
- Scala sbt http://www.scala-sbt.org/download.html
- Spark 1.3.1
http://www.apache.org/dyn/closer.lua/spark/spark-
1.3.1/spark-1.3.1.tgz
• Detailled instructions here:
https://developer.ibm.com/clouddataservic
es/start-developing-with-spark-and-
notebooks/
©2015 IBM Corporation
Setup local development Environment contd..
• Create scala project using sbt
• Create directories to start from scratch
mkdir helloSpark && cd helloSpark
mkdir -p src/main/scala
mkdir -p src/main/java
mkdir -p src/main/resources
Create a subdirectory under src/main/scala
directory
mkdir -p com/ibm/cds/spark/sample
• Github URL for the same project
https://github.com/ibm-cds-
labs/spark.samples
©2015 IBM Corporation
Setup local development Environment contd..
• Create HelloSpark.scala using an IDE or a
text editor
• Copy paste this code snippetpackage com.ibm.cds.spark.samples
import org.apache.spark._
object HelloSpark {
    //main method invoked when running as a standalone Spark
Application
    def main(args: Array[String]) {
        val conf = new SparkConf().setAppName("Hello Spark")
        val spark = new SparkContext(conf)
 
        println("Hello Spark Demo. Compute the mean and variance of
a collection")
        val stats = computeStatsForCollection(spark);
        println(">>> Results: ")
        println(">>>>>>>Mean: " + stats._1 );
        println(">>>>>>>Variance: " + stats._2);
        spark.stop()
    }
 
    //Library method that can be invoked from Jupyter Notebook
    def computeStatsForCollection( spark: SparkContext,
countPerPartitions: Int = 100000, partitions: Int=5): (Double, Double)
= {   
        val totalNumber = math.min( countPerPartitions * partitions,
©2015 IBM Corporation
Setup local development Environment contd..
• Create a file build.sbt under the project
root directory:
• Under the project root directory run
name := "helloSpark"
 
version := "1.0"
 
scalaVersion := "2.10.4"
 
libraryDependencies ++= {
    val sparkVersion =  "1.3.1"
    Seq(
        "org.apache.spark" %%
"spark-core" % sparkVersion,
        "org.apache.spark" %%
"spark-sql" % sparkVersion,
        "org.apache.spark" %%
"spark-repl" % sparkVersion
    )
} Download all
dependencies
$sbt update
Compile
$sbt compile
Package an
application jar
file
$sbt package
©2015 IBM Corporation
Hello World application on Bluemix Apache
Starter
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Introduction to Notebooks
‣ Notebooks allow creation of interactive
executable documents that include rich text
with Markdown, executable code with Scala,
Python or R, graphics with matplotlib
‣ Apache Spark provides multiple flavor APIs
that can be executed with a REPL shell:
Scala, Python (PYSpark), R
‣ Multiple open-source implementations
available:
- Jupyter: https://jupyter.org
- Apache Zeppelin: http://zeppelin-project.org
©2015 IBM Corporation
Notebook walkthrough
‣ Sign up on Bluemix
https://console.ng.bluemix.net/registration/
‣ Getting started with Analytics for Apache
Spark:
https://www.ng.bluemix.net/docs/services/Ana
lyticsforApacheSpark/index.html
‣ You can also follow tutorial here:
https://developer.ibm.com/clouddataservices/
start-developing-with-spark-and-notebooks/
©2015 IBM Corporation
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Spark Streaming
‣ “Spark Streaming is an extension of the core
Spark API that enables scalable, high-
throughput, fault-tolerant stream processing
of live data streams”
(http://spark.apache.org/docs/latest/streami
ng-programming-guide.html)
‣ Breakdown the Streaming data into smaller
pieces which are then sent to the Spark
Engine
©2015 IBM Corporation
Spark Streaming
‣ Provides connectors for multiple data
sources:
- Kafka
- Flume
- Twitter
- MQTT
- ZeroMQ
‣ Provides API to create custom connectors.
Lots of examples available on Github and
spark-packages.org
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Spark + Twitter + Watson application
‣ Use Spark Streaming in combination with IBM Watson to perform sentiment
analysis and track how a conversation is trending on Twitter.
‣ Use Spark Streaming to create a feed that captures live tweets from Twitter. You
can optionally filter the tweets that contain the hashtag(s) of your choice.
‣ The tweet data is then enriched in real time with various sentiment scores
provided by the Watson Tone Analyzer service (available on Bluemix). This service
provides insight into sentiment, or how the author feels.
‣ The data is then loaded and analyzed by the data scientist within Notebook.
‣ We can also use streaming analytics to feed a real-time web app dashboard
©2015 IBM Corporation
About this sample application
• Github: https://github.com/ibm-cds-labs/spark.samples/tree/master/streaming-twitter
• Tutorial: https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags
• A word about Scala
• Scala is Object oriented but also support functional programming style
• Bi-directional interoperability with Java
• Resources:
• Official web site: http://scala-lang.org
• Excellent first steps site: http://www.artima.com/scalazine/articles/steps.html
• Free e-books: http://readwrite.com/2011/04/30/5-free-b-books-and-tutorials-o
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Spark Streaming with “IBM Insight for Twitter” and “Watson Tone Analyzer”
Watson Tone
Analyzer
Service
Bluemix
Producer
Stream
Enrich data with Emotion
Tone Scores
Processed data
Scala Notebook IPython
Notebook
Consumer
Stream
Message Hub
Service
Bluemix
Full Archive
Search API
Consumer Spark
Topics
Publish topics from
Spark analytics results
Event Hub
Service
Bluemix
Real-Time
Dashboard
Data Engineer
Business Analyst
C(Suite)
Data Scientist
©2015 IBM Corporation
Building a Spark Streaming application
Sentiment analysis with Twitter and Watson Tone Analyzer
‣Configure Twitter and Watson Tone Analyzer
1. Configure OAuth credentials for Twitter
2. Create a Watson Tone Analyzer Service on Bluemix
3. Configure MessageHub Service on Bluemix (Kafka)
4. Configure EventHub Service on Bluemix
©2015 IBM Corporation
Configure OAuth credentials for Twitter
‣You can follow
along the steps in
https://developer.ib
m.com/clouddataser
vices/sentiment-
analysis-of-twitter-
hashtags/#twitter
©2015 IBM Corporation
Create a Watson Tone Analyzer Service on Bluemix
‣You can follow along the steps in
https://developer.ibm.com/clouddataservices/
sentiment-analysis-of-twitter-
hashtags/#bluemix
©2015 IBM Corporation
Building a Spark Streaming application
Sentiment analysis with Twitter and Watson Tone Analyzer
‣Work with Twitter data
1. Create a Twitter Stream
2. Enrich the data with sentiment analysis from
Watson Tone Analyzer
3. Aggregate data into RDD with enriched Data model
4. Create SparkSQL DataFrame and register Table
©2015 IBM Corporation
Create a Twitter Stream
//Hold configuration key/value pairs
val config = Map[String, String](
("twitter4j.oauth.consumerKey", Option(System.getProperty("twitter4j.oauth.consumerKey")).orNull ),
("twitter4j.oauth.consumerSecret", Option(System.getProperty("twitter4j.oauth.consumerSecret")).orNull ),
("twitter4j.oauth.accessToken", Option(System.getProperty("twitter4j.oauth.accessToken")).orNull ),
("twitter4j.oauth.accessTokenSecret", Option(System.getProperty("twitter4j.oauth.accessTokenSecret")).orNull ),
("tweets.key", Option(System.getProperty("tweets.key")).getOrElse("")),
("watson.tone.url", Option(System.getProperty("watson.tone.url")).orNull ),
("watson.tone.username", Option(System.getProperty("watson.tone.username")).orNull ),
("watson.tone.password", Option(System.getProperty("watson.tone.password")).orNull )
)
Create a map that stores the credentials for the Twitter and Watson Service
config.foreach( (t:(String,String)) =>
if ( t._1.startsWith( "twitter4j") ) System.setProperty( t._1, t._2 )
)
Twitter4j requires credentials to be store in System properties
©2015 IBM Corporation
Create a Twitter Stream
//Filter the tweets to only keeps the one with english as the language
//twitterStream is a discretized stream of twitter4j Status objects
var twitterStream = org.apache.spark.streaming.twitter.TwitterUtils.createStream( ssc, None )
.filter { status =>
Option(status.getUser).flatMap[String] {
u => Option(u.getLang)
}.getOrElse("").startsWith("en") //Allow only tweets that use “en” as the
language
&& CharMatcher.ASCII.matchesAllOf(status.getText) //Only pick text that are ASCII
&& ( keys.isEmpty || keys.exists{status.getText.contains(_)}) //If User specified #hashtags to monitor
}
Initial DStream
of Status Objects
©2015 IBM Corporation
Enrich the data with sentiment analysis from Watson
Tone Analyzer
//Broadcast the config to each worker node
val broadcastVar = sc.broadcast(config)
Initial DStream
of Status Objects
©2015 IBM Corporation
Enrich the data with sentiment analysis from Watson
Tone Analyzer
Initial DStream
of Status Objects
Data Model
|-- author: string (nullable = true)
|-- date: string (nullable = true)
|-- lang: string (nullable = true)
|-- text: string (nullable = true)
|-- lat: integer (nullable = true)
|-- long: integer (nullable = true)
|-- Cheerfulness: double (nullable = true)
|-- Negative: double (nullable = true)
|-- Anger: double (nullable = true)
|-- Analytical: double (nullable = true)
|-- Confident: double (nullable = true)
|-- Tentative: double (nullable = true)
|-- Openness: double (nullable = true)
|-- Agreeableness: double (nullable = true)
|-- Conscientiousness: double (nullable = true)
DStream of key,
value pairs
©2015 IBM Corporation
Aggregate data into RDD with enriched Data model
…..
//Aggregate the data from each DStream into the working RDD
rowTweets.foreachRDD( rdd => {
if ( rdd.count() > 0 ){
workingRDD = sc.parallelize( rdd.map( t => t._1 ).collect()).union( workingRDD )
}
})
Initial
DStream
RowTweets
Initial
DStream
RowTweets
Initial
DStream
RowTweets
….
Microbatches
Row 1
Row 2
Row 3
Row 4
…
…
Row n
workingRDD
Data Model
|-- author: string (nullable = true)
|-- date: string (nullable = true)
|-- lang: string (nullable = true)
|-- text: string (nullable = true)
|-- lat: integer (nullable = true)
|-- long: integer (nullable = true)
|-- Cheerfulness: double (nullable = true)
|-- Negative: double (nullable = true)
|-- Anger: double (nullable = true)
|-- Analytical: double (nullable = true)
|-- Confident: double (nullable = true)
|-- Tentative: double (nullable = true)
|-- Openness: double (nullable = true)
|-- Agreeableness: double (nullable = true)
|-- Conscientiousness: double (nullable = true)
©2015 IBM Corporation
Create SparkSQL DataFrame and register Table
//Create a SparkSQL DataFrame from the aggregate workingRDD
val df = sqlContext.createDataFrame( workingRDD, schemaTweets )
//Register a temporary table using the name "tweets"
df.registerTempTable("tweets")
println("A new table named tweets with " + df.count() + " records has been correctly created and can be accessed through the SQLContextvariable")
println("Here's the schema for tweets")
df.printSchema()
(sqlContext, df)
Row 1
Row 2
Row 3
Row 4
…
…
Row n
workingRDD
author date lang …
Cheerfulnes
s
Negative …
Conscientio
usness
John Smith
10/11/2015 –
20:18
en 0.0 65.8 … 25.5
Alfred … en 34.5 0.0 … 100.0
… … … … … …
… … … … … …
… … … … … …
Chris … en 85.3 22.9 … 0.0
Relational SparkSQL Table
©2015 IBM Corporation
Building a Spark Streaming application:
Sentiment analysis with Twitter and Watson Tone Analyzer
‣IPython Notebook analysis
1. Load the data into an IPython Notebook
2. Analytic 1: Compute the distribution of tweets by
sentiment scores greater than 60%
3. Analytic 2: Compute the top 10 hashtags contained
in the tweets
4. Analytic 3: Visualize aggregated sentiment scores
for the top 5 hashtags
©2015 IBM Corporation
Load the data into an IPython Notebook
‣ You can follow along the steps here: https://github.com/ibm-
cds-labs/spark.samples/blob/master/streaming-
twitter/notebook/Twitter%20%2B%20Watson%20Tone
%20Analyzer%20Part%202.ipynb
Create a SQLContext
from a SparkContext
Load from parquet file
and create a DataFrame
Create a SQL table and
start excuting SQL
queries
©2015 IBM Corporation
Analytic 1 - Compute the distribution of tweets by sentiment
scores greater than 60%
#create an array that will hold the count for each sentiment
sentimentDistribution=[0] * 9
#For each sentiment, run a sql query that counts the number of tweets for which the sentiment score is greater than 60%
#Store the data in the array
for i, sentiment in enumerate(tweets.columns[-9:]):
sentimentDistribution[i]=sqlContext.sql("SELECT count(*) as sentCount FROM tweets where " + sentiment + " > 60")
.collect()[0].sentCount
©2015 IBM Corporation
Analytic 1 - Compute the distribution of tweets by sentiment
scores greater than 60%
Use matplotlib to create a bar chart
©2015 IBM Corporation
Analytic 1 - Compute the distribution of tweets by sentiment
scores greater than 60%
Bar Chart Visualization
©2015 IBM Corporation
Analytic 2: Compute the top 10 hashtags contained in
the tweets
Initial
Tweets
RDD
Filter
hashtags
Key, value
pair RDD
Reduced
map with
counts
Sorted
Map by key
flatMap filter map reduceByKey sortByKey
©2015 IBM Corporation
Analytic 2: Compute the top 10 hashtags contained in
the tweets
©2015 IBM Corporation
Analytic 2: Compute the top 10 hashtags contained in
the tweets
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
‣ Problem:
- Compute the mean average all the emotion score for
all the top 10 hastags
- Format the data in a way that can be consumed by the
plot script
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 1: Create RDD from tweets dataframe
tagsRDD = tweets.map(lambda t: t )
author … Cheerfulness
Jake … 0.0
Scrad … 23.5
Nittya Indika … 84.0
… … …
… … …
Madison … 93.0
tweets (Type: DataFrame)
Row(author=u'Jake', …, text=u’@sarahwag…’, Cheerfulness=0.0,
…)
Row(author=u’Scrad', …, text=u’ #SuperBloodMoon https://t…’,
Cheerfulness=23.5, …)
Row(author=u’ Nittya Indika', …, text=u’ Good mornin! http://t.…’,
Cheerfulness=84.0, …)
…
…
Row(author=u’ Madison', …, text=u’ how many nights…’,
Cheerfulness=93.0, …)
tagsRDD (Type: RDD)
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 2: Filter to only keep the entries that are in top10tags
tagsRDD = tagsRDD.filter( lambda t: any(s in t.text for s in [i[0] for i in top10tags] ) )
Row(author=u'Jake', …, text=u’@sarahwag…’,
Cheerfulness=0.0, …)
Row(author=u’Scrad', …, text=u’ #SuperBloodMoon
https://t…’, Cheerfulness=23.5, …)
Row(author=u’ Nittya Indika', …, text=u’ Good mornin!
http://t.…’, Cheerfulness=84.0, …)
…
…
Row(author=u’ Madison', …, text=u’ how many
nights…’, Cheerfulness=93.0, …)
Row(author=u'Mike McGuire', text=u'Explains my
disappointment #SuperBloodMoon
https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0)
Row(author=u'Meng_tisoy', text=u’…hihi
#ALDUBThisMustBeLove https://t….’,
…,Conscientiousness=68.0)
Row(author=u'Kevin Contreras', text=u’…SILA!
#ALDUBThisMustBeLove', …Conscientiousness=68.0)
…
…
Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove
https://t…’,…, Conscientiousness=100.0)
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags
#Step 3: Create a flatMap using the expand function defined above, this will be used to collect all the scores
#for a particular tag with the following format: Tag-Tone-ToneScore
cols = tweets.columns[-9:]
def expand( t ):
ret = [ ]
for s in [i[0] for i in top10tags]:
if ( s in t.text ):
for tone in cols:
ret += [s + u"-" + unicode(tone) + ":" + unicode(getattr(t, tone))]
return ret
tagsRDD = tagsRDD.flatMap( expand )
Row(author=u'Mike McGuire', text=u'Explains my
disappointment #SuperBloodMoon
https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0)
Row(author=u'Meng_tisoy', text=u’…hihi
#ALDUBThisMustBeLove https://t….’,
…,Conscientiousness=68.0)
Row(author=u'Kevin Contreras', text=u’…SILA!
#ALDUBThisMustBeLove', …Conscientiousness=68.0)
…
Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove
https://t…’,…, Conscientiousness=100.0)
u'#SuperBloodMoon-Cheerfulness:0.0'
u'#SuperBloodMoon-Negative:100.0’
u'#SuperBloodMoon-Negative:23.5'
…
u'#ALDUBThisMustBeLove-Analytical:85.0’
FlatMap of encoded values
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 4: Create a map indexed by Tag-Tone keys
tagsRDD = tagsRDD.map( lambda fullTag : (fullTag.split(":")[0], float( fullTag.split(":")[1]) ))
u'#SuperBloodMoon-Cheerfulness:0.0'
u'#SuperBloodMoon-Negative:100.0’
u'#SuperBloodMoon-Negativer:23.5'
…
u'#ALDUBThisMustBeLove-Analytical:85.0’
u'#SuperBloodMoon-
Cheerfulness'
0.0
u'#SuperBloodMoon-Negative’ 100.0
u'#SuperBloodMoon-Negative' 23.5
…
u'#ALDUBThisMustBeLove’ 85.0
map
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 5: Call combineByKey to format the data as follow
#Key=Tag-Tone, Value=(count, sum_of_all_score_for_this_tone)
tagsRDD = tagsRDD.combineByKey((lambda x: (x,1)),
(lambda x, y: (x[0] + y, x[1] + 1)),
(lambda x, y: (x[0] + y[0], x[1] + y[1])))
u'#SuperBloodMoon-
Cheerfulness'
0.0
u'#SuperBloodMoon-Negative’ 100.0
u'#SuperBloodMoon-Negative' 23.5
…
u'#ALDUBThisMustBeLove’ 85.0
u'#Supermoon-Confident’ (0.0, 3)
u'#HajjStampede-Tentative’ (0.0, 3)
u'#KiligKapamilya-
Conscientiousness’
(290.0, 6)
…
u'#LunarEclipse-Tentative’ (92.0, 4)
CreateCombiner: Create list of tuples (sum,count)
mergeValue: called for each new value (sum, count)
MergeCombiner: reduce part, merge 2 combiners
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 6 : ReIndex the map to have the key be the Tag and value be (Tone, Average_score) tuple
#Key=Tag
#Value=(Tone, average_score)
tagsRDD = tagsRDD.map(lambda (key, ab): (key.split("-")[0], (key.split("-")[1], round(ab[0]/ab[1],2))))
u'#Supermoon-Confident’ (0.0, 3)
u'#HajjStampede-Tentative’ (0.0, 3)
u'#KiligKapamilya-
Conscientiousness’
(290.0, 6)
…
u'#LunarEclipse-Tentative’ (92.0, 4)
u'#Supermoon-Confident’ (u'Confident', 0.0)
u'#HajjStampede-Tentative’ (u'Tentative', 0.0)
u'#KiligKapamilya-
Conscientiousness’
(u'Conscientiousness',
48.33)
…
u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 7: Reduce the map on the Tag key, value becomes a list of (Tone,average_score) tuples
tagsRDD = tagsRDD.reduceByKey( lambda x, y : makeList(x) + makeList(y) )
u'#Supermoon-Confident’ (u'Confident', 0.0)
u'#HajjStampede-Tentative’ (u'Tentative', 0.0)
u'#KiligKapamilya-
Conscientiousness’
(u'Conscientiousness',
48.33)
…
u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)
u'#HajjStampede'
[(u'Tentative', 0.0), (u'Agreeableness',
3.67), …, (u'Cheerfulness', 100.0)]
u'#Supermoon'
[(u'Confident', 0.0), (u'Openness',
91.0), …, (u'Agreeableness',
20.33)]
u'#bloodmoon'
[(u'Anger', 0.0), (u'Negative', 0.0),
…, (u'Openness', 38.0)]
…
u'#KiligKapamilya
'
[(u'Conscientiousness', 48.33),
(u'Anger', 0.0),...
(u'Agreeableness', 10.83)]
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 8 : Sort the (Tone,average_score) tuples alphabetically by Tone
tagsRDD = tagsRDD.mapValues( lambda x : sorted(x) )
u'#HajjStampede'
[(u'Tentative', 0.0), (u'Agreeableness',
3.67), …, (u'Cheerfulness', 100.0)]
u'#Supermoon'
[(u'Confident', 0.0), (u'Openness',
91.0), …, (u'Agreeableness',
20.33)]
u'#bloodmoon'
[(u'Anger', 0.0), (u'Negative', 0.0),
…, (u'Openness', 38.0)]
…
u'#KiligKapamilya
'
[(u'Conscientiousness', 48.33),
(u'Anger', 0.0),...
(u'Agreeableness', 10.83)]
u'#HajjStampede'
[(u'Agreeableness', 3.67),
(u'Cheerfulness', 100.0),… (u'Tentative',
0.0),]
u'#Supermoon'
[(u'Agreeableness', 20.33),
(u'Confident', 0.0),..., (u'Openness',
91.0)]
u'#bloodmoon'
[(u'Anger', 0.0), (u'Negative', 0.0),
…, (u'Openness', 38.0)]
…
u'#KiligKapamilya'
[(u'Agreeableness', 10.83),
(u'Anger', 0.0)(u'Conscientiousness',
48.33),,...]
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 9 : Format the data as expected by the plotting code in the next cell.
#map the Values to a tuple as follow: ([list of tone], [list of average score])
tagsRDD = tagsRDD.mapValues( lambda x : ([elt[0] for elt in x],[elt[1] for elt in x]) )
u'#HajjStampede'
[(u'Agreeableness', 3.67),
(u'Cheerfulness', 100.0),… (u'Tentative',
0.0),]
u'#Supermoon'
[(u'Agreeableness', 20.33),
(u'Confident', 0.0),..., (u'Openness',
91.0)]
u'#bloodmoon'
[(u'Anger', 0.0), (u'Negative', 0.0),
…, (u'Openness', 38.0)]
…
u'#KiligKapamilya'
[(u'Agreeableness', 10.83),
(u'Anger', 0.0)(u'Conscientiousness',
48.33),,...]
u'#HajjStampede'
([u'Agreeableness’,u'Cheerfulness’,…
u'Tentative’], [3.67, 100.0,…0.0])
u'#Supermoon'
([u'Agreeableness’,u'Confident',...,
u'Openness’],[20.33, 0.0,… 91.0])
u'#bloodmoon'
([u'Anger’,u'Negative', …,
u'Openness’), [0.0, 0.0,…38.0])
…
u'#KiligKapamilya'
([u'Agreeableness’,u'Anger’,
u'Conscientiousness',...],[10.83,
0.0,48.33,...])
Value is a tuple of 2 arrays: tones-scores
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
#Step 10 : Use custom sort function to sort the entries by order of appearance in top10tags
def customCompare( key ):
for (k,v) in top10tags:
if k == key:
return v
return 0
tagsRDD = tagsRDD.sortByKey(ascending=False, numPartitions=None, keyfunc = customCompare)
u'#HajjStampede'
([u'Agreeableness’,u'Cheerfulness’,…
u'Tentative’], [3.67, 100.0,…0.0])
u'#Supermoon'
([u'Agreeableness’,u'Confident',...,
u'Openness’],[20.33, 0.0,… 91.0])
u'#bloodmoon'
([u'Anger’,u'Negative', …,
u'Openness’), [0.0, 0.0,…38.0])
…
u'#KiligKapamilya'
([u'Agreeableness’,u'Anger’,
u'Conscientiousness',...],[10.83,
0.0,48.33,...])
u'#Superbloodmon'
([u'Agreeableness’,u'Cheerfulness’,…
u'Tentative’], [33.97, 19.38,…12.85])
u'#BBWLA'
([u'Agreeableness’,u'Confident',...,
u'Openness’],[38.33, 12.34,…
21.43])
u'#ALDUBThisMust
BeLove'
([u'Anger’,u'Negative', …,
u'Openness’), [0.0, 0.0,…62.0])
…
u'#Newmusic'
([u'Agreeableness’,u'Anger’,
u'Conscientiousness',...],[0.0,
0.0,68.33,...])
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment scores
for the top 5 hashtags
©2015 IBM Corporation
Analytic 3 - Visualize aggregated sentiment
scores for the top 5 hashtags
©2015 IBM Corporation
Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)
©2015 IBM Corporation
Spark Streaming with “IBM Insight for Twitter” and “Watson Tone Analyzer”
Watson Tone
Analyzer
Service
Bluemix
Producer
Stream
Enrich data with Emotion
Tone Scores
Processed data
Scala Notebook IPython
Notebook
Consumer
Stream
Message Hub
Service
Bluemix
Full Archive
Search API
Consumer Spark
Topics
Publish topics from
Spark analytics results
Event Hub
Service
Bluemix
Real-Time
Dashboard
Data Engineer
Business Analyst
C(Suite)
Data Scientist
©2015 IBM Corporation
Real-Time Web app Dashboard
‣ Pie chart showing
top Hashtags
distribution
‣ Bar chart showing
distribution of
tone scores for
each of top
HashTags
©2015 IBM Corporation
Create a Receiver that subscribes to Kafka
topics
Store new record into DStream
Get batch of new records
MessageHub on Bluemix requires Kafka 0.9
©2015 IBM Corporation
Create Kafka DStream
Implicit conversion to add synthetically add method to StreamingContext
©2015 IBM Corporation
Enrich Tweets with Watson Scores
Get Tone scores
Map to new EnrichedTweet Object
©2015 IBM Corporation
Streaming analytics
Prepare for Map/Reduce
Map tag-tone to corresponding score
Compute Count + Average for each score
Map each tag to count + List of scores averages
Reduce
©2015 IBM Corporation
Maintain State between micro-batch RDDs
Maintain State between micro-batches by recomputing
count and List of averages
©2015 IBM Corporation
Produce Streaming analytics topic data
Can’t call Kakfa Producer from streaming analytic
because not serializable
Post message to queue
Process
message
queue from
separate Thread
©2015 IBM Corporation
Real-time web app dashboard
‣ Technology used:
- Mozaik
(https://github.com/plo
uc/mozaik)
- ReactJS,
- WebSocket
- D3JS/C3JS
‣ Consume Topics
generated by Spark
Streaming analytics
Consumer Spark
Topics
Real-Time
Dashboard
Topics:
•topHashTags
•topHashTags.toneScores
©2015 IBM Corporation
Access MessageHub API through message-hub-rest node module
©2015 IBM Corporation
React Components for Mozaik framework
©2015 IBM Corporation
Demo!
©2015 IBM Corporation
Thank You

More Related Content

What's hot

Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three MonthsZalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Technology
 
APEX Interactive Grid API Essentials: The Stuff You Will Really Use
APEX Interactive Grid API Essentials:  The Stuff You Will Really UseAPEX Interactive Grid API Essentials:  The Stuff You Will Really Use
APEX Interactive Grid API Essentials: The Stuff You Will Really Use
Karen Cannell
 
"It’s not only Lambda! Economics behind Serverless" at Serverless Architectur...
"It’s not only Lambda! Economics behind Serverless" at Serverless Architectur..."It’s not only Lambda! Economics behind Serverless" at Serverless Architectur...
"It’s not only Lambda! Economics behind Serverless" at Serverless Architectur...
Vadym Kazulkin
 
Forge - DevCon 2016: Developing & Deploying Secure, Scalable Applications on ...
Forge - DevCon 2016: Developing & Deploying Secure, Scalable Applications on ...Forge - DevCon 2016: Developing & Deploying Secure, Scalable Applications on ...
Forge - DevCon 2016: Developing & Deploying Secure, Scalable Applications on ...
Autodesk
 
Azure Integration in Production with Logic Apps and more
Azure Integration in Production with Logic Apps and moreAzure Integration in Production with Logic Apps and more
Azure Integration in Production with Logic Apps and more
BizTalk360
 
What's New in Toolkits for IBM Streams V4.1
What's New in Toolkits for IBM Streams V4.1What's New in Toolkits for IBM Streams V4.1
What's New in Toolkits for IBM Streams V4.1
lisanl
 
[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...
[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...
[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...
European Collaboration Summit
 
News From the Front Lines - an update on Front-End Tech
News From the Front Lines - an update on Front-End TechNews From the Front Lines - an update on Front-End Tech
News From the Front Lines - an update on Front-End Tech
Kevin Bruce
 
Shalini Agarwal, LinkedIn. Engineering excellence: marathon, not a sprint
Shalini Agarwal, LinkedIn. Engineering excellence: marathon, not a sprintShalini Agarwal, LinkedIn. Engineering excellence: marathon, not a sprint
Shalini Agarwal, LinkedIn. Engineering excellence: marathon, not a sprint
IT Arena
 
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Serverless London 2019   FaaS composition using Kafka and CloudEventsServerless London 2019   FaaS composition using Kafka and CloudEvents
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Neil Avery
 
Build an AppStream 2.0 Environment to Deliver Desktop Applications to Any Com...
Build an AppStream 2.0 Environment to Deliver Desktop Applications to Any Com...Build an AppStream 2.0 Environment to Deliver Desktop Applications to Any Com...
Build an AppStream 2.0 Environment to Deliver Desktop Applications to Any Com...
Amazon Web Services
 
Extending the Enterprise with MEF
Extending the Enterprise with MEFExtending the Enterprise with MEF
Extending the Enterprise with MEF
Brian Ritchie
 
ECS19 - Johan Delimon - Keep your Skype for Business Hybrid working like a ch...
ECS19 - Johan Delimon - Keep your Skype for Business Hybrid working like a ch...ECS19 - Johan Delimon - Keep your Skype for Business Hybrid working like a ch...
ECS19 - Johan Delimon - Keep your Skype for Business Hybrid working like a ch...
European Collaboration Summit
 
EVOLVE'15 | Enhance | Norberto Leite | Effectively Scale and Operate AEM with...
EVOLVE'15 | Enhance | Norberto Leite | Effectively Scale and Operate AEM with...EVOLVE'15 | Enhance | Norberto Leite | Effectively Scale and Operate AEM with...
EVOLVE'15 | Enhance | Norberto Leite | Effectively Scale and Operate AEM with...
Evolve The Adobe Digital Marketing Community
 
Enrich Your DevOps Environment: Tools for Accelerating and Integrating Your A...
Enrich Your DevOps Environment: Tools for Accelerating and Integrating Your A...Enrich Your DevOps Environment: Tools for Accelerating and Integrating Your A...
Enrich Your DevOps Environment: Tools for Accelerating and Integrating Your A...
Amazon Web Services
 
Conclusion Code Cafe - Microcks for Mocking and Testing Async APIs (January 2...
Conclusion Code Cafe - Microcks for Mocking and Testing Async APIs (January 2...Conclusion Code Cafe - Microcks for Mocking and Testing Async APIs (January 2...
Conclusion Code Cafe - Microcks for Mocking and Testing Async APIs (January 2...
Lucas Jellema
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Kai Wähner
 
Build SPFx Solutions for SharePoint 2019
Build SPFx Solutions for SharePoint 2019Build SPFx Solutions for SharePoint 2019
Build SPFx Solutions for SharePoint 2019
Suhail Jamaldeen
 
Forge - DevCon 2016: Extend BIM 360 Docs with the Issues Service API
Forge - DevCon 2016: Extend BIM 360 Docs with the Issues Service APIForge - DevCon 2016: Extend BIM 360 Docs with the Issues Service API
Forge - DevCon 2016: Extend BIM 360 Docs with the Issues Service API
Autodesk
 
Tour of Dapr
Tour of DaprTour of Dapr
Tour of Dapr
Abhishek Gupta
 

What's hot (20)

Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three MonthsZalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three Months
 
APEX Interactive Grid API Essentials: The Stuff You Will Really Use
APEX Interactive Grid API Essentials:  The Stuff You Will Really UseAPEX Interactive Grid API Essentials:  The Stuff You Will Really Use
APEX Interactive Grid API Essentials: The Stuff You Will Really Use
 
"It’s not only Lambda! Economics behind Serverless" at Serverless Architectur...
"It’s not only Lambda! Economics behind Serverless" at Serverless Architectur..."It’s not only Lambda! Economics behind Serverless" at Serverless Architectur...
"It’s not only Lambda! Economics behind Serverless" at Serverless Architectur...
 
Forge - DevCon 2016: Developing & Deploying Secure, Scalable Applications on ...
Forge - DevCon 2016: Developing & Deploying Secure, Scalable Applications on ...Forge - DevCon 2016: Developing & Deploying Secure, Scalable Applications on ...
Forge - DevCon 2016: Developing & Deploying Secure, Scalable Applications on ...
 
Azure Integration in Production with Logic Apps and more
Azure Integration in Production with Logic Apps and moreAzure Integration in Production with Logic Apps and more
Azure Integration in Production with Logic Apps and more
 
What's New in Toolkits for IBM Streams V4.1
What's New in Toolkits for IBM Streams V4.1What's New in Toolkits for IBM Streams V4.1
What's New in Toolkits for IBM Streams V4.1
 
[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...
[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...
[Collinge] Office 365 Enterprise Network Connectivity Using Published Office ...
 
News From the Front Lines - an update on Front-End Tech
News From the Front Lines - an update on Front-End TechNews From the Front Lines - an update on Front-End Tech
News From the Front Lines - an update on Front-End Tech
 
Shalini Agarwal, LinkedIn. Engineering excellence: marathon, not a sprint
Shalini Agarwal, LinkedIn. Engineering excellence: marathon, not a sprintShalini Agarwal, LinkedIn. Engineering excellence: marathon, not a sprint
Shalini Agarwal, LinkedIn. Engineering excellence: marathon, not a sprint
 
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Serverless London 2019   FaaS composition using Kafka and CloudEventsServerless London 2019   FaaS composition using Kafka and CloudEvents
Serverless London 2019 FaaS composition using Kafka and CloudEvents
 
Build an AppStream 2.0 Environment to Deliver Desktop Applications to Any Com...
Build an AppStream 2.0 Environment to Deliver Desktop Applications to Any Com...Build an AppStream 2.0 Environment to Deliver Desktop Applications to Any Com...
Build an AppStream 2.0 Environment to Deliver Desktop Applications to Any Com...
 
Extending the Enterprise with MEF
Extending the Enterprise with MEFExtending the Enterprise with MEF
Extending the Enterprise with MEF
 
ECS19 - Johan Delimon - Keep your Skype for Business Hybrid working like a ch...
ECS19 - Johan Delimon - Keep your Skype for Business Hybrid working like a ch...ECS19 - Johan Delimon - Keep your Skype for Business Hybrid working like a ch...
ECS19 - Johan Delimon - Keep your Skype for Business Hybrid working like a ch...
 
EVOLVE'15 | Enhance | Norberto Leite | Effectively Scale and Operate AEM with...
EVOLVE'15 | Enhance | Norberto Leite | Effectively Scale and Operate AEM with...EVOLVE'15 | Enhance | Norberto Leite | Effectively Scale and Operate AEM with...
EVOLVE'15 | Enhance | Norberto Leite | Effectively Scale and Operate AEM with...
 
Enrich Your DevOps Environment: Tools for Accelerating and Integrating Your A...
Enrich Your DevOps Environment: Tools for Accelerating and Integrating Your A...Enrich Your DevOps Environment: Tools for Accelerating and Integrating Your A...
Enrich Your DevOps Environment: Tools for Accelerating and Integrating Your A...
 
Conclusion Code Cafe - Microcks for Mocking and Testing Async APIs (January 2...
Conclusion Code Cafe - Microcks for Mocking and Testing Async APIs (January 2...Conclusion Code Cafe - Microcks for Mocking and Testing Async APIs (January 2...
Conclusion Code Cafe - Microcks for Mocking and Testing Async APIs (January 2...
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
 
Build SPFx Solutions for SharePoint 2019
Build SPFx Solutions for SharePoint 2019Build SPFx Solutions for SharePoint 2019
Build SPFx Solutions for SharePoint 2019
 
Forge - DevCon 2016: Extend BIM 360 Docs with the Issues Service API
Forge - DevCon 2016: Extend BIM 360 Docs with the Issues Service APIForge - DevCon 2016: Extend BIM 360 Docs with the Issues Service API
Forge - DevCon 2016: Extend BIM 360 Docs with the Issues Service API
 
Tour of Dapr
Tour of DaprTour of Dapr
Tour of Dapr
 

Similar to Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
Marco Parenzan
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
DataStax Academy
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark PresentationStephen Borg
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
Travis Oliphant
 
.NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa).NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa)
Marco Parenzan
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Etu Solution
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virendervithakur
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
DataWorks Summit
 
Developing apache spark jobs in .net using mobius
Developing apache spark jobs in .net using mobiusDeveloping apache spark jobs in .net using mobius
Developing apache spark jobs in .net using mobius
shareddatamsft
 
321 codeincontainer brewbox
321 codeincontainer brewbox321 codeincontainer brewbox
321 codeincontainer brewbox
Lino Telera
 

Similar to Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015 (20)

.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
.NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa).NET for Azure Synapse (and viceversa)
.NET for Azure Synapse (and viceversa)
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Ow
OwOw
Ow
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
 
Developing apache spark jobs in .net using mobius
Developing apache spark jobs in .net using mobiusDeveloping apache spark jobs in .net using mobius
Developing apache spark jobs in .net using mobius
 
321 codeincontainer brewbox
321 codeincontainer brewbox321 codeincontainer brewbox
321 codeincontainer brewbox
 

Recently uploaded

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 

Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

  • 1. ©2015 IBM Corporation Spark + Watson + Twitter DataPalooza SF 2015 David Taieb STSM - IBM Cloud Data Services
  • 2. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 4. ©2015 IBM Corporation Introduction Our mission: We are here to help developers realize their most ambitious projects. Goals for today’s session: •Introduction to real time analytics using Spark Streaming •Technical Deep dive on the Spark + Watson + Twitter sample application •At the end of this session, you should be able to download the source code and run the application on IBM Analytics for Apache Spark
  • 5. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 6. ©2015 IBM Corporation What is spark Spark is an open source in-memory computing framework for distributed data processing and iterative analysis on massive data volumes
  • 7. ©2015 IBM Corporation Spark Core Libraries Spark CoreSpark Core general compute engine, handles distributed task dispatching, scheduling and basic I/O functions Spark SQL Spark SQL Spark Streaming Spark Streaming Mllib (machine learning) Mllib (machine learning) GraphX (graph) GraphX (graph) executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework
  • 8. ©2015 IBM Corporation Key reasons for interest in Spark Open SourceOpen Source FastFast distributed data processing distributed data processing ProductiveProductive Web ScaleWeb Scale •In-memory storage greatly reduces disk I/O •Up to 100x faster in memory, 10x faster on disk •Largest project and one of the most active on Apache •Vibrant growing community of developers continuously improve code base and extend capabilities •Fast adoption in the enterprise (IBM, Databricks, etc…) •Fault tolerant, seamlessly recompute lost data from hardware failure •Scalable: easily increase number of worker nodes •Flexible job execution: Batch, Streaming, Interactive •Easily handle Petabytes of data without special code handling •Compatible with existing Hadoop ecosystem •Unified programming model across a range of use cases •Rich and expressive apis hide complexities of parallel computing and worker node management •Support for Java, Scala, Python and R: less code written •Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX
  • 9. ©2015 IBM Corporation High level architecture Spark Application (driver) Master (cluster Manager) Worker Node Worker Node Worker Node Worker Node … Spark Cluster Kernel Master (cluster Manager) Worker Node Worker Node … Spark Cluster Notebook Server Browser Http/WebSockets Kernel Protocol (e.g ZeroMQ) Batch Job (Spark-Submit) Interactive Notebook • RDD Partitioning • Task packaging and dispatching • Worker node scheduling
  • 10. ©2015 IBM Corporation Spark programming model lifecycle Load data into RDDs Apply transformation into new RDDs Apply Actions (analytics) to produce results • In memory collection: • sc.parallelize • Unstructured data: • Text: sc.textFile • HDFS: sc.hadoopFile • Structured data: • Json: sqlCtxt.jsonFile • Parquet: sqlCtxt.parquetFile • Jdbc: sqlCtxt.load • Custom data source: 1.4+ • Streaming data: • TwitterUtils.createStream • KafkaUtils.createStream • FlumeUtils.createStream • MQTTUtils.createStream • Custom DStream • Sc: SparkContext entry point: created by the application or automatically provided by Notebook shell • sqlCtxt: SQLContext entry point for working with DataFrames and execute SQLQueries • Create new RDDs by applying transformations to existing one • map(fn): apply fn to all elements in RDD • flatMap(fn): Same as map, fn can return 0 or more elements • filter(fn): select only elements for which fn returns true • reduceByKey • sortByKey • Sample: sample a fraction of data • Union: combine elements of 2 RDDs • Intersection: intersect 2 RDDS • Distinct: remove duplicate elements • …. • Produce results from running analytics against RDDs • reduce(fn): perform summary operation on the elements • collect(): return all elements in an Array • count(): count the number of elements in the RDD • take(n): return the first n elements in an Array • foreach(fn): execute the fn on all the elements in the RDD • saveAsTextFile: persist the elements in a text file • ….
  • 12. ©2015 IBM Corporation Ecosystem of the IBM Analytics for Apache Spark as service
  • 13. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 14. ©2015 IBM Corporation Setup local development Environment • Pre-requisites - Scala runtime 2.10.4 http://www.scala- lang.org/download/2.10.4.html - Homebrew http://brew.sh/ - Scala sbt http://www.scala-sbt.org/download.html - Spark 1.3.1 http://www.apache.org/dyn/closer.lua/spark/spark- 1.3.1/spark-1.3.1.tgz • Detailled instructions here: https://developer.ibm.com/clouddataservic es/start-developing-with-spark-and- notebooks/
  • 15. ©2015 IBM Corporation Setup local development Environment contd.. • Create scala project using sbt • Create directories to start from scratch mkdir helloSpark && cd helloSpark mkdir -p src/main/scala mkdir -p src/main/java mkdir -p src/main/resources Create a subdirectory under src/main/scala directory mkdir -p com/ibm/cds/spark/sample • Github URL for the same project https://github.com/ibm-cds- labs/spark.samples
  • 16. ©2015 IBM Corporation Setup local development Environment contd.. • Create HelloSpark.scala using an IDE or a text editor • Copy paste this code snippetpackage com.ibm.cds.spark.samples import org.apache.spark._ object HelloSpark {     //main method invoked when running as a standalone Spark Application     def main(args: Array[String]) {         val conf = new SparkConf().setAppName("Hello Spark")         val spark = new SparkContext(conf)           println("Hello Spark Demo. Compute the mean and variance of a collection")         val stats = computeStatsForCollection(spark);         println(">>> Results: ")         println(">>>>>>>Mean: " + stats._1 );         println(">>>>>>>Variance: " + stats._2);         spark.stop()     }       //Library method that can be invoked from Jupyter Notebook     def computeStatsForCollection( spark: SparkContext, countPerPartitions: Int = 100000, partitions: Int=5): (Double, Double) = {            val totalNumber = math.min( countPerPartitions * partitions,
  • 17. ©2015 IBM Corporation Setup local development Environment contd.. • Create a file build.sbt under the project root directory: • Under the project root directory run name := "helloSpark"   version := "1.0"   scalaVersion := "2.10.4"   libraryDependencies ++= {     val sparkVersion =  "1.3.1"     Seq(         "org.apache.spark" %% "spark-core" % sparkVersion,         "org.apache.spark" %% "spark-sql" % sparkVersion,         "org.apache.spark" %% "spark-repl" % sparkVersion     ) } Download all dependencies $sbt update Compile $sbt compile Package an application jar file $sbt package
  • 18. ©2015 IBM Corporation Hello World application on Bluemix Apache Starter
  • 19. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 20. ©2015 IBM Corporation Introduction to Notebooks ‣ Notebooks allow creation of interactive executable documents that include rich text with Markdown, executable code with Scala, Python or R, graphics with matplotlib ‣ Apache Spark provides multiple flavor APIs that can be executed with a REPL shell: Scala, Python (PYSpark), R ‣ Multiple open-source implementations available: - Jupyter: https://jupyter.org - Apache Zeppelin: http://zeppelin-project.org
  • 21. ©2015 IBM Corporation Notebook walkthrough ‣ Sign up on Bluemix https://console.ng.bluemix.net/registration/ ‣ Getting started with Analytics for Apache Spark: https://www.ng.bluemix.net/docs/services/Ana lyticsforApacheSpark/index.html ‣ You can also follow tutorial here: https://developer.ibm.com/clouddataservices/ start-developing-with-spark-and-notebooks/
  • 23. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 24. ©2015 IBM Corporation Spark Streaming ‣ “Spark Streaming is an extension of the core Spark API that enables scalable, high- throughput, fault-tolerant stream processing of live data streams” (http://spark.apache.org/docs/latest/streami ng-programming-guide.html) ‣ Breakdown the Streaming data into smaller pieces which are then sent to the Spark Engine
  • 25. ©2015 IBM Corporation Spark Streaming ‣ Provides connectors for multiple data sources: - Kafka - Flume - Twitter - MQTT - ZeroMQ ‣ Provides API to create custom connectors. Lots of examples available on Github and spark-packages.org
  • 26. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 27. ©2015 IBM Corporation Spark + Twitter + Watson application ‣ Use Spark Streaming in combination with IBM Watson to perform sentiment analysis and track how a conversation is trending on Twitter. ‣ Use Spark Streaming to create a feed that captures live tweets from Twitter. You can optionally filter the tweets that contain the hashtag(s) of your choice. ‣ The tweet data is then enriched in real time with various sentiment scores provided by the Watson Tone Analyzer service (available on Bluemix). This service provides insight into sentiment, or how the author feels. ‣ The data is then loaded and analyzed by the data scientist within Notebook. ‣ We can also use streaming analytics to feed a real-time web app dashboard
  • 28. ©2015 IBM Corporation About this sample application • Github: https://github.com/ibm-cds-labs/spark.samples/tree/master/streaming-twitter • Tutorial: https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags • A word about Scala • Scala is Object oriented but also support functional programming style • Bi-directional interoperability with Java • Resources: • Official web site: http://scala-lang.org • Excellent first steps site: http://www.artima.com/scalazine/articles/steps.html • Free e-books: http://readwrite.com/2011/04/30/5-free-b-books-and-tutorials-o
  • 29. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 30. ©2015 IBM Corporation Spark Streaming with “IBM Insight for Twitter” and “Watson Tone Analyzer” Watson Tone Analyzer Service Bluemix Producer Stream Enrich data with Emotion Tone Scores Processed data Scala Notebook IPython Notebook Consumer Stream Message Hub Service Bluemix Full Archive Search API Consumer Spark Topics Publish topics from Spark analytics results Event Hub Service Bluemix Real-Time Dashboard Data Engineer Business Analyst C(Suite) Data Scientist
  • 31. ©2015 IBM Corporation Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer ‣Configure Twitter and Watson Tone Analyzer 1. Configure OAuth credentials for Twitter 2. Create a Watson Tone Analyzer Service on Bluemix 3. Configure MessageHub Service on Bluemix (Kafka) 4. Configure EventHub Service on Bluemix
  • 32. ©2015 IBM Corporation Configure OAuth credentials for Twitter ‣You can follow along the steps in https://developer.ib m.com/clouddataser vices/sentiment- analysis-of-twitter- hashtags/#twitter
  • 33. ©2015 IBM Corporation Create a Watson Tone Analyzer Service on Bluemix ‣You can follow along the steps in https://developer.ibm.com/clouddataservices/ sentiment-analysis-of-twitter- hashtags/#bluemix
  • 34. ©2015 IBM Corporation Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer ‣Work with Twitter data 1. Create a Twitter Stream 2. Enrich the data with sentiment analysis from Watson Tone Analyzer 3. Aggregate data into RDD with enriched Data model 4. Create SparkSQL DataFrame and register Table
  • 35. ©2015 IBM Corporation Create a Twitter Stream //Hold configuration key/value pairs val config = Map[String, String]( ("twitter4j.oauth.consumerKey", Option(System.getProperty("twitter4j.oauth.consumerKey")).orNull ), ("twitter4j.oauth.consumerSecret", Option(System.getProperty("twitter4j.oauth.consumerSecret")).orNull ), ("twitter4j.oauth.accessToken", Option(System.getProperty("twitter4j.oauth.accessToken")).orNull ), ("twitter4j.oauth.accessTokenSecret", Option(System.getProperty("twitter4j.oauth.accessTokenSecret")).orNull ), ("tweets.key", Option(System.getProperty("tweets.key")).getOrElse("")), ("watson.tone.url", Option(System.getProperty("watson.tone.url")).orNull ), ("watson.tone.username", Option(System.getProperty("watson.tone.username")).orNull ), ("watson.tone.password", Option(System.getProperty("watson.tone.password")).orNull ) ) Create a map that stores the credentials for the Twitter and Watson Service config.foreach( (t:(String,String)) => if ( t._1.startsWith( "twitter4j") ) System.setProperty( t._1, t._2 ) ) Twitter4j requires credentials to be store in System properties
  • 36. ©2015 IBM Corporation Create a Twitter Stream //Filter the tweets to only keeps the one with english as the language //twitterStream is a discretized stream of twitter4j Status objects var twitterStream = org.apache.spark.streaming.twitter.TwitterUtils.createStream( ssc, None ) .filter { status => Option(status.getUser).flatMap[String] { u => Option(u.getLang) }.getOrElse("").startsWith("en") //Allow only tweets that use “en” as the language && CharMatcher.ASCII.matchesAllOf(status.getText) //Only pick text that are ASCII && ( keys.isEmpty || keys.exists{status.getText.contains(_)}) //If User specified #hashtags to monitor } Initial DStream of Status Objects
  • 37. ©2015 IBM Corporation Enrich the data with sentiment analysis from Watson Tone Analyzer //Broadcast the config to each worker node val broadcastVar = sc.broadcast(config) Initial DStream of Status Objects
  • 38. ©2015 IBM Corporation Enrich the data with sentiment analysis from Watson Tone Analyzer Initial DStream of Status Objects Data Model |-- author: string (nullable = true) |-- date: string (nullable = true) |-- lang: string (nullable = true) |-- text: string (nullable = true) |-- lat: integer (nullable = true) |-- long: integer (nullable = true) |-- Cheerfulness: double (nullable = true) |-- Negative: double (nullable = true) |-- Anger: double (nullable = true) |-- Analytical: double (nullable = true) |-- Confident: double (nullable = true) |-- Tentative: double (nullable = true) |-- Openness: double (nullable = true) |-- Agreeableness: double (nullable = true) |-- Conscientiousness: double (nullable = true) DStream of key, value pairs
  • 39. ©2015 IBM Corporation Aggregate data into RDD with enriched Data model ….. //Aggregate the data from each DStream into the working RDD rowTweets.foreachRDD( rdd => { if ( rdd.count() > 0 ){ workingRDD = sc.parallelize( rdd.map( t => t._1 ).collect()).union( workingRDD ) } }) Initial DStream RowTweets Initial DStream RowTweets Initial DStream RowTweets …. Microbatches Row 1 Row 2 Row 3 Row 4 … … Row n workingRDD Data Model |-- author: string (nullable = true) |-- date: string (nullable = true) |-- lang: string (nullable = true) |-- text: string (nullable = true) |-- lat: integer (nullable = true) |-- long: integer (nullable = true) |-- Cheerfulness: double (nullable = true) |-- Negative: double (nullable = true) |-- Anger: double (nullable = true) |-- Analytical: double (nullable = true) |-- Confident: double (nullable = true) |-- Tentative: double (nullable = true) |-- Openness: double (nullable = true) |-- Agreeableness: double (nullable = true) |-- Conscientiousness: double (nullable = true)
  • 40. ©2015 IBM Corporation Create SparkSQL DataFrame and register Table //Create a SparkSQL DataFrame from the aggregate workingRDD val df = sqlContext.createDataFrame( workingRDD, schemaTweets ) //Register a temporary table using the name "tweets" df.registerTempTable("tweets") println("A new table named tweets with " + df.count() + " records has been correctly created and can be accessed through the SQLContextvariable") println("Here's the schema for tweets") df.printSchema() (sqlContext, df) Row 1 Row 2 Row 3 Row 4 … … Row n workingRDD author date lang … Cheerfulnes s Negative … Conscientio usness John Smith 10/11/2015 – 20:18 en 0.0 65.8 … 25.5 Alfred … en 34.5 0.0 … 100.0 … … … … … … … … … … … … … … … … … … Chris … en 85.3 22.9 … 0.0 Relational SparkSQL Table
  • 41. ©2015 IBM Corporation Building a Spark Streaming application: Sentiment analysis with Twitter and Watson Tone Analyzer ‣IPython Notebook analysis 1. Load the data into an IPython Notebook 2. Analytic 1: Compute the distribution of tweets by sentiment scores greater than 60% 3. Analytic 2: Compute the top 10 hashtags contained in the tweets 4. Analytic 3: Visualize aggregated sentiment scores for the top 5 hashtags
  • 42. ©2015 IBM Corporation Load the data into an IPython Notebook ‣ You can follow along the steps here: https://github.com/ibm- cds-labs/spark.samples/blob/master/streaming- twitter/notebook/Twitter%20%2B%20Watson%20Tone %20Analyzer%20Part%202.ipynb Create a SQLContext from a SparkContext Load from parquet file and create a DataFrame Create a SQL table and start excuting SQL queries
  • 43. ©2015 IBM Corporation Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60% #create an array that will hold the count for each sentiment sentimentDistribution=[0] * 9 #For each sentiment, run a sql query that counts the number of tweets for which the sentiment score is greater than 60% #Store the data in the array for i, sentiment in enumerate(tweets.columns[-9:]): sentimentDistribution[i]=sqlContext.sql("SELECT count(*) as sentCount FROM tweets where " + sentiment + " > 60") .collect()[0].sentCount
  • 44. ©2015 IBM Corporation Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60% Use matplotlib to create a bar chart
  • 45. ©2015 IBM Corporation Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60% Bar Chart Visualization
  • 46. ©2015 IBM Corporation Analytic 2: Compute the top 10 hashtags contained in the tweets Initial Tweets RDD Filter hashtags Key, value pair RDD Reduced map with counts Sorted Map by key flatMap filter map reduceByKey sortByKey
  • 47. ©2015 IBM Corporation Analytic 2: Compute the top 10 hashtags contained in the tweets
  • 48. ©2015 IBM Corporation Analytic 2: Compute the top 10 hashtags contained in the tweets
  • 49. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags ‣ Problem: - Compute the mean average all the emotion score for all the top 10 hastags - Format the data in a way that can be consumed by the plot script
  • 50. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 1: Create RDD from tweets dataframe tagsRDD = tweets.map(lambda t: t ) author … Cheerfulness Jake … 0.0 Scrad … 23.5 Nittya Indika … 84.0 … … … … … … Madison … 93.0 tweets (Type: DataFrame) Row(author=u'Jake', …, text=u’@sarahwag…’, Cheerfulness=0.0, …) Row(author=u’Scrad', …, text=u’ #SuperBloodMoon https://t…’, Cheerfulness=23.5, …) Row(author=u’ Nittya Indika', …, text=u’ Good mornin! http://t.…’, Cheerfulness=84.0, …) … … Row(author=u’ Madison', …, text=u’ how many nights…’, Cheerfulness=93.0, …) tagsRDD (Type: RDD)
  • 51. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 2: Filter to only keep the entries that are in top10tags tagsRDD = tagsRDD.filter( lambda t: any(s in t.text for s in [i[0] for i in top10tags] ) ) Row(author=u'Jake', …, text=u’@sarahwag…’, Cheerfulness=0.0, …) Row(author=u’Scrad', …, text=u’ #SuperBloodMoon https://t…’, Cheerfulness=23.5, …) Row(author=u’ Nittya Indika', …, text=u’ Good mornin! http://t.…’, Cheerfulness=84.0, …) … … Row(author=u’ Madison', …, text=u’ how many nights…’, Cheerfulness=93.0, …) Row(author=u'Mike McGuire', text=u'Explains my disappointment #SuperBloodMoon https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0) Row(author=u'Meng_tisoy', text=u’…hihi #ALDUBThisMustBeLove https://t….’, …,Conscientiousness=68.0) Row(author=u'Kevin Contreras', text=u’…SILA! #ALDUBThisMustBeLove', …Conscientiousness=68.0) … … Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove https://t…’,…, Conscientiousness=100.0)
  • 52. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 3: Create a flatMap using the expand function defined above, this will be used to collect all the scores #for a particular tag with the following format: Tag-Tone-ToneScore cols = tweets.columns[-9:] def expand( t ): ret = [ ] for s in [i[0] for i in top10tags]: if ( s in t.text ): for tone in cols: ret += [s + u"-" + unicode(tone) + ":" + unicode(getattr(t, tone))] return ret tagsRDD = tagsRDD.flatMap( expand ) Row(author=u'Mike McGuire', text=u'Explains my disappointment #SuperBloodMoon https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0) Row(author=u'Meng_tisoy', text=u’…hihi #ALDUBThisMustBeLove https://t….’, …,Conscientiousness=68.0) Row(author=u'Kevin Contreras', text=u’…SILA! #ALDUBThisMustBeLove', …Conscientiousness=68.0) … Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove https://t…’,…, Conscientiousness=100.0) u'#SuperBloodMoon-Cheerfulness:0.0' u'#SuperBloodMoon-Negative:100.0’ u'#SuperBloodMoon-Negative:23.5' … u'#ALDUBThisMustBeLove-Analytical:85.0’ FlatMap of encoded values
  • 53. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 4: Create a map indexed by Tag-Tone keys tagsRDD = tagsRDD.map( lambda fullTag : (fullTag.split(":")[0], float( fullTag.split(":")[1]) )) u'#SuperBloodMoon-Cheerfulness:0.0' u'#SuperBloodMoon-Negative:100.0’ u'#SuperBloodMoon-Negativer:23.5' … u'#ALDUBThisMustBeLove-Analytical:85.0’ u'#SuperBloodMoon- Cheerfulness' 0.0 u'#SuperBloodMoon-Negative’ 100.0 u'#SuperBloodMoon-Negative' 23.5 … u'#ALDUBThisMustBeLove’ 85.0 map
  • 54. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 5: Call combineByKey to format the data as follow #Key=Tag-Tone, Value=(count, sum_of_all_score_for_this_tone) tagsRDD = tagsRDD.combineByKey((lambda x: (x,1)), (lambda x, y: (x[0] + y, x[1] + 1)), (lambda x, y: (x[0] + y[0], x[1] + y[1]))) u'#SuperBloodMoon- Cheerfulness' 0.0 u'#SuperBloodMoon-Negative’ 100.0 u'#SuperBloodMoon-Negative' 23.5 … u'#ALDUBThisMustBeLove’ 85.0 u'#Supermoon-Confident’ (0.0, 3) u'#HajjStampede-Tentative’ (0.0, 3) u'#KiligKapamilya- Conscientiousness’ (290.0, 6) … u'#LunarEclipse-Tentative’ (92.0, 4) CreateCombiner: Create list of tuples (sum,count) mergeValue: called for each new value (sum, count) MergeCombiner: reduce part, merge 2 combiners
  • 55. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 6 : ReIndex the map to have the key be the Tag and value be (Tone, Average_score) tuple #Key=Tag #Value=(Tone, average_score) tagsRDD = tagsRDD.map(lambda (key, ab): (key.split("-")[0], (key.split("-")[1], round(ab[0]/ab[1],2)))) u'#Supermoon-Confident’ (0.0, 3) u'#HajjStampede-Tentative’ (0.0, 3) u'#KiligKapamilya- Conscientiousness’ (290.0, 6) … u'#LunarEclipse-Tentative’ (92.0, 4) u'#Supermoon-Confident’ (u'Confident', 0.0) u'#HajjStampede-Tentative’ (u'Tentative', 0.0) u'#KiligKapamilya- Conscientiousness’ (u'Conscientiousness', 48.33) … u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)
  • 56. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 7: Reduce the map on the Tag key, value becomes a list of (Tone,average_score) tuples tagsRDD = tagsRDD.reduceByKey( lambda x, y : makeList(x) + makeList(y) ) u'#Supermoon-Confident’ (u'Confident', 0.0) u'#HajjStampede-Tentative’ (u'Tentative', 0.0) u'#KiligKapamilya- Conscientiousness’ (u'Conscientiousness', 48.33) … u'#LunarEclipse-Tentative’ (u'Tentative', 23.0) u'#HajjStampede' [(u'Tentative', 0.0), (u'Agreeableness', 3.67), …, (u'Cheerfulness', 100.0)] u'#Supermoon' [(u'Confident', 0.0), (u'Openness', 91.0), …, (u'Agreeableness', 20.33)] u'#bloodmoon' [(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)] … u'#KiligKapamilya ' [(u'Conscientiousness', 48.33), (u'Anger', 0.0),... (u'Agreeableness', 10.83)]
  • 57. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 8 : Sort the (Tone,average_score) tuples alphabetically by Tone tagsRDD = tagsRDD.mapValues( lambda x : sorted(x) ) u'#HajjStampede' [(u'Tentative', 0.0), (u'Agreeableness', 3.67), …, (u'Cheerfulness', 100.0)] u'#Supermoon' [(u'Confident', 0.0), (u'Openness', 91.0), …, (u'Agreeableness', 20.33)] u'#bloodmoon' [(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)] … u'#KiligKapamilya ' [(u'Conscientiousness', 48.33), (u'Anger', 0.0),... (u'Agreeableness', 10.83)] u'#HajjStampede' [(u'Agreeableness', 3.67), (u'Cheerfulness', 100.0),… (u'Tentative', 0.0),] u'#Supermoon' [(u'Agreeableness', 20.33), (u'Confident', 0.0),..., (u'Openness', 91.0)] u'#bloodmoon' [(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)] … u'#KiligKapamilya' [(u'Agreeableness', 10.83), (u'Anger', 0.0)(u'Conscientiousness', 48.33),,...]
  • 58. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 9 : Format the data as expected by the plotting code in the next cell. #map the Values to a tuple as follow: ([list of tone], [list of average score]) tagsRDD = tagsRDD.mapValues( lambda x : ([elt[0] for elt in x],[elt[1] for elt in x]) ) u'#HajjStampede' [(u'Agreeableness', 3.67), (u'Cheerfulness', 100.0),… (u'Tentative', 0.0),] u'#Supermoon' [(u'Agreeableness', 20.33), (u'Confident', 0.0),..., (u'Openness', 91.0)] u'#bloodmoon' [(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)] … u'#KiligKapamilya' [(u'Agreeableness', 10.83), (u'Anger', 0.0)(u'Conscientiousness', 48.33),,...] u'#HajjStampede' ([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [3.67, 100.0,…0.0]) u'#Supermoon' ([u'Agreeableness’,u'Confident',..., u'Openness’],[20.33, 0.0,… 91.0]) u'#bloodmoon' ([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…38.0]) … u'#KiligKapamilya' ([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[10.83, 0.0,48.33,...]) Value is a tuple of 2 arrays: tones-scores
  • 59. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 10 : Use custom sort function to sort the entries by order of appearance in top10tags def customCompare( key ): for (k,v) in top10tags: if k == key: return v return 0 tagsRDD = tagsRDD.sortByKey(ascending=False, numPartitions=None, keyfunc = customCompare) u'#HajjStampede' ([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [3.67, 100.0,…0.0]) u'#Supermoon' ([u'Agreeableness’,u'Confident',..., u'Openness’],[20.33, 0.0,… 91.0]) u'#bloodmoon' ([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…38.0]) … u'#KiligKapamilya' ([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[10.83, 0.0,48.33,...]) u'#Superbloodmon' ([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [33.97, 19.38,…12.85]) u'#BBWLA' ([u'Agreeableness’,u'Confident',..., u'Openness’],[38.33, 12.34,… 21.43]) u'#ALDUBThisMust BeLove' ([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…62.0]) … u'#Newmusic' ([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[0.0, 0.0,68.33,...])
  • 60. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags
  • 61. ©2015 IBM Corporation Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags
  • 62. ©2015 IBM Corporation Agenda • Introduction • Quick Introduction to Spark • Set up development environment and create the hello world application • Notebook Walk-through • Spark Streaming • Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer • Architectural Overview • Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub • Create the Streaming Receiver to connect to Kafka (Scala) • Create analytics using Jupyter Notebook (Python) • Create Real-time Web Dashboard (Nodejs)
  • 63. ©2015 IBM Corporation Spark Streaming with “IBM Insight for Twitter” and “Watson Tone Analyzer” Watson Tone Analyzer Service Bluemix Producer Stream Enrich data with Emotion Tone Scores Processed data Scala Notebook IPython Notebook Consumer Stream Message Hub Service Bluemix Full Archive Search API Consumer Spark Topics Publish topics from Spark analytics results Event Hub Service Bluemix Real-Time Dashboard Data Engineer Business Analyst C(Suite) Data Scientist
  • 64. ©2015 IBM Corporation Real-Time Web app Dashboard ‣ Pie chart showing top Hashtags distribution ‣ Bar chart showing distribution of tone scores for each of top HashTags
  • 65. ©2015 IBM Corporation Create a Receiver that subscribes to Kafka topics Store new record into DStream Get batch of new records MessageHub on Bluemix requires Kafka 0.9
  • 66. ©2015 IBM Corporation Create Kafka DStream Implicit conversion to add synthetically add method to StreamingContext
  • 67. ©2015 IBM Corporation Enrich Tweets with Watson Scores Get Tone scores Map to new EnrichedTweet Object
  • 68. ©2015 IBM Corporation Streaming analytics Prepare for Map/Reduce Map tag-tone to corresponding score Compute Count + Average for each score Map each tag to count + List of scores averages Reduce
  • 69. ©2015 IBM Corporation Maintain State between micro-batch RDDs Maintain State between micro-batches by recomputing count and List of averages
  • 70. ©2015 IBM Corporation Produce Streaming analytics topic data Can’t call Kakfa Producer from streaming analytic because not serializable Post message to queue Process message queue from separate Thread
  • 71. ©2015 IBM Corporation Real-time web app dashboard ‣ Technology used: - Mozaik (https://github.com/plo uc/mozaik) - ReactJS, - WebSocket - D3JS/C3JS ‣ Consume Topics generated by Spark Streaming analytics Consumer Spark Topics Real-Time Dashboard Topics: •topHashTags •topHashTags.toneScores
  • 72. ©2015 IBM Corporation Access MessageHub API through message-hub-rest node module
  • 73. ©2015 IBM Corporation React Components for Mozaik framework