Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation
Spark + Watson +
Twitter
DataPalooza SF 2015
David Taieb
STSM - IBM Cloud Data Services

Agenda
• Introduction
• Quick Introduction to Spark
• Set up development environment and create the hello world application
• Notebook Walk-through
• Spark Streaming
• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer
• Architectural Overview
• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub
• Create the Streaming Receiver to connect to Kafka (Scala)
• Create analytics using Jupyter Notebook (Python)
• Create Real-time Web Dashboard (Nodejs)

Introduction

Introduction
Our mission:
We are here to help developers realize their most ambitious projects.
Goals for today’s session:
•Introduction to real time analytics using Spark Streaming
•Technical Deep dive on the Spark + Watson + Twitter sample application
•At the end of this session, you should be able to download the source code and run the
application on IBM Analytics for Apache Spark

What is spark
Spark is an open source
in-memory
computing framework for
distributed data processing
and
iterative analysis
on massive data volumes

Spark Core Libraries
Spark CoreSpark Core
general compute engine, handles
distributed task dispatching, scheduling
and basic I/O functions
Spark
SQL
Spark
SQL
Spark
Streaming
Spark
Streaming
Mllib
(machine
learning)
Mllib
(machine
learning)
GraphX
(graph)
GraphX
(graph)
executes
SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework

Key reasons for interest in Spark
Open SourceOpen Source
FastFast
distributed data
processing
distributed data
processing
ProductiveProductive
Web ScaleWeb Scale
•In-memory storage greatly reduces disk I/O
•Up to 100x faster in memory, 10x faster on disk
•Largest project and one of the most active on Apache
•Vibrant growing community of developers continuously improve code
base and extend capabilities
•Fast adoption in the enterprise (IBM, Databricks, etc…)
•Fault tolerant, seamlessly recompute lost data from hardware failure
•Scalable: easily increase number of worker nodes
•Flexible job execution: Batch, Streaming, Interactive
•Easily handle Petabytes of data without special code handling
•Compatible with existing Hadoop ecosystem
•Unified programming model across a range of use cases
•Rich and expressive apis hide complexities of parallel computing and worker node
management
•Support for Java, Scala, Python and R: less code written
•Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX

High level architecture
Spark Application
(driver)
Master
(cluster Manager)
Worker Node Worker Node
…
Spark Cluster
Kernel
Master
(cluster Manager)
…
Spark Cluster
Notebook Server
Browser
Http/WebSockets
Kernel Protocol (e.g ZeroMQ)
Batch Job
(Spark-Submit)
Interactive
Notebook
• RDD Partitioning
• Task packaging and
dispatching
• Worker node scheduling

Spark programming model lifecycle
Load data into RDDs
Apply transformation
into new RDDs
Apply Actions
(analytics) to produce
results
• In memory collection:
• sc.parallelize
• Unstructured data:
• Text: sc.textFile
• HDFS: sc.hadoopFile
• Structured data:
• Json: sqlCtxt.jsonFile
• Parquet: sqlCtxt.parquetFile
• Jdbc: sqlCtxt.load
• Custom data source: 1.4+
• Streaming data:
• TwitterUtils.createStream
• KafkaUtils.createStream
• FlumeUtils.createStream
• MQTTUtils.createStream
• Custom DStream
• Sc: SparkContext entry point: created by the application or automatically provided by Notebook
shell
• sqlCtxt: SQLContext entry point for working with DataFrames and execute SQLQueries
• Create new RDDs by applying transformations to
existing one
• map(fn): apply fn to all elements in RDD
• flatMap(fn): Same as map, fn can return 0 or more
elements
• filter(fn): select only elements for which fn returns
true
• reduceByKey
• sortByKey
• Sample: sample a fraction of data
• Union: combine elements of 2 RDDs
• Intersection: intersect 2 RDDS
• Distinct: remove duplicate elements
• ….
• Produce results from running analytics against
RDDs
• reduce(fn): perform summary operation on the
elements
• collect(): return all elements in an Array
• count(): count the number of elements in the
RDD
• take(n): return the first n elements in an Array
• foreach(fn): execute the fn on all the elements
in the RDD
• saveAsTextFile: persist the elements in a text
file
• ….

Job Scheduling

Ecosystem of the IBM Analytics for Apache
Spark as service

Setup local development Environment
• Pre-requisites
- Scala runtime 2.10.4 http://www.scala-
lang.org/download/2.10.4.html
- Homebrew http://brew.sh/
- Scala sbt http://www.scala-sbt.org/download.html
- Spark 1.3.1
http://www.apache.org/dyn/closer.lua/spark/spark-
1.3.1/spark-1.3.1.tgz
• Detailled instructions here:
https://developer.ibm.com/clouddataservic
es/start-developing-with-spark-and-
notebooks/

Setup local development Environment contd..
• Create scala project using sbt
• Create directories to start from scratch
mkdir helloSpark && cd helloSpark
mkdir -p src/main/scala
mkdir -p src/main/java
mkdir -p src/main/resources
Create a subdirectory under src/main/scala
directory
mkdir -p com/ibm/cds/spark/sample
• Github URL for the same project
https://github.com/ibm-cds-
labs/spark.samples

• Create HelloSpark.scala using an IDE or a
text editor
• Copy paste this code snippetpackage com.ibm.cds.spark.samples
import org.apache.spark._
object HelloSpark {
    //main method invoked when running as a standalone Spark
Application
    def main(args: Array[String]) {
        val conf = new SparkConf().setAppName("Hello Spark")
        val spark = new SparkContext(conf)

        println("Hello Spark Demo. Compute the mean and variance of
a collection")
        val stats = computeStatsForCollection(spark);
        println(">>> Results: ")
        println(">>>>>>>Mean: " + stats._1 );
        println(">>>>>>>Variance: " + stats._2);
        spark.stop()
    }

    //Library method that can be invoked from Jupyter Notebook
    def computeStatsForCollection( spark: SparkContext,
countPerPartitions: Int = 100000, partitions: Int=5): (Double, Double)
= {
        val totalNumber = math.min( countPerPartitions * partitions,

• Create a file build.sbt under the project
root directory:
• Under the project root directory run
name := "helloSpark"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= {
    val sparkVersion = "1.3.1"
    Seq(
        "org.apache.spark" %%
"spark-core" % sparkVersion,
"spark-sql" % sparkVersion,
"spark-repl" % sparkVersion
    )
} Download all
dependencies
$sbt update
Compile
$sbt compile
Package an
application jar
file
$sbt package

Hello World application on Bluemix Apache
Starter

Introduction to Notebooks
‣ Notebooks allow creation of interactive
executable documents that include rich text
with Markdown, executable code with Scala,
Python or R, graphics with matplotlib
‣ Apache Spark provides multiple flavor APIs
that can be executed with a REPL shell:
Scala, Python (PYSpark), R
‣ Multiple open-source implementations
available:
- Jupyter: https://jupyter.org
- Apache Zeppelin: http://zeppelin-project.org

Notebook walkthrough
‣ Sign up on Bluemix
https://console.ng.bluemix.net/registration/
‣ Getting started with Analytics for Apache
Spark:
https://www.ng.bluemix.net/docs/services/Ana
lyticsforApacheSpark/index.html
‣ You can also follow tutorial here:
https://developer.ibm.com/clouddataservices/
start-developing-with-spark-and-notebooks/

Spark Streaming
‣ “Spark Streaming is an extension of the core
Spark API that enables scalable, high-
throughput, fault-tolerant stream processing
of live data streams”
(http://spark.apache.org/docs/latest/streami
ng-programming-guide.html)
‣ Breakdown the Streaming data into smaller
pieces which are then sent to the Spark
Engine

Spark Streaming
‣ Provides connectors for multiple data
sources:
- Kafka
- Flume
- Twitter
- MQTT
- ZeroMQ
‣ Provides API to create custom connectors.
Lots of examples available on Github and
spark-packages.org

Spark + Twitter + Watson application
‣ Use Spark Streaming in combination with IBM Watson to perform sentiment
analysis and track how a conversation is trending on Twitter.
‣ Use Spark Streaming to create a feed that captures live tweets from Twitter. You
can optionally filter the tweets that contain the hashtag(s) of your choice.
‣ The tweet data is then enriched in real time with various sentiment scores
provided by the Watson Tone Analyzer service (available on Bluemix). This service
provides insight into sentiment, or how the author feels.
‣ The data is then loaded and analyzed by the data scientist within Notebook.
‣ We can also use streaming analytics to feed a real-time web app dashboard

About this sample application
• Github: https://github.com/ibm-cds-labs/spark.samples/tree/master/streaming-twitter
• Tutorial: https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags
• A word about Scala
• Scala is Object oriented but also support functional programming style
• Bi-directional interoperability with Java
• Resources:
• Official web site: http://scala-lang.org
• Excellent first steps site: http://www.artima.com/scalazine/articles/steps.html
• Free e-books: http://readwrite.com/2011/04/30/5-free-b-books-and-tutorials-o

Spark Streaming with “IBM Insight for Twitter” and “Watson Tone Analyzer”
Watson Tone
Analyzer
Service
Bluemix
Producer
Stream
Enrich data with Emotion
Tone Scores
Processed data
Scala Notebook IPython
Notebook
Consumer
Stream
Message Hub
Service
Bluemix
Full Archive
Search API
Consumer Spark
Topics
Publish topics from
Spark analytics results
Event Hub
Service
Bluemix
Real-Time
Dashboard
Data Engineer
Business Analyst
C(Suite)
Data Scientist

Building a Spark Streaming application
Sentiment analysis with Twitter and Watson Tone Analyzer
‣Configure Twitter and Watson Tone Analyzer
1. Configure OAuth credentials for Twitter
2. Create a Watson Tone Analyzer Service on Bluemix
3. Configure MessageHub Service on Bluemix (Kafka)
4. Configure EventHub Service on Bluemix

Configure OAuth credentials for Twitter
‣You can follow
along the steps in
https://developer.ib
m.com/clouddataser
vices/sentiment-
analysis-of-twitter-
hashtags/#twitter

Create a Watson Tone Analyzer Service on Bluemix
‣You can follow along the steps in
https://developer.ibm.com/clouddataservices/
sentiment-analysis-of-twitter-
hashtags/#bluemix

Building a Spark Streaming application
‣Work with Twitter data
1. Create a Twitter Stream
2. Enrich the data with sentiment analysis from
Watson Tone Analyzer
3. Aggregate data into RDD with enriched Data model
4. Create SparkSQL DataFrame and register Table

Create a Twitter Stream
//Hold configuration key/value pairs
val config = Map[String, String](
("twitter4j.oauth.consumerKey", Option(System.getProperty("twitter4j.oauth.consumerKey")).orNull ),
("twitter4j.oauth.consumerSecret", Option(System.getProperty("twitter4j.oauth.consumerSecret")).orNull ),
("twitter4j.oauth.accessToken", Option(System.getProperty("twitter4j.oauth.accessToken")).orNull ),
("twitter4j.oauth.accessTokenSecret", Option(System.getProperty("twitter4j.oauth.accessTokenSecret")).orNull ),
("tweets.key", Option(System.getProperty("tweets.key")).getOrElse("")),
("watson.tone.url", Option(System.getProperty("watson.tone.url")).orNull ),
("watson.tone.username", Option(System.getProperty("watson.tone.username")).orNull ),
("watson.tone.password", Option(System.getProperty("watson.tone.password")).orNull )
)
Create a map that stores the credentials for the Twitter and Watson Service
config.foreach( (t:(String,String)) =>
if ( t._1.startsWith( "twitter4j") ) System.setProperty( t._1, t._2 )
)
Twitter4j requires credentials to be store in System properties

Create a Twitter Stream
//Filter the tweets to only keeps the one with english as the language
//twitterStream is a discretized stream of twitter4j Status objects
var twitterStream = org.apache.spark.streaming.twitter.TwitterUtils.createStream( ssc, None )
.filter { status =>
Option(status.getUser).flatMap[String] {
u => Option(u.getLang)
}.getOrElse("").startsWith("en") //Allow only tweets that use “en” as the
language
&& CharMatcher.ASCII.matchesAllOf(status.getText) //Only pick text that are ASCII
&& ( keys.isEmpty || keys.exists{status.getText.contains(_)}) //If User specified #hashtags to monitor
}
Initial DStream
of Status Objects

Enrich the data with sentiment analysis from Watson
Tone Analyzer
//Broadcast the config to each worker node
val broadcastVar = sc.broadcast(config)
Initial DStream
of Status Objects

Enrich the data with sentiment analysis from Watson
Tone Analyzer
Initial DStream
of Status Objects
Data Model
|-- author: string (nullable = true)
|-- date: string (nullable = true)
|-- lang: string (nullable = true)
|-- text: string (nullable = true)
|-- lat: integer (nullable = true)
|-- long: integer (nullable = true)
|-- Cheerfulness: double (nullable = true)
|-- Negative: double (nullable = true)
|-- Anger: double (nullable = true)
|-- Analytical: double (nullable = true)
|-- Confident: double (nullable = true)
|-- Tentative: double (nullable = true)
|-- Openness: double (nullable = true)
|-- Agreeableness: double (nullable = true)
|-- Conscientiousness: double (nullable = true)
DStream of key,
value pairs

Aggregate data into RDD with enriched Data model
…..
//Aggregate the data from each DStream into the working RDD
rowTweets.foreachRDD( rdd => {
if ( rdd.count() > 0 ){
workingRDD = sc.parallelize( rdd.map( t => t._1 ).collect()).union( workingRDD )
}
})
Initial
DStream
RowTweets
Initial
DStream
RowTweets
Initial
DStream
RowTweets
….
Microbatches
Row 1
Row 2
Row 3
Row 4
…
…
Row n
workingRDD
Data Model
|-- author: string (nullable = true)
|-- date: string (nullable = true)
|-- lang: string (nullable = true)
|-- text: string (nullable = true)
|-- lat: integer (nullable = true)
|-- long: integer (nullable = true)
|-- Cheerfulness: double (nullable = true)
|-- Negative: double (nullable = true)
|-- Anger: double (nullable = true)
|-- Analytical: double (nullable = true)
|-- Confident: double (nullable = true)
|-- Tentative: double (nullable = true)
|-- Openness: double (nullable = true)
|-- Agreeableness: double (nullable = true)
|-- Conscientiousness: double (nullable = true)

Create SparkSQL DataFrame and register Table
//Create a SparkSQL DataFrame from the aggregate workingRDD
val df = sqlContext.createDataFrame( workingRDD, schemaTweets )
//Register a temporary table using the name "tweets"
df.registerTempTable("tweets")
println("A new table named tweets with " + df.count() + " records has been correctly created and can be accessed through the SQLContextvariable")
println("Here's the schema for tweets")
df.printSchema()
(sqlContext, df)
Row 1
Row 2
Row 3
Row 4
…
…
Row n
workingRDD
author date lang …
Cheerfulnes
s
Negative …
Conscientio
usness
John Smith
10/11/2015 –
20:18
en 0.0 65.8 … 25.5
Alfred … en 34.5 0.0 … 100.0
… … … … … …
… … … … … …
… … … … … …
Chris … en 85.3 22.9 … 0.0
Relational SparkSQL Table

Building a Spark Streaming application:
‣IPython Notebook analysis
1. Load the data into an IPython Notebook
2. Analytic 1: Compute the distribution of tweets by
sentiment scores greater than 60%
3. Analytic 2: Compute the top 10 hashtags contained
in the tweets
4. Analytic 3: Visualize aggregated sentiment scores
for the top 5 hashtags

Load the data into an IPython Notebook
‣ You can follow along the steps here: https://github.com/ibm-
cds-labs/spark.samples/blob/master/streaming-
twitter/notebook/Twitter%20%2B%20Watson%20Tone
%20Analyzer%20Part%202.ipynb
Create a SQLContext
from a SparkContext
Load from parquet file
and create a DataFrame
Create a SQL table and
start excuting SQL
queries

Analytic 1 - Compute the distribution of tweets by sentiment
scores greater than 60%
#create an array that will hold the count for each sentiment
sentimentDistribution=[0] * 9
#For each sentiment, run a sql query that counts the number of tweets for which the sentiment score is greater than 60%
#Store the data in the array
for i, sentiment in enumerate(tweets.columns[-9:]):
sentimentDistribution[i]=sqlContext.sql("SELECT count(*) as sentCount FROM tweets where " + sentiment + " > 60")
.collect()[0].sentCount

Use matplotlib to create a bar chart

Bar Chart Visualization

Analytic 2: Compute the top 10 hashtags contained in
the tweets
Initial
Tweets
RDD
Filter
hashtags
Key, value
pair RDD
Reduced
map with
counts
Sorted
Map by key
flatMap filter map reduceByKey sortByKey

Analytic 2: Compute the top 10 hashtags contained in
the tweets

Analytic 3 - Visualize aggregated sentiment scores
‣ Problem:
- Compute the mean average all the emotion score for
all the top 10 hastags
- Format the data in a way that can be consumed by the
plot script

#Step 1: Create RDD from tweets dataframe
tagsRDD = tweets.map(lambda t: t )
author … Cheerfulness
Jake … 0.0
Scrad … 23.5
Nittya Indika … 84.0
… … …
… … …
Madison … 93.0
tweets (Type: DataFrame)
Row(author=u'Jake', …, text=u’@sarahwag…’, Cheerfulness=0.0,
…)
Row(author=u’Scrad', …, text=u’ #SuperBloodMoon https://t…’,
Cheerfulness=23.5, …)
Row(author=u’ Nittya Indika', …, text=u’ Good mornin! http://t.…’,
…
…
Row(author=u’ Madison', …, text=u’ how many nights…’,
tagsRDD (Type: RDD)

#Step 2: Filter to only keep the entries that are in top10tags
tagsRDD = tagsRDD.filter( lambda t: any(s in t.text for s in [i[0] for i in top10tags] ) )
Row(author=u'Jake', …, text=u’@sarahwag…’,
Row(author=u’Scrad', …, text=u’ #SuperBloodMoon
https://t…’, Cheerfulness=23.5, …)
Row(author=u’ Nittya Indika', …, text=u’ Good mornin!
http://t.…’, Cheerfulness=84.0, …)
…
…
Row(author=u’ Madison', …, text=u’ how many
nights…’, Cheerfulness=93.0, …)
Row(author=u'Mike McGuire', text=u'Explains my
disappointment #SuperBloodMoon
https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0)
Row(author=u'Meng_tisoy', text=u’…hihi
#ALDUBThisMustBeLove https://t….’,
…,Conscientiousness=68.0)
Row(author=u'Kevin Contreras', text=u’…SILA!
#ALDUBThisMustBeLove', …Conscientiousness=68.0)
…
…
Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove
https://t…’,…, Conscientiousness=100.0)

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags
#Step 3: Create a flatMap using the expand function defined above, this will be used to collect all the scores
#for a particular tag with the following format: Tag-Tone-ToneScore
cols = tweets.columns[-9:]
def expand( t ):
ret = [ ]
for s in [i[0] for i in top10tags]:
if ( s in t.text ):
for tone in cols:
ret += [s + u"-" + unicode(tone) + ":" + unicode(getattr(t, tone))]
return ret
tagsRDD = tagsRDD.flatMap( expand )
Row(author=u'Mike McGuire', text=u'Explains my
disappointment #SuperBloodMoon
https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0)
Row(author=u'Meng_tisoy', text=u’…hihi
#ALDUBThisMustBeLove https://t….’,
…,Conscientiousness=68.0)
Row(author=u'Kevin Contreras', text=u’…SILA!
#ALDUBThisMustBeLove', …Conscientiousness=68.0)
…
Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove
https://t…’,…, Conscientiousness=100.0)
u'#SuperBloodMoon-Cheerfulness:0.0'
u'#SuperBloodMoon-Negative:100.0’
u'#SuperBloodMoon-Negative:23.5'
…
u'#ALDUBThisMustBeLove-Analytical:85.0’
FlatMap of encoded values

#Step 4: Create a map indexed by Tag-Tone keys
tagsRDD = tagsRDD.map( lambda fullTag : (fullTag.split(":")[0], float( fullTag.split(":")[1]) ))
u'#SuperBloodMoon-Cheerfulness:0.0'
u'#SuperBloodMoon-Negative:100.0’
u'#SuperBloodMoon-Negativer:23.5'
…
u'#ALDUBThisMustBeLove-Analytical:85.0’
u'#SuperBloodMoon-
Cheerfulness'
0.0
u'#SuperBloodMoon-Negative’ 100.0
u'#SuperBloodMoon-Negative' 23.5
…
u'#ALDUBThisMustBeLove’ 85.0
map

#Step 5: Call combineByKey to format the data as follow
#Key=Tag-Tone, Value=(count, sum_of_all_score_for_this_tone)
tagsRDD = tagsRDD.combineByKey((lambda x: (x,1)),
(lambda x, y: (x[0] + y, x[1] + 1)),
(lambda x, y: (x[0] + y[0], x[1] + y[1])))
u'#SuperBloodMoon-
Cheerfulness'
0.0
u'#SuperBloodMoon-Negative’ 100.0
u'#SuperBloodMoon-Negative' 23.5
…
u'#ALDUBThisMustBeLove’ 85.0
u'#Supermoon-Confident’ (0.0, 3)
u'#HajjStampede-Tentative’ (0.0, 3)
u'#KiligKapamilya-
Conscientiousness’
(290.0, 6)
…
u'#LunarEclipse-Tentative’ (92.0, 4)
CreateCombiner: Create list of tuples (sum,count)
mergeValue: called for each new value (sum, count)
MergeCombiner: reduce part, merge 2 combiners

#Step 6 : ReIndex the map to have the key be the Tag and value be (Tone, Average_score) tuple
#Key=Tag
#Value=(Tone, average_score)
tagsRDD = tagsRDD.map(lambda (key, ab): (key.split("-")[0], (key.split("-")[1], round(ab[0]/ab[1],2))))
u'#Supermoon-Confident’ (0.0, 3)
u'#HajjStampede-Tentative’ (0.0, 3)
u'#KiligKapamilya-
(290.0, 6)
…
u'#LunarEclipse-Tentative’ (92.0, 4)
u'#Supermoon-Confident’ (u'Confident', 0.0)
u'#HajjStampede-Tentative’ (u'Tentative', 0.0)
u'#KiligKapamilya-
(u'Conscientiousness',
48.33)
…
u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)

#Step 7: Reduce the map on the Tag key, value becomes a list of (Tone,average_score) tuples
tagsRDD = tagsRDD.reduceByKey( lambda x, y : makeList(x) + makeList(y) )
u'#Supermoon-Confident’ (u'Confident', 0.0)
u'#HajjStampede-Tentative’ (u'Tentative', 0.0)
u'#KiligKapamilya-
(u'Conscientiousness',
48.33)
…
u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)
u'#HajjStampede'
[(u'Tentative', 0.0), (u'Agreeableness',
3.67), …, (u'Cheerfulness', 100.0)]
u'#Supermoon'
[(u'Confident', 0.0), (u'Openness',
91.0), …, (u'Agreeableness',
20.33)]
u'#bloodmoon'
[(u'Anger', 0.0), (u'Negative', 0.0),
…, (u'Openness', 38.0)]
…
u'#KiligKapamilya
'
[(u'Conscientiousness', 48.33),
(u'Anger', 0.0),...
(u'Agreeableness', 10.83)]

#Step 8 : Sort the (Tone,average_score) tuples alphabetically by Tone
tagsRDD = tagsRDD.mapValues( lambda x : sorted(x) )
u'#HajjStampede'
[(u'Tentative', 0.0), (u'Agreeableness',
3.67), …, (u'Cheerfulness', 100.0)]
u'#Supermoon'
[(u'Confident', 0.0), (u'Openness',
91.0), …, (u'Agreeableness',
20.33)]
u'#bloodmoon'
…, (u'Openness', 38.0)]
…
u'#KiligKapamilya
'
[(u'Conscientiousness', 48.33),
(u'Anger', 0.0),...
(u'Agreeableness', 10.83)]
u'#HajjStampede'
[(u'Agreeableness', 3.67),
(u'Cheerfulness', 100.0),… (u'Tentative',
0.0),]
u'#Supermoon'
(u'Confident', 0.0),..., (u'Openness',
91.0)]
u'#bloodmoon'
…, (u'Openness', 38.0)]
…
u'#KiligKapamilya'
(u'Anger', 0.0)(u'Conscientiousness',
48.33),,...]

#Step 9 : Format the data as expected by the plotting code in the next cell.
#map the Values to a tuple as follow: ([list of tone], [list of average score])
tagsRDD = tagsRDD.mapValues( lambda x : ([elt[0] for elt in x],[elt[1] for elt in x]) )
u'#HajjStampede'
(u'Cheerfulness', 100.0),… (u'Tentative',
0.0),]
u'#Supermoon'
(u'Confident', 0.0),..., (u'Openness',
91.0)]
u'#bloodmoon'
…, (u'Openness', 38.0)]
…
u'#KiligKapamilya'
(u'Anger', 0.0)(u'Conscientiousness',
48.33),,...]
u'#HajjStampede'
([u'Agreeableness’,u'Cheerfulness’,…
u'Tentative’], [3.67, 100.0,…0.0])
u'#Supermoon'
([u'Agreeableness’,u'Confident',...,
u'Openness’],[20.33, 0.0,… 91.0])
u'#bloodmoon'
([u'Anger’,u'Negative', …,
u'Openness’), [0.0, 0.0,…38.0])
…
u'#KiligKapamilya'
([u'Agreeableness’,u'Anger’,
u'Conscientiousness',...],[10.83,
0.0,48.33,...])
Value is a tuple of 2 arrays: tones-scores

#Step 10 : Use custom sort function to sort the entries by order of appearance in top10tags
def customCompare( key ):
for (k,v) in top10tags:
if k == key:
return v
return 0
tagsRDD = tagsRDD.sortByKey(ascending=False, numPartitions=None, keyfunc = customCompare)
u'#HajjStampede'
u'Tentative’], [3.67, 100.0,…0.0])
u'#Supermoon'
u'Openness’],[20.33, 0.0,… 91.0])
u'#bloodmoon'
u'Openness’), [0.0, 0.0,…38.0])
…
u'#KiligKapamilya'
0.0,48.33,...])
u'#Superbloodmon'
u'Tentative’], [33.97, 19.38,…12.85])
u'#BBWLA'
u'Openness’],[38.33, 12.34,…
21.43])
u'#ALDUBThisMust
BeLove'
u'Openness’), [0.0, 0.0,…62.0])
…
u'#Newmusic'
0.0,68.33,...])

Analytic 3 - Visualize aggregated sentiment
scores for the top 5 hashtags

Real-Time Web app Dashboard
‣ Pie chart showing
top Hashtags
distribution
‣ Bar chart showing
distribution of
tone scores for
each of top
HashTags

Create a Receiver that subscribes to Kafka
topics
Store new record into DStream
Get batch of new records
MessageHub on Bluemix requires Kafka 0.9

Create Kafka DStream
Implicit conversion to add synthetically add method to StreamingContext

Enrich Tweets with Watson Scores
Get Tone scores
Map to new EnrichedTweet Object

Streaming analytics
Prepare for Map/Reduce
Map tag-tone to corresponding score
Compute Count + Average for each score
Map each tag to count + List of scores averages
Reduce

Maintain State between micro-batch RDDs
Maintain State between micro-batches by recomputing
count and List of averages

Produce Streaming analytics topic data
Can’t call Kakfa Producer from streaming analytic
because not serializable
Post message to queue
Process
message
queue from
separate Thread

Real-time web app dashboard
‣ Technology used:
- Mozaik
(https://github.com/plo
uc/mozaik)
- ReactJS,
- WebSocket
- D3JS/C3JS
‣ Consume Topics
generated by Spark
Streaming analytics
Consumer Spark
Topics
Real-Time
Dashboard
Topics:
•topHashTags
•topHashTags.toneScores

Access MessageHub API through message-hub-rest node module

React Components for Mozaik framework

Thank You

Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

Similar to Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015 (20)

Recently uploaded

Recently uploaded (20)

Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015