1
Big Data Analytics with Storm,
Spark and GraphLab
Dr. Vijay Srinivas Agneeswaran
Director and Head, Big-data R&D
Impetus Technologies Inc.
2
Contents
Big Data
Computations
•Introduction to ML
•Characterization
Berkeley data
analytics stack
•Spark
Real-time
Analytics with
Storm
PMML Scoring
for Naïve Bayes
•PMML Primer
•Naïve Bayes Primer
GraphLab
Hadoop 2.0
(Hadoop YARN)
Programming
Abstractions
• What is it?
• learn patterns in data
• improve accuracy by learning
• Examples
• Speech recognition systems
• Recommender systems
• Medical decision aids
• Robot navigation systems
Introduction to Machine Learning
3
• Attributes and their values:
• Outlook: Sunny, Overcast, Rain
• Humidity: High, Normal
• Wind: Strong, Weak
• Temperature: Hot, Mild, Cool
• Target prediction - Play Tennis: Yes, No
Introduction to Machine Learning
4
5
Introduction to Machine Learning
NoStrongHighMildRainD14
YesWeakNormalHotOvercastD13
YesStrongHighMildOvercastD12
YesStrongNormalMildSunnyD11
YesStrongNormalMildRainD10
YesWeakNormalCoolSunnyD9
NoWeakHighMildSunnyD8
YesWeakNormalCoolOvercastD7
NoStrongNormalCoolRainD6
YesWeakNormalCoolRainD5
YesWeakHighMildRainD4
YesWeakHighHotOvercastD3
NoStrongHighHotSunnyD2
NoWeakHighHotSunnyD1
Play TennisWindHumidityTemp.OutlookDay
Tom Mitchell, Machine Learning, Tata McGraw Hill Publications.
6
Introduction to Machine Learning: Decision Trees
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
7
Decision Trees to Random Forests
Can we have an ensemble of trees? – random
forests
Final prediction is the
mean (regression) or
class with max votes
(categorization)
Does not need tree
pruning for generalization
Greater accuracy across
domains.
Decision trees
Pros
•Handling of mixed data, Robustness to outliers,
Computational scalability
cons
•Low prediction accuracy, High variance, Size VS
Goodness of fit
K-means Clustering
8
9
Support Vector Machines
10
Introduction to Machine Learning
Machine
learning tasks
Learning associations –
market basket analysis
Supervised learning
(Classification/regression) –
random forests, support vector
machines (SVMs), logistic
regression (LR), Naïve Bayes
Unsupervised learning
(clustering) - k-means,
sentiment analysis
Prediction – random
forests, SVMs, LR
Data Mining
Application of machine
learning to large data
Knowledge Discovery in
Databases (KDD)
Credit scoring, fraud
detection, market
basket analysis,
medical diagnosis,
manufacturing
optimization
11
Big Data ComputationsComputations/Operations
Giant 1 (simple stats) is perfect
for Hadoop 1.0.
Giants 2 (linear algebra), 3 (N-
body), 4 (optimization) Spark
from UC Berkeley is efficient.
Logistic regression, kernel SVMs,
conjugate gradient descent,
collaborative filtering, Gibbs
sampling, alternating least squares.
Example is social group-first
approach for consumer churn
analysis [2]
Interactive/On-the-fly data
processing – Storm.
OLAP – data cube operations.
Dremel/Drill
Data sets – not embarrassingly
parallel?
Deep Learning Artificial Neural Networks
Machine vision from Google [3]
Speech analysis from Microsoft
Giant 5 – Graph processing –
GraphLab, Pregel, Giraph
[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.
[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups.
In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741
[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W.
Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012: 1232-1240
Iterative ML Algorithms
[CB09] C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific computing,
Technical Report, University of California, Computer Science Department, 2009.
What are iterative
algorithms?
• Those that need
communication
among the
computing
entities
• Examples –
neural networks,
PageRank
algorithms,
network traffic
analysis
Conjugate
gradient descent
• Commonly used
to solve systems
of linear
equations
• [CB09] tried
implementing
CG on dense
matrices
• DAXPY –
Multiplies vector
x by constant a
and adds y.
• DDOT – Dot
product of 2
vectors
• MatVec –
Multiply matrix
by vector,
produce a
vector.
Communication
Overhead
• 1 MR per
primitive – 6
MRs per CG
iteration,
hundreds of
MRs per CG
computation,
leading to 10 of
GBs of
communication
even for small
matrices.
Other iterative
algorithms
• fast fourier
transform, block
tridiagonal
13
ML realizations: 3 Generational view
Generation First Generation Second Generation Third Generation
Examples SAS, R, Weka,
SPSS in native
form
Mahout, Pentaho,
Revolution R, SAS In-
memory Analytics (Hadoop)
Spark, HaLoop, GraphLab, Pregel,
SAS In-memory Analytics
(Greenplum/Teradata), Giraph,
Golden ORB, Stanford GPS, ML over
Storm
Scalability Vertical Horizontal (over Hadoop) Horizontal (Beyond Hadoop)
Algorithms
Available
Huge collection
of algorithms
Small subset – sequential
logistic regression, linear
SVMs, Stochastic Gradient
Descent, k-means
clustering, Random Forests
etc.
Much wider – including Conjugate
Gradient Descent (CGD), Alternating
Least Squares (ALS), collaborative
filtering, kernel SVM, belief
propagation, matrix factorization,
Gibbs sampling etc.
Algorithms
Not Available
Practically
Nothing
Vast no. – Kernel SVMs,
Multivariate Logistic
Regression, Conjugate
Gradient Descent, ALS etc.
Multivariate logistic regression in
general form, K-means clustering
etc. – work in progress to expand the
set of algorithms available.
Fault-
Tolerance
Single point of
failure
Most tools are FT, as they
are built on top of Hadoop
FT – HaLoop, Spark
Not FT – Pregel, GraphLab, Giraph
Giants All 7 giants – for
small data sets
Giants 1, and 2. Spark – giant 2, 3 and 4.
GraphLab – giant 5.
Vijay Srinivas Agneeswaran, Pranay Tonpay and Jayati Tiwari, “Paradigms for Realizing Machine Learning Algorithms”,
Big Data Journal (Libertpub), 1(4), 207-214.
14
Contents
Big Data
Computations
•Introduction to ML
•Characterization
Berkeley data
analytics stack
•Spark
Real-time
Analytics with
Storm
Hadoop 2.0
(Hadoop YARN)
PMML Scoring
for Naïve Bayes
•PMML Primer
•Naïve Bayes Primer
GraphLab
Programming
Abstractions
15
Data Flow in Spark
and Hadoop
16
Berkeley Big-data Analytics Stack (BDAS)
BDAS: Use Cases
17
Ooyala
Uses Cassandra for
video data
personalization.
Pre-compute
aggregates VS on-the-
fly queries.
Moved to Spark for
ML and computing
views.
Moved to Shark for on-the-fly queries –
C* OLAP aggregate queries on
Cassandra 130 secs, 60 ms in Spark
Conviva
Uses Hive for
repeatedly running
ad-hoc queries on
video data.
Optimized ad-hoc
queries using Spark
RDDs – found Spark
is 30 times faster
than Hive
ML for connection
analysis and video
streaming
optimization.
Yahoo
Advertisement
targeting: 30K
nodes on Hadoop
Yarn
Hadoop – batch processing
Spark – iterative processing
Storm – on-the-fly processing
Content
recommendation –
collaborative filtering
BDAS: Spark
[MZ12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J.
Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory
cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and
Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.
Transformations/Actions Description
Map(function f1) Pass each element of the RDD through f1 in parallel and return the resulting RDD.
Filter(function f2) Select elements of RDD that return true when passed through f2.
flatMap(function f3) Similar to Map, but f3 returns a sequence to facilitate mapping single input to multiple outputs.
Union(RDD r1) Returns result of union of the RDD r1 with the self.
Sample(flag, p, seed) Returns a randomly sampled (with seed) p percentage of the RDD.
groupByKey(noTasks) Can only be invoked on key-value paired data – returns data grouped by value. No. of parallel
tasks is given as an argument (default is 8).
reduceByKey(function f4,
noTasks)
Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the second
argument.
Join(RDD r2, noTasks) Joins RDD r2 with self – computes all possible pairs for given key.
groupWith(RDD r3, noTasks) Joins RDD r3 with self and groups by key.
sortByKey(flag) Sorts the self RDD in ascending or descending based on flag.
Reduce(function f5) Aggregates result of applying function f5 on all elements of self RDD
Collect() Return all elements of the RDD as an array.
Count() Count no. of elements in RDD
take(n) Get first n elements of RDD.
First() Equivalent to take(1)
saveAsTextFile(path) Persists RDD in a file in HDFS or other Hadoop supported file system at given path.
saveAsSequenceFile(path) Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs that
implement Hadoop writable interface or equivalent.
foreach(function f6) Run f6 in parallel on elements of self RDD.
Representation of an RDD
19
Information HadoopRDD FilteredRDD JoinedRDD
Set of partitions 1 per HDFS block Same as parent 1 per reduce task
Set of dependencies None 1-to-1 on parent Shuffle on each parent
Function to compute data
set based on parents
Read corresponding block Compute parent and
filter it
Read and join shuffled
data
Meta-data on location
(preferredLocaations)
HDFS block location from
namenode
None (parent) None
Meta-data on partitioning
(partitioningScheme)
None None HashPartitioner
Some Spark(ling) examples
Scala code (serial)
var count = 0
for (i <- 1 to 100000)
{ val x = Math.random * 2 - 1
val y = Math.random * 2 - 1
if (x*x + y*y < 1) count += 1 }
println("Pi is roughly " + 4 * count / 100000.0)
Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence,
u get approximate value for PI.
Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).
Some Spark(ling) examples
Spark code (parallel)
val spark = new SparkContext(<Mesos master>)
var count = spark.accumulator(0)
for (i <- spark.parallelize(1 to 100000, 12))
{ val x = Math.random * 2 – 1
val y = Math.random * 2 - 1
if (x*x + y*y < 1) count += 1 }
println("Pi is roughly " + 4 * count / 100000.0)
Notable points:
1. Spark context created – talks to Mesos1 master.
2. Count becomes shared variable – accumulator.
3. For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices.
4. Parallelize method invokes foreach method of RDD.
1 Mesos is an Apache incubated clustering system – http://mesosproject.org
Logistic Regression in Spark: Serial Code
// Read data file and convert it into Point objects
val lines = scala.io.Source.fromFile("data.txt").getLines()
val points = lines.map(x => parsePoint(x))
// Run logistic regression
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = Vector.zeros(D)
for (p <- points) {
val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y
gradient += scale * p.x
}
w -= gradient
}
println("Result: " + w)
Logistic Regression in Spark
// Read data file and transform it into Point objects
val spark = new SparkContext(<Mesos master>)
val lines = spark.hdfsTextFile("hdfs://.../data.txt")
val points = lines.map(x => parsePoint(x)).cache()
// Run logistic regression
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = spark.accumulator(Vector.zeros(D))
for (p <- points) {
val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y
gradient += scale * p.x
}
w -= gradient.value
}
println("Result: " + w)
Logistic Regression: Spark VS Hadoop
24http://spark-project.org
25
26
Contents
Big Data
Computations
•Introduction to ML
•Characterization
Berkeley data
analytics stack
•Spark
Real-time
Analytics with
Storm
PMML Scoring
for Naïve Bayes
•PMML Primer
•Naïve Bayes Primer
GraphLab
Hadoop 2.0
(Hadoop YARN)
Programming
Abstractions
27
Real-time Analytics with Storm
Solution to Internet Traffic Analysis Use Case
29
Contents
Big Data
Computations
•Introduction to ML
•Characterization
Berkeley data
analytics stack
•Spark
Real-time
Analytics with
Storm
PMML Scoring
for Naïve Bayes
•PMML Primer
•Naïve Bayes Primer
GraphLab
Hadoop 2.0
(Hadoop YARN)
Programming
Abstractions
PMML Primer
30
Predictive Model Markup
Language
Developed by DMG (Data
Mining Group)
XML representation of a
model.
PMML offers a standard
to define a model, so that
a model generated in
tool-A can be directly
used in tool-B.
May contain a myriad of
data transformations
(pre- and post-processing)
as well as one or more
predictive models.
Naïve Bayes Primer
31
Normalization Constant
Likelihood Prior
A simple probabilistic
classifier based on
Bayes Theorem
Given features
X1,X2,…,Xn, predict a
label Y by calculating
the probability for all
possible Y value
PMML Scoring for Naïve Bayes
32
Wrote a PMML based
scoring engine for
Naïve Bayes
algorithm.
This can theoretically
be used in any
framework for data
processing by
invoking the API
Deployed a Naïve
Bayes PMML
generated from R into
Storm / Spark and
Samza frameworks
Real time predictions
with the above APIs
33
Header
•Version and timestamp
•Model development
environment information
Data Dictionary
•Variable types, missing
valid and invalid values,
Data
Munging/Transformation
•Normalization, mapping,
discretization
Model
•Model specifi attributes
•Mining Schema
•Treatment for missing
and outlier values
•Targets
•Prior probability and
default
•Outputs
•List of computer output
fields
•Post-processing
•Definition of model
architecture/parameters.
<DataDictionary numberOfFields="4">
<DataField name="Class" optype="categorical" dataType="string">
<Value value="democrat"/>
<Value value="republican"/>
</DataField>
<DataField name="V1" optype="categorical" dataType="string">
<Value value="n"/>
<Value value="y"/>
</DataField>
<DataField name="V2" optype="categorical" dataType="string">
<Value value="n"/>
<Value value="y"/>
</DataField>
<DataField name="V3" optype="categorical" dataType="string">
<Value value="n"/>
<Value value="y"/>
</DataField>
</DataDictionary>
(ctd on the next slide)
PMML Scoring for Naïve Bayes
34
<NaiveBayesModel modelName="naiveBayes_Model" functionName="classification"
threshold="0.003">
<MiningSchema>
<MiningField name="Class" usageType="predicted"/>
<MiningField name="V1" usageType="active"/>
<MiningField name="V2" usageType="active"/>
<MiningField name="V3" usageType="active"/>
</MiningSchema>
<Output>
<OutputField name="Predicted_Class" feature="predictedValue"/>
<OutputField name="Probability_democrat" optype="continuous" dataType="double"
feature="probability" value="democrat"/>
<OutputField name="Probability_republican" optype="continuous" dataType="double"
feature="probability" value="republican"/>
</Output>
<BayesInputs>
(ctd on the next page)
PMML Scoring for Naïve Bayes
35
PMML Scoring for Naïve Bayes
36
<BayesInputs>
<BayesInput fieldName="V1">
<PairCounts value="n">
<TargetValueCounts>
<TargetValueCount value="democrat" count="51"/>
<TargetValueCount value="republican" count="85"/>
</TargetValueCounts>
</PairCounts>
<PairCounts value="y">
<TargetValueCounts>
<TargetValueCount value="democrat" count="73"/>
<TargetValueCount value="republican" count="23"/>
</TargetValueCounts>
</PairCounts>
</BayesInput>
<BayesInput fieldName="V2">
*
<BayesInput fieldName="V3">
*
</BayesInputs>
<BayesOutput fieldName="Class">
<TargetValueCounts>
<TargetValueCount value="democrat" count="124"/>
<TargetValueCount value="republican" count="108"/>
</TargetValueCounts>
</BayesOutput>
PMML Scoring for Naïve Bayes
37
Definition Of Elements:-
DataDictionary :
Definitions for fields as used in mining models
( Class, V1, V2, V3 )
NaiveBayesModel :
Indicates that this is a NaiveBayes PMML
MiningSchema : lists fields as used in that model.
Class is “predicted” field,
V1,V2,V3 are “active” predictor fields
Output:
Describes a set of result values that can be returned
from a model
PMML Scoring for Naïve Bayes
38
Definition Of Elements (ctd .. ) :-
BayesInputs:
For each type of inputs, contains the counts of
outputs
BayesOutput:
Contains the counts associated with the values of the
target field
Sample Input
Eg1 - n y y n y y n n n n n n y y y y
Eg2 - n y n y y y n n n n n y y y n y
• 1st , 2nd and 3rd Columns:
Predictor variables ( Attribute “name” in element MiningField )
• Using these we predict whether the Output is Democrat or
Republican ( PMML element BayesOutput)
PMML Scoring for Naïve Bayes
39
PMML Scoring for Naïve Bayes
40
• 3 Node Xeon Machines Storm cluster ( 8
quad code CPUs, 32 GB RAM, 32 GB Swap
space, 1 Nimbus, 2 Supervisors )
Number of records ( in
millions )
Time Taken (seconds)
0.1 4
0.4 7
1.0 12
2.0 21
10 129
25 310
PMML Scoring for Naïve Bayes
41
• 3 Node Xeon Machines Spark cluster( 8
quad code CPUs, 32 GB RAM and 32 GB
Swap space )
Number of records ( in
millions )
Time Taken (
0.1 1 min 47 sec
0.2 3 min 35 src
0.4 6 min 40 secs
1.0 35 mins 17 sec
10 More than 3 hrs
42
Contents
Big Data
Computations
•Introduction to ML
•Characterization
Berkeley data
analytics stack
•Spark
Real-time
Analytics with
Storm
PMML Scoring
for Naïve Bayes
•PMML Primer
•Naïve Bayes Primer
GraphLab
Hadoop 2.0
(Hadoop YARN)
Programming
Abstractions
GraphLab: Ideal Engine for Processing Natural Graphs [YL12]
[YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: a
framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727.
Goals – targeted at machine
learning.
•Model graph dependencies, be
asynchronous, iterative, dynamic.
Data associated with edges
(weights, for instance) and vertices
(user profile data, current interests
etc.).
Update functions – lives on each
vertex
• Transforms data in scope of vertex.
• Can choose to trigger neighbours (for
example only if Rank changes drastically)
• Run asynchronously till convergence – no
global barrier.
Consistency is important in ML
algorithms (some do not even
converge when there are
inconsistent updates – collaborative
filtering).
• GraphLab – provides varying level of
consistency. Parallelism VS consistency.
Implemented several algorithms,
including ALS, K-means, SVM, Belief
propagation, matrix factorization,
Gibbs sampling, SVD, CoEM etc.
• Co-EM (Expectation Maximization)
algorithm 15x faster than Hadoop MR – on
distributed GraphLab, only 0.3% of Hadoop
execution time.
GraphLab 2: PowerGraph – Modeling Natural Graphs [1]
[1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph: Distributed
Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium on Operating Systems
Design and Implementation (OSDI '12).
GraphLab could not
scale to Altavista web
graph 2002, 1.4B
vertices, 6.7B edges.
• Most graph parallel
abstractions assume small
neighbourhoods – low
degree vertices
• But natural graphs
(LinkedIn, Facebook,
Twitter) – power law
graphs.
• Hard to partition power law
graphs, high degree
vertices limit parallelism.
Powergraph provides
new way of
partitioning power law
graphs
• Edges are tied to
machines, vertices (esp.
high degree ones) span
machines
• Execution split into 3
phases:
• Gather, apply and
scatter.
Triangle counting on
Twitter graph
• Hadoop MR took 423
minutes on 1536
machines
• GraphLab 2 took 1.5
minutes on 1024 cores (64
machines)
45
Contents
Big Data
Computations
•Introduction to ML
•Characterization
Berkeley data
analytics stack
•Spark
Real-time
Analytics with
Storm
PMML Scoring
for Naïve Bayes
•PMML Primer
•Naïve Bayes Primer
GraphLab
Hadoop 2.0
(Hadoop YARN)
Programming
Abstractions
Hadoop YARN Requirements or 1.0 shortcomings46
R1: Scalability
•single cluster limitation
R2: Multi-tenancy
•Addressed by Hadoop-on-
Demand
•Security, Quotas
R3: Locality
awareness
•Shuffle of records
R4: Shared cluster
utilization
•Hogging by users
•Typed slots
R5:
Reliability/Availability
•Job Tracker bugs
R6: Iterative Machine
Learning
Vinod Kumar Vavilapalli, Arun C Murthy , Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason
Lowe , Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric
Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator”, ACM Symposium on Cloud Computing, Oct 2013, ACM
Press.
47
Hadoop YARN Architecture
YARN Internals
48
Application Master
•Sends
ResourceRequests
to the YARN RM
•Captures
containers,
resources per
container, locality
preferences.
YARN RM
•Generates tokens
and containers
•Global view of
cluster –
monolithic
scheduling.
Node Manager
•Node health
monitoring,
advertise
available
resources through
heartbeats to RM.
49
Contents
Big Data
Computations
•Introduction to ML
•Characterization
Berkeley data
analytics stack
•Spark
Real-time
Analytics with
Storm
PMML Scoring
for Naïve Bayes
•PMML Primer
•Naïve Bayes Primer
GraphLab
Hadoop 2.0
(Hadoop YARN)
Programming
Abstractions
Programming Abstractions
50
PMML
•XML based
representation
of the analytical
model
Spark
•Scala collection
– over a
distributed
shared memory
system
GraphLab
•Gather-Apply-
Scatter
Forge
•Domain Specific
Language
51
•Domain specific language approach from
Stanford.
•Forge [AKS13] – a meta DSL for high
performance DSLs.
•40X faster than Spark!
•OptiML – DSL for machine language
Forge: Approach to build high performance
Domain Specific Languages
[Arvind K. Sujeeth, Austin Gibbons, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Martin Odersky, and Kunle
Olukotun. 2013. Forge: generating a high performance DSL implementation from a declarative specification.
In Proceedings of the 12th international conference on Generative programming: concepts & experiences (GPCE '13).
ACM, New York, NY, USA, 145-154.
• Beyond Hadoop Map-Reduce philosophy
• Optimization and other problems.
• Real-time computation
• Processing specialized data structures
• PMML scoring
• Spark for batch computations
• Spark streaming and Storm for real-time.
• Allows traditional analytical tools/algorithms to be re-
used.
Conclusions
52
Thank You!
Mail •bigdata@impetus.com
LinkedIn •www.linkedin.com/company/impetus
Blogs •http://blogs.impetus.com/
Twitter •@impetustech

Big Data Analytics with Storm, Spark and GraphLab

  • 1.
    1 Big Data Analyticswith Storm, Spark and GraphLab Dr. Vijay Srinivas Agneeswaran Director and Head, Big-data R&D Impetus Technologies Inc.
  • 2.
    2 Contents Big Data Computations •Introduction toML •Characterization Berkeley data analytics stack •Spark Real-time Analytics with Storm PMML Scoring for Naïve Bayes •PMML Primer •Naïve Bayes Primer GraphLab Hadoop 2.0 (Hadoop YARN) Programming Abstractions
  • 3.
    • What isit? • learn patterns in data • improve accuracy by learning • Examples • Speech recognition systems • Recommender systems • Medical decision aids • Robot navigation systems Introduction to Machine Learning 3
  • 4.
    • Attributes andtheir values: • Outlook: Sunny, Overcast, Rain • Humidity: High, Normal • Wind: Strong, Weak • Temperature: Hot, Mild, Cool • Target prediction - Play Tennis: Yes, No Introduction to Machine Learning 4
  • 5.
    5 Introduction to MachineLearning NoStrongHighMildRainD14 YesWeakNormalHotOvercastD13 YesStrongHighMildOvercastD12 YesStrongNormalMildSunnyD11 YesStrongNormalMildRainD10 YesWeakNormalCoolSunnyD9 NoWeakHighMildSunnyD8 YesWeakNormalCoolOvercastD7 NoStrongNormalCoolRainD6 YesWeakNormalCoolRainD5 YesWeakHighMildRainD4 YesWeakHighHotOvercastD3 NoStrongHighHotSunnyD2 NoWeakHighHotSunnyD1 Play TennisWindHumidityTemp.OutlookDay Tom Mitchell, Machine Learning, Tata McGraw Hill Publications.
  • 6.
    6 Introduction to MachineLearning: Decision Trees Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes Yes YesNo
  • 7.
    7 Decision Trees toRandom Forests Can we have an ensemble of trees? – random forests Final prediction is the mean (regression) or class with max votes (categorization) Does not need tree pruning for generalization Greater accuracy across domains. Decision trees Pros •Handling of mixed data, Robustness to outliers, Computational scalability cons •Low prediction accuracy, High variance, Size VS Goodness of fit
  • 8.
  • 9.
  • 10.
    10 Introduction to MachineLearning Machine learning tasks Learning associations – market basket analysis Supervised learning (Classification/regression) – random forests, support vector machines (SVMs), logistic regression (LR), Naïve Bayes Unsupervised learning (clustering) - k-means, sentiment analysis Prediction – random forests, SVMs, LR Data Mining Application of machine learning to large data Knowledge Discovery in Databases (KDD) Credit scoring, fraud detection, market basket analysis, medical diagnosis, manufacturing optimization
  • 11.
    11 Big Data ComputationsComputations/Operations Giant1 (simple stats) is perfect for Hadoop 1.0. Giants 2 (linear algebra), 3 (N- body), 4 (optimization) Spark from UC Berkeley is efficient. Logistic regression, kernel SVMs, conjugate gradient descent, collaborative filtering, Gibbs sampling, alternating least squares. Example is social group-first approach for consumer churn analysis [2] Interactive/On-the-fly data processing – Storm. OLAP – data cube operations. Dremel/Drill Data sets – not embarrassingly parallel? Deep Learning Artificial Neural Networks Machine vision from Google [3] Speech analysis from Microsoft Giant 5 – Graph processing – GraphLab, Pregel, Giraph [1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013. [2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741 [3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012: 1232-1240
  • 12.
    Iterative ML Algorithms [CB09]C. Bunch, B. Drawert, M. Norman, Mapscale: a cloud environment for scientific computing, Technical Report, University of California, Computer Science Department, 2009. What are iterative algorithms? • Those that need communication among the computing entities • Examples – neural networks, PageRank algorithms, network traffic analysis Conjugate gradient descent • Commonly used to solve systems of linear equations • [CB09] tried implementing CG on dense matrices • DAXPY – Multiplies vector x by constant a and adds y. • DDOT – Dot product of 2 vectors • MatVec – Multiply matrix by vector, produce a vector. Communication Overhead • 1 MR per primitive – 6 MRs per CG iteration, hundreds of MRs per CG computation, leading to 10 of GBs of communication even for small matrices. Other iterative algorithms • fast fourier transform, block tridiagonal
  • 13.
    13 ML realizations: 3Generational view Generation First Generation Second Generation Third Generation Examples SAS, R, Weka, SPSS in native form Mahout, Pentaho, Revolution R, SAS In- memory Analytics (Hadoop) Spark, HaLoop, GraphLab, Pregel, SAS In-memory Analytics (Greenplum/Teradata), Giraph, Golden ORB, Stanford GPS, ML over Storm Scalability Vertical Horizontal (over Hadoop) Horizontal (Beyond Hadoop) Algorithms Available Huge collection of algorithms Small subset – sequential logistic regression, linear SVMs, Stochastic Gradient Descent, k-means clustering, Random Forests etc. Much wider – including Conjugate Gradient Descent (CGD), Alternating Least Squares (ALS), collaborative filtering, kernel SVM, belief propagation, matrix factorization, Gibbs sampling etc. Algorithms Not Available Practically Nothing Vast no. – Kernel SVMs, Multivariate Logistic Regression, Conjugate Gradient Descent, ALS etc. Multivariate logistic regression in general form, K-means clustering etc. – work in progress to expand the set of algorithms available. Fault- Tolerance Single point of failure Most tools are FT, as they are built on top of Hadoop FT – HaLoop, Spark Not FT – Pregel, GraphLab, Giraph Giants All 7 giants – for small data sets Giants 1, and 2. Spark – giant 2, 3 and 4. GraphLab – giant 5. Vijay Srinivas Agneeswaran, Pranay Tonpay and Jayati Tiwari, “Paradigms for Realizing Machine Learning Algorithms”, Big Data Journal (Libertpub), 1(4), 207-214.
  • 14.
    14 Contents Big Data Computations •Introduction toML •Characterization Berkeley data analytics stack •Spark Real-time Analytics with Storm Hadoop 2.0 (Hadoop YARN) PMML Scoring for Naïve Bayes •PMML Primer •Naïve Bayes Primer GraphLab Programming Abstractions
  • 15.
    15 Data Flow inSpark and Hadoop
  • 16.
  • 17.
    BDAS: Use Cases 17 Ooyala UsesCassandra for video data personalization. Pre-compute aggregates VS on-the- fly queries. Moved to Spark for ML and computing views. Moved to Shark for on-the-fly queries – C* OLAP aggregate queries on Cassandra 130 secs, 60 ms in Spark Conviva Uses Hive for repeatedly running ad-hoc queries on video data. Optimized ad-hoc queries using Spark RDDs – found Spark is 30 times faster than Hive ML for connection analysis and video streaming optimization. Yahoo Advertisement targeting: 30K nodes on Hadoop Yarn Hadoop – batch processing Spark – iterative processing Storm – on-the-fly processing Content recommendation – collaborative filtering
  • 18.
    BDAS: Spark [MZ12] MateiZaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2. Transformations/Actions Description Map(function f1) Pass each element of the RDD through f1 in parallel and return the resulting RDD. Filter(function f2) Select elements of RDD that return true when passed through f2. flatMap(function f3) Similar to Map, but f3 returns a sequence to facilitate mapping single input to multiple outputs. Union(RDD r1) Returns result of union of the RDD r1 with the self. Sample(flag, p, seed) Returns a randomly sampled (with seed) p percentage of the RDD. groupByKey(noTasks) Can only be invoked on key-value paired data – returns data grouped by value. No. of parallel tasks is given as an argument (default is 8). reduceByKey(function f4, noTasks) Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the second argument. Join(RDD r2, noTasks) Joins RDD r2 with self – computes all possible pairs for given key. groupWith(RDD r3, noTasks) Joins RDD r3 with self and groups by key. sortByKey(flag) Sorts the self RDD in ascending or descending based on flag. Reduce(function f5) Aggregates result of applying function f5 on all elements of self RDD Collect() Return all elements of the RDD as an array. Count() Count no. of elements in RDD take(n) Get first n elements of RDD. First() Equivalent to take(1) saveAsTextFile(path) Persists RDD in a file in HDFS or other Hadoop supported file system at given path. saveAsSequenceFile(path) Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs that implement Hadoop writable interface or equivalent. foreach(function f6) Run f6 in parallel on elements of self RDD.
  • 19.
    Representation of anRDD 19 Information HadoopRDD FilteredRDD JoinedRDD Set of partitions 1 per HDFS block Same as parent 1 per reduce task Set of dependencies None 1-to-1 on parent Shuffle on each parent Function to compute data set based on parents Read corresponding block Compute parent and filter it Read and join shuffled data Meta-data on location (preferredLocaations) HDFS block location from namenode None (parent) None Meta-data on partitioning (partitioningScheme) None None HashPartitioner
  • 20.
    Some Spark(ling) examples Scalacode (serial) var count = 0 for (i <- 1 to 100000) { val x = Math.random * 2 - 1 val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0) Sample random point on unit circle – count how many are inside them (roughly about PI/4). Hence, u get approximate value for PI. Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).
  • 21.
    Some Spark(ling) examples Sparkcode (parallel) val spark = new SparkContext(<Mesos master>) var count = spark.accumulator(0) for (i <- spark.parallelize(1 to 100000, 12)) { val x = Math.random * 2 – 1 val y = Math.random * 2 - 1 if (x*x + y*y < 1) count += 1 } println("Pi is roughly " + 4 * count / 100000.0) Notable points: 1. Spark context created – talks to Mesos1 master. 2. Count becomes shared variable – accumulator. 3. For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices. 4. Parallelize method invokes foreach method of RDD. 1 Mesos is an Apache incubated clustering system – http://mesosproject.org
  • 22.
    Logistic Regression inSpark: Serial Code // Read data file and convert it into Point objects val lines = scala.io.Source.fromFile("data.txt").getLines() val points = lines.map(x => parsePoint(x)) // Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = Vector.zeros(D) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient } println("Result: " + w)
  • 23.
    Logistic Regression inSpark // Read data file and transform it into Point objects val spark = new SparkContext(<Mesos master>) val lines = spark.hdfsTextFile("hdfs://.../data.txt") val points = lines.map(x => parsePoint(x)).cache() // Run logistic regression var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = spark.accumulator(Vector.zeros(D)) for (p <- points) { val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y gradient += scale * p.x } w -= gradient.value } println("Result: " + w)
  • 24.
    Logistic Regression: SparkVS Hadoop 24http://spark-project.org
  • 25.
  • 26.
    26 Contents Big Data Computations •Introduction toML •Characterization Berkeley data analytics stack •Spark Real-time Analytics with Storm PMML Scoring for Naïve Bayes •PMML Primer •Naïve Bayes Primer GraphLab Hadoop 2.0 (Hadoop YARN) Programming Abstractions
  • 27.
  • 28.
    Solution to InternetTraffic Analysis Use Case
  • 29.
    29 Contents Big Data Computations •Introduction toML •Characterization Berkeley data analytics stack •Spark Real-time Analytics with Storm PMML Scoring for Naïve Bayes •PMML Primer •Naïve Bayes Primer GraphLab Hadoop 2.0 (Hadoop YARN) Programming Abstractions
  • 30.
    PMML Primer 30 Predictive ModelMarkup Language Developed by DMG (Data Mining Group) XML representation of a model. PMML offers a standard to define a model, so that a model generated in tool-A can be directly used in tool-B. May contain a myriad of data transformations (pre- and post-processing) as well as one or more predictive models.
  • 31.
    Naïve Bayes Primer 31 NormalizationConstant Likelihood Prior A simple probabilistic classifier based on Bayes Theorem Given features X1,X2,…,Xn, predict a label Y by calculating the probability for all possible Y value
  • 32.
    PMML Scoring forNaïve Bayes 32 Wrote a PMML based scoring engine for Naïve Bayes algorithm. This can theoretically be used in any framework for data processing by invoking the API Deployed a Naïve Bayes PMML generated from R into Storm / Spark and Samza frameworks Real time predictions with the above APIs
  • 33.
    33 Header •Version and timestamp •Modeldevelopment environment information Data Dictionary •Variable types, missing valid and invalid values, Data Munging/Transformation •Normalization, mapping, discretization Model •Model specifi attributes •Mining Schema •Treatment for missing and outlier values •Targets •Prior probability and default •Outputs •List of computer output fields •Post-processing •Definition of model architecture/parameters.
  • 34.
    <DataDictionary numberOfFields="4"> <DataField name="Class"optype="categorical" dataType="string"> <Value value="democrat"/> <Value value="republican"/> </DataField> <DataField name="V1" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V2" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> <DataField name="V3" optype="categorical" dataType="string"> <Value value="n"/> <Value value="y"/> </DataField> </DataDictionary> (ctd on the next slide) PMML Scoring for Naïve Bayes 34
  • 35.
    <NaiveBayesModel modelName="naiveBayes_Model" functionName="classification" threshold="0.003"> <MiningSchema> <MiningFieldname="Class" usageType="predicted"/> <MiningField name="V1" usageType="active"/> <MiningField name="V2" usageType="active"/> <MiningField name="V3" usageType="active"/> </MiningSchema> <Output> <OutputField name="Predicted_Class" feature="predictedValue"/> <OutputField name="Probability_democrat" optype="continuous" dataType="double" feature="probability" value="democrat"/> <OutputField name="Probability_republican" optype="continuous" dataType="double" feature="probability" value="republican"/> </Output> <BayesInputs> (ctd on the next page) PMML Scoring for Naïve Bayes 35
  • 36.
    PMML Scoring forNaïve Bayes 36 <BayesInputs> <BayesInput fieldName="V1"> <PairCounts value="n"> <TargetValueCounts> <TargetValueCount value="democrat" count="51"/> <TargetValueCount value="republican" count="85"/> </TargetValueCounts> </PairCounts> <PairCounts value="y"> <TargetValueCounts> <TargetValueCount value="democrat" count="73"/> <TargetValueCount value="republican" count="23"/> </TargetValueCounts> </PairCounts> </BayesInput> <BayesInput fieldName="V2"> * <BayesInput fieldName="V3"> * </BayesInputs> <BayesOutput fieldName="Class"> <TargetValueCounts> <TargetValueCount value="democrat" count="124"/> <TargetValueCount value="republican" count="108"/> </TargetValueCounts> </BayesOutput>
  • 37.
    PMML Scoring forNaïve Bayes 37 Definition Of Elements:- DataDictionary : Definitions for fields as used in mining models ( Class, V1, V2, V3 ) NaiveBayesModel : Indicates that this is a NaiveBayes PMML MiningSchema : lists fields as used in that model. Class is “predicted” field, V1,V2,V3 are “active” predictor fields Output: Describes a set of result values that can be returned from a model
  • 38.
    PMML Scoring forNaïve Bayes 38 Definition Of Elements (ctd .. ) :- BayesInputs: For each type of inputs, contains the counts of outputs BayesOutput: Contains the counts associated with the values of the target field
  • 39.
    Sample Input Eg1 -n y y n y y n n n n n n y y y y Eg2 - n y n y y y n n n n n y y y n y • 1st , 2nd and 3rd Columns: Predictor variables ( Attribute “name” in element MiningField ) • Using these we predict whether the Output is Democrat or Republican ( PMML element BayesOutput) PMML Scoring for Naïve Bayes 39
  • 40.
    PMML Scoring forNaïve Bayes 40 • 3 Node Xeon Machines Storm cluster ( 8 quad code CPUs, 32 GB RAM, 32 GB Swap space, 1 Nimbus, 2 Supervisors ) Number of records ( in millions ) Time Taken (seconds) 0.1 4 0.4 7 1.0 12 2.0 21 10 129 25 310
  • 41.
    PMML Scoring forNaïve Bayes 41 • 3 Node Xeon Machines Spark cluster( 8 quad code CPUs, 32 GB RAM and 32 GB Swap space ) Number of records ( in millions ) Time Taken ( 0.1 1 min 47 sec 0.2 3 min 35 src 0.4 6 min 40 secs 1.0 35 mins 17 sec 10 More than 3 hrs
  • 42.
    42 Contents Big Data Computations •Introduction toML •Characterization Berkeley data analytics stack •Spark Real-time Analytics with Storm PMML Scoring for Naïve Bayes •PMML Primer •Naïve Bayes Primer GraphLab Hadoop 2.0 (Hadoop YARN) Programming Abstractions
  • 43.
    GraphLab: Ideal Enginefor Processing Natural Graphs [YL12] [YL12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5, 8 (April 2012), 716-727. Goals – targeted at machine learning. •Model graph dependencies, be asynchronous, iterative, dynamic. Data associated with edges (weights, for instance) and vertices (user profile data, current interests etc.). Update functions – lives on each vertex • Transforms data in scope of vertex. • Can choose to trigger neighbours (for example only if Rank changes drastically) • Run asynchronously till convergence – no global barrier. Consistency is important in ML algorithms (some do not even converge when there are inconsistent updates – collaborative filtering). • GraphLab – provides varying level of consistency. Parallelism VS consistency. Implemented several algorithms, including ALS, K-means, SVM, Belief propagation, matrix factorization, Gibbs sampling, SVD, CoEM etc. • Co-EM (Expectation Maximization) algorithm 15x faster than Hadoop MR – on distributed GraphLab, only 0.3% of Hadoop execution time.
  • 44.
    GraphLab 2: PowerGraph– Modeling Natural Graphs [1] [1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin (2012). "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12). GraphLab could not scale to Altavista web graph 2002, 1.4B vertices, 6.7B edges. • Most graph parallel abstractions assume small neighbourhoods – low degree vertices • But natural graphs (LinkedIn, Facebook, Twitter) – power law graphs. • Hard to partition power law graphs, high degree vertices limit parallelism. Powergraph provides new way of partitioning power law graphs • Edges are tied to machines, vertices (esp. high degree ones) span machines • Execution split into 3 phases: • Gather, apply and scatter. Triangle counting on Twitter graph • Hadoop MR took 423 minutes on 1536 machines • GraphLab 2 took 1.5 minutes on 1024 cores (64 machines)
  • 45.
    45 Contents Big Data Computations •Introduction toML •Characterization Berkeley data analytics stack •Spark Real-time Analytics with Storm PMML Scoring for Naïve Bayes •PMML Primer •Naïve Bayes Primer GraphLab Hadoop 2.0 (Hadoop YARN) Programming Abstractions
  • 46.
    Hadoop YARN Requirementsor 1.0 shortcomings46 R1: Scalability •single cluster limitation R2: Multi-tenancy •Addressed by Hadoop-on- Demand •Security, Quotas R3: Locality awareness •Shuffle of records R4: Shared cluster utilization •Hogging by users •Typed slots R5: Reliability/Availability •Job Tracker bugs R6: Iterative Machine Learning Vinod Kumar Vavilapalli, Arun C Murthy , Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe , Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler, “Apache Hadoop YARN: Yet Another Resource Negotiator”, ACM Symposium on Cloud Computing, Oct 2013, ACM Press.
  • 47.
  • 48.
    YARN Internals 48 Application Master •Sends ResourceRequests tothe YARN RM •Captures containers, resources per container, locality preferences. YARN RM •Generates tokens and containers •Global view of cluster – monolithic scheduling. Node Manager •Node health monitoring, advertise available resources through heartbeats to RM.
  • 49.
    49 Contents Big Data Computations •Introduction toML •Characterization Berkeley data analytics stack •Spark Real-time Analytics with Storm PMML Scoring for Naïve Bayes •PMML Primer •Naïve Bayes Primer GraphLab Hadoop 2.0 (Hadoop YARN) Programming Abstractions
  • 50.
    Programming Abstractions 50 PMML •XML based representation ofthe analytical model Spark •Scala collection – over a distributed shared memory system GraphLab •Gather-Apply- Scatter Forge •Domain Specific Language
  • 51.
    51 •Domain specific languageapproach from Stanford. •Forge [AKS13] – a meta DSL for high performance DSLs. •40X faster than Spark! •OptiML – DSL for machine language Forge: Approach to build high performance Domain Specific Languages [Arvind K. Sujeeth, Austin Gibbons, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Martin Odersky, and Kunle Olukotun. 2013. Forge: generating a high performance DSL implementation from a declarative specification. In Proceedings of the 12th international conference on Generative programming: concepts & experiences (GPCE '13). ACM, New York, NY, USA, 145-154.
  • 52.
    • Beyond HadoopMap-Reduce philosophy • Optimization and other problems. • Real-time computation • Processing specialized data structures • PMML scoring • Spark for batch computations • Spark streaming and Storm for real-time. • Allows traditional analytical tools/algorithms to be re- used. Conclusions 52
  • 53.
    Thank You! Mail •bigdata@impetus.com LinkedIn•www.linkedin.com/company/impetus Blogs •http://blogs.impetus.com/ Twitter •@impetustech