Spark &
Zeppelin
Introduction
#NightClazz Spark & ML
10/03/16
Fabrice Sznajderman
Agenda
● Apache Spark
● Apache Zeppelin
Introduction
Who I am?
● Fabrice Sznajderman
○ Java/Scala/Web developer
■ Java/Scala trainer
● BrownBagLunch.fr
Spark
Introduction
Big picture
Spark introduction
What is it about?
● A cluster computing framework
● Open source
● Written in Scala
History
2009 : Project start at MIT research lab
2010 : Project open-sourced
2013 : Become a Apache project and creation of the
Databricks company
2014 : Become a top level Apache project and the most active
project in the Apache fundation (500+ contributors)
2014 : Release of Spark 1.0, 1.1 and 1.2
2015 : Release of Spark 1.3, 1.4, 1.5 and 1.6
2015 : IBM, SAP… investment in Spark
2015 : 2000 registration in Spark Summit SF, 1000 in Spark
Summit Amsterdam
2016 : new Spark Summit in San Francisco in June 2016
Multi-languages
● Scala
● Java
● Python
● R
Spark Shell
● REPL
● Learn API
● Interactive Analysis
RDD
Core concept
Definition
● Resilient
● Distributed
● Datasets
Properties
● Immutable
● Serializable
● Can be persist in RAM and / or
disk
● Simple or complexe type
Use as a collection
● DSL
● Monadic type
● Several operators
○ map, filter, count, distinct, flatmap, ...
○ join, groupBy, union, ...
Created from
● A collection (List, Set)
● Various formats of file
○ json, text, Hadoop SequenceFile, ...
● Various database
○ JDBC, Cassandra, ...
● Others RDD
Sample
val conf = new SparkConf()
.setAppName("sample")
.setMaster("local")
val sc = new SparkContext(conf)
val rdd = sc.textFile("data.csv")
val nb = rdd.map(s => s.length).filter(i => i> 10).count()
Lazy-evaluation
● Intermediate operators
○ map, filter, distinct, flatmap, …
● final operators
○ count, mean, fold, first, ...
val nb = rdd.map(s => s.length).filter(i => i> 10).count()
Caching
● Reused an intermediate result
● Cache operator
● Avoid re-computing
val r = rdd.map(s => s.length).cache()
val nb = r.filter(i => i> 10).count()
val sum = r.filter(i => i> 10).sum()
Distributed
Architecture
Core concept
Run locally
val master = "local"
val master = "local[*]"
val master = "local[4]"
val conf = new SparkConf().setAppName("sample")
.setMaster(master)
val sc = new SparkContext(conf)
Run on cluster
val master = "spark://..."
val conf = new SparkConf().setAppName("sample")
.setMaster(master)
val sc = new SparkContext(conf)
Standalone Cluster
Spark
Master
Spark
Slave
Spark
Slave
Spark
Slave
E
E E
E
E
E
Spark
client
Spark
client
Spark
client
Modules
Core concept
Composed by
Spark Core
Spark
Streaming
MLlib GraphX
Spark
SQL
ML PipelineDataFrames
Several data sources
Several data sources
http://prog3.com/article/2015-06-18/2824958
Spark SQL
● Structured data processing
● SQL Language
● DataFrame
DataFrame 1/3
● A distributed collection of rows
organized into named columns
● An abstraction for selecting,
filtering, aggregating and
plotting structured data
● Provide a schema
● Not a RDD replacement
What?
DataFrame 1/3
● RDD more efficient than before
(Hadoop)
● But RDD is still too complicated
for common tasks
● DataFrame is more simple and
faster
Why?
DataFrame 2/3
Optimized
DataFrame 3/3
● From Spark 1.3
● DataFrame API is just an
interface
○ Implementation is done one time in
Spark engine
○ All languages take benefits of
optimization with out rewriting
anything
How ?
Spark Streaming
● Framework over RDD and
Dataframe API
● Real-time data processing
● RDD is DStream here
● Same as before but dataset is
not static
Spark Streaming
Internal flow
http://spark.apache.org/docs/latest/img/streaming-flow.png
Spark Streaming
Inputs / Ouputs
http://spark.apache.org/docs/latest/img/streaming-arch.png
Spark MLlib
● Make pratical machine learning
scalable and easy
● Provide commons learning
algorithms & utilities
Spark MLlib
● Divides into 2 packages
○ spark.mllib
○ spark.ml
Spark MLlib
● Original API based on RDD
● Each model has its own
interface
spark.mllib
Spark MLlib
spark.mllib
val sc = //init sparkContext
val (trainingData, checkData) = sc.textFile
("train.csv")
/*transform*/
.randomSplit(Array(0.98, 0.02))
val model = RandomForest.trainClassifier(
trainingData,
10,
Map[Int, Int](),
30,
"auto",
"gini",
7,
100,
0)
val prediction = model.predict(...)
//init sparkContext
val (trainingData, checkData) = sc.textFile("train.
csv")
/*transform*/
.randomSplit(Array(0.98, 0.02))
val model = new
LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(train)
val prediction = model.predict(...)
Each model exposes its own
interface
Spark MLlib
● Provides uniform set of high-
level APIs
● Based on top of Dataframe
● Pipeline concepts
○ Transformer
○ Estimator
○ Pipeline
spark.ml
Spark MLlib
spark.ml
● Transformer : transform(DF)
○ map a dataFrame by adding new
column
○ predict the label and adding result in
new column
● Estimator : fit(DF)
○ learning algorithm
○ produces a model from dataFrame
Spark MLlib
spark.ml
● Pipeline
○ sequence of stages (transformer or
estimator)
○ specific order
Spark MLlib
spark.ml
val training:DataFrame = ???
val test:DataFrame = ???
val lr = new LogisticRegression()
lr.setMaxIter(10).setRegParam(0.01)
//training model
val model1 = lr.fit(training)
//prediction on data test
model1.transform(test)
Spark MLlib
spark.ml
val training:DataFrame = ???
val test:DataFrame = ???
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new RandomForestClassifier()()
/*.add parameter*/
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
model.transform(test)
val training:DataFrame = ???
val test:DataFrame = ???
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
/*.add parameter*/
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
model.transform(test)
Differents models
Same manner to create
the pipeline
Zeppelin
Introduction
Big picture
Zeppelin introduction
What it is about?
● “A web-based notebook that
enables interactive data analytics”
● 100% opensource
● Undergoing Incubation
Multi-purpose
● Data Ingestion
● Data Discovery
● Data Analytics
● Data Visualization &
Collaboration
Multiple Language
backend
● Scala
● shell
● python
● markdown
● your language by creation your
own interpreter
Data visualization
Easy way to build graph
from data
Demo
Merci
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin