NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark &
Zeppelin
Introduction
#NightClazz Spark & ML
10/03/16
Fabrice Sznajderman

Agenda
● Apache Spark
● Apache Zeppelin
Introduction

Who I am?
● Fabrice Sznajderman
○ Java/Scala/Web developer
■ Java/Scala trainer
● BrownBagLunch.fr

Big picture
Spark introduction

What is it about?
● A cluster computing framework
● Open source
● Written in Scala

History
2009 : Project start at MIT research lab
2010 : Project open-sourced
2013 : Become a Apache project and creation of the
Databricks company
2014 : Become a top level Apache project and the most active
project in the Apache fundation (500+ contributors)
2014 : Release of Spark 1.0, 1.1 and 1.2
2015 : Release of Spark 1.3, 1.4, 1.5 and 1.6
2015 : IBM, SAP… investment in Spark
2015 : 2000 registration in Spark Summit SF, 1000 in Spark
Summit Amsterdam
2016 : new Spark Summit in San Francisco in June 2016

Multi-languages
● Scala
● Java
● Python
● R

Spark Shell
● REPL
● Learn API
● Interactive Analysis

Definition
● Resilient
● Distributed
● Datasets

Properties
● Immutable
● Serializable
● Can be persist in RAM and / or
disk
● Simple or complexe type

Use as a collection
● DSL
● Monadic type
● Several operators
○ map, filter, count, distinct, flatmap, ...
○ join, groupBy, union, ...

Created from
● A collection (List, Set)
● Various formats of file
○ json, text, Hadoop SequenceFile, ...
● Various database
○ JDBC, Cassandra, ...
● Others RDD

Sample
val conf = new SparkConf()
.setAppName("sample")
.setMaster("local")
val sc = new SparkContext(conf)
val rdd = sc.textFile("data.csv")
val nb = rdd.map(s => s.length).filter(i => i> 10).count()

Lazy-evaluation
● Intermediate operators
○ map, filter, distinct, flatmap, …
● final operators
○ count, mean, fold, first, ...
val nb = rdd.map(s => s.length).filter(i => i> 10).count()

Caching
● Reused an intermediate result
● Cache operator
● Avoid re-computing
val r = rdd.map(s => s.length).cache()
val nb = r.filter(i => i> 10).count()
val sum = r.filter(i => i> 10).sum()

Distributed
Architecture
Core concept

Run locally
val master = "local"
val master = "local[*]"
val master = "local[4]"
val conf = new SparkConf().setAppName("sample")
.setMaster(master)

Run on cluster
val master = "spark://..."
val conf = new SparkConf().setAppName("sample")
.setMaster(master)

Standalone Cluster
Spark
Master
Spark
Slave
Spark
Slave
Spark
Slave
E
E E
E
E
E
Spark
client
Spark
client
Spark
client

Composed by
Spark Core
Spark
Streaming
MLlib GraphX
Spark
SQL
ML PipelineDataFrames
Several data sources

Several data sources
http://prog3.com/article/2015-06-18/2824958

Spark SQL
● Structured data processing
● SQL Language
● DataFrame

DataFrame 1/3
● A distributed collection of rows
organized into named columns
● An abstraction for selecting,
filtering, aggregating and
plotting structured data
● Provide a schema
● Not a RDD replacement
What?

DataFrame 1/3
● RDD more efficient than before
(Hadoop)
● But RDD is still too complicated
for common tasks
● DataFrame is more simple and
faster
Why?

DataFrame 3/3
● From Spark 1.3
● DataFrame API is just an
interface
○ Implementation is done one time in
Spark engine
○ All languages take benefits of
optimization with out rewriting
anything
How ?

Spark Streaming
● Framework over RDD and
Dataframe API
● Real-time data processing
● RDD is DStream here
● Same as before but dataset is
not static

Spark Streaming
Internal flow
http://spark.apache.org/docs/latest/img/streaming-flow.png

Spark Streaming
Inputs / Ouputs
http://spark.apache.org/docs/latest/img/streaming-arch.png

Spark MLlib
● Make pratical machine learning
scalable and easy
● Provide commons learning
algorithms & utilities

Spark MLlib
● Divides into 2 packages
○ spark.mllib
○ spark.ml

Spark MLlib
● Original API based on RDD
● Each model has its own
interface
spark.mllib

Spark MLlib
spark.mllib
val sc = //init sparkContext
val (trainingData, checkData) = sc.textFile
("train.csv")
/*transform*/
.randomSplit(Array(0.98, 0.02))
val model = RandomForest.trainClassifier(
trainingData,
10,
Map[Int, Int](),
30,
"auto",
"gini",
7,
100,
0)
val prediction = model.predict(...)
//init sparkContext
val (trainingData, checkData) = sc.textFile("train.
csv")
/*transform*/
.randomSplit(Array(0.98, 0.02))
val model = new
LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(train)
val prediction = model.predict(...)
Each model exposes its own
interface

Spark MLlib
● Provides uniform set of high-
level APIs
● Based on top of Dataframe
● Pipeline concepts
○ Transformer
○ Estimator
○ Pipeline
spark.ml

Spark MLlib
spark.ml
● Transformer : transform(DF)
○ map a dataFrame by adding new
column
○ predict the label and adding result in
new column
● Estimator : fit(DF)
○ learning algorithm
○ produces a model from dataFrame

Spark MLlib
spark.ml
● Pipeline
○ sequence of stages (transformer or
estimator)
○ specific order

Spark MLlib
spark.ml
val training:DataFrame = ???
val test:DataFrame = ???
val lr = new LogisticRegression()
lr.setMaxIter(10).setRegParam(0.01)
//training model
val model1 = lr.fit(training)
//prediction on data test
model1.transform(test)

Spark MLlib
spark.ml
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new RandomForestClassifier()()
/*.add parameter*/
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
model.transform(test)
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
/*.add parameter*/
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
model.transform(test)
Differents models
Same manner to create
the pipeline

Big picture
Zeppelin introduction

What it is about?
● “A web-based notebook that
enables interactive data analytics”
● 100% opensource
● Undergoing Incubation

Multi-purpose
● Data Ingestion
● Data Discovery
● Data Analytics
● Data Visualization &
Collaboration

Multiple Language
backend
● Scala
● shell
● python
● markdown
● your language by creation your
own interpreter

Data visualization
Easy way to build graph
from data

NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

More Related Content

What's hot

Viewers also liked

Similar to NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

More from Zenika

Recently uploaded

NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin