Spark Serving
by Stepan Pushkarev
CTO of Hydrosphere.io
Spark Users here?
Data Scientists and Spark Users here?
Why do companies hire data scientists?
Why do companies hire data scientists?
To make products smarter.
What is a deliverable of data scientist and data
engineer?
What is a deliverable of data scientist?
Academic
paper?
ML Model? R/Python
script?
Jupiter
Notebook?
BI
Dashboard?
cluster
data
model
data	
scientist
? web	
app
val wordCounts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
executor
executorexecutor
executor executor
Machine	Learning:	training	
+	serving
pipeline
Training	(Estimation)	pipeline
trainpreprocess preprocess
tokenizer
apache	spark																												1
hadoop	mapreduce																	0
spark	machine	learning										1
[apache,	spark]																									1
[hadoop,	mapreduce]														0
[spark,	machine,	learning]							1
hashing	tf
[apache,	spark]																										1
[hadoop,	mapreduce]														0	
[spark,	machine,	learning]						1
[105,	495],	[1.0,	1.0]																	1
[6,	638,	655],	[1.0,	1.0,	1.0]					0
[105,	72,	852],	[1.0,	1.0,	1.0]					1
logistic	regression
[105,	495],	[1.0,	1.0]																	1
[6,	638,	655],	[1.0,	1.0,	1.0]					0
[105,	72,	852],	[1.0,	1.0,	1.0]					1
0					72													-2.7138781446090308
0						94													0.9042505436914775
0					105												3.0835670890496645
0				495												3.2071722417080766
0				722												0.9042505436914775
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)
model.write.save("/tmp/spark-model")
pipeline
Prediction	Pipeline
preprocess preprocess
val test = spark.createDataFrame(Seq(
("spark hadoop"),
("hadoop learning")
)).toDF("text")
val model = PipelineModel.load("/tmp/spark-
model")
model.transform(test).collect()
./bin/spark-submit …
cluster
data
model
data	
scientist
? web	
app
Pipeline Serving - NOT Model Serving
Model level API leads to code duplication & inconsistency
at pre-processing stages!
Web	App
Ruby/PHP:
preprocess
Check current user
User Logs
ML Pipeline: preprocess, train
Save
Score/serve model
Fraud	Detection	
Model
https://issues.apache.org/jira/browse/SPARK-16365
https://issues.apache.org/jira/browse/SPARK-13944
cluster
data
model
data	
scientist
web	
app
PMML
PFA
MLEAP
- Yet another Format Lock
- Code & state duplication
- Limited extensibility
- Inconsistency
- Extra moving parts
cluster
data
model
data	
scientist
web	
app
docker
model
libs
deps
- Fat All inclusive Docker - bad
practice
- Every model requires new
docker to be rebuilt
cluster
data
model
data	
scientist
web	
app
API
API
- Needs Spark Running
- High latency, low throughput
cluster
data
model
data	
scientist
web	
app
API
serving
API
+ Serving skips Spark
+ But re-uses ML algorithms
+ No new formats and APIs
+ Low Latency but not super tuned
+ Scalable
Low level API Challenge
MS Azure
A deliverable for ML model
Single	row	Serving	/	Scoring	
layer
xml,	json,	parquet,	pojo,	other
Monitoring,	
testing	
integration
Large	Scale,	
Batch	
processing	
engine
Zooming out
Unified	Serving/Scoring	API
Repository
MLLib	model TensorFlow	model Other	model
Real-time Prediction PIpelines
Starting from scratch - System ML
Multiple execution modes, including Spark MLContext
API, Spark Batch, Hadoop Batch, Standalone, and JMLC.
Demo Time
Thank you
Looking for
- Feedback
- Advisors, mentors & partners
- Pilots and early adopters
Stay in touch
- @hydrospheredata
- https://github.com/Hydrospheredata
- http://hydrosphere.io/
- spushkarev@hydrosphere.io

DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью Apache Spark_Степан Пушкарев