DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью Apache Spark_Степан Пушкарев

Spark Serving
by Stepan Pushkarev
CTO of Hydrosphere.io

Data Scientists and Spark Users here?

Why do companies hire data scientists?

Why do companies hire data scientists?
To make products smarter.

What is a deliverable of data scientist and data
engineer?

What is a deliverable of data scientist?
Academic
paper?
ML Model? R/Python
script?
Jupiter
Notebook?
BI
Dashboard?

cluster
data
model
data
scientist
? web
app

val wordCounts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
executor
executorexecutor
executor executor

Machine Learning: training
+ serving

pipeline
Training (Estimation) pipeline
trainpreprocess preprocess

tokenizer
apache spark 1
hadoop mapreduce 0
spark machine learning 1
[apache, spark] 1
[hadoop, mapreduce] 0
[spark, machine, learning] 1

hashing tf
[apache, spark] 1
[hadoop, mapreduce] 0
[spark, machine, learning] 1
[105, 495], [1.0, 1.0] 1
[6, 638, 655], [1.0, 1.0, 1.0] 0
[105, 72, 852], [1.0, 1.0, 1.0] 1

logistic regression
[105, 495], [1.0, 1.0] 1
[6, 638, 655], [1.0, 1.0, 1.0] 0
[105, 72, 852], [1.0, 1.0, 1.0] 1
0 72 -2.7138781446090308
0 94 0.9042505436914775
0 105 3.0835670890496645
0 495 3.2071722417080766
0 722 0.9042505436914775

val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)
model.write.save("/tmp/spark-model")

pipeline
Prediction Pipeline
preprocess preprocess

val test = spark.createDataFrame(Seq(
("spark hadoop"),
("hadoop learning")
)).toDF("text")
val model = PipelineModel.load("/tmp/spark-
model")
model.transform(test).collect()

Pipeline Serving - NOT Model Serving
Model level API leads to code duplication & inconsistency
at pre-processing stages!
Web App
Ruby/PHP:
preprocess
Check current user
User Logs
ML Pipeline: preprocess, train
Save
Score/serve model
Fraud Detection
Model

https://issues.apache.org/jira/browse/SPARK-16365
https://issues.apache.org/jira/browse/SPARK-13944

cluster
data
model
data
scientist
web
app
PMML
PFA
MLEAP
- Yet another Format Lock
- Code & state duplication
- Limited extensibility
- Inconsistency
- Extra moving parts

cluster
data
model
data
scientist
web
app
docker
model
libs
deps
- Fat All inclusive Docker - bad
practice
- Every model requires new
docker to be rebuilt

cluster
data
model
data
scientist
web
app
API
API
- Needs Spark Running
- High latency, low throughput

cluster
data
model
data
scientist
web
app
API
serving
API
+ Serving skips Spark
+ But re-uses ML algorithms
+ No new formats and APIs
+ Low Latency but not super tuned
+ Scalable

Low level API Challenge
MS Azure

A deliverable for ML model
Single row Serving / Scoring
layer
xml, json, parquet, pojo, other
Monitoring,
testing
integration
Large Scale,
Batch
processing
engine

Zooming out
Unified Serving/Scoring API
Repository
MLLib model TensorFlow model Other model

Real-time Prediction PIpelines

Starting from scratch - System ML
Multiple execution modes, including Spark MLContext
API, Spark Batch, Hadoop Batch, Standalone, and JMLC.

Thank you
Looking for
- Feedback
- Advisors, mentors & partners
- Pilots and early adopters
Stay in touch
- @hydrospheredata
- https://github.com/Hydrospheredata
- http://hydrosphere.io/
- spushkarev@hydrosphere.io

DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью Apache Spark_Степан Пушкарев

More Related Content

What's hot

Similar to DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью Apache Spark_Степан Пушкарев

More from GeeksLab Odessa

Recently uploaded

DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью Apache Spark_Степан Пушкарев