Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Spark Serving
by Stepan Pushkarev
CTO of Hydrosphere.io
Spark Users here?
Data Scientists and Spark Users here?
Why do companies hire data scientists?
Why do companies hire data scientists?
To make products smarter.
What is a deliverable of data scientist and data
engineer?
What is a deliverable of data scientist?
Academic
paper?
ML Model? R/Python
script?
Jupiter
Notebook?
BI
Dashboard?
cluster
data
model
data	
scientist
? web	
app
val wordCounts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
executor...
Machine	Learning:	training	
+	serving
pipeline
Training	(Estimation)	pipeline
trainpreprocess preprocess
tokenizer
apache	spark																												1
hadoop	mapreduce																	0
spark	machine	learning										1
[...
hashing	tf
[apache,	spark]																										1
[hadoop,	mapreduce]														0	
[spark,	machine,	learning]						...
logistic	regression
[105,	495],	[1.0,	1.0]																	1
[6,	638,	655],	[1.0,	1.0,	1.0]					0
[105,	72,	852],	[1.0,	1....
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeature...
pipeline
Prediction	Pipeline
preprocess preprocess
val test = spark.createDataFrame(Seq(
("spark hadoop"),
("hadoop learning")
)).toDF("text")
val model = PipelineModel.load...
./bin/spark-submit …
cluster
data
model
data	
scientist
? web	
app
Pipeline Serving - NOT Model Serving
Model level API leads to code duplication & inconsistency
at pre-processing stages!
W...
https://issues.apache.org/jira/browse/SPARK-16365
https://issues.apache.org/jira/browse/SPARK-13944
cluster
data
model
data	
scientist
web	
app
PMML
PFA
MLEAP
- Yet another Format Lock
- Code & state duplication
- Limited ...
cluster
data
model
data	
scientist
web	
app
docker
model
libs
deps
- Fat All inclusive Docker - bad
practice
- Every model...
cluster
data
model
data	
scientist
web	
app
API
API
- Needs Spark Running
- High latency, low throughput
cluster
data
model
data	
scientist
web	
app
API
serving
API
+ Serving skips Spark
+ But re-uses ML algorithms
+ No new for...
Low level API Challenge
MS Azure
A deliverable for ML model
Single	row	Serving	/	Scoring	
layer
xml,	json,	parquet,	pojo,	other
Monitoring,	
testing	
integ...
Zooming out
Unified	Serving/Scoring	API
Repository
MLLib	model TensorFlow	model Other	model
Real-time Prediction PIpelines
Starting from scratch - System ML
Multiple execution modes, including Spark MLContext
API, Spark Batch, Hadoop Batch, Stan...
Demo Time
Thank you
Looking for
- Feedback
- Advisors, mentors & partners
- Pilots and early adopters
Stay in touch
- @hydrosphereda...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью Apache Spark_Степан Пушкарев
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью Apache Spark_Степан Пушкарев
Upcoming SlideShare
Loading in …5
×

DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью Apache Spark_Степан Пушкарев

52 views

Published on

DataScience Lab, 13 мая 2017
Cервинг моделей, построенных на больших данных с помощью Apache Spark
Степан Пушкарев (GM (Kazan) at Provectus / CTO at Hydrosphere.io)
После подготовки данных и обучения моделей на больших данных с использованием Apache Spark встает вопрос о том, как использовать обученные модели в реальных приложениях. Помимо модели важно не забывать про весь пайплайн пре-процессинга данных, который должен попасть в продакшн в том виде, в котором его спроектировал и реализовал дата саентист. Такие решения, как PMML/PFA, основанные на экспорте/импорте модели и алгоритма имеют очевидные недостатки и ограничения. В данном докладе мы предложим альтернативное решение, которое упрощает процесс использования моделей и пайплайнов в реальных боевых приложениях.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017

Published in: Technology
  • Be the first to comment

  • Be the first to like this

DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью Apache Spark_Степан Пушкарев

  1. 1. Spark Serving by Stepan Pushkarev CTO of Hydrosphere.io
  2. 2. Spark Users here?
  3. 3. Data Scientists and Spark Users here?
  4. 4. Why do companies hire data scientists?
  5. 5. Why do companies hire data scientists? To make products smarter.
  6. 6. What is a deliverable of data scientist and data engineer?
  7. 7. What is a deliverable of data scientist? Academic paper? ML Model? R/Python script? Jupiter Notebook? BI Dashboard?
  8. 8. cluster data model data scientist ? web app
  9. 9. val wordCounts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) executor executorexecutor executor executor
  10. 10. Machine Learning: training + serving
  11. 11. pipeline Training (Estimation) pipeline trainpreprocess preprocess
  12. 12. tokenizer apache spark 1 hadoop mapreduce 0 spark machine learning 1 [apache, spark] 1 [hadoop, mapreduce] 0 [spark, machine, learning] 1
  13. 13. hashing tf [apache, spark] 1 [hadoop, mapreduce] 0 [spark, machine, learning] 1 [105, 495], [1.0, 1.0] 1 [6, 638, 655], [1.0, 1.0, 1.0] 0 [105, 72, 852], [1.0, 1.0, 1.0] 1
  14. 14. logistic regression [105, 495], [1.0, 1.0] 1 [6, 638, 655], [1.0, 1.0, 1.0] 0 [105, 72, 852], [1.0, 1.0, 1.0] 1 0 72 -2.7138781446090308 0 94 0.9042505436914775 0 105 3.0835670890496645 0 495 3.2071722417080766 0 722 0.9042505436914775
  15. 15. val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) val model = pipeline.fit(training) model.write.save("/tmp/spark-model")
  16. 16. pipeline Prediction Pipeline preprocess preprocess
  17. 17. val test = spark.createDataFrame(Seq( ("spark hadoop"), ("hadoop learning") )).toDF("text") val model = PipelineModel.load("/tmp/spark- model") model.transform(test).collect()
  18. 18. ./bin/spark-submit …
  19. 19. cluster data model data scientist ? web app
  20. 20. Pipeline Serving - NOT Model Serving Model level API leads to code duplication & inconsistency at pre-processing stages! Web App Ruby/PHP: preprocess Check current user User Logs ML Pipeline: preprocess, train Save Score/serve model Fraud Detection Model
  21. 21. https://issues.apache.org/jira/browse/SPARK-16365 https://issues.apache.org/jira/browse/SPARK-13944
  22. 22. cluster data model data scientist web app PMML PFA MLEAP - Yet another Format Lock - Code & state duplication - Limited extensibility - Inconsistency - Extra moving parts
  23. 23. cluster data model data scientist web app docker model libs deps - Fat All inclusive Docker - bad practice - Every model requires new docker to be rebuilt
  24. 24. cluster data model data scientist web app API API - Needs Spark Running - High latency, low throughput
  25. 25. cluster data model data scientist web app API serving API + Serving skips Spark + But re-uses ML algorithms + No new formats and APIs + Low Latency but not super tuned + Scalable
  26. 26. Low level API Challenge MS Azure
  27. 27. A deliverable for ML model Single row Serving / Scoring layer xml, json, parquet, pojo, other Monitoring, testing integration Large Scale, Batch processing engine
  28. 28. Zooming out Unified Serving/Scoring API Repository MLLib model TensorFlow model Other model
  29. 29. Real-time Prediction PIpelines
  30. 30. Starting from scratch - System ML Multiple execution modes, including Spark MLContext API, Spark Batch, Hadoop Batch, Standalone, and JMLC.
  31. 31. Demo Time
  32. 32. Thank you Looking for - Feedback - Advisors, mentors & partners - Pilots and early adopters Stay in touch - @hydrospheredata - https://github.com/Hydrospheredata - http://hydrosphere.io/ - spushkarev@hydrosphere.io

×