Spark and machine learning in microservices architecture

Spark and Machine Learning in
Microservices Architecture
by Stepan Pushkarev
CTO of Hydrosphere.io

Mission: Accelerate Machine Learning to Production
Opensource Products:
- Mist: Spark Compute as a Service
- ML Lambda: ML Function as a Service
- Sonar: Data and ML Monitoring
Business Model: Subscription services and hands-on
consulting
About

Spark/Kafka Users here?
Machine Learning nerds here?

Stateless Microservices are well studied

Data, reporting and machine learning
architectures are different
● Raw SQL / HiveQL / SQL on Hadoop
● Datawarehouse / Data Lake centric
● Scripts driven ./bin/spark-submit
● Automated with Cron and/or Workflow Managers
● Hosted Notebooks culture
● Traditionally offline / for internal users
● File system aware (HDFS, S3)
● Defined by all-inclusive Hadoop distributions

- Data Pipelines on Microservices
- ML Functions as low latency prediction
services
Agenda

Part 1: Data Pipeline Intuition
Need to transform Source Data into desired shape

Evolution: Scripts driven development
Python
Script
SQL
Script

Evolution: More Scripts
Python
Script
SQL
Script
Python
Script
SQL
Script
Python
Script
SQL
Script

Evolution: Move to Hive/Spark
Spark
Script
Hive
Script
Spark
Script
Python
Script
Hive
Script
SQL
Script

Typical Python/SQL/Hive/Spark Task
1) Read data
2) Transform
3) Save to temp table for the
next task

Problem: Unmanageable State in Shared folder
- Data Flow is not managed. DAG scheduling is a different
objection.
- Who is responsible for schema migration? Task 1, Task 2 or
Manager?
- What folder Task 1 should write to and Task 2 should read from?
- How to manage folders/resources between parallel sessions?
- When / how to clean up shared folder? Another cleanup pipeline?
- How to check that data batch is arrived and valid?
- How to unit test it?
- How to handle errors?

Statesafe Pipelines
1. Get rid of Workflow Manager!
2. Turn black box tasks and scripts into microservices.
3. Use Avro data contracts between stages. Data is also
an API to be standardized, versioned and validated.
4. Segregate black box tasks into (read) (process) and
(write) services.
5. Keep the state in shared folder/topic/session
manageable by framework rather than data engineers.
6. Abstract engineer from data transport and provide a
pure function to deal with.

Zoom in: Data Transport between stages

Zoom in: Statesafe API for Business Logic

On-demand batch pipelines
- Could not be
pre-calculated
- On-demand
parametrized jobs
- Interactive
- Require large
scale processing
- Reporting
- Simulation (pricing,
bank stress testing,
taxi rides)
- Forecasting (ad
campaign, energy
savings, others)
- Ad-hoc analytics tools
for business users

Bad Practice: Database as API
Execute reporting job
Mark Job as complete &
save result
Poll for new tasks
Poll for resultSet a flag and parameters
to build a report

Batch prediction / reporting services

From Vanila Spark to Spark Compute as a
Service
./bin/spark-submit
- Spark Sessions Pool
- REST API Framework
- Data API Framework
- Infrastructure
Integration (EMR,
Hortonworks, etc)

Spark Compute as a Service
DEMO

cluster
data
model
data
scientist
? web
app

Machine Learning: training
+ serving (deployment)

pipeline
Training (Estimation) pipeline
trainpreprocess preprocess

pipeline
Prediction Pipeline
preprocess preprocess

val test = spark.createDataFrame(Seq(
("spark hadoop"),
("hadoop learning")
)).toDF("text")
val model =
PipelineModel.load("/tmp/spark-model")
model.transform(test).collect()

https://issues.apache.org/jira/browse/SPARK-16365
https://issues.apache.org/jira/browse/SPARK-13944

cluster
data
model
data
scientist
web
app
PMML
PFA
MLEAP
- Yet another Format Lock
- Code & state duplication
- Limited extensibility
- Inconsistency
- Extra moving parts

cluster
data
model
data
scientist
web
app
API
API
- Needs Spark Running
- High latency, low
throughput

cluster
data
model
data
scientist
web
app
docker
API
libs
model
Spark ML Local Serving Library:
https://github.com/Hydrospheredata
/spark-ml-serving

Zoo: Models - Runtimes - Standards

API & Logistics
- HTTP/1.1, HTTP/2, gRPC
- Kafka, Flink, Kinesis
- Protobuf, Avro
- Service Discovery
- Pipelining
- Tracing
- Monitoring
- Autoscaling
- Versioning
- A/B, Canary
- Testing
- CPU, GPU

UX: Train anywhere and deploy as a Function

UX: Models and Applications
Applications provide public endpoints for the models
and compositions of the models.

UX: Streaming Applications + Batching

UX: Pipelines, Assembles and BestSLA
Applications

Thank you
Looking for
- Feedback
- Advisors, mentors &
partners
- Pilots and early adopters
Stay in touch
- @hydrospheredata
- https://github.com/Hydrospheredata
- https://hydrosphere.io/
- spushkarev@hydrosphere.io

Spark and machine learning in microservices architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark and machine learning in microservices architecture

Similar to Spark and machine learning in microservices architecture (20)

Recently uploaded

Recently uploaded (20)

Spark and machine learning in microservices architecture