Serverless Machine Learning on Modern Hardware Using Apache Spark with Patrick Stuedi

Patrick Stuedi, Michael Kaufmann, Adrian Schuepbach
IBM Research
Serverless Machine
Learning on Modern
Hardware
#Res6SAIS

Serverless Computing
●
No need to setup/manage a
cluster
●
Automatic, dynamic and fine-
grained scaling
●
Sub-second billing
●
Many frameworks: AWS
Lambda, Google Cloud
Functions, Azure Functions,
Databricks Serverless, etc.

Challenge: Performance
AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise
0
100
200
300
400
500
Increasing flexibility
Increasing performance
Example:
Sorting 100GB
AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17
Runtime[seconds]

0
100
200
300
400
500
Example:
Sorting 100GB
64 lambda
workers
Runtime[seconds]

0
100
200
300
400
500
Example:
Sorting 100GBserverless cluster
with autoscaling
min machines: 1
max machines: 864 lambda
workers
Runtime[seconds]

0
100
200
300
400
500
Example:
with autoscaling
min machines: 1
workers standard cluster
no autoscaling
8 machines
Runtime[seconds]

0
100
200
300
400
500
Example:
with autoscaling
min machines: 1
no autoscaling
8 machines
100Gb/s
Ethernet
Runtime[seconds]

0
100
200
300
400
500
Example:
with autoscaling
min machines: 1
no autoscaling
8 machines
100Gb/s
Ethernet
RDMA,
NVMe flash,
NVMeF
Runtime[seconds]

put your #assignedhashtag here by setting the footer in view-header/footer●
Scheduler: when to best add/remove resources?
●
Container startup: may have to dynamically spin up containers
●
Storage: input data needs to be fetched from remote storage
(e.g., S3)
– As opposed to compute-local storage such as HDFS
●
Data sharing: intermediate needs to be temporarily stored on
remote storage (S3, Redis)
– Affects operations like shuffle, broadcast, etc.,
Why is it so hard?

Example: MapReduce (Cluster)
Map
Stage
Reduce
Stage
Compute &
Store Nodes
Compute &
Store Nodes
Compute &
Store Nodes
data is mostly
written and
read locally

Map
Stage
Reduce
Stage
Dynamically
growing/shrinking
compute cloud
Storage Service
(e.g, S3, Redis)
Shuffle
data is
exclusively
written and
read remotely
Example: MapReduce (Serverless)

I/O Overhead: Sorting 100GB
AWS Lambda Spark Serverless Spark Cloud Spark Cluster Spark HPC
0
100
200
300
400
500 Shuffle I/O
Compute
Input/Output
60%
22%
18%
Spark Cluster Spark HPC
0
10
20
30
40
50
60
Shuffle overheads are significantly higher when intermediate data is stored remotely
19%
49%
32%
38%
3%
59%AWS λ Databricks Databricks Spark Spark
Spark Spark
On-premise On-premise++
Runtime[seconds]

What about other workloads?
Example: SQL, Query 77 / TPC-DS benchmark

Example: SQL, Query 77 / TPC-DS benchmark
Shuffle/Broadcast
(needs to be stored remotely)

W W PS W W
could be co-located
with worker nodes
*) read training data
*) fetch model params
*) compute
*) update model
*) use cached data
*) compute
*) update model
Example: Iterative ML (e.g., linear regression)

W W PS W W
could be co-located
with worker nodes
*) compute
*) update model
*) use cached data
*) compute
*) update model

W W PS W W
could be co-located
with worker nodes
Needs to be
remote
W
*) compute
*) update model
*) use cached data
*) compute
*) update model
read remote data
scale
out

W W PS W W
could be co-located
with worker nodes
Needs to be
remote
W
*) compute
*) update model
*) use cached data
*) compute
*) update model
read remote data
scale
out
barrier, need to wait

Can we..
●
..use Spark to run such workloads in a serverless
fashion?
– Dynamic scaling of compute nodes while jobs are running
– No cluster configuration
– No startup time overhead
●
..eliminate the performance overheads?
– Workloads should run as fast as on a dedicated cluster

put your #assignedhashtag here by setting the footer in view-header/footer
●
Scheduling:
– Use serverless framework to schedule executors
– Use serverless framework to schedule tasks
– Enable sharing of executors among different applications
●
Intermediate data:
– Executors cooperate with scheduler to flush data remotely
– Consequently store all intermediate state remotely
Design Options
1
2
3
1
2

●
Scheduling:
●
Intermediate data:
Design Options
High startup
Latency!
1
2
3
1
2

●
Scheduling:
●
Intermediate data:
Design Options
Slow!
High startup
Latency!
1
2
3
1
2

●
Scheduling:
●
Intermediate data:
Design Options
Slow!
High startup
Latency!
Complex!
1
2
3
1
2

Architecture Overview
Driver HCS/
Scheduler
ExecutorStorage
server
Intermediate data
Intermediate
data
send schedule
register
Metadata
server
send job DAG
register
assign
application
register
launch
task
crail.apache.org

Driver HCS/
Scheduler
ExecutorStorage
server
send schedule
register
Metadata
server
send job DAG
register
assign
application
register
launch
task
crail.apache.org
Intermediate
data
Intermediate data

Driver HCS/
Scheduler
ExecutorStorage
server
send schedule
register
Metadata
server
send job DAG
register
assign
application
register
launch
task
Executors get
dynamically
re-assigned to
apps/drivers
crail.apache.org
Intermediate
data
Intermediate data

HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource

HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
computes task
schedule

HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
computes task
schedule
Uses ML to
predict task
runtimes
based on
previous runs

HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
determines
resource set
(executors)
for a given
schedule
computes task
schedule
Uses ML to
predict task
runtimes
based on
previous runs

HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
determines
resource set
(executors)
for a given
schedule
computes task
schedule
Uses ML to
predict task
runtimes
based on
previous runs
“The HCl Scheduler:
Going all-in on Heterogeneity”,
Michael Kaufmann et al., HotCloud’17

Video: Putting things together
Application 1:
GridSearch
Application 2:
SQL TPC-DS
Application 3:
SQL TPC-DS
HCL
Scheduler
Resource view:
fraction of resources
each app consumes
Executor view:
Which app an executor
currently runs

Scheduling Snapshot
Executors
Time
Grid Search
SQL

Compute cluster size: 8 nodes: IBM Power8 Minsky
●
Storage cluster size: 8 nodes, IBM Power8 Minsky
●
Cluster hardware:
– DRAM: 512 GB
– Storage: 4x 1.2 TB NVMe SSD
– Network: 10Gb/s Ethernert, 100Gb/s RoCE
– GPU: NVIDIA P100, NVLink
●
Workloads
– ML: Logistic Regression using the CoCoa framework
– SQL: TCP-DS
Let’s look at performance...

ML: Logistic Regression
Vanilla Cluster HCS/CRAIL/TCP HCS/CRAIL/100Gb/s HCS/CRAIL/RDMA
0
0.5
1
1.5
2
2.5
Input
Reduce
Broadcast
Compute
KDDA data set
6.5 GB
Runtime[seconds]
Vanilla
Cluster,100Gb/s
Vanilla
Serverless
HCS/Crail
TCP,100Gb/s
HCS/Crail
RDMA,100Gb/s
Using HCS/Crail we can store intermediate data remotely and reduce the runtime for ML

TPC-DS: Query #87
Spark Cluster Simple Serverless HCS/CRAIL/TCP HCS/CRAIL/RDMA
0
2
4
6
8
10
12
14
Runtime[seconds]
Vanilla
Cluster
HCS/Crail
TCP,10Gb/s
HCS/Crail
TCP,100Gb/s
HCS/Crail
RDMA,100Gb/s
We need a fast network and an optimized software stack to eliminate storage overheads

TPC-DS: Query #3
Warm serverless
Crail Serverless
Spark Cluster
0 5 10 15 20 25 30 35
Runtime [seconds]
Executing short running queries on a warm serverless cluster improves performance

Efficient serverless computing is challenging
– Local state (e.g. shuffle, cached input, network state) is lost as compute cloud scales up/down
●
This talk: turning Spark into a serverless framework by
– Implementing HCS, a new serverless scheduler
– Consequently storing compute state remotely using Apache Crail
●
Supports arbitrary Spark workloads with almost no performance overhead
– MapReduce, SQL, Iterative Machine Learning
●
Implicit support for fast network and storage hardware
– e.g, RDMA, NVMe, NVMf
Conclusion

●
Containerize the platform
●
Add support for dynamic re-partitioning on scale
events
●
Add support for automatic caching
●
Add more sophisticated scheduling policies
Future Work

●
Running Apache Spark on a High-Performance
Cluster Using RDMA and NVMe Flash, Spark
Summit’17, https://tinyurl.com/yd453uzq
●
Apache Crail, http://crail.apache.org
●
HCS Scheduler, github.com/zrlio/hcs
●
Spark-HCS, github.com/zrlio/spark-hcs
●
Spark-IO, github.com/zrlio/spark-io
Links

Thanks to
Michael Kaufmann, Adrian Schuepbach, Jonas Pfefferle,
Animesh Trivedi, Bernard Metzler, Ana Klimovic, Yawen
Wang

Serverless Machine Learning on Modern Hardware Using Apache Spark with Patrick Stuedi

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Serverless Machine Learning on Modern Hardware Using Apache Spark with Patrick Stuedi

Similar to Serverless Machine Learning on Modern Hardware Using Apache Spark with Patrick Stuedi (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Serverless Machine Learning on Modern Hardware Using Apache Spark with Patrick Stuedi