SlideShare a Scribd company logo
Patrick Stuedi, Michael Kaufmann, Adrian Schuepbach
IBM Research
Serverless Machine
Learning on Modern
Hardware
#Res6SAIS
Serverless Computing
●
No need to setup/manage a
cluster
●
Automatic, dynamic and fine-
grained scaling
●
Sub-second billing
●
Many frameworks: AWS
Lambda, Google Cloud
Functions, Azure Functions,
Databricks Serverless, etc.
Challenge: Performance
AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise
0
100
200
300
400
500
Increasing flexibility
Increasing performance
Example:
Sorting 100GB
AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17
Runtime[seconds]
Challenge: Performance
AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise
0
100
200
300
400
500
Increasing flexibility
Increasing performance
Example:
Sorting 100GB
64 lambda
workers
AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17
Runtime[seconds]
Challenge: Performance
AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise
0
100
200
300
400
500
Increasing flexibility
Increasing performance
Example:
Sorting 100GBserverless cluster
with autoscaling
min machines: 1
max machines: 864 lambda
workers
AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17
Runtime[seconds]
Challenge: Performance
AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise
0
100
200
300
400
500
Increasing flexibility
Increasing performance
Example:
Sorting 100GBserverless cluster
with autoscaling
min machines: 1
max machines: 864 lambda
workers standard cluster
no autoscaling
8 machines
AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17
Runtime[seconds]
Challenge: Performance
AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise
0
100
200
300
400
500
Increasing flexibility
Increasing performance
Example:
Sorting 100GBserverless cluster
with autoscaling
min machines: 1
max machines: 864 lambda
workers standard cluster
no autoscaling
8 machines
100Gb/s
Ethernet
AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17
Runtime[seconds]
Challenge: Performance
AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise
0
100
200
300
400
500
Increasing flexibility
Increasing performance
Example:
Sorting 100GBserverless cluster
with autoscaling
min machines: 1
max machines: 864 lambda
workers standard cluster
no autoscaling
8 machines
100Gb/s
Ethernet
AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
RDMA,
NVMe flash,
NVMeF
Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17
Runtime[seconds]
put your #assignedhashtag here by setting the footer in view-header/footer●
Scheduler: when to best add/remove resources?
●
Container startup: may have to dynamically spin up containers
●
Storage: input data needs to be fetched from remote storage
(e.g., S3)
– As opposed to compute-local storage such as HDFS
●
Data sharing: intermediate needs to be temporarily stored on
remote storage (S3, Redis)
– Affects operations like shuffle, broadcast, etc.,
Why is it so hard?
put your #assignedhashtag here by setting the footer in view-header/footer●
Scheduler: when to best add/remove resources?
●
Container startup: may have to dynamically spin up containers
●
Storage: input data needs to be fetched from remote storage
(e.g., S3)
– As opposed to compute-local storage such as HDFS
●
Data sharing: intermediate needs to be temporarily stored on
remote storage (S3, Redis)
– Affects operations like shuffle, broadcast, etc.,
Why is it so hard?
Example: MapReduce (Cluster)
Map
Stage
Reduce
Stage
Compute &
Store Nodes
Compute &
Store Nodes
Compute &
Store Nodes
data is mostly
written and
read locally
Map
Stage
Reduce
Stage
Dynamically
growing/shrinking
compute cloud
Storage Service
(e.g, S3, Redis)
Shuffle
data is
exclusively
written and
read remotely
Example: MapReduce (Serverless)
I/O Overhead: Sorting 100GB
AWS Lambda Spark Serverless Spark Cloud Spark Cluster Spark HPC
0
100
200
300
400
500 Shuffle I/O
Compute
Input/Output
60%
22%
18%
Spark Cluster Spark HPC
0
10
20
30
40
50
60
Shuffle overheads are significantly higher when intermediate data is stored remotely
19%
49%
32%
38%
3%
59%AWS λ Databricks Databricks Spark Spark
Serverless Standard On-premise On-premise++
Spark Spark
On-premise On-premise++
Runtime[seconds]
What about other workloads?
Example: SQL, Query 77 / TPC-DS benchmark
What about other workloads?
Example: SQL, Query 77 / TPC-DS benchmark
Shuffle/Broadcast
(needs to be stored remotely)
What about other workloads?
W W PS W W
could be co-located
with worker nodes
*) read training data
*) fetch model params
*) compute
*) update model
*) use cached data
*) fetch model params
*) compute
*) update model
Example: Iterative ML (e.g., linear regression)
What about other workloads?
Example: Iterative ML (e.g., linear regression)
W W PS W W
could be co-located
with worker nodes
*) read training data
*) fetch model params
*) compute
*) update model
*) use cached data
*) fetch model params
*) compute
*) update model
What about other workloads?
Example: Iterative ML (e.g., linear regression)
W W PS W W
could be co-located
with worker nodes
Needs to be
remote
W
*) read training data
*) fetch model params
*) compute
*) update model
*) use cached data
*) fetch model params
*) compute
*) update model
read remote data
scale
out
What about other workloads?
Example: Iterative ML (e.g., linear regression)
W W PS W W
could be co-located
with worker nodes
Needs to be
remote
W
*) read training data
*) fetch model params
*) compute
*) update model
*) use cached data
*) fetch model params
*) compute
*) update model
read remote data
scale
out
barrier, need to wait
Can we..
●
..use Spark to run such workloads in a serverless
fashion?
– Dynamic scaling of compute nodes while jobs are running
– No cluster configuration
– No startup time overhead
●
..eliminate the performance overheads?
– Workloads should run as fast as on a dedicated cluster
put your #assignedhashtag here by setting the footer in view-header/footer
●
Scheduling:
– Use serverless framework to schedule executors
– Use serverless framework to schedule tasks
– Enable sharing of executors among different applications
●
Intermediate data:
– Executors cooperate with scheduler to flush data remotely
– Consequently store all intermediate state remotely
Design Options
1
2
3
1
2
put your #assignedhashtag here by setting the footer in view-header/footer
●
Scheduling:
– Use serverless framework to schedule executors
– Use serverless framework to schedule tasks
– Enable sharing of executors among different applications
●
Intermediate data:
– Executors cooperate with scheduler to flush data remotely
– Consequently store all intermediate state remotely
Design Options
High startup
Latency!
1
2
3
1
2
put your #assignedhashtag here by setting the footer in view-header/footer
●
Scheduling:
– Use serverless framework to schedule executors
– Use serverless framework to schedule tasks
– Enable sharing of executors among different applications
●
Intermediate data:
– Executors cooperate with scheduler to flush data remotely
– Consequently store all intermediate state remotely
Design Options
Slow!
High startup
Latency!
1
2
3
1
2
put your #assignedhashtag here by setting the footer in view-header/footer
●
Scheduling:
– Use serverless framework to schedule executors
– Use serverless framework to schedule tasks
– Enable sharing of executors among different applications
●
Intermediate data:
– Executors cooperate with scheduler to flush data remotely
– Consequently store all intermediate state remotely
Design Options
Slow!
High startup
Latency!
1
2
3
1
2
put your #assignedhashtag here by setting the footer in view-header/footer
●
Scheduling:
– Use serverless framework to schedule executors
– Use serverless framework to schedule tasks
– Enable sharing of executors among different applications
●
Intermediate data:
– Executors cooperate with scheduler to flush data remotely
– Consequently store all intermediate state remotely
Design Options
Slow!
High startup
Latency!
Complex!
1
2
3
1
2
put your #assignedhashtag here by setting the footer in view-header/footer
●
Scheduling:
– Use serverless framework to schedule executors
– Use serverless framework to schedule tasks
– Enable sharing of executors among different applications
●
Intermediate data:
– Executors cooperate with scheduler to flush data remotely
– Consequently store all intermediate state remotely
Design Options
Slow!
High startup
Latency!
Complex!
1
2
3
1
2
Architecture Overview
Driver HCS/
Scheduler
ExecutorStorage
server
Intermediate data
Intermediate
data
send schedule
register
Metadata
server
send job DAG
register
assign
application
register
launch
task
crail.apache.org
Architecture Overview
Driver HCS/
Scheduler
ExecutorStorage
server
send schedule
register
Metadata
server
send job DAG
register
assign
application
register
launch
task
crail.apache.org
Intermediate
data
Intermediate data
Architecture Overview
Driver HCS/
Scheduler
ExecutorStorage
server
send schedule
register
Metadata
server
send job DAG
register
assign
application
register
launch
task
crail.apache.org
Intermediate
data
Intermediate data
Architecture Overview
Driver HCS/
Scheduler
ExecutorStorage
server
send schedule
register
Metadata
server
send job DAG
register
assign
application
register
launch
task
Executors get
dynamically
re-assigned to
apps/drivers
crail.apache.org
Intermediate
data
Intermediate data
HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
computes task
schedule
HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
computes task
schedule
Uses ML to
predict task
runtimes
based on
previous runs
HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
determines
resource set
(executors)
for a given
schedule
computes task
schedule
Uses ML to
predict task
runtimes
based on
previous runs
HCS Scheduler
Executor
Resource
Manager
“Oracle”
Scheduler
Driver
REST API REST API
request resource
assign resource
determines
resource set
(executors)
for a given
schedule
computes task
schedule
Uses ML to
predict task
runtimes
based on
previous runs
“The HCl Scheduler:
Going all-in on Heterogeneity”,
Michael Kaufmann et al., HotCloud’17
Video: Putting things together
Application 1:
GridSearch
Application 2:
SQL TPC-DS
Application 3:
SQL TPC-DS
HCL
Scheduler
Resource view:
fraction of resources
each app consumes
Executor view:
Which app an executor
currently runs
Scheduling Snapshot
Executors
Time
Grid Search
SQL
put your #assignedhashtag here by setting the footer in view-header/footer●
Compute cluster size: 8 nodes: IBM Power8 Minsky
●
Storage cluster size: 8 nodes, IBM Power8 Minsky
●
Cluster hardware:
– DRAM: 512 GB
– Storage: 4x 1.2 TB NVMe SSD
– Network: 10Gb/s Ethernert, 100Gb/s RoCE
– GPU: NVIDIA P100, NVLink
●
Workloads
– ML: Logistic Regression using the CoCoa framework
– SQL: TCP-DS
Let’s look at performance...
ML: Logistic Regression
Vanilla Cluster HCS/CRAIL/TCP HCS/CRAIL/100Gb/s HCS/CRAIL/RDMA
0
0.5
1
1.5
2
2.5
Input
Reduce
Broadcast
Compute
KDDA data set
6.5 GB
Runtime[seconds]
Vanilla
Cluster,100Gb/s
Vanilla
Serverless
HCS/Crail
TCP,100Gb/s
HCS/Crail
RDMA,100Gb/s
Using HCS/Crail we can store intermediate data remotely and reduce the runtime for ML
TPC-DS: Query #87
Spark Cluster Simple Serverless HCS/CRAIL/TCP HCS/CRAIL/RDMA
0
2
4
6
8
10
12
14
Runtime[seconds]
Vanilla
Cluster
HCS/Crail
TCP,10Gb/s
HCS/Crail
TCP,100Gb/s
HCS/Crail
RDMA,100Gb/s
We need a fast network and an optimized software stack to eliminate storage overheads
TPC-DS: Query #3
Warm serverless
Crail Serverless
Spark Cluster
0 5 10 15 20 25 30 35
Runtime [seconds]
Executing short running queries on a warm serverless cluster improves performance
put your #assignedhashtag here by setting the footer in view-header/footer●
Efficient serverless computing is challenging
– Local state (e.g. shuffle, cached input, network state) is lost as compute cloud scales up/down
●
This talk: turning Spark into a serverless framework by
– Implementing HCS, a new serverless scheduler
– Consequently storing compute state remotely using Apache Crail
●
Supports arbitrary Spark workloads with almost no performance overhead
– MapReduce, SQL, Iterative Machine Learning
●
Implicit support for fast network and storage hardware
– e.g, RDMA, NVMe, NVMf
Conclusion
put your #assignedhashtag here by setting the footer in view-header/footer
●
Containerize the platform
●
Add support for dynamic re-partitioning on scale
events
●
Add support for automatic caching
●
Add more sophisticated scheduling policies
Future Work
put your #assignedhashtag here by setting the footer in view-header/footer
●
Running Apache Spark on a High-Performance
Cluster Using RDMA and NVMe Flash, Spark
Summit’17, https://tinyurl.com/yd453uzq
●
Apache Crail, http://crail.apache.org
●
HCS Scheduler, github.com/zrlio/hcs
●
Spark-HCS, github.com/zrlio/spark-hcs
●
Spark-IO, github.com/zrlio/spark-io
Links
Thanks to
Michael Kaufmann, Adrian Schuepbach, Jonas Pfefferle,
Animesh Trivedi, Bernard Metzler, Ana Klimovic, Yawen
Wang

More Related Content

What's hot

700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Spark Summit
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
DataWorks Summit
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
Databricks
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Daniel Rodriguez
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Hubert Fan Chiang
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent Memory
Databricks
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and Present
Databricks
 

What's hot (20)

700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent Memory
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and Present
 

Similar to Serverless Machine Learning on Modern Hardware Using Apache Spark with Patrick Stuedi

Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Amazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
prevota
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
Amazon Web Services
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructureharendra_pathak
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
Tatiana Al-Chueyr
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
Roger Rafanell Mas
 

Similar to Serverless Machine Learning on Modern Hardware Using Apache Spark with Patrick Stuedi (20)

Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 

Recently uploaded (20)

一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 

Serverless Machine Learning on Modern Hardware Using Apache Spark with Patrick Stuedi

  • 1. Patrick Stuedi, Michael Kaufmann, Adrian Schuepbach IBM Research Serverless Machine Learning on Modern Hardware #Res6SAIS
  • 2. Serverless Computing ● No need to setup/manage a cluster ● Automatic, dynamic and fine- grained scaling ● Sub-second billing ● Many frameworks: AWS Lambda, Google Cloud Functions, Azure Functions, Databricks Serverless, etc.
  • 3. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GB AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  • 4. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GB 64 lambda workers AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  • 5. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GBserverless cluster with autoscaling min machines: 1 max machines: 864 lambda workers AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  • 6. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GBserverless cluster with autoscaling min machines: 1 max machines: 864 lambda workers standard cluster no autoscaling 8 machines AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  • 7. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GBserverless cluster with autoscaling min machines: 1 max machines: 864 lambda workers standard cluster no autoscaling 8 machines 100Gb/s Ethernet AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  • 8. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GBserverless cluster with autoscaling min machines: 1 max machines: 864 lambda workers standard cluster no autoscaling 8 machines 100Gb/s Ethernet AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ RDMA, NVMe flash, NVMeF Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  • 9. put your #assignedhashtag here by setting the footer in view-header/footer● Scheduler: when to best add/remove resources? ● Container startup: may have to dynamically spin up containers ● Storage: input data needs to be fetched from remote storage (e.g., S3) – As opposed to compute-local storage such as HDFS ● Data sharing: intermediate needs to be temporarily stored on remote storage (S3, Redis) – Affects operations like shuffle, broadcast, etc., Why is it so hard?
  • 10. put your #assignedhashtag here by setting the footer in view-header/footer● Scheduler: when to best add/remove resources? ● Container startup: may have to dynamically spin up containers ● Storage: input data needs to be fetched from remote storage (e.g., S3) – As opposed to compute-local storage such as HDFS ● Data sharing: intermediate needs to be temporarily stored on remote storage (S3, Redis) – Affects operations like shuffle, broadcast, etc., Why is it so hard?
  • 11. Example: MapReduce (Cluster) Map Stage Reduce Stage Compute & Store Nodes Compute & Store Nodes Compute & Store Nodes data is mostly written and read locally
  • 12. Map Stage Reduce Stage Dynamically growing/shrinking compute cloud Storage Service (e.g, S3, Redis) Shuffle data is exclusively written and read remotely Example: MapReduce (Serverless)
  • 13. I/O Overhead: Sorting 100GB AWS Lambda Spark Serverless Spark Cloud Spark Cluster Spark HPC 0 100 200 300 400 500 Shuffle I/O Compute Input/Output 60% 22% 18% Spark Cluster Spark HPC 0 10 20 30 40 50 60 Shuffle overheads are significantly higher when intermediate data is stored remotely 19% 49% 32% 38% 3% 59%AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark Spark On-premise On-premise++ Runtime[seconds]
  • 14. What about other workloads? Example: SQL, Query 77 / TPC-DS benchmark
  • 15. What about other workloads? Example: SQL, Query 77 / TPC-DS benchmark Shuffle/Broadcast (needs to be stored remotely)
  • 16. What about other workloads? W W PS W W could be co-located with worker nodes *) read training data *) fetch model params *) compute *) update model *) use cached data *) fetch model params *) compute *) update model Example: Iterative ML (e.g., linear regression)
  • 17. What about other workloads? Example: Iterative ML (e.g., linear regression) W W PS W W could be co-located with worker nodes *) read training data *) fetch model params *) compute *) update model *) use cached data *) fetch model params *) compute *) update model
  • 18. What about other workloads? Example: Iterative ML (e.g., linear regression) W W PS W W could be co-located with worker nodes Needs to be remote W *) read training data *) fetch model params *) compute *) update model *) use cached data *) fetch model params *) compute *) update model read remote data scale out
  • 19. What about other workloads? Example: Iterative ML (e.g., linear regression) W W PS W W could be co-located with worker nodes Needs to be remote W *) read training data *) fetch model params *) compute *) update model *) use cached data *) fetch model params *) compute *) update model read remote data scale out barrier, need to wait
  • 20. Can we.. ● ..use Spark to run such workloads in a serverless fashion? – Dynamic scaling of compute nodes while jobs are running – No cluster configuration – No startup time overhead ● ..eliminate the performance overheads? – Workloads should run as fast as on a dedicated cluster
  • 21. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options 1 2 3 1 2
  • 22. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options High startup Latency! 1 2 3 1 2
  • 23. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options Slow! High startup Latency! 1 2 3 1 2
  • 24. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options Slow! High startup Latency! 1 2 3 1 2
  • 25. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options Slow! High startup Latency! Complex! 1 2 3 1 2
  • 26. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options Slow! High startup Latency! Complex! 1 2 3 1 2
  • 27. Architecture Overview Driver HCS/ Scheduler ExecutorStorage server Intermediate data Intermediate data send schedule register Metadata server send job DAG register assign application register launch task crail.apache.org
  • 28. Architecture Overview Driver HCS/ Scheduler ExecutorStorage server send schedule register Metadata server send job DAG register assign application register launch task crail.apache.org Intermediate data Intermediate data
  • 29. Architecture Overview Driver HCS/ Scheduler ExecutorStorage server send schedule register Metadata server send job DAG register assign application register launch task crail.apache.org Intermediate data Intermediate data
  • 30. Architecture Overview Driver HCS/ Scheduler ExecutorStorage server send schedule register Metadata server send job DAG register assign application register launch task Executors get dynamically re-assigned to apps/drivers crail.apache.org Intermediate data Intermediate data
  • 32. HCS Scheduler Executor Resource Manager “Oracle” Scheduler Driver REST API REST API request resource assign resource computes task schedule
  • 33. HCS Scheduler Executor Resource Manager “Oracle” Scheduler Driver REST API REST API request resource assign resource computes task schedule Uses ML to predict task runtimes based on previous runs
  • 34. HCS Scheduler Executor Resource Manager “Oracle” Scheduler Driver REST API REST API request resource assign resource determines resource set (executors) for a given schedule computes task schedule Uses ML to predict task runtimes based on previous runs
  • 35. HCS Scheduler Executor Resource Manager “Oracle” Scheduler Driver REST API REST API request resource assign resource determines resource set (executors) for a given schedule computes task schedule Uses ML to predict task runtimes based on previous runs “The HCl Scheduler: Going all-in on Heterogeneity”, Michael Kaufmann et al., HotCloud’17
  • 36. Video: Putting things together Application 1: GridSearch Application 2: SQL TPC-DS Application 3: SQL TPC-DS HCL Scheduler Resource view: fraction of resources each app consumes Executor view: Which app an executor currently runs
  • 38. put your #assignedhashtag here by setting the footer in view-header/footer● Compute cluster size: 8 nodes: IBM Power8 Minsky ● Storage cluster size: 8 nodes, IBM Power8 Minsky ● Cluster hardware: – DRAM: 512 GB – Storage: 4x 1.2 TB NVMe SSD – Network: 10Gb/s Ethernert, 100Gb/s RoCE – GPU: NVIDIA P100, NVLink ● Workloads – ML: Logistic Regression using the CoCoa framework – SQL: TCP-DS Let’s look at performance...
  • 39. ML: Logistic Regression Vanilla Cluster HCS/CRAIL/TCP HCS/CRAIL/100Gb/s HCS/CRAIL/RDMA 0 0.5 1 1.5 2 2.5 Input Reduce Broadcast Compute KDDA data set 6.5 GB Runtime[seconds] Vanilla Cluster,100Gb/s Vanilla Serverless HCS/Crail TCP,100Gb/s HCS/Crail RDMA,100Gb/s Using HCS/Crail we can store intermediate data remotely and reduce the runtime for ML
  • 40. TPC-DS: Query #87 Spark Cluster Simple Serverless HCS/CRAIL/TCP HCS/CRAIL/RDMA 0 2 4 6 8 10 12 14 Runtime[seconds] Vanilla Cluster HCS/Crail TCP,10Gb/s HCS/Crail TCP,100Gb/s HCS/Crail RDMA,100Gb/s We need a fast network and an optimized software stack to eliminate storage overheads
  • 41. TPC-DS: Query #3 Warm serverless Crail Serverless Spark Cluster 0 5 10 15 20 25 30 35 Runtime [seconds] Executing short running queries on a warm serverless cluster improves performance
  • 42. put your #assignedhashtag here by setting the footer in view-header/footer● Efficient serverless computing is challenging – Local state (e.g. shuffle, cached input, network state) is lost as compute cloud scales up/down ● This talk: turning Spark into a serverless framework by – Implementing HCS, a new serverless scheduler – Consequently storing compute state remotely using Apache Crail ● Supports arbitrary Spark workloads with almost no performance overhead – MapReduce, SQL, Iterative Machine Learning ● Implicit support for fast network and storage hardware – e.g, RDMA, NVMe, NVMf Conclusion
  • 43. put your #assignedhashtag here by setting the footer in view-header/footer ● Containerize the platform ● Add support for dynamic re-partitioning on scale events ● Add support for automatic caching ● Add more sophisticated scheduling policies Future Work
  • 44. put your #assignedhashtag here by setting the footer in view-header/footer ● Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash, Spark Summit’17, https://tinyurl.com/yd453uzq ● Apache Crail, http://crail.apache.org ● HCS Scheduler, github.com/zrlio/hcs ● Spark-HCS, github.com/zrlio/spark-hcs ● Spark-IO, github.com/zrlio/spark-io Links
  • 45. Thanks to Michael Kaufmann, Adrian Schuepbach, Jonas Pfefferle, Animesh Trivedi, Bernard Metzler, Ana Klimovic, Yawen Wang