Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Serverless Machine Learning on Modern Hardware Using Apache Spark with Patrick Stuedi

401 views

Published on

Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.

In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.

Published in: Data & Analytics

Serverless Machine Learning on Modern Hardware Using Apache Spark with Patrick Stuedi

  1. 1. Patrick Stuedi, Michael Kaufmann, Adrian Schuepbach IBM Research Serverless Machine Learning on Modern Hardware #Res6SAIS
  2. 2. Serverless Computing ● No need to setup/manage a cluster ● Automatic, dynamic and fine- grained scaling ● Sub-second billing ● Many frameworks: AWS Lambda, Google Cloud Functions, Azure Functions, Databricks Serverless, etc.
  3. 3. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GB AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  4. 4. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GB 64 lambda workers AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  5. 5. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GBserverless cluster with autoscaling min machines: 1 max machines: 864 lambda workers AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  6. 6. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GBserverless cluster with autoscaling min machines: 1 max machines: 864 lambda workers standard cluster no autoscaling 8 machines AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  7. 7. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GBserverless cluster with autoscaling min machines: 1 max machines: 864 lambda workers standard cluster no autoscaling 8 machines 100Gb/s Ethernet AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  8. 8. Challenge: Performance AWS Lambda Spark Serverless Spark Cloud On-Premise On-Premise 0 100 200 300 400 500 Increasing flexibility Increasing performance Example: Sorting 100GBserverless cluster with autoscaling min machines: 1 max machines: 864 lambda workers standard cluster no autoscaling 8 machines 100Gb/s Ethernet AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ RDMA, NVMe flash, NVMeF Spark/On-Premise++: Running Apache Spark on a High-Performance Cluster using RDMA and NVMe Flash, Spark Summit’17 Runtime[seconds]
  9. 9. put your #assignedhashtag here by setting the footer in view-header/footer● Scheduler: when to best add/remove resources? ● Container startup: may have to dynamically spin up containers ● Storage: input data needs to be fetched from remote storage (e.g., S3) – As opposed to compute-local storage such as HDFS ● Data sharing: intermediate needs to be temporarily stored on remote storage (S3, Redis) – Affects operations like shuffle, broadcast, etc., Why is it so hard?
  10. 10. put your #assignedhashtag here by setting the footer in view-header/footer● Scheduler: when to best add/remove resources? ● Container startup: may have to dynamically spin up containers ● Storage: input data needs to be fetched from remote storage (e.g., S3) – As opposed to compute-local storage such as HDFS ● Data sharing: intermediate needs to be temporarily stored on remote storage (S3, Redis) – Affects operations like shuffle, broadcast, etc., Why is it so hard?
  11. 11. Example: MapReduce (Cluster) Map Stage Reduce Stage Compute & Store Nodes Compute & Store Nodes Compute & Store Nodes data is mostly written and read locally
  12. 12. Map Stage Reduce Stage Dynamically growing/shrinking compute cloud Storage Service (e.g, S3, Redis) Shuffle data is exclusively written and read remotely Example: MapReduce (Serverless)
  13. 13. I/O Overhead: Sorting 100GB AWS Lambda Spark Serverless Spark Cloud Spark Cluster Spark HPC 0 100 200 300 400 500 Shuffle I/O Compute Input/Output 60% 22% 18% Spark Cluster Spark HPC 0 10 20 30 40 50 60 Shuffle overheads are significantly higher when intermediate data is stored remotely 19% 49% 32% 38% 3% 59%AWS λ Databricks Databricks Spark Spark Serverless Standard On-premise On-premise++ Spark Spark On-premise On-premise++ Runtime[seconds]
  14. 14. What about other workloads? Example: SQL, Query 77 / TPC-DS benchmark
  15. 15. What about other workloads? Example: SQL, Query 77 / TPC-DS benchmark Shuffle/Broadcast (needs to be stored remotely)
  16. 16. What about other workloads? W W PS W W could be co-located with worker nodes *) read training data *) fetch model params *) compute *) update model *) use cached data *) fetch model params *) compute *) update model Example: Iterative ML (e.g., linear regression)
  17. 17. What about other workloads? Example: Iterative ML (e.g., linear regression) W W PS W W could be co-located with worker nodes *) read training data *) fetch model params *) compute *) update model *) use cached data *) fetch model params *) compute *) update model
  18. 18. What about other workloads? Example: Iterative ML (e.g., linear regression) W W PS W W could be co-located with worker nodes Needs to be remote W *) read training data *) fetch model params *) compute *) update model *) use cached data *) fetch model params *) compute *) update model read remote data scale out
  19. 19. What about other workloads? Example: Iterative ML (e.g., linear regression) W W PS W W could be co-located with worker nodes Needs to be remote W *) read training data *) fetch model params *) compute *) update model *) use cached data *) fetch model params *) compute *) update model read remote data scale out barrier, need to wait
  20. 20. Can we.. ● ..use Spark to run such workloads in a serverless fashion? – Dynamic scaling of compute nodes while jobs are running – No cluster configuration – No startup time overhead ● ..eliminate the performance overheads? – Workloads should run as fast as on a dedicated cluster
  21. 21. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options 1 2 3 1 2
  22. 22. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options High startup Latency! 1 2 3 1 2
  23. 23. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options Slow! High startup Latency! 1 2 3 1 2
  24. 24. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options Slow! High startup Latency! 1 2 3 1 2
  25. 25. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options Slow! High startup Latency! Complex! 1 2 3 1 2
  26. 26. put your #assignedhashtag here by setting the footer in view-header/footer ● Scheduling: – Use serverless framework to schedule executors – Use serverless framework to schedule tasks – Enable sharing of executors among different applications ● Intermediate data: – Executors cooperate with scheduler to flush data remotely – Consequently store all intermediate state remotely Design Options Slow! High startup Latency! Complex! 1 2 3 1 2
  27. 27. Architecture Overview Driver HCS/ Scheduler ExecutorStorage server Intermediate data Intermediate data send schedule register Metadata server send job DAG register assign application register launch task crail.apache.org
  28. 28. Architecture Overview Driver HCS/ Scheduler ExecutorStorage server send schedule register Metadata server send job DAG register assign application register launch task crail.apache.org Intermediate data Intermediate data
  29. 29. Architecture Overview Driver HCS/ Scheduler ExecutorStorage server send schedule register Metadata server send job DAG register assign application register launch task crail.apache.org Intermediate data Intermediate data
  30. 30. Architecture Overview Driver HCS/ Scheduler ExecutorStorage server send schedule register Metadata server send job DAG register assign application register launch task Executors get dynamically re-assigned to apps/drivers crail.apache.org Intermediate data Intermediate data
  31. 31. HCS Scheduler Executor Resource Manager “Oracle” Scheduler Driver REST API REST API request resource assign resource
  32. 32. HCS Scheduler Executor Resource Manager “Oracle” Scheduler Driver REST API REST API request resource assign resource computes task schedule
  33. 33. HCS Scheduler Executor Resource Manager “Oracle” Scheduler Driver REST API REST API request resource assign resource computes task schedule Uses ML to predict task runtimes based on previous runs
  34. 34. HCS Scheduler Executor Resource Manager “Oracle” Scheduler Driver REST API REST API request resource assign resource determines resource set (executors) for a given schedule computes task schedule Uses ML to predict task runtimes based on previous runs
  35. 35. HCS Scheduler Executor Resource Manager “Oracle” Scheduler Driver REST API REST API request resource assign resource determines resource set (executors) for a given schedule computes task schedule Uses ML to predict task runtimes based on previous runs “The HCl Scheduler: Going all-in on Heterogeneity”, Michael Kaufmann et al., HotCloud’17
  36. 36. Video: Putting things together Application 1: GridSearch Application 2: SQL TPC-DS Application 3: SQL TPC-DS HCL Scheduler Resource view: fraction of resources each app consumes Executor view: Which app an executor currently runs
  37. 37. Scheduling Snapshot Executors Time Grid Search SQL
  38. 38. put your #assignedhashtag here by setting the footer in view-header/footer● Compute cluster size: 8 nodes: IBM Power8 Minsky ● Storage cluster size: 8 nodes, IBM Power8 Minsky ● Cluster hardware: – DRAM: 512 GB – Storage: 4x 1.2 TB NVMe SSD – Network: 10Gb/s Ethernert, 100Gb/s RoCE – GPU: NVIDIA P100, NVLink ● Workloads – ML: Logistic Regression using the CoCoa framework – SQL: TCP-DS Let’s look at performance...
  39. 39. ML: Logistic Regression Vanilla Cluster HCS/CRAIL/TCP HCS/CRAIL/100Gb/s HCS/CRAIL/RDMA 0 0.5 1 1.5 2 2.5 Input Reduce Broadcast Compute KDDA data set 6.5 GB Runtime[seconds] Vanilla Cluster,100Gb/s Vanilla Serverless HCS/Crail TCP,100Gb/s HCS/Crail RDMA,100Gb/s Using HCS/Crail we can store intermediate data remotely and reduce the runtime for ML
  40. 40. TPC-DS: Query #87 Spark Cluster Simple Serverless HCS/CRAIL/TCP HCS/CRAIL/RDMA 0 2 4 6 8 10 12 14 Runtime[seconds] Vanilla Cluster HCS/Crail TCP,10Gb/s HCS/Crail TCP,100Gb/s HCS/Crail RDMA,100Gb/s We need a fast network and an optimized software stack to eliminate storage overheads
  41. 41. TPC-DS: Query #3 Warm serverless Crail Serverless Spark Cluster 0 5 10 15 20 25 30 35 Runtime [seconds] Executing short running queries on a warm serverless cluster improves performance
  42. 42. put your #assignedhashtag here by setting the footer in view-header/footer● Efficient serverless computing is challenging – Local state (e.g. shuffle, cached input, network state) is lost as compute cloud scales up/down ● This talk: turning Spark into a serverless framework by – Implementing HCS, a new serverless scheduler – Consequently storing compute state remotely using Apache Crail ● Supports arbitrary Spark workloads with almost no performance overhead – MapReduce, SQL, Iterative Machine Learning ● Implicit support for fast network and storage hardware – e.g, RDMA, NVMe, NVMf Conclusion
  43. 43. put your #assignedhashtag here by setting the footer in view-header/footer ● Containerize the platform ● Add support for dynamic re-partitioning on scale events ● Add support for automatic caching ● Add more sophisticated scheduling policies Future Work
  44. 44. put your #assignedhashtag here by setting the footer in view-header/footer ● Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash, Spark Summit’17, https://tinyurl.com/yd453uzq ● Apache Crail, http://crail.apache.org ● HCS Scheduler, github.com/zrlio/hcs ● Spark-HCS, github.com/zrlio/spark-hcs ● Spark-IO, github.com/zrlio/spark-io Links
  45. 45. Thanks to Michael Kaufmann, Adrian Schuepbach, Jonas Pfefferle, Animesh Trivedi, Bernard Metzler, Ana Klimovic, Yawen Wang

×