Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark


Published on

"Project Hydrogen is a major Apache Spark initiative to bring state-of-the-art AI and Big Data solutions together. It contains three major projects: 1) barrier execution mode 2) optimized data exchange and 3) accelerator-aware scheduling. A basic implementation of barrier execution mode was merged into Apache Spark 2.4.0, and the community is working on the latter two. In this talk, we will present progress updates to Project Hydrogen and discuss the next steps.

First, we will review the barrier execution mode implementation from Spark 2.4.0. It enables developers to embed distributed training jobs properly on a Spark cluster. We will demonstrate distributed AI integrations built on top it, e.g., Horovod and Distributed TensorFlow. We will also discuss the technical challenges to implement those integrations and future work. Second, we will outline on-going work for optimized data exchange. Its target scenario is distributed model inference. We will present how we do performance testing/profiling, where the bottlenecks are, and how to improve the overall throughput on Spark. If time allows, we might also give updates on accelerator-aware scheduling.

Published in: Data & Analytics
  • Be the first to comment

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark

  1. 1. Xiangrui Meng, Databricks Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark #UnifiedAnalytics #SparkAISummit
  2. 2. 2 About me ● Software Engineer at Databricks ○ machine learning and data science/engineering ● Committer and PMC member of Apache Spark ○ MLlib, SparkR, PySpark, Spark Packages, etc
  3. 3. 3 Announced last June, Project Hydrogen is a major Spark initiative to unify state-of-the-art AI and big data workloads. About Project Hydrogen Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  4. 4. 4 Why Spark + AI?
  5. 5. Runtime Delta Spark Core Engine Big Data Processing ETL + SQL +Streaming Machine Learning MLlib + SparkR Apache Spark: The First Unified Analytics Engine 5
  6. 6. and many more... Internet of ThingsDigital Personalization Huge disruptive innovations are affecting most enterprises on the planet Healthcare and Genomics Fraud Prevention AI is re-shaping the world 6
  7. 7. Better AI needs more data 7
  8. 8. When AI goes distributed ... When datasets get bigger and bigger, we see more and more distributed training scenarios and open-source offerings, e.g., distributed TensorFlow, Horovod, and distributed MXNet. This is where Spark and AI cross. 8
  9. 9. 9 Why Project Hydrogen?
  10. 10. Two simple stories As a data scientist, I can: ● build a pipeline that fetches training events from a production data warehouse and trains a DL model in parallel; ● apply a trained DL model to a distributed stream of events and enrich it with predicted labels. 10
  11. 11. Distributed training data warehouse load fit model Required: Be able to read from Databricks Delta, Parquet, MySQL, Hive, etc. Answer: Apache Spark Required: distributed GPU cluster for fast training Answer: Horovod, Distributed Tensorflow, etc 11
  12. 12. Two separate data and AI clusters? load using a Spark cluster fit on a GPU cluster model save data required: glue code 12
  13. 13. Streaming model inference Kafka load predict model required: ● save to stream sink ● GPU for fast inference 13
  14. 14. A hybrid Spark and AI cluster? load using a Spark cluster w/ GPUs fit a model distributedly on the same cluster model load using a Spark cluster w/ GPUs predict w/ GPUs as a Spark task model 14
  15. 15. Unfortunately, it doesn’t work out of the box. See a previous demo.
  16. 16. 16 Project Hydrogen to fill the major gaps Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  17. 17. 17 Updates from Project Hydrogen As a Spark contributor, I want to present: ● what features from Project Hydrogen are available, ● what features are in development. As a Databricks engineer, I want to share: ● how we utilized features from Project Hydrogen, ● lessons learned and best practices.
  18. 18. 18 Story #1: Distributed training load using a Spark cluster w/ GPUs fit a model distributedly on the same cluster model
  19. 19. 19 Project Hydrogen: barrier execution mode Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  20. 20. 20 Different execution models Task 1 Task 2 Task 3 Spark (MapReduce) Tasks are independent of each other Embarrassingly parallel & massively scalable Distributed training Complete coordination among tasks Optimized for communication Task 1 Task 2 Task 3
  21. 21. 21 Barrier execution mode We introduced gang scheduling to Spark on top of MapReduce execution model. So a distributed DL job can run as a Spark job. ● It starts all tasks together. ● It provides sufficient info and tooling to run a hybrid distributed job. ● It cancels and restarts all tasks in case of failures. JIRA: SPARK-24374 (Spark 2.4)
  22. 22. 22 API: RDD.barrier() RDD.barrier() tells Spark to launch the tasks together. rdd.barrier().mapPartitions { iter => val context = BarrierTaskContext.get() ... }
  23. 23. 23 API: context.barrier() context.barrier() places a global barrier and waits until all tasks in this stage hit this barrier. val context = BarrierTaskContext.get() … // preparation context.barrier()
  24. 24. 24 API: context.getTaskInfos() context.getTaskInfos() returns info about all tasks in this stage. if (context.partitionId == 0) { val addrs = context.getTaskInfos().map(_.address) ... // start a hybrid training job, e.g., via MPI } context.barrier() // wait until training finishes
  25. 25. 25 Barrier mode integration
  26. 26. 26 Horovod (an LF AI hosted project) Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. It is originally developed at Uber, now an LF AI hosted project at Linux Foundation. ● Little modification to single-node code. ● High-performance I/O via MPI and NCCL. ● Same convergence theory. Some limitation: ● Before v0.16, user still needs to use mpirun to launch a job, ● … with a python training script: mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python
  27. 27. 27 Hydrogen integration with Horovod Databricks released HorovodRunner w/ Runtime 5.0 ML built on top of Horovod and Project Hydrogen. ● Runs Horovod under barrier execution mode. ● Hides cluster setup, scripts, MPI command line from users. def train_hvd(): hvd.init() … # train using Horovod HorovodRunner(np=2).run(train_hvd)
  28. 28. 28 Implementation of HorovodRunner Integrating Horovod with barrier mode is straightforward: ● Pickle and broadcast the train function. ○ Inspect code and warn users about potential issues. ● Launch a Spark job in barrier execution mode. ● In the first executor, use worker addresses to launch the Horovod MPI job. ● Terminate Horovod if the Spark job got cancelled. ○ Hint: PR_SET_PDEATHSIG Limitation: ● Tailored for Databricks Runtime ML ○ Horovod built with TensorFlow/PyTorch, SSH, OpenMPI, NCCL, etc. ○ Spark 2.4, GPU cluster configuration, etc.
  29. 29. 29 horovod.spark horovod.spark is a new feature in Horovod 0.16 release. Similar to HorovodRunner, it runs Horovod as a Spark job and takes python train functions. Its assumption is more general: ● no dependency on SSH, ● system-independent process termination, ● multiple Spark versions, ● and more … also check out horovodrun:)
  30. 30. 30 Collaboration on Horovod + Spark Engineers at Uber and Databricks are collaborating on improving the integration between Horovod and Spark. Goals: ● Merge design and code development into horovod.spark. ● HorovodRunner uses horovod.spark implementation with extra Databricks-specific features. ● Support barrier execution mode and GPU-aware scheduling. Stay tuned for the announcement from LF/Uber/Databricks!
  31. 31. 31 Project Hydrogen: GPU-aware scheduling Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  32. 32. 32 Accelerator-aware scheduling Accelerators (GPUs, FPGAs) are widely used for accelerating specialized workloads like deep learning and signal processing. To utilize accelerators in a Spark cluster, Spark needs to be aware of the accelerators assigned to the driver and executors and schedule them according to user requests. JIRA: SPARK-24615 (ETA: Spark 3.0)
  33. 33. 33 ● Mesos, YARN, and Kubernetes already support GPUs. ● However, even GPUs are allocated by a cluster manager for a Spark application, Spark itself is not aware of the GPUs. ● Consider a simple case where one task needs one GPU: Why Spark needs GPU awareness? Executor 0 GPU:0 GPU:1 Task 0 Task 1 Executor 1 GPU:0 GPU:1 Task 2 Task 3 Task 4 ?
  34. 34. 34 Workarounds (a.k.a hacks) ● Limit Spark task slots per node to 1. ○ The running task can safely claim all GPUs on the node. ○ It might lead to resource waste if the workload doesn’t need all GPUs. ○ User also needs to write multithreading code to maximize data I/O. ● Let running tasks themselves to collaboratively decide which GPUs to use, e.g., via shared locks.
  35. 35. 35 User Spark Cluster Manager 0. Auto-discover resources. 1. Submit an application with resource requests. 2. Pass resource requests to cluster manager. 4. Register executors. 3. Allocate executors with resource isolation. 5. Submit a Spark job. 6. Schedule tasks on available executors. 7. Dynamic allocation. 8. Retrieve assigned resources and use them in tasks. 9. Monitor and recover failed executors. Proposed workflow
  36. 36. 36 Discover and request GPUs Admin can specify a script to auto-discover GPUs (#24406) ● spark.driver.resource.gpu.discoveryScript ● spark.executor.resource.gpu.discoveryScript ● e.g., `nvidia-smi --query-gpu=index ...` User can request GPUs at application level (#24374) ● spark.executor.resource.gpu.count ● spark.driver.resource.gpu.count ● spark.task.resource.gpu.count
  37. 37. 37 Retrieve assigned GPUs User can retrieve assigned GPUs from task context (#24374) context = TaskContext.get() assigned_gpu = context.getResources()[“gpu”][0] with tf.device(assigned_gpu): # training code ...
  38. 38. 38 Cluster manager support YARN SPARK-27361 Kubernetes SPARK-27362 Mesos SPARK-27363 Standalone SPARK-27361
  39. 39. 39 Jenkins support (SPARK-27365) To support end-to-end integration test, we are adding GPU cards to Spark Jenkins machines hosted by Berkeley RISELab. Thanks NVIDIA for donating the latest Tesla T4 cards!
  40. 40. 40 Support other accelerators We focus on GPU support but keep the interfaces general to support other types of accelerators in the future, e.g., FPGA. ● “GPU” is not a hard-coded resource type. ● spark.executor.resource.{resourceType}.discoveryScript ● context.getResources() returns a map from resourceType to assigned addresses.
  41. 41. 41 Features beyond the current SPIP ● Resource request at task level. ● Fine-grained scheduling within one GPU. ● Affinity and anti-affinity. ● ...
  42. 42. More on distributed training: data flow We recommend the following data flow for training: ● Load and preprocess training data using Spark. ● Save preprocessed training data to a shared storage. ○ What format? TFRecords, Parquet + Petastorm. ○ Which shared storage? S3, Azure Blob Storage, HDFS, NFS, etc. ● Load training data in DL frameworks. ○ But DL frameworks do not work well with remote storage. 42
  43. 43. Connect DL frameworks to remote storage We recommend high-performance FUSE clients to mount remote storage as local files so DL frameworks can load/save data easily. 43 s3://bucket wasb://container file:/mnt/... TensorFlow + Horovod FUSE clients: Goofys blobfuse worker
  44. 44. 44 Story #2: Streaming model inference load using a Spark cluster w/ GPUs predict w/ GPUs as a Spark task model
  45. 45. 45 Project Hydrogen: Optimized data exchange Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  46. 46. 46 Optimized data exchange None of the integrations are possible without exchanging data between Spark and AI frameworks. And performance matters. JIRA: SPARK-24579
  47. 47. 47 Pandas UDF Pandas UDF was introduced in Spark 2.3, which uses Arrow for data exchange and utilizes Pandas for vectorized computation.
  48. 48. 48 Pandas UDF for distributed inference Pandas UDF makes it simple to apply a model to a data stream. @pandas_udf(...) def predict(features): ... spark.readStream(...) .withColumn(‘prediction’, predict(col(‘features’)))
  49. 49. 49 Return StructType from Pandas UDF We improved scalar Pandas UDF to complex return types. So users can return predicted labels and raw scores together. JIRA: SPARK-23836 (Spark 3.0) @pandas_udf(...) def predict(features): # ... return pd.DataFrame({'labels': labels, 'scores': scores})
  50. 50. 50 Data pipelining CPU GPU t1 fetch batch #1 t2 fetch batch #2 process batch #1 t3 fetch batch #3 process batch #2 t4 process batch #3 CPU GPU t1 fetch batch #1 t2 process batch #1 t3 fetch batch #2 t4 process batch #2 t5 fetch batch #3 t6 process batch #3 (pipelining)
  51. 51. 51 Pandas UDF prefetch To improve the throughput, we prefetch Arrow record batches in the queue while executing Pandas UDF on the current batch. ● Enabled by default on Databricks Runtime 5.2. ● Up to 2x for I/O and compute balanced workload. ● Observed 1.5x in real workload. JIRA: SPARK-27569 (ETA: Spark 3.0)
  52. 52. 52 Per-batch initialization overhead Loading model per batch introduces a constant overhead. We propose a new Pandas UDF interface that takes an iterator of batches so we only need to load the model once. JIRA: SPARK-26412 (WIP) @pandas_udf(...) def predict(batches): model = … # load model once for batch in batches: yield model.predict(batch)
  53. 53. 53 Standardize on the Arrow format Many accelerated computing libraries now support Arrow. The community is discussing whether we should expose the Arrow format in a public interface. ● Simplify data exchange. ● Reduce data copy/conversion overhead. ● Allow pluggable vectorization code. JIRA: SPARK-27396 (pending vote)
  54. 54. 54 Acknowledgement ● Many ideas in Project Hydrogen are based on previous community work: TensorFrames, BigDL, Apache Arrow, Pandas UDF, Spark GPU support, MPI, etc. ● We would like to thank many Spark committers and contributors who helped the project proposal, design, and implementation.
  55. 55. 55 Acknowledgement ● Xingbo Jiang ● Thomas Graves ● Andy Feng ● Alex Sergeev ● Shane Knapp ● Xiao Li ● Li Jin ● Bryan Cutler ● Takuya Ueshin ● Wenchen Fan ● Jason Lowe ● Hyukjin Kwon ● Madhukar Korupolu ● Robert Evans ● Yinan Li ● Felix Cheung ● Imran Rashid ● Saisai Shao ● Mark Hamstra ● Sean Owen ● Yu Jiang ● … and many more!
  56. 56. Thank you!