Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Xingbo Jiang, Databricks
Updates from Project Hydrogen: Unifying
State-of-the-Art AI and Big Data in Apache Spark
#Unified...
About Me
• Software Engineer at
• Committer of Apache Spark
Xingbo Jiang (Github: jiangxb1987)
4
Announced last June, Project Hydrogen is a major Spark initiative
to unify state-of-the-art AI and big data workloads.
A...
5
Why Spark + AI?
Runtime
Delta
Spark Core Engine
Big Data Processing
ETL + SQL +Streaming
Machine Learning
MLlib + SparkR
Apache Spark:
The...
and many more...
Internet of ThingsDigital Personalization
Huge disruptive innovations are affecting most enterprises on th...
Better AI needs more data
8
The cross...
99
Map/Reduce
CaffeOnSpark
TensorFlowOnSpark
DataFrame-based APIs
50+ Data Sources
Python/Java/R interfaces
St...
10
Why Project Hydrogen?
Two simple stories
11
data warehouse load fit model
data
stream
load predict model
Distributed training
data warehouse load fit model
Required: Be able to read from
Delta Lake, Parquet,
MySQL, Hive, etc.
A...
Two separate data and AI clusters?
load using a Spark
cluster
fit on a GPU
cluster
model
save data
required: glue code
13
Streaming model inference
Kafka load predict model
required:
● save to stream sink
● GPU for fast inference
14
A hybrid Spark and AI cluster?
load using a Spark
cluster w/ GPUs
fit a model
distributedly
on the same
cluster
model
load...
Unfortunately, it doesn’t work out of the box.
See a previous demo.
17
Project Hydrogen to fill the major gaps
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
18
Updates from Project Hydrogen
● Available features
● Future Improvement
● How to utilize
19
Story #1:
Distributed training
load using a Spark
cluster w/ GPUs
fit a model
distributedly
on the same
cluster
model
20
Project Hydrogen: barrier execution mode
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
21
Different execution models
Task 1
Task 2
Task 3
Spark (MapReduce)
Tasks are independent of each other
Embarrassingly par...
22
Barrier execution mode
• All tasks start together
• Sufficient info to run a hybrid distributed job
• Cancel and restart...
23
API: RDD.barrier()
RDD.barrier() tells Spark to launch the tasks together.
rdd.barrier().mapPartitions { iter =>
val co...
24
API: context.barrier()
context.barrier() places a global barrier and waits until all tasks in
this stage hit this barri...
25
API: context.getTaskInfos()
context.getTaskInfos() returns info about all tasks in this stage.
if (context.partitionId ...
26
Barrier mode integration
27
Horovod (an LF AI hosted project)
● Little modification to single-node code
● High-performance I/O via MPI and NCCL
● S...
28
Hydrogen integration with Horovod
● HorovodRunner with Databricks Runtime 5.0 ML has released
● Runs Horovod under barr...
29
Implementation of HorovodRunner
Integrating Horovod with barrier mode is straightforward:
● Pickle and broadcast the tr...
30
Project Hydrogen: Accelerator-aware scheduling
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Schedul...
31
Accelerator-aware scheduling
JIRA: SPARK-24615 (ETA: Spark 3.0)
Executor 0
GPU:0
GPU:1
Task 0
Task 1
Executor 1
GPU:0
G...
32
● Some cluster managers already support accelerators
(GPU/FPGA/etc...)
● Spark still need to be aware of accelerators. ...
33
Workarounds (a.k.a hacks)
● Only allow one Spark task on each node
○ Pros: avoid accelerator resources contention
○ Con...
34
User Spark Cluster Manager
0. Auto-discover resources.
1. Submit an application with
resource requests.
2. Pass resourc...
35
Discover and request accelerators
Admin can specify a script to auto-discover accelerators (SPARK-27024)
● spark.driver...
36
Retrieve assigned accelerators
User can retrieve assigned accelerators from task context (SPARK-27366)
context = TaskCo...
37
Cluster manager support
YARN
SPARK-27361
Kubernetes
SPARK-27362
Mesos (not
started)
SPARK-27363
Standalone
SPARK-27360
38
Web UI for accelerators
39
Support general accelerator types
We keep the interfaces general to support other types of
accelerators other than GPU ...
40
Features beyond Project Hydrogen
● Resource request at task level.
● Fine-grained scheduling within one GPU.
● Affinity ...
41
Story #2:
Streaming model inference
load using a Spark
cluster w/ GPUs
predict w/ GPUs as
a Spark task
model
42
Project Hydrogen: Optimized data exchange
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
43
Optimized data exchange
None of the integrations are possible without exchanging data
between Spark and AI frameworks. ...
44
Pandas UDF
Pandas UDF was introduced in Spark 2.3, which uses Arrow for
data exchange and utilizes Pandas for vectorize...
45
Pandas UDF for distributed inference
Pandas UDF makes it simple to apply a model to a data stream.
@pandas_udf(...)
def...
46
Return StructType from Pandas UDF
We improved scalar Pandas UDF to complex return types. So users
can return predicted ...
47
Data pipelining
CPU GPU
t1 fetch batch #1
t2 fetch batch #2 process batch #1
t3 fetch batch #3 process batch #2
t4 proc...
48
Pandas UDF prefetch
To improve the throughput, we prefetch Arrow record batches in
the queue while executing Pandas UDF...
49
Per-batch initialization overhead
A new Pandas UDF interface that load the model only once and
reuse it on an iterator ...
50
Acknowledgement
● Many ideas in Project Hydrogen are based on previous
community work: TensorFrames, BigDL, Apache Arro...
51
Acknowledgement
● Alex Sergeev
● Andy Feng
● Bryan Cutler
● Felix Cheung
● Hyukjin Kwon
● Imran Rashid
● Jason Lowe
● J...
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Upcoming SlideShare
Loading in …5
×

of

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 1 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 2 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 3 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 4 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 5 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 6 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 7 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 8 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 9 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 10 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 11 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 12 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 13 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 14 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 15 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 16 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 17 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 18 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 19 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 20 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 21 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 22 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 23 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 24 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 25 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 26 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 27 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 28 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 29 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 30 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 31 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 32 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 33 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 34 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 35 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 36 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 37 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 38 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 39 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 40 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 41 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 42 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 43 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 44 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 45 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 46 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 47 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 48 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 49 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 50 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 51 Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Slide 52
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark

Download to read offline

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Project Hydrogen is a major Apache Spark initiative to bring state-of-the-art AI and Big Data solutions together.

It contains three major projects:
1) barrier execution mode
2) optimized data exchange and
3) accelerator-aware scheduling.

A basic implementation of barrier execution mode was merged into Apache Spark 2.4.0, and the community is working on the latter two. In this talk, we will present progress updates to Project Hydrogen and discuss the next steps.

First, we will review the barrier execution mode implementation from Spark 2.4.0. It enables developers to embed distributed training jobs properly on a Spark cluster. We will demonstrate distributed AI integrations built on top it, e.g., Horovod and Distributed TensorFlow. We will also discuss the technical challenges to implement those integrations and future work.

Second, we will give updates on accelerator-aware scheduling and how it shall help accelerate your Spark training jobs. We will also outline on-going work for optimized data exchange.

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Xingbo Jiang, Databricks Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark #UnifiedDataAnalytics #SparkAISummit
  3. 3. About Me • Software Engineer at • Committer of Apache Spark Xingbo Jiang (Github: jiangxb1987)
  4. 4. 4 Announced last June, Project Hydrogen is a major Spark initiative to unify state-of-the-art AI and big data workloads. About Project Hydrogen Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  5. 5. 5 Why Spark + AI?
  6. 6. Runtime Delta Spark Core Engine Big Data Processing ETL + SQL +Streaming Machine Learning MLlib + SparkR Apache Spark: The First Unified Analytics Engine 6
  7. 7. and many more... Internet of ThingsDigital Personalization Huge disruptive innovations are affecting most enterprises on the planet Healthcare and Genomics Fraud Prevention AI is re-shaping the world 7
  8. 8. Better AI needs more data 8
  9. 9. The cross... 99 Map/Reduce CaffeOnSpark TensorFlowOnSpark DataFrame-based APIs 50+ Data Sources Python/Java/R interfaces Structured Streaming ML Pipelines API Continuous Processing RDD Project Tungsten Pandas UDF TensorFrames scikit-learn pandas/numpy/scipy LIBLINEAR R glmnet xgboost GraphLab Caffe/PyTorch/MXNet TensorFlow Keras Distributed TensorFlow Horovod tf.data tf.transform AI/ML ?? TF XLA
  10. 10. 10 Why Project Hydrogen?
  11. 11. Two simple stories 11 data warehouse load fit model data stream load predict model
  12. 12. Distributed training data warehouse load fit model Required: Be able to read from Delta Lake, Parquet, MySQL, Hive, etc. Answer: Apache Spark Required: distributed GPU cluster for fast training Answer: Horovod, Distributed Tensorflow, etc 12
  13. 13. Two separate data and AI clusters? load using a Spark cluster fit on a GPU cluster model save data required: glue code 13
  14. 14. Streaming model inference Kafka load predict model required: ● save to stream sink ● GPU for fast inference 14
  15. 15. A hybrid Spark and AI cluster? load using a Spark cluster w/ GPUs fit a model distributedly on the same cluster model load using a Spark cluster w/ GPUs predict w/ GPUs as a Spark task model 15
  16. 16. Unfortunately, it doesn’t work out of the box. See a previous demo.
  17. 17. 17 Project Hydrogen to fill the major gaps Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  18. 18. 18 Updates from Project Hydrogen ● Available features ● Future Improvement ● How to utilize
  19. 19. 19 Story #1: Distributed training load using a Spark cluster w/ GPUs fit a model distributedly on the same cluster model
  20. 20. 20 Project Hydrogen: barrier execution mode Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  21. 21. 21 Different execution models Task 1 Task 2 Task 3 Spark (MapReduce) Tasks are independent of each other Embarrassingly parallel & massively scalable Distributed training Complete coordination among tasks Optimized for communication Task 1 Task 2 Task 3
  22. 22. 22 Barrier execution mode • All tasks start together • Sufficient info to run a hybrid distributed job • Cancel and restart all tasks on failure JIRA: SPARK-24374 (Spark 2.4)
  23. 23. 23 API: RDD.barrier() RDD.barrier() tells Spark to launch the tasks together. rdd.barrier().mapPartitions { iter => val context = BarrierTaskContext.get() ... }
  24. 24. 24 API: context.barrier() context.barrier() places a global barrier and waits until all tasks in this stage hit this barrier. val context = BarrierTaskContext.get() … // preparation context.barrier()
  25. 25. 25 API: context.getTaskInfos() context.getTaskInfos() returns info about all tasks in this stage. if (context.partitionId == 0) { val addrs = context.getTaskInfos().map(_.address) ... // start a hybrid training job, e.g., via MPI } context.barrier() // wait until training finishes
  26. 26. 26 Barrier mode integration
  27. 27. 27 Horovod (an LF AI hosted project) ● Little modification to single-node code ● High-performance I/O via MPI and NCCL ● Same convergence theory ● Limitations
  28. 28. 28 Hydrogen integration with Horovod ● HorovodRunner with Databricks Runtime 5.0 ML has released ● Runs Horovod under barrier execution mode ● Hides details from users def train_hvd(): hvd.init() … # train using Horovod HorovodRunner(np=2).run(train_hvd)
  29. 29. 29 Implementation of HorovodRunner Integrating Horovod with barrier mode is straightforward: ● Pickle and broadcast the train function. ○ Inspect code and warn users about potential issues. ● Launch a Spark job in barrier execution mode. ● In the first executor, use worker addresses to launch the Horovod MPI job. ● Terminate Horovod if the Spark job got cancelled. ○ Hint: PR_SET_PDEATHSIG Limitation: ● Tailored for Databricks Runtime ML ○ Horovod built with TensorFlow/PyTorch, SSH, OpenMPI, NCCL, etc. ○ Spark 2.4, GPU cluster configuration, etc.
  30. 30. 30 Project Hydrogen: Accelerator-aware scheduling Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  31. 31. 31 Accelerator-aware scheduling JIRA: SPARK-24615 (ETA: Spark 3.0) Executor 0 GPU:0 GPU:1 Task 0 Task 1 Executor 1 GPU:0 GPU:1 Task 2 Task 3 Task 4 Driver Cluster Manager
  32. 32. 32 ● Some cluster managers already support accelerators (GPU/FPGA/etc...) ● Spark still need to be aware of accelerators. Example: Why Spark needs accelerator awareness? Executor 0 GPU:0 GPU:1 Task 0 Task 1 Executor 1 GPU:0 GPU:1 Task 2 Task 3 Task 4 ?
  33. 33. 33 Workarounds (a.k.a hacks) ● Only allow one Spark task on each node ○ Pros: avoid accelerator resources contention ○ Cons: waste resources, poor performance ● Running tasks choose resources collaboratively (e.g. shared locks)
  34. 34. 34 User Spark Cluster Manager 0. Auto-discover resources. 1. Submit an application with resource requests. 2. Pass resource requests to cluster manager. 4. Register executors. 3. Allocate executors with resource isolation. 5. Submit a Spark job. 6. Schedule tasks on available executors. 7. Dynamic allocation. 8. Retrieve assigned resources and use them in tasks. 9. Monitor and recover failed executors. Proposed workflow
  35. 35. 35 Discover and request accelerators Admin can specify a script to auto-discover accelerators (SPARK-27024) ● spark.driver.resource.${resourceName}.discoveryScript ● spark.executor.resource.${resourceName}.discoveryScript ● e.g., `nvidia-smi --query-gpu=index ...` User can request accelerators at application level (SPARK-27366) ● spark.executor.resource.${resourceName}.amount ● spark.driver.resource.${resourceName}.amount ● spark.task.resource.${resourceName}.amount
  36. 36. 36 Retrieve assigned accelerators User can retrieve assigned accelerators from task context (SPARK-27366) context = TaskContext.get() assigned_gpu = context.resources().get(“gpu”).get.addresses.head with tf.device(assigned_gpu): # training code ...
  37. 37. 37 Cluster manager support YARN SPARK-27361 Kubernetes SPARK-27362 Mesos (not started) SPARK-27363 Standalone SPARK-27360
  38. 38. 38 Web UI for accelerators
  39. 39. 39 Support general accelerator types We keep the interfaces general to support other types of accelerators other than GPU in the future, e.g. FPGA ● “GPU” is not a hard-coded resource type. ● spark.executor.resource.{resourceName}.discoveryScript ● context.resources() returns a map from resourceName to ResourceInformation (resource name and addresses).
  40. 40. 40 Features beyond Project Hydrogen ● Resource request at task level. ● Fine-grained scheduling within one GPU. ● Affinity and anti-affinity. ● ...
  41. 41. 41 Story #2: Streaming model inference load using a Spark cluster w/ GPUs predict w/ GPUs as a Spark task model
  42. 42. 42 Project Hydrogen: Optimized data exchange Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  43. 43. 43 Optimized data exchange None of the integrations are possible without exchanging data between Spark and AI frameworks. And performance matters. JIRA: SPARK-24579
  44. 44. 44 Pandas UDF Pandas UDF was introduced in Spark 2.3, which uses Arrow for data exchange and utilizes Pandas for vectorized computation.
  45. 45. 45 Pandas UDF for distributed inference Pandas UDF makes it simple to apply a model to a data stream. @pandas_udf(...) def predict(features): ... spark.readStream(...) .withColumn(‘prediction’, predict(col(‘features’)))
  46. 46. 46 Return StructType from Pandas UDF We improved scalar Pandas UDF to complex return types. So users can return predicted labels and raw scores together. JIRA: SPARK-23836 (Spark 3.0) @pandas_udf(...) def predict(features): # ... return pd.DataFrame({'labels': labels, 'scores': scores})
  47. 47. 47 Data pipelining CPU GPU t1 fetch batch #1 t2 fetch batch #2 process batch #1 t3 fetch batch #3 process batch #2 t4 process batch #3 CPU GPU t1 fetch batch #1 t2 process batch #1 t3 fetch batch #2 t4 process batch #2 t5 fetch batch #3 t6 process batch #3 (pipelining)
  48. 48. 48 Pandas UDF prefetch To improve the throughput, we prefetch Arrow record batches in the queue while executing Pandas UDF on the current batch. ● Enabled by default since Databricks Runtime 5.2. ● Up to 2x for I/O and compute balanced workload. ● Observed 1.5x in real workload. JIRA: SPARK-27569 (ETA: Spark 3.0)
  49. 49. 49 Per-batch initialization overhead A new Pandas UDF interface that load the model only once and reuse it on an iterator of batches. JIRA: SPARK-26412 (Spark 3.0) @pandas_udf(...) def predict(batches): model = … # load model once for batch in batches: yield model.predict(batch)
  50. 50. 50 Acknowledgement ● Many ideas in Project Hydrogen are based on previous community work: TensorFrames, BigDL, Apache Arrow, Pandas UDF, Spark GPU support, MPI, etc. ● We would like to thank many Spark committers and contributors who helped the project proposal, design, and implementation.
  51. 51. 51 Acknowledgement ● Alex Sergeev ● Andy Feng ● Bryan Cutler ● Felix Cheung ● Hyukjin Kwon ● Imran Rashid ● Jason Lowe ● Jerry Shao ● Li Jin ● Madhukar Korupolu ● Mark Hamstra ● Robert Evans ● Sean Owen ● Shane Knapp ● Takuya Ueshin ● Thomas Graves ● Wenchen Fan ● Xiangrui Meng ● Xiao Li ● Yi Wu ● Yinan Li ● Yu Jiang ● … and many more!
  52. 52. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • manuzhang

    Nov. 3, 2019

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark Project Hydrogen is a major Apache Spark initiative to bring state-of-the-art AI and Big Data solutions together. It contains three major projects: 1) barrier execution mode 2) optimized data exchange and 3) accelerator-aware scheduling. A basic implementation of barrier execution mode was merged into Apache Spark 2.4.0, and the community is working on the latter two. In this talk, we will present progress updates to Project Hydrogen and discuss the next steps. First, we will review the barrier execution mode implementation from Spark 2.4.0. It enables developers to embed distributed training jobs properly on a Spark cluster. We will demonstrate distributed AI integrations built on top it, e.g., Horovod and Distributed TensorFlow. We will also discuss the technical challenges to implement those integrations and future work. Second, we will give updates on accelerator-aware scheduling and how it shall help accelerate your Spark training jobs. We will also outline on-going work for optimized data exchange.

Views

Total views

388

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

22

Shares

0

Comments

0

Likes

1

×