Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models. AI has always been one of the most exciting applications of big data. Project Hydrogen is a major Apache Spark initiative to bring the best AI and big data solutions together. It introduced barrier execution mode to Spark 2.4.0 release to help distributed model training, and it explores optimized data exchange to accelerate distributed model inference.
In this talk, we will explain why barrier execution mode is needed, how it works, and how to use it to integrate distributed DL training on Spark. We will demonstrate HorovodRunner, the first Spark+AI integration powered by Project Hydrogen. It is based on the Horovod framework developed by Uber and Databricks Runtime 5.0 for Machine Learning.
We will also share our experience and performance tips on how to combine Pandas UDF from Spark and AI frameworks to scale complex model inference workload.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Training and Inference on Apache Spark
1. Project Hydrogen, HorovodRunner, and
Pandas UDF: Distributed Deep Learning
Training and Inference on Apache Spark
Lu WANG
2019-01-17 BASM Meetup @ Unravel Data
1
5. Big data for AI
There are many efforts from the Spark community to integrate
Spark with AI/ML frameworks:
● (Yahoo) CaffeOnSpark, TensorFlowOnSpark
● (John Snow Labs) Spark-NLP
● (Databricks) spark-sklearn, tensorframes, spark-deep-learning
● … 80+ ML/AI packages on spark-packages.org
5
6. AI needs big data
We have seen efforts from the DL libraries to handle different data
scenarios:
● tf.data, tf.estimator
● spark-tensorflow-connector
● torch.utils.data
● … ...
6
7. The status quo: two simple stories
As a data scientist, I can:
● build a pipeline that fetches training events from a production
data warehouse and trains a DL model in parallel;
● apply a trained DL model to a distributed stream of events and
enrich it with predicted labels.
7
8. Distributed DL training
data warehouse load fit model
Read from Databricks Delta,
Parquet,
MySQL, Hive, etc.
Distributed GPU clusters for fast
training
Horovod, Distributed Tensorflow,
etc
Databricks
Delta
8
10. Two Challenges in Supporting AI
Frameworks in Spark
Data exchange:
need to push data in high
throughput between Spark and
Accelerated frameworks
Execution mode:
fundamental incompatibility between
Spark (embarrassingly parallel) vs AI
frameworks (gang scheduled)
1 2
10
11. Data exchange:
Vectorized Data Exchange
Accelerator-aware
scheduling
Execution mode:
Barrier Execution Mode
1 2
Project Hydrogen: Spark + AI
11
12. 1 2
Project Hydrogen: Spark + AI
12
Execution mode:
Barrier Execution Mode
Data exchange:
Vectorized Data Exchange
Accelerator-aware
scheduling
13. Different execution mode
Task 1
Task 2
Task 3
Spark
Tasks are independent of each other
Embarrassingly parallel & massively scalable
Distributed Training
Complete coordination among tasks
Optimized for communication
13
14. Different execution mode
Task 1
Task 2
Task 3
Spark
Tasks are independent of each other
Embarrassingly parallel & massively scalable
If one crashes…
Distributed Training
Complete coordination among tasks
Optimized for communication
14
15. Different execution mode
Task 1
Task 2
Task 3
Spark
Tasks are independent of each other
Embarrassingly parallel & massively scalable
If one crashes, rerun that one
Distributed Training
Complete coordination among tasks
Optimized for communication
If one crashes, must rerun all tasks
15
16. 16
Barrier execution mode
We introduce gang scheduling to Spark on top of MapReduce
execution model. So a distributed DL job can run as a Spark job.
● It starts all tasks together.
● It provides sufficient info and tooling to run a hybrid distributed job.
● It cancels and restarts all tasks in case of failures.
18. 18
context.barrier()
context.barrier() places a global barrier and waits until all tasks in
this stage hit this barrier.
val context = BarrierTaskContext.get()
… // write partition data out
context.barrier()
19. 19
context.getTaskInfos()
context.getTaskInfos() returns info about all tasks in this stage.
if (context.partitionId == 0) {
val addrs = context.getTaskInfos().map(_.address)
... // start a hybrid training job, e.g., via MPI
}
context.barrier() // wait until training finishes
20. Distributed DL training with barrier
Stage 1
data prep
embarrassingly parallel
Stage 2
distributed ML training
gang scheduled
Stage 3
data sink
embarrassingly parallel
HorovodRunner: a general API to run distributed deep learning
workloads on Databricks using Uber's Horovod framework
20
21. Why start with Horovod?
Horovod is a distributed training framework developed at Uber
● Supports TensorFlow, Keras, and PyTorch
● Easy to use
■ Users only need to slightly modify single-node training code to use Horovod
● Horovod offers good scaling efficiency
21
22. Why HorovodRunner?
HorovodRunner makes it easy to run Horovod on Databricks.
● Horovod runs an MPI job for distributed training, which is
hard to set up
● It is hard to schedule an MPI job on a Spark cluster
22
23. HorovodRunner
The HorovodRunner API supports the following methods:
● init(self, np)
○ Create an instance of HorovodRunner.
● run(self, main, **kwargs)
○ Run a Horovod training job invoking main(**kwargs).
def train():
hvd.init()
hr = HorovodRunner(np=2)
hr.run(train)
23
25. Single-node to distributed
Development workflow for distributed DL training is as following
● Data preparation
● Prepare single node DL code
● Add Horovod hooks
● Run distributively with HorovodRunner
25
35. Accelerator-aware scheduling (SPIP)
To utilize accelerators (GPUs, FPGAs) in a heterogeneous cluster
or to utilize multiple accelerators in a multi-task node, Spark
needs to understand the accelerators installed on each node.
SPIP JIRA: SPARK-24615 (pending vote, ETA: 3.0)
35
36. Request accelerators
With accelerator awareness, users can specify accelerator
constraints or hints (API pending discussion):
rdd.accelerated
.by(“/gpu/p100”)
.numPerTask(2)
.required // or .optional
36
37. Multiple tasks on the same node
When multiple tasks are scheduled on the same node with
multiple GPUs, each task knows which GPUs are assigned to avoid
crashing into each other (API pending discussion):
// inside a task closure
val gpus = context.getAcceleratorInfos()
37
38. Conclusion
● Barrier Execution Mode
○ HorovodRunner
● Optimized Data Exchange: PandasUDF
○ Model inference with PandasUDF
● Accelerator Aware Scheduling
38
41. Main Contents to deliver
• Why care about distributed training
• HorovodRunner
• why need Barrier execution model for DL training
• First application: HorovodRunner
• DL Model Inference
• Why need PandasUDF and prefetching
• how to optimize Model Inference on databricks
• optimization advice
• Demo
41
42. Main things the audience may take
• Users can easily run distributed DL training on Databricks
• How to migrate single node DL workflow to distributed with
HorovodRunner
• What kind of support with HorovodRunner to tune the distributed
code: Tensorboard, timeline, mlflow
• Users can do model inference efficiently on Databricks
• How to do model inference with PandasUDF
• Some hints to optimize the code
42
43. Table Contents
• Introduction (4 min)
• Why we need big data/ distributed training (2 min / 3 slides)
• Workflow of Distributed DL training and model inference (2 min / 2
slides)
• Project Hydrogen: Spark + AI (15 min)
• Barrier Execution Mode: HorovodRunner (8 min / 12 slides)
• Optimized Data Exchange: Model inference with PandasUDF (6 min / 8
slides)
• Accelerator Aware Scheduling (1 min / 3 slides)
• Demo (9 min)
• Conclusion (4 min) 43
48. PandasUDF
Tips to optimize predict
data
prep
Pre-
processing
Predict
Pre-
processing
Predict
● Increase the batch size to increase the GPU utilization
48
49. PandasUDF
Tips to optimize preprocessing
data
prep
Pre-
processing
Predict
Pre-
processing
Predict
● data prefetching
● parallel data loading and preprocessing
● Run part of preprocessing on GPU
49
52. When AI goes distributed ...
When datasets get bigger and bigger, we see more and more
distributed training scenarios and open-source offerings, e.g.,
distributed TensorFlow, Horovod, and distributed MXNet.
This is where we see Spark and AI efforts overlap more.
52