Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark

Xiangrui Meng, Databricks
Updates from Project Hydrogen: Unifying
State-of-the-Art AI and Big Data in Apache Spark
#UnifiedAnalytics #SparkAISummit

2
About me
● Software Engineer at Databricks
○ machine learning and data science/engineering
● Committer and PMC member of Apache Spark
○ MLlib, SparkR, PySpark, Spark Packages, etc

3
Announced last June, Project Hydrogen is a major Spark initiative
to unify state-of-the-art AI and big data workloads.
About Project Hydrogen
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling

Runtime
Delta
Spark Core Engine
Big Data Processing
ETL + SQL +Streaming
Machine Learning
MLlib + SparkR
Apache Spark:
The First Unified Analytics Engine
5

and many more...
Internet of ThingsDigital Personalization
Huge disruptive innovations are aﬀecting most enterprises on the planet
Healthcare and Genomics Fraud Prevention
AI is re-shaping the world
6

When AI goes distributed ...
When datasets get bigger and bigger, we see more and more
distributed training scenarios and open-source oﬀerings, e.g.,
distributed TensorFlow, Horovod, and distributed MXNet.
This is where Spark and AI cross.
8

Two simple stories
As a data scientist, I can:
● build a pipeline that fetches training events from a production
data warehouse and trains a DL model in parallel;
● apply a trained DL model to a distributed stream of events and
enrich it with predicted labels.
10

Distributed training
data warehouse load fit model
Required: Be able to read from
Databricks Delta, Parquet,
MySQL, Hive, etc.
Answer: Apache Spark
Required: distributed GPU cluster
for fast training
Answer: Horovod, Distributed
Tensorflow, etc
11

Two separate data and AI clusters?
load using a Spark
cluster
fit on a GPU
cluster
model
save data
required: glue code
12

Streaming model inference
Kafka load predict model
required:
● save to stream sink
● GPU for fast inference
13

A hybrid Spark and AI cluster?
load using a Spark
cluster w/ GPUs
fit a model
distributedly
on the same
cluster
model
load using a Spark
cluster w/ GPUs
predict w/ GPUs as
a Spark task
model
14

Unfortunately, it doesn’t work out of the box.
See a previous demo.

16
Project Hydrogen to fill the major gaps
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling

17
Updates from Project Hydrogen
As a Spark contributor, I want to present:
● what features from Project Hydrogen are available,
● what features are in development.
As a Databricks engineer, I want to share:
● how we utilized features from Project Hydrogen,
● lessons learned and best practices.

18
Story #1:
load using a Spark
cluster w/ GPUs
fit a model
distributedly
on the same
cluster
model

19
Project Hydrogen: barrier execution mode
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling

20
Diﬀerent execution models
Task 1
Task 2
Task 3
Spark (MapReduce)
Tasks are independent of each other
Embarrassingly parallel & massively scalable
Complete coordination among tasks
Optimized for communication
Task 1
Task 2 Task 3

21
Barrier execution mode
We introduced gang scheduling to Spark on top of MapReduce
execution model. So a distributed DL job can run as a Spark job.
● It starts all tasks together.
● It provides suﬀicient info and tooling to run a hybrid distributed job.
● It cancels and restarts all tasks in case of failures.
JIRA: SPARK-24374 (Spark 2.4)

22
API: RDD.barrier()
RDD.barrier() tells Spark to launch the tasks together.
rdd.barrier().mapPartitions { iter =>
val context = BarrierTaskContext.get()
...
}

23
API: context.barrier()
context.barrier() places a global barrier and waits until all tasks in
this stage hit this barrier.
val context = BarrierTaskContext.get()
… // preparation
context.barrier()

24
API: context.getTaskInfos()
context.getTaskInfos() returns info about all tasks in this stage.
if (context.partitionId == 0) {
val addrs = context.getTaskInfos().map(_.address)
... // start a hybrid training job, e.g., via MPI
}
context.barrier() // wait until training finishes

26
Horovod (an LF AI hosted project)
Horovod is a distributed training framework for TensorFlow,
Keras, PyTorch, and MXNet. It is originally developed at Uber, now
an LF AI hosted project at Linux Foundation.
● Little modification to single-node code.
● High-performance I/O via MPI and NCCL.
● Same convergence theory.
Some limitation:
● Before v0.16, user still needs to use mpirun to launch a job,
● … with a python training script: mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 -bind-to none
-map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py

27
Hydrogen integration with Horovod
Databricks released HorovodRunner w/ Runtime 5.0 ML built on
top of Horovod and Project Hydrogen.
● Runs Horovod under barrier execution mode.
● Hides cluster setup, scripts, MPI command line from users.
def train_hvd():
hvd.init()
… # train using Horovod
HorovodRunner(np=2).run(train_hvd)

28
Implementation of HorovodRunner
Integrating Horovod with barrier mode is straightforward:
● Pickle and broadcast the train function.
○ Inspect code and warn users about potential issues.
● Launch a Spark job in barrier execution mode.
● In the first executor, use worker addresses to launch the Horovod MPI job.
● Terminate Horovod if the Spark job got cancelled.
○ Hint: PR_SET_PDEATHSIG
Limitation:
● Tailored for Databricks Runtime ML
○ Horovod built with TensorFlow/PyTorch, SSH, OpenMPI, NCCL, etc.
○ Spark 2.4, GPU cluster configuration, etc.

29
horovod.spark
horovod.spark is a new feature in Horovod 0.16 release. Similar to
HorovodRunner, it runs Horovod as a Spark job and takes python
train functions. Its assumption is more general:
● no dependency on SSH,
● system-independent process termination,
● multiple Spark versions,
● and more … also check out horovodrun:)

30
Collaboration on Horovod + Spark
Engineers at Uber and Databricks are collaborating on improving
the integration between Horovod and Spark. Goals:
● Merge design and code development into horovod.spark.
● HorovodRunner uses horovod.spark implementation with
extra Databricks-specific features.
● Support barrier execution mode and GPU-aware scheduling.
Stay tuned for the announcement from LF/Uber/Databricks!

31
Project Hydrogen: GPU-aware scheduling
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling

32
Accelerator-aware scheduling
Accelerators (GPUs, FPGAs) are widely used for accelerating
specialized workloads like deep learning and signal processing.
To utilize accelerators in a Spark cluster, Spark needs to be aware
of the accelerators assigned to the driver and executors and
schedule them according to user requests.
JIRA: SPARK-24615 (ETA: Spark 3.0)

33
● Mesos, YARN, and Kubernetes already support GPUs.
● However, even GPUs are allocated by a cluster manager for a Spark application,
Spark itself is not aware of the GPUs.
● Consider a simple case where one task needs one GPU:
Why Spark needs GPU awareness?
Executor 0
GPU:0
GPU:1
Task 0
Task 1
Executor 1
GPU:0
GPU:1
Task 2
Task 3
Task 4 ?

34
Workarounds (a.k.a hacks)
● Limit Spark task slots per node to 1.
○ The running task can safely claim all GPUs on the node.
○ It might lead to resource waste if the workload doesn’t need all GPUs.
○ User also needs to write multithreading code to maximize data I/O.
● Let running tasks themselves to collaboratively decide which
GPUs to use, e.g., via shared locks.

35
User Spark Cluster Manager
0. Auto-discover resources.
1. Submit an application with
resource requests.
2. Pass resource requests to
cluster manager.
4. Register executors.
3. Allocate executors with
resource isolation.
5. Submit a Spark job. 6. Schedule tasks on available
executors.
7. Dynamic allocation.
8. Retrieve assigned resources
and use them in tasks.
9. Monitor and recover failed
executors.
Proposed workflow

36
Discover and request GPUs
Admin can specify a script to auto-discover GPUs (#24406)
● spark.driver.resource.gpu.discoveryScript
● spark.executor.resource.gpu.discoveryScript
● e.g., `nvidia-smi --query-gpu=index ...`
User can request GPUs at application level (#24374)
● spark.executor.resource.gpu.count
● spark.driver.resource.gpu.count
● spark.task.resource.gpu.count

37
Retrieve assigned GPUs
User can retrieve assigned GPUs from task context (#24374)
context = TaskContext.get()
assigned_gpu = context.getResources()[“gpu”][0]
with tf.device(assigned_gpu):
# training code ...

38
Cluster manager support
YARN
SPARK-27361
Kubernetes
SPARK-27362
Mesos
SPARK-27363
Standalone
SPARK-27361

39
Jenkins support (SPARK-27365)
To support end-to-end integration test, we are adding GPU cards
to Spark Jenkins machines hosted by Berkeley RISELab.
Thanks NVIDIA for donating the latest Tesla T4 cards!

40
Support other accelerators
We focus on GPU support but keep the interfaces general to
support other types of accelerators in the future, e.g., FPGA.
● “GPU” is not a hard-coded resource type.
● spark.executor.resource.{resourceType}.discoveryScript
● context.getResources() returns a map from resourceType to assigned addresses.

41
Features beyond the current SPIP
● Resource request at task level.
● Fine-grained scheduling within one GPU.
● Aﬀinity and anti-aﬀinity.
● ...

More on distributed training: data flow
We recommend the following data flow for training:
● Load and preprocess training data using Spark.
● Save preprocessed training data to a shared storage.
○ What format? TFRecords, Parquet + Petastorm.
○ Which shared storage? S3, Azure Blob Storage, HDFS, NFS, etc.
● Load training data in DL frameworks.
○ But DL frameworks do not work well with remote storage.
42

Connect DL frameworks to remote storage
We recommend high-performance FUSE clients to mount remote
storage as local files so DL frameworks can load/save data easily.
43
s3://bucket
wasb://container
file:/mnt/...
TensorFlow
+
Horovod
FUSE clients:
Goofys
blobfuse
worker

44
Story #2:
Streaming model inference
load using a Spark
cluster w/ GPUs
predict w/ GPUs as
a Spark task
model

45
Project Hydrogen: Optimized data exchange
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling

46
Optimized data exchange
None of the integrations are possible without exchanging data
between Spark and AI frameworks. And performance matters.
JIRA: SPARK-24579

47
Pandas UDF
Pandas UDF was introduced in Spark 2.3, which uses Arrow for
data exchange and utilizes Pandas for vectorized computation.

48
Pandas UDF for distributed inference
Pandas UDF makes it simple to apply a model to a data stream.
@pandas_udf(...)
def predict(features):
...
spark.readStream(...)
.withColumn(‘prediction’, predict(col(‘features’)))

49
Return StructType from Pandas UDF
We improved scalar Pandas UDF to complex return types. So users
can return predicted labels and raw scores together.
JIRA: SPARK-23836 (Spark 3.0)
@pandas_udf(...)
def predict(features):
# ...
return pd.DataFrame({'labels': labels, 'scores': scores})

50
Data pipelining
CPU GPU
t1 fetch batch #1
t2 fetch batch #2 process batch #1
t3 fetch batch #3 process batch #2
t4 process batch #3
CPU GPU
t1 fetch batch #1
t2 process batch #1
t3 fetch batch #2
t4 process batch #2
t5 fetch batch #3
t6 process batch #3 (pipelining)

51
Pandas UDF prefetch
To improve the throughput, we prefetch Arrow record batches in
the queue while executing Pandas UDF on the current batch.
● Enabled by default on Databricks Runtime 5.2.
● Up to 2x for I/O and compute balanced workload.
● Observed 1.5x in real workload.
JIRA: SPARK-27569 (ETA: Spark 3.0)

52
Per-batch initialization overhead
Loading model per batch introduces a constant overhead. We
propose a new Pandas UDF interface that takes an iterator of
batches so we only need to load the model once.
JIRA: SPARK-26412 (WIP)
@pandas_udf(...)
def predict(batches):
model = … # load model once
for batch in batches:
yield model.predict(batch)

53
Standardize on the Arrow format
Many accelerated computing libraries now support Arrow. The
community is discussing whether we should expose the Arrow
format in a public interface.
● Simplify data exchange.
● Reduce data copy/conversion overhead.
● Allow pluggable vectorization code.
JIRA: SPARK-27396 (pending vote)

54
Acknowledgement
● Many ideas in Project Hydrogen are based on previous
community work: TensorFrames, BigDL, Apache Arrow, Pandas
UDF, Spark GPU support, MPI, etc.
● We would like to thank many Spark committers and
contributors who helped the project proposal, design, and
implementation.

55
Acknowledgement
● Xingbo Jiang
● Thomas Graves
● Andy Feng
● Alex Sergeev
● Shane Knapp
● Xiao Li
● Li Jin
● Bryan Cutler
● Takuya Ueshin
● Wenchen Fan
● Jason Lowe
● Hyukjin Kwon
● Madhukar Korupolu
● Robert Evans
● Yinan Li
● Felix Cheung
● Imran Rashid
● Saisai Shao
● Mark Hamstra
● Sean Owen
● Yu Jiang
● … and many more!

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark

More Related Content

What's hot

Similar to Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark

More from Databricks

Recently uploaded

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark