Xiangrui Meng, Databricks
Updates from Project Hydrogen: Unifying
State-of-the-Art AI and Big Data in Apache Spark
#UnifiedAnalytics #SparkAISummit
2
About me
● Software Engineer at Databricks
○ machine learning and data science/engineering
● Committer and PMC member of Apache Spark
○ MLlib, SparkR, PySpark, Spark Packages, etc
3
Announced last June, Project Hydrogen is a major Spark initiative
to unify state-of-the-art AI and big data workloads.
About Project Hydrogen
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
4
Why Spark + AI?
Runtime
Delta
Spark Core Engine
Big Data Processing
ETL + SQL +Streaming
Machine Learning
MLlib + SparkR
Apache Spark:
The First Unified Analytics Engine
5
and many more...
Internet of ThingsDigital Personalization
Huge disruptive innovations are affecting most enterprises on the planet
Healthcare and Genomics Fraud Prevention
AI is re-shaping the world
6
Better AI needs more data
7
When AI goes distributed ...
When datasets get bigger and bigger, we see more and more
distributed training scenarios and open-source offerings, e.g.,
distributed TensorFlow, Horovod, and distributed MXNet.
This is where Spark and AI cross.
8
9
Why Project Hydrogen?
Two simple stories
As a data scientist, I can:
● build a pipeline that fetches training events from a production
data warehouse and trains a DL model in parallel;
● apply a trained DL model to a distributed stream of events and
enrich it with predicted labels.
10
Distributed training
data warehouse load fit model
Required: Be able to read from
Databricks Delta, Parquet,
MySQL, Hive, etc.
Answer: Apache Spark
Required: distributed GPU cluster
for fast training
Answer: Horovod, Distributed
Tensorflow, etc
11
Two separate data and AI clusters?
load using a Spark
cluster
fit on a GPU
cluster
model
save data
required: glue code
12
Streaming model inference
Kafka load predict model
required:
● save to stream sink
● GPU for fast inference
13
A hybrid Spark and AI cluster?
load using a Spark
cluster w/ GPUs
fit a model
distributedly
on the same
cluster
model
load using a Spark
cluster w/ GPUs
predict w/ GPUs as
a Spark task
model
14
Unfortunately, it doesn’t work out of the box.
See a previous demo.
16
Project Hydrogen to fill the major gaps
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
17
Updates from Project Hydrogen
As a Spark contributor, I want to present:
● what features from Project Hydrogen are available,
● what features are in development.
As a Databricks engineer, I want to share:
● how we utilized features from Project Hydrogen,
● lessons learned and best practices.
18
Story #1:
Distributed training
load using a Spark
cluster w/ GPUs
fit a model
distributedly
on the same
cluster
model
19
Project Hydrogen: barrier execution mode
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
20
Different execution models
Task 1
Task 2
Task 3
Spark (MapReduce)
Tasks are independent of each other
Embarrassingly parallel & massively scalable
Distributed training
Complete coordination among tasks
Optimized for communication
Task 1
Task 2 Task 3
21
Barrier execution mode
We introduced gang scheduling to Spark on top of MapReduce
execution model. So a distributed DL job can run as a Spark job.
● It starts all tasks together.
● It provides sufficient info and tooling to run a hybrid distributed job.
● It cancels and restarts all tasks in case of failures.
JIRA: SPARK-24374 (Spark 2.4)
22
API: RDD.barrier()
RDD.barrier() tells Spark to launch the tasks together.
rdd.barrier().mapPartitions { iter =>
val context = BarrierTaskContext.get()
...
}
23
API: context.barrier()
context.barrier() places a global barrier and waits until all tasks in
this stage hit this barrier.
val context = BarrierTaskContext.get()
… // preparation
context.barrier()
24
API: context.getTaskInfos()
context.getTaskInfos() returns info about all tasks in this stage.
if (context.partitionId == 0) {
val addrs = context.getTaskInfos().map(_.address)
... // start a hybrid training job, e.g., via MPI
}
context.barrier() // wait until training finishes
25
Barrier mode integration
26
Horovod (an LF AI hosted project)
Horovod is a distributed training framework for TensorFlow,
Keras, PyTorch, and MXNet. It is originally developed at Uber, now
an LF AI hosted project at Linux Foundation.
● Little modification to single-node code.
● High-performance I/O via MPI and NCCL.
● Same convergence theory.
Some limitation:
● Before v0.16, user still needs to use mpirun to launch a job,
● … with a python training script: mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 -bind-to none
-map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
27
Hydrogen integration with Horovod
Databricks released HorovodRunner w/ Runtime 5.0 ML built on
top of Horovod and Project Hydrogen.
● Runs Horovod under barrier execution mode.
● Hides cluster setup, scripts, MPI command line from users.
def train_hvd():
hvd.init()
… # train using Horovod
HorovodRunner(np=2).run(train_hvd)
28
Implementation of HorovodRunner
Integrating Horovod with barrier mode is straightforward:
● Pickle and broadcast the train function.
○ Inspect code and warn users about potential issues.
● Launch a Spark job in barrier execution mode.
● In the first executor, use worker addresses to launch the Horovod MPI job.
● Terminate Horovod if the Spark job got cancelled.
○ Hint: PR_SET_PDEATHSIG
Limitation:
● Tailored for Databricks Runtime ML
○ Horovod built with TensorFlow/PyTorch, SSH, OpenMPI, NCCL, etc.
○ Spark 2.4, GPU cluster configuration, etc.
29
horovod.spark
horovod.spark is a new feature in Horovod 0.16 release. Similar to
HorovodRunner, it runs Horovod as a Spark job and takes python
train functions. Its assumption is more general:
● no dependency on SSH,
● system-independent process termination,
● multiple Spark versions,
● and more … also check out horovodrun:)
30
Collaboration on Horovod + Spark
Engineers at Uber and Databricks are collaborating on improving
the integration between Horovod and Spark. Goals:
● Merge design and code development into horovod.spark.
● HorovodRunner uses horovod.spark implementation with
extra Databricks-specific features.
● Support barrier execution mode and GPU-aware scheduling.
Stay tuned for the announcement from LF/Uber/Databricks!
31
Project Hydrogen: GPU-aware scheduling
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
32
Accelerator-aware scheduling
Accelerators (GPUs, FPGAs) are widely used for accelerating
specialized workloads like deep learning and signal processing.
To utilize accelerators in a Spark cluster, Spark needs to be aware
of the accelerators assigned to the driver and executors and
schedule them according to user requests.
JIRA: SPARK-24615 (ETA: Spark 3.0)
33
● Mesos, YARN, and Kubernetes already support GPUs.
● However, even GPUs are allocated by a cluster manager for a Spark application,
Spark itself is not aware of the GPUs.
● Consider a simple case where one task needs one GPU:
Why Spark needs GPU awareness?
Executor 0
GPU:0
GPU:1
Task 0
Task 1
Executor 1
GPU:0
GPU:1
Task 2
Task 3
Task 4 ?
34
Workarounds (a.k.a hacks)
● Limit Spark task slots per node to 1.
○ The running task can safely claim all GPUs on the node.
○ It might lead to resource waste if the workload doesn’t need all GPUs.
○ User also needs to write multithreading code to maximize data I/O.
● Let running tasks themselves to collaboratively decide which
GPUs to use, e.g., via shared locks.
35
User Spark Cluster Manager
0. Auto-discover resources.
1. Submit an application with
resource requests.
2. Pass resource requests to
cluster manager.
4. Register executors.
3. Allocate executors with
resource isolation.
5. Submit a Spark job. 6. Schedule tasks on available
executors.
7. Dynamic allocation.
8. Retrieve assigned resources
and use them in tasks.
9. Monitor and recover failed
executors.
Proposed workflow
36
Discover and request GPUs
Admin can specify a script to auto-discover GPUs (#24406)
● spark.driver.resource.gpu.discoveryScript
● spark.executor.resource.gpu.discoveryScript
● e.g., `nvidia-smi --query-gpu=index ...`
User can request GPUs at application level (#24374)
● spark.executor.resource.gpu.count
● spark.driver.resource.gpu.count
● spark.task.resource.gpu.count
37
Retrieve assigned GPUs
User can retrieve assigned GPUs from task context (#24374)
context = TaskContext.get()
assigned_gpu = context.getResources()[“gpu”][0]
with tf.device(assigned_gpu):
# training code ...
38
Cluster manager support
YARN
SPARK-27361
Kubernetes
SPARK-27362
Mesos
SPARK-27363
Standalone
SPARK-27361
39
Jenkins support (SPARK-27365)
To support end-to-end integration test, we are adding GPU cards
to Spark Jenkins machines hosted by Berkeley RISELab.
Thanks NVIDIA for donating the latest Tesla T4 cards!
40
Support other accelerators
We focus on GPU support but keep the interfaces general to
support other types of accelerators in the future, e.g., FPGA.
● “GPU” is not a hard-coded resource type.
● spark.executor.resource.{resourceType}.discoveryScript
● context.getResources() returns a map from resourceType to assigned addresses.
41
Features beyond the current SPIP
● Resource request at task level.
● Fine-grained scheduling within one GPU.
● Affinity and anti-affinity.
● ...
More on distributed training: data flow
We recommend the following data flow for training:
● Load and preprocess training data using Spark.
● Save preprocessed training data to a shared storage.
○ What format? TFRecords, Parquet + Petastorm.
○ Which shared storage? S3, Azure Blob Storage, HDFS, NFS, etc.
● Load training data in DL frameworks.
○ But DL frameworks do not work well with remote storage.
42
Connect DL frameworks to remote storage
We recommend high-performance FUSE clients to mount remote
storage as local files so DL frameworks can load/save data easily.
43
s3://bucket
wasb://container
file:/mnt/...
TensorFlow
+
Horovod
FUSE clients:
Goofys
blobfuse
worker
44
Story #2:
Streaming model inference
load using a Spark
cluster w/ GPUs
predict w/ GPUs as
a Spark task
model
45
Project Hydrogen: Optimized data exchange
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
46
Optimized data exchange
None of the integrations are possible without exchanging data
between Spark and AI frameworks. And performance matters.
JIRA: SPARK-24579
47
Pandas UDF
Pandas UDF was introduced in Spark 2.3, which uses Arrow for
data exchange and utilizes Pandas for vectorized computation.
48
Pandas UDF for distributed inference
Pandas UDF makes it simple to apply a model to a data stream.
@pandas_udf(...)
def predict(features):
...
spark.readStream(...) 
.withColumn(‘prediction’, predict(col(‘features’)))
49
Return StructType from Pandas UDF
We improved scalar Pandas UDF to complex return types. So users
can return predicted labels and raw scores together.
JIRA: SPARK-23836 (Spark 3.0)
@pandas_udf(...)
def predict(features):
# ...
return pd.DataFrame({'labels': labels, 'scores': scores})
50
Data pipelining
CPU GPU
t1 fetch batch #1
t2 fetch batch #2 process batch #1
t3 fetch batch #3 process batch #2
t4 process batch #3
CPU GPU
t1 fetch batch #1
t2 process batch #1
t3 fetch batch #2
t4 process batch #2
t5 fetch batch #3
t6 process batch #3 (pipelining)
51
Pandas UDF prefetch
To improve the throughput, we prefetch Arrow record batches in
the queue while executing Pandas UDF on the current batch.
● Enabled by default on Databricks Runtime 5.2.
● Up to 2x for I/O and compute balanced workload.
● Observed 1.5x in real workload.
JIRA: SPARK-27569 (ETA: Spark 3.0)
52
Per-batch initialization overhead
Loading model per batch introduces a constant overhead. We
propose a new Pandas UDF interface that takes an iterator of
batches so we only need to load the model once.
JIRA: SPARK-26412 (WIP)
@pandas_udf(...)
def predict(batches):
model = … # load model once
for batch in batches:
yield model.predict(batch)
53
Standardize on the Arrow format
Many accelerated computing libraries now support Arrow. The
community is discussing whether we should expose the Arrow
format in a public interface.
● Simplify data exchange.
● Reduce data copy/conversion overhead.
● Allow pluggable vectorization code.
JIRA: SPARK-27396 (pending vote)
54
Acknowledgement
● Many ideas in Project Hydrogen are based on previous
community work: TensorFrames, BigDL, Apache Arrow, Pandas
UDF, Spark GPU support, MPI, etc.
● We would like to thank many Spark committers and
contributors who helped the project proposal, design, and
implementation.
55
Acknowledgement
● Xingbo Jiang
● Thomas Graves
● Andy Feng
● Alex Sergeev
● Shane Knapp
● Xiao Li
● Li Jin
● Bryan Cutler
● Takuya Ueshin
● Wenchen Fan
● Jason Lowe
● Hyukjin Kwon
● Madhukar Korupolu
● Robert Evans
● Yinan Li
● Felix Cheung
● Imran Rashid
● Saisai Shao
● Mark Hamstra
● Sean Owen
● Yu Jiang
● … and many more!
Thank you!

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark

  • 1.
    Xiangrui Meng, Databricks Updatesfrom Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark #UnifiedAnalytics #SparkAISummit
  • 2.
    2 About me ● SoftwareEngineer at Databricks ○ machine learning and data science/engineering ● Committer and PMC member of Apache Spark ○ MLlib, SparkR, PySpark, Spark Packages, etc
  • 3.
    3 Announced last June,Project Hydrogen is a major Spark initiative to unify state-of-the-art AI and big data workloads. About Project Hydrogen Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 4.
  • 5.
    Runtime Delta Spark Core Engine BigData Processing ETL + SQL +Streaming Machine Learning MLlib + SparkR Apache Spark: The First Unified Analytics Engine 5
  • 6.
    and many more... Internetof ThingsDigital Personalization Huge disruptive innovations are affecting most enterprises on the planet Healthcare and Genomics Fraud Prevention AI is re-shaping the world 6
  • 7.
    Better AI needsmore data 7
  • 8.
    When AI goesdistributed ... When datasets get bigger and bigger, we see more and more distributed training scenarios and open-source offerings, e.g., distributed TensorFlow, Horovod, and distributed MXNet. This is where Spark and AI cross. 8
  • 9.
  • 10.
    Two simple stories Asa data scientist, I can: ● build a pipeline that fetches training events from a production data warehouse and trains a DL model in parallel; ● apply a trained DL model to a distributed stream of events and enrich it with predicted labels. 10
  • 11.
    Distributed training data warehouseload fit model Required: Be able to read from Databricks Delta, Parquet, MySQL, Hive, etc. Answer: Apache Spark Required: distributed GPU cluster for fast training Answer: Horovod, Distributed Tensorflow, etc 11
  • 12.
    Two separate dataand AI clusters? load using a Spark cluster fit on a GPU cluster model save data required: glue code 12
  • 13.
    Streaming model inference Kafkaload predict model required: ● save to stream sink ● GPU for fast inference 13
  • 14.
    A hybrid Sparkand AI cluster? load using a Spark cluster w/ GPUs fit a model distributedly on the same cluster model load using a Spark cluster w/ GPUs predict w/ GPUs as a Spark task model 14
  • 15.
    Unfortunately, it doesn’twork out of the box. See a previous demo.
  • 16.
    16 Project Hydrogen tofill the major gaps Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 17.
    17 Updates from ProjectHydrogen As a Spark contributor, I want to present: ● what features from Project Hydrogen are available, ● what features are in development. As a Databricks engineer, I want to share: ● how we utilized features from Project Hydrogen, ● lessons learned and best practices.
  • 18.
    18 Story #1: Distributed training loadusing a Spark cluster w/ GPUs fit a model distributedly on the same cluster model
  • 19.
    19 Project Hydrogen: barrierexecution mode Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 20.
    20 Different execution models Task1 Task 2 Task 3 Spark (MapReduce) Tasks are independent of each other Embarrassingly parallel & massively scalable Distributed training Complete coordination among tasks Optimized for communication Task 1 Task 2 Task 3
  • 21.
    21 Barrier execution mode Weintroduced gang scheduling to Spark on top of MapReduce execution model. So a distributed DL job can run as a Spark job. ● It starts all tasks together. ● It provides sufficient info and tooling to run a hybrid distributed job. ● It cancels and restarts all tasks in case of failures. JIRA: SPARK-24374 (Spark 2.4)
  • 22.
    22 API: RDD.barrier() RDD.barrier() tellsSpark to launch the tasks together. rdd.barrier().mapPartitions { iter => val context = BarrierTaskContext.get() ... }
  • 23.
    23 API: context.barrier() context.barrier() placesa global barrier and waits until all tasks in this stage hit this barrier. val context = BarrierTaskContext.get() … // preparation context.barrier()
  • 24.
    24 API: context.getTaskInfos() context.getTaskInfos() returnsinfo about all tasks in this stage. if (context.partitionId == 0) { val addrs = context.getTaskInfos().map(_.address) ... // start a hybrid training job, e.g., via MPI } context.barrier() // wait until training finishes
  • 25.
  • 26.
    26 Horovod (an LFAI hosted project) Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. It is originally developed at Uber, now an LF AI hosted project at Linux Foundation. ● Little modification to single-node code. ● High-performance I/O via MPI and NCCL. ● Same convergence theory. Some limitation: ● Before v0.16, user still needs to use mpirun to launch a job, ● … with a python training script: mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
  • 27.
    27 Hydrogen integration withHorovod Databricks released HorovodRunner w/ Runtime 5.0 ML built on top of Horovod and Project Hydrogen. ● Runs Horovod under barrier execution mode. ● Hides cluster setup, scripts, MPI command line from users. def train_hvd(): hvd.init() … # train using Horovod HorovodRunner(np=2).run(train_hvd)
  • 28.
    28 Implementation of HorovodRunner IntegratingHorovod with barrier mode is straightforward: ● Pickle and broadcast the train function. ○ Inspect code and warn users about potential issues. ● Launch a Spark job in barrier execution mode. ● In the first executor, use worker addresses to launch the Horovod MPI job. ● Terminate Horovod if the Spark job got cancelled. ○ Hint: PR_SET_PDEATHSIG Limitation: ● Tailored for Databricks Runtime ML ○ Horovod built with TensorFlow/PyTorch, SSH, OpenMPI, NCCL, etc. ○ Spark 2.4, GPU cluster configuration, etc.
  • 29.
    29 horovod.spark horovod.spark is anew feature in Horovod 0.16 release. Similar to HorovodRunner, it runs Horovod as a Spark job and takes python train functions. Its assumption is more general: ● no dependency on SSH, ● system-independent process termination, ● multiple Spark versions, ● and more … also check out horovodrun:)
  • 30.
    30 Collaboration on Horovod+ Spark Engineers at Uber and Databricks are collaborating on improving the integration between Horovod and Spark. Goals: ● Merge design and code development into horovod.spark. ● HorovodRunner uses horovod.spark implementation with extra Databricks-specific features. ● Support barrier execution mode and GPU-aware scheduling. Stay tuned for the announcement from LF/Uber/Databricks!
  • 31.
    31 Project Hydrogen: GPU-awarescheduling Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 32.
    32 Accelerator-aware scheduling Accelerators (GPUs,FPGAs) are widely used for accelerating specialized workloads like deep learning and signal processing. To utilize accelerators in a Spark cluster, Spark needs to be aware of the accelerators assigned to the driver and executors and schedule them according to user requests. JIRA: SPARK-24615 (ETA: Spark 3.0)
  • 33.
    33 ● Mesos, YARN,and Kubernetes already support GPUs. ● However, even GPUs are allocated by a cluster manager for a Spark application, Spark itself is not aware of the GPUs. ● Consider a simple case where one task needs one GPU: Why Spark needs GPU awareness? Executor 0 GPU:0 GPU:1 Task 0 Task 1 Executor 1 GPU:0 GPU:1 Task 2 Task 3 Task 4 ?
  • 34.
    34 Workarounds (a.k.a hacks) ●Limit Spark task slots per node to 1. ○ The running task can safely claim all GPUs on the node. ○ It might lead to resource waste if the workload doesn’t need all GPUs. ○ User also needs to write multithreading code to maximize data I/O. ● Let running tasks themselves to collaboratively decide which GPUs to use, e.g., via shared locks.
  • 35.
    35 User Spark ClusterManager 0. Auto-discover resources. 1. Submit an application with resource requests. 2. Pass resource requests to cluster manager. 4. Register executors. 3. Allocate executors with resource isolation. 5. Submit a Spark job. 6. Schedule tasks on available executors. 7. Dynamic allocation. 8. Retrieve assigned resources and use them in tasks. 9. Monitor and recover failed executors. Proposed workflow
  • 36.
    36 Discover and requestGPUs Admin can specify a script to auto-discover GPUs (#24406) ● spark.driver.resource.gpu.discoveryScript ● spark.executor.resource.gpu.discoveryScript ● e.g., `nvidia-smi --query-gpu=index ...` User can request GPUs at application level (#24374) ● spark.executor.resource.gpu.count ● spark.driver.resource.gpu.count ● spark.task.resource.gpu.count
  • 37.
    37 Retrieve assigned GPUs Usercan retrieve assigned GPUs from task context (#24374) context = TaskContext.get() assigned_gpu = context.getResources()[“gpu”][0] with tf.device(assigned_gpu): # training code ...
  • 38.
  • 39.
    39 Jenkins support (SPARK-27365) Tosupport end-to-end integration test, we are adding GPU cards to Spark Jenkins machines hosted by Berkeley RISELab. Thanks NVIDIA for donating the latest Tesla T4 cards!
  • 40.
    40 Support other accelerators Wefocus on GPU support but keep the interfaces general to support other types of accelerators in the future, e.g., FPGA. ● “GPU” is not a hard-coded resource type. ● spark.executor.resource.{resourceType}.discoveryScript ● context.getResources() returns a map from resourceType to assigned addresses.
  • 41.
    41 Features beyond thecurrent SPIP ● Resource request at task level. ● Fine-grained scheduling within one GPU. ● Affinity and anti-affinity. ● ...
  • 42.
    More on distributedtraining: data flow We recommend the following data flow for training: ● Load and preprocess training data using Spark. ● Save preprocessed training data to a shared storage. ○ What format? TFRecords, Parquet + Petastorm. ○ Which shared storage? S3, Azure Blob Storage, HDFS, NFS, etc. ● Load training data in DL frameworks. ○ But DL frameworks do not work well with remote storage. 42
  • 43.
    Connect DL frameworksto remote storage We recommend high-performance FUSE clients to mount remote storage as local files so DL frameworks can load/save data easily. 43 s3://bucket wasb://container file:/mnt/... TensorFlow + Horovod FUSE clients: Goofys blobfuse worker
  • 44.
    44 Story #2: Streaming modelinference load using a Spark cluster w/ GPUs predict w/ GPUs as a Spark task model
  • 45.
    45 Project Hydrogen: Optimizeddata exchange Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 46.
    46 Optimized data exchange Noneof the integrations are possible without exchanging data between Spark and AI frameworks. And performance matters. JIRA: SPARK-24579
  • 47.
    47 Pandas UDF Pandas UDFwas introduced in Spark 2.3, which uses Arrow for data exchange and utilizes Pandas for vectorized computation.
  • 48.
    48 Pandas UDF fordistributed inference Pandas UDF makes it simple to apply a model to a data stream. @pandas_udf(...) def predict(features): ... spark.readStream(...) .withColumn(‘prediction’, predict(col(‘features’)))
  • 49.
    49 Return StructType fromPandas UDF We improved scalar Pandas UDF to complex return types. So users can return predicted labels and raw scores together. JIRA: SPARK-23836 (Spark 3.0) @pandas_udf(...) def predict(features): # ... return pd.DataFrame({'labels': labels, 'scores': scores})
  • 50.
    50 Data pipelining CPU GPU t1fetch batch #1 t2 fetch batch #2 process batch #1 t3 fetch batch #3 process batch #2 t4 process batch #3 CPU GPU t1 fetch batch #1 t2 process batch #1 t3 fetch batch #2 t4 process batch #2 t5 fetch batch #3 t6 process batch #3 (pipelining)
  • 51.
    51 Pandas UDF prefetch Toimprove the throughput, we prefetch Arrow record batches in the queue while executing Pandas UDF on the current batch. ● Enabled by default on Databricks Runtime 5.2. ● Up to 2x for I/O and compute balanced workload. ● Observed 1.5x in real workload. JIRA: SPARK-27569 (ETA: Spark 3.0)
  • 52.
    52 Per-batch initialization overhead Loadingmodel per batch introduces a constant overhead. We propose a new Pandas UDF interface that takes an iterator of batches so we only need to load the model once. JIRA: SPARK-26412 (WIP) @pandas_udf(...) def predict(batches): model = … # load model once for batch in batches: yield model.predict(batch)
  • 53.
    53 Standardize on theArrow format Many accelerated computing libraries now support Arrow. The community is discussing whether we should expose the Arrow format in a public interface. ● Simplify data exchange. ● Reduce data copy/conversion overhead. ● Allow pluggable vectorization code. JIRA: SPARK-27396 (pending vote)
  • 54.
    54 Acknowledgement ● Many ideasin Project Hydrogen are based on previous community work: TensorFrames, BigDL, Apache Arrow, Pandas UDF, Spark GPU support, MPI, etc. ● We would like to thank many Spark committers and contributors who helped the project proposal, design, and implementation.
  • 55.
    55 Acknowledgement ● Xingbo Jiang ●Thomas Graves ● Andy Feng ● Alex Sergeev ● Shane Knapp ● Xiao Li ● Li Jin ● Bryan Cutler ● Takuya Ueshin ● Wenchen Fan ● Jason Lowe ● Hyukjin Kwon ● Madhukar Korupolu ● Robert Evans ● Yinan Li ● Felix Cheung ● Imran Rashid ● Saisai Shao ● Mark Hamstra ● Sean Owen ● Yu Jiang ● … and many more!
  • 56.