SlideShare a Scribd company logo
Project Hydrogen, HorovodRunner, and
Pandas UDF: Distributed Deep Learning
Training and Inference on Apache Spark
Lu WANG
2019-01-17 BASM Meetup @ Unravel Data
1
About
•Lu Wang
•Software engineer @ Databricks
•Ph.D. from Penn State in Mathematics
•Contributor to Deep Learning Pipelines
2
Table of contents
• Introduction
• Project Hydrogen: Spark + AI
• Barrier Execution Mode: HorovodRunner
• Optimized Data Exchange: Model inference with PandasUDF
• Accelerator Aware Scheduling
• Conclusion
3
Big data v.s. AI Technologies
4
X
Big data for AI
There are many efforts from the Spark community to integrate
Spark with AI/ML frameworks:
● (Yahoo) CaffeOnSpark, TensorFlowOnSpark
● (John Snow Labs) Spark-NLP
● (Databricks) spark-sklearn, tensorframes, spark-deep-learning
● … 80+ ML/AI packages on spark-packages.org
5
AI needs big data
We have seen efforts from the DL libraries to handle different data
scenarios:
● tf.data, tf.estimator
● spark-tensorflow-connector
● torch.utils.data
● … ...
6
The status quo: two simple stories
As a data scientist, I can:
● build a pipeline that fetches training events from a production
data warehouse and trains a DL model in parallel;
● apply a trained DL model to a distributed stream of events and
enrich it with predicted labels.
7
Distributed DL training
data warehouse load fit model
Read from Databricks Delta,
Parquet,
MySQL, Hive, etc.
Distributed GPU clusters for fast
training
Horovod, Distributed Tensorflow,
etc
Databricks
Delta
8
Distributed model inference
data warehouse prep predict model
● GPU for fast inference
9
Two Challenges in Supporting AI
Frameworks in Spark
Data exchange:
need to push data in high
throughput between Spark and
Accelerated frameworks
Execution mode:
fundamental incompatibility between
Spark (embarrassingly parallel) vs AI
frameworks (gang scheduled)
1 2
10
Data exchange:
Vectorized Data Exchange
Accelerator-aware
scheduling
Execution mode:
Barrier Execution Mode
1 2
Project Hydrogen: Spark + AI
11
1 2
Project Hydrogen: Spark + AI
12
Execution mode:
Barrier Execution Mode
Data exchange:
Vectorized Data Exchange
Accelerator-aware
scheduling
Different execution mode
Task 1
Task 2
Task 3
Spark
Tasks are independent of each other
Embarrassingly parallel & massively scalable
Distributed Training
Complete coordination among tasks
Optimized for communication
13
Different execution mode
Task 1
Task 2
Task 3
Spark
Tasks are independent of each other
Embarrassingly parallel & massively scalable
If one crashes…
Distributed Training
Complete coordination among tasks
Optimized for communication
14
Different execution mode
Task 1
Task 2
Task 3
Spark
Tasks are independent of each other
Embarrassingly parallel & massively scalable
If one crashes, rerun that one
Distributed Training
Complete coordination among tasks
Optimized for communication
If one crashes, must rerun all tasks
15
16
Barrier execution mode
We introduce gang scheduling to Spark on top of MapReduce
execution model. So a distributed DL job can run as a Spark job.
● It starts all tasks together.
● It provides sufficient info and tooling to run a hybrid distributed job.
● It cancels and restarts all tasks in case of failures.
17
RDD.barrier()
RDD.barrier() tells Spark to launch the tasks together.
rdd.barrier().mapPartitions { iter =>
val context = BarrierTaskContext.get()
...
}
18
context.barrier()
context.barrier() places a global barrier and waits until all tasks in
this stage hit this barrier.
val context = BarrierTaskContext.get()
… // write partition data out
context.barrier()
19
context.getTaskInfos()
context.getTaskInfos() returns info about all tasks in this stage.
if (context.partitionId == 0) {
val addrs = context.getTaskInfos().map(_.address)
... // start a hybrid training job, e.g., via MPI
}
context.barrier() // wait until training finishes
Distributed DL training with barrier
Stage 1
data prep
embarrassingly parallel
Stage 2
distributed ML training
gang scheduled
Stage 3
data sink
embarrassingly parallel
HorovodRunner: a general API to run distributed deep learning
workloads on Databricks using Uber's Horovod framework
20
Why start with Horovod?
Horovod is a distributed training framework developed at Uber
● Supports TensorFlow, Keras, and PyTorch
● Easy to use
■ Users only need to slightly modify single-node training code to use Horovod
● Horovod offers good scaling efficiency
21
Why HorovodRunner?
HorovodRunner makes it easy to run Horovod on Databricks.
● Horovod runs an MPI job for distributed training, which is
hard to set up
● It is hard to schedule an MPI job on a Spark cluster
22
HorovodRunner
The HorovodRunner API supports the following methods:
● init(self, np)
○ Create an instance of HorovodRunner.
● run(self, main, **kwargs)
○ Run a Horovod training job invoking main(**kwargs).
def train():
hvd.init()
hr = HorovodRunner(np=2)
hr.run(train)
23
Workflow with HorovodRunner
data
prep
Barrier Execution Mode
Model
Spark Driver
Spark Executor 0
Spark Executor 1 Spark Executor 2 … ...
24
Single-node to distributed
Development workflow for distributed DL training is as following
● Data preparation
● Prepare single node DL code
● Add Horovod hooks
● Run distributively with HorovodRunner
25
Demo
26
1 2
Project Hydrogen: Spark + AI
27
Execution mode:
Barrier Execution Mode
Data exchange:
Vectorized Data Exchange
Accelerator-aware
scheduling
Row-at-a-time Data Exchange
Spark
Python UDF
john 4.1
mike 3.5
sally 6.4
john 4.1
2john 4.1
2john 4.1
28
Row-at-a-time Data Exchange
Spark
Python UDF
john 4.1
mike 3.5
sally 6.4
3
mike 3.5
3mike 3.5
mike 3.5
29
Vectorized Data Exchange: PandasUDF
Spark
Pandas UDF
john 4.1
mike 3.5
sally 6.4
2john 4.1
3mike 3.5
4sally 6.4
john 4.1
mike 3.5
sally 6.4
2
3
4
john 4.1
mike 3.5
sally 6.4
30
Performance - 3 to 240X faster
31
0.9
1.1
7.2
Plus One
CDF
Subtract
Mean
3.15
242
117
Runtime (seconds - shorter is better)
Row-at-a-time
Vectorized
PandasUDF
Distributed model inference
data
prep
Pre-
processing
PredictPredict
Pre-
processing
32
PandasUDF
Distributed model inference
data
prep
Pre-
processing
Predict
Pre-
processing
Pre-
processing
Predict
33
Demo
34
Accelerator-aware scheduling (SPIP)
To utilize accelerators (GPUs, FPGAs) in a heterogeneous cluster
or to utilize multiple accelerators in a multi-task node, Spark
needs to understand the accelerators installed on each node.
SPIP JIRA: SPARK-24615 (pending vote, ETA: 3.0)
35
Request accelerators
With accelerator awareness, users can specify accelerator
constraints or hints (API pending discussion):
rdd.accelerated
.by(“/gpu/p100”)
.numPerTask(2)
.required // or .optional
36
Multiple tasks on the same node
When multiple tasks are scheduled on the same node with
multiple GPUs, each task knows which GPUs are assigned to avoid
crashing into each other (API pending discussion):
// inside a task closure
val gpus = context.getAcceleratorInfos()
37
Conclusion
● Barrier Execution Mode
○ HorovodRunner
● Optimized Data Exchange: PandasUDF
○ Model inference with PandasUDF
● Accelerator Aware Scheduling
38
databricks.com/sparkaisummit
39
Thank you!
40
Main Contents to deliver
• Why care about distributed training
• HorovodRunner
• why need Barrier execution model for DL training
• First application: HorovodRunner
• DL Model Inference
• Why need PandasUDF and prefetching
• how to optimize Model Inference on databricks
• optimization advice
• Demo
41
Main things the audience may take
• Users can easily run distributed DL training on Databricks
• How to migrate single node DL workflow to distributed with
HorovodRunner
• What kind of support with HorovodRunner to tune the distributed
code: Tensorboard, timeline, mlflow
• Users can do model inference efficiently on Databricks
• How to do model inference with PandasUDF
• Some hints to optimize the code
42
Table Contents
• Introduction (4 min)
• Why we need big data/ distributed training (2 min / 3 slides)
• Workflow of Distributed DL training and model inference (2 min / 2
slides)
• Project Hydrogen: Spark + AI (15 min)
• Barrier Execution Mode: HorovodRunner (8 min / 12 slides)
• Optimized Data Exchange: Model inference with PandasUDF (6 min / 8
slides)
• Accelerator Aware Scheduling (1 min / 3 slides)
• Demo (9 min)
• Conclusion (4 min) 43
def train(epochs=12, lr=1.0):
model = get_model()
dataset = get_dataset(train_dir)
opt = keras.optimizers.Adadelta(lr=lr)
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=opt,
metrics=['accuracy'])
model.fit(dataset, epochs=epochs, steps_per_epoch=30,
verbose=2, validation_data=val_dataset,
callbacks=callbacks,
validation_steps=3)
hr = HorovodRunner(np=4)
hr.run(train, epochs=20)
opt = hvd.DistributedOptimizer(opt)
hvd.init() Initialize Horovod
Wrap the optimizer
hvd.broadcast_global_variables(0) Initialize the vars
44
Other features of HorovodRunner
● Timeline
● Tensorboard
● MLflow
● Multi-GPU support (available in Databricks Runtime 5.2 ML)
45
PandasUDF
Distributed model inference
data
prep
Pre-
processing
Predict
Pre-
processing
Predict
GPU for fast inference
46
PandasUDF
Performance Tuning Guide
data
prep
Pre-
processing
Predict
Pre-
processing
Predict
● Reduce the model to a trivial model and measure the running time.
● Check the GPU utilization metrics.
47
PandasUDF
Tips to optimize predict
data
prep
Pre-
processing
Predict
Pre-
processing
Predict
● Increase the batch size to increase the GPU utilization
48
PandasUDF
Tips to optimize preprocessing
data
prep
Pre-
processing
Predict
Pre-
processing
Predict
● data prefetching
● parallel data loading and preprocessing
● Run part of preprocessing on GPU
49
PandasUDF
Performance Tuning Guide
data
prep
Pre-
processing
Predict
Pre-
processing
Predict
● Set the max records per batch and prefetching for Pandas UDF
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "5000")
spark.conf.set("spark.databricks.sql.execution.pandasUDF.maxPrefetch", 2)
50
Supplemental material
51
When AI goes distributed ...
When datasets get bigger and bigger, we see more and more
distributed training scenarios and open-source offerings, e.g.,
distributed TensorFlow, Horovod, and distributed MXNet.
This is where we see Spark and AI efforts overlap more.
52

More Related Content

What's hot

Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming宇 傅
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkBTI360
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 

What's hot (20)

Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 

Similar to Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Training and Inference on Apache Spark

Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKMatt Stubbs
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Databricks
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversityAlex Zeltov
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkDatabricks
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningDataWorks Summit
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInAnthony Hsu
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 

Similar to Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Training and Inference on Apache Spark (20)

Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache Spark
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 

More from Anyscale

ACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfAnyscale
 
What's Next for MLflow in 2019
What's Next for MLflow in 2019What's Next for MLflow in 2019
What's Next for MLflow in 2019Anyscale
 
Putting AI to Work on Apache Spark
Putting AI to Work on Apache SparkPutting AI to Work on Apache Spark
Putting AI to Work on Apache SparkAnyscale
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Monitoring Error Logs at Databricks
Monitoring Error Logs at DatabricksMonitoring Error Logs at Databricks
Monitoring Error Logs at DatabricksAnyscale
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Anyscale
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark Anyscale
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Anyscale
 

More from Anyscale (11)

ACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdf
 
What's Next for MLflow in 2019
What's Next for MLflow in 2019What's Next for MLflow in 2019
What's Next for MLflow in 2019
 
Putting AI to Work on Apache Spark
Putting AI to Work on Apache SparkPutting AI to Work on Apache Spark
Putting AI to Work on Apache Spark
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Monitoring Error Logs at Databricks
Monitoring Error Logs at DatabricksMonitoring Error Logs at Databricks
Monitoring Error Logs at Databricks
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 

Recently uploaded

Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILNatan Silnitsky
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion Clinic
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessWSO2
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisNeo4j
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownloadvrstrong314
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfkalichargn70th171
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1KnowledgeSeed
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesKrzysztofKkol1
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
 
iGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by SkilrockiGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by SkilrockSkilrock Technologies
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
 

Recently uploaded (20)

Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
iGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by SkilrockiGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by Skilrock
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 

Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Training and Inference on Apache Spark

  • 1. Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Training and Inference on Apache Spark Lu WANG 2019-01-17 BASM Meetup @ Unravel Data 1
  • 2. About •Lu Wang •Software engineer @ Databricks •Ph.D. from Penn State in Mathematics •Contributor to Deep Learning Pipelines 2
  • 3. Table of contents • Introduction • Project Hydrogen: Spark + AI • Barrier Execution Mode: HorovodRunner • Optimized Data Exchange: Model inference with PandasUDF • Accelerator Aware Scheduling • Conclusion 3
  • 4. Big data v.s. AI Technologies 4 X
  • 5. Big data for AI There are many efforts from the Spark community to integrate Spark with AI/ML frameworks: ● (Yahoo) CaffeOnSpark, TensorFlowOnSpark ● (John Snow Labs) Spark-NLP ● (Databricks) spark-sklearn, tensorframes, spark-deep-learning ● … 80+ ML/AI packages on spark-packages.org 5
  • 6. AI needs big data We have seen efforts from the DL libraries to handle different data scenarios: ● tf.data, tf.estimator ● spark-tensorflow-connector ● torch.utils.data ● … ... 6
  • 7. The status quo: two simple stories As a data scientist, I can: ● build a pipeline that fetches training events from a production data warehouse and trains a DL model in parallel; ● apply a trained DL model to a distributed stream of events and enrich it with predicted labels. 7
  • 8. Distributed DL training data warehouse load fit model Read from Databricks Delta, Parquet, MySQL, Hive, etc. Distributed GPU clusters for fast training Horovod, Distributed Tensorflow, etc Databricks Delta 8
  • 9. Distributed model inference data warehouse prep predict model ● GPU for fast inference 9
  • 10. Two Challenges in Supporting AI Frameworks in Spark Data exchange: need to push data in high throughput between Spark and Accelerated frameworks Execution mode: fundamental incompatibility between Spark (embarrassingly parallel) vs AI frameworks (gang scheduled) 1 2 10
  • 11. Data exchange: Vectorized Data Exchange Accelerator-aware scheduling Execution mode: Barrier Execution Mode 1 2 Project Hydrogen: Spark + AI 11
  • 12. 1 2 Project Hydrogen: Spark + AI 12 Execution mode: Barrier Execution Mode Data exchange: Vectorized Data Exchange Accelerator-aware scheduling
  • 13. Different execution mode Task 1 Task 2 Task 3 Spark Tasks are independent of each other Embarrassingly parallel & massively scalable Distributed Training Complete coordination among tasks Optimized for communication 13
  • 14. Different execution mode Task 1 Task 2 Task 3 Spark Tasks are independent of each other Embarrassingly parallel & massively scalable If one crashes… Distributed Training Complete coordination among tasks Optimized for communication 14
  • 15. Different execution mode Task 1 Task 2 Task 3 Spark Tasks are independent of each other Embarrassingly parallel & massively scalable If one crashes, rerun that one Distributed Training Complete coordination among tasks Optimized for communication If one crashes, must rerun all tasks 15
  • 16. 16 Barrier execution mode We introduce gang scheduling to Spark on top of MapReduce execution model. So a distributed DL job can run as a Spark job. ● It starts all tasks together. ● It provides sufficient info and tooling to run a hybrid distributed job. ● It cancels and restarts all tasks in case of failures.
  • 17. 17 RDD.barrier() RDD.barrier() tells Spark to launch the tasks together. rdd.barrier().mapPartitions { iter => val context = BarrierTaskContext.get() ... }
  • 18. 18 context.barrier() context.barrier() places a global barrier and waits until all tasks in this stage hit this barrier. val context = BarrierTaskContext.get() … // write partition data out context.barrier()
  • 19. 19 context.getTaskInfos() context.getTaskInfos() returns info about all tasks in this stage. if (context.partitionId == 0) { val addrs = context.getTaskInfos().map(_.address) ... // start a hybrid training job, e.g., via MPI } context.barrier() // wait until training finishes
  • 20. Distributed DL training with barrier Stage 1 data prep embarrassingly parallel Stage 2 distributed ML training gang scheduled Stage 3 data sink embarrassingly parallel HorovodRunner: a general API to run distributed deep learning workloads on Databricks using Uber's Horovod framework 20
  • 21. Why start with Horovod? Horovod is a distributed training framework developed at Uber ● Supports TensorFlow, Keras, and PyTorch ● Easy to use ■ Users only need to slightly modify single-node training code to use Horovod ● Horovod offers good scaling efficiency 21
  • 22. Why HorovodRunner? HorovodRunner makes it easy to run Horovod on Databricks. ● Horovod runs an MPI job for distributed training, which is hard to set up ● It is hard to schedule an MPI job on a Spark cluster 22
  • 23. HorovodRunner The HorovodRunner API supports the following methods: ● init(self, np) ○ Create an instance of HorovodRunner. ● run(self, main, **kwargs) ○ Run a Horovod training job invoking main(**kwargs). def train(): hvd.init() hr = HorovodRunner(np=2) hr.run(train) 23
  • 24. Workflow with HorovodRunner data prep Barrier Execution Mode Model Spark Driver Spark Executor 0 Spark Executor 1 Spark Executor 2 … ... 24
  • 25. Single-node to distributed Development workflow for distributed DL training is as following ● Data preparation ● Prepare single node DL code ● Add Horovod hooks ● Run distributively with HorovodRunner 25
  • 27. 1 2 Project Hydrogen: Spark + AI 27 Execution mode: Barrier Execution Mode Data exchange: Vectorized Data Exchange Accelerator-aware scheduling
  • 28. Row-at-a-time Data Exchange Spark Python UDF john 4.1 mike 3.5 sally 6.4 john 4.1 2john 4.1 2john 4.1 28
  • 29. Row-at-a-time Data Exchange Spark Python UDF john 4.1 mike 3.5 sally 6.4 3 mike 3.5 3mike 3.5 mike 3.5 29
  • 30. Vectorized Data Exchange: PandasUDF Spark Pandas UDF john 4.1 mike 3.5 sally 6.4 2john 4.1 3mike 3.5 4sally 6.4 john 4.1 mike 3.5 sally 6.4 2 3 4 john 4.1 mike 3.5 sally 6.4 30
  • 31. Performance - 3 to 240X faster 31 0.9 1.1 7.2 Plus One CDF Subtract Mean 3.15 242 117 Runtime (seconds - shorter is better) Row-at-a-time Vectorized
  • 35. Accelerator-aware scheduling (SPIP) To utilize accelerators (GPUs, FPGAs) in a heterogeneous cluster or to utilize multiple accelerators in a multi-task node, Spark needs to understand the accelerators installed on each node. SPIP JIRA: SPARK-24615 (pending vote, ETA: 3.0) 35
  • 36. Request accelerators With accelerator awareness, users can specify accelerator constraints or hints (API pending discussion): rdd.accelerated .by(“/gpu/p100”) .numPerTask(2) .required // or .optional 36
  • 37. Multiple tasks on the same node When multiple tasks are scheduled on the same node with multiple GPUs, each task knows which GPUs are assigned to avoid crashing into each other (API pending discussion): // inside a task closure val gpus = context.getAcceleratorInfos() 37
  • 38. Conclusion ● Barrier Execution Mode ○ HorovodRunner ● Optimized Data Exchange: PandasUDF ○ Model inference with PandasUDF ● Accelerator Aware Scheduling 38
  • 41. Main Contents to deliver • Why care about distributed training • HorovodRunner • why need Barrier execution model for DL training • First application: HorovodRunner • DL Model Inference • Why need PandasUDF and prefetching • how to optimize Model Inference on databricks • optimization advice • Demo 41
  • 42. Main things the audience may take • Users can easily run distributed DL training on Databricks • How to migrate single node DL workflow to distributed with HorovodRunner • What kind of support with HorovodRunner to tune the distributed code: Tensorboard, timeline, mlflow • Users can do model inference efficiently on Databricks • How to do model inference with PandasUDF • Some hints to optimize the code 42
  • 43. Table Contents • Introduction (4 min) • Why we need big data/ distributed training (2 min / 3 slides) • Workflow of Distributed DL training and model inference (2 min / 2 slides) • Project Hydrogen: Spark + AI (15 min) • Barrier Execution Mode: HorovodRunner (8 min / 12 slides) • Optimized Data Exchange: Model inference with PandasUDF (6 min / 8 slides) • Accelerator Aware Scheduling (1 min / 3 slides) • Demo (9 min) • Conclusion (4 min) 43
  • 44. def train(epochs=12, lr=1.0): model = get_model() dataset = get_dataset(train_dir) opt = keras.optimizers.Adadelta(lr=lr) model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, metrics=['accuracy']) model.fit(dataset, epochs=epochs, steps_per_epoch=30, verbose=2, validation_data=val_dataset, callbacks=callbacks, validation_steps=3) hr = HorovodRunner(np=4) hr.run(train, epochs=20) opt = hvd.DistributedOptimizer(opt) hvd.init() Initialize Horovod Wrap the optimizer hvd.broadcast_global_variables(0) Initialize the vars 44
  • 45. Other features of HorovodRunner ● Timeline ● Tensorboard ● MLflow ● Multi-GPU support (available in Databricks Runtime 5.2 ML) 45
  • 47. PandasUDF Performance Tuning Guide data prep Pre- processing Predict Pre- processing Predict ● Reduce the model to a trivial model and measure the running time. ● Check the GPU utilization metrics. 47
  • 48. PandasUDF Tips to optimize predict data prep Pre- processing Predict Pre- processing Predict ● Increase the batch size to increase the GPU utilization 48
  • 49. PandasUDF Tips to optimize preprocessing data prep Pre- processing Predict Pre- processing Predict ● data prefetching ● parallel data loading and preprocessing ● Run part of preprocessing on GPU 49
  • 50. PandasUDF Performance Tuning Guide data prep Pre- processing Predict Pre- processing Predict ● Set the max records per batch and prefetching for Pandas UDF spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "5000") spark.conf.set("spark.databricks.sql.execution.pandasUDF.maxPrefetch", 2) 50
  • 52. When AI goes distributed ... When datasets get bigger and bigger, we see more and more distributed training scenarios and open-source offerings, e.g., distributed TensorFlow, Horovod, and distributed MXNet. This is where we see Spark and AI efforts overlap more. 52