TensorFlow on Spark Deep Dive

Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
TensorFlow on Spark:
A Deep Dive into
Distributed Deep Learning
DataCon.TW 2020
Evans Ye, Verizon Media

Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Evans Ye
2
Engineering Manager @ Verizon Media
● Use data to power advertising/eCommerce experience.
● Build next-gen Big Data & ML/AI solutions.
ASF Member @ Apache Software Foundation
● Spread the Apache way.
Apache Bigtop former VP, PMC member, Committer
● Drive project direction, build community, mentor new committers.
Director of Taiwan Data Engineering Association (TDEA)
● Promote OSS & data engineering technologies.

Agenda
1. Why Distributed Deep Learning?
2. Solution at Verizon Media
3. Distributed Deep Learning
4. Lightweight Distributed Deep Learning on PySpark
5. Recap & Future Work
3

Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. 4
Why Distributed
Deep Learning?

Industry Trend
5
OpenAI blog post: AI and Compute
● Drastically increasing in
computation needs!
Before 2012:
● uncommon to use GPUs for ML
2012 to 2014:
● 1-8 GPUs rated at 1-2 TFLOPS
2014 to 2016:
● 10-100 GPUs rated at 5-10 TFLOPS

Applying DL w/ GPUs in Enterprise
6
Deep Learning requires both big data & computing power(GPUs).
Data has gravity
● A dedicate GPU cluster posts a problem for data migration.
2.
DL training
1.
Prepare data
3.
Inferencingmodel
data

Solution at
Verizon Media

Yahoo! TensorFlowOnSpark
(Open-sourced Feb. 2017)
8
Framework to create TensorFlow cluster on Spark and feeds the data for training.
Yahoo! Developer blog post for TFoS

Input Modes
9
InputMode.SPARK
● HDFS → RDD.mapPartitions → TF worker (push mode)
InputMode.TENSORFLOW
● TF worker ← tf.data ← HDFS (pull mode)

What’s the Diﬀerent?
InputMode.SPARK
● Data are proxied through Spark RDD, hence slower.
○ Supports whatever data can be loaded as RDD.
● TF worker runs in background. Failures happened behind the scene...
InputMode.TENSORFLOW
● Data fetched from HDFS directly, hence faster.
○ Supports TFRecords.
● TF worker runs in foreground. Failures are raised and retired by Spark.
10

TFoS API Example
11
cluster = TFCluster.run(sc, main_fun, args,
args.cluster_size, args.num_ps,
tensorboard=args.tensorboard,
input_mode=TFCluster.InputMode.SPARK, master_node='chief')
# InputMode.SPARK only
cluster.train(images_labels, args.epochs)
cluster.inference(images_labels)
cluster.shutdown()

Get Data w/ InputMode.SPARK
12

Get Data w/ InputMode.TENSORFLOW
13

TFoS Advantages
14
● Easily migrate existing TensorFlow programs with <10 lines of code
change.
● Support all TensorFlow functionalities: synchronous/asynchronous
training, model/data parallelism, inferencing and TensorBoard.
● Allow datasets on HDFS and other sources pushed by Spark or pulled by
TensorFlow.
● Easily integrate with your existing Spark data processing pipelines.
● Easily deployed on cloud or on-premise and on CPUs or GPUs.
* Ref: https://github.com/yahoo/TensorFlowOnSpark

More About TFoS
15
Spark Summit Talks:
● TensorFlow On Spark: Scalable TensorFlow Learning on Spark Clusters
● TensorFlowOnSpark Enhanced: Scala, Pipelines, and Beyond
Github:
● https://github.com/yahoo/TensorFlowOnSpark
● TFoS 1.X Keras Example
● TFoS 2.X InputMode.TENSORFLOW Keras Example
● TFoS 2.X InputMode.SPARK Keras Example

Distributed Deep Learning

Types of Parallelism
17
Data Parallelism
● Each worker trains on different data pieces.
● Sync/async approaches to update the parameters w/ gradients.
● Entire model should fit into GPU’s memory.
Model Parallelism
● EX: To train a 6 layers model,
assign first 3 layers to worker0, later 3 layers to worker1.
● If model can’t fit into a single GPU’s memory.
Hybrid approach
● data parallelism between nodes, model parallelism between GPUs.

Asynchronous Parameter Server
18
● Each worker computes the gradient and send the delta to PS for updates.
● The updated parameters are then pulled to worker for next step training.

19
Asynchronous → Inconsistency
● Different workers may update the parameters at the same time. Since
they act asynchronously, there’s no guaranteed order.
Large Scale Distributed Deep Networks, Google, 2012
● “In practice we found relaxing consistency requirements to be remarkably
effective.”
● Additional stochasticity.

20
Pros:
● Scalable, each worker works independently.
● Robust to machine failures.
Cons:
● Workers may computing gradients based on staled weights, hence delay
the convergence.
Suitable for large number of not so powerful devices and dynamic
environment which preemption can happen.

Synchronous AllReduce
21
● Each worker has its own model parameters
and computes gradient separately.
● All the workers sync to each other with all
the gradients.
● Next training step begins after all the workers
have the model updated.

Ring AllReduce
22
Baidu, 2017
● Bringing HPC Techniques to Deep Learning
Uber, 2018
● Horovod: fast and easy distributed deep
learning in TensorFlow
Each worker sends gradient to its successor
and receive gradient from its predecessor.
● Uses both upload & download bandwidth at
the same time, hence the communication
time is optimized.

Synchronous AllReduce
23
Pros:
● Faster convergence w/ powerful devices & strong communication links.
Cons:
● Synchronous in design hence may suffer from failures.
● Not suitable for devices with different computing power, bandwidth.
Suitable for multi-GPU on single machine or small number of machines.

Lightweight
Distributed Deep Learning on
PySpark

Rethink Distributed Training
25
Engineering side:
● Single node, multi-GPU training w/ synchronous allreduce can lead to
faster convergence.
● TensorFlow cluster is required for multi-node training only.
Science side:
● Huge amount of labeled training data is not easy to get.
● Leveraging well-trained model and fine-tune from there is common in
practice instead of train from scratch.

What if Only Single Node Distributed Training is
Supported?
26
● No need to spawn up a TensorFlow cluster.
● The code can be simplified w/o coupling to a clustering framework, hence
easier for deployment and testing.
● Failure discovery and handling can be simplified.
● Single node, multi-GPU training is supported by
MirroredStrategy(NcclAllReduce) w/ Keras API.

Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. 27
Introducing a Simple, yet
Powerful Solution that Leverages
Several PySpark tricks.

Lightweight Distributed Training Architecture
28
PySpark Preprocessing
● Leverage Spark for distributed preprocessing.
○ spark.sql or spark.read to load data.
● Collect data back to driver for training.
○ df.toPandas()
Multi-GPU Training (Driver)
● Small data
model.fit(x=train_x, y=train_y, ...)
● Data can’t fit into GPU memory
model.fit(x=generator, ...)
HDFS

Huge Data Training is Support (up to 1B records)
29
Generator + Spark df.toLocalIterator() to collect the data sequentially.
● iter = df.toLocalIterator()
● record = next(iter)

Performance
30
Take one of our production model for example:
Multi-GPU
AllReduce
Multi-Node
Parameter Server

Lightweight Parallel Predicting Architecture
31
Preprocessing
● Leverage Spark for distributed preprocessing.
○ spark.sql or spark.read to load data.
Pandas UDF Predict (Executors)
● Data are handed over to Python in Pandas
DataFrame format.
● In Python UDF, predict by a simple Keras API.
● Resulting DataFrame are then passed back as
a Spark DataFrame.
● df.write or other post-processing if needed.
HDFS

PySpark Pandas UDF
32
Pandas Grouped Map UDF (Spark 2.4)
● The UDF leverages Apache Arrow for efficient JVM <-> Python SerDe.
● GroupBy a uniform distributed random ID, making data evenly grouped
across partitions:
● Make predictions in UDF:

Predicting Performance
33
● Looking at the CPU predicting, it scales
well with more resource added.
● Though GPU predicting is slightly
faster, CPU predicting is more scalable
due to # of CPUs available.
● The solution is capable of predicting up
to 1B records.

Summary
34
Productivity
● The PySpark based solution is lightweight and easy to test on local IDE.
No need to spawn up a cluster.
Flexibility
● The trainer/predictor impl. are decoupled from the framework and can be
run independently.
○ EX: run on local machine w/o Spark.
● The solution can support other frameworks such as Pytorch, xgboost, etc.
Efficiency
● Cross-language SerDes(toPandas, Pandas UDF) are optimized by
PySpark Arrow Integration.

Recap & Future Work

Recap
36
● TFoS is a comprehensive solution for distributed deep learning.
● Distributed deep learning can be achieved in a more lightweight manner
with “TensorFlow on Spark” only, therefore increase the productivity.
● Know the details under the neath. Maybe it's an architecture-wise
problem for your training:
Types of Parallelism Data Parallelism Model Parallelism
Training Approaches Async Parameter Server Sync AllReduce

Future Work
37
● Petastorm for efficient DL with Parquet file format.
● Horovod to support multi-framework distributed deep learning.
● More efficient AllReduce:
○ NCCL topology optimizations.
○ Blink: Fast and Generic Collectives for Distributed ML.
● Other related area that are also rapidly developing:
○ ML Lifecycle: MLFlow, KubeFlow.
○ Feature Store: Michelangelo.
○ GPU Acceleration: RAPIDS for end-to-end ETL acceleration.

Reference
38
● https://openai.com/blog/ai-and-compute/
● Open Sourcing TensorFlowOnSpark: Distributed Deep Learning on
Big-Data Clusters
● Introduction to Distributed Deep Learning
● Large Scale Distributed Deep Networks
● Bringing HPC Techniques to Deep Learning
● Horovod: fast and easy distributed deep learning in TensorFlow
● NCCL 2.0
● DISTRIBUTED DEEP NEURAL NETWORK TRAINING: NCCL ON
SUMMIT
● Blink: Fast and Generic Collectives for Distributed ML

Q&A
39

Appendix
41

TFoS 1.X (TensorFlow 1.X)
42
“Parameter Server” approach is adopted for distributed training.
Distributed training is supported via tf.estimator.train_and_evaluate API.
● Keras model can be converted to estimator via
tf.keras.estimator.model_to_estimator, but the train/predict APIs are still
required to be TF APIs.

TFoS 2.X (TensorFlow 2.X)
43
In 2.X, Keras APIs are supported for distributed training by
tf.distribute.Strategy API(Experimental).
Distributed training with TensorFlow (as of 2.3)
● Notice that Keras API’s support is better than the others.

TensorFlow on Spark Deep Dive

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to TensorFlow on Spark Deep Dive

Similar to TensorFlow on Spark Deep Dive (20)

More from Evans Ye

More from Evans Ye (19)

Recently uploaded

Recently uploaded (20)

TensorFlow on Spark Deep Dive