TensorFlow on Spark (TFoS) provides a framework for distributed deep learning on Spark clusters. It allows existing TensorFlow programs to be easily migrated by handling the distributed training logic. TFoS supports two input modes - InputMode.SPARK which pushes data through Spark RDDs, and InputMode.TENSORFLOW which pulls data directly. Lightweight distributed training is also possible using PySpark by preprocessing data on the Spark cluster and collecting it to the driver for multi-GPU training.
Human Factors of XR: Using Human Factors to Design XR Systems
TensorFlow on Spark Deep Dive
1. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
TensorFlow on Spark:
A Deep Dive into
Distributed Deep Learning
DataCon.TW 2020
Evans Ye, Verizon Media
2. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Evans Ye
2
Engineering Manager @ Verizon Media
● Use data to power advertising/eCommerce experience.
● Build next-gen Big Data & ML/AI solutions.
ASF Member @ Apache Software Foundation
● Spread the Apache way.
Apache Bigtop former VP, PMC member, Committer
● Drive project direction, build community, mentor new committers.
Director of Taiwan Data Engineering Association (TDEA)
● Promote OSS & data engineering technologies.
3. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Agenda
1. Why Distributed Deep Learning?
2. Solution at Verizon Media
3. Distributed Deep Learning
4. Lightweight Distributed Deep Learning on PySpark
5. Recap & Future Work
3
4. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. 4
Why Distributed
Deep Learning?
5. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Industry Trend
5
OpenAI blog post: AI and Compute
● Drastically increasing in
computation needs!
Before 2012:
● uncommon to use GPUs for ML
2012 to 2014:
● 1-8 GPUs rated at 1-2 TFLOPS
2014 to 2016:
● 10-100 GPUs rated at 5-10 TFLOPS
6. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Applying DL w/ GPUs in Enterprise
6
Deep Learning requires both big data & computing power(GPUs).
Data has gravity
● A dedicate GPU cluster posts a problem for data migration.
2.
DL training
1.
Prepare data
3.
Inferencingmodel
data
7. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. 7
Solution at
Verizon Media
8. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Yahoo! TensorFlowOnSpark
(Open-sourced Feb. 2017)
8
Framework to create TensorFlow cluster on Spark and feeds the data for training.
Yahoo! Developer blog post for TFoS
9. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Input Modes
9
InputMode.SPARK
● HDFS → RDD.mapPartitions → TF worker (push mode)
InputMode.TENSORFLOW
● TF worker ← tf.data ← HDFS (pull mode)
10. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
What’s the Different?
InputMode.SPARK
● Data are proxied through Spark RDD, hence slower.
○ Supports whatever data can be loaded as RDD.
● TF worker runs in background. Failures happened behind the scene...
InputMode.TENSORFLOW
● Data fetched from HDFS directly, hence faster.
○ Supports TFRecords.
● TF worker runs in foreground. Failures are raised and retired by Spark.
10
11. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
TFoS API Example
11
cluster = TFCluster.run(sc, main_fun, args,
args.cluster_size, args.num_ps,
tensorboard=args.tensorboard,
input_mode=TFCluster.InputMode.SPARK, master_node='chief')
# InputMode.SPARK only
cluster.train(images_labels, args.epochs)
cluster.inference(images_labels)
cluster.shutdown()
12. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Get Data w/ InputMode.SPARK
12
13. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Get Data w/ InputMode.TENSORFLOW
13
14. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
TFoS Advantages
14
● Easily migrate existing TensorFlow programs with <10 lines of code
change.
● Support all TensorFlow functionalities: synchronous/asynchronous
training, model/data parallelism, inferencing and TensorBoard.
● Allow datasets on HDFS and other sources pushed by Spark or pulled by
TensorFlow.
● Easily integrate with your existing Spark data processing pipelines.
● Easily deployed on cloud or on-premise and on CPUs or GPUs.
* Ref: https://github.com/yahoo/TensorFlowOnSpark
15. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
More About TFoS
15
Spark Summit Talks:
● TensorFlow On Spark: Scalable TensorFlow Learning on Spark Clusters
● TensorFlowOnSpark Enhanced: Scala, Pipelines, and Beyond
Github:
● https://github.com/yahoo/TensorFlowOnSpark
● TFoS 1.X Keras Example
● TFoS 2.X InputMode.TENSORFLOW Keras Example
● TFoS 2.X InputMode.SPARK Keras Example
16. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. 16
Distributed Deep Learning
17. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Types of Parallelism
17
Data Parallelism
● Each worker trains on different data pieces.
● Sync/async approaches to update the parameters w/ gradients.
● Entire model should fit into GPU’s memory.
Model Parallelism
● EX: To train a 6 layers model,
assign first 3 layers to worker0, later 3 layers to worker1.
● If model can’t fit into a single GPU’s memory.
Hybrid approach
● data parallelism between nodes, model parallelism between GPUs.
18. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Asynchronous Parameter Server
18
● Each worker computes the gradient and send the delta to PS for updates.
● The updated parameters are then pulled to worker for next step training.
19. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Asynchronous Parameter Server
19
Asynchronous → Inconsistency
● Different workers may update the parameters at the same time. Since
they act asynchronously, there’s no guaranteed order.
Large Scale Distributed Deep Networks, Google, 2012
● “In practice we found relaxing consistency requirements to be remarkably
effective.”
● Additional stochasticity.
20. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Asynchronous Parameter Server
20
Pros:
● Scalable, each worker works independently.
● Robust to machine failures.
Cons:
● Workers may computing gradients based on staled weights, hence delay
the convergence.
Suitable for large number of not so powerful devices and dynamic
environment which preemption can happen.
21. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Synchronous AllReduce
21
● Each worker has its own model parameters
and computes gradient separately.
● All the workers sync to each other with all
the gradients.
● Next training step begins after all the workers
have the model updated.
22. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Ring AllReduce
22
Baidu, 2017
● Bringing HPC Techniques to Deep Learning
Uber, 2018
● Horovod: fast and easy distributed deep
learning in TensorFlow
Each worker sends gradient to its successor
and receive gradient from its predecessor.
● Uses both upload & download bandwidth at
the same time, hence the communication
time is optimized.
23. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Synchronous AllReduce
23
Pros:
● Faster convergence w/ powerful devices & strong communication links.
Cons:
● Synchronous in design hence may suffer from failures.
● Not suitable for devices with different computing power, bandwidth.
Suitable for multi-GPU on single machine or small number of machines.
24. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. 24
Lightweight
Distributed Deep Learning on
PySpark
25. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Rethink Distributed Training
25
Engineering side:
● Single node, multi-GPU training w/ synchronous allreduce can lead to
faster convergence.
● TensorFlow cluster is required for multi-node training only.
Science side:
● Huge amount of labeled training data is not easy to get.
● Leveraging well-trained model and fine-tune from there is common in
practice instead of train from scratch.
26. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
What if Only Single Node Distributed Training is
Supported?
26
● No need to spawn up a TensorFlow cluster.
● The code can be simplified w/o coupling to a clustering framework, hence
easier for deployment and testing.
● Failure discovery and handling can be simplified.
● Single node, multi-GPU training is supported by
MirroredStrategy(NcclAllReduce) w/ Keras API.
27. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. 27
Introducing a Simple, yet
Powerful Solution that Leverages
Several PySpark tricks.
28. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Lightweight Distributed Training Architecture
28
PySpark Preprocessing
● Leverage Spark for distributed preprocessing.
○ spark.sql or spark.read to load data.
● Collect data back to driver for training.
○ df.toPandas()
Multi-GPU Training (Driver)
● Small data
model.fit(x=train_x, y=train_y, ...)
● Data can’t fit into GPU memory
model.fit(x=generator, ...)
HDFS
29. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Huge Data Training is Support (up to 1B records)
29
Generator + Spark df.toLocalIterator() to collect the data sequentially.
● iter = df.toLocalIterator()
● record = next(iter)
30. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Performance
30
Take one of our production model for example:
Multi-GPU
AllReduce
Multi-Node
Parameter Server
31. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Lightweight Parallel Predicting Architecture
31
Preprocessing
● Leverage Spark for distributed preprocessing.
○ spark.sql or spark.read to load data.
Pandas UDF Predict (Executors)
● Data are handed over to Python in Pandas
DataFrame format.
● In Python UDF, predict by a simple Keras API.
● Resulting DataFrame are then passed back as
a Spark DataFrame.
● df.write or other post-processing if needed.
HDFS
32. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
PySpark Pandas UDF
32
Pandas Grouped Map UDF (Spark 2.4)
● The UDF leverages Apache Arrow for efficient JVM <-> Python SerDe.
● GroupBy a uniform distributed random ID, making data evenly grouped
across partitions:
● Make predictions in UDF:
33. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Predicting Performance
33
● Looking at the CPU predicting, it scales
well with more resource added.
● Though GPU predicting is slightly
faster, CPU predicting is more scalable
due to # of CPUs available.
● The solution is capable of predicting up
to 1B records.
34. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Summary
34
Productivity
● The PySpark based solution is lightweight and easy to test on local IDE.
No need to spawn up a cluster.
Flexibility
● The trainer/predictor impl. are decoupled from the framework and can be
run independently.
○ EX: run on local machine w/o Spark.
● The solution can support other frameworks such as Pytorch, xgboost, etc.
Efficiency
● Cross-language SerDes(toPandas, Pandas UDF) are optimized by
PySpark Arrow Integration.
35. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. 35
Recap & Future Work
36. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Recap
36
● TFoS is a comprehensive solution for distributed deep learning.
● Distributed deep learning can be achieved in a more lightweight manner
with “TensorFlow on Spark” only, therefore increase the productivity.
● Know the details under the neath. Maybe it's an architecture-wise
problem for your training:
Types of Parallelism Data Parallelism Model Parallelism
Training Approaches Async Parameter Server Sync AllReduce
37. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Future Work
37
● Petastorm for efficient DL with Parquet file format.
● Horovod to support multi-framework distributed deep learning.
● More efficient AllReduce:
○ NCCL topology optimizations.
○ Blink: Fast and Generic Collectives for Distributed ML.
● Other related area that are also rapidly developing:
○ ML Lifecycle: MLFlow, KubeFlow.
○ Feature Store: Michelangelo.
○ GPU Acceleration: RAPIDS for end-to-end ETL acceleration.
38. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Reference
38
● https://openai.com/blog/ai-and-compute/
● Open Sourcing TensorFlowOnSpark: Distributed Deep Learning on
Big-Data Clusters
● Introduction to Distributed Deep Learning
● Large Scale Distributed Deep Networks
● Bringing HPC Techniques to Deep Learning
● Horovod: fast and easy distributed deep learning in TensorFlow
● NCCL 2.0
● DISTRIBUTED DEEP NEURAL NETWORK TRAINING: NCCL ON
SUMMIT
● Blink: Fast and Generic Collectives for Distributed ML
39. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Q&A
39
40. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
41. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Appendix
41
42. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
TFoS 1.X (TensorFlow 1.X)
42
“Parameter Server” approach is adopted for distributed training.
Distributed training is supported via tf.estimator.train_and_evaluate API.
● Keras model can be converted to estimator via
tf.keras.estimator.model_to_estimator, but the train/predict APIs are still
required to be TF APIs.
43. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
TFoS 2.X (TensorFlow 2.X)
43
In 2.X, Keras APIs are supported for distributed training by
tf.distribute.Strategy API(Experimental).
Distributed training with TensorFlow (as of 2.3)
● Notice that Keras API’s support is better than the others.