Methods that scale with available computation are the future of AI. Distributed deep learning is one such method that enables data scientists to massively increase their productivity by (1) running parallel experiments over many devices (GPUs/TPUs/servers) and (2) massively reducing training time by distributing the training of a single network over many devices. Apache Spark is a key enabling platform for distributed deep learning, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end pipeline. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows to build distributed deep learning applications.
We will analyse the different frameworks for integrating Spark with Tensorflow, from Horovod to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We will also look at where you will find the bottlenecks when training models (in your frameworks, the network, GPUs, and with your data scientists) and how to get around them. We will look at how to use Spark Estimator model to perform hyper-parameter optimization with Spark/TensorFlow and model-architecture search, where Spark executors perform experiments in parallel to automatically find good model architectures.
The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on the Hops platform. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. The demo will be run on the Hops platform, currently used by over 450 researchers and students in Sweden, as well as at companies such as Scania and Ericsson.
8. HopsML
8/48
• Experiments
– Dist. Hyperparameter Optimization
– Versioning of Models/Code/Resources
– Visualization with Tensorboard
– Distributed Training with checkpointing
• [Feature Store]
• Model Serving and Monitoring
Feature Extraction
Experimentation
Training
Test + Serve
Data Acquisition
Clean/Transform Data
10. 10/48
Prof Nando de Freitas @NandoDF
“ICLR 2019 lessons thus far: The deep neural
nets have to be BIGGER and they’re hungry for
data, memory and compute.”
https://twitter.com/NandoDF/status/1046371764211716096
11. All Roads Lead to Distribution
11/48
Distributed
Deep Learning
Hyper
Parameter
Optimization
Distributed
Training
Larger
Training
Datasets
Elastic
Model
Serving
Parallel
Experiments
(Commodity)
GPU Clusters
Auto
ML
19. Distributed TensorFlow / TfOnSpark
TF_CONFIG
Bring your own Distribution!
1. Start all processes for
P1,P2, G1-G4 yourself
2. Enter all IP addresses in
TF_CONFIG along with
GPU device IDs.
19/48
Parameter Servers
G1 G2 G3 G4
P1 P2
GPU Servers
TF_CONFIG
20. Horovod
• Bandwidth optimal
• Builds the Ring, runs
AllReduce using MPI
and NCCL2
• Available in
– Hopsworks
– Databricks (Spark 2.4)
20/48
21. Tf CollectiveAllReduceStrategy
TF_CONFIG, again.
Bring your own Distribution!
1. Start all processes for
G1-G4 yourself
2. Enter all IP addresses in
TF_CONFIG along with
GPU device IDs.
21/48
G1
G2
G3
G4
TF_CONFIG
Available from TensorFlow 1.11
22. Tf CollectiveAllReduceStrategy Gotchas
• Specify GPU order in the ring statically
– gpu_indices
• Configure the batch size for merging tensors
– allreduce_merge_scope
• Set to ‘1’ for no merging
• Set to ’32’ for higher throughput.*
22/48
2018-10-
06
* https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/7T05tNV08Us
23. HopsML CollectiveAllReduceStrategy
• Uses Spark/YARN to add
distribution to TensorFlow’s
CollectiveAllReduceStrategy
– Automatically builds the ring
(Spark/YARN)
– Allocates GPUs to Spark
Executors
23/48
https://github.com/logicalclocks/hops-examples/tree/master/tensorflow/notebooks/Distributed_Training
24. CollectiveAllReduce vs Horovod Benchmark
TensorFlow: 1.11
Model: Inception v1
Dataset: imagenet (synthetic)
Batch size: 256 global, 32.0 per device
Num batches: 100
Optimizer Momemtum
Num GPUs: 8
AllReduce: collective
Step Img/sec total_loss
1 images/sec: 2972.4 +/- 0.0
10 images/sec: 3008.9 +/- 8.9
100 images/sec: 2998.6 +/- 4.3
------------------------------------------------------------
total images/sec: 2993.52
TensorFlow: 1.7
Model: Inception v1
Dataset: imagenet (synthetic)
Batch size: 256 global, 32.0 per device
Num batches 100
Optimizer Momemtum
Num GPUs: 8
AllReduce: horovod
Step Img/sec total_loss
1 images/sec: 2816.6 +/- 0.0
10 images/sec: 2808.0 +/- 10.8
100 images/sec: 2806.9 +/- 3.9
-----------------------------------------------------------
total images/sec: 2803.69
https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/7T05tNV08Us
Small Model
24/48
25. CollectiveAllReduce vs Horovod Benchmark
TensorFlow: 1.11
Model: VGG19
Dataset: imagenet (synthetic)
Batch size: 256 global, 32.0 per device
Num batches: 100
Optimizer Momemtum
Num GPUs: 8
AllReduce: collective
Step Img/sec total_loss
1 images/sec: 634.4 +/- 0.0
10 images/sec: 635.2 +/- 0.8
100 images/sec: 635.0 +/- 0.5
------------------------------------------------------------
total images/sec: 634.80
TensorFlow: 1.7
Model: VGG19
Dataset: imagenet (synthetic)
Batch size: 256 global, 32.0 per device
Num batches 100
Optimizer Momemtum
Num GPUs: 8
AllReduce: horovod
Step Img/sec total_loss
1 images/sec: 583.01 +/- 0.0
10 images/sec: 582.22 +/- 0.1
100 images/sec: 583.61 +/- 0.2
-----------------------------------------------------------
total images/sec: 583.61
https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/7T05tNV08Us
Big Model
25/48
26. Reduction in LoC for Dist Training
26/48
Released Framework Lines of Code in Hops
March 2016 DistributedTensorFlow ~1000
Feb 2017 TensorFlowOnSpark* ~900
Jan 2018 Horovod (Keras)* ~130
June 2018 Databricks’ HorovodEstimator ~100
Sep 2018 HopsML (Keras/CollectiveAllReduce)* ~100
*https://github.com/logicalclocks/hops-examples
**https://docs.azuredatabricks.net/_static/notebooks/horovod-estimator.html
38. Feeding Data to TensorFlow
38/48
Dataframe
Model Training
GPUs
CPUs
CPUs
CPUs
CPUs
Filesystem
.tfrecords
.csv
.parquet
Project Hydrogen: Barrier Execution mode in Spark: JIRA: SPARK-24374, SPARK-24723, SPARK-24579
Wrangling/Cleaning DataFrame
39. Filesystems are not good enough
Uber on Petastorm:
“[Using files] is hard to implement at large scale,
especially using modern distributed file systems
such as HDFS and S3 (these systems are typically
optimized for fast reads of large chunks of data).”
https://eng.uber.com/petastorm/
39/48
40. with Reader('hdfs://myhadoop/dataset.parquet') as reader:
dataset = make_petastorm_dataset(reader)
iterator = dataset.make_one_shot_iterator()
tensor = iterator.get_next()
with tf.Session() as sess:
sample = sess.run(tensor)
print(sample.id)
PetaStorm: Read Parquet directly into TensorFlow
40/48
41. NVMe Disks – Game Changer
• HDFS (and S3) are designed around large
blocks (optimized to overcome slow random I/O
on disks), while new NVMe hardware supports
orders of magnitude faster random disk I/O.
• Can we support faster random disk I/O with
HDFS?
– Yes with HopsFS.
41/48
42. Small files on NVMe
• At Spotify’s HDFS:
– 33% of files < 64KB
in size
– 42% of operations
are on files < 16KB in
size
42/48
*Size Matters: Improving the Performance of Small Files in Hadoop, Middleware 2018. Niazi et al
43. HopsFS – NVMe Performance
• HDFS with Distributed Metadata
– Winner IEEE Scale Prize 2017
• Small files stored replicated in the
metadata layer on NVMe disks*
– Read 10s of 1000s of images/second from
HopsFS
43/48
*Size Matters: Improving the Performance of Small Files in Hadoop, Middleware 2018. Niazi et al
47. Orchestrating ML Pipelines with Airflow
47/48
Airflow
Spark
ETL
Dist Training
TensorFlow
Test &
Optimize
Kubernetes
ModelServing
SparkStreaming
Monitoring
48. Summary
48/48
• The future of Deep Learning is Distributed
https://www.oreilly.com/ideas/distributed-tensorflow
• Hops is a new Data Platform with first-class
support for Python / Deep Learning / ML / Data
Governance / GPUs
hopshadoop logicalclocks
www.hops.io
www.logicalclocks.com