Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling

1,490 views

Published on

Methods that scale with available computation are the future of AI. Distributed deep learning is one such method that enables data scientists to massively increase their productivity by (1) running parallel experiments over many devices (GPUs/TPUs/servers) and (2) massively reducing training time by distributing the training of a single network over many devices. Apache Spark is a key enabling platform for distributed deep learning, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end pipeline. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows to build distributed deep learning applications.

We will analyse the different frameworks for integrating Spark with Tensorflow, from Horovod to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We will also look at where you will find the bottlenecks when training models (in your frameworks, the network, GPUs, and with your data scientists) and how to get around them. We will look at how to use Spark Estimator model to perform hyper-parameter optimization with Spark/TensorFlow and model-architecture search, where Spark executors perform experiments in parallel to automatically find good model architectures.

The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on the Hops platform. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. The demo will be run on the Hops platform, currently used by over 450 researchers and students in Sweden, as well as at companies such as Scania and Ericsson.

Published in: Data & Analytics

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling

  1. 1. Jim Dowling, Logical Clocks AB Distributed Deep Learning with Apache Spark and TensorFlow #SAISDL2 jim_dowling
  2. 2. The Cargobike Riddle 2/48 ?
  3. 3. Spark & TensorFlow Dynamic Executors (release GPUs when training finishes) Container (GPUs) Blacklisting Executors (for Fault Tolerant Hyperparameter Optimization) Optimizing GPU Resource Utilization 3/48
  4. 4. Scalable ML Pipeline Data Collection Experimentation Training Serving Feature Extraction Data Transformation & Verification Test Distributed Storage Potential Bottlenecks Object Stores (S3, GCS), HDFS, Ceph StandaloneSingle-Host TensorFlow Single GPU 4/48
  5. 5. Scalable ML Pipeline 5/48 Data Collection Experimentation Training Serving Feature Extraction Data Transformation & Verification Test HopsFS (Hopsworks) PySpark KubernetesTensorFlow
  6. 6. 6/48 Hopsworks Rest API JWT / TLS Airflow
  7. 7. Spark/TensorFlow in Hopsworks 7/48 Executor Executor Driver HopsFSTensorBoard/Logs Model Serving Conda Envs Conda Envs
  8. 8. HopsML 8/48 • Experiments – Dist. Hyperparameter Optimization – Versioning of Models/Code/Resources – Visualization with Tensorboard – Distributed Training with checkpointing • [Feature Store] • Model Serving and Monitoring Feature Extraction Experimentation Training Test + Serve Data Acquisition Clean/Transform Data
  9. 9. Why Distributed Deep Learning? 9/48
  10. 10. 10/48 Prof Nando de Freitas @NandoDF “ICLR 2019 lessons thus far: The deep neural nets have to be BIGGER and they’re hungry for data, memory and compute.” https://twitter.com/NandoDF/status/1046371764211716096
  11. 11. All Roads Lead to Distribution 11/48 Distributed Deep Learning Hyper Parameter Optimization Distributed Training Larger Training Datasets Elastic Model Serving Parallel Experiments (Commodity) GPU Clusters Auto ML
  12. 12. Hyperparameter Optimization 12/48 (Because DL Theory Sucks!)
  13. 13. Faster Experimentation 13/48 GPU Servers LearningRate (LR): [0.0001-0.001] NumOfLayers (NL): [5-10] …… LR: 0.0001 NL: 5 Error: 0.35 LR: 0.0001 NL: 7 Error: 0.34 LR: 0.0005 NL: 5 Error: 0.38 LR: 0.0005 NL: 10 Error: 0.37 LR: 0.001 NL: 6 Error: 0.31 LR: 0.001 NL: 9 HyperParameter Optimization LR: 0.001 NL: 6 Error: 0.31 TensorFlow Program Hyperparameters Blacklist Executor LR: 0.001 NL: 9 Error: 0.36
  14. 14. Declarative or API Approach? • Declarative Hyperparameters in external files – Vizier/CloudML (yaml) – Sagemaker (json)* • API-Driven – Databrick’s MLFlow – HopsML 14/48 *https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html Notebook-Friendly
  15. 15. def train(learning_rate, dropout): [TensorFlow Code here] args_dict = {'learning_rate': [0.001, 0.005, 0.01], 'dropout': [0.5, 0.6]} experiment.launch(train, args_dict) GridSearch for Hyperparameters on HopsML Launch 6 Spark Executors Dynamic Executors, Blacklisting
  16. 16. Distributed Training 16/48 Image from @hardmaru on Twitter.
  17. 17. Data Parallel Distributed Training 17/48 Training Time Generalization Error (Synchronous Stochastic Gradient Descent (SGD))
  18. 18. Frameworks for Distributed Training 18/48
  19. 19. Distributed TensorFlow / TfOnSpark TF_CONFIG Bring your own Distribution! 1. Start all processes for P1,P2, G1-G4 yourself 2. Enter all IP addresses in TF_CONFIG along with GPU device IDs. 19/48 Parameter Servers G1 G2 G3 G4 P1 P2 GPU Servers TF_CONFIG
  20. 20. Horovod • Bandwidth optimal • Builds the Ring, runs AllReduce using MPI and NCCL2 • Available in – Hopsworks – Databricks (Spark 2.4) 20/48
  21. 21. Tf CollectiveAllReduceStrategy TF_CONFIG, again. Bring your own Distribution! 1. Start all processes for G1-G4 yourself 2. Enter all IP addresses in TF_CONFIG along with GPU device IDs. 21/48 G1 G2 G3 G4 TF_CONFIG Available from TensorFlow 1.11
  22. 22. Tf CollectiveAllReduceStrategy Gotchas • Specify GPU order in the ring statically – gpu_indices • Configure the batch size for merging tensors – allreduce_merge_scope • Set to ‘1’ for no merging • Set to ’32’ for higher throughput.* 22/48 2018-10- 06 * https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/7T05tNV08Us
  23. 23. HopsML CollectiveAllReduceStrategy • Uses Spark/YARN to add distribution to TensorFlow’s CollectiveAllReduceStrategy – Automatically builds the ring (Spark/YARN) – Allocates GPUs to Spark Executors 23/48 https://github.com/logicalclocks/hops-examples/tree/master/tensorflow/notebooks/Distributed_Training
  24. 24. CollectiveAllReduce vs Horovod Benchmark TensorFlow: 1.11 Model: Inception v1 Dataset: imagenet (synthetic) Batch size: 256 global, 32.0 per device Num batches: 100 Optimizer Momemtum Num GPUs: 8 AllReduce: collective Step Img/sec total_loss 1 images/sec: 2972.4 +/- 0.0 10 images/sec: 3008.9 +/- 8.9 100 images/sec: 2998.6 +/- 4.3 ------------------------------------------------------------ total images/sec: 2993.52 TensorFlow: 1.7 Model: Inception v1 Dataset: imagenet (synthetic) Batch size: 256 global, 32.0 per device Num batches 100 Optimizer Momemtum Num GPUs: 8 AllReduce: horovod Step Img/sec total_loss 1 images/sec: 2816.6 +/- 0.0 10 images/sec: 2808.0 +/- 10.8 100 images/sec: 2806.9 +/- 3.9 ----------------------------------------------------------- total images/sec: 2803.69 https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/7T05tNV08Us Small Model 24/48
  25. 25. CollectiveAllReduce vs Horovod Benchmark TensorFlow: 1.11 Model: VGG19 Dataset: imagenet (synthetic) Batch size: 256 global, 32.0 per device Num batches: 100 Optimizer Momemtum Num GPUs: 8 AllReduce: collective Step Img/sec total_loss 1 images/sec: 634.4 +/- 0.0 10 images/sec: 635.2 +/- 0.8 100 images/sec: 635.0 +/- 0.5 ------------------------------------------------------------ total images/sec: 634.80 TensorFlow: 1.7 Model: VGG19 Dataset: imagenet (synthetic) Batch size: 256 global, 32.0 per device Num batches 100 Optimizer Momemtum Num GPUs: 8 AllReduce: horovod Step Img/sec total_loss 1 images/sec: 583.01 +/- 0.0 10 images/sec: 582.22 +/- 0.1 100 images/sec: 583.61 +/- 0.2 ----------------------------------------------------------- total images/sec: 583.61 https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/7T05tNV08Us Big Model 25/48
  26. 26. Reduction in LoC for Dist Training 26/48 Released Framework Lines of Code in Hops March 2016 DistributedTensorFlow ~1000 Feb 2017 TensorFlowOnSpark* ~900 Jan 2018 Horovod (Keras)* ~130 June 2018 Databricks’ HorovodEstimator ~100 Sep 2018 HopsML (Keras/CollectiveAllReduce)* ~100 *https://github.com/logicalclocks/hops-examples **https://docs.azuredatabricks.net/_static/notebooks/horovod-estimator.html
  27. 27. Estimator APIs in TensorFlow 27/48
  28. 28. Estimators log to the Distributed Filesystem tf.estimator.RunConfig( ‘CollectiveAllReduceStrategy’ model_dir tensorboard_logs checkpoints ) experiment.launch(…) /Experiments/appId/run.ID/<name> /Experiments/appId/run.ID/<name>/eval /Experiments/appId/run.ID/<name>/checkpoint HopsFS (HDFS) /Experiments/appId/run.ID/<name>/*.ipynb /Experiments/appId/run.ID/<name>/conda.yml 28/48
  29. 29. def distributed_training(): def input_fn(): # return dataset model = … optimizer = … model.compile(…) rc = tf.estimator.RunConfig(‘CollectiveAllReduceStrategy’) keras_estimator = tf.keras.estimator.model_to_estimator(….) tf.estimator.train_and_evaluate(keras_estimator, input_fn) experiment.allreduce(distributed_training) HopsML CollectiveAllReduceStrategy with Keras
  30. 30. def distributed_training(): from hops import tensorboard model_dir = tensorboard.logdir() def input_fn(): # return dataset model = … optimizer = … model.compile(…) rc = tf.estimator.RunConfig(‘CollectiveAllReduceStrategy’) keras_estimator = keras.model_to_estimator(model_dir) tf.estimator.train_and_evaluate(keras_estimator, input_fn) experiment.allreduce(distributed_training) Add Tensorboard Support
  31. 31. def distributed_training(): from hops import devices def input_fn(): # return dataset model = … optimizer = … model.compile(…) est.RunConfig(num_gpus_per_worker=devices.get_num_gpus()) keras_estimator = keras.model_to_estimator(…) tf.estimator.train_and_evaluate(keras_estimator, input_fn) experiment.allreduce(distributed_training) GPU Device Awareness
  32. 32. def distributed_training(): def input_fn(): # return dataset model = … optimizer = … model.compile(…) rc = tf.estimator.RunConfig(‘CollectiveAllReduceStrategy’) keras_estimator = keras.model_to_estimator(…) tf.estimator.train_and_evaluate(keras_estimator, input_fn) notebook = hdfs.project_path()+'/Jupyter/Experiment/inc.ipynb' experiment.allreduce(distributed_training, name='inception', description='A inception example with hidden layers‘, versioned_resources=[notebook]) Experiment Versioning (.ipynb, conda, results)
  33. 33. Experiments/Versioning in Hopsworks
  34. 34. 34/48
  35. 35. 35/48
  36. 36. 36/48
  37. 37. The Data Layer (Foundations)
  38. 38. Feeding Data to TensorFlow 38/48 Dataframe Model Training GPUs CPUs CPUs CPUs CPUs Filesystem .tfrecords .csv .parquet Project Hydrogen: Barrier Execution mode in Spark: JIRA: SPARK-24374, SPARK-24723, SPARK-24579 Wrangling/Cleaning DataFrame
  39. 39. Filesystems are not good enough Uber on Petastorm: “[Using files] is hard to implement at large scale, especially using modern distributed file systems such as HDFS and S3 (these systems are typically optimized for fast reads of large chunks of data).” https://eng.uber.com/petastorm/ 39/48
  40. 40. with Reader('hdfs://myhadoop/dataset.parquet') as reader: dataset = make_petastorm_dataset(reader) iterator = dataset.make_one_shot_iterator() tensor = iterator.get_next() with tf.Session() as sess: sample = sess.run(tensor) print(sample.id) PetaStorm: Read Parquet directly into TensorFlow 40/48
  41. 41. NVMe Disks – Game Changer • HDFS (and S3) are designed around large blocks (optimized to overcome slow random I/O on disks), while new NVMe hardware supports orders of magnitude faster random disk I/O. • Can we support faster random disk I/O with HDFS? – Yes with HopsFS. 41/48
  42. 42. Small files on NVMe • At Spotify’s HDFS: – 33% of files < 64KB in size – 42% of operations are on files < 16KB in size 42/48 *Size Matters: Improving the Performance of Small Files in Hadoop, Middleware 2018. Niazi et al
  43. 43. HopsFS – NVMe Performance • HDFS with Distributed Metadata – Winner IEEE Scale Prize 2017 • Small files stored replicated in the metadata layer on NVMe disks* – Read 10s of 1000s of images/second from HopsFS 43/48 *Size Matters: Improving the Performance of Small Files in Hadoop, Middleware 2018. Niazi et al
  44. 44. Model Serving
  45. 45. Model Serving on Kubernetes 45/48
  46. 46. Model Serving 46/48 Hopsworks REST API External Apps Internal Apps Logging Load Balancer Model Serving Containers Streaming Notifications Retrain Features: • Canary • Multiple Models • Scale-Out/In Frameworks: ü TensorFlow Serving ü MLeap for Spark ü scikit-learn
  47. 47. Orchestrating ML Pipelines with Airflow 47/48 Airflow Spark ETL Dist Training TensorFlow Test & Optimize Kubernetes ModelServing SparkStreaming Monitoring
  48. 48. Summary 48/48 • The future of Deep Learning is Distributed https://www.oreilly.com/ideas/distributed-tensorflow • Hops is a new Data Platform with first-class support for Python / Deep Learning / ML / Data Governance / GPUs hopshadoop logicalclocks www.hops.io www.logicalclocks.com

×