Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ROCm and Distributed Deep Learning on Spark and TensorFlow

423 views

Published on

ROCm, the Radeon Open Ecosystem, is an open-source software foundation for GPU computing on Linux. ROCm supports TensorFlow and PyTorch using MIOpen, a library of highly optimized GPU routines for deep learning. In this talk, we describe how Apache Spark is a key enabling platform for distributed deep learning on ROCm, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end machine learning pipeline. We will analyse the different frameworks for integrating Spark with Tensorflow on ROCm, from Horovod to HopsML to Databrick's Project Hydrogen. We will also examine the surprising places where bottlenecks can surface when training models (everything from object stores to the Data Scientists themselves), and we will investigate ways to get around these bottlenecks. The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on Hopsworks with ROCm.

Published in: Data & Analytics
  • Be the first to comment

ROCm and Distributed Deep Learning on Spark and TensorFlow

  1. 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  2. 2. Jim Dowling, CEO @ Logical Clocks Ajit Mathews, VP Machine Learning @ AMD ROCm and Distributed Deep Learning on Spark and TensorFlow #UnifiedAnalytics #SparkAISummit jim_dowling
  3. 3. Great Hedge of India • East India Company was one of the industrial world’s first monopolies. • They assembled a thorny hedge (not a wall!) spanning India. • You paid customs duty to bring salt over the wall (sorry, hedge). In 2019, not all graphics cards are allowed to be used in a Data Center. Monoplies are not good for deep learning! 3 [Image from Wikipedia]
  4. 4. Nvidia™ 2080Ti vs AMD Radeon™ VII ResNet-50 Benchmark Nvidia™ 2080Ti Memory: 11GB TensorFlow 1.12 CUDA 10.0.130, cuDNN 7.4.1 Model: RESNET-50 Dataset: imagenet (synthetic) ------------------------------------------------------------ FP32 total images/sec: ~322 FP16 total images/sec: ~560 AMD Radeon™ VII Memory: 16 GB TensorFlow 1.13.1 ROCm: 2.3 Model: RESNET-50 Dataset: imagenet (synthetic) ------------------------------------------------------------ FP32 total images/sec: ~302 FP16 total images/sec: ~415 https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/ https://www.phoronix.com/scan.php?page=article&item=nvidia- rtx2080ti-tensorflow&num=2 4/48 https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173
  5. 5. 5 #UnifiedAnalytics #SparkAISummit Latest Machine Learning Frameworks Dockers and Kubernetes support Optimized Math & Communication Libraries Up-Streamed for Linux Kernel Distributions Frameworks Middleware and Libraries Eigen Spark / Machine Learning Apps Data Platform Tools ROCm Fully Open Source ROCm Platform OpenMP HIP OpenCL™ Python Devices GPU CPU APU DLA RCCL BLAS, FFT, RNG MIOpen O P E N S O U R C E F O U N D A T I O N F O R M A C H I N E L E A R N I N G A M D M L & H P C S O L U T I O N S AM D M L S O F T WAR E S T R AT E G Y
  6. 6. Linux Kernel 4.17 700+ upstream ROCm driver commits since 4.12 kernel https://github.com/RadeonOpenCompute/ROCK-Kernel- Driver Distro: Upstream Linux Kernel Support
  7. 7. Programming Models LLVM: https://llvm.org/docs/AMDGPUUsage.html CLANG HIP: https://clang.llvm.org/doxygen/HIP_8h_source.html OpenMP Python OpenCLHIP LLVM -> AMDGCN Compiler AMDGPU Code LLVM Languages: Multiple Programming options
  8. 8. rocBLAS rocSparse rocALUTION rocPrim rocFFT rocSolver rocRAND Up to 98%Software Efficiency Supports FP64, FP32, mixed precision, INT8,(INT4) Libraries: Optimized Math Libraries https://github.com/ROCm-Developer-Tools
  9. 9. Machine Learning Frameworks High performance FP16/FP32 training with up to 8 GPUs/node v1.13 – Available today as a docker container or as Python PIP wheel AMD support in mainline repository, including initial multi-GPU support for Caffe2 Available today as a docker container (or build from source)
  10. 10. ROCm Distributed Training 10#UnifiedAnalytics #SparkAISummit Optimized collective communication operations library Easy MPI integration Support for Infiniband and RoCE highspeed network fabrics ROCm enabled UCX ROCm w/ ROCnRDMA RCCL 1.00X 1.99X 3.98X 7.64X 0.00X 1.00X 2.00X 3.00X 4.00X 5.00X 6.00X 7.00X 8.00X RESNET50 Multi-GPU Scaling (PCIe, CPU parameter-server, 1/2/4/8 GPU) 1GPU 2GPU 4GPU 8GPU ResNet-50
  11. 11. GPU Acceleration & Distributed training Docker and Kubernetes Support o Docker o https://hub.docker.com/r/rocm/rocm- terminal/ o Kubernetes o https://hub.docker.com/r/rocm/k8s- device-plugin/ Extensive Language support including Python o Anaconda NUMBA o https://github.com/numba/roctools o Conda o https://anaconda.org/rocm o CLANG HIP o https://clang.llvm.org/doxygen/HIP_8h_s ource.html ROCm supported Kubernetes device plugin that enables the registration of GPU in a container cluster for compute workload With support for AMD's ROCm drivers, Numba lets you write parallel GPU algorithms entirely from Python Horovod Support with RCCL o Ring AllReduce o https://github.com/ROCmSoftwarePlatfor m/horovod o Tensor Flow CollectiveAllReduceStrategy o HopsML CollectiveAllReduceStrategy ROCm supports AllReduce Distributed Training using Horovod and HopsML
  12. 12. Hopsworks Open-Source Data and AI 12#UnifiedAnalytics #SparkAISummit Data Sources HopsFS Kafka Airflow Spark / Flink Spark Feature Store Hive Deep Learning BI Tools & Reporting Notebooks Serving w/ Kubernetes Hopsworks On-Premise, AWS, Azure, GCE Elastic External Service Hopsworks Service
  13. 13. Hopsworks Open-Source Data and AI 13#UnifiedAnalytics #SparkAISummit Data Sources HopsFS Kafka Airflow Spark / Flink Spark Feature Store Hive Deep Learning BI Tools & Reporting Notebooks Serving w/ Kubernetes Hopsworks On-Premise, AWS, Azure, GCE Elastic External Service Hopsworks Service BATCH ANALYTICS STREAMING ML & DEEP LEARNING
  14. 14. Hopsworks Hides ML Complexity 14#UnifiedAnalytics #SparkAISummit MODEL TRAINING Feature Store HopsML API [Diagram adapted from “technical debt of machine learning”] https://www.logicalclocks.com/feature-store/ https://hopsworks.readthedocs.io/en/latest/hopsml/hopsML.html#
  15. 15. ROCm -> Spark / TensorFlow • Spark / TensorFlow applications run unchanged on ROCm • Hopsworks runs Spark/TensorFlow on YARN and Conda 15#UnifiedAnalytics #SparkAISummit
  16. 16. Container A Container is a CGroup that isolates CPU, memory, and GPU resources and has a conda environment and TLS certs. ContainerContainerContainer YARN support for ROCm in Hops 16#UnifiedAnalytics #SparkAISummit Resource Manager Node Manager Node Manager Node Manager Executor ExecutorExecutorDriver
  17. 17. Distributed Deep Learning Spark/TF 17#UnifiedAnalytics #SparkAISummit Executor 1 Executor N Driver HopsFS TensorBoard ModelsCheckpoints Training Data Logs conda_env conda_env conda_env # RUNS ON THE EXECUTORS def train(lr, dropout): def input_fn(): # return dataset optimizer = … model = … model.add(Conv2D(…)) model.compile(…) model.fit(…) model.evaluate(…) # RUNS ON THE DRIVER Hparams= {‘lr’:[0.001, 0.0001], ‘dropout’: [0.25, 0.5, 0.75]} experiment.grid_search(train,HParams) More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0 https://github.com/logicalclocks/hops-examples
  18. 18. Distributed Deep Learning Spark/TF 18#UnifiedAnalytics #SparkAISummit Executor 1 Executor N Driver HopsFS TensorBoard ModelsCheckpoints Training Data Logs conda_env conda_env conda_env # RUNS ON THE EXECUTORS def train(): def input_fn(): # return dataset model = … optimizer = … model.compile(…) rc = tf.estimator.RunConfig( ‘CollectiveAllReduceStrategy’) keras_estimator = tf.keras.estimator. model_to_estimator(….) tf.estimator.train_and_evaluate( keras_estimator, input_fn) # RUNS ON THE DRIVER experiment.collective_all_reduce(train) More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0 https://github.com/logicalclocks/hops-examples
  19. 19. 19#UnifiedAnalytics #SparkAISummit Horizontally Scalable ML Pipelines with Hopsworks Raw Data Event Data Monitor HopsFS Feature Store Serving Feature StoreData PrepIngest DeployExperiment/Train Airflow logs logs Metadata Store
  20. 20. Pipelines from Jupyter Notebooks 20#UnifiedAnalytics #SparkAISummit Convert .ipynb to .py Jobs Service Run .py or .jar Schedule using REST API or UI materialize certs, ENV variables View Old Notebooks, Experiments and Visualizations Experiments & Tensorboard PySpark Kernel .ipynb (HDFS contents) [logs, results] Livy Server HopsYARNHopsFS Interactive materialize certs, ENV variables
  21. 21. DEMO Distributed TensorFlow with Spark/TensorFlow and ROCm 21#UnifiedAnalytics #SparkAISummit
  22. 22. 22 @logicalclocks www.logicalclocks.com O P E N S O U R C E F O U N D A T I O N F O R M A C H I N E L E A R N I N G
  23. 23. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  24. 24. Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2019 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Radeon and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. Use of third party marks / names is for informational purposes only and no endorsement of or by AMD is intended or implied.

×