Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Infrastructure for Deep Learning in Apache Spark

248 views

Published on

In machine learning projects, the preparation of large datasets is a key phase which can be complex and expensive. It was traditionally done by data engineers before the handover to data scientists or ML engineers. They operated in different environments due to the differences in the tools, frameworks and runtimes required in each phase. Spark's support for different types of workloads brought data engineering closer to the downstream activities like machine learning that depended on the data. Unifying data acquisition, preprocessing, training models and batch inferencing under a single platform enabled by Spark not only provided seamless experience between different phases and helped accelerate the end-to-end ML lifecycle but also lowered the TCO in the building, managing the infrastructure to cover different phases. With that, the needs of a shared infrastructure expanded to include specialized hardware like GPUs and support deep learning workloads as well. Spark can effectively make use of such infrastructure as it integrates with popular deep learning frameworks and supports acceleration of deep learning jobs using GPUs. In this talk, we share learnings and experiences in supporting different types of workloads in shared clusters equipped for doing deep learning as well as data engineering. We will cover the following topics: * Considerations for sharing the infrastructure for big data and deep learning in Spark * Deep learning in Spark in clusters with and without GPUs * Differences between distributed data processing and distributed machine learning * Multitenancy and isolation in shared infrastructure

Published in: Data & Analytics
  • Be the first to comment

Infrastructure for Deep Learning in Apache Spark

  1. 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  2. 2. Kaarthik Sivashanmugam, Wee Hyong Tok Microsoft Infrastructure for Deep Learning in Apache Spark #UnifiedAnalytics #SparkAISummit
  3. 3. Agenda • Evolution of data infrastructure • ML workflow: Data prep & DNN training • Intro to deep learning and computing needs • Distributed deep learning and challenges • Unified platform using Spark – Infra considerations, challenges • ML Pipelines 3#UnifiedAnalytics #SparkAISummit
  4. 4. Video Feeds Call Logs Data Web logs Products Images …… Organization’s Data Database / Data Warehouse Organization’s data
  5. 5. Machine Learning Typical E2E Process … Prepare Experiment Deploy Orchestrate
  6. 6. + Machine Learning and Deep Learning workloads 6#UnifiedAnalytics #SparkAISummit
  7. 7. How long does it take to train Resnet-50 on ImageNet? 7#UnifiedAnalytics #SparkAISummit 14 daysBefore 2017 NVIDIA M40 GPU
  8. 8. Training Resnet-50 on Imagenet 8#UnifiedAnalytics #SparkAISummit 1 hour 31 mins 15 mins Apr Sept Nov Tesla P100 x 256 1,600 CPUs Tesla P100 x 1,024 Facebook Caffe2 UC Berkeley, TACC, UC Davis Tensorflow Preferred Network ChainerMN 2017 6.6 mins Tesla P40 x 2,048 Tencent TensorFlow July Nov 2.0 mins Sony Neural Network Library (NNL) Tesla V100 x 3,456 2018 2019 Fujitsu MXNet 1.2 mins Tesla V100 x 2,048 Apr
  9. 9. Considerations for Deep Learning @ Scale • CPU vs. GPU • Single vs. multi-GPU • MPI vs. non-MPI • Infiniband vs. Ethernet 9#UnifiedAnalytics #SparkAISummit Credits: Mathew Salvaris https://azure.microsoft.com/en-us/blog/gpus-vs-cpus-for-deployment-of-deep-learning-models/
  10. 10. “Things” you need to deal with when training machine learning/deep learning models Gather results Secure Access Scale resources Schedule jobs Dependencies and Containers Provision VM clusters Distribute data Handling failures
  11. 11. Machine Learning Typical E2E Process … Prepare Experiment Deploy Orchestrate
  12. 12. Machine Learning and Deep Learning 12#UnifiedAnalytics #SparkAISummit Top figure source; Bottom figure from NVIDIA ML DL
  13. 13. Lots of ML Frameworks …. 13#UnifiedAnalytics #SparkAISummit TensorFlow PyTorch Scikit-Learn MXNet Chainer Keras
  14. 14. Design Choices for Big Data and Machine Learning/Deep Learning 14#UnifiedAnalytics #SparkAISummit Laptop Spark + Separate infrastructure for ML/DL training/inference Cloud Spark
  15. 15. Execution Models for Spark and Deep Learning 15#UnifiedAnalytics #SparkAISummit Task 1 • Independent Tasks • Embarrassingly Parallel and Massively Scalable Task 2 Task 3 Spark Data Parallelism Model Parallelism • Non-Independent Tasks • Some parallel processing • Optimizing communication between nodes Distributed Learning Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
  16. 16. Execution Models for Spark and Deep Learning 16#UnifiedAnalytics #SparkAISummit Task 1 • Independent Tasks • Embarrassingly Parallel and Massively Scalable Task 2 Task 3 Spark • Non-Independent Tasks • Some parallel processing • Optimizing communication between nodes Distributed Learning Task 3 Task 2 Task 1 Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
  17. 17. Execution Models for Spark and Deep Learning 17#UnifiedAnalytics #SparkAISummit Task 1 • Independent Tasks • Embarrassingly Parallel and Massively Scalable • Re-run crashed task Task 2 Task 3 Spark • Non-Independent Tasks • Some parallel processing • Optimizing communication between nodes • Re-run all tasks Distributed Learning Task 3 Task 2 Task 1 Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
  18. 18. Spark + ML/DL 18#UnifiedAnalytics #SparkAISummit www.aka.ms/spark Sparkflow TensorFlowOnSpark Project Hydrogen HorovodRunner
  19. 19. 19#UnifiedAnalytics #SparkAISummit Microsoft Machine Learning for Apache Spark v0.16 Microsoft’s Open Source Contributions to Apache Spark www.aka.ms/spark Azure/mmlspark Cognitive Services Spark Serving Model Interpretability LightGBM Gradient Boosting Deep Networks with CNTK HTTP on Spark
  20. 20. Demo - Azure Databricks and Deep Learning 20#UnifiedAnalytics #SparkAISummit
  21. 21. Demo – Distributed Deep Learning using Tensorflow with HorovodRunner 21#UnifiedAnalytics #SparkAISummit
  22. 22. What do you need for training / distributed training? CPU GPU Network Storage Deep Learning Framework Memory Physics of Machine Learning and Deep Learning
  23. 23. GPU Device Interconnect • NVLink • GPUDirect P2P • GPUDirect RDMA • Standard network stack Interconnect topology sample Credits:CUDA-MPI Blog (https://bit.ly/2KnmN58)
  24. 24. From CUDA to NCCL1 to NCCL2 Multi-Core CPU GPU Multi-GPU Multi-GPU Multi-Node NCCL 2NCCL 1CUDA Multi-GPU Communication Library Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
  25. 25. NCCL 2.x (multi-node) Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
  26. 26. NCCL 2.x (multi- node) Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
  27. 27. Spark & GPU • Using GPU with Spark options: 1. Native support (cluster manager, GPU tasks): SPARK- 24615 2. Use cores/memory as proxy for GPU resources and allow GPU-enabled code execution 3. Code implementation/generation for GPU offload • Considerations – Flexibility – Data management – Multi-GPU execution 27#UnifiedAnalytics #SparkAISummit
  28. 28. Infrastructure Considerations • Data format, storage and reuse – Co-locate Data Engineering storage infrastructure (cluster-local) – DL Framework support for HDFS (reading from HDFS does not mean data-locality-aware computation) – Sharing data between Spark and Deep Learning (HDFS, Spark-TF connector, Parquet/Petastorm) • Job execution – Gang scheduling – Refer to SPARK-24374 – Support for GPU (and other accelerators) – Refer to SPARK-24615 – Cluster sharing with other types of jobs (CPU-only cluster vs. CPU+GPU cluster) – Quota management – Support for Docker containers – MPI vs. non-MPI – Difference GPU generations • Node, GPU connectivity – Infiniband, RDMA – GPU Interconnect options – Interconnect-aware scheduling, minimize distribution, repacking
  29. 29. ML Pipelines • Using machine learning pipelines, data scientists, data engineers, and IT professionals can collaborate on different steps/phases • Enable use of best tech for different phases in ML/DL workflow 29#UnifiedAnalytics #SparkAISummit
  30. 30. Demo – Azure ML Pipelines & Databricks 30#UnifiedAnalytics #SparkAISummit
  31. 31. What do you need for training / distributed training? CPU GPU Network Storage Deep Learning Framework Memory Physics of Machine Learning and Deep Learning
  32. 32. Kaarthik Sivashanmugam, Wee Hyong Tok Microsoft Infrastructure for Deep Learning in Apache Spark #UnifiedAnalytics #SparkAISummit
  33. 33. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×