Kaarthik Sivashanmugam, Wee Hyong Tok
Microsoft
Infrastructure for Deep Learning
in Apache Spark
#UnifiedAnalytics #SparkAISummit
Agenda
• Evolution of data infrastructure
• ML workflow: Data prep & DNN training
• Intro to deep learning and computing needs
• Distributed deep learning and challenges
• Unified platform using Spark
– Infra considerations, challenges
• ML Pipelines
3#UnifiedAnalytics #SparkAISummit
Video
Feeds
Call Logs
Data
Web logs
Products
Images
……
Organization’s Data
Database /
Data
Warehouse
Organization’s data
Machine Learning
Typical E2E Process
…
Prepare Experiment Deploy
Orchestrate
+ Machine Learning and
Deep Learning workloads
6#UnifiedAnalytics #SparkAISummit
How long does it take to train Resnet-50 on ImageNet?
7#UnifiedAnalytics #SparkAISummit
14 daysBefore
2017
NVIDIA M40 GPU
Training Resnet-50 on Imagenet
8#UnifiedAnalytics #SparkAISummit
1 hour 31 mins 15 mins
Apr Sept Nov
Tesla P100 x 256 1,600 CPUs Tesla P100 x 1,024
Facebook
Caffe2
UC Berkeley,
TACC, UC Davis
Tensorflow
Preferred Network
ChainerMN
2017
6.6 mins
Tesla P40 x 2,048
Tencent
TensorFlow
July Nov
2.0 mins
Sony
Neural Network
Library (NNL)
Tesla V100 x 3,456
2018 2019
Fujitsu
MXNet
1.2 mins
Tesla V100 x 2,048
Apr
Considerations for Deep Learning @ Scale
• CPU vs. GPU
• Single vs. multi-GPU
• MPI vs. non-MPI
• Infiniband vs. Ethernet
9#UnifiedAnalytics #SparkAISummit
Credits: Mathew Salvaris
https://azure.microsoft.com/en-us/blog/gpus-vs-cpus-for-deployment-of-deep-learning-models/
“Things” you need to deal with when training
machine learning/deep learning models
Gather results
Secure Access
Scale resources
Schedule jobs
Dependencies and Containers
Provision VM clusters
Distribute data
Handling failures
Machine Learning
Typical E2E Process
…
Prepare Experiment Deploy
Orchestrate
Machine Learning and Deep Learning
12#UnifiedAnalytics #SparkAISummit
Top figure source;
Bottom figure from NVIDIA
ML
DL
Lots of ML
Frameworks ….
13#UnifiedAnalytics #SparkAISummit
TensorFlow PyTorch
Scikit-Learn
MXNet Chainer
Keras
Design Choices for Big Data and Machine Learning/Deep Learning
14#UnifiedAnalytics #SparkAISummit
Laptop Spark +
Separate infrastructure for
ML/DL training/inference
Cloud
Spark
Execution Models for Spark and Deep Learning
15#UnifiedAnalytics #SparkAISummit
Task
1
• Independent Tasks
• Embarrassingly Parallel and Massively Scalable
Task
2
Task
3
Spark
Data Parallelism Model Parallelism
• Non-Independent Tasks
• Some parallel processing
• Optimizing communication between nodes
Distributed Learning
Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
Execution Models for Spark and Deep Learning
16#UnifiedAnalytics #SparkAISummit
Task
1
• Independent Tasks
• Embarrassingly Parallel and Massively Scalable
Task
2
Task
3
Spark
• Non-Independent Tasks
• Some parallel processing
• Optimizing communication between nodes
Distributed Learning
Task
3
Task
2
Task
1
Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
Execution Models for Spark and Deep Learning
17#UnifiedAnalytics #SparkAISummit
Task
1
• Independent Tasks
• Embarrassingly Parallel and Massively Scalable
• Re-run crashed task
Task
2
Task
3
Spark
• Non-Independent Tasks
• Some parallel processing
• Optimizing communication between nodes
• Re-run all tasks
Distributed Learning
Task
3
Task
2
Task
1
Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
Spark + ML/DL
18#UnifiedAnalytics #SparkAISummit
www.aka.ms/spark Sparkflow
TensorFlowOnSpark
Project Hydrogen
HorovodRunner
19#UnifiedAnalytics #SparkAISummit
Microsoft Machine Learning for
Apache Spark v0.16
Microsoft’s Open Source
Contributions to Apache Spark
www.aka.ms/spark Azure/mmlspark
Cognitive
Services
Spark
Serving
Model
Interpretability
LightGBM
Gradient Boosting
Deep Networks
with CNTK
HTTP on
Spark
Demo - Azure Databricks
and Deep Learning
20#UnifiedAnalytics #SparkAISummit
Demo – Distributed Deep
Learning using Tensorflow
with HorovodRunner
21#UnifiedAnalytics #SparkAISummit
What do you
need for
training /
distributed
training?
CPU
GPU
Network
Storage
Deep Learning
Framework
Memory
Physics of Machine Learning and Deep Learning
GPU Device Interconnect
• NVLink
• GPUDirect P2P
• GPUDirect RDMA
Interconnect topology sample
Credits:CUDA-MPI Blog (https://bit.ly/2KnmN58)
From CUDA to NCCL1 to NCCL2
Multi-Core
CPU
GPU Multi-GPU Multi-GPU
Multi-Node
NCCL 2NCCL 1CUDA
Multi-GPU
Communication
Library
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
NCCL 2.x (multi-node)
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
NCCL 2.x
(multi-
node)
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
Spark & GPU
• Using GPU with Spark options:
1. Native support (cluster manager, GPU tasks): SPARK-
24615
2. Use cores/memory as proxy for GPU resources and
allow GPU-enabled code execution
3. Code implementation/generation for GPU offload
• Considerations
– Flexibility
– Data management
– Multi-GPU execution
27#UnifiedAnalytics #SparkAISummit
Infrastructure Considerations
• Data format, storage and reuse
– Co-locate Data Engineering storage infrastructure (cluster-local)
– DL Framework support for HDFS (reading from HDFS does not mean data-locality-aware computation)
– Sharing data between Spark and Deep Learning (HDFS, Spark-TF connector, Parquet/Petastorm)
• Job execution
– Gang scheduling – Refer to SPARK-24374
– Support for GPU (and other accelerators) – Refer to SPARK-24615
– Cluster sharing with other types of jobs (CPU-only cluster vs. CPU+GPU cluster)
– Quota management
– Support for Docker containers
– MPI vs. non-MPI
– Difference GPU generations
• Node, GPU connectivity
– Infiniband, RDMA
– GPU Interconnect options
– Interconnect-aware scheduling, minimize distribution, repacking
ML Pipelines
• Using machine learning pipelines, data scientists, data engineers,
and IT professionals can collaborate on different steps/phases
• Enable use of best tech for different phases in ML/DL workflow
29#UnifiedAnalytics #SparkAISummit
Demo – Azure ML
Pipelines & Databricks
30#UnifiedAnalytics #SparkAISummit
What do you
need for training /
distributed
training?
CPU
GPU
Network
Storage
Deep Learning
Framework
Memory
Physics of Machine Learning and Deep Learning
Kaarthik Sivashanmugam, Wee Hyong Tok
Microsoft
Infrastructure for Deep Learning
in Apache Spark
#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Spark summit 2019 infrastructure for deep learning in apache spark 0425

  • 2.
    Kaarthik Sivashanmugam, WeeHyong Tok Microsoft Infrastructure for Deep Learning in Apache Spark #UnifiedAnalytics #SparkAISummit
  • 3.
    Agenda • Evolution ofdata infrastructure • ML workflow: Data prep & DNN training • Intro to deep learning and computing needs • Distributed deep learning and challenges • Unified platform using Spark – Infra considerations, challenges • ML Pipelines 3#UnifiedAnalytics #SparkAISummit
  • 4.
    Video Feeds Call Logs Data Web logs Products Images …… Organization’sData Database / Data Warehouse Organization’s data
  • 5.
    Machine Learning Typical E2EProcess … Prepare Experiment Deploy Orchestrate
  • 6.
    + Machine Learningand Deep Learning workloads 6#UnifiedAnalytics #SparkAISummit
  • 7.
    How long doesit take to train Resnet-50 on ImageNet? 7#UnifiedAnalytics #SparkAISummit 14 daysBefore 2017 NVIDIA M40 GPU
  • 8.
    Training Resnet-50 onImagenet 8#UnifiedAnalytics #SparkAISummit 1 hour 31 mins 15 mins Apr Sept Nov Tesla P100 x 256 1,600 CPUs Tesla P100 x 1,024 Facebook Caffe2 UC Berkeley, TACC, UC Davis Tensorflow Preferred Network ChainerMN 2017 6.6 mins Tesla P40 x 2,048 Tencent TensorFlow July Nov 2.0 mins Sony Neural Network Library (NNL) Tesla V100 x 3,456 2018 2019 Fujitsu MXNet 1.2 mins Tesla V100 x 2,048 Apr
  • 9.
    Considerations for DeepLearning @ Scale • CPU vs. GPU • Single vs. multi-GPU • MPI vs. non-MPI • Infiniband vs. Ethernet 9#UnifiedAnalytics #SparkAISummit Credits: Mathew Salvaris https://azure.microsoft.com/en-us/blog/gpus-vs-cpus-for-deployment-of-deep-learning-models/
  • 10.
    “Things” you needto deal with when training machine learning/deep learning models Gather results Secure Access Scale resources Schedule jobs Dependencies and Containers Provision VM clusters Distribute data Handling failures
  • 11.
    Machine Learning Typical E2EProcess … Prepare Experiment Deploy Orchestrate
  • 12.
    Machine Learning andDeep Learning 12#UnifiedAnalytics #SparkAISummit Top figure source; Bottom figure from NVIDIA ML DL
  • 13.
    Lots of ML Frameworks…. 13#UnifiedAnalytics #SparkAISummit TensorFlow PyTorch Scikit-Learn MXNet Chainer Keras
  • 14.
    Design Choices forBig Data and Machine Learning/Deep Learning 14#UnifiedAnalytics #SparkAISummit Laptop Spark + Separate infrastructure for ML/DL training/inference Cloud Spark
  • 15.
    Execution Models forSpark and Deep Learning 15#UnifiedAnalytics #SparkAISummit Task 1 • Independent Tasks • Embarrassingly Parallel and Massively Scalable Task 2 Task 3 Spark Data Parallelism Model Parallelism • Non-Independent Tasks • Some parallel processing • Optimizing communication between nodes Distributed Learning Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
  • 16.
    Execution Models forSpark and Deep Learning 16#UnifiedAnalytics #SparkAISummit Task 1 • Independent Tasks • Embarrassingly Parallel and Massively Scalable Task 2 Task 3 Spark • Non-Independent Tasks • Some parallel processing • Optimizing communication between nodes Distributed Learning Task 3 Task 2 Task 1 Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
  • 17.
    Execution Models forSpark and Deep Learning 17#UnifiedAnalytics #SparkAISummit Task 1 • Independent Tasks • Embarrassingly Parallel and Massively Scalable • Re-run crashed task Task 2 Task 3 Spark • Non-Independent Tasks • Some parallel processing • Optimizing communication between nodes • Re-run all tasks Distributed Learning Task 3 Task 2 Task 1 Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
  • 18.
    Spark + ML/DL 18#UnifiedAnalytics#SparkAISummit www.aka.ms/spark Sparkflow TensorFlowOnSpark Project Hydrogen HorovodRunner
  • 19.
    19#UnifiedAnalytics #SparkAISummit Microsoft MachineLearning for Apache Spark v0.16 Microsoft’s Open Source Contributions to Apache Spark www.aka.ms/spark Azure/mmlspark Cognitive Services Spark Serving Model Interpretability LightGBM Gradient Boosting Deep Networks with CNTK HTTP on Spark
  • 20.
    Demo - AzureDatabricks and Deep Learning 20#UnifiedAnalytics #SparkAISummit
  • 21.
    Demo – DistributedDeep Learning using Tensorflow with HorovodRunner 21#UnifiedAnalytics #SparkAISummit
  • 22.
    What do you needfor training / distributed training? CPU GPU Network Storage Deep Learning Framework Memory Physics of Machine Learning and Deep Learning
  • 23.
    GPU Device Interconnect •NVLink • GPUDirect P2P • GPUDirect RDMA Interconnect topology sample Credits:CUDA-MPI Blog (https://bit.ly/2KnmN58)
  • 24.
    From CUDA toNCCL1 to NCCL2 Multi-Core CPU GPU Multi-GPU Multi-GPU Multi-Node NCCL 2NCCL 1CUDA Multi-GPU Communication Library Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
  • 25.
    NCCL 2.x (multi-node) Credits:NCCL Tutorial (https://bit.ly/2KpPP44)
  • 26.
    NCCL 2.x (multi- node) Credits: NCCLTutorial (https://bit.ly/2KpPP44)
  • 27.
    Spark & GPU •Using GPU with Spark options: 1. Native support (cluster manager, GPU tasks): SPARK- 24615 2. Use cores/memory as proxy for GPU resources and allow GPU-enabled code execution 3. Code implementation/generation for GPU offload • Considerations – Flexibility – Data management – Multi-GPU execution 27#UnifiedAnalytics #SparkAISummit
  • 28.
    Infrastructure Considerations • Dataformat, storage and reuse – Co-locate Data Engineering storage infrastructure (cluster-local) – DL Framework support for HDFS (reading from HDFS does not mean data-locality-aware computation) – Sharing data between Spark and Deep Learning (HDFS, Spark-TF connector, Parquet/Petastorm) • Job execution – Gang scheduling – Refer to SPARK-24374 – Support for GPU (and other accelerators) – Refer to SPARK-24615 – Cluster sharing with other types of jobs (CPU-only cluster vs. CPU+GPU cluster) – Quota management – Support for Docker containers – MPI vs. non-MPI – Difference GPU generations • Node, GPU connectivity – Infiniband, RDMA – GPU Interconnect options – Interconnect-aware scheduling, minimize distribution, repacking
  • 29.
    ML Pipelines • Usingmachine learning pipelines, data scientists, data engineers, and IT professionals can collaborate on different steps/phases • Enable use of best tech for different phases in ML/DL workflow 29#UnifiedAnalytics #SparkAISummit
  • 30.
    Demo – AzureML Pipelines & Databricks 30#UnifiedAnalytics #SparkAISummit
  • 31.
    What do you needfor training / distributed training? CPU GPU Network Storage Deep Learning Framework Memory Physics of Machine Learning and Deep Learning
  • 32.
    Kaarthik Sivashanmugam, WeeHyong Tok Microsoft Infrastructure for Deep Learning in Apache Spark #UnifiedAnalytics #SparkAISummit
  • 33.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT