Spark summit 2019 infrastructure for deep learning in apache spark 0425

Kaarthik Sivashanmugam, Wee Hyong Tok
Microsoft
Infrastructure for Deep Learning
in Apache Spark
#UnifiedAnalytics #SparkAISummit

Agenda
• Evolution of data infrastructure
• ML workflow: Data prep & DNN training
• Intro to deep learning and computing needs
• Distributed deep learning and challenges
• Unified platform using Spark
– Infra considerations, challenges
• ML Pipelines
3#UnifiedAnalytics #SparkAISummit

Video
Feeds
Call Logs
Data
Web logs
Products
Images
……
Organization’s Data
Database /
Data
Warehouse
Organization’s data

Machine Learning
Typical E2E Process
…
Prepare Experiment Deploy
Orchestrate

+ Machine Learning and
Deep Learning workloads

How long does it take to train Resnet-50 on ImageNet?
14 daysBefore
2017
NVIDIA M40 GPU

Training Resnet-50 on Imagenet
1 hour 31 mins 15 mins
Apr Sept Nov
Tesla P100 x 256 1,600 CPUs Tesla P100 x 1,024
Facebook
Caffe2
UC Berkeley,
TACC, UC Davis
Tensorflow
Preferred Network
ChainerMN
2017
6.6 mins
Tesla P40 x 2,048
Tencent
TensorFlow
July Nov
2.0 mins
Sony
Neural Network
Library (NNL)
Tesla V100 x 3,456
2018 2019
Fujitsu
MXNet
1.2 mins
Tesla V100 x 2,048
Apr

Considerations for Deep Learning @ Scale
• CPU vs. GPU
• Single vs. multi-GPU
• MPI vs. non-MPI
• Infiniband vs. Ethernet
Credits: Mathew Salvaris
https://azure.microsoft.com/en-us/blog/gpus-vs-cpus-for-deployment-of-deep-learning-models/

“Things” you need to deal with when training
machine learning/deep learning models
Gather results
Secure Access
Scale resources
Schedule jobs
Dependencies and Containers
Provision VM clusters
Distribute data
Handling failures

Machine Learning and Deep Learning
Top figure source;
Bottom figure from NVIDIA
ML
DL

Lots of ML
Frameworks ….
TensorFlow PyTorch
Scikit-Learn
MXNet Chainer
Keras

Design Choices for Big Data and Machine Learning/Deep Learning
Laptop Spark +
Separate infrastructure for
ML/DL training/inference
Cloud
Spark

Execution Models for Spark and Deep Learning
Task
1
• Independent Tasks
• Embarrassingly Parallel and Massively Scalable
Task
2
Task
3
Spark
Data Parallelism Model Parallelism
• Non-Independent Tasks
• Some parallel processing
• Optimizing communication between nodes
Distributed Learning
Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark

Task
1
Task
2
Task
3
Spark
Task
3
Task
2
Task
1

Task
1
• Re-run crashed task
Task
2
Task
3
Spark
• Re-run all tasks
Task
3
Task
2
Task
1

Spark + ML/DL
www.aka.ms/spark Sparkflow
TensorFlowOnSpark
Project Hydrogen
HorovodRunner

Microsoft Machine Learning for
Apache Spark v0.16
Microsoft’s Open Source
Contributions to Apache Spark
www.aka.ms/spark Azure/mmlspark
Cognitive
Services
Spark
Serving
Model
Interpretability
LightGBM
Gradient Boosting
Deep Networks
with CNTK
HTTP on
Spark

Demo - Azure Databricks
and Deep Learning

Demo – Distributed Deep
Learning using Tensorflow
with HorovodRunner

What do you
need for
training /
distributed
training?
CPU
GPU
Network
Storage
Deep Learning
Framework
Memory
Physics of Machine Learning and Deep Learning

GPU Device Interconnect
• NVLink
• GPUDirect P2P
• GPUDirect RDMA
Interconnect topology sample
Credits:CUDA-MPI Blog (https://bit.ly/2KnmN58)

From CUDA to NCCL1 to NCCL2
Multi-Core
CPU
GPU Multi-GPU Multi-GPU
Multi-Node
NCCL 2NCCL 1CUDA
Multi-GPU
Communication
Library
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)

NCCL 2.x (multi-node)

NCCL 2.x
(multi-
node)

Spark & GPU
• Using GPU with Spark options:
1. Native support (cluster manager, GPU tasks): SPARK-
24615
2. Use cores/memory as proxy for GPU resources and
allow GPU-enabled code execution
3. Code implementation/generation for GPU offload
• Considerations
– Flexibility
– Data management
– Multi-GPU execution

Infrastructure Considerations
• Data format, storage and reuse
– Co-locate Data Engineering storage infrastructure (cluster-local)
– DL Framework support for HDFS (reading from HDFS does not mean data-locality-aware computation)
– Sharing data between Spark and Deep Learning (HDFS, Spark-TF connector, Parquet/Petastorm)
• Job execution
– Gang scheduling – Refer to SPARK-24374
– Support for GPU (and other accelerators) – Refer to SPARK-24615
– Cluster sharing with other types of jobs (CPU-only cluster vs. CPU+GPU cluster)
– Quota management
– Support for Docker containers
– MPI vs. non-MPI
– Difference GPU generations
• Node, GPU connectivity
– Infiniband, RDMA
– GPU Interconnect options
– Interconnect-aware scheduling, minimize distribution, repacking

ML Pipelines
• Using machine learning pipelines, data scientists, data engineers,
and IT professionals can collaborate on different steps/phases
• Enable use of best tech for different phases in ML/DL workflow

Demo – Azure ML
Pipelines & Databricks

What do you
need for training /
distributed
training?
CPU
GPU
Network
Storage
Deep Learning
Framework
Memory
Physics of Machine Learning and Deep Learning

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Spark summit 2019 infrastructure for deep learning in apache spark 0425

More Related Content

What's hot

Similar to Spark summit 2019 infrastructure for deep learning in apache spark 0425

Recently uploaded

Spark summit 2019 infrastructure for deep learning in apache spark 0425