SlideShare a Scribd company logo
1 of 47
Download to read offline
How to Achieve High-Performance, Scalable and
Distributed Training on Modern HPC Systems?
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Talk at HPCAC-AI Stanford Conference (April ’20)
by
Follow us on
https://twitter.com/mvapich
HPCAC-AI-Stanford (April ‘20) 2Network Based Computing Laboratory
Understanding the Deep Learning Resurgence
Adopted from: http://www.deeplearningbook.org/contents/intro.html
• Deep Learning (DL) is a sub-set of
Machine Learning (ML)
– Perhaps, the most revolutionary
subset!
– Feature extraction vs. hand-crafted
features
• Deep Learning
– A renewed interest and a lot of hype!
– Key success: Deep Neural Networks
(DNNs)
– Everything was there since the late 80s
except the “computability of DNNs”
AI
Machine
Learning
Deep
Learning
Examples:
MLPs, DNNs,
Examples:
Logistic
Regression
HPCAC-AI-Stanford (April ‘20) 3Network Based Computing Laboratory
• Modern and efficient hardware enabled
– Computability of DNNs – impossible in
the past!
– GPUs – at the core of DNN training
– CPUs – catching up fast
• Availability of Datasets
– MNIST, CIFAR10, ImageNet, and more…
• Excellent Accuracy for many application
areas
– Vision, Machine Translation, and several
others...
Deep Learning in the Many-core Era
Courtesy: A. Canziani et al., “An Analysis of Deep Neural Network Models for Practical Applications”, CoRR, 2016.
0
2000
4000
6000
8000
10000
2 GTX 580 DGX-2
MinutestoTrain
AlexNet
~500X in 5 years
HPCAC-AI-Stanford (April ‘20) 4Network Based Computing Laboratory
Big Data
(Hadoop, Spark,
HBase,
Memcached,
etc.)
Deep Learning
(Caffe, TensorFlow, BigDL,
etc.)
HPC
(MPI, RDMA,
Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big
Data, and Deep Learning!
Increasing Need to Run these
applications on the Cloud!!
HPCAC-AI-Stanford (April ‘20) 5Network Based Computing Laboratory
Drivers of Modern HPC Cluster Architectures
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.
Accelerators / Coprocessors
high compute density, high
performance/watt
>1 TFlop DP on a chip
High Performance Interconnects -
InfiniBand
<1usec latency, 200Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM
K - ComputerSunway TaihuLightSummit Sierra
HPCAC-AI-Stanford (April ‘20) 6Network Based Computing Laboratory
• Deep Learning has two major tasks
1. Training of the Deep Neural Network
2. Inference (or deployment) that uses a trained DNN
• DNN Training
– Training is a compute/communication intensive process – can take days to
weeks
– Faster training is necessary!
• Faster training can be achieved by
– Using Newer and Faster Hardware – But, there is a limit!
– Can we use more GPUs or nodes?
• The need for Parallel and Distributed Training
Key Phases of Deep Learning
HPCAC-AI-Stanford (April ‘20) 7Network Based Computing Laboratory
• Scale-up: Intra-node Communication
– Many improvements like:
• NVIDIA cuDNN, cuBLAS, NCCL, etc.
• CUDA 9 Co-operative Groups
• Scale-out: Inter-node Communication
– DL Frameworks – most are optimized for
single-node only
– Distributed (Parallel) Training is an emerging
trend
• OSU-Caffe – MPI-based
• Microsoft CNTK – MPI/NCCL2
• Google TensorFlow – gRPC-based/MPI/NCCL2
• Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI)
Scale-up and Scale-out
Scale-upPerformance
Scale-out Performance
cuDNN
gRPC
Hadoop
MPI
MKL-DNN
Desired
NCCL2
HPCAC-AI-Stanford (April ‘20) 8Network Based Computing Laboratory
Holistic Evaluation is Important!!
• My framework is faster than
your framework!
• This needs to be understood
in a holistic way.
• Performance depends on
the entire execution
environment (the full stack)
• Isolated view of
performance is not helpful
A. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. “An In-depth Performance Characterization of CPU- and GPU-based DNN Training on
Modern Architectures”, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17). ACM, New York, NY, USA, Article 8.
MKL/
MKL-DNN
HPCAC-AI-Stanford (April ‘20) 9Network Based Computing Laboratory
How to efficiently scale-out a
Deep Learning (DL) framework and take
advantage of heterogeneous
High Performance Computing (HPC)
resources?
Broad Challenge: Exploiting HPC for Deep Learning
HPCAC-AI-Stanford (April ‘20) 10Network Based Computing Laboratory
1. What are the fundamental
issues in designing DL
frameworks?
– Memory Requirements
– Computation
Requirements
– Communication Overhead
2. Why do we need to support
distributed training?
– To overcome the limits of
single-node training
– To better utilize hundreds
of existing HPC Clusters
Research Challenges to Exploit HPC Technologies
InfiniBand GPUCPU
CNTK
Gradient
Aggregation
Model Propagation
Forward
Backward
Deep Learning and Machine Learning Frameworks
Caffe/
OSU-Caffe
Caffe2 TensorFlow MXNet
Communication Runtimes to support
Distributed Training
HPC Platforms
Major Computation and Communication Phases in DL Frameworks
1
2
HPCAC-AI-Stanford (April ‘20) 11Network Based Computing Laboratory
3. What are the new design challenges
brought forward by DL frameworks for
Communication runtimes?
– Large Message Collective
Communication and Reductions
– GPU Buffers (CUDA-Awareness)
4. Can a Co-design approach help in
achieving Scale-up and Scale-out efficiently?
– Co-Design the support at Runtime
level and Exploit it at the DL
Framework level
– What performance benefits can
be observed?
– What needs to be fixed at the
communication runtime layer?
Research Challenges to Exploit HPC Technologies (Cont’d)
CUDA-
Awareness
InfiniBand GPUCPU
Large-message
Collectives
CNTK
Point-to-
Point
Operations
Gradient
Aggregation
Model Propagation
Forward
Backward
Deep Learning and Machine Learning Frameworks
Caffe/
OSU-Caffe
Caffe2 TensorFlow MXNet
Communication Runtimes (MPI/NCCL/Gloo/MLSL)
HPC Platforms
Major Computation and Communication Phases in DL Frameworks
3
4 Co-Design
Opportunities
HPCAC-AI-Stanford (April ‘20) 12Network Based Computing Laboratory
• MPI-driven Deep Learning
– CPU-based Deep Learning
– GPU-based Deep Learning
• Co-designing Deep Learning Stacks with High-Performance MPI
• Out-of-core DNN training
• Exploiting Hybrid (Data and Model) Parallelism
• Use-Case: AI-Driven Digital Pathology
Multiple Approaches taken up by OSU
HPCAC-AI-Stanford (April ‘20) 13Network Based Computing Laboratory
Data Parallel Deep Learning and MPI Collectives
MPI_Bcast (GPU 0)
packed_comm_buff
L1
L2
..
Ln
F
L1
L2
..
Ln
L1
L2
..
Ln
L1
L2
..
Ln
Params
GPU0
Params
GPU1
Params
GPU2
Params
GPU3
Gradients
1. Data
Propagation
2. Forward
Backward
Pass
3. Gradient
Aggregatio
n
B F B F B F B
packed_red
uce_buff
packed_red
uce_buff
packed_red
uce_buff
packed_red
uce_buff
ApplyUpdates
MPI_Reduce (GPU 0)
Loop {}
• Major MPI Collectives
involved in Designing
distributed frameworks
• MPI_Bcast – required for
DNN parameter exchange
• MPI_Reduce – needed for
gradient accumulation
from multiple solvers
• MPI_Allreduce – use just
one Allreduce instead of
Reduce and Broadcast
A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU
Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
HPCAC-AI-Stanford (April ‘20) 14Network Based Computing Laboratory
Overview of the MVAPICH2 Project
• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 3,075 organizations in 89 countries
– More than 732,000 (> 0.7 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (November ‘19 ranking)
• 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China
• 5th, 448, 448 cores (Frontera) at TACC
• 8th, 391,680 cores (ABCI) in Japan
• 14th, 570,020 cores (Nurion) in South Korea and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, OpenHPC, and Spack)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade Partner in the 5th ranked TACC Frontera System
HPCAC-AI-Stanford (April ‘20) 15Network Based Computing Laboratory
0
100000
200000
300000
400000
500000
600000
700000
800000
Sep-04
Mar-05
Sep-05
Mar-06
Sep-06
Mar-07
Sep-07
Mar-08
Sep-08
Mar-09
Sep-09
Mar-10
Sep-10
Mar-11
Sep-11
Mar-12
Sep-12
Mar-13
Sep-13
Mar-14
Sep-14
Mar-15
Sep-15
Mar-16
Sep-16
Mar-17
Sep-17
Mar-18
Sep-18
Mar-19
Sep-19
NumberofDownloads
Timeline
MV0.9.4
MV20.9.0
MV20.9.8
MV21.0
MV1.0
MV21.0.3
MV1.1
MV21.4
MV21.5
MV21.6
MV21.7
MV21.8
MV21.9
MV2-GDR2.0b
MV2-MIC2.0
MV2-GDR2.3.2
MV2-X2.3rc3
MV2Virt2.2
MV22.3.3
OSUINAM0.9.5
MV2-Azure2.3.2
MV2-AWS2.3
MVAPICH2 Release Timeline and Downloads
HPCAC-AI-Stanford (April ‘20) 16Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family (HPC and DL)
High Performance Parallel Programming Models
Message Passing Interface
(MPI)
PGAS
(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X
(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication Runtime
Diverse APIs and Mechanisms
Point-to-
point
Primitives
Collectives
Algorithms
Energy-
Awareness
Remote
Memory
Access
I/O and
File Systems
Fault
Tolerance
Virtualization
Active
Messages
Job Startup
Introspection
& Analysis
Support for Modern Networking Technology
(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)
Support for Modern Multi-/Many-core Architectures
(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC SRD UD DC UMR ODP
SR-
IOV
Multi
Rail
Transport Mechanisms
Shared
Memory
CMA IVSHMEM
Modern Features
Optane* NVLink CAPI*
* Upcoming
XPMEM
HPCAC-AI-Stanford (April ‘20) 17Network Based Computing Laboratory
• gRPC
– Officially available and supported
– Open-source – can be enhanced by others
– Accelerated gRPC (add RDMA to gRPC)
• gRPC+X
– Use gRPC for bootstrap and rendezvous
– Actual communication is in “X”
– X MPI, Verbs, GPUDirect RDMA (GDR), etc.
• No-gRPC
– Baidu – the first one to use MPI Collectives for TF
– Horovod – Use NCCL, or MPI, or any other future library (e.g. IBM DDL support recently added)
Data Parallel Training with TensorFlow
*Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation ”, CCGrid ’19.
HPCAC-AI-Stanford (April ‘20) 18Network Based Computing Laboratory
MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training
MVAPICH2-X for
CPU-Based Training
MVAPICH2-GDR for
GPU-Based Training
Horovod
TensorFlow PyTorch MXNet
ML/DL Applications
HPCAC-AI-Stanford (April ‘20) 19Network Based Computing Laboratory
MVAPICH2 Software Family (CPU-Based Deep Learning)
High-Performance Parallel Programming Libraries
MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE
MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and
MPI+PGAS programming models with unified communication runtime
MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning
Applications
MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud
MVAPICH2-EA Energy aware and High-performance MPI
MVAPICH2-MIC Optimized MPI for clusters with Intel KNC
Microbenchmarks
OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++)
libraries for CPUs and GPUs
Tools
OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler
integration
OEMT Utility to measure the energy consumption of MPI applications
HPCAC-AI-Stanford (April ‘20) 20Network Based Computing Laboratory
Performance of CNTK with MVAPICH2-X on CPU-based Deep Learning
0
200
400
600
800
28 56 112 224
ExecutionTime(s)
No. of Processes
Intel MPI
MVAPICH2
MVAPICH2-XPMEM
CNTK AlexNet Training
(B.S=default, iteration=50, ppn=28)
20%
9%
• CPU-based training of AlexNet neural
network using ImageNet ILSVRC2012
dataset
• Advanced XPMEM-based designs show
up to 20% benefits over Intel MPI (IMPI)
for CNTK DNN training using All_Reduce
• The proposed designs show good
scalability with increasing system size
Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, J. Hashmi, S. Chakraborty, M. Bayatpour, H.
Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018
Available since MVAPICH2-X 2.3rc1 release
HPCAC-AI-Stanford (April ‘20) 21Network Based Computing Laboratory
Distributed TensorFlow on TACC Frontera (2,048 CPU nodes)
• Scaled TensorFlow to 2048 nodes on
Frontera using MVAPICH2 and IntelMPI
• MVAPICH2 and IntelMPI give similar
performance for DNN training
• Report a peak of 260,000 images/sec on
2,048 nodes
• On 2048 nodes, ResNet-50 can be trained
in 7 minutes!
A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using
MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).
HPCAC-AI-Stanford (April ‘20) 22Network Based Computing Laboratory
Scaling DL Frameworks using MVAPICH2-X on TACC Frontera
• On single node, TensorFlow
(TF) is 8% better than
MXNet
• TF (tf_cnn_benchmark) is
2.13x better than PyTorch
• TensorFlow is 1.7x better
than MXNet
• TF (Keras) gives better
performance compared to
PyTorch and MXNet.
A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19
Workshop).
HPCAC-AI-Stanford (April ‘20) 23Network Based Computing Laboratory
MVAPICH2 Software Family (GPU-Based Deep Learning)
High-Performance Parallel Programming Libraries
MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE
MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and
MPI+PGAS programming models with unified communication runtime
MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning
Applications
MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud
MVAPICH2-EA Energy aware and High-performance MPI
MVAPICH2-MIC Optimized MPI for clusters with Intel KNC
Microbenchmarks
OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++)
libraries for CPUs and GPUs
Tools
OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler
integration
OEMT Utility to measure the energy consumption of MPI applications
HPCAC-AI-Stanford (April ‘20) 24Network Based Computing Laboratory
At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
HPCAC-AI-Stanford (April ‘20) 25Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3.3 Releases
• Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication
(GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU
adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node
communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from
GPU device buffers
• Unified memory
HPCAC-AI-Stanford (April ‘20) 26Network Based Computing Laboratory
0
2000
4000
6000
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
Bandwidth(MB/s)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
1000
2000
3000
4000
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
Bandwidth(MB/s)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
10
20
30 0
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
8K
Latency(us)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR 2.3
MVAPICH2-GDR-2.3
Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPU
Mellanox Connect-X4 EDR HCA
CUDA 9.0
Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR Design
1.85us
11X
HPCAC-AI-Stanford (April ‘20) 27Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2)
• Optimized designs in MVAPICH2-GDR offer better/comparable performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs)
1
10
100
1000
10000
Latency(us)
Message Size (Bytes)
MVAPICH2-GDR-2.3.3 NCCL-2.5
~2.5X better
Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2
0
5
10
15
20
25
30
35
40
45
50
8
16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
Latency(us)
Message Size (Bytes)
MVAPICH2-GDR-2.3.3 NCCL-2.5
~4.7X better
C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU
Systems, " ICS-2020, Accepted to be Presented.
HPCAC-AI-Stanford (April ‘20) 28Network Based Computing Laboratory
MVAPICH2-GDR: Enhanced MPI_Allreduce at Scale
• Optimized designs in MVAPICH2-GDR offer better performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs
0
1
2
3
4
5
6
32M 64M 128M 256M
Bandwidth(GB/s)
Message Size (Bytes)
Bandwidth on 1,536 GPUs
MVAPICH2-GDR-2.3.3 NCCL 2.5
1.7X better
0
50
100
150
200
250
300
350
400
450
4 16 64 256 1K 4K 16K
Latency(us)
Message Size (Bytes)
Latency on 1,536 GPUs
MVAPICH2-GDR-2.3.3 NCCL 2.5
1.6X better
Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect
0
2
4
6
8
10
24 48 96 192 384 768 1536
Bandwidth(GB/s)
Number of GPUs
128MB Message
SpectrumMPI 10.3 OpenMPI 4.0.1
NCCL 2.5 MVAPICH2-GDR-2.3.3
1.7X better
C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU
Systems, " ICS-2020. Accepted to be Presented.
HPCAC-AI-Stanford (April ‘20) 29Network Based Computing Laboratory
Scalable TensorFlow using Horovod and MVAPICH2-GDR
• ResNet-50 Training using TensorFlow benchmark on 1 DGX-2 node (16 Volta GPUs)
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 16
Imagepersecond
Number of GPUs
NCCL-2.5 MVAPICH2-GDR-2.3.3
9% higher
Platform: Nvidia DGX-2 system, CUDA 9.2
0
20
40
60
80
100
1 2 4 8 16
ScalingEfficiency(%)
Number of GPUs
NCCL-2.5 MVAPICH2-GDR-2.3.3
Scaling Efficiency =
Actual throughput
Ideal throughput at scale
× 100%
C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU
Systems, " ICS-2020. Accepted to be Presented.
HPCAC-AI-Stanford (April ‘20) 30Network Based Computing Laboratory
Distributed TensorFlow on ORNL Summit (1,536 GPUs)
• ResNet-50 Training using
TensorFlow benchmark on
SUMMIT -- 1536 Volta
GPUs!
• 1,281,167 (1.2 mil.) images
• Time/epoch = 3.6 seconds
• Total Time (90 epochs)
= 3.6 x 90 = 332 seconds =
5.5 minutes!
0
50
100
150
200
250
300
350
400
1 2 4 6 12 24 48 96 192 384 768 1536
Imagepersecond
Thousands
Number of GPUs
NCCL-2.5 MVAPICH2-GDR 2.3.3
Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2
*We observed errors for NCCL2 beyond 96 GPUs
MVAPICH2-GDR reaching ~0.35 million
images per second for ImageNet-1k!
ImageNet-1k has 1.2 million images
HPCAC-AI-Stanford (April ‘20) 31Network Based Computing Laboratory
• LLNL/Lassen  POWER9 CPU
with 4 V100 GPUs/node
• PyTorch is becoming a very
important DL framework
• Scaling PyTorch models with
Horovod is simple
• MVAPICH2-GDR provides
better performance and
scalability compared to NCCL2
Scaling PyTorch on LLNL/Lassen using MVAPICH2-GDR
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
1 2 4 8 16 32 64 128
Images/sec(higherisbetter)
No. of GPUs
MVAPICH2-GDR NCCL2
Platform: The Lassen Supercomputer (#10 on Top500.org) – 4 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1
HPCAC-AI-Stanford (April ‘20) 32Network Based Computing Laboratory
• RTX 5000 are NVIDIA’s GPUs targeted
for data centers
• Different from GTX series
– Supports GPUDirect RDMA (GDR)
– Supports GDRCOPY
• MVAPICH2-GDR offers good
performance and reasonable scaling
• Scaling is not as good as Lassen because
– Nodes are connected with IB FDR
– No NVLink between GPUs (only PCIe)
Early Exploration of MXNet using MVAPICH2-GDR on TACC
Frontera RTX GPU Nodes
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 2 4 8 16 32 64
Images/second(higherisbetter)
No. of GPUs
MVAPICH2-GDR
Platform: TACC Frontera-RTX – 4 NVIDIA RTX 5000 GPUs per node, CUDA 10.1
HPCAC-AI-Stanford (April ‘20) 33Network Based Computing Laboratory
• TACC Longhorn is like the
LLNL/Lassen system (POWER9
with 4 V100 GPUs/node)
• Some restrictions to keep in
mind
– TensorFlow 2.1 on POWER9
only built with CUDA 10.2
• Early results are promising
– Good scaling up to 256 GPUs
Early Exploration of TensorFlow 2.1 on TACC/Longhorn using MVAPICH2-GDR
0
10000
20000
30000
40000
50000
60000
70000
1 2 4 8 16 32 64 128 256
Images/sec(higherisbetter) No. of GPUs
MVAPICH2-GDR
Platform: The Longhorn System at TACC – 4 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.2
HPCAC-AI-Stanford (April ‘20) 34Network Based Computing Laboratory
• MPI-driven Deep Learning
– CPU-based Deep Learning
– GPU-based Deep Learning
• Co-designing Deep Learning Stacks with High-Performance MPI
• Out-of-core DNN training
• Exploiting Hybrid (Data and Model) Parallelism
• Use-Case: AI-Driven Digital Pathology
Multiple Approaches taken up by OSU
HPCAC-AI-Stanford (April ‘20) 35Network Based Computing Laboratory
• What are the fundamental bottlenecks in the existing DL frameworks?
• How to design a Scalable and Distributed Framework?
– Achieve both Scale-up and Scale-out efficiently
• What are the new requirements and expectations for Communication Runtimes?
• Can a Co-design approach help in achieving better training performance?
– Can a DL framework like Caffe be co-designed with an MPI runtime?
• What is the impact of the co-design?
– What performance benefits can be observed? At what levels?
S-Caffe: Co-designing HPC and DL
A. A. Awan, K. Hamidouche, J.M. Hashmi, DK Panda, “S-Caffe: Co-designing MPI Runtimes
and Caffe for Scalable Deep Learning on Modern GPU Clusters”, ACM PPoPP ’17
HPCAC-AI-Stanford (April ‘20) 36Network Based Computing Laboratory
• Caffe : A flexible and layered Deep Learning framework.
• Benefits and Weaknesses
– Multi-GPU Training within a single node
– Performance degradation for GPUs across different
sockets
– Limited Scale-out
• OSU-Caffe: MPI-based Parallel Training
– Enable Scale-up (within a node) and Scale-out (across
multi-GPU nodes)
– Scale-out on 64 GPUs for training CIFAR-10 network on
CIFAR-10 dataset
– Scale-out on 128 GPUs for training GoogLeNet network on
ImageNet dataset
OSU-Caffe: Scalable Deep Learning
0
50
100
150
200
250
8 16 32 64 128
TrainingTime(seconds)
No. of GPUs
GoogLeNet (ImageNet) on 128 GPUs
Caffe OSU-Caffe (1024) OSU-Caffe (2048)
Invalid use case
OSU-Caffe publicly available from
http://hidl.cse.ohio-state.edu/
HPCAC-AI-Stanford (April ‘20) 37Network Based Computing Laboratory
Out-of-core Training via CUDA Unified Memory
• Large DNNs cannot be trained on GPUs due to memory limitation!
– ResNet-50 -- current frameworks only allow small batch sizes
– Next-generation models like Neural Machine Translation
(NMT), AmoebaNet, etc.
• Ridiculously large consists of billions of parameters
– Can we design Out-of-core DNN training support using new
software features in CUDA 8/9 and hardware mechanisms in
Pascal/Volta GPUs?
• General intuition is that managed allocations “will be” slow!
– The proposed OC-Caffe (Out-of-Core Caffe) exploits the
potential of CUDA Unified Memory.
• OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for
ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7
A. Awan et al., OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ’18
HPCAC-AI-Stanford (April ‘20) 38Network Based Computing Laboratory
• Data-Parallelism– only for models that fit the memory
• Out-of-core models
– Deeper model  Better accuracy but more memory required!
• Model parallelism can work for out-of-core models!
• Key Challenges
– Model Partitioning is difficult for application
programmers
– Finding the right partition (grain) size is hard
– cut at which layer and why?
– Developing a practical system for model-parallelism
• Redesign DL Framework or create additional layers?
• Existing Communication middleware or extensions needed?
HyPar-Flow: Hybrid and Model Parallelism for CPUs
A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda,
“HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel
Training of TensorFlow models”, ISC ‘20 (Accepted to be
presented), https://arxiv.org/pdf/1911.05146.pdf
HPCAC-AI-Stanford (April ‘20) 39Network Based Computing Laboratory
• ResNet-1001 with variable batch size
• Approach:
– 48 model-partitions for 56 cores
– 512 model-replicas for 512 nodes
– Total cores: 48 x 512 = 24,576
• Speedup
– 253X on 256 nodes
– 481X on 512 nodes
• Scaling Efficiency
– 98% up to 256 nodes
– 93.9% for 512 nodes
Out-of-core Training with HyPar-Flow (512 nodes on TACC Frontera)
481x speedup on 512 Intel Xeon Skylake nodes (TACC Frontera)
A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow
models”, ISC ‘20 (Accepted to be presented), https://arxiv.org/pdf/1911.05146.pdf
HPCAC-AI-Stanford (April ‘20) 40Network Based Computing Laboratory
• The field of Pathology (like many other medical disciplines) is moving Digital
• Traditionally, a pathologist reviews a slide and carries out diagnosis based on prior
knowledge and training
• Experience matters
• Can Deep Learning to be used to train thousands of slides and
– Let the computer carry out the diagnosis
– Narrow down the diagnosis and help a pathologist to make a final decision
• Significant benefits in
– Reducing diagnosis time
– Making pathologists productive
– Reducing health care cost
AI-Driven Digital Pathology
HPCAC-AI-Stanford (April ‘20) 41Network Based Computing Laboratory
• Pathology whole slide image (WSI)
– Each WSI = 100,000 x 100,000 pixels
– Can not fit in a single GPU memory
– Tiles are extracted to make training possible
• Two main problems with tiles
– Restricted tile size because of GPU memory limitation
– Smaller tiles loose structural information
• Can we use Model Parallelism to train on larger tiles to
get better accuracy and diagnosis?
• Reduced training time significantly
– 7.25 hours (1 node, 4 GPUs) ->
27 mins (32 nodes, 128 GPUs)
Exploiting Model Parallelism in AI-Driven Digital Pathology
Courtesy: https://blog.kitware.com/digital-slide-
archive-large-image-and-histomicstk-open-source-
informatics-tools-for-management-visualization-and-
analysis-of-digital-histopathology-data/
HPCAC-AI-Stanford (April ‘20) 42Network Based Computing Laboratory
• Scalable distributed training is getting important
• Requires high-performance middleware designs while exploiting modern
interconnects
• Provided a set of solutions to achieve scalable distributed training
– MPI (MVAPICH2)-driven solution with Horovod for TensorFlow, PyTorch and
MXNet
– Optimized collectives for CPU-based training
– CUDA-aware MPI with optimized collectives for GPU-based training
– Out-of-core training and Hybrid Parallelism
• Will continue to enable the DL community to achieve scalability and
high-performance for their distributed training
Conclusions
HPCAC-AI-Stanford (April ‘20) 43Network Based Computing Laboratory
• Supported through X-ScaleSolutions (http://x-scalesolutions.com)
• Benefits:
– Help and guidance with installation of the library
– Platform-specific optimizations and tuning
– Timely support for operational issues encountered with the library
– Web portal interface to submit issues and tracking their progress
– Advanced debugging techniques
– Application-specific optimizations and tuning
– Obtaining guidelines on best practices
– Periodic information on major fixes and updates
– Information on major releases
– Help with upgrading to the latest release
– Flexible Service Level Agreements
• Support being provided to Lawrence Livermore National Laboratory (LLNL) and KISTI, Korea
Commercial Support for MVAPICH2, HiBD, and HiDL Libraries
HPCAC-AI-Stanford (April ‘20) 44Network Based Computing Laboratory
• High-performance solution for distributed training for your complex
AI problems
• Features:
– Integrated package with TensorFlow, PyTorch, MXNet, Horovod, and
MVAPICH2 MPI libraries
– Targeted for both CPU-based and GPU-based Deep Learning Training
– Integrated profiling support across the stacks
– Support for x86 and OpenPOWER platforms
– Support for InfiniBand, RoCE and NVLink Interconnects
– Out-of-the-box optimal performance
– One-click deployment and execution
• Send an e-mail to contactus@x-scalesolutions.com for free trial!!
X-ScaleAI Product
HPCAC-AI-Stanford (April ‘20) 45Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by
HPCAC-AI-Stanford (April ‘20) 46Network Based Computing Laboratory
Personnel Acknowledgments
Current Students (Graduate)
– A. Awan (Ph.D.)
– M. Bayatpour (Ph.D.)
– C.-H. Chu (Ph.D.)
– J. Hashmi (Ph.D.)
– A. Jain (Ph.D.)
– K. S. Kandadi (Ph.D.)
Past Students
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– R. Biswas (M.S.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– S. Chakraborthy (Ph.D.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– R. Rajachandrasekar (Ph.D.)
– D. Shankar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
– J. Zhang (Ph.D.)
Past Research Scientist
– K. Hamidouche
– S. Sur
– X. Lu
Past Post-Docs
– D. Banerjee
– X. Besseron
– H.-W. Jin
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– Kamal Raj (M.S.)
– K. S. Khorassani (Ph.D.)
– P. Kousha (Ph.D.)
– A. Quentin (Ph.D.)
– B. Ramesh (Ph.D.)
– S. Xu (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Past Programmers
– D. Bureddy
– J. Perkins
Current Research Specialist
– J. Smith
– S. Marcarelli
– A. Ruhela
– J. Vienne
Current Post-doc
– M. S. Ghazimeersaeed
– K. Manian
Current Students (Undergraduate)
– V. Gangal (B.S.)
– N. Sarkauskas (B.S.)
Past Research Specialist
– M. Arnold
Current Research Scientist
– A. Shafi
– H. Subramoni
– Q. Zhou (Ph.D.)
– H. Wang
HPCAC-AI-Stanford (April ‘20) 47Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
panda@cse.ohio-state.edu
The High-Performance MPI/PGAS Project
http://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Project
http://hidl.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/

More Related Content

What's hot

TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputinginside-BigData.com
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsGanesan Narayanasamy
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutionsinside-BigData.com
 
Welcome to the 2016 HPC Advisory Council Switzerland Conference
Welcome to the 2016 HPC Advisory Council Switzerland ConferenceWelcome to the 2016 HPC Advisory Council Switzerland Conference
Welcome to the 2016 HPC Advisory Council Switzerland Conferenceinside-BigData.com
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERinside-BigData.com
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnectinside-BigData.com
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Clusterinside-BigData.com
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsGanesan Narayanasamy
 
OpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software StackOpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software Stackinside-BigData.com
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformGanesan Narayanasamy
 

What's hot (20)

TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
 
Welcome to the 2016 HPC Advisory Council Switzerland Conference
Welcome to the 2016 HPC Advisory Council Switzerland ConferenceWelcome to the 2016 HPC Advisory Council Switzerland Conference
Welcome to the 2016 HPC Advisory Council Switzerland Conference
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
ARM HPC Ecosystem
ARM HPC EcosystemARM HPC Ecosystem
ARM HPC Ecosystem
 
DOME 64-bit μDataCenter
DOME 64-bit μDataCenterDOME 64-bit μDataCenter
DOME 64-bit μDataCenter
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnect
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Cluster
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
 
OpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software StackOpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software Stack
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
 

Similar to How to Achieve High-Performance, Scalable and Distributed DNN Training on Modern HPC Systems?

Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systemsinside-BigData.com
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsGanesan Narayanasamy
 
Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418inside-BigData.com
 
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale SystemsDesigning Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systemsinside-BigData.com
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
 
Communication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big DataCommunication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big Datainside-BigData.com
 
Designing High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCDesigning High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCObject Automation
 
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...inside-BigData.com
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceObject Automation
 
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...
TECHNICAL OVERVIEW NVIDIA DEEP  LEARNING PLATFORM Giant Leaps in Performance ...TECHNICAL OVERVIEW NVIDIA DEEP  LEARNING PLATFORM Giant Leaps in Performance ...
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...Willy Marroquin (WillyDevNET)
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
 
Deep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoDeep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoVincenzo Lomonaco
 
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future PlansMVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plansinside-BigData.com
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmapinside-BigData.com
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...OpenStack
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
OpenACC Monthly Highlights: July 2021
OpenACC Monthly Highlights: July  2021OpenACC Monthly Highlights: July  2021
OpenACC Monthly Highlights: July 2021OpenACC
 

Similar to How to Achieve High-Performance, Scalable and Distributed DNN Training on Modern HPC Systems? (20)

Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 
Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418
 
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale SystemsDesigning Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
 
Communication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big DataCommunication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big Data
 
Designing High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCDesigning High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPC
 
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
 
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...
TECHNICAL OVERVIEW NVIDIA DEEP  LEARNING PLATFORM Giant Leaps in Performance ...TECHNICAL OVERVIEW NVIDIA DEEP  LEARNING PLATFORM Giant Leaps in Performance ...
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
Deep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoDeep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with Theano
 
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future PlansMVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmap
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
OpenACC Monthly Highlights: July 2021
OpenACC Monthly Highlights: July  2021OpenACC Monthly Highlights: July  2021
OpenACC Monthly Highlights: July 2021
 

More from inside-BigData.com

Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architecturesinside-BigData.com
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computinginside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computing
 
FPGAs and Machine Learning
FPGAs and Machine LearningFPGAs and Machine Learning
FPGAs and Machine Learning
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

How to Achieve High-Performance, Scalable and Distributed DNN Training on Modern HPC Systems?

  • 1. How to Achieve High-Performance, Scalable and Distributed Training on Modern HPC Systems? Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Talk at HPCAC-AI Stanford Conference (April ’20) by Follow us on https://twitter.com/mvapich
  • 2. HPCAC-AI-Stanford (April ‘20) 2Network Based Computing Laboratory Understanding the Deep Learning Resurgence Adopted from: http://www.deeplearningbook.org/contents/intro.html • Deep Learning (DL) is a sub-set of Machine Learning (ML) – Perhaps, the most revolutionary subset! – Feature extraction vs. hand-crafted features • Deep Learning – A renewed interest and a lot of hype! – Key success: Deep Neural Networks (DNNs) – Everything was there since the late 80s except the “computability of DNNs” AI Machine Learning Deep Learning Examples: MLPs, DNNs, Examples: Logistic Regression
  • 3. HPCAC-AI-Stanford (April ‘20) 3Network Based Computing Laboratory • Modern and efficient hardware enabled – Computability of DNNs – impossible in the past! – GPUs – at the core of DNN training – CPUs – catching up fast • Availability of Datasets – MNIST, CIFAR10, ImageNet, and more… • Excellent Accuracy for many application areas – Vision, Machine Translation, and several others... Deep Learning in the Many-core Era Courtesy: A. Canziani et al., “An Analysis of Deep Neural Network Models for Practical Applications”, CoRR, 2016. 0 2000 4000 6000 8000 10000 2 GTX 580 DGX-2 MinutestoTrain AlexNet ~500X in 5 years
  • 4. HPCAC-AI-Stanford (April ‘20) 4Network Based Computing Laboratory Big Data (Hadoop, Spark, HBase, Memcached, etc.) Deep Learning (Caffe, TensorFlow, BigDL, etc.) HPC (MPI, RDMA, Lustre, etc.) Increasing Usage of HPC, Big Data and Deep Learning Convergence of HPC, Big Data, and Deep Learning! Increasing Need to Run these applications on the Cloud!!
  • 5. HPCAC-AI-Stanford (April ‘20) 5Network Based Computing Laboratory Drivers of Modern HPC Cluster Architectures • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, 200Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM K - ComputerSunway TaihuLightSummit Sierra
  • 6. HPCAC-AI-Stanford (April ‘20) 6Network Based Computing Laboratory • Deep Learning has two major tasks 1. Training of the Deep Neural Network 2. Inference (or deployment) that uses a trained DNN • DNN Training – Training is a compute/communication intensive process – can take days to weeks – Faster training is necessary! • Faster training can be achieved by – Using Newer and Faster Hardware – But, there is a limit! – Can we use more GPUs or nodes? • The need for Parallel and Distributed Training Key Phases of Deep Learning
  • 7. HPCAC-AI-Stanford (April ‘20) 7Network Based Computing Laboratory • Scale-up: Intra-node Communication – Many improvements like: • NVIDIA cuDNN, cuBLAS, NCCL, etc. • CUDA 9 Co-operative Groups • Scale-out: Inter-node Communication – DL Frameworks – most are optimized for single-node only – Distributed (Parallel) Training is an emerging trend • OSU-Caffe – MPI-based • Microsoft CNTK – MPI/NCCL2 • Google TensorFlow – gRPC-based/MPI/NCCL2 • Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI) Scale-up and Scale-out Scale-upPerformance Scale-out Performance cuDNN gRPC Hadoop MPI MKL-DNN Desired NCCL2
  • 8. HPCAC-AI-Stanford (April ‘20) 8Network Based Computing Laboratory Holistic Evaluation is Important!! • My framework is faster than your framework! • This needs to be understood in a holistic way. • Performance depends on the entire execution environment (the full stack) • Isolated view of performance is not helpful A. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. “An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures”, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17). ACM, New York, NY, USA, Article 8. MKL/ MKL-DNN
  • 9. HPCAC-AI-Stanford (April ‘20) 9Network Based Computing Laboratory How to efficiently scale-out a Deep Learning (DL) framework and take advantage of heterogeneous High Performance Computing (HPC) resources? Broad Challenge: Exploiting HPC for Deep Learning
  • 10. HPCAC-AI-Stanford (April ‘20) 10Network Based Computing Laboratory 1. What are the fundamental issues in designing DL frameworks? – Memory Requirements – Computation Requirements – Communication Overhead 2. Why do we need to support distributed training? – To overcome the limits of single-node training – To better utilize hundreds of existing HPC Clusters Research Challenges to Exploit HPC Technologies InfiniBand GPUCPU CNTK Gradient Aggregation Model Propagation Forward Backward Deep Learning and Machine Learning Frameworks Caffe/ OSU-Caffe Caffe2 TensorFlow MXNet Communication Runtimes to support Distributed Training HPC Platforms Major Computation and Communication Phases in DL Frameworks 1 2
  • 11. HPCAC-AI-Stanford (April ‘20) 11Network Based Computing Laboratory 3. What are the new design challenges brought forward by DL frameworks for Communication runtimes? – Large Message Collective Communication and Reductions – GPU Buffers (CUDA-Awareness) 4. Can a Co-design approach help in achieving Scale-up and Scale-out efficiently? – Co-Design the support at Runtime level and Exploit it at the DL Framework level – What performance benefits can be observed? – What needs to be fixed at the communication runtime layer? Research Challenges to Exploit HPC Technologies (Cont’d) CUDA- Awareness InfiniBand GPUCPU Large-message Collectives CNTK Point-to- Point Operations Gradient Aggregation Model Propagation Forward Backward Deep Learning and Machine Learning Frameworks Caffe/ OSU-Caffe Caffe2 TensorFlow MXNet Communication Runtimes (MPI/NCCL/Gloo/MLSL) HPC Platforms Major Computation and Communication Phases in DL Frameworks 3 4 Co-Design Opportunities
  • 12. HPCAC-AI-Stanford (April ‘20) 12Network Based Computing Laboratory • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Co-designing Deep Learning Stacks with High-Performance MPI • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology Multiple Approaches taken up by OSU
  • 13. HPCAC-AI-Stanford (April ‘20) 13Network Based Computing Laboratory Data Parallel Deep Learning and MPI Collectives MPI_Bcast (GPU 0) packed_comm_buff L1 L2 .. Ln F L1 L2 .. Ln L1 L2 .. Ln L1 L2 .. Ln Params GPU0 Params GPU1 Params GPU2 Params GPU3 Gradients 1. Data Propagation 2. Forward Backward Pass 3. Gradient Aggregatio n B F B F B F B packed_red uce_buff packed_red uce_buff packed_red uce_buff packed_red uce_buff ApplyUpdates MPI_Reduce (GPU 0) Loop {} • Major MPI Collectives involved in Designing distributed frameworks • MPI_Bcast – required for DNN parameter exchange • MPI_Reduce – needed for gradient accumulation from multiple solvers • MPI_Allreduce – use just one Allreduce instead of Reduce and Broadcast A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
  • 14. HPCAC-AI-Stanford (April ‘20) 14Network Based Computing Laboratory Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 3,075 organizations in 89 countries – More than 732,000 (> 0.7 million) downloads from the OSU site directly – Empowering many TOP500 clusters (November ‘19 ranking) • 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 5th, 448, 448 cores (Frontera) at TACC • 8th, 391,680 cores (ABCI) in Japan • 14th, 570,020 cores (Nurion) in South Korea and many others – Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, OpenHPC, and Spack) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade Partner in the 5th ranked TACC Frontera System
  • 15. HPCAC-AI-Stanford (April ‘20) 15Network Based Computing Laboratory 0 100000 200000 300000 400000 500000 600000 700000 800000 Sep-04 Mar-05 Sep-05 Mar-06 Sep-06 Mar-07 Sep-07 Mar-08 Sep-08 Mar-09 Sep-09 Mar-10 Sep-10 Mar-11 Sep-11 Mar-12 Sep-12 Mar-13 Sep-13 Mar-14 Sep-14 Mar-15 Sep-15 Mar-16 Sep-16 Mar-17 Sep-17 Mar-18 Sep-18 Mar-19 Sep-19 NumberofDownloads Timeline MV0.9.4 MV20.9.0 MV20.9.8 MV21.0 MV1.0 MV21.0.3 MV1.1 MV21.4 MV21.5 MV21.6 MV21.7 MV21.8 MV21.9 MV2-GDR2.0b MV2-MIC2.0 MV2-GDR2.3.2 MV2-X2.3rc3 MV2Virt2.2 MV22.3.3 OSUINAM0.9.5 MV2-Azure2.3.2 MV2-AWS2.3 MVAPICH2 Release Timeline and Downloads
  • 16. HPCAC-AI-Stanford (April ‘20) 16Network Based Computing Laboratory Architecture of MVAPICH2 Software Family (HPC and DL) High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- point Primitives Collectives Algorithms Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Job Startup Introspection & Analysis Support for Modern Networking Technology (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter) Support for Modern Multi-/Many-core Architectures (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features RC SRD UD DC UMR ODP SR- IOV Multi Rail Transport Mechanisms Shared Memory CMA IVSHMEM Modern Features Optane* NVLink CAPI* * Upcoming XPMEM
  • 17. HPCAC-AI-Stanford (April ‘20) 17Network Based Computing Laboratory • gRPC – Officially available and supported – Open-source – can be enhanced by others – Accelerated gRPC (add RDMA to gRPC) • gRPC+X – Use gRPC for bootstrap and rendezvous – Actual communication is in “X” – X MPI, Verbs, GPUDirect RDMA (GDR), etc. • No-gRPC – Baidu – the first one to use MPI Collectives for TF – Horovod – Use NCCL, or MPI, or any other future library (e.g. IBM DDL support recently added) Data Parallel Training with TensorFlow *Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation ”, CCGrid ’19.
  • 18. HPCAC-AI-Stanford (April ‘20) 18Network Based Computing Laboratory MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training MVAPICH2-X for CPU-Based Training MVAPICH2-GDR for GPU-Based Training Horovod TensorFlow PyTorch MXNet ML/DL Applications
  • 19. HPCAC-AI-Stanford (April ‘20) 19Network Based Computing Laboratory MVAPICH2 Software Family (CPU-Based Deep Learning) High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning Applications MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications
  • 20. HPCAC-AI-Stanford (April ‘20) 20Network Based Computing Laboratory Performance of CNTK with MVAPICH2-X on CPU-based Deep Learning 0 200 400 600 800 28 56 112 224 ExecutionTime(s) No. of Processes Intel MPI MVAPICH2 MVAPICH2-XPMEM CNTK AlexNet Training (B.S=default, iteration=50, ppn=28) 20% 9% • CPU-based training of AlexNet neural network using ImageNet ILSVRC2012 dataset • Advanced XPMEM-based designs show up to 20% benefits over Intel MPI (IMPI) for CNTK DNN training using All_Reduce • The proposed designs show good scalability with increasing system size Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018 Available since MVAPICH2-X 2.3rc1 release
  • 21. HPCAC-AI-Stanford (April ‘20) 21Network Based Computing Laboratory Distributed TensorFlow on TACC Frontera (2,048 CPU nodes) • Scaled TensorFlow to 2048 nodes on Frontera using MVAPICH2 and IntelMPI • MVAPICH2 and IntelMPI give similar performance for DNN training • Report a peak of 260,000 images/sec on 2,048 nodes • On 2048 nodes, ResNet-50 can be trained in 7 minutes! A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).
  • 22. HPCAC-AI-Stanford (April ‘20) 22Network Based Computing Laboratory Scaling DL Frameworks using MVAPICH2-X on TACC Frontera • On single node, TensorFlow (TF) is 8% better than MXNet • TF (tf_cnn_benchmark) is 2.13x better than PyTorch • TensorFlow is 1.7x better than MXNet • TF (Keras) gives better performance compared to PyTorch and MXNet. A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).
  • 23. HPCAC-AI-Stanford (April ‘20) 23Network Based Computing Laboratory MVAPICH2 Software Family (GPU-Based Deep Learning) High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning Applications MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications
  • 24. HPCAC-AI-Stanford (April ‘20) 24Network Based Computing Laboratory At Sender: At Receiver: MPI_Recv(r_devbuf, size, …); inside MVAPICH2 • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers High Performance and High Productivity MPI_Send(s_devbuf, size, …); GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
  • 25. HPCAC-AI-Stanford (April ‘20) 25Network Based Computing Laboratory CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3.3 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers • Unified memory
  • 26. HPCAC-AI-Stanford (April ‘20) 26Network Based Computing Laboratory 0 2000 4000 6000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth(MB/s) Message Size (Bytes) GPU-GPU Inter-node Bi-Bandwidth MV2-(NO-GDR) MV2-GDR-2.3 0 1000 2000 3000 4000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth(MB/s) Message Size (Bytes) GPU-GPU Inter-node Bandwidth MV2-(NO-GDR) MV2-GDR-2.3 0 10 20 30 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Latency(us) Message Size (Bytes) GPU-GPU Inter-node Latency MV2-(NO-GDR) MV2-GDR 2.3 MVAPICH2-GDR-2.3 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA 10x 9x Optimized MVAPICH2-GDR Design 1.85us 11X
  • 27. HPCAC-AI-Stanford (April ‘20) 27Network Based Computing Laboratory MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2) • Optimized designs in MVAPICH2-GDR offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs) 1 10 100 1000 10000 Latency(us) Message Size (Bytes) MVAPICH2-GDR-2.3.3 NCCL-2.5 ~2.5X better Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2 0 5 10 15 20 25 30 35 40 45 50 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K Latency(us) Message Size (Bytes) MVAPICH2-GDR-2.3.3 NCCL-2.5 ~4.7X better C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020, Accepted to be Presented.
  • 28. HPCAC-AI-Stanford (April ‘20) 28Network Based Computing Laboratory MVAPICH2-GDR: Enhanced MPI_Allreduce at Scale • Optimized designs in MVAPICH2-GDR offer better performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs 0 1 2 3 4 5 6 32M 64M 128M 256M Bandwidth(GB/s) Message Size (Bytes) Bandwidth on 1,536 GPUs MVAPICH2-GDR-2.3.3 NCCL 2.5 1.7X better 0 50 100 150 200 250 300 350 400 450 4 16 64 256 1K 4K 16K Latency(us) Message Size (Bytes) Latency on 1,536 GPUs MVAPICH2-GDR-2.3.3 NCCL 2.5 1.6X better Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect 0 2 4 6 8 10 24 48 96 192 384 768 1536 Bandwidth(GB/s) Number of GPUs 128MB Message SpectrumMPI 10.3 OpenMPI 4.0.1 NCCL 2.5 MVAPICH2-GDR-2.3.3 1.7X better C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020. Accepted to be Presented.
  • 29. HPCAC-AI-Stanford (April ‘20) 29Network Based Computing Laboratory Scalable TensorFlow using Horovod and MVAPICH2-GDR • ResNet-50 Training using TensorFlow benchmark on 1 DGX-2 node (16 Volta GPUs) 0 1000 2000 3000 4000 5000 6000 7000 1 2 4 8 16 Imagepersecond Number of GPUs NCCL-2.5 MVAPICH2-GDR-2.3.3 9% higher Platform: Nvidia DGX-2 system, CUDA 9.2 0 20 40 60 80 100 1 2 4 8 16 ScalingEfficiency(%) Number of GPUs NCCL-2.5 MVAPICH2-GDR-2.3.3 Scaling Efficiency = Actual throughput Ideal throughput at scale × 100% C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020. Accepted to be Presented.
  • 30. HPCAC-AI-Stanford (April ‘20) 30Network Based Computing Laboratory Distributed TensorFlow on ORNL Summit (1,536 GPUs) • ResNet-50 Training using TensorFlow benchmark on SUMMIT -- 1536 Volta GPUs! • 1,281,167 (1.2 mil.) images • Time/epoch = 3.6 seconds • Total Time (90 epochs) = 3.6 x 90 = 332 seconds = 5.5 minutes! 0 50 100 150 200 250 300 350 400 1 2 4 6 12 24 48 96 192 384 768 1536 Imagepersecond Thousands Number of GPUs NCCL-2.5 MVAPICH2-GDR 2.3.3 Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2 *We observed errors for NCCL2 beyond 96 GPUs MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k! ImageNet-1k has 1.2 million images
  • 31. HPCAC-AI-Stanford (April ‘20) 31Network Based Computing Laboratory • LLNL/Lassen  POWER9 CPU with 4 V100 GPUs/node • PyTorch is becoming a very important DL framework • Scaling PyTorch models with Horovod is simple • MVAPICH2-GDR provides better performance and scalability compared to NCCL2 Scaling PyTorch on LLNL/Lassen using MVAPICH2-GDR 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 1 2 4 8 16 32 64 128 Images/sec(higherisbetter) No. of GPUs MVAPICH2-GDR NCCL2 Platform: The Lassen Supercomputer (#10 on Top500.org) – 4 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1
  • 32. HPCAC-AI-Stanford (April ‘20) 32Network Based Computing Laboratory • RTX 5000 are NVIDIA’s GPUs targeted for data centers • Different from GTX series – Supports GPUDirect RDMA (GDR) – Supports GDRCOPY • MVAPICH2-GDR offers good performance and reasonable scaling • Scaling is not as good as Lassen because – Nodes are connected with IB FDR – No NVLink between GPUs (only PCIe) Early Exploration of MXNet using MVAPICH2-GDR on TACC Frontera RTX GPU Nodes 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 2 4 8 16 32 64 Images/second(higherisbetter) No. of GPUs MVAPICH2-GDR Platform: TACC Frontera-RTX – 4 NVIDIA RTX 5000 GPUs per node, CUDA 10.1
  • 33. HPCAC-AI-Stanford (April ‘20) 33Network Based Computing Laboratory • TACC Longhorn is like the LLNL/Lassen system (POWER9 with 4 V100 GPUs/node) • Some restrictions to keep in mind – TensorFlow 2.1 on POWER9 only built with CUDA 10.2 • Early results are promising – Good scaling up to 256 GPUs Early Exploration of TensorFlow 2.1 on TACC/Longhorn using MVAPICH2-GDR 0 10000 20000 30000 40000 50000 60000 70000 1 2 4 8 16 32 64 128 256 Images/sec(higherisbetter) No. of GPUs MVAPICH2-GDR Platform: The Longhorn System at TACC – 4 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.2
  • 34. HPCAC-AI-Stanford (April ‘20) 34Network Based Computing Laboratory • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Co-designing Deep Learning Stacks with High-Performance MPI • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology Multiple Approaches taken up by OSU
  • 35. HPCAC-AI-Stanford (April ‘20) 35Network Based Computing Laboratory • What are the fundamental bottlenecks in the existing DL frameworks? • How to design a Scalable and Distributed Framework? – Achieve both Scale-up and Scale-out efficiently • What are the new requirements and expectations for Communication Runtimes? • Can a Co-design approach help in achieving better training performance? – Can a DL framework like Caffe be co-designed with an MPI runtime? • What is the impact of the co-design? – What performance benefits can be observed? At what levels? S-Caffe: Co-designing HPC and DL A. A. Awan, K. Hamidouche, J.M. Hashmi, DK Panda, “S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters”, ACM PPoPP ’17
  • 36. HPCAC-AI-Stanford (April ‘20) 36Network Based Computing Laboratory • Caffe : A flexible and layered Deep Learning framework. • Benefits and Weaknesses – Multi-GPU Training within a single node – Performance degradation for GPUs across different sockets – Limited Scale-out • OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across multi-GPU nodes) – Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset – Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset OSU-Caffe: Scalable Deep Learning 0 50 100 150 200 250 8 16 32 64 128 TrainingTime(seconds) No. of GPUs GoogLeNet (ImageNet) on 128 GPUs Caffe OSU-Caffe (1024) OSU-Caffe (2048) Invalid use case OSU-Caffe publicly available from http://hidl.cse.ohio-state.edu/
  • 37. HPCAC-AI-Stanford (April ‘20) 37Network Based Computing Laboratory Out-of-core Training via CUDA Unified Memory • Large DNNs cannot be trained on GPUs due to memory limitation! – ResNet-50 -- current frameworks only allow small batch sizes – Next-generation models like Neural Machine Translation (NMT), AmoebaNet, etc. • Ridiculously large consists of billions of parameters – Can we design Out-of-core DNN training support using new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs? • General intuition is that managed allocations “will be” slow! – The proposed OC-Caffe (Out-of-Core Caffe) exploits the potential of CUDA Unified Memory. • OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7 A. Awan et al., OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ’18
  • 38. HPCAC-AI-Stanford (April ‘20) 38Network Based Computing Laboratory • Data-Parallelism– only for models that fit the memory • Out-of-core models – Deeper model  Better accuracy but more memory required! • Model parallelism can work for out-of-core models! • Key Challenges – Model Partitioning is difficult for application programmers – Finding the right partition (grain) size is hard – cut at which layer and why? – Developing a practical system for model-parallelism • Redesign DL Framework or create additional layers? • Existing Communication middleware or extensions needed? HyPar-Flow: Hybrid and Model Parallelism for CPUs A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ‘20 (Accepted to be presented), https://arxiv.org/pdf/1911.05146.pdf
  • 39. HPCAC-AI-Stanford (April ‘20) 39Network Based Computing Laboratory • ResNet-1001 with variable batch size • Approach: – 48 model-partitions for 56 cores – 512 model-replicas for 512 nodes – Total cores: 48 x 512 = 24,576 • Speedup – 253X on 256 nodes – 481X on 512 nodes • Scaling Efficiency – 98% up to 256 nodes – 93.9% for 512 nodes Out-of-core Training with HyPar-Flow (512 nodes on TACC Frontera) 481x speedup on 512 Intel Xeon Skylake nodes (TACC Frontera) A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ‘20 (Accepted to be presented), https://arxiv.org/pdf/1911.05146.pdf
  • 40. HPCAC-AI-Stanford (April ‘20) 40Network Based Computing Laboratory • The field of Pathology (like many other medical disciplines) is moving Digital • Traditionally, a pathologist reviews a slide and carries out diagnosis based on prior knowledge and training • Experience matters • Can Deep Learning to be used to train thousands of slides and – Let the computer carry out the diagnosis – Narrow down the diagnosis and help a pathologist to make a final decision • Significant benefits in – Reducing diagnosis time – Making pathologists productive – Reducing health care cost AI-Driven Digital Pathology
  • 41. HPCAC-AI-Stanford (April ‘20) 41Network Based Computing Laboratory • Pathology whole slide image (WSI) – Each WSI = 100,000 x 100,000 pixels – Can not fit in a single GPU memory – Tiles are extracted to make training possible • Two main problems with tiles – Restricted tile size because of GPU memory limitation – Smaller tiles loose structural information • Can we use Model Parallelism to train on larger tiles to get better accuracy and diagnosis? • Reduced training time significantly – 7.25 hours (1 node, 4 GPUs) -> 27 mins (32 nodes, 128 GPUs) Exploiting Model Parallelism in AI-Driven Digital Pathology Courtesy: https://blog.kitware.com/digital-slide- archive-large-image-and-histomicstk-open-source- informatics-tools-for-management-visualization-and- analysis-of-digital-histopathology-data/
  • 42. HPCAC-AI-Stanford (April ‘20) 42Network Based Computing Laboratory • Scalable distributed training is getting important • Requires high-performance middleware designs while exploiting modern interconnects • Provided a set of solutions to achieve scalable distributed training – MPI (MVAPICH2)-driven solution with Horovod for TensorFlow, PyTorch and MXNet – Optimized collectives for CPU-based training – CUDA-aware MPI with optimized collectives for GPU-based training – Out-of-core training and Hybrid Parallelism • Will continue to enable the DL community to achieve scalability and high-performance for their distributed training Conclusions
  • 43. HPCAC-AI-Stanford (April ‘20) 43Network Based Computing Laboratory • Supported through X-ScaleSolutions (http://x-scalesolutions.com) • Benefits: – Help and guidance with installation of the library – Platform-specific optimizations and tuning – Timely support for operational issues encountered with the library – Web portal interface to submit issues and tracking their progress – Advanced debugging techniques – Application-specific optimizations and tuning – Obtaining guidelines on best practices – Periodic information on major fixes and updates – Information on major releases – Help with upgrading to the latest release – Flexible Service Level Agreements • Support being provided to Lawrence Livermore National Laboratory (LLNL) and KISTI, Korea Commercial Support for MVAPICH2, HiBD, and HiDL Libraries
  • 44. HPCAC-AI-Stanford (April ‘20) 44Network Based Computing Laboratory • High-performance solution for distributed training for your complex AI problems • Features: – Integrated package with TensorFlow, PyTorch, MXNet, Horovod, and MVAPICH2 MPI libraries – Targeted for both CPU-based and GPU-based Deep Learning Training – Integrated profiling support across the stacks – Support for x86 and OpenPOWER platforms – Support for InfiniBand, RoCE and NVLink Interconnects – Out-of-the-box optimal performance – One-click deployment and execution • Send an e-mail to contactus@x-scalesolutions.com for free trial!! X-ScaleAI Product
  • 45. HPCAC-AI-Stanford (April ‘20) 45Network Based Computing Laboratory Funding Acknowledgments Funding Support by Equipment Support by
  • 46. HPCAC-AI-Stanford (April ‘20) 46Network Based Computing Laboratory Personnel Acknowledgments Current Students (Graduate) – A. Awan (Ph.D.) – M. Bayatpour (Ph.D.) – C.-H. Chu (Ph.D.) – J. Hashmi (Ph.D.) – A. Jain (Ph.D.) – K. S. Kandadi (Ph.D.) Past Students – A. Augustine (M.S.) – P. Balaji (Ph.D.) – R. Biswas (M.S.) – S. Bhagvat (M.S.) – A. Bhat (M.S.) – D. Buntinas (Ph.D.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – S. Chakraborthy (Ph.D.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – R. Rajachandrasekar (Ph.D.) – D. Shankar (Ph.D.) – G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Sur (Ph.D.) – H. Subramoni (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) – J. Zhang (Ph.D.) Past Research Scientist – K. Hamidouche – S. Sur – X. Lu Past Post-Docs – D. Banerjee – X. Besseron – H.-W. Jin – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – W. Huang (Ph.D.) – W. Jiang (M.S.) – J. Jose (Ph.D.) – S. Kini (M.S.) – M. Koop (Ph.D.) – K. Kulkarni (M.S.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – K. Kandalla (Ph.D.) – M. Li (Ph.D.) – P. Lai (M.S.) – J. Liu (Ph.D.) – M. Luo (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – V. Meshram (M.S.) – A. Moody (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – S. Potluri (Ph.D.) – Kamal Raj (M.S.) – K. S. Khorassani (Ph.D.) – P. Kousha (Ph.D.) – A. Quentin (Ph.D.) – B. Ramesh (Ph.D.) – S. Xu (Ph.D.) – J. Lin – M. Luo – E. Mancini Past Programmers – D. Bureddy – J. Perkins Current Research Specialist – J. Smith – S. Marcarelli – A. Ruhela – J. Vienne Current Post-doc – M. S. Ghazimeersaeed – K. Manian Current Students (Undergraduate) – V. Gangal (B.S.) – N. Sarkauskas (B.S.) Past Research Specialist – M. Arnold Current Research Scientist – A. Shafi – H. Subramoni – Q. Zhou (Ph.D.) – H. Wang
  • 47. HPCAC-AI-Stanford (April ‘20) 47Network Based Computing Laboratory Thank You! Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ panda@cse.ohio-state.edu The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/