SlideShare a Scribd company logo
1 of 50
Download to read offline
Scalable and Distributed DNN Training on
Modern HPC Systems
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Talk at HPC-Advisory Council Switzerland Conference (April ’19)
by
HPCAC-Swiss (April ‘19) 2Network Based Computing Laboratory
Understanding the Deep Learning Resurgence
Courtesy: http://www.deeplearningbook.org/contents/intro.html
• Deep Learning is a sub-set of Machine
Learning
– But, it is perhaps the most radical and
revolutionary subset
– Automatic feature extraction vs. hand-crafted
features
• Deep Learning
– A renewed interest and a lot of hype!
– Key success: Deep Neural Networks (DNNs)
– Everything was there since the late 80s except
the “computability of DNNs”
HPCAC-Swiss (April ‘19) 3Network Based Computing Laboratory
Deep Learning Use Cases and Growth Trends
Courtesy: https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/
HPCAC-Swiss (April ‘19) 4Network Based Computing Laboratory
Big Data
(Hadoop, Spark,
HBase,
Memcached,
etc.)
Deep Learning
(Caffe, TensorFlow, BigDL,
etc.)
HPC
(MPI, RDMA,
Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big
Data, and Deep Learning!
Increasing Need to Run these
applications on the Cloud!!
HPCAC-Swiss (April ‘19) 5Network Based Computing Laboratory
(1) Prepare
Datasets @Scale
(2) Deep
Learning @Scale
(3) Non-deep
learning
analytics @Scale
(4) Apply ML
model @Scale
• Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms
• More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big
data stacks, such as Apache Hadoop and Spark
• Benefits of the DLoBD approach
– Easily build a powerful data analytics pipeline
• E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, http://bit.ly/1KIDfof
– Better data locality
– Efficient resource sharing and cost effective
Newer Workflows - Deep Learning over Big Data (DLoBD)
HPCAC-Swiss (April ‘19) 6Network Based Computing Laboratory
Drivers of Modern HPC Cluster Architectures
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.
Accelerators / Coprocessors
high compute density, high
performance/watt
>1 TFlop DP on a chip
High Performance Interconnects -
InfiniBand
<1usec latency, 200Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM
K - ComputerSunway TaihuLightSummit Sierra
HPCAC-Swiss (April ‘19) 7Network Based Computing Laboratory
• Deep Learning has two major tasks
1. Training of the Deep Neural Network
2. Inference (or deployment) that uses a trained DNN
• DNN Training
– Training is a compute/communication intensive process – can take days to
weeks
– Faster training is necessary!
• Faster training can be achieved by
– Using Newer and Faster Hardware – But, there is a limit!
– Can we use more GPUs or nodes?
• The need for Parallel and Distributed Training
Key Phases of Deep Learning
HPCAC-Swiss (April ‘19) 8Network Based Computing Laboratory
• Scale-up: Intra-node Communication
– Many improvements like:
• NVIDIA cuDNN, cuBLAS, NCCL, etc.
• CUDA 9 Co-operative Groups
• Scale-out: Inter-node Communication
– DL Frameworks – most are optimized for
single-node only
– Distributed (Parallel) Training is an emerging
trend
• OSU-Caffe – MPI-based
• Microsoft CNTK – MPI/NCCL2
• Google TensorFlow – gRPC-based/MPI/NCCL2
• Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI)
Scale-up and Scale-out
Scale-upPerformance
Scale-out Performance
cuDNN
gRPC
Hadoop
MPI
MKL-DNN
Desired
NCCL2
HPCAC-Swiss (April ‘19) 9Network Based Computing Laboratory
Holistic Evaluation is Important!!
DL Applications (Image Recognition, Speech Processing, etc.)
DL Frameworks (Caffe, TensorFlow, etc.)
BLAS Libraries
Hardware
Many-core GPU
(Pascal P100)
Generic
ConvolutionLayer
MKL Optimized
ConvolutionLayer
MKL 2017 cuDNN/cuBLAS
Multi-/Many-core
(Xeon, Xeon Phi)
cuDNN Optimized
ConvolutionLayer
Other BLAS Libraries
OpenBLASATLAS
Other Processors
• My framework is faster than
your framework!
• This needs to be understood
in a holistic way.
• Performance depends on
the entire execution
environment (the full stack)
• Isolated view of
performance is not helpful
A. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. “An In-depth Performance Characterization of CPU- and GPU-based DNN Training on
Modern Architectures”, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17). ACM, New York, NY, USA, Article 8.
HPCAC-Swiss (April ‘19) 10Network Based Computing Laboratory
How to efficiently scale-out a
Deep Learning (DL) framework and take
advantage of heterogeneous
High Performance Computing (HPC)
resources?
Broad Challenge: Exploiting HPC for Deep Learning
HPCAC-Swiss (April ‘19) 11Network Based Computing Laboratory
1. What are the fundamental
issues in designing DL
frameworks?
– Memory Requirements
– Computation
Requirements
– Communication Overhead
2. Why do we need to support
distributed training?
– To overcome the limits of
single-node training
– To better utilize hundreds
of existing HPC Clusters
Research Challenges to Exploit HPC Technologies
InfiniBand GPUCPU
CNTK
Gradient
Aggregation
Model Propagation
Forward
Backward
Deep Learning and Machine Learning Frameworks
Caffe/
OSU-Caffe
Caffe2 TensorFlow MXNet
Communication Runtimes to support
Distributed Training
HPC Platforms
Major Computation and Communication Phases in DL Frameworks
1
2
HPCAC-Swiss (April ‘19) 12Network Based Computing Laboratory
3. What are the new design challenges
brought forward by DL frameworks for
Communication runtimes?
– Large Message Collective
Communication and Reductions
– GPU Buffers (CUDA-Awareness)
4. Can a Co-design approach help in
achieving Scale-up and Scale-out efficiently?
– Co-Design the support at Runtime
level and Exploit it at the DL
Framework level
– What performance benefits can
be observed?
– What needs to be fixed at the
communication runtime layer?
Research Challenges to Exploit HPC Technologies (Cont’d)
CUDA-
Awareness
InfiniBand GPUCPU
Large-message
Collectives
CNTK
Point-to-
Point
Operations
Gradient
Aggregation
Model Propagation
Forward
Backward
Deep Learning and Machine Learning Frameworks
Caffe/
OSU-Caffe
Caffe2 TensorFlow MXNet
Communication Runtimes (MPI/NCCL/Gloo/MLSL)
HPC Platforms
Major Computation and Communication Phases in DL Frameworks
3
4 Co-Design
Opportunities
HPCAC-Swiss (April ‘19) 13Network Based Computing Laboratory
• MPI-driven Deep Learning
– CPU-based Deep Learning
– GPU-based Deep Learning
• Co-designing Deep Learning Stacks with High-Performance MPI
• Out-of-core DNN training
• Accelerating TensorFlow on HPC Systems
• Accelerating Big Data Stacks
• Efficient Deep Learning over Big Data
Multiple Approaches taken up by OSU
HPCAC-Swiss (April ‘19) 14Network Based Computing Laboratory
Data Parallel Deep Learning and MPI Collectives
MPI_Bcast (GPU 0)
packed_comm_buff
L1
L2
..
Ln
F
L1
L2
..
Ln
L1
L2
..
Ln
L1
L2
..
Ln
Params
GPU0
Params
GPU1
Params
GPU2
Params
GPU3
Gradients
1. Data
Propagation
2. Forward
Backward
Pass
3. Gradient
Aggregatio
n
B F B F B F B
packed_red
uce_buff
packed_red
uce_buff
packed_red
uce_buff
packed_red
uce_buff
ApplyUpdates
MPI_Reduce (GPU 0)
Loop {}
• Major MPI Collectives
involved in Designing
distributed frameworks
• MPI_Bcast – required for
DNN parameter exchange
• MPI_Reduce – needed for
gradient accumulation
from multiple solvers
• MPI_Allreduce – use just
one Allreduce instead of
Reduce and Broadcast
A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU
Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
HPCAC-Swiss (April ‘19) 15Network Based Computing Laboratory
Overview of the MVAPICH2 Project
• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 3,000 organizations in 88 countries
– More than 531,000 (> 0.5 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘18 ranking)
• 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China
• 14th, 556,104 cores (Oakforest-PACS) in Japan
• 17th, 367,024 cores (Stampede2) at TACC
• 27th, 241,108-core (Pleiades) at NASA and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
Partner in the upcoming TACC Frontera System
HPCAC-Swiss (April ‘19) 16Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family
High Performance Parallel Programming Models
Message Passing Interface
(MPI)
PGAS
(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X
(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication Runtime
Diverse APIs and Mechanisms
Point-to-
point
Primitives
Collectives
Algorithms
Energy-
Awareness
Remote
Memory
Access
I/O and
File Systems
Fault
Tolerance
Virtualization
Active
Messages
Job Startup
Introspection
& Analysis
Support for Modern Networking Technology
(InfiniBand, iWARP, RoCE, Omni-Path)
Support for Modern Multi-/Many-core Architectures
(Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC XRC UD DC UMR ODP
SR-
IOV
Multi
Rail
Transport Mechanisms
Shared
Memory
CMA IVSHMEM
Modern Features
MCDRAM* NVLink* CAPI*
* Upcoming
XPMEM*
HPCAC-Swiss (April ‘19) 17Network Based Computing Laboratory
MVAPICH2 Software Family (CPU-Based Deep Learning)
High-Performance Parallel Programming Libraries
MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE
MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and
MPI+PGAS programming models with unified communication runtime
MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning
Applications
MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud
MVAPICH2-EA Energy aware and High-performance MPI
MVAPICH2-MIC Optimized MPI for clusters with Intel KNC
Microbenchmarks
OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++)
libraries for CPUs and GPUs
Tools
OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler
integration
OEMT Utility to measure the energy consumption of MPI applications
HPCAC-Swiss (April ‘19) 18Network Based Computing Laboratory
Performance of CNTK with MVAPICH2-X on CPU-based Deep Learning
0
200
400
600
800
28 56 112 224
ExecutionTime(s)
No. of Processes
Intel MPI
MVAPICH2
MVAPICH2-XPMEM
CNTK AlexNet Training
(B.S=default, iteration=50, ppn=28)
20%
9%
• CPU-based training of AlexNet neural
network using ImageNet ILSVRC2012
dataset
• Advanced XPMEM-based designs show
up to 20% benefits over Intel MPI (IMPI)
for CNTK DNN training using All_Reduce
• The proposed designs show good
scalability with increasing system size
Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, J. Hashmi, S. Chakraborty, M. Bayatpour, H.
Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018
Available since MVAPICH2-X 2.3rc1 release
HPCAC-Swiss (April ‘19) 19Network Based Computing Laboratory
• CPU-based distributed TensorFlow
Benchmarks (TF) benchmark
– tf_cnn_benchmark tests
• AlexNet model training
– ImageNet ILSVRC2012 dataset
• Advanced SALaR and XPMEM based
designs in MVAPICH-X showed good
scalability
• Up to 15% and 35% improvements in
number of images per second at 448 and
896 processes, respectively.
Performance of TensorFlow with MVAPICH2-X on CPU
500
600
112 224 448 896
Samples/Second
Number of Processes
MVAPICH2 2.3rc2
SALaR−SHMEM
SALaR−XPMEM
0
100
200
300
400
TensorFlow Images per Second
(higher is better)
35%
SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives, M. Bayatpour, J. Hashmi, S. Chakraborty, H. Subramoni,
P. Kousha, and DK Panda IEEE Cluster 2018, Sep 2018 [Best Paper in Architecture Track]
Will be available in future MVAPICH2-X releases
HPCAC-Swiss (April ‘19) 20Network Based Computing Laboratory
MVAPICH2 Software Family (GPU-Based Deep Learning)
High-Performance Parallel Programming Libraries
MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE
MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and
MPI+PGAS programming models with unified communication runtime
MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning
Applications
MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud
MVAPICH2-EA Energy aware and High-performance MPI
MVAPICH2-MIC Optimized MPI for clusters with Intel KNC
Microbenchmarks
OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++)
libraries for CPUs and GPUs
Tools
OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler
integration
OEMT Utility to measure the energy consumption of MPI applications
HPCAC-Swiss (April ‘19) 21Network Based Computing Laboratory
At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
HPCAC-Swiss (April ‘19) 22Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3.1 Releases
• Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication
(GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU
adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node
communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from
GPU device buffers
• Unified memory
HPCAC-Swiss (April ‘19) 23Network Based Computing Laboratory
• MVAPICH2-GDR 2.3.1 requires the following software to be installed on your system:
1. Mellanox OFED 3.2 and later
2. NVIDIA Driver 367.48 or later
3. NVIDIA CUDA Toolkit 7.5 and later
4. NVIDIA Peer Memory (nv_peer_mem) module to enable GPUDirect RDMA (GDR) support
• Strongly Recommended for Best Performance
5. GDRCOPY Library by NVIDIA: https://github.com/NVIDIA/gdrcopy
• Comprehensive Instructions can be seen from the MVAPICH2-GDR User Guide:
– http://mvapich.cse.ohio-state.edu/userguide/gdr/
MVAPICH2-GDR: Pre-requisites for OpenPOWER & x86 Systems
HPCAC-Swiss (April ‘19) 24Network Based Computing Laboratory
• Simple Installation steps for both systems
• Pick the right MVAPICH2-GDR RPM from Downloads page:
– http://mvapich.cse.ohio-state.edu/downloads/
– e.g. http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3/mofed4.5/mvapich2-gdr-
mcast.cuda10.0.mofed4.5.gnu4.8.5-2.3-1.el7.x86_64.rpm (== <mv2-gdr-rpm-name>.rpm)
$ wget http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3/<mv2-gdr-rpm-
name>.rpm
Root Users:
$ rpm -Uvh --nodeps <mv2-gdr-rpm-name>.rpm
Non-Root Users:
$ rpm2cpio <mv2-gdr-rpm-name>.rpm | cpio – id
• Contact MVAPICH help list with any questions related to the package
mvapich-help@cse.ohio-state.edu
MVAPICH2-GDR: Download and Setup on OpenPOWER & x86 Systems
HPCAC-Swiss (April ‘19) 25Network Based Computing Laboratory
• Released on 03/16/2018
• Major Features and Enhancements
– Based on MVAPICH2 2.3.1
– Enhanced intra-node and inter-node point-to-point performance for DGX-2 and IBM POWER8 and IBM POWER9 systems
– Enhanced Allreduce performance for DGX-2 and IBM POWER8/POWER9 systems
– Enhanced small message performance for CUDA-Aware MPI_Put and MPI_Get
– Support for PGI 18.10
– Flexible support for running TensorFlow (Horovod) jobs
– Add support for Volta (V100) GPU
– Support for OpenPOWER with NVLink
– Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink
– Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication
– InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications
– Efficient broadcast designs for Deep Learning applications
MVAPICH2-GDR 2.3.1
HPCAC-Swiss (April ‘19) 26Network Based Computing Laboratory
0
2000
4000
6000
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
Bandwidth(MB/s)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
1000
2000
3000
4000
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
Bandwidth(MB/s)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
10
20
30 0
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
8K
Latency(us)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR 2.3
MVAPICH2-GDR-2.3
Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPU
Mellanox Connect-X4 EDR HCA
CUDA 9.0
Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR Design
1.85us
11X
HPCAC-Swiss (April ‘19) 27Network Based Computing Laboratory
• TensorFlow is the most popular
DL framework
• gRPC is the official distributed
training runtime
– Many problems for HPC use-
cases
• Community efforts - Baidu and
Uber’s Horovod have added MPI
support to TF across nodes
• Need to understand several
options currently available →
Distributed Training using TensorFlow (TF)
Distributed
TensorFlow
gRPC
Accelerated
gRPC
gRPC+X
gRPC+MPI
gRPC+Verbs
gRPC+GDR
No-gRPC
Baidu-MPI
Horovod
MPI
NCCL
Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”,
(To be presented) CCGrid ‘19. https://arxiv.org/abs/1810.11112
HPCAC-Swiss (April ‘19) 28Network Based Computing Laboratory
• Efficient Allreduce is crucial for
Horovod’s overall training performance
– Both MPI and NCCL designs are available
• We have evaluated Horovod extensively
and compared across a wide range of
designs using gRPC and gRPC extensions
• MVAPICH2-GDR achieved up to 90%
scaling efficiency for ResNet-50 Training
on 64 Pascal GPUs
Scalable TensorFlow using Horovod, MPI, and NCCL
0
100
200
300
400
500
600
700
800
900
1000
1 2 4 8 16
Images/second(Higherisbetter)
No. of GPUs
Horovod-MPI Horovod-NCCL2 Horovod-MPI-Opt (Proposed) Ideal
1
4
16
64
256
1024
4096
16384
1 2 4 8 16 32 64
Images/second(Higherisbetter)
No. of Nodes (GPUs)
Horovod-NCCL2 Horovod-MPI-Opt Ideal
Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI:
Characterization, Designs, and Performance Evaluation”, (To be presented) CCGrid ‘19.
https://arxiv.org/abs/1810.11112
HPCAC-Swiss (April ‘19) 29Network Based Computing Laboratory
0
10000
20000
30000
40000
50000
512K 1M 2M 4M
Latency(us)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
0
1000000
2000000
3000000
4000000
5000000
6000000
8388608
16777216
33554432
67108864
134217728
268435456
536870912
Latency(us)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
1
10
100
1000
10000
100000
4
16
64
256
1024
4096
16384
65536
262144
Latency(us)
Message Size (Bytes)
MVAPICH2 BAIDU OPENMPI
• 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0
MVAPICH2-GDR: Allreduce Comparison with Baidu and OpenMPI
*Available since MVAPICH2-GDR 2.3a
~30X better
MV2 is ~2X better
than Baidu
~10X better OpenMPI is ~5X slower
than Baidu
~4X better
HPCAC-Swiss (April ‘19) 30Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2 – Allreduce Operation
• Optimized designs in MVAPICH2-GDR 2.3 offer better/comparable performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs
1
10
100
1000
10000
100000
Latency(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~1.2X better
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect
1
10
100
1000
4
8
16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
Latency(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~3X better
HPCAC-Swiss (April ‘19) 31Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2)
• Optimized designs in upcoming MVAPICH2-GDR offer better/comparable performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs)
1
10
100
1000
10000
Latency(us)
Message Size (Bytes)
MVAPICH2-GDR-Next NCCL-2.3
~1.7X better
Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2
0
10
20
30
40
50
60
8
16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
Latency(us)
Message Size (Bytes)
MVAPICH2-GDR-Next NCCL-2.3
~2.5X better
HPCAC-Swiss (April ‘19) 32Network Based Computing Laboratory
Distributed Training with TensorFlow and MVAPICH2-GDR
• ResNet-50 Training using TensorFlow benchmark on 1 DGX-2 node (8 Volta GPUs)
0
500
1000
1500
2000
2500
3000
1 2 4 8
Imagepersecond
Number of GPUs
NCCL-2.3 MVAPICH2-GDR-Next
7.5% higher
Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2
75
80
85
90
95
100
1 2 4 8
ScalingEfficiency(%)
Number of GPUs
NCCL-2.3 MVAPICH2-GDR-Next
Scaling Efficiency =
Actual throughput
Ideal throughput at scale
× 100%
HPCAC-Swiss (April ‘19) 33Network Based Computing Laboratory
• Caffe : A flexible and layered Deep Learning framework.
• Benefits and Weaknesses
– Multi-GPU Training within a single node
– Performance degradation for GPUs across different
sockets
– Limited Scale-out
• OSU-Caffe: MPI-based Parallel Training
– Enable Scale-up (within a node) and Scale-out (across
multi-GPU nodes)
– Scale-out on 64 GPUs for training CIFAR-10 network on
CIFAR-10 dataset
– Scale-out on 128 GPUs for training GoogLeNet network on
ImageNet dataset
OSU-Caffe: Scalable Deep Learning
0
50
100
150
200
250
8 16 32 64 128
TrainingTime(seconds)
No. of GPUs
GoogLeNet (ImageNet) on 128 GPUs
Caffe OSU-Caffe (1024) OSU-Caffe (2048)
Invalid use case
OSU-Caffe publicly available from
http://hidl.cse.ohio-state.edu/
HPCAC-Swiss (April ‘19) 34Network Based Computing Laboratory
Training Large (Out-of-core) Models
• Large DNNs cannot be trained on GPUs due to memory limitation!
– ResNet-50 for Image Recognition but current frameworks can
only go up to a small batch size of 45
– Next generation models like Neural Machine Translation
(NMT) are ridiculously large, consists of billions of parameters,
and require even more memory
– Can we design Out-of-core DNN training support using new
software features in CUDA 8/9 and hardware mechanisms in
Pascal/Volta GPUs?
• General intuition is that managed allocations “will be” slow!
– The proposed framework called OC-Caffe (Out-of-Core Caffe)
shows the potential of managed memory designs that can
provide performance with negligible/no overhead.
• OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for
ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7
256
512
1024
2048
4096
32 64
256
512
1024
50 100 150 200 250
0
500
1000
1500
2000
2500
3000
3500
4000
4500
BatchSize
Trainability (Memory Requirements)
AlexNet GoogLeNet VGG
Out-of-core
TrainingP100 GPU Memory Limit (16 GB)
A. Awan et al., OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ’18
0
5
10
15
20
Images/sec(Higherisbetter)
caffe-gpu oc-caffe-naïve oc-caffe-opt
caffe-cpu intel-caffe intel-caffe-opt
oc-caffe-opt is
80% better than
intel-caffe
caffe-gpu
cannot
run
X
intel-
caffe-opt
(N/A)
X
HPCAC-Swiss (April ‘19) 35Network Based Computing Laboratory
• MPI-driven Deep Learning
• Co-designing Deep Learning Stacks with High-Performance MPI
• Out-of-core DNN training
• Accelerating TensorFlow on HPC Systems
• Accelerating Big Data Stacks
• Efficient Deep Learning over Big Data
Multiple Approaches taken up by OSU
HPCAC-Swiss (April ‘19) 36Network Based Computing Laboratory
Architecture Overview of gRPC
Key Features:
• Simple service definition
• Works across languages and platforms
• C++, Java, Python, Android Java etc
• Linux, Mac, Windows.
• Start quickly and scale
• Bi-directional streaming and integrated
authentication
• Used by Google (several of Google’s cloud
products and Google externally facing APIs,
TensorFlow), Netflix, Docker, Cisco, Juniper
Networks etc.
• Uses sockets for communication!
Source: http://www.grpc.io/
Large-scale distributed systems composed of
micro services
HPCAC-Swiss (April ‘19) 37Network Based Computing Laboratory
Performance Benefits for RDMA-gRPC with Micro-Benchmark
RDMA-gRPC RPC Latency
• gRPC-RDMA Latency on SDSC-Comet-FDR
– Up to 2.7x performance speedup over IPoIB for Latency for small messages
– Up to 2.8x performance speedup over IPoIB for Latency for medium messages
– Up to 2.5x performance speedup over IPoIB for Latency for large messages
0
15
30
45
60
75
90
2 8 32 128 512 2K 8K
Latency(us)
payload (Bytes)
Default gRPC
OSU RDMA gRPC
0
200
400
600
800
1000
16K 32K 64K 128K 256K 512KLatency(us)
Payload (Bytes)
Default gRPC
OSU RDMA gRPC
100
3800
7500
11200
14900
18600
1M 2M 4M 8M
Latency(us)
Payload (Bytes)
Default gRPC
OSU RDMA gRPC
R. Biswas, X. Lu, and D. K. Panda, Accelerating gRPC and TensorFlow with RDMA for High-Performance Deep Learning over InfiniBand, HiPC ‘18.
HPCAC-Swiss (April ‘19) 38Network Based Computing Laboratory
0
50
100
150
200
16 32 64
Images/Second
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
Performance Benefit for RDMA-TensorFlow (Inception3)
• TensorFlow Inception3 performance evaluation on an IB EDR cluster
– Up to 20% performance speedup over Default gRPC (IPoIB) for 8 GPUs
– Up to 34% performance speedup over Default gRPC (IPoIB) for 16 GPUs
– Up to 37% performance speedup over Default gRPC (IPoIB) for 24 GPUs
4 Nodes (8 GPUS) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS)
0
200
400
600
16 32 64
Images/Second
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
0
100
200
300
400
16 32 64Images/Second
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
R. Biswas, X. Lu, and D. K. Panda,
Accelerating TensorFlow with
Adaptive RDMA-based gRPC. HiPC ‘18
HPCAC-Swiss (April ‘19) 39Network Based Computing Laboratory
• High-Performance Design of TensorFlow over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand support at the verbs-level for gRPC and TensorFlow
– RDMA-based data communication
– Adaptive communication protocols
– Dynamic message chunking and accumulation
– Support for RDMA device selection
– Easily configurable for different protocols (native InfiniBand and IPoIB)
• Current release: 0.9.1
– Based on Google TensorFlow 1.3.0
– Tested with
• Mellanox InfiniBand adapters (e.g., EDR)
• NVIDIA GPGPU K80
• Tested with CUDA 8.0 and CUDNN 5.0
– http://hidl.cse.ohio-state.edu
RDMA-TensorFlow Distribution
HPCAC-Swiss (April ‘19) 40Network Based Computing Laboratory
• MPI-driven Deep Learning
• Co-designing Deep Learning Stacks with High-Performance MPI
• Out-of-core DNN Training
• Accelerating TensorFlow on HPC Systems
• Accelerating Big Data Stacks
• Efficient Deep Learning over Big Data
Multiple Approaches taken up by OSU
HPCAC-Swiss (April ‘19) 41Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x)
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache Kafka
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 305 organizations from 35 countries
• More than 29,500 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
Also run on Ethernet
Available for x86 and OpenPOWER
Support for Singularity and Docker
HPCAC-Swiss (April ‘19) 42Network Based Computing Laboratory
0
50
100
150
200
250
300
350
400
80 120 160
ExecutionTime(s)
Data Size (GB)
IPoIB (EDR)
OSU-IB (EDR)
0
100
200
300
400
500
600
700
800
80 160 240
ExecutionTime(s)
Data Size (GB)
IPoIB (EDR)
OSU-IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x –
RandomWriter & TeraGen in OSU-RI2 (EDR)
Cluster with 8 Nodes with a total of 64 maps
• RandomWriter
– 3x improvement over IPoIB
for 80-160 GB file size
• TeraGen
– 4x improvement over IPoIB for
80-240 GB file size
RandomWriter TeraGen
Reduced by 3x Reduced by 4x
HPCAC-Swiss (April ‘19) 43Network Based Computing Laboratory
• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
• RDMA-based design for Spark 1.5.1
• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.
– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)
– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
0
50
100
150
200
250
300
350
400
450
Huge BigData Gigantic
Time(sec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData Gigantic
Time(sec)
Data Size (GB)
IPoIB
RDMA
43%37%
HPCAC-Swiss (April ‘19) 44Network Based Computing Laboratory
X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017.
High-Performance Deep Learning over Big Data (DLoBD) Stacks
• Challenges of Deep Learning over Big Data
(DLoBD)
▪ Can RDMA-based designs in DLoBD stacks improve
performance, scalability, and resource utilization
on high-performance interconnects, GPUs, and
multi-core CPUs?
▪ What are the performance characteristics of
representative DLoBD stacks on RDMA networks?
• Characterization on DLoBD Stacks
▪ CaffeOnSpark, TensorFlowOnSpark, and BigDL
▪ IPoIB vs. RDMA; In-band communication vs. Out-
of-band communication; CPU vs. GPU; etc.
▪ Performance, accuracy, scalability, and resource
utilization
▪ RDMA-based DLoBD stacks (e.g., BigDL over
RDMA-Spark) can achieve 2.6x speedup compared
to the IPoIB based scheme, while maintain similar
accuracy
0
20
40
60
10
1010
2010
3010
4010
5010
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Accuracy(%)
EpochsTime(secs)
Epoch Number
IPoIB-Time
RDMA-Time
IPoIB-Accuracy
RDMA-Accuracy
2.6X
HPCAC-Swiss (April ‘19) 45Network Based Computing Laboratory
• Supported through X-ScaleSolutions (http://x-scalesolutions.com)
• Benefits:
– Help and guidance with installation of the library
– Platform-specific optimizations and tuning
– Timely support for operational issues encountered with the library
– Web portal interface to submit issues and tracking their progress
– Advanced debugging techniques
– Application-specific optimizations and tuning
– Obtaining guidelines on best practices
– Periodic information on major fixes and updates
– Information on major releases
– Help with upgrading to the latest release
– Flexible Service Level Agreements
• Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years
Commercial Support for MVAPICH2, HiBD, and HiDL Libraries
HPCAC-Swiss (April ‘19) 46Network Based Computing Laboratory
• Recently joined the OpenPOWER Consortium as a silver ISV member
• Provides flexibility:
– To have MVAPICH2, HiDL and HiBD libraries getting integrated into the OpenPOWER software
stack
– A part of the OpenPOWER ecosystem
– Can participate with different vendors for bidding, installation and deployment process
Silver ISV Member for the OpenPOWER Consortium
HPCAC-Swiss (April ‘19) 47Network Based Computing Laboratory
• Scalable distributed training is getting important
• Requires high-performance middleware designs while exploiting modern
interconnects
• Provided a set of different solutions to achieve scalable distributed
training
– Optimized collectives for CPU-based training
– CUDA-aware MPI with optimized collectives for GPU-based training
– TensorFlow-gRPC with RDMA support
– Efficient DL support over Big Data
• Will continue to enable the DL community to achieve scalability and
high-performance for their distributed training
Conclusions
HPCAC-Swiss (April ‘19) 48Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by
HPCAC-Swiss (April ‘19) 49Network Based Computing Laboratory
Personnel Acknowledgments
Current Students (Graduate)
– A. Awan (Ph.D.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– S. Guganani (Ph.D.)
Past Students
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– R. Biswas (M.S.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
– J. Zhang (Ph.D.)
Past Research Scientist
– K. Hamidouche
– S. Sur
Past Post-Docs
– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– J. Hashmi (Ph.D.)
– A. Jain (Ph.D.)
– K. S. Khorassani (Ph.D.)
– P. Kousha (Ph.D.)
– D. Shankar (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Asst. Professor
– X. Lu
Past Programmers
– D. Bureddy
– J. Perkins
Current Research Specialist
– J. Smith
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc
– A. Ruhela
– K. Manian
Current Students (Undergraduate)
– V. Gangal (B.S.)
– M. Haupt (B.S.)
– N. Sarkauskas (B.S.)
– A. Yeretzian (B.S.)
Past Research Specialist
– M. Arnold
Current Research Scientist
– H. Subramoni
HPCAC-Swiss (April ‘19) 50Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
panda@cse.ohio-state.edu
The High-Performance MPI/PGAS Project
http://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Project
http://hidl.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/

More Related Content

What's hot

High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
 
The Education of Computational Scientists
The Education of Computational ScientistsThe Education of Computational Scientists
The Education of Computational Scientistsinside-BigData.com
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Welcome to the 2016 HPC Advisory Council Switzerland Conference
Welcome to the 2016 HPC Advisory Council Switzerland ConferenceWelcome to the 2016 HPC Advisory Council Switzerland Conference
Welcome to the 2016 HPC Advisory Council Switzerland Conferenceinside-BigData.com
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future PlansMVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plansinside-BigData.com
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmapinside-BigData.com
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimizationinside-BigData.com
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIinside-BigData.com
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERinside-BigData.com
 

What's hot (20)

Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
 
The Education of Computational Scientists
The Education of Computational ScientistsThe Education of Computational Scientists
The Education of Computational Scientists
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Welcome to the 2016 HPC Advisory Council Switzerland Conference
Welcome to the 2016 HPC Advisory Council Switzerland ConferenceWelcome to the 2016 HPC Advisory Council Switzerland Conference
Welcome to the 2016 HPC Advisory Council Switzerland Conference
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future PlansMVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmap
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
ARM HPC Ecosystem
ARM HPC EcosystemARM HPC Ecosystem
ARM HPC Ecosystem
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPI
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 

Similar to Scalable and Distributed DNN Training on Modern HPC Systems

Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
 
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale SystemsDesigning Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systemsinside-BigData.com
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceObject Automation
 
Designing High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCDesigning High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCObject Automation
 
Communication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big DataCommunication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big Datainside-BigData.com
 
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...inside-BigData.com
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...OpenStack
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBDDan Frincu
 
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...MLconf
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014Olli-Pekka Lehto
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?inside-BigData.com
 

Similar to Scalable and Distributed DNN Training on Modern HPC Systems (20)

Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale SystemsDesigning Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
 
Designing High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCDesigning High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPC
 
Communication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big DataCommunication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big Data
 
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
2017 dagstuhl-nfv-rothenberg
2017 dagstuhl-nfv-rothenberg2017 dagstuhl-nfv-rothenberg
2017 dagstuhl-nfv-rothenberg
 
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architecturesinside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
 

Recently uploaded

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Scalable and Distributed DNN Training on Modern HPC Systems

  • 1. Scalable and Distributed DNN Training on Modern HPC Systems Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Talk at HPC-Advisory Council Switzerland Conference (April ’19) by
  • 2. HPCAC-Swiss (April ‘19) 2Network Based Computing Laboratory Understanding the Deep Learning Resurgence Courtesy: http://www.deeplearningbook.org/contents/intro.html • Deep Learning is a sub-set of Machine Learning – But, it is perhaps the most radical and revolutionary subset – Automatic feature extraction vs. hand-crafted features • Deep Learning – A renewed interest and a lot of hype! – Key success: Deep Neural Networks (DNNs) – Everything was there since the late 80s except the “computability of DNNs”
  • 3. HPCAC-Swiss (April ‘19) 3Network Based Computing Laboratory Deep Learning Use Cases and Growth Trends Courtesy: https://www.top500.org/news/market-for-artificial-intelligence-projected-to-hit-36-billion-by-2025/
  • 4. HPCAC-Swiss (April ‘19) 4Network Based Computing Laboratory Big Data (Hadoop, Spark, HBase, Memcached, etc.) Deep Learning (Caffe, TensorFlow, BigDL, etc.) HPC (MPI, RDMA, Lustre, etc.) Increasing Usage of HPC, Big Data and Deep Learning Convergence of HPC, Big Data, and Deep Learning! Increasing Need to Run these applications on the Cloud!!
  • 5. HPCAC-Swiss (April ‘19) 5Network Based Computing Laboratory (1) Prepare Datasets @Scale (2) Deep Learning @Scale (3) Non-deep learning analytics @Scale (4) Apply ML model @Scale • Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms • More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big data stacks, such as Apache Hadoop and Spark • Benefits of the DLoBD approach – Easily build a powerful data analytics pipeline • E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, http://bit.ly/1KIDfof – Better data locality – Efficient resource sharing and cost effective Newer Workflows - Deep Learning over Big Data (DLoBD)
  • 6. HPCAC-Swiss (April ‘19) 6Network Based Computing Laboratory Drivers of Modern HPC Cluster Architectures • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, 200Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM K - ComputerSunway TaihuLightSummit Sierra
  • 7. HPCAC-Swiss (April ‘19) 7Network Based Computing Laboratory • Deep Learning has two major tasks 1. Training of the Deep Neural Network 2. Inference (or deployment) that uses a trained DNN • DNN Training – Training is a compute/communication intensive process – can take days to weeks – Faster training is necessary! • Faster training can be achieved by – Using Newer and Faster Hardware – But, there is a limit! – Can we use more GPUs or nodes? • The need for Parallel and Distributed Training Key Phases of Deep Learning
  • 8. HPCAC-Swiss (April ‘19) 8Network Based Computing Laboratory • Scale-up: Intra-node Communication – Many improvements like: • NVIDIA cuDNN, cuBLAS, NCCL, etc. • CUDA 9 Co-operative Groups • Scale-out: Inter-node Communication – DL Frameworks – most are optimized for single-node only – Distributed (Parallel) Training is an emerging trend • OSU-Caffe – MPI-based • Microsoft CNTK – MPI/NCCL2 • Google TensorFlow – gRPC-based/MPI/NCCL2 • Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI) Scale-up and Scale-out Scale-upPerformance Scale-out Performance cuDNN gRPC Hadoop MPI MKL-DNN Desired NCCL2
  • 9. HPCAC-Swiss (April ‘19) 9Network Based Computing Laboratory Holistic Evaluation is Important!! DL Applications (Image Recognition, Speech Processing, etc.) DL Frameworks (Caffe, TensorFlow, etc.) BLAS Libraries Hardware Many-core GPU (Pascal P100) Generic ConvolutionLayer MKL Optimized ConvolutionLayer MKL 2017 cuDNN/cuBLAS Multi-/Many-core (Xeon, Xeon Phi) cuDNN Optimized ConvolutionLayer Other BLAS Libraries OpenBLASATLAS Other Processors • My framework is faster than your framework! • This needs to be understood in a holistic way. • Performance depends on the entire execution environment (the full stack) • Isolated view of performance is not helpful A. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. “An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures”, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17). ACM, New York, NY, USA, Article 8.
  • 10. HPCAC-Swiss (April ‘19) 10Network Based Computing Laboratory How to efficiently scale-out a Deep Learning (DL) framework and take advantage of heterogeneous High Performance Computing (HPC) resources? Broad Challenge: Exploiting HPC for Deep Learning
  • 11. HPCAC-Swiss (April ‘19) 11Network Based Computing Laboratory 1. What are the fundamental issues in designing DL frameworks? – Memory Requirements – Computation Requirements – Communication Overhead 2. Why do we need to support distributed training? – To overcome the limits of single-node training – To better utilize hundreds of existing HPC Clusters Research Challenges to Exploit HPC Technologies InfiniBand GPUCPU CNTK Gradient Aggregation Model Propagation Forward Backward Deep Learning and Machine Learning Frameworks Caffe/ OSU-Caffe Caffe2 TensorFlow MXNet Communication Runtimes to support Distributed Training HPC Platforms Major Computation and Communication Phases in DL Frameworks 1 2
  • 12. HPCAC-Swiss (April ‘19) 12Network Based Computing Laboratory 3. What are the new design challenges brought forward by DL frameworks for Communication runtimes? – Large Message Collective Communication and Reductions – GPU Buffers (CUDA-Awareness) 4. Can a Co-design approach help in achieving Scale-up and Scale-out efficiently? – Co-Design the support at Runtime level and Exploit it at the DL Framework level – What performance benefits can be observed? – What needs to be fixed at the communication runtime layer? Research Challenges to Exploit HPC Technologies (Cont’d) CUDA- Awareness InfiniBand GPUCPU Large-message Collectives CNTK Point-to- Point Operations Gradient Aggregation Model Propagation Forward Backward Deep Learning and Machine Learning Frameworks Caffe/ OSU-Caffe Caffe2 TensorFlow MXNet Communication Runtimes (MPI/NCCL/Gloo/MLSL) HPC Platforms Major Computation and Communication Phases in DL Frameworks 3 4 Co-Design Opportunities
  • 13. HPCAC-Swiss (April ‘19) 13Network Based Computing Laboratory • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Co-designing Deep Learning Stacks with High-Performance MPI • Out-of-core DNN training • Accelerating TensorFlow on HPC Systems • Accelerating Big Data Stacks • Efficient Deep Learning over Big Data Multiple Approaches taken up by OSU
  • 14. HPCAC-Swiss (April ‘19) 14Network Based Computing Laboratory Data Parallel Deep Learning and MPI Collectives MPI_Bcast (GPU 0) packed_comm_buff L1 L2 .. Ln F L1 L2 .. Ln L1 L2 .. Ln L1 L2 .. Ln Params GPU0 Params GPU1 Params GPU2 Params GPU3 Gradients 1. Data Propagation 2. Forward Backward Pass 3. Gradient Aggregatio n B F B F B F B packed_red uce_buff packed_red uce_buff packed_red uce_buff packed_red uce_buff ApplyUpdates MPI_Reduce (GPU 0) Loop {} • Major MPI Collectives involved in Designing distributed frameworks • MPI_Bcast – required for DNN parameter exchange • MPI_Reduce – needed for gradient accumulation from multiple solvers • MPI_Allreduce – use just one Allreduce instead of Reduce and Broadcast A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
  • 15. HPCAC-Swiss (April ‘19) 15Network Based Computing Laboratory Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 3,000 organizations in 88 countries – More than 531,000 (> 0.5 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘18 ranking) • 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 14th, 556,104 cores (Oakforest-PACS) in Japan • 17th, 367,024 cores (Stampede2) at TACC • 27th, 241,108-core (Pleiades) at NASA and many others – Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade Partner in the upcoming TACC Frontera System
  • 16. HPCAC-Swiss (April ‘19) 16Network Based Computing Laboratory Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- point Primitives Collectives Algorithms Energy- Awareness Remote Memory Access I/O and File Systems Fault Tolerance Virtualization Active Messages Job Startup Introspection & Analysis Support for Modern Networking Technology (InfiniBand, iWARP, RoCE, Omni-Path) Support for Modern Multi-/Many-core Architectures (Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features RC XRC UD DC UMR ODP SR- IOV Multi Rail Transport Mechanisms Shared Memory CMA IVSHMEM Modern Features MCDRAM* NVLink* CAPI* * Upcoming XPMEM*
  • 17. HPCAC-Swiss (April ‘19) 17Network Based Computing Laboratory MVAPICH2 Software Family (CPU-Based Deep Learning) High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning Applications MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications
  • 18. HPCAC-Swiss (April ‘19) 18Network Based Computing Laboratory Performance of CNTK with MVAPICH2-X on CPU-based Deep Learning 0 200 400 600 800 28 56 112 224 ExecutionTime(s) No. of Processes Intel MPI MVAPICH2 MVAPICH2-XPMEM CNTK AlexNet Training (B.S=default, iteration=50, ppn=28) 20% 9% • CPU-based training of AlexNet neural network using ImageNet ILSVRC2012 dataset • Advanced XPMEM-based designs show up to 20% benefits over Intel MPI (IMPI) for CNTK DNN training using All_Reduce • The proposed designs show good scalability with increasing system size Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018 Available since MVAPICH2-X 2.3rc1 release
  • 19. HPCAC-Swiss (April ‘19) 19Network Based Computing Laboratory • CPU-based distributed TensorFlow Benchmarks (TF) benchmark – tf_cnn_benchmark tests • AlexNet model training – ImageNet ILSVRC2012 dataset • Advanced SALaR and XPMEM based designs in MVAPICH-X showed good scalability • Up to 15% and 35% improvements in number of images per second at 448 and 896 processes, respectively. Performance of TensorFlow with MVAPICH2-X on CPU 500 600 112 224 448 896 Samples/Second Number of Processes MVAPICH2 2.3rc2 SALaR−SHMEM SALaR−XPMEM 0 100 200 300 400 TensorFlow Images per Second (higher is better) 35% SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives, M. Bayatpour, J. Hashmi, S. Chakraborty, H. Subramoni, P. Kousha, and DK Panda IEEE Cluster 2018, Sep 2018 [Best Paper in Architecture Track] Will be available in future MVAPICH2-X releases
  • 20. HPCAC-Swiss (April ‘19) 20Network Based Computing Laboratory MVAPICH2 Software Family (GPU-Based Deep Learning) High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning Applications MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications
  • 21. HPCAC-Swiss (April ‘19) 21Network Based Computing Laboratory At Sender: At Receiver: MPI_Recv(r_devbuf, size, …); inside MVAPICH2 • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers High Performance and High Productivity MPI_Send(s_devbuf, size, …); GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
  • 22. HPCAC-Swiss (April ‘19) 22Network Based Computing Laboratory CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3.1 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers • Unified memory
  • 23. HPCAC-Swiss (April ‘19) 23Network Based Computing Laboratory • MVAPICH2-GDR 2.3.1 requires the following software to be installed on your system: 1. Mellanox OFED 3.2 and later 2. NVIDIA Driver 367.48 or later 3. NVIDIA CUDA Toolkit 7.5 and later 4. NVIDIA Peer Memory (nv_peer_mem) module to enable GPUDirect RDMA (GDR) support • Strongly Recommended for Best Performance 5. GDRCOPY Library by NVIDIA: https://github.com/NVIDIA/gdrcopy • Comprehensive Instructions can be seen from the MVAPICH2-GDR User Guide: – http://mvapich.cse.ohio-state.edu/userguide/gdr/ MVAPICH2-GDR: Pre-requisites for OpenPOWER & x86 Systems
  • 24. HPCAC-Swiss (April ‘19) 24Network Based Computing Laboratory • Simple Installation steps for both systems • Pick the right MVAPICH2-GDR RPM from Downloads page: – http://mvapich.cse.ohio-state.edu/downloads/ – e.g. http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3/mofed4.5/mvapich2-gdr- mcast.cuda10.0.mofed4.5.gnu4.8.5-2.3-1.el7.x86_64.rpm (== <mv2-gdr-rpm-name>.rpm) $ wget http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3/<mv2-gdr-rpm- name>.rpm Root Users: $ rpm -Uvh --nodeps <mv2-gdr-rpm-name>.rpm Non-Root Users: $ rpm2cpio <mv2-gdr-rpm-name>.rpm | cpio – id • Contact MVAPICH help list with any questions related to the package mvapich-help@cse.ohio-state.edu MVAPICH2-GDR: Download and Setup on OpenPOWER & x86 Systems
  • 25. HPCAC-Swiss (April ‘19) 25Network Based Computing Laboratory • Released on 03/16/2018 • Major Features and Enhancements – Based on MVAPICH2 2.3.1 – Enhanced intra-node and inter-node point-to-point performance for DGX-2 and IBM POWER8 and IBM POWER9 systems – Enhanced Allreduce performance for DGX-2 and IBM POWER8/POWER9 systems – Enhanced small message performance for CUDA-Aware MPI_Put and MPI_Get – Support for PGI 18.10 – Flexible support for running TensorFlow (Horovod) jobs – Add support for Volta (V100) GPU – Support for OpenPOWER with NVLink – Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink – Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication – InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications – Efficient broadcast designs for Deep Learning applications MVAPICH2-GDR 2.3.1
  • 26. HPCAC-Swiss (April ‘19) 26Network Based Computing Laboratory 0 2000 4000 6000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth(MB/s) Message Size (Bytes) GPU-GPU Inter-node Bi-Bandwidth MV2-(NO-GDR) MV2-GDR-2.3 0 1000 2000 3000 4000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Bandwidth(MB/s) Message Size (Bytes) GPU-GPU Inter-node Bandwidth MV2-(NO-GDR) MV2-GDR-2.3 0 10 20 30 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Latency(us) Message Size (Bytes) GPU-GPU Inter-node Latency MV2-(NO-GDR) MV2-GDR 2.3 MVAPICH2-GDR-2.3 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores NVIDIA Volta V100 GPU Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA 10x 9x Optimized MVAPICH2-GDR Design 1.85us 11X
  • 27. HPCAC-Swiss (April ‘19) 27Network Based Computing Laboratory • TensorFlow is the most popular DL framework • gRPC is the official distributed training runtime – Many problems for HPC use- cases • Community efforts - Baidu and Uber’s Horovod have added MPI support to TF across nodes • Need to understand several options currently available → Distributed Training using TensorFlow (TF) Distributed TensorFlow gRPC Accelerated gRPC gRPC+X gRPC+MPI gRPC+Verbs gRPC+GDR No-gRPC Baidu-MPI Horovod MPI NCCL Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, (To be presented) CCGrid ‘19. https://arxiv.org/abs/1810.11112
  • 28. HPCAC-Swiss (April ‘19) 28Network Based Computing Laboratory • Efficient Allreduce is crucial for Horovod’s overall training performance – Both MPI and NCCL designs are available • We have evaluated Horovod extensively and compared across a wide range of designs using gRPC and gRPC extensions • MVAPICH2-GDR achieved up to 90% scaling efficiency for ResNet-50 Training on 64 Pascal GPUs Scalable TensorFlow using Horovod, MPI, and NCCL 0 100 200 300 400 500 600 700 800 900 1000 1 2 4 8 16 Images/second(Higherisbetter) No. of GPUs Horovod-MPI Horovod-NCCL2 Horovod-MPI-Opt (Proposed) Ideal 1 4 16 64 256 1024 4096 16384 1 2 4 8 16 32 64 Images/second(Higherisbetter) No. of Nodes (GPUs) Horovod-NCCL2 Horovod-MPI-Opt Ideal Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, (To be presented) CCGrid ‘19. https://arxiv.org/abs/1810.11112
  • 29. HPCAC-Swiss (April ‘19) 29Network Based Computing Laboratory 0 10000 20000 30000 40000 50000 512K 1M 2M 4M Latency(us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI 0 1000000 2000000 3000000 4000000 5000000 6000000 8388608 16777216 33554432 67108864 134217728 268435456 536870912 Latency(us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI 1 10 100 1000 10000 100000 4 16 64 256 1024 4096 16384 65536 262144 Latency(us) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI • 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0 MVAPICH2-GDR: Allreduce Comparison with Baidu and OpenMPI *Available since MVAPICH2-GDR 2.3a ~30X better MV2 is ~2X better than Baidu ~10X better OpenMPI is ~5X slower than Baidu ~4X better
  • 30. HPCAC-Swiss (April ‘19) 30Network Based Computing Laboratory MVAPICH2-GDR vs. NCCL2 – Allreduce Operation • Optimized designs in MVAPICH2-GDR 2.3 offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs 1 10 100 1000 10000 100000 Latency(us) Message Size (Bytes) MVAPICH2-GDR NCCL2 ~1.2X better Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect 1 10 100 1000 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Latency(us) Message Size (Bytes) MVAPICH2-GDR NCCL2 ~3X better
  • 31. HPCAC-Swiss (April ‘19) 31Network Based Computing Laboratory MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2) • Optimized designs in upcoming MVAPICH2-GDR offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs) 1 10 100 1000 10000 Latency(us) Message Size (Bytes) MVAPICH2-GDR-Next NCCL-2.3 ~1.7X better Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2 0 10 20 30 40 50 60 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K Latency(us) Message Size (Bytes) MVAPICH2-GDR-Next NCCL-2.3 ~2.5X better
  • 32. HPCAC-Swiss (April ‘19) 32Network Based Computing Laboratory Distributed Training with TensorFlow and MVAPICH2-GDR • ResNet-50 Training using TensorFlow benchmark on 1 DGX-2 node (8 Volta GPUs) 0 500 1000 1500 2000 2500 3000 1 2 4 8 Imagepersecond Number of GPUs NCCL-2.3 MVAPICH2-GDR-Next 7.5% higher Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2 75 80 85 90 95 100 1 2 4 8 ScalingEfficiency(%) Number of GPUs NCCL-2.3 MVAPICH2-GDR-Next Scaling Efficiency = Actual throughput Ideal throughput at scale × 100%
  • 33. HPCAC-Swiss (April ‘19) 33Network Based Computing Laboratory • Caffe : A flexible and layered Deep Learning framework. • Benefits and Weaknesses – Multi-GPU Training within a single node – Performance degradation for GPUs across different sockets – Limited Scale-out • OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across multi-GPU nodes) – Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset – Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset OSU-Caffe: Scalable Deep Learning 0 50 100 150 200 250 8 16 32 64 128 TrainingTime(seconds) No. of GPUs GoogLeNet (ImageNet) on 128 GPUs Caffe OSU-Caffe (1024) OSU-Caffe (2048) Invalid use case OSU-Caffe publicly available from http://hidl.cse.ohio-state.edu/
  • 34. HPCAC-Swiss (April ‘19) 34Network Based Computing Laboratory Training Large (Out-of-core) Models • Large DNNs cannot be trained on GPUs due to memory limitation! – ResNet-50 for Image Recognition but current frameworks can only go up to a small batch size of 45 – Next generation models like Neural Machine Translation (NMT) are ridiculously large, consists of billions of parameters, and require even more memory – Can we design Out-of-core DNN training support using new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs? • General intuition is that managed allocations “will be” slow! – The proposed framework called OC-Caffe (Out-of-Core Caffe) shows the potential of managed memory designs that can provide performance with negligible/no overhead. • OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7 256 512 1024 2048 4096 32 64 256 512 1024 50 100 150 200 250 0 500 1000 1500 2000 2500 3000 3500 4000 4500 BatchSize Trainability (Memory Requirements) AlexNet GoogLeNet VGG Out-of-core TrainingP100 GPU Memory Limit (16 GB) A. Awan et al., OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ’18 0 5 10 15 20 Images/sec(Higherisbetter) caffe-gpu oc-caffe-naïve oc-caffe-opt caffe-cpu intel-caffe intel-caffe-opt oc-caffe-opt is 80% better than intel-caffe caffe-gpu cannot run X intel- caffe-opt (N/A) X
  • 35. HPCAC-Swiss (April ‘19) 35Network Based Computing Laboratory • MPI-driven Deep Learning • Co-designing Deep Learning Stacks with High-Performance MPI • Out-of-core DNN training • Accelerating TensorFlow on HPC Systems • Accelerating Big Data Stacks • Efficient Deep Learning over Big Data Multiple Approaches taken up by OSU
  • 36. HPCAC-Swiss (April ‘19) 36Network Based Computing Laboratory Architecture Overview of gRPC Key Features: • Simple service definition • Works across languages and platforms • C++, Java, Python, Android Java etc • Linux, Mac, Windows. • Start quickly and scale • Bi-directional streaming and integrated authentication • Used by Google (several of Google’s cloud products and Google externally facing APIs, TensorFlow), Netflix, Docker, Cisco, Juniper Networks etc. • Uses sockets for communication! Source: http://www.grpc.io/ Large-scale distributed systems composed of micro services
  • 37. HPCAC-Swiss (April ‘19) 37Network Based Computing Laboratory Performance Benefits for RDMA-gRPC with Micro-Benchmark RDMA-gRPC RPC Latency • gRPC-RDMA Latency on SDSC-Comet-FDR – Up to 2.7x performance speedup over IPoIB for Latency for small messages – Up to 2.8x performance speedup over IPoIB for Latency for medium messages – Up to 2.5x performance speedup over IPoIB for Latency for large messages 0 15 30 45 60 75 90 2 8 32 128 512 2K 8K Latency(us) payload (Bytes) Default gRPC OSU RDMA gRPC 0 200 400 600 800 1000 16K 32K 64K 128K 256K 512KLatency(us) Payload (Bytes) Default gRPC OSU RDMA gRPC 100 3800 7500 11200 14900 18600 1M 2M 4M 8M Latency(us) Payload (Bytes) Default gRPC OSU RDMA gRPC R. Biswas, X. Lu, and D. K. Panda, Accelerating gRPC and TensorFlow with RDMA for High-Performance Deep Learning over InfiniBand, HiPC ‘18.
  • 38. HPCAC-Swiss (April ‘19) 38Network Based Computing Laboratory 0 50 100 150 200 16 32 64 Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) Performance Benefit for RDMA-TensorFlow (Inception3) • TensorFlow Inception3 performance evaluation on an IB EDR cluster – Up to 20% performance speedup over Default gRPC (IPoIB) for 8 GPUs – Up to 34% performance speedup over Default gRPC (IPoIB) for 16 GPUs – Up to 37% performance speedup over Default gRPC (IPoIB) for 24 GPUs 4 Nodes (8 GPUS) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS) 0 200 400 600 16 32 64 Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) 0 100 200 300 400 16 32 64Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) R. Biswas, X. Lu, and D. K. Panda, Accelerating TensorFlow with Adaptive RDMA-based gRPC. HiPC ‘18
  • 39. HPCAC-Swiss (April ‘19) 39Network Based Computing Laboratory • High-Performance Design of TensorFlow over RDMA-enabled Interconnects – High performance RDMA-enhanced design with native InfiniBand support at the verbs-level for gRPC and TensorFlow – RDMA-based data communication – Adaptive communication protocols – Dynamic message chunking and accumulation – Support for RDMA device selection – Easily configurable for different protocols (native InfiniBand and IPoIB) • Current release: 0.9.1 – Based on Google TensorFlow 1.3.0 – Tested with • Mellanox InfiniBand adapters (e.g., EDR) • NVIDIA GPGPU K80 • Tested with CUDA 8.0 and CUDNN 5.0 – http://hidl.cse.ohio-state.edu RDMA-TensorFlow Distribution
  • 40. HPCAC-Swiss (April ‘19) 40Network Based Computing Laboratory • MPI-driven Deep Learning • Co-designing Deep Learning Stacks with High-Performance MPI • Out-of-core DNN Training • Accelerating TensorFlow on HPC Systems • Accelerating Big Data Stacks • Efficient Deep Learning over Big Data Multiple Approaches taken up by OSU
  • 41. HPCAC-Swiss (April ‘19) 41Network Based Computing Laboratory • RDMA for Apache Spark • RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x) • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions • RDMA for Apache Kafka • RDMA for Apache HBase • RDMA for Memcached (RDMA-Memcached) • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) • OSU HiBD-Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro-benchmarks • http://hibd.cse.ohio-state.edu • Users Base: 305 organizations from 35 countries • More than 29,500 downloads from the project site The High-Performance Big Data (HiBD) Project Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Support for Singularity and Docker
  • 42. HPCAC-Swiss (April ‘19) 42Network Based Computing Laboratory 0 50 100 150 200 250 300 350 400 80 120 160 ExecutionTime(s) Data Size (GB) IPoIB (EDR) OSU-IB (EDR) 0 100 200 300 400 500 600 700 800 80 160 240 ExecutionTime(s) Data Size (GB) IPoIB (EDR) OSU-IB (EDR) Performance Numbers of RDMA for Apache Hadoop 2.x – RandomWriter & TeraGen in OSU-RI2 (EDR) Cluster with 8 Nodes with a total of 64 maps • RandomWriter – 3x improvement over IPoIB for 80-160 GB file size • TeraGen – 4x improvement over IPoIB for 80-240 GB file size RandomWriter TeraGen Reduced by 3x Reduced by 4x
  • 43. HPCAC-Swiss (April ‘19) 43Network Based Computing Laboratory • InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) • RDMA-based design for Spark 1.5.1 • RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. – 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) – 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) Performance Evaluation on SDSC Comet – HiBench PageRank 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time 0 50 100 150 200 250 300 350 400 450 Huge BigData Gigantic Time(sec) Data Size (GB) IPoIB RDMA 0 100 200 300 400 500 600 700 800 Huge BigData Gigantic Time(sec) Data Size (GB) IPoIB RDMA 43%37%
  • 44. HPCAC-Swiss (April ‘19) 44Network Based Computing Laboratory X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017. High-Performance Deep Learning over Big Data (DLoBD) Stacks • Challenges of Deep Learning over Big Data (DLoBD) ▪ Can RDMA-based designs in DLoBD stacks improve performance, scalability, and resource utilization on high-performance interconnects, GPUs, and multi-core CPUs? ▪ What are the performance characteristics of representative DLoBD stacks on RDMA networks? • Characterization on DLoBD Stacks ▪ CaffeOnSpark, TensorFlowOnSpark, and BigDL ▪ IPoIB vs. RDMA; In-band communication vs. Out- of-band communication; CPU vs. GPU; etc. ▪ Performance, accuracy, scalability, and resource utilization ▪ RDMA-based DLoBD stacks (e.g., BigDL over RDMA-Spark) can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy 0 20 40 60 10 1010 2010 3010 4010 5010 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Accuracy(%) EpochsTime(secs) Epoch Number IPoIB-Time RDMA-Time IPoIB-Accuracy RDMA-Accuracy 2.6X
  • 45. HPCAC-Swiss (April ‘19) 45Network Based Computing Laboratory • Supported through X-ScaleSolutions (http://x-scalesolutions.com) • Benefits: – Help and guidance with installation of the library – Platform-specific optimizations and tuning – Timely support for operational issues encountered with the library – Web portal interface to submit issues and tracking their progress – Advanced debugging techniques – Application-specific optimizations and tuning – Obtaining guidelines on best practices – Periodic information on major fixes and updates – Information on major releases – Help with upgrading to the latest release – Flexible Service Level Agreements • Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years Commercial Support for MVAPICH2, HiBD, and HiDL Libraries
  • 46. HPCAC-Swiss (April ‘19) 46Network Based Computing Laboratory • Recently joined the OpenPOWER Consortium as a silver ISV member • Provides flexibility: – To have MVAPICH2, HiDL and HiBD libraries getting integrated into the OpenPOWER software stack – A part of the OpenPOWER ecosystem – Can participate with different vendors for bidding, installation and deployment process Silver ISV Member for the OpenPOWER Consortium
  • 47. HPCAC-Swiss (April ‘19) 47Network Based Computing Laboratory • Scalable distributed training is getting important • Requires high-performance middleware designs while exploiting modern interconnects • Provided a set of different solutions to achieve scalable distributed training – Optimized collectives for CPU-based training – CUDA-aware MPI with optimized collectives for GPU-based training – TensorFlow-gRPC with RDMA support – Efficient DL support over Big Data • Will continue to enable the DL community to achieve scalability and high-performance for their distributed training Conclusions
  • 48. HPCAC-Swiss (April ‘19) 48Network Based Computing Laboratory Funding Acknowledgments Funding Support by Equipment Support by
  • 49. HPCAC-Swiss (April ‘19) 49Network Based Computing Laboratory Personnel Acknowledgments Current Students (Graduate) – A. Awan (Ph.D.) – M. Bayatpour (Ph.D.) – S. Chakraborthy (Ph.D.) – C.-H. Chu (Ph.D.) – S. Guganani (Ph.D.) Past Students – A. Augustine (M.S.) – P. Balaji (Ph.D.) – R. Biswas (M.S.) – S. Bhagvat (M.S.) – A. Bhat (M.S.) – D. Buntinas (Ph.D.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – R. Rajachandrasekar (Ph.D.) – G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Sur (Ph.D.) – H. Subramoni (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) – J. Zhang (Ph.D.) Past Research Scientist – K. Hamidouche – S. Sur Past Post-Docs – D. Banerjee – X. Besseron – H.-W. Jin – W. Huang (Ph.D.) – W. Jiang (M.S.) – J. Jose (Ph.D.) – S. Kini (M.S.) – M. Koop (Ph.D.) – K. Kulkarni (M.S.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – K. Kandalla (Ph.D.) – M. Li (Ph.D.) – P. Lai (M.S.) – J. Liu (Ph.D.) – M. Luo (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – V. Meshram (M.S.) – A. Moody (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – S. Potluri (Ph.D.) – J. Hashmi (Ph.D.) – A. Jain (Ph.D.) – K. S. Khorassani (Ph.D.) – P. Kousha (Ph.D.) – D. Shankar (Ph.D.) – J. Lin – M. Luo – E. Mancini Current Research Asst. Professor – X. Lu Past Programmers – D. Bureddy – J. Perkins Current Research Specialist – J. Smith – S. Marcarelli – J. Vienne – H. Wang Current Post-doc – A. Ruhela – K. Manian Current Students (Undergraduate) – V. Gangal (B.S.) – M. Haupt (B.S.) – N. Sarkauskas (B.S.) – A. Yeretzian (B.S.) Past Research Specialist – M. Arnold Current Research Scientist – H. Subramoni
  • 50. HPCAC-Swiss (April ‘19) 50Network Based Computing Laboratory Thank You! Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ panda@cse.ohio-state.edu The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/