Hardware & Software Platforms for HPC, AI and ML

Gunter Roth, Senior Solution Architect gunterr@nvidia.com
HW & SW PLATFORMS FOR HPC,
AI AND ML

2
ACCELERATED VDIDATA ANALYTICS
AI / DEEP LEARNINGHIGH PERFORMANCE COMPUTE
ACCELERATED COMPUTING
Performance & Energy Efficiency

3
NVIDIA SDK & LIBRARIES
INDUSTRY FRAMEWORKS
& APPLICATIONS
CUSTOMER USECASES
SUPERCOMPUTING
+550
Applications
CUDA
NCCLcuDNN TensorRTcuBLAS DeepStreamcuSPARSEcuFFT
Amber
NAMDLAMMPS
CHROMA
ENTERPRISE APPLICATIONSCONSUMER INTERNET
ManufacturingHealthcare EngineeringSpeech Translate Recommender Molecular
Simulations
Weather
Forecasting
Seismic
Mapping
cuRAND
NVIDIA TESLA PLATFORM
World’s Leading Data Center Platform for Accelerating HPC and AI
TESLA GPUs & SYSTEMS
SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY

44
NVIDIA POWERS WORLD'S FASTEST
SUPERCOMPUTER
Summit Becomes First System To Scale The 100 Petaflops Milestone
27,648
Volta Tensor Core GPUs
122 PF 3 EF
HPC AI

5
NVIDIA POWERS TODAY’S
FASTEST SUPERCOMPUTERS
22 of Top 25 Greenest
Piz Daint
Europe’s Fastest
5,704 GPUs| 21 PF
ORNL Summit
World’s Fastest
27,648 GPUs| 149 PF
Total Pangea 3
Fastest Industrial
3,348 GPUs| 18 PF
ABCI
Japan’s Fastest
4,352 GPUs| 20 PF
LLNL Sierra
World’s 2nd Fastest
17,280 GPUs| 95 PF

6
GPUS FOR HPC AND DEEP LEARNING
NVIDIA Tesla V100
5120 energy efficient cores + TensorCores
7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) ,
125 Tensor TFLOP/s mixed-precision
Huge demand on communication and memory bandwidth
NVLink
6 links per GPU a 50 GB/s bi-
directional for maximum
scalability between GPU’s
CoWoS with HBM2
900 GB/s Memory Bandwidth
Unifying Compute & Memory
in Single Package
Huge demand on compute power (FLOPS)
NCCL
High-performance multi-GPU
and multi-node collective
communication primitives
optimized for NVIDIA GPUs
GPU Direct /
GPU Direct RDMA
Direct communication
between GPUs by
eliminating the CPU from
the critical path

7
Universal Inference Acceleration
320 Turing Tensor cores
2,560 CUDA cores
65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS
16GB | 320GB/s
ANNOUNCING TESLA T4
WORLD’S MOST ADVANCED INFERENCE GPU

8
DGX-STATION / DGX-1
DGX-2 / HGX-2 /
SUPERPOD

9
NVIDIA DGX-STATION
AI supercomputer for the desk
4x Tesla V100 connected via NVLINK
(60 TFLOPS FP32, 0.5 PFLOPS Tensor
performance)
Xeon CPU, 256 GB Memory
Storage:
3X 1.92 TB SSD RAID 0 (Data)
1X 1.92 TB SSD (OS)
Dual 10GbE
1500W, Water-cooled → Quiet
Optimized Deep Learning Software across the
entire stack
Containerized frameworks
Always up-to-date via the cloud

10
NVIDIA DGX-1
AI supercomputer-appliance-in-a-box
8x Tesla V100 connected via NVLINK
(125 TFLOPS FP32, 1 PFLOPS Tensor Core
performance)
Dual Xeon CPU, 512 GB Memory
7 TB SSD Deep Learning Cache
Dual 10GbE, Quad IB 100Gb
3RU – 3200W
Optimized Deep Learning Software
across the entire stack
Containerized frameworks
Always up-to-date via the cloud

11
NVIDIA DGX-2
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
11
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet

ANNOUNCING NVIDIA
DGX SUPERPOD
AI LEADERSHIP REQUIRES
AI INFRASTRUCTURE LEADERSHIP
Test Bed for Highest Performance Scale-Up Systems
• 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list
• <2 mins To Train RN-50
Modular & Scalable GPU SuperPOD Architecture
• Built in 3 Weeks
• Optimized For Compute, Networking, Storage & Software
Integrates Fully Optimized Software Stacks
• Freely Available Through NGC
• 96 DGX-2H
• 10 Mellanox EDR IB per node
• 1,536 V100 Tensor Core
GPUs
• 1 megawatt of power
Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC

13
PROJECT FEEDING
DATA HUNGRY GPUS

14
GPU OPTIMIZED
DATA CENTERS
Clusters with GPUDirect Storage

15
CUFILE AND GPUDIRECT STORAGE
cuFile API
For applications
NVFS Driver API
For filesystem, block, and storage drivers
Architecture of the Stack
Application
Filesystem Driver
cuFile API
CUDA
Block IO Driver
Storage Driver
NVFS Driver
OS KERNEL
APPLICATION

16
FOR MORE INFORMATION
Join the GPUDirect Storage interest list in order to:
Provide feedback
Extend with other filesystems
Technical blog and link to sign up:
https://devblogs.nvidia.com/gpudirect-storage/

18
Mixed-Precision Computing
TENSOR CORES FOR SCIENCE
7.8
15.7
125
0
20
40
60
80
100
120
140
V100
TFLOPS
FP64+ MULTI-PRECISION
PLASMA FUSION
APPLICATION
FP16 Solver
3.5x faster
EARTHQUAKE SIMULATION
FP16-FP21-FP32-FP64
25x faster
MIXED PRECISION WEATHER
PREDICTION
FP16/FP32/FP64
4x faster

19
TENSOR CORE
Mixed Precision Matrix Math - 4x4 matrices
New CUDA TensorOp instructions & data formats
4x4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Using Tensor cores via
• Volta optimized frameworks and libraries
(cuDNN, CuBLAS, TensorRT, ..)
• CUDA C++ Warp Level Matrix Operations

21
USING TENSOR CORES
Volta Optimized
Frameworks and Libraries
__device__ void tensor_op_16_16_16(
float *d, half *a, half *b, float *c)
{
wmma::fragment<matrix_a, …> Amat;
wmma::fragment<matrix_b, …> Bmat;
wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);
wmma::load_matrix_sync(Bmat, b, 16);
wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,
wmma::row_major);
}
CUDA C++
Warp-Level Matrix Operations
NVIDIA cuDNN, cuBLAS, TensorRT

22
matrix size
2k 4k 6k 8k10k 14k 18k 22k 26k 30k 34k
Tflop/s
0
2
4
6
8
10
12
14
16
18
20
22
24
26 FP16-TC (Tensor Cores) hgetrf LU
FP16 hgetrf LU
FP32 sgetrf LU
FP64 dgetrf LU
Double Precision LU Decomposition
▪ Compute initial solution in FP16
▪ Iteratively refine to FP64
Achieved FP64 Tflops: 26
Device FP64 Tflops: 7.8
LINEAR ALGEBRA + TENSOR CORES
Data courtesy of: Azzam Haidar, Stan. Tomov & Jack Dongarra, Innovative Computing Laboratory, University of Tennessee
“Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers”, A. Haidar, P. Wu, S. Tomov, J. Dongarra, SC’17
GTC 2018 Poster P8237: Harnessing GPU’s Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solves

24
CPU
OPENACC IS FOR MULTICORE, MANYCORE & GPUS
% pgfortran -ta=multicore –fast –Minfo=acc -c
update_tile_halo_kernel.f90
. . .
100, Loop is parallelizable
Generating Multicore code
100, !$acc loop gang
GPU
% pgfortran -ta=tesla,cc35,cc60 –fast -Minfo=acc –c
update_tile_halo_kernel.f90
. . .
Accelerator kernel generated
Generating Tesla code
100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
102, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
98 !$ACC KERNELS
99 !$ACC LOOP INDEPENDENT
100 DO k=y_min-depth,y_max+depth
101 !$ACC LOOP INDEPENDENT
102 DO j=1,depth
103 density0(x_min-j,k)=left_density0(left_xmax+1-j,k)
104 ENDDO
105 ENDDO
106 !$ACC END KERNELS

25
Resources
https://www.openacc.org/resources
Success Stories
https://www.openacc.org/success-stories
Events
https://www.openacc.org/events
OPENACC.ORG RESOURCES
Guides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow
Compilers and Tools
https://www.openacc.org/tools
OpenACC
Now in GCC
https://www.openacc.org/community#slack

29
OpenACC Auto-compare
Find where CPU and GPU numerical results diverge
…
Data
copyin
Data
copyout
…
Compare and
report differences
CPU GPU
• –ta=tesla:autocompare
• Compute regions run redundantly
on CPU and GPU
• Results compared when data
copied from GPU to CPU
• pgicompilers.com/pcast

32
Fortran
2018
Parallel Features in Fortran and C++
pSTL Parallel Algorithms (C++17)
Array syntax (F90)
Co-arrays (F08, F18)
FORALL (F95)
DO CONCURRENT (F08, F18)
Threads (C++11)

35
INTRODUCING CUDA 10.0
New GPU Architecture, Tensor Cores, NVSwitch Fabric
TURING AND NEW SYSTEMS
CUDA Graphs, Vulkan & DX12 Interop, Warp Matrix
CUDA PLATFORM
GPU-accelerated hybrid JPEG decoding,
Symmetric Eigenvalue Solvers, FFT Scaling
LIBRARIES
New Nsight Products – Nsight Systems and Nsight Compute
DEVELOPER TOOLS
Scientific Computing

36
https://developer.nvidia.com/cufft
cuFFT 10.0
Multi-GPU Scaling across DGX-2 and HGX-2
Up to 17TF performance on 16-GPUs
3D 1K FFT
 Strong scaling across 16-GPU systems –
DGX-2 and HGX-2
 Multi-GPU R2C and C2R support
 Large FFT models across 16-GPUs –
effective 512GB vs 32GB capacity
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
2 4 8 16
cuFFT 9.2 cuFFT 10.0 Linear (cuFFT 10.0)
GFLOPS Number of GPUs
cuFFT (10.0 and 9.2) using 3D C2C FFT 1024 size on DGX-2 with CUDA 10 (10.0.130)

38
https://developer.nvidia.com/cusparse
cuSOLVER 10.0
Dense Linear Algebra
Up to 44x Faster on Symmetric Eigensolver
(DSYEVD)
Improved performance with new
implementations for
 Cholesky factorization
 Symmetric & Generalized Symmetric
Eigensolver
 QR factorization
Benchmarks use 2 x Intel Gold 6140 (Skylake) processors with Intel MKL 2018 and
NVIDIA Tesla V100 (Volta) GPUs
1.1
15.8
18.0
0.9
3.6
0
5
10
15
20
25
30
4096 8192
MKL2018 CUDA 9.2 CUDA 10.0
Time(s)
157.8
Matrix Size

40
https://github.com/NVIDIA/cutlass
CUTLASS 1.1
High-performance Matrix Multiplication in Open Source CUDA C++
 Turing optimized GEMMs
 Integer (8-bit, 4-bit and 1-bit) using
WMMA
 Batched strided GEMM
 Support for CUDA 10.0
 Updates to documentation and more
examples
0%
20%
40%
60%
80%
100%
dgemm_nn
dgemm_nt
dgemm_tn
dgemm_tt
hgemm_nn
hgemm_nt
hgemm_tn
hgemm_tt
igemm_nn
igemm_nt
igemm_tn
igemm_tt
sgemm_nn
sgemm_nt
sgemm_tn
sgemm_tt
wmma_gemm_f16_nn
wmma_gemm_f16_nt
wmma_gemm_f16_tn
wmma_gemm_f16_tt
wmma_gemm_nn
wmma_gemm_nt
wmma_gemm_tn
wmma_gemm_tt
DGEMM HGEMM IGEMM SGEMM WMMA (F16) WMMA (F32)
%RelativetoPeak
CUTLASS operations reach 90% of CUBLAS Performance
CUTLASS 1.1 on Volta (GV100)

42
cuSPARSE
New Improved Sparse BLAS APIs
cuBLAS 10.1 Update 1 performance collected on GV100; MKL 2019.1 performance collected on 2-socket Xeon Gold 6140
cusparseStatus_t
cusparseSpMM(cusparseHandle_t handle,
cusparseOperation_t transA,
cusparseOperation_t transB,
const void* alpha,
const cusparseSpMatDescr_t matA,
const cusparseDenseMatDescr_t matB,
const void* beta,
cusparseDenseMatDescr_t matC,
cudaDataType computeType,
cusparseSpMMAlg_t alg,
void* externalBuffer)
cuSPARSE SpMM Speedup over MKL 2019.1Introduced generic APIs with improved performance
• SpVV - Sparse Vector Dense Vector Multiplication
• SpMV – Sparse Matrix Dense Vector Multiplication
• SpMM – Sparse Matrix Dense Matrix Multiplication
Coming Soon
• SpGEMM – Sparse Matrix Sparse Matrix Multiplication
57.1
32.5
43.9
37.8 41.9
29.1
36.7
43.2
28.8 33.2
63.5
80.6
114.9
0.0
20.0
40.0
60.0
80.0
100.0
120.0

43
NSIGHT PRODUCT FAMILY
Nsight Systems
System-wide application
algorithm tuning
Nsight Compute
CUDA Kernel Profiling and
Debugging
Nsight Graphics
Graphics Shader Profiling and
Debugging
IDE Plugins
Nsight Eclipse
Edition/Visual Studio
(Editor, Debugger)

45
NVIDIA DEEP LEARNING SOFTWARE PLATFORM
NVIDIA DEEP LEARNING SDK
TRAINING DEPLOY WITH TENSORRT
TRAINED
NETWORK
TRAINING
DATA TRAINING
DATA MANAGEMENT
MODEL ASSESSMENT
EMBEDDED
Jetson TX
AUTOMOTIVE
Drive PX (XAVIER)
DATA CENTER
Tesla (Pascal, Volta)
GATHER AND LABEL
Rapidly label data,
guide training get
insights
Gather Data
Curate data sets
CNN
RNN
FC

47
NVIDIA Collective Communications Library (NCCL) 2
Multi-GPU and multi-node collective communication primitives
developer.nvidia.com/nccl
High-performance multi-GPU and multi-node collective
communication primitives optimized for NVIDIA GPUs
Fast routines for multi-GPU multi-node acceleration that
maximizes inter-GPU bandwidth utilization
Easy to integrate and MPI compatible. Uses automatic
topology detection to scale HPC and deep learning
applications over PCIe and NVink
Accelerates leading deep learning frameworks such as
Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and
more
Multi-Node:
InfiniBand verbs
IP Sockets
Multi-GPU:
NVLink
PCIe
Automatic
Topology
Detection

48
TENSORRT 5 & TENSORRT INFERENCE SERVER
World’s Most
Advanced Inference
Accelerator
Turing Support ● Optimizations & APIs ● Inference Server
Free download to members of NVIDIA Developer Program soon at
developer.nvidia.com/tensorrt
New optimizations &
flexible INT8 APIs
Achieve highest throughput at low
latency with newly optimized
operations, INT8 workflows, and
support for Win and CentOS
Up to 40x faster inference for
apps such as translation using
mixed precision on Turing Tensor
Cores
Maximize GPU utilization by
executing multiple models from
different frameworks on a node
via API
TensorRT inference
server

50
In GPU Memory
cuXFilter
Visualization
Data Preparation VisualizationModel Training
cuML
Machine Learning
cuGraph
Graph Analytics
Deep Learning
cuDF
Analytics
GPU Accelerated End-to-End Data Science
RAPIDS is a set of open source libraries for GPU accelerating
data preparation and machine learning.
rapids.ai

51
cuDF
• GPU-accelerated data preparation and feature engineering
• Python drop-in Pandas replacement
cuML
• GPU-accelerated traditional machine learning libraries
• XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD…
cuGraph
• GPU-accelerated graph analytics libraries
cuXfilter
• Web Data Visualization library
• DataFrame kept in GPU-memory throughout the session

52
cuML roadmap
cuML Algorithms Available Soon
XGBoost GBDT MGMN
XGBoost Random Forest MGMN
K-Means Clustering SG MGMN
K-Nearest Neighbors (KNN) MG MGMN
Principal Component Analysis (PCA) SG
Density-based Spatial Clustering of Applications with Noise (DBSCAN) SG
Truncated Singular Value Decomposition (tSVD) SG
Uniform Manifold Aproximation and Projection (UMAP) SG MG
Kalman Filters (KF) SG
Ordinary Least Squares Linear Regression (OLS) SG
Stochastic Gradient Descent (SGD) SG
Generalized Linear Model, including Logistic (GLM) SG
Time Series (Holts-Winters) SG
Autoregressive Integrated Moving Average (ARIMA) SG
T-SNE Dimensionality Reduction SG
Support Vector Machines (SVM) SG
SG
Single GPU
MG
Multi-GPU
MGMN
Multi-GPU Multi-Node
Last updated 16.05.19

54
NGC: GPU-OPTIMIZED SOFTWARE HUB
Simplifying DL, ML and HPC Workflows
50+ Containers
DL, ML, HPC
60 Pre-trained Models
NLP, Image Classification, Object Detection
& more
Industry Workflows
Medical Imaging, Intelligent Video
Analytics
15+ Model Training Scripts
NLP, Image Classification, Object Detection &
more
NGC
DEEP LEARNING
HPC
NAMD | GROMACS | more
TensorFlow | PyTorch | more
MACHINE LEARNING
VISUALIZATION
RAPIDS | H2O | more
ParaView | IndeX | more

55
NVIDIA GPU CLOUD REGISTRY
Deep Learning
All major frameworks with multi-GPU optimizations Uses
NCCL for NVLINK data exchange Multi-threaded I/O to
feed the GPUs
Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow,
Theano, Torch
HPC
NAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC
HPC Visualization
Paraview with Optix, Index and Holodeck with OpenGL
visualization base on NVIDIA Docker 2.0, IndeX, VMD
Single NGC Account
For use on GPUs everywhere - https://ngc.nvidia.com
Common Software stack across NVIDIA GPUs
NVIDIA GPU Cloud containerizes GPU-
optimized frameworks, applications, runtimes,
libraries, and operating system, available at no
charge

Gunter Roth (gunterr@nvidia.com)

Hardware & Software Platforms for HPC, AI and ML

More Related Content

What's hot

Similar to Hardware & Software Platforms for HPC, AI and ML

More from inside-BigData.com

Recently uploaded

Hardware & Software Platforms for HPC, AI and ML