Gunter Roth, Senior Solution Architect gunterr@nvidia.com
HW & SW PLATFORMS FOR HPC,
AI AND ML
2
ACCELERATED VDIDATA ANALYTICS
AI / DEEP LEARNINGHIGH PERFORMANCE COMPUTE
ACCELERATED COMPUTING
Performance & Energy Efficiency
3
NVIDIA SDK & LIBRARIES
INDUSTRY FRAMEWORKS
& APPLICATIONS
CUSTOMER USECASES
SUPERCOMPUTING
+550
Applications
CUDA
NCCLcuDNN TensorRTcuBLAS DeepStreamcuSPARSEcuFFT
Amber
NAMDLAMMPS
CHROMA
ENTERPRISE APPLICATIONSCONSUMER INTERNET
ManufacturingHealthcare EngineeringSpeech Translate Recommender Molecular
Simulations
Weather
Forecasting
Seismic
Mapping
cuRAND
NVIDIA TESLA PLATFORM
World’s Leading Data Center Platform for Accelerating HPC and AI
TESLA GPUs & SYSTEMS
SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY
44
NVIDIA POWERS WORLD'S FASTEST
SUPERCOMPUTER
Summit Becomes First System To Scale The 100 Petaflops Milestone
27,648
Volta Tensor Core GPUs
122 PF 3 EF
HPC AI
5
NVIDIA POWERS TODAY’S
FASTEST SUPERCOMPUTERS
22 of Top 25 Greenest
Piz Daint
Europe’s Fastest
5,704 GPUs| 21 PF
ORNL Summit
World’s Fastest
27,648 GPUs| 149 PF
Total Pangea 3
Fastest Industrial
3,348 GPUs| 18 PF
ABCI
Japan’s Fastest
4,352 GPUs| 20 PF
LLNL Sierra
World’s 2nd Fastest
17,280 GPUs| 95 PF
6
GPUS FOR HPC AND DEEP LEARNING
NVIDIA Tesla V100
5120 energy efficient cores + TensorCores
7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) ,
125 Tensor TFLOP/s mixed-precision
Huge demand on communication and memory bandwidth
NVLink
6 links per GPU a 50 GB/s bi-
directional for maximum
scalability between GPU’s
CoWoS with HBM2
900 GB/s Memory Bandwidth
Unifying Compute & Memory
in Single Package
Huge demand on compute power (FLOPS)
NCCL
High-performance multi-GPU
and multi-node collective
communication primitives
optimized for NVIDIA GPUs
GPU Direct /
GPU Direct RDMA
Direct communication
between GPUs by
eliminating the CPU from
the critical path
7
Universal Inference Acceleration
320 Turing Tensor cores
2,560 CUDA cores
65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS
16GB | 320GB/s
ANNOUNCING TESLA T4
WORLD’S MOST ADVANCED INFERENCE GPU
8
DGX-STATION / DGX-1
DGX-2 / HGX-2 /
SUPERPOD
9
NVIDIA DGX-STATION
AI supercomputer for the desk
4x Tesla V100 connected via NVLINK
(60 TFLOPS FP32, 0.5 PFLOPS Tensor
performance)
Xeon CPU, 256 GB Memory
Storage:
3X 1.92 TB SSD RAID 0 (Data)
1X 1.92 TB SSD (OS)
Dual 10GbE
1500W, Water-cooled → Quiet
Optimized Deep Learning Software across the
entire stack
Containerized frameworks
Always up-to-date via the cloud
10
NVIDIA DGX-1
AI supercomputer-appliance-in-a-box
8x Tesla V100 connected via NVLINK
(125 TFLOPS FP32, 1 PFLOPS Tensor Core
performance)
Dual Xeon CPU, 512 GB Memory
7 TB SSD Deep Learning Cache
Dual 10GbE, Quad IB 100Gb
3RU – 3200W
Optimized Deep Learning Software
across the entire stack
Containerized frameworks
Always up-to-date via the cloud
11
NVIDIA DGX-2
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
11
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet
ANNOUNCING NVIDIA
DGX SUPERPOD
AI LEADERSHIP REQUIRES
AI INFRASTRUCTURE LEADERSHIP
Test Bed for Highest Performance Scale-Up Systems
• 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list
• <2 mins To Train RN-50
Modular & Scalable GPU SuperPOD Architecture
• Built in 3 Weeks
• Optimized For Compute, Networking, Storage & Software
Integrates Fully Optimized Software Stacks
• Freely Available Through NGC
• 96 DGX-2H
• 10 Mellanox EDR IB per node
• 1,536 V100 Tensor Core
GPUs
• 1 megawatt of power
Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC
13
PROJECT FEEDING
DATA HUNGRY GPUS
14
GPU OPTIMIZED
DATA CENTERS
Clusters with GPUDirect Storage
15
CUFILE AND GPUDIRECT STORAGE
cuFile API
For applications
NVFS Driver API
For filesystem, block, and storage drivers
Architecture of the Stack
Application
Filesystem Driver
cuFile API
CUDA
Block IO Driver
Storage Driver
NVFS Driver
OS KERNEL
APPLICATION
16
FOR MORE INFORMATION
Join the GPUDirect Storage interest list in order to:
Provide feedback
Extend with other filesystems
Technical blog and link to sign up:
https://devblogs.nvidia.com/gpudirect-storage/
17
VOLTA TENSOR CORE
18
Mixed-Precision Computing
TENSOR CORES FOR SCIENCE
7.8
15.7
125
0
20
40
60
80
100
120
140
V100
TFLOPS
FP64+ MULTI-PRECISION
PLASMA FUSION
APPLICATION
FP16 Solver
3.5x faster
EARTHQUAKE SIMULATION
FP16-FP21-FP32-FP64
25x faster
MIXED PRECISION WEATHER
PREDICTION
FP16/FP32/FP64
4x faster
19
TENSOR CORE
Mixed Precision Matrix Math - 4x4 matrices
New CUDA TensorOp instructions & data formats
4x4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Using Tensor cores via
• Volta optimized frameworks and libraries
(cuDNN, CuBLAS, TensorRT, ..)
• CUDA C++ Warp Level Matrix Operations
20
TURING TENSOR CORE
21
USING TENSOR CORES
Volta Optimized
Frameworks and Libraries
__device__ void tensor_op_16_16_16(
float *d, half *a, half *b, float *c)
{
wmma::fragment<matrix_a, …> Amat;
wmma::fragment<matrix_b, …> Bmat;
wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);
wmma::load_matrix_sync(Bmat, b, 16);
wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,
wmma::row_major);
}
CUDA C++
Warp-Level Matrix Operations
NVIDIA cuDNN, cuBLAS, TensorRT
22
matrix size
2k 4k 6k 8k10k 14k 18k 22k 26k 30k 34k
Tflop/s
0
2
4
6
8
10
12
14
16
18
20
22
24
26 FP16-TC (Tensor Cores) hgetrf LU
FP16 hgetrf LU
FP32 sgetrf LU
FP64 dgetrf LU
Double Precision LU Decomposition
▪ Compute initial solution in FP16
▪ Iteratively refine to FP64
Achieved FP64 Tflops: 26
Device FP64 Tflops: 7.8
LINEAR ALGEBRA + TENSOR CORES
Data courtesy of: Azzam Haidar, Stan. Tomov & Jack Dongarra, Innovative Computing Laboratory, University of Tennessee
“Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers”, A. Haidar, P. Wu, S. Tomov, J. Dongarra, SC’17
GTC 2018 Poster P8237: Harnessing GPU’s Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solves
23
OPENACC
24
CPU
OPENACC IS FOR MULTICORE, MANYCORE & GPUS
% pgfortran -ta=multicore –fast –Minfo=acc -c 
update_tile_halo_kernel.f90
. . .
100, Loop is parallelizable
Generating Multicore code
100, !$acc loop gang
102, Loop is parallelizable
GPU
% pgfortran -ta=tesla,cc35,cc60 –fast -Minfo=acc –c 
update_tile_halo_kernel.f90
. . .
100, Loop is parallelizable
102, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
102, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
98 !$ACC KERNELS
99 !$ACC LOOP INDEPENDENT
100 DO k=y_min-depth,y_max+depth
101 !$ACC LOOP INDEPENDENT
102 DO j=1,depth
103 density0(x_min-j,k)=left_density0(left_xmax+1-j,k)
104 ENDDO
105 ENDDO
106 !$ACC END KERNELS
25
Resources
https://www.openacc.org/resources
Success Stories
https://www.openacc.org/success-stories
Events
https://www.openacc.org/events
OPENACC.ORG RESOURCES
Guides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow
Compilers and Tools
https://www.openacc.org/tools
OpenACC
Now in GCC
https://www.openacc.org/community#slack
26
27
29
OpenACC Auto-compare
Find where CPU and GPU numerical results diverge
…
Data
copyin
Data
copyout
…
Compare and
report differences
CPU GPU
• –ta=tesla:autocompare
• Compute regions run redundantly
on CPU and GPU
• Results compared when data
copied from GPU to CPU
• pgicompilers.com/pcast
32
Fortran
2018
Parallel Features in Fortran and C++
pSTL Parallel Algorithms (C++17)
Array syntax (F90)
Co-arrays (F08, F18)
FORALL (F95)
DO CONCURRENT (F08, F18)
Threads (C++11)
33
34
CUDA
35
INTRODUCING CUDA 10.0
New GPU Architecture, Tensor Cores, NVSwitch Fabric
TURING AND NEW SYSTEMS
CUDA Graphs, Vulkan & DX12 Interop, Warp Matrix
CUDA PLATFORM
GPU-accelerated hybrid JPEG decoding,
Symmetric Eigenvalue Solvers, FFT Scaling
LIBRARIES
New Nsight Products – Nsight Systems and Nsight Compute
DEVELOPER TOOLS
Scientific Computing
36
https://developer.nvidia.com/cufft
cuFFT 10.0
Multi-GPU Scaling across DGX-2 and HGX-2
Up to 17TF performance on 16-GPUs
3D 1K FFT
 Strong scaling across 16-GPU systems –
DGX-2 and HGX-2
 Multi-GPU R2C and C2R support
 Large FFT models across 16-GPUs –
effective 512GB vs 32GB capacity
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
2 4 8 16
cuFFT 9.2 cuFFT 10.0 Linear (cuFFT 10.0)
GFLOPS Number of GPUs
cuFFT (10.0 and 9.2) using 3D C2C FFT 1024 size on DGX-2 with CUDA 10 (10.0.130)
38
https://developer.nvidia.com/cusparse
cuSOLVER 10.0
Dense Linear Algebra
Up to 44x Faster on Symmetric Eigensolver
(DSYEVD)
Improved performance with new
implementations for
 Cholesky factorization
 Symmetric & Generalized Symmetric
Eigensolver
 QR factorization
Benchmarks use 2 x Intel Gold 6140 (Skylake) processors with Intel MKL 2018 and
NVIDIA Tesla V100 (Volta) GPUs
1.1
15.8
18.0
0.9
3.6
0
5
10
15
20
25
30
4096 8192
MKL2018 CUDA 9.2 CUDA 10.0
Time(s)
157.8
Matrix Size
40
https://github.com/NVIDIA/cutlass
CUTLASS 1.1
High-performance Matrix Multiplication in Open Source CUDA C++
 Turing optimized GEMMs
 Integer (8-bit, 4-bit and 1-bit) using
WMMA
 Batched strided GEMM
 Support for CUDA 10.0
 Updates to documentation and more
examples
0%
20%
40%
60%
80%
100%
dgemm_nn
dgemm_nt
dgemm_tn
dgemm_tt
hgemm_nn
hgemm_nt
hgemm_tn
hgemm_tt
igemm_nn
igemm_nt
igemm_tn
igemm_tt
sgemm_nn
sgemm_nt
sgemm_tn
sgemm_tt
wmma_gemm_f16_nn
wmma_gemm_f16_nt
wmma_gemm_f16_tn
wmma_gemm_f16_tt
wmma_gemm_nn
wmma_gemm_nt
wmma_gemm_tn
wmma_gemm_tt
DGEMM HGEMM IGEMM SGEMM WMMA (F16) WMMA (F32)
%RelativetoPeak
CUTLASS operations reach 90% of CUBLAS Performance
CUTLASS 1.1 on Volta (GV100)
42
cuSPARSE
New Improved Sparse BLAS APIs
cuBLAS 10.1 Update 1 performance collected on GV100; MKL 2019.1 performance collected on 2-socket Xeon Gold 6140
cusparseStatus_t
cusparseSpMM(cusparseHandle_t handle,
cusparseOperation_t transA,
cusparseOperation_t transB,
const void* alpha,
const cusparseSpMatDescr_t matA,
const cusparseDenseMatDescr_t matB,
const void* beta,
cusparseDenseMatDescr_t matC,
cudaDataType computeType,
cusparseSpMMAlg_t alg,
void* externalBuffer)
cuSPARSE SpMM Speedup over MKL 2019.1Introduced generic APIs with improved performance
• SpVV - Sparse Vector Dense Vector Multiplication
• SpMV – Sparse Matrix Dense Vector Multiplication
• SpMM – Sparse Matrix Dense Matrix Multiplication
Coming Soon
• SpGEMM – Sparse Matrix Sparse Matrix Multiplication
57.1
32.5
43.9
37.8 41.9
29.1
36.7
43.2
28.8 33.2
63.5
80.6
114.9
0.0
20.0
40.0
60.0
80.0
100.0
120.0
43
NSIGHT PRODUCT FAMILY
Nsight Systems
System-wide application
algorithm tuning
Nsight Compute
CUDA Kernel Profiling and
Debugging
Nsight Graphics
Graphics Shader Profiling and
Debugging
IDE Plugins
Nsight Eclipse
Edition/Visual Studio
(Editor, Debugger)
44
DEEP LEARNING SDK
45
NVIDIA DEEP LEARNING SOFTWARE PLATFORM
NVIDIA DEEP LEARNING SDK
TRAINING DEPLOY WITH TENSORRT
TRAINED
NETWORK
TRAINING
DATA TRAINING
DATA MANAGEMENT
MODEL ASSESSMENT
EMBEDDED
Jetson TX
AUTOMOTIVE
Drive PX (XAVIER)
DATA CENTER
Tesla (Pascal, Volta)
GATHER AND LABEL
Rapidly label data,
guide training get
insights
Gather Data
Curate data sets
CNN
RNN
FC
47
NVIDIA Collective Communications Library (NCCL) 2
Multi-GPU and multi-node collective communication primitives
developer.nvidia.com/nccl
High-performance multi-GPU and multi-node collective
communication primitives optimized for NVIDIA GPUs
Fast routines for multi-GPU multi-node acceleration that
maximizes inter-GPU bandwidth utilization
Easy to integrate and MPI compatible. Uses automatic
topology detection to scale HPC and deep learning
applications over PCIe and NVink
Accelerates leading deep learning frameworks such as
Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and
more
Multi-Node:
InfiniBand verbs
IP Sockets
Multi-GPU:
NVLink
PCIe
Automatic
Topology
Detection
48
TENSORRT 5 & TENSORRT INFERENCE SERVER
World’s Most
Advanced Inference
Accelerator
Turing Support ● Optimizations & APIs ● Inference Server
Free download to members of NVIDIA Developer Program soon at
developer.nvidia.com/tensorrt
New optimizations &
flexible INT8 APIs
Achieve highest throughput at low
latency with newly optimized
operations, INT8 workflows, and
support for Win and CentOS
Up to 40x faster inference for
apps such as translation using
mixed precision on Turing Tensor
Cores
Maximize GPU utilization by
executing multiple models from
different frameworks on a node
via API
TensorRT inference
server
49
RAPIDS
50
In GPU Memory
cuXFilter
Visualization
Data Preparation VisualizationModel Training
cuML
Machine Learning
cuGraph
Graph Analytics
Deep Learning
cuDF
Analytics
GPU Accelerated End-to-End Data Science
RAPIDS is a set of open source libraries for GPU accelerating
data preparation and machine learning.
rapids.ai
51
cuDF
• GPU-accelerated data preparation and feature engineering
• Python drop-in Pandas replacement
cuML
• GPU-accelerated traditional machine learning libraries
• XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD…
cuGraph
• GPU-accelerated graph analytics libraries
cuXfilter
• Web Data Visualization library
• DataFrame kept in GPU-memory throughout the session
52
cuML roadmap
cuML Algorithms Available Soon
XGBoost GBDT MGMN
XGBoost Random Forest MGMN
K-Means Clustering SG MGMN
K-Nearest Neighbors (KNN) MG MGMN
Principal Component Analysis (PCA) SG
Density-based Spatial Clustering of Applications with Noise (DBSCAN) SG
Truncated Singular Value Decomposition (tSVD) SG
Uniform Manifold Aproximation and Projection (UMAP) SG MG
Kalman Filters (KF) SG
Ordinary Least Squares Linear Regression (OLS) SG
Stochastic Gradient Descent (SGD) SG
Generalized Linear Model, including Logistic (GLM) SG
Time Series (Holts-Winters) SG
Autoregressive Integrated Moving Average (ARIMA) SG
T-SNE Dimensionality Reduction SG
Support Vector Machines (SVM) SG
SG
Single GPU
MG
Multi-GPU
MGMN
Multi-GPU Multi-Node
Last updated 16.05.19
53
NGC
54
NGC: GPU-OPTIMIZED SOFTWARE HUB
Simplifying DL, ML and HPC Workflows
50+ Containers
DL, ML, HPC
60 Pre-trained Models
NLP, Image Classification, Object Detection
& more
Industry Workflows
Medical Imaging, Intelligent Video
Analytics
15+ Model Training Scripts
NLP, Image Classification, Object Detection &
more
NGC
DEEP LEARNING
HPC
NAMD | GROMACS | more
TensorFlow | PyTorch | more
MACHINE LEARNING
VISUALIZATION
RAPIDS | H2O | more
ParaView | IndeX | more
55
NVIDIA GPU CLOUD REGISTRY
Deep Learning
All major frameworks with multi-GPU optimizations Uses
NCCL for NVLINK data exchange Multi-threaded I/O to
feed the GPUs
Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow,
Theano, Torch
HPC
NAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC
HPC Visualization
Paraview with Optix, Index and Holodeck with OpenGL
visualization base on NVIDIA Docker 2.0, IndeX, VMD
Single NGC Account
For use on GPUs everywhere - https://ngc.nvidia.com
Common Software stack across NVIDIA GPUs
NVIDIA GPU Cloud containerizes GPU-
optimized frameworks, applications, runtimes,
libraries, and operating system, available at no
charge
Gunter Roth (gunterr@nvidia.com)

Hardware & Software Platforms for HPC, AI and ML

  • 1.
    Gunter Roth, SeniorSolution Architect gunterr@nvidia.com HW & SW PLATFORMS FOR HPC, AI AND ML
  • 2.
    2 ACCELERATED VDIDATA ANALYTICS AI/ DEEP LEARNINGHIGH PERFORMANCE COMPUTE ACCELERATED COMPUTING Performance & Energy Efficiency
  • 3.
    3 NVIDIA SDK &LIBRARIES INDUSTRY FRAMEWORKS & APPLICATIONS CUSTOMER USECASES SUPERCOMPUTING +550 Applications CUDA NCCLcuDNN TensorRTcuBLAS DeepStreamcuSPARSEcuFFT Amber NAMDLAMMPS CHROMA ENTERPRISE APPLICATIONSCONSUMER INTERNET ManufacturingHealthcare EngineeringSpeech Translate Recommender Molecular Simulations Weather Forecasting Seismic Mapping cuRAND NVIDIA TESLA PLATFORM World’s Leading Data Center Platform for Accelerating HPC and AI TESLA GPUs & SYSTEMS SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY
  • 4.
    44 NVIDIA POWERS WORLD'SFASTEST SUPERCOMPUTER Summit Becomes First System To Scale The 100 Petaflops Milestone 27,648 Volta Tensor Core GPUs 122 PF 3 EF HPC AI
  • 5.
    5 NVIDIA POWERS TODAY’S FASTESTSUPERCOMPUTERS 22 of Top 25 Greenest Piz Daint Europe’s Fastest 5,704 GPUs| 21 PF ORNL Summit World’s Fastest 27,648 GPUs| 149 PF Total Pangea 3 Fastest Industrial 3,348 GPUs| 18 PF ABCI Japan’s Fastest 4,352 GPUs| 20 PF LLNL Sierra World’s 2nd Fastest 17,280 GPUs| 95 PF
  • 6.
    6 GPUS FOR HPCAND DEEP LEARNING NVIDIA Tesla V100 5120 energy efficient cores + TensorCores 7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) , 125 Tensor TFLOP/s mixed-precision Huge demand on communication and memory bandwidth NVLink 6 links per GPU a 50 GB/s bi- directional for maximum scalability between GPU’s CoWoS with HBM2 900 GB/s Memory Bandwidth Unifying Compute & Memory in Single Package Huge demand on compute power (FLOPS) NCCL High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs GPU Direct / GPU Direct RDMA Direct communication between GPUs by eliminating the CPU from the critical path
  • 7.
    7 Universal Inference Acceleration 320Turing Tensor cores 2,560 CUDA cores 65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS 16GB | 320GB/s ANNOUNCING TESLA T4 WORLD’S MOST ADVANCED INFERENCE GPU
  • 8.
    8 DGX-STATION / DGX-1 DGX-2/ HGX-2 / SUPERPOD
  • 9.
    9 NVIDIA DGX-STATION AI supercomputerfor the desk 4x Tesla V100 connected via NVLINK (60 TFLOPS FP32, 0.5 PFLOPS Tensor performance) Xeon CPU, 256 GB Memory Storage: 3X 1.92 TB SSD RAID 0 (Data) 1X 1.92 TB SSD (OS) Dual 10GbE 1500W, Water-cooled → Quiet Optimized Deep Learning Software across the entire stack Containerized frameworks Always up-to-date via the cloud
  • 10.
    10 NVIDIA DGX-1 AI supercomputer-appliance-in-a-box 8xTesla V100 connected via NVLINK (125 TFLOPS FP32, 1 PFLOPS Tensor Core performance) Dual Xeon CPU, 512 GB Memory 7 TB SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU – 3200W Optimized Deep Learning Software across the entire stack Containerized frameworks Always up-to-date via the cloud
  • 11.
    11 NVIDIA DGX-2 1 2 3 5 4 6 TwoIntel Xeon Platinum CPUs 7 1.5 TB System Memory 11 30 TB NVME SSDs Internal Storage NVIDIA Tesla V100 32GB Two GPU Boards 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.4 TB/sec bi-section bandwidth Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth PCIe Switch Complex 8 9 9Dual 10/25 Gb/sec Ethernet
  • 12.
    ANNOUNCING NVIDIA DGX SUPERPOD AILEADERSHIP REQUIRES AI INFRASTRUCTURE LEADERSHIP Test Bed for Highest Performance Scale-Up Systems • 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list • <2 mins To Train RN-50 Modular & Scalable GPU SuperPOD Architecture • Built in 3 Weeks • Optimized For Compute, Networking, Storage & Software Integrates Fully Optimized Software Stacks • Freely Available Through NGC • 96 DGX-2H • 10 Mellanox EDR IB per node • 1,536 V100 Tensor Core GPUs • 1 megawatt of power Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC
  • 13.
  • 14.
  • 15.
    15 CUFILE AND GPUDIRECTSTORAGE cuFile API For applications NVFS Driver API For filesystem, block, and storage drivers Architecture of the Stack Application Filesystem Driver cuFile API CUDA Block IO Driver Storage Driver NVFS Driver OS KERNEL APPLICATION
  • 16.
    16 FOR MORE INFORMATION Jointhe GPUDirect Storage interest list in order to: Provide feedback Extend with other filesystems Technical blog and link to sign up: https://devblogs.nvidia.com/gpudirect-storage/
  • 17.
  • 18.
    18 Mixed-Precision Computing TENSOR CORESFOR SCIENCE 7.8 15.7 125 0 20 40 60 80 100 120 140 V100 TFLOPS FP64+ MULTI-PRECISION PLASMA FUSION APPLICATION FP16 Solver 3.5x faster EARTHQUAKE SIMULATION FP16-FP21-FP32-FP64 25x faster MIXED PRECISION WEATHER PREDICTION FP16/FP32/FP64 4x faster
  • 19.
    19 TENSOR CORE Mixed PrecisionMatrix Math - 4x4 matrices New CUDA TensorOp instructions & data formats 4x4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Using Tensor cores via • Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..) • CUDA C++ Warp Level Matrix Operations
  • 20.
  • 21.
    21 USING TENSOR CORES VoltaOptimized Frameworks and Libraries __device__ void tensor_op_16_16_16( float *d, half *a, half *b, float *c) { wmma::fragment<matrix_a, …> Amat; wmma::fragment<matrix_b, …> Bmat; wmma::fragment<matrix_c, …> Cmat; wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } CUDA C++ Warp-Level Matrix Operations NVIDIA cuDNN, cuBLAS, TensorRT
  • 22.
    22 matrix size 2k 4k6k 8k10k 14k 18k 22k 26k 30k 34k Tflop/s 0 2 4 6 8 10 12 14 16 18 20 22 24 26 FP16-TC (Tensor Cores) hgetrf LU FP16 hgetrf LU FP32 sgetrf LU FP64 dgetrf LU Double Precision LU Decomposition ▪ Compute initial solution in FP16 ▪ Iteratively refine to FP64 Achieved FP64 Tflops: 26 Device FP64 Tflops: 7.8 LINEAR ALGEBRA + TENSOR CORES Data courtesy of: Azzam Haidar, Stan. Tomov & Jack Dongarra, Innovative Computing Laboratory, University of Tennessee “Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers”, A. Haidar, P. Wu, S. Tomov, J. Dongarra, SC’17 GTC 2018 Poster P8237: Harnessing GPU’s Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solves
  • 23.
  • 24.
    24 CPU OPENACC IS FORMULTICORE, MANYCORE & GPUS % pgfortran -ta=multicore –fast –Minfo=acc -c update_tile_halo_kernel.f90 . . . 100, Loop is parallelizable Generating Multicore code 100, !$acc loop gang 102, Loop is parallelizable GPU % pgfortran -ta=tesla,cc35,cc60 –fast -Minfo=acc –c update_tile_halo_kernel.f90 . . . 100, Loop is parallelizable 102, Loop is parallelizable Accelerator kernel generated Generating Tesla code 100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y 102, !$acc loop gang, vector(32) ! blockidx%x threadidx%x 98 !$ACC KERNELS 99 !$ACC LOOP INDEPENDENT 100 DO k=y_min-depth,y_max+depth 101 !$ACC LOOP INDEPENDENT 102 DO j=1,depth 103 density0(x_min-j,k)=left_density0(left_xmax+1-j,k) 104 ENDDO 105 ENDDO 106 !$ACC END KERNELS
  • 25.
    25 Resources https://www.openacc.org/resources Success Stories https://www.openacc.org/success-stories Events https://www.openacc.org/events OPENACC.ORG RESOURCES Guides● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow Compilers and Tools https://www.openacc.org/tools OpenACC Now in GCC https://www.openacc.org/community#slack
  • 26.
  • 27.
  • 28.
    29 OpenACC Auto-compare Find whereCPU and GPU numerical results diverge … Data copyin Data copyout … Compare and report differences CPU GPU • –ta=tesla:autocompare • Compute regions run redundantly on CPU and GPU • Results compared when data copied from GPU to CPU • pgicompilers.com/pcast
  • 29.
    32 Fortran 2018 Parallel Features inFortran and C++ pSTL Parallel Algorithms (C++17) Array syntax (F90) Co-arrays (F08, F18) FORALL (F95) DO CONCURRENT (F08, F18) Threads (C++11)
  • 30.
  • 31.
  • 32.
    35 INTRODUCING CUDA 10.0 NewGPU Architecture, Tensor Cores, NVSwitch Fabric TURING AND NEW SYSTEMS CUDA Graphs, Vulkan & DX12 Interop, Warp Matrix CUDA PLATFORM GPU-accelerated hybrid JPEG decoding, Symmetric Eigenvalue Solvers, FFT Scaling LIBRARIES New Nsight Products – Nsight Systems and Nsight Compute DEVELOPER TOOLS Scientific Computing
  • 33.
    36 https://developer.nvidia.com/cufft cuFFT 10.0 Multi-GPU Scalingacross DGX-2 and HGX-2 Up to 17TF performance on 16-GPUs 3D 1K FFT  Strong scaling across 16-GPU systems – DGX-2 and HGX-2  Multi-GPU R2C and C2R support  Large FFT models across 16-GPUs – effective 512GB vs 32GB capacity 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 2 4 8 16 cuFFT 9.2 cuFFT 10.0 Linear (cuFFT 10.0) GFLOPS Number of GPUs cuFFT (10.0 and 9.2) using 3D C2C FFT 1024 size on DGX-2 with CUDA 10 (10.0.130)
  • 34.
    38 https://developer.nvidia.com/cusparse cuSOLVER 10.0 Dense LinearAlgebra Up to 44x Faster on Symmetric Eigensolver (DSYEVD) Improved performance with new implementations for  Cholesky factorization  Symmetric & Generalized Symmetric Eigensolver  QR factorization Benchmarks use 2 x Intel Gold 6140 (Skylake) processors with Intel MKL 2018 and NVIDIA Tesla V100 (Volta) GPUs 1.1 15.8 18.0 0.9 3.6 0 5 10 15 20 25 30 4096 8192 MKL2018 CUDA 9.2 CUDA 10.0 Time(s) 157.8 Matrix Size
  • 35.
    40 https://github.com/NVIDIA/cutlass CUTLASS 1.1 High-performance MatrixMultiplication in Open Source CUDA C++  Turing optimized GEMMs  Integer (8-bit, 4-bit and 1-bit) using WMMA  Batched strided GEMM  Support for CUDA 10.0  Updates to documentation and more examples 0% 20% 40% 60% 80% 100% dgemm_nn dgemm_nt dgemm_tn dgemm_tt hgemm_nn hgemm_nt hgemm_tn hgemm_tt igemm_nn igemm_nt igemm_tn igemm_tt sgemm_nn sgemm_nt sgemm_tn sgemm_tt wmma_gemm_f16_nn wmma_gemm_f16_nt wmma_gemm_f16_tn wmma_gemm_f16_tt wmma_gemm_nn wmma_gemm_nt wmma_gemm_tn wmma_gemm_tt DGEMM HGEMM IGEMM SGEMM WMMA (F16) WMMA (F32) %RelativetoPeak CUTLASS operations reach 90% of CUBLAS Performance CUTLASS 1.1 on Volta (GV100)
  • 36.
    42 cuSPARSE New Improved SparseBLAS APIs cuBLAS 10.1 Update 1 performance collected on GV100; MKL 2019.1 performance collected on 2-socket Xeon Gold 6140 cusparseStatus_t cusparseSpMM(cusparseHandle_t handle, cusparseOperation_t transA, cusparseOperation_t transB, const void* alpha, const cusparseSpMatDescr_t matA, const cusparseDenseMatDescr_t matB, const void* beta, cusparseDenseMatDescr_t matC, cudaDataType computeType, cusparseSpMMAlg_t alg, void* externalBuffer) cuSPARSE SpMM Speedup over MKL 2019.1Introduced generic APIs with improved performance • SpVV - Sparse Vector Dense Vector Multiplication • SpMV – Sparse Matrix Dense Vector Multiplication • SpMM – Sparse Matrix Dense Matrix Multiplication Coming Soon • SpGEMM – Sparse Matrix Sparse Matrix Multiplication 57.1 32.5 43.9 37.8 41.9 29.1 36.7 43.2 28.8 33.2 63.5 80.6 114.9 0.0 20.0 40.0 60.0 80.0 100.0 120.0
  • 37.
    43 NSIGHT PRODUCT FAMILY NsightSystems System-wide application algorithm tuning Nsight Compute CUDA Kernel Profiling and Debugging Nsight Graphics Graphics Shader Profiling and Debugging IDE Plugins Nsight Eclipse Edition/Visual Studio (Editor, Debugger)
  • 38.
  • 39.
    45 NVIDIA DEEP LEARNINGSOFTWARE PLATFORM NVIDIA DEEP LEARNING SDK TRAINING DEPLOY WITH TENSORRT TRAINED NETWORK TRAINING DATA TRAINING DATA MANAGEMENT MODEL ASSESSMENT EMBEDDED Jetson TX AUTOMOTIVE Drive PX (XAVIER) DATA CENTER Tesla (Pascal, Volta) GATHER AND LABEL Rapidly label data, guide training get insights Gather Data Curate data sets CNN RNN FC
  • 40.
    47 NVIDIA Collective CommunicationsLibrary (NCCL) 2 Multi-GPU and multi-node collective communication primitives developer.nvidia.com/nccl High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVink Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more Multi-Node: InfiniBand verbs IP Sockets Multi-GPU: NVLink PCIe Automatic Topology Detection
  • 41.
    48 TENSORRT 5 &TENSORRT INFERENCE SERVER World’s Most Advanced Inference Accelerator Turing Support ● Optimizations & APIs ● Inference Server Free download to members of NVIDIA Developer Program soon at developer.nvidia.com/tensorrt New optimizations & flexible INT8 APIs Achieve highest throughput at low latency with newly optimized operations, INT8 workflows, and support for Win and CentOS Up to 40x faster inference for apps such as translation using mixed precision on Turing Tensor Cores Maximize GPU utilization by executing multiple models from different frameworks on a node via API TensorRT inference server
  • 42.
  • 43.
    50 In GPU Memory cuXFilter Visualization DataPreparation VisualizationModel Training cuML Machine Learning cuGraph Graph Analytics Deep Learning cuDF Analytics GPU Accelerated End-to-End Data Science RAPIDS is a set of open source libraries for GPU accelerating data preparation and machine learning. rapids.ai
  • 44.
    51 cuDF • GPU-accelerated datapreparation and feature engineering • Python drop-in Pandas replacement cuML • GPU-accelerated traditional machine learning libraries • XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD… cuGraph • GPU-accelerated graph analytics libraries cuXfilter • Web Data Visualization library • DataFrame kept in GPU-memory throughout the session
  • 45.
    52 cuML roadmap cuML AlgorithmsAvailable Soon XGBoost GBDT MGMN XGBoost Random Forest MGMN K-Means Clustering SG MGMN K-Nearest Neighbors (KNN) MG MGMN Principal Component Analysis (PCA) SG Density-based Spatial Clustering of Applications with Noise (DBSCAN) SG Truncated Singular Value Decomposition (tSVD) SG Uniform Manifold Aproximation and Projection (UMAP) SG MG Kalman Filters (KF) SG Ordinary Least Squares Linear Regression (OLS) SG Stochastic Gradient Descent (SGD) SG Generalized Linear Model, including Logistic (GLM) SG Time Series (Holts-Winters) SG Autoregressive Integrated Moving Average (ARIMA) SG T-SNE Dimensionality Reduction SG Support Vector Machines (SVM) SG SG Single GPU MG Multi-GPU MGMN Multi-GPU Multi-Node Last updated 16.05.19
  • 46.
  • 47.
    54 NGC: GPU-OPTIMIZED SOFTWAREHUB Simplifying DL, ML and HPC Workflows 50+ Containers DL, ML, HPC 60 Pre-trained Models NLP, Image Classification, Object Detection & more Industry Workflows Medical Imaging, Intelligent Video Analytics 15+ Model Training Scripts NLP, Image Classification, Object Detection & more NGC DEEP LEARNING HPC NAMD | GROMACS | more TensorFlow | PyTorch | more MACHINE LEARNING VISUALIZATION RAPIDS | H2O | more ParaView | IndeX | more
  • 48.
    55 NVIDIA GPU CLOUDREGISTRY Deep Learning All major frameworks with multi-GPU optimizations Uses NCCL for NVLINK data exchange Multi-threaded I/O to feed the GPUs Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow, Theano, Torch HPC NAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC HPC Visualization Paraview with Optix, Index and Holodeck with OpenGL visualization base on NVIDIA Docker 2.0, IndeX, VMD Single NGC Account For use on GPUs everywhere - https://ngc.nvidia.com Common Software stack across NVIDIA GPUs NVIDIA GPU Cloud containerizes GPU- optimized frameworks, applications, runtimes, libraries, and operating system, available at no charge
  • 49.