NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf

Dr. Charles Cheung, NVIDIA
DGX USER GROUP 1ST MEETUP

6
Amazing Innovation and Expansion of NVIDIA Ecosystem
NVIDIA IS A FULL STACK COMPUTING PLATFORM
24M
CUDA Downloads
2,000
GPU-Accelerated Applications
7,500
AI Startups
AI DRIVE
METRO ISAAC
CLARA
RAPIDS
AERIAL
5G
RTX HPC
MAGNUM IO
CUDA
CUDA-X-AI
2.5M
Developers
150
SDKs
COMPLETE SOFTWARE STACK GROWING ECOSYSTEM
1B
CUDA GPUs
CHIPS
SYSTEMS
SDK & ENGINES
APPLICATIONS
ECOSYSTEM
FULL STACK INNOVATION

7
NVIDIA DATACENTER PLATFORM
NGC
HARDWARE
TECHNOLOGIES
NVIDIA CERTIFIED
VALIDATED
SOLUTIONS
MANAGEMENT
BF DPU
MONITORING
DCGM
NVSwitch NVIDIA Switch
SERVERS &
CLOUD
CSP Instances
Purpose Built
DGX
HGX
BUSINESS
APPLICATIONS
APPLICATION
FRAMEWORKS
HEALTHCARE
Clara
SMART CITY
Metropolis
CONVERSATIONAL AI
Jarvis
AUTONOMOUS
VEHICLES
Drive
RECOMMENDATION
SYSTEMS
Merlin
++
...
Con
SOFTWARE HUB
Pre-trained Models
SDKs
Certified
Containers
UFM
SMART NIC
OPERATIONS
ACCELERATION
LIBRARIES
COMPUTE
CUDA-X
DEVELOPER
TOOLKITS
ML & DATA ANALYTICS AI TRAINING & INFERENCE HIGH PERFORMANCE
COMPUTING
NVIDIA HPC SDK
RENDERING &
VISUALIZATION
IndeX OptiX
Customer
Engagement
Patient
Diagnostics
Fraud
Detection
Quality
Assurance
Industrial
Automation
Precision
Marketing
Molecular
Simulations
++
TensorRT
NETWORKING, STORAGE & SECURITY
MAGNUM IO
DOCA
TRITON
INFERENCE
SERVER FLEET
COMMAND
NVIDIA GPU
Operator
Mainstream & Edge
EGX
CloudXR MDL
VIRTUAL GPU SW
GPU

9
CONVERSATIONAL AI-JARVIS
CLIENT APPLICATIONS LEVERAGE JARVIS SKILLS
TO BUILD REAL TIME CONVERSATIONAL AI USER EXPERIENCES
Jarvis Skills
• Real-time transcription to 40,000 call center agents
• Identifies keywords and recommends solutions
JARVIS DEPLOYED IN
PRODUCTION IN CALL CENTERS

11
VIRTUAL COLLABORATION - MAXINE
Super Resolution Scaling From 180p to 720p Better Quality of Service in Low Bandwidth Environment
Maxine AI Face Encode
Standard Video
• >50% of enterprises in Japan use
SoftBank (Telco) Zoom Client
• Premium, smooth (no compression
artifacts) video conferencing
experience over 5G network
• Audio/Video noise removal enhance
experience of virtual communication

13
Millions of Apps | Billions of Users | Trillions of Queries
Multiple Models
Different Query
Types
X86 CPU V100 GPU
A100 MIG
T4 GPU A100 GPU ARM CPU
DEPLOYING AI IN PRODUCTION APPLICATIONS IS HARD
Many Target
Processors
Query and Response
Siloed Monolithic Apps App App
App …
Real time,
Stateful stream,
Ensemble
Real time, Batch,
Ensemble
Real time,
Ensemble
Many Serving
Backends
ASR, NLP, TTS NLP,
Recommender
Image Classify,
Image Segment,
Recommender
Serving
(PyTorch)
Serving
(TensorFlow) Serving
(Custom)
Serving
(OpenVINO)
Serving
(TRT)
Serving
(ONNX)
Serving
(TensorFlow)
Serving
(PyTorch)
Serving
(Custom)

14
Millions of Apps | Billions of Users | Trillions of Queries
Batching & Scheduling
Real time | Batch | Stateful Stream | Ensemble
Multiple Framework Backends
Custom C++, Python
Microservices Running
on Triton
Triton Inference
Server
NLP Recommender
Image
Classify
TTS Image
segment
ASR
Microservices Based
Apps
App App
App …
TRITON: THE COMPUTE ENGINE OF THE MODERN DATA CENTER
Optimized for All
Processors
Query and Response
X86 CPU V100 GPU
A100 MIG
T4 GPU A100 GPU ARM CPU

15
MULTI TRILLION NLP MODELS –TRAIN AND REAL-TIME
GIANT MODEL – TRAIN AND INFERENCE
REAL-TIME PERFORMANCE
Input sequence length=128 tokens (average of 102 words), Output sequence length=8 tokens (average of 6 words)
GPU: Megatron GPT-3 on DGX-A100-80GB, Batch size=16, FP16, FasterTransformer 4.0, Triton 2.6
CPU: OpenAI GPT-3 on Xeon Platinum 8280 2S, 755GB System memory, Batch size=1, FP32, TensorFlow 2.3
Triton Inference Server
Megatron | Faster Transformer
Execution Backends
Query Response
Apps
NVIDIA
Dev Tech
GPUs:
Train Multi-Trillion Parameter Megatron and Perform Real-Time Inference with Triton
Inference
NVIDIA Megatron
Framework
(Multi-Trillion
parameters)
Train
16 QUERIES
1 SECOND
DGX A100
1 QUERY
>1 Minute
Dual Socket CPU Server
LINEAR SCALING 1 TRILLION PARAMETERS
Over 50% Sustained Peak w/ 3000+ GPUS
1.7B
3.6B
7.5B
18B
39B
76B
145B
310B
530B
1T
2
8
32
128
512
8 64 512 4096
Sustained
Petaflops
Number of GPUs
1T
Model
GPT-3
(175B)
Turing-NLG (17.2B)
GPT-2 (1.5B)
BERT (340M)
Data label: Model size (parameters)
A few hundred iterations of model on # GPUs (x-axis) to estimate petaflops (not
trained to convergence)
GPT-3
Based
Chatbots

18
GIVING SCIENTISTS A TIME MACHINE
CONTINUOUS IMPROVEMENT
13x in 5 years
15X MULTI-GPU ACCELERATION
NAMD version 3.0
ERA OF EXASCALE AI IS NOW
HPL vs. AI performance
P100
(2016)
V100
(2017)
V100
(2018)
V100
(2019)
A100
(2020)
A100
(2021)
1X
(100 hrs)
3X
(32 hrs)
2X
(42 hrs)
4X
(27 hrs)
13X
(8 hrs)
9X
(11 hrs)
Chroma
GROMACS
Quantum Espresso
Random Forest
TensorFlow
PyTorch
VASP
Amber
MILC
NAMD
LARGEST AI + MD SIMULATION
DeepDriveMD + NAMD
0
40
80
120
0 2 4 6 8
ns/day
(higher
is
better)
GPUs
2.14 baseline 3.0a7 En
10000
1000
100
Perlmutter
Summit
Fugaku
Sierra JUWELS
Leonardo
AI
HPL
PFLOPS
2018 2019 2020 2021

19
A NEW COMPUTING MODEL - QUANTUM
NEW COMPUTING MODEL POTENTIAL USE CASES REQUIRES QUBITS SCALE
TO DOUBLE EVERY YEAR
Computational
Finance
Cryptography Optimization
Quantum
Chemistry
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
2010 2015 2020 2025 2030 2035 2040
Qubits
Fault-Tolerant Quantum
Computing Speedups Threshold

20
ANNOUNCING NVIDIA CUQUANTUM
Researching the Computer of Tomorrow on the Most Powerful Computer Today
SDK FOR GPU-ACCELERATED
QUANTUM SIMULATIONS
Quantum Circuit Simulation
Frameworks
2 Hours
0 5000 10000 15000
TTS (Minutes)
Dual-CPU DGX-A100
4 Days
9 Years
0 1000 2000 3000
TTS (days)
Dual-CPU DGX-A100
STATE VECTOR SIMULATION
Scales to 10s of Qubits
TENSOR NETWORK SIMULATION
Scales to 1000’s of Qubits
10 days
Footnotes: State Vector- 1,000 circuits , 36 qubits depth m=10, complex 64 | CPU: Qiskit (IBM) on Dual AMD EPYC 7742 | GPU: Qgate (NVAITC) on DGX-A100
Tensor Network - 53 qubits, depth m=20 | CPU: Estimated Quimb (Caltech) on Dual AMD EPYC 7742 | GPU: Quimb (Caltech) on DGX-A100
cuQuantum

22
A GPU FOR EVERY VIRTUAL WORKLOAD
Expanding Workloads Drive The Need for Specialized Accelerators
250W & 300W
2-slot FHFL | 40G & 80G | NVLINK
165W
2-slot FHFL | 24GB | NVLINK
300W
2-slot FHFL | 48GB | NVLINK
150W
1-slot FHFL | 24GB
70W
1-slot Low Profile | 16GB
300W
2-slot FHFL | 64GB
A100 A30 A40 A10 T4
Fastest Compute, FP64
Up to 7 MIG instances
DL Training
Scientific Research
Data Analytics
250W & 300W | 40G & 80G
2-slot FHFL | NVLINK
Highest Perf Compute
AI, HPC, Data Processing
Versatile Mainstream Compute
FP64, Up to 4 MIG instances
Language Processing
Conversational AI
Recommender Systems
AI Inference &
Mainstream Compute
165W | 24GB
4K Cloud Gaming
Graphics, Video with AI
Virtual Workstation
Video Conferencing
4K Cloud Games
Mainstream Graphics &
Video with AI
150W | 24GB
1-slot FHFL
High density Video & Graphics
Compact & Versatile
Edge AI
Edge Video
Mobile Cloud Games
Small Footprint Datacenter
and Edge Inference
70W | 16GB
1-slot Low Profile
Fastest RT Graphics
Largest render models
Cloud Rendering
Cloud XR
Omniverse
Highest Perf Graphics
Visual Computing
300W | 48GB
A16
High-res, multi-monitor
Max # of encode/decode
streams
Highest Density
Virtual Desktop
250W | 4 x 16GB
2-slot FHFL
Virtual Desktop
Transcoding
Compute Graphics

23
INTRODUCING A30
Versatile Compute Acceleration for Mainstream Enterprise Servers
Purpose built for Inference and Flexible
Enterprise Compute
20X T4 AI perf
Multi-Instance GPU
Up to 4 concurrent instances per GPU (QoS)
Compute
3rd Gen Tensor cores, Fast FP64
High Bandwidth Memory
Ultra-low latency
Power Efficient
Excellent Perf/W
Sparsity Acceleration
Further 2x speed up
A30
GPU Architecture NVIDIA Ampere
Multi-Instance GPU
4 instances @ 6GB each
2 instances @ 12GB each
GPU Memory 24GB HBM2
Memory Bandwidth 933 GB/s
Interconnect
PCIe Gen 4 (x16)
1x NVLINK Bridge
Form Factor 2 Slot FHFL
Max Power 165W
Schedule Production

24
NVIDIA A10
Mainstream Graphics and Video with AI
Enrich Graphics and Video applications with powerful AI
New SM, 2nd gen RT Cores, 3rd gen Tensor Cores
2.5x Graphics Perf, >3x Inference Perf
Media Acceleration
AV1 Decode, multiple 4K streams, 8K HDR
Cloud Gaming
Increased CCU
Power Efficient
Excellent Perf/W
High Density Servers
Single-wide form factor
A10
NVIDIA Architecture Ampere
GPU Memory 24GB GDDR6
Memory Bandwidth 600 GB/s
Interconnect PCIe Gen 4 (x16)
Form Factor FHFL (1-slot)
Max Power 150W
Availability Production
* Performance projections using 20 AAA games and INT8 (with sparsity) relative to T4
4K Cloud
Gaming
Virtual
Workstations
Video
Conferencing
Media Effects SuperRes

25
NVIDIA A16
Unprecedented User Experience and Density for Graphics-Rich VDI
High User Density for
Modern Virtual Desktops
A16
NVIDIA Architecture Ampere
GPU Memory 4x 16GB GDDR6
Memory Bandwidth 4x 232 GB/s
Interconnect PCIe Gen 4 (x16)
Form Factor FHFL (2-slot)
Max Power 250W
Availability Q2 2021
64 GB (4x 16GB per GPU) GDDR6 Memory
Up to 64 multimedia-rich virtual desktops per board
Flexibility of heterogenous users
Simultaneously host different user profiles on one board
Purpose-built for high user density with NVIDIA vPC
2X density versus previous generation*
Superior Quality Video
Supports H.265 encode/decode, VP9, and AV1 decode
Higher resolution monitors for streaming video & multimedia
More than 2X encoder throughput*
Latest NVIDIA Ampere architecture
2nd gen RT cores, 3rd gen Tensor Cores

27
TRADITIONAL SERVER
Infrastructure Management Software Defined Security
Software-defined Storage Software-defined Networking
Acceleration Engines Infrastructure Management
Software-defined Storage
Software-defined Security
Software-defined Networking
Acceleration Engines
DPU-ACCELERATED SERVER
NVIDIA DPU with Arm Cores & Accelerators
CPU
NVIDIA NIC
CPU
VMs Containers
VMs Containers
Software-defined Data Center Infrastructure-on-a-Chip
INTRODUCING THE DATA PROCESSING UNIT (DPU)
Manual Infrastructure Management | Security Appliances |
Storage Systems | Static Networks | Microservices |
East-West Traffic | Storage Access | Zero Trust Security

28
Up to 200Gb/s Ethernet and InfiniBand, PAM4/NRZ
CX-6 Dx Inside
8 ARM A72 CPUs subsystem – over 2.5GHz
8MB L2 cache, 6MB L3 cache in 4 Tiles
Fully coherent low-latency interconnect
Integrated PCIe switch, 16x Gen4.0
PCIe Root Complex or End Point modes
Single DDR4 Channel
1GbE Out-of-Band management port
Accelerated Security, Storage, Networking
ANNOUNCING NVIDIA
BLUEFIELD-2
Data Center Infrastructure on a Chip

29
NVIDIA DOCA
Enabling Broad DPU Partner Ecosystem
Software application framework for BlueField DPUs
DOCA is for DPUs while CUDA is for GPUs
Protects developer investment for future DPUs
Certified reference applications, APIs & partner solutions
Rich partner ecosystem across industries and workloads
CYBER
SECURITY
EDGE STORAGE
PLATFORM
INFRASTRUCTURE
ORCHESTRATION
MANAGEMENT
TELEMETRY
SECURITY NETWORKING STORAGE
ACCELERATION LIBRARIES
DOCA

30
NVIDIA BLUEFIELD-2 DPU TRANSFORMS DATA CENTERS
Securing and Accelerating Modern Application Workloads
Cloud Computing
Bare-Metal as a Service
Enterprise AI
5G Networks
HPC / AI
Cloud Gaming
AI-Powered
Cyber Security

31
NVIDIA & VMWARE ENABLE HYBRID CLOUD ARCHITECTURE
Project Monterey
Run Modern Workloads Efficiently Over New Composable, Disaggregated Infrastructure
Peak Performance
Maximize network bandwidth while saving x86 cycles for
top application performance
Unified, Consistent Operations
Support bare metal servers, simplify lifecycle
management, and reduce TCO
Zero-Trust Security Model
Ensure comprehensive application security without
compromising performance
Application-Driven Infrastructure
Dynamically assemble right-sized infrastructure at runtime
based on unique application needs
App
Compute
Hypervisor
NVIDIA
BlueField-2 DPU
ESXi
App App
Functional Isolation
Bare Metal
Windows + Linux
Networki
ng
NSX Svcs
Host
Manage
ment
Storage:
VSAN
Data
Distribut
ed
Firewall
NSX Svcs

32
SECURING AND ACCELERATING CLOUD GAMING PLATFORMS
Game Shader Video
Shader Cache Streamer Accelerator
SDDC Security Streaming
NAT | DDOS | Reverse Proxy
BLUEFIELD-2 DPU
Gaming
Infrastructure
Services
Gaming
Seats
Enhanced Gaming Experience
Ensure delightful user experience while delivering
consistent and predictable application performance
Secure Infrastructure
Protect data and assets in the cloud without
compromising application performance
More Concurrent Users
Scale the number of concurrent user per server by
freeing up compute resources from infrastructure
Unified, Consistent Operations
Run bare metal servers, simplify lifecycle
management, and reduce TCO

33
ANNOUNCING NVIDIA BLUEFIELD-3 DPU
First 400Gb/s Data Processing Unit
22B transistors
400Gb/s Ethernet & InfiniBand
Connectivity
400Gb/s Crypto Acceleration
16 ARM CPU Cores
300 Equivalent x86 Cores
18M IOP/s Elastic Block Storage

34
NVIDIA DPU ROADMAP
Programmable Data Center
Infrastructure-on-a-Chip
2021 2022 2023
1X
10X
100X
BlueField-4
1000 SPECINT
800 TOPS
800 Gbps
1000X
DOCA — One Architecture
BlueField-2
70 SPECINT
0.7 TOPS
200 Gbps
BlueField-3
350 SPECINT
1.5 TOPS
400 Gbps

36
Benefit from our experience, DGX SuperPOD now delivers:
• Secure multi-tenancy with bare-metal performance
• Best of breed infrastructure management
• Unmatched expertise in platform operations
• Continuous innovation delivered automatically from NVIDIA
DGX SUPERPOD
WHITE GLOVE
SERVICES
NVIDIA BlueField DPU
ANNOUNCING THE
WORLD’S FIRST
CLOUD-NATIVE
SUPERCOMPUTER
NVIDIA DGX SuperPOD with NVIDIA BlueField
NVIDIA Base Command
User 1 User 2 User n
User 3
Secure user/data partitioning and isolation with bare metal performance
Apollo screens*
Apollo screens*
Apollo screens*
Resource Allocation
Infrastructure Monitoring
Dashboard / Analytics

38
GIANT MODELS PUSHING LIMITS OF EXISTING ARCHITECTURE
Requires a New Architecture
GPU 8,000 GB/sec
CPU 200 GB/sec
PCIE Gen4 (Effective Per GPU) 16 GB/sec
Mem-to-GPU 64 GB/sec
System Bandwidth Bottleneck
DDR4 HBM2e
GPU
GPU
GPU
GPU
x86
ELMo (94M)
BERT-Large (340M)
GPT-2
(1.5B)
Megatron-LM
(8.3B)
T5 (11B)
Turing-NLG
(17.2B)
GPT-3 (175B)
0.00001
0.0001
0.001
0.01
0.1
1
10
100
1000
2018 2019 2020 2021 2022 2023
Model
Size
(Trillions
of
Parameters)
100 TRILLION PARAMETER MODELS BY 2023

39
ANNOUNCING NVIDIA GRACE
Breakthrough CPU Designed for Giant-Scale AI and HPC Applications
FASTEST INTERCONNECTS
>900 GB/s Cache Coherent NVLink CPU To GPU (14x)
>600GB/s CPU To CPU (2x)
NEXT GENERATION ARM NEOVERSE CORES
>300 SPECrate2017_int_base est.
Availability 2023
HIGHEST MEMORY BANDWIDTH
>500GB/s LPDDR5x w/ ECC
>2x Higher B/W
10x Higher Energy Efficiency

40
TURBOCHARGED TERABYTE SCALE ACCELERATED COMPUTING
CURRENT x86 ARCHITECTURE
DDR4 HBM2e
Evolving Architecture For New Workloads
INTEGRATED CPU-GPU ARCHITECTURE
LPDDR5x HBM2e
3 DAYS FROM 1 MONTH
Fine-Tune Training of 1T Model
REAL-TIME INFERENCE
ON 0.5T MODEL
Interactive Single Node NLP Inference
GPU
GPU
GPU
GPU
GRACE
GRACE
GRACE
GRACE
GPU
GPU
GPU
GPU
x86
Transfer 2TB in 30 secs Transfer 2TB in 1 secs
GPU 8,000 GB/sec
CPU 200 GB/sec
PCIE Gen4
(Effective Per GPU)
16 GB/sec
GPU 8,000 GB/sec
CPU 500 GB/sec
NVLink 500 GB/sec
Bandwidth claims rounded to nearest hundred for illustration.
Performance results based on projections on these configurations Grace : 8xGrace and 8xA100 with 4th Gen NVIDIA NVLink Connection between CPU and GPU and x86: DGX A100.
Training: 1 Month of training is Fine-Tuning a 1T parameter model on a large custom data set on 64xGrace+64xA100 compared to 8xDGXA100 (16xX86+64xA100)
Inference: 530B Parameter model on 8xGrace+8xA100 compared to DGXA100.

41
ANNOUNCING
THE WORLD’S FASTEST
SUPERCOMPUTER FOR AI
20 Exaflops of AI
Accelerated w/ NVIDIA Grace CPU and NVIDIA GPU
HPC and AI For Scientific and Commercial Apps
Advance Weather, Climate, and Material Science

NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf

NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf

Recommended

Recommended

More Related Content

Similar to NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf

Similar to NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf (20)

More from MuhammadAbdullah311866

More from MuhammadAbdullah311866 (20)

Recently uploaded

Recently uploaded (20)

NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf