This is a presentation I presented at NVIDIA AI Conference in Korea. It's about building the largest GPU - DGX-2, the most powerful supercomputer in one node.
What's New in Teams Calling, Meetings and Devices March 2024
Building the World's Largest GPU
1. Renee Yao, NVIDIA Senior Product
Marketing Manager, AI Systems
Twitter: @ReneeYao1
BUILDING THE WORLD'S
LARGEST GPU
2. 2Twitter: @ReneeYao1Twitter: @ReneeYao1
THE DGX FAMILY OF AI SUPERCOMPUTERS
AI WORKSTATIONCLOUD-SCALE AI AI DATA CENTER
Cloud platform with the highest
deep learning efficiency
NVIDIA GPU Cloud
The Essential
Instrument for AI
Research
DGX-1
with
Tesla V100 32GB
The Personal
AI Supercomputer
DGX Station
with
Tesla V100 32GB
The World’s Most Powerful
AI System for the Most
Complex AI Challenges
DGX-2
with
Tesla V100 32GB
3. 3Twitter: @ReneeYao1
10X PERFORMANCE GAIN IN LESS THAN A YEAR
DGX-1, SEP’17 DGX-2, Q3‘18
software improvements across the stack including NCCL, cuDNN, etc.
Workload: FairSeq, 55 epochs to solution. PyTorch training performance.
Time to Train (days)
1.5
15
0 5 10 15 20
DGX-2
DGX-1 with V100
10 Times Fasterdays
days
4. 4Twitter: @ReneeYao1
DGX-2 NOW SHIPPING
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
4
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet
5. 5Twitter: @ReneeYao1Twitter: @ReneeYao1
MULTI-CORE AND CUDA WITH ONE GPU
GPU
GPC
GPC
HBM2
Memory
Controller
Memory
Controller
HBM2
Memory
Controller
Memory
Controller
XBAR High-Speed Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
Work (data and
CUDA Kernels)
Results
(data)
CPU
• Users explicitly
express parallel
work in CUDA
• GPU Driver
distributes work
to available
GPC/SM cores
• GPC/SM cores
use shared
HBM2 to
exchange data
6. 6Twitter: @ReneeYao1Twitter: @ReneeYao1
TWO-GPUS WITH PCIE
GPU0
GPC
GPC
XBAR
HBM2+MemCtrlHBM2+MemCtrl
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
GPU1
GPC
GPC
XBAR
HBM2+MemCtrlHBM2+MemCtrl
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
Work (data and
CUDA Kernels)
Results
(data)
CPU
• Access to HBM2
of other GPU is
at PCIe BW
(16 GBps)
• PCIe is the
“Wild West”
(lots of perf
bandits)
• Interactions
with CPU
compete with
GPU-to-GPU
7. 7Twitter: @ReneeYao1Twitter: @ReneeYao1
TWO-GPUS WITH NVLINK
• Access to HBM2
of other GPU is
at multi-NVLink
BW (150 GBps
in V100 GPUs)
• All GPCs can
access all HBM2
memories
• NVLinks are
effectively a
“bridge”
between XBARs
GPU0
GPC
GPC
XBAR
HBM2+MemCtrlHBM2+MemCtrl
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
GPU1
GPC
GPC
XBAR
HBM2+MemCtrlHBM2+MemCtrl
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
Work (data and
CUDA Kernels)
Results
(data)
CPU
8. 8Twitter: @ReneeYao1Twitter: @ReneeYao1
THE “ONE GIGANTIC GPU” IDEAL
• Number of GPUs is as high as
possible
• Single GPU Driver process controls
all work across all GPUs
• From perspective of GPCs, all
HBM2s can be accessed without
intervention by other processes
(LD/ST instructions, Copy Engine
RDMA, everything “just works”)
• Access to all HBM2s is
independent of PCIe
• BW across bridged XBARs is as
high as possible (some NUMA is
unavoidable)
GPU0
GPU1
GPU2
GPU3
GPU0
GPU1
GPU2
GPU3
GPU0
GPU1
GPU2
GPU3
GPU0
GPU1
GPU2
GPU3
NVLink XBAR
CPU
CPU
?
10. 10Twitter: @ReneeYao1
DGX-2 NOW SHIPPING
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
10
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet
11. 11Twitter: @ReneeYao1Twitter: @ReneeYao1
EXPANDABLE SYSTEM
• Taking this to the limit - connect one NVLink from each
GPU to each of 6 switches
• No routing between different switch planes required
• 8 NVLinks of the 18 available per switch are used to
connect to GPUs
• 10 NVLinks available per switch for communication
outside the local group (only 8 are required to support
full BW)
• This is the GPU baseboard configuration for DGX-2
V100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH
12. 12Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX-2 NVLINK FABRICV100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH
NVSWITCH
V100
V100
V100
V100
V100
V100
V100
V100
• Two of these building blocks together form a
fully connected 16GPU cluster
• Non-blocking, non-interfering (unless same
destination is involved)
• Regular load, store, atomics just work
• Presenters note: The astute among you will
note that there is a redundant level of
switches here, but configuration simplifies
system-level design and manufacturing
13. 14
Data Science HW Architecture
128x memory I/O
300x core-to-core I/O
100x processing cores
128 GB/s
20
Cores
512
GB
128 GB/s
20
Cores
512
GB
128 GB/s
20
Cores
512
GB
128 GB/s
20
Cores
512
GB
CPU Cluster
DGX-2
Larger datasets but slower
● CPU/memory bandwidth
● # of processing cores
● Network I/O
128 GB/s
20
Cores
512
GB
Single CPU Node
Typically very slow
With 20GB+ datasets
14. 15Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX-2 PCIE NETWORK
PCIE
SW
x86x86
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
x6x6
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
QPIQPI
V100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH
NVSWITCH
V100
V100
V100
V100
V100
V100
V100
V100
• Xeon sockets are
QPI connected, but
affinity-binding
keeps GPU-related
traffic off QPI
• PCIe tree has NICs
connected to pairs
of GPUs to facilitate
GPUDirect RDMAs
over IB network
• Configuration and
control of the
NVSwitches is via
driver process
running on CPUs
16. 17Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX-2: SYSTEM COOLING
• Forced-air cooling
of Baseboards, I/O
Expander, and CPU
provided by 10
92 mm fans
• 4 supplemental
60 mm internal fans
to cool NVMe drives
and PSUs
• Air to NVSwitches is
pre-heated by
GPUs, so use “full
height” heatsinks
17. 18Twitter: @ReneeYao1
DGX-2: cuFFT
• Results are “iso-
problem instance”
(more GFLOPS means
shorter running time)
• As problem is split
over more GPUs, it
takes longer to
transfer data than to
calculate locally
DGX-1V½ DGX-2
19. 20Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX-2: UP TO 2.7X ON TARGET APPS
2 DGX-1V servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 32GB GPUs. Servers connected via 4 EDR IB ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 32GB GPUs
13K
GFLOPS
26K
GFLOPS
Physics
(MILC benchmark)
4D Grid
Weather
(IFS benchmark)
FFT, All-to-all
Recommender
(Sparse Embedding)
Reduce & Broadcast
22B
lookups
/sec
11B
Lookups
/sec
Language Model
(Transformer with MoE)
All-to-all
9.3Hr
3.4Hr
DGX-2 with NVSwitch2x DGX-1 (Volta)
2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER
11 Steps/
sec
26 Steps/
sec
20. 21Twitter: @ReneeYao1 21
FLEXIBILITY WITH
VIRTUALIZATION
Enable your own private DL Training
Cloud for your Enterprise
• KVM hypervisor for Ubuntu Linux
• Enable teams of developers to
simultaneously access DGX-2
• Flexibly allocate GPU resources to
each user and their experiments
• Full GPU’s and NVSwitch access
within VMs — either all GPU’s or as
few as 1
21. 22
CRISIS MANAGEMENT
SOLUTION
Natural disasters are increasingly causing major destruction
to life, property and economies. DFKI is using the NVIDIA
DGX-2 to evolve DeepEye —which uses satellite images
enriched with social media content to identify natural
disasters— into a crisis management solution. With
the increased GPU memory and fully connected
GPUs based on the NVSwitch architecture, DFKI
can build bigger models and process more
data to aid rescuers in their decision-
making for faster, more efficient
dispatching of
resources.
22. 23
“Fujifilm applies AI in a wide range of fields. In
healthcare, multiple NVIDIA GPUs will deliver
high-speed computation to develop AI
supporting image diagnostics.The introduction
of this supercomputer will massively increase our
processing power.We expect that AI learning that
once took days to complete can now be
completed within hours.”
AkiraYoda
chief digital officer of FUJIFILMCorporation
- Pharmaceuticals
- BioCDMO
- Regenerative medicine
- Analyzing and
recognizing medical
images
- Simulations display
materials and fine
chemicals
23. 24Twitter: @ReneeYao1Twitter: @ReneeYao1
AI ADOPTERS IMPEDED BY
INFRASTRUCTURE
AI Boosts Profit
Margins up to 15%
40% see infrastructure
as impeding AI
source: 2018 CTA Market Research
24. 25Twitter: @ReneeYao1Twitter: @ReneeYao1
THE CHALLENGE OF AI INFRASTRUCTURE
Short term thinking leads to longer term problems
Ensuring the
architecture delivers
predictable performance
that scales
DESIGN
GUESSWORK
Procuring, installing and
troubleshooting compute,
storage, networking and
software
DEPLOYMENT
COMPLEXITY
MULTIPLE POINTS
OF SUPPORT
Contending with
multiple vendors across
multiple layers in the
stack
25. 27Twitter: @ReneeYao1Twitter: @ReneeYao1
DESIGNING INFRASTRUCTURE THAT SCALES
Insights gained from deep learning data centers
Rack Design Networking Storage Facilities Software
• DL drives
close to
operational
limits
• Similarities
to HPC best
practices
• IB or
Ethernet
based fabric
• 100Gbps
inter-
connect
• High-
bandwidth,
ultra-low
latency
• Datasets
range from
10k’s to
millions
objects
• terabyte
levels of
storage and
up
• High IOPS,
low latency
• assume
higher watts
per-rack
• Higher
FLOPS/watt
= DC less
floorspace
required
• Scale
requires
“cluster-
aware”
software
Example:
• Autonomous vehicle = 1TB / hr
• Training sets up to 500 PB
• RN50: 113 days to train
• Objective: 7 days
• 6 simultaneous developers
= 97 node cluster
26. 28Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX POD™
• Initial reference architecture based on the NVIDIA® DGX-1™ server
• Designed for deep learning training workflow
• Baseline for other reference architectures:
• Easily upgraded to NVIDIA DGX-2™ and NVIDIA HGX-2™ servers
• Industry-specific PODs
• Storage and network partners
• Server OEM solutions
A Reference Architecture For GPU Data Centers
27. 29Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX DATA CENTER REFERENCE DESIGN
Easy Deployment of DGX Servers for Deep Learning
Content:
• AI Workflow and Sizing
• NVIDIA AI Software
• DGX POD Design
• DGX POD Installation and
Management
28. 30Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA AUTOMOTIVE WORKFLOW ON SATURNV
Research Workflow
Training
• Many node – user submits 1 job with
many single node training sessions -
hyper parm sweep
• Multi-node – user submits 1 job with
single multi-node training session
Inference
• Many GPU – user submits many jobs
each with single GPU inference
Inference
Many node
Training Multi
node
Training
StoragePerformance
Interconnect performance
30. 32Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX POD — DGX-1
Reference Architecture in a Single 35 kW High-Density Rack
In real-life DL application development, one to two
DGX-1 servers per developer are often required
One DGX POD supports five developers (AV workload)
Each developer works on two experiments per day
One DGX-1/developer/experiment/day*
*300,000 0.5M images * 120 epochs @ 480 images/sec
Resnet-18 backbone detection network per experiment
Fit within a standard-height
42 RU data center rack
• Nine DGX-1 servers
(9 x 3 RU = 27 RU)
• Twelve storage servers
(12 x 1 RU = 12 RU)
• 10 GbE (min) storage and
management switch
(1 RU)
• Mellanox 100 Gbps intra-
rack high speed network
switches
(1 or 2 RU)
31. 33Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX POD — DGX-2
Reference Architecture in a Single 35 kW High-Density Rack
Fit within a standard-height
48 RU data center rack
• Three DGX-2 servers
(3 x 10 RU = 30 RU)
• Twelve storage servers
(12 x 1 RU = 12 RU)
• 10 GbE (min) storage and
management switch
(1 RU)
• Mellanox 100 Gbps intra-
rack high speed network
switches
(1 or 2 RU)
In real-life DL application development, one DGX-2 per
developer minimizes model training time
One DGX POD supports at least three developers
(AV workload)
Each developer works on two experiments per day
One DGX-2/developer/2 experiments/day*
*300,000 0.5M images * 120 epochs @ 480 images/sec
Resnet-18 backbone detection network per experiment
32. 34Twitter: @ReneeYao1
NEW DGX PODS
DELIVERY, DEPLOYMENT, DEEP LEARNING IN A DAY
95% Reduction in Deployment Time
5X Increase in Data Scientist Productivity
$0 Integration Cost
Adopted by Leading Auto, Healthcare & Telco Companies
33. 35Twitter: @ReneeYao1
NVIDIA DGX
SYSTEMS
Faster AI Innovation
and Insight
The World’s First Portfolio of
Purpose-Built AI Supercomputers
• Powered by NVIDIA GPU Cloud
• Get Started in AI – Faster
• Effortless Productivity
• Performance Without Compromise
For More Information
DGX Systems: nvidia.com/dgx
DGX Pod: https://www.nvidia.com/en-us/data-
center/resources/nvidia-dgx-pod-reference-
architecture/
DGX Reference Architecture:
https://www.nvidia.com/en-us/data-center/dgx-
reference-architecture/ 35