Building the World's Largest GPU

Renee Yao, NVIDIA Senior Product
Marketing Manager, AI Systems
Twitter: @ReneeYao1
BUILDING THE WORLD'S
LARGEST GPU

2Twitter: @ReneeYao1Twitter: @ReneeYao1
THE DGX FAMILY OF AI SUPERCOMPUTERS
AI WORKSTATIONCLOUD-SCALE AI AI DATA CENTER
Cloud platform with the highest
deep learning efficiency
NVIDIA GPU Cloud
The Essential
Instrument for AI
Research
DGX-1
with
Tesla V100 32GB
The Personal
AI Supercomputer
DGX Station
with
Tesla V100 32GB
The World’s Most Powerful
AI System for the Most
Complex AI Challenges
DGX-2
with
Tesla V100 32GB

3Twitter: @ReneeYao1
10X PERFORMANCE GAIN IN LESS THAN A YEAR
DGX-1, SEP’17 DGX-2, Q3‘18
software improvements across the stack including NCCL, cuDNN, etc.
Workload: FairSeq, 55 epochs to solution. PyTorch training performance.
Time to Train (days)
1.5
15
0 5 10 15 20
DGX-2
DGX-1 with V100
10 Times Fasterdays
days

DGX-2 NOW SHIPPING
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
4
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet

MULTI-CORE AND CUDA WITH ONE GPU
GPU
GPC
GPC
HBM2
Memory
Controller
Memory
Controller
HBM2
Memory
Controller
Memory
Controller
XBAR High-Speed Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
Work (data and
CUDA Kernels)
Results
(data)
CPU
• Users explicitly
express parallel
work in CUDA
• GPU Driver
distributes work
to available
GPC/SM cores
• GPC/SM cores
use shared
HBM2 to
exchange data

TWO-GPUS WITH PCIE
GPU0
GPC
GPC
XBAR
HBM2+MemCtrlHBM2+MemCtrl
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
GPU1
GPC
GPC
XBAR
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
Work (data and
CUDA Kernels)
Results
(data)
CPU
• Access to HBM2
of other GPU is
at PCIe BW
(16 GBps)
• PCIe is the
“Wild West”
(lots of perf
bandits)
• Interactions
with CPU
compete with
GPU-to-GPU

TWO-GPUS WITH NVLINK
• Access to HBM2
of other GPU is
at multi-NVLink
BW (150 GBps
in V100 GPUs)
• All GPCs can
access all HBM2
memories
• NVLinks are
effectively a
“bridge”
between XBARs
GPU0
GPC
GPC
XBAR
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
GPU1
GPC
GPC
XBAR
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
Work (data and
CUDA Kernels)
Results
(data)
CPU

THE “ONE GIGANTIC GPU” IDEAL
• Number of GPUs is as high as
possible
• Single GPU Driver process controls
all work across all GPUs
• From perspective of GPCs, all
HBM2s can be accessed without
intervention by other processes
(LD/ST instructions, Copy Engine
RDMA, everything “just works”)
• Access to all HBM2s is
independent of PCIe
• BW across bridged XBARs is as
high as possible (some NUMA is
unavoidable)
GPU0
GPU1
GPU2
GPU3
GPU0
GPU1
GPU2
GPU3
GPU0
GPU1
GPU2
GPU3
GPU0
GPU1
GPU2
GPU3
NVLink XBAR
CPU
CPU
?

INTRODUCING NVSWITCH
Parameter Spec
Bi-Di BW per NVLink 51.5 GBps
NRZ Lane Rate (x8 per NVLink) 25.78125 Gbps
Transistors 2 Billion
Process TSMC 12FFN
Die Size 106 mm^2
Parameter Spec
Bi-Di Aggregate BW 928 GBps
NVLink Ports 18
Mgmt Port (config, maintenance, err) PCIe
LD/ST BW Efficiency (128B pkts) 80.0%
Copy Engine BW Efficiency (256B pkts) 88.9%

DGX-2 NOW SHIPPING
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
10
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet

EXPANDABLE SYSTEM
• Taking this to the limit - connect one NVLink from each
GPU to each of 6 switches
• No routing between different switch planes required
• 8 NVLinks of the 18 available per switch are used to
connect to GPUs
• 10 NVLinks available per switch for communication
outside the local group (only 8 are required to support
full BW)
• This is the GPU baseboard configuration for DGX-2
V100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH

DGX-2 NVLINK FABRICV100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH
NVSWITCH
V100
V100
V100
V100
V100
V100
V100
V100
• Two of these building blocks together form a
fully connected 16GPU cluster
• Non-blocking, non-interfering (unless same
destination is involved)
• Regular load, store, atomics just work
• Presenters note: The astute among you will
note that there is a redundant level of
switches here, but configuration simplifies
system-level design and manufacturing

14
Data Science HW Architecture
128x memory I/O
300x core-to-core I/O
100x processing cores
128 GB/s
20
Cores
512
GB
128 GB/s
20
Cores
512
GB
128 GB/s
20
Cores
512
GB
128 GB/s
20
Cores
512
GB
CPU Cluster
DGX-2
Larger datasets but slower
● CPU/memory bandwidth
● # of processing cores
● Network I/O
128 GB/s
20
Cores
512
GB
Single CPU Node
Typically very slow
With 20GB+ datasets

DGX-2 PCIE NETWORK
PCIE
SW
x86x86
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
x6x6
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
QPIQPI
V100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH
NVSWITCH
V100
V100
V100
V100
V100
V100
V100
V100
• Xeon sockets are
QPI connected, but
affinity-binding
keeps GPU-related
traffic off QPI
• PCIe tree has NICs
connected to pairs
of GPUs to facilitate
GPUDirect RDMAs
over IB network
• Configuration and
control of the
NVSwitches is via
driver process
running on CPUs

NVIDIA DGX-2: GPUS + NVSWITCH COMPLEX
• Two GPU
Baseboards with
8 V100 GPUs and
6 NVSwitches
• Two Plane Cards
carry 24 NVLinks
each

NVIDIA DGX-2: SYSTEM COOLING
• Forced-air cooling
of Baseboards, I/O
Expander, and CPU
provided by 10
92 mm fans
• 4 supplemental
60 mm internal fans
to cool NVMe drives
and PSUs
• Air to NVSwitches is
pre-heated by
GPUs, so use “full
height” heatsinks

DGX-2: cuFFT
• Results are “iso-
problem instance”
(more GFLOPS means
shorter running time)
• As problem is split
over more GPUs, it
takes longer to
transfer data than to
calculate locally
DGX-1V½ DGX-2

DGX-2: ALL-REDUCE BENCHMARK
4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
Message Size (B)
Bandwidth
(MB/s)
2 DGX1V
(1 100 Gb IB)
2 DGX1V
(4 100 Gb IB)
DGX-2
(Ring-Topology
Communication)
DGX-2
(All-to-All Communication)
8x
• Important
communication
primitive in Machine-
Learning Apps
• Increased BW
compared to two 8-
GPU servers
• All-to-all NVSwitch
network reduces
latency overheads vs
simpler topologies
(e.g., “rings”)

DGX-2: UP TO 2.7X ON TARGET APPS
2 DGX-1V servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 32GB GPUs. Servers connected via 4 EDR IB ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 32GB GPUs
13K
GFLOPS
26K
GFLOPS
Physics
(MILC benchmark)
4D Grid
Weather
(IFS benchmark)
FFT, All-to-all
Recommender
(Sparse Embedding)
Reduce & Broadcast
22B
lookups
/sec
11B
Lookups
/sec
Language Model
(Transformer with MoE)
All-to-all
9.3Hr
3.4Hr
DGX-2 with NVSwitch2x DGX-1 (Volta)
2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER
11 Steps/
sec
26 Steps/
sec

21Twitter: @ReneeYao1 21
FLEXIBILITY WITH
VIRTUALIZATION
Enable your own private DL Training
Cloud for your Enterprise
• KVM hypervisor for Ubuntu Linux
• Enable teams of developers to
simultaneously access DGX-2
• Flexibly allocate GPU resources to
each user and their experiments
• Full GPU’s and NVSwitch access
within VMs — either all GPU’s or as
few as 1

22
CRISIS MANAGEMENT
SOLUTION
Natural disasters are increasingly causing major destruction
to life, property and economies. DFKI is using the NVIDIA
DGX-2 to evolve DeepEye —which uses satellite images
enriched with social media content to identify natural
disasters— into a crisis management solution. With
the increased GPU memory and fully connected
GPUs based on the NVSwitch architecture, DFKI
can build bigger models and process more
data to aid rescuers in their decision-
making for faster, more efficient
dispatching of
resources.

23
“Fujifilm applies AI in a wide range of fields. In
healthcare, multiple NVIDIA GPUs will deliver
high-speed computation to develop AI
supporting image diagnostics.The introduction
of this supercomputer will massively increase our
processing power.We expect that AI learning that
once took days to complete can now be
completed within hours.”
AkiraYoda
chief digital officer of FUJIFILMCorporation
- Pharmaceuticals
- BioCDMO
- Regenerative medicine
- Analyzing and
recognizing medical
images
- Simulations display
materials and fine
chemicals

AI ADOPTERS IMPEDED BY
INFRASTRUCTURE
AI Boosts Profit
Margins up to 15%
40% see infrastructure
as impeding AI
source: 2018 CTA Market Research

THE CHALLENGE OF AI INFRASTRUCTURE
Short term thinking leads to longer term problems
Ensuring the
architecture delivers
predictable performance
that scales
DESIGN
GUESSWORK
Procuring, installing and
troubleshooting compute,
storage, networking and
software
DEPLOYMENT
COMPLEXITY
MULTIPLE POINTS
OF SUPPORT
Contending with
multiple vendors across
multiple layers in the
stack

DESIGNING INFRASTRUCTURE THAT SCALES
Insights gained from deep learning data centers
Rack Design Networking Storage Facilities Software
• DL drives
close to
operational
limits
• Similarities
to HPC best
practices
• IB or
Ethernet
based fabric
• 100Gbps
inter-
connect
• High-
bandwidth,
ultra-low
latency
• Datasets
range from
10k’s to
millions
objects
• terabyte
levels of
storage and
up
• High IOPS,
low latency
• assume
higher watts
per-rack
• Higher
FLOPS/watt
= DC less
floorspace
required
• Scale
requires
“cluster-
aware”
software
Example:
• Autonomous vehicle = 1TB / hr
• Training sets up to 500 PB
• RN50: 113 days to train
• Objective: 7 days
• 6 simultaneous developers
= 97 node cluster

NVIDIA DGX POD™
• Initial reference architecture based on the NVIDIA® DGX-1™ server
• Designed for deep learning training workflow
• Baseline for other reference architectures:
• Easily upgraded to NVIDIA DGX-2™ and NVIDIA HGX-2™ servers
• Industry-specific PODs
• Storage and network partners
• Server OEM solutions
A Reference Architecture For GPU Data Centers

DGX DATA CENTER REFERENCE DESIGN
Easy Deployment of DGX Servers for Deep Learning
Content:
• AI Workflow and Sizing
• NVIDIA AI Software
• DGX POD Design
• DGX POD Installation and
Management

NVIDIA AUTOMOTIVE WORKFLOW ON SATURNV
Research Workflow
Training
• Many node – user submits 1 job with
many single node training sessions -
hyper parm sweep
• Multi-node – user submits 1 job with
single multi-node training session
Inference
• Many GPU – user submits many jobs
each with single GPU inference
Inference
Many node
Training Multi
node
Training
StoragePerformance
Interconnect performance

EXAMPLE DL WORKFLOW: AUTOMOTIVE
Driving DL Platform - Training, Simulation, Testing
Raw
data
OTA
updates
Indexing, selection,
labeling

NVIDIA DGX POD — DGX-1
Reference Architecture in a Single 35 kW High-Density Rack
In real-life DL application development, one to two
DGX-1 servers per developer are often required
One DGX POD supports five developers (AV workload)
Each developer works on two experiments per day
One DGX-1/developer/experiment/day*
*300,000 0.5M images * 120 epochs @ 480 images/sec
Resnet-18 backbone detection network per experiment
Fit within a standard-height
42 RU data center rack
• Nine DGX-1 servers
(9 x 3 RU = 27 RU)
• Twelve storage servers
(12 x 1 RU = 12 RU)
• 10 GbE (min) storage and
management switch
(1 RU)
• Mellanox 100 Gbps intra-
rack high speed network
switches
(1 or 2 RU)

NVIDIA DGX POD — DGX-2
Reference Architecture in a Single 35 kW High-Density Rack
Fit within a standard-height
48 RU data center rack
• Three DGX-2 servers
(3 x 10 RU = 30 RU)
• Twelve storage servers
(12 x 1 RU = 12 RU)
• 10 GbE (min) storage and
management switch
(1 RU)
• Mellanox 100 Gbps intra-
rack high speed network
switches
(1 or 2 RU)
In real-life DL application development, one DGX-2 per
developer minimizes model training time
One DGX POD supports at least three developers
(AV workload)
Each developer works on two experiments per day
One DGX-2/developer/2 experiments/day*
*300,000 0.5M images * 120 epochs @ 480 images/sec
Resnet-18 backbone detection network per experiment

NEW DGX PODS
DELIVERY, DEPLOYMENT, DEEP LEARNING IN A DAY
95% Reduction in Deployment Time
5X Increase in Data Scientist Productivity
$0 Integration Cost
Adopted by Leading Auto, Healthcare & Telco Companies

NVIDIA DGX
SYSTEMS
Faster AI Innovation
and Insight
The World’s First Portfolio of
Purpose-Built AI Supercomputers
• Powered by NVIDIA GPU Cloud
• Get Started in AI – Faster
• Effortless Productivity
• Performance Without Compromise
For More Information
DGX Systems: nvidia.com/dgx
DGX Pod: https://www.nvidia.com/en-us/data-
center/resources/nvidia-dgx-pod-reference-
architecture/
DGX Reference Architecture:
https://www.nvidia.com/en-us/data-center/dgx-
reference-architecture/ 35

Building the World's Largest GPU

Building the World's Largest GPU

More Related Content

What's hot

Similar to Building the World's Largest GPU

More from Renee Yao

Recently uploaded

Building the World's Largest GPU