Renee Yao, NVIDIA Senior Product
Marketing Manager, AI Systems
Twitter: @ReneeYao1
BUILDING THE WORLD'S
LARGEST GPU
2Twitter: @ReneeYao1Twitter: @ReneeYao1
THE DGX FAMILY OF AI SUPERCOMPUTERS
AI WORKSTATIONCLOUD-SCALE AI AI DATA CENTER
Cloud platform with the highest
deep learning efficiency
NVIDIA GPU Cloud
The Essential
Instrument for AI
Research
DGX-1
with
Tesla V100 32GB
The Personal
AI Supercomputer
DGX Station
with
Tesla V100 32GB
The World’s Most Powerful
AI System for the Most
Complex AI Challenges
DGX-2
with
Tesla V100 32GB
3Twitter: @ReneeYao1
10X PERFORMANCE GAIN IN LESS THAN A YEAR
DGX-1, SEP’17 DGX-2, Q3‘18
software improvements across the stack including NCCL, cuDNN, etc.
Workload: FairSeq, 55 epochs to solution. PyTorch training performance.
Time to Train (days)
1.5
15
0 5 10 15 20
DGX-2
DGX-1 with V100
10 Times Fasterdays
days
4Twitter: @ReneeYao1
DGX-2 NOW SHIPPING
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
4
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet
5Twitter: @ReneeYao1Twitter: @ReneeYao1
MULTI-CORE AND CUDA WITH ONE GPU
GPU
GPC
GPC
HBM2
Memory
Controller
Memory
Controller
HBM2
Memory
Controller
Memory
Controller
XBAR High-Speed Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
Work (data and
CUDA Kernels)
Results
(data)
CPU
• Users explicitly
express parallel
work in CUDA
• GPU Driver
distributes work
to available
GPC/SM cores
• GPC/SM cores
use shared
HBM2 to
exchange data
6Twitter: @ReneeYao1Twitter: @ReneeYao1
TWO-GPUS WITH PCIE
GPU0
GPC
GPC
XBAR
HBM2+MemCtrlHBM2+MemCtrl
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
GPU1
GPC
GPC
XBAR
HBM2+MemCtrlHBM2+MemCtrl
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
Work (data and
CUDA Kernels)
Results
(data)
CPU
• Access to HBM2
of other GPU is
at PCIe BW
(16 GBps)
• PCIe is the
“Wild West”
(lots of perf
bandits)
• Interactions
with CPU
compete with
GPU-to-GPU
7Twitter: @ReneeYao1Twitter: @ReneeYao1
TWO-GPUS WITH NVLINK
• Access to HBM2
of other GPU is
at multi-NVLink
BW (150 GBps
in V100 GPUs)
• All GPCs can
access all HBM2
memories
• NVLinks are
effectively a
“bridge”
between XBARs
GPU0
GPC
GPC
XBAR
HBM2+MemCtrlHBM2+MemCtrl
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
GPU1
GPC
GPC
XBAR
HBM2+MemCtrlHBM2+MemCtrl
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
Work (data and
CUDA Kernels)
Results
(data)
CPU
8Twitter: @ReneeYao1Twitter: @ReneeYao1
THE “ONE GIGANTIC GPU” IDEAL
• Number of GPUs is as high as
possible
• Single GPU Driver process controls
all work across all GPUs
• From perspective of GPCs, all
HBM2s can be accessed without
intervention by other processes
(LD/ST instructions, Copy Engine
RDMA, everything “just works”)
• Access to all HBM2s is
independent of PCIe
• BW across bridged XBARs is as
high as possible (some NUMA is
unavoidable)
GPU0
GPU1
GPU2
GPU3
GPU0
GPU1
GPU2
GPU3
GPU0
GPU1
GPU2
GPU3
GPU0
GPU1
GPU2
GPU3
NVLink XBAR
CPU
CPU
?
9Twitter: @ReneeYao1Twitter: @ReneeYao1
INTRODUCING NVSWITCH
Parameter Spec
Bi-Di BW per NVLink 51.5 GBps
NRZ Lane Rate (x8 per NVLink) 25.78125 Gbps
Transistors 2 Billion
Process TSMC 12FFN
Die Size 106 mm^2
Parameter Spec
Bi-Di Aggregate BW 928 GBps
NVLink Ports 18
Mgmt Port (config, maintenance, err) PCIe
LD/ST BW Efficiency (128B pkts) 80.0%
Copy Engine BW Efficiency (256B pkts) 88.9%
10Twitter: @ReneeYao1
DGX-2 NOW SHIPPING
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
10
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet
11Twitter: @ReneeYao1Twitter: @ReneeYao1
EXPANDABLE SYSTEM
• Taking this to the limit - connect one NVLink from each
GPU to each of 6 switches
• No routing between different switch planes required
• 8 NVLinks of the 18 available per switch are used to
connect to GPUs
• 10 NVLinks available per switch for communication
outside the local group (only 8 are required to support
full BW)
• This is the GPU baseboard configuration for DGX-2
V100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH
12Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX-2 NVLINK FABRICV100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH
NVSWITCH
V100
V100
V100
V100
V100
V100
V100
V100
• Two of these building blocks together form a
fully connected 16GPU cluster
• Non-blocking, non-interfering (unless same
destination is involved)
• Regular load, store, atomics just work
• Presenters note: The astute among you will
note that there is a redundant level of
switches here, but configuration simplifies
system-level design and manufacturing
14
Data Science HW Architecture
128x memory I/O
300x core-to-core I/O
100x processing cores
128 GB/s
20
Cores
512
GB
128 GB/s
20
Cores
512
GB
128 GB/s
20
Cores
512
GB
128 GB/s
20
Cores
512
GB
CPU Cluster
DGX-2
Larger datasets but slower
● CPU/memory bandwidth
● # of processing cores
● Network I/O
128 GB/s
20
Cores
512
GB
Single CPU Node
Typically very slow
With 20GB+ datasets
15Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX-2 PCIE NETWORK
PCIE
SW
x86x86
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
x6x6
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
QPIQPI
V100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH
NVSWITCH
V100
V100
V100
V100
V100
V100
V100
V100
• Xeon sockets are
QPI connected, but
affinity-binding
keeps GPU-related
traffic off QPI
• PCIe tree has NICs
connected to pairs
of GPUs to facilitate
GPUDirect RDMAs
over IB network
• Configuration and
control of the
NVSwitches is via
driver process
running on CPUs
16Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX-2: GPUS + NVSWITCH COMPLEX
• Two GPU
Baseboards with
8 V100 GPUs and
6 NVSwitches
• Two Plane Cards
carry 24 NVLinks
each
17Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX-2: SYSTEM COOLING
• Forced-air cooling
of Baseboards, I/O
Expander, and CPU
provided by 10
92 mm fans
• 4 supplemental
60 mm internal fans
to cool NVMe drives
and PSUs
• Air to NVSwitches is
pre-heated by
GPUs, so use “full
height” heatsinks
18Twitter: @ReneeYao1
DGX-2: cuFFT
• Results are “iso-
problem instance”
(more GFLOPS means
shorter running time)
• As problem is split
over more GPUs, it
takes longer to
transfer data than to
calculate locally
DGX-1V½ DGX-2
19Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX-2: ALL-REDUCE BENCHMARK
4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
Message Size (B)
Bandwidth
(MB/s)
2 DGX1V
(1 100 Gb IB)
2 DGX1V
(4 100 Gb IB)
DGX-2
(Ring-Topology
Communication)
DGX-2
(All-to-All Communication)
8x
• Important
communication
primitive in Machine-
Learning Apps
• Increased BW
compared to two 8-
GPU servers
• All-to-all NVSwitch
network reduces
latency overheads vs
simpler topologies
(e.g., “rings”)
20Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX-2: UP TO 2.7X ON TARGET APPS
2 DGX-1V servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 32GB GPUs. Servers connected via 4 EDR IB ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 32GB GPUs
13K
GFLOPS
26K
GFLOPS
Physics
(MILC benchmark)
4D Grid
Weather
(IFS benchmark)
FFT, All-to-all
Recommender
(Sparse Embedding)
Reduce & Broadcast
22B
lookups
/sec
11B
Lookups
/sec
Language Model
(Transformer with MoE)
All-to-all
9.3Hr
3.4Hr
DGX-2 with NVSwitch2x DGX-1 (Volta)
2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER
11 Steps/
sec
26 Steps/
sec
21Twitter: @ReneeYao1 21
FLEXIBILITY WITH
VIRTUALIZATION
Enable your own private DL Training
Cloud for your Enterprise
• KVM hypervisor for Ubuntu Linux
• Enable teams of developers to
simultaneously access DGX-2
• Flexibly allocate GPU resources to
each user and their experiments
• Full GPU’s and NVSwitch access
within VMs — either all GPU’s or as
few as 1
22
CRISIS MANAGEMENT
SOLUTION
Natural disasters are increasingly causing major destruction
to life, property and economies. DFKI is using the NVIDIA
DGX-2 to evolve DeepEye —which uses satellite images
enriched with social media content to identify natural
disasters— into a crisis management solution. With
the increased GPU memory and fully connected
GPUs based on the NVSwitch architecture, DFKI
can build bigger models and process more
data to aid rescuers in their decision-
making for faster, more efficient
dispatching of
resources.
23
“Fujifilm applies AI in a wide range of fields. In
healthcare, multiple NVIDIA GPUs will deliver
high-speed computation to develop AI
supporting image diagnostics.The introduction
of this supercomputer will massively increase our
processing power.We expect that AI learning that
once took days to complete can now be
completed within hours.”
AkiraYoda
chief digital officer of FUJIFILMCorporation
- Pharmaceuticals
- BioCDMO
- Regenerative medicine
- Analyzing and
recognizing medical
images
- Simulations display
materials and fine
chemicals
24Twitter: @ReneeYao1Twitter: @ReneeYao1
AI ADOPTERS IMPEDED BY
INFRASTRUCTURE
AI Boosts Profit
Margins up to 15%
40% see infrastructure
as impeding AI
source: 2018 CTA Market Research
25Twitter: @ReneeYao1Twitter: @ReneeYao1
THE CHALLENGE OF AI INFRASTRUCTURE
Short term thinking leads to longer term problems
Ensuring the
architecture delivers
predictable performance
that scales
DESIGN
GUESSWORK
Procuring, installing and
troubleshooting compute,
storage, networking and
software
DEPLOYMENT
COMPLEXITY
MULTIPLE POINTS
OF SUPPORT
Contending with
multiple vendors across
multiple layers in the
stack
27Twitter: @ReneeYao1Twitter: @ReneeYao1
DESIGNING INFRASTRUCTURE THAT SCALES
Insights gained from deep learning data centers
Rack Design Networking Storage Facilities Software
• DL drives
close to
operational
limits
• Similarities
to HPC best
practices
• IB or
Ethernet
based fabric
• 100Gbps
inter-
connect
• High-
bandwidth,
ultra-low
latency
• Datasets
range from
10k’s to
millions
objects
• terabyte
levels of
storage and
up
• High IOPS,
low latency
• assume
higher watts
per-rack
• Higher
FLOPS/watt
= DC less
floorspace
required
• Scale
requires
“cluster-
aware”
software
Example:
• Autonomous vehicle = 1TB / hr
• Training sets up to 500 PB
• RN50: 113 days to train
• Objective: 7 days
• 6 simultaneous developers
= 97 node cluster
28Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX POD™
• Initial reference architecture based on the NVIDIA® DGX-1™ server
• Designed for deep learning training workflow
• Baseline for other reference architectures:
• Easily upgraded to NVIDIA DGX-2™ and NVIDIA HGX-2™ servers
• Industry-specific PODs
• Storage and network partners
• Server OEM solutions
A Reference Architecture For GPU Data Centers
29Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX DATA CENTER REFERENCE DESIGN
Easy Deployment of DGX Servers for Deep Learning
Content:
• AI Workflow and Sizing
• NVIDIA AI Software
• DGX POD Design
• DGX POD Installation and
Management
30Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA AUTOMOTIVE WORKFLOW ON SATURNV
Research Workflow
Training
• Many node – user submits 1 job with
many single node training sessions -
hyper parm sweep
• Multi-node – user submits 1 job with
single multi-node training session
Inference
• Many GPU – user submits many jobs
each with single GPU inference
Inference
Many node
Training Multi
node
Training
StoragePerformance
Interconnect performance
31Twitter: @ReneeYao1Twitter: @ReneeYao1
EXAMPLE DL WORKFLOW: AUTOMOTIVE
Driving DL Platform - Training, Simulation, Testing
Raw
data
OTA
updates
Indexing, selection,
labeling
32Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX POD — DGX-1
Reference Architecture in a Single 35 kW High-Density Rack
In real-life DL application development, one to two
DGX-1 servers per developer are often required
One DGX POD supports five developers (AV workload)
Each developer works on two experiments per day
One DGX-1/developer/experiment/day*
*300,000 0.5M images * 120 epochs @ 480 images/sec
Resnet-18 backbone detection network per experiment
Fit within a standard-height
42 RU data center rack
• Nine DGX-1 servers
(9 x 3 RU = 27 RU)
• Twelve storage servers
(12 x 1 RU = 12 RU)
• 10 GbE (min) storage and
management switch
(1 RU)
• Mellanox 100 Gbps intra-
rack high speed network
switches
(1 or 2 RU)
33Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX POD — DGX-2
Reference Architecture in a Single 35 kW High-Density Rack
Fit within a standard-height
48 RU data center rack
• Three DGX-2 servers
(3 x 10 RU = 30 RU)
• Twelve storage servers
(12 x 1 RU = 12 RU)
• 10 GbE (min) storage and
management switch
(1 RU)
• Mellanox 100 Gbps intra-
rack high speed network
switches
(1 or 2 RU)
In real-life DL application development, one DGX-2 per
developer minimizes model training time
One DGX POD supports at least three developers
(AV workload)
Each developer works on two experiments per day
One DGX-2/developer/2 experiments/day*
*300,000 0.5M images * 120 epochs @ 480 images/sec
Resnet-18 backbone detection network per experiment
34Twitter: @ReneeYao1
NEW DGX PODS
DELIVERY, DEPLOYMENT, DEEP LEARNING IN A DAY
95% Reduction in Deployment Time
5X Increase in Data Scientist Productivity
$0 Integration Cost
Adopted by Leading Auto, Healthcare & Telco Companies
35Twitter: @ReneeYao1
NVIDIA DGX
SYSTEMS
Faster AI Innovation
and Insight
The World’s First Portfolio of
Purpose-Built AI Supercomputers
• Powered by NVIDIA GPU Cloud
• Get Started in AI – Faster
• Effortless Productivity
• Performance Without Compromise
For More Information
DGX Systems: nvidia.com/dgx
DGX Pod: https://www.nvidia.com/en-us/data-
center/resources/nvidia-dgx-pod-reference-
architecture/
DGX Reference Architecture:
https://www.nvidia.com/en-us/data-center/dgx-
reference-architecture/ 35
Building the World's Largest GPU

Building the World's Largest GPU

  • 1.
    Renee Yao, NVIDIASenior Product Marketing Manager, AI Systems Twitter: @ReneeYao1 BUILDING THE WORLD'S LARGEST GPU
  • 2.
    2Twitter: @ReneeYao1Twitter: @ReneeYao1 THEDGX FAMILY OF AI SUPERCOMPUTERS AI WORKSTATIONCLOUD-SCALE AI AI DATA CENTER Cloud platform with the highest deep learning efficiency NVIDIA GPU Cloud The Essential Instrument for AI Research DGX-1 with Tesla V100 32GB The Personal AI Supercomputer DGX Station with Tesla V100 32GB The World’s Most Powerful AI System for the Most Complex AI Challenges DGX-2 with Tesla V100 32GB
  • 3.
    3Twitter: @ReneeYao1 10X PERFORMANCEGAIN IN LESS THAN A YEAR DGX-1, SEP’17 DGX-2, Q3‘18 software improvements across the stack including NCCL, cuDNN, etc. Workload: FairSeq, 55 epochs to solution. PyTorch training performance. Time to Train (days) 1.5 15 0 5 10 15 20 DGX-2 DGX-1 with V100 10 Times Fasterdays days
  • 4.
    4Twitter: @ReneeYao1 DGX-2 NOWSHIPPING 1 2 3 5 4 6 Two Intel Xeon Platinum CPUs 7 1.5 TB System Memory 4 30 TB NVME SSDs Internal Storage NVIDIA Tesla V100 32GB Two GPU Boards 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.4 TB/sec bi-section bandwidth Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth PCIe Switch Complex 8 9 9Dual 10/25 Gb/sec Ethernet
  • 5.
    5Twitter: @ReneeYao1Twitter: @ReneeYao1 MULTI-COREAND CUDA WITH ONE GPU GPU GPC GPC HBM2 Memory Controller Memory Controller HBM2 Memory Controller Memory Controller XBAR High-Speed Hub NVLink NVLink Copy Engines PCIe I/O Work (data and CUDA Kernels) Results (data) CPU • Users explicitly express parallel work in CUDA • GPU Driver distributes work to available GPC/SM cores • GPC/SM cores use shared HBM2 to exchange data
  • 6.
    6Twitter: @ReneeYao1Twitter: @ReneeYao1 TWO-GPUSWITH PCIE GPU0 GPC GPC XBAR HBM2+MemCtrlHBM2+MemCtrl High-Speed Hub NVLink NVLink Copy Engines PCIe I/O GPU1 GPC GPC XBAR HBM2+MemCtrlHBM2+MemCtrl High-Speed Hub NVLink NVLink Copy Engines PCIe I/O Work (data and CUDA Kernels) Results (data) CPU • Access to HBM2 of other GPU is at PCIe BW (16 GBps) • PCIe is the “Wild West” (lots of perf bandits) • Interactions with CPU compete with GPU-to-GPU
  • 7.
    7Twitter: @ReneeYao1Twitter: @ReneeYao1 TWO-GPUSWITH NVLINK • Access to HBM2 of other GPU is at multi-NVLink BW (150 GBps in V100 GPUs) • All GPCs can access all HBM2 memories • NVLinks are effectively a “bridge” between XBARs GPU0 GPC GPC XBAR HBM2+MemCtrlHBM2+MemCtrl High-Speed Hub NVLink NVLink Copy Engines PCIe I/O GPU1 GPC GPC XBAR HBM2+MemCtrlHBM2+MemCtrl High-Speed Hub NVLink NVLink Copy Engines PCIe I/O Work (data and CUDA Kernels) Results (data) CPU
  • 8.
    8Twitter: @ReneeYao1Twitter: @ReneeYao1 THE“ONE GIGANTIC GPU” IDEAL • Number of GPUs is as high as possible • Single GPU Driver process controls all work across all GPUs • From perspective of GPCs, all HBM2s can be accessed without intervention by other processes (LD/ST instructions, Copy Engine RDMA, everything “just works”) • Access to all HBM2s is independent of PCIe • BW across bridged XBARs is as high as possible (some NUMA is unavoidable) GPU0 GPU1 GPU2 GPU3 GPU0 GPU1 GPU2 GPU3 GPU0 GPU1 GPU2 GPU3 GPU0 GPU1 GPU2 GPU3 NVLink XBAR CPU CPU ?
  • 9.
    9Twitter: @ReneeYao1Twitter: @ReneeYao1 INTRODUCINGNVSWITCH Parameter Spec Bi-Di BW per NVLink 51.5 GBps NRZ Lane Rate (x8 per NVLink) 25.78125 Gbps Transistors 2 Billion Process TSMC 12FFN Die Size 106 mm^2 Parameter Spec Bi-Di Aggregate BW 928 GBps NVLink Ports 18 Mgmt Port (config, maintenance, err) PCIe LD/ST BW Efficiency (128B pkts) 80.0% Copy Engine BW Efficiency (256B pkts) 88.9%
  • 10.
    10Twitter: @ReneeYao1 DGX-2 NOWSHIPPING 1 2 3 5 4 6 Two Intel Xeon Platinum CPUs 7 1.5 TB System Memory 10 30 TB NVME SSDs Internal Storage NVIDIA Tesla V100 32GB Two GPU Boards 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.4 TB/sec bi-section bandwidth Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth PCIe Switch Complex 8 9 9Dual 10/25 Gb/sec Ethernet
  • 11.
    11Twitter: @ReneeYao1Twitter: @ReneeYao1 EXPANDABLESYSTEM • Taking this to the limit - connect one NVLink from each GPU to each of 6 switches • No routing between different switch planes required • 8 NVLinks of the 18 available per switch are used to connect to GPUs • 10 NVLinks available per switch for communication outside the local group (only 8 are required to support full BW) • This is the GPU baseboard configuration for DGX-2 V100 V100 V100 V100 V100 V100 V100 V100 NVSWITCH
  • 12.
    12Twitter: @ReneeYao1Twitter: @ReneeYao1 DGX-2NVLINK FABRICV100 V100 V100 V100 V100 V100 V100 V100 NVSWITCH NVSWITCH V100 V100 V100 V100 V100 V100 V100 V100 • Two of these building blocks together form a fully connected 16GPU cluster • Non-blocking, non-interfering (unless same destination is involved) • Regular load, store, atomics just work • Presenters note: The astute among you will note that there is a redundant level of switches here, but configuration simplifies system-level design and manufacturing
  • 13.
    14 Data Science HWArchitecture 128x memory I/O 300x core-to-core I/O 100x processing cores 128 GB/s 20 Cores 512 GB 128 GB/s 20 Cores 512 GB 128 GB/s 20 Cores 512 GB 128 GB/s 20 Cores 512 GB CPU Cluster DGX-2 Larger datasets but slower ● CPU/memory bandwidth ● # of processing cores ● Network I/O 128 GB/s 20 Cores 512 GB Single CPU Node Typically very slow With 20GB+ datasets
  • 14.
    15Twitter: @ReneeYao1Twitter: @ReneeYao1 DGX-2PCIE NETWORK PCIE SW x86x86 PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW x6x6 PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW 200G NIC 200G NIC 200G NIC 200G NIC 200G NIC 200G NIC 200G NIC 200G NIC QPIQPI V100 V100 V100 V100 V100 V100 V100 V100 NVSWITCH NVSWITCH V100 V100 V100 V100 V100 V100 V100 V100 • Xeon sockets are QPI connected, but affinity-binding keeps GPU-related traffic off QPI • PCIe tree has NICs connected to pairs of GPUs to facilitate GPUDirect RDMAs over IB network • Configuration and control of the NVSwitches is via driver process running on CPUs
  • 15.
    16Twitter: @ReneeYao1Twitter: @ReneeYao1 NVIDIADGX-2: GPUS + NVSWITCH COMPLEX • Two GPU Baseboards with 8 V100 GPUs and 6 NVSwitches • Two Plane Cards carry 24 NVLinks each
  • 16.
    17Twitter: @ReneeYao1Twitter: @ReneeYao1 NVIDIADGX-2: SYSTEM COOLING • Forced-air cooling of Baseboards, I/O Expander, and CPU provided by 10 92 mm fans • 4 supplemental 60 mm internal fans to cool NVMe drives and PSUs • Air to NVSwitches is pre-heated by GPUs, so use “full height” heatsinks
  • 17.
    18Twitter: @ReneeYao1 DGX-2: cuFFT •Results are “iso- problem instance” (more GFLOPS means shorter running time) • As problem is split over more GPUs, it takes longer to transfer data than to calculate locally DGX-1V½ DGX-2
  • 18.
    19Twitter: @ReneeYao1Twitter: @ReneeYao1 DGX-2:ALL-REDUCE BENCHMARK 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 Message Size (B) Bandwidth (MB/s) 2 DGX1V (1 100 Gb IB) 2 DGX1V (4 100 Gb IB) DGX-2 (Ring-Topology Communication) DGX-2 (All-to-All Communication) 8x • Important communication primitive in Machine- Learning Apps • Increased BW compared to two 8- GPU servers • All-to-all NVSwitch network reduces latency overheads vs simpler topologies (e.g., “rings”)
  • 19.
    20Twitter: @ReneeYao1Twitter: @ReneeYao1 DGX-2:UP TO 2.7X ON TARGET APPS 2 DGX-1V servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 32GB GPUs. Servers connected via 4 EDR IB ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 32GB GPUs 13K GFLOPS 26K GFLOPS Physics (MILC benchmark) 4D Grid Weather (IFS benchmark) FFT, All-to-all Recommender (Sparse Embedding) Reduce & Broadcast 22B lookups /sec 11B Lookups /sec Language Model (Transformer with MoE) All-to-all 9.3Hr 3.4Hr DGX-2 with NVSwitch2x DGX-1 (Volta) 2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER 11 Steps/ sec 26 Steps/ sec
  • 20.
    21Twitter: @ReneeYao1 21 FLEXIBILITYWITH VIRTUALIZATION Enable your own private DL Training Cloud for your Enterprise • KVM hypervisor for Ubuntu Linux • Enable teams of developers to simultaneously access DGX-2 • Flexibly allocate GPU resources to each user and their experiments • Full GPU’s and NVSwitch access within VMs — either all GPU’s or as few as 1
  • 21.
    22 CRISIS MANAGEMENT SOLUTION Natural disastersare increasingly causing major destruction to life, property and economies. DFKI is using the NVIDIA DGX-2 to evolve DeepEye —which uses satellite images enriched with social media content to identify natural disasters— into a crisis management solution. With the increased GPU memory and fully connected GPUs based on the NVSwitch architecture, DFKI can build bigger models and process more data to aid rescuers in their decision- making for faster, more efficient dispatching of resources.
  • 22.
    23 “Fujifilm applies AIin a wide range of fields. In healthcare, multiple NVIDIA GPUs will deliver high-speed computation to develop AI supporting image diagnostics.The introduction of this supercomputer will massively increase our processing power.We expect that AI learning that once took days to complete can now be completed within hours.” AkiraYoda chief digital officer of FUJIFILMCorporation - Pharmaceuticals - BioCDMO - Regenerative medicine - Analyzing and recognizing medical images - Simulations display materials and fine chemicals
  • 23.
    24Twitter: @ReneeYao1Twitter: @ReneeYao1 AIADOPTERS IMPEDED BY INFRASTRUCTURE AI Boosts Profit Margins up to 15% 40% see infrastructure as impeding AI source: 2018 CTA Market Research
  • 24.
    25Twitter: @ReneeYao1Twitter: @ReneeYao1 THECHALLENGE OF AI INFRASTRUCTURE Short term thinking leads to longer term problems Ensuring the architecture delivers predictable performance that scales DESIGN GUESSWORK Procuring, installing and troubleshooting compute, storage, networking and software DEPLOYMENT COMPLEXITY MULTIPLE POINTS OF SUPPORT Contending with multiple vendors across multiple layers in the stack
  • 25.
    27Twitter: @ReneeYao1Twitter: @ReneeYao1 DESIGNINGINFRASTRUCTURE THAT SCALES Insights gained from deep learning data centers Rack Design Networking Storage Facilities Software • DL drives close to operational limits • Similarities to HPC best practices • IB or Ethernet based fabric • 100Gbps inter- connect • High- bandwidth, ultra-low latency • Datasets range from 10k’s to millions objects • terabyte levels of storage and up • High IOPS, low latency • assume higher watts per-rack • Higher FLOPS/watt = DC less floorspace required • Scale requires “cluster- aware” software Example: • Autonomous vehicle = 1TB / hr • Training sets up to 500 PB • RN50: 113 days to train • Objective: 7 days • 6 simultaneous developers = 97 node cluster
  • 26.
    28Twitter: @ReneeYao1Twitter: @ReneeYao1 NVIDIADGX POD™ • Initial reference architecture based on the NVIDIA® DGX-1™ server • Designed for deep learning training workflow • Baseline for other reference architectures: • Easily upgraded to NVIDIA DGX-2™ and NVIDIA HGX-2™ servers • Industry-specific PODs • Storage and network partners • Server OEM solutions A Reference Architecture For GPU Data Centers
  • 27.
    29Twitter: @ReneeYao1Twitter: @ReneeYao1 DGXDATA CENTER REFERENCE DESIGN Easy Deployment of DGX Servers for Deep Learning Content: • AI Workflow and Sizing • NVIDIA AI Software • DGX POD Design • DGX POD Installation and Management
  • 28.
    30Twitter: @ReneeYao1Twitter: @ReneeYao1 NVIDIAAUTOMOTIVE WORKFLOW ON SATURNV Research Workflow Training • Many node – user submits 1 job with many single node training sessions - hyper parm sweep • Multi-node – user submits 1 job with single multi-node training session Inference • Many GPU – user submits many jobs each with single GPU inference Inference Many node Training Multi node Training StoragePerformance Interconnect performance
  • 29.
    31Twitter: @ReneeYao1Twitter: @ReneeYao1 EXAMPLEDL WORKFLOW: AUTOMOTIVE Driving DL Platform - Training, Simulation, Testing Raw data OTA updates Indexing, selection, labeling
  • 30.
    32Twitter: @ReneeYao1Twitter: @ReneeYao1 NVIDIADGX POD — DGX-1 Reference Architecture in a Single 35 kW High-Density Rack In real-life DL application development, one to two DGX-1 servers per developer are often required One DGX POD supports five developers (AV workload) Each developer works on two experiments per day One DGX-1/developer/experiment/day* *300,000 0.5M images * 120 epochs @ 480 images/sec Resnet-18 backbone detection network per experiment Fit within a standard-height 42 RU data center rack • Nine DGX-1 servers (9 x 3 RU = 27 RU) • Twelve storage servers (12 x 1 RU = 12 RU) • 10 GbE (min) storage and management switch (1 RU) • Mellanox 100 Gbps intra- rack high speed network switches (1 or 2 RU)
  • 31.
    33Twitter: @ReneeYao1Twitter: @ReneeYao1 NVIDIADGX POD — DGX-2 Reference Architecture in a Single 35 kW High-Density Rack Fit within a standard-height 48 RU data center rack • Three DGX-2 servers (3 x 10 RU = 30 RU) • Twelve storage servers (12 x 1 RU = 12 RU) • 10 GbE (min) storage and management switch (1 RU) • Mellanox 100 Gbps intra- rack high speed network switches (1 or 2 RU) In real-life DL application development, one DGX-2 per developer minimizes model training time One DGX POD supports at least three developers (AV workload) Each developer works on two experiments per day One DGX-2/developer/2 experiments/day* *300,000 0.5M images * 120 epochs @ 480 images/sec Resnet-18 backbone detection network per experiment
  • 32.
    34Twitter: @ReneeYao1 NEW DGXPODS DELIVERY, DEPLOYMENT, DEEP LEARNING IN A DAY 95% Reduction in Deployment Time 5X Increase in Data Scientist Productivity $0 Integration Cost Adopted by Leading Auto, Healthcare & Telco Companies
  • 33.
    35Twitter: @ReneeYao1 NVIDIA DGX SYSTEMS FasterAI Innovation and Insight The World’s First Portfolio of Purpose-Built AI Supercomputers • Powered by NVIDIA GPU Cloud • Get Started in AI – Faster • Effortless Productivity • Performance Without Compromise For More Information DGX Systems: nvidia.com/dgx DGX Pod: https://www.nvidia.com/en-us/data- center/resources/nvidia-dgx-pod-reference- architecture/ DGX Reference Architecture: https://www.nvidia.com/en-us/data-center/dgx- reference-architecture/ 35