SlideShare a Scribd company logo
1 of 30
Download to read offline
Alex Ishii and Denis Foley with
Eric Anderson, Bill Dally, Glenn Dearth
Larry Dennison, Mark Hummel and John Schafer
NVSWITCH AND DGX-2
NVLINK-SWITCHING CHIP AND SCALE-UP COMPUTE SERVER
2
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
NVIDIA DGX-2 SERVER AND NVSWITCH
16 Tesla™ V100 32 GB GPUs
FP64: 125 TFLOPS
FP32: 250 TFLOPS
Tensor: 2000 TFLOPS
512 GB of GPU HBM2
Single-Server Chassis
10U/19-Inch Rack Mount
10 kW Peak TDP
Dual 24-core Xeon CPUs
1.5 TB DDR4 DRAM
30 TB NVMe Storage
New NVSwitch Chip
18 2nd Generation NVLink™ Ports
25 GBps per Port
900 GBps Total Bidirectional
Bandwidth
450 GBps Total Throughput
12 NVSwitch Network
Full-Bandwidth Fat-Tree Topology
2.4 TBps Bisection Bandwidth
Global Shared Memory
Repeater-less
® ™
™
3
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
OUTLINE
The Case for “One Gigantic GPU” and NVLink Review
NVSwitch — Speeds and Feeds, Architecture and System Applications
DGX-2 Server — Speeds and Feeds, Architecture, and Packaging
Signal-Integrity Design Highlights
Achieved Performance
4
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
MULTI-CORE AND
CUDA WITH ONE GPU
Users explicitly express parallel
work in NVIDIA CUDA®
GPU Driver distributes work to
available Graphics Processing
Clusters (GPC)/Streaming
Multiprocessor (SM) cores
GPC/SM cores can compute on
data in any of the second-
generation High Bandwidth
Memories (HBM2s)
GPC/SM cores use shared HBM2s
to exchange data
GPU
GPC
GPC
Hub
NVLink
Copy
Engines
PCIe I/O
Work (data and
CUDA Kernels)
Results
(data)
CPU
HBM2
HBM2
L2
Cache
L2
Cache
L2
Cache
L2
Cache
XBAR
NVLink
5
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
TWO GPUS WITH PCIE
Access to HBM2 of other GPU
is at PCIe BW (32 GBps
(bidirectional))
Interactions with CPU compete
with GPU-to-GPU
PCIe is the “Wild West”
(lots of performance bandits)
GPU0 GPU1
HBM2
+
L2
Caches
HBM2
+
L2
Caches
GPC
GPC
Hub
NVLink
Copy
Engines
PCIe I/O
NVLink
Work
(data and
CUDA
Kernels)
Results
(data)
CPU
XBAR
HBM2
+
L2
Caches
HBM2
+
L2
Caches
XBAR
GPC
GPC
Hub
PCIe I/O
NVLink
NVLink
Copy
Engines
6
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
TWO GPUS WITH NVLINK
All GPCs can access all HBM2
memories
Access to HBM2 of other GPU
is at multi-NVLink bandwidth
(300 GBps bidirectional in V100
GPUs)
NVLinks are effectively a
“bridge” between XBARs
No collisions with PCIe traffic
GPU0 GPU1
HBM2
+
L2
Caches
HBM2
+
L2
Caches
GPC
GPC
Hub
NVLink
Copy
Engines
PCIe I/O
NVLink
Work
(data and
CUDA
Kernels)
Results
(data)
CPU
XBAR
HBM2
+
L2
Caches
HBM2
+
L2
Caches
XBAR
GPC
GPC
Hub
PCIe I/O
NVLink
NVLink
Copy
Engines
7
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
THE “ONE GIGANTIC GPU” IDEAL
Highest number of GPUs possible
Single GPU Driver process controls
all work across all GPUs
From perspective of GPCs, all HBM2s
can be accessed without intervention
by other processes (LD/ST
instructions, Copy Engine RDMA,
everything “just works”)
Access to all HBM2s is independent
of PCIe
Bandwidth across bridged XBARs is as
high as possible (some NUMA is
unavoidable)
GPU4 GPU8
NVLink XBAR
CPU
CPU
GPU9
GPU12
GPU13
GPU14
GPU15
GPU5
GPU6
GPU7
GPU0
GPU1
GPU2
GPU3
GPU10
GPU11
8
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
Problem Size Capacity
Problem size is limited by aggregate
HBM2 capacity of entire set of GPUs,
rather than capacity of single GPU
Strong Scaling
NUMA-effects greatly reduced
compared to existing solutions
Aggregate bandwidth to HBM2 grows
with number of GPUs
Ease of Use
Apps written for small number
of GPUs port more easily
Abundant resources enable rapid
experimentation
“ONE GIGANTIC GPU” BENEFITS
GPU4 GPU8
NVLink XBAR
CPU
CPU
GPU9
GPU12
GPU13
GPU14
GPU15
GPU5
GPU6
GPU7
GPU0
GPU1
GPU2
GPU3
GPU10
GPU11
?
9
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
INTRODUCING NVSWITCH
Parameter Spec
Bidirectional Bandwidth per NVLink 51.5 GBps
NRZ Lane Rate (x8 per NVLink) 25.78125 Gbps
Transistors 2 Billion
Process TSMC 12FFN
Die Size 106 mm^2
Parameter Spec
Bidirectional Aggregate Bandwidth 928 GBps
NVLink Ports 18
Mgmt Port (config, maintenance, errors) PCIe
LD/ST BW Efficiency (128 B pkts) 80.0%
Copy Engine BW Efficiency (256 B pkts) 88.9%
10
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
NVSWITCH BLOCK DIAGRAM
GPU-XBAR-bridging device;
not a general networking device
Packet-Transforms make traffic
to/from multiple GPUs look like
they are to/from single GPU
XBAR is non-blocking
SRAM-based buffering
NVLink IP blocks and XBAR
design/verification
infrastructure reused from V100
Management
XBAR
(18
X
18)
Port Logic 0
Routing
Error Check &
Statistics Collection
Classification &
Packet Transforms
Transaction Tracking &
Packet Transforms
NVLink 0
PHY
DL
TL
PHY
NVSwitch
Port Logic 17
Routing
Error Check &
Statistics Collection
Classification &
Packet Transforms
Transaction Tracking &
Packet Transforms
NVLink 17
PHY
DL
TL
PHY
PCIe I/O
11
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
GPU0 GPU1
HBM2
+
L2
Caches
HBM2
+
L2
Caches
GPC
GPC
Hub
NVLink
Copy
Engines
PCIe I/O
NVLink
Work
(data and
CUDA
Kernels)
Results
(data)
CPU
XBAR
HBM2
+
L2
Caches
HBM2
+
L2
Caches
XBAR
GPC
GPC
Hub
PCIe I/O
NVLink
NVLink
Copy
Engines
NVLINK: PHYSICAL SHARED MEMORY
Virtual-to physical address
translation is done in GPCs
NVLink packets carry physical
addresses
NVSwitch and DGX-2 follow
same model
Physical Addressing
12
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
NVSWITCH: RELIABILITY
Hop-by-hop error checking
deemed sufficient in intra-
chassis operating environment
Low BER (10-12 or better) on
intra-chassis transmission media
removes need for high-overhead
protections like FEC
HW-based consistency checks
guard against “escapes” from
hop-to-hop
SW responsible for additional
consistency checks and clean-up
XBAR
ECC
on
in-flight
pkts
Port Logic
Routing
Error Check &
Statistics Collection
Classification &
Packet Transforms
Transaction Tracking
& Packet Transforms
ECC on SRAMs and in-flight pkts
Access-Control checks for mis-configured pkts
(either config/app errors or corruption)
Configurable buffer allocations
NVLink
Data Link (DL) pkt tracking
and retry for external links
CRC on pkts (external)
PHY
DL
TL
PHY
NVSwitch
Management
Hooks for SW-based scrubbing for table corruptions and error conditions
(orphaned packets)
Hooks for partial-reset and queue draining
13
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
Packet-processing and switching (XBAR) logic is extremely compact
With more than 50% of area taken-up by I/O blocks, maximizing “perimeter”-to-area ratio
was key to an efficient design
Having all NVLink ports exit die on parallel paths simplified package substrate routing
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
NVLINK
PHYS
Non-NVLink Logic
NVSWITCH: DIE ASPECT RATIO
14
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
No NVSwitch
Connect GPU directly
Aggregate NVLinks into gangs
for higher bandwidth
Interleaved over the links to
prevent camping
Max bandwidth between two GPUs
limited to the bandwidth of the gang
With NVSwitch
Interleave traffic across all the links
and to support full bandwidth
between any pair of GPUs
Traffic to a single GPU is non-
blocking, so long as aggregate
bandwidth of six NVLinks is not
exceeded
SIMPLE SWITCH SYSTEM
V100 V100
V100
Gang
V100 V100
V100
NVSWITCH
15
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
Add NVSwitches in parallel to
support more GPUs
Eight GPU closed system can be built
with three NVSwitches with two
NVLinks from each GPU to each switch
Gang width is still six — traffic is
interleaved across all the GPU links
GPUs can now communicate pairwise
using the full 300 GBps bidirectional
between any pair
NVSwitch XBAR provides unique paths
from and source to any destination —
non-blocking, non-interfering
SWITCH CONNECTED SYSTEMS
8 GPUs, 3 NVSwitches
V100 V100
V100 V100
V100 V100
V100 V100
NVSWITCH
16
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
Taking this to the limit —
connect one NVLink from each
GPU to each of six switches
No routing between different
switch planes required
Eight NVLinks of the 18 available
per switch are used to connect
to GPUs
Ten NVLinks available per switch
for communication outside the local
group (only eight are required to
support full bandwidth)
This is the GPU baseboard
configuration for DGX-2
NON-BLOCKING BUILDING BLOCK
V100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH
17
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
Two of these building blocks
together form a fully connected
16 GPU cluster
Non-blocking, non-interfering
(unless same destination
is involved)
Regular load, store, atomics
just work
Left-right symmetry simplifies
physical packaging, and
manufacturability
DGX-2 NVLINK FABRIC
V100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH
NVSWITCH
V100
V100
V100
V100
V100
V100
V100
V100
18
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
Parameter DGX-2 Spec
CPUs Dual Xeon Platinum 8168
CPU Memory 1.5 TB DDR4
Aggregate Storage 30 TB (8 NVMes)
Peak Max TDP 10 kW
Dimensions (H/W/D)
17.3”(10U)/ 19.0”/32.8”
(440.0mm/482.3mm/834.0mm)
Weight 340 lbs (154.2 kgs)
Cooling (forced air) 1,000 CFM
Parameter DGX-2 Spec
Number of Tesla V100 GPUs 16
Aggregate FP64/FP32 125/250 TFLOPS
Aggregate Tensor (FP16) 2000 TFLOPS
Aggregate Shared HBM2 512 GB
Aggregate HBM2 Bandwidth 14.4 TBps
Per-GPU NVLink Bandwidth 300 GBps bidirectional
Chassis Bisection Bandwidth 2.4 TBps
InfiniBand NICs 8 Mellanox EDR
NVIDIA DGX-2: SPEEDS AND FEEDS
19
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
Xeon sockets are QPI connected,
but affinity-binding keeps
GPU-related traffic off QPI
PCIe tree has NICs connected to
pairs of GPUs to facilitate
GPUDirectTM RDMAs over IB
network
Configuration and control of the
NVSwitches is via driver process
running on CPUs
DGX-2 PCIE NETWORK
NVS
WIT
CH
NVS
WIT
CH
NVS
WIT
CH
NVS
WIT
CH
NVS
WIT
CH
V100
V100
V100
V100
NVS
WIT
CH
V100
V100
V100
V100
NVS
WIT
CH
NVS
WIT
CH
NVS
WIT
CH
V100
V100
V100
V100
V100
V100
V100
V100
PCIE
SW
x86
x86
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
x6
x6
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
100G
NIC
100G
NIC
100G
NIC
100G
NIC
100G
NIC
100G
NIC
100G
NIC
100G
NIC
NVSWITCH
NVSWITCH
QPI
QPI
20
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
Two GPU Baseboards with
eight V100 GPUs and six
NVSwitches on each
Two Plane Cards carry
24 NVLinks each
No repeaters or redrivers on any
of the NVLinks conserves board
space and power
NVIDIA DGX-2: GPUS + NVSWITCH COMPLEX
21
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
GPU Baseboards get power and
PCIe from front midplane
Front midplane connects to
I/O-Expander PCB (with PCIe
switches and NICs) and CPU
Motherboard
48V PSUs reduce current load on
distribution paths
Internal “bus bars” bring current
from each PSU to front-mid plane
NVIDIA DGX-2: SYSTEM PACKAGING
22
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
Forced-air cooling of Baseboards,
I/O Expander, and CPU provided
by ten 92 mm fans
Four supplemental 60 mm
internal fans to cool NVMe drives
and PSUs
Air to NVSwitches is pre-heated
by GPUs, requiring “full height”
heatsinks
NVIDIA DGX-2: SYSTEM COOLING
23
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
NVLink repeaterless topology
trades-off longer traces to
GPUs for shorter traces to
Plane Cards
Custom SXM and Plane Card
connectors for reduced loss
and crosstalk
TX and RX grouped in pin-fields
to reduce crosstalk
DGX-2: SIGNAL INTEGRITY OPTIMIZATIONS
All TX
All RX
24
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
Test has each GPU reading data
from another GPU across
bisection (from GPU on
different Baseboard)
Raw bisection bandwidth is
2.472 TBps
1.98 TBps achieved Read
bisection bandwidth matches
theoretical 80% bidirectional
NVLink efficiency
“All-to-all” (each GPU reads
from eight GPUs on other PCB)
results are similar
DGX-2: ACHIEVED BISECTION BW
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
GB/sec
Number of GPUs
25
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
Results are “iso-problem
instance” (more GFLOPS means
shorter running time)
As problem is split over more
GPUs, it takes longer to transfer
data than to calculate locally
Aggregate nature of HBM2
capacity enables larger
problem sizes
DGX-2: CUFFT (16K X 16K)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 2 3 4 5 6 7 8
GFLOPS
½ DGX-2 (All-to-All NVLink Fabric) DGX-1TM (V100 and Hybrid Cube Mesh)
Number of GPUs
26
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
Important communication
primitive in Machine-Learning
apps
DGX-2 provides increased
bandwidth and lower latency
compared to two 8-GPU servers
(connected with InfiniBand)
NVLink network has efficiency
superior to InfiniBand on
smaller message sizes
DGX-2: ALL-REDUCE BENCHMARK
Message Size (B)
Bandwidth
(MB/s)
DGX-2 (NVLink Fabric) 2 DGX-1 (4 100 Gb IB)
128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M
0
20000
40000
60000
80000
100000
120000
3X
8X
27
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
DGX-2: >2X SPEED-UP ON TARGET APPS
Two DGX-1 Servers (V100) Compared to DGX-2 − Same Total GPU Count
DGX-2 with NVSwitch
Two DGX-1 (V100)
2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER
AI Training
HPC
Physics
(MILC benchmark)
4D Grid
Weather
(ECMWF benchmark) All-to-all
Language Model
(Transformer with MoE)
All-to-all
Recommender (Sparse Embedding)
Reduce & Broadcast
Two DGX-1 servers have dual socket Xeon E5 2698v4 Processor. Eight Tesla V100 32 GB GPUs. Servers connected via four
EDR IB/GbE ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 Tesla V100 32GB GPUs
TM
28
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
Eight GPU baseboard with
six NVSwitches
Two HGX-2 boards can be
passively connected to realize
16-GPU systems
ODM/OEM partners build servers
utilizing NVIDIA HGX-2 GPU
baseboards
Design guide provides
recommendations
NVSWITCH AVAILABILITY: NVIDIA HGX-2™
29
©2018 NVIDIA CORPORATION
©2018 NVIDIA CORPORATION
New NVSwitch
18-NVLink-Port, Non-Blocking
NVLink Switch
51.5 GBps-per-Port
928 GBps Aggregate Bidirectional
Bandwidth
DGX-2
“One Gigantic GPU” Scale-Up Server
with 16 Tesla V100 GPUs
125 TF (FP64) to 2000 TF (Tensor)
512 GB Aggregate HBM2 Capacity
NVSwitch Fabric Enables >2X Speed-
Up on Target Apps
SUMMARY
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf

More Related Content

Similar to 2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf

Converged IO for HP ProLiant Gen8
Converged IO for HP ProLiant Gen8Converged IO for HP ProLiant Gen8
Converged IO for HP ProLiant Gen8IT Brand Pulse
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?Shinnosuke Furuya
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community6WIND
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Ontico
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmapinside-BigData.com
 
HPE Synergy Networking Adapters-a00043968enw.pdf
HPE Synergy Networking Adapters-a00043968enw.pdfHPE Synergy Networking Adapters-a00043968enw.pdf
HPE Synergy Networking Adapters-a00043968enw.pdfShreeram Rane
 
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable CloudInside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloudinside-BigData.com
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステムShinnosuke Furuya
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrjRoberto Brandao
 
PLNOG 13: Maciej Grabowski: HP Moonshot
PLNOG 13: Maciej Grabowski: HP MoonshotPLNOG 13: Maciej Grabowski: HP Moonshot
PLNOG 13: Maciej Grabowski: HP MoonshotPROIDEA
 
9/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'169/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'16Kangaroot
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchJim St. Leger
 
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGATO project
 
6WINDGate™ - Enabling Cloud RAN Virtualization
6WINDGate™ - Enabling Cloud RAN Virtualization6WINDGate™ - Enabling Cloud RAN Virtualization
6WINDGate™ - Enabling Cloud RAN Virtualization6WIND
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsRebekah Rodriguez
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...KTN
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA
 
Dell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation WebinarDell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation WebinarBill Wong
 

Similar to 2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf (20)

Converged IO for HP ProLiant Gen8
Converged IO for HP ProLiant Gen8Converged IO for HP ProLiant Gen8
Converged IO for HP ProLiant Gen8
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmap
 
HPE Synergy Networking Adapters-a00043968enw.pdf
HPE Synergy Networking Adapters-a00043968enw.pdfHPE Synergy Networking Adapters-a00043968enw.pdf
HPE Synergy Networking Adapters-a00043968enw.pdf
 
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable CloudInside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloud
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム
 
Новые коммутаторы QFX10000. Технология JunOS Fusion
Новые коммутаторы QFX10000. Технология JunOS FusionНовые коммутаторы QFX10000. Технология JunOS Fusion
Новые коммутаторы QFX10000. Технология JunOS Fusion
 
Mellanox OpenPOWER features
Mellanox OpenPOWER featuresMellanox OpenPOWER features
Mellanox OpenPOWER features
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
PLNOG 13: Maciej Grabowski: HP Moonshot
PLNOG 13: Maciej Grabowski: HP MoonshotPLNOG 13: Maciej Grabowski: HP Moonshot
PLNOG 13: Maciej Grabowski: HP Moonshot
 
9/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'169/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'16
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous Hardware
 
6WINDGate™ - Enabling Cloud RAN Virtualization
6WINDGate™ - Enabling Cloud RAN Virtualization6WINDGate™ - Enabling Cloud RAN Virtualization
6WINDGate™ - Enabling Cloud RAN Virtualization
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
 
Dell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation WebinarDell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation Webinar
 

More from bui thequan

InfiniBand in the Enterprise Data Center.pdf
InfiniBand in the Enterprise Data Center.pdfInfiniBand in the Enterprise Data Center.pdf
InfiniBand in the Enterprise Data Center.pdfbui thequan
 
OCP liquid direct to chip temperature guideline.pdf
OCP liquid direct to chip temperature guideline.pdfOCP liquid direct to chip temperature guideline.pdf
OCP liquid direct to chip temperature guideline.pdfbui thequan
 
Coolinside DTC Liquid Cooling Solution.pdf
Coolinside DTC Liquid Cooling Solution.pdfCoolinside DTC Liquid Cooling Solution.pdf
Coolinside DTC Liquid Cooling Solution.pdfbui thequan
 
(268) Total Thermal Solution - heat sink.pdf
(268) Total Thermal Solution - heat sink.pdf(268) Total Thermal Solution - heat sink.pdf
(268) Total Thermal Solution - heat sink.pdfbui thequan
 
an-advanced-liquid-cooling-design-for-data-center-final-v3-1-pdf.pdf
an-advanced-liquid-cooling-design-for-data-center-final-v3-1-pdf.pdfan-advanced-liquid-cooling-design-for-data-center-final-v3-1-pdf.pdf
an-advanced-liquid-cooling-design-for-data-center-final-v3-1-pdf.pdfbui thequan
 
dclc-r134a-968 detail.pdf
dclc-r134a-968 detail.pdfdclc-r134a-968 detail.pdf
dclc-r134a-968 detail.pdfbui thequan
 
ajeep_04a_energy_eficient_chiller - Mitsu.pdf
ajeep_04a_energy_eficient_chiller - Mitsu.pdfajeep_04a_energy_eficient_chiller - Mitsu.pdf
ajeep_04a_energy_eficient_chiller - Mitsu.pdfbui thequan
 
Performance Gain for Multiple Stage Centrifugal Compressor by usi.pdf
Performance Gain for Multiple Stage Centrifugal Compressor by usi.pdfPerformance Gain for Multiple Stage Centrifugal Compressor by usi.pdf
Performance Gain for Multiple Stage Centrifugal Compressor by usi.pdfbui thequan
 
Trane design chiller.pdf
Trane design chiller.pdfTrane design chiller.pdf
Trane design chiller.pdfbui thequan
 
chilled-water-system-presentation.pdf
chilled-water-system-presentation.pdfchilled-water-system-presentation.pdf
chilled-water-system-presentation.pdfbui thequan
 
multiple-chiller-system-design-and-control-trane-applications-engineering-man...
multiple-chiller-system-design-and-control-trane-applications-engineering-man...multiple-chiller-system-design-and-control-trane-applications-engineering-man...
multiple-chiller-system-design-and-control-trane-applications-engineering-man...bui thequan
 
399TGp_medical_light_ani.potx
399TGp_medical_light_ani.potx399TGp_medical_light_ani.potx
399TGp_medical_light_ani.potxbui thequan
 
29422920 overview-of-ng-sdh
29422920 overview-of-ng-sdh29422920 overview-of-ng-sdh
29422920 overview-of-ng-sdhbui thequan
 
194 adss cable-installationguide
194 adss cable-installationguide194 adss cable-installationguide
194 adss cable-installationguidebui thequan
 
Cisco me4600 ont_rgw_user_manual_v3_2-4
Cisco me4600 ont_rgw_user_manual_v3_2-4Cisco me4600 ont_rgw_user_manual_v3_2-4
Cisco me4600 ont_rgw_user_manual_v3_2-4bui thequan
 
Wccp introduction final2
Wccp introduction final2Wccp introduction final2
Wccp introduction final2bui thequan
 
Qwilt transparent caching-6keyfactors
Qwilt transparent caching-6keyfactorsQwilt transparent caching-6keyfactors
Qwilt transparent caching-6keyfactorsbui thequan
 

More from bui thequan (20)

InfiniBand in the Enterprise Data Center.pdf
InfiniBand in the Enterprise Data Center.pdfInfiniBand in the Enterprise Data Center.pdf
InfiniBand in the Enterprise Data Center.pdf
 
OCP liquid direct to chip temperature guideline.pdf
OCP liquid direct to chip temperature guideline.pdfOCP liquid direct to chip temperature guideline.pdf
OCP liquid direct to chip temperature guideline.pdf
 
Coolinside DTC Liquid Cooling Solution.pdf
Coolinside DTC Liquid Cooling Solution.pdfCoolinside DTC Liquid Cooling Solution.pdf
Coolinside DTC Liquid Cooling Solution.pdf
 
(268) Total Thermal Solution - heat sink.pdf
(268) Total Thermal Solution - heat sink.pdf(268) Total Thermal Solution - heat sink.pdf
(268) Total Thermal Solution - heat sink.pdf
 
6 - Oracle.pdf
6 - Oracle.pdf6 - Oracle.pdf
6 - Oracle.pdf
 
an-advanced-liquid-cooling-design-for-data-center-final-v3-1-pdf.pdf
an-advanced-liquid-cooling-design-for-data-center-final-v3-1-pdf.pdfan-advanced-liquid-cooling-design-for-data-center-final-v3-1-pdf.pdf
an-advanced-liquid-cooling-design-for-data-center-final-v3-1-pdf.pdf
 
dclc-r134a-968 detail.pdf
dclc-r134a-968 detail.pdfdclc-r134a-968 detail.pdf
dclc-r134a-968 detail.pdf
 
ajeep_04a_energy_eficient_chiller - Mitsu.pdf
ajeep_04a_energy_eficient_chiller - Mitsu.pdfajeep_04a_energy_eficient_chiller - Mitsu.pdf
ajeep_04a_energy_eficient_chiller - Mitsu.pdf
 
Performance Gain for Multiple Stage Centrifugal Compressor by usi.pdf
Performance Gain for Multiple Stage Centrifugal Compressor by usi.pdfPerformance Gain for Multiple Stage Centrifugal Compressor by usi.pdf
Performance Gain for Multiple Stage Centrifugal Compressor by usi.pdf
 
Trane design chiller.pdf
Trane design chiller.pdfTrane design chiller.pdf
Trane design chiller.pdf
 
chilled-water-system-presentation.pdf
chilled-water-system-presentation.pdfchilled-water-system-presentation.pdf
chilled-water-system-presentation.pdf
 
multiple-chiller-system-design-and-control-trane-applications-engineering-man...
multiple-chiller-system-design-and-control-trane-applications-engineering-man...multiple-chiller-system-design-and-control-trane-applications-engineering-man...
multiple-chiller-system-design-and-control-trane-applications-engineering-man...
 
399TGp_medical_light_ani.potx
399TGp_medical_light_ani.potx399TGp_medical_light_ani.potx
399TGp_medical_light_ani.potx
 
29422920 overview-of-ng-sdh
29422920 overview-of-ng-sdh29422920 overview-of-ng-sdh
29422920 overview-of-ng-sdh
 
20407473 ospf
20407473 ospf20407473 ospf
20407473 ospf
 
194 adss cable-installationguide
194 adss cable-installationguide194 adss cable-installationguide
194 adss cable-installationguide
 
Cisco me4600 ont_rgw_user_manual_v3_2-4
Cisco me4600 ont_rgw_user_manual_v3_2-4Cisco me4600 ont_rgw_user_manual_v3_2-4
Cisco me4600 ont_rgw_user_manual_v3_2-4
 
Wccp introduction final2
Wccp introduction final2Wccp introduction final2
Wccp introduction final2
 
Qwilt transparent caching-6keyfactors
Qwilt transparent caching-6keyfactorsQwilt transparent caching-6keyfactors
Qwilt transparent caching-6keyfactors
 
Guide otn ang
Guide otn angGuide otn ang
Guide otn ang
 

Recently uploaded

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf

  • 1. Alex Ishii and Denis Foley with Eric Anderson, Bill Dally, Glenn Dearth Larry Dennison, Mark Hummel and John Schafer NVSWITCH AND DGX-2 NVLINK-SWITCHING CHIP AND SCALE-UP COMPUTE SERVER
  • 2. 2 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor: 2000 TFLOPS 512 GB of GPU HBM2 Single-Server Chassis 10U/19-Inch Rack Mount 10 kW Peak TDP Dual 24-core Xeon CPUs 1.5 TB DDR4 DRAM 30 TB NVMe Storage New NVSwitch Chip 18 2nd Generation NVLink™ Ports 25 GBps per Port 900 GBps Total Bidirectional Bandwidth 450 GBps Total Throughput 12 NVSwitch Network Full-Bandwidth Fat-Tree Topology 2.4 TBps Bisection Bandwidth Global Shared Memory Repeater-less ® ™ ™
  • 3. 3 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION OUTLINE The Case for “One Gigantic GPU” and NVLink Review NVSwitch — Speeds and Feeds, Architecture and System Applications DGX-2 Server — Speeds and Feeds, Architecture, and Packaging Signal-Integrity Design Highlights Achieved Performance
  • 4. 4 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION MULTI-CORE AND CUDA WITH ONE GPU Users explicitly express parallel work in NVIDIA CUDA® GPU Driver distributes work to available Graphics Processing Clusters (GPC)/Streaming Multiprocessor (SM) cores GPC/SM cores can compute on data in any of the second- generation High Bandwidth Memories (HBM2s) GPC/SM cores use shared HBM2s to exchange data GPU GPC GPC Hub NVLink Copy Engines PCIe I/O Work (data and CUDA Kernels) Results (data) CPU HBM2 HBM2 L2 Cache L2 Cache L2 Cache L2 Cache XBAR NVLink
  • 5. 5 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION TWO GPUS WITH PCIE Access to HBM2 of other GPU is at PCIe BW (32 GBps (bidirectional)) Interactions with CPU compete with GPU-to-GPU PCIe is the “Wild West” (lots of performance bandits) GPU0 GPU1 HBM2 + L2 Caches HBM2 + L2 Caches GPC GPC Hub NVLink Copy Engines PCIe I/O NVLink Work (data and CUDA Kernels) Results (data) CPU XBAR HBM2 + L2 Caches HBM2 + L2 Caches XBAR GPC GPC Hub PCIe I/O NVLink NVLink Copy Engines
  • 6. 6 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION TWO GPUS WITH NVLINK All GPCs can access all HBM2 memories Access to HBM2 of other GPU is at multi-NVLink bandwidth (300 GBps bidirectional in V100 GPUs) NVLinks are effectively a “bridge” between XBARs No collisions with PCIe traffic GPU0 GPU1 HBM2 + L2 Caches HBM2 + L2 Caches GPC GPC Hub NVLink Copy Engines PCIe I/O NVLink Work (data and CUDA Kernels) Results (data) CPU XBAR HBM2 + L2 Caches HBM2 + L2 Caches XBAR GPC GPC Hub PCIe I/O NVLink NVLink Copy Engines
  • 7. 7 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION THE “ONE GIGANTIC GPU” IDEAL Highest number of GPUs possible Single GPU Driver process controls all work across all GPUs From perspective of GPCs, all HBM2s can be accessed without intervention by other processes (LD/ST instructions, Copy Engine RDMA, everything “just works”) Access to all HBM2s is independent of PCIe Bandwidth across bridged XBARs is as high as possible (some NUMA is unavoidable) GPU4 GPU8 NVLink XBAR CPU CPU GPU9 GPU12 GPU13 GPU14 GPU15 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU10 GPU11
  • 8. 8 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION Problem Size Capacity Problem size is limited by aggregate HBM2 capacity of entire set of GPUs, rather than capacity of single GPU Strong Scaling NUMA-effects greatly reduced compared to existing solutions Aggregate bandwidth to HBM2 grows with number of GPUs Ease of Use Apps written for small number of GPUs port more easily Abundant resources enable rapid experimentation “ONE GIGANTIC GPU” BENEFITS GPU4 GPU8 NVLink XBAR CPU CPU GPU9 GPU12 GPU13 GPU14 GPU15 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU10 GPU11 ?
  • 9. 9 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION INTRODUCING NVSWITCH Parameter Spec Bidirectional Bandwidth per NVLink 51.5 GBps NRZ Lane Rate (x8 per NVLink) 25.78125 Gbps Transistors 2 Billion Process TSMC 12FFN Die Size 106 mm^2 Parameter Spec Bidirectional Aggregate Bandwidth 928 GBps NVLink Ports 18 Mgmt Port (config, maintenance, errors) PCIe LD/ST BW Efficiency (128 B pkts) 80.0% Copy Engine BW Efficiency (256 B pkts) 88.9%
  • 10. 10 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION NVSWITCH BLOCK DIAGRAM GPU-XBAR-bridging device; not a general networking device Packet-Transforms make traffic to/from multiple GPUs look like they are to/from single GPU XBAR is non-blocking SRAM-based buffering NVLink IP blocks and XBAR design/verification infrastructure reused from V100 Management XBAR (18 X 18) Port Logic 0 Routing Error Check & Statistics Collection Classification & Packet Transforms Transaction Tracking & Packet Transforms NVLink 0 PHY DL TL PHY NVSwitch Port Logic 17 Routing Error Check & Statistics Collection Classification & Packet Transforms Transaction Tracking & Packet Transforms NVLink 17 PHY DL TL PHY PCIe I/O
  • 11. 11 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION GPU0 GPU1 HBM2 + L2 Caches HBM2 + L2 Caches GPC GPC Hub NVLink Copy Engines PCIe I/O NVLink Work (data and CUDA Kernels) Results (data) CPU XBAR HBM2 + L2 Caches HBM2 + L2 Caches XBAR GPC GPC Hub PCIe I/O NVLink NVLink Copy Engines NVLINK: PHYSICAL SHARED MEMORY Virtual-to physical address translation is done in GPCs NVLink packets carry physical addresses NVSwitch and DGX-2 follow same model Physical Addressing
  • 12. 12 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION NVSWITCH: RELIABILITY Hop-by-hop error checking deemed sufficient in intra- chassis operating environment Low BER (10-12 or better) on intra-chassis transmission media removes need for high-overhead protections like FEC HW-based consistency checks guard against “escapes” from hop-to-hop SW responsible for additional consistency checks and clean-up XBAR ECC on in-flight pkts Port Logic Routing Error Check & Statistics Collection Classification & Packet Transforms Transaction Tracking & Packet Transforms ECC on SRAMs and in-flight pkts Access-Control checks for mis-configured pkts (either config/app errors or corruption) Configurable buffer allocations NVLink Data Link (DL) pkt tracking and retry for external links CRC on pkts (external) PHY DL TL PHY NVSwitch Management Hooks for SW-based scrubbing for table corruptions and error conditions (orphaned packets) Hooks for partial-reset and queue draining
  • 13. 13 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION Packet-processing and switching (XBAR) logic is extremely compact With more than 50% of area taken-up by I/O blocks, maximizing “perimeter”-to-area ratio was key to an efficient design Having all NVLink ports exit die on parallel paths simplified package substrate routing NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS Non-NVLink Logic NVSWITCH: DIE ASPECT RATIO
  • 14. 14 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION No NVSwitch Connect GPU directly Aggregate NVLinks into gangs for higher bandwidth Interleaved over the links to prevent camping Max bandwidth between two GPUs limited to the bandwidth of the gang With NVSwitch Interleave traffic across all the links and to support full bandwidth between any pair of GPUs Traffic to a single GPU is non- blocking, so long as aggregate bandwidth of six NVLinks is not exceeded SIMPLE SWITCH SYSTEM V100 V100 V100 Gang V100 V100 V100 NVSWITCH
  • 15. 15 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION Add NVSwitches in parallel to support more GPUs Eight GPU closed system can be built with three NVSwitches with two NVLinks from each GPU to each switch Gang width is still six — traffic is interleaved across all the GPU links GPUs can now communicate pairwise using the full 300 GBps bidirectional between any pair NVSwitch XBAR provides unique paths from and source to any destination — non-blocking, non-interfering SWITCH CONNECTED SYSTEMS 8 GPUs, 3 NVSwitches V100 V100 V100 V100 V100 V100 V100 V100 NVSWITCH
  • 16. 16 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION Taking this to the limit — connect one NVLink from each GPU to each of six switches No routing between different switch planes required Eight NVLinks of the 18 available per switch are used to connect to GPUs Ten NVLinks available per switch for communication outside the local group (only eight are required to support full bandwidth) This is the GPU baseboard configuration for DGX-2 NON-BLOCKING BUILDING BLOCK V100 V100 V100 V100 V100 V100 V100 V100 NVSWITCH
  • 17. 17 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION Two of these building blocks together form a fully connected 16 GPU cluster Non-blocking, non-interfering (unless same destination is involved) Regular load, store, atomics just work Left-right symmetry simplifies physical packaging, and manufacturability DGX-2 NVLINK FABRIC V100 V100 V100 V100 V100 V100 V100 V100 NVSWITCH NVSWITCH V100 V100 V100 V100 V100 V100 V100 V100
  • 18. 18 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION Parameter DGX-2 Spec CPUs Dual Xeon Platinum 8168 CPU Memory 1.5 TB DDR4 Aggregate Storage 30 TB (8 NVMes) Peak Max TDP 10 kW Dimensions (H/W/D) 17.3”(10U)/ 19.0”/32.8” (440.0mm/482.3mm/834.0mm) Weight 340 lbs (154.2 kgs) Cooling (forced air) 1,000 CFM Parameter DGX-2 Spec Number of Tesla V100 GPUs 16 Aggregate FP64/FP32 125/250 TFLOPS Aggregate Tensor (FP16) 2000 TFLOPS Aggregate Shared HBM2 512 GB Aggregate HBM2 Bandwidth 14.4 TBps Per-GPU NVLink Bandwidth 300 GBps bidirectional Chassis Bisection Bandwidth 2.4 TBps InfiniBand NICs 8 Mellanox EDR NVIDIA DGX-2: SPEEDS AND FEEDS
  • 19. 19 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION Xeon sockets are QPI connected, but affinity-binding keeps GPU-related traffic off QPI PCIe tree has NICs connected to pairs of GPUs to facilitate GPUDirectTM RDMAs over IB network Configuration and control of the NVSwitches is via driver process running on CPUs DGX-2 PCIE NETWORK NVS WIT CH NVS WIT CH NVS WIT CH NVS WIT CH NVS WIT CH V100 V100 V100 V100 NVS WIT CH V100 V100 V100 V100 NVS WIT CH NVS WIT CH NVS WIT CH V100 V100 V100 V100 V100 V100 V100 V100 PCIE SW x86 x86 PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW x6 x6 PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW 100G NIC 100G NIC 100G NIC 100G NIC 100G NIC 100G NIC 100G NIC 100G NIC NVSWITCH NVSWITCH QPI QPI
  • 20. 20 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION Two GPU Baseboards with eight V100 GPUs and six NVSwitches on each Two Plane Cards carry 24 NVLinks each No repeaters or redrivers on any of the NVLinks conserves board space and power NVIDIA DGX-2: GPUS + NVSWITCH COMPLEX
  • 21. 21 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION GPU Baseboards get power and PCIe from front midplane Front midplane connects to I/O-Expander PCB (with PCIe switches and NICs) and CPU Motherboard 48V PSUs reduce current load on distribution paths Internal “bus bars” bring current from each PSU to front-mid plane NVIDIA DGX-2: SYSTEM PACKAGING
  • 22. 22 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION Forced-air cooling of Baseboards, I/O Expander, and CPU provided by ten 92 mm fans Four supplemental 60 mm internal fans to cool NVMe drives and PSUs Air to NVSwitches is pre-heated by GPUs, requiring “full height” heatsinks NVIDIA DGX-2: SYSTEM COOLING
  • 23. 23 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION NVLink repeaterless topology trades-off longer traces to GPUs for shorter traces to Plane Cards Custom SXM and Plane Card connectors for reduced loss and crosstalk TX and RX grouped in pin-fields to reduce crosstalk DGX-2: SIGNAL INTEGRITY OPTIMIZATIONS All TX All RX
  • 24. 24 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION Test has each GPU reading data from another GPU across bisection (from GPU on different Baseboard) Raw bisection bandwidth is 2.472 TBps 1.98 TBps achieved Read bisection bandwidth matches theoretical 80% bidirectional NVLink efficiency “All-to-all” (each GPU reads from eight GPUs on other PCB) results are similar DGX-2: ACHIEVED BISECTION BW 0 500 1000 1500 2000 2500 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 GB/sec Number of GPUs
  • 25. 25 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION Results are “iso-problem instance” (more GFLOPS means shorter running time) As problem is split over more GPUs, it takes longer to transfer data than to calculate locally Aggregate nature of HBM2 capacity enables larger problem sizes DGX-2: CUFFT (16K X 16K) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1 2 3 4 5 6 7 8 GFLOPS ½ DGX-2 (All-to-All NVLink Fabric) DGX-1TM (V100 and Hybrid Cube Mesh) Number of GPUs
  • 26. 26 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION Important communication primitive in Machine-Learning apps DGX-2 provides increased bandwidth and lower latency compared to two 8-GPU servers (connected with InfiniBand) NVLink network has efficiency superior to InfiniBand on smaller message sizes DGX-2: ALL-REDUCE BENCHMARK Message Size (B) Bandwidth (MB/s) DGX-2 (NVLink Fabric) 2 DGX-1 (4 100 Gb IB) 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M 0 20000 40000 60000 80000 100000 120000 3X 8X
  • 27. 27 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION DGX-2: >2X SPEED-UP ON TARGET APPS Two DGX-1 Servers (V100) Compared to DGX-2 − Same Total GPU Count DGX-2 with NVSwitch Two DGX-1 (V100) 2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER AI Training HPC Physics (MILC benchmark) 4D Grid Weather (ECMWF benchmark) All-to-all Language Model (Transformer with MoE) All-to-all Recommender (Sparse Embedding) Reduce & Broadcast Two DGX-1 servers have dual socket Xeon E5 2698v4 Processor. Eight Tesla V100 32 GB GPUs. Servers connected via four EDR IB/GbE ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 Tesla V100 32GB GPUs TM
  • 28. 28 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION Eight GPU baseboard with six NVSwitches Two HGX-2 boards can be passively connected to realize 16-GPU systems ODM/OEM partners build servers utilizing NVIDIA HGX-2 GPU baseboards Design guide provides recommendations NVSWITCH AVAILABILITY: NVIDIA HGX-2™
  • 29. 29 ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION New NVSwitch 18-NVLink-Port, Non-Blocking NVLink Switch 51.5 GBps-per-Port 928 GBps Aggregate Bidirectional Bandwidth DGX-2 “One Gigantic GPU” Scale-Up Server with 16 Tesla V100 GPUs 125 TF (FP64) to 2000 TF (Tensor) 512 GB Aggregate HBM2 Capacity NVSwitch Fabric Enables >2X Speed- Up on Target Apps SUMMARY