Submit Search
Upload
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
•
0 likes
•
20 views
bui thequan
Follow
Nvidia NVswitch document
Read less
Read more
Technology
Report
Share
Report
Share
1 of 30
Download now
Download to read offline
Recommended
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
inside-BigData.com
Building the World's Largest GPU
Building the World's Largest GPU
Renee Yao
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
inside-BigData.com
Latest HPC News from NVIDIA
Latest HPC News from NVIDIA
inside-BigData.com
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloud
inside-BigData.com
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
MuhammadAbdullah311866
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
Jim St. Leger
GTC 2022 Keynote
GTC 2022 Keynote
Alison B. Lowndes
Recommended
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
inside-BigData.com
Building the World's Largest GPU
Building the World's Largest GPU
Renee Yao
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
inside-BigData.com
Latest HPC News from NVIDIA
Latest HPC News from NVIDIA
inside-BigData.com
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloud
inside-BigData.com
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
MuhammadAbdullah311866
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
Jim St. Leger
GTC 2022 Keynote
GTC 2022 Keynote
Alison B. Lowndes
Converged IO for HP ProLiant Gen8
Converged IO for HP ProLiant Gen8
IT Brand Pulse
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
Shinnosuke Furuya
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
6WIND
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Ontico
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmap
inside-BigData.com
HPE Synergy Networking Adapters-a00043968enw.pdf
HPE Synergy Networking Adapters-a00043968enw.pdf
Shreeram Rane
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloud
inside-BigData.com
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム
Shinnosuke Furuya
Новые коммутаторы QFX10000. Технология JunOS Fusion
Новые коммутаторы QFX10000. Технология JunOS Fusion
TERMILAB. Интернет - лаборатория
Mellanox OpenPOWER features
Mellanox OpenPOWER features
Ganesan Narayanasamy
Amd accelerated computing -ufrj
Amd accelerated computing -ufrj
Roberto Brandao
PLNOG 13: Maciej Grabowski: HP Moonshot
PLNOG 13: Maciej Grabowski: HP Moonshot
PROIDEA
9/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'16
Kangaroot
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
Jim St. Leger
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous Hardware
LEGATO project
6WINDGate™ - Enabling Cloud RAN Virtualization
6WINDGate™ - Enabling Cloud RAN Virtualization
6WIND
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
Rebekah Rodriguez
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
KTN
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA
Dell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation Webinar
Bill Wong
InfiniBand in the Enterprise Data Center.pdf
InfiniBand in the Enterprise Data Center.pdf
bui thequan
OCP liquid direct to chip temperature guideline.pdf
OCP liquid direct to chip temperature guideline.pdf
bui thequan
More Related Content
Similar to 2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
Converged IO for HP ProLiant Gen8
Converged IO for HP ProLiant Gen8
IT Brand Pulse
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
Shinnosuke Furuya
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
6WIND
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Ontico
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmap
inside-BigData.com
HPE Synergy Networking Adapters-a00043968enw.pdf
HPE Synergy Networking Adapters-a00043968enw.pdf
Shreeram Rane
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloud
inside-BigData.com
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム
Shinnosuke Furuya
Новые коммутаторы QFX10000. Технология JunOS Fusion
Новые коммутаторы QFX10000. Технология JunOS Fusion
TERMILAB. Интернет - лаборатория
Mellanox OpenPOWER features
Mellanox OpenPOWER features
Ganesan Narayanasamy
Amd accelerated computing -ufrj
Amd accelerated computing -ufrj
Roberto Brandao
PLNOG 13: Maciej Grabowski: HP Moonshot
PLNOG 13: Maciej Grabowski: HP Moonshot
PROIDEA
9/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'16
Kangaroot
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
Jim St. Leger
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous Hardware
LEGATO project
6WINDGate™ - Enabling Cloud RAN Virtualization
6WINDGate™ - Enabling Cloud RAN Virtualization
6WIND
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
Rebekah Rodriguez
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
KTN
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA
Dell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation Webinar
Bill Wong
Similar to 2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
(20)
Converged IO for HP ProLiant Gen8
Converged IO for HP ProLiant Gen8
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmap
HPE Synergy Networking Adapters-a00043968enw.pdf
HPE Synergy Networking Adapters-a00043968enw.pdf
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloud
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム
Новые коммутаторы QFX10000. Технология JunOS Fusion
Новые коммутаторы QFX10000. Технология JunOS Fusion
Mellanox OpenPOWER features
Mellanox OpenPOWER features
Amd accelerated computing -ufrj
Amd accelerated computing -ufrj
PLNOG 13: Maciej Grabowski: HP Moonshot
PLNOG 13: Maciej Grabowski: HP Moonshot
9/ IBM POWER @ OPEN'16
9/ IBM POWER @ OPEN'16
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous Hardware
6WINDGate™ - Enabling Cloud RAN Virtualization
6WINDGate™ - Enabling Cloud RAN Virtualization
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
Dell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation Webinar
More from bui thequan
InfiniBand in the Enterprise Data Center.pdf
InfiniBand in the Enterprise Data Center.pdf
bui thequan
OCP liquid direct to chip temperature guideline.pdf
OCP liquid direct to chip temperature guideline.pdf
bui thequan
Coolinside DTC Liquid Cooling Solution.pdf
Coolinside DTC Liquid Cooling Solution.pdf
bui thequan
(268) Total Thermal Solution - heat sink.pdf
(268) Total Thermal Solution - heat sink.pdf
bui thequan
6 - Oracle.pdf
6 - Oracle.pdf
bui thequan
an-advanced-liquid-cooling-design-for-data-center-final-v3-1-pdf.pdf
an-advanced-liquid-cooling-design-for-data-center-final-v3-1-pdf.pdf
bui thequan
dclc-r134a-968 detail.pdf
dclc-r134a-968 detail.pdf
bui thequan
ajeep_04a_energy_eficient_chiller - Mitsu.pdf
ajeep_04a_energy_eficient_chiller - Mitsu.pdf
bui thequan
Performance Gain for Multiple Stage Centrifugal Compressor by usi.pdf
Performance Gain for Multiple Stage Centrifugal Compressor by usi.pdf
bui thequan
Trane design chiller.pdf
Trane design chiller.pdf
bui thequan
chilled-water-system-presentation.pdf
chilled-water-system-presentation.pdf
bui thequan
multiple-chiller-system-design-and-control-trane-applications-engineering-man...
multiple-chiller-system-design-and-control-trane-applications-engineering-man...
bui thequan
399TGp_medical_light_ani.potx
399TGp_medical_light_ani.potx
bui thequan
29422920 overview-of-ng-sdh
29422920 overview-of-ng-sdh
bui thequan
20407473 ospf
20407473 ospf
bui thequan
194 adss cable-installationguide
194 adss cable-installationguide
bui thequan
Cisco me4600 ont_rgw_user_manual_v3_2-4
Cisco me4600 ont_rgw_user_manual_v3_2-4
bui thequan
Wccp introduction final2
Wccp introduction final2
bui thequan
Qwilt transparent caching-6keyfactors
Qwilt transparent caching-6keyfactors
bui thequan
Guide otn ang
Guide otn ang
bui thequan
More from bui thequan
(20)
InfiniBand in the Enterprise Data Center.pdf
InfiniBand in the Enterprise Data Center.pdf
OCP liquid direct to chip temperature guideline.pdf
OCP liquid direct to chip temperature guideline.pdf
Coolinside DTC Liquid Cooling Solution.pdf
Coolinside DTC Liquid Cooling Solution.pdf
(268) Total Thermal Solution - heat sink.pdf
(268) Total Thermal Solution - heat sink.pdf
6 - Oracle.pdf
6 - Oracle.pdf
an-advanced-liquid-cooling-design-for-data-center-final-v3-1-pdf.pdf
an-advanced-liquid-cooling-design-for-data-center-final-v3-1-pdf.pdf
dclc-r134a-968 detail.pdf
dclc-r134a-968 detail.pdf
ajeep_04a_energy_eficient_chiller - Mitsu.pdf
ajeep_04a_energy_eficient_chiller - Mitsu.pdf
Performance Gain for Multiple Stage Centrifugal Compressor by usi.pdf
Performance Gain for Multiple Stage Centrifugal Compressor by usi.pdf
Trane design chiller.pdf
Trane design chiller.pdf
chilled-water-system-presentation.pdf
chilled-water-system-presentation.pdf
multiple-chiller-system-design-and-control-trane-applications-engineering-man...
multiple-chiller-system-design-and-control-trane-applications-engineering-man...
399TGp_medical_light_ani.potx
399TGp_medical_light_ani.potx
29422920 overview-of-ng-sdh
29422920 overview-of-ng-sdh
20407473 ospf
20407473 ospf
194 adss cable-installationguide
194 adss cable-installationguide
Cisco me4600 ont_rgw_user_manual_v3_2-4
Cisco me4600 ont_rgw_user_manual_v3_2-4
Wccp introduction final2
Wccp introduction final2
Qwilt transparent caching-6keyfactors
Qwilt transparent caching-6keyfactors
Guide otn ang
Guide otn ang
Recently uploaded
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
null - The Open Security Community
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
Deakin University
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
Precisely
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Enterprise Knowledge
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Dubai Multi Commodity Centre
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Mark Billinghurst
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Wonjun Hwang
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
Softradix Technologies
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
BookNet Canada
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
Padma Pradeep
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
Pixlogix Infotech
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
jimielynbastida
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Memoori
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
Florian Wilhelm
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
Ridwan Fadjar
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
comworks
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
The Digital Insurer
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
BookNet Canada
Recently uploaded
(20)
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
1.
Alex Ishii and
Denis Foley with Eric Anderson, Bill Dally, Glenn Dearth Larry Dennison, Mark Hummel and John Schafer NVSWITCH AND DGX-2 NVLINK-SWITCHING CHIP AND SCALE-UP COMPUTE SERVER
2.
2 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs FP64: 125 TFLOPS FP32: 250 TFLOPS Tensor: 2000 TFLOPS 512 GB of GPU HBM2 Single-Server Chassis 10U/19-Inch Rack Mount 10 kW Peak TDP Dual 24-core Xeon CPUs 1.5 TB DDR4 DRAM 30 TB NVMe Storage New NVSwitch Chip 18 2nd Generation NVLink™ Ports 25 GBps per Port 900 GBps Total Bidirectional Bandwidth 450 GBps Total Throughput 12 NVSwitch Network Full-Bandwidth Fat-Tree Topology 2.4 TBps Bisection Bandwidth Global Shared Memory Repeater-less ® ™ ™
3.
3 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION OUTLINE The Case for “One Gigantic GPU” and NVLink Review NVSwitch — Speeds and Feeds, Architecture and System Applications DGX-2 Server — Speeds and Feeds, Architecture, and Packaging Signal-Integrity Design Highlights Achieved Performance
4.
4 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION MULTI-CORE AND CUDA WITH ONE GPU Users explicitly express parallel work in NVIDIA CUDA® GPU Driver distributes work to available Graphics Processing Clusters (GPC)/Streaming Multiprocessor (SM) cores GPC/SM cores can compute on data in any of the second- generation High Bandwidth Memories (HBM2s) GPC/SM cores use shared HBM2s to exchange data GPU GPC GPC Hub NVLink Copy Engines PCIe I/O Work (data and CUDA Kernels) Results (data) CPU HBM2 HBM2 L2 Cache L2 Cache L2 Cache L2 Cache XBAR NVLink
5.
5 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION TWO GPUS WITH PCIE Access to HBM2 of other GPU is at PCIe BW (32 GBps (bidirectional)) Interactions with CPU compete with GPU-to-GPU PCIe is the “Wild West” (lots of performance bandits) GPU0 GPU1 HBM2 + L2 Caches HBM2 + L2 Caches GPC GPC Hub NVLink Copy Engines PCIe I/O NVLink Work (data and CUDA Kernels) Results (data) CPU XBAR HBM2 + L2 Caches HBM2 + L2 Caches XBAR GPC GPC Hub PCIe I/O NVLink NVLink Copy Engines
6.
6 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION TWO GPUS WITH NVLINK All GPCs can access all HBM2 memories Access to HBM2 of other GPU is at multi-NVLink bandwidth (300 GBps bidirectional in V100 GPUs) NVLinks are effectively a “bridge” between XBARs No collisions with PCIe traffic GPU0 GPU1 HBM2 + L2 Caches HBM2 + L2 Caches GPC GPC Hub NVLink Copy Engines PCIe I/O NVLink Work (data and CUDA Kernels) Results (data) CPU XBAR HBM2 + L2 Caches HBM2 + L2 Caches XBAR GPC GPC Hub PCIe I/O NVLink NVLink Copy Engines
7.
7 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION THE “ONE GIGANTIC GPU” IDEAL Highest number of GPUs possible Single GPU Driver process controls all work across all GPUs From perspective of GPCs, all HBM2s can be accessed without intervention by other processes (LD/ST instructions, Copy Engine RDMA, everything “just works”) Access to all HBM2s is independent of PCIe Bandwidth across bridged XBARs is as high as possible (some NUMA is unavoidable) GPU4 GPU8 NVLink XBAR CPU CPU GPU9 GPU12 GPU13 GPU14 GPU15 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU10 GPU11
8.
8 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION Problem Size Capacity Problem size is limited by aggregate HBM2 capacity of entire set of GPUs, rather than capacity of single GPU Strong Scaling NUMA-effects greatly reduced compared to existing solutions Aggregate bandwidth to HBM2 grows with number of GPUs Ease of Use Apps written for small number of GPUs port more easily Abundant resources enable rapid experimentation “ONE GIGANTIC GPU” BENEFITS GPU4 GPU8 NVLink XBAR CPU CPU GPU9 GPU12 GPU13 GPU14 GPU15 GPU5 GPU6 GPU7 GPU0 GPU1 GPU2 GPU3 GPU10 GPU11 ?
9.
9 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION INTRODUCING NVSWITCH Parameter Spec Bidirectional Bandwidth per NVLink 51.5 GBps NRZ Lane Rate (x8 per NVLink) 25.78125 Gbps Transistors 2 Billion Process TSMC 12FFN Die Size 106 mm^2 Parameter Spec Bidirectional Aggregate Bandwidth 928 GBps NVLink Ports 18 Mgmt Port (config, maintenance, errors) PCIe LD/ST BW Efficiency (128 B pkts) 80.0% Copy Engine BW Efficiency (256 B pkts) 88.9%
10.
10 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION NVSWITCH BLOCK DIAGRAM GPU-XBAR-bridging device; not a general networking device Packet-Transforms make traffic to/from multiple GPUs look like they are to/from single GPU XBAR is non-blocking SRAM-based buffering NVLink IP blocks and XBAR design/verification infrastructure reused from V100 Management XBAR (18 X 18) Port Logic 0 Routing Error Check & Statistics Collection Classification & Packet Transforms Transaction Tracking & Packet Transforms NVLink 0 PHY DL TL PHY NVSwitch Port Logic 17 Routing Error Check & Statistics Collection Classification & Packet Transforms Transaction Tracking & Packet Transforms NVLink 17 PHY DL TL PHY PCIe I/O
11.
11 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION GPU0 GPU1 HBM2 + L2 Caches HBM2 + L2 Caches GPC GPC Hub NVLink Copy Engines PCIe I/O NVLink Work (data and CUDA Kernels) Results (data) CPU XBAR HBM2 + L2 Caches HBM2 + L2 Caches XBAR GPC GPC Hub PCIe I/O NVLink NVLink Copy Engines NVLINK: PHYSICAL SHARED MEMORY Virtual-to physical address translation is done in GPCs NVLink packets carry physical addresses NVSwitch and DGX-2 follow same model Physical Addressing
12.
12 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION NVSWITCH: RELIABILITY Hop-by-hop error checking deemed sufficient in intra- chassis operating environment Low BER (10-12 or better) on intra-chassis transmission media removes need for high-overhead protections like FEC HW-based consistency checks guard against “escapes” from hop-to-hop SW responsible for additional consistency checks and clean-up XBAR ECC on in-flight pkts Port Logic Routing Error Check & Statistics Collection Classification & Packet Transforms Transaction Tracking & Packet Transforms ECC on SRAMs and in-flight pkts Access-Control checks for mis-configured pkts (either config/app errors or corruption) Configurable buffer allocations NVLink Data Link (DL) pkt tracking and retry for external links CRC on pkts (external) PHY DL TL PHY NVSwitch Management Hooks for SW-based scrubbing for table corruptions and error conditions (orphaned packets) Hooks for partial-reset and queue draining
13.
13 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION Packet-processing and switching (XBAR) logic is extremely compact With more than 50% of area taken-up by I/O blocks, maximizing “perimeter”-to-area ratio was key to an efficient design Having all NVLink ports exit die on parallel paths simplified package substrate routing NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS NVLINK PHYS Non-NVLink Logic NVSWITCH: DIE ASPECT RATIO
14.
14 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION No NVSwitch Connect GPU directly Aggregate NVLinks into gangs for higher bandwidth Interleaved over the links to prevent camping Max bandwidth between two GPUs limited to the bandwidth of the gang With NVSwitch Interleave traffic across all the links and to support full bandwidth between any pair of GPUs Traffic to a single GPU is non- blocking, so long as aggregate bandwidth of six NVLinks is not exceeded SIMPLE SWITCH SYSTEM V100 V100 V100 Gang V100 V100 V100 NVSWITCH
15.
15 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION Add NVSwitches in parallel to support more GPUs Eight GPU closed system can be built with three NVSwitches with two NVLinks from each GPU to each switch Gang width is still six — traffic is interleaved across all the GPU links GPUs can now communicate pairwise using the full 300 GBps bidirectional between any pair NVSwitch XBAR provides unique paths from and source to any destination — non-blocking, non-interfering SWITCH CONNECTED SYSTEMS 8 GPUs, 3 NVSwitches V100 V100 V100 V100 V100 V100 V100 V100 NVSWITCH
16.
16 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION Taking this to the limit — connect one NVLink from each GPU to each of six switches No routing between different switch planes required Eight NVLinks of the 18 available per switch are used to connect to GPUs Ten NVLinks available per switch for communication outside the local group (only eight are required to support full bandwidth) This is the GPU baseboard configuration for DGX-2 NON-BLOCKING BUILDING BLOCK V100 V100 V100 V100 V100 V100 V100 V100 NVSWITCH
17.
17 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION Two of these building blocks together form a fully connected 16 GPU cluster Non-blocking, non-interfering (unless same destination is involved) Regular load, store, atomics just work Left-right symmetry simplifies physical packaging, and manufacturability DGX-2 NVLINK FABRIC V100 V100 V100 V100 V100 V100 V100 V100 NVSWITCH NVSWITCH V100 V100 V100 V100 V100 V100 V100 V100
18.
18 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION Parameter DGX-2 Spec CPUs Dual Xeon Platinum 8168 CPU Memory 1.5 TB DDR4 Aggregate Storage 30 TB (8 NVMes) Peak Max TDP 10 kW Dimensions (H/W/D) 17.3”(10U)/ 19.0”/32.8” (440.0mm/482.3mm/834.0mm) Weight 340 lbs (154.2 kgs) Cooling (forced air) 1,000 CFM Parameter DGX-2 Spec Number of Tesla V100 GPUs 16 Aggregate FP64/FP32 125/250 TFLOPS Aggregate Tensor (FP16) 2000 TFLOPS Aggregate Shared HBM2 512 GB Aggregate HBM2 Bandwidth 14.4 TBps Per-GPU NVLink Bandwidth 300 GBps bidirectional Chassis Bisection Bandwidth 2.4 TBps InfiniBand NICs 8 Mellanox EDR NVIDIA DGX-2: SPEEDS AND FEEDS
19.
19 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION Xeon sockets are QPI connected, but affinity-binding keeps GPU-related traffic off QPI PCIe tree has NICs connected to pairs of GPUs to facilitate GPUDirectTM RDMAs over IB network Configuration and control of the NVSwitches is via driver process running on CPUs DGX-2 PCIE NETWORK NVS WIT CH NVS WIT CH NVS WIT CH NVS WIT CH NVS WIT CH V100 V100 V100 V100 NVS WIT CH V100 V100 V100 V100 NVS WIT CH NVS WIT CH NVS WIT CH V100 V100 V100 V100 V100 V100 V100 V100 PCIE SW x86 x86 PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW x6 x6 PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW 100G NIC 100G NIC 100G NIC 100G NIC 100G NIC 100G NIC 100G NIC 100G NIC NVSWITCH NVSWITCH QPI QPI
20.
20 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION Two GPU Baseboards with eight V100 GPUs and six NVSwitches on each Two Plane Cards carry 24 NVLinks each No repeaters or redrivers on any of the NVLinks conserves board space and power NVIDIA DGX-2: GPUS + NVSWITCH COMPLEX
21.
21 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION GPU Baseboards get power and PCIe from front midplane Front midplane connects to I/O-Expander PCB (with PCIe switches and NICs) and CPU Motherboard 48V PSUs reduce current load on distribution paths Internal “bus bars” bring current from each PSU to front-mid plane NVIDIA DGX-2: SYSTEM PACKAGING
22.
22 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION Forced-air cooling of Baseboards, I/O Expander, and CPU provided by ten 92 mm fans Four supplemental 60 mm internal fans to cool NVMe drives and PSUs Air to NVSwitches is pre-heated by GPUs, requiring “full height” heatsinks NVIDIA DGX-2: SYSTEM COOLING
23.
23 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION NVLink repeaterless topology trades-off longer traces to GPUs for shorter traces to Plane Cards Custom SXM and Plane Card connectors for reduced loss and crosstalk TX and RX grouped in pin-fields to reduce crosstalk DGX-2: SIGNAL INTEGRITY OPTIMIZATIONS All TX All RX
24.
24 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION Test has each GPU reading data from another GPU across bisection (from GPU on different Baseboard) Raw bisection bandwidth is 2.472 TBps 1.98 TBps achieved Read bisection bandwidth matches theoretical 80% bidirectional NVLink efficiency “All-to-all” (each GPU reads from eight GPUs on other PCB) results are similar DGX-2: ACHIEVED BISECTION BW 0 500 1000 1500 2000 2500 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 GB/sec Number of GPUs
25.
25 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION Results are “iso-problem instance” (more GFLOPS means shorter running time) As problem is split over more GPUs, it takes longer to transfer data than to calculate locally Aggregate nature of HBM2 capacity enables larger problem sizes DGX-2: CUFFT (16K X 16K) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1 2 3 4 5 6 7 8 GFLOPS ½ DGX-2 (All-to-All NVLink Fabric) DGX-1TM (V100 and Hybrid Cube Mesh) Number of GPUs
26.
26 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION Important communication primitive in Machine-Learning apps DGX-2 provides increased bandwidth and lower latency compared to two 8-GPU servers (connected with InfiniBand) NVLink network has efficiency superior to InfiniBand on smaller message sizes DGX-2: ALL-REDUCE BENCHMARK Message Size (B) Bandwidth (MB/s) DGX-2 (NVLink Fabric) 2 DGX-1 (4 100 Gb IB) 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M 0 20000 40000 60000 80000 100000 120000 3X 8X
27.
27 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION DGX-2: >2X SPEED-UP ON TARGET APPS Two DGX-1 Servers (V100) Compared to DGX-2 − Same Total GPU Count DGX-2 with NVSwitch Two DGX-1 (V100) 2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER AI Training HPC Physics (MILC benchmark) 4D Grid Weather (ECMWF benchmark) All-to-all Language Model (Transformer with MoE) All-to-all Recommender (Sparse Embedding) Reduce & Broadcast Two DGX-1 servers have dual socket Xeon E5 2698v4 Processor. Eight Tesla V100 32 GB GPUs. Servers connected via four EDR IB/GbE ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 Tesla V100 32GB GPUs TM
28.
28 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION Eight GPU baseboard with six NVSwitches Two HGX-2 boards can be passively connected to realize 16-GPU systems ODM/OEM partners build servers utilizing NVIDIA HGX-2 GPU baseboards Design guide provides recommendations NVSWITCH AVAILABILITY: NVIDIA HGX-2™
29.
29 ©2018 NVIDIA CORPORATION ©2018
NVIDIA CORPORATION New NVSwitch 18-NVLink-Port, Non-Blocking NVLink Switch 51.5 GBps-per-Port 928 GBps Aggregate Bidirectional Bandwidth DGX-2 “One Gigantic GPU” Scale-Up Server with 16 Tesla V100 GPUs 125 TF (FP64) to 2000 TF (Tensor) 512 GB Aggregate HBM2 Capacity NVSwitch Fabric Enables >2X Speed- Up on Target Apps SUMMARY
Download now