SlideShare a Scribd company logo
1 of 34
Renee Yao, NVIDIA Senior Product
Marketing Manager, AI Systems
Twitter: @ReneeYao1
BUILDING THE WORLD'S
LARGEST GPU
2Twitter: @ReneeYao1Twitter: @ReneeYao1
THE DGX FAMILY OF AI SUPERCOMPUTERS
AI WORKSTATIONCLOUD-SCALE AI AI DATA CENTER
Cloud platform with the highest
deep learning efficiency
NVIDIA GPU Cloud
The Essential
Instrument for AI
Research
DGX-1
with
Tesla V100 32GB
The Personal
AI Supercomputer
DGX Station
with
Tesla V100 32GB
The World’s Most Powerful
AI System for the Most
Complex AI Challenges
DGX-2
with
Tesla V100 32GB
3Twitter: @ReneeYao1
10X PERFORMANCE GAIN IN LESS THAN A YEAR
DGX-1, SEP’17 DGX-2, Q3‘18
software improvements across the stack including NCCL, cuDNN, etc.
Workload: FairSeq, 55 epochs to solution. PyTorch training performance.
Time to Train (days)
1.5
15
0 5 10 15 20
DGX-2
DGX-1 with V100
10 Times Fasterdays
days
4Twitter: @ReneeYao1
DGX-2 NOW SHIPPING
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
4
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet
5Twitter: @ReneeYao1Twitter: @ReneeYao1
MULTI-CORE AND CUDA WITH ONE GPU
GPU
GPC
GPC
HBM2
Memory
Controller
Memory
Controller
HBM2
Memory
Controller
Memory
Controller
XBAR High-Speed Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
Work (data and
CUDA Kernels)
Results
(data)
CPU
• Users explicitly
express parallel
work in CUDA
• GPU Driver
distributes work
to available
GPC/SM cores
• GPC/SM cores
use shared
HBM2 to
exchange data
6Twitter: @ReneeYao1Twitter: @ReneeYao1
TWO-GPUS WITH PCIE
GPU0
GPC
GPC
XBAR
HBM2+MemCtrlHBM2+MemCtrl
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
GPU1
GPC
GPC
XBAR
HBM2+MemCtrlHBM2+MemCtrl
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
Work (data and
CUDA Kernels)
Results
(data)
CPU
• Access to HBM2
of other GPU is
at PCIe BW
(16 GBps)
• PCIe is the
“Wild West”
(lots of perf
bandits)
• Interactions
with CPU
compete with
GPU-to-GPU
7Twitter: @ReneeYao1Twitter: @ReneeYao1
TWO-GPUS WITH NVLINK
• Access to HBM2
of other GPU is
at multi-NVLink
BW (150 GBps
in V100 GPUs)
• All GPCs can
access all HBM2
memories
• NVLinks are
effectively a
“bridge”
between XBARs
GPU0
GPC
GPC
XBAR
HBM2+MemCtrlHBM2+MemCtrl
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
GPU1
GPC
GPC
XBAR
HBM2+MemCtrlHBM2+MemCtrl
High-Speed
Hub
NVLink
NVLink
Copy
Engines
PCIe I/O
Work (data and
CUDA Kernels)
Results
(data)
CPU
8Twitter: @ReneeYao1Twitter: @ReneeYao1
THE “ONE GIGANTIC GPU” IDEAL
• Number of GPUs is as high as
possible
• Single GPU Driver process controls
all work across all GPUs
• From perspective of GPCs, all
HBM2s can be accessed without
intervention by other processes
(LD/ST instructions, Copy Engine
RDMA, everything “just works”)
• Access to all HBM2s is
independent of PCIe
• BW across bridged XBARs is as
high as possible (some NUMA is
unavoidable)
GPU0
GPU1
GPU2
GPU3
GPU0
GPU1
GPU2
GPU3
GPU0
GPU1
GPU2
GPU3
GPU0
GPU1
GPU2
GPU3
NVLink XBAR
CPU
CPU
?
9Twitter: @ReneeYao1Twitter: @ReneeYao1
INTRODUCING NVSWITCH
Parameter Spec
Bi-Di BW per NVLink 51.5 GBps
NRZ Lane Rate (x8 per NVLink) 25.78125 Gbps
Transistors 2 Billion
Process TSMC 12FFN
Die Size 106 mm^2
Parameter Spec
Bi-Di Aggregate BW 928 GBps
NVLink Ports 18
Mgmt Port (config, maintenance, err) PCIe
LD/ST BW Efficiency (128B pkts) 80.0%
Copy Engine BW Efficiency (256B pkts) 88.9%
10Twitter: @ReneeYao1
DGX-2 NOW SHIPPING
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
10
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet
11Twitter: @ReneeYao1Twitter: @ReneeYao1
EXPANDABLE SYSTEM
• Taking this to the limit - connect one NVLink from each
GPU to each of 6 switches
• No routing between different switch planes required
• 8 NVLinks of the 18 available per switch are used to
connect to GPUs
• 10 NVLinks available per switch for communication
outside the local group (only 8 are required to support
full BW)
• This is the GPU baseboard configuration for DGX-2
V100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH
12Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX-2 NVLINK FABRICV100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH
NVSWITCH
V100
V100
V100
V100
V100
V100
V100
V100
• Two of these building blocks together form a
fully connected 16GPU cluster
• Non-blocking, non-interfering (unless same
destination is involved)
• Regular load, store, atomics just work
• Presenters note: The astute among you will
note that there is a redundant level of
switches here, but configuration simplifies
system-level design and manufacturing
14
Data Science HW Architecture
128x memory I/O
300x core-to-core I/O
100x processing cores
128 GB/s
20
Cores
512
GB
128 GB/s
20
Cores
512
GB
128 GB/s
20
Cores
512
GB
128 GB/s
20
Cores
512
GB
CPU Cluster
DGX-2
Larger datasets but slower
● CPU/memory bandwidth
● # of processing cores
● Network I/O
128 GB/s
20
Cores
512
GB
Single CPU Node
Typically very slow
With 20GB+ datasets
15Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX-2 PCIE NETWORK
PCIE
SW
x86x86
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
x6x6
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
PCIE
SW
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
200G
NIC
QPIQPI
V100
V100
V100
V100
V100
V100
V100
V100
NVSWITCH
NVSWITCH
V100
V100
V100
V100
V100
V100
V100
V100
• Xeon sockets are
QPI connected, but
affinity-binding
keeps GPU-related
traffic off QPI
• PCIe tree has NICs
connected to pairs
of GPUs to facilitate
GPUDirect RDMAs
over IB network
• Configuration and
control of the
NVSwitches is via
driver process
running on CPUs
16Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX-2: GPUS + NVSWITCH COMPLEX
• Two GPU
Baseboards with
8 V100 GPUs and
6 NVSwitches
• Two Plane Cards
carry 24 NVLinks
each
17Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX-2: SYSTEM COOLING
• Forced-air cooling
of Baseboards, I/O
Expander, and CPU
provided by 10
92 mm fans
• 4 supplemental
60 mm internal fans
to cool NVMe drives
and PSUs
• Air to NVSwitches is
pre-heated by
GPUs, so use “full
height” heatsinks
18Twitter: @ReneeYao1
DGX-2: cuFFT
• Results are “iso-
problem instance”
(more GFLOPS means
shorter running time)
• As problem is split
over more GPUs, it
takes longer to
transfer data than to
calculate locally
DGX-1V½ DGX-2
19Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX-2: ALL-REDUCE BENCHMARK
4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
Message Size (B)
Bandwidth
(MB/s)
2 DGX1V
(1 100 Gb IB)
2 DGX1V
(4 100 Gb IB)
DGX-2
(Ring-Topology
Communication)
DGX-2
(All-to-All Communication)
8x
• Important
communication
primitive in Machine-
Learning Apps
• Increased BW
compared to two 8-
GPU servers
• All-to-all NVSwitch
network reduces
latency overheads vs
simpler topologies
(e.g., “rings”)
20Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX-2: UP TO 2.7X ON TARGET APPS
2 DGX-1V servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 32GB GPUs. Servers connected via 4 EDR IB ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 32GB GPUs
13K
GFLOPS
26K
GFLOPS
Physics
(MILC benchmark)
4D Grid
Weather
(IFS benchmark)
FFT, All-to-all
Recommender
(Sparse Embedding)
Reduce & Broadcast
22B
lookups
/sec
11B
Lookups
/sec
Language Model
(Transformer with MoE)
All-to-all
9.3Hr
3.4Hr
DGX-2 with NVSwitch2x DGX-1 (Volta)
2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER
11 Steps/
sec
26 Steps/
sec
21Twitter: @ReneeYao1 21
FLEXIBILITY WITH
VIRTUALIZATION
Enable your own private DL Training
Cloud for your Enterprise
• KVM hypervisor for Ubuntu Linux
• Enable teams of developers to
simultaneously access DGX-2
• Flexibly allocate GPU resources to
each user and their experiments
• Full GPU’s and NVSwitch access
within VMs — either all GPU’s or as
few as 1
22
CRISIS MANAGEMENT
SOLUTION
Natural disasters are increasingly causing major destruction
to life, property and economies. DFKI is using the NVIDIA
DGX-2 to evolve DeepEye —which uses satellite images
enriched with social media content to identify natural
disasters— into a crisis management solution. With
the increased GPU memory and fully connected
GPUs based on the NVSwitch architecture, DFKI
can build bigger models and process more
data to aid rescuers in their decision-
making for faster, more efficient
dispatching of
resources.
23
“Fujifilm applies AI in a wide range of fields. In
healthcare, multiple NVIDIA GPUs will deliver
high-speed computation to develop AI
supporting image diagnostics.The introduction
of this supercomputer will massively increase our
processing power.We expect that AI learning that
once took days to complete can now be
completed within hours.”
AkiraYoda
chief digital officer of FUJIFILMCorporation
- Pharmaceuticals
- BioCDMO
- Regenerative medicine
- Analyzing and
recognizing medical
images
- Simulations display
materials and fine
chemicals
24Twitter: @ReneeYao1Twitter: @ReneeYao1
AI ADOPTERS IMPEDED BY
INFRASTRUCTURE
AI Boosts Profit
Margins up to 15%
40% see infrastructure
as impeding AI
source: 2018 CTA Market Research
25Twitter: @ReneeYao1Twitter: @ReneeYao1
THE CHALLENGE OF AI INFRASTRUCTURE
Short term thinking leads to longer term problems
Ensuring the
architecture delivers
predictable performance
that scales
DESIGN
GUESSWORK
Procuring, installing and
troubleshooting compute,
storage, networking and
software
DEPLOYMENT
COMPLEXITY
MULTIPLE POINTS
OF SUPPORT
Contending with
multiple vendors across
multiple layers in the
stack
27Twitter: @ReneeYao1Twitter: @ReneeYao1
DESIGNING INFRASTRUCTURE THAT SCALES
Insights gained from deep learning data centers
Rack Design Networking Storage Facilities Software
• DL drives
close to
operational
limits
• Similarities
to HPC best
practices
• IB or
Ethernet
based fabric
• 100Gbps
inter-
connect
• High-
bandwidth,
ultra-low
latency
• Datasets
range from
10k’s to
millions
objects
• terabyte
levels of
storage and
up
• High IOPS,
low latency
• assume
higher watts
per-rack
• Higher
FLOPS/watt
= DC less
floorspace
required
• Scale
requires
“cluster-
aware”
software
Example:
• Autonomous vehicle = 1TB / hr
• Training sets up to 500 PB
• RN50: 113 days to train
• Objective: 7 days
• 6 simultaneous developers
= 97 node cluster
28Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX POD™
• Initial reference architecture based on the NVIDIA® DGX-1™ server
• Designed for deep learning training workflow
• Baseline for other reference architectures:
• Easily upgraded to NVIDIA DGX-2™ and NVIDIA HGX-2™ servers
• Industry-specific PODs
• Storage and network partners
• Server OEM solutions
A Reference Architecture For GPU Data Centers
29Twitter: @ReneeYao1Twitter: @ReneeYao1
DGX DATA CENTER REFERENCE DESIGN
Easy Deployment of DGX Servers for Deep Learning
Content:
• AI Workflow and Sizing
• NVIDIA AI Software
• DGX POD Design
• DGX POD Installation and
Management
30Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA AUTOMOTIVE WORKFLOW ON SATURNV
Research Workflow
Training
• Many node – user submits 1 job with
many single node training sessions -
hyper parm sweep
• Multi-node – user submits 1 job with
single multi-node training session
Inference
• Many GPU – user submits many jobs
each with single GPU inference
Inference
Many node
Training Multi
node
Training
StoragePerformance
Interconnect performance
31Twitter: @ReneeYao1Twitter: @ReneeYao1
EXAMPLE DL WORKFLOW: AUTOMOTIVE
Driving DL Platform - Training, Simulation, Testing
Raw
data
OTA
updates
Indexing, selection,
labeling
32Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX POD — DGX-1
Reference Architecture in a Single 35 kW High-Density Rack
In real-life DL application development, one to two
DGX-1 servers per developer are often required
One DGX POD supports five developers (AV workload)
Each developer works on two experiments per day
One DGX-1/developer/experiment/day*
*300,000 0.5M images * 120 epochs @ 480 images/sec
Resnet-18 backbone detection network per experiment
Fit within a standard-height
42 RU data center rack
• Nine DGX-1 servers
(9 x 3 RU = 27 RU)
• Twelve storage servers
(12 x 1 RU = 12 RU)
• 10 GbE (min) storage and
management switch
(1 RU)
• Mellanox 100 Gbps intra-
rack high speed network
switches
(1 or 2 RU)
33Twitter: @ReneeYao1Twitter: @ReneeYao1
NVIDIA DGX POD — DGX-2
Reference Architecture in a Single 35 kW High-Density Rack
Fit within a standard-height
48 RU data center rack
• Three DGX-2 servers
(3 x 10 RU = 30 RU)
• Twelve storage servers
(12 x 1 RU = 12 RU)
• 10 GbE (min) storage and
management switch
(1 RU)
• Mellanox 100 Gbps intra-
rack high speed network
switches
(1 or 2 RU)
In real-life DL application development, one DGX-2 per
developer minimizes model training time
One DGX POD supports at least three developers
(AV workload)
Each developer works on two experiments per day
One DGX-2/developer/2 experiments/day*
*300,000 0.5M images * 120 epochs @ 480 images/sec
Resnet-18 backbone detection network per experiment
34Twitter: @ReneeYao1
NEW DGX PODS
DELIVERY, DEPLOYMENT, DEEP LEARNING IN A DAY
95% Reduction in Deployment Time
5X Increase in Data Scientist Productivity
$0 Integration Cost
Adopted by Leading Auto, Healthcare & Telco Companies
35Twitter: @ReneeYao1
NVIDIA DGX
SYSTEMS
Faster AI Innovation
and Insight
The World’s First Portfolio of
Purpose-Built AI Supercomputers
• Powered by NVIDIA GPU Cloud
• Get Started in AI – Faster
• Effortless Productivity
• Performance Without Compromise
For More Information
DGX Systems: nvidia.com/dgx
DGX Pod: https://www.nvidia.com/en-us/data-
center/resources/nvidia-dgx-pod-reference-
architecture/
DGX Reference Architecture:
https://www.nvidia.com/en-us/data-center/dgx-
reference-architecture/ 35
Building the World's Largest GPU

More Related Content

What's hot

Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaSpark Summit
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScyllaDB
 
Adventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree EngineAdventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree EngineAltinity Ltd
 
AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )Mydbops
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackMichel Tricot
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
How to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your WorkloadsHow to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your WorkloadsInfluxData
 
Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Grant McAlister
 
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...HostedbyConfluent
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and SwitchingCXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and SwitchingMemory Fabric Forum
 
SK hynix CXL Disaggregated Memory Solution
SK hynix CXL Disaggregated Memory SolutionSK hynix CXL Disaggregated Memory Solution
SK hynix CXL Disaggregated Memory SolutionMemory Fabric Forum
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 

What's hot (20)

Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with Raft
 
Druid
DruidDruid
Druid
 
Adventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree EngineAdventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree Engine
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor Architecture
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )Percona XtraDB Cluster ( Ensure high Availability )
Percona XtraDB Cluster ( Ensure high Availability )
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
How to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your WorkloadsHow to Choose the Right Database for Your Workloads
How to Choose the Right Database for Your Workloads
 
Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput
 
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and SwitchingCXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
CXL Memory Expansion, Pooling, Sharing, FAM Enablement, and Switching
 
SK hynix CXL Disaggregated Memory Solution
SK hynix CXL Disaggregated Memory SolutionSK hynix CXL Disaggregated Memory Solution
SK hynix CXL Disaggregated Memory Solution
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
GTC 2022 Keynote
GTC 2022 KeynoteGTC 2022 Keynote
GTC 2022 Keynote
 

Similar to Building the World's Largest GPU

NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univainside-BigData.com
 
Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...Jeremy Eder
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfMuhammadAbdullah311866
 
Cuda meetup presentation 5
Cuda meetup presentation 5Cuda meetup presentation 5
Cuda meetup presentation 5Rihards Gailums
 
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStackGPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStackBrian Schott
 
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based NodesBenefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodesinside-BigData.com
 
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdfbui thequan
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUsiguazio
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoEmbarcados
 
Nvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex SabatierNvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex SabatierSri Ambati
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceAlison B. Lowndes
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステムShinnosuke Furuya
 
Dell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation WebinarDell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation WebinarBill Wong
 
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習NVIDIA Taiwan
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsGanesan Narayanasamy
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?Shinnosuke Furuya
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platforminside-BigData.com
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
 

Similar to Building the World's Largest GPU (20)

NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...Best practices for optimizing Red Hat platforms for large scale datacenter de...
Best practices for optimizing Red Hat platforms for large scale datacenter de...
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
 
Cuda meetup presentation 5
Cuda meetup presentation 5Cuda meetup presentation 5
Cuda meetup presentation 5
 
Nvidia tesla-k80-overview
Nvidia tesla-k80-overviewNvidia tesla-k80-overview
Nvidia tesla-k80-overview
 
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStackGPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
GPU Accelerated Virtual Desktop Infrastructure (VDI) on OpenStack
 
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based NodesBenefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
 
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
 
Nvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex SabatierNvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex Sabatier
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligence
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム
 
Dell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation WebinarDell NVIDIA AI Powered Transformation Webinar
Dell NVIDIA AI Powered Transformation Webinar
 
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習NVIDIA DGX-1 超級電腦與人工智慧及深度學習
NVIDIA DGX-1 超級電腦與人工智慧及深度學習
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 

More from Renee Yao

Medical Imaging AI Startups _RSNA 2021
Medical Imaging AI Startups _RSNA 2021Medical Imaging AI Startups _RSNA 2021
Medical Imaging AI Startups _RSNA 2021Renee Yao
 
Women L.E.A.D. Toastmasters Appreciation event
Women L.E.A.D. Toastmasters Appreciation eventWomen L.E.A.D. Toastmasters Appreciation event
Women L.E.A.D. Toastmasters Appreciation eventRenee Yao
 
Toastmasters Evaluation Contest Workshop
Toastmasters Evaluation Contest WorkshopToastmasters Evaluation Contest Workshop
Toastmasters Evaluation Contest WorkshopRenee Yao
 
Presentation tips for non native speakers
Presentation tips for non native speakersPresentation tips for non native speakers
Presentation tips for non native speakersRenee Yao
 
How to be an effective mentor
How to be an effective mentorHow to be an effective mentor
How to be an effective mentorRenee Yao
 
Why Toastmasters and How it Helps Your Daily Job
Why Toastmasters and How it Helps Your Daily Job Why Toastmasters and How it Helps Your Daily Job
Why Toastmasters and How it Helps Your Daily Job Renee Yao
 
How to get the most out of a mentorship
How to get the most out of a mentorshipHow to get the most out of a mentorship
How to get the most out of a mentorshipRenee Yao
 
AI in Healthcare | Future of Smart Hospitals
AI in Healthcare | Future of Smart Hospitals AI in Healthcare | Future of Smart Hospitals
AI in Healthcare | Future of Smart Hospitals Renee Yao
 
How to Evaluate Effectively
How to Evaluate EffectivelyHow to Evaluate Effectively
How to Evaluate EffectivelyRenee Yao
 
Startups Step Up - how healthcare ai startups are taking action during covid-...
Startups Step Up - how healthcare ai startups are taking action during covid-...Startups Step Up - how healthcare ai startups are taking action during covid-...
Startups Step Up - how healthcare ai startups are taking action during covid-...Renee Yao
 
Code for good
Code for goodCode for good
Code for goodRenee Yao
 
NetApp Insights 2018 Post Show
NetApp Insights 2018 Post ShowNetApp Insights 2018 Post Show
NetApp Insights 2018 Post ShowRenee Yao
 
Accelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsAccelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsRenee Yao
 
HPE and NVIDIA empowering AI and IoT
HPE and NVIDIA empowering AI and IoTHPE and NVIDIA empowering AI and IoT
HPE and NVIDIA empowering AI and IoTRenee Yao
 
Dell and NVIDIA for Your AI workloads in the Data Center
Dell and NVIDIA for Your AI workloads in the Data CenterDell and NVIDIA for Your AI workloads in the Data Center
Dell and NVIDIA for Your AI workloads in the Data CenterRenee Yao
 
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs Renee Yao
 
Public speaking journey
Public speaking journeyPublic speaking journey
Public speaking journeyRenee Yao
 
Cisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSION
Cisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSIONCisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSION
Cisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSIONRenee Yao
 

More from Renee Yao (19)

Medical Imaging AI Startups _RSNA 2021
Medical Imaging AI Startups _RSNA 2021Medical Imaging AI Startups _RSNA 2021
Medical Imaging AI Startups _RSNA 2021
 
Women L.E.A.D. Toastmasters Appreciation event
Women L.E.A.D. Toastmasters Appreciation eventWomen L.E.A.D. Toastmasters Appreciation event
Women L.E.A.D. Toastmasters Appreciation event
 
Toastmasters Evaluation Contest Workshop
Toastmasters Evaluation Contest WorkshopToastmasters Evaluation Contest Workshop
Toastmasters Evaluation Contest Workshop
 
Presentation tips for non native speakers
Presentation tips for non native speakersPresentation tips for non native speakers
Presentation tips for non native speakers
 
How to be an effective mentor
How to be an effective mentorHow to be an effective mentor
How to be an effective mentor
 
Why Toastmasters and How it Helps Your Daily Job
Why Toastmasters and How it Helps Your Daily Job Why Toastmasters and How it Helps Your Daily Job
Why Toastmasters and How it Helps Your Daily Job
 
How to get the most out of a mentorship
How to get the most out of a mentorshipHow to get the most out of a mentorship
How to get the most out of a mentorship
 
AI in Healthcare | Future of Smart Hospitals
AI in Healthcare | Future of Smart Hospitals AI in Healthcare | Future of Smart Hospitals
AI in Healthcare | Future of Smart Hospitals
 
How to Evaluate Effectively
How to Evaluate EffectivelyHow to Evaluate Effectively
How to Evaluate Effectively
 
Startups Step Up - how healthcare ai startups are taking action during covid-...
Startups Step Up - how healthcare ai startups are taking action during covid-...Startups Step Up - how healthcare ai startups are taking action during covid-...
Startups Step Up - how healthcare ai startups are taking action during covid-...
 
Code for good
Code for goodCode for good
Code for good
 
NetApp Insights 2018 Post Show
NetApp Insights 2018 Post ShowNetApp Insights 2018 Post Show
NetApp Insights 2018 Post Show
 
Accelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANsAccelerate AI w/ Synthetic Data using GANs
Accelerate AI w/ Synthetic Data using GANs
 
HPE and NVIDIA empowering AI and IoT
HPE and NVIDIA empowering AI and IoTHPE and NVIDIA empowering AI and IoT
HPE and NVIDIA empowering AI and IoT
 
Dell and NVIDIA for Your AI workloads in the Data Center
Dell and NVIDIA for Your AI workloads in the Data CenterDell and NVIDIA for Your AI workloads in the Data Center
Dell and NVIDIA for Your AI workloads in the Data Center
 
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
 
Public speaking journey
Public speaking journeyPublic speaking journey
Public speaking journey
 
Cisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSION
Cisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSIONCisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSION
Cisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSION
 
Renee Yao
Renee YaoRenee Yao
Renee Yao
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Building the World's Largest GPU

  • 1. Renee Yao, NVIDIA Senior Product Marketing Manager, AI Systems Twitter: @ReneeYao1 BUILDING THE WORLD'S LARGEST GPU
  • 2. 2Twitter: @ReneeYao1Twitter: @ReneeYao1 THE DGX FAMILY OF AI SUPERCOMPUTERS AI WORKSTATIONCLOUD-SCALE AI AI DATA CENTER Cloud platform with the highest deep learning efficiency NVIDIA GPU Cloud The Essential Instrument for AI Research DGX-1 with Tesla V100 32GB The Personal AI Supercomputer DGX Station with Tesla V100 32GB The World’s Most Powerful AI System for the Most Complex AI Challenges DGX-2 with Tesla V100 32GB
  • 3. 3Twitter: @ReneeYao1 10X PERFORMANCE GAIN IN LESS THAN A YEAR DGX-1, SEP’17 DGX-2, Q3‘18 software improvements across the stack including NCCL, cuDNN, etc. Workload: FairSeq, 55 epochs to solution. PyTorch training performance. Time to Train (days) 1.5 15 0 5 10 15 20 DGX-2 DGX-1 with V100 10 Times Fasterdays days
  • 4. 4Twitter: @ReneeYao1 DGX-2 NOW SHIPPING 1 2 3 5 4 6 Two Intel Xeon Platinum CPUs 7 1.5 TB System Memory 4 30 TB NVME SSDs Internal Storage NVIDIA Tesla V100 32GB Two GPU Boards 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.4 TB/sec bi-section bandwidth Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth PCIe Switch Complex 8 9 9Dual 10/25 Gb/sec Ethernet
  • 5. 5Twitter: @ReneeYao1Twitter: @ReneeYao1 MULTI-CORE AND CUDA WITH ONE GPU GPU GPC GPC HBM2 Memory Controller Memory Controller HBM2 Memory Controller Memory Controller XBAR High-Speed Hub NVLink NVLink Copy Engines PCIe I/O Work (data and CUDA Kernels) Results (data) CPU • Users explicitly express parallel work in CUDA • GPU Driver distributes work to available GPC/SM cores • GPC/SM cores use shared HBM2 to exchange data
  • 6. 6Twitter: @ReneeYao1Twitter: @ReneeYao1 TWO-GPUS WITH PCIE GPU0 GPC GPC XBAR HBM2+MemCtrlHBM2+MemCtrl High-Speed Hub NVLink NVLink Copy Engines PCIe I/O GPU1 GPC GPC XBAR HBM2+MemCtrlHBM2+MemCtrl High-Speed Hub NVLink NVLink Copy Engines PCIe I/O Work (data and CUDA Kernels) Results (data) CPU • Access to HBM2 of other GPU is at PCIe BW (16 GBps) • PCIe is the “Wild West” (lots of perf bandits) • Interactions with CPU compete with GPU-to-GPU
  • 7. 7Twitter: @ReneeYao1Twitter: @ReneeYao1 TWO-GPUS WITH NVLINK • Access to HBM2 of other GPU is at multi-NVLink BW (150 GBps in V100 GPUs) • All GPCs can access all HBM2 memories • NVLinks are effectively a “bridge” between XBARs GPU0 GPC GPC XBAR HBM2+MemCtrlHBM2+MemCtrl High-Speed Hub NVLink NVLink Copy Engines PCIe I/O GPU1 GPC GPC XBAR HBM2+MemCtrlHBM2+MemCtrl High-Speed Hub NVLink NVLink Copy Engines PCIe I/O Work (data and CUDA Kernels) Results (data) CPU
  • 8. 8Twitter: @ReneeYao1Twitter: @ReneeYao1 THE “ONE GIGANTIC GPU” IDEAL • Number of GPUs is as high as possible • Single GPU Driver process controls all work across all GPUs • From perspective of GPCs, all HBM2s can be accessed without intervention by other processes (LD/ST instructions, Copy Engine RDMA, everything “just works”) • Access to all HBM2s is independent of PCIe • BW across bridged XBARs is as high as possible (some NUMA is unavoidable) GPU0 GPU1 GPU2 GPU3 GPU0 GPU1 GPU2 GPU3 GPU0 GPU1 GPU2 GPU3 GPU0 GPU1 GPU2 GPU3 NVLink XBAR CPU CPU ?
  • 9. 9Twitter: @ReneeYao1Twitter: @ReneeYao1 INTRODUCING NVSWITCH Parameter Spec Bi-Di BW per NVLink 51.5 GBps NRZ Lane Rate (x8 per NVLink) 25.78125 Gbps Transistors 2 Billion Process TSMC 12FFN Die Size 106 mm^2 Parameter Spec Bi-Di Aggregate BW 928 GBps NVLink Ports 18 Mgmt Port (config, maintenance, err) PCIe LD/ST BW Efficiency (128B pkts) 80.0% Copy Engine BW Efficiency (256B pkts) 88.9%
  • 10. 10Twitter: @ReneeYao1 DGX-2 NOW SHIPPING 1 2 3 5 4 6 Two Intel Xeon Platinum CPUs 7 1.5 TB System Memory 10 30 TB NVME SSDs Internal Storage NVIDIA Tesla V100 32GB Two GPU Boards 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.4 TB/sec bi-section bandwidth Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth PCIe Switch Complex 8 9 9Dual 10/25 Gb/sec Ethernet
  • 11. 11Twitter: @ReneeYao1Twitter: @ReneeYao1 EXPANDABLE SYSTEM • Taking this to the limit - connect one NVLink from each GPU to each of 6 switches • No routing between different switch planes required • 8 NVLinks of the 18 available per switch are used to connect to GPUs • 10 NVLinks available per switch for communication outside the local group (only 8 are required to support full BW) • This is the GPU baseboard configuration for DGX-2 V100 V100 V100 V100 V100 V100 V100 V100 NVSWITCH
  • 12. 12Twitter: @ReneeYao1Twitter: @ReneeYao1 DGX-2 NVLINK FABRICV100 V100 V100 V100 V100 V100 V100 V100 NVSWITCH NVSWITCH V100 V100 V100 V100 V100 V100 V100 V100 • Two of these building blocks together form a fully connected 16GPU cluster • Non-blocking, non-interfering (unless same destination is involved) • Regular load, store, atomics just work • Presenters note: The astute among you will note that there is a redundant level of switches here, but configuration simplifies system-level design and manufacturing
  • 13. 14 Data Science HW Architecture 128x memory I/O 300x core-to-core I/O 100x processing cores 128 GB/s 20 Cores 512 GB 128 GB/s 20 Cores 512 GB 128 GB/s 20 Cores 512 GB 128 GB/s 20 Cores 512 GB CPU Cluster DGX-2 Larger datasets but slower ● CPU/memory bandwidth ● # of processing cores ● Network I/O 128 GB/s 20 Cores 512 GB Single CPU Node Typically very slow With 20GB+ datasets
  • 14. 15Twitter: @ReneeYao1Twitter: @ReneeYao1 DGX-2 PCIE NETWORK PCIE SW x86x86 PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW x6x6 PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW PCIE SW 200G NIC 200G NIC 200G NIC 200G NIC 200G NIC 200G NIC 200G NIC 200G NIC QPIQPI V100 V100 V100 V100 V100 V100 V100 V100 NVSWITCH NVSWITCH V100 V100 V100 V100 V100 V100 V100 V100 • Xeon sockets are QPI connected, but affinity-binding keeps GPU-related traffic off QPI • PCIe tree has NICs connected to pairs of GPUs to facilitate GPUDirect RDMAs over IB network • Configuration and control of the NVSwitches is via driver process running on CPUs
  • 15. 16Twitter: @ReneeYao1Twitter: @ReneeYao1 NVIDIA DGX-2: GPUS + NVSWITCH COMPLEX • Two GPU Baseboards with 8 V100 GPUs and 6 NVSwitches • Two Plane Cards carry 24 NVLinks each
  • 16. 17Twitter: @ReneeYao1Twitter: @ReneeYao1 NVIDIA DGX-2: SYSTEM COOLING • Forced-air cooling of Baseboards, I/O Expander, and CPU provided by 10 92 mm fans • 4 supplemental 60 mm internal fans to cool NVMe drives and PSUs • Air to NVSwitches is pre-heated by GPUs, so use “full height” heatsinks
  • 17. 18Twitter: @ReneeYao1 DGX-2: cuFFT • Results are “iso- problem instance” (more GFLOPS means shorter running time) • As problem is split over more GPUs, it takes longer to transfer data than to calculate locally DGX-1V½ DGX-2
  • 18. 19Twitter: @ReneeYao1Twitter: @ReneeYao1 DGX-2: ALL-REDUCE BENCHMARK 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 Message Size (B) Bandwidth (MB/s) 2 DGX1V (1 100 Gb IB) 2 DGX1V (4 100 Gb IB) DGX-2 (Ring-Topology Communication) DGX-2 (All-to-All Communication) 8x • Important communication primitive in Machine- Learning Apps • Increased BW compared to two 8- GPU servers • All-to-all NVSwitch network reduces latency overheads vs simpler topologies (e.g., “rings”)
  • 19. 20Twitter: @ReneeYao1Twitter: @ReneeYao1 DGX-2: UP TO 2.7X ON TARGET APPS 2 DGX-1V servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 32GB GPUs. Servers connected via 4 EDR IB ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 32GB GPUs 13K GFLOPS 26K GFLOPS Physics (MILC benchmark) 4D Grid Weather (IFS benchmark) FFT, All-to-all Recommender (Sparse Embedding) Reduce & Broadcast 22B lookups /sec 11B Lookups /sec Language Model (Transformer with MoE) All-to-all 9.3Hr 3.4Hr DGX-2 with NVSwitch2x DGX-1 (Volta) 2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER 11 Steps/ sec 26 Steps/ sec
  • 20. 21Twitter: @ReneeYao1 21 FLEXIBILITY WITH VIRTUALIZATION Enable your own private DL Training Cloud for your Enterprise • KVM hypervisor for Ubuntu Linux • Enable teams of developers to simultaneously access DGX-2 • Flexibly allocate GPU resources to each user and their experiments • Full GPU’s and NVSwitch access within VMs — either all GPU’s or as few as 1
  • 21. 22 CRISIS MANAGEMENT SOLUTION Natural disasters are increasingly causing major destruction to life, property and economies. DFKI is using the NVIDIA DGX-2 to evolve DeepEye —which uses satellite images enriched with social media content to identify natural disasters— into a crisis management solution. With the increased GPU memory and fully connected GPUs based on the NVSwitch architecture, DFKI can build bigger models and process more data to aid rescuers in their decision- making for faster, more efficient dispatching of resources.
  • 22. 23 “Fujifilm applies AI in a wide range of fields. In healthcare, multiple NVIDIA GPUs will deliver high-speed computation to develop AI supporting image diagnostics.The introduction of this supercomputer will massively increase our processing power.We expect that AI learning that once took days to complete can now be completed within hours.” AkiraYoda chief digital officer of FUJIFILMCorporation - Pharmaceuticals - BioCDMO - Regenerative medicine - Analyzing and recognizing medical images - Simulations display materials and fine chemicals
  • 23. 24Twitter: @ReneeYao1Twitter: @ReneeYao1 AI ADOPTERS IMPEDED BY INFRASTRUCTURE AI Boosts Profit Margins up to 15% 40% see infrastructure as impeding AI source: 2018 CTA Market Research
  • 24. 25Twitter: @ReneeYao1Twitter: @ReneeYao1 THE CHALLENGE OF AI INFRASTRUCTURE Short term thinking leads to longer term problems Ensuring the architecture delivers predictable performance that scales DESIGN GUESSWORK Procuring, installing and troubleshooting compute, storage, networking and software DEPLOYMENT COMPLEXITY MULTIPLE POINTS OF SUPPORT Contending with multiple vendors across multiple layers in the stack
  • 25. 27Twitter: @ReneeYao1Twitter: @ReneeYao1 DESIGNING INFRASTRUCTURE THAT SCALES Insights gained from deep learning data centers Rack Design Networking Storage Facilities Software • DL drives close to operational limits • Similarities to HPC best practices • IB or Ethernet based fabric • 100Gbps inter- connect • High- bandwidth, ultra-low latency • Datasets range from 10k’s to millions objects • terabyte levels of storage and up • High IOPS, low latency • assume higher watts per-rack • Higher FLOPS/watt = DC less floorspace required • Scale requires “cluster- aware” software Example: • Autonomous vehicle = 1TB / hr • Training sets up to 500 PB • RN50: 113 days to train • Objective: 7 days • 6 simultaneous developers = 97 node cluster
  • 26. 28Twitter: @ReneeYao1Twitter: @ReneeYao1 NVIDIA DGX POD™ • Initial reference architecture based on the NVIDIA® DGX-1™ server • Designed for deep learning training workflow • Baseline for other reference architectures: • Easily upgraded to NVIDIA DGX-2™ and NVIDIA HGX-2™ servers • Industry-specific PODs • Storage and network partners • Server OEM solutions A Reference Architecture For GPU Data Centers
  • 27. 29Twitter: @ReneeYao1Twitter: @ReneeYao1 DGX DATA CENTER REFERENCE DESIGN Easy Deployment of DGX Servers for Deep Learning Content: • AI Workflow and Sizing • NVIDIA AI Software • DGX POD Design • DGX POD Installation and Management
  • 28. 30Twitter: @ReneeYao1Twitter: @ReneeYao1 NVIDIA AUTOMOTIVE WORKFLOW ON SATURNV Research Workflow Training • Many node – user submits 1 job with many single node training sessions - hyper parm sweep • Multi-node – user submits 1 job with single multi-node training session Inference • Many GPU – user submits many jobs each with single GPU inference Inference Many node Training Multi node Training StoragePerformance Interconnect performance
  • 29. 31Twitter: @ReneeYao1Twitter: @ReneeYao1 EXAMPLE DL WORKFLOW: AUTOMOTIVE Driving DL Platform - Training, Simulation, Testing Raw data OTA updates Indexing, selection, labeling
  • 30. 32Twitter: @ReneeYao1Twitter: @ReneeYao1 NVIDIA DGX POD — DGX-1 Reference Architecture in a Single 35 kW High-Density Rack In real-life DL application development, one to two DGX-1 servers per developer are often required One DGX POD supports five developers (AV workload) Each developer works on two experiments per day One DGX-1/developer/experiment/day* *300,000 0.5M images * 120 epochs @ 480 images/sec Resnet-18 backbone detection network per experiment Fit within a standard-height 42 RU data center rack • Nine DGX-1 servers (9 x 3 RU = 27 RU) • Twelve storage servers (12 x 1 RU = 12 RU) • 10 GbE (min) storage and management switch (1 RU) • Mellanox 100 Gbps intra- rack high speed network switches (1 or 2 RU)
  • 31. 33Twitter: @ReneeYao1Twitter: @ReneeYao1 NVIDIA DGX POD — DGX-2 Reference Architecture in a Single 35 kW High-Density Rack Fit within a standard-height 48 RU data center rack • Three DGX-2 servers (3 x 10 RU = 30 RU) • Twelve storage servers (12 x 1 RU = 12 RU) • 10 GbE (min) storage and management switch (1 RU) • Mellanox 100 Gbps intra- rack high speed network switches (1 or 2 RU) In real-life DL application development, one DGX-2 per developer minimizes model training time One DGX POD supports at least three developers (AV workload) Each developer works on two experiments per day One DGX-2/developer/2 experiments/day* *300,000 0.5M images * 120 epochs @ 480 images/sec Resnet-18 backbone detection network per experiment
  • 32. 34Twitter: @ReneeYao1 NEW DGX PODS DELIVERY, DEPLOYMENT, DEEP LEARNING IN A DAY 95% Reduction in Deployment Time 5X Increase in Data Scientist Productivity $0 Integration Cost Adopted by Leading Auto, Healthcare & Telco Companies
  • 33. 35Twitter: @ReneeYao1 NVIDIA DGX SYSTEMS Faster AI Innovation and Insight The World’s First Portfolio of Purpose-Built AI Supercomputers • Powered by NVIDIA GPU Cloud • Get Started in AI – Faster • Effortless Productivity • Performance Without Compromise For More Information DGX Systems: nvidia.com/dgx DGX Pod: https://www.nvidia.com/en-us/data- center/resources/nvidia-dgx-pod-reference- architecture/ DGX Reference Architecture: https://www.nvidia.com/en-us/data-center/dgx- reference-architecture/ 35