Shinnosuke Furuya, Ph.D., HPC Developer Relations, NVIDIA
2021/09/21
計算⼒学シミュレーションに
GPU は役⽴つのか?
2
Founded in 1993 Jensen Huang, Founder & CEO 19,000 Employees $16.7B in FY21
Santa Clara Tokyo
50+ Offices
3
NVIDIA GPUS
4
NVIDIA GPUS AT A GLANCE
Fermi
(2010)
Kepler
(2012)
M2090
Maxwell
(2014)
Pascal
(2016)
Volta
(2017)
Turing
(2018)
Ampere
(2020)
K80
M40
M10
K1
P100
P4
T4
V100
Data Center
GPU
RTX / Quadro
GeForce
A100
A30
6000 K6000 M6000 P5000 GP100 GV100 RTX 8000
GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti
RTX A6000
RTX 3090
A40
A10
A16
5
AMPERE GPU ARCHITECTURE
A100 Tensor Core GPU
7 GPCs
7 or 8
TPCs/GPC
2 SMs/TPC
(108 SMs/GPU)
5 HBM2 stacks
GPC: GPU Processing Cluster, TPC: Texture Processing Cluster, SM: Streaming Multiprocessor
12 NVLink links
6
AMPERE GPU ARCHITECTURE
Streaming Multiprocessor (SM)
GA100 (A100, A30) GA102 (A40, A10)
32 FP64 CUDA Cores
64 FP32 CUDA Cores
4 Tensor Cores
Up to 128 FP32
CUDA Cores
1 RT Core
4 Tensor Cores
7
DATA CENTER PRODUCT COMPARISON (SEPT 2021)
* Performance with structured sparse matrix
A100* A30* A40 A10 T4
Performance
FP64 (no Tensor Core) 9.7 TFlops 5.2 TFlops - - -
FP64 Tensor Core 19.5 TFlops 10.3 TFlops N/A N/A N/A
FP32 (no Tensor Core) 19.5 TFlops 10.3 TFlops 37.4 Tflops 31.2 TFlops 8.1 TFlops
TF32 Tensor Core 156 | 312 TFlops* 82 | 165 Tflops* 74.8 | 149.6 TFlops* 62.5 | 125 TFlops* N/A
FP16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* 65 TFlops
BFloat16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* N/A
Int8 Tensor Core 624 | 1248 TOPS* 330 | 661 TOPS* 299.3 | 598.6 TOPS* 250 | 500 TOPS* 130 TOPS
Int4 Tensor Core 1248 | 2496 TOPS* 661 | 1321 TOPS* 598.7 | 1197.4 TOPS* 500 | 1000 TOPS* 260 TOPS
Form Factor
SXM4 module
on baseboard
x16 PCIe Gen4
2 Slot FHFL
3 NVLINK bridges
x16 PCIe Gen 4
2 Slot FHFL
1 NVLINK bridge
x16 PCIe Gen4
2 Slot FHFL
1 NVLINK bridge
x16 PCIe Gen 4
1 Slot FHFL
x16 PCIe Gen 3
1 Slot LP
GPU Memory 40 | 80 GB HBM2e 40 | 80 GB HBM2e 24 GB HBM2 48 GB GDDR6 24 GB GDDR6 16 GB GDDR6
GPU Memory Bandwidth 1555 | 1935 GB/s 1555 | 2039 GB/s 933 GB/s 696 GB/s 600 GB/s 300 GB/s
Multi-Instance GPU Up to 7 Up to 7 Up to 4 N/A N/A N/A
Media Acceleration
1 JPEG Decoder,
5 Video Decoder
1 JPEG Decoder
4 Video Decoder
1 Video Encoder,
2 Video Decoder
(+AV1 decode)
1 Video Encoder,
2 Video Decoder
Ray Tracing No No No Yes Yes Yes
Graphics
For in-situ visualization
(no vPC/vQuadro)
Best Better Good
Max Power 400 W 250 | 300 W 165 W 300 W 150 W 70 W
8
TF32 TENSOR CORE TO SPEEDUP FP32
Range of FP32 and Precision of FP16
Input in FP32 and Accumulation in FP32
FP32
Matrix
FP32
Matrix
FP32
Matrix
Format to TF32
and Multiply
FP32 Accumulate
23 bits
8 bits
10 bits
10 bits
7 bits
8 bits
5 bits
8 bits
FP32
TF32
FP16
BFloat16
Sign Range Precision
TF32 Range
TF32 Precision
9
GPU ACCELERATED
APPLICATION PERFORMANCE
10
GPU ACCELERATED APPS
GPU Applications
https://www.nvidia.com/en-us/gpu-accelerated-applications/
Search from here Supported features
11
GPU ACCELERATED APPS
GPU Applications
GPU scaling Supported features
LS-DYNA - -
ABAQUS/STANDARD Multi-GPU / Multi-Node
* Direct sparse solver
* AMS Solver
* Steady State Dynamics
STAR-CCM+ Single GPU / Single Node * Rendering
Fluent Multi-GPU / Multi-Node
* Linear equation solver
* Radiation heat transfer model
* Discrete Ordinate Radiation model
Nastran - -
Particleworks Multi-GPU / Multi-Node * Explicit and Implicit methods
12
GPU ACCELERATED APPS
GPU Applications Catalog
https://images.nvidia.com/aem-dam/Solutions/Data-Center/tesla-product-literature/gpu-applications-catalog.pdf
13
GPU ACCELERATED APPS
HPC Application Performance
https://developer.nvidia.com/hpc-application-performance/
14
DS SIMULIA CST STUDIO
Time Domain Solver
0
5000
10000
15000
20000
25000
100^3 150^3 200^3 300^3
Mcells/sec
Throughput
Simulation Size
FIT Performance
A6000_1GPU A6000_2GPU A100_1GPU A100_2GPU
3.5x
3.8x
2.5x
3.2x
Higher is Better
A100 System :- Dual Xeon Gold 6148 2x20 cores, 384GB RAM, DDR4-2666M, 2 A100-PCIE-40GB GPUs, RHEL7.7
A6000 System:- Dual AMD EPYC 7F72 24 cores, 1TB RAM, A6000 GPUs, NV Driver 455.45.01, Win 10
15
ALTAIR CFD (SPH SOLVER)
Ampere PCIe scaling performance
Aerospace Gearbox
Size: ~21M Fluid particles (~26.7M total)
1000 timesteps
Higher is Better
1.0 1.0
0.5
1.0
1.8 1.8
1.0
1.7
3.5 3.5
1.9
3.4
6.2 6.3
3.6
6.1
0X
1X
2X
3X
4X
5X
6X
7X
A100 80GB PCIe A100 40GB PCIe A30 PCIe A40 PCIe
Relative
Performance
Aerospace Gearbox 26M
Altair CFD SPH Solver (Altair® nanoFluidX®) on NVIDIA EGX Server
1 GPU 2 GPUs 4 GPUs 8 GPUs
Tests run on a server with 2x AMD EPYC 7742@2.25GHz 3.4GHz Turbo (Rome), 64-core CPU, Driver 465.19.01, 512GB RAM, 8x NVIDIA GPUs,
Ubuntu 20.04, ECC off, HT Off
Relative performance calculated based on the average model performance (nanoseconds/particles/timesteps) on Altair nanoFluidX 2021.0
16
ROCKY DEM 4.4
ROTATING DRUM Benchmark with polyhedron and spherical shaped particles
0
10
20
30
40
50
60
70
80
90
100
Polyhedron Cells Performance on V100
1 x V100
2 x V100
4 x V100
Speed-up
(relative
to
8xCPU
cores
Intel
Xeon
Gold
6230
CPU
at
2.10
GHz
31.25 62.5 125 250 500 1000 2000
Number of particles per GPU (x1000)
0
10
20
30
40
50
60
70
80
90
100
Polyhedron Cells Performance on A100
1 x A100
2 x A100
4 x A100
Speed-up
(relative
to
8xCPU
cores
Intel
Xeon
Gold
6230
CPU
at
2.10
GHz
31.25 62.5 125 250 500 1000 2000
Number of particles per GPU (x1000)
Higher is Better Higher is Better
38x speedup for 1xV100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz
47x speedup for 1xA100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz
Case Description: Particles in a drum rotating at 1 rev/sec, simulated for 20,000 solver iterations
Hardware on Oracle cloud: CPU: Intel Xeon Gold 6230 @2.1 GHz (8 cores)GPU: NVIDIA A100 and V100
17
PARAVIEW
WITH GPU ACCELERATION
18
PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION
NVIDIA RTX テクノロジの⼀つがレイ トレーシング、NVIDIA OptiX など最適化されたレイ トレーシング API がある
NVIDIA OptiX は Paraviwe にインテグレートされている
RT コアが搭載された GPU では、さらに加速される
レンダリングには時間がかかる
画像の品質にこだわりはじめると、光源やテクスチャなど様々な設定を変えてレンダリングを繰り返す
科学技術計算で得られるような⼤規模データを使ったレイ トレーシングによる可視化はストレスフル
ParaView 経由で NVIDIA OptiX を使い、さらに RT コアでレイ トレーシングが加速されると嬉しいのでは
#NVIDIA技術ブログ の記事から
https://medium.com/nvidiajapan/62b7c70e732a
19
PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION
#NVIDIA技術ブログ の記事から
https://medium.com/nvidiajapan/62b7c70e732a
20
GPU COMPUTING
IN FUTURE
21
GIANT MODELS PUSHING LIMITS OF EXISTING ARCHITECTURE
Requires a New Architecture
GPU 8,000 GB/sec
CPU 200 GB/sec
PCIE Gen4 (Effective Per GPU) 16 GB/sec
Mem-to-GPU 64 GB/sec
System Bandwidth Bottleneck
DDR4 HBM2e
GPU
GPU
GPU
GPU
x86
ELMo (94M)
BERT-Large (340M)
GPT-2
(1.5B)
Megatron-LM
(8.3B)
T5 (11B)
Turing-NLG
(17.2B)
GPT-3 (175B)
0.00001
0.0001
0.001
0.01
0.1
1
10
100
1000
2018 2019 2020 2021 2022 2023
Model
Size
(Trillions
of
Parameters)
100 TRILLION PARAMETER MODELS BY 2023
22
NVIDIA GRACE
Breakthrough CPU Designed for Giant-Scale AI and HPC Applications
FASTEST INTERCONNECTS
>900 GB/s Cache Coherent NVLink CPU To GPU (14x)
>600GB/s CPU To CPU (2x)
NEXT GENERATION ARM NEOVERSE CORES
>300 SPECrate2017_int_base est.
Availability 2023
HIGHEST MEMORY BANDWIDTH
>500GB/s LPDDR5x w/ ECC
>2x Higher B/W
10x Higher Energy Efficiency
23
TURBOCHARGED TERABYTE SCALE ACCELERATED COMPUTING
Evolving Architecture For New Workloads
CURRENT x86 ARCHITECTURE
DDR4 HBM2e
INTEGRATED CPU-GPU ARCHITECTURE
LPDDR5x HBM2e
3 DAYS FROM 1 MONTH
Fine-Tune Training of 1T Model
GPU
GPU
GPU
GPU
GRACE
GRACE
GRACE
GRACE
GPU
GPU
GPU
GPU
x86
Transfer 2TB in 30 secs Transfer 2TB in 1 secs
GPU 8,000 GB/sec
CPU 200 GB/sec
PCIE Gen4
(Effective Per GPU)
16 GB/sec
Mem-to-GPU 64 GB/sec
GPU 8,000 GB/sec
CPU 500 GB/sec
NVLink 500 GB/sec
Mem-to-GPU 2,000 GB/sec
REAL-TIME INFERENCE
ON 0.5T MODEL
Interactive Single Node NLP Inference
Bandwidth claims rounded to nearest hundred for illustration.
Performance results based on projections on these configurations Grace : 8xGrace and 8xA100 with 4th Gen NVIDIA NVLink Connection between CPU and GPU and x86: DGX A100.
Training: 1 Month of training is Fine-Tuning a 1T parameter model on a large custom data set on 64xGrace+64xA100 compared to 8xDGXA100 (16xX86+64xA100)
Inference: 530B Parameter model on 8xGrace+8xA100 compared to DGXA100.
24
NVIDIA 秋の HPC Weeks
Week 1 : 2021 年 10 ⽉ 11 ⽇ (⽉) GPU Computing & Network Deep Dive
Week 2 : 2021 年 10 ⽉ 18 ⽇ (⽉) HPC + Machine Learning
Week 3 : 2021 年 10 ⽉ 25 ⽇ (⽉) GPU Applications
https://events.nvidia.com/hpcweek/
Stephen W. Keckler
NVIDIA
Torsten Hoefler
ETH Zürich
⻘⽊ 尊之
東京⼯業⼤学
Tobias Weinzierl
Durham University
James Legg
University College London
Mark Turner
Durham University
岡野原 ⼤輔
Preferred Networks
横⽥ 理央
東京⼯業⼤学
美添 ⼀樹
九州⼤学
秋⼭ 泰
東京⼯業⼤学
市村 強
東京⼤学
⾼⽊ 知弘
京都⼯芸繊維⼤学
25
SUMMARY
Current NVIDIA data center GPU
A100 & A30 for FP64, A40 & A10 for FP32
GPU accelerated application performance
Simulia CST Studio, Altair CFD and Rocky DEM got excellent performance on GPU
Paraview with GPU acceleration
Ray tracing accelerated with RT core
In future
Grace CPU improve memory bandwidth between CPU and GPU
計算力学シミュレーションに GPU は役立つのか?

計算力学シミュレーションに GPU は役立つのか?

  • 1.
    Shinnosuke Furuya, Ph.D.,HPC Developer Relations, NVIDIA 2021/09/21 計算⼒学シミュレーションに GPU は役⽴つのか?
  • 2.
    2 Founded in 1993Jensen Huang, Founder & CEO 19,000 Employees $16.7B in FY21 Santa Clara Tokyo 50+ Offices
  • 3.
  • 4.
    4 NVIDIA GPUS ATA GLANCE Fermi (2010) Kepler (2012) M2090 Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 P4 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 A40 A10 A16
  • 5.
    5 AMPERE GPU ARCHITECTURE A100Tensor Core GPU 7 GPCs 7 or 8 TPCs/GPC 2 SMs/TPC (108 SMs/GPU) 5 HBM2 stacks GPC: GPU Processing Cluster, TPC: Texture Processing Cluster, SM: Streaming Multiprocessor 12 NVLink links
  • 6.
    6 AMPERE GPU ARCHITECTURE StreamingMultiprocessor (SM) GA100 (A100, A30) GA102 (A40, A10) 32 FP64 CUDA Cores 64 FP32 CUDA Cores 4 Tensor Cores Up to 128 FP32 CUDA Cores 1 RT Core 4 Tensor Cores
  • 7.
    7 DATA CENTER PRODUCTCOMPARISON (SEPT 2021) * Performance with structured sparse matrix A100* A30* A40 A10 T4 Performance FP64 (no Tensor Core) 9.7 TFlops 5.2 TFlops - - - FP64 Tensor Core 19.5 TFlops 10.3 TFlops N/A N/A N/A FP32 (no Tensor Core) 19.5 TFlops 10.3 TFlops 37.4 Tflops 31.2 TFlops 8.1 TFlops TF32 Tensor Core 156 | 312 TFlops* 82 | 165 Tflops* 74.8 | 149.6 TFlops* 62.5 | 125 TFlops* N/A FP16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* 65 TFlops BFloat16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* N/A Int8 Tensor Core 624 | 1248 TOPS* 330 | 661 TOPS* 299.3 | 598.6 TOPS* 250 | 500 TOPS* 130 TOPS Int4 Tensor Core 1248 | 2496 TOPS* 661 | 1321 TOPS* 598.7 | 1197.4 TOPS* 500 | 1000 TOPS* 260 TOPS Form Factor SXM4 module on baseboard x16 PCIe Gen4 2 Slot FHFL 3 NVLINK bridges x16 PCIe Gen 4 2 Slot FHFL 1 NVLINK bridge x16 PCIe Gen4 2 Slot FHFL 1 NVLINK bridge x16 PCIe Gen 4 1 Slot FHFL x16 PCIe Gen 3 1 Slot LP GPU Memory 40 | 80 GB HBM2e 40 | 80 GB HBM2e 24 GB HBM2 48 GB GDDR6 24 GB GDDR6 16 GB GDDR6 GPU Memory Bandwidth 1555 | 1935 GB/s 1555 | 2039 GB/s 933 GB/s 696 GB/s 600 GB/s 300 GB/s Multi-Instance GPU Up to 7 Up to 7 Up to 4 N/A N/A N/A Media Acceleration 1 JPEG Decoder, 5 Video Decoder 1 JPEG Decoder 4 Video Decoder 1 Video Encoder, 2 Video Decoder (+AV1 decode) 1 Video Encoder, 2 Video Decoder Ray Tracing No No No Yes Yes Yes Graphics For in-situ visualization (no vPC/vQuadro) Best Better Good Max Power 400 W 250 | 300 W 165 W 300 W 150 W 70 W
  • 8.
    8 TF32 TENSOR CORETO SPEEDUP FP32 Range of FP32 and Precision of FP16 Input in FP32 and Accumulation in FP32 FP32 Matrix FP32 Matrix FP32 Matrix Format to TF32 and Multiply FP32 Accumulate 23 bits 8 bits 10 bits 10 bits 7 bits 8 bits 5 bits 8 bits FP32 TF32 FP16 BFloat16 Sign Range Precision TF32 Range TF32 Precision
  • 9.
  • 10.
    10 GPU ACCELERATED APPS GPUApplications https://www.nvidia.com/en-us/gpu-accelerated-applications/ Search from here Supported features
  • 11.
    11 GPU ACCELERATED APPS GPUApplications GPU scaling Supported features LS-DYNA - - ABAQUS/STANDARD Multi-GPU / Multi-Node * Direct sparse solver * AMS Solver * Steady State Dynamics STAR-CCM+ Single GPU / Single Node * Rendering Fluent Multi-GPU / Multi-Node * Linear equation solver * Radiation heat transfer model * Discrete Ordinate Radiation model Nastran - - Particleworks Multi-GPU / Multi-Node * Explicit and Implicit methods
  • 12.
    12 GPU ACCELERATED APPS GPUApplications Catalog https://images.nvidia.com/aem-dam/Solutions/Data-Center/tesla-product-literature/gpu-applications-catalog.pdf
  • 13.
    13 GPU ACCELERATED APPS HPCApplication Performance https://developer.nvidia.com/hpc-application-performance/
  • 14.
    14 DS SIMULIA CSTSTUDIO Time Domain Solver 0 5000 10000 15000 20000 25000 100^3 150^3 200^3 300^3 Mcells/sec Throughput Simulation Size FIT Performance A6000_1GPU A6000_2GPU A100_1GPU A100_2GPU 3.5x 3.8x 2.5x 3.2x Higher is Better A100 System :- Dual Xeon Gold 6148 2x20 cores, 384GB RAM, DDR4-2666M, 2 A100-PCIE-40GB GPUs, RHEL7.7 A6000 System:- Dual AMD EPYC 7F72 24 cores, 1TB RAM, A6000 GPUs, NV Driver 455.45.01, Win 10
  • 15.
    15 ALTAIR CFD (SPHSOLVER) Ampere PCIe scaling performance Aerospace Gearbox Size: ~21M Fluid particles (~26.7M total) 1000 timesteps Higher is Better 1.0 1.0 0.5 1.0 1.8 1.8 1.0 1.7 3.5 3.5 1.9 3.4 6.2 6.3 3.6 6.1 0X 1X 2X 3X 4X 5X 6X 7X A100 80GB PCIe A100 40GB PCIe A30 PCIe A40 PCIe Relative Performance Aerospace Gearbox 26M Altair CFD SPH Solver (Altair® nanoFluidX®) on NVIDIA EGX Server 1 GPU 2 GPUs 4 GPUs 8 GPUs Tests run on a server with 2x AMD EPYC 7742@2.25GHz 3.4GHz Turbo (Rome), 64-core CPU, Driver 465.19.01, 512GB RAM, 8x NVIDIA GPUs, Ubuntu 20.04, ECC off, HT Off Relative performance calculated based on the average model performance (nanoseconds/particles/timesteps) on Altair nanoFluidX 2021.0
  • 16.
    16 ROCKY DEM 4.4 ROTATINGDRUM Benchmark with polyhedron and spherical shaped particles 0 10 20 30 40 50 60 70 80 90 100 Polyhedron Cells Performance on V100 1 x V100 2 x V100 4 x V100 Speed-up (relative to 8xCPU cores Intel Xeon Gold 6230 CPU at 2.10 GHz 31.25 62.5 125 250 500 1000 2000 Number of particles per GPU (x1000) 0 10 20 30 40 50 60 70 80 90 100 Polyhedron Cells Performance on A100 1 x A100 2 x A100 4 x A100 Speed-up (relative to 8xCPU cores Intel Xeon Gold 6230 CPU at 2.10 GHz 31.25 62.5 125 250 500 1000 2000 Number of particles per GPU (x1000) Higher is Better Higher is Better 38x speedup for 1xV100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz 47x speedup for 1xA100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz Case Description: Particles in a drum rotating at 1 rev/sec, simulated for 20,000 solver iterations Hardware on Oracle cloud: CPU: Intel Xeon Gold 6230 @2.1 GHz (8 cores)GPU: NVIDIA A100 and V100
  • 17.
  • 18.
    18 PARAVIEW と NVIDIAOPTIX による SCIENTIFIC VISUALIZATION NVIDIA RTX テクノロジの⼀つがレイ トレーシング、NVIDIA OptiX など最適化されたレイ トレーシング API がある NVIDIA OptiX は Paraviwe にインテグレートされている RT コアが搭載された GPU では、さらに加速される レンダリングには時間がかかる 画像の品質にこだわりはじめると、光源やテクスチャなど様々な設定を変えてレンダリングを繰り返す 科学技術計算で得られるような⼤規模データを使ったレイ トレーシングによる可視化はストレスフル ParaView 経由で NVIDIA OptiX を使い、さらに RT コアでレイ トレーシングが加速されると嬉しいのでは #NVIDIA技術ブログ の記事から https://medium.com/nvidiajapan/62b7c70e732a
  • 19.
    19 PARAVIEW と NVIDIAOPTIX による SCIENTIFIC VISUALIZATION #NVIDIA技術ブログ の記事から https://medium.com/nvidiajapan/62b7c70e732a
  • 20.
  • 21.
    21 GIANT MODELS PUSHINGLIMITS OF EXISTING ARCHITECTURE Requires a New Architecture GPU 8,000 GB/sec CPU 200 GB/sec PCIE Gen4 (Effective Per GPU) 16 GB/sec Mem-to-GPU 64 GB/sec System Bandwidth Bottleneck DDR4 HBM2e GPU GPU GPU GPU x86 ELMo (94M) BERT-Large (340M) GPT-2 (1.5B) Megatron-LM (8.3B) T5 (11B) Turing-NLG (17.2B) GPT-3 (175B) 0.00001 0.0001 0.001 0.01 0.1 1 10 100 1000 2018 2019 2020 2021 2022 2023 Model Size (Trillions of Parameters) 100 TRILLION PARAMETER MODELS BY 2023
  • 22.
    22 NVIDIA GRACE Breakthrough CPUDesigned for Giant-Scale AI and HPC Applications FASTEST INTERCONNECTS >900 GB/s Cache Coherent NVLink CPU To GPU (14x) >600GB/s CPU To CPU (2x) NEXT GENERATION ARM NEOVERSE CORES >300 SPECrate2017_int_base est. Availability 2023 HIGHEST MEMORY BANDWIDTH >500GB/s LPDDR5x w/ ECC >2x Higher B/W 10x Higher Energy Efficiency
  • 23.
    23 TURBOCHARGED TERABYTE SCALEACCELERATED COMPUTING Evolving Architecture For New Workloads CURRENT x86 ARCHITECTURE DDR4 HBM2e INTEGRATED CPU-GPU ARCHITECTURE LPDDR5x HBM2e 3 DAYS FROM 1 MONTH Fine-Tune Training of 1T Model GPU GPU GPU GPU GRACE GRACE GRACE GRACE GPU GPU GPU GPU x86 Transfer 2TB in 30 secs Transfer 2TB in 1 secs GPU 8,000 GB/sec CPU 200 GB/sec PCIE Gen4 (Effective Per GPU) 16 GB/sec Mem-to-GPU 64 GB/sec GPU 8,000 GB/sec CPU 500 GB/sec NVLink 500 GB/sec Mem-to-GPU 2,000 GB/sec REAL-TIME INFERENCE ON 0.5T MODEL Interactive Single Node NLP Inference Bandwidth claims rounded to nearest hundred for illustration. Performance results based on projections on these configurations Grace : 8xGrace and 8xA100 with 4th Gen NVIDIA NVLink Connection between CPU and GPU and x86: DGX A100. Training: 1 Month of training is Fine-Tuning a 1T parameter model on a large custom data set on 64xGrace+64xA100 compared to 8xDGXA100 (16xX86+64xA100) Inference: 530B Parameter model on 8xGrace+8xA100 compared to DGXA100.
  • 24.
    24 NVIDIA 秋の HPCWeeks Week 1 : 2021 年 10 ⽉ 11 ⽇ (⽉) GPU Computing & Network Deep Dive Week 2 : 2021 年 10 ⽉ 18 ⽇ (⽉) HPC + Machine Learning Week 3 : 2021 年 10 ⽉ 25 ⽇ (⽉) GPU Applications https://events.nvidia.com/hpcweek/ Stephen W. Keckler NVIDIA Torsten Hoefler ETH Zürich ⻘⽊ 尊之 東京⼯業⼤学 Tobias Weinzierl Durham University James Legg University College London Mark Turner Durham University 岡野原 ⼤輔 Preferred Networks 横⽥ 理央 東京⼯業⼤学 美添 ⼀樹 九州⼤学 秋⼭ 泰 東京⼯業⼤学 市村 強 東京⼤学 ⾼⽊ 知弘 京都⼯芸繊維⼤学
  • 25.
    25 SUMMARY Current NVIDIA datacenter GPU A100 & A30 for FP64, A40 & A10 for FP32 GPU accelerated application performance Simulia CST Studio, Altair CFD and Rocky DEM got excellent performance on GPU Paraview with GPU acceleration Ray tracing accelerated with RT core In future Grace CPU improve memory bandwidth between CPU and GPU