計算力学シミュレーションに GPU は役立つのか？

Shinnosuke Furuya, Ph.D., HPC Developer Relations, NVIDIA
2021/09/21
計算⼒学シミュレーションに
GPU は役⽴つのか？

2
Founded in 1993 Jensen Huang, Founder & CEO 19,000 Employees $16.7B in FY21
Santa Clara Tokyo
50+ Offices

4
NVIDIA GPUS AT A GLANCE
Fermi
(2010)
Kepler
(2012)
M2090
Maxwell
(2014)
Pascal
(2016)
Volta
(2017)
Turing
(2018)
Ampere
(2020)
K80
M40
M10
K1
P100
P4
T4
V100
Data Center
GPU
RTX / Quadro
GeForce
A100
A30
6000 K6000 M6000 P5000 GP100 GV100 RTX 8000
GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti
RTX A6000
RTX 3090
A40
A10
A16

5
AMPERE GPU ARCHITECTURE
A100 Tensor Core GPU
7 GPCs
7 or 8
TPCs/GPC
2 SMs/TPC
(108 SMs/GPU)
5 HBM2 stacks
GPC: GPU Processing Cluster, TPC: Texture Processing Cluster, SM: Streaming Multiprocessor
12 NVLink links

6
AMPERE GPU ARCHITECTURE
Streaming Multiprocessor (SM)
GA100 (A100, A30) GA102 (A40, A10)
32 FP64 CUDA Cores
64 FP32 CUDA Cores
4 Tensor Cores
Up to 128 FP32
CUDA Cores
1 RT Core
4 Tensor Cores

7
DATA CENTER PRODUCT COMPARISON (SEPT 2021)
* Performance with structured sparse matrix
A100* A30* A40 A10 T4
Performance
FP64 (no Tensor Core) 9.7 TFlops 5.2 TFlops - - -
FP64 Tensor Core 19.5 TFlops 10.3 TFlops N/A N/A N/A
FP32 (no Tensor Core) 19.5 TFlops 10.3 TFlops 37.4 Tflops 31.2 TFlops 8.1 TFlops
TF32 Tensor Core 156 | 312 TFlops* 82 | 165 Tflops* 74.8 | 149.6 TFlops* 62.5 | 125 TFlops* N/A
FP16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* 65 TFlops
BFloat16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* N/A
Int8 Tensor Core 624 | 1248 TOPS* 330 | 661 TOPS* 299.3 | 598.6 TOPS* 250 | 500 TOPS* 130 TOPS
Int4 Tensor Core 1248 | 2496 TOPS* 661 | 1321 TOPS* 598.7 | 1197.4 TOPS* 500 | 1000 TOPS* 260 TOPS
Form Factor
SXM4 module
on baseboard
x16 PCIe Gen4
2 Slot FHFL
3 NVLINK bridges
x16 PCIe Gen 4
2 Slot FHFL
1 NVLINK bridge
x16 PCIe Gen4
2 Slot FHFL
1 NVLINK bridge
x16 PCIe Gen 4
1 Slot FHFL
x16 PCIe Gen 3
1 Slot LP
GPU Memory 40 | 80 GB HBM2e 40 | 80 GB HBM2e 24 GB HBM2 48 GB GDDR6 24 GB GDDR6 16 GB GDDR6
GPU Memory Bandwidth 1555 | 1935 GB/s 1555 | 2039 GB/s 933 GB/s 696 GB/s 600 GB/s 300 GB/s
Multi-Instance GPU Up to 7 Up to 7 Up to 4 N/A N/A N/A
Media Acceleration
1 JPEG Decoder,
5 Video Decoder
1 JPEG Decoder
4 Video Decoder
1 Video Encoder,
2 Video Decoder
(+AV1 decode)
1 Video Encoder,
2 Video Decoder
Ray Tracing No No No Yes Yes Yes
Graphics
For in-situ visualization
(no vPC/vQuadro)
Best Better Good
Max Power 400 W 250 | 300 W 165 W 300 W 150 W 70 W

8
TF32 TENSOR CORE TO SPEEDUP FP32
Range of FP32 and Precision of FP16
Input in FP32 and Accumulation in FP32
FP32
Matrix
FP32
Matrix
FP32
Matrix
Format to TF32
and Multiply
FP32 Accumulate
23 bits
8 bits
10 bits
10 bits
7 bits
8 bits
5 bits
8 bits
FP32
TF32
FP16
BFloat16
Sign Range Precision
TF32 Range
TF32 Precision

9
GPU ACCELERATED
APPLICATION PERFORMANCE

10
GPU ACCELERATED APPS
GPU Applications
https://www.nvidia.com/en-us/gpu-accelerated-applications/
Search from here Supported features

11
GPU Applications
GPU scaling Supported features
LS-DYNA - -
ABAQUS/STANDARD Multi-GPU / Multi-Node
* Direct sparse solver
* AMS Solver
* Steady State Dynamics
STAR-CCM+ Single GPU / Single Node * Rendering
Fluent Multi-GPU / Multi-Node
* Linear equation solver
* Radiation heat transfer model
* Discrete Ordinate Radiation model
Nastran - -
Particleworks Multi-GPU / Multi-Node * Explicit and Implicit methods

12
GPU Applications Catalog
https://images.nvidia.com/aem-dam/Solutions/Data-Center/tesla-product-literature/gpu-applications-catalog.pdf

13
HPC Application Performance
https://developer.nvidia.com/hpc-application-performance/

14
DS SIMULIA CST STUDIO
Time Domain Solver
0
5000
10000
15000
20000
25000
100^3 150^3 200^3 300^3
Mcells/sec
Throughput
Simulation Size
FIT Performance
A6000_1GPU A6000_2GPU A100_1GPU A100_2GPU
3.5x
3.8x
2.5x
3.2x
Higher is Better
A100 System :- Dual Xeon Gold 6148 2x20 cores, 384GB RAM, DDR4-2666M, 2 A100-PCIE-40GB GPUs, RHEL7.7
A6000 System:- Dual AMD EPYC 7F72 24 cores, 1TB RAM, A6000 GPUs, NV Driver 455.45.01, Win 10

15
ALTAIR CFD (SPH SOLVER)
Ampere PCIe scaling performance
Aerospace Gearbox
Size: ~21M Fluid particles (~26.7M total)
1000 timesteps
Higher is Better
1.0 1.0
0.5
1.0
1.8 1.8
1.0
1.7
3.5 3.5
1.9
3.4
6.2 6.3
3.6
6.1
0X
1X
2X
3X
4X
5X
6X
7X
A100 80GB PCIe A100 40GB PCIe A30 PCIe A40 PCIe
Relative
Performance
Aerospace Gearbox 26M
Altair CFD SPH Solver (Altair® nanoFluidX®) on NVIDIA EGX Server
1 GPU 2 GPUs 4 GPUs 8 GPUs
Tests run on a server with 2x AMD EPYC 7742@2.25GHz 3.4GHz Turbo (Rome), 64-core CPU, Driver 465.19.01, 512GB RAM, 8x NVIDIA GPUs,
Ubuntu 20.04, ECC off, HT Off
Relative performance calculated based on the average model performance (nanoseconds/particles/timesteps) on Altair nanoFluidX 2021.0

16
ROCKY DEM 4.4
ROTATING DRUM Benchmark with polyhedron and spherical shaped particles
0
10
20
30
40
50
60
70
80
90
100
Polyhedron Cells Performance on V100
1 x V100
2 x V100
4 x V100
Speed-up
(relative
to
8xCPU
cores
Intel
Xeon
Gold
6230
CPU
at
2.10
GHz
31.25 62.5 125 250 500 1000 2000
Number of particles per GPU (x1000)
0
10
20
30
40
50
60
70
80
90
100
Polyhedron Cells Performance on A100
1 x A100
2 x A100
4 x A100
Speed-up
(relative
to
8xCPU
cores
Intel
Xeon
Gold
6230
CPU
at
2.10
GHz
31.25 62.5 125 250 500 1000 2000
Number of particles per GPU (x1000)
Higher is Better Higher is Better
38x speedup for 1xV100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz
47x speedup for 1xA100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz
Case Description: Particles in a drum rotating at 1 rev/sec, simulated for 20,000 solver iterations
Hardware on Oracle cloud: CPU: Intel Xeon Gold 6230 @2.1 GHz (8 cores)GPU: NVIDIA A100 and V100

17
PARAVIEW
WITH GPU ACCELERATION

18
PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION
NVIDIA RTX テクノロジの⼀つがレイトレーシング、NVIDIA OptiX など最適化されたレイトレーシング API がある
NVIDIA OptiX は Paraviwe にインテグレートされている
RT コアが搭載された GPU では、さらに加速される
レンダリングには時間がかかる
画像の品質にこだわりはじめると、光源やテクスチャなど様々な設定を変えてレンダリングを繰り返す
科学技術計算で得られるような⼤規模データを使ったレイトレーシングによる可視化はストレスフル
ParaView 経由で NVIDIA OptiX を使い、さらに RT コアでレイトレーシングが加速されると嬉しいのでは
#NVIDIA技術ブログの記事から
https://medium.com/nvidiajapan/62b7c70e732a

19
PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION
#NVIDIA技術ブログの記事から
https://medium.com/nvidiajapan/62b7c70e732a

21
GIANT MODELS PUSHING LIMITS OF EXISTING ARCHITECTURE
Requires a New Architecture
GPU 8,000 GB/sec
CPU 200 GB/sec
PCIE Gen4 (Effective Per GPU) 16 GB/sec
Mem-to-GPU 64 GB/sec
System Bandwidth Bottleneck
DDR4 HBM2e
GPU
GPU
GPU
GPU
x86
ELMo (94M)
BERT-Large (340M)
GPT-2
(1.5B)
Megatron-LM
(8.3B)
T5 (11B)
Turing-NLG
(17.2B)
GPT-3 (175B)
0.00001
0.0001
0.001
0.01
0.1
1
10
100
1000
2018 2019 2020 2021 2022 2023
Model
Size
(Trillions
of
Parameters)
100 TRILLION PARAMETER MODELS BY 2023

22
NVIDIA GRACE
Breakthrough CPU Designed for Giant-Scale AI and HPC Applications
FASTEST INTERCONNECTS
>900 GB/s Cache Coherent NVLink CPU To GPU (14x)
>600GB/s CPU To CPU (2x)
NEXT GENERATION ARM NEOVERSE CORES
>300 SPECrate2017_int_base est.
Availability 2023
HIGHEST MEMORY BANDWIDTH
>500GB/s LPDDR5x w/ ECC
>2x Higher B/W
10x Higher Energy Efficiency

23
TURBOCHARGED TERABYTE SCALE ACCELERATED COMPUTING
Evolving Architecture For New Workloads
CURRENT x86 ARCHITECTURE
DDR4 HBM2e
INTEGRATED CPU-GPU ARCHITECTURE
LPDDR5x HBM2e
3 DAYS FROM 1 MONTH
Fine-Tune Training of 1T Model
GPU
GPU
GPU
GPU
GRACE
GRACE
GRACE
GRACE
GPU
GPU
GPU
GPU
x86
Transfer 2TB in 30 secs Transfer 2TB in 1 secs
GPU 8,000 GB/sec
CPU 200 GB/sec
PCIE Gen4
(Effective Per GPU)
16 GB/sec
Mem-to-GPU 64 GB/sec
GPU 8,000 GB/sec
CPU 500 GB/sec
NVLink 500 GB/sec
Mem-to-GPU 2,000 GB/sec
REAL-TIME INFERENCE
ON 0.5T MODEL
Interactive Single Node NLP Inference
Bandwidth claims rounded to nearest hundred for illustration.
Performance results based on projections on these configurations Grace : 8xGrace and 8xA100 with 4th Gen NVIDIA NVLink Connection between CPU and GPU and x86: DGX A100.
Training: 1 Month of training is Fine-Tuning a 1T parameter model on a large custom data set on 64xGrace+64xA100 compared to 8xDGXA100 (16xX86+64xA100)
Inference: 530B Parameter model on 8xGrace+8xA100 compared to DGXA100.

24
NVIDIA 秋の HPC Weeks
Week 1 : 2021 年 10 ⽉ 11 ⽇ (⽉) GPU Computing & Network Deep Dive
Week 2 : 2021 年 10 ⽉ 18 ⽇ (⽉) HPC + Machine Learning
Week 3 : 2021 年 10 ⽉ 25 ⽇ (⽉) GPU Applications
https://events.nvidia.com/hpcweek/
Stephen W. Keckler
NVIDIA
Torsten Hoefler
ETH Zürich
⻘⽊尊之
東京⼯業⼤学
Tobias Weinzierl
Durham University
James Legg
University College London
Mark Turner
Durham University
岡野原⼤輔
Preferred Networks
横⽥理央
東京⼯業⼤学
美添⼀樹
九州⼤学
秋⼭泰
東京⼯業⼤学
市村強
東京⼤学
⾼⽊知弘
京都⼯芸繊維⼤学

25
SUMMARY
Current NVIDIA data center GPU
A100 & A30 for FP64, A40 & A10 for FP32
GPU accelerated application performance
Simulia CST Studio, Altair CFD and Rocky DEM got excellent performance on GPU
Paraview with GPU acceleration
Ray tracing accelerated with RT core
In future
Grace CPU improve memory bandwidth between CPU and GPU

計算力学シミュレーションに GPU は役立つのか？

計算力学シミュレーションに GPU は役立つのか？

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 計算力学シミュレーションに GPU は役立つのか？

Similar to 計算力学シミュレーションに GPU は役立つのか？ (20)

Recently uploaded

Recently uploaded (20)

計算力学シミュレーションに GPU は役立つのか？