SlideShare a Scribd company logo
1 of 26
Download to read offline
Shinnosuke Furuya, Ph.D., HPC Developer Relations, NVIDIA
2021/09/21
計算⼒学シミュレーションに
GPU は役⽴つのか?
2
Founded in 1993 Jensen Huang, Founder & CEO 19,000 Employees $16.7B in FY21
Santa Clara Tokyo
50+ Offices
3
NVIDIA GPUS
4
NVIDIA GPUS AT A GLANCE
Fermi
(2010)
Kepler
(2012)
M2090
Maxwell
(2014)
Pascal
(2016)
Volta
(2017)
Turing
(2018)
Ampere
(2020)
K80
M40
M10
K1
P100
P4
T4
V100
Data Center
GPU
RTX / Quadro
GeForce
A100
A30
6000 K6000 M6000 P5000 GP100 GV100 RTX 8000
GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti
RTX A6000
RTX 3090
A40
A10
A16
5
AMPERE GPU ARCHITECTURE
A100 Tensor Core GPU
7 GPCs
7 or 8
TPCs/GPC
2 SMs/TPC
(108 SMs/GPU)
5 HBM2 stacks
GPC: GPU Processing Cluster, TPC: Texture Processing Cluster, SM: Streaming Multiprocessor
12 NVLink links
6
AMPERE GPU ARCHITECTURE
Streaming Multiprocessor (SM)
GA100 (A100, A30) GA102 (A40, A10)
32 FP64 CUDA Cores
64 FP32 CUDA Cores
4 Tensor Cores
Up to 128 FP32
CUDA Cores
1 RT Core
4 Tensor Cores
7
DATA CENTER PRODUCT COMPARISON (SEPT 2021)
* Performance with structured sparse matrix
A100* A30* A40 A10 T4
Performance
FP64 (no Tensor Core) 9.7 TFlops 5.2 TFlops - - -
FP64 Tensor Core 19.5 TFlops 10.3 TFlops N/A N/A N/A
FP32 (no Tensor Core) 19.5 TFlops 10.3 TFlops 37.4 Tflops 31.2 TFlops 8.1 TFlops
TF32 Tensor Core 156 | 312 TFlops* 82 | 165 Tflops* 74.8 | 149.6 TFlops* 62.5 | 125 TFlops* N/A
FP16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* 65 TFlops
BFloat16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* N/A
Int8 Tensor Core 624 | 1248 TOPS* 330 | 661 TOPS* 299.3 | 598.6 TOPS* 250 | 500 TOPS* 130 TOPS
Int4 Tensor Core 1248 | 2496 TOPS* 661 | 1321 TOPS* 598.7 | 1197.4 TOPS* 500 | 1000 TOPS* 260 TOPS
Form Factor
SXM4 module
on baseboard
x16 PCIe Gen4
2 Slot FHFL
3 NVLINK bridges
x16 PCIe Gen 4
2 Slot FHFL
1 NVLINK bridge
x16 PCIe Gen4
2 Slot FHFL
1 NVLINK bridge
x16 PCIe Gen 4
1 Slot FHFL
x16 PCIe Gen 3
1 Slot LP
GPU Memory 40 | 80 GB HBM2e 40 | 80 GB HBM2e 24 GB HBM2 48 GB GDDR6 24 GB GDDR6 16 GB GDDR6
GPU Memory Bandwidth 1555 | 1935 GB/s 1555 | 2039 GB/s 933 GB/s 696 GB/s 600 GB/s 300 GB/s
Multi-Instance GPU Up to 7 Up to 7 Up to 4 N/A N/A N/A
Media Acceleration
1 JPEG Decoder,
5 Video Decoder
1 JPEG Decoder
4 Video Decoder
1 Video Encoder,
2 Video Decoder
(+AV1 decode)
1 Video Encoder,
2 Video Decoder
Ray Tracing No No No Yes Yes Yes
Graphics
For in-situ visualization
(no vPC/vQuadro)
Best Better Good
Max Power 400 W 250 | 300 W 165 W 300 W 150 W 70 W
8
TF32 TENSOR CORE TO SPEEDUP FP32
Range of FP32 and Precision of FP16
Input in FP32 and Accumulation in FP32
FP32
Matrix
FP32
Matrix
FP32
Matrix
Format to TF32
and Multiply
FP32 Accumulate
23 bits
8 bits
10 bits
10 bits
7 bits
8 bits
5 bits
8 bits
FP32
TF32
FP16
BFloat16
Sign Range Precision
TF32 Range
TF32 Precision
9
GPU ACCELERATED
APPLICATION PERFORMANCE
10
GPU ACCELERATED APPS
GPU Applications
https://www.nvidia.com/en-us/gpu-accelerated-applications/
Search from here Supported features
11
GPU ACCELERATED APPS
GPU Applications
GPU scaling Supported features
LS-DYNA - -
ABAQUS/STANDARD Multi-GPU / Multi-Node
* Direct sparse solver
* AMS Solver
* Steady State Dynamics
STAR-CCM+ Single GPU / Single Node * Rendering
Fluent Multi-GPU / Multi-Node
* Linear equation solver
* Radiation heat transfer model
* Discrete Ordinate Radiation model
Nastran - -
Particleworks Multi-GPU / Multi-Node * Explicit and Implicit methods
12
GPU ACCELERATED APPS
GPU Applications Catalog
https://images.nvidia.com/aem-dam/Solutions/Data-Center/tesla-product-literature/gpu-applications-catalog.pdf
13
GPU ACCELERATED APPS
HPC Application Performance
https://developer.nvidia.com/hpc-application-performance/
14
DS SIMULIA CST STUDIO
Time Domain Solver
0
5000
10000
15000
20000
25000
100^3 150^3 200^3 300^3
Mcells/sec
Throughput
Simulation Size
FIT Performance
A6000_1GPU A6000_2GPU A100_1GPU A100_2GPU
3.5x
3.8x
2.5x
3.2x
Higher is Better
A100 System :- Dual Xeon Gold 6148 2x20 cores, 384GB RAM, DDR4-2666M, 2 A100-PCIE-40GB GPUs, RHEL7.7
A6000 System:- Dual AMD EPYC 7F72 24 cores, 1TB RAM, A6000 GPUs, NV Driver 455.45.01, Win 10
15
ALTAIR CFD (SPH SOLVER)
Ampere PCIe scaling performance
Aerospace Gearbox
Size: ~21M Fluid particles (~26.7M total)
1000 timesteps
Higher is Better
1.0 1.0
0.5
1.0
1.8 1.8
1.0
1.7
3.5 3.5
1.9
3.4
6.2 6.3
3.6
6.1
0X
1X
2X
3X
4X
5X
6X
7X
A100 80GB PCIe A100 40GB PCIe A30 PCIe A40 PCIe
Relative
Performance
Aerospace Gearbox 26M
Altair CFD SPH Solver (Altair® nanoFluidX®) on NVIDIA EGX Server
1 GPU 2 GPUs 4 GPUs 8 GPUs
Tests run on a server with 2x AMD EPYC 7742@2.25GHz 3.4GHz Turbo (Rome), 64-core CPU, Driver 465.19.01, 512GB RAM, 8x NVIDIA GPUs,
Ubuntu 20.04, ECC off, HT Off
Relative performance calculated based on the average model performance (nanoseconds/particles/timesteps) on Altair nanoFluidX 2021.0
16
ROCKY DEM 4.4
ROTATING DRUM Benchmark with polyhedron and spherical shaped particles
0
10
20
30
40
50
60
70
80
90
100
Polyhedron Cells Performance on V100
1 x V100
2 x V100
4 x V100
Speed-up
(relative
to
8xCPU
cores
Intel
Xeon
Gold
6230
CPU
at
2.10
GHz
31.25 62.5 125 250 500 1000 2000
Number of particles per GPU (x1000)
0
10
20
30
40
50
60
70
80
90
100
Polyhedron Cells Performance on A100
1 x A100
2 x A100
4 x A100
Speed-up
(relative
to
8xCPU
cores
Intel
Xeon
Gold
6230
CPU
at
2.10
GHz
31.25 62.5 125 250 500 1000 2000
Number of particles per GPU (x1000)
Higher is Better Higher is Better
38x speedup for 1xV100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz
47x speedup for 1xA100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz
Case Description: Particles in a drum rotating at 1 rev/sec, simulated for 20,000 solver iterations
Hardware on Oracle cloud: CPU: Intel Xeon Gold 6230 @2.1 GHz (8 cores)GPU: NVIDIA A100 and V100
17
PARAVIEW
WITH GPU ACCELERATION
18
PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION
NVIDIA RTX テクノロジの⼀つがレイ トレーシング、NVIDIA OptiX など最適化されたレイ トレーシング API がある
NVIDIA OptiX は Paraviwe にインテグレートされている
RT コアが搭載された GPU では、さらに加速される
レンダリングには時間がかかる
画像の品質にこだわりはじめると、光源やテクスチャなど様々な設定を変えてレンダリングを繰り返す
科学技術計算で得られるような⼤規模データを使ったレイ トレーシングによる可視化はストレスフル
ParaView 経由で NVIDIA OptiX を使い、さらに RT コアでレイ トレーシングが加速されると嬉しいのでは
#NVIDIA技術ブログ の記事から
https://medium.com/nvidiajapan/62b7c70e732a
19
PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION
#NVIDIA技術ブログ の記事から
https://medium.com/nvidiajapan/62b7c70e732a
20
GPU COMPUTING
IN FUTURE
21
GIANT MODELS PUSHING LIMITS OF EXISTING ARCHITECTURE
Requires a New Architecture
GPU 8,000 GB/sec
CPU 200 GB/sec
PCIE Gen4 (Effective Per GPU) 16 GB/sec
Mem-to-GPU 64 GB/sec
System Bandwidth Bottleneck
DDR4 HBM2e
GPU
GPU
GPU
GPU
x86
ELMo (94M)
BERT-Large (340M)
GPT-2
(1.5B)
Megatron-LM
(8.3B)
T5 (11B)
Turing-NLG
(17.2B)
GPT-3 (175B)
0.00001
0.0001
0.001
0.01
0.1
1
10
100
1000
2018 2019 2020 2021 2022 2023
Model
Size
(Trillions
of
Parameters)
100 TRILLION PARAMETER MODELS BY 2023
22
NVIDIA GRACE
Breakthrough CPU Designed for Giant-Scale AI and HPC Applications
FASTEST INTERCONNECTS
>900 GB/s Cache Coherent NVLink CPU To GPU (14x)
>600GB/s CPU To CPU (2x)
NEXT GENERATION ARM NEOVERSE CORES
>300 SPECrate2017_int_base est.
Availability 2023
HIGHEST MEMORY BANDWIDTH
>500GB/s LPDDR5x w/ ECC
>2x Higher B/W
10x Higher Energy Efficiency
23
TURBOCHARGED TERABYTE SCALE ACCELERATED COMPUTING
Evolving Architecture For New Workloads
CURRENT x86 ARCHITECTURE
DDR4 HBM2e
INTEGRATED CPU-GPU ARCHITECTURE
LPDDR5x HBM2e
3 DAYS FROM 1 MONTH
Fine-Tune Training of 1T Model
GPU
GPU
GPU
GPU
GRACE
GRACE
GRACE
GRACE
GPU
GPU
GPU
GPU
x86
Transfer 2TB in 30 secs Transfer 2TB in 1 secs
GPU 8,000 GB/sec
CPU 200 GB/sec
PCIE Gen4
(Effective Per GPU)
16 GB/sec
Mem-to-GPU 64 GB/sec
GPU 8,000 GB/sec
CPU 500 GB/sec
NVLink 500 GB/sec
Mem-to-GPU 2,000 GB/sec
REAL-TIME INFERENCE
ON 0.5T MODEL
Interactive Single Node NLP Inference
Bandwidth claims rounded to nearest hundred for illustration.
Performance results based on projections on these configurations Grace : 8xGrace and 8xA100 with 4th Gen NVIDIA NVLink Connection between CPU and GPU and x86: DGX A100.
Training: 1 Month of training is Fine-Tuning a 1T parameter model on a large custom data set on 64xGrace+64xA100 compared to 8xDGXA100 (16xX86+64xA100)
Inference: 530B Parameter model on 8xGrace+8xA100 compared to DGXA100.
24
NVIDIA 秋の HPC Weeks
Week 1 : 2021 年 10 ⽉ 11 ⽇ (⽉) GPU Computing & Network Deep Dive
Week 2 : 2021 年 10 ⽉ 18 ⽇ (⽉) HPC + Machine Learning
Week 3 : 2021 年 10 ⽉ 25 ⽇ (⽉) GPU Applications
https://events.nvidia.com/hpcweek/
Stephen W. Keckler
NVIDIA
Torsten Hoefler
ETH Zürich
⻘⽊ 尊之
東京⼯業⼤学
Tobias Weinzierl
Durham University
James Legg
University College London
Mark Turner
Durham University
岡野原 ⼤輔
Preferred Networks
横⽥ 理央
東京⼯業⼤学
美添 ⼀樹
九州⼤学
秋⼭ 泰
東京⼯業⼤学
市村 強
東京⼤学
⾼⽊ 知弘
京都⼯芸繊維⼤学
25
SUMMARY
Current NVIDIA data center GPU
A100 & A30 for FP64, A40 & A10 for FP32
GPU accelerated application performance
Simulia CST Studio, Altair CFD and Rocky DEM got excellent performance on GPU
Paraview with GPU acceleration
Ray tracing accelerated with RT core
In future
Grace CPU improve memory bandwidth between CPU and GPU
計算力学シミュレーションに GPU は役立つのか?

More Related Content

What's hot

GPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIAGPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIANVIDIA Japan
 
いまさら聞けない!CUDA高速化入門
いまさら聞けない!CUDA高速化入門いまさら聞けない!CUDA高速化入門
いまさら聞けない!CUDA高速化入門Fixstars Corporation
 
2015年度先端GPGPUシミュレーション工学特論 第4回 GPUのメモリ階層の詳細 (共有メモリ)
2015年度先端GPGPUシミュレーション工学特論 第4回 GPUのメモリ階層の詳細(共有メモリ)2015年度先端GPGPUシミュレーション工学特論 第4回 GPUのメモリ階層の詳細(共有メモリ)
2015年度先端GPGPUシミュレーション工学特論 第4回 GPUのメモリ階層の詳細 (共有メモリ)智啓 出川
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラNVIDIA Japan
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門NVIDIA Japan
 
CUDAプログラミング入門
CUDAプログラミング入門CUDAプログラミング入門
CUDAプログラミング入門NVIDIA Japan
 
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層智啓 出川
 
2015年度先端GPGPUシミュレーション工学特論 第15回 CPUとGPUの協調
2015年度先端GPGPUシミュレーション工学特論 第15回 CPUとGPUの協調2015年度先端GPGPUシミュレーション工学特論 第15回 CPUとGPUの協調
2015年度先端GPGPUシミュレーション工学特論 第15回 CPUとGPUの協調智啓 出川
 
Hokkaido.cap #osc11do Wiresharkを使いこなそう!
Hokkaido.cap #osc11do Wiresharkを使いこなそう!Hokkaido.cap #osc11do Wiresharkを使いこなそう!
Hokkaido.cap #osc11do Wiresharkを使いこなそう!Panda Yamaki
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことNVIDIA Japan
 
NEDIA_SNIA_CXL_講演資料.pdf
NEDIA_SNIA_CXL_講演資料.pdfNEDIA_SNIA_CXL_講演資料.pdf
NEDIA_SNIA_CXL_講演資料.pdfYasunori Goto
 
A100 GPU 搭載! P4d インスタンス 使いこなしのコツ
A100 GPU 搭載! P4d インスタンス使いこなしのコツA100 GPU 搭載! P4d インスタンス使いこなしのコツ
A100 GPU 搭載! P4d インスタンス 使いこなしのコツKuninobu SaSaki
 
CPU / GPU高速化セミナー!性能モデルの理論と実践:実践編
CPU / GPU高速化セミナー!性能モデルの理論と実践:実践編CPU / GPU高速化セミナー!性能モデルの理論と実践:実践編
CPU / GPU高速化セミナー!性能モデルの理論と実践:実践編Fixstars Corporation
 
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワークNVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワークNVIDIA Japan
 
HPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなのHPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなのNVIDIA Japan
 
2015年度GPGPU実践プログラミング 第12回 偏微分方程式の差分計算
2015年度GPGPU実践プログラミング 第12回 偏微分方程式の差分計算2015年度GPGPU実践プログラミング 第12回 偏微分方程式の差分計算
2015年度GPGPU実践プログラミング 第12回 偏微分方程式の差分計算智啓 出川
 
HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?NVIDIA Japan
 
分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~Hideki Tsunashima
 
SAT/SMTソルバの仕組み
SAT/SMTソルバの仕組みSAT/SMTソルバの仕組み
SAT/SMTソルバの仕組みMasahiro Sakai
 

What's hot (20)

GPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIAGPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIA
 
いまさら聞けない!CUDA高速化入門
いまさら聞けない!CUDA高速化入門いまさら聞けない!CUDA高速化入門
いまさら聞けない!CUDA高速化入門
 
2015年度先端GPGPUシミュレーション工学特論 第4回 GPUのメモリ階層の詳細 (共有メモリ)
2015年度先端GPGPUシミュレーション工学特論 第4回 GPUのメモリ階層の詳細(共有メモリ)2015年度先端GPGPUシミュレーション工学特論 第4回 GPUのメモリ階層の詳細(共有メモリ)
2015年度先端GPGPUシミュレーション工学特論 第4回 GPUのメモリ階層の詳細 (共有メモリ)
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラ
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門
 
CUDAプログラミング入門
CUDAプログラミング入門CUDAプログラミング入門
CUDAプログラミング入門
 
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
 
2015年度先端GPGPUシミュレーション工学特論 第15回 CPUとGPUの協調
2015年度先端GPGPUシミュレーション工学特論 第15回 CPUとGPUの協調2015年度先端GPGPUシミュレーション工学特論 第15回 CPUとGPUの協調
2015年度先端GPGPUシミュレーション工学特論 第15回 CPUとGPUの協調
 
Hokkaido.cap #osc11do Wiresharkを使いこなそう!
Hokkaido.cap #osc11do Wiresharkを使いこなそう!Hokkaido.cap #osc11do Wiresharkを使いこなそう!
Hokkaido.cap #osc11do Wiresharkを使いこなそう!
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないこと
 
NEDIA_SNIA_CXL_講演資料.pdf
NEDIA_SNIA_CXL_講演資料.pdfNEDIA_SNIA_CXL_講演資料.pdf
NEDIA_SNIA_CXL_講演資料.pdf
 
MPIによる並列計算
MPIによる並列計算MPIによる並列計算
MPIによる並列計算
 
A100 GPU 搭載! P4d インスタンス 使いこなしのコツ
A100 GPU 搭載! P4d インスタンス使いこなしのコツA100 GPU 搭載! P4d インスタンス使いこなしのコツ
A100 GPU 搭載! P4d インスタンス 使いこなしのコツ
 
CPU / GPU高速化セミナー!性能モデルの理論と実践:実践編
CPU / GPU高速化セミナー!性能モデルの理論と実践:実践編CPU / GPU高速化セミナー!性能モデルの理論と実践:実践編
CPU / GPU高速化セミナー!性能モデルの理論と実践:実践編
 
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワークNVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
 
HPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなのHPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなの
 
2015年度GPGPU実践プログラミング 第12回 偏微分方程式の差分計算
2015年度GPGPU実践プログラミング 第12回 偏微分方程式の差分計算2015年度GPGPU実践プログラミング 第12回 偏微分方程式の差分計算
2015年度GPGPU実践プログラミング 第12回 偏微分方程式の差分計算
 
HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?
 
分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~分散学習のあれこれ~データパラレルからモデルパラレルまで~
分散学習のあれこれ~データパラレルからモデルパラレルまで~
 
SAT/SMTソルバの仕組み
SAT/SMTソルバの仕組みSAT/SMTソルバの仕組み
SAT/SMTソルバの仕組み
 

Similar to 計算力学シミュレーションに GPU は役立つのか?

NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfMuhammadAbdullah311866
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univainside-BigData.com
 
GTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI RevolutionGTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI RevolutionNVIDIA
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUAMD
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesDustin Franklin
 
Nvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex SabatierNvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex SabatierSri Ambati
 
한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson Platform한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson PlatformHANCOM MDS
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...KTN
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステムShinnosuke Furuya
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsRebekah Rodriguez
 
한컴MDS_NVIDIA Enterprise Platform
한컴MDS_NVIDIA Enterprise Platform한컴MDS_NVIDIA Enterprise Platform
한컴MDS_NVIDIA Enterprise PlatformHANCOM MDS
 
Product Roadmap iEi 2017
Product Roadmap iEi 2017Product Roadmap iEi 2017
Product Roadmap iEi 2017Andrei Teleanu
 
Apple A10 Series Application Processor
Apple A10 Series Application ProcessorApple A10 Series Application Processor
Apple A10 Series Application ProcessorJJ Wu
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Clusterinside-BigData.com
 

Similar to 計算力学シミュレーションに GPU は役立つのか? (20)

NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
GTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI RevolutionGTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI Revolution
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
 
Nvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex SabatierNvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex Sabatier
 
Nvidia tesla-k80-overview
Nvidia tesla-k80-overviewNvidia tesla-k80-overview
Nvidia tesla-k80-overview
 
한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson Platform한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson Platform
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
GTC 2022 Keynote
GTC 2022 KeynoteGTC 2022 Keynote
GTC 2022 Keynote
 
Agx 2
Agx 2Agx 2
Agx 2
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
 
한컴MDS_NVIDIA Enterprise Platform
한컴MDS_NVIDIA Enterprise Platform한컴MDS_NVIDIA Enterprise Platform
한컴MDS_NVIDIA Enterprise Platform
 
Advances in GPU Computing
Advances in GPU ComputingAdvances in GPU Computing
Advances in GPU Computing
 
Product Roadmap iEi 2017
Product Roadmap iEi 2017Product Roadmap iEi 2017
Product Roadmap iEi 2017
 
Apple A10 Series Application Processor
Apple A10 Series Application ProcessorApple A10 Series Application Processor
Apple A10 Series Application Processor
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Cluster
 

Recently uploaded

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...ssuserdfc773
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Introduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxIntroduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxhublikarsn
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Electromagnetic relays used for power system .pptx
Electromagnetic relays used for power system .pptxElectromagnetic relays used for power system .pptx
Electromagnetic relays used for power system .pptxNANDHAKUMARA10
 
Post office management system project ..pdf
Post office management system project ..pdfPost office management system project ..pdf
Post office management system project ..pdfKamal Acharya
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information SystemsAnge Felix NSANZIYERA
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...josephjonse
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Ramkumar k
 
Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementDr. Deepak Mudgal
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessorAshwiniTodkar4
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 

Recently uploaded (20)

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
Convergence of Robotics and Gen AI offers excellent opportunities for Entrepr...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Introduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxIntroduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptx
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Electromagnetic relays used for power system .pptx
Electromagnetic relays used for power system .pptxElectromagnetic relays used for power system .pptx
Electromagnetic relays used for power system .pptx
 
Post office management system project ..pdf
Post office management system project ..pdfPost office management system project ..pdf
Post office management system project ..pdf
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information Systems
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth Reinforcement
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 

計算力学シミュレーションに GPU は役立つのか?

  • 1. Shinnosuke Furuya, Ph.D., HPC Developer Relations, NVIDIA 2021/09/21 計算⼒学シミュレーションに GPU は役⽴つのか?
  • 2. 2 Founded in 1993 Jensen Huang, Founder & CEO 19,000 Employees $16.7B in FY21 Santa Clara Tokyo 50+ Offices
  • 4. 4 NVIDIA GPUS AT A GLANCE Fermi (2010) Kepler (2012) M2090 Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 P4 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 A40 A10 A16
  • 5. 5 AMPERE GPU ARCHITECTURE A100 Tensor Core GPU 7 GPCs 7 or 8 TPCs/GPC 2 SMs/TPC (108 SMs/GPU) 5 HBM2 stacks GPC: GPU Processing Cluster, TPC: Texture Processing Cluster, SM: Streaming Multiprocessor 12 NVLink links
  • 6. 6 AMPERE GPU ARCHITECTURE Streaming Multiprocessor (SM) GA100 (A100, A30) GA102 (A40, A10) 32 FP64 CUDA Cores 64 FP32 CUDA Cores 4 Tensor Cores Up to 128 FP32 CUDA Cores 1 RT Core 4 Tensor Cores
  • 7. 7 DATA CENTER PRODUCT COMPARISON (SEPT 2021) * Performance with structured sparse matrix A100* A30* A40 A10 T4 Performance FP64 (no Tensor Core) 9.7 TFlops 5.2 TFlops - - - FP64 Tensor Core 19.5 TFlops 10.3 TFlops N/A N/A N/A FP32 (no Tensor Core) 19.5 TFlops 10.3 TFlops 37.4 Tflops 31.2 TFlops 8.1 TFlops TF32 Tensor Core 156 | 312 TFlops* 82 | 165 Tflops* 74.8 | 149.6 TFlops* 62.5 | 125 TFlops* N/A FP16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* 65 TFlops BFloat16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* N/A Int8 Tensor Core 624 | 1248 TOPS* 330 | 661 TOPS* 299.3 | 598.6 TOPS* 250 | 500 TOPS* 130 TOPS Int4 Tensor Core 1248 | 2496 TOPS* 661 | 1321 TOPS* 598.7 | 1197.4 TOPS* 500 | 1000 TOPS* 260 TOPS Form Factor SXM4 module on baseboard x16 PCIe Gen4 2 Slot FHFL 3 NVLINK bridges x16 PCIe Gen 4 2 Slot FHFL 1 NVLINK bridge x16 PCIe Gen4 2 Slot FHFL 1 NVLINK bridge x16 PCIe Gen 4 1 Slot FHFL x16 PCIe Gen 3 1 Slot LP GPU Memory 40 | 80 GB HBM2e 40 | 80 GB HBM2e 24 GB HBM2 48 GB GDDR6 24 GB GDDR6 16 GB GDDR6 GPU Memory Bandwidth 1555 | 1935 GB/s 1555 | 2039 GB/s 933 GB/s 696 GB/s 600 GB/s 300 GB/s Multi-Instance GPU Up to 7 Up to 7 Up to 4 N/A N/A N/A Media Acceleration 1 JPEG Decoder, 5 Video Decoder 1 JPEG Decoder 4 Video Decoder 1 Video Encoder, 2 Video Decoder (+AV1 decode) 1 Video Encoder, 2 Video Decoder Ray Tracing No No No Yes Yes Yes Graphics For in-situ visualization (no vPC/vQuadro) Best Better Good Max Power 400 W 250 | 300 W 165 W 300 W 150 W 70 W
  • 8. 8 TF32 TENSOR CORE TO SPEEDUP FP32 Range of FP32 and Precision of FP16 Input in FP32 and Accumulation in FP32 FP32 Matrix FP32 Matrix FP32 Matrix Format to TF32 and Multiply FP32 Accumulate 23 bits 8 bits 10 bits 10 bits 7 bits 8 bits 5 bits 8 bits FP32 TF32 FP16 BFloat16 Sign Range Precision TF32 Range TF32 Precision
  • 10. 10 GPU ACCELERATED APPS GPU Applications https://www.nvidia.com/en-us/gpu-accelerated-applications/ Search from here Supported features
  • 11. 11 GPU ACCELERATED APPS GPU Applications GPU scaling Supported features LS-DYNA - - ABAQUS/STANDARD Multi-GPU / Multi-Node * Direct sparse solver * AMS Solver * Steady State Dynamics STAR-CCM+ Single GPU / Single Node * Rendering Fluent Multi-GPU / Multi-Node * Linear equation solver * Radiation heat transfer model * Discrete Ordinate Radiation model Nastran - - Particleworks Multi-GPU / Multi-Node * Explicit and Implicit methods
  • 12. 12 GPU ACCELERATED APPS GPU Applications Catalog https://images.nvidia.com/aem-dam/Solutions/Data-Center/tesla-product-literature/gpu-applications-catalog.pdf
  • 13. 13 GPU ACCELERATED APPS HPC Application Performance https://developer.nvidia.com/hpc-application-performance/
  • 14. 14 DS SIMULIA CST STUDIO Time Domain Solver 0 5000 10000 15000 20000 25000 100^3 150^3 200^3 300^3 Mcells/sec Throughput Simulation Size FIT Performance A6000_1GPU A6000_2GPU A100_1GPU A100_2GPU 3.5x 3.8x 2.5x 3.2x Higher is Better A100 System :- Dual Xeon Gold 6148 2x20 cores, 384GB RAM, DDR4-2666M, 2 A100-PCIE-40GB GPUs, RHEL7.7 A6000 System:- Dual AMD EPYC 7F72 24 cores, 1TB RAM, A6000 GPUs, NV Driver 455.45.01, Win 10
  • 15. 15 ALTAIR CFD (SPH SOLVER) Ampere PCIe scaling performance Aerospace Gearbox Size: ~21M Fluid particles (~26.7M total) 1000 timesteps Higher is Better 1.0 1.0 0.5 1.0 1.8 1.8 1.0 1.7 3.5 3.5 1.9 3.4 6.2 6.3 3.6 6.1 0X 1X 2X 3X 4X 5X 6X 7X A100 80GB PCIe A100 40GB PCIe A30 PCIe A40 PCIe Relative Performance Aerospace Gearbox 26M Altair CFD SPH Solver (Altair® nanoFluidX®) on NVIDIA EGX Server 1 GPU 2 GPUs 4 GPUs 8 GPUs Tests run on a server with 2x AMD EPYC 7742@2.25GHz 3.4GHz Turbo (Rome), 64-core CPU, Driver 465.19.01, 512GB RAM, 8x NVIDIA GPUs, Ubuntu 20.04, ECC off, HT Off Relative performance calculated based on the average model performance (nanoseconds/particles/timesteps) on Altair nanoFluidX 2021.0
  • 16. 16 ROCKY DEM 4.4 ROTATING DRUM Benchmark with polyhedron and spherical shaped particles 0 10 20 30 40 50 60 70 80 90 100 Polyhedron Cells Performance on V100 1 x V100 2 x V100 4 x V100 Speed-up (relative to 8xCPU cores Intel Xeon Gold 6230 CPU at 2.10 GHz 31.25 62.5 125 250 500 1000 2000 Number of particles per GPU (x1000) 0 10 20 30 40 50 60 70 80 90 100 Polyhedron Cells Performance on A100 1 x A100 2 x A100 4 x A100 Speed-up (relative to 8xCPU cores Intel Xeon Gold 6230 CPU at 2.10 GHz 31.25 62.5 125 250 500 1000 2000 Number of particles per GPU (x1000) Higher is Better Higher is Better 38x speedup for 1xV100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz 47x speedup for 1xA100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz Case Description: Particles in a drum rotating at 1 rev/sec, simulated for 20,000 solver iterations Hardware on Oracle cloud: CPU: Intel Xeon Gold 6230 @2.1 GHz (8 cores)GPU: NVIDIA A100 and V100
  • 18. 18 PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION NVIDIA RTX テクノロジの⼀つがレイ トレーシング、NVIDIA OptiX など最適化されたレイ トレーシング API がある NVIDIA OptiX は Paraviwe にインテグレートされている RT コアが搭載された GPU では、さらに加速される レンダリングには時間がかかる 画像の品質にこだわりはじめると、光源やテクスチャなど様々な設定を変えてレンダリングを繰り返す 科学技術計算で得られるような⼤規模データを使ったレイ トレーシングによる可視化はストレスフル ParaView 経由で NVIDIA OptiX を使い、さらに RT コアでレイ トレーシングが加速されると嬉しいのでは #NVIDIA技術ブログ の記事から https://medium.com/nvidiajapan/62b7c70e732a
  • 19. 19 PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION #NVIDIA技術ブログ の記事から https://medium.com/nvidiajapan/62b7c70e732a
  • 21. 21 GIANT MODELS PUSHING LIMITS OF EXISTING ARCHITECTURE Requires a New Architecture GPU 8,000 GB/sec CPU 200 GB/sec PCIE Gen4 (Effective Per GPU) 16 GB/sec Mem-to-GPU 64 GB/sec System Bandwidth Bottleneck DDR4 HBM2e GPU GPU GPU GPU x86 ELMo (94M) BERT-Large (340M) GPT-2 (1.5B) Megatron-LM (8.3B) T5 (11B) Turing-NLG (17.2B) GPT-3 (175B) 0.00001 0.0001 0.001 0.01 0.1 1 10 100 1000 2018 2019 2020 2021 2022 2023 Model Size (Trillions of Parameters) 100 TRILLION PARAMETER MODELS BY 2023
  • 22. 22 NVIDIA GRACE Breakthrough CPU Designed for Giant-Scale AI and HPC Applications FASTEST INTERCONNECTS >900 GB/s Cache Coherent NVLink CPU To GPU (14x) >600GB/s CPU To CPU (2x) NEXT GENERATION ARM NEOVERSE CORES >300 SPECrate2017_int_base est. Availability 2023 HIGHEST MEMORY BANDWIDTH >500GB/s LPDDR5x w/ ECC >2x Higher B/W 10x Higher Energy Efficiency
  • 23. 23 TURBOCHARGED TERABYTE SCALE ACCELERATED COMPUTING Evolving Architecture For New Workloads CURRENT x86 ARCHITECTURE DDR4 HBM2e INTEGRATED CPU-GPU ARCHITECTURE LPDDR5x HBM2e 3 DAYS FROM 1 MONTH Fine-Tune Training of 1T Model GPU GPU GPU GPU GRACE GRACE GRACE GRACE GPU GPU GPU GPU x86 Transfer 2TB in 30 secs Transfer 2TB in 1 secs GPU 8,000 GB/sec CPU 200 GB/sec PCIE Gen4 (Effective Per GPU) 16 GB/sec Mem-to-GPU 64 GB/sec GPU 8,000 GB/sec CPU 500 GB/sec NVLink 500 GB/sec Mem-to-GPU 2,000 GB/sec REAL-TIME INFERENCE ON 0.5T MODEL Interactive Single Node NLP Inference Bandwidth claims rounded to nearest hundred for illustration. Performance results based on projections on these configurations Grace : 8xGrace and 8xA100 with 4th Gen NVIDIA NVLink Connection between CPU and GPU and x86: DGX A100. Training: 1 Month of training is Fine-Tuning a 1T parameter model on a large custom data set on 64xGrace+64xA100 compared to 8xDGXA100 (16xX86+64xA100) Inference: 530B Parameter model on 8xGrace+8xA100 compared to DGXA100.
  • 24. 24 NVIDIA 秋の HPC Weeks Week 1 : 2021 年 10 ⽉ 11 ⽇ (⽉) GPU Computing & Network Deep Dive Week 2 : 2021 年 10 ⽉ 18 ⽇ (⽉) HPC + Machine Learning Week 3 : 2021 年 10 ⽉ 25 ⽇ (⽉) GPU Applications https://events.nvidia.com/hpcweek/ Stephen W. Keckler NVIDIA Torsten Hoefler ETH Zürich ⻘⽊ 尊之 東京⼯業⼤学 Tobias Weinzierl Durham University James Legg University College London Mark Turner Durham University 岡野原 ⼤輔 Preferred Networks 横⽥ 理央 東京⼯業⼤学 美添 ⼀樹 九州⼤学 秋⼭ 泰 東京⼯業⼤学 市村 強 東京⼤学 ⾼⽊ 知弘 京都⼯芸繊維⼤学
  • 25. 25 SUMMARY Current NVIDIA data center GPU A100 & A30 for FP64, A40 & A10 for FP32 GPU accelerated application performance Simulia CST Studio, Altair CFD and Rocky DEM got excellent performance on GPU Paraview with GPU acceleration Ray tracing accelerated with RT core In future Grace CPU improve memory bandwidth between CPU and GPU