SlideShare a Scribd company logo
Shinnosuke Furuya, Ph.D., HPC Developer Relations, NVIDIA
2021/09/21
計算⼒学シミュレーションに
GPU は役⽴つのか?
2
Founded in 1993 Jensen Huang, Founder & CEO 19,000 Employees $16.7B in FY21
Santa Clara Tokyo
50+ Offices
3
NVIDIA GPUS
4
NVIDIA GPUS AT A GLANCE
Fermi
(2010)
Kepler
(2012)
M2090
Maxwell
(2014)
Pascal
(2016)
Volta
(2017)
Turing
(2018)
Ampere
(2020)
K80
M40
M10
K1
P100
P4
T4
V100
Data Center
GPU
RTX / Quadro
GeForce
A100
A30
6000 K6000 M6000 P5000 GP100 GV100 RTX 8000
GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti
RTX A6000
RTX 3090
A40
A10
A16
5
AMPERE GPU ARCHITECTURE
A100 Tensor Core GPU
7 GPCs
7 or 8
TPCs/GPC
2 SMs/TPC
(108 SMs/GPU)
5 HBM2 stacks
GPC: GPU Processing Cluster, TPC: Texture Processing Cluster, SM: Streaming Multiprocessor
12 NVLink links
6
AMPERE GPU ARCHITECTURE
Streaming Multiprocessor (SM)
GA100 (A100, A30) GA102 (A40, A10)
32 FP64 CUDA Cores
64 FP32 CUDA Cores
4 Tensor Cores
Up to 128 FP32
CUDA Cores
1 RT Core
4 Tensor Cores
7
DATA CENTER PRODUCT COMPARISON (SEPT 2021)
* Performance with structured sparse matrix
A100* A30* A40 A10 T4
Performance
FP64 (no Tensor Core) 9.7 TFlops 5.2 TFlops - - -
FP64 Tensor Core 19.5 TFlops 10.3 TFlops N/A N/A N/A
FP32 (no Tensor Core) 19.5 TFlops 10.3 TFlops 37.4 Tflops 31.2 TFlops 8.1 TFlops
TF32 Tensor Core 156 | 312 TFlops* 82 | 165 Tflops* 74.8 | 149.6 TFlops* 62.5 | 125 TFlops* N/A
FP16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* 65 TFlops
BFloat16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* N/A
Int8 Tensor Core 624 | 1248 TOPS* 330 | 661 TOPS* 299.3 | 598.6 TOPS* 250 | 500 TOPS* 130 TOPS
Int4 Tensor Core 1248 | 2496 TOPS* 661 | 1321 TOPS* 598.7 | 1197.4 TOPS* 500 | 1000 TOPS* 260 TOPS
Form Factor
SXM4 module
on baseboard
x16 PCIe Gen4
2 Slot FHFL
3 NVLINK bridges
x16 PCIe Gen 4
2 Slot FHFL
1 NVLINK bridge
x16 PCIe Gen4
2 Slot FHFL
1 NVLINK bridge
x16 PCIe Gen 4
1 Slot FHFL
x16 PCIe Gen 3
1 Slot LP
GPU Memory 40 | 80 GB HBM2e 40 | 80 GB HBM2e 24 GB HBM2 48 GB GDDR6 24 GB GDDR6 16 GB GDDR6
GPU Memory Bandwidth 1555 | 1935 GB/s 1555 | 2039 GB/s 933 GB/s 696 GB/s 600 GB/s 300 GB/s
Multi-Instance GPU Up to 7 Up to 7 Up to 4 N/A N/A N/A
Media Acceleration
1 JPEG Decoder,
5 Video Decoder
1 JPEG Decoder
4 Video Decoder
1 Video Encoder,
2 Video Decoder
(+AV1 decode)
1 Video Encoder,
2 Video Decoder
Ray Tracing No No No Yes Yes Yes
Graphics
For in-situ visualization
(no vPC/vQuadro)
Best Better Good
Max Power 400 W 250 | 300 W 165 W 300 W 150 W 70 W
8
TF32 TENSOR CORE TO SPEEDUP FP32
Range of FP32 and Precision of FP16
Input in FP32 and Accumulation in FP32
FP32
Matrix
FP32
Matrix
FP32
Matrix
Format to TF32
and Multiply
FP32 Accumulate
23 bits
8 bits
10 bits
10 bits
7 bits
8 bits
5 bits
8 bits
FP32
TF32
FP16
BFloat16
Sign Range Precision
TF32 Range
TF32 Precision
9
GPU ACCELERATED
APPLICATION PERFORMANCE
10
GPU ACCELERATED APPS
GPU Applications
https://www.nvidia.com/en-us/gpu-accelerated-applications/
Search from here Supported features
11
GPU ACCELERATED APPS
GPU Applications
GPU scaling Supported features
LS-DYNA - -
ABAQUS/STANDARD Multi-GPU / Multi-Node
* Direct sparse solver
* AMS Solver
* Steady State Dynamics
STAR-CCM+ Single GPU / Single Node * Rendering
Fluent Multi-GPU / Multi-Node
* Linear equation solver
* Radiation heat transfer model
* Discrete Ordinate Radiation model
Nastran - -
Particleworks Multi-GPU / Multi-Node * Explicit and Implicit methods
12
GPU ACCELERATED APPS
GPU Applications Catalog
https://images.nvidia.com/aem-dam/Solutions/Data-Center/tesla-product-literature/gpu-applications-catalog.pdf
13
GPU ACCELERATED APPS
HPC Application Performance
https://developer.nvidia.com/hpc-application-performance/
14
DS SIMULIA CST STUDIO
Time Domain Solver
0
5000
10000
15000
20000
25000
100^3 150^3 200^3 300^3
Mcells/sec
Throughput
Simulation Size
FIT Performance
A6000_1GPU A6000_2GPU A100_1GPU A100_2GPU
3.5x
3.8x
2.5x
3.2x
Higher is Better
A100 System :- Dual Xeon Gold 6148 2x20 cores, 384GB RAM, DDR4-2666M, 2 A100-PCIE-40GB GPUs, RHEL7.7
A6000 System:- Dual AMD EPYC 7F72 24 cores, 1TB RAM, A6000 GPUs, NV Driver 455.45.01, Win 10
15
ALTAIR CFD (SPH SOLVER)
Ampere PCIe scaling performance
Aerospace Gearbox
Size: ~21M Fluid particles (~26.7M total)
1000 timesteps
Higher is Better
1.0 1.0
0.5
1.0
1.8 1.8
1.0
1.7
3.5 3.5
1.9
3.4
6.2 6.3
3.6
6.1
0X
1X
2X
3X
4X
5X
6X
7X
A100 80GB PCIe A100 40GB PCIe A30 PCIe A40 PCIe
Relative
Performance
Aerospace Gearbox 26M
Altair CFD SPH Solver (Altair® nanoFluidX®) on NVIDIA EGX Server
1 GPU 2 GPUs 4 GPUs 8 GPUs
Tests run on a server with 2x AMD EPYC 7742@2.25GHz 3.4GHz Turbo (Rome), 64-core CPU, Driver 465.19.01, 512GB RAM, 8x NVIDIA GPUs,
Ubuntu 20.04, ECC off, HT Off
Relative performance calculated based on the average model performance (nanoseconds/particles/timesteps) on Altair nanoFluidX 2021.0
16
ROCKY DEM 4.4
ROTATING DRUM Benchmark with polyhedron and spherical shaped particles
0
10
20
30
40
50
60
70
80
90
100
Polyhedron Cells Performance on V100
1 x V100
2 x V100
4 x V100
Speed-up
(relative
to
8xCPU
cores
Intel
Xeon
Gold
6230
CPU
at
2.10
GHz
31.25 62.5 125 250 500 1000 2000
Number of particles per GPU (x1000)
0
10
20
30
40
50
60
70
80
90
100
Polyhedron Cells Performance on A100
1 x A100
2 x A100
4 x A100
Speed-up
(relative
to
8xCPU
cores
Intel
Xeon
Gold
6230
CPU
at
2.10
GHz
31.25 62.5 125 250 500 1000 2000
Number of particles per GPU (x1000)
Higher is Better Higher is Better
38x speedup for 1xV100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz
47x speedup for 1xA100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz
Case Description: Particles in a drum rotating at 1 rev/sec, simulated for 20,000 solver iterations
Hardware on Oracle cloud: CPU: Intel Xeon Gold 6230 @2.1 GHz (8 cores)GPU: NVIDIA A100 and V100
17
PARAVIEW
WITH GPU ACCELERATION
18
PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION
NVIDIA RTX テクノロジの⼀つがレイ トレーシング、NVIDIA OptiX など最適化されたレイ トレーシング API がある
NVIDIA OptiX は Paraviwe にインテグレートされている
RT コアが搭載された GPU では、さらに加速される
レンダリングには時間がかかる
画像の品質にこだわりはじめると、光源やテクスチャなど様々な設定を変えてレンダリングを繰り返す
科学技術計算で得られるような⼤規模データを使ったレイ トレーシングによる可視化はストレスフル
ParaView 経由で NVIDIA OptiX を使い、さらに RT コアでレイ トレーシングが加速されると嬉しいのでは
#NVIDIA技術ブログ の記事から
https://medium.com/nvidiajapan/62b7c70e732a
19
PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION
#NVIDIA技術ブログ の記事から
https://medium.com/nvidiajapan/62b7c70e732a
20
GPU COMPUTING
IN FUTURE
21
GIANT MODELS PUSHING LIMITS OF EXISTING ARCHITECTURE
Requires a New Architecture
GPU 8,000 GB/sec
CPU 200 GB/sec
PCIE Gen4 (Effective Per GPU) 16 GB/sec
Mem-to-GPU 64 GB/sec
System Bandwidth Bottleneck
DDR4 HBM2e
GPU
GPU
GPU
GPU
x86
ELMo (94M)
BERT-Large (340M)
GPT-2
(1.5B)
Megatron-LM
(8.3B)
T5 (11B)
Turing-NLG
(17.2B)
GPT-3 (175B)
0.00001
0.0001
0.001
0.01
0.1
1
10
100
1000
2018 2019 2020 2021 2022 2023
Model
Size
(Trillions
of
Parameters)
100 TRILLION PARAMETER MODELS BY 2023
22
NVIDIA GRACE
Breakthrough CPU Designed for Giant-Scale AI and HPC Applications
FASTEST INTERCONNECTS
>900 GB/s Cache Coherent NVLink CPU To GPU (14x)
>600GB/s CPU To CPU (2x)
NEXT GENERATION ARM NEOVERSE CORES
>300 SPECrate2017_int_base est.
Availability 2023
HIGHEST MEMORY BANDWIDTH
>500GB/s LPDDR5x w/ ECC
>2x Higher B/W
10x Higher Energy Efficiency
23
TURBOCHARGED TERABYTE SCALE ACCELERATED COMPUTING
Evolving Architecture For New Workloads
CURRENT x86 ARCHITECTURE
DDR4 HBM2e
INTEGRATED CPU-GPU ARCHITECTURE
LPDDR5x HBM2e
3 DAYS FROM 1 MONTH
Fine-Tune Training of 1T Model
GPU
GPU
GPU
GPU
GRACE
GRACE
GRACE
GRACE
GPU
GPU
GPU
GPU
x86
Transfer 2TB in 30 secs Transfer 2TB in 1 secs
GPU 8,000 GB/sec
CPU 200 GB/sec
PCIE Gen4
(Effective Per GPU)
16 GB/sec
Mem-to-GPU 64 GB/sec
GPU 8,000 GB/sec
CPU 500 GB/sec
NVLink 500 GB/sec
Mem-to-GPU 2,000 GB/sec
REAL-TIME INFERENCE
ON 0.5T MODEL
Interactive Single Node NLP Inference
Bandwidth claims rounded to nearest hundred for illustration.
Performance results based on projections on these configurations Grace : 8xGrace and 8xA100 with 4th Gen NVIDIA NVLink Connection between CPU and GPU and x86: DGX A100.
Training: 1 Month of training is Fine-Tuning a 1T parameter model on a large custom data set on 64xGrace+64xA100 compared to 8xDGXA100 (16xX86+64xA100)
Inference: 530B Parameter model on 8xGrace+8xA100 compared to DGXA100.
24
NVIDIA 秋の HPC Weeks
Week 1 : 2021 年 10 ⽉ 11 ⽇ (⽉) GPU Computing & Network Deep Dive
Week 2 : 2021 年 10 ⽉ 18 ⽇ (⽉) HPC + Machine Learning
Week 3 : 2021 年 10 ⽉ 25 ⽇ (⽉) GPU Applications
https://events.nvidia.com/hpcweek/
Stephen W. Keckler
NVIDIA
Torsten Hoefler
ETH Zürich
⻘⽊ 尊之
東京⼯業⼤学
Tobias Weinzierl
Durham University
James Legg
University College London
Mark Turner
Durham University
岡野原 ⼤輔
Preferred Networks
横⽥ 理央
東京⼯業⼤学
美添 ⼀樹
九州⼤学
秋⼭ 泰
東京⼯業⼤学
市村 強
東京⼤学
⾼⽊ 知弘
京都⼯芸繊維⼤学
25
SUMMARY
Current NVIDIA data center GPU
A100 & A30 for FP64, A40 & A10 for FP32
GPU accelerated application performance
Simulia CST Studio, Altair CFD and Rocky DEM got excellent performance on GPU
Paraview with GPU acceleration
Ray tracing accelerated with RT core
In future
Grace CPU improve memory bandwidth between CPU and GPU
計算力学シミュレーションに GPU は役立つのか?

More Related Content

What's hot

Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Preferred Networks
 
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
NTT DATA Technology & Innovation
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
RCCSRENKEI
 
0から理解するニューラルネットアーキテクチャサーチ(NAS)
0から理解するニューラルネットアーキテクチャサーチ(NAS)0から理解するニューラルネットアーキテクチャサーチ(NAS)
0から理解するニューラルネットアーキテクチャサーチ(NAS)
MasanoriSuganuma
 
Snowflakeって実際どうなの?数多のDBを使い倒した猛者が語る
Snowflakeって実際どうなの?数多のDBを使い倒した猛者が語るSnowflakeって実際どうなの?数多のDBを使い倒した猛者が語る
Snowflakeって実際どうなの?数多のDBを使い倒した猛者が語る
Ryota Shibuya
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
Hiroki Nakahara
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
Deep Learning JP
 
root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす
Akihiro Suda
 
ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)
Hideki Okada
 
20180729 Preferred Networksの機械学習クラスタを支える技術
20180729 Preferred Networksの機械学習クラスタを支える技術20180729 Preferred Networksの機械学習クラスタを支える技術
20180729 Preferred Networksの機械学習クラスタを支える技術
Preferred Networks
 
MLOpsはバズワード
MLOpsはバズワードMLOpsはバズワード
MLOpsはバズワード
Tetsutaro Watanabe
 
【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models
【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models
【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models
Deep Learning JP
 
KubernetesでGPUクラスタを管理したい
KubernetesでGPUクラスタを管理したいKubernetesでGPUクラスタを管理したい
KubernetesでGPUクラスタを管理したい
Yuji Oshima
 
Data-centricなML開発
Data-centricなML開発Data-centricなML開発
Data-centricなML開発
Takeshi Suzuki
 
x86x64 SSE4.2 POPCNT
x86x64 SSE4.2 POPCNTx86x64 SSE4.2 POPCNT
x86x64 SSE4.2 POPCNT
takesako
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選
Yusuke Uchida
 
SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないこと
NVIDIA Japan
 
株式会社コロプラ『GKE と Cloud Spanner が躍動するドラゴンクエストウォーク』第 9 回 Google Cloud INSIDE Game...
株式会社コロプラ『GKE と Cloud Spanner が躍動するドラゴンクエストウォーク』第 9 回 Google Cloud INSIDE Game...株式会社コロプラ『GKE と Cloud Spanner が躍動するドラゴンクエストウォーク』第 9 回 Google Cloud INSIDE Game...
株式会社コロプラ『GKE と Cloud Spanner が躍動するドラゴンクエストウォーク』第 9 回 Google Cloud INSIDE Game...
Google Cloud Platform - Japan
 
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
智啓 出川
 

What's hot (20)

Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
 
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 
0から理解するニューラルネットアーキテクチャサーチ(NAS)
0から理解するニューラルネットアーキテクチャサーチ(NAS)0から理解するニューラルネットアーキテクチャサーチ(NAS)
0から理解するニューラルネットアーキテクチャサーチ(NAS)
 
Snowflakeって実際どうなの?数多のDBを使い倒した猛者が語る
Snowflakeって実際どうなの?数多のDBを使い倒した猛者が語るSnowflakeって実際どうなの?数多のDBを使い倒した猛者が語る
Snowflakeって実際どうなの?数多のDBを使い倒した猛者が語る
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
 
root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす
 
ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)
 
20180729 Preferred Networksの機械学習クラスタを支える技術
20180729 Preferred Networksの機械学習クラスタを支える技術20180729 Preferred Networksの機械学習クラスタを支える技術
20180729 Preferred Networksの機械学習クラスタを支える技術
 
MLOpsはバズワード
MLOpsはバズワードMLOpsはバズワード
MLOpsはバズワード
 
【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models
【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models
【DL輪読会】Hyena Hierarchy: Towards Larger Convolutional Language Models
 
KubernetesでGPUクラスタを管理したい
KubernetesでGPUクラスタを管理したいKubernetesでGPUクラスタを管理したい
KubernetesでGPUクラスタを管理したい
 
Data-centricなML開発
Data-centricなML開発Data-centricなML開発
Data-centricなML開発
 
x86x64 SSE4.2 POPCNT
x86x64 SSE4.2 POPCNTx86x64 SSE4.2 POPCNT
x86x64 SSE4.2 POPCNT
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選
 
SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用SSII2022 [OS3-02] Federated Learningの基礎と応用
SSII2022 [OS3-02] Federated Learningの基礎と応用
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないこと
 
株式会社コロプラ『GKE と Cloud Spanner が躍動するドラゴンクエストウォーク』第 9 回 Google Cloud INSIDE Game...
株式会社コロプラ『GKE と Cloud Spanner が躍動するドラゴンクエストウォーク』第 9 回 Google Cloud INSIDE Game...株式会社コロプラ『GKE と Cloud Spanner が躍動するドラゴンクエストウォーク』第 9 回 Google Cloud INSIDE Game...
株式会社コロプラ『GKE と Cloud Spanner が躍動するドラゴンクエストウォーク』第 9 回 Google Cloud INSIDE Game...
 
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
 

Similar to 計算力学シミュレーションに GPU は役立つのか?

NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
MuhammadAbdullah311866
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
inside-BigData.com
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
inside-BigData.com
 
GTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI RevolutionGTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI Revolution
NVIDIA
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
AMD
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
Dustin Franklin
 
Nvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex SabatierNvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex Sabatier
Sri Ambati
 
Nvidia tesla-k80-overview
Nvidia tesla-k80-overviewNvidia tesla-k80-overview
Nvidia tesla-k80-overview
Communication Progress
 
한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson Platform한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson Platform
HANCOM MDS
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
KTN
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム
Shinnosuke Furuya
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
GTC 2022 Keynote
GTC 2022 KeynoteGTC 2022 Keynote
GTC 2022 Keynote
Alison B. Lowndes
 
Agx 2
Agx 2Agx 2
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
Rebekah Rodriguez
 
한컴MDS_NVIDIA Enterprise Platform
한컴MDS_NVIDIA Enterprise Platform한컴MDS_NVIDIA Enterprise Platform
한컴MDS_NVIDIA Enterprise Platform
HANCOM MDS
 
Advances in GPU Computing
Advances in GPU ComputingAdvances in GPU Computing
Advances in GPU Computing
Frédéric Parienté
 
Product Roadmap iEi 2017
Product Roadmap iEi 2017Product Roadmap iEi 2017
Product Roadmap iEi 2017
Andrei Teleanu
 
Apple A10 Series Application Processor
Apple A10 Series Application ProcessorApple A10 Series Application Processor
Apple A10 Series Application Processor
JJ Wu
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Cluster
inside-BigData.com
 

Similar to 計算力学シミュレーションに GPU は役立つのか? (20)

NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
GTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI RevolutionGTC 2017: Powering the AI Revolution
GTC 2017: Powering the AI Revolution
 
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APUHot Chips: AMD Next Gen 7nm Ryzen 4000 APU
Hot Chips: AMD Next Gen 7nm Ryzen 4000 APU
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
 
Nvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex SabatierNvidia Deep Learning Solutions - Alex Sabatier
Nvidia Deep Learning Solutions - Alex Sabatier
 
Nvidia tesla-k80-overview
Nvidia tesla-k80-overviewNvidia tesla-k80-overview
Nvidia tesla-k80-overview
 
한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson Platform한컴MDS_NVIDIA Jetson Platform
한컴MDS_NVIDIA Jetson Platform
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
GTC 2022 Keynote
GTC 2022 KeynoteGTC 2022 Keynote
GTC 2022 Keynote
 
Agx 2
Agx 2Agx 2
Agx 2
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
 
한컴MDS_NVIDIA Enterprise Platform
한컴MDS_NVIDIA Enterprise Platform한컴MDS_NVIDIA Enterprise Platform
한컴MDS_NVIDIA Enterprise Platform
 
Advances in GPU Computing
Advances in GPU ComputingAdvances in GPU Computing
Advances in GPU Computing
 
Product Roadmap iEi 2017
Product Roadmap iEi 2017Product Roadmap iEi 2017
Product Roadmap iEi 2017
 
Apple A10 Series Application Processor
Apple A10 Series Application ProcessorApple A10 Series Application Processor
Apple A10 Series Application Processor
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Cluster
 

Recently uploaded

Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
Kamal Acharya
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
DuvanRamosGarzon1
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 

Recently uploaded (20)

Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 

計算力学シミュレーションに GPU は役立つのか?

  • 1. Shinnosuke Furuya, Ph.D., HPC Developer Relations, NVIDIA 2021/09/21 計算⼒学シミュレーションに GPU は役⽴つのか?
  • 2. 2 Founded in 1993 Jensen Huang, Founder & CEO 19,000 Employees $16.7B in FY21 Santa Clara Tokyo 50+ Offices
  • 4. 4 NVIDIA GPUS AT A GLANCE Fermi (2010) Kepler (2012) M2090 Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 P4 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 A40 A10 A16
  • 5. 5 AMPERE GPU ARCHITECTURE A100 Tensor Core GPU 7 GPCs 7 or 8 TPCs/GPC 2 SMs/TPC (108 SMs/GPU) 5 HBM2 stacks GPC: GPU Processing Cluster, TPC: Texture Processing Cluster, SM: Streaming Multiprocessor 12 NVLink links
  • 6. 6 AMPERE GPU ARCHITECTURE Streaming Multiprocessor (SM) GA100 (A100, A30) GA102 (A40, A10) 32 FP64 CUDA Cores 64 FP32 CUDA Cores 4 Tensor Cores Up to 128 FP32 CUDA Cores 1 RT Core 4 Tensor Cores
  • 7. 7 DATA CENTER PRODUCT COMPARISON (SEPT 2021) * Performance with structured sparse matrix A100* A30* A40 A10 T4 Performance FP64 (no Tensor Core) 9.7 TFlops 5.2 TFlops - - - FP64 Tensor Core 19.5 TFlops 10.3 TFlops N/A N/A N/A FP32 (no Tensor Core) 19.5 TFlops 10.3 TFlops 37.4 Tflops 31.2 TFlops 8.1 TFlops TF32 Tensor Core 156 | 312 TFlops* 82 | 165 Tflops* 74.8 | 149.6 TFlops* 62.5 | 125 TFlops* N/A FP16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* 65 TFlops BFloat16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* N/A Int8 Tensor Core 624 | 1248 TOPS* 330 | 661 TOPS* 299.3 | 598.6 TOPS* 250 | 500 TOPS* 130 TOPS Int4 Tensor Core 1248 | 2496 TOPS* 661 | 1321 TOPS* 598.7 | 1197.4 TOPS* 500 | 1000 TOPS* 260 TOPS Form Factor SXM4 module on baseboard x16 PCIe Gen4 2 Slot FHFL 3 NVLINK bridges x16 PCIe Gen 4 2 Slot FHFL 1 NVLINK bridge x16 PCIe Gen4 2 Slot FHFL 1 NVLINK bridge x16 PCIe Gen 4 1 Slot FHFL x16 PCIe Gen 3 1 Slot LP GPU Memory 40 | 80 GB HBM2e 40 | 80 GB HBM2e 24 GB HBM2 48 GB GDDR6 24 GB GDDR6 16 GB GDDR6 GPU Memory Bandwidth 1555 | 1935 GB/s 1555 | 2039 GB/s 933 GB/s 696 GB/s 600 GB/s 300 GB/s Multi-Instance GPU Up to 7 Up to 7 Up to 4 N/A N/A N/A Media Acceleration 1 JPEG Decoder, 5 Video Decoder 1 JPEG Decoder 4 Video Decoder 1 Video Encoder, 2 Video Decoder (+AV1 decode) 1 Video Encoder, 2 Video Decoder Ray Tracing No No No Yes Yes Yes Graphics For in-situ visualization (no vPC/vQuadro) Best Better Good Max Power 400 W 250 | 300 W 165 W 300 W 150 W 70 W
  • 8. 8 TF32 TENSOR CORE TO SPEEDUP FP32 Range of FP32 and Precision of FP16 Input in FP32 and Accumulation in FP32 FP32 Matrix FP32 Matrix FP32 Matrix Format to TF32 and Multiply FP32 Accumulate 23 bits 8 bits 10 bits 10 bits 7 bits 8 bits 5 bits 8 bits FP32 TF32 FP16 BFloat16 Sign Range Precision TF32 Range TF32 Precision
  • 10. 10 GPU ACCELERATED APPS GPU Applications https://www.nvidia.com/en-us/gpu-accelerated-applications/ Search from here Supported features
  • 11. 11 GPU ACCELERATED APPS GPU Applications GPU scaling Supported features LS-DYNA - - ABAQUS/STANDARD Multi-GPU / Multi-Node * Direct sparse solver * AMS Solver * Steady State Dynamics STAR-CCM+ Single GPU / Single Node * Rendering Fluent Multi-GPU / Multi-Node * Linear equation solver * Radiation heat transfer model * Discrete Ordinate Radiation model Nastran - - Particleworks Multi-GPU / Multi-Node * Explicit and Implicit methods
  • 12. 12 GPU ACCELERATED APPS GPU Applications Catalog https://images.nvidia.com/aem-dam/Solutions/Data-Center/tesla-product-literature/gpu-applications-catalog.pdf
  • 13. 13 GPU ACCELERATED APPS HPC Application Performance https://developer.nvidia.com/hpc-application-performance/
  • 14. 14 DS SIMULIA CST STUDIO Time Domain Solver 0 5000 10000 15000 20000 25000 100^3 150^3 200^3 300^3 Mcells/sec Throughput Simulation Size FIT Performance A6000_1GPU A6000_2GPU A100_1GPU A100_2GPU 3.5x 3.8x 2.5x 3.2x Higher is Better A100 System :- Dual Xeon Gold 6148 2x20 cores, 384GB RAM, DDR4-2666M, 2 A100-PCIE-40GB GPUs, RHEL7.7 A6000 System:- Dual AMD EPYC 7F72 24 cores, 1TB RAM, A6000 GPUs, NV Driver 455.45.01, Win 10
  • 15. 15 ALTAIR CFD (SPH SOLVER) Ampere PCIe scaling performance Aerospace Gearbox Size: ~21M Fluid particles (~26.7M total) 1000 timesteps Higher is Better 1.0 1.0 0.5 1.0 1.8 1.8 1.0 1.7 3.5 3.5 1.9 3.4 6.2 6.3 3.6 6.1 0X 1X 2X 3X 4X 5X 6X 7X A100 80GB PCIe A100 40GB PCIe A30 PCIe A40 PCIe Relative Performance Aerospace Gearbox 26M Altair CFD SPH Solver (Altair® nanoFluidX®) on NVIDIA EGX Server 1 GPU 2 GPUs 4 GPUs 8 GPUs Tests run on a server with 2x AMD EPYC 7742@2.25GHz 3.4GHz Turbo (Rome), 64-core CPU, Driver 465.19.01, 512GB RAM, 8x NVIDIA GPUs, Ubuntu 20.04, ECC off, HT Off Relative performance calculated based on the average model performance (nanoseconds/particles/timesteps) on Altair nanoFluidX 2021.0
  • 16. 16 ROCKY DEM 4.4 ROTATING DRUM Benchmark with polyhedron and spherical shaped particles 0 10 20 30 40 50 60 70 80 90 100 Polyhedron Cells Performance on V100 1 x V100 2 x V100 4 x V100 Speed-up (relative to 8xCPU cores Intel Xeon Gold 6230 CPU at 2.10 GHz 31.25 62.5 125 250 500 1000 2000 Number of particles per GPU (x1000) 0 10 20 30 40 50 60 70 80 90 100 Polyhedron Cells Performance on A100 1 x A100 2 x A100 4 x A100 Speed-up (relative to 8xCPU cores Intel Xeon Gold 6230 CPU at 2.10 GHz 31.25 62.5 125 250 500 1000 2000 Number of particles per GPU (x1000) Higher is Better Higher is Better 38x speedup for 1xV100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz 47x speedup for 1xA100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz Case Description: Particles in a drum rotating at 1 rev/sec, simulated for 20,000 solver iterations Hardware on Oracle cloud: CPU: Intel Xeon Gold 6230 @2.1 GHz (8 cores)GPU: NVIDIA A100 and V100
  • 18. 18 PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION NVIDIA RTX テクノロジの⼀つがレイ トレーシング、NVIDIA OptiX など最適化されたレイ トレーシング API がある NVIDIA OptiX は Paraviwe にインテグレートされている RT コアが搭載された GPU では、さらに加速される レンダリングには時間がかかる 画像の品質にこだわりはじめると、光源やテクスチャなど様々な設定を変えてレンダリングを繰り返す 科学技術計算で得られるような⼤規模データを使ったレイ トレーシングによる可視化はストレスフル ParaView 経由で NVIDIA OptiX を使い、さらに RT コアでレイ トレーシングが加速されると嬉しいのでは #NVIDIA技術ブログ の記事から https://medium.com/nvidiajapan/62b7c70e732a
  • 19. 19 PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION #NVIDIA技術ブログ の記事から https://medium.com/nvidiajapan/62b7c70e732a
  • 21. 21 GIANT MODELS PUSHING LIMITS OF EXISTING ARCHITECTURE Requires a New Architecture GPU 8,000 GB/sec CPU 200 GB/sec PCIE Gen4 (Effective Per GPU) 16 GB/sec Mem-to-GPU 64 GB/sec System Bandwidth Bottleneck DDR4 HBM2e GPU GPU GPU GPU x86 ELMo (94M) BERT-Large (340M) GPT-2 (1.5B) Megatron-LM (8.3B) T5 (11B) Turing-NLG (17.2B) GPT-3 (175B) 0.00001 0.0001 0.001 0.01 0.1 1 10 100 1000 2018 2019 2020 2021 2022 2023 Model Size (Trillions of Parameters) 100 TRILLION PARAMETER MODELS BY 2023
  • 22. 22 NVIDIA GRACE Breakthrough CPU Designed for Giant-Scale AI and HPC Applications FASTEST INTERCONNECTS >900 GB/s Cache Coherent NVLink CPU To GPU (14x) >600GB/s CPU To CPU (2x) NEXT GENERATION ARM NEOVERSE CORES >300 SPECrate2017_int_base est. Availability 2023 HIGHEST MEMORY BANDWIDTH >500GB/s LPDDR5x w/ ECC >2x Higher B/W 10x Higher Energy Efficiency
  • 23. 23 TURBOCHARGED TERABYTE SCALE ACCELERATED COMPUTING Evolving Architecture For New Workloads CURRENT x86 ARCHITECTURE DDR4 HBM2e INTEGRATED CPU-GPU ARCHITECTURE LPDDR5x HBM2e 3 DAYS FROM 1 MONTH Fine-Tune Training of 1T Model GPU GPU GPU GPU GRACE GRACE GRACE GRACE GPU GPU GPU GPU x86 Transfer 2TB in 30 secs Transfer 2TB in 1 secs GPU 8,000 GB/sec CPU 200 GB/sec PCIE Gen4 (Effective Per GPU) 16 GB/sec Mem-to-GPU 64 GB/sec GPU 8,000 GB/sec CPU 500 GB/sec NVLink 500 GB/sec Mem-to-GPU 2,000 GB/sec REAL-TIME INFERENCE ON 0.5T MODEL Interactive Single Node NLP Inference Bandwidth claims rounded to nearest hundred for illustration. Performance results based on projections on these configurations Grace : 8xGrace and 8xA100 with 4th Gen NVIDIA NVLink Connection between CPU and GPU and x86: DGX A100. Training: 1 Month of training is Fine-Tuning a 1T parameter model on a large custom data set on 64xGrace+64xA100 compared to 8xDGXA100 (16xX86+64xA100) Inference: 530B Parameter model on 8xGrace+8xA100 compared to DGXA100.
  • 24. 24 NVIDIA 秋の HPC Weeks Week 1 : 2021 年 10 ⽉ 11 ⽇ (⽉) GPU Computing & Network Deep Dive Week 2 : 2021 年 10 ⽉ 18 ⽇ (⽉) HPC + Machine Learning Week 3 : 2021 年 10 ⽉ 25 ⽇ (⽉) GPU Applications https://events.nvidia.com/hpcweek/ Stephen W. Keckler NVIDIA Torsten Hoefler ETH Zürich ⻘⽊ 尊之 東京⼯業⼤学 Tobias Weinzierl Durham University James Legg University College London Mark Turner Durham University 岡野原 ⼤輔 Preferred Networks 横⽥ 理央 東京⼯業⼤学 美添 ⼀樹 九州⼤学 秋⼭ 泰 東京⼯業⼤学 市村 強 東京⼤学 ⾼⽊ 知弘 京都⼯芸繊維⼤学
  • 25. 25 SUMMARY Current NVIDIA data center GPU A100 & A30 for FP64, A40 & A10 for FP32 GPU accelerated application performance Simulia CST Studio, Altair CFD and Rocky DEM got excellent performance on GPU Paraview with GPU acceleration Ray tracing accelerated with RT core In future Grace CPU improve memory bandwidth between CPU and GPU