SlideShare a Scribd company logo
Akira Naruse, 9th Nov. 2017
VOLTA (TESLA V100)
2
VOLTA
3
SGEMM/W
2012 20142008 2010 2016
48
36
12
0
24
60
2018
72
Tesla Fermi
Kepler
Maxwell
Pascal
Volta
GPU ROADMAPS
4
VOLTA: TESLA V100
HPC and Deep Learning、両方に適した最速のGPU
Volta Architecture
Most Productive GPU
Tensor Core
120 Programmable
TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink &
HBM2
Efficient Bandwidth
5
21B transistors
815 mm2
80 SM
5120 CUDA Cores
640 Tensor Cores
16 GB HBM2
900 GB/s HBM2
300 GB/s NVLink
VOLTA: TESLA V100
*full GV100 chip contains 84 SMs
6
P100 V100 Ratio
DL ops (FP16 or Mixed) 21 TOPS 120 TOPS 6x
FP32 10 TFLOPS 15 TFLOPS 1.5x
FP64 5 TFLOPS 7.5 TFLOPS 1.5x
L1 Caches 1.3 MB 10 MB 7.7x
L2 Cache 4 MB 6 MB 1.5x
Memory 720 GB/s 900 GB/s 1.2x
NVLink 160 GB/s 300 GB/s 1.9x
GPU性能比較
演算
容量
帯域
7
NEW HBM2 MEMORY ARCHITECTURE
STREAM:Triad-DeliveredGB/s
P100 V100
76% DRAM
Utilization
95% DRAM
Utilization
実効帯域は1.5倍
V100 measured on pre-production hardware.
HBM2 stack
8
ROAD TO EXASCALE
Volta to Fuel Most Powerful
US Supercomputers
RelativetoTeslaP100
Volta HPC Application Performance
System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla
P100 or V100. V100 measured on pre-production hardware.
Summit
Supercomputer
200+ PetaFlops
~3,400 Nodes
10 Megawatts
1.5x
9
21B transistors
815 mm2
80 SM
5120 CUDA Cores
640 Tensor Cores
16 GB HBM2
900 GB/s HBM2
300 GB/s NVLink
VOLTA: TESLA V100
*full GV100 chip contains 84 SMs
10
VOLTA GV100 SM
GV100
FP32 units 64
FP64 units 32
INT units 64
Tensor Cores 8
Register File 256 KB
Unified L1/Shared
memory
128 KB
Active Threads 2048
11
VOLTA GV100 SM
Completely new ISA
Twice the schedulers
Simplified Issue Logic
Large, fast L1 cache
Improved SIMT model
Tensor acceleration
=
GPU史上、最も性能の出しやすいSM
使い勝手の良いアーキテクチャ
12
MPS: MULTI-PROCESS SERVICE
複数プロセスで、安全かつ効率的にGPUを共有
Limited
Isolation
A B C
CUDA MULTI-PROCESS SERVICE
Pascal GP100
A
B
C
CPU Processes
GPU Execution
Hardware
Isolation
VOLTA MULTI-PROCESS SERVICE
Volta GV100
A B C
CUDA MULTI-PROCESS SERVICE CONTROL
CPU Processes
GPU Execution
A B C
Pascal Volta
13
VOLTA: INDEPENDENT THREAD SCHEDULING
Pascal: Lock-Free Algorithms Volta: Starvation Free Algorithms
Communicating Algorithms
Threads cannot wait for messages Threads may wait for messages
14
PASCAL SIMT EXECUTION MODEL
ワープ内の分岐したスレッド間で、データ交換ができない
Time
X; Y;
diverge
reconverge
A; B;
if (threadIdx.x < 4) {
A;
__syncwarp();
B;
} else {
X;
__syncwarp();
Y;
}
15
VOLTA SIMT EXECUTION MODEL
diverge
A; B;
X; Y;
ワープ内の分岐したスレッド間でも、データ交換が可能
Time
synchronize
if (threadIdx.x < 4) {
A;
__syncwarp();
B;
} else {
X;
__syncwarp();
Y;
}
__syncwarp();
16
VOLTA TENSOR CORE
17
TENSOR CORE
128 ops /cycle
D = AB + C
D =
FP16 or FP32 FP16 FP16 FP16 or FP32
A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
B0,0 B0,1 B0,2 B0,3
B1,0 B1,1 B1,2 B1,3
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3
C0,0 C0,1 C0,2 C0,3
C1,0 C1,1 C1,2 C1,3
C2,0 C2,1 C2,2 C2,3
C3,0 C3,1 C3,2 C3,3
Mixed Precision
18
TENSOR SYNCHRONIZATION
ワープ内スレッドで同期
Full Warp 16x16 Matrix Math
16x16の行列乗算を、4x4の行列乗算
の組み合わせとして実行
各スレッドに結果を分配
Warp (32 threads)
19
VOLTA TENSOR OPERATION
FP16
storage/input
Full precision
product
Sum with
FP32
accumulator
Convert to
FP32 result
FP16
FP16
× + FP32
FP32
more products
20
USING TENSOR CORES
Volta Optimized Libraries
__device__ void tensor_op_16_16_16(
float *d, half *a, half *b, float *c)
{
wmma::fragment<matrix_a, …> Amat;
wmma::fragment<matrix_b, …> Bmat;
wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);
wmma::load_matrix_sync(Bmat, b, 16);
wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,
wmma::row_major);
}
CUDA C++
Warp-Level Matrix Operations
NVIDIA cuDNN, cuBLAS, TensorRT
21
VOLTA: A GIANT LEAP FOR DEEP LEARNING
P100 V100 P100 V100
ImagesperSecond
ImagesperSecond
2.4x faster 3.7x faster
FP32 Tensor Cores FP16 Tensor Cores
V100 measured on pre-production hardware.
ResNet-50 Training ResNet-50 Inference
TensorRT - 7ms Latency
22
FP16でトレーニングして、精度は大丈夫なのか?
23
大丈夫です、Tensor Coreを使えば
http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
Training with Mixed-Precision User Guide
24
大丈夫です、Tensor Coreを使えば
• Mixed Precision Training
• Forward中、Backward中は、ほぼ全てfp16で実行して問題ない (Tensorコアを使えば)
• Update(重みの更新)は、fp32で実行した方がよい (Update時間は短い)
• モデルによっては、Loss scalingと呼ばれるテクニックが必要 (オーバーヘッド小)
• 主要DLフレームワークで使用可能
• TensorFlow, MxNet, PyTorch, Caffe2, Theano, MS Cognitive Toolkit, Chainer
http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
Training with Mixed-Precision User Guide
25
どれぐらい、Voltaで速くなるの?
P100 FP32, V100 FP32 vs. V100 Tensor Core
Resnet50
(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用
Conv,1x1,64
Conv,3x3,64
Conv,1x1,256
BN
ReLU
BN
ReLU
BN
+
x
ReLU
26
どれぐらい、Voltaで速くなるの?
P100 FP32, V100 FP32 vs. V100 Tensor Core
0 100 200 300 400 500 600
Conv BN Relu Cupy_* Misc.
570 ms
360 ms
197 ms
ImageNet, Resnet50, Batch:128 Time per iteration [ms]
P100 FP32
V100 FP32
V100
Tensor Core
(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用
27
どれぐらい、Voltaで速くなるの?
P100 FP32, V100 FP32 vs. V100 Tensor Core
0 100 200 300 400 500 600
Conv BN Relu Cupy_* Misc.
570 ms
360 ms
197 ms
ImageNet, Resnet50, Batch:128 Time per iteration [ms]
P100 FP32
V100 FP32
V100
Tensor Core
(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用
28
どれぐらい、Voltaで速くなるの?
P100 FP32, V100 FP32 vs. V100 Tensor Core
0 100 200 300 400 500 600
Conv BN Relu Cupy_* Misc.
570 ms
360 ms
197 ms
ImageNet, Resnet50, Batch:128 Time per iteration [ms]
約3倍
P100 FP32
V100 FP32
V100
Tensor Core
(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用
29
マルチGPU性能
ImageNet, Resnet50, Batch/GPU:128
224
430
857
1,657
355
675
1,331
2,530
649
1,199
2,359
4,064
0
1,000
2,000
3,000
4,000
5,000
1 GPU 2 GPUs 4 GPUs 8 GPUs
P100 FP32 V100 FP32 V100 Tensor Core
Imagespersecond
(*) CUDA 9, cuDNN 7, NCCL 2, Chainer 3.0.0rc1+, CuPy 2.0.0rc1+ を使用、マシンはDGX1(V)
Volta (Tesla V100) の紹介

More Related Content

What's hot

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
Kohei KaiGai
 
Debugging CUDA applications
Debugging CUDA applicationsDebugging CUDA applications
Debugging CUDA applications
Rogue Wave Software
 
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloudPart 3 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
Univa, an Altair Company
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
Dustin Franklin
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
inside-BigData.com
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
Kohei KaiGai
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Ural-PDC
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
Kohei KaiGai
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
Jan Holčapek
 
How to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memoHow to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memo
Naoto MATSUMOTO
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
Kohei KaiGai
 
PG-Strom
PG-StromPG-Strom
PG-Strom
Kohei KaiGai
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
KTN
 
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
Kohei KaiGai
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_reportMichael Zhang
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
Kohei KaiGai
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
Kohei KaiGai
 
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
NTT Communications Technology Development
 
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)
Kohei KaiGai
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
Kohei KaiGai
 

What's hot (20)

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
Debugging CUDA applications
Debugging CUDA applicationsDebugging CUDA applications
Debugging CUDA applications
 
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloudPart 3 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
How to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memoHow to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memo
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
 
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_report
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
 
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
 
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 

Similar to Volta (Tesla V100) の紹介

Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
Jack (Jaegeun) Han
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
inside-BigData.com
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Joshua Mora
 
Advances in GPU Computing
Advances in GPU ComputingAdvances in GPU Computing
Advances in GPU Computing
Frédéric Parienté
 
GTC 2022 Keynote
GTC 2022 KeynoteGTC 2022 Keynote
GTC 2022 Keynote
Alison B. Lowndes
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
inside-BigData.com
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsOptimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Chris Fregly
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Chris Fregly
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
Chris Fregly
 
Brkdct 3101
Brkdct 3101Brkdct 3101
Brkdct 3101
Nguyen Van Linh
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
Hannes Tschofenig
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
MuhammadAbdullah311866
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Databricks
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
ssuser30e7d2
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
E-Commerce Brasil
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
Wim Vanderbauwhede
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
RCCSRENKEI
 
100 M pps on PC.
100 M pps on PC.100 M pps on PC.
100 M pps on PC.
Redge Technologies
 
Новые коммутаторы QFX10000. Технология JunOS Fusion
Новые коммутаторы QFX10000. Технология JunOS FusionНовые коммутаторы QFX10000. Технология JunOS Fusion
Новые коммутаторы QFX10000. Технология JunOS Fusion
TERMILAB. Интернет - лаборатория
 

Similar to Volta (Tesla V100) の紹介 (20)

Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
 
Advances in GPU Computing
Advances in GPU ComputingAdvances in GPU Computing
Advances in GPU Computing
 
GTC 2022 Keynote
GTC 2022 KeynoteGTC 2022 Keynote
GTC 2022 Keynote
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsOptimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
 
Brkdct 3101
Brkdct 3101Brkdct 3101
Brkdct 3101
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 
100 M pps on PC.
100 M pps on PC.100 M pps on PC.
100 M pps on PC.
 
Новые коммутаторы QFX10000. Технология JunOS Fusion
Новые коммутаторы QFX10000. Технология JunOS FusionНовые коммутаторы QFX10000. Технология JunOS Fusion
Новые коммутаторы QFX10000. Технология JunOS Fusion
 

More from NVIDIA Japan

HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?
NVIDIA Japan
 
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA Japan
 
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
NVIDIA Japan
 
20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdf20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdf
NVIDIA Japan
 
開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK
NVIDIA Japan
 
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワークNVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Japan
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
NVIDIA Japan
 
HPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなのHPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなの
NVIDIA Japan
 
Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報
NVIDIA Japan
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラ
NVIDIA Japan
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないこと
NVIDIA Japan
 
GPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIAGPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIA
NVIDIA Japan
 
GTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリーGTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリー
NVIDIA Japan
 
テレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティテレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティ
NVIDIA Japan
 
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
NVIDIA Japan
 
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
NVIDIA Japan
 
2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育
NVIDIA Japan
 
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
NVIDIA Japan
 
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
NVIDIA Japan
 
Jetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジにJetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジに
NVIDIA Japan
 

More from NVIDIA Japan (20)

HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?
 
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
 
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
 
20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdf20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdf
 
開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK
 
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワークNVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
HPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなのHPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなの
 
Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラ
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないこと
 
GPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIAGPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIA
 
GTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリーGTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリー
 
テレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティテレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティ
 
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
 
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
 
2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育
 
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
 
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
 
Jetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジにJetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジに
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 

Volta (Tesla V100) の紹介

  • 1. Akira Naruse, 9th Nov. 2017 VOLTA (TESLA V100)
  • 3. 3 SGEMM/W 2012 20142008 2010 2016 48 36 12 0 24 60 2018 72 Tesla Fermi Kepler Maxwell Pascal Volta GPU ROADMAPS
  • 4. 4 VOLTA: TESLA V100 HPC and Deep Learning、両方に適した最速のGPU Volta Architecture Most Productive GPU Tensor Core 120 Programmable TFLOPS Deep Learning Improved SIMT Model New Algorithms Volta MPS Inference Utilization Improved NVLink & HBM2 Efficient Bandwidth
  • 5. 5 21B transistors 815 mm2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink VOLTA: TESLA V100 *full GV100 chip contains 84 SMs
  • 6. 6 P100 V100 Ratio DL ops (FP16 or Mixed) 21 TOPS 120 TOPS 6x FP32 10 TFLOPS 15 TFLOPS 1.5x FP64 5 TFLOPS 7.5 TFLOPS 1.5x L1 Caches 1.3 MB 10 MB 7.7x L2 Cache 4 MB 6 MB 1.5x Memory 720 GB/s 900 GB/s 1.2x NVLink 160 GB/s 300 GB/s 1.9x GPU性能比較 演算 容量 帯域
  • 7. 7 NEW HBM2 MEMORY ARCHITECTURE STREAM:Triad-DeliveredGB/s P100 V100 76% DRAM Utilization 95% DRAM Utilization 実効帯域は1.5倍 V100 measured on pre-production hardware. HBM2 stack
  • 8. 8 ROAD TO EXASCALE Volta to Fuel Most Powerful US Supercomputers RelativetoTeslaP100 Volta HPC Application Performance System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware. Summit Supercomputer 200+ PetaFlops ~3,400 Nodes 10 Megawatts 1.5x
  • 9. 9 21B transistors 815 mm2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink VOLTA: TESLA V100 *full GV100 chip contains 84 SMs
  • 10. 10 VOLTA GV100 SM GV100 FP32 units 64 FP64 units 32 INT units 64 Tensor Cores 8 Register File 256 KB Unified L1/Shared memory 128 KB Active Threads 2048
  • 11. 11 VOLTA GV100 SM Completely new ISA Twice the schedulers Simplified Issue Logic Large, fast L1 cache Improved SIMT model Tensor acceleration = GPU史上、最も性能の出しやすいSM 使い勝手の良いアーキテクチャ
  • 12. 12 MPS: MULTI-PROCESS SERVICE 複数プロセスで、安全かつ効率的にGPUを共有 Limited Isolation A B C CUDA MULTI-PROCESS SERVICE Pascal GP100 A B C CPU Processes GPU Execution Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA MULTI-PROCESS SERVICE CONTROL CPU Processes GPU Execution A B C Pascal Volta
  • 13. 13 VOLTA: INDEPENDENT THREAD SCHEDULING Pascal: Lock-Free Algorithms Volta: Starvation Free Algorithms Communicating Algorithms Threads cannot wait for messages Threads may wait for messages
  • 14. 14 PASCAL SIMT EXECUTION MODEL ワープ内の分岐したスレッド間で、データ交換ができない Time X; Y; diverge reconverge A; B; if (threadIdx.x < 4) { A; __syncwarp(); B; } else { X; __syncwarp(); Y; }
  • 15. 15 VOLTA SIMT EXECUTION MODEL diverge A; B; X; Y; ワープ内の分岐したスレッド間でも、データ交換が可能 Time synchronize if (threadIdx.x < 4) { A; __syncwarp(); B; } else { X; __syncwarp(); Y; } __syncwarp();
  • 17. 17 TENSOR CORE 128 ops /cycle D = AB + C D = FP16 or FP32 FP16 FP16 FP16 or FP32 A0,0 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,3 A3,0 A3,1 A3,2 A3,3 B0,0 B0,1 B0,2 B0,3 B1,0 B1,1 B1,2 B1,3 B2,0 B2,1 B2,2 B2,3 B3,0 B3,1 B3,2 B3,3 C0,0 C0,1 C0,2 C0,3 C1,0 C1,1 C1,2 C1,3 C2,0 C2,1 C2,2 C2,3 C3,0 C3,1 C3,2 C3,3 Mixed Precision
  • 18. 18 TENSOR SYNCHRONIZATION ワープ内スレッドで同期 Full Warp 16x16 Matrix Math 16x16の行列乗算を、4x4の行列乗算 の組み合わせとして実行 各スレッドに結果を分配 Warp (32 threads)
  • 19. 19 VOLTA TENSOR OPERATION FP16 storage/input Full precision product Sum with FP32 accumulator Convert to FP32 result FP16 FP16 × + FP32 FP32 more products
  • 20. 20 USING TENSOR CORES Volta Optimized Libraries __device__ void tensor_op_16_16_16( float *d, half *a, half *b, float *c) { wmma::fragment<matrix_a, …> Amat; wmma::fragment<matrix_b, …> Bmat; wmma::fragment<matrix_c, …> Cmat; wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } CUDA C++ Warp-Level Matrix Operations NVIDIA cuDNN, cuBLAS, TensorRT
  • 21. 21 VOLTA: A GIANT LEAP FOR DEEP LEARNING P100 V100 P100 V100 ImagesperSecond ImagesperSecond 2.4x faster 3.7x faster FP32 Tensor Cores FP16 Tensor Cores V100 measured on pre-production hardware. ResNet-50 Training ResNet-50 Inference TensorRT - 7ms Latency
  • 24. 24 大丈夫です、Tensor Coreを使えば • Mixed Precision Training • Forward中、Backward中は、ほぼ全てfp16で実行して問題ない (Tensorコアを使えば) • Update(重みの更新)は、fp32で実行した方がよい (Update時間は短い) • モデルによっては、Loss scalingと呼ばれるテクニックが必要 (オーバーヘッド小) • 主要DLフレームワークで使用可能 • TensorFlow, MxNet, PyTorch, Caffe2, Theano, MS Cognitive Toolkit, Chainer http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html Training with Mixed-Precision User Guide
  • 25. 25 どれぐらい、Voltaで速くなるの? P100 FP32, V100 FP32 vs. V100 Tensor Core Resnet50 (*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用 Conv,1x1,64 Conv,3x3,64 Conv,1x1,256 BN ReLU BN ReLU BN + x ReLU
  • 26. 26 どれぐらい、Voltaで速くなるの? P100 FP32, V100 FP32 vs. V100 Tensor Core 0 100 200 300 400 500 600 Conv BN Relu Cupy_* Misc. 570 ms 360 ms 197 ms ImageNet, Resnet50, Batch:128 Time per iteration [ms] P100 FP32 V100 FP32 V100 Tensor Core (*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用
  • 27. 27 どれぐらい、Voltaで速くなるの? P100 FP32, V100 FP32 vs. V100 Tensor Core 0 100 200 300 400 500 600 Conv BN Relu Cupy_* Misc. 570 ms 360 ms 197 ms ImageNet, Resnet50, Batch:128 Time per iteration [ms] P100 FP32 V100 FP32 V100 Tensor Core (*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用
  • 28. 28 どれぐらい、Voltaで速くなるの? P100 FP32, V100 FP32 vs. V100 Tensor Core 0 100 200 300 400 500 600 Conv BN Relu Cupy_* Misc. 570 ms 360 ms 197 ms ImageNet, Resnet50, Batch:128 Time per iteration [ms] 約3倍 P100 FP32 V100 FP32 V100 Tensor Core (*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用
  • 29. 29 マルチGPU性能 ImageNet, Resnet50, Batch/GPU:128 224 430 857 1,657 355 675 1,331 2,530 649 1,199 2,359 4,064 0 1,000 2,000 3,000 4,000 5,000 1 GPU 2 GPUs 4 GPUs 8 GPUs P100 FP32 V100 FP32 V100 Tensor Core Imagespersecond (*) CUDA 9, cuDNN 7, NCCL 2, Chainer 3.0.0rc1+, CuPy 2.0.0rc1+ を使用、マシンはDGX1(V)