Volta (Tesla V100) の紹介

Akira Naruse, 9th Nov. 2017
VOLTA (TESLA V100)

3
SGEMM/W
2012 20142008 2010 2016
48
36
12
0
24
60
2018
72
Tesla Fermi
Kepler
Maxwell
Pascal
Volta
GPU ROADMAPS

4
VOLTA: TESLA V100
HPC and Deep Learning、両方に適した最速のGPU
Volta Architecture
Most Productive GPU
Tensor Core
120 Programmable
TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink &
HBM2
Efficient Bandwidth

5
21B transistors
815 mm2
80 SM
5120 CUDA Cores
640 Tensor Cores
16 GB HBM2
900 GB/s HBM2
300 GB/s NVLink
VOLTA: TESLA V100
*full GV100 chip contains 84 SMs

6
P100 V100 Ratio
DL ops (FP16 or Mixed) 21 TOPS 120 TOPS 6x
FP32 10 TFLOPS 15 TFLOPS 1.5x
FP64 5 TFLOPS 7.5 TFLOPS 1.5x
L1 Caches 1.3 MB 10 MB 7.7x
L2 Cache 4 MB 6 MB 1.5x
Memory 720 GB/s 900 GB/s 1.2x
NVLink 160 GB/s 300 GB/s 1.9x
GPU性能比較
演算
容量
帯域

7
NEW HBM2 MEMORY ARCHITECTURE
STREAM:Triad-DeliveredGB/s
P100 V100
76% DRAM
Utilization
95% DRAM
Utilization
実効帯域は1.5倍
V100 measured on pre-production hardware.
HBM2 stack

8
ROAD TO EXASCALE
Volta to Fuel Most Powerful
US Supercomputers
RelativetoTeslaP100
Volta HPC Application Performance
System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla
P100 or V100. V100 measured on pre-production hardware.
Summit
Supercomputer
200+ PetaFlops
~3,400 Nodes
10 Megawatts
1.5x

9
21B transistors
815 mm2
80 SM
5120 CUDA Cores
640 Tensor Cores
16 GB HBM2
900 GB/s HBM2
300 GB/s NVLink
VOLTA: TESLA V100
*full GV100 chip contains 84 SMs

10
VOLTA GV100 SM
GV100
FP32 units 64
FP64 units 32
INT units 64
Tensor Cores 8
Register File 256 KB
Unified L1/Shared
memory
128 KB
Active Threads 2048

11
VOLTA GV100 SM
Completely new ISA
Twice the schedulers
Simplified Issue Logic
Large, fast L1 cache
Improved SIMT model
Tensor acceleration
=
GPU史上、最も性能の出しやすいSM
使い勝手の良いアーキテクチャ

12
MPS: MULTI-PROCESS SERVICE
複数プロセスで、安全かつ効率的にGPUを共有
Limited
Isolation
A B C
CUDA MULTI-PROCESS SERVICE
Pascal GP100
A
B
C
CPU Processes
GPU Execution
Hardware
Isolation
VOLTA MULTI-PROCESS SERVICE
Volta GV100
A B C
CUDA MULTI-PROCESS SERVICE CONTROL
CPU Processes
GPU Execution
A B C
Pascal Volta

13
VOLTA: INDEPENDENT THREAD SCHEDULING
Pascal: Lock-Free Algorithms Volta: Starvation Free Algorithms
Communicating Algorithms
Threads cannot wait for messages Threads may wait for messages

14
PASCAL SIMT EXECUTION MODEL
ワープ内の分岐したスレッド間で、データ交換ができない
Time
X; Y;
diverge
reconverge
A; B;
if (threadIdx.x < 4) {
A;
__syncwarp();
B;
} else {
X;
__syncwarp();
Y;
}

15
VOLTA SIMT EXECUTION MODEL
diverge
A; B;
X; Y;
ワープ内の分岐したスレッド間でも、データ交換が可能
Time
synchronize
if (threadIdx.x < 4) {
A;
__syncwarp();
B;
} else {
X;
__syncwarp();
Y;
}
__syncwarp();

17
TENSOR CORE
128 ops /cycle
D = AB + C
D =
FP16 or FP32 FP16 FP16 FP16 or FP32
A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
B0,0 B0,1 B0,2 B0,3
B1,0 B1,1 B1,2 B1,3
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3
C0,0 C0,1 C0,2 C0,3
C1,0 C1,1 C1,2 C1,3
C2,0 C2,1 C2,2 C2,3
C3,0 C3,1 C3,2 C3,3
Mixed Precision

18
TENSOR SYNCHRONIZATION
ワープ内スレッドで同期
Full Warp 16x16 Matrix Math
16x16の行列乗算を、4x4の行列乗算
の組み合わせとして実行
各スレッドに結果を分配
Warp (32 threads)

19
VOLTA TENSOR OPERATION
FP16
storage/input
Full precision
product
Sum with
FP32
accumulator
Convert to
FP32 result
FP16
FP16
× + FP32
FP32
more products

20
USING TENSOR CORES
Volta Optimized Libraries
__device__ void tensor_op_16_16_16(
float *d, half *a, half *b, float *c)
{
wmma::fragment<matrix_a, …> Amat;
wmma::fragment<matrix_b, …> Bmat;
wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);
wmma::load_matrix_sync(Bmat, b, 16);
wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,
wmma::row_major);
}
CUDA C++
Warp-Level Matrix Operations
NVIDIA cuDNN, cuBLAS, TensorRT

21
VOLTA: A GIANT LEAP FOR DEEP LEARNING
P100 V100 P100 V100
ImagesperSecond
ImagesperSecond
2.4x faster 3.7x faster
FP32 Tensor Cores FP16 Tensor Cores
V100 measured on pre-production hardware.
ResNet-50 Training ResNet-50 Inference
TensorRT - 7ms Latency

22
FP16でトレーニングして、精度は大丈夫なのか？

23
大丈夫です、Tensor Coreを使えば
http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
Training with Mixed-Precision User Guide

24
大丈夫です、Tensor Coreを使えば
• Mixed Precision Training
• Forward中、Backward中は、ほぼ全てfp16で実行して問題ない (Tensorコアを使えば)
• Update(重みの更新)は、fp32で実行した方がよい (Update時間は短い)
• モデルによっては、Loss scalingと呼ばれるテクニックが必要 (オーバーヘッド小)
• 主要DLフレームワークで使用可能
• TensorFlow, MxNet, PyTorch, Caffe2, Theano, MS Cognitive Toolkit, Chainer
http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
Training with Mixed-Precision User Guide

25
どれぐらい、Voltaで速くなるの?
P100 FP32, V100 FP32 vs. V100 Tensor Core
Resnet50
(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用
Conv,1x1,64
Conv,3x3,64
Conv,1x1,256
BN
ReLU
BN
ReLU
BN
+
x
ReLU

26
0 100 200 300 400 500 600
Conv BN Relu Cupy_* Misc.
570 ms
360 ms
197 ms
ImageNet, Resnet50, Batch:128 Time per iteration [ms]
P100 FP32
V100 FP32
V100
Tensor Core

27
0 100 200 300 400 500 600
570 ms
360 ms
197 ms
P100 FP32
V100 FP32
V100
Tensor Core

28
0 100 200 300 400 500 600
570 ms
360 ms
197 ms
約3倍
P100 FP32
V100 FP32
V100
Tensor Core

29
マルチGPU性能
ImageNet, Resnet50, Batch/GPU:128
224
430
857
1,657
355
675
1,331
2,530
649
1,199
2,359
4,064
0
1,000
2,000
3,000
4,000
5,000
1 GPU 2 GPUs 4 GPUs 8 GPUs
P100 FP32 V100 FP32 V100 Tensor Core
Imagespersecond
(*) CUDA 9, cuDNN 7, NCCL 2, Chainer 3.0.0rc1+, CuPy 2.0.0rc1+ を使用、マシンはDGX1(V)

Volta (Tesla V100) の紹介

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Volta (Tesla V100) の紹介

Similar to Volta (Tesla V100) の紹介 (20)

More from NVIDIA Japan

More from NVIDIA Japan (20)

Recently uploaded

Recently uploaded (20)

Volta (Tesla V100) の紹介