機械学習とこれを支える並列計算: ディープラーニング・スーパーコンピューターの応用について

森野慎也, シニアCUDAエンジニア, エヌビディアジャパン, 2017/2/10
ディープラーニング・スーパーコンピュータの
応用について

3https://www.youtube.com/watch?v=B0pt6gpgCXQ

4
DGX-1 ダイアグラム
CPU
2x Intel® Xeon® E5-2698 v4,
20-core, 2.2GHz
GPU 8x Tesla P100 SXM2 16GB
DRAM
512 GB
2133 MHz 32 GB DDR4 LRDIMM
Storage
(OS) 1x 480 GB, 6 Gb/s, SATA 3.0 SSD
(Data) 4x 1.92 TB, 6 Gb/s, SATA 3.0
SSD

5
TESLA P100
世界最速の演算ノードを実現する新しいGPUアーキテクチャ
Pascalアーキテクチャ NVLink HBM2 Stacked Memory Page Migration Engine
PCIe
Switch
PCIe
Switch
CPU CPU
最高の演算性能最大限のスケーラビリティをもたらす
GPU間のインターコネクト
演算とメモリを一つのパッケージに 512 TBの仮想メモリによる
シンプルな並列プログラミング
Unified Memory
CPU
Tesla
P100

6NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
すべてにおける大きな飛躍
3x GPU Mem BW
K40
Bandwidth
1x
2x
3x P100
M40
5x GPU-GPU BW
K40
Bandwidth(GB/Sec)
40
80
120
160 P100
M40
3x Compute
Teraflops(FP32/FP16)
5
10
15
20
K40
P100
(FP32)
P100
(FP16)
M40

7
Tesla P100 GPU : GP100
56 SM
3584 CUDAコア
倍精度 5.3 TFLOPS
単精度 10.6 TFLOPS
半精度 21.2 TFLOPS
16 GB HBM2
バンド幅 720 GB/s

8
IEEE 754 Floating Point on GP100
3つのサイズと3つのスピード, すべて高速
Feature Half precision Single precision Double precision
レイアウト s5.10 s8.23 s11.52
命令発行 2演算 / 1 clock 1演算 / 1 clock 1 演算 / 2 clocks
Subnormalサポート Yes Yes Yes
Atomic加算 Yes Yes Yes

9
HBM2 :バンド幅は 720 GB/s
ECCサポート
スペーサ
4層のHBM2
スタック
バンプ
シリコン
キャリア
GPU
基板

10
DGX-1 DEMO 1
Tesla P100のパフォーマンス

11
行列演算の例
- 行列の次元は、(9600,9600)
- 正規直交系の行列を作成。
Gram-Schmidt Process
- 行列積をとることで、単位行列。
GEMMの性能を見る
𝐼 = 𝐴 𝐴 𝑇

12
演算性能(FP32)
GPU: Tesla P100-SXM2-16GB, CC=6.0, 3584 CUDA cores.
(略)
dim=(9600, 9600), precision: fp32
Generating orthogonal matrix.
init: err(diag)=3.304e+03, err(off-diag)=2.512e+03
0: err(diag)=1.192e-06, err(off-diag)=1.780e-01
done.
prepare ... 0 1 2 3 4 5 6 7
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
Perf [GFLOPS] 9956.4 9928.5 9949.3 9948.4 9815.4 9963.8 9952.4 9974.8
Time [ms] 177.72 178.22 177.85 177.86 180.28 177.59 177.79 177.39

13
演算性能(FP64)
(略)
dim=(9600, 9600), precision: fp64
Generating orthogonal matrix.
done.
prepare ... 0 1 2 3 4 5 6 7
Perf [GFLOPS] 4853.2 4851.0 4961.6 4890.0 4842.0 4973.6 4880.2 4882.9
Time [ms] 364.60 364.77 356.63 361.86 365.44 355.77 362.58 362.38
freeing resources.

14
演算性能(FP64)
(略)
(略)
done.
Hgemm : (math on FP16)
Perf [GFLOPS] 19192.8 19132.2 19225.2 19214.1 19227.7 19220.5 19240.0 19235.7
Time [ms] 92.19 92.49 92.04 92.09 92.03 92.06 91.97 91.99
Err(diag) 2.8e-01 2.8e-01 2.8e-01 2.8e-01 2.8e-01 2.8e-01 2.8e-01 2.8e-01
Err(off-diag) 1.1e-03 1.1e-03 1.1e-03 1.1e-03 1.1e-03 1.1e-03 1.1e-03 1.1e-03
SgemmEx : R16 x R16 = R16 (math on FP32)
Perf [GFLOPS] 9900.2 9855.1 9901.4 9897.4 9901.4 9897.3 9893.2 9899.8
Time [ms] 178.73 179.55 178.71 178.78 178.71 178.78 178.86 178.74
SgemmEx : R16 x R16) = R32 (math on FP32)
Perf [GFLOPS] 9901.9 9860.0 9898.7 9896.9 9902.8 9895.6 9886.2 9901.5
Time [ms] 178.70 179.46 178.76 178.79 178.68 178.81 178.98 178.71
freeing resources.

19
Tesla P100 物理コネクタ
NVLink接続も含む

20
NVLink
P100 一つあたり、4リンク。
94% のバンド幅効率
対向するGPUに対して、read/writes/atomics をサポート
NVLinkをサポートするCPUからのread/write アクセス
複数のリンクを束ねることで、より高いバンド幅
通信レイテンシの削減
NVLink on Tesla P100
40 GB/s
40 GB/s
40 GB/s
40 GB/s

21
DGX-1 ダイアグラム
CPU
2x Intel® Xeon® E5-2698 v4,
20-core, 2.2GHz
GPU 8x Tesla P100 SXM2 16GB
DRAM
512 GB
2133 MHz 32 GB DDR4 LRDIMM
Storage
(OS) 1x 480 GB, 6 Gb/s, SATA 3.0 SSD
(Data) 4x 1.92 TB, 6 Gb/s, SATA 3.0
SSD

23
NCCLの実装
• 例) 1 CPU と 4 GPUs (PCIe)
リングアルゴリズム
リング上、もしくは、多くのトポロジー上で、有効にバンド幅を活かす
コレクティブの実装は、1つ、もしくは、それ以上の個数のリングで、表すことが
できる。ｓ [P. Patarasuk and X. Yuan]

24
Broadcast
データをすべてのGPUに送出

25
Broadcast
片方向リングを利用した場合
GPU0 GPU1 GPU2 GPU3

26
Broadcast
Step 1: Δt = N/B
N: bytes to broadcast
B: bandwidth of each link
GPU0 GPU1 GPU2 GPU3

27
Broadcast
Step 1: Δt = N/B
Step 2: Δt = N/B
GPU0 GPU1 GPU2 GPU3

28
Broadcast
Step 1: Δt = N/B
Step 2: Δt = N/B
Step 3: Δt = N/B
GPU0 GPU1 GPU2 GPU3

29
Broadcast
Step 1: Δt = N/B
Step 2: Δt = N/B
Step 3: Δt = N/B
Total time: (K-1)N/B
K: number of GPUs
GPU0 GPU1 GPU2 GPU3

30
Broadcast
データをS個のメッセージに分解
GPU0 GPU1 GPU2 GPU3

31
Broadcast
Step 1: Δt = N/BS
GPU0 GPU1 GPU2 GPU3

32
Broadcast
Step 1: Δt = N/BS
Step 2: Δt = N/BS
GPU0 GPU1 GPU2 GPU3

33
Broadcast
Step 1: Δt = N/BS
Step 2: Δt = N/BS
Step 3: Δt = N/BS
GPU0 GPU1 GPU2 GPU3

34
Broadcast
Step 1: Δt = N/BS
Step 2: Δt = N/BS
Step 3: Δt = N/BS
Step 4: Δt = N/BS
GPU0 GPU1 GPU2 GPU3

35
Broadcast
Step 1: Δt = N/BS
Step 2: Δt = N/BS
Step 3: Δt = N/BS
Step 4: Δt = N/BS
Step 5: Δt = N/BS
GPU0 GPU1 GPU2 GPU3

36
Broadcast
Step 1: Δt = N/BS
Step 2: Δt = N/BS
Step 3: Δt = N/BS
Step 4: Δt = N/BS
Step 4: Δt = N/BS
...
Total time:
(S+k-2)N/BS  N/B
GPU0 GPU1 GPU2 GPU3

37
Broadcast 性能値
# bytes N type root time algbw busbw delta
10000000 10000000 char 0 0.317 31.52 31.52 0e+00
10000000 10000000 char 1 0.316 31.61 31.61 0e+00
10000000 10000000 char 2 0.300 33.28 33.28 0e+00
10000000 10000000 char 3 0.310 32.22 32.22 0e+00
10000000 10000000 char 4 0.318 31.49 31.49 0e+00
10000000 10000000 char 5 0.325 30.73 30.73 0e+00
10000000 10000000 char 6 0.312 32.04 32.04 0e+00
10000000 10000000 char 7 0.318 31.42 31.42 0e+00
10000000 2500000 int 0 0.309 32.32 32.32 0e+00
10000000 2500000 int 1 0.317 31.54 31.54 0e+00
10000000 2500000 int 2 0.306 32.71 32.71 0e+00
10000000 2500000 int 3 0.320 31.21 31.21 0e+00
10000000 2500000 int 4 0.322 31.05 31.05 0e+00
10000000 2500000 int 5 0.321 31.19 31.19 0e+00
10000000 2500000 int 6 0.311 32.15 32.15 0e+00
10000000 2500000 int 7 0.317 31.59 31.59 0e+00
10000000 5000000 half 0 0.313 31.95 31.95 0e+00
10000000 5000000 half 1 0.312 32.01 32.01 0e+00

38
DGX-1 ソフトウエアスタック

39
NVLinkによる、リニアなマルチGPUスケーリング
1.0x
2.0x
3.0x
4.0x
5.0x
6.0x
7.0x
8.0x
1GPU 2GPU 4GPU 8GPU
AlexnetOWT
DGX-1
P100 PCIE
Deepmark test with NVCaffe. AlexnetOWT use batch 128, Incep-v3/ResNet-50 use batch 32, weak scaling,
P100 and DGX-1 are measured, FP32 training, software optimization in progress, CUDA8/cuDNN5.1, Ubuntu 14.04
1.0x
2.0x
3.0x
4.0x
5.0x
6.0x
7.0x
8.0x
1GPU 2GPU 4GPU 8GPU
Incep-v3
DGX-1
P100 PCIE
1.0x
2.0x
3.0x
4.0x
5.0x
6.0x
7.0x
8.0x
1GPU 2GPU 4GPU 8GPU
ResNet-50
DGX-1
P100 PCIE
Speedup
2.3x
1.3x
1.5x

40
Multi-GPU performance with NCCL
NVIDIA DGX-1, Chainer with NCCL patch
0
2
4
6
8
0 2 4 6 8
Number of GPUs
Scalability
ResNet (152 layers)VGG-D (16 layers)AlexNet (7 layers)
0
2
4
6
8
0 2 4 6 8
0
2
4
6
8
0 2 4 6 8
NCCL (DGX-1)NCCL (1-ring)Gather & Bcast
[Batch size per GPU] AlexNet:768, VGG-D:32, ResNet:12

41
Multi-GPU performance with NCCL
NVIDIA DGX-1, Chainer 1.17.0 with NCCL patch
0
0.5
1
1.5
2
2.5
…
G&B
NCCL
(1-ring)
NCCL
(DGX-1)
G&B
NCCL
(1-ring)
NCCL
(DGX-1)
G&B
NCCL
(1-ring)
NCCL
(DGX-1)
1 GPU 2 GPUs 4 GPUs 8 GPUs
Relativetimeto1GPU
Time per one batch (VGG-D)
Update
Allreduce
Backward
Forward

42
すでに提供されているコンテナ
2017/2/8時点

43
78
5,300
13,000
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
CPU Server Server with 8x Tesla M40 DGX-1
Microsoft cognitive toolkit 170X Faster on dgx-1
Toolkit Accelerates 170x Faster on DGX-1
images/sec
170x
Faster
vs CPU server
60x
Faster
vs CPU server
Latest Framework Fully Optimized for NVIDIA DGX-1
AlexNet training batch size 128,
CNTK 2.0b2 for CPU. CNTK 2.0b3 (to be released) includes cuDNN 5.1.8, NCCL 1.6.1, NVLink enabled

46NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
複数のフレームワークを、同時に使う
チューニングされたDockerコンテナ
DGX-1
コンテナ化されたアプリケーション
TF Tuned SW
NVIDIA Docker
CNTK Tuned SW
NVIDIA Docker
Caffe Tuned SW
NVIDIA Docker
Torch Tuned SW
NVIDIA Docker
CUDA RTCUDA RTCUDA RTCUDA RT
Linux Kernel + CUDA Driver

47
• すぐに活用できます
— plug-and-play,
AIフレームワークのサポート
• ソフトウエアスタック全域にわたる、
最適化
• フレームワーク混在環境
—コンテナ化
• NVIDIAのエクスパートへの
直接的なアクセス
NVIDIA DGX-1 SOFTWARE STACK
完全に統合されたDLプラットフォーム短い時間で価値を作り上げるために、
NVIDIAのR&Dの成果を活用してください。

機械学習とこれを支える並列計算: ディープラーニング・スーパーコンピューターの応用について

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 機械学習とこれを支える並列計算: ディープラーニング・スーパーコンピューターの応用について

Similar to 機械学習とこれを支える並列計算: ディープラーニング・スーパーコンピューターの応用について (20)

More from ハイシンク創研 / Laboratory of Hi-Think Corporation

More from ハイシンク創研 / Laboratory of Hi-Think Corporation (7)

機械学習とこれを支える並列計算: ディープラーニング・スーパーコンピューターの応用について