AI橋渡しクラウド（ABCI）における高性能計算とAI/ビッグデータ処理の融合

3
O c x
Z / 6
o Z
t k N x v Z
N G E N G E /
v Z dfl
/KIJH N )6
tk
e fae O
/BEKM 6
E BEB
kn O
6 R O
fn Oe
8 R
X g
OiO
d N/BEKMb lp
JA HE J
kn O
O
rOo
fn Oe
M
v Z dfl
+ )6
G K v O O
6G HC + FFG
O c x
FIJ H I /
Z
/DB
AFKJ
v
Z
H GA8
BH GA
L N 6 D R
6 /
e
+BL B
DFK F6 /
+ I II E H FE F
bO Od e
FF G H
/BEKM 6
oP Oi Q fub P+Xx aOd
dfl hvn X
sOo X
6
)FHJH EN N R
• IW b
• P e
• IW b G C
U bG /
• P G b G C
b H A b
G /
• IW U P
• P U
• IW
• P H
• IW P G
• P P G
U hvn X bdfl
W
U o fub U S

AI Bridging Cloud Infrastructure
as World’s First Large-scale Open AI Infrastructure
• Open, Public, and Dedicated infrastructure for AI/Big Data
• Platform to accelerate joint academic-industry R&D for AI in Japan
• Top-level compute capability w/ 0.550EFlops(AI), 37.2 PFlops(DP)
( )2 5 0.1 ,. 10 5 ## 0 #
4
Univ. Tokyo Kashiwa II Campus
Operation Scheduled in 2018

• 1088x compute nodes w/ 4352x NVIDIA Tesla V100 GPUs, 43520 CPU Cores,
476TiB of Memory, 1.6PB of NVMe SSDs, 22PB of HDD-based Storage and
Infiniband EDR network
• Ultra-dense IDC design from the ground-up w/ 20x thermal density of standard IDC
• Extreme Green w/ ambient warm liquid cooling, high-efficiency power supplies, etc.,
commoditizing supercomputer cooling technologies to clouds ( 2.3MW, 70kW/rack)
5
Gateway and
Firewall
Computing Nodes: 0.550 EFlops(HP), 37 PFlops(DP)
476 TiB Mem, 1.6 PB NVMe SSD
Storage: 22 PB GPFS
High Performance Computing Nodes (w/GPU) x1088
• Intel Xeon Gold6148 (2.4GHz/20cores) x2
• NVIDIA Tesla V100 (SXM2) x 4
• 384GiB Memory, 1.6TB NVMe SSD
Multi-platform Nodes (w/o GPU) x10
• Intel Xeon Gold6132 (2.6GHz/14cores) x2
• 768GiB Memory, 3.8TB NVMe SSD
Interactive Nodes
DDN SFA14K
(w/ SS8462 Enclosure x 10) x 3
• 12TB 7.2Krpm NL-SAS HDD x 2400
• 3.84TB SAS SSD x 216
• NSD Servers x 12
Object Storage for Protocol Nodes
100GbE
Service Network (10GbE)
External
Networks
SINET5
Interconnect (Infiniband EDR)

ABCI: AI Bridging Cloud Infrastructure
6
System
(32 Racks)Rack
(17 Chassis)
Compute Node
(4GPUs, 2CPUs)
Chips
(GPU, CPU)
Node Chassis
(2 Compute Nodes)
NVIDIA Tesla V100
(16GB SMX2)
3.72 TB/s MEM BW
384 GiB MEM
200 Gbps NW BW
1.6TB NVMe SSD
1.16 PFlops(DP)
17.2 PFlops (AI)
37.2 PFlops(DP)
0.550 EFlops (AI)
68.5 PFlops(DP)
1.01 PFlops (AI)
34.2 TFlops(DP)
506 TFlops (AI)
GPU:
7.8 TFlops(DP)
125 TFlops (AI)
CPU:
1.53 TFlops(DP)
3.07 TFlops (AI)
Intel Xeon Gold 6148
(27.5M Cache,
2.40 GHz, 20 Core)
0.550 EFlops(AI), 37.2 PFlops(DP)
19.88 PFlops(Peak), Ranked #5 Top500 June 2018
131TB/s MEM BW
Full Bisection BW within Rack
70kW Max
1088 Compute Nodes
4352 GPUs
4.19 PB/s MEM BW
1/3 of Oversubscription BW
2.3MW

GPU Compute Nodes
• NVIDIA TESLA V100
(16GB, SXM2) x 4
• Intel Xeon Gold 6148
x 2 Sockets
– 20 cores per Socket
• 384GiB of DDR4 Memory
• 1.6TB NVMe SSD x 1
– Intel DC P4600 u.2
• EDR Infiniband HCA x 2
– Connected to other Compute Notes
and Filesystems
7
Xeon Gold
6148
Xeon Gold
6148
10.4GT/s x3DDR4-2666
32GB x 6
DDR4-2666
32GB x 6
128GB/s 128GB/s
IB HCA (100Gbps)IB HCA (100Gbps)
NVMe
UPI x3
x48 switch x64 switch
Tesla V100 SXM2 Tesla V100 SXM2
Tesla V100 SXM2 Tesla V100 SXM2
PCIe gen3 x16 PCIe gen3 x16
PCIe gen3 x16 PCIe gen3 x16
NVLink2 x2

Rack as Dense-packaged “Pod”
( AB < ) 1 6
) 0 BC / 0 BC ,3
0G < F BA -7
BH<D G D CF BA -7 FB
<IF<DA CB
) 7 7 4 I
7 F<D , D 2 D .BB A
8Pod #1
LEAF#1
(SB7890)
LEAF#2
(SB7890)
LEAF#3
(SB7890)
LEAF#4
(SB7890)
SPINE#1
(CS7500)
SPINE#2
(CS7500)
CX40
0#1
CX2570#1
CX2570#2
CX40
0#2
CX2570#3
CX2570#4
CX40
0#3
CX2570#5
CX2570#6
CX40
0#17
CX2570#33
CX2570#34
FBB#1
(SB7890)
FBB#2
(SB7890)
FBB#3
(SB7890)
1/3 Oversubscription BW
IB-EDR x 24
Full bisection BW
IB-EDR x 72
InfiniBand EDR x1
InfiniBand EDR x6
InfiniBand EDR x4
x 32 pods

Hierarchical Storage Tiers
• Local Storage
– 1.6 TB NVMe SSD (Intel DC P4600 u.2) per Node
– Local Storage Aggregation w/ BeeOnd
• Parallel Filesystem
– 22PB of GPFS
• DDN SFA14K ( w/ SS8462 Enclosure x 10) x 3 set
• Bare Metal NSD servers and Flash-based Metadata
Volumes for metadata operation acceleration
– Home and Shared Use
• Object Storage
– Part of GPFS using OpenStack Swift
– S3-like API Access, Global Shared Use
– Additional Secure Volumes w/ Encryption
(Planned)
9
Parallel Filesystem
Local Storage
as Burst Buffers
Object Storage as Campaign Storage

Performance Reference for Distributed Deep Learning
10
Better• Environments
– ABCI 64 nodes (256 GPUs)
– Framework: ChainerMN v1.3.0
• Chainer 4.2.0, Cupy 4.2.3, mpi4py 3.0.0, Python 3.6.5
– Baremetal
• CentOS 7.4, gcc-4.8.5,
CUDA 9.2, CuDNN 7.1.4, NCCL2.2, OpenMPI 2,1.3
• Settings
– Dataset: Imagenet-1K
– Model: ResNet-50
– Training:
• Batch size: 32 per GPU, 32 x 256 in total
• Learning Rate: Starting 0.1 and x0.1 at 30, 60, 80 epoch
w/ warm up scheduling
• Optimization: Momentum SGD (momentum=0.9)
• Weight Decay: 0.0001
• Training Epoch: 100

LGD M
P
C N D M
A D U
N , L
A D
I M C
11
/home (GPFS)
Job Job Job Job
NQS
Submit
Scheduling
script
file
$ qsub <option> script_filename
inter-connect
SSH
( )
G
(High Throughput Computing)

/: /
D 0 1 72 A :
/: /
D / 2
D 172 2
.2: , 2
,C : ,/17/ 2 2 1
inkbc
augk t v y
12
, Hgmeu P
• 172 fp k
Hh rs wogk L Q
• lr t augk t T

.
,
13
(cont’d)
CUDA8.0
8.0.44 8.0.61.2
CUDA9.0
9.0.176
CUDA9.1
9.1.85 9.1.85.1 9.1.85.3
CuDNN5.1
5.1.5 5.1.10
CuDNN6.0
6.0.21
CuDNN7.0
7.0.5
CuDNN7.1
7.1.1 7.1.2 7.1.3
CUDA9.2
9.2.88.1
NCCL1.3
1.3.4 1.3.40-1
NCCL2.1NCCL2.0 NCCL2.2
2.0.5-3 2.1.4-1 2.1.15-1 2.2.12
OpenMPI MVAPICH2-GDR2.1.3 3.0.1 3.1.0 2.3a
Python 2.7 3.5 3.6
Python Modules mpi4py matplotlibCython Pillow Jupyter
Caffe2 CNTK ChainerMN Tensorflow MXNetNnabla

Software Stuck for ABCI
• Batch Job Scheduler
– High throughput computing
• Minimum Pre-installed Software
– Users can deploy their environments
using anaconda, pip, python venv, etc.
– Reduce operational cost
• Container Support
– Singularity for multi-node jobs w/
user customized images
– Docker for single-node jobs w/
site-certified images
14
User Applications
DL
Frameworks
Python, Ruby, R, Java, Scala, Perl, Lua, etc.
ISV AppsOSSHadoop/
Spark
GCC PGI
OpenMPI MVAPICH2
Intel Parallel
Studio XE
Cluster Edition
CUDA/CUDNN/NCCL
GPFS BeeOND
OpenStack
Swift
Univa Grid Engine
Singularity Docker
CentOS/RHEL

ABCI : Dynamic Container Deployment
with HPC Linux Containers
Linux Container (Singulairty, Docker)
GPFS/Object Storage
Compute
Node
Compute
Node
Compute
Node
Compute
Node
Container
Image
Container
Image
Container
Image
Container
Image
Job Job Job Job
Job Scheduler
Container
Image
Register/copy container images
Import/copy container images
Submit jobs with container images
Container image repository
(Dockerhub, private registry)

P D
. ,
H O I
CDHR D HAU C
M P
16
CharlieCloud
HPCEnterprise

(
(
D u Mvtr B pae
C . :
. /: CA C
m p
D l Mgo mi ws cn PS
:: M b kpI
M :: HL hpd ws
D :/ I y
:/ . x
D fcU S I y
, . /
17

( )
sudo singularity build –sandbox tmpdir/ Singularity
sudo singularity build –writable container.img Singularity
sudo singularity build container.img Singularity
sudo singularity build container.img docker://ubuntu
sudo singularity build container.img shub://ubuntu
) S
R
sudo singularity shell –writable container.img
D H
R
(, R
( (
, , , ,
container.img
h
S g
singularity run container.img
singularity exec container.img …
singularity shell container.img
a
Sc
) (
e e

4 , C C GHI
4 4 ,4 M :
c s a mn_e a e e s
P B . . , kt
M :
c sP B Md v
lN g I
Bc s i ro . i NI up
4. 4 . - . , , . , . - 4. 4 -
M :
lN g
l N 4. 4 g
20

- :
- D
0- 8 D H 0-
6/ .8 / 8/ / - / .
0- / / 8 $ 8 - -
6/ .8 / -:- / - / . / / 6 /
21
GPU --nv

C B 3A3 : 3
0 :
M
C $ : 3 an f uV
- a gi pN
C $ c v - fo : 3
r P H ts I
y m i NN _ O
3 - B leo
. - m i x
22

S y x
R 40 HIA ) 29
R m in - 03 H K67 O
S 0D H K 0N G 9 MDIH )
R oad
S HCN K M O
S NHMN ) C ( C
R g l
S 0 HM8 C ( C
S yup
R b c- 4G C H M 5
R bl- : 7 M (
R -
S e W - 29 U P )
S s - I D U
S t r- 6IG HMNG 21 GIG HMNG. ,
S = CDM 1 -
S h v-
23
:
Better:

L i
C 2 2 , 2 e D
N GLbdP G GU
M iA
C 2 2 , 2 e
2 - ak
M i
P GU
G L
, ,
o n I
G L
c P
24
Base Drivers, Libraries on Host
CUDA
Drivers
Infiniband
Drivers
Filesystem
Libraries
(GPFS,
Lustre)
Userland Libraries on Container
CUDA CuDNN NCCL2
MPI
(mpi4py)
Mount
ibverbs
Distributed Deep Learning
Frameworks
Caffe2 ChainerMNDistributed
TensorflowMXNet

/ . ) /
m S C
O = y t = i
sn f
y Sun v
S C ,
=. / St i S C
. / • C
I
/. , ) , ) , / ) O >– iU >
ir ga
lu bc S S
ke o a
26

A P oce ʼ A
Oʼ A L ʼ t
uO P
V P d A P
S r ʼ
oceA –
uO V V R
I • O x O o n M A
() P ao
oceO P A
– i dP• oce O od L
i dO () o d n oce O od
27

ei S Ll B
sI
B BCA I g e
/ R
/
ro Ra
ei A L n ut
28

AI橋渡しクラウド（ABCI）における高性能計算とAI/ビッグデータ処理の融合

More Related Content

What's hot

Similar to AI橋渡しクラウド（ABCI）における高性能計算とAI/ビッグデータ処理の融合

More from Hitoshi Sato

Recently uploaded

AI橋渡しクラウド（ABCI）における高性能計算とAI/ビッグデータ処理の融合