AI Bridging Cloud Infrastructure (ABCI) and its communication performance

AI Bridging Cloud Infrastructure (ABCI)
and its Communication Performance
Shinichiro Takizawa
The National Institute of Advanced Industrial Science and Technology
(AIST), Japan

Outline
• Introduction of AIST and ABCI
• ABCI Detail
– Architecture, software stack, network topology
• MPI performance on ABCI
• A latest application on ABCI

Introduction of AIST
• A research institute as a part of the Ministry of Economy, Trade and Industry
(METI) of Japan
• Our mission
– Integrate scientific and engineering
knowledge to address industry and
society needs
– Bridge the gap between innovative
technological seeds and
commercialization
We built ABCI for popularize AI technologies in Japan

4
n P C 8DTD B L SRD MC C R P BD
B AH HRW
n U S H OI GSJ 5 JOIG J HM P RPSBRSPD
P 0 1HF 3 R 0 F PHRGL RU PD MC
0 HB RH M
n U S :SST G OTS G LTVR R BBD DP RD I HMR
B CDLHB HMCS RPW 3 P 07
07 1PHCFHMF 2 SC 7M P RPSBRSPD
GP VLTVRGSI 0
7< B 7 ,
)- 7< B 7 ,
6LL I O VLTVRGSI 0 GW TL S ( /
/ .. 7< B . OS C
() 87< B F ) OS 8A66
. . C7< B OS 9 48
T V DWGM 0 1 ( ) F
2 VGM D60 1 6W ORG J
234:0 CN FTV J W 7OVW <GVM #BIG
U S 2: :SLVGW V I V
Univ. Tokyo / Kashiwa II Campus
OPERATED SINCE AUGUST 1ST, 2018

Some of 1 Year Achievements
• 100+ projects, 1000+ users
– Academia, research institutes and companies
• Large Japanese companies start to use ABCI as their R&D platform
• ABCI supports NGC containers
• SONY and Fujitsu lab achieved good
performance of ImageNet-1k
classification on ABCI
• Two research papers that use ABCI
were accepted by SC19
https://blogs.nvidia.com/blog/2019/06/17/abci-adopts-ngc/

6
0.74
0.745
0.75
0.755
0.76
0.765
0
200
400
600
800
1000
1200
1400
1600
MSRA
(2015)
Facebook
(2017)
Google
Brain
(2017)
Preferred
Networks
(2017)
Tencent
(2018)
Sony +
ABCI
(2018)
Google
(2018)
Google
(2018)
Sony +
ABCI
(2019)
Fujitsu
Lab +
ABCI
(2019)
NVIDIA
(2019)
Google
(2019)
Fujitsu
Lab +
ABCI
(2019)
ImageNet / ResNet-50 (Relative speedup & Accuracy)
Relative speedup Accuracy
Relativespeedup
Accuracy
World’s Highest Speed in ImageNet-1k Training
MLPerf Training v0.6
SONY’s work
https://arxiv.org/abs/1811.05233
Fujitsu lab’s work
https://arxiv.org/abs/1903.12650

7
Gateway and Firewall
Interactive Nodes x 4
Interconnect (InfiniBand EDR)
Service Network (10GbE)
Management and Gateway Nodes x 15
High-Performance Computing System
550 PFlops(FP16), 37.2 PFlops(FP64)
476 TiB Memory, 1.74 PB NVMe SSD
Computing Nodes (w/ GPU) x 1088
Multi-platform Nodes (w/o GPU) x 10
• Intel Xeon Gold 6132 (2.6GHz/14cores) x 2
• 768GiB Memory, 3.8TB NVMe SSD, 1.5TB Intel Optane x2
22 PB GPFS (Group Shared Directory, etc.)
DDN SFA14K (w/ SS8462 Enclosure x 10) x3
• 12TB 7.2Krpm NL-SAS HDD x 2400
• 3.84TB SAS SSD x 216
• Mellanox CS7500 x 2
• Mellanox SB7890 x 229
• Nexsus 3232C x2
• FortiGate 1500D x2
• FortiAnalyzer 400E x1
100Gbps
SINET5
GPU NVIDIA Tesla V100 SXM2 x 4
CPU Intel Xeon Gold 6148 (2.4GHz/20cores) x 2
Memory 384GiB
Local Storage Intel SSD DC P4600 (NVMe) 1.6TB x 1
Interconnect InfiniBand EDR x 2
ABCI Hardware Overview
Large-scale Storage System
1 PB Lustre (Home Directory)
DDN SFA14KX (w/ SS9012 Enclosure x 10) x1
• 7.68TB SAS SSD x 185 for data
• 960GB SAS SSD x 13 for metadata
17 PB Object Storage (Preparing)
HPE Apollo 4510 Gen10 x 24
• 12TB SATA HDD x 1440
• 3.2TB SSD x 24

8
ABCI Software Stack
Stack
Operating System RHEL / CentOS 7.6
Job Scheduler Univa Grid Engine 8.6.3
Container Engine
Docker 17.12.0 (Users can use only supported container images)
Singularity 2.6.1 (Users can use any container images)
MPI
Intel MPI 2018.2.199
MVAPICH2 2.3rc2, 2.3 / MVAPICH2-GDR 2.3a, 2.3rc1, 2.3, 2.3.1
OpenMPI 1.10.7, 2.1.3, 2.1.5, 2.1.6, 3.0.3, 3.1.0, 3.1.2, 3.1.3
Development tools
Intel Parallel Studio XE Cluster Edition 2017.8, 2018.2, 2018.3, 2019.3
PGI Professional Edition 17.10, 18.5, 18.10, 19.3
NVIDIA CUDA SDK 8.0, 9.0, 9.1, 9.2, 10.0
cuDNN 5.1, 6.0, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
NCCL 1.3.5, 2.1, 2.2, 2.3, 2.4
Intel MKL 2017.8, 2018.2, 2018.3, 2019.3
GCC, Python, Ruby, R, OpenJDK, Go, Perl
Deep Learning
Caffe, Caffe2, TensorFlow, Theano, Torch, PyTorch, CNTK, MXnet, Chainer, Keras, etc.
(Frameworks provided by NVIDIA GPU Cloud can also be deployed)
Big Data Processing Hadoop, Spark

9
234: 4TRU TJ
Xeon Gold
6148
Xeon Gold
6148
10.4GT/s x3DDR4-2666
32GB x 6
DDR4-2666
32GB x 6
128GB/s 128GB/s
IB HCA (100Gbps)IB HCA (100Gbps)
NVMe
UPI x3
x48 switch
Skylake Skylake
x64 switch
Tesla V100 SXM2 Tesla V100 SXM2
Tesla V100 SXM2 Tesla V100 SXM2
PCIe gen3 x16 PCIe gen3 x16
PCIe gen3 x16 PCIe gen3 x16
NVLink2 x2
FUJITSU PRIMERGY Server (2 servers in 2U)
CPU Xeon Gold 6148 (27.5M Cache, 2.40 GHz, 20 Core) x2
GPU NVIDIA Tesla V100 (SXM2) x4
Memory 384GiB DDR4 2666MHz RDIMM
Local Storage 1.6TB NVMe SSD (Intel SSD DC P4600 u.2) x1
Interconnect InfiniBand EDR x2

10
234: TJ AGIP :S VITSS I
InfiniBand EDR x1
InfiniBand EDR x6
InfiniBand EDR x4
Rack #1
LEAF#1
SB7890
LEAF#2
SB7890
LEAF#3
SB7890
LEAF#4
SB7890
CX400
#1
CX2570#1
CX2570#2
CX400
#2
CX2570#3
CX2570#4
CX400
#3
CX2570#5
CX2570#6
CX400
#17
CX2570#33
CX2570#34
FBB#1
SB7890
FBB#2
SB7890
FBB#3
SB7890
Full bisection BW
IB-EDR x 72
Rack #2
LEAF#1
SB7890
LEAF#2
SB7890
LEAF#3
SB7890
LEAF#4
SB7890
CX400
#1
CX2570#1
CX2570#2
CX400
#2
CX2570#3
CX2570#4
CX400
#3
CX2570#5
CX2570#6
CX400
#17
CX2570#33
CX2570#34
FBB#1
SB7890
FBB#2
SB7890
FBB#3
SB7890
SPINE#1
CS7500
SPINE#2
CS7500
Full bisection BW
IB-EDR x 72
1/3 Oversubscription BW
IB-EDR x 24
n 5 SW #UGIPGM J VGIP0 ) STJ W ), C W G E
• GD PDRHB D DP PL MBD DP P B
:4 4: , :4 4:
B F D :> ) & : C / &&:4 -P B
• : UDP B M SL RH M DP P B . , ))
n :S VITSS I
• 4 R RPDD R FW
• 7MRP P B . S AH DBRH M 1
• 7MRDP P B ) TDP SA BPH RH M ( && -&&
• HRG SR C RHTD P SRHMF
• HRG SR D M 60 :

Grounds for the Interconnect Design
• 1/3 over-subscription network
– A cost effective solution
• Many DL training apps do not use large number of nodes
• High bandwidth network is required by only a small set of nodes
• Without adoptive routing
– InfiniBand is shared by compute and IO communication, and the delivery
vendor suggested not to use adaptive routing on such a configuration
• Without Mellanox SHARP
– We use EDR and switches which do not support large size message in
SHARP

Distribution of Number of Used GPUs/Nodes in DL Jobs
• Workload is collected from pre-
ABCI system, AAIC
– A system dedicated for AI research
• Single GPU Jobs are dominant and
degree of parallelism is low
• ABCI is designed for a capacity
computing system for AI
#Nodes: 32-38
#GPUs/Node: 8
#GPUs: 256-304
Single GPU, Single Node
and Multi Node jobs
# Used Nodes in Multi Node
Jobs
Workload is published under: https://github.com/aistairc/aaic-workload

Performance Impact on MPI
under 1/3 over-subscription without adoptive routing
• Measured P2P host-mem to host-mem transfer
performance using OpenMPI 2.1.6
• Intra-rack: 17 - 17 nodes
• Inter-rack: 34 - 34 nodes
• Increase # of node pairs
Node
Node
Group0 (Rack0) Group1 (Rack1)
Node
…
Node
Node
Node
…

:S VG#AGIP ( VLTVRGSI
14
0
5
10
15
20
25
30
0 5 10 15 20
Per-NodeThroughput(GB/s)
# Intra-Group Connection
Ideal Measured
0
50
100
150
200
250
300
350
400
450
0 5 10 15 20
TotalThroughput(GB/s)
# Intra-Group Connection
Ideal Measured
Per-Node Average Throughput Aggregated Throughput
More than 80% of the theoretical performance is achieved.

:S V#AGIP ( VLTVRGSI
15
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35 40
Per-NodeThroughput(GB/s)
# Inter-Rack Connection
Ideal Measured
0
50
100
150
200
250
300
350
0 5 10 15 20 25 30 35 40
TotalRackThroughput(GB/s)
# Inter-Rack Connection
Ideal Measured
Per-Node Average Throughput Aggregated Throughput
Large performance degradation at 6 and 20 connections, and far from Ideal performance

Collective Comm. Performance on ABCI
• Measured performance of two collective comm. patterns
– Allreduce: essential in distributed training of DL
– Reduce: the most heavy comm. in our application
• Used OSU Micro-Benchmarks 5.6.1
– NCCL Bench 2.0.0 is used for NCCL
– Default parameters
MPI Version
MVAPICH2 MVAPICH2 2.3.1
IntelMPI IntelMPI 2018 Update 2
OpenMPI OpenMPI 3.1.3
MPI, NCCL Version
MVAPICH2 MVAPICH2 2.3.1
MVAPICH2-GDR MVAPICH2-GDR 2.3.1, gdrcopy 1.2
OpenMPI OpenMPI 3.1.3
NCCL NCCL 2.4.7
Compiler GCC 4.8.5
CUDA 10.0.130
Host to host Transfer GPU to GPU Transfer

9(9 2 A J I
17
Inter-Rack (68 Nodes)Intra-Rack (34 Nodes)
• MVAPICH2 is comparable to IntelMPI
• Performance do not suffer from 1/3 over-subscription inter-rack bandwidth

9(9 A J I
18
MVAPICH2 is better than others in small message, but IntelMPI outperforms MVAPICH in large message

5(5 2 A J I
19
MVAPICH2-GDR and NCCL outperforms others in large messages

5(5 A J I
20
MVAPICH2-GDR achieves the best performance

Application Performance: iFDK
• A high-speed and high-resolution CT image reconstruction framework
running on GPU clusters
– Create 3D images from multiple 2D x-ray images
– Demands for high-resolution CT image:
non-invasive inspection, reverse engineering, etc.
• What we done
– GPU kernel whose compute cost is 1/6 of the standard
algorithm
– Efficient FDK computation by overlapping CPU comp.,
GPU comp. and comm.
– Distributed framework for high-resolution
image reconstruction
Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, Satoshi Matsuoka.
iFDK: A Scalable Framework for Instant High-resolution Image Reconstruction
CBCT geometry and trajectory
Micro-focus
X-ray source
Will be presented at SC19

Distributed Framework for High-resolution
Image Reconstruction
22
22
Filtering
Filtering
Filtering
Filtering
Back-projection
Back-projection
Back-projection
Back-projection
Filtering
Filtering
Filtering
Filtering
Back-projection
Back-projection
Back-projection
Back-projection
Load
Load
Load
Load
Load
Load
Load
Load
Store
Store
Store
Store
On CPUs On GPUs
AllGather Reduce
On CPUs
2DProjections
3DVolume
Input Output

Example Case
23
0 32 64 96
1 9 17 25
31 63 95 127
!"
!#
!$#
%" %# %& %$
⋯ ⋯ ⋯ ⋯ ⋯ ⋯
Input : 2D Projections
Output : 3D volume
vol 0
vol 1
vol 62
vol 63
16 MB/img x4096
8 GB/vol x64
Allgather
Reduce
Input: 4k count of 2k image, Output: 4k^3, 32 Nodes

Performance Result
• Large performance gap between
MPI implementations.
– Mainly by Reduction
– Used IntelMPI in the paper
• Next step
– Tune parameters for collectives
– Use CUDA-aware MPI collectives0
100
200
300
400
500
IntelMPI MVAPICH2 OpenMPI
Time(Second)
Allgather Reduce IO Compute
Input: 4k count of 2k image, Output: 4k^3, 32 Nodes

Summary
• Introduce ABCI, an open AI platform
– Architecture, software stack and latest achievements
– Detail network architecture
• MPI communication performance on ABCI
– Effect of 1/3 inter-rack over-subscription bandwidth on P2P and
collectives
– An application performance example
• Future work
– Evaluate the latest MVAPICH2 and MVAPICH2-GDR
– Provide detail comm. performance results on ABCI to users
https://abci.ai/

AI Bridging Cloud Infrastructure (ABCI) and its communication performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AI Bridging Cloud Infrastructure (ABCI) and its communication performance

Similar to AI Bridging Cloud Infrastructure (ABCI) and its communication performance (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

AI Bridging Cloud Infrastructure (ABCI) and its communication performance