SlideShare a Scribd company logo
AI Bridging Cloud Infrastructure (ABCI)
and its Communication Performance
Shinichiro Takizawa
The National Institute of Advanced Industrial Science and Technology
(AIST), Japan
Outline
• Introduction of AIST and ABCI
• ABCI Detail
– Architecture, software stack, network topology
• MPI performance on ABCI
• A latest application on ABCI
Introduction of AIST
• A research institute as a part of the Ministry of Economy, Trade and Industry
(METI) of Japan
• Our mission
– Integrate scientific and engineering
knowledge to address industry and
society needs
– Bridge the gap between innovative
technological seeds and
commercialization
We built ABCI for popularize AI technologies in Japan
4
n P C 8DTD B L SRD MC C R P BD
B AH HRW
n U S H OI GSJ 5 JOIG J HM P RPSBRSPD
P 0 1HF 3 R 0 F PHRGL RU PD MC
0 HB RH M
n U S :SST G OTS G LTVR R BBD DP RD I HMR
B CDLHB HMCS RPW 3 P 07
07 1PHCFHMF 2 SC 7M P RPSBRSPD
GP VLTVRGSI 0
7< B 7 ,
)- 7< B 7 ,
6LL I O VLTVRGSI 0 GW TL S ( /
/ .. 7< B . OS C
() 87< B F ) OS 8A66
. . C7< B OS 9 48
T V DWGM 0 1 ( ) F
2 VGM D60 1 6W ORG J
234:0 CN FTV J W 7OVW <GVM #BIG
U S 2: :SLVGW V I V
Univ. Tokyo / Kashiwa II Campus
OPERATED SINCE AUGUST 1ST, 2018
Some of 1 Year Achievements
• 100+ projects, 1000+ users
– Academia, research institutes and companies
• Large Japanese companies start to use ABCI as their R&D platform
• ABCI supports NGC containers
• SONY and Fujitsu lab achieved good
performance of ImageNet-1k
classification on ABCI
• Two research papers that use ABCI
were accepted by SC19
https://blogs.nvidia.com/blog/2019/06/17/abci-adopts-ngc/
6
0.74
0.745
0.75
0.755
0.76
0.765
0
200
400
600
800
1000
1200
1400
1600
MSRA
(2015)
Facebook
(2017)
Google
Brain
(2017)
Preferred
Networks
(2017)
Tencent
(2018)
Sony +
ABCI
(2018)
Google
(2018)
Google
(2018)
Sony +
ABCI
(2019)
Fujitsu
Lab +
ABCI
(2019)
NVIDIA
(2019)
Google
(2019)
Fujitsu
Lab +
ABCI
(2019)
ImageNet / ResNet-50 (Relative speedup & Accuracy)
Relative speedup Accuracy
Relativespeedup
Accuracy
World’s Highest Speed in ImageNet-1k Training
MLPerf Training v0.6
SONY’s work
https://arxiv.org/abs/1811.05233
Fujitsu lab’s work
https://arxiv.org/abs/1903.12650
7
Gateway and Firewall
Interactive Nodes x 4
Interconnect (InfiniBand EDR)
Service Network (10GbE)
Management and Gateway Nodes x 15
High-Performance Computing System
550 PFlops(FP16), 37.2 PFlops(FP64)
476 TiB Memory, 1.74 PB NVMe SSD
Computing Nodes (w/ GPU) x 1088
Multi-platform Nodes (w/o GPU) x 10
• Intel Xeon Gold 6132 (2.6GHz/14cores) x 2
• 768GiB Memory, 3.8TB NVMe SSD, 1.5TB Intel Optane x2
22 PB GPFS (Group Shared Directory, etc.)
DDN SFA14K (w/ SS8462 Enclosure x 10) x3
• 12TB 7.2Krpm NL-SAS HDD x 2400
• 3.84TB SAS SSD x 216
• Mellanox CS7500 x 2
• Mellanox SB7890 x 229
• Nexsus 3232C x2
• FortiGate 1500D x2
• FortiAnalyzer 400E x1
100Gbps
SINET5
GPU NVIDIA Tesla V100 SXM2 x 4
CPU Intel Xeon Gold 6148 (2.4GHz/20cores) x 2
Memory 384GiB
Local Storage Intel SSD DC P4600 (NVMe) 1.6TB x 1
Interconnect InfiniBand EDR x 2
ABCI Hardware Overview
Large-scale Storage System
1 PB Lustre (Home Directory)
DDN SFA14KX (w/ SS9012 Enclosure x 10) x1
• 7.68TB SAS SSD x 185 for data
• 960GB SAS SSD x 13 for metadata
17 PB Object Storage (Preparing)
HPE Apollo 4510 Gen10 x 24
• 12TB SATA HDD x 1440
• 3.2TB SSD x 24
8
ABCI Software Stack
Stack
Operating System RHEL / CentOS 7.6
Job Scheduler Univa Grid Engine 8.6.3
Container Engine
Docker 17.12.0 (Users can use only supported container images)
Singularity 2.6.1 (Users can use any container images)
MPI
Intel MPI 2018.2.199
MVAPICH2 2.3rc2, 2.3 / MVAPICH2-GDR 2.3a, 2.3rc1, 2.3, 2.3.1
OpenMPI 1.10.7, 2.1.3, 2.1.5, 2.1.6, 3.0.3, 3.1.0, 3.1.2, 3.1.3
Development tools
Intel Parallel Studio XE Cluster Edition 2017.8, 2018.2, 2018.3, 2019.3
PGI Professional Edition 17.10, 18.5, 18.10, 19.3
NVIDIA CUDA SDK 8.0, 9.0, 9.1, 9.2, 10.0
cuDNN 5.1, 6.0, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
NCCL 1.3.5, 2.1, 2.2, 2.3, 2.4
Intel MKL 2017.8, 2018.2, 2018.3, 2019.3
GCC, Python, Ruby, R, OpenJDK, Go, Perl
Deep Learning
Caffe, Caffe2, TensorFlow, Theano, Torch, PyTorch, CNTK, MXnet, Chainer, Keras, etc.
(Frameworks provided by NVIDIA GPU Cloud can also be deployed)
Big Data Processing Hadoop, Spark
9
234: 4TRU TJ
Xeon Gold
6148
Xeon Gold
6148
10.4GT/s x3DDR4-2666
32GB x 6
DDR4-2666
32GB x 6
128GB/s 128GB/s
IB HCA (100Gbps)IB HCA (100Gbps)
NVMe
UPI x3
x48 switch
Skylake Skylake
x64 switch
Tesla V100 SXM2 Tesla V100 SXM2
Tesla V100 SXM2 Tesla V100 SXM2
PCIe gen3 x16 PCIe gen3 x16
PCIe gen3 x16 PCIe gen3 x16
NVLink2 x2
FUJITSU PRIMERGY Server (2 servers in 2U)
CPU Xeon Gold 6148 (27.5M Cache, 2.40 GHz, 20 Core) x2
GPU NVIDIA Tesla V100 (SXM2) x4
Memory 384GiB DDR4 2666MHz RDIMM
Local Storage 1.6TB NVMe SSD (Intel SSD DC P4600 u.2) x1
Interconnect InfiniBand EDR x2
10
234: TJ AGIP :S VITSS I
InfiniBand EDR x1
InfiniBand EDR x6
InfiniBand EDR x4
Rack #1
LEAF#1
SB7890
LEAF#2
SB7890
LEAF#3
SB7890
LEAF#4
SB7890
CX400
#1
CX2570#1
CX2570#2
CX400
#2
CX2570#3
CX2570#4
CX400
#3
CX2570#5
CX2570#6
CX400
#17
CX2570#33
CX2570#34
FBB#1
SB7890
FBB#2
SB7890
FBB#3
SB7890
Full bisection BW
IB-EDR x 72
Rack #2
LEAF#1
SB7890
LEAF#2
SB7890
LEAF#3
SB7890
LEAF#4
SB7890
CX400
#1
CX2570#1
CX2570#2
CX400
#2
CX2570#3
CX2570#4
CX400
#3
CX2570#5
CX2570#6
CX400
#17
CX2570#33
CX2570#34
FBB#1
SB7890
FBB#2
SB7890
FBB#3
SB7890
SPINE#1
CS7500
SPINE#2
CS7500
Full bisection BW
IB-EDR x 72
1/3 Oversubscription BW
IB-EDR x 24
n 5 SW #UGIPGM J VGIP0 ) STJ W ), C W G E
• GD PDRHB D DP PL MBD DP P B
:4 4: , :4 4:
B F D :> ) & : C / &&:4 -P B
• : UDP B M SL RH M DP P B . , ))
n :S VITSS I
• 4 R RPDD R FW
• 7MRP P B . S AH DBRH M 1
• 7MRDP P B ) TDP SA BPH RH M ( && -&&
• HRG SR C RHTD P SRHMF
• HRG SR D M 60 :
Grounds for the Interconnect Design
• 1/3 over-subscription network
– A cost effective solution
• Many DL training apps do not use large number of nodes
• High bandwidth network is required by only a small set of nodes
• Without adoptive routing
– InfiniBand is shared by compute and IO communication, and the delivery
vendor suggested not to use adaptive routing on such a configuration
• Without Mellanox SHARP
– We use EDR and switches which do not support large size message in
SHARP
Distribution of Number of Used GPUs/Nodes in DL Jobs
• Workload is collected from pre-
ABCI system, AAIC
– A system dedicated for AI research
• Single GPU Jobs are dominant and
degree of parallelism is low
• ABCI is designed for a capacity
computing system for AI
#Nodes: 32-38
#GPUs/Node: 8
#GPUs: 256-304
Single GPU, Single Node
and Multi Node jobs
# Used Nodes in Multi Node
Jobs
Workload is published under: https://github.com/aistairc/aaic-workload
Performance Impact on MPI
under 1/3 over-subscription without adoptive routing
• Measured P2P host-mem to host-mem transfer
performance using OpenMPI 2.1.6
• Intra-rack: 17 - 17 nodes
• Inter-rack: 34 - 34 nodes
• Increase # of node pairs
Node
Node
Group0 (Rack0) Group1 (Rack1)
Node
…
Node
Node
Node
…
:S VG#AGIP ( VLTVRGSI
14
0
5
10
15
20
25
30
0 5 10 15 20
Per-NodeThroughput(GB/s)
# Intra-Group Connection
Ideal Measured
0
50
100
150
200
250
300
350
400
450
0 5 10 15 20
TotalThroughput(GB/s)
# Intra-Group Connection
Ideal Measured
Per-Node Average Throughput Aggregated Throughput
More than 80% of the theoretical performance is achieved.
:S V#AGIP ( VLTVRGSI
15
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35 40
Per-NodeThroughput(GB/s)
# Inter-Rack Connection
Ideal Measured
0
50
100
150
200
250
300
350
0 5 10 15 20 25 30 35 40
TotalRackThroughput(GB/s)
# Inter-Rack Connection
Ideal Measured
Per-Node Average Throughput Aggregated Throughput
Large performance degradation at 6 and 20 connections, and far from Ideal performance
Collective Comm. Performance on ABCI
• Measured performance of two collective comm. patterns
– Allreduce: essential in distributed training of DL
– Reduce: the most heavy comm. in our application
• Used OSU Micro-Benchmarks 5.6.1
– NCCL Bench 2.0.0 is used for NCCL
– Default parameters
MPI Version
MVAPICH2 MVAPICH2 2.3.1
IntelMPI IntelMPI 2018 Update 2
OpenMPI OpenMPI 3.1.3
MPI, NCCL Version
MVAPICH2 MVAPICH2 2.3.1
MVAPICH2-GDR MVAPICH2-GDR 2.3.1, gdrcopy 1.2
OpenMPI OpenMPI 3.1.3
NCCL NCCL 2.4.7
Compiler GCC 4.8.5
CUDA 10.0.130
Host to host Transfer GPU to GPU Transfer
9(9 2 A J I
17
Inter-Rack (68 Nodes)Intra-Rack (34 Nodes)
• MVAPICH2 is comparable to IntelMPI
• Performance do not suffer from 1/3 over-subscription inter-rack bandwidth
9(9 A J I
18
Inter-Rack (68 Nodes)Intra-Rack (34 Nodes)
MVAPICH2 is better than others in small message, but IntelMPI outperforms MVAPICH in large message
5(5 2 A J I
19
Inter-Rack (68 Nodes)Intra-Rack (34 Nodes)
MVAPICH2-GDR and NCCL outperforms others in large messages
5(5 A J I
20
Inter-Rack (68 Nodes)Intra-Rack (34 Nodes)
MVAPICH2-GDR achieves the best performance
Application Performance: iFDK
• A high-speed and high-resolution CT image reconstruction framework
running on GPU clusters
– Create 3D images from multiple 2D x-ray images
– Demands for high-resolution CT image:
non-invasive inspection, reverse engineering, etc.
• What we done
– GPU kernel whose compute cost is 1/6 of the standard
algorithm
– Efficient FDK computation by overlapping CPU comp.,
GPU comp. and comm.
– Distributed framework for high-resolution
image reconstruction
Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, Satoshi Matsuoka.
iFDK: A Scalable Framework for Instant High-resolution Image Reconstruction
CBCT geometry and trajectory
Micro-focus
X-ray source
Will be presented at SC19
Distributed Framework for High-resolution
Image Reconstruction
22
22
Filtering
Filtering
Filtering
Filtering
Back-projection
Back-projection
Back-projection
Back-projection
Filtering
Filtering
Filtering
Filtering
Back-projection
Back-projection
Back-projection
Back-projection
Load
Load
Load
Load
Load
Load
Load
Load
Store
Store
Store
Store
On CPUs On GPUs
AllGather Reduce
On CPUs
2DProjections
3DVolume
Input Output
Example Case
23
0 32 64 96
1 9 17 25
31 63 95 127
!"
!#
!$#
%" %# %& %$
⋯ ⋯ ⋯ ⋯ ⋯ ⋯
Input : 2D Projections
Output : 3D volume
vol 0
vol 1
vol 62
vol 63
16 MB/img x4096
8 GB/vol x64
Allgather
Reduce
Input: 4k count of 2k image, Output: 4k^3, 32 Nodes
Performance Result
• Large performance gap between
MPI implementations.
– Mainly by Reduction
– Used IntelMPI in the paper
• Next step
– Tune parameters for collectives
– Use CUDA-aware MPI collectives0
100
200
300
400
500
IntelMPI MVAPICH2 OpenMPI
Time(Second)
Allgather Reduce IO Compute
Input: 4k count of 2k image, Output: 4k^3, 32 Nodes
Summary
• Introduce ABCI, an open AI platform
– Architecture, software stack and latest achievements
– Detail network architecture
• MPI communication performance on ABCI
– Effect of 1/3 inter-rack over-subscription bandwidth on P2P and
collectives
– An application performance example
• Future work
– Evaluate the latest MVAPICH2 and MVAPICH2-GDR
– Provide detail comm. performance results on ABCI to users
https://abci.ai/

More Related Content

What's hot

Resume General
Resume GeneralResume General
oneAPI: Industry Initiative & Intel Product
oneAPI: Industry Initiative & Intel ProductoneAPI: Industry Initiative & Intel Product
oneAPI: Industry Initiative & Intel Product
Tyrone Systems
 
MAP-E as IPv4 over IPv6 Technology
MAP-E as IPv4 over IPv6 TechnologyMAP-E as IPv4 over IPv6 Technology
MAP-E as IPv4 over IPv6 Technology
Akira Nakagawa
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmap
inside-BigData.com
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
Ganesan Narayanasamy
 
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction ProtocolSHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
inside-BigData.com
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmap
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
hetshah_resume
hetshah_resumehetshah_resume
hetshah_resume
het shah
 
RDMA programming design and case studies – for better performance distributed...
RDMA programming design and case studies – for better performance distributed...RDMA programming design and case studies – for better performance distributed...
RDMA programming design and case studies – for better performance distributed...
NTT Software Innovation Center
 
IETF 104 Hackathon VPP Prototyping Stateless SRv6/GTP-U Translation
IETF 104 Hackathon VPP Prototyping Stateless SRv6/GTP-U TranslationIETF 104 Hackathon VPP Prototyping Stateless SRv6/GTP-U Translation
IETF 104 Hackathon VPP Prototyping Stateless SRv6/GTP-U Translation
Kentaro Ebisawa
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
Ganesan Narayanasamy
 
Dell and NVIDIA for Your AI workloads in the Data Center
Dell and NVIDIA for Your AI workloads in the Data CenterDell and NVIDIA for Your AI workloads in the Data Center
Dell and NVIDIA for Your AI workloads in the Data Center
Renee Yao
 
Fast, Scalable Quantized Neural Network Inference on FPGAs with FINN and Logi...
Fast, Scalable Quantized Neural Network Inference on FPGAs with FINN and Logi...Fast, Scalable Quantized Neural Network Inference on FPGAs with FINN and Logi...
Fast, Scalable Quantized Neural Network Inference on FPGAs with FINN and Logi...
KTN
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPI
inside-BigData.com
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Ganesan Narayanasamy
 
Varun_resume
Varun_resumeVarun_resume
Varun_resume
Varun Shah
 
CEI-56G - Signal Integrity to the Forefront
CEI-56G - Signal Integrity to the ForefrontCEI-56G - Signal Integrity to the Forefront
CEI-56G - Signal Integrity to the Forefront
Deborah Porchivina
 

What's hot (20)

Resume General
Resume GeneralResume General
Resume General
 
oneAPI: Industry Initiative & Intel Product
oneAPI: Industry Initiative & Intel ProductoneAPI: Industry Initiative & Intel Product
oneAPI: Industry Initiative & Intel Product
 
MAP-E as IPv4 over IPv6 Technology
MAP-E as IPv4 over IPv6 TechnologyMAP-E as IPv4 over IPv6 Technology
MAP-E as IPv4 over IPv6 Technology
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmap
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction ProtocolSHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmap
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
hetshah_resume
hetshah_resumehetshah_resume
hetshah_resume
 
RDMA programming design and case studies – for better performance distributed...
RDMA programming design and case studies – for better performance distributed...RDMA programming design and case studies – for better performance distributed...
RDMA programming design and case studies – for better performance distributed...
 
IETF 104 Hackathon VPP Prototyping Stateless SRv6/GTP-U Translation
IETF 104 Hackathon VPP Prototyping Stateless SRv6/GTP-U TranslationIETF 104 Hackathon VPP Prototyping Stateless SRv6/GTP-U Translation
IETF 104 Hackathon VPP Prototyping Stateless SRv6/GTP-U Translation
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
 
Dell and NVIDIA for Your AI workloads in the Data Center
Dell and NVIDIA for Your AI workloads in the Data CenterDell and NVIDIA for Your AI workloads in the Data Center
Dell and NVIDIA for Your AI workloads in the Data Center
 
Fast, Scalable Quantized Neural Network Inference on FPGAs with FINN and Logi...
Fast, Scalable Quantized Neural Network Inference on FPGAs with FINN and Logi...Fast, Scalable Quantized Neural Network Inference on FPGAs with FINN and Logi...
Fast, Scalable Quantized Neural Network Inference on FPGAs with FINN and Logi...
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPI
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
 
Varun_resume
Varun_resumeVarun_resume
Varun_resume
 
CEI-56G - Signal Integrity to the Forefront
CEI-56G - Signal Integrity to the ForefrontCEI-56G - Signal Integrity to the Forefront
CEI-56G - Signal Integrity to the Forefront
 

Similar to AI Bridging Cloud Infrastructure (ABCI) and its communication performance

Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
Ryousei Takano
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
Ganesan Narayanasamy
 
Auto conversion of serial C code to CUDA code
Auto conversion of serial C code to CUDA codeAuto conversion of serial C code to CUDA code
Auto conversion of serial C code to CUDA code
IRJET Journal
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
 
Benchmark of Alibaba Cloud capabilities
Benchmark of Alibaba Cloud capabilitiesBenchmark of Alibaba Cloud capabilities
Benchmark of Alibaba Cloud capabilities
Huxi LI
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)
vaidehi87
 
Review paper on 32-BIT RISC processor with floating point arithmetic
Review paper on 32-BIT RISC processor with floating point arithmeticReview paper on 32-BIT RISC processor with floating point arithmetic
Review paper on 32-BIT RISC processor with floating point arithmetic
IRJET Journal
 
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligence
Alison B. Lowndes
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
mustafa sarac
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
DataWorks Summit
 
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision SystemHai Tao at AI Frontiers: Deep Learning For Embedded Vision System
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
AI Frontiers
 
IRJET- Intelligent Home Security System using Artificial Intelligence
IRJET- Intelligent Home Security System using Artificial IntelligenceIRJET- Intelligent Home Security System using Artificial Intelligence
IRJET- Intelligent Home Security System using Artificial Intelligence
IRJET Journal
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020
OpenACC
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020
OpenACC
 
IRJET- Portable Camera based Assistive Text and Label Reading for Blind Persons
IRJET- Portable Camera based Assistive Text and Label Reading for Blind PersonsIRJET- Portable Camera based Assistive Text and Label Reading for Blind Persons
IRJET- Portable Camera based Assistive Text and Label Reading for Blind Persons
IRJET Journal
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
Putchong Uthayopas
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
NVIDIA Japan
 

Similar to AI Bridging Cloud Infrastructure (ABCI) and its communication performance (20)

Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 
Auto conversion of serial C code to CUDA code
Auto conversion of serial C code to CUDA codeAuto conversion of serial C code to CUDA code
Auto conversion of serial C code to CUDA code
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Benchmark of Alibaba Cloud capabilities
Benchmark of Alibaba Cloud capabilitiesBenchmark of Alibaba Cloud capabilities
Benchmark of Alibaba Cloud capabilities
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)
 
Review paper on 32-BIT RISC processor with floating point arithmetic
Review paper on 32-BIT RISC processor with floating point arithmeticReview paper on 32-BIT RISC processor with floating point arithmetic
Review paper on 32-BIT RISC processor with floating point arithmetic
 
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligence
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision SystemHai Tao at AI Frontiers: Deep Learning For Embedded Vision System
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
 
IRJET- Intelligent Home Security System using Artificial Intelligence
IRJET- Intelligent Home Security System using Artificial IntelligenceIRJET- Intelligent Home Security System using Artificial Intelligence
IRJET- Intelligent Home Security System using Artificial Intelligence
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020
 
IRJET- Portable Camera based Assistive Text and Label Reading for Blind Persons
IRJET- Portable Camera based Assistive Text and Label Reading for Blind PersonsIRJET- Portable Camera based Assistive Text and Label Reading for Blind Persons
IRJET- Portable Camera based Assistive Text and Label Reading for Blind Persons
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 

More from inside-BigData.com

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
inside-BigData.com
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
inside-BigData.com
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
inside-BigData.com
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
inside-BigData.com
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
inside-BigData.com
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computing
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computing
 

Recently uploaded

What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
DianaGray10
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
ScyllaDB
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 

Recently uploaded (20)

What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
ScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking ReplicationScyllaDB Tablets: Rethinking Replication
ScyllaDB Tablets: Rethinking Replication
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 

AI Bridging Cloud Infrastructure (ABCI) and its communication performance

  • 1. AI Bridging Cloud Infrastructure (ABCI) and its Communication Performance Shinichiro Takizawa The National Institute of Advanced Industrial Science and Technology (AIST), Japan
  • 2. Outline • Introduction of AIST and ABCI • ABCI Detail – Architecture, software stack, network topology • MPI performance on ABCI • A latest application on ABCI
  • 3. Introduction of AIST • A research institute as a part of the Ministry of Economy, Trade and Industry (METI) of Japan • Our mission – Integrate scientific and engineering knowledge to address industry and society needs – Bridge the gap between innovative technological seeds and commercialization We built ABCI for popularize AI technologies in Japan
  • 4. 4 n P C 8DTD B L SRD MC C R P BD B AH HRW n U S H OI GSJ 5 JOIG J HM P RPSBRSPD P 0 1HF 3 R 0 F PHRGL RU PD MC 0 HB RH M n U S :SST G OTS G LTVR R BBD DP RD I HMR B CDLHB HMCS RPW 3 P 07 07 1PHCFHMF 2 SC 7M P RPSBRSPD GP VLTVRGSI 0 7< B 7 , )- 7< B 7 , 6LL I O VLTVRGSI 0 GW TL S ( / / .. 7< B . OS C () 87< B F ) OS 8A66 . . C7< B OS 9 48 T V DWGM 0 1 ( ) F 2 VGM D60 1 6W ORG J 234:0 CN FTV J W 7OVW <GVM #BIG U S 2: :SLVGW V I V Univ. Tokyo / Kashiwa II Campus OPERATED SINCE AUGUST 1ST, 2018
  • 5. Some of 1 Year Achievements • 100+ projects, 1000+ users – Academia, research institutes and companies • Large Japanese companies start to use ABCI as their R&D platform • ABCI supports NGC containers • SONY and Fujitsu lab achieved good performance of ImageNet-1k classification on ABCI • Two research papers that use ABCI were accepted by SC19 https://blogs.nvidia.com/blog/2019/06/17/abci-adopts-ngc/
  • 6. 6 0.74 0.745 0.75 0.755 0.76 0.765 0 200 400 600 800 1000 1200 1400 1600 MSRA (2015) Facebook (2017) Google Brain (2017) Preferred Networks (2017) Tencent (2018) Sony + ABCI (2018) Google (2018) Google (2018) Sony + ABCI (2019) Fujitsu Lab + ABCI (2019) NVIDIA (2019) Google (2019) Fujitsu Lab + ABCI (2019) ImageNet / ResNet-50 (Relative speedup & Accuracy) Relative speedup Accuracy Relativespeedup Accuracy World’s Highest Speed in ImageNet-1k Training MLPerf Training v0.6 SONY’s work https://arxiv.org/abs/1811.05233 Fujitsu lab’s work https://arxiv.org/abs/1903.12650
  • 7. 7 Gateway and Firewall Interactive Nodes x 4 Interconnect (InfiniBand EDR) Service Network (10GbE) Management and Gateway Nodes x 15 High-Performance Computing System 550 PFlops(FP16), 37.2 PFlops(FP64) 476 TiB Memory, 1.74 PB NVMe SSD Computing Nodes (w/ GPU) x 1088 Multi-platform Nodes (w/o GPU) x 10 • Intel Xeon Gold 6132 (2.6GHz/14cores) x 2 • 768GiB Memory, 3.8TB NVMe SSD, 1.5TB Intel Optane x2 22 PB GPFS (Group Shared Directory, etc.) DDN SFA14K (w/ SS8462 Enclosure x 10) x3 • 12TB 7.2Krpm NL-SAS HDD x 2400 • 3.84TB SAS SSD x 216 • Mellanox CS7500 x 2 • Mellanox SB7890 x 229 • Nexsus 3232C x2 • FortiGate 1500D x2 • FortiAnalyzer 400E x1 100Gbps SINET5 GPU NVIDIA Tesla V100 SXM2 x 4 CPU Intel Xeon Gold 6148 (2.4GHz/20cores) x 2 Memory 384GiB Local Storage Intel SSD DC P4600 (NVMe) 1.6TB x 1 Interconnect InfiniBand EDR x 2 ABCI Hardware Overview Large-scale Storage System 1 PB Lustre (Home Directory) DDN SFA14KX (w/ SS9012 Enclosure x 10) x1 • 7.68TB SAS SSD x 185 for data • 960GB SAS SSD x 13 for metadata 17 PB Object Storage (Preparing) HPE Apollo 4510 Gen10 x 24 • 12TB SATA HDD x 1440 • 3.2TB SSD x 24
  • 8. 8 ABCI Software Stack Stack Operating System RHEL / CentOS 7.6 Job Scheduler Univa Grid Engine 8.6.3 Container Engine Docker 17.12.0 (Users can use only supported container images) Singularity 2.6.1 (Users can use any container images) MPI Intel MPI 2018.2.199 MVAPICH2 2.3rc2, 2.3 / MVAPICH2-GDR 2.3a, 2.3rc1, 2.3, 2.3.1 OpenMPI 1.10.7, 2.1.3, 2.1.5, 2.1.6, 3.0.3, 3.1.0, 3.1.2, 3.1.3 Development tools Intel Parallel Studio XE Cluster Edition 2017.8, 2018.2, 2018.3, 2019.3 PGI Professional Edition 17.10, 18.5, 18.10, 19.3 NVIDIA CUDA SDK 8.0, 9.0, 9.1, 9.2, 10.0 cuDNN 5.1, 6.0, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5 NCCL 1.3.5, 2.1, 2.2, 2.3, 2.4 Intel MKL 2017.8, 2018.2, 2018.3, 2019.3 GCC, Python, Ruby, R, OpenJDK, Go, Perl Deep Learning Caffe, Caffe2, TensorFlow, Theano, Torch, PyTorch, CNTK, MXnet, Chainer, Keras, etc. (Frameworks provided by NVIDIA GPU Cloud can also be deployed) Big Data Processing Hadoop, Spark
  • 9. 9 234: 4TRU TJ Xeon Gold 6148 Xeon Gold 6148 10.4GT/s x3DDR4-2666 32GB x 6 DDR4-2666 32GB x 6 128GB/s 128GB/s IB HCA (100Gbps)IB HCA (100Gbps) NVMe UPI x3 x48 switch Skylake Skylake x64 switch Tesla V100 SXM2 Tesla V100 SXM2 Tesla V100 SXM2 Tesla V100 SXM2 PCIe gen3 x16 PCIe gen3 x16 PCIe gen3 x16 PCIe gen3 x16 NVLink2 x2 FUJITSU PRIMERGY Server (2 servers in 2U) CPU Xeon Gold 6148 (27.5M Cache, 2.40 GHz, 20 Core) x2 GPU NVIDIA Tesla V100 (SXM2) x4 Memory 384GiB DDR4 2666MHz RDIMM Local Storage 1.6TB NVMe SSD (Intel SSD DC P4600 u.2) x1 Interconnect InfiniBand EDR x2
  • 10. 10 234: TJ AGIP :S VITSS I InfiniBand EDR x1 InfiniBand EDR x6 InfiniBand EDR x4 Rack #1 LEAF#1 SB7890 LEAF#2 SB7890 LEAF#3 SB7890 LEAF#4 SB7890 CX400 #1 CX2570#1 CX2570#2 CX400 #2 CX2570#3 CX2570#4 CX400 #3 CX2570#5 CX2570#6 CX400 #17 CX2570#33 CX2570#34 FBB#1 SB7890 FBB#2 SB7890 FBB#3 SB7890 Full bisection BW IB-EDR x 72 Rack #2 LEAF#1 SB7890 LEAF#2 SB7890 LEAF#3 SB7890 LEAF#4 SB7890 CX400 #1 CX2570#1 CX2570#2 CX400 #2 CX2570#3 CX2570#4 CX400 #3 CX2570#5 CX2570#6 CX400 #17 CX2570#33 CX2570#34 FBB#1 SB7890 FBB#2 SB7890 FBB#3 SB7890 SPINE#1 CS7500 SPINE#2 CS7500 Full bisection BW IB-EDR x 72 1/3 Oversubscription BW IB-EDR x 24 n 5 SW #UGIPGM J VGIP0 ) STJ W ), C W G E • GD PDRHB D DP PL MBD DP P B :4 4: , :4 4: B F D :> ) & : C / &&:4 -P B • : UDP B M SL RH M DP P B . , )) n :S VITSS I • 4 R RPDD R FW • 7MRP P B . S AH DBRH M 1 • 7MRDP P B ) TDP SA BPH RH M ( && -&& • HRG SR C RHTD P SRHMF • HRG SR D M 60 :
  • 11. Grounds for the Interconnect Design • 1/3 over-subscription network – A cost effective solution • Many DL training apps do not use large number of nodes • High bandwidth network is required by only a small set of nodes • Without adoptive routing – InfiniBand is shared by compute and IO communication, and the delivery vendor suggested not to use adaptive routing on such a configuration • Without Mellanox SHARP – We use EDR and switches which do not support large size message in SHARP
  • 12. Distribution of Number of Used GPUs/Nodes in DL Jobs • Workload is collected from pre- ABCI system, AAIC – A system dedicated for AI research • Single GPU Jobs are dominant and degree of parallelism is low • ABCI is designed for a capacity computing system for AI #Nodes: 32-38 #GPUs/Node: 8 #GPUs: 256-304 Single GPU, Single Node and Multi Node jobs # Used Nodes in Multi Node Jobs Workload is published under: https://github.com/aistairc/aaic-workload
  • 13. Performance Impact on MPI under 1/3 over-subscription without adoptive routing • Measured P2P host-mem to host-mem transfer performance using OpenMPI 2.1.6 • Intra-rack: 17 - 17 nodes • Inter-rack: 34 - 34 nodes • Increase # of node pairs Node Node Group0 (Rack0) Group1 (Rack1) Node … Node Node Node …
  • 14. :S VG#AGIP ( VLTVRGSI 14 0 5 10 15 20 25 30 0 5 10 15 20 Per-NodeThroughput(GB/s) # Intra-Group Connection Ideal Measured 0 50 100 150 200 250 300 350 400 450 0 5 10 15 20 TotalThroughput(GB/s) # Intra-Group Connection Ideal Measured Per-Node Average Throughput Aggregated Throughput More than 80% of the theoretical performance is achieved.
  • 15. :S V#AGIP ( VLTVRGSI 15 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 40 Per-NodeThroughput(GB/s) # Inter-Rack Connection Ideal Measured 0 50 100 150 200 250 300 350 0 5 10 15 20 25 30 35 40 TotalRackThroughput(GB/s) # Inter-Rack Connection Ideal Measured Per-Node Average Throughput Aggregated Throughput Large performance degradation at 6 and 20 connections, and far from Ideal performance
  • 16. Collective Comm. Performance on ABCI • Measured performance of two collective comm. patterns – Allreduce: essential in distributed training of DL – Reduce: the most heavy comm. in our application • Used OSU Micro-Benchmarks 5.6.1 – NCCL Bench 2.0.0 is used for NCCL – Default parameters MPI Version MVAPICH2 MVAPICH2 2.3.1 IntelMPI IntelMPI 2018 Update 2 OpenMPI OpenMPI 3.1.3 MPI, NCCL Version MVAPICH2 MVAPICH2 2.3.1 MVAPICH2-GDR MVAPICH2-GDR 2.3.1, gdrcopy 1.2 OpenMPI OpenMPI 3.1.3 NCCL NCCL 2.4.7 Compiler GCC 4.8.5 CUDA 10.0.130 Host to host Transfer GPU to GPU Transfer
  • 17. 9(9 2 A J I 17 Inter-Rack (68 Nodes)Intra-Rack (34 Nodes) • MVAPICH2 is comparable to IntelMPI • Performance do not suffer from 1/3 over-subscription inter-rack bandwidth
  • 18. 9(9 A J I 18 Inter-Rack (68 Nodes)Intra-Rack (34 Nodes) MVAPICH2 is better than others in small message, but IntelMPI outperforms MVAPICH in large message
  • 19. 5(5 2 A J I 19 Inter-Rack (68 Nodes)Intra-Rack (34 Nodes) MVAPICH2-GDR and NCCL outperforms others in large messages
  • 20. 5(5 A J I 20 Inter-Rack (68 Nodes)Intra-Rack (34 Nodes) MVAPICH2-GDR achieves the best performance
  • 21. Application Performance: iFDK • A high-speed and high-resolution CT image reconstruction framework running on GPU clusters – Create 3D images from multiple 2D x-ray images – Demands for high-resolution CT image: non-invasive inspection, reverse engineering, etc. • What we done – GPU kernel whose compute cost is 1/6 of the standard algorithm – Efficient FDK computation by overlapping CPU comp., GPU comp. and comm. – Distributed framework for high-resolution image reconstruction Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, Satoshi Matsuoka. iFDK: A Scalable Framework for Instant High-resolution Image Reconstruction CBCT geometry and trajectory Micro-focus X-ray source Will be presented at SC19
  • 22. Distributed Framework for High-resolution Image Reconstruction 22 22 Filtering Filtering Filtering Filtering Back-projection Back-projection Back-projection Back-projection Filtering Filtering Filtering Filtering Back-projection Back-projection Back-projection Back-projection Load Load Load Load Load Load Load Load Store Store Store Store On CPUs On GPUs AllGather Reduce On CPUs 2DProjections 3DVolume Input Output
  • 23. Example Case 23 0 32 64 96 1 9 17 25 31 63 95 127 !" !# !$# %" %# %& %$ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ Input : 2D Projections Output : 3D volume vol 0 vol 1 vol 62 vol 63 16 MB/img x4096 8 GB/vol x64 Allgather Reduce Input: 4k count of 2k image, Output: 4k^3, 32 Nodes
  • 24. Performance Result • Large performance gap between MPI implementations. – Mainly by Reduction – Used IntelMPI in the paper • Next step – Tune parameters for collectives – Use CUDA-aware MPI collectives0 100 200 300 400 500 IntelMPI MVAPICH2 OpenMPI Time(Second) Allgather Reduce IO Compute Input: 4k count of 2k image, Output: 4k^3, 32 Nodes
  • 25. Summary • Introduce ABCI, an open AI platform – Architecture, software stack and latest achievements – Detail network architecture • MPI communication performance on ABCI – Effect of 1/3 inter-rack over-subscription bandwidth on P2P and collectives – An application performance example • Future work – Evaluate the latest MVAPICH2 and MVAPICH2-GDR – Provide detail comm. performance results on ABCI to users https://abci.ai/