SlideShare a Scribd company logo
/
/ /
National Institute of
Advanced Industrial Science and Technology (AIST)
Artificial Intelligence Research Center
Hitoshi Sato
1
About AIST
2
ITRI
Information Technology
Research Institute
Artificial Intelligence
Research Center
Real World Big-Data
Computing – Open
Innovation Laboratory
One of the public research institute in Japan under METI*, focus on
“bridging” the gap between innovative technological seeds and commercialization.
*Ministry of Economy, Trade and Industry
@Odaiba, Tokyo Joint Lab w/ Tokyo Tech
HQ @Tsukuba
Current Status of AI R&D in Japan:
AI requires the convergence of Algorithm, Big Data, and
Computing Infrastructure
3
Big Data
Algorithm
Computing Infrastructure
Deep Learning, Machine Learning, etc.
Cutting edge infrastructures,
Clouds, Supercomputers, etc.
We lack the cutting edge infrastructure
dedicated to AI and Big Data, open to public use
Big Companies AI/BD
R&D (also Science)
AI Venture Startups
AI/BD Centers & Labs in National
Labs & Universities
/
• Dense Problem
• Compute-intensive
• Dense Matrix Op., Multiply-Add
• Reduced Precision
• Regular data/memory accesses
• Sparse Problem
– Graph Data Analytics Symbol Processing
• Data-intensive
• Sparse Matrix Op., Integer
• Big Data Kernels (Sort, PrefixSum, etc.)
• Irregular data/memory accesses
4
Different Workload Characteristics,
But, Similar to HPC
AI requires HPC
Example of Real AI/DL Apps Usage
• lang2program
– https://github.com/kelvinguu/
lang2program
– Implementation for ACL 2017
(http://acl2017.org/ ) paper
– Provided as Dockerfile
• Including Tensorflow, CUDA, CuDNN,
PostgresQL, Python pip Packages, etc.
• Various software dependencies for installation
5
Can traditional shared HPC systems support
such DL Apps?
• Not allowed Docker for security reason
(rootless access required)
• Elaborate efforts for manual installation
6
F E DD 86G CBF
/ HA E 86
E6E F
H EL EC E
+L D B L D B
6E6 F FG AF
HFGE L
6G8 C
8 H EF
Linux O S
+B B 6
/M C86
GCE6
M
88 E6GCEF
VMsLLinux Containers
Ethernet
Local Node
Storage
x86 CPU
Distributed
Filesystems
HDFS
M apReduce Fram ew orks
Spark/Hadoop
User Applications
RDB
PostgresQL
ML Libraries
MLlib/
Mahout
Graph
Libraries
GraphX/
Giraph
JavaL ScalaM IDL
SQL
Hive/Pig
CloudDB/NoSQL
Hbase/Cassandra/MondoDB
CordinationEngine
ZooKeeper
BHJ
CH F HD E8CADHG EFDD 86G CB
FG A C GI6E
6E I6E
CEGE6BL L M +
• Cloud Jobs are often Interactive w/ resource control REST APIs
• HPC Jobs are Batch-Oriented, resource control by MPI
• Cloud employs High Productivity Languages but performance neglected, focus on
data analytics and dynamic frequent changes
• HPC employs High Performance Languages but requires Ninja Programmers, low
productivity. Kernels & compilers well tuned & result shared by many programs,
less rewrite
• Cloud HW based on Web Server “commodity” x86 servers, distributed storage on
nodes assuming REST API access
• HPC HW aggressively adopts new technologies such a s GPUs, focused on ultimate
performance at higher cost, shared storage to support legacy apps
Various convergence research efforts underway but no realistic converged
SW Stack yet
C GI6E 8CF FG A CE B +
//
CE CI
• Cloud focused on databases and data manipulation workflow
• HPC focused on compute kernels, even for data processing. Jobs scales to
thousands of jobs, thus debugging and performance tuning
• Cloud requires purpose-specific computing/data environment as well as their mutual
isolation & security
• HPC requires environment for fast & lean use of resources, but on modern machines
require considerable system software support
Characteristics of Deep Learning I/O Workloads
• Small READ I/O to Huge Files
– Native Media Files such as Jpeg, Wav, Text, etc.
– Painful I/O patterns for shared parallel file
systems
– Increasing Metadata Operation
• Randomness is essentially important
– Runtime Data Augmentation for
Training Accuracy
• Random Crop
• Random Horizontal Flip
• Normalization, etc.
7
Training Dataset to ImageNet1K
• Expected Power Limit of Real HPC Systems
– Multi MW for systems
– 60 70 kW per rack
• General-purpose Data Centers
– Multi kW per rack
– Lack of power density for HPC
→ How to introduce supercomputer cooling technologies
to data centers
üPower Efficient, Thermal Density, Maintainability, etc.
üIn-expensive Build
8
AIST AI Cloud
Air Cooling
TSUBAME-KFC
Liquid
Immersion
Cooling
#3 Green500
2017 June
TSUBAME2
Water Cooling
#2 Green500
2010 Nov
#1 Green500
2013 Nov
• Accelerating AI/Big Data R&D
– Dedicated HPC Infra converged w/ AI/Big Data
• Extreme Computing Power
• Ultra High Bandwidth Low Latency in Memory, Network, and Storage
• Extreme Green
• Bridging AI Tech through Common Platform
– For AI R&D, Apps, Services, and Infrastructure Designs
– Aiming fast tech transfer through industry and academia collaboration
– De facto Standard Software
• Cloud-Datacenter-ready Infrastructure, but Supercomputing
– Supercomputer Cooling Technologies to Datacenter
– Inexpensive Facilities, etc.
9
AI Bridging Cloud Infrastructure
as World’s First Large-scale Open AI Infrastructure
• Open, Public, and Dedicated infrastructure for AI/Big Data
• Platform to accelerate joint academic-industry R&D for AI in Japan
• Top-level compute capability w/ 0.550EFlops(AI), 37.2 PFlops(DP)
( )2 5 0.1 ,. 10 5 ## 0 #
10
Univ. Tokyo Kashiwa II Campus
Operation Scheduled in 2018
• 1088x compute nodes w/ 4352x NVIDIA Tesla V100 GPUs, 476TiB of Memory,
1.6PB of NVMe SSDs, 22PB of HDD-based Storage and Infiniband EDR network
• Ultra-dense IDC design from the ground-up w/ 20x thermal density of standard IDC
• Extreme Green w/ ambient warm liquid cooling, high-efficiency power supplies, etc.,
commoditizing supercomputer cooling technologies to clouds ( 2.3MW, 70kW/rack)
11
Gateway and
Firewall
Computing Nodes: 0.550 EFlops(HP), 37 PFlops(DP)
476 TiB Mem, 1.6 PB NVMe SSD
Storage: 22 PB GPFS
High Performance Computing Nodes (w/GPU) x1088
• Intel Xeon Gold6148 (2.4GHz/20cores) x2
• NVIDIA Tesla V100 (SXM2) x 4
• 384GiB Memory, 1.6TB NVMe SSD
Multi-platform Nodes (w/o GPU) x10
• Intel Xeon Gold6132 (2.6GHz/14cores) x2
• 768GiB Memory, 3.8TB NVMe SSD
Interactive Nodes
DDN SFA14K
(w/ SS8462 Enclosure x 10) x 3
• 12TB 7.2Krpm NL-SAS HDD x 2400
• 3.84TB SAS SSD x 216
• NSD Servers x 12
Object Storage for Protocol Nodes
100GbE
Service Network (10GbE)
External
Networks
SINET5
Interconnect (Infiniband EDR)
ABCI: AI Bridging Cloud Infrastructure
12
System
(32 Racks)Rack
(17 Chassis)
Compute Node
(4GPUs, 2CPUs)
Chips
(GPU, CPU)
Node Chassis
(2 Compute Nodes)
NVIDIA Tesla V100
(16GB SMX2)
3.72 TB/s MEM BW
384 GiB MEM
200 Gbps NW BW
1.6TB NVMe SSD
1.16 PFlops(DP)
17.2 PFlops (AI)
37.2 PFlops(DP)
0.550 EFlops (AI)
68.5 PFlops(DP)
1.01 PFlops (AI)
34.2 TFlops(DP)
506 TFlops (AI)
GPU:
7.8 TFlops(DP)
125 TFlops (AI)
CPU:
1.53 TFlops(DP)
3.07 TFlops (AI)
Intel Xeon Gold 6148
(27.5M Cache,
2.40 GHz, 20 Core)
0.550 EFlops(AI), 37.2 PFlops(DP)
19.88 PFlops(Peak), Ranked #5 Top500 June 2018
131TB/s MEM BW
Full Bisection BW within Rack
70kW Max
1088 Compute Nodes
4352 GPUs
4.19 PB/s MEM BW
1/3 of Oversubscription BW
2.3MW
GPU Compute Nodes
• NVIDIA TESLA V100
(16GB, SXM2) x 4
• Intel Xeon Gold 6148
x 2 Sockets
– 20 cores per Socket
• 384GiB of DDR4 Memory
• 1.6TB NVMe SSD x 1
– Intel DC P4600 u.2
• EDR Infiniband HCA x 2
– Connected to other Compute Notes
and Filesystems
13
Xeon Gold
6148
Xeon Gold
6148
10.4GT/s x3DDR4-2666
32GB x 6
DDR4-2666
32GB x 6
128GB/s 128GB/s
IB HCA (100Gbps)IB HCA (100Gbps)
NVMe
UPI x3
x48 switch x64 switch
Tesla V100 SXM2 Tesla V100 SXM2
Tesla V100 SXM2 Tesla V100 SXM2
PCIe gen3 x16 PCIe gen3 x16
PCIe gen3 x16 PCIe gen3 x16
NVLink2 x2
Rack as Dense-packaged “Pod”
( AB < ) 1 6
) 0 BC / 0 BC ,3
0G < F BA -7
BH<D G D CF BA -7 FB
<IF<DA CB
) 7 7 4 I
7 F<D , D 2 D .BB A
14Pod #1
LEAF#1
(SB7890)
LEAF#2
(SB7890)
LEAF#3
(SB7890)
LEAF#4
(SB7890)
SPINE#1
(CS7500)
SPINE#2
(CS7500)
CX40
0#1
CX2570#1
CX2570#2
CX40
0#2
CX2570#3
CX2570#4
CX40
0#3
CX2570#5
CX2570#6
CX40
0#17
CX2570#33
CX2570#34
FBB#1
(SB7890)
FBB#2
(SB7890)
FBB#3
(SB7890)
1/3 Oversubscription BW
IB-EDR x 24
Full bisection BW
IB-EDR x 72
InfiniBand EDR x1
InfiniBand EDR x6
InfiniBand EDR x4
x 32 pods
Hierarchical Storage Tiers
• Local Storage
– 1.6 TB NVMe SSD (Intel DC P4600 u.2) per Node
– Local Storage Aggregation w/ BeeOnd
• Parallel Filesystem
– 22PB of GPFS
• DDN SFA14K ( w/ SS8462 Enclosure x 10) x 3 set
• Bare Metal NSD servers and Flash-based Metadata
Volumes for metadata operation acceleration
– Home and Shared Use
• Object Storage
– Part of GPFS using OpenStack Swift
– S3-like API Access, Global Shared Use
– Additional Secure Volumes w/ Encryption
(Planned)
15
Parallel Filesystem
Local Storage
as Burst Buffers
Object Storage as Campaign Storage
16
Better• Environments
– ABCI 64 nodes (256 GPUs)
– Framework: ChainerMN v1.3.0
• Chainer 4.2.0, Cupy 4.2.3, mpi4py 3.0.0, Python 3.6.5
– Baremetal
• CentOS 7.4, gcc-4.8.5,
CUDA 9.2, CuDNN 7.1.4, NCCL2.2, OpenMPI 2,1.3
• Settings
– Dataset: Imagenet-1K
– Model: ResNet-50
– Training:
• Batch size: 32 per GPU, 32 x 256 in total
• Learning Rate: Starting 0.1 and x0.1 at 30, 60, 80 epoch
w/ warm up scheduling
• Optimization: Momentum SGD (momentum=0.9)
• Weight Decay: 0.0001
• Training Epoch: 100
.F F A A A 5L I F L I .FFCA 6 FCF A B I B
1A M A R I FL R LAC A
5L: A C 6FC I M F
4 B
P I B
P I B FI .0
3FM I
P ( 2
P 2 FI .0
:A I 1A LA .FFCA
P I CF B ) B I B
P / .FAC 7 A B I B
P 2 A F C
17
19m x 24m x 6m
計算
8
計算
7
計算
6
計算
5
計算
4
計算
3
計算
2
計算
1
計算
16
計算
15
計算
14
計算
13
計算
12
計算
11
計算
10
計算
9
計算
23
計算
22
計算
21
計算
20
計算
28
計算
27
計算
19
計算
18
計算
17
計算
32
計算
31
計算
30
計算
29
IB
1
IB2
計算
26
計算
25
計算
24
STR
G38
STR
G37
STR
G36
STR
G35
STR
G34
その
他
MPF
39
MP
F4
STR
G33
ラックレイアウト案IBスイッチラックをUPSゾーンに設置した場合
High Voltage Transform ers (3.25M W )
Passive
Cooling Tower
Free cooling
Cooling Capacity: 3MW
Lithium Battery
(Plan)
1MWh, 1MVA
Active Chillers
Cooling Capacity: 200kW
Future Expansion Space
UPS
72 Racks
w/ Hybrid Cooling
18 Racks
w/ UPS
Server Room Cooling
• Skeleton frame structure for
computer racks, water circuit,
air-cooling units, and cables
• 70kW/rack cooling capacity
60kW liquid (32 C water) + 10kW air
18
19 or 23 inch
Rack (48U)
ComputingServer
Hot Water
Circuit
Front side
Air
CDU
Coolability 10kW
Water
Hot Aisle Capping
35 C
Water Block
(CPU or/and Accelerator, etc.)
Air 40 C
32 C 40 C
Cold Water
Circuit
Fan Coil Unit
Coolability 10kW
Cold Aisle
35 C
Hot Aisle
40 C
Capping Wall
Capping Wall
Fan Coil Unit
Server Rack
W ater
Circuit
Fan Coil
Unit
Server
Rack
Hot
Aisle
Software Stuck for ABCI
• Batch Job Scheduler
– High throughput computing
• Minimum Pre-installed Software
– Users can deploy their environments
using anaconda, pip, python venv, etc.
– Reduce operational cost
• Container Support
– Singularity for multi-node jobs w/
user customized images
– Docker for single-node jobs w/
site-certified images
19
User Applications
DL
Frameworks
Python, Ruby, R, Java, Scala, Perl, Lua, etc.
ISV AppsOSSHadoop/
Spark
GCC PGI
OpenMPI MVAPICH2
Intel Parallel
Studio XE
Cluster Edition
CUDA/CUDNN/NCCL
GPFS BeeOND
OpenStack
Swift
Univa Grid Engine
Singularity Docker
CentOS/RHEL
• Singularity
– HPC Linux Container developed at LBNL and Sylabs Inc.
• http://singularity.lbl.gov/
• https://github.com/singularityware/singularity
– Rootless program execution, storage access
• No daemons w/ root privilege like Docker
– Computing resources are managed by job schedulers such as UGE
• Root privilege is given by commands w/ setuid
– Mount, Linux namespace creation, Bind paths between host and container
– Existing Docker images available via DockerHub
– HPC software stacks available
• CUDA, MPI, Infiniband, etc.
20
How to Create Container Images for Distributed DL
• Host
– Mount native device drivers and
system libraries to containers
• GPU, Infiniband, etc.
• cf. nvidia-docker
• Containers
– Install user space libraries
• CUDA, CuDNN, NCCL2, ibverbs, MPI, etc.
– Install distributed deep learning
frameworks
• Optimized build for underlying computing
resources
21
Base Drivers, Libraries on Host
CUDA
Drivers
Infiniband
Drivers
Filesystem
Libraries
(GPFS, Lustre)
Userland Libraries on Container
CUDA CuDNN NCCL2
MPI
(mpi4py)
Mount
ibverbs
Distributed Deep Learning
Frameworks
Caffe2 ChainerMNDistributed
TensorflowMXNet
AICloudModules: Container Image Collection for AI
• Base Images
– Optimized OS Images for
AI Cloud
• Including CUDA,CuDNN, NCCL,
MPI, etc., Libraries
• Excluding CUDA, Infiniband, GPFS
etc., native drivers on host
DL Images
• DL Images
– Installed Deep Learning
Frameworks and AI Applications
based on Base OS Images
22
Base Images
Base OS Images
incl. CUDA, CuDNN, NCCL2, MPI, Infiniband
DL Images
Images for Deep Learning Frameworks
based on HPCBase
CentOSUbuntu
Base Drivers, Libraries on Host
GPU Infiniband GPFSMPI
Caffe2 ChainerMNDistributed
TensorflowMXNet CNTK
https://hub.docker.com/r/aistairc/aaic/
as a prototype
• Accommodate open ecosystems of data and software for AI R&D
– Sharable/Reusable/Reproducible algorithm methods,
training data w/ labels, trained models, etc.
• Creating various social applications that use AI (ML, DL, etc.)
23
X
Y
Training
Dataset
Public Data
on Internet Deep Learning
Framework
Data
Collection and
Cleansing
Modeling Learning
Open
Data in
Enterprise
Open
Data in
Government
Trained
Model
Inference
Publish Trained Models to Public
to apply various domains
Prepare Training Datasets
Trial and error for designing
networks, parameters
Accelerate learning with HPC
to improve accuracy of models
Reuse
Trained
Model
Implementing
AI Techs
to Society
Retail Stores
Factories
Logistics Centers
Object Detection and
Tracking
Summary
• Issues for Public AI Infrastructures
– Accelerating AI R&D
– Bridging AI Tech through Common Platform
– Cloud-Datacenter-ready Infrastructure, but Supercomputing
• ABCI (AI Bridging Cloud Infrastructure)
– 0.550 EFlops(AI), 37.2 PFlops(DP),
19.88 PFlops(Peak) #5 Top500 June 2018
– Open, Public, Dedicated Infrastructure for AI/Big Data
– A Reference for AI/Big Data and HPC System
24
25

More Related Content

What's hot

GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2
NVIDIA
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
Edge AI and Vision Alliance
 

What's hot (20)

Onnx and onnx runtime
Onnx and onnx runtimeOnnx and onnx runtime
Onnx and onnx runtime
 
GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラ
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
AI Chip Trends and Forecast
AI Chip Trends and ForecastAI Chip Trends and Forecast
AI Chip Trends and Forecast
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Presentation - Model Efficiency for Edge AI
Presentation - Model Efficiency for Edge AIPresentation - Model Efficiency for Edge AI
Presentation - Model Efficiency for Edge AI
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Chiplets in Data Centers
Chiplets in Data CentersChiplets in Data Centers
Chiplets in Data Centers
 
mushroom classification using machine learning
mushroom classification using machine learningmushroom classification using machine learning
mushroom classification using machine learning
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
réseaux de neurones artificiels
réseaux de neurones artificiels réseaux de neurones artificiels
réseaux de neurones artificiels
 

Similar to ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM Research
 

Similar to ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data (20)

Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
NSCC Training Introductory Class
NSCC Training Introductory Class NSCC Training Introductory Class
NSCC Training Introductory Class
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
NSCC Training - Introductory Class
NSCC Training - Introductory ClassNSCC Training - Introductory Class
NSCC Training - Introductory Class
 
From the Archives: Future of Supercomputing at Altparty 2009
From the Archives: Future of Supercomputing at Altparty 2009From the Archives: Future of Supercomputing at Altparty 2009
From the Archives: Future of Supercomputing at Altparty 2009
 
HPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataHPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big Data
 
Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)
 
Exadata architecture and internals presentation
Exadata architecture and internals presentationExadata architecture and internals presentation
Exadata architecture and internals presentation
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 
PyTables
PyTablesPyTables
PyTables
 
Py tables
Py tablesPy tables
Py tables
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Available HPC Resources at CSUC
Available HPC Resources at CSUCAvailable HPC Resources at CSUC
Available HPC Resources at CSUC
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 

More from Hitoshi Sato

More from Hitoshi Sato (10)

AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
 
Building Software Ecosystems for AI Cloud using Singularity HPC Container
Building Software Ecosystems for AI Cloud using Singularity HPC ContainerBuilding Software Ecosystems for AI Cloud using Singularity HPC Container
Building Software Ecosystems for AI Cloud using Singularity HPC Container
 
Singularityで分散深層学習
Singularityで分散深層学習Singularityで分散深層学習
Singularityで分散深層学習
 
第162回情報処理学会ハイパフォーマンスコンピューティング研究発表会
第162回情報処理学会ハイパフォーマンスコンピューティング研究発表会第162回情報処理学会ハイパフォーマンスコンピューティング研究発表会
第162回情報処理学会ハイパフォーマンスコンピューティング研究発表会
 
GTC Japan 2017
GTC Japan 2017GTC Japan 2017
GTC Japan 2017
 
産総研AIクラウドでChainerMN
産総研AIクラウドでChainerMN産総研AIクラウドでChainerMN
産総研AIクラウドでChainerMN
 
Japan Lustre User Group 2014
Japan Lustre User Group 2014Japan Lustre User Group 2014
Japan Lustre User Group 2014
 
MemoryPlus Workshop
MemoryPlus WorkshopMemoryPlus Workshop
MemoryPlus Workshop
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
GTC Japan 2014
GTC Japan 2014GTC Japan 2014
GTC Japan 2014
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 

ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data

  • 1. / / / National Institute of Advanced Industrial Science and Technology (AIST) Artificial Intelligence Research Center Hitoshi Sato 1
  • 2. About AIST 2 ITRI Information Technology Research Institute Artificial Intelligence Research Center Real World Big-Data Computing – Open Innovation Laboratory One of the public research institute in Japan under METI*, focus on “bridging” the gap between innovative technological seeds and commercialization. *Ministry of Economy, Trade and Industry @Odaiba, Tokyo Joint Lab w/ Tokyo Tech HQ @Tsukuba
  • 3. Current Status of AI R&D in Japan: AI requires the convergence of Algorithm, Big Data, and Computing Infrastructure 3 Big Data Algorithm Computing Infrastructure Deep Learning, Machine Learning, etc. Cutting edge infrastructures, Clouds, Supercomputers, etc. We lack the cutting edge infrastructure dedicated to AI and Big Data, open to public use Big Companies AI/BD R&D (also Science) AI Venture Startups AI/BD Centers & Labs in National Labs & Universities
  • 4. / • Dense Problem • Compute-intensive • Dense Matrix Op., Multiply-Add • Reduced Precision • Regular data/memory accesses • Sparse Problem – Graph Data Analytics Symbol Processing • Data-intensive • Sparse Matrix Op., Integer • Big Data Kernels (Sort, PrefixSum, etc.) • Irregular data/memory accesses 4 Different Workload Characteristics, But, Similar to HPC AI requires HPC
  • 5. Example of Real AI/DL Apps Usage • lang2program – https://github.com/kelvinguu/ lang2program – Implementation for ACL 2017 (http://acl2017.org/ ) paper – Provided as Dockerfile • Including Tensorflow, CUDA, CuDNN, PostgresQL, Python pip Packages, etc. • Various software dependencies for installation 5 Can traditional shared HPC systems support such DL Apps? • Not allowed Docker for security reason (rootless access required) • Elaborate efforts for manual installation
  • 6. 6 F E DD 86G CBF / HA E 86 E6E F H EL EC E +L D B L D B 6E6 F FG AF HFGE L 6G8 C 8 H EF Linux O S +B B 6 /M C86 GCE6 M 88 E6GCEF VMsLLinux Containers Ethernet Local Node Storage x86 CPU Distributed Filesystems HDFS M apReduce Fram ew orks Spark/Hadoop User Applications RDB PostgresQL ML Libraries MLlib/ Mahout Graph Libraries GraphX/ Giraph JavaL ScalaM IDL SQL Hive/Pig CloudDB/NoSQL Hbase/Cassandra/MondoDB CordinationEngine ZooKeeper BHJ CH F HD E8CADHG EFDD 86G CB FG A C GI6E 6E I6E CEGE6BL L M + • Cloud Jobs are often Interactive w/ resource control REST APIs • HPC Jobs are Batch-Oriented, resource control by MPI • Cloud employs High Productivity Languages but performance neglected, focus on data analytics and dynamic frequent changes • HPC employs High Performance Languages but requires Ninja Programmers, low productivity. Kernels & compilers well tuned & result shared by many programs, less rewrite • Cloud HW based on Web Server “commodity” x86 servers, distributed storage on nodes assuming REST API access • HPC HW aggressively adopts new technologies such a s GPUs, focused on ultimate performance at higher cost, shared storage to support legacy apps Various convergence research efforts underway but no realistic converged SW Stack yet C GI6E 8CF FG A CE B + // CE CI • Cloud focused on databases and data manipulation workflow • HPC focused on compute kernels, even for data processing. Jobs scales to thousands of jobs, thus debugging and performance tuning • Cloud requires purpose-specific computing/data environment as well as their mutual isolation & security • HPC requires environment for fast & lean use of resources, but on modern machines require considerable system software support
  • 7. Characteristics of Deep Learning I/O Workloads • Small READ I/O to Huge Files – Native Media Files such as Jpeg, Wav, Text, etc. – Painful I/O patterns for shared parallel file systems – Increasing Metadata Operation • Randomness is essentially important – Runtime Data Augmentation for Training Accuracy • Random Crop • Random Horizontal Flip • Normalization, etc. 7 Training Dataset to ImageNet1K
  • 8. • Expected Power Limit of Real HPC Systems – Multi MW for systems – 60 70 kW per rack • General-purpose Data Centers – Multi kW per rack – Lack of power density for HPC → How to introduce supercomputer cooling technologies to data centers üPower Efficient, Thermal Density, Maintainability, etc. üIn-expensive Build 8 AIST AI Cloud Air Cooling TSUBAME-KFC Liquid Immersion Cooling #3 Green500 2017 June TSUBAME2 Water Cooling #2 Green500 2010 Nov #1 Green500 2013 Nov
  • 9. • Accelerating AI/Big Data R&D – Dedicated HPC Infra converged w/ AI/Big Data • Extreme Computing Power • Ultra High Bandwidth Low Latency in Memory, Network, and Storage • Extreme Green • Bridging AI Tech through Common Platform – For AI R&D, Apps, Services, and Infrastructure Designs – Aiming fast tech transfer through industry and academia collaboration – De facto Standard Software • Cloud-Datacenter-ready Infrastructure, but Supercomputing – Supercomputer Cooling Technologies to Datacenter – Inexpensive Facilities, etc. 9
  • 10. AI Bridging Cloud Infrastructure as World’s First Large-scale Open AI Infrastructure • Open, Public, and Dedicated infrastructure for AI/Big Data • Platform to accelerate joint academic-industry R&D for AI in Japan • Top-level compute capability w/ 0.550EFlops(AI), 37.2 PFlops(DP) ( )2 5 0.1 ,. 10 5 ## 0 # 10 Univ. Tokyo Kashiwa II Campus Operation Scheduled in 2018
  • 11. • 1088x compute nodes w/ 4352x NVIDIA Tesla V100 GPUs, 476TiB of Memory, 1.6PB of NVMe SSDs, 22PB of HDD-based Storage and Infiniband EDR network • Ultra-dense IDC design from the ground-up w/ 20x thermal density of standard IDC • Extreme Green w/ ambient warm liquid cooling, high-efficiency power supplies, etc., commoditizing supercomputer cooling technologies to clouds ( 2.3MW, 70kW/rack) 11 Gateway and Firewall Computing Nodes: 0.550 EFlops(HP), 37 PFlops(DP) 476 TiB Mem, 1.6 PB NVMe SSD Storage: 22 PB GPFS High Performance Computing Nodes (w/GPU) x1088 • Intel Xeon Gold6148 (2.4GHz/20cores) x2 • NVIDIA Tesla V100 (SXM2) x 4 • 384GiB Memory, 1.6TB NVMe SSD Multi-platform Nodes (w/o GPU) x10 • Intel Xeon Gold6132 (2.6GHz/14cores) x2 • 768GiB Memory, 3.8TB NVMe SSD Interactive Nodes DDN SFA14K (w/ SS8462 Enclosure x 10) x 3 • 12TB 7.2Krpm NL-SAS HDD x 2400 • 3.84TB SAS SSD x 216 • NSD Servers x 12 Object Storage for Protocol Nodes 100GbE Service Network (10GbE) External Networks SINET5 Interconnect (Infiniband EDR)
  • 12. ABCI: AI Bridging Cloud Infrastructure 12 System (32 Racks)Rack (17 Chassis) Compute Node (4GPUs, 2CPUs) Chips (GPU, CPU) Node Chassis (2 Compute Nodes) NVIDIA Tesla V100 (16GB SMX2) 3.72 TB/s MEM BW 384 GiB MEM 200 Gbps NW BW 1.6TB NVMe SSD 1.16 PFlops(DP) 17.2 PFlops (AI) 37.2 PFlops(DP) 0.550 EFlops (AI) 68.5 PFlops(DP) 1.01 PFlops (AI) 34.2 TFlops(DP) 506 TFlops (AI) GPU: 7.8 TFlops(DP) 125 TFlops (AI) CPU: 1.53 TFlops(DP) 3.07 TFlops (AI) Intel Xeon Gold 6148 (27.5M Cache, 2.40 GHz, 20 Core) 0.550 EFlops(AI), 37.2 PFlops(DP) 19.88 PFlops(Peak), Ranked #5 Top500 June 2018 131TB/s MEM BW Full Bisection BW within Rack 70kW Max 1088 Compute Nodes 4352 GPUs 4.19 PB/s MEM BW 1/3 of Oversubscription BW 2.3MW
  • 13. GPU Compute Nodes • NVIDIA TESLA V100 (16GB, SXM2) x 4 • Intel Xeon Gold 6148 x 2 Sockets – 20 cores per Socket • 384GiB of DDR4 Memory • 1.6TB NVMe SSD x 1 – Intel DC P4600 u.2 • EDR Infiniband HCA x 2 – Connected to other Compute Notes and Filesystems 13 Xeon Gold 6148 Xeon Gold 6148 10.4GT/s x3DDR4-2666 32GB x 6 DDR4-2666 32GB x 6 128GB/s 128GB/s IB HCA (100Gbps)IB HCA (100Gbps) NVMe UPI x3 x48 switch x64 switch Tesla V100 SXM2 Tesla V100 SXM2 Tesla V100 SXM2 Tesla V100 SXM2 PCIe gen3 x16 PCIe gen3 x16 PCIe gen3 x16 PCIe gen3 x16 NVLink2 x2
  • 14. Rack as Dense-packaged “Pod” ( AB < ) 1 6 ) 0 BC / 0 BC ,3 0G < F BA -7 BH<D G D CF BA -7 FB <IF<DA CB ) 7 7 4 I 7 F<D , D 2 D .BB A 14Pod #1 LEAF#1 (SB7890) LEAF#2 (SB7890) LEAF#3 (SB7890) LEAF#4 (SB7890) SPINE#1 (CS7500) SPINE#2 (CS7500) CX40 0#1 CX2570#1 CX2570#2 CX40 0#2 CX2570#3 CX2570#4 CX40 0#3 CX2570#5 CX2570#6 CX40 0#17 CX2570#33 CX2570#34 FBB#1 (SB7890) FBB#2 (SB7890) FBB#3 (SB7890) 1/3 Oversubscription BW IB-EDR x 24 Full bisection BW IB-EDR x 72 InfiniBand EDR x1 InfiniBand EDR x6 InfiniBand EDR x4 x 32 pods
  • 15. Hierarchical Storage Tiers • Local Storage – 1.6 TB NVMe SSD (Intel DC P4600 u.2) per Node – Local Storage Aggregation w/ BeeOnd • Parallel Filesystem – 22PB of GPFS • DDN SFA14K ( w/ SS8462 Enclosure x 10) x 3 set • Bare Metal NSD servers and Flash-based Metadata Volumes for metadata operation acceleration – Home and Shared Use • Object Storage – Part of GPFS using OpenStack Swift – S3-like API Access, Global Shared Use – Additional Secure Volumes w/ Encryption (Planned) 15 Parallel Filesystem Local Storage as Burst Buffers Object Storage as Campaign Storage
  • 16. 16 Better• Environments – ABCI 64 nodes (256 GPUs) – Framework: ChainerMN v1.3.0 • Chainer 4.2.0, Cupy 4.2.3, mpi4py 3.0.0, Python 3.6.5 – Baremetal • CentOS 7.4, gcc-4.8.5, CUDA 9.2, CuDNN 7.1.4, NCCL2.2, OpenMPI 2,1.3 • Settings – Dataset: Imagenet-1K – Model: ResNet-50 – Training: • Batch size: 32 per GPU, 32 x 256 in total • Learning Rate: Starting 0.1 and x0.1 at 30, 60, 80 epoch w/ warm up scheduling • Optimization: Momentum SGD (momentum=0.9) • Weight Decay: 0.0001 • Training Epoch: 100
  • 17. .F F A A A 5L I F L I .FFCA 6 FCF A B I B 1A M A R I FL R LAC A 5L: A C 6FC I M F 4 B P I B P I B FI .0 3FM I P ( 2 P 2 FI .0 :A I 1A LA .FFCA P I CF B ) B I B P / .FAC 7 A B I B P 2 A F C 17 19m x 24m x 6m 計算 8 計算 7 計算 6 計算 5 計算 4 計算 3 計算 2 計算 1 計算 16 計算 15 計算 14 計算 13 計算 12 計算 11 計算 10 計算 9 計算 23 計算 22 計算 21 計算 20 計算 28 計算 27 計算 19 計算 18 計算 17 計算 32 計算 31 計算 30 計算 29 IB 1 IB2 計算 26 計算 25 計算 24 STR G38 STR G37 STR G36 STR G35 STR G34 その 他 MPF 39 MP F4 STR G33 ラックレイアウト案IBスイッチラックをUPSゾーンに設置した場合 High Voltage Transform ers (3.25M W ) Passive Cooling Tower Free cooling Cooling Capacity: 3MW Lithium Battery (Plan) 1MWh, 1MVA Active Chillers Cooling Capacity: 200kW Future Expansion Space UPS 72 Racks w/ Hybrid Cooling 18 Racks w/ UPS
  • 18. Server Room Cooling • Skeleton frame structure for computer racks, water circuit, air-cooling units, and cables • 70kW/rack cooling capacity 60kW liquid (32 C water) + 10kW air 18 19 or 23 inch Rack (48U) ComputingServer Hot Water Circuit Front side Air CDU Coolability 10kW Water Hot Aisle Capping 35 C Water Block (CPU or/and Accelerator, etc.) Air 40 C 32 C 40 C Cold Water Circuit Fan Coil Unit Coolability 10kW Cold Aisle 35 C Hot Aisle 40 C Capping Wall Capping Wall Fan Coil Unit Server Rack W ater Circuit Fan Coil Unit Server Rack Hot Aisle
  • 19. Software Stuck for ABCI • Batch Job Scheduler – High throughput computing • Minimum Pre-installed Software – Users can deploy their environments using anaconda, pip, python venv, etc. – Reduce operational cost • Container Support – Singularity for multi-node jobs w/ user customized images – Docker for single-node jobs w/ site-certified images 19 User Applications DL Frameworks Python, Ruby, R, Java, Scala, Perl, Lua, etc. ISV AppsOSSHadoop/ Spark GCC PGI OpenMPI MVAPICH2 Intel Parallel Studio XE Cluster Edition CUDA/CUDNN/NCCL GPFS BeeOND OpenStack Swift Univa Grid Engine Singularity Docker CentOS/RHEL
  • 20. • Singularity – HPC Linux Container developed at LBNL and Sylabs Inc. • http://singularity.lbl.gov/ • https://github.com/singularityware/singularity – Rootless program execution, storage access • No daemons w/ root privilege like Docker – Computing resources are managed by job schedulers such as UGE • Root privilege is given by commands w/ setuid – Mount, Linux namespace creation, Bind paths between host and container – Existing Docker images available via DockerHub – HPC software stacks available • CUDA, MPI, Infiniband, etc. 20
  • 21. How to Create Container Images for Distributed DL • Host – Mount native device drivers and system libraries to containers • GPU, Infiniband, etc. • cf. nvidia-docker • Containers – Install user space libraries • CUDA, CuDNN, NCCL2, ibverbs, MPI, etc. – Install distributed deep learning frameworks • Optimized build for underlying computing resources 21 Base Drivers, Libraries on Host CUDA Drivers Infiniband Drivers Filesystem Libraries (GPFS, Lustre) Userland Libraries on Container CUDA CuDNN NCCL2 MPI (mpi4py) Mount ibverbs Distributed Deep Learning Frameworks Caffe2 ChainerMNDistributed TensorflowMXNet
  • 22. AICloudModules: Container Image Collection for AI • Base Images – Optimized OS Images for AI Cloud • Including CUDA,CuDNN, NCCL, MPI, etc., Libraries • Excluding CUDA, Infiniband, GPFS etc., native drivers on host DL Images • DL Images – Installed Deep Learning Frameworks and AI Applications based on Base OS Images 22 Base Images Base OS Images incl. CUDA, CuDNN, NCCL2, MPI, Infiniband DL Images Images for Deep Learning Frameworks based on HPCBase CentOSUbuntu Base Drivers, Libraries on Host GPU Infiniband GPFSMPI Caffe2 ChainerMNDistributed TensorflowMXNet CNTK https://hub.docker.com/r/aistairc/aaic/ as a prototype
  • 23. • Accommodate open ecosystems of data and software for AI R&D – Sharable/Reusable/Reproducible algorithm methods, training data w/ labels, trained models, etc. • Creating various social applications that use AI (ML, DL, etc.) 23 X Y Training Dataset Public Data on Internet Deep Learning Framework Data Collection and Cleansing Modeling Learning Open Data in Enterprise Open Data in Government Trained Model Inference Publish Trained Models to Public to apply various domains Prepare Training Datasets Trial and error for designing networks, parameters Accelerate learning with HPC to improve accuracy of models Reuse Trained Model Implementing AI Techs to Society Retail Stores Factories Logistics Centers Object Detection and Tracking
  • 24. Summary • Issues for Public AI Infrastructures – Accelerating AI R&D – Bridging AI Tech through Common Platform – Cloud-Datacenter-ready Infrastructure, but Supercomputing • ABCI (AI Bridging Cloud Infrastructure) – 0.550 EFlops(AI), 37.2 PFlops(DP), 19.88 PFlops(Peak) #5 Top500 June 2018 – Open, Public, Dedicated Infrastructure for AI/Big Data – A Reference for AI/Big Data and HPC System 24
  • 25. 25