SlideShare a Scribd company logo
Akihiro Nomura2019-06-17 ISC’19, Frankfurt Germany
Global Scientific Information and
Computing Center
Introducing Container Technology to
TSUBAME3.0 Supercomputer
Part of this work was supported by JST CREST Grant Number JPMJCR1501, Japan
Part of this work was conducted as research activities of AIST - Tokyo Tech
Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL)
What TSUBAME3.0 is?
• TSUBAME: Supercomputer series in Tokyo Tech
• TSUBAME1.0: ClearSpeed
• TSUBAME1.2: NVIDIA Tesla S1070
• The 1st supercomputer using GPU as compute unit (2008.11, T1.2)
• TSUBAME2.0: NVIDIA Tesla M2050 → TSUBAME2.5: K20X
• Operated for 7 years: 2010.11 – 2017.10
• TSUBAME-KFC(/DL): Oil-submerged supercomputer testbed
• #1 in Green500 (2013.11, 2014.06)
• TSUBAME3.0: NVIDIA Tesla P100
• #1 in Green500 (2017.06)
• Currently #25 in TOP500 (2019.06)
• Operation started on 2017.08
• ~1500 users from academia + industry, not only from Tokyo Tech
• Various application domain, expertise, software (ISV, self-made, serial, parallel)
• Important to keep research secret from other users
2019/6/17 ISC-HPC 2019 2
Experience in 7-year operation of
TSUBAME 2 supercomputer
• Our previous TSUBAME2 operated from 2010.11 to 2017.10
• We faced many problems during long-term operations
• Resource separation is required
• How to maintain software up-to-date
• How to ban users overeating resources in shared node
• How to make the system energy efficient
• How to cope with decaying network cables
(SIAM PP18)
• Other problems that I cannot disclose, or
do not want to remember again 
2019/6/17 ISC-HPC 2019 3
Why resource separation ?
• TSUBAME2 1408 compute nodes are too fat
• To utilize 3GPU+2CPU/node config, users need to program with
CUDA/OpenACC for GPU, OpenMP for intra-node comm, MPI for inter-
node comm… just too hard for most users
• Three types of users (or workloads)
• Expert, Guru: fully utilize 3GPU+2CPU/node config
• GPU user: use 1~3 GPUs, but not so much for CPU threads
• CPU user: don’t use GPU at all
• Assigning full node to all users is just a waste of resource
2019/6/17 ISC-HPC 2019 4
Resource separation accomplished in T2
(in 2010)
• VM(KVM) based approach
• Run CPU VM inside GPU-job nodes
• GPU couldn’t be virtualized
• NW performance is limited due to IPoIB
• Nice usability
• Users can ssh into both GPU part and CPU part for debug / monitoring
• Many TSUBAME1 users did during their job
• Good isolation
• GPU user cannot see what’s going on in CPU part and vice versa
• Bad flexibility
• We cannot change the # of nodes to be split into two dynamically
CPU
CPU
GPU
GPU
GPU
IB
HCA
IB
HCA
4cores
8cores
G
Bare-Metal
U/V
VM
IP over IB
2019/6/17 ISC-HPC 2019 5
What happens in SW env,
if we operate one system for long time
• Everything gets stale
• System software compatibility problem
• GPU and Lustre drivers won’t support 5-years-old OS distro
• OS support problem
• OS vendors drop support of 5-years-old distro
• ISV software compatibility problem
• Some newer version won’t work in old OS
• Some stable version isn’t verified in new OS
• Library version hell
• Upgrade to newer OS version is painful
• Everything must be validated again, esp. in ISV software
• We did once (SLES11 → SLES11SP3, 2014.08), with large cost
2019/6/17 ISC-HPC 2019 6
When I tried to install Caffe to T2.5
(2015.05)
• SLES11SP3, two years from release, <1 year from system update
• SP4 appeared just after verification and installation
• Got request from a user on Friday evening, thought it’s easy
• Experienced library hell, took 3 days to install it
• Lots of missing libraries
• >20 Python packages, gflags, glog, leveldb, lmdb, protobuf, snappy, cuDNN, OpenCV
• GCC is too old, let’s install it…
• Ah, I need to recompile everything with new GCC…
• Also tried Tensorflow later days, but I abandoned
• Some binary-shipped part requires newer glibc 
∴ Introducing bleeding-edge software to old system is quite painful
2019/6/17 ISC-HPC 2019 7
Our expectation to container technology
for upcoming TSUBAME3 (as of late 2016)
• We just wanted something we can
• Make OS kernel version and userland version independent
• Provide new system software and libraries with least cost
• Provide old userland if necessary
• Then we can skip validation of all ISVs in newer environment
• Also (partially) useful to replay old experiment later
• Split resources (CPU, GPU, Memory, NW) without performance
drawback
• Secure isolation between separated partitions
• Dynamic partitioning
• Allow users to do what they did in previous systems
• In our case, SSH to compute node while a job is running
2019/6/17 ISC-HPC 2019 8
Our choices for resource separation
(again, as of late 2016)
2019/6/17 ISC-HPC 2019 9
• VM and Docker was available choice
• Other container technology (Shifter, Singularity, …) was not mature
VM Metrics Docker container
GPU: virtualized 
Interconnect:
IB: supports SR-IOV 
OmniPath: No support 
Performance Almost no overhead 
SSH is not a problem  Usability SSH into container requires some integration 
Isolated w/o problem  Isolation If cgroup works well, it’s OK 
Hard to deploy OS dynamically  Userland
virtualization
Userland can be chosen 
VM on/off is costly  Flexibility Container itself won’t be a problem 
We didn’t specify VM or Docker explicitly,
but requested functionality in procurement
The vendor choose Docker
How TSUBAME3 node looks like
• The node is larger than T2
• 28 CPU cores
• 4 GPUs
• 4 Omni-Path HFIs
• Too huge for most of users
• Expert, Guru
• GPU user
• CPU user
• We expect most of users to
split the node
2019/6/17 ISC-HPC 2019 10
How we separate the node physically
• Separate the node
hierarchically
• Inspired by buddy system in
Linux kernel’s slab memory
allocator
• Less flexible because of fixed
mem/CPU mem/GPU ratio
• Better scheduling to minimize
scattered resources
2019/6/17 ISC-HPC 2019 11
CPU
CPU
GPU
GPU
GPU
GPU
OPA
HFI
OPA
HFI
OPA
HFI
OPA
HFI
14cores H
7cores Q
2cores
4cores
G
C4
Resource Utilization in TSUBAME3
(2019.04)
• ~70% of Jobs (based on
vnode×time) are running on
separated node, rather than full
node
• Sum of vnode×time product
exceeded 540 × 30day in busy
months
• We couldn’t serve jobs without
partitioning
2019/6/17 ISC-HPC 2019 12
How we separate the node logically
• Integration by HPE (primary vendor) and UNIVA (scheduler vendor)
• Just using cgroup
• To achieve the minimal goal of resource separation in short development time
• Userland virtualization is not urgent, should be implemented by when the initial
userland become obsoleted
• SSH to (part of) compute nodes are desirable, but not requisite
• Using Docker, integrated with scheduler
• To achieve the full goal, including triaged goals in cgroup impl.
• Multi-node docker integration was challenging, no predecessor at that time
• It took almost two years to make docker part serviced
• Integration broke scheduling priorities etc. in specific situation 
• Finally started Docker-based service on 2019.04
2019/6/17 ISC-HPC 2019 13
Our requirement to container technology
• DO NOT PASS root TO USER
• We use several filesystems in our network
• Cluster NFS for home storage
• Lustre for high-speed shared storage
• (local SSD + BeeOND)
• We MUST prevent users to access data of other users
→ We decided NOT to allow users to bring their own images
• In docker, root in container is (sometimes restricted) root in host OS
• We cannot filter malicious image that allows to escape from jail
• Files with setuid bit, local vulnerability exploit, …
• Drawback: users cannot bring the images
• We initially thought that’s not a problem, or inevitable compromise
2019/6/17 ISC-HPC 2019 14
Time flies like an arrow, in just 2 years
• During introduction
and preparation,
container tech evolved
rapidly and we were
out of sync
• What users expect to
container was not
what we planned to do
with container
• Lots of application
container appeared,
including HPC apps
Pics from
http://www.projectcartoon.com
2019/6/17 ISC-HPC 2019 15
Other container choices: Singularity
• Docker was general purpose container
• Not designed to be used by untrusted users
• HPC-aware container are being implemented
• Shifter
• Prevent users in container image from getting root
• Singularity
• Run container without root (except for startup, cgroup and FS mount)
• There are security document describing setuid-related implementation!!
• Can we accept user-brought container images using Singularity?
2019/6/17 ISC-HPC 2019 16
Introducing Singularity to TSUBAME3.0
(2018.08-09)
• Request came from a user, with a pointer to security
consideration document
• Checked source code of singularity (setuid-related part) with
multiple staffs
• Discussion in research computer system audit board
• Not usual path for ordinary software, but Singularity requires setuid
binary
• Finally installed Singularity 2.6
• Singularity 3.2.1 is also available, from last week 
• Did the same setuid-related code check as implementation changed
2019/6/17 ISC-HPC 2019 17
Pros and cons for Docker and Singularity
2019/6/17 ISC-HPC 2019 18
• Note: it’s just TSUBAME3’s case
Metrics should vary in different supercomputer sites
Docker in TSUBAME3 Metrics Singularity in TSUBAME3
Can SSH into container
IP address is assigned
Usability Running daemon inside container is not supported
No IP address is assigned
Already Integrated  Isolation Need to be done from outside, but possible
Userland can be chosen
Only by system admins
Userland
virtualization
User can bring arbitrary images
Delayed to 2019.04 Service Start 2018.09
Yes, HPC container started working with
Singularity, that’s all?
• Unfortunately NO in MPI apps
• Requires integration of both kernel(host)-level drivers and userland libs
• Also, process launch must be done in host side, not from container
• mpirun …… singularity exec …… path/to/mpiapp
• Many container implementation has mechanics to fill the gap of
NVIDIA GPU driver version difference
• NVIDIA-docker, --nv option of Singularity…
• Yes, TSUBAME3 is NVIDIA GPU Cloud Ready 
• TSUBAME3 uses OmniPath, while other HPC sites often uses
InfiniBand (or Tofu, Aries, …)
• Users (except for guru) don’t care which the underlying interconnect is
• Unlike accelerators, users don’t expect CUDA works in FPGA
• However, system software required in the container is different
2019/6/17 ISC-HPC 2019 19
Container
What we expect for MPI-impl
independent container
• MPI equivalent of --nv option?
• Auto-introduces MPI related system software
• Requires MPI ABI compatibility in some level
• MPI ABI compatibility initiatives
• libfabrics
• Recompile MPI apps with specific MPI,
when the image is built for specific system
• Fat container images to choose MPI lib
dynamically?
2019/6/17 ISC-HPC 2019 20
App
MPI for
InfiniBand
MPI for
OmniPath
Wrap up
• We tried to introduce Docker to TSUBAME3.0 in order to implement resource
separation and flexible userland update, not targeting container as goal, but just
as tools
• However, users expectation to container was different from what we thought
• Obtaining full goal at once with Docker was too adventurous and took very long
time to get it in service, but now working well 
• It sometimes is important to change mind during system operation, opinion from
users are important
• For system administrator, security documentation is very important
• To run massively parallel applications everywhere using container, it still have
several problems to solve
• I believe I did (and am doing) something stupid, due to historical reasons, or just
not knowing appropriate technology
• Your input is always welcome
2019/6/17 ISC-HPC 2019 21
Acknowledgement
• TSUBAME3 operation working group members
• ~15 faculty and other staff members
• HPE and UNIVA engineers, who finally realized container-base
TSUBAME3.0 system with lots of effort
• We expect upgrading to SLES15 in 2020.03
• Many container vendors for formal or informal discussions
• And Users, especially those who requested bleeding edge software
• TSUBAME Computing Services: https://www.t3.gsic.titech.ac.jp/en/
2019/6/17 ISC-HPC 2019 22
Resources

More Related Content

What's hot

제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers
Tommy Lee
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
The Linux Foundation
 
Noah - Robust and Flexible Operating System Compatibility Architecture - Cont...
Noah - Robust and Flexible Operating System Compatibility Architecture - Cont...Noah - Robust and Flexible Operating System Compatibility Architecture - Cont...
Noah - Robust and Flexible Operating System Compatibility Architecture - Cont...
Takaya Saeki
 
Buildinga billionuserloadbalancer may2015-sre-con15europe-shuff
Buildinga billionuserloadbalancer may2015-sre-con15europe-shuffBuildinga billionuserloadbalancer may2015-sre-con15europe-shuff
Buildinga billionuserloadbalancer may2015-sre-con15europe-shuff
Patrick Shuff
 
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
OpenStack Korea Community
 
Monitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosMonitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, Nagios
Pradeep Kumar
 
[2C4]Clustered computing with CoreOS, fleet and etcd
[2C4]Clustered computing with CoreOS, fleet and etcd[2C4]Clustered computing with CoreOS, fleet and etcd
[2C4]Clustered computing with CoreOS, fleet and etcd
NAVER D2
 
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
SaltStack
 
Clocker - The Docker Cloud Maker
Clocker - The Docker Cloud MakerClocker - The Docker Cloud Maker
Clocker - The Docker Cloud Maker
Andrew Kennedy
 
Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28
Sadique Puthen
 
Experience Report: Cloud Foundry Open Source Operations | anynines
Experience Report: Cloud Foundry Open Source Operations | anyninesExperience Report: Cloud Foundry Open Source Operations | anynines
Experience Report: Cloud Foundry Open Source Operations | anynines
anynines GmbH
 
How Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project FeedbackHow Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project Feedback
NETWAYS
 
Running Cloud Foundry for 12 months - An experience report | anynines
Running Cloud Foundry for 12 months - An experience report | anyninesRunning Cloud Foundry for 12 months - An experience report | anynines
Running Cloud Foundry for 12 months - An experience report | anynines
anynines GmbH
 
OpenStack and Ceph: the Winning Pair
OpenStack and Ceph: the Winning PairOpenStack and Ceph: the Winning Pair
OpenStack and Ceph: the Winning Pair
Red_Hat_Storage
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
Ceph Community
 
Kubernetes for HCL Connections Component Pack - Build or Buy?
Kubernetes for HCL Connections Component Pack - Build or Buy?Kubernetes for HCL Connections Component Pack - Build or Buy?
Kubernetes for HCL Connections Component Pack - Build or Buy?
Martin Schmidt
 
OWF: Xen - Open Source Hypervisor Designed for Clouds
OWF: Xen - Open Source Hypervisor Designed for CloudsOWF: Xen - Open Source Hypervisor Designed for Clouds
OWF: Xen - Open Source Hypervisor Designed for Clouds
The Linux Foundation
 
Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!
Red Hat Developers
 
XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...
XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...
XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...
The Linux Foundation
 
Ganeti Web Manager: Cluster Management Made Simple
Ganeti Web Manager: Cluster Management Made SimpleGaneti Web Manager: Cluster Management Made Simple
Ganeti Web Manager: Cluster Management Made Simple
OSCON Byrum
 

What's hot (20)

제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-Ranchers
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
 
Noah - Robust and Flexible Operating System Compatibility Architecture - Cont...
Noah - Robust and Flexible Operating System Compatibility Architecture - Cont...Noah - Robust and Flexible Operating System Compatibility Architecture - Cont...
Noah - Robust and Flexible Operating System Compatibility Architecture - Cont...
 
Buildinga billionuserloadbalancer may2015-sre-con15europe-shuff
Buildinga billionuserloadbalancer may2015-sre-con15europe-shuffBuildinga billionuserloadbalancer may2015-sre-con15europe-shuff
Buildinga billionuserloadbalancer may2015-sre-con15europe-shuff
 
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
 
Monitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosMonitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, Nagios
 
[2C4]Clustered computing with CoreOS, fleet and etcd
[2C4]Clustered computing with CoreOS, fleet and etcd[2C4]Clustered computing with CoreOS, fleet and etcd
[2C4]Clustered computing with CoreOS, fleet and etcd
 
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
SaltConf14 - Matthew Williams, Flowroute - Salt Virt for Linux contatiners an...
 
Clocker - The Docker Cloud Maker
Clocker - The Docker Cloud MakerClocker - The Docker Cloud Maker
Clocker - The Docker Cloud Maker
 
Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28
 
Experience Report: Cloud Foundry Open Source Operations | anynines
Experience Report: Cloud Foundry Open Source Operations | anyninesExperience Report: Cloud Foundry Open Source Operations | anynines
Experience Report: Cloud Foundry Open Source Operations | anynines
 
How Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project FeedbackHow Can OpenNebula Fit Your Needs: A European Project Feedback
How Can OpenNebula Fit Your Needs: A European Project Feedback
 
Running Cloud Foundry for 12 months - An experience report | anynines
Running Cloud Foundry for 12 months - An experience report | anyninesRunning Cloud Foundry for 12 months - An experience report | anynines
Running Cloud Foundry for 12 months - An experience report | anynines
 
OpenStack and Ceph: the Winning Pair
OpenStack and Ceph: the Winning PairOpenStack and Ceph: the Winning Pair
OpenStack and Ceph: the Winning Pair
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Kubernetes for HCL Connections Component Pack - Build or Buy?
Kubernetes for HCL Connections Component Pack - Build or Buy?Kubernetes for HCL Connections Component Pack - Build or Buy?
Kubernetes for HCL Connections Component Pack - Build or Buy?
 
OWF: Xen - Open Source Hypervisor Designed for Clouds
OWF: Xen - Open Source Hypervisor Designed for CloudsOWF: Xen - Open Source Hypervisor Designed for Clouds
OWF: Xen - Open Source Hypervisor Designed for Clouds
 
Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!
 
XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...
XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...
XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...
 
Ganeti Web Manager: Cluster Management Made Simple
Ganeti Web Manager: Cluster Management Made SimpleGaneti Web Manager: Cluster Management Made Simple
Ganeti Web Manager: Cluster Management Made Simple
 

Similar to Introducing Container Technology to TSUBAME3.0 Supercomputer

[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020
Akihiro Suda
 
”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016
Kuniyasu Suzaki
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Edge AI and Vision Alliance
 
2014 11-05 hpcac-kniep_christian_dockermpi
2014 11-05 hpcac-kniep_christian_dockermpi2014 11-05 hpcac-kniep_christian_dockermpi
2014 11-05 hpcac-kniep_christian_dockermpi
QNIB Solutions
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...chiportal
 
The State of Rootless Containers
The State of Rootless ContainersThe State of Rootless Containers
The State of Rootless Containers
Akihiro Suda
 
Rootless Containers
Rootless ContainersRootless Containers
Rootless Containers
Akihiro Suda
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
Stanislav Pogrebnyak
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
OpenEBS
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGO
Igor Sfiligoi
 
Usernetes: Kubernetes as a non-root user
Usernetes: Kubernetes as a non-root userUsernetes: Kubernetes as a non-root user
Usernetes: Kubernetes as a non-root user
Akihiro Suda
 
Network services on Kubernetes on premise
Network services on Kubernetes on premiseNetwork services on Kubernetes on premise
Network services on Kubernetes on premise
Hans Duedal
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
Peter Clapham
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
Composing services with Kubernetes
Composing services with KubernetesComposing services with Kubernetes
Composing services with Kubernetes
Bart Spaans
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Tibo Beijen
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
Karol Chrapek
 
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Indrajit Poddar
 
[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman
Akihiro Suda
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
MayaData Inc
 

Similar to Introducing Container Technology to TSUBAME3.0 Supercomputer (20)

[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020[KubeCon NA 2020] containerd: Rootless Containers 2020
[KubeCon NA 2020] containerd: Rootless Containers 2020
 
”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
2014 11-05 hpcac-kniep_christian_dockermpi
2014 11-05 hpcac-kniep_christian_dockermpi2014 11-05 hpcac-kniep_christian_dockermpi
2014 11-05 hpcac-kniep_christian_dockermpi
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
 
The State of Rootless Containers
The State of Rootless ContainersThe State of Rootless Containers
The State of Rootless Containers
 
Rootless Containers
Rootless ContainersRootless Containers
Rootless Containers
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGO
 
Usernetes: Kubernetes as a non-root user
Usernetes: Kubernetes as a non-root userUsernetes: Kubernetes as a non-root user
Usernetes: Kubernetes as a non-root user
 
Network services on Kubernetes on premise
Network services on Kubernetes on premiseNetwork services on Kubernetes on premise
Network services on Kubernetes on premise
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Composing services with Kubernetes
Composing services with KubernetesComposing services with Kubernetes
Composing services with Kubernetes
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
 
[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman[Podman Special Event] Kubernetes in Rootless Podman
[Podman Special Event] Kubernetes in Rootless Podman
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 

Recently uploaded

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
Kamal Acharya
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 

Recently uploaded (20)

一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 

Introducing Container Technology to TSUBAME3.0 Supercomputer

  • 1. Akihiro Nomura2019-06-17 ISC’19, Frankfurt Germany Global Scientific Information and Computing Center Introducing Container Technology to TSUBAME3.0 Supercomputer Part of this work was supported by JST CREST Grant Number JPMJCR1501, Japan Part of this work was conducted as research activities of AIST - Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL)
  • 2. What TSUBAME3.0 is? • TSUBAME: Supercomputer series in Tokyo Tech • TSUBAME1.0: ClearSpeed • TSUBAME1.2: NVIDIA Tesla S1070 • The 1st supercomputer using GPU as compute unit (2008.11, T1.2) • TSUBAME2.0: NVIDIA Tesla M2050 → TSUBAME2.5: K20X • Operated for 7 years: 2010.11 – 2017.10 • TSUBAME-KFC(/DL): Oil-submerged supercomputer testbed • #1 in Green500 (2013.11, 2014.06) • TSUBAME3.0: NVIDIA Tesla P100 • #1 in Green500 (2017.06) • Currently #25 in TOP500 (2019.06) • Operation started on 2017.08 • ~1500 users from academia + industry, not only from Tokyo Tech • Various application domain, expertise, software (ISV, self-made, serial, parallel) • Important to keep research secret from other users 2019/6/17 ISC-HPC 2019 2
  • 3. Experience in 7-year operation of TSUBAME 2 supercomputer • Our previous TSUBAME2 operated from 2010.11 to 2017.10 • We faced many problems during long-term operations • Resource separation is required • How to maintain software up-to-date • How to ban users overeating resources in shared node • How to make the system energy efficient • How to cope with decaying network cables (SIAM PP18) • Other problems that I cannot disclose, or do not want to remember again  2019/6/17 ISC-HPC 2019 3
  • 4. Why resource separation ? • TSUBAME2 1408 compute nodes are too fat • To utilize 3GPU+2CPU/node config, users need to program with CUDA/OpenACC for GPU, OpenMP for intra-node comm, MPI for inter- node comm… just too hard for most users • Three types of users (or workloads) • Expert, Guru: fully utilize 3GPU+2CPU/node config • GPU user: use 1~3 GPUs, but not so much for CPU threads • CPU user: don’t use GPU at all • Assigning full node to all users is just a waste of resource 2019/6/17 ISC-HPC 2019 4
  • 5. Resource separation accomplished in T2 (in 2010) • VM(KVM) based approach • Run CPU VM inside GPU-job nodes • GPU couldn’t be virtualized • NW performance is limited due to IPoIB • Nice usability • Users can ssh into both GPU part and CPU part for debug / monitoring • Many TSUBAME1 users did during their job • Good isolation • GPU user cannot see what’s going on in CPU part and vice versa • Bad flexibility • We cannot change the # of nodes to be split into two dynamically CPU CPU GPU GPU GPU IB HCA IB HCA 4cores 8cores G Bare-Metal U/V VM IP over IB 2019/6/17 ISC-HPC 2019 5
  • 6. What happens in SW env, if we operate one system for long time • Everything gets stale • System software compatibility problem • GPU and Lustre drivers won’t support 5-years-old OS distro • OS support problem • OS vendors drop support of 5-years-old distro • ISV software compatibility problem • Some newer version won’t work in old OS • Some stable version isn’t verified in new OS • Library version hell • Upgrade to newer OS version is painful • Everything must be validated again, esp. in ISV software • We did once (SLES11 → SLES11SP3, 2014.08), with large cost 2019/6/17 ISC-HPC 2019 6
  • 7. When I tried to install Caffe to T2.5 (2015.05) • SLES11SP3, two years from release, <1 year from system update • SP4 appeared just after verification and installation • Got request from a user on Friday evening, thought it’s easy • Experienced library hell, took 3 days to install it • Lots of missing libraries • >20 Python packages, gflags, glog, leveldb, lmdb, protobuf, snappy, cuDNN, OpenCV • GCC is too old, let’s install it… • Ah, I need to recompile everything with new GCC… • Also tried Tensorflow later days, but I abandoned • Some binary-shipped part requires newer glibc  ∴ Introducing bleeding-edge software to old system is quite painful 2019/6/17 ISC-HPC 2019 7
  • 8. Our expectation to container technology for upcoming TSUBAME3 (as of late 2016) • We just wanted something we can • Make OS kernel version and userland version independent • Provide new system software and libraries with least cost • Provide old userland if necessary • Then we can skip validation of all ISVs in newer environment • Also (partially) useful to replay old experiment later • Split resources (CPU, GPU, Memory, NW) without performance drawback • Secure isolation between separated partitions • Dynamic partitioning • Allow users to do what they did in previous systems • In our case, SSH to compute node while a job is running 2019/6/17 ISC-HPC 2019 8
  • 9. Our choices for resource separation (again, as of late 2016) 2019/6/17 ISC-HPC 2019 9 • VM and Docker was available choice • Other container technology (Shifter, Singularity, …) was not mature VM Metrics Docker container GPU: virtualized  Interconnect: IB: supports SR-IOV  OmniPath: No support  Performance Almost no overhead  SSH is not a problem  Usability SSH into container requires some integration  Isolated w/o problem  Isolation If cgroup works well, it’s OK  Hard to deploy OS dynamically  Userland virtualization Userland can be chosen  VM on/off is costly  Flexibility Container itself won’t be a problem  We didn’t specify VM or Docker explicitly, but requested functionality in procurement The vendor choose Docker
  • 10. How TSUBAME3 node looks like • The node is larger than T2 • 28 CPU cores • 4 GPUs • 4 Omni-Path HFIs • Too huge for most of users • Expert, Guru • GPU user • CPU user • We expect most of users to split the node 2019/6/17 ISC-HPC 2019 10
  • 11. How we separate the node physically • Separate the node hierarchically • Inspired by buddy system in Linux kernel’s slab memory allocator • Less flexible because of fixed mem/CPU mem/GPU ratio • Better scheduling to minimize scattered resources 2019/6/17 ISC-HPC 2019 11 CPU CPU GPU GPU GPU GPU OPA HFI OPA HFI OPA HFI OPA HFI 14cores H 7cores Q 2cores 4cores G C4
  • 12. Resource Utilization in TSUBAME3 (2019.04) • ~70% of Jobs (based on vnode×time) are running on separated node, rather than full node • Sum of vnode×time product exceeded 540 × 30day in busy months • We couldn’t serve jobs without partitioning 2019/6/17 ISC-HPC 2019 12
  • 13. How we separate the node logically • Integration by HPE (primary vendor) and UNIVA (scheduler vendor) • Just using cgroup • To achieve the minimal goal of resource separation in short development time • Userland virtualization is not urgent, should be implemented by when the initial userland become obsoleted • SSH to (part of) compute nodes are desirable, but not requisite • Using Docker, integrated with scheduler • To achieve the full goal, including triaged goals in cgroup impl. • Multi-node docker integration was challenging, no predecessor at that time • It took almost two years to make docker part serviced • Integration broke scheduling priorities etc. in specific situation  • Finally started Docker-based service on 2019.04 2019/6/17 ISC-HPC 2019 13
  • 14. Our requirement to container technology • DO NOT PASS root TO USER • We use several filesystems in our network • Cluster NFS for home storage • Lustre for high-speed shared storage • (local SSD + BeeOND) • We MUST prevent users to access data of other users → We decided NOT to allow users to bring their own images • In docker, root in container is (sometimes restricted) root in host OS • We cannot filter malicious image that allows to escape from jail • Files with setuid bit, local vulnerability exploit, … • Drawback: users cannot bring the images • We initially thought that’s not a problem, or inevitable compromise 2019/6/17 ISC-HPC 2019 14
  • 15. Time flies like an arrow, in just 2 years • During introduction and preparation, container tech evolved rapidly and we were out of sync • What users expect to container was not what we planned to do with container • Lots of application container appeared, including HPC apps Pics from http://www.projectcartoon.com 2019/6/17 ISC-HPC 2019 15
  • 16. Other container choices: Singularity • Docker was general purpose container • Not designed to be used by untrusted users • HPC-aware container are being implemented • Shifter • Prevent users in container image from getting root • Singularity • Run container without root (except for startup, cgroup and FS mount) • There are security document describing setuid-related implementation!! • Can we accept user-brought container images using Singularity? 2019/6/17 ISC-HPC 2019 16
  • 17. Introducing Singularity to TSUBAME3.0 (2018.08-09) • Request came from a user, with a pointer to security consideration document • Checked source code of singularity (setuid-related part) with multiple staffs • Discussion in research computer system audit board • Not usual path for ordinary software, but Singularity requires setuid binary • Finally installed Singularity 2.6 • Singularity 3.2.1 is also available, from last week  • Did the same setuid-related code check as implementation changed 2019/6/17 ISC-HPC 2019 17
  • 18. Pros and cons for Docker and Singularity 2019/6/17 ISC-HPC 2019 18 • Note: it’s just TSUBAME3’s case Metrics should vary in different supercomputer sites Docker in TSUBAME3 Metrics Singularity in TSUBAME3 Can SSH into container IP address is assigned Usability Running daemon inside container is not supported No IP address is assigned Already Integrated  Isolation Need to be done from outside, but possible Userland can be chosen Only by system admins Userland virtualization User can bring arbitrary images Delayed to 2019.04 Service Start 2018.09
  • 19. Yes, HPC container started working with Singularity, that’s all? • Unfortunately NO in MPI apps • Requires integration of both kernel(host)-level drivers and userland libs • Also, process launch must be done in host side, not from container • mpirun …… singularity exec …… path/to/mpiapp • Many container implementation has mechanics to fill the gap of NVIDIA GPU driver version difference • NVIDIA-docker, --nv option of Singularity… • Yes, TSUBAME3 is NVIDIA GPU Cloud Ready  • TSUBAME3 uses OmniPath, while other HPC sites often uses InfiniBand (or Tofu, Aries, …) • Users (except for guru) don’t care which the underlying interconnect is • Unlike accelerators, users don’t expect CUDA works in FPGA • However, system software required in the container is different 2019/6/17 ISC-HPC 2019 19
  • 20. Container What we expect for MPI-impl independent container • MPI equivalent of --nv option? • Auto-introduces MPI related system software • Requires MPI ABI compatibility in some level • MPI ABI compatibility initiatives • libfabrics • Recompile MPI apps with specific MPI, when the image is built for specific system • Fat container images to choose MPI lib dynamically? 2019/6/17 ISC-HPC 2019 20 App MPI for InfiniBand MPI for OmniPath
  • 21. Wrap up • We tried to introduce Docker to TSUBAME3.0 in order to implement resource separation and flexible userland update, not targeting container as goal, but just as tools • However, users expectation to container was different from what we thought • Obtaining full goal at once with Docker was too adventurous and took very long time to get it in service, but now working well  • It sometimes is important to change mind during system operation, opinion from users are important • For system administrator, security documentation is very important • To run massively parallel applications everywhere using container, it still have several problems to solve • I believe I did (and am doing) something stupid, due to historical reasons, or just not knowing appropriate technology • Your input is always welcome 2019/6/17 ISC-HPC 2019 21
  • 22. Acknowledgement • TSUBAME3 operation working group members • ~15 faculty and other staff members • HPE and UNIVA engineers, who finally realized container-base TSUBAME3.0 system with lots of effort • We expect upgrading to SLES15 in 2020.03 • Many container vendors for formal or informal discussions • And Users, especially those who requested bleeding edge software • TSUBAME Computing Services: https://www.t3.gsic.titech.ac.jp/en/ 2019/6/17 ISC-HPC 2019 22 Resources