Managing Containerized HPC and AI Workloads on TSUBAME3.0

www.univa.com
www.univa.com
Ian Lumb
Solutions Architect
SUSE, Booth #1681
SC17, Denver, CO
Managing Containerized
HPC and AI Workloads on
TSUBAME3.0

www.univa.com
Copyright © Univa Corporation, 2017. All Rights Reserved. Internal Use Only. 4
TSUBAME 3.0 - Compute Node Overview
A compute-node:
■ 256 GB DDR4 RAM
■ 2 TB SSDs
■ 2x 14 cores
■ 4x GPUs
■ 4x HFI (1000 Gbps)
⇒ This is what they call
a “fat compute node”

www.univa.com
TSUBAME 3.0 - The Challenges
12.2 PetaFLOPS within only 20 racks or 540 compute
nodes
➢ It is the smallest >10 PFLOPS machine in the world
➢ Wasted/unreachable resources (parts of a node) have a
much bigger impact on such a “small” cluster
➢ Performance is also highly dependent on the job-
placement due to additional resources, such as GPUs
and HFI-devices (the closer, the better)
➢ It needs smart and flexible partitioning to ensure a
high utilization

www.univa.com
6
TSUBAME 3.0 - UGE Enhancements
▪ Core Bindings
▪ Enhanced PE support and strategies
▪ RSMAPS
▪ Enhanced PE support and chaining
▪ Docker
▪ Define unique but known container hostnames
▪ Configure Infiniband device in the container
▪ Map all job users into the container
▪ Provide execution host and Docker container hostnames to the job

www.univa.com
Putting it all together …
qsub -l docker,docker_images="*ubuntu:14.04*"
-l gpu=1,hfi=1,hosts=1
-xd ‘--device=/dev/gpu${gpu(0)}:/dev/gpu,
--device=/dev/hfi${hfi(0)}:/dev/hfi’
-xd ‘--hostname ${hosts(0)}’
-binding one_socket_balanced:4
-pe rr 4 jobscript.sh
No matter the host-OS, the application
gets whatever OS it needs (if they run
their own docker-repo, the image can
even be prepared however they need it)
Each PE-task will get 1
GPU and 1 HFI device
(both with the same ID,
i.e. in the same “location”)
and a unique hostname
No matter which devices
are granted, the
application only sees
/dev/gpu and /dev/hfi
inside the container and
can use them directly
without any performance
penalty!
Even if the RSMAP would occupy 7 cores
per GPU, we only want 4 per PE-task.
Thus leaving room for other jobs, which do
not need a GPU or HFI. Also, we only go
on one socket per host.
Container gets a unique, known (!) hostname

Managing Containerized HPC and AI Workloads on TSUBAME3.0

More Related Content

What's hot

Similar to Managing Containerized HPC and AI Workloads on TSUBAME3.0

More from Ian Lumb

Recently uploaded

Managing Containerized HPC and AI Workloads on TSUBAME3.0