When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
1. When HPC Meet ML/DL
manage HPC Data Center
with Kubernetes
Yong Feng (yongfeng@ca.ibm.com)
2. IBM Systems
Please Note:
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice
and at IBM’s sole discretion.
• Information regarding potential future products is intended to outline our general product direction and it
should not be relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or legal
obligation to deliver any material, code or functionality. Information about potential future products may not be
incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our products
remains at our sole discretion.
• Performance is based on measurements and projections using standard IBM benchmarks in a controlled
environment. The actual throughput or performance that any user will experience will vary depending upon
many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the
I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be
given that an individual user will achieve results similar to those stated here.
| 2
3. 3IBM Systems
Senior Architect of IBM Spectrum (former Platform Computing)
• Work on resource manager and workload scheduler for 12+ years after Ph.D
• Lead team on Open Source development from OpenStack, Yarn, Mesos, Kubernetes to
Spark etc.
• Lead team on core platform development of IBM Cloud Private
Who am I?
4. IBM Systems
Agenda
• What does ML/DL mean for HPC?
• What does Container/Docker mean for HPC?
• Kubernetes Basic
• Run MPI job on Kubernetes
• Run ML/DL Pipeline on Kubernetes
• Gaps of Kubernetes for HPC DataCenter
• What about Now?
| 4
6. 6IBM Systems
• New business challenges, especially Big Data, bring new topics,
HPDA, AI and IoT.
• Algorithm scientists have to keep optimizing their codes by new
technology
• ML/DL solves business problem across many domains
• New hardware technology makes ML/DL possible.
ML/DL is HPC’s 1st Consumer Killer App?
7. IBM Systems
Compute Resources & Network
Simulation
Visualization
Analytics Machine
Learning
Remote
UsersRemote
Users
Remote Users
• Scheduler controls job start and
placement
• Applications exchange data as
needed
• Producers
• Consumers
• Both
• Remote users receive/provide
feedback
Scheduler
data exchange
data exchange
HPC Solution Workflow
8. 8IBM Systems
• HPC common requirements
• Hardware: high IOPS Storage, low-latency networks,
powerful CPU, large Memory, etc.
• Software: parallel accelerators, job scheduler
• GPU becomes critical
• Various framework, more than just job, such as, in-memory
databases, long running services, etc.
• MPI is still important
• Development pipeline
• Container does matter
Infrastructure and Software Challenge
10. 10IBM Systems
• Portability to resolve the complexity
• Scalability to fit the nature of distribute/parallel computing
• Developer friendly with pipeline of develop, build, distribute and
deploy
• Improve resource utilization
• Less overhead
• Network and resource isolation
• Supported by existing HPC job scheduler
Values
11. 11IBM Systems
• Old Linux kernel
• Support infrastructure device/software, IB, parallel FS, GPU,
FPGA, etc.
• Security
• Limit HPC specific optimization
• Image control
• Trouble-shooting
Challenge
From: https://www.hpcwire.com/2017/05/04/singularity-hpc-container-technology-moves-lab/
From: http://www.hpctoday.com/viewpoints/containers-meet-hpc/
13. 13IBM Systems
Kubernetes Features
Intelligent Scheduling Self-healing Horizontal scaling
Service discovery
& load balancing
Automated rollouts
& rollbacks
Management of secret
& configuration
Storage orchestration
Batch Execution
14. IBM Systems
Kubernetes Concepts
A group of co-located containers
| 14
A service defines a set of pods and
a means by which to access them,
such as single stable IP address and
corresponding DNS name.
A volume is a directory, possibly
with some data in it, which is
accessible to a Container as part of
its filesystem.
A label is a key/value pair that is
attached to a resource, such as a
pod, to convey a user-defined
identifying attribute.
A replicateset ensures that
a specified number of pod replicas
are running at any one time.
A statefulset is a Controller that provides
a unique identity to its Pods. It provides
guarantees about the ordering of
deployment and scaling.
ReplicateSet StatefulSet
A job creates one or more pods and
ensures that a specified number of
them successfully terminate.
A Secret is an object that contains a
small amount of sensitive data. Such
information might be put in a Pod
specification or in an image
Batchjob
Secret
18. 18IBM Systems
• Docker image of MPI running environment
• Kubernetes BatchJob to manage MPI job lifecycle
• Kubernetes Secret for password-less ssh access among workers
• Bootstrap to integrate with MPI Process Lifecycle Management
(PLM)
• Kubernetes platform to work with other services and resources
• Kubernetes platform for general data center platform
Run MPI in Kubernetes
(bootstrap)
mpirun
Job pod
(bootstrap)
sshd
(bootstrap)
sshd
kube-api
Job pod Job pod
19. 19IBM Systems
• Docker image of Tensorflow running environment
• Kubernetes BatchJob to manage Tensorflow training job lifecycle
• Kubernetes Volume to share the data
• Kuberentes Deployment/Service to provide Tensorflow serving
service
• Kubernetes platform to work with other services and resources
• Kubernetes platform for general data center platform
Run Tensorflow Pipeline In Kubernetes
ps task
ps task
worker task
worker task
worker task
input
log
mode
l
JobVolume
dashboard
Deployment/ServiceVolume
serving
serving
Deployment/Service
test
Job
20. 20IBM Systems
• Kubernetes Deployment/Service for rolling upgrade
• Integrate with CI/CD utilities
Extend the Pipeline to Iterative Development
ps task
ps task
worker task
worker task
worker task
input
log
mode
l
JobVolume
dashboard
Deployment/ServiceVolume
serving
serving
Deployment/Service
test
Job
new
algorithm
new image
22. 22IBM Systems
• Lack of feature on job scheduling
• Job group: ps task and worker task
• Job queue: priority, fare-sharing, pre-emption, etc.
• MPI: gang-scheduling, PLM integration, placement policy
• Advance reservation
• Lack of feature on container support
• MPI optimization: optimization based on placement topology,
share IPC, NUMA/CPU binding, job recovery
• Lack of feature on security
• Image control
Gaps of Kubernetes for HPC
23. 23IBM Systems
• Job queue: (#36716)
• Introduce job queue concept and related resource sharing
policy
Planned Project in Community
HPDA = Data-Intensive Computing Using HPC
Domains
Manufactory:
Retail
Life science
Travel
Finance
Energy&Utility
HPDA = Data-Intensive Computing Using HPC
Domains
Manufactory:
Retail
Life science
Travel
Finance
Energy&Utility
Applications are different and each serves a purpose in computing an overall actionable solution to a problem
Not all applications need the same data or any data at all hence each application is classified as a data producer, consumer, or both
Remote user can be located on Intranet or Internet
A lot of point to point transfer data transactions – every application needs to know who it needs to send data to and every application needs to know who it should receive data fromvery cumbersome and potentially complicated if an application should fail or a new application starts
Complexity:
Dependencies: tools, compilers, libraries, etc
Software stack: academic sw is difficult to install, configure and deploy
Heterogeneous platform/architecture: laptop->supercomputer, x86-power
http://www.hpctoday.com/viewpoints/containers-meet-hpc/
https://www.nextplatform.com/2016/09/13/will-containers-total-package-hpc/
Security:
Containers launched as root
Access to bare metal, filesystems& device drivers
Infrastructure device: incompatibility of low level kernel
Image control: vulnerabilities
Limit HPC specific optimization: MPI local memory sharing, HDFS/GPFS data locality