Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

When HPC meet ML/DL: Manage HPC Data Center with Kubernetes


Published on

LinuxCon + ContainerCon + CloudOpen China 2017

Published in: Internet
  • Login to see the comments

  • Be the first to like this

When HPC meet ML/DL: Manage HPC Data Center with Kubernetes

  1. 1. When HPC Meet ML/DL manage HPC Data Center with Kubernetes Yong Feng (
  2. 2. IBM Systems Please Note: • IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion. • Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. • The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. • The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. • Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. | 2
  3. 3. 3IBM Systems Senior Architect of IBM Spectrum (former Platform Computing) • Work on resource manager and workload scheduler for 12+ years after Ph.D • Lead team on Open Source development from OpenStack, Yarn, Mesos, Kubernetes to Spark etc. • Lead team on core platform development of IBM Cloud Private Who am I?
  4. 4. IBM Systems Agenda • What does ML/DL mean for HPC? • What does Container/Docker mean for HPC? • Kubernetes Basic • Run MPI job on Kubernetes • Run ML/DL Pipeline on Kubernetes • Gaps of Kubernetes for HPC DataCenter • What about Now? | 4
  5. 5. ML/DL means for HPC
  6. 6. 6IBM Systems • New business challenges, especially Big Data, bring new topics, HPDA, AI and IoT. • Algorithm scientists have to keep optimizing their codes by new technology • ML/DL solves business problem across many domains • New hardware technology makes ML/DL possible. ML/DL is HPC’s 1st Consumer Killer App?
  7. 7. IBM Systems Compute Resources & Network Simulation Visualization Analytics Machine Learning Remote UsersRemote Users Remote Users • Scheduler controls job start and placement • Applications exchange data as needed • Producers • Consumers • Both • Remote users receive/provide feedback Scheduler data exchange data exchange HPC Solution Workflow
  8. 8. 8IBM Systems • HPC common requirements • Hardware: high IOPS Storage, low-latency networks, powerful CPU, large Memory, etc. • Software: parallel accelerators, job scheduler • GPU becomes critical • Various framework, more than just job, such as, in-memory databases, long running services, etc. • MPI is still important • Development pipeline • Container does matter Infrastructure and Software Challenge
  9. 9. Container/Docker means for HPC
  10. 10. 10IBM Systems • Portability to resolve the complexity • Scalability to fit the nature of distribute/parallel computing • Developer friendly with pipeline of develop, build, distribute and deploy • Improve resource utilization • Less overhead • Network and resource isolation • Supported by existing HPC job scheduler Values
  11. 11. 11IBM Systems • Old Linux kernel • Support infrastructure device/software, IB, parallel FS, GPU, FPGA, etc. • Security • Limit HPC specific optimization • Image control • Trouble-shooting Challenge From: From:
  12. 12. Kubernetes
  13. 13. 13IBM Systems Kubernetes Features Intelligent Scheduling Self-healing Horizontal scaling Service discovery & load balancing Automated rollouts & rollbacks Management of secret & configuration Storage orchestration Batch Execution
  14. 14. IBM Systems Kubernetes Concepts A group of co-located containers | 14 A service defines a set of pods and a means by which to access them, such as single stable IP address and corresponding DNS name. A volume is a directory, possibly with some data in it, which is accessible to a Container as part of its filesystem. A label is a key/value pair that is attached to a resource, such as a pod, to convey a user-defined identifying attribute. A replicateset ensures that a specified number of pod replicas are running at any one time. A statefulset is a Controller that provides a unique identity to its Pods. It provides guarantees about the ordering of deployment and scaling. ReplicateSet StatefulSet A job creates one or more pods and ensures that a specified number of them successfully terminate. A Secret is an object that contains a small amount of sensitive data. Such information might be put in a Pod specification or in an image Batchjob Secret
  15. 15. IBM Systems Kubernetes Architecture
  16. 16. Getting Started
  17. 17. 17IBM Systems • Auto-discovery GPU resources • GPU scheduling • Monitor GPU resource utilization • GPU driver injection Manage GPU Resources
  18. 18. 18IBM Systems • Docker image of MPI running environment • Kubernetes BatchJob to manage MPI job lifecycle • Kubernetes Secret for password-less ssh access among workers • Bootstrap to integrate with MPI Process Lifecycle Management (PLM) • Kubernetes platform to work with other services and resources • Kubernetes platform for general data center platform Run MPI in Kubernetes (bootstrap) mpirun Job pod (bootstrap) sshd (bootstrap) sshd kube-api Job pod Job pod
  19. 19. 19IBM Systems • Docker image of Tensorflow running environment • Kubernetes BatchJob to manage Tensorflow training job lifecycle • Kubernetes Volume to share the data • Kuberentes Deployment/Service to provide Tensorflow serving service • Kubernetes platform to work with other services and resources • Kubernetes platform for general data center platform Run Tensorflow Pipeline In Kubernetes ps task ps task worker task worker task worker task input log mode l JobVolume dashboard Deployment/ServiceVolume serving serving Deployment/Service test Job
  20. 20. 20IBM Systems • Kubernetes Deployment/Service for rolling upgrade • Integrate with CI/CD utilities Extend the Pipeline to Iterative Development ps task ps task worker task worker task worker task input log mode l JobVolume dashboard Deployment/ServiceVolume serving serving Deployment/Service test Job new algorithm new image
  21. 21. Gaps
  22. 22. 22IBM Systems • Lack of feature on job scheduling • Job group: ps task and worker task • Job queue: priority, fare-sharing, pre-emption, etc. • MPI: gang-scheduling, PLM integration, placement policy • Advance reservation • Lack of feature on container support • MPI optimization: optimization based on placement topology, share IPC, NUMA/CPU binding, job recovery • Lack of feature on security • Image control Gaps of Kubernetes for HPC
  23. 23. 23IBM Systems • Job queue: (#36716) • Introduce job queue concept and related resource sharing policy Planned Project in Community
  24. 24. What about Now?
  25. 25. 25IBM Systems • Run HPC Job Scheduler as workload manager on Kubernetes • IBM Spectrum LSF • Univa Kubernetes + HPC Job Scheduler
  26. 26. IBM Systems Q&A