Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER

#ibmedge© 2016 IBM Corporation
Enabling Cognitive Workloads on the Cloud:
GPUs with Mesos, Docker and Marathon on
POWER
Seetharami Seelam, IBM Research
Indrajit Poddar, IBM Systems

#ibmedge
Please Note:
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice
and at IBM’s sole discretion.
• Information regarding potential future products is intended to outline our general product direction and it
should not be relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or legal
obligation to deliver any material, code or functionality. Information about potential future products may not be
incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our products
remains at our sole discretion.
• Performance is based on measurements and projections using standard IBM benchmarks in a controlled
environment. The actual throughput or performance that any user will experience will vary depending upon
many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the
I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be
given that an individual user will achieve results similar to those stated here.
1

#ibmedge
About Seelam
Expertise:
• 10+ years in Large scale high performance and
distributed systems
• Built multiple cloud services for IBM Bluemix:
autoscaling, business rules, containers, POWER
containers, and Deep Learning as a service
• Enabled and scaled Docker on POWER/Z for
extreme density (tens of thousands of
containers)
• Enabling GPUs in the cloud for container-based
workloads (Mesos/Kub/Docker)
2
Dr. Seetharami R. Seelam
Research Staff Member
IBM T. J. Watson Research Center
Yorktown Heights, NY
sseelam@us.ibm.com
Twitter: sseelam

#ibmedge
About Indrajit (a.k.a. I.P)
Expertise:
• Accelerated Cloud Data Services, Machine
Learning and Deep Learning
• Apache Spark, TensorFlow… with GPUs
• Distributed Computing (scale out and up)
• Cloud Foundry, Spectrum Conductor, Mesos,
Kubernetes, Docker, OpenStack, WebSphere
• Cloud computing on High Performance Systems
• OpenPOWER, IBM POWER
3
Indrajit Poddar
Senior Technical Staff Member,
Master Inventor, IBM Systems
ipoddar@us.ibm.com
Twitter: @ipoddar

#ibmedge
Agenda
• Introduction to Cognitive workloads and POWER
• Requirements for GPUs in the Cloud
• Mesos/GPU enablement
• Kubernetes/GPU enablement
• Demo of GPU usage with Docker on OpenPOWER to identify dog
breads
• Machine and Deep Leaning Ecosystem on OpenPOWER
• Summary and Next Steps
4

#ibmedge
Cognition
5
What you and I (our brains) do without even thinking about it…..we recognize a bicycle

#ibmedge
Now machines are learning the way we learn….
6
From "Texture of the Nervous System
of Man and the Vertebrates" by
Santiago Ramón y Cajal.
Artificial Neural Networks

#ibmedge
But training needs a lot computational resources
Easy scale-out with: Deep Learning model training is hard to distribute
Training can take hours, days or weeks
Input data and model sizes are becoming
larger than ever (e.g. video input, billions of
features etc.)
Real-time analytics with:
Unprecedented demand for offloaded computation,
accelerators, and higher memory bandwidth systems
Resulting in….
Moore’s law is dying

#ibmedge
OpenPOWER: Open Hardware for High Performance
8
Systems designed for
big data analytics
and superior cloud economics
Upto:
12 cores per cpu
96 hardware threads per cpu
1 TB RAM
7.6Tb/s combined I/O Bandwidth
GPUs and FPGAs coming…
OpenPOWER
Traditional
Intel x86
http://www.softlayer.com/bare-metal-search?processorModel[]=9

#ibmedge
New OpenPOWER Systems with NVLink
9
S822LC-hpc “Minsky”:
2 POWER8 CPUs with 4 NVIDIA® Tesla® P100
GPUs GPUs hooked directly to CPUs using
Nvidia’s NVLink high-speed interconnect
http://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.html

#ibmedge
Transparent acceleration for Deep Learning on
OpenPOWER and GPUs
10
Huge speed-ups
with GPUs and
OpenPOWER!
http://openpower.devpost.com/
Impressive acceleration examples:
• artNet Genre classifier
• Distributed Tensorflow for cancer detection
• Scale up and out genetics bioinformatics
• Full red blood cell modeling
• Accelerated ultrasound imaging
• Emergency service prediction

#ibmedge
Enabling Accelerators/GPUs in the cloud stack
Deep Learning apps
11
Containers and images
OR
Accelerators
Clustering frameworks

#ibmedge
Requirements for GPUs in the Cloud
12
Function/Feature Comments
GPUs exposed to Dockerized
applications
Apps need access to /dev/nvidia* to use the GPUs
Support for NVIDIA GPUs Both IBM Cloud and POWER systems support NVIDIA GPUs
Support Multiple GPUs per node IBM Cloud machines have up to 2 K80s (4 GPUs) and POWER nodes
have many more
Containers require no GPU drivers GPU drivers are huge and drivers in a container creates a portability
problems so we need to support to mounting GPU drivers into the
container from the host (volume injection)
GPU Isolation GPUs allocated to a workloads should be invisible to other workloads
GPU Auto-discovery Worker node agent automatically discovers the GPU types and numbers
and report to the scheduler
GPU Usage metrics GPU utilization is critical for developers so need to expose these metrics
Support for heterogeneous GPUs in a
cluster (including app to pick a GPU
type)
IBM Cloud has M60, K80, etc and different workloads need different
GPUs
GPU sharing GPUs should be isolated between workloads
GPUs should be sharable in some cases between workloads

#ibmedge
NVIDIA Docker
13
Credit: https://github.com/NVIDIA/nvidia-docker
• A docker wrapper and tools
to package and GPU based apps
• Uses drivers on the host
• Manual GPU assignment
• Good for single node
• Available on POWER

#ibmedge
Mesos and Ecosystem
• Open-source cluster manager
• Enables siloed applications to be consolidated on a shared pool of resources, delivering:
• Rich framework ecosystem
• Emerging GPU support
14

#ibmedge
Mesos GPU support
(Joint work between Mesosphere, NVIDIA and IBM)
Credit: Kevin Klaues, Mesosphere
Mesos support for GPUs v 1.1
• Mesos will support GPU in two different
frameworks
– Unified containerizer
• No docker support initially
• Remove Docker daemon from the node
– Docker containerizer
• Traditional executor for Docker
• Docker container based deployment
• On going work
– Code to allocate GPUs at the node in docker
containerizer
– Code to share the state with unified containerizer
– Logic for node recovery (nvidia driving this work)
• Limitations
– No GPU sharing between docker containers
– Limited GPU usage information exposed in the UI
– Slave recovery code will evolve over time
– NVIDIA GPUs only

#ibmedge
Implementation
• GPU shared by mesos containerizer and docker containerizer
• mesos-docker-executor extended to handle devices isolation/exposition through docker daemon
• Native docker implementation for CPU/memory/GPU/GPU driver volume management
16
Nvidia GPU
Allocator
Nvidia Volume
Manager
Mesos
Containerizer
Docker
Containerizer
Docker Daemon
CPU Memory GPU GPU driver volume
mesos-docker-executor
Nvidia GPU Isolator
Mesos Agent
Docker image label check:
com.nvidia.volumes.needed="nvidia_driver"

#ibmedge
Mesos GPU monitor and Marathon on OpenPOWER
17

#ibmedge
Usage and Progress
• Usage
• Compile Mesos with flag: ../configure --with-nvml=/nvml-header-path &&
make –j install
• Build GPU images following nvidia-docker:
(https://github.com/NVIDIA/nvidia-docker)
• Run a docker task with additional such resource “gpus=1”
• Release
• Target release: 1.1
• GPU allocator for docker containerizer (code review)
• GPU isolation/exposition support for msos-docker-executor (code review)
• GPU driver volume injection (under development)
18

#ibmedge
Eco-system Activities
• Marathon
• GPU support for Mesos Containerizer in release 1.3
• GPU support for Docker Containerizer ready for release (waiting for
Mesos side code merge)
19

#ibmedge
Kubernetes
• Open source
orchestration system for
Docker containers
• Handles scheduling onto
nodes in a compute
cluster
• Actively manages
workloads to ensure that
their state matches the
users declared intentions
• Emerging support for
GPUs
20
Kubernetes
master
Docker
Engine
Docker
Engine
Docker
Engine
Host Host Host
Kubelet/
Proxy
Kubelet/
Proxy
Kubelet/
Proxy
Etcd
cluster
-API server
-Scheduler
-Controller Mgr
Support HA mode
Cluster state

#ibmedge
Kubernetes GPU support
• Design Doc for GPU support in K8s has been out for a while
– https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/gpu-support.md
Function/Feature Kub Community Our Contribution
GPUs exposed to
Dockerized applications
Yes
Support for NVIDIA GPUs Yes
Support Multiple GPUs per
node
Yes, a PR is
pending*
Containers require no GPU
drivers
No PoC complete
GPU Isolation Yes
GPU Auto-discovery No future
GPU Usage metrics No future
Support for heterogeneous
GPUs in a cluster
(including app to pick a
GPU type)
No future
GPU sharing No future
*GPU on Kubernetes updates in community: https://github.com/kubernetes/kubernetes/pull/28216

#ibmedge
Status of GPUs in Mesos and Kubernetes
22
Function/Feature NVIDIA Docker Mesos Kubernetes
GPUs exposed to Dockerized applications ✔ ✔ ✔
Support for NVIDIA GPUs ✔ ✔ ✔
Support Multiple GPUs per node ✔ ✔ ✔
Containers require no GPU drivers ✔ ✔ Future
GPU Isolation ✔ ✔ ✔
GPU Auto-discovery Future Future ✔
GPU Usage metrics ✔ Future Future
Support for heterogeneous GPUs in a cluster (including app to pick a
GPU type)
✘ Future Future
GPU sharing ✔
(not controlled)
Future Future

#ibmedge
Machine Learning and Deep Learning analytics on
OpenPOWER
No code changes needed!!
24
ATLAS
Automatically Tuned Linear Algebra
Software)

#ibmedge
Learn More and Get Started…
25
Power-Efficient Machine Learning on
POWER Systems using FPGA Acceleration
Machine and Deep Learning on Power Systems
Register for a SuperVessel Account and take deep learning
notebooks running in docker containers a spin!
https://ny1.ptopenlab.com/bigdata_cluster

#ibmedge
Summary and Next Steps
• Cognitive, Machine and Deep Learning workloads are everywhere
• OpenPOWER and Accelerators will help speed up these workloads
• Containers can be leveraged with accelerators for agile deployment of these new
workloads
• Docker, Mesos and Kubernetes are making rapid progress to support accelerators
• OpenPOWER and this emerging cloud stack makes it the preferred platform for Cognitive
workloads
|

#ibmedge
Special Thanks to Collaborators
• Kevin Klues, Mesosphere
• Yu Bo Li, IBM
• Rajat Phull, NVIdia
• Guangya Liu, IBM
• Qian Zhang, IBM
• Benjamin Mahler, Mesosphere
• Vikrama Ditya, Nvidia
• Yong Feng, IBM
• Christy L Norman Perez, IBM
• Kubernetes Team

Thank You
Seelam – sseelam@us.ibm.com
IP - ipoddar@us.ibm.com

Backup
29

#ibmedge
Notices and Disclaimers
30
Copyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission
from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of
initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS
DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE
USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY.
IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided.
IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our
warranty terms apply.”
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers
have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in
which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials
and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or
their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and
interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such
laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law

#ibmedge
Notices and Disclaimers Con’t.
31
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not
tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the
ability of any such third-party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT
NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual
property right.
IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®,
FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG,
Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®,
PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®,
StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business
Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM
trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.

Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER

More Related Content

What's hot

Viewers also liked

Similar to Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER

Recently uploaded

Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marathon on POWER