Build FAST Deep Learning 
Apps with Docker on 
OpenPOWER and GPUs
Accelerated training and inference
Indrajit Poddar (I.P)
Ashwin Srinivas
IBM
IBM Systems
Deep Learning
What you and I (our brains) do without even thinking about it…..we recognize a bicycle
IBM Systems
Now machines are learning the way we learn….
3
From "Texture of the
Nervous System of Man and the
Vertebrates" by
Santiago Ramón y Cajal.
Artificial Neural Networks
IBM Systems
But training needs a lot computational resources
Easy scale-out with:
But Deep Learning model training is not easy to distribute
Training can take hours, days or weeks
Input data and model sizes are becoming
larger than ever (e.g. video input, billions of features
etc.)
Real-time analytics with:
Unprecedented demand for offloaded computation,
accelerators, and higher memory bandwidth systems
Resulting in….
Moore’s law is dying
IBM Systems
OpenPOWER: Open Hardware for High Performance
5
Systems designed for
big data analytics
and superior cloud economics
Upto:
12 cores per cpu
96 hardware threads per cpu
1 TB RAM
7.6Tb/s combined I/O Bandwidth
GPUs and FPGAs coming…
OpenPOWER
Traditional 
Intel x86
http://www.softlayer.com/POWER-SERVERS
https://mc.jarvice.com/
IBM Systems
Nimbix Cloud Adds IBM “Minsky” S822LC for HPC
PaaS+SaaS
Containerized:
Platform delivers
industry best
performance
and agility at the
lowest cost to
the customer
“True HPC Cloud Eliminates
Virtualization and Embraces
Containerization + Acceleration for
Native Bare-Metal Performance”
Nimbix Cloud
Advantages
•Easier to use
•Highest Performance
•Ultra Fast Launch
Times
•Lower Cost
•Faster time to Value
•Bare-Metal
Acceleration
•Enterprise
Accounting
•Application
Marketplace
•Private Apps
•Private Cloud Option
https://mc.jarvice.com/
https://power.jarvice.com
IBM Systems
The Cognitive Revolution
New Paradigm, New Chip, New Servers
S822LC for High
Performance Computing
POWER8 + coherent
CAPI + novel NVLink
for high BW coherent
CPU/GPU acceleration
New Chip
“POWER8 with NVLink”
Accelerated AI
S821LC:
High Density 2-Socket 1U
S822LC for Big Data
New Power
Linux Servers
Accelerator XAccelerator X
M.Gschwind, Bringing the Deep Learning Revolution into the Enterprise
IBM Systems
Introducing 822LC Power System for HPC
First Custom-Built GPU Accelerator Server with NVLink
2.5x Faster CPU-GPU Data
Communication via NVLink
NVLink
80 GB/s
GPUGPU
P8P8
GPUGPU GPUGPU
P8P8
GPUGPU
PCIe
32 GB/s
GPUGPU
x86x86
GPUGPU GPUGPU
x86x86
GPUGPU
No NVLink between CPU &
GPU for x86 Servers: PCIe
Bottleneck
NVIDIA P100 Pascal GPU
POWER8 NVLink Server x86 Servers with PCIe
• Custom-built GPU Accelerator Server
• High-Speed NVLink Connections between
CPUs & GPUs and among GPUs
• Features novel NVIDIA P100 Pascal GPU
accelerator
M.Gschwind, Bringing the Deep Learning Revolution into the Enterprise
IBM Systems
TensorFlow on Tesla P100: PowerAI is 30% faster
9
IBM S822LC 20-cores 2.86GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs / Ubuntu 16.04 / CUDA 8.0.44 /
cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch)
Intel Broadwell E5-2640v4 20-core 2.6 GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs/ Ubuntu 16.04 / CUDA
8.0.44 / cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch)
Larger value is better
IBM Systems
PowerAI vs DGX-1: 1.6x TensorFlow Throughput / Dollar
10
▪ TensorFlow 0.12 on the IBM PowerAI platform takes
advantage of the full capabilities of NVLink
▪ For image classification and analysis this means a 1.6X price
performance advantage relative to the NVIDIA DGX-1
System Images / Second List Price $ / Image / Second
NVIDIA DGX-1
(8 P100 GPU,
512GB Mem)
330 $129,000 $390
PowerAI (4 P100
GPU, 512 GB
Mem)
273 $67,000 $241
Lower cost is better
IBM Systems
NVLink and P100 advantage
• NVLink reduces communication time and overhead
• Incorporating the fastest GPU for deep learning
• Data gets from GPU-GPU, Memory-GPU faster, for shorter training times
x86 based
GPU system
POWER8 +
Tesla
P100+NVLink
ImageNet / Alexnet: Minibatch size = 128
170 ms
78 ms
IBM advantage: data communication and
GPU performance
IBM Systems
Introducing PowerAI: Get Started Fast with Deep Learning
12
Enabled by High Performance Computing Infrastructure
Package of Pre-Compiled Major
Deep Learning Frameworks
Easy to install & get started with
Deep Learning with Enterprise-
Class Support
Optimized for Performance
To Take Advantage of NVLink
IBM Systems
Machine Learning and Deep Learning analytics on OpenPOWER
No code changes needed!!
13
ATLAS
Automatically Tuned Linear Algebra
Software)
IBM Systems
OpenPOWER: GPU support
14
Credit: Kevin Klaues, Mesosphere
IBM Spectrum
Conductor includes
enhanced support for
fine grained GPU and
CPU scheduling with
Apache Spark and
Docker
Mesos supports GPUs
Huge speed-ups with GPUs and OpenPOWER!
IBM Systems
ENABLING Accelerators/GPUs in the Cloud Stack
15
Deep Learning Training + Inference
Containers
and images
Accelerators
Clustering frameworks
IBM Systems
Build Deep Learning Docker Images Using PowerAI Software
16
Dockerfile
FROM nimbix/ubuntu-cuda-ppc64le
RUN wget --no-verbose
http://developer.download.nvidia.com/compute/machine-
learning/repos/ubuntu1404/ppc64el/nvidia-machine-
learning-repo-ubuntu1404_1.0.0-1_ppc64el.deb && dpkg
--install nvidia-*.deb && rm -f nvidia-*.deb && apt-get
update
RUN wget --no-verbose
http://download.boulder.ibm.com/ibmdl/pub/software/serv
er/mldl/mldl-repo-local_3.3.0_ppc64el.deb && 
dpkg --install mldl*.deb && 
apt-get update && 
apt-get -y install power-mldl openmpi-
bin numactl libopenmpi-dev && 
apt-get clean
RUN apt-get update && apt-get -y install power-mldl
openmpi-bin numactl libopenmpi-dev && apt-get clean
POWER
IBM Systems
NVIDIA Docker
17
https://github.com/NVIDIA/nvidia-docker
• A Docker wrapper and tools
to package and GPU based apps
• Uses drivers on the host
• No need to include drivers in
Docker image
• No GPU scheduling
• Good for single node
• Available on POWER
IBM Systems
Demo on NIMBIX
| 18
Thank you.
IBM Systems
ibm.com/systems
| 19

Build FAST Deep Learning Apps with Docker on OpenPOWER and GPUs

  • 1.
  • 2.
    IBM Systems Deep Learning What you andI (our brains) do without even thinking about it…..we recognize a bicycle
  • 3.
    IBM Systems Now machines are learning the way we learn…. 3 From "Texture ofthe Nervous System of Man and the Vertebrates" by Santiago Ramón y Cajal. Artificial Neural Networks
  • 4.
    IBM Systems But training needs a lot computational resources Easy scale-out with: ButDeep Learning model training is not easy to distribute Training can take hours, days or weeks Input data and model sizes are becoming larger than ever (e.g. video input, billions of features etc.) Real-time analytics with: Unprecedented demand for offloaded computation, accelerators, and higher memory bandwidth systems Resulting in…. Moore’s law is dying
  • 5.
    IBM Systems OpenPOWER: Open Hardware for High Performance 5 Systems designed for bigdata analytics and superior cloud economics Upto: 12 cores per cpu 96 hardware threads per cpu 1 TB RAM 7.6Tb/s combined I/O Bandwidth GPUs and FPGAs coming… OpenPOWER Traditional  Intel x86 http://www.softlayer.com/POWER-SERVERS https://mc.jarvice.com/
  • 6.
    IBM Systems Nimbix CloudAdds IBM “Minsky” S822LC for HPC PaaS+SaaS Containerized: Platform delivers industry best performance and agility at the lowest cost to the customer “True HPC Cloud Eliminates Virtualization and Embraces Containerization + Acceleration for Native Bare-Metal Performance” Nimbix Cloud Advantages •Easier to use •Highest Performance •Ultra Fast Launch Times •Lower Cost •Faster time to Value •Bare-Metal Acceleration •Enterprise Accounting •Application Marketplace •Private Apps •Private Cloud Option https://mc.jarvice.com/ https://power.jarvice.com
  • 7.
    IBM Systems The CognitiveRevolution New Paradigm, New Chip, New Servers S822LC for High Performance Computing POWER8 + coherent CAPI + novel NVLink for high BW coherent CPU/GPU acceleration New Chip “POWER8 with NVLink” Accelerated AI S821LC: High Density 2-Socket 1U S822LC for Big Data New Power Linux Servers Accelerator XAccelerator X M.Gschwind, Bringing the Deep Learning Revolution into the Enterprise
  • 8.
    IBM Systems Introducing 822LCPower System for HPC First Custom-Built GPU Accelerator Server with NVLink 2.5x Faster CPU-GPU Data Communication via NVLink NVLink 80 GB/s GPUGPU P8P8 GPUGPU GPUGPU P8P8 GPUGPU PCIe 32 GB/s GPUGPU x86x86 GPUGPU GPUGPU x86x86 GPUGPU No NVLink between CPU & GPU for x86 Servers: PCIe Bottleneck NVIDIA P100 Pascal GPU POWER8 NVLink Server x86 Servers with PCIe • Custom-built GPU Accelerator Server • High-Speed NVLink Connections between CPUs & GPUs and among GPUs • Features novel NVIDIA P100 Pascal GPU accelerator M.Gschwind, Bringing the Deep Learning Revolution into the Enterprise
  • 9.
    IBM Systems TensorFlow onTesla P100: PowerAI is 30% faster 9 IBM S822LC 20-cores 2.86GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs / Ubuntu 16.04 / CUDA 8.0.44 / cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch) Intel Broadwell E5-2640v4 20-core 2.6 GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs/ Ubuntu 16.04 / CUDA 8.0.44 / cuDNN 5.1 / TensorFlow 0.12.0 / Inception v3 Benchmark (64 image minbatch) Larger value is better
  • 10.
    IBM Systems PowerAI vsDGX-1: 1.6x TensorFlow Throughput / Dollar 10 ▪ TensorFlow 0.12 on the IBM PowerAI platform takes advantage of the full capabilities of NVLink ▪ For image classification and analysis this means a 1.6X price performance advantage relative to the NVIDIA DGX-1 System Images / Second List Price $ / Image / Second NVIDIA DGX-1 (8 P100 GPU, 512GB Mem) 330 $129,000 $390 PowerAI (4 P100 GPU, 512 GB Mem) 273 $67,000 $241 Lower cost is better
  • 11.
    IBM Systems NVLink andP100 advantage • NVLink reduces communication time and overhead • Incorporating the fastest GPU for deep learning • Data gets from GPU-GPU, Memory-GPU faster, for shorter training times x86 based GPU system POWER8 + Tesla P100+NVLink ImageNet / Alexnet: Minibatch size = 128 170 ms 78 ms IBM advantage: data communication and GPU performance
  • 12.
    IBM Systems Introducing PowerAI:Get Started Fast with Deep Learning 12 Enabled by High Performance Computing Infrastructure Package of Pre-Compiled Major Deep Learning Frameworks Easy to install & get started with Deep Learning with Enterprise- Class Support Optimized for Performance To Take Advantage of NVLink
  • 13.
    IBM Systems Machine Learningand Deep Learning analytics on OpenPOWER No code changes needed!! 13 ATLAS Automatically Tuned Linear Algebra Software)
  • 14.
    IBM Systems OpenPOWER: GPUsupport 14 Credit: Kevin Klaues, Mesosphere IBM Spectrum Conductor includes enhanced support for fine grained GPU and CPU scheduling with Apache Spark and Docker Mesos supports GPUs Huge speed-ups with GPUs and OpenPOWER!
  • 15.
    IBM Systems ENABLING Accelerators/GPUsin the Cloud Stack 15 Deep Learning Training + Inference Containers and images Accelerators Clustering frameworks
  • 16.
    IBM Systems Build DeepLearning Docker Images Using PowerAI Software 16 Dockerfile FROM nimbix/ubuntu-cuda-ppc64le RUN wget --no-verbose http://developer.download.nvidia.com/compute/machine- learning/repos/ubuntu1404/ppc64el/nvidia-machine- learning-repo-ubuntu1404_1.0.0-1_ppc64el.deb && dpkg --install nvidia-*.deb && rm -f nvidia-*.deb && apt-get update RUN wget --no-verbose http://download.boulder.ibm.com/ibmdl/pub/software/serv er/mldl/mldl-repo-local_3.3.0_ppc64el.deb && dpkg --install mldl*.deb && apt-get update && apt-get -y install power-mldl openmpi- bin numactl libopenmpi-dev && apt-get clean RUN apt-get update && apt-get -y install power-mldl openmpi-bin numactl libopenmpi-dev && apt-get clean POWER
  • 17.
    IBM Systems NVIDIA Docker 17 https://github.com/NVIDIA/nvidia-docker •A Docker wrapper and tools to package and GPU based apps • Uses drivers on the host • No need to include drivers in Docker image • No GPU scheduling • Good for single node • Available on POWER
  • 18.
  • 19.

Editor's Notes

  • #8 821LC Built for data-intensive workloads, like databases, data analytics, machine & deep learning, HPC 822LC Superior Performance and TCO for data workloads Built for accelerated computing with new accelerator interconnect technologies
  • #11 This becomes even more powerful when you look at it in terms of images/second/dollar. Assuming list price (in USD): PowerAI on S822LC for HPC / $67,000 DGX-1 / $129,000 DGX1: 129000/330 images=$390/image Minsky: 42000/278 images=$241/image IBM PowerAI has a 1.6x price performance advantage when you compare image processing throughput vs. a fully loaded DGX-1 For the same price of a DGX-1 at 330 images/second, you could purchase 2x PowerAI systems for a total of 546 images/second
  • #12 Digits01 – transfer of images from cpu to gpu 3 -4 GB/s very low Minsky – transfer of images from cpu to gpu 17 GB/s
  • #15 Supports up to 18 GPUs Tested with upto 24 devices Exploit IBM Design for Big Data Large address space enables rich acceleration 1TB address space per PCI host interface Standard LE linux drivers