Open power ddl and lms

Distributed and Collaborative Deep Learning and Machine Learning
Amit Juneja, PhD
Cognitive Solution Specialist
IBM

Artificial
Intelligence
and
Cognitive
Applications
Machine
Learning
Deep
Learning
The deeper you go, the more value you gain,
and the more you know

Machine Learning/ Deep Learning Process
3
Historic
Training
Data
ML/DL
Training
Trained
Model
Training/Development
Live
Data
Trained
model
Action
Inference/Deployment/Application

Machine Learning: Linear and Nonlinear Classification
4
LinearNon-linear

Machine Learning
5
Regression Clustering

Machine Learning Models
6
Support
Vector
Machines
Random
Forests
K-Means
Clustering
Logistic
Regression
Generalized
Linear
Models
Neural
Networks

A neural network
8
More hidden layers = more
complex modeling

9
Training by
back
propagation
of errors
This did not really
work for large number
of hidden layers with
sigmoid units

Restricted Boltzmann Machines
10

Deep Belief Network: The first “DEEP” network
11
Initial deep
learning
breakthroughs
in speech
recognition

Deep Convolutional Networks: Image processing/classification
12
Currently RELU operations such
as max-pool replace sigmoid
operations for easier training and
higher accuracy

Recurrent Neural Network: Sequential data: time series, language, etc.
13

LSTMs: Long short term memory networks
14

flattening the Deep Learning
time to value curve

Enterprise Deep Learning Distribution
19

How can I train deep learning models many times faster?
20

PowerAI Rel. 4 with Distributed
Deep Learning tech. preview
Performance…
Faster Training
and Inferencing
Near Ideal Scaling to 256 GPUs and Beyond16 Days Down to 7 Hours:
58x Faster
1 System 64 Systems
16 Days
7 Hours
ResNet-101, ImageNet-22K, Caffe with PowerAI DDL, Running on Minsky (S822Lc) Power System

22
Training by
back
propagation
of errors

24
Parallel
Execution of
Training

Threads per
core vs x86
Up to 9.5x more I/O
bandwidth than x86
More RAM
possible vs. x86
CPU to deliver
PCIe gen 4
4x 9.5x 2.6x 1st
POWER9
An acceleration superhighway.
The only processor specifically designed for the AI era.

27
• TensorFlow 1.4 Performance on IBM POWER9 with Nvidia V100
• Single node 35% more images processed per second vs tested x86 systems
ResNet50 testing on ILSVRC 2012 dataset (aka Imagenet 2012)
Training on 1.2M images
Validation on 50K images
▪ Results are based IBM Internal Measurements running
1000 iterations of HPM Resnet50 on 1.2M images and
validation on 50K images with Dataset from ILSVRC 2012
also known as Imagenet 2012.
▪ Software: Tensorflow 1.4.0 framework and HPM
Resnet50 https://github.com/tensorflow/benchmarks.git (
commit: f5d85aef) and with the following parameters:
Batch-Size: 64 per GPU; Iterations: 1100; Data: Imagenet;
local-parameter-device: gpu; variable-update: replicated
Date of testing: November 26, 2017
Faster Training Time with Distributed Deep Learning

28
• TensorFlow 1.4 Performance on IBM POWER9 with Nvidia V100
• Multiple nodes Distributed Deep Learning: IBM POWER9™ with
Nvidia Tesla V100 results in 2.3X more data
processed on TensorFlow versus tested x86 systems
2.3X more images processed per second vs tested
x86 systems
PowerAI Distributed Deep Learning (DDL) library
provides innovative distribution methods enabling
AI frameworks to scale to multiple servers
leveraging all attached GPUs
ResNet50 testing on ILSVRC 2012 dataset (also
known as Imagenet 2012)
Training on 1.2M images
Validation on 50K images
Date of testing: December 2, 2017
Faster Training Time with Distributed Deep Learning

Run using ddlrun
• ddlrun is a tool for running ddl enabled scripts
• See README.md in DDLRun docs in /opt/DL/ddl/doc/README.md
• https://developer.ibm.com/linuxonpower/2018/05/01/improved-ease-use-ddl-powerai/
Example invocation:
ddlrun -H system1,system2 python mnist.py
Yes. It’s really that easy!

Adjust the Keras callbacks
Adjust the learning rate
Since the data was split between the GPUs, the learning rate had to be scaled by the total number of GPUs.
The 2 primary operations that need to be restricted to only running on rank 0 are model checkpointing and logging. This is accomplished by only
adding these callbacks on rank 0.
An extra callback is needed to keep all metrics in sync across all nodes. This ensures that early stopping and learning rate scheduling all remain in
sync

How can I train models that don’t fit in GPU memory?
33

Why Power9 - 3.8X Faster than x86 architectures
Supports large model training sets that are too large for GPU memory in DL and HPC/simulation
Memory coherency also makes programming GPU’s easier for developers by automatically
moving data between Power9 system memory and V100 GPU memory.

Power Systems - 7-10X Bandwidth over x86 architectures
No NVLink for x86 Servers: PCIe Bottleneck
GPU
P8
GPU GPU
P8
GPU
NVLink
80
GB/s
NVLink
80
NVLink
80
Minsky + NVLink
CPU<->GPU and GPU<->GPU
GPU
x86
GPU GPU
x86
GPU
PCIe
32
GB/s
NVLink
80 GB/s
NVLink
80 GB/s
x86 + NVLink
GPU<->GPU Only
GPU
x86
GPU GPU
x86
GPU
PCIe
32
GB/s
x86 using PCIe
GPUs access System Memory thu x86 CPU & slow
PCIe
Bandwidth between GPU’s and
memory is critical
Power8 with NVLink delivers 2.5X the bandwidth
PCIe Data Pipe
NVLink
P9 CPUDDR4
Tesla V100 Tesla V100Tesla V100 NVL
150 GB/s
NVL
150 GB/s
150 GB/s
150GB/s
170 GB/s
Power9 and NVLink Gen 2.0
delivers 7-10 X bandwidth
increase over X86 architectures
Power8 NVLink
Data Pipe

TensorFlow Large Model Support (TFLMS)
• Swap out unused tensors (feature maps,
parameters) to CPU memory after GPU
computation
• Swap in before use in backward propagation
phase
• Implemented as a Python module to statically
edit model graph before training.
• Support for training with Session, Estimator,
tf.keras APIs
• Code contributed to TensorFlow community:
https://github.com/tensorflow/tensorflow/pull/19845
l+1l-1 LLayer 1
Loss
…..…
……….
…...
Forward
Backward
l
…….
CPU memory
GPU memory
Swap-out
Swap-in

Why 3D Image segmentation?
Training 3DUnet models for image segmentation generally has high memory usage requirements which can limit the size
of the 3D images that can be used for training and can also lead to lower batch sizes for model training.
The annual International Multimodal Brain Tumor Segmentation Challenge (BraTS) [1] drives advancements in 3D image
segmentation models.
We utilized enabled TFLMS in a Keras model written by David G. Ellis, U of Nebraska. This model was written to process
multimodal MRI scans following the model architecture described by Isensee et al. in the 2017 BraTS proceedings on page
100 and received 3rd place in the challenge [2].
Real world use case of large model support
The maximum image resolution and batch size:
144^3 with batch size 1 in 16GB GPU without TFLMS
192^3 with batch size 1 in 16GB GPU with TFLMS - ~2.4x the resolution
Higher resolution image processing allows for learning and labeling finer
details and structures.
[1] http://www.med.upenn.edu/sbia/brats2018/data.html
[2] https://www.cbica.upenn.edu/sbia/Spyridon.Bakas/MICCAI_BraTS/MICCAI_BraTS_2017_proceedings_shortPapers.pdf

POWER9 vs x86 GPU Connectivity

TFLMS runtime performance POWER9 vs x86
The 3DUnet model was run with TFLMS on an IBM
AC922 and a x86 based GPU server. Both systems
have the NVIDIA Volta V100 GPU.
The x86 server shows significant slowdown which
gets worse when GPUs share the same PCI bus.
The diagram on the right shows the nvprof view of processing one image with the model. On the
x86 server the GPU goes idle (white space) while waiting on the memory copies over the PCI bus.
Corresponding kernels runtimes between the runs are linked in red.
Note that the 4 GPU times were running 4 individual models
concurrently, not one model distributed. Distributed results follow.
0
250
500
750
1000
1250
1500
1750
2000
2250
IBM AC922
(4 GPU
version)
IBM AC922
(6 GPU
version)
x86 server x86 server
with PCI
contention
Timeinseconds
Epoch times at 192^3 with TFLMS

GPU and interconnect usage
GB/s
10 GB/s
20 GB/s
30 GB/s
40 GB/s
50 GB/s
60 GB/s
70 GB/s
IBM AC922
(4 GPU version)
x86 server x86 server while sharing
PCI bus
Average memory copy throughput over 30
batches
Host to GPU memory copy GPU to Host memory copy
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
IBM AC922
(4 GPU version)
x86 server x86 server while sharing
PCI bus
Average GPU utilization over 30 batches
Higher memory copy throughput drives higher GPU utilization

IBM PowerAI Distributed Deep Learning (DDL) + TFLMS
1
3.93
7.76
14.75
-4
1
6
11
16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Speedup
# of GPUs
DDL Training Speedup
Number of GPUs Epoch time Speedup Efficiency
1 (without DDL) 590s
4 150s 3.93 98.33%
8 76s 7.76 97.04%
16 40s 14.75 92.19%

How do I enable LMS in my TensorFlow code?
43

How to enable TFLMS
Session based training:
Step 1: define optimizer/solver scopes
with tf.name_scope('adam_optimizer'):
optimizer = tf.train.AdamOptimizer(1e-4)
train_step = minimize(cross_entropy)
Step 2: define an LMS object and run it
from tensorflow.contrib.lms import LMS
lms_obj = LMS({'adam_optimizer'})
lms_obj.run(graph=tf.get_default_graph())
Estimator based training:
Step 1: define optimizer/solver scopes
with tf.name_scope(‘graddescopt'):
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(loss=loss,
global_step=tf.train.get_global_step())
Step 2: define a LMSHook
from tensorflow.contrib.lms import LMSHook
lms_hook = LMSHook({'graddescopt'})
Step 3: add the LMSHook into the Estimator's hook list
mnist_classifier.train(input_fn=train_input_fn, steps=20000,
hooks=[logging_hook, lms_hook])
Step 1: Define a LMSKerasCallback.
from tensorflow.contrib.lms import LMSKerasCallback lms_callback =
LMSKerasCallback()
Step 2: pass the callback to the Keras fit or fit_generator function.
model.fit_generator(generator=training_gen,
callbacks=[lms_callback])
TF-Keras based training:

How can I train machine learning models on
terabyte scale with GPUs?
45

SnapML: Rapid training of logistic regression/SVMs on GPUs
|
Snap ML Rapid Training of Logistic Regression/SVMs on GPUs –
Tera-scale ML benchmark

|
Snap ML Iteration Profile (Intel X86** + TeslaV100 + PCIe Gen3)

|
Snap ML Iteration Profile (POWER9 + TeslaV100 + NVLINK 2.0)

|
Ssklearn.linear_model.LogisticRegression
Ssnap_ml.LogisticRegression
SnapML is Easy

How can I do this collaboratively with optimal resource utilization?
50

• Physical view - Spectrum Conductor installed on each Linux Server
• Logical view - Users & groups have own Spark cluster - isolated, protected & secured by Spark Instance Groups
• Manage all DL Resources with Conductor SLA’s
• Scheduler interfaces with Spark - ensures accelerated GPU resources for priority applications and users
Linux
Linux
LinuxLinux
Linux
Linux
Linux
LOB
Data Scientist
Researcher
Virtual Spark cluster
(PaaS)
Linux
Customer behavior...
Trend analysis...
HPC...
Marketing...
Fraud detection...
instance
group #1
instance
group #2
instance
group #3
Managem
ent Nodes
Pool
Compute Nodes Pool
Spectrum Scale
Administrator
Web console
Create Spark
instance group
Linux
LinuxLinuxLinux
Linux
Linux
LinuxLinuxLinux
Linux
IT or Data
Warehouse
ETL / Batch
instance
group #4
LOB
IoT
instance
group #5
Why Power9 - GPU accelerated Spark + Multi-Tenancy Spectrum Conductor
Conductor with Spark
Session scheduler Service management (ASC/K8s)
Security Report/log management ContainerMulti-tenancy
Notebook Spark ELK
GPU and AccelerationData Connector

THANK YOU!
Amit Juneja, PhD
Cognitive Solution Specialist
IBM

Open power ddl and lms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Open power ddl and lms

Similar to Open power ddl and lms (20)

More from Ganesan Narayanasamy

More from Ganesan Narayanasamy (20)

Recently uploaded

Recently uploaded (20)

Open power ddl and lms