Toronto meetup 20190917

Elastic Distributed Deep
Learning Training at large
scale on-prem and cloud
productions
Junfeng Liu, STSM, jfliu@ca.ibm.com
Kelvin Lui, Technical Product Manager,
kelvinl@ca.ibm.com
Yonggang Hu, Distinguished Engineer
yhu@ca.ibm.com
ibm.com/spectrum-computing
ibm.com/us-en/marketplace/deep-learning-
platform

Red Bull Racing: Competing with Computing
Every week a new challenge, but part of a season long strategy
52 Wins, 58 Poles, 135 Podiums, 52 Fastest Laps, 4 Formula One Constructors' World Championships
A Decade of Racing Successes, a decade with Spectrum Computing
New Car for 2017
• Tailored to the track
• Complex virtual design and
simulation models
• >200 step simulations
• 30K engineering changes per
season
Real-time Decision Making
• Car sensors and real time
telemetry drive decisions, before
during and after the race
• >100 sensors per car
• Pit stops under 2 seconds
Race Strategy
• Scenario-driven decision making
• 1000s of scenarios run per race
• Model environments: Rain, Heat,
Delays
• Pit stops and tire choices win/lose
races

Agenda
The needs and challenges of running distributed training
Elastic Distributed Training
§ Architecture
§ Benchmark
§ Interface
Use Cases
Demo (time permitting)
Next Steps

Deep Learning Needs HPC & Big Data
5

7
$11,458
Tesla V100 – 32GB
Nissan Versa
IBM AC 922 – 4V100
$80,000
Audi A8
Nvidia DGX2
$399,000
Lamborghini Aventador
Resource matters
$500,000,000

8
Workload matters
Inference - A simple language model
125 TFlops
Train - ResNet-50 – ImageNet 1K
29 hours on 8 GPUs
In minutes in Summit
Require proper implementation & tuning
Need 1 TFlops – in ms
14 days to train on 1 GPU
Tens of models
Hundreds of tuning runs
Hundreds of users
Thousands of days NAS on GPUs
Millions of pictures
Thousands of datasets
Millions of jobs
Billions of inferences
SLA from ms to days

Faster Training Time with Large-scale Deep Learning
9Days
Recognition
Recognition
54x
Learning
runs with
Power 8
What will you do?
Iterate more and create more accurate models?
Create more models?
Both?
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
• From 2015 to 2018
– GPU Compute: 18TFOPS to 112 TFLOPS (FP16)
– GPU Memory: 16GB to 32GB
– Communication: 100Gbps to 200Gbps
3x

Large-scale Deep Learning
[skymind.ai]
• Data parallelism : constant traffic per GPU (only network size matters)
• Model parallelism : partitioning dictates traffic (need significant research)
[DeepSpeech2]

IBM
Spectrum
Conductor
VGG as example
• 128.3M data / GPU
• allreduce – broadcast sync model
• Data increasing according to the
number of GPU
§ 4 GPU = 1026M data transferred
every iteration
Distribution Challenge
12
Source: https://www.semanticscholar.org/paper/Poseidon-An-Efficient-Communication-Architecture-f-Zhang-Zheng/c37145669be8e7f14f4cdd5ddc3935ea03a54673

Using Allreduce for SGD
• More performant
• MPI, NCCL
• All large-scale studies
• Scalable
§local
§gradient
§aggregated
§gradient
§[skymind.ai, mpitutorial.com]

Prior Arts: Ring-based Allreduce (Thakur 2005, Baidu Feb/2017)
§14
reduce-scatter all-gather
• Bandwidth-optimal for homogeneous network architecture
– Each step is throttled by the worst bandwidth (i.e.. Pipeline)
• Linear dependency on latency
– N GPUs will have N*latency overheads
– Recursive schemes exist yet optimal for the 2m learners
NOT SCALABLE for Many Learners
- Too many iterations (latency adds up fast)
- The weakest link slows down others (cluster, cloud)
[Baidu.com]

Prior Arts: Two-step Approach (Tencent Jul/2018, Uber Oct/2018)
§15
NOT SCALABLE for Many Learners
- Still too many iterations (latency adds up fast)
- Sub-optimal traffic pattern due to additional Reduce and Broadcast
- Only master GPUs are active

DDL: Mix-Match for Best Performance
MPI, NCCL, IB_Verb, SharedMem, OpenFabric, Custom-lib
§IBM, Nvidia, Mellanox, Intel, OpenCAPI
Ring, Recursive, Tree
[NeuRIPS18, SYSML19]
IBM DDL: https://arxiv.org/pdf/1708.02188.pdf

More Challenges
• Flexibility
• Multiple DL frameworks support
• Developer transparency
• Auto Scaling & Elastic training
• Fault tolerance
• Service Quality
• Scalability & Performance & Accuracy

18
Training challenges and reactions to Elastic Training
Distributed training is great, but we
only run training on single GPU?
Why? You do not want speed-up?
300 + GPUs and 300 + students. Each
researcher is entitled to use 1 GPU
You are in meetings right now. Are you using
the GPUs allocated to you?
No …
if you ask for 16 GPUs, you will never get it
in a busy cluster.
A classic large job starvation problem!
If you run large jobs and use more than you
deserved, you jobs will be killed.
What if you start with 1 GPU; Your job can grow; If
there are other high priority jobs, your job gracefully
shrink back to your own quotaF*&?% brilliant Idea!

• # cluster specification
• parameter_servers = ["pc-01:2222"]
• workers = [ "pc-02:2222", "pc-03:2222", "pc-04:2222"]
• cluster = tf.train.ClusterSpec({"ps":parameter_servers, "worker":workers})
• tf.app.flags.DEFINE_string("job_name", "", "Either 'ps' or 'worker'")
• tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")
• FLAGS = tf.app.flags.FLAGS
• server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
• mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
• if FLAGS.job_name == "ps":
• server.join()
• elif FLAGS.job_name == "worker":
• with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:%d" %FLAGS.task_index, cluster=cluster)):
• with tf.name_scope('input'):
• x = tf.placeholder(tf.float32, shape=[None, 784], name="x-input")
• y_ = tf.placeholder(tf.float32, shape=[None, 10], name="y-input")
• with tf.name_scope("weights"):
• W1 = tf.Variable(tf.random_normal([784, 100]))
• W2 = tf.Variable(tf.random_normal([100, 10]))
• with tf.name_scope("softmax"):
• y = tf.nn.softmax(z3)
• - …... // lines of code
• with tf.name_scope('train'):
• # optimizer is an "operation" which we can execute in a session
• grad_op = tf.train.GradientDescentOptimizer(learning_rate)
• train_op = grad_op.minimize(cross_entropy, global_step=global_step)
• sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
• global_step=global_step,
• init_op=init_op)
• with sv.prepare_or_wait_for_session(server.target) as sess:
• if FLAGS.task_index == 0:
• // is chief manage model and log and others
• writer = tf.train.SummaryWriter(logs_path, graph=tf.get_default_graph())
• for epoch in range(training_epochs):
• for i in range(batch_count):
• batch_x, batch_y = mnist.train.next_batch(batch_size)
• _, cost, summary, step = sess.run( [train_op, cross_entropy, summary_op, global_step],
• feed_dict={x: batch_x, y_: batch_y})
• writer.add_summary(summary, step)
Static Cluster configuration
Mixed data ingest, PS and Worker logic
Model/Graph definition
(EDT only needs this)
Training Runtime Management
Distributed Tensorflow

0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
SPECTRUM
MPI
1.09 1.1 1.1 1.1 1.13 1.17 1.21 1.29 1.35 1.85 2 2.17 2.5 3.16 4.15 5.2
MVAPICH 1.66 1.81 1.81 1.81 1.8 1.85 1.84 1.86 1.71 2.18 2.33 2.57 3 3.98 5.34 12.38
ompi 1.4 1.38 1.37 1.37 1.37 1.48 1.5 1.51 1.74 1.87 2.28 2.53 3.04 4.23 5.21 9.9
MPICH 1.44 1.45 1.45 1.44 1.45 1.46 1.64 1.67 1.99 2.14 2.28 2.93 3.62 4.18 5.55 8.47
0
1
2
3
4
5
6
7
8
9
10
11
12
13
LATENCY(US)
MESSAGE SIZE (BYTES)
PINGPONG LATENCY IMB, FIRESTONE/EDR
SPECTRUM MPI MVAPICH ompi MPICH
MPI – A reference point for parallel applications
IBM Spectrum MPI V10.1Optimized PAMI Point to point performance - Latency

MPI – A reference point for parallel applications
§0 §1 §2
§3 §4 §5
§6 §7 §8
§9 §10 §11
§12 §13 §14
§SPMD programming model
§Peer to peer model
§Advantages
• Standard, portable
• Fast, low-latency communications
• Many features
•point-to-point, message selectivity, collective
operations, process groups etc.
§But challenges too! (Not cloud native)
• Not fault-tolerant
• Programmer needs to keep track of rank
• Exception handling left to the developer
• Resource allocations static
• Computations not distributed optimally
• Challenging to debug
§Common binary runs on each
§core or CPU and discovers its rank at run-
time

Traditional HPC
App linked with communication lib
APP
Main{
Initialize(
Host1, host2,
host3,);
Printf();
Send(work1,
<)
Elastic Fabric
– Converge HPC and Cloud native
APP
GetMes(
Cal ();
Send(work2,
<)
APP
GetMes(
Cal ();
Send(work2,
<)
§MPI, TCP
Spark, MR, Symphony SOAM, etc.
High Performance fabric manages workload and state –
call out user code to enable elasticity, resilience, mobility
and hide infrastructure complexity and deployment
M
S
S
S
S
S
S
S
C
Client
Services
Tasks
PriceFXOpt()
Session
PriceFXOpt()
PriceFXOpt()
PriceFXOpt()
PriceFXFW()
PriceFXFW()
PriceFXFW()

Elastic Deep Learning
• Have the best of performance and flexibility
• Auto Scale up and down based on resource plan
• Priority, Real time Fairshare, FIFO
• Transparent for Tensorflow, Pytorch and Caffe
• Convergence and Hyperparameter awareness
Combine the best scheduling and fastest communication with DL specific high performance optimization

Elastic Distributed Training Engine
Session Scheduler
Elastic Scaling
DL Driver
DL Framework
Sync Engine (DDL)
Work Wrapper
Resource policy
(Scaling, preemption, migration)
Training planning, scheduling
(Training task, micro-batch pipeline, sync-plan)
Worker wrapper
(model transparency, Data ingest)
Tensorflow, Caffe, Pytorch, Keras
High performance synchronization
(sync, async, p-2-p, centralized)
(New RDMA Library)
Elastic distribution
challenges:
• Graceful pre-emption
• Auto scale
• Dynamic priority
• Fault tolerant
• Speed up, performance (DDL)
• Accuracy
• Synchronization algorithm
• Topology & GPU Aware
• Model transparency

Auto scaling and pre-emption
• Resource policy driven the scale up and down
• Priority
• Sharing policy, fair share, fifo
• GPU demand
• Cluster wide utilization
• Fabric handle the scale
• Automatically without interruption
• Support both sync and async model
• Keep/adjust batch size etc hyperparameter during scaling

Keras AutoScaling – Go Beyond one GPU with EDT
mlp = Sequential()
mlp.add(Dense(1000, input_shape=(784,)))
mlp.add(Activation('relu'))
mlp.add(Dense(250))
mlp.add(Dense(10))
mlp.add(Activation('softmax'))
trainer = ElasticDL(
model=mlp,
loss='categorical_crossentropy',
optimizer=optimizer_mlp,
batch_size=4, num_epoch=1)
trainer.fit(training_set, epochs=4)
mlp = Sequential()
mlp.add(Dense(1000, input_shape=(784,)))
mlp.add(Dense(250))
mlp.add(Dense(10))
mlp.add(Activation('softmax'))
mlp.compile(loss='categorical_crossentropy',
optimizer=optimizer_mlp,
callback_MAO)
mlp.fit( training_set, epochs=epochs)
Same Model
definition
Same
optimizer, loss
function
Hide MAO as
call back
Similar API for
fit and compile
Data source
support Spark
dataframe
1 GPU N GPUs

Pytorch – Transparent Scaling through EDT
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
model = Net()
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)
#EDT PYTORCH
model= ElasticDL ( model, optimizer, F.nll_loss, dataLoader)
#hide MAO and data ingest in workers
model.train(200, 64)
# Native PYTORCH
def train(model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()

Transparent DL Insight
• No code modification
• Customize the live metrics
• Plugin interface with third party monitoring
• Elastic and interactive with notebook, and developer tools
• Scale up and down without interruption

Autoscaling with Accuracy, Transparency
March 2019 31
Maintain the same accuracy when scaling GPU
up and down.
4 GPUs -> 2 GPUs, preemption
1 GPU
One line of code change – train anywhere on multiple nodes and multiple GPUs
Interactive experience in notebook

Elastic AI Solutions:
IBM Power with IBM Storage
March 18, 2019
March 2019 32

For questions, contact:
Olga Yiparaki yiparaki@us.ibm.com
Chief engineer, IBM Storage Performance
Eric Fiala Eric.J.Fiala@ibm.com
Solution Architect, Spectrum Computing
Constantine Arnold Constantine.Arnold@ibm.com
Data Science and Storage Systems Research
Jay Vaddi jayvaddi@us.ibm.com
IBM Storage Performance
Brian Porter bporter1@us.ibm.com
Client Technical Specialist, IBM Systems
March 2019 33

IBM Storage:
Spectrum Scale
NVMe all-flash appliance
March 2019 35
IB EDR fabric
Up to 8 Hosts were used in these tests
Compute nodes can be increased elastically
. . .
IBM Spectrum Scale
NVMe all-flash
appliance
Only one storage AFA node was used throughout these tests
Additional storage can be added independent of compute nodes
• NVMe-based Storage
provides more than ample
performance for these AI
benchmarks which saturate
the GPUs.
• Single AFA Storage uses 2U
rack space and provides
~63 TB of user capacity
• Max Read from storage:
Over 35 GB/s, assuming
enough network adapters
• Storage can be increased in
a linear fashion to meet
capacity and/or
performance requirements.
Power9 with IBM
Watson Machine
Learning
Accelerator
(WMLA)
2x IB links
per host
8x IB links
• Up to 8 x Power9
AC922 hosts, in this
environment
• GTX model, water
cooled
• 512 GM RAM, per
Power9
• 6 GPU per Power9
host, up to 48 GPU in
this environment
• Single dual-ported IB
adapter per Power9
host
AC922 AC922 AC922

Elastic Distributed Training Scaling efficiency
Framework: TensorFlow (Elastic distributed training)
Spark instance group dliauto
Model inceptionV3
Batch size 64
Dataset: flowers
Hyperparameter
Learning rate policy: exponential
Base learning rate: 0.01
Decay steps: 4000
Learning rate decay: 0.9
Staircase: TRUE
Solver type: GradientDescent
Maximum iterations: 10000
• With Elastic Distributed Training Capabilities, included in Watson
Machine Learning Accelerator, the system dynamically scales out
to accommodate the demands of growing AI applications
• Quick growth demands are accommodated elastically, with ease
of management
• As the number of GPUs and hosts increases this showcases
measured data and high scaling efficiency
March 2019 36
1.0
2.0
3.8
7.5
0.0
2.0
4.0
6.0
8.0
1 host 2 hosts 4 hosts 8 hosts
ITERATIONS/MINVS.SINGLEHOST
BASELINE
Speedup by
Scaling out hosts & GPUs
100% 100%
96%
94%
0%
20%
40%
60%
80%
100%
120%
6 GPU 12 GPU 24 GPU 48 GPU
EFFICIENCYVS.BASELINEOF6GPU
Efficiency vs. 6 GPU baseline
50K iterations
In all these cases, the NVMe-based Storage remains unchanged
and provides more than ample performance, since the AI workload
saturates the GPUs, as evidenced by the high scaling efficiency.

IBM WML-A Enables Service Level Agreements by
absorbing multiple tenant growth
• POWER9 enables the rapid growth of AI demands: As
multiple tenants increase, every new job or user does
adds negligible overhead, enabling predictable
behaviors and SLAs (Service level Agreements)
• Coupled with EDT (Elastic Distributed Training),
multitenant workloads are accommodated elastically,
with ease of management
• This showcases measured data, with the same
negligible overheads as the number of GPUs varies.
March 2019 37
-10%
-8%
-6%
-4%
-2%
0%
2%
4%
6%
8%
10%
0 2 4 6 8 10 12 14 16 18
Overheadofadditionaltenantrelativetoaverage
Tenant
Multitenant Overhead vs. Average Tenant
3 GPUs x16 Tenants
6 GPUs x8 Tenants
12 GPUs x 4 Tenants
24 GPUs x 2 Tenants
48 GPUs x 1 Tenant
In all these cases, the NVMe-based Storage remains
unchanged and provides more than ample performance,
enabling the GPUs to accommodate the multitenant
AI workload without any slowdowns, as evidenced by the
negligible overheads.
Multitenancy overheads are on par with the
corresponding overheads when each server
uses local storage instead of the external IBM
storage used in these tests

Improve Data Scientists productivity by 31% and IT resource utilization by 33% in a
multi-user shared cluster of GPU’s running IBM WML-Accelerator on POWER9 AC922
servers with Nvidia Tesla V100 GPUs connected via NVLink 2.0
• 1.31X reduction in time of training multiple concurrent experiments vs tested x86
systems
• 4 jobs of Inceptionv3 trained for 15000 iterations on Flowers data requiring 3 GPUs each
running on 2 AC922 nodes with 4 GPUs each.
• 1.33x improvement in the utilization of POWER9 DL Infrastructure utilization vs
tested x86 systems
• 0 wait time – Jobs submitted will be executed even if the cluster is busy due to IBM
WML-Accelerator’s elastic scaling
• supports multi-tenancy and elasticity with fairshare policy of scheduling.
A Simple Scenario
• Results are based IBM Internal Measurements running 15000 Iteration training of InceptionV3 model (mini-batch size=32 per GPU) on Flowers dataset.
• Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 3.8 GHz, 1 TB memory, 4xTesla V100 GPU ; Red Hat Enterprise Linux 7.5 for Power Little Endian
(POWER9) with CUDA 9.2/ CUDNN 7.2.1; WML-A v1.1.1.
• Competitive stack: 2x Xeon(R) Gold 6150; 36 cores (2 x 18c chips); 2.70 GHz; 512 GB memory, 4xTesla V100 GPU, Ubuntu 16.04.4 with CUDA .9.1/ CUDNN 7.1.2;
NGC image:nvcr.io/nvidia/tensorflow Version: 18.08-py2; Kubernetes v1.11.2.
- idle resource
34.4
26.22
0
5
10
15
20
25
30
35
40
x86 AC922
TimeTaken(mins)
Multiuser Jobs 3-3-3-3
4 jobs of InceptionV3/Flowers training for 15000 iterations

Traditional
Business Data
Sensor Data
Data from
collaboration
partners
Data from
mobile apps &
social media
Legacy Data
Data Preparation
Pre-Processing
Data Source Model Training Inference
AI Deep Learning
Frameworks
(Tensorflow, Caffe, …)
Monitor
& Advise
Iterate
Distributed & Elastic
Training for Deep Learning
Parallel Hyper-Parameter
Search & Optimization
Network
Models
Hyper-
Parameters
Testing
Dataset
Trained Model
Life Cycle
Management
Deploy in
Production using
Trained Model
(Rest API)
New Data
ML/DL Training & Execution - Watson ML Accelerator
Heavy IO
Instrumentation
Training
Dataset
Data Ingestion
Multi-Tenant, Shared Services Architecture (Conductor)
Resource Groups, Consumers, Resource Plans, Instance Groups, Resiliency,
Workload Management, Notebooks, Anaconda, Reporting, Security

Watson ML Accelerator Technical References
Tutoriale Url
Classify images with IBM Watson Machine Learning Accelerator https://developer.ibm.com/tutorials/use-computer-vision-
with-dli-watson-machine-learning-accelerator/
Train Keras and Mllib models with IBM Watson Machine Learning
Accelerator
https://developer.ibm.com/tutorials/training-keras-and-
mllib-model-with-watson-machine-learning-accelerator/
Get Dynamic, elastic, and fine-grained resource allocations and controls
for accelerating multiple model trainings simultaneously
https://developer.ibm.com/tutorials/dynamic-resilient-and-
elastic-deep-learning-with-watson-machine-learning-
accelerator/
Train Xgboost models with IBM Watson Machine Learning Accelerator https://developer.ibm.com/tutorials/train-xgboost-models-
within-watson-ml-accelerator/
Accelerate Generalized Linear Model training with IBM Watson Machine
Learning Accelerator and Snap ML
https://developer.ibm.com/tutorials/accelerate-machine-
model-training-with-watson-ml-accelerator-snap-ml/
Accelerate tree-based model training with Watson Machine Learning
Accelerator and Snap ML
https://developer.ibm.com/tutorials/accelerate-random-
forest-model-training-with-watson-ml-accelerator/
§ Machine Learning and Deep Learning with IBM Watson Machine Learning Accelerator Series
§ Offers walk through, and hands on experience with Watson ML Accelerator key differentiators
§ English Series: https://developer.ibm.com/series/learn-watson-machine-learning-accelerator/
§ Chinese Series: https://developer.ibm.com/cn/blog/2019/learn-ibm-powerai-enterprise/

Toronto meetup 20190917

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Toronto meetup 20190917

Similar to Toronto meetup 20190917 (20)

More from Bill Liu

More from Bill Liu (20)

Recently uploaded

Recently uploaded (20)

Toronto meetup 20190917