SlideShare a Scribd company logo
Elastic Distributed Deep
Learning Training at large
scale on-prem and cloud
productions
Junfeng Liu, STSM, jfliu@ca.ibm.com
Kelvin Lui, Technical Product Manager,
kelvinl@ca.ibm.com
Yonggang Hu, Distinguished Engineer
yhu@ca.ibm.com
ibm.com/spectrum-computing
ibm.com/us-en/marketplace/deep-learning-
platform
Red Bull Racing: Competing with Computing
Every week a new challenge, but part of a season long strategy
52 Wins, 58 Poles, 135 Podiums, 52 Fastest Laps, 4 Formula One Constructors' World Championships
A Decade of Racing Successes, a decade with Spectrum Computing
New Car for 2017
• Tailored to the track
• Complex virtual design and
simulation models
• >200 step simulations
• 30K engineering changes per
season
Real-time Decision Making
• Car sensors and real time
telemetry drive decisions, before
during and after the race
• >100 sensors per car
• Pit stops under 2 seconds
Race Strategy
• Scenario-driven decision making
• 1000s of scenarios run per race
• Model environments: Rain, Heat,
Delays
• Pit stops and tire choices win/lose
races
Agenda
The needs and challenges of running distributed training
Elastic Distributed Training
§ Architecture
§ Benchmark
§ Interface
Use Cases
Demo (time permitting)
Next Steps
Deep Learning Needs HPC & Big Data
5
AI Demand for Compute
6
7
$11,458
Tesla V100 – 32GB
Nissan Versa
IBM AC 922 – 4V100
$80,000
Audi A8
Nvidia DGX2
$399,000
Lamborghini Aventador
Resource matters
$500,000,000
8
Workload matters
Inference - A simple language model
125 TFlops
Train - ResNet-50 – ImageNet 1K
29 hours on 8 GPUs
In minutes in Summit
Require proper implementation & tuning
Need 1 TFlops – in ms
14 days to train on 1 GPU
Tens of models
Hundreds of tuning runs
Hundreds of users
Thousands of days NAS on GPUs
Millions of pictures
Thousands of datasets
Millions of jobs
Billions of inferences
SLA from ms to days
Faster Training Time with Large-scale Deep Learning
9Days
Recognition
Recognition
54x
Learning
runs with
Power 8
What will you do?
Iterate more and create more accurate models?
Create more models?
Both?
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
4Hours
• From 2015 to 2018
– GPU Compute: 18TFOPS to 112 TFLOPS (FP16)
– GPU Memory: 16GB to 32GB
– Communication: 100Gbps to 200Gbps
3x
Large-scale Deep Learning
[skymind.ai]
• Data parallelism : constant traffic per GPU (only network size matters)
• Model parallelism : partitioning dictates traffic (need significant research)
[DeepSpeech2]
IBM
Spectrum
Conductor
VGG as example
• 128.3M data / GPU
• allreduce – broadcast sync model
• Data increasing according to the
number of GPU
§ 4 GPU = 1026M data transferred
every iteration
Distribution Challenge
12
Source: https://www.semanticscholar.org/paper/Poseidon-An-Efficient-Communication-Architecture-f-Zhang-Zheng/c37145669be8e7f14f4cdd5ddc3935ea03a54673
Using Allreduce for SGD
• More performant
• MPI, NCCL
• All large-scale studies
• Scalable
§local
§gradient
§aggregated
§gradient
§[skymind.ai, mpitutorial.com]
Prior Arts: Ring-based Allreduce (Thakur 2005, Baidu Feb/2017)
§14
reduce-scatter all-gather
• Bandwidth-optimal for homogeneous network architecture
– Each step is throttled by the worst bandwidth (i.e.. Pipeline)
• Linear dependency on latency
– N GPUs will have N*latency overheads
– Recursive schemes exist yet optimal for the 2m learners
NOT SCALABLE for Many Learners
- Too many iterations (latency adds up fast)
- The weakest link slows down others (cluster, cloud)
[Baidu.com]
Prior Arts: Two-step Approach (Tencent Jul/2018, Uber Oct/2018)
§15
NOT SCALABLE for Many Learners
- Still too many iterations (latency adds up fast)
- Sub-optimal traffic pattern due to additional Reduce and Broadcast
- Only master GPUs are active
DDL: Mix-Match for Best Performance
MPI, NCCL, IB_Verb, SharedMem, OpenFabric, Custom-lib
§IBM, Nvidia, Mellanox, Intel, OpenCAPI
Ring, Recursive, Tree
[NeuRIPS18, SYSML19]
IBM DDL: https://arxiv.org/pdf/1708.02188.pdf
More Challenges
• Flexibility
• Multiple DL frameworks support
• Developer transparency
• Auto Scaling & Elastic training
• Fault tolerance
• Service Quality
• Scalability & Performance & Accuracy
18
Training challenges and reactions to Elastic Training
Distributed training is great, but we
only run training on single GPU?
Why? You do not want speed-up?
300 + GPUs and 300 + students. Each
researcher is entitled to use 1 GPU
You are in meetings right now. Are you using
the GPUs allocated to you?
No …
if you ask for 16 GPUs, you will never get it
in a busy cluster.
A classic large job starvation problem!
If you run large jobs and use more than you
deserved, you jobs will be killed.
What if you start with 1 GPU; Your job can grow; If
there are other high priority jobs, your job gracefully
shrink back to your own quotaF*&?% brilliant Idea!
• # cluster specification
• parameter_servers = ["pc-01:2222"]
• workers = [ "pc-02:2222", "pc-03:2222", "pc-04:2222"]
• cluster = tf.train.ClusterSpec({"ps":parameter_servers, "worker":workers})
• tf.app.flags.DEFINE_string("job_name", "", "Either 'ps' or 'worker'")
• tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")
• FLAGS = tf.app.flags.FLAGS
• server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
• mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
• if FLAGS.job_name == "ps":
• server.join()
• elif FLAGS.job_name == "worker":
• with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:%d" %FLAGS.task_index, cluster=cluster)):
• with tf.name_scope('input'):
• x = tf.placeholder(tf.float32, shape=[None, 784], name="x-input")
• y_ = tf.placeholder(tf.float32, shape=[None, 10], name="y-input")
• with tf.name_scope("weights"):
• W1 = tf.Variable(tf.random_normal([784, 100]))
• W2 = tf.Variable(tf.random_normal([100, 10]))
• with tf.name_scope("softmax"):
• y = tf.nn.softmax(z3)
• - …... // lines of code
• with tf.name_scope('train'):
• # optimizer is an "operation" which we can execute in a session
• grad_op = tf.train.GradientDescentOptimizer(learning_rate)
• train_op = grad_op.minimize(cross_entropy, global_step=global_step)
• sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
• global_step=global_step,
• init_op=init_op)
• with sv.prepare_or_wait_for_session(server.target) as sess:
• if FLAGS.task_index == 0:
• // is chief manage model and log and others
• writer = tf.train.SummaryWriter(logs_path, graph=tf.get_default_graph())
• for epoch in range(training_epochs):
• for i in range(batch_count):
• batch_x, batch_y = mnist.train.next_batch(batch_size)
• _, cost, summary, step = sess.run( [train_op, cross_entropy, summary_op, global_step],
• feed_dict={x: batch_x, y_: batch_y})
• writer.add_summary(summary, step)
Static Cluster configuration
Mixed data ingest, PS and Worker logic
Model/Graph definition
(EDT only needs this)
Training Runtime Management
Distributed Tensorflow
0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
SPECTRUM
MPI
1.09 1.1 1.1 1.1 1.13 1.17 1.21 1.29 1.35 1.85 2 2.17 2.5 3.16 4.15 5.2
MVAPICH 1.66 1.81 1.81 1.81 1.8 1.85 1.84 1.86 1.71 2.18 2.33 2.57 3 3.98 5.34 12.38
ompi 1.4 1.38 1.37 1.37 1.37 1.48 1.5 1.51 1.74 1.87 2.28 2.53 3.04 4.23 5.21 9.9
MPICH 1.44 1.45 1.45 1.44 1.45 1.46 1.64 1.67 1.99 2.14 2.28 2.93 3.62 4.18 5.55 8.47
0
1
2
3
4
5
6
7
8
9
10
11
12
13
LATENCY(US)
MESSAGE SIZE (BYTES)
PINGPONG LATENCY IMB, FIRESTONE/EDR
SPECTRUM MPI MVAPICH ompi MPICH
MPI – A reference point for parallel applications
IBM Spectrum MPI V10.1Optimized PAMI Point to point performance - Latency
MPI – A reference point for parallel applications
§0 §1 §2
§3 §4 §5
§6 §7 §8
§9 §10 §11
§12 §13 §14
§SPMD programming model
§Peer to peer model
§Advantages
• Standard, portable
• Fast, low-latency communications
• Many features
•point-to-point, message selectivity, collective
operations, process groups etc.
§But challenges too! (Not cloud native)
• Not fault-tolerant
• Programmer needs to keep track of rank
• Exception handling left to the developer
• Resource allocations static
• Computations not distributed optimally
• Challenging to debug
§Common binary runs on each
§core or CPU and discovers its rank at run-
time
Traditional HPC
App linked with communication lib
APP
Main{
Initialize(
Host1, host2,
host3,);
Printf();
Send(work1,
<)
Elastic Fabric
– Converge HPC and Cloud native
APP
GetMes(
Cal ();
Send(work2,
<)
APP
GetMes(
Cal ();
Send(work2,
<)
§MPI, TCP
Spark, MR, Symphony SOAM, etc.
High Performance fabric manages workload and state –
call out user code to enable elasticity, resilience, mobility
and hide infrastructure complexity and deployment
M
S
S
S
S
S
S
S
C
Client
Services
Tasks
PriceFXOpt()
Session
PriceFXOpt()
PriceFXOpt()
PriceFXOpt()
PriceFXFW()
PriceFXFW()
PriceFXFW()
Elastic Deep Learning
• Have the best of performance and flexibility
• Auto Scale up and down based on resource plan
• Priority, Real time Fairshare, FIFO
• Transparent for Tensorflow, Pytorch and Caffe
• Convergence and Hyperparameter awareness
Combine the best scheduling and fastest communication with DL specific high performance optimization
Elastic Distributed Training Engine
Session Scheduler
Elastic Scaling
DL Driver
DL Framework
Sync Engine (DDL)
Work Wrapper
Resource policy
(Scaling, preemption, migration)
Training planning, scheduling
(Training task, micro-batch pipeline, sync-plan)
Worker wrapper
(model transparency, Data ingest)
Tensorflow, Caffe, Pytorch, Keras
High performance synchronization
(sync, async, p-2-p, centralized)
(New RDMA Library)
Elastic distribution
challenges:
• Graceful pre-emption
• Auto scale
• Dynamic priority
• Fault tolerant
• Speed up, performance (DDL)
• Accuracy
• Synchronization algorithm
• Topology & GPU Aware
• Model transparency
Auto scaling and pre-emption
• Resource policy driven the scale up and down
• Priority
• Sharing policy, fair share, fifo
• GPU demand
• Cluster wide utilization
• Fabric handle the scale
• Automatically without interruption
• Support both sync and async model
• Keep/adjust batch size etc hyperparameter during scaling
Keras AutoScaling – Go Beyond one GPU with EDT
mlp = Sequential()
mlp.add(Dense(1000, input_shape=(784,)))
mlp.add(Activation('relu'))
mlp.add(Dense(250))
mlp.add(Activation('relu'))
mlp.add(Dense(10))
mlp.add(Activation('softmax'))
trainer = ElasticDL(
model=mlp,
loss='categorical_crossentropy',
optimizer=optimizer_mlp,
batch_size=4, num_epoch=1)
trainer.fit(training_set, epochs=4)
mlp = Sequential()
mlp.add(Dense(1000, input_shape=(784,)))
mlp.add(Activation('relu'))
mlp.add(Dense(250))
mlp.add(Activation('relu'))
mlp.add(Dense(10))
mlp.add(Activation('softmax'))
mlp.compile(loss='categorical_crossentropy',
optimizer=optimizer_mlp,
callback_MAO)
mlp.fit( training_set, epochs=epochs)
Same Model
definition
Same
optimizer, loss
function
Hide MAO as
call back
Similar API for
fit and compile
Data source
support Spark
dataframe
1 GPU N GPUs
Pytorch – Transparent Scaling through EDT
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
model = Net()
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)
#EDT PYTORCH
model= ElasticDL ( model, optimizer, F.nll_loss, dataLoader)
#hide MAO and data ingest in workers
model.train(200, 64)
# Native PYTORCH
def train(model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
Transparent DL Insight
• No code modification
• Customize the live metrics
• Plugin interface with third party monitoring
• Elastic and interactive with notebook, and developer tools
• Scale up and down without interruption
Autoscaling with Accuracy, Transparency
March 2019 31
Maintain the same accuracy when scaling GPU
up and down.
4 GPUs -> 2 GPUs, preemption
1 GPU
One line of code change – train anywhere on multiple nodes and multiple GPUs
Interactive experience in notebook
Elastic AI Solutions:
IBM Power with IBM Storage
March 18, 2019
March 2019 32
For questions, contact:
Olga Yiparaki yiparaki@us.ibm.com
Chief engineer, IBM Storage Performance
Eric Fiala Eric.J.Fiala@ibm.com
Solution Architect, Spectrum Computing
Constantine Arnold Constantine.Arnold@ibm.com
Data Science and Storage Systems Research
Jay Vaddi jayvaddi@us.ibm.com
IBM Storage Performance
Brian Porter bporter1@us.ibm.com
Client Technical Specialist, IBM Systems
March 2019 33
IBM Storage:
Spectrum Scale
NVMe all-flash appliance
March 2019 35
IB EDR fabric
Up to 8 Hosts were used in these tests
Compute nodes can be increased elastically
. . .
IBM Spectrum Scale
NVMe all-flash
appliance
Only one storage AFA node was used throughout these tests
Additional storage can be added independent of compute nodes
• NVMe-based Storage
provides more than ample
performance for these AI
benchmarks which saturate
the GPUs.
• Single AFA Storage uses 2U
rack space and provides
~63 TB of user capacity
• Max Read from storage:
Over 35 GB/s, assuming
enough network adapters
• Storage can be increased in
a linear fashion to meet
capacity and/or
performance requirements.
Power9 with IBM
Watson Machine
Learning
Accelerator
(WMLA)
2x IB links
per host
8x IB links
• Up to 8 x Power9
AC922 hosts, in this
environment
• GTX model, water
cooled
• 512 GM RAM, per
Power9
• 6 GPU per Power9
host, up to 48 GPU in
this environment
• Single dual-ported IB
adapter per Power9
host
AC922 AC922 AC922
Elastic Distributed Training Scaling efficiency
Framework: TensorFlow (Elastic distributed training)
Spark instance group dliauto
Model inceptionV3
Batch size 64
Dataset: flowers
Hyperparameter
Learning rate policy: exponential
Base learning rate: 0.01
Decay steps: 4000
Learning rate decay: 0.9
Staircase: TRUE
Solver type: GradientDescent
Maximum iterations: 10000
• With Elastic Distributed Training Capabilities, included in Watson
Machine Learning Accelerator, the system dynamically scales out
to accommodate the demands of growing AI applications
• Quick growth demands are accommodated elastically, with ease
of management
• As the number of GPUs and hosts increases this showcases
measured data and high scaling efficiency
March 2019 36
1.0
2.0
3.8
7.5
0.0
2.0
4.0
6.0
8.0
1 host 2 hosts 4 hosts 8 hosts
ITERATIONS/MINVS.SINGLEHOST
BASELINE
Speedup by
Scaling out hosts & GPUs
100% 100%
96%
94%
0%
20%
40%
60%
80%
100%
120%
6 GPU 12 GPU 24 GPU 48 GPU
EFFICIENCYVS.BASELINEOF6GPU
Efficiency vs. 6 GPU baseline
50K iterations
In all these cases, the NVMe-based Storage remains unchanged
and provides more than ample performance, since the AI workload
saturates the GPUs, as evidenced by the high scaling efficiency.
IBM WML-A Enables Service Level Agreements by
absorbing multiple tenant growth
• POWER9 enables the rapid growth of AI demands: As
multiple tenants increase, every new job or user does
adds negligible overhead, enabling predictable
behaviors and SLAs (Service level Agreements)
• Coupled with EDT (Elastic Distributed Training),
multitenant workloads are accommodated elastically,
with ease of management
• This showcases measured data, with the same
negligible overheads as the number of GPUs varies.
March 2019 37
-10%
-8%
-6%
-4%
-2%
0%
2%
4%
6%
8%
10%
0 2 4 6 8 10 12 14 16 18
Overheadofadditionaltenantrelativetoaverage
Tenant
Multitenant Overhead vs. Average Tenant
3 GPUs x16 Tenants
6 GPUs x8 Tenants
12 GPUs x 4 Tenants
24 GPUs x 2 Tenants
48 GPUs x 1 Tenant
In all these cases, the NVMe-based Storage remains
unchanged and provides more than ample performance,
enabling the GPUs to accommodate the multitenant
AI workload without any slowdowns, as evidenced by the
negligible overheads.
Multitenancy overheads are on par with the
corresponding overheads when each server
uses local storage instead of the external IBM
storage used in these tests
Improve Data Scientists productivity by 31% and IT resource utilization by 33% in a
multi-user shared cluster of GPU’s running IBM WML-Accelerator on POWER9 AC922
servers with Nvidia Tesla V100 GPUs connected via NVLink 2.0
• 1.31X reduction in time of training multiple concurrent experiments vs tested x86
systems
• 4 jobs of Inceptionv3 trained for 15000 iterations on Flowers data requiring 3 GPUs each
running on 2 AC922 nodes with 4 GPUs each.
• 1.33x improvement in the utilization of POWER9 DL Infrastructure utilization vs
tested x86 systems
• 0 wait time – Jobs submitted will be executed even if the cluster is busy due to IBM
WML-Accelerator’s elastic scaling
• supports multi-tenancy and elasticity with fairshare policy of scheduling.
A Simple Scenario
• Results are based IBM Internal Measurements running 15000 Iteration training of InceptionV3 model (mini-batch size=32 per GPU) on Flowers dataset.
• Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 3.8 GHz, 1 TB memory, 4xTesla V100 GPU ; Red Hat Enterprise Linux 7.5 for Power Little Endian
(POWER9) with CUDA 9.2/ CUDNN 7.2.1; WML-A v1.1.1.
• Competitive stack: 2x Xeon(R) Gold 6150; 36 cores (2 x 18c chips); 2.70 GHz; 512 GB memory, 4xTesla V100 GPU, Ubuntu 16.04.4 with CUDA .9.1/ CUDNN 7.1.2;
NGC image:nvcr.io/nvidia/tensorflow Version: 18.08-py2; Kubernetes v1.11.2.
- idle resource
34.4
26.22
0
5
10
15
20
25
30
35
40
x86 AC922
TimeTaken(mins)
Multiuser Jobs 3-3-3-3
4 jobs of InceptionV3/Flowers training for 15000 iterations
Traditional
Business Data
Sensor Data
Data from
collaboration
partners
Data from
mobile apps &
social media
Legacy Data
Data Preparation
Pre-Processing
Data Source Model Training Inference
AI Deep Learning
Frameworks
(Tensorflow, Caffe, …)
Monitor
& Advise
Iterate
Distributed & Elastic
Training for Deep Learning
Parallel Hyper-Parameter
Search & Optimization
Network
Models
Hyper-
Parameters
Testing
Dataset
Trained Model
Life Cycle
Management
Deploy in
Production using
Trained Model
(Rest API)
New Data
ML/DL Training & Execution - Watson ML Accelerator
Heavy IO
Instrumentation
Training
Dataset
Data Ingestion
Multi-Tenant, Shared Services Architecture (Conductor)
Resource Groups, Consumers, Resource Plans, Instance Groups, Resiliency,
Workload Management, Notebooks, Anaconda, Reporting, Security
Watson ML Accelerator Technical References
Tutoriale Url
Classify images with IBM Watson Machine Learning Accelerator https://developer.ibm.com/tutorials/use-computer-vision-
with-dli-watson-machine-learning-accelerator/
Train Keras and Mllib models with IBM Watson Machine Learning
Accelerator
https://developer.ibm.com/tutorials/training-keras-and-
mllib-model-with-watson-machine-learning-accelerator/
Get Dynamic, elastic, and fine-grained resource allocations and controls
for accelerating multiple model trainings simultaneously
https://developer.ibm.com/tutorials/dynamic-resilient-and-
elastic-deep-learning-with-watson-machine-learning-
accelerator/
Train Xgboost models with IBM Watson Machine Learning Accelerator https://developer.ibm.com/tutorials/train-xgboost-models-
within-watson-ml-accelerator/
Accelerate Generalized Linear Model training with IBM Watson Machine
Learning Accelerator and Snap ML
https://developer.ibm.com/tutorials/accelerate-machine-
model-training-with-watson-ml-accelerator-snap-ml/
Accelerate tree-based model training with Watson Machine Learning
Accelerator and Snap ML
https://developer.ibm.com/tutorials/accelerate-random-
forest-model-training-with-watson-ml-accelerator/
§ Machine Learning and Deep Learning with IBM Watson Machine Learning Accelerator Series
§ Offers walk through, and hands on experience with Watson ML Accelerator key differentiators
§ English Series: https://developer.ibm.com/series/learn-watson-machine-learning-accelerator/
§ Chinese Series: https://developer.ibm.com/cn/blog/2019/learn-ibm-powerai-enterprise/
Thank you

More Related Content

What's hot

Image Object Detection Pipeline
Image Object Detection PipelineImage Object Detection Pipeline
Image Object Detection Pipeline
Abhinav Dadhich
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
DataWorks Summit
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
Matthias Feys
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networks
JunKudo2
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
PingCAP
 
Spectral clustering - Houston ML Meetup
Spectral clustering - Houston ML MeetupSpectral clustering - Houston ML Meetup
Spectral clustering - Houston ML Meetup
Yan Xu
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
Viet-Trung TRAN
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engine
G. Bruce Berriman
 
DARTS: Differentiable Architecture Search at 社内論文読み会
DARTS: Differentiable Architecture Search at 社内論文読み会DARTS: Differentiable Architecture Search at 社内論文読み会
DARTS: Differentiable Architecture Search at 社内論文読み会
Masashi Shibata
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in R
mikaelhuss
 
Pycon 2016-open-space
Pycon 2016-open-spacePycon 2016-open-space
Pycon 2016-open-space
Chetan Khatri
 
Skytree big data london meetup - may 2013
Skytree   big data london meetup - may 2013Skytree   big data london meetup - may 2013
Skytree big data london meetup - may 2013
bigdatalondon
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
Jody Garnett
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
PingCAP
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
Antonios Katsarakis
 

What's hot (20)

Image Object Detection Pipeline
Image Object Detection PipelineImage Object Detection Pipeline
Image Object Detection Pipeline
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networks
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
 
Spectral clustering - Houston ML Meetup
Spectral clustering - Houston ML MeetupSpectral clustering - Houston ML Meetup
Spectral clustering - Houston ML Meetup
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engine
 
DARTS: Differentiable Architecture Search at 社内論文読み会
DARTS: Differentiable Architecture Search at 社内論文読み会DARTS: Differentiable Architecture Search at 社内論文読み会
DARTS: Differentiable Architecture Search at 社内論文読み会
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in R
 
Pycon 2016-open-space
Pycon 2016-open-spacePycon 2016-open-space
Pycon 2016-open-space
 
Skytree big data london meetup - may 2013
Skytree   big data london meetup - may 2013Skytree   big data london meetup - may 2013
Skytree big data london meetup - may 2013
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
 

Similar to Toronto meetup 20190917

Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloads
S N
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
Ganesan Narayanasamy
 
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning PerformanceAnirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
Lviv Startup Club
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
inside-BigData.com
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
Faisal Siddiqi
 
Gopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracowGopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracow
MateuszSzczyrzyca
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based Modeling
Jason Liu
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)
Jim Dowling
 
Parallel Computing - Lec 6
Parallel Computing - Lec 6Parallel Computing - Lec 6
Parallel Computing - Lec 6
Shah Zaib
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
Sri Ambati
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Indrajit Poddar
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Alluxio, Inc.
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
Omid Vahdaty
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
Lior Sidi
 
SigOpt at GTC - Reducing operational barriers to optimization
SigOpt at GTC - Reducing operational barriers to optimizationSigOpt at GTC - Reducing operational barriers to optimization
SigOpt at GTC - Reducing operational barriers to optimization
SigOpt
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
PyData
 
C3 w3
C3 w3C3 w3

Similar to Toronto meetup 20190917 (20)

Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloads
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning PerformanceAnirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
 
Gopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracowGopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracow
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based Modeling
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)
 
Parallel Computing - Lec 6
Parallel Computing - Lec 6Parallel Computing - Lec 6
Parallel Computing - Lec 6
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
SigOpt at GTC - Reducing operational barriers to optimization
SigOpt at GTC - Reducing operational barriers to optimizationSigOpt at GTC - Reducing operational barriers to optimization
SigOpt at GTC - Reducing operational barriers to optimization
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
 
C3 w3
C3 w3C3 w3
C3 w3
 

More from Bill Liu

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
Bill Liu
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Bill Liu
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
Bill Liu
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
Bill Liu
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
Bill Liu
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
Bill Liu
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
Bill Liu
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Bill Liu
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
Bill Liu
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Bill Liu
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
Bill Liu
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
Bill Liu
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
Bill Liu
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Bill Liu
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
Bill Liu
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
Bill Liu
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
Bill Liu
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Bill Liu
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
Bill Liu
 

More from Bill Liu (20)

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 

Recently uploaded

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 

Recently uploaded (20)

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 

Toronto meetup 20190917

  • 1. Elastic Distributed Deep Learning Training at large scale on-prem and cloud productions Junfeng Liu, STSM, jfliu@ca.ibm.com Kelvin Lui, Technical Product Manager, kelvinl@ca.ibm.com Yonggang Hu, Distinguished Engineer yhu@ca.ibm.com ibm.com/spectrum-computing ibm.com/us-en/marketplace/deep-learning- platform
  • 2.
  • 3. Red Bull Racing: Competing with Computing Every week a new challenge, but part of a season long strategy 52 Wins, 58 Poles, 135 Podiums, 52 Fastest Laps, 4 Formula One Constructors' World Championships A Decade of Racing Successes, a decade with Spectrum Computing New Car for 2017 • Tailored to the track • Complex virtual design and simulation models • >200 step simulations • 30K engineering changes per season Real-time Decision Making • Car sensors and real time telemetry drive decisions, before during and after the race • >100 sensors per car • Pit stops under 2 seconds Race Strategy • Scenario-driven decision making • 1000s of scenarios run per race • Model environments: Rain, Heat, Delays • Pit stops and tire choices win/lose races
  • 4. Agenda The needs and challenges of running distributed training Elastic Distributed Training § Architecture § Benchmark § Interface Use Cases Demo (time permitting) Next Steps
  • 5. Deep Learning Needs HPC & Big Data 5
  • 6. AI Demand for Compute 6
  • 7. 7 $11,458 Tesla V100 – 32GB Nissan Versa IBM AC 922 – 4V100 $80,000 Audi A8 Nvidia DGX2 $399,000 Lamborghini Aventador Resource matters $500,000,000
  • 8. 8 Workload matters Inference - A simple language model 125 TFlops Train - ResNet-50 – ImageNet 1K 29 hours on 8 GPUs In minutes in Summit Require proper implementation & tuning Need 1 TFlops – in ms 14 days to train on 1 GPU Tens of models Hundreds of tuning runs Hundreds of users Thousands of days NAS on GPUs Millions of pictures Thousands of datasets Millions of jobs Billions of inferences SLA from ms to days
  • 9. Faster Training Time with Large-scale Deep Learning 9Days Recognition Recognition 54x Learning runs with Power 8 What will you do? Iterate more and create more accurate models? Create more models? Both? 4Hours 4Hours 4Hours 4Hours 4Hours 4Hours 4Hours 4Hours 4Hours 4Hours 4Hours 4Hours • From 2015 to 2018 – GPU Compute: 18TFOPS to 112 TFLOPS (FP16) – GPU Memory: 16GB to 32GB – Communication: 100Gbps to 200Gbps 3x
  • 10. Large-scale Deep Learning [skymind.ai] • Data parallelism : constant traffic per GPU (only network size matters) • Model parallelism : partitioning dictates traffic (need significant research) [DeepSpeech2]
  • 11. IBM Spectrum Conductor VGG as example • 128.3M data / GPU • allreduce – broadcast sync model • Data increasing according to the number of GPU § 4 GPU = 1026M data transferred every iteration Distribution Challenge 12 Source: https://www.semanticscholar.org/paper/Poseidon-An-Efficient-Communication-Architecture-f-Zhang-Zheng/c37145669be8e7f14f4cdd5ddc3935ea03a54673
  • 12. Using Allreduce for SGD • More performant • MPI, NCCL • All large-scale studies • Scalable §local §gradient §aggregated §gradient §[skymind.ai, mpitutorial.com]
  • 13. Prior Arts: Ring-based Allreduce (Thakur 2005, Baidu Feb/2017) §14 reduce-scatter all-gather • Bandwidth-optimal for homogeneous network architecture – Each step is throttled by the worst bandwidth (i.e.. Pipeline) • Linear dependency on latency – N GPUs will have N*latency overheads – Recursive schemes exist yet optimal for the 2m learners NOT SCALABLE for Many Learners - Too many iterations (latency adds up fast) - The weakest link slows down others (cluster, cloud) [Baidu.com]
  • 14. Prior Arts: Two-step Approach (Tencent Jul/2018, Uber Oct/2018) §15 NOT SCALABLE for Many Learners - Still too many iterations (latency adds up fast) - Sub-optimal traffic pattern due to additional Reduce and Broadcast - Only master GPUs are active
  • 15. DDL: Mix-Match for Best Performance MPI, NCCL, IB_Verb, SharedMem, OpenFabric, Custom-lib §IBM, Nvidia, Mellanox, Intel, OpenCAPI Ring, Recursive, Tree [NeuRIPS18, SYSML19] IBM DDL: https://arxiv.org/pdf/1708.02188.pdf
  • 16. More Challenges • Flexibility • Multiple DL frameworks support • Developer transparency • Auto Scaling & Elastic training • Fault tolerance • Service Quality • Scalability & Performance & Accuracy
  • 17. 18 Training challenges and reactions to Elastic Training Distributed training is great, but we only run training on single GPU? Why? You do not want speed-up? 300 + GPUs and 300 + students. Each researcher is entitled to use 1 GPU You are in meetings right now. Are you using the GPUs allocated to you? No … if you ask for 16 GPUs, you will never get it in a busy cluster. A classic large job starvation problem! If you run large jobs and use more than you deserved, you jobs will be killed. What if you start with 1 GPU; Your job can grow; If there are other high priority jobs, your job gracefully shrink back to your own quotaF*&?% brilliant Idea!
  • 18. • # cluster specification • parameter_servers = ["pc-01:2222"] • workers = [ "pc-02:2222", "pc-03:2222", "pc-04:2222"] • cluster = tf.train.ClusterSpec({"ps":parameter_servers, "worker":workers}) • tf.app.flags.DEFINE_string("job_name", "", "Either 'ps' or 'worker'") • tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job") • FLAGS = tf.app.flags.FLAGS • server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) • mnist = input_data.read_data_sets('MNIST_data', one_hot=True) • if FLAGS.job_name == "ps": • server.join() • elif FLAGS.job_name == "worker": • with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:%d" %FLAGS.task_index, cluster=cluster)): • with tf.name_scope('input'): • x = tf.placeholder(tf.float32, shape=[None, 784], name="x-input") • y_ = tf.placeholder(tf.float32, shape=[None, 10], name="y-input") • with tf.name_scope("weights"): • W1 = tf.Variable(tf.random_normal([784, 100])) • W2 = tf.Variable(tf.random_normal([100, 10])) • with tf.name_scope("softmax"): • y = tf.nn.softmax(z3) • - …... // lines of code • with tf.name_scope('train'): • # optimizer is an "operation" which we can execute in a session • grad_op = tf.train.GradientDescentOptimizer(learning_rate) • train_op = grad_op.minimize(cross_entropy, global_step=global_step) • sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), • global_step=global_step, • init_op=init_op) • with sv.prepare_or_wait_for_session(server.target) as sess: • if FLAGS.task_index == 0: • // is chief manage model and log and others • writer = tf.train.SummaryWriter(logs_path, graph=tf.get_default_graph()) • for epoch in range(training_epochs): • for i in range(batch_count): • batch_x, batch_y = mnist.train.next_batch(batch_size) • _, cost, summary, step = sess.run( [train_op, cross_entropy, summary_op, global_step], • feed_dict={x: batch_x, y_: batch_y}) • writer.add_summary(summary, step) Static Cluster configuration Mixed data ingest, PS and Worker logic Model/Graph definition (EDT only needs this) Training Runtime Management Distributed Tensorflow
  • 19. 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 SPECTRUM MPI 1.09 1.1 1.1 1.1 1.13 1.17 1.21 1.29 1.35 1.85 2 2.17 2.5 3.16 4.15 5.2 MVAPICH 1.66 1.81 1.81 1.81 1.8 1.85 1.84 1.86 1.71 2.18 2.33 2.57 3 3.98 5.34 12.38 ompi 1.4 1.38 1.37 1.37 1.37 1.48 1.5 1.51 1.74 1.87 2.28 2.53 3.04 4.23 5.21 9.9 MPICH 1.44 1.45 1.45 1.44 1.45 1.46 1.64 1.67 1.99 2.14 2.28 2.93 3.62 4.18 5.55 8.47 0 1 2 3 4 5 6 7 8 9 10 11 12 13 LATENCY(US) MESSAGE SIZE (BYTES) PINGPONG LATENCY IMB, FIRESTONE/EDR SPECTRUM MPI MVAPICH ompi MPICH MPI – A reference point for parallel applications IBM Spectrum MPI V10.1Optimized PAMI Point to point performance - Latency
  • 20. MPI – A reference point for parallel applications §0 §1 §2 §3 §4 §5 §6 §7 §8 §9 §10 §11 §12 §13 §14 §SPMD programming model §Peer to peer model §Advantages • Standard, portable • Fast, low-latency communications • Many features •point-to-point, message selectivity, collective operations, process groups etc. §But challenges too! (Not cloud native) • Not fault-tolerant • Programmer needs to keep track of rank • Exception handling left to the developer • Resource allocations static • Computations not distributed optimally • Challenging to debug §Common binary runs on each §core or CPU and discovers its rank at run- time
  • 21. Traditional HPC App linked with communication lib APP Main{ Initialize( Host1, host2, host3,); Printf(); Send(work1, <) Elastic Fabric – Converge HPC and Cloud native APP GetMes( Cal (); Send(work2, <) APP GetMes( Cal (); Send(work2, <) §MPI, TCP Spark, MR, Symphony SOAM, etc. High Performance fabric manages workload and state – call out user code to enable elasticity, resilience, mobility and hide infrastructure complexity and deployment M S S S S S S S C Client Services Tasks PriceFXOpt() Session PriceFXOpt() PriceFXOpt() PriceFXOpt() PriceFXFW() PriceFXFW() PriceFXFW()
  • 22. Elastic Deep Learning • Have the best of performance and flexibility • Auto Scale up and down based on resource plan • Priority, Real time Fairshare, FIFO • Transparent for Tensorflow, Pytorch and Caffe • Convergence and Hyperparameter awareness Combine the best scheduling and fastest communication with DL specific high performance optimization
  • 23. Elastic Distributed Training Engine Session Scheduler Elastic Scaling DL Driver DL Framework Sync Engine (DDL) Work Wrapper Resource policy (Scaling, preemption, migration) Training planning, scheduling (Training task, micro-batch pipeline, sync-plan) Worker wrapper (model transparency, Data ingest) Tensorflow, Caffe, Pytorch, Keras High performance synchronization (sync, async, p-2-p, centralized) (New RDMA Library) Elastic distribution challenges: • Graceful pre-emption • Auto scale • Dynamic priority • Fault tolerant • Speed up, performance (DDL) • Accuracy • Synchronization algorithm • Topology & GPU Aware • Model transparency
  • 24. Auto scaling and pre-emption • Resource policy driven the scale up and down • Priority • Sharing policy, fair share, fifo • GPU demand • Cluster wide utilization • Fabric handle the scale • Automatically without interruption • Support both sync and async model • Keep/adjust batch size etc hyperparameter during scaling
  • 25. Keras AutoScaling – Go Beyond one GPU with EDT mlp = Sequential() mlp.add(Dense(1000, input_shape=(784,))) mlp.add(Activation('relu')) mlp.add(Dense(250)) mlp.add(Activation('relu')) mlp.add(Dense(10)) mlp.add(Activation('softmax')) trainer = ElasticDL( model=mlp, loss='categorical_crossentropy', optimizer=optimizer_mlp, batch_size=4, num_epoch=1) trainer.fit(training_set, epochs=4) mlp = Sequential() mlp.add(Dense(1000, input_shape=(784,))) mlp.add(Activation('relu')) mlp.add(Dense(250)) mlp.add(Activation('relu')) mlp.add(Dense(10)) mlp.add(Activation('softmax')) mlp.compile(loss='categorical_crossentropy', optimizer=optimizer_mlp, callback_MAO) mlp.fit( training_set, epochs=epochs) Same Model definition Same optimizer, loss function Hide MAO as call back Similar API for fit and compile Data source support Spark dataframe 1 GPU N GPUs
  • 26. Pytorch – Transparent Scaling through EDT class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 10, kernel_size=5) self.conv2 = nn.Conv2d(10, 20, kernel_size=5) self.conv2_drop = nn.Dropout2d() self.fc1 = nn.Linear(320, 50) self.fc2 = nn.Linear(50, 10) def forward(self, x): x = F.relu(F.max_pool2d(self.conv1(x), 2)) x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2)) x = x.view(-1, 320) x = F.relu(self.fc1(x)) x = F.dropout(x, training=self.training) x = self.fc2(x) return F.log_softmax(x, dim=1) model = Net() optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum) #EDT PYTORCH model= ElasticDL ( model, optimizer, F.nll_loss, dataLoader) #hide MAO and data ingest in workers model.train(200, 64) # Native PYTORCH def train(model, device, train_loader, optimizer, epoch): model.train() for batch_idx, (data, target) in enumerate(train_loader): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
  • 27. Transparent DL Insight • No code modification • Customize the live metrics • Plugin interface with third party monitoring • Elastic and interactive with notebook, and developer tools • Scale up and down without interruption
  • 28. Autoscaling with Accuracy, Transparency March 2019 31 Maintain the same accuracy when scaling GPU up and down. 4 GPUs -> 2 GPUs, preemption 1 GPU One line of code change – train anywhere on multiple nodes and multiple GPUs Interactive experience in notebook
  • 29. Elastic AI Solutions: IBM Power with IBM Storage March 18, 2019 March 2019 32
  • 30. For questions, contact: Olga Yiparaki yiparaki@us.ibm.com Chief engineer, IBM Storage Performance Eric Fiala Eric.J.Fiala@ibm.com Solution Architect, Spectrum Computing Constantine Arnold Constantine.Arnold@ibm.com Data Science and Storage Systems Research Jay Vaddi jayvaddi@us.ibm.com IBM Storage Performance Brian Porter bporter1@us.ibm.com Client Technical Specialist, IBM Systems March 2019 33
  • 31. IBM Storage: Spectrum Scale NVMe all-flash appliance March 2019 35 IB EDR fabric Up to 8 Hosts were used in these tests Compute nodes can be increased elastically . . . IBM Spectrum Scale NVMe all-flash appliance Only one storage AFA node was used throughout these tests Additional storage can be added independent of compute nodes • NVMe-based Storage provides more than ample performance for these AI benchmarks which saturate the GPUs. • Single AFA Storage uses 2U rack space and provides ~63 TB of user capacity • Max Read from storage: Over 35 GB/s, assuming enough network adapters • Storage can be increased in a linear fashion to meet capacity and/or performance requirements. Power9 with IBM Watson Machine Learning Accelerator (WMLA) 2x IB links per host 8x IB links • Up to 8 x Power9 AC922 hosts, in this environment • GTX model, water cooled • 512 GM RAM, per Power9 • 6 GPU per Power9 host, up to 48 GPU in this environment • Single dual-ported IB adapter per Power9 host AC922 AC922 AC922
  • 32. Elastic Distributed Training Scaling efficiency Framework: TensorFlow (Elastic distributed training) Spark instance group dliauto Model inceptionV3 Batch size 64 Dataset: flowers Hyperparameter Learning rate policy: exponential Base learning rate: 0.01 Decay steps: 4000 Learning rate decay: 0.9 Staircase: TRUE Solver type: GradientDescent Maximum iterations: 10000 • With Elastic Distributed Training Capabilities, included in Watson Machine Learning Accelerator, the system dynamically scales out to accommodate the demands of growing AI applications • Quick growth demands are accommodated elastically, with ease of management • As the number of GPUs and hosts increases this showcases measured data and high scaling efficiency March 2019 36 1.0 2.0 3.8 7.5 0.0 2.0 4.0 6.0 8.0 1 host 2 hosts 4 hosts 8 hosts ITERATIONS/MINVS.SINGLEHOST BASELINE Speedup by Scaling out hosts & GPUs 100% 100% 96% 94% 0% 20% 40% 60% 80% 100% 120% 6 GPU 12 GPU 24 GPU 48 GPU EFFICIENCYVS.BASELINEOF6GPU Efficiency vs. 6 GPU baseline 50K iterations In all these cases, the NVMe-based Storage remains unchanged and provides more than ample performance, since the AI workload saturates the GPUs, as evidenced by the high scaling efficiency.
  • 33. IBM WML-A Enables Service Level Agreements by absorbing multiple tenant growth • POWER9 enables the rapid growth of AI demands: As multiple tenants increase, every new job or user does adds negligible overhead, enabling predictable behaviors and SLAs (Service level Agreements) • Coupled with EDT (Elastic Distributed Training), multitenant workloads are accommodated elastically, with ease of management • This showcases measured data, with the same negligible overheads as the number of GPUs varies. March 2019 37 -10% -8% -6% -4% -2% 0% 2% 4% 6% 8% 10% 0 2 4 6 8 10 12 14 16 18 Overheadofadditionaltenantrelativetoaverage Tenant Multitenant Overhead vs. Average Tenant 3 GPUs x16 Tenants 6 GPUs x8 Tenants 12 GPUs x 4 Tenants 24 GPUs x 2 Tenants 48 GPUs x 1 Tenant In all these cases, the NVMe-based Storage remains unchanged and provides more than ample performance, enabling the GPUs to accommodate the multitenant AI workload without any slowdowns, as evidenced by the negligible overheads. Multitenancy overheads are on par with the corresponding overheads when each server uses local storage instead of the external IBM storage used in these tests
  • 34. Improve Data Scientists productivity by 31% and IT resource utilization by 33% in a multi-user shared cluster of GPU’s running IBM WML-Accelerator on POWER9 AC922 servers with Nvidia Tesla V100 GPUs connected via NVLink 2.0 • 1.31X reduction in time of training multiple concurrent experiments vs tested x86 systems • 4 jobs of Inceptionv3 trained for 15000 iterations on Flowers data requiring 3 GPUs each running on 2 AC922 nodes with 4 GPUs each. • 1.33x improvement in the utilization of POWER9 DL Infrastructure utilization vs tested x86 systems • 0 wait time – Jobs submitted will be executed even if the cluster is busy due to IBM WML-Accelerator’s elastic scaling • supports multi-tenancy and elasticity with fairshare policy of scheduling. A Simple Scenario • Results are based IBM Internal Measurements running 15000 Iteration training of InceptionV3 model (mini-batch size=32 per GPU) on Flowers dataset. • Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 3.8 GHz, 1 TB memory, 4xTesla V100 GPU ; Red Hat Enterprise Linux 7.5 for Power Little Endian (POWER9) with CUDA 9.2/ CUDNN 7.2.1; WML-A v1.1.1. • Competitive stack: 2x Xeon(R) Gold 6150; 36 cores (2 x 18c chips); 2.70 GHz; 512 GB memory, 4xTesla V100 GPU, Ubuntu 16.04.4 with CUDA .9.1/ CUDNN 7.1.2; NGC image:nvcr.io/nvidia/tensorflow Version: 18.08-py2; Kubernetes v1.11.2. - idle resource 34.4 26.22 0 5 10 15 20 25 30 35 40 x86 AC922 TimeTaken(mins) Multiuser Jobs 3-3-3-3 4 jobs of InceptionV3/Flowers training for 15000 iterations
  • 35. Traditional Business Data Sensor Data Data from collaboration partners Data from mobile apps & social media Legacy Data Data Preparation Pre-Processing Data Source Model Training Inference AI Deep Learning Frameworks (Tensorflow, Caffe, …) Monitor & Advise Iterate Distributed & Elastic Training for Deep Learning Parallel Hyper-Parameter Search & Optimization Network Models Hyper- Parameters Testing Dataset Trained Model Life Cycle Management Deploy in Production using Trained Model (Rest API) New Data ML/DL Training & Execution - Watson ML Accelerator Heavy IO Instrumentation Training Dataset Data Ingestion Multi-Tenant, Shared Services Architecture (Conductor) Resource Groups, Consumers, Resource Plans, Instance Groups, Resiliency, Workload Management, Notebooks, Anaconda, Reporting, Security
  • 36. Watson ML Accelerator Technical References Tutoriale Url Classify images with IBM Watson Machine Learning Accelerator https://developer.ibm.com/tutorials/use-computer-vision- with-dli-watson-machine-learning-accelerator/ Train Keras and Mllib models with IBM Watson Machine Learning Accelerator https://developer.ibm.com/tutorials/training-keras-and- mllib-model-with-watson-machine-learning-accelerator/ Get Dynamic, elastic, and fine-grained resource allocations and controls for accelerating multiple model trainings simultaneously https://developer.ibm.com/tutorials/dynamic-resilient-and- elastic-deep-learning-with-watson-machine-learning- accelerator/ Train Xgboost models with IBM Watson Machine Learning Accelerator https://developer.ibm.com/tutorials/train-xgboost-models- within-watson-ml-accelerator/ Accelerate Generalized Linear Model training with IBM Watson Machine Learning Accelerator and Snap ML https://developer.ibm.com/tutorials/accelerate-machine- model-training-with-watson-ml-accelerator-snap-ml/ Accelerate tree-based model training with Watson Machine Learning Accelerator and Snap ML https://developer.ibm.com/tutorials/accelerate-random- forest-model-training-with-watson-ml-accelerator/ § Machine Learning and Deep Learning with IBM Watson Machine Learning Accelerator Series § Offers walk through, and hands on experience with Watson ML Accelerator key differentiators § English Series: https://developer.ibm.com/series/learn-watson-machine-learning-accelerator/ § Chinese Series: https://developer.ibm.com/cn/blog/2019/learn-ibm-powerai-enterprise/