SlideShare a Scribd company logo
1 of 52
Download to read offline
Distributed and Collaborative Deep Learning and Machine Learning
Amit Juneja, PhD
Cognitive Solution Specialist
IBM
Artificial
Intelligence
and
Cognitive
Applications
Machine
Learning
Deep
Learning
The deeper you go, the more value you gain,
and the more you know
Machine Learning/ Deep Learning Process
3
Historic
Training
Data
ML/DL
Training
Trained
Model
Training/Development
Live
Data
Trained
model
Action
Inference/Deployment/Application
Machine Learning: Linear and Nonlinear Classification
4
LinearNon-linear
Machine Learning
5
Regression Clustering
Machine Learning Models
6
Support
Vector
Machines
Random
Forests
K-Means
Clustering
Logistic
Regression
Generalized
Linear
Models
Neural
Networks
A node of a neural network
7
A neural network
8
More hidden layers = more
complex modeling
9
Training by
back
propagation
of errors
This did not really
work for large number
of hidden layers with
sigmoid units
Restricted Boltzmann Machines
10
Deep Belief Network: The first “DEEP” network
11
Initial deep
learning
breakthroughs
in speech
recognition
Deep Convolutional Networks: Image processing/classification
12
Currently RELU operations such
as max-pool replace sigmoid
operations for easier training and
higher accuracy
Recurrent Neural Network: Sequential data: time series, language, etc.
13
LSTMs: Long short term memory networks
14
15
Reinforcement Learning
16
Unsupported frameworks
17
flattening the Deep Learning
time to value curve
Enterprise Deep Learning Distribution
19
How can I train deep learning models many times faster?
20
PowerAI Rel. 4 with Distributed
Deep Learning tech. preview
Performance…
Faster Training
and Inferencing
Near Ideal Scaling to 256 GPUs and Beyond16 Days Down to 7 Hours:
58x Faster
1 System 64 Systems
16 Days
7 Hours
ResNet-101, ImageNet-22K, Caffe with PowerAI DDL, Running on Minsky (S822Lc) Power System
22
Training by
back
propagation
of errors
23
Training on
GPUs
24
Parallel
Execution of
Training
Threads per
core vs x86
Up to 9.5x more I/O
bandwidth than x86
More RAM
possible vs. x86
CPU to deliver
PCIe gen 4
4x 9.5x 2.6x 1st
POWER9
An acceleration superhighway.
The only processor specifically designed for the AI era.
PowerAI Rel. 4 with Distributed
Deep Learning tech. preview
Performance…
Faster Training
and Inferencing
Near Ideal Scaling to 256 GPUs and Beyond16 Days Down to 7 Hours:
58x Faster
1 System 64 Systems
16 Days
7 Hours
ResNet-101, ImageNet-22K, Caffe with PowerAI DDL, Running on Minsky (S822Lc) Power System
27
• TensorFlow 1.4 Performance on IBM POWER9 with Nvidia V100
• Single node 35% more images processed per second vs tested x86 systems
ResNet50 testing on ILSVRC 2012 dataset (aka Imagenet 2012)
Training on 1.2M images
Validation on 50K images
▪ Results are based IBM Internal Measurements running
1000 iterations of HPM Resnet50 on 1.2M images and
validation on 50K images with Dataset from ILSVRC 2012
also known as Imagenet 2012.
▪ Software: Tensorflow 1.4.0 framework and HPM
Resnet50 https://github.com/tensorflow/benchmarks.git (
commit: f5d85aef) and with the following parameters:
Batch-Size: 64 per GPU; Iterations: 1100; Data: Imagenet;
local-parameter-device: gpu; variable-update: replicated
Date of testing: November 26, 2017
Faster Training Time with Distributed Deep Learning
28
• TensorFlow 1.4 Performance on IBM POWER9 with Nvidia V100
• Multiple nodes Distributed Deep Learning: IBM POWER9™ with
Nvidia Tesla V100 results in 2.3X more data
processed on TensorFlow versus tested x86 systems
2.3X more images processed per second vs tested
x86 systems
PowerAI Distributed Deep Learning (DDL) library
provides innovative distribution methods enabling
AI frameworks to scale to multiple servers
leveraging all attached GPUs
ResNet50 testing on ILSVRC 2012 dataset (also
known as Imagenet 2012)
Training on 1.2M images
Validation on 50K images
Date of testing: December 2, 2017
Faster Training Time with Distributed Deep Learning
How to enable DDL
Run using ddlrun
• ddlrun is a tool for running ddl enabled scripts
• See README.md in DDLRun docs in /opt/DL/ddl/doc/README.md
• https://developer.ibm.com/linuxonpower/2018/05/01/improved-ease-use-ddl-powerai/
Example invocation:
ddlrun -H system1,system2 python mnist.py
Yes. It’s really that easy!
Split the data
Adjust the Keras callbacks
Adjust the learning rate
Since the data was split between the GPUs, the learning rate had to be scaled by the total number of GPUs.
The 2 primary operations that need to be restricted to only running on rank 0 are model checkpointing and logging. This is accomplished by only
adding these callbacks on rank 0.
An extra callback is needed to keep all metrics in sync across all nodes. This ensures that early stopping and learning rate scheduling all remain in
sync
How can I train models that don’t fit in GPU memory?
33
Why Power9 - 3.8X Faster than x86 architectures
Supports large model training sets that are too large for GPU memory in DL and HPC/simulation
Memory coherency also makes programming GPU’s easier for developers by automatically
moving data between Power9 system memory and V100 GPU memory.
Power Systems - 7-10X Bandwidth over x86 architectures
No NVLink for x86 Servers: PCIe Bottleneck
GPU
P8
GPU GPU
P8
GPU
NVLink
80
GB/s
NVLink
80
NVLink
80
Minsky + NVLink
CPU<->GPU and GPU<->GPU
GPU
x86
GPU GPU
x86
GPU
PCIe
32
GB/s
NVLink
80 GB/s
NVLink
80 GB/s
x86 + NVLink
GPU<->GPU Only
GPU
x86
GPU GPU
x86
GPU
PCIe
32
GB/s
x86 using PCIe
GPUs access System Memory thu x86 CPU & slow
PCIe
Bandwidth between GPU’s and
memory is critical
Power8 with NVLink delivers 2.5X the bandwidth
PCIe Data Pipe
NVLink
P9 CPUDDR4
Tesla V100 Tesla V100Tesla V100 NVL
150 GB/s
NVL
150 GB/s
150 GB/s
150GB/s
170 GB/s
Power9 and NVLink Gen 2.0
delivers 7-10 X bandwidth
increase over X86 architectures
Power8 NVLink
Data Pipe
TensorFlow Large Model Support (TFLMS)
• Swap out unused tensors (feature maps,
parameters) to CPU memory after GPU
computation
• Swap in before use in backward propagation
phase
• Implemented as a Python module to statically
edit model graph before training.
• Support for training with Session, Estimator,
tf.keras APIs
• Code contributed to TensorFlow community:
https://github.com/tensorflow/tensorflow/pull/19845
l+1l-1 LLayer 1
Loss
…..…
……….
…...
Forward
Backward
l
…….
CPU memory
GPU memory
Swap-out
Swap-in
Why 3D Image segmentation?
Training 3DUnet models for image segmentation generally has high memory usage requirements which can limit the size
of the 3D images that can be used for training and can also lead to lower batch sizes for model training.
The annual International Multimodal Brain Tumor Segmentation Challenge (BraTS) [1] drives advancements in 3D image
segmentation models.
We utilized enabled TFLMS in a Keras model written by David G. Ellis, U of Nebraska. This model was written to process
multimodal MRI scans following the model architecture described by Isensee et al. in the 2017 BraTS proceedings on page
100 and received 3rd place in the challenge [2].
Real world use case of large model support
The maximum image resolution and batch size:
144^3 with batch size 1 in 16GB GPU without TFLMS
192^3 with batch size 1 in 16GB GPU with TFLMS - ~2.4x the resolution
Higher resolution image processing allows for learning and labeling finer
details and structures.
[1] http://www.med.upenn.edu/sbia/brats2018/data.html
[2] https://www.cbica.upenn.edu/sbia/Spyridon.Bakas/MICCAI_BraTS/MICCAI_BraTS_2017_proceedings_shortPapers.pdf
POWER9 vs x86 GPU Connectivity
TFLMS runtime performance POWER9 vs x86
The 3DUnet model was run with TFLMS on an IBM
AC922 and a x86 based GPU server. Both systems
have the NVIDIA Volta V100 GPU.
The x86 server shows significant slowdown which
gets worse when GPUs share the same PCI bus.
The diagram on the right shows the nvprof view of processing one image with the model. On the
x86 server the GPU goes idle (white space) while waiting on the memory copies over the PCI bus.
Corresponding kernels runtimes between the runs are linked in red.
Note that the 4 GPU times were running 4 individual models
concurrently, not one model distributed. Distributed results follow.
0
250
500
750
1000
1250
1500
1750
2000
2250
IBM AC922
(4 GPU
version)
IBM AC922
(6 GPU
version)
x86 server x86 server
with PCI
contention
Timeinseconds
Epoch times at 192^3 with TFLMS
GPU and interconnect usage
GB/s
10 GB/s
20 GB/s
30 GB/s
40 GB/s
50 GB/s
60 GB/s
70 GB/s
IBM AC922
(4 GPU version)
x86 server x86 server while sharing
PCI bus
Average memory copy throughput over 30
batches
Host to GPU memory copy GPU to Host memory copy
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
IBM AC922
(4 GPU version)
x86 server x86 server while sharing
PCI bus
Average GPU utilization over 30 batches
Higher memory copy throughput drives higher GPU utilization
Can I combine DDL and LMS?
41
IBM PowerAI Distributed Deep Learning (DDL) + TFLMS
1
3.93
7.76
14.75
-4
1
6
11
16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Speedup
# of GPUs
DDL Training Speedup
Number of GPUs Epoch time Speedup Efficiency
1 (without DDL) 590s
4 150s 3.93 98.33%
8 76s 7.76 97.04%
16 40s 14.75 92.19%
How do I enable LMS in my TensorFlow code?
43
How to enable TFLMS
Session based training:
Step 1: define optimizer/solver scopes
with tf.name_scope('adam_optimizer'):
optimizer = tf.train.AdamOptimizer(1e-4)
train_step = minimize(cross_entropy)
Step 2: define an LMS object and run it
from tensorflow.contrib.lms import LMS
lms_obj = LMS({'adam_optimizer'})
lms_obj.run(graph=tf.get_default_graph())
Estimator based training:
Step 1: define optimizer/solver scopes
with tf.name_scope(‘graddescopt'):
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(loss=loss,
global_step=tf.train.get_global_step())
Step 2: define a LMSHook
from tensorflow.contrib.lms import LMSHook
lms_hook = LMSHook({'graddescopt'})
Step 3: add the LMSHook into the Estimator's hook list
mnist_classifier.train(input_fn=train_input_fn, steps=20000,
hooks=[logging_hook, lms_hook])
Step 1: Define a LMSKerasCallback.
from tensorflow.contrib.lms import LMSKerasCallback lms_callback =
LMSKerasCallback()
Step 2: pass the callback to the Keras fit or fit_generator function.
model.fit_generator(generator=training_gen,
callbacks=[lms_callback])
TF-Keras based training:
How can I train machine learning models on
terabyte scale with GPUs?
45
SnapML: Rapid training of logistic regression/SVMs on GPUs
|
Snap ML Rapid Training of Logistic Regression/SVMs on GPUs –
Tera-scale ML benchmark
|
Snap ML Iteration Profile (Intel X86** + TeslaV100 + PCIe Gen3)
|
Snap ML Iteration Profile (POWER9 + TeslaV100 + NVLINK 2.0)
|
Ssklearn.linear_model.LogisticRegression
Ssnap_ml.LogisticRegression
SnapML is Easy
How can I do this collaboratively with optimal resource utilization?
50
• Physical view - Spectrum Conductor installed on each Linux Server
• Logical view - Users & groups have own Spark cluster - isolated, protected & secured by Spark Instance Groups
• Manage all DL Resources with Conductor SLA’s
• Scheduler interfaces with Spark - ensures accelerated GPU resources for priority applications and users
Linux
Linux
LinuxLinux
Linux
Linux
Linux
LOB
Data Scientist
Researcher
Virtual Spark cluster
(PaaS)
Linux
Customer behavior...
Trend analysis...
HPC...
Marketing...
Fraud detection...
instance
group #1
instance
group #2
instance
group #3
Managem
ent Nodes
Pool
Compute Nodes Pool
Spectrum Scale
Administrator
Web console
Create Spark
instance group
Linux
LinuxLinuxLinux
Linux
Linux
LinuxLinuxLinux
Linux
IT or Data
Warehouse
ETL / Batch
instance
group #4
LOB
IoT
instance
group #5
Why Power9 - GPU accelerated Spark + Multi-Tenancy Spectrum Conductor
Conductor with Spark
Session scheduler Service management (ASC/K8s)
Security Report/log management ContainerMulti-tenancy
Notebook Spark ELK
GPU and AccelerationData Connector
THANK YOU!
Amit Juneja, PhD
Cognitive Solution Specialist
IBM

More Related Content

What's hot

Affordable AI Connects To A Better Life
Affordable AI Connects To A Better LifeAffordable AI Connects To A Better Life
Affordable AI Connects To A Better Life
NVIDIA Taiwan
 
Performance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SPerformance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4S
Ganesan Narayanasamy
 

What's hot (20)

AI Hardware
AI HardwareAI Hardware
AI Hardware
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for Energy
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
Build FAST Deep Learning Apps with Docker on OpenPOWER and GPUs
Build FAST Deep Learning Apps with Docker on OpenPOWER and GPUs  Build FAST Deep Learning Apps with Docker on OpenPOWER and GPUs
Build FAST Deep Learning Apps with Docker on OpenPOWER and GPUs
 
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Distributed deep learning optimizations for Finance
Distributed deep learning optimizations for FinanceDistributed deep learning optimizations for Finance
Distributed deep learning optimizations for Finance
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
Distributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBestDistributed deep learning optimizations - AI WithTheBest
Distributed deep learning optimizations - AI WithTheBest
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
Fast Scalable Easy Machine Learning with OpenPOWER, GPUs and Docker
Fast Scalable Easy Machine Learning with OpenPOWER, GPUs and DockerFast Scalable Easy Machine Learning with OpenPOWER, GPUs and Docker
Fast Scalable Easy Machine Learning with OpenPOWER, GPUs and Docker
 
OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware Challegens
 
Affordable AI Connects To A Better Life
Affordable AI Connects To A Better LifeAffordable AI Connects To A Better Life
Affordable AI Connects To A Better Life
 
Performance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SPerformance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4S
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 

Similar to Open power ddl and lms

DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Sahil Kaw
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
anil0878
 

Similar to Open power ddl and lms (20)

OpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in ZurichOpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in Zurich
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 
IBM Cloud Paris Meetup - 20190520 - IA & Power
IBM Cloud Paris Meetup - 20190520 - IA & PowerIBM Cloud Paris Meetup - 20190520 - IA & Power
IBM Cloud Paris Meetup - 20190520 - IA & Power
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
 
IBM AI at Scale
IBM AI at ScaleIBM AI at Scale
IBM AI at Scale
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
 

More from Ganesan Narayanasamy

180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 

Open power ddl and lms

  • 1. Distributed and Collaborative Deep Learning and Machine Learning Amit Juneja, PhD Cognitive Solution Specialist IBM
  • 3. Machine Learning/ Deep Learning Process 3 Historic Training Data ML/DL Training Trained Model Training/Development Live Data Trained model Action Inference/Deployment/Application
  • 4. Machine Learning: Linear and Nonlinear Classification 4 LinearNon-linear
  • 7. A node of a neural network 7
  • 8. A neural network 8 More hidden layers = more complex modeling
  • 9. 9 Training by back propagation of errors This did not really work for large number of hidden layers with sigmoid units
  • 11. Deep Belief Network: The first “DEEP” network 11 Initial deep learning breakthroughs in speech recognition
  • 12. Deep Convolutional Networks: Image processing/classification 12 Currently RELU operations such as max-pool replace sigmoid operations for easier training and higher accuracy
  • 13. Recurrent Neural Network: Sequential data: time series, language, etc. 13
  • 14. LSTMs: Long short term memory networks 14
  • 15. 15
  • 18. flattening the Deep Learning time to value curve
  • 19. Enterprise Deep Learning Distribution 19
  • 20. How can I train deep learning models many times faster? 20
  • 21. PowerAI Rel. 4 with Distributed Deep Learning tech. preview Performance… Faster Training and Inferencing Near Ideal Scaling to 256 GPUs and Beyond16 Days Down to 7 Hours: 58x Faster 1 System 64 Systems 16 Days 7 Hours ResNet-101, ImageNet-22K, Caffe with PowerAI DDL, Running on Minsky (S822Lc) Power System
  • 25. Threads per core vs x86 Up to 9.5x more I/O bandwidth than x86 More RAM possible vs. x86 CPU to deliver PCIe gen 4 4x 9.5x 2.6x 1st POWER9 An acceleration superhighway. The only processor specifically designed for the AI era.
  • 26. PowerAI Rel. 4 with Distributed Deep Learning tech. preview Performance… Faster Training and Inferencing Near Ideal Scaling to 256 GPUs and Beyond16 Days Down to 7 Hours: 58x Faster 1 System 64 Systems 16 Days 7 Hours ResNet-101, ImageNet-22K, Caffe with PowerAI DDL, Running on Minsky (S822Lc) Power System
  • 27. 27 • TensorFlow 1.4 Performance on IBM POWER9 with Nvidia V100 • Single node 35% more images processed per second vs tested x86 systems ResNet50 testing on ILSVRC 2012 dataset (aka Imagenet 2012) Training on 1.2M images Validation on 50K images ▪ Results are based IBM Internal Measurements running 1000 iterations of HPM Resnet50 on 1.2M images and validation on 50K images with Dataset from ILSVRC 2012 also known as Imagenet 2012. ▪ Software: Tensorflow 1.4.0 framework and HPM Resnet50 https://github.com/tensorflow/benchmarks.git ( commit: f5d85aef) and with the following parameters: Batch-Size: 64 per GPU; Iterations: 1100; Data: Imagenet; local-parameter-device: gpu; variable-update: replicated Date of testing: November 26, 2017 Faster Training Time with Distributed Deep Learning
  • 28. 28 • TensorFlow 1.4 Performance on IBM POWER9 with Nvidia V100 • Multiple nodes Distributed Deep Learning: IBM POWER9™ with Nvidia Tesla V100 results in 2.3X more data processed on TensorFlow versus tested x86 systems 2.3X more images processed per second vs tested x86 systems PowerAI Distributed Deep Learning (DDL) library provides innovative distribution methods enabling AI frameworks to scale to multiple servers leveraging all attached GPUs ResNet50 testing on ILSVRC 2012 dataset (also known as Imagenet 2012) Training on 1.2M images Validation on 50K images Date of testing: December 2, 2017 Faster Training Time with Distributed Deep Learning
  • 30. Run using ddlrun • ddlrun is a tool for running ddl enabled scripts • See README.md in DDLRun docs in /opt/DL/ddl/doc/README.md • https://developer.ibm.com/linuxonpower/2018/05/01/improved-ease-use-ddl-powerai/ Example invocation: ddlrun -H system1,system2 python mnist.py Yes. It’s really that easy!
  • 32. Adjust the Keras callbacks Adjust the learning rate Since the data was split between the GPUs, the learning rate had to be scaled by the total number of GPUs. The 2 primary operations that need to be restricted to only running on rank 0 are model checkpointing and logging. This is accomplished by only adding these callbacks on rank 0. An extra callback is needed to keep all metrics in sync across all nodes. This ensures that early stopping and learning rate scheduling all remain in sync
  • 33. How can I train models that don’t fit in GPU memory? 33
  • 34. Why Power9 - 3.8X Faster than x86 architectures Supports large model training sets that are too large for GPU memory in DL and HPC/simulation Memory coherency also makes programming GPU’s easier for developers by automatically moving data between Power9 system memory and V100 GPU memory.
  • 35. Power Systems - 7-10X Bandwidth over x86 architectures No NVLink for x86 Servers: PCIe Bottleneck GPU P8 GPU GPU P8 GPU NVLink 80 GB/s NVLink 80 NVLink 80 Minsky + NVLink CPU<->GPU and GPU<->GPU GPU x86 GPU GPU x86 GPU PCIe 32 GB/s NVLink 80 GB/s NVLink 80 GB/s x86 + NVLink GPU<->GPU Only GPU x86 GPU GPU x86 GPU PCIe 32 GB/s x86 using PCIe GPUs access System Memory thu x86 CPU & slow PCIe Bandwidth between GPU’s and memory is critical Power8 with NVLink delivers 2.5X the bandwidth PCIe Data Pipe NVLink P9 CPUDDR4 Tesla V100 Tesla V100Tesla V100 NVL 150 GB/s NVL 150 GB/s 150 GB/s 150GB/s 170 GB/s Power9 and NVLink Gen 2.0 delivers 7-10 X bandwidth increase over X86 architectures Power8 NVLink Data Pipe
  • 36. TensorFlow Large Model Support (TFLMS) • Swap out unused tensors (feature maps, parameters) to CPU memory after GPU computation • Swap in before use in backward propagation phase • Implemented as a Python module to statically edit model graph before training. • Support for training with Session, Estimator, tf.keras APIs • Code contributed to TensorFlow community: https://github.com/tensorflow/tensorflow/pull/19845 l+1l-1 LLayer 1 Loss …..… ………. …... Forward Backward l ……. CPU memory GPU memory Swap-out Swap-in
  • 37. Why 3D Image segmentation? Training 3DUnet models for image segmentation generally has high memory usage requirements which can limit the size of the 3D images that can be used for training and can also lead to lower batch sizes for model training. The annual International Multimodal Brain Tumor Segmentation Challenge (BraTS) [1] drives advancements in 3D image segmentation models. We utilized enabled TFLMS in a Keras model written by David G. Ellis, U of Nebraska. This model was written to process multimodal MRI scans following the model architecture described by Isensee et al. in the 2017 BraTS proceedings on page 100 and received 3rd place in the challenge [2]. Real world use case of large model support The maximum image resolution and batch size: 144^3 with batch size 1 in 16GB GPU without TFLMS 192^3 with batch size 1 in 16GB GPU with TFLMS - ~2.4x the resolution Higher resolution image processing allows for learning and labeling finer details and structures. [1] http://www.med.upenn.edu/sbia/brats2018/data.html [2] https://www.cbica.upenn.edu/sbia/Spyridon.Bakas/MICCAI_BraTS/MICCAI_BraTS_2017_proceedings_shortPapers.pdf
  • 38. POWER9 vs x86 GPU Connectivity
  • 39. TFLMS runtime performance POWER9 vs x86 The 3DUnet model was run with TFLMS on an IBM AC922 and a x86 based GPU server. Both systems have the NVIDIA Volta V100 GPU. The x86 server shows significant slowdown which gets worse when GPUs share the same PCI bus. The diagram on the right shows the nvprof view of processing one image with the model. On the x86 server the GPU goes idle (white space) while waiting on the memory copies over the PCI bus. Corresponding kernels runtimes between the runs are linked in red. Note that the 4 GPU times were running 4 individual models concurrently, not one model distributed. Distributed results follow. 0 250 500 750 1000 1250 1500 1750 2000 2250 IBM AC922 (4 GPU version) IBM AC922 (6 GPU version) x86 server x86 server with PCI contention Timeinseconds Epoch times at 192^3 with TFLMS
  • 40. GPU and interconnect usage GB/s 10 GB/s 20 GB/s 30 GB/s 40 GB/s 50 GB/s 60 GB/s 70 GB/s IBM AC922 (4 GPU version) x86 server x86 server while sharing PCI bus Average memory copy throughput over 30 batches Host to GPU memory copy GPU to Host memory copy 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% IBM AC922 (4 GPU version) x86 server x86 server while sharing PCI bus Average GPU utilization over 30 batches Higher memory copy throughput drives higher GPU utilization
  • 41. Can I combine DDL and LMS? 41
  • 42. IBM PowerAI Distributed Deep Learning (DDL) + TFLMS 1 3.93 7.76 14.75 -4 1 6 11 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Speedup # of GPUs DDL Training Speedup Number of GPUs Epoch time Speedup Efficiency 1 (without DDL) 590s 4 150s 3.93 98.33% 8 76s 7.76 97.04% 16 40s 14.75 92.19%
  • 43. How do I enable LMS in my TensorFlow code? 43
  • 44. How to enable TFLMS Session based training: Step 1: define optimizer/solver scopes with tf.name_scope('adam_optimizer'): optimizer = tf.train.AdamOptimizer(1e-4) train_step = minimize(cross_entropy) Step 2: define an LMS object and run it from tensorflow.contrib.lms import LMS lms_obj = LMS({'adam_optimizer'}) lms_obj.run(graph=tf.get_default_graph()) Estimator based training: Step 1: define optimizer/solver scopes with tf.name_scope(‘graddescopt'): optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001) train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step()) Step 2: define a LMSHook from tensorflow.contrib.lms import LMSHook lms_hook = LMSHook({'graddescopt'}) Step 3: add the LMSHook into the Estimator's hook list mnist_classifier.train(input_fn=train_input_fn, steps=20000, hooks=[logging_hook, lms_hook]) Step 1: Define a LMSKerasCallback. from tensorflow.contrib.lms import LMSKerasCallback lms_callback = LMSKerasCallback() Step 2: pass the callback to the Keras fit or fit_generator function. model.fit_generator(generator=training_gen, callbacks=[lms_callback]) TF-Keras based training:
  • 45. How can I train machine learning models on terabyte scale with GPUs? 45
  • 46. SnapML: Rapid training of logistic regression/SVMs on GPUs | Snap ML Rapid Training of Logistic Regression/SVMs on GPUs – Tera-scale ML benchmark
  • 47. | Snap ML Iteration Profile (Intel X86** + TeslaV100 + PCIe Gen3)
  • 48. | Snap ML Iteration Profile (POWER9 + TeslaV100 + NVLINK 2.0)
  • 50. How can I do this collaboratively with optimal resource utilization? 50
  • 51. • Physical view - Spectrum Conductor installed on each Linux Server • Logical view - Users & groups have own Spark cluster - isolated, protected & secured by Spark Instance Groups • Manage all DL Resources with Conductor SLA’s • Scheduler interfaces with Spark - ensures accelerated GPU resources for priority applications and users Linux Linux LinuxLinux Linux Linux Linux LOB Data Scientist Researcher Virtual Spark cluster (PaaS) Linux Customer behavior... Trend analysis... HPC... Marketing... Fraud detection... instance group #1 instance group #2 instance group #3 Managem ent Nodes Pool Compute Nodes Pool Spectrum Scale Administrator Web console Create Spark instance group Linux LinuxLinuxLinux Linux Linux LinuxLinuxLinux Linux IT or Data Warehouse ETL / Batch instance group #4 LOB IoT instance group #5 Why Power9 - GPU accelerated Spark + Multi-Tenancy Spectrum Conductor Conductor with Spark Session scheduler Service management (ASC/K8s) Security Report/log management ContainerMulti-tenancy Notebook Spark ELK GPU and AccelerationData Connector
  • 52. THANK YOU! Amit Juneja, PhD Cognitive Solution Specialist IBM