Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
1. GPUs, Distributed Deep Learning and Hopsworks
Jim Dowling
Assoc Prof @ KTH – Royal Institute of Technology
CEO at Logical Clocks AB
2. Leadership & Offices
Stockholm
Box 1263,
Isafjordsgatan
22
Kista,
Sweden
London
IDEALondon,
69 Wilson St,
London,,
UK
Silicon Valley
470 Ramona St
Palo Alto
California,
USA
Dr. Jim Dowling
CEO
Theo Kakantousis
COO
Prof. Seif Haridi
Chief Scientist
Fabio Buso
VP Engineering
Steffen Grohsschmiedt
Head Of Cloud
www.logicalclocks.com
Shraddha Chouhan
Head Of
Marketing
3. Affine Transformation
•Matrix Multiplications calculate weighted sums for
Feed-Forward Networks
•Fused multiply-add (FMA) instruction on a GPU
2020-06-04 Guest Lecture, Jim Dowling 3/62
x1
x2
..
xm
w11 w12 …w1n
w21 w22 …w2n
….
wm1 wm2 …wmn
y1
y2
..
ym
Input Weights Output
b1
b2
..
bm
Biases
4. Convolution Operations
Input Matrix ⊗ Filter =
Output (activation map)
For each output map j
For each input map k
For each pixel x,y
For each kernel element u,v
Bxyj += A(x-u)(y-v)k * Kuvkj
2020-06-04 Guest Lecture, Jim Dowling 4/62
9. GPU Programming
Intel Xeon E5-2680v4:
Clock speed: 2.4 GHz
4 instructions per cycle with AVX2 CPU - 28 cores
2.4 x 4 x 28 = 268.8 Gigaflops double precision
NVIDIA Tesla P100:
Single instruction per cycle
3584 CUDA cores
4.7 Teraflops double precision
CPU vs GPU
11. GPU ProgrammingCPUs, GPUs, Memory, and the PCI-E Bus
CPU Memory
CPU GPU
GPU MemoryPCI-E Bus
Compute-Intense Fns
Sequential CPU Code
Guest Lecture, Jim Dowling
Program
13. Single-Instruction Multiple-Data (SIMD)
•A single instruction
stream to operations
that may be naturally
parallelized
- Matrix Multiplication
- Convolutions
•x86 SIMD support
- AVX Extensions
- Intel MKL Library
CS 61c 13
Instruction Pool
DataPool
Processor
Processor
Processor
Processor
14. Why is SIMD not dominant?
•It’s hard to program
- Have to identify data parallelism in applications
•Specialized hardware is expensive
- Vector processors used to be expensive, so programming
language support lagged
•SIMD extensions (AVX, SSE)
- Not very wide
- No control flow within SIMD
15. Nvidia Cuda: SIMT Programming
•Single Instruction Multiple Thread (SIMT)
- Parallel threads that use SIMD hardware
- CUDA, OpenCL are both SIMT programming models
•What is CUDA?
- Nvidia proprietory API and platform
• Hierarchical thread programming model that combines MIMD/SIMD
ideas
- Lots of Libraries
• CuDNN - Deep Neural Network library
• CUBLAS - CUDA Basic Linear Algebra Subroutines library
• CUDART - CUDA RunTime library
• NVIDIA Collective Communications Library (NCCL)
16. Cuda SIMT
•Many parallel threads of instructions
- Each instruction is responsible for a single data input
•Threads are grouped into blocks (<1024) (blocks are
mapped to SMXs)
- Hardware divides blocks into SIMD groups or warps
- Warps of threads are executed together as a single SIMD
instruction
•Threads distinguish computation by querying their
grid/block location (think thread IDs)
17. CUDA memory model
•All threads share Global Memory
•Threads within a block share Shared Memory
•Threads have their own private Registers
•Memory consistency is very relaxed
- Cannot guarantee ordering of instructions across blocks.
- Can insert explicit memory fences/barriers to order threads
within a block
23. Intra GPU and Inter-Host GPU Connections
• QPI link ~8-12 GB/s
• PCIe ~16-32 GB/s . NVLink ~80 GB/s.
• Infiniband ~108-216 Gb/s
23Host: 2 CPU Sockets, PCIe Bus, 4 GPUs, Infiniband Net
24. GPU -> Network -> GPU Communications
Oden et al, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters, IEEE 21 CLUSTER 2013
29. 5/30/2012 www.hops.io
Will GPUs become commodity compute?
•For….
- Gaming GPUs have the best Price/Performance for training
models
- Distributed Deep Learning technology is rapidly improving
• Buy more GPUs to scale out training, HParam Tuning, Ablation
Studies, etc
•Against….
- Nvidia
29
30. 5/30/2012 www.hops.io
Commodity Fight: AMD Radeon vs Nvidia
30
Nvidia™ 2080Ti
Memory: 11GB
TensorFlow 1.12
CUDA 10.0.130, cuDNN 7.4.1
Model: RESNET-50
Dataset: imagenet (synthetic)
----------------------------------------------------
--------
FP32 total images/sec: ~322
FP16 total images/sec: ~560
AMD Radeon™ VII
Memory: 16 GB
TensorFlow 1.13.1
ROCm: 2.3
Model: RESNET-50
Dataset: imagenet (synthetic)
------------------------------------------------------
------
FP32 total images/sec: ~302
FP16 total images/sec: ~415
https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/
https://www.phoronix.com/scan.php?page=article&item=nvidia-
rtx2080ti-tensorflow&num=2
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173
31. 5/30/2012 www.hops.io
AMD Software Stack
31
31
Latest Machine
Learning Frameworks
Dockers and
Kubernetes support
Optimized Math &
Communication Libraries
Up-Streamed for Linux
Kernel Distributions
Frameworks
Middleware
and Libraries
Eigen
Spark / Machine Learning Apps
Data
Platform
Tools
ROCm
Fully Open Source ROCm Platform
OpenMP HIP OpenCL™ Python
Devices GPU CPU APU DLA
RCCL
BLAS, FFT,
RNG
MIOpen
O P E N S O U R C E
F O U N D A T I O N
F O R M A C H I N E
L E A R N I N G
32. 5/30/2012 www.hops.io
AMD Distributed Training
32
Optimized collective
communication operations
library
Easy MPI integration
Support for Infiniband and
RoCE highspeed network
fabrics
ROCm enabled UCX
ROCm w/ ROCmRDMA
RCCL
1,00X
1,99X
3,98X
7,64X
0,00X
1,00X
2,00X
3,00X
4,00X
5,00X
6,00X
7,00X
8,00X
RESNET50
Multi-GPU Scaling
(PCIe, CPU
parameter-server,
1/2/4/8 GPU)
1GPU 2GPU 4GPU 8GPU
ResNet-
50
33. 5/30/2012 www.hops.io
ROCm over Spark/TensorFlow on Hopsworks
33
•Spark / TensorFlow
applications run
unchanged on ROCm
•Hopsworks runs
Spark/TensorFlow on
YARN and Conda
34. 5/30/2012 www.hops.io
YARN support for ROCm in Hops
34
Container
A Container
is a CGroup
that isolates
CPU, memory,
and GPU
resources and
has a conda
environment
and TLS certs.
ContainerContainerContainer
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Executor ExecutorExecutorDriver
36. Model Parallelism SGD vs Data Parallelism SGD
•Model Parallelism
- Models cannot fit on a
single GPU
- Model is partitioned over
many GPUs
• For example, one layer per
GPU for a ConvNet
- The same data is used to
train the partitioned
models
• E.g., input data in at the
bottom layer (GPU1) and as
the data feeds forward to
the output layer, it passes
through many GPUs
•Data Parallelism
- Copy of the model stored
at each work
- Each worker trains on a
different set of samples
from the same
mini- batch
- Gradients computed at
each worker need to be
aggregated to calculate
the new model for a
mini-batch
- A new model needs to be
broadcast for each
iteration to all workers
40. Synchnrous SGD: N/W is the Bottleneck
40
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
1 2 3 4 5 6 7 8 9 10
1 GPU 4 GPUs
N/W N/W N/W N/W N/W
Amount
Work
Time
Reduce N/W Comms Time, Increase Computation Time
Amdahl’s Law
41. Synchronous SGD Challenges
•Effective size of the batch becomes larger
- Can we train models with very large batch sizes?
•Update time depends on the slowest worker
- Backup workers proposed as a mitigating strategy
41
42. Facebook: Scaling Synchronous SGD
42
June 2017: Facebook reduced training time on ImageNet from 2 weeks to 1 hr
https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf
43. Facebook AllReduce Synchronous SGD
•Experiments using ConvNets on ImageNet
- ResNet-50 on up to 256 GPUs
- No loss of accuracy when training with large minibatch
sizes up to 8192 images
- ∼90% scaling efficiency when moving from 8 to 256 GPUs
•Learning rate heuristic
- Make the learning rate proportional to the batch size
•Warm up phase
- Start at a low learning rate that you gradually increase up
to the target learning rate (5 epochs to target rate)
43
46. Instead of Decaying the Learning Rate, Increase Batch Size
•An alternative approach to Facebook’s learning rate heuristic
is to increase the batch size during training
- Quoc Le et Al
•Vast batch size of 65536 images when training Inception-
ResNet-V2 using only 2500 iterations (model updates to all
GPUs) reaching accuracy of 77% (cf. Facebook’s 76%).
5/30/2012 www.hops.io 46
https://arxiv.org/pdf/1711.00489.pdf
63. Concurrency in Ring-AllReduce SGD
•After computing the gradients for a layer, send the
gradients to your neighbor immediately
- Spreads out network traffic over time
63
”Running the model on 40 GPUs takes approximately 650 – 700
milliseconds per iteration, while on a single GPU it takes
approximately 370 milliseconds. Since by our estimate communication
would take 400 milliseconds, we are saving an extra 70 – 120
milliseconds per iteration by overlapping the backpropagation with
the data transfer.”
http://research.baidu.com/bringing-hpc-techniques-deep-learning/
64. 5/30/2012 www.hops.io
Support in TensorFlow For AllReduce
•Collective AllReduce in Keras/TensorFlow
- Multi-node collective communication primitives
•Uber Horovod
- Built on NVIDIA Collective Communications Library (NCCL)
- NCCL provides routines such as all-gather, all-reduce,
broadcast, reduce, reduce-scatter over PCIe and NVLink.
64
67. Hopsworks Project
World’s first Hadoop
platform to support
GPUs-as-a-Resource
World’s fastest
HDFS Published at
USENIX FAST with
Oracle and Spotify
World’s First
Open Source Feature
Store for Machine
Learning
World’s First
Distributed Filesystem to
store small files in
metadata on NVMe disks
Winner of IEEE
Scale
Challenge 2017
with HopsFS -
1.2m ops/sec
2017
World’s most scalable
POSIX-like Hierarchical
Filesystem with
Multi Data Center Availability
with 1.6m ops/sec
2018 2019
World’s First managed
feature store in the
Cloud (Hopsworks.ai)
World’s First
Unified Hyperparam
and Ablation Study
Parallel Prog.
Framework
68. Inner and Outer Loop of Deep Learning
Inner Loop
Outer Loop
Training Data
worker1 worker2 workerN
…
∆
1
∆
2
∆
N
Synchronization
Metric
Search
Method
HParams
http://tiny.cc/51yjdz
69. Inner and Outer Loop of Deep Learning
Inner Loop
Outer Loop
Training Data
worker1 worker2 workerN
…
∆
1
∆
2
∆
N
Synchronization
Metric
Search
Method
HParams
http://tiny.cc/51yjdzLEARNING
SEARCH
71. Parallel Black Box Optimization
71
Which algorithm to use for
search?
How to monitor progress?
Fault Tolerance?How to aggregate results?
Learning
Black Box
Metric
Meta-level
learning &
optimization Parallel
WorkerQueue
Trial
Trial
Search space
This should be managed with platform support!
72. 5/30/2012 www.hops.io
Distributed HParam Tuning
72
Executor Executor
Driver
HopsFS
TensorBoard ModelsCheckpoints Training Data Logs
conda_env
# RUNS ON THE EXECUTORS
def train(lr, dropout):
def input_fn(): # return dataset
optimizer = …
model = …
model.add(Conv2D(…))
model.compile(…)
model.fit(…)
model.evaluate(…)
# RUNS ON THE DRIVER
Hparams= {‘lr’:[0.001, 0.0001],
‘dropout’: [0.25, 0.5, 0.75]}
experiment.grid_search(train,HParams)
More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0
https://github.com/logicalclocks/hops-examples
73. 5/30/2012 www.hops.io
Distributed Training
73
Executor Executor
Driver
HopsFS
TensorBoard ModelsCheckpoints Training Data Logs
conda_env
# RUNS ON THE EXECUTORS
def train():
def input_fn(): # return dataset
model = …
optimizer = …
model.compile(…)
rc = tf.estimator.RunConfig(
‘CollectiveAllReduceStrategy’)
keras_estimator = tf.keras.estimator.
model_to_estimator(….)
tf.estimator.train_and_evaluate(
keras_estimator, input_fn)
# RUNS ON THE DRIVER
experiment.collective_all_reduce(train)
More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0
https://github.com/logicalclocks/hops-examples
74. 5/30/2012 www.hops.io
Maggy – Parallel HParam Trials on PySpark
74
Task11
Driver
Task12
Task13
Task1N
…
Barrier
Metrics
New Trial
Early Stop
Long running tasks execute many trials, with a global optimizer.
76. ML Model Development
•It’s simple - only four steps
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies
77. Artifacts and Non DRY Code
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies
78. Development of ML Models is Iterative
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies
79. Iterative Development Is a Pain, We Need DRY Code!
•Each step requires different implementations of the training code
Ablation StudiesEDA HParam Tuning Training (Dist)
80. The Oblivious Training Function
OBLIVIOUS
TRAINING
FUNCTION
# RUNS ON THE WORKERS
def train():
def input_fn(): # return dataset
model = …
optimizer = …
model.compile(…)
rc = tf.estimator.RunConfig(
‘CollectiveAllReduceSt
rategy’)
keras_estimator =
tf.keras.estimator.
model_to_estimator(….)
tf.estimator.train_and_evaluate(
keras_estimator, input_fn)
Ablation StudiesEDA HParam Tuning Training (Dist)
81. Challenge: Obtrusive Framework Artifacts
• TF_CONFIG
• Distribution Strategy
• Dataset (Sharding, DFS)
• Integration in Python - hard from inside a notebook
• Keras vs. Estimator vs. Custom Training Loop
•TensorFlow Challenges
84. Distribution Context
•Single-host vs. parallel multi-host vs. distributed multi-host
Worker1
Worker5
Worker3
Worker2
Worker4
Worker7
Worker8
Worker6
Driver
TF_CONFIG
Driver
Experiment
Controller
Worker1 WorkerNWorker2
Single
Host
85. Distribution Context
•Single-host vs. parallel multi-host vs. distributed multi-host
Worker1
Worker5
Worker3
Worker2
Worker4
Worker7
Worker8
Worker6
Driver
TF_CONFIG
Driver
Experiment
Controller
Worker1 WorkerNWorker2
Single
Host
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies
86. Model Development Best Practices
• Modularize
• Parametrize
• Higher order training
functions
• Usage of callbacks at
runtime
Dataset
Generation
Model
Generation
Training
Logic
87. Oblivious Training Function as an Abstraction
•Let the system handle the complexities
System takes care of ...
… fixing parameters
… launching
the function
… launching trials (parametrized
instantiations of the function)
… generating new trials
… collecting and logging results
… setting up TF_CONFIG
… wrapping in Distribution Strategy
… launching function as workers
… collecting results
89. Recap: The Solution
•Add Communication and Long Running Tasks
Task11
Task12
Task13
Task1N
Driver
Barrier
Metrics New Trial
90. What’s New?
•Worker discovery and distribution context set-up
Task11
Task12
Task13
Task1N
Driver
Barrier
Launch Oblivious
Training Function in
Context
Discover
Workers