Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University

GPUs, Distributed Deep Learning and Hopsworks
Jim Dowling
Assoc Prof @ KTH – Royal Institute of Technology
CEO at Logical Clocks AB

Leadership & Offices
Stockholm
Box 1263,
Isafjordsgatan
22
Kista,
Sweden
London
IDEALondon,
69 Wilson St,
London,,
UK
Silicon Valley
470 Ramona St
Palo Alto
California,
USA
Dr. Jim Dowling
CEO
Theo Kakantousis
COO
Prof. Seif Haridi
Chief Scientist
Fabio Buso
VP Engineering
Steffen Grohsschmiedt
Head Of Cloud
www.logicalclocks.com
Shraddha Chouhan
Head Of
Marketing

Affine Transformation
•Matrix Multiplications calculate weighted sums for
Feed-Forward Networks
•Fused multiply-add (FMA) instruction on a GPU
2020-06-04 Guest Lecture, Jim Dowling 3/62
x1
x2
..
xm
w11 w12 …w1n
w21 w22 …w2n
….
wm1 wm2 …wmn
y1
y2
..
ym
Input Weights Output
b1
b2
..
bm
Biases

Convolution Operations
Input Matrix ⊗ Filter =
Output (activation map)
For each output map j
For each input map k
For each pixel x,y
For each kernel element u,v
Bxyj += A(x-u)(y-v)k * Kuvkj

Von Neumann Architecture
Instruction Stream Data Stream

The Memory Hierarchy
Registers
L1
L2/L3
DRAM
NVMe
<1 ns
~1-2 ns
~4-10 ns
~50-70 ns
Magnetic Disk
~10 𝜇𝑠
~5-10 ms
Latency Size/Throughput
~2 TB
~3 GB/s
~12 TB
~100 MB/s
~32 GB DIMMs
~10 GB/s per CPU
~128 KB – 8 MB
~1-200 GB/s
~16-64 KB
~700 GB/s
~6 KiB
Guest Lecture, Jim Dowling

CPU gains outperforming Memory gains

SSD/Net gains outperforming Memory gains
5/30/2012 https://itblog.sandisk.com/cpu-bandwidth-the-worrisome-2020-trend/
8
https://itblog.sandisk.com/cpu-bandwidth-the-worrisome-2020-trend/

GPU Programming
Intel Xeon E5-2680v4:
Clock speed: 2.4 GHz
4 instructions per cycle with AVX2 CPU - 28 cores
2.4 x 4 x 28 = 268.8 Gigaflops double precision
NVIDIA Tesla P100:
Single instruction per cycle
3584 CUDA cores
4.7 Teraflops double precision
CPU vs GPU

GPU ProgrammingCPUs, GPUs, Memory, and the PCI-E Bus
CPU Memory
CPU GPU
GPU MemoryPCI-E Bus
Compute-Intense Fns
Sequential CPU Code
Guest Lecture, Jim Dowling
Program

Single-Instruction Multiple-Data (SIMD)
•A single instruction
stream to operations
that may be naturally
parallelized
- Matrix Multiplication
- Convolutions
•x86 SIMD support
- AVX Extensions
- Intel MKL Library
CS 61c 13
Instruction Pool
DataPool
Processor
Processor
Processor
Processor

Why is SIMD not dominant?
•It’s hard to program
- Have to identify data parallelism in applications
•Specialized hardware is expensive
- Vector processors used to be expensive, so programming
language support lagged
•SIMD extensions (AVX, SSE)
- Not very wide
- No control flow within SIMD

Nvidia Cuda: SIMT Programming
•Single Instruction Multiple Thread (SIMT)
- Parallel threads that use SIMD hardware
- CUDA, OpenCL are both SIMT programming models
•What is CUDA?
- Nvidia proprietory API and platform
• Hierarchical thread programming model that combines MIMD/SIMD
ideas
- Lots of Libraries
• CuDNN - Deep Neural Network library
• CUBLAS - CUDA Basic Linear Algebra Subroutines library
• CUDART - CUDA RunTime library
• NVIDIA Collective Communications Library (NCCL)

Cuda SIMT
•Many parallel threads of instructions
- Each instruction is responsible for a single data input
•Threads are grouped into blocks (<1024) (blocks are
mapped to SMXs)
- Hardware divides blocks into SIMD groups or warps
- Warps of threads are executed together as a single SIMD
instruction
•Threads distinguish computation by querying their
grid/block location (think thread IDs)

CUDA memory model
•All threads share Global Memory
•Threads within a block share Shared Memory
•Threads have their own private Registers
•Memory consistency is very relaxed
- Cannot guarantee ordering of instructions across blocks.
- Can insert explicit memory fences/barriers to order threads
within a block

CUDA memory model
Global
Shared
Thread 0 Thread 1
Block 0
Shared
Thread 0 Thread 1
Block 1
Registers Registers Registers Registers

Reduced Precision
2020-06-04 19/62

Reduced Precision for Inference

Specialized Instructions – Tensor Cores
5/30/2012 www.hops.io
21

Connecting GPUs over the Network
5/30/2012
22

Intra GPU and Inter-Host GPU Connections
• QPI link ~8-12 GB/s
• PCIe ~16-32 GB/s . NVLink ~80 GB/s.
• Infiniband ~108-216 Gb/s
23Host: 2 CPU Sockets, PCIe Bus, 4 GPUs, Infiniband Net

GPU -> Network -> GPU Communications
Oden et al, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters, IEEE 21 CLUSTER 2013

NVLink – a GPU-to-GPU Bus
25
NVLink
~80 GB/s

SingleRoot Complex – Commodity Server
26

Click to edit
5/30/2012 www.hops.io 28
The Deep Learning world beyond Nvidia

Will GPUs become commodity compute?
•For….
- Gaming GPUs have the best Price/Performance for training
models
- Distributed Deep Learning technology is rapidly improving
• Buy more GPUs to scale out training, HParam Tuning, Ablation
Studies, etc
•Against….
- Nvidia
29

Commodity Fight: AMD Radeon vs Nvidia
30
Nvidia™ 2080Ti
Memory: 11GB
TensorFlow 1.12
CUDA 10.0.130, cuDNN 7.4.1
Model: RESNET-50
Dataset: imagenet (synthetic)
----------------------------------------------------
--------
FP32 total images/sec: ~322
AMD Radeon™ VII
Memory: 16 GB
TensorFlow 1.13.1
ROCm: 2.3
Model: RESNET-50
Dataset: imagenet (synthetic)
------------------------------------------------------
------
https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/
https://www.phoronix.com/scan.php?page=article&item=nvidia-
rtx2080ti-tensorflow&num=2
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173

AMD Software Stack
31
31
Latest Machine
Learning Frameworks
Dockers and
Kubernetes support
Optimized Math &
Communication Libraries
Up-Streamed for Linux
Kernel Distributions
Frameworks
Middleware
and Libraries
Eigen
Spark / Machine Learning Apps
Data
Platform
Tools
ROCm
Fully Open Source ROCm Platform
OpenMP HIP OpenCL™ Python
Devices GPU CPU APU DLA
RCCL
BLAS, FFT,
RNG
MIOpen
O P E N S O U R C E
F O U N D A T I O N
F O R M A C H I N E
L E A R N I N G

AMD Distributed Training
32
Optimized collective
communication operations
library
Easy MPI integration
Support for Infiniband and
RoCE highspeed network
fabrics
ROCm enabled UCX
ROCm w/ ROCmRDMA
RCCL
1,00X
1,99X
3,98X
7,64X
0,00X
1,00X
2,00X
3,00X
4,00X
5,00X
6,00X
7,00X
8,00X
RESNET50
Multi-GPU Scaling
(PCIe, CPU
parameter-server,
1/2/4/8 GPU)
1GPU 2GPU 4GPU 8GPU
ResNet-
50

ROCm over Spark/TensorFlow on Hopsworks
33
•Spark / TensorFlow
applications run
unchanged on ROCm
•Hopsworks runs
Spark/TensorFlow on
YARN and Conda

YARN support for ROCm in Hops
34
Container
A Container
is a CGroup
that isolates
CPU, memory,
and GPU
resources and
has a conda
environment
and TLS certs.
ContainerContainerContainer
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Executor ExecutorExecutorDriver

Distributed Stochastic Gradient Descent
35

Model Parallelism SGD vs Data Parallelism SGD
•Model Parallelism
- Models cannot fit on a
single GPU
- Model is partitioned over
many GPUs
• For example, one layer per
GPU for a ConvNet
- The same data is used to
train the partitioned
models
• E.g., input data in at the
bottom layer (GPU1) and as
the data feeds forward to
the output layer, it passes
through many GPUs
•Data Parallelism
- Copy of the model stored
at each work
- Each worker trains on a
different set of samples
from the same
minibatch
- Gradients computed at
each worker need to be
aggregated to calculate
the new model for a
mini-batch
- A new model needs to be
broadcast for each
iteration to all workers

Distributed SGD with Data Parallelism
37

Distributed SGD with Data Parallelism
38

Asynchronous SGD vs Synchronous SGD
39

Synchnrous SGD: N/W is the Bottleneck
40
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
1 2 3 4 5 6 7 8 9 10
1 GPU 4 GPUs
N/W N/W N/W N/W N/W
Amount
Work
Time
Reduce N/W Comms Time, Increase Computation Time
Amdahl’s Law

Synchronous SGD Challenges
•Effective size of the batch becomes larger
- Can we train models with very large batch sizes?
•Update time depends on the slowest worker
- Backup workers proposed as a mitigating strategy
41

Facebook: Scaling Synchronous SGD
42
June 2017: Facebook reduced training time on ImageNet from 2 weeks to 1 hr
https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf

Facebook AllReduce Synchronous SGD
•Experiments using ConvNets on ImageNet
- ResNet-50 on up to 256 GPUs
- No loss of accuracy when training with large minibatch
sizes up to 8192 images
- ∼90% scaling efficiency when moving from 8 to 256 GPUs
•Learning rate heuristic
- Make the learning rate proportional to the batch size
•Warm up phase
- Start at a low learning rate that you gradually increase up
to the target learning rate (5 epochs to target rate)
43

Facebook: Increasing Batch Size
44

Facebook: Learning Rate Warmup Evaluation
45

Instead of Decaying the Learning Rate, Increase Batch Size
•An alternative approach to Facebook’s learning rate heuristic
is to increase the batch size during training
- Quoc Le et Al
•Vast batch size of 65536 images when training Inception-
ResNet-V2 using only 2500 iterations (model updates to all
GPUs) reaching accuracy of 77% (cf. Facebook’s 76%).
5/30/2012 www.hops.io 46
https://arxiv.org/pdf/1711.00489.pdf

Ring-AllReduce Distributed SGD
47

Ring-AllReduce vs Parameter Server(s)
2020-06-04 48/48
GPU 0
GPU 1
GPU 2
GPU 3
send
send
send
send
recv
recv
recv
recv GPU 1 GPU 2 GPU 3 GPU 4
Param Server(s)
Network Bandwidth is the Bottleneck for Distributed Training

One Slow Link/GPU/Bus slows down Training
2020-06-04 49/48
Ring-AllReduce

Ring-AllReduce Algorithm
•This example of AllReduce sums all elements in N
arrays using N GPUs in parallel.
50

First of N-1 iterations of scatter-reduce
51

Intermediate sums in step1 of scatter-reduce
52

More iterations of scatter-reduce
53

54

55

Final state after all scatter-reduce transfers
56

First iteration of the allgather
57

Allgather data transfers (Iteration 1)
58

Next Allgather data transfer
59

60

61

Final state after all Allgather transfers
62

Concurrency in Ring-AllReduce SGD
•After computing the gradients for a layer, send the
gradients to your neighbor immediately
- Spreads out network traffic over time
63
”Running the model on 40 GPUs takes approximately 650 – 700
milliseconds per iteration, while on a single GPU it takes
approximately 370 milliseconds. Since by our estimate communication
would take 400 milliseconds, we are saving an extra 70 – 120
milliseconds per iteration by overlapping the backpropagation with
the data transfer.”
http://research.baidu.com/bringing-hpc-techniques-deep-learning/

Support in TensorFlow For AllReduce
•Collective AllReduce in Keras/TensorFlow
- Multi-node collective communication primitives
•Uber Horovod
- Built on NVIDIA Collective Communications Library (NCCL)
- NCCL provides routines such as all-gather, all-reduce,
broadcast, reduce, reduce-scatter over PCIe and NVLink.
64

Current State of the Art (Microsoft)
65

Distributed Deep Learning on Hopsworks
66
https://www.oreilly.com/content/distributed-tensorflow/

Hopsworks Project
World’s first Hadoop
platform to support
GPUs-as-a-Resource
World’s fastest
HDFS Published at
USENIX FAST with
Oracle and Spotify
World’s First
Open Source Feature
Store for Machine
Learning
World’s First
Distributed Filesystem to
store small files in
metadata on NVMe disks
Winner of IEEE
Scale
Challenge 2017
with HopsFS -
1.2m ops/sec
2017
World’s most scalable
POSIX-like Hierarchical
Filesystem with
Multi Data Center Availability
with 1.6m ops/sec
2018 2019
World’s First managed
feature store in the
Cloud (Hopsworks.ai)
World’s First
Unified Hyperparam
and Ablation Study
Parallel Prog.
Framework

Inner and Outer Loop of Deep Learning
Inner Loop
Outer Loop
Training Data
worker1 worker2 workerN
…
∆
1
∆
2
∆
N
Synchronization
Metric
Search
Method
HParams
http://tiny.cc/51yjdz

Inner and Outer Loop of Deep Learning
Inner Loop
Outer Loop
Training Data
worker1 worker2 workerN
…
∆
1
∆
2
∆
N
Synchronization
Metric
Search
Method
HParams
http://tiny.cc/51yjdzLEARNING
SEARCH

Black Box Optimization
Learning
Black Box
Metric
Meta-level
learning &
optimization
Search space

Parallel Black Box Optimization
71
Which algorithm to use for
search?
How to monitor progress?
Fault Tolerance?How to aggregate results?
Learning
Black Box
Metric
Meta-level
learning &
optimization Parallel
WorkerQueue
Trial
Trial
Search space
This should be managed with platform support!

Distributed HParam Tuning
72
Executor Executor
Driver
HopsFS
TensorBoard ModelsCheckpoints Training Data Logs
conda_env
# RUNS ON THE EXECUTORS
def train(lr, dropout):
def input_fn(): # return dataset
optimizer = …
model = …
model.add(Conv2D(…))
model.compile(…)
model.fit(…)
model.evaluate(…)
# RUNS ON THE DRIVER
Hparams= {‘lr’:[0.001, 0.0001],
‘dropout’: [0.25, 0.5, 0.75]}
experiment.grid_search(train,HParams)
More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0
https://github.com/logicalclocks/hops-examples

Distributed Training
73
Executor Executor
Driver
HopsFS
TensorBoard ModelsCheckpoints Training Data Logs
conda_env
# RUNS ON THE EXECUTORS
def train():
model = …
optimizer = …
model.compile(…)
rc = tf.estimator.RunConfig(
‘CollectiveAllReduceStrategy’)
keras_estimator = tf.keras.estimator.
model_to_estimator(….)
tf.estimator.train_and_evaluate(
keras_estimator, input_fn)
# RUNS ON THE DRIVER
experiment.collective_all_reduce(train)
More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0
https://github.com/logicalclocks/hops-examples

Maggy – Parallel HParam Trials on PySpark
74
Task11
Driver
Task12
Task13
Task1N
…
Barrier
Metrics
New Trial
Early Stop
Long running tasks execute many trials, with a global optimizer.

ML Model Development
•A simplified view

ML Model Development
•It’s simple - only four steps
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies

Artifacts and Non DRY Code
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies

Development of ML Models is Iterative
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies

Iterative Development Is a Pain, We Need DRY Code!
•Each step requires different implementations of the training code
Ablation StudiesEDA HParam Tuning Training (Dist)

The Oblivious Training Function
OBLIVIOUS
TRAINING
FUNCTION
# RUNS ON THE WORKERS
def train():
model = …
optimizer = …
model.compile(…)
rc = tf.estimator.RunConfig(
‘CollectiveAllReduceSt
rategy’)
keras_estimator =
tf.keras.estimator.
model_to_estimator(….)
tf.estimator.train_and_evaluate(
keras_estimator, input_fn)
Ablation StudiesEDA HParam Tuning Training (Dist)

Challenge: Obtrusive Framework Artifacts
• TF_CONFIG
• Distribution Strategy
• Dataset (Sharding, DFS)
• Integration in Python - hard from inside a notebook
• Keras vs. Estimator vs. Custom Training Loop
•TensorFlow Challenges

Trend(s) in DL: Productive High-Level APIs
Idea
Experiment
Results
Infrastructure
Framework
Tracking
Visualization
? Hopsworks (Open Source)
Databricks
Apache Spark
Cloud Providers

How do we keep High-Level Code Transparent?
def dataset(batch_size):
(x_train, y_train) = load_data()
x_train = x_train / np.float32(255)
y_train = y_train.astype(np.int64)
train_dataset = tf.data.Dataset.from_tensor_slices(
(x_train,y_train)).shuffle(60000)
.repeat().batch(batch_size)
return train_dataset
def build_and_compile_cnn_model(lr):
model = tf.keras.Sequential([
tf.keras.Input(shape=(28, 28)),
tf.keras.layers.Conv2D(32, 3,
activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
loss=SparseCategoricalCrossentropy(from_logits=True),
optimizer=SGD(learning_rate=lr))
return model
def dataset(batch_size):
(x_train, y_train) = load_data()
x_train = x_train / np.float32(255)
y_train = y_train.astype(np.int64)
train_dataset = tf.data.Dataset.from_tensor_slices(
(x_train,y_train)).shuffle(60000)
.repeat().batch(batch_size)
return train_dataset
def build_and_compile_cnn_model(lr):
model = tf.keras.Sequential([
tf.keras.Input(shape=(28, 28)),
tf.keras.layers.Conv2D(32, 3,
activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
loss=SparseCategoricalCrossentropy(from_logits=True),
optimizer=SGD(learning_rate=lr))
return model
NO
CHANGES!

Distribution Context
•Single-host vs. parallel multi-host vs. distributed multi-host
Worker1
Worker5
Worker3
Worker2
Worker4
Worker7
Worker8
Worker6
Driver
TF_CONFIG
Driver
Experiment
Controller
Worker1 WorkerNWorker2
Single
Host

Distribution Context
•Single-host vs. parallel multi-host vs. distributed multi-host
Worker1
Worker5
Worker3
Worker2
Worker4
Worker7
Worker8
Worker6
Driver
TF_CONFIG
Driver
Experiment
Controller
Worker1 WorkerNWorker2
Single
Host
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies

Model Development Best Practices
• Modularize
• Parametrize
• Higher order training
functions
• Usage of callbacks at
runtime
Dataset
Generation
Model
Generation
Training
Logic

Oblivious Training Function as an Abstraction
•Let the system handle the complexities
System takes care of ...
… fixing parameters
… launching
the function
… launching trials (parametrized
instantiations of the function)
… generating new trials
… collecting and logging results
… setting up TF_CONFIG
… wrapping in Distribution Strategy
… launching function as workers
… collecting results

Maggy - Asynchronous Trials on Spark
•Spark is bulk-synchronous
Wasted
Compute
Wasted
Compute
HopsFS
Barrier
Task11
Task12
Task13
Task1N
Driver
Metrics1
Barrier
Task21
Task22
Task23
Task2N
Metrics2
Barrier
Task31
Task32
Task33
Task3N
Metrics3
Wasted
Compute
Early-Stopping

Recap: The Solution
•Add Communication and Long Running Tasks
Task11
Task12
Task13
Task1N
Driver
Barrier
Metrics New Trial

What’s New?
•Worker discovery and distribution context set-up
Task11
Task12
Task13
Task1N
Driver
Barrier
Launch Oblivious
Training Function in
Context
Discover
Workers

What’s New: Distribution Context
sp = maggy.optimization.Searchspace(...)
dist_strat = tf.keras.distribute.MirroredStrategy(...)
ab = maggy.ablation.AblationStudy(...)
maggy.set_context('optimization’)
maggy.lagom(training_function, sp)
maggy.set_context(‘distributed_training’)
maggy.lagom(training_function, dist_strat)
maggy.set_context(‘ablation’)
maggy.lagom(training_function, ab)

Contributors – Thanks!
–Contributions from colleagues:
–Robin Andersson @robzor92
–Sina Sheikholeslami @cutlash
–Kim Hammar @KimHammar1
–Alex Ormenisan @alex_ormenisan
• Maggy
https://github.com/logicalclocks/maggy or
https://maggy.readthedocs.io/en/latest/

Show us some love!
@hopsworks
http://github.com/logicalclocks/hopsworks

Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University

Similar to Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University (20)

More from Jim Dowling

More from Jim Dowling (20)

Recently uploaded

Recently uploaded (20)

Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University