SlideShare a Scribd company logo
1 of 93
Download to read offline
GPUs, Distributed Deep Learning and Hopsworks
Jim Dowling
Assoc Prof @ KTH – Royal Institute of Technology
CEO at Logical Clocks AB
Leadership & Offices
Stockholm
Box 1263,
Isafjordsgatan
22
Kista,
Sweden
London
IDEALondon,
69 Wilson St,
London,,
UK
Silicon Valley
470 Ramona St
Palo Alto
California,
USA
Dr. Jim Dowling
CEO
Theo Kakantousis
COO
Prof. Seif Haridi
Chief Scientist
Fabio Buso
VP Engineering
Steffen Grohsschmiedt
Head Of Cloud
www.logicalclocks.com
Shraddha Chouhan
Head Of
Marketing
Affine Transformation
•Matrix Multiplications calculate weighted sums for
Feed-Forward Networks
•Fused multiply-add (FMA) instruction on a GPU
2020-06-04 Guest Lecture, Jim Dowling 3/62
x1
x2
..
xm
w11 w12 …w1n
w21 w22 …w2n
….
wm1 wm2 …wmn
y1
y2
..
ym
Input Weights Output
b1
b2
..
bm
Biases
Convolution Operations
Input Matrix ⊗ Filter =
Output (activation map)
For each output map j
For each input map k
For each pixel x,y
For each kernel element u,v
Bxyj += A(x-u)(y-v)k * Kuvkj
2020-06-04 Guest Lecture, Jim Dowling 4/62
Von Neumann Architecture
Instruction Stream Data Stream
The Memory Hierarchy
Registers
L1
L2/L3
DRAM
NVMe
<1 ns
~1-2 ns
~4-10 ns
~50-70 ns
Magnetic Disk
~10 𝜇𝑠
~5-10 ms
Latency Size/Throughput
~2 TB
~3 GB/s
~12 TB
~100 MB/s
~32 GB DIMMs
~10 GB/s per CPU
~128 KB – 8 MB
~1-200 GB/s
~16-64 KB
~700 GB/s
~6 KiB
Guest Lecture, Jim Dowling
CPU gains outperforming Memory gains
SSD/Net gains outperforming Memory gains
5/30/2012 https://itblog.sandisk.com/cpu-bandwidth-the-worrisome-2020-trend/
8
https://itblog.sandisk.com/cpu-bandwidth-the-worrisome-2020-trend/
GPU Programming
Intel Xeon E5-2680v4:
Clock speed: 2.4 GHz
4 instructions per cycle with AVX2 CPU - 28 cores
2.4 x 4 x 28 = 268.8 Gigaflops double precision
NVIDIA Tesla P100:
Single instruction per cycle
3584 CUDA cores
4.7 Teraflops double precision
CPU vs GPU
GPU vs CPU
2020-06-04 10/62
GPU ProgrammingCPUs, GPUs, Memory, and the PCI-E Bus
CPU Memory
CPU GPU
GPU MemoryPCI-E Bus
Compute-Intense Fns
Sequential CPU Code
Guest Lecture, Jim Dowling
Program
Flynn’s Taxonomy
Single-Instruction Multiple-Data (SIMD)
•A single instruction
stream to operations
that may be naturally
parallelized
- Matrix Multiplication
- Convolutions
•x86 SIMD support
- AVX Extensions
- Intel MKL Library
CS 61c 13
Instruction Pool
DataPool
Processor
Processor
Processor
Processor
Why is SIMD not dominant?
•It’s hard to program
- Have to identify data parallelism in applications
•Specialized hardware is expensive
- Vector processors used to be expensive, so programming
language support lagged
•SIMD extensions (AVX, SSE)
- Not very wide
- No control flow within SIMD
Nvidia Cuda: SIMT Programming
•Single Instruction Multiple Thread (SIMT)
- Parallel threads that use SIMD hardware
- CUDA, OpenCL are both SIMT programming models
•What is CUDA?
- Nvidia proprietory API and platform
• Hierarchical thread programming model that combines MIMD/SIMD
ideas
- Lots of Libraries
• CuDNN - Deep Neural Network library
• CUBLAS - CUDA Basic Linear Algebra Subroutines library
• CUDART - CUDA RunTime library
• NVIDIA Collective Communications Library (NCCL)
Cuda SIMT
•Many parallel threads of instructions
- Each instruction is responsible for a single data input
•Threads are grouped into blocks (<1024) (blocks are
mapped to SMXs)
- Hardware divides blocks into SIMD groups or warps
- Warps of threads are executed together as a single SIMD
instruction
•Threads distinguish computation by querying their
grid/block location (think thread IDs)
CUDA memory model
•All threads share Global Memory
•Threads within a block share Shared Memory
•Threads have their own private Registers
•Memory consistency is very relaxed
- Cannot guarantee ordering of instructions across blocks.
- Can insert explicit memory fences/barriers to order threads
within a block
CUDA memory model
Global
Shared
Thread 0 Thread 1
Block 0
Shared
Thread 0 Thread 1
Block 1
Registers Registers Registers Registers
Reduced Precision
2020-06-04 19/62
Reduced Precision for Inference
2020-06-04 Guest Lecture, Jim Dowling 20/62
Specialized Instructions – Tensor Cores
5/30/2012 www.hops.io
21
Connecting GPUs over the Network
5/30/2012
22
Intra GPU and Inter-Host GPU Connections
• QPI link ~8-12 GB/s
• PCIe ~16-32 GB/s . NVLink ~80 GB/s.
• Infiniband ~108-216 Gb/s
23Host: 2 CPU Sockets, PCIe Bus, 4 GPUs, Infiniband Net
GPU -> Network -> GPU Communications
Oden et al, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters, IEEE 21 CLUSTER 2013
NVLink – a GPU-to-GPU Bus
25
NVLink
~80 GB/s
SingleRoot Complex – Commodity Server
26
Benefit of NVLink
27
Click to edit
5/30/2012 www.hops.io 28
The Deep Learning world beyond Nvidia
5/30/2012 www.hops.io
Will GPUs become commodity compute?
•For….
- Gaming GPUs have the best Price/Performance for training
models
- Distributed Deep Learning technology is rapidly improving
• Buy more GPUs to scale out training, HParam Tuning, Ablation
Studies, etc
•Against….
- Nvidia
29
5/30/2012 www.hops.io
Commodity Fight: AMD Radeon vs Nvidia
30
Nvidia™ 2080Ti
Memory: 11GB
TensorFlow 1.12
CUDA 10.0.130, cuDNN 7.4.1
Model: RESNET-50
Dataset: imagenet (synthetic)
----------------------------------------------------
--------
FP32 total images/sec: ~322
FP16 total images/sec: ~560
AMD Radeon™ VII
Memory: 16 GB
TensorFlow 1.13.1
ROCm: 2.3
Model: RESNET-50
Dataset: imagenet (synthetic)
------------------------------------------------------
------
FP32 total images/sec: ~302
FP16 total images/sec: ~415
https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/
https://www.phoronix.com/scan.php?page=article&item=nvidia-
rtx2080ti-tensorflow&num=2
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173
5/30/2012 www.hops.io
AMD Software Stack
31
31
Latest Machine
Learning Frameworks
Dockers and
Kubernetes support
Optimized Math &
Communication Libraries
Up-Streamed for Linux
Kernel Distributions
Frameworks
Middleware
and Libraries
Eigen
Spark / Machine Learning Apps
Data
Platform
Tools
ROCm
Fully Open Source ROCm Platform
OpenMP HIP OpenCL™ Python
Devices GPU CPU APU DLA
RCCL
BLAS, FFT,
RNG
MIOpen
O P E N S O U R C E
F O U N D A T I O N
F O R M A C H I N E
L E A R N I N G
5/30/2012 www.hops.io
AMD Distributed Training
32
Optimized collective
communication operations
library
Easy MPI integration
Support for Infiniband and
RoCE highspeed network
fabrics
ROCm enabled UCX
ROCm w/ ROCmRDMA
RCCL
1,00X
1,99X
3,98X
7,64X
0,00X
1,00X
2,00X
3,00X
4,00X
5,00X
6,00X
7,00X
8,00X
RESNET50
Multi-GPU Scaling
(PCIe, CPU
parameter-server,
1/2/4/8 GPU)
1GPU 2GPU 4GPU 8GPU
ResNet-
50
5/30/2012 www.hops.io
ROCm over Spark/TensorFlow on Hopsworks
33
•Spark / TensorFlow
applications run
unchanged on ROCm
•Hopsworks runs
Spark/TensorFlow on
YARN and Conda
5/30/2012 www.hops.io
YARN support for ROCm in Hops
34
Container
A Container
is a CGroup
that isolates
CPU, memory,
and GPU
resources and
has a conda
environment
and TLS certs.
ContainerContainerContainer
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Executor ExecutorExecutorDriver
Distributed Stochastic Gradient Descent
35
Model Parallelism SGD vs Data Parallelism SGD
•Model Parallelism
- Models cannot fit on a
single GPU
- Model is partitioned over
many GPUs
• For example, one layer per
GPU for a ConvNet
- The same data is used to
train the partitioned
models
• E.g., input data in at the
bottom layer (GPU1) and as
the data feeds forward to
the output layer, it passes
through many GPUs
•Data Parallelism
- Copy of the model stored
at each work
- Each worker trains on a
different set of samples
from the same
mini- batch
- Gradients computed at
each worker need to be
aggregated to calculate
the new model for a
mini-batch
- A new model needs to be
broadcast for each
iteration to all workers
Distributed SGD with Data Parallelism
37
Distributed SGD with Data Parallelism
38
5/30/2012 www.hops.io
Asynchronous SGD vs Synchronous SGD
39
Synchnrous SGD: N/W is the Bottleneck
40
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
1 2 3 4 5 6 7 8 9 10
1 GPU 4 GPUs
N/W N/W N/W N/W N/W
Amount
Work
Time
Reduce N/W Comms Time, Increase Computation Time
Amdahl’s Law
Synchronous SGD Challenges
•Effective size of the batch becomes larger
- Can we train models with very large batch sizes?
•Update time depends on the slowest worker
- Backup workers proposed as a mitigating strategy
41
Facebook: Scaling Synchronous SGD
42
June 2017: Facebook reduced training time on ImageNet from 2 weeks to 1 hr
https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf
Facebook AllReduce Synchronous SGD
•Experiments using ConvNets on ImageNet
- ResNet-50 on up to 256 GPUs
- No loss of accuracy when training with large minibatch
sizes up to 8192 images
- ∼90% scaling efficiency when moving from 8 to 256 GPUs
•Learning rate heuristic
- Make the learning rate proportional to the batch size
•Warm up phase
- Start at a low learning rate that you gradually increase up
to the target learning rate (5 epochs to target rate)
43
Facebook: Increasing Batch Size
44
Facebook: Learning Rate Warmup Evaluation
45
Instead of Decaying the Learning Rate, Increase Batch Size
•An alternative approach to Facebook’s learning rate heuristic
is to increase the batch size during training
- Quoc Le et Al
•Vast batch size of 65536 images when training Inception-
ResNet-V2 using only 2500 iterations (model updates to all
GPUs) reaching accuracy of 77% (cf. Facebook’s 76%).
5/30/2012 www.hops.io 46
https://arxiv.org/pdf/1711.00489.pdf
Ring-AllReduce Distributed SGD
47
Ring-AllReduce vs Parameter Server(s)
2020-06-04 48/48
GPU 0
GPU 1
GPU 2
GPU 3
send
send
send
send
recv
recv
recv
recv GPU 1 GPU 2 GPU 3 GPU 4
Param Server(s)
Network Bandwidth is the Bottleneck for Distributed Training
One Slow Link/GPU/Bus slows down Training
2020-06-04 49/48
Ring-AllReduce
Ring-AllReduce Algorithm
•This example of AllReduce sums all elements in N
arrays using N GPUs in parallel.
50
First of N-1 iterations of scatter-reduce
51
Intermediate sums in step1 of scatter-reduce
52
More iterations of scatter-reduce
53
More iterations of scatter-reduce
54
More iterations of scatter-reduce
55
Final state after all scatter-reduce transfers
56
First iteration of the allgather
57
Allgather data transfers (Iteration 1)
58
Next Allgather data transfer
59
Next Allgather data transfer
60
Next Allgather data transfer
61
Final state after all Allgather transfers
62
Concurrency in Ring-AllReduce SGD
•After computing the gradients for a layer, send the
gradients to your neighbor immediately
- Spreads out network traffic over time
63
”Running the model on 40 GPUs takes approximately 650 – 700
milliseconds per iteration, while on a single GPU it takes
approximately 370 milliseconds. Since by our estimate communication
would take 400 milliseconds, we are saving an extra 70 – 120
milliseconds per iteration by overlapping the backpropagation with
the data transfer.”
http://research.baidu.com/bringing-hpc-techniques-deep-learning/
5/30/2012 www.hops.io
Support in TensorFlow For AllReduce
•Collective AllReduce in Keras/TensorFlow
- Multi-node collective communication primitives
•Uber Horovod
- Built on NVIDIA Collective Communications Library (NCCL)
- NCCL provides routines such as all-gather, all-reduce,
broadcast, reduce, reduce-scatter over PCIe and NVLink.
64
5/30/2012 www.hops.io
Current State of the Art (Microsoft)
65
Distributed Deep Learning on Hopsworks
66
https://www.oreilly.com/content/distributed-tensorflow/
Hopsworks Project
World’s first Hadoop
platform to support
GPUs-as-a-Resource
World’s fastest
HDFS Published at
USENIX FAST with
Oracle and Spotify
World’s First
Open Source Feature
Store for Machine
Learning
World’s First
Distributed Filesystem to
store small files in
metadata on NVMe disks
Winner of IEEE
Scale
Challenge 2017
with HopsFS -
1.2m ops/sec
2017
World’s most scalable
POSIX-like Hierarchical
Filesystem with
Multi Data Center Availability
with 1.6m ops/sec
2018 2019
World’s First managed
feature store in the
Cloud (Hopsworks.ai)
World’s First
Unified Hyperparam
and Ablation Study
Parallel Prog.
Framework
Inner and Outer Loop of Deep Learning
Inner Loop
Outer Loop
Training Data
worker1 worker2 workerN
…
∆
1
∆
2
∆
N
Synchronization
Metric
Search
Method
HParams
http://tiny.cc/51yjdz
Inner and Outer Loop of Deep Learning
Inner Loop
Outer Loop
Training Data
worker1 worker2 workerN
…
∆
1
∆
2
∆
N
Synchronization
Metric
Search
Method
HParams
http://tiny.cc/51yjdzLEARNING
SEARCH
Black Box Optimization
Learning
Black Box
Metric
Meta-level
learning &
optimization
Search space
Parallel Black Box Optimization
71
Which algorithm to use for
search?
How to monitor progress?
Fault Tolerance?How to aggregate results?
Learning
Black Box
Metric
Meta-level
learning &
optimization Parallel
WorkerQueue
Trial
Trial
Search space
This should be managed with platform support!
5/30/2012 www.hops.io
Distributed HParam Tuning
72
Executor Executor
Driver
HopsFS
TensorBoard ModelsCheckpoints Training Data Logs
conda_env
# RUNS ON THE EXECUTORS
def train(lr, dropout):
def input_fn(): # return dataset
optimizer = …
model = …
model.add(Conv2D(…))
model.compile(…)
model.fit(…)
model.evaluate(…)
# RUNS ON THE DRIVER
Hparams= {‘lr’:[0.001, 0.0001],
‘dropout’: [0.25, 0.5, 0.75]}
experiment.grid_search(train,HParams)
More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0
https://github.com/logicalclocks/hops-examples
5/30/2012 www.hops.io
Distributed Training
73
Executor Executor
Driver
HopsFS
TensorBoard ModelsCheckpoints Training Data Logs
conda_env
# RUNS ON THE EXECUTORS
def train():
def input_fn(): # return dataset
model = …
optimizer = …
model.compile(…)
rc = tf.estimator.RunConfig(
‘CollectiveAllReduceStrategy’)
keras_estimator = tf.keras.estimator.
model_to_estimator(….)
tf.estimator.train_and_evaluate(
keras_estimator, input_fn)
# RUNS ON THE DRIVER
experiment.collective_all_reduce(train)
More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0
https://github.com/logicalclocks/hops-examples
5/30/2012 www.hops.io
Maggy – Parallel HParam Trials on PySpark
74
Task11
Driver
Task12
Task13
Task1N
…
Barrier
Metrics
New Trial
Early Stop
Long running tasks execute many trials, with a global optimizer.
ML Model Development
•A simplified view
ML Model Development
•It’s simple - only four steps
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies
Artifacts and Non DRY Code
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies
Development of ML Models is Iterative
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies
Iterative Development Is a Pain, We Need DRY Code!
•Each step requires different implementations of the training code
Ablation StudiesEDA HParam Tuning Training (Dist)
The Oblivious Training Function
OBLIVIOUS
TRAINING
FUNCTION
# RUNS ON THE WORKERS
def train():
def input_fn(): # return dataset
model = …
optimizer = …
model.compile(…)
rc = tf.estimator.RunConfig(
‘CollectiveAllReduceSt
rategy’)
keras_estimator =
tf.keras.estimator.
model_to_estimator(….)
tf.estimator.train_and_evaluate(
keras_estimator, input_fn)
Ablation StudiesEDA HParam Tuning Training (Dist)
Challenge: Obtrusive Framework Artifacts
• TF_CONFIG
• Distribution Strategy
• Dataset (Sharding, DFS)
• Integration in Python - hard from inside a notebook
• Keras vs. Estimator vs. Custom Training Loop
•TensorFlow Challenges
Trend(s) in DL: Productive High-Level APIs
Idea
Experiment
Results
Infrastructure
Framework
Tracking
Visualization
? Hopsworks (Open Source)
Databricks
Apache Spark
Cloud Providers
How do we keep High-Level Code Transparent?
def dataset(batch_size):
(x_train, y_train) = load_data()
x_train = x_train / np.float32(255)
y_train = y_train.astype(np.int64)
train_dataset = tf.data.Dataset.from_tensor_slices(
(x_train,y_train)).shuffle(60000)
.repeat().batch(batch_size)
return train_dataset
def build_and_compile_cnn_model(lr):
model = tf.keras.Sequential([
tf.keras.Input(shape=(28, 28)),
tf.keras.layers.Conv2D(32, 3,
activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
loss=SparseCategoricalCrossentropy(from_logits=True),
optimizer=SGD(learning_rate=lr))
return model
def dataset(batch_size):
(x_train, y_train) = load_data()
x_train = x_train / np.float32(255)
y_train = y_train.astype(np.int64)
train_dataset = tf.data.Dataset.from_tensor_slices(
(x_train,y_train)).shuffle(60000)
.repeat().batch(batch_size)
return train_dataset
def build_and_compile_cnn_model(lr):
model = tf.keras.Sequential([
tf.keras.Input(shape=(28, 28)),
tf.keras.layers.Conv2D(32, 3,
activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
loss=SparseCategoricalCrossentropy(from_logits=True),
optimizer=SGD(learning_rate=lr))
return model
NO
CHANGES!
Distribution Context
•Single-host vs. parallel multi-host vs. distributed multi-host
Worker1
Worker5
Worker3
Worker2
Worker4
Worker7
Worker8
Worker6
Driver
TF_CONFIG
Driver
Experiment
Controller
Worker1 WorkerNWorker2
Single
Host
Distribution Context
•Single-host vs. parallel multi-host vs. distributed multi-host
Worker1
Worker5
Worker3
Worker2
Worker4
Worker7
Worker8
Worker6
Driver
TF_CONFIG
Driver
Experiment
Controller
Worker1 WorkerNWorker2
Single
Host
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies
Model Development Best Practices
• Modularize
• Parametrize
• Higher order training
functions
• Usage of callbacks at
runtime
Dataset
Generation
Model
Generation
Training
Logic
Oblivious Training Function as an Abstraction
•Let the system handle the complexities
System takes care of ...
… fixing parameters
… launching
the function
… launching trials (parametrized
instantiations of the function)
… generating new trials
… collecting and logging results
… setting up TF_CONFIG
… wrapping in Distribution Strategy
… launching function as workers
… collecting results
Maggy - Asynchronous Trials on Spark
•Spark is bulk-synchronous
Wasted
Compute
Wasted
Compute
HopsFS
Barrier
Task11
Task12
Task13
Task1N
Driver
Metrics1
Barrier
Task21
Task22
Task23
Task2N
Metrics2
Barrier
Task31
Task32
Task33
Task3N
Metrics3
Wasted
Compute
Early-Stopping
Recap: The Solution
•Add Communication and Long Running Tasks
Task11
Task12
Task13
Task1N
Driver
Barrier
Metrics New Trial
What’s New?
•Worker discovery and distribution context set-up
Task11
Task12
Task13
Task1N
Driver
Barrier
Launch Oblivious
Training Function in
Context
Discover
Workers
What’s New: Distribution Context
sp = maggy.optimization.Searchspace(...)
dist_strat = tf.keras.distribute.MirroredStrategy(...)
ab = maggy.ablation.AblationStudy(...)
maggy.set_context('optimization’)
maggy.lagom(training_function, sp)
maggy.set_context(‘distributed_training’)
maggy.lagom(training_function, dist_strat)
maggy.set_context(‘ablation’)
maggy.lagom(training_function, ab)
Contributors – Thanks!
–Contributions from colleagues:
–Robin Andersson @robzor92
–Sina Sheikholeslami @cutlash
–Kim Hammar @KimHammar1
–Alex Ormenisan @alex_ormenisan
• Maggy
https://github.com/logicalclocks/maggy or
https://maggy.readthedocs.io/en/latest/
2020-06-04 Guest Lecture, Jim Dowling 92/62
Show us some love!
@hopsworks
http://github.com/logicalclocks/hopsworks

More Related Content

What's hot

HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaTim Ellison
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platforminside-BigData.com
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Indrajit Poddar
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyondinside-BigData.com
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...Edge AI and Vision Alliance
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 

What's hot (20)

BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
 
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyond
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 

Similar to Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University

lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)Julien SIMON
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AITyrone Systems
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAFacultad de Informática UCM
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeAnand Haridass
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
FutureGrid Computing Testbed as a Service
 FutureGrid Computing Testbed as a Service FutureGrid Computing Testbed as a Service
FutureGrid Computing Testbed as a ServiceGeoffrey Fox
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 

Similar to Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University (20)

lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
uCluster
uClusteruCluster
uCluster
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand Challenge
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
FutureGrid Computing Testbed as a Service
 FutureGrid Computing Testbed as a Service FutureGrid Computing Testbed as a Service
FutureGrid Computing Testbed as a Service
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 

More from Jim Dowling

ARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfJim Dowling
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfJim Dowling
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfJim Dowling
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdfJim Dowling
 
Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Jim Dowling
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022Jim Dowling
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Jim Dowling
 
Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21Jim Dowling
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigmJim Dowling
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
GANs for Anti Money Laundering
GANs for Anti Money LaunderingGANs for Anti Money Laundering
GANs for Anti Money LaunderingJim Dowling
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingJim Dowling
 
Hopsworks data engineering melbourne april 2020
Hopsworks   data engineering melbourne april 2020Hopsworks   data engineering melbourne april 2020
Hopsworks data engineering melbourne april 2020Jim Dowling
 
The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines Jim Dowling
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019Jim Dowling
 

More from Jim Dowling (20)

ARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdf
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf
 
Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021
 
Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigm
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
GANs for Anti Money Laundering
GANs for Anti Money LaunderingGANs for Anti Money Laundering
GANs for Anti Money Laundering
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowling
 
Hopsworks data engineering melbourne april 2020
Hopsworks   data engineering melbourne april 2020Hopsworks   data engineering melbourne april 2020
Hopsworks data engineering melbourne april 2020
 
The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University

  • 1. GPUs, Distributed Deep Learning and Hopsworks Jim Dowling Assoc Prof @ KTH – Royal Institute of Technology CEO at Logical Clocks AB
  • 2. Leadership & Offices Stockholm Box 1263, Isafjordsgatan 22 Kista, Sweden London IDEALondon, 69 Wilson St, London,, UK Silicon Valley 470 Ramona St Palo Alto California, USA Dr. Jim Dowling CEO Theo Kakantousis COO Prof. Seif Haridi Chief Scientist Fabio Buso VP Engineering Steffen Grohsschmiedt Head Of Cloud www.logicalclocks.com Shraddha Chouhan Head Of Marketing
  • 3. Affine Transformation •Matrix Multiplications calculate weighted sums for Feed-Forward Networks •Fused multiply-add (FMA) instruction on a GPU 2020-06-04 Guest Lecture, Jim Dowling 3/62 x1 x2 .. xm w11 w12 …w1n w21 w22 …w2n …. wm1 wm2 …wmn y1 y2 .. ym Input Weights Output b1 b2 .. bm Biases
  • 4. Convolution Operations Input Matrix ⊗ Filter = Output (activation map) For each output map j For each input map k For each pixel x,y For each kernel element u,v Bxyj += A(x-u)(y-v)k * Kuvkj 2020-06-04 Guest Lecture, Jim Dowling 4/62
  • 6. The Memory Hierarchy Registers L1 L2/L3 DRAM NVMe <1 ns ~1-2 ns ~4-10 ns ~50-70 ns Magnetic Disk ~10 𝜇𝑠 ~5-10 ms Latency Size/Throughput ~2 TB ~3 GB/s ~12 TB ~100 MB/s ~32 GB DIMMs ~10 GB/s per CPU ~128 KB – 8 MB ~1-200 GB/s ~16-64 KB ~700 GB/s ~6 KiB Guest Lecture, Jim Dowling
  • 7. CPU gains outperforming Memory gains
  • 8. SSD/Net gains outperforming Memory gains 5/30/2012 https://itblog.sandisk.com/cpu-bandwidth-the-worrisome-2020-trend/ 8 https://itblog.sandisk.com/cpu-bandwidth-the-worrisome-2020-trend/
  • 9. GPU Programming Intel Xeon E5-2680v4: Clock speed: 2.4 GHz 4 instructions per cycle with AVX2 CPU - 28 cores 2.4 x 4 x 28 = 268.8 Gigaflops double precision NVIDIA Tesla P100: Single instruction per cycle 3584 CUDA cores 4.7 Teraflops double precision CPU vs GPU
  • 11. GPU ProgrammingCPUs, GPUs, Memory, and the PCI-E Bus CPU Memory CPU GPU GPU MemoryPCI-E Bus Compute-Intense Fns Sequential CPU Code Guest Lecture, Jim Dowling Program
  • 13. Single-Instruction Multiple-Data (SIMD) •A single instruction stream to operations that may be naturally parallelized - Matrix Multiplication - Convolutions •x86 SIMD support - AVX Extensions - Intel MKL Library CS 61c 13 Instruction Pool DataPool Processor Processor Processor Processor
  • 14. Why is SIMD not dominant? •It’s hard to program - Have to identify data parallelism in applications •Specialized hardware is expensive - Vector processors used to be expensive, so programming language support lagged •SIMD extensions (AVX, SSE) - Not very wide - No control flow within SIMD
  • 15. Nvidia Cuda: SIMT Programming •Single Instruction Multiple Thread (SIMT) - Parallel threads that use SIMD hardware - CUDA, OpenCL are both SIMT programming models •What is CUDA? - Nvidia proprietory API and platform • Hierarchical thread programming model that combines MIMD/SIMD ideas - Lots of Libraries • CuDNN - Deep Neural Network library • CUBLAS - CUDA Basic Linear Algebra Subroutines library • CUDART - CUDA RunTime library • NVIDIA Collective Communications Library (NCCL)
  • 16. Cuda SIMT •Many parallel threads of instructions - Each instruction is responsible for a single data input •Threads are grouped into blocks (<1024) (blocks are mapped to SMXs) - Hardware divides blocks into SIMD groups or warps - Warps of threads are executed together as a single SIMD instruction •Threads distinguish computation by querying their grid/block location (think thread IDs)
  • 17. CUDA memory model •All threads share Global Memory •Threads within a block share Shared Memory •Threads have their own private Registers •Memory consistency is very relaxed - Cannot guarantee ordering of instructions across blocks. - Can insert explicit memory fences/barriers to order threads within a block
  • 18. CUDA memory model Global Shared Thread 0 Thread 1 Block 0 Shared Thread 0 Thread 1 Block 1 Registers Registers Registers Registers
  • 20. Reduced Precision for Inference 2020-06-04 Guest Lecture, Jim Dowling 20/62
  • 21. Specialized Instructions – Tensor Cores 5/30/2012 www.hops.io 21
  • 22. Connecting GPUs over the Network 5/30/2012 22
  • 23. Intra GPU and Inter-Host GPU Connections • QPI link ~8-12 GB/s • PCIe ~16-32 GB/s . NVLink ~80 GB/s. • Infiniband ~108-216 Gb/s 23Host: 2 CPU Sockets, PCIe Bus, 4 GPUs, Infiniband Net
  • 24. GPU -> Network -> GPU Communications Oden et al, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters, IEEE 21 CLUSTER 2013
  • 25. NVLink – a GPU-to-GPU Bus 25 NVLink ~80 GB/s
  • 26. SingleRoot Complex – Commodity Server 26
  • 28. Click to edit 5/30/2012 www.hops.io 28 The Deep Learning world beyond Nvidia
  • 29. 5/30/2012 www.hops.io Will GPUs become commodity compute? •For…. - Gaming GPUs have the best Price/Performance for training models - Distributed Deep Learning technology is rapidly improving • Buy more GPUs to scale out training, HParam Tuning, Ablation Studies, etc •Against…. - Nvidia 29
  • 30. 5/30/2012 www.hops.io Commodity Fight: AMD Radeon vs Nvidia 30 Nvidia™ 2080Ti Memory: 11GB TensorFlow 1.12 CUDA 10.0.130, cuDNN 7.4.1 Model: RESNET-50 Dataset: imagenet (synthetic) ---------------------------------------------------- -------- FP32 total images/sec: ~322 FP16 total images/sec: ~560 AMD Radeon™ VII Memory: 16 GB TensorFlow 1.13.1 ROCm: 2.3 Model: RESNET-50 Dataset: imagenet (synthetic) ------------------------------------------------------ ------ FP32 total images/sec: ~302 FP16 total images/sec: ~415 https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/ https://www.phoronix.com/scan.php?page=article&item=nvidia- rtx2080ti-tensorflow&num=2 https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173
  • 31. 5/30/2012 www.hops.io AMD Software Stack 31 31 Latest Machine Learning Frameworks Dockers and Kubernetes support Optimized Math & Communication Libraries Up-Streamed for Linux Kernel Distributions Frameworks Middleware and Libraries Eigen Spark / Machine Learning Apps Data Platform Tools ROCm Fully Open Source ROCm Platform OpenMP HIP OpenCL™ Python Devices GPU CPU APU DLA RCCL BLAS, FFT, RNG MIOpen O P E N S O U R C E F O U N D A T I O N F O R M A C H I N E L E A R N I N G
  • 32. 5/30/2012 www.hops.io AMD Distributed Training 32 Optimized collective communication operations library Easy MPI integration Support for Infiniband and RoCE highspeed network fabrics ROCm enabled UCX ROCm w/ ROCmRDMA RCCL 1,00X 1,99X 3,98X 7,64X 0,00X 1,00X 2,00X 3,00X 4,00X 5,00X 6,00X 7,00X 8,00X RESNET50 Multi-GPU Scaling (PCIe, CPU parameter-server, 1/2/4/8 GPU) 1GPU 2GPU 4GPU 8GPU ResNet- 50
  • 33. 5/30/2012 www.hops.io ROCm over Spark/TensorFlow on Hopsworks 33 •Spark / TensorFlow applications run unchanged on ROCm •Hopsworks runs Spark/TensorFlow on YARN and Conda
  • 34. 5/30/2012 www.hops.io YARN support for ROCm in Hops 34 Container A Container is a CGroup that isolates CPU, memory, and GPU resources and has a conda environment and TLS certs. ContainerContainerContainer Resource Manager Node Manager Node Manager Node Manager Executor ExecutorExecutorDriver
  • 36. Model Parallelism SGD vs Data Parallelism SGD •Model Parallelism - Models cannot fit on a single GPU - Model is partitioned over many GPUs • For example, one layer per GPU for a ConvNet - The same data is used to train the partitioned models • E.g., input data in at the bottom layer (GPU1) and as the data feeds forward to the output layer, it passes through many GPUs •Data Parallelism - Copy of the model stored at each work - Each worker trains on a different set of samples from the same mini- batch - Gradients computed at each worker need to be aggregated to calculate the new model for a mini-batch - A new model needs to be broadcast for each iteration to all workers
  • 37. Distributed SGD with Data Parallelism 37
  • 38. Distributed SGD with Data Parallelism 38
  • 39. 5/30/2012 www.hops.io Asynchronous SGD vs Synchronous SGD 39
  • 40. Synchnrous SGD: N/W is the Bottleneck 40 0 0,5 1 1,5 2 2,5 3 3,5 4 4,5 1 2 3 4 5 6 7 8 9 10 1 GPU 4 GPUs N/W N/W N/W N/W N/W Amount Work Time Reduce N/W Comms Time, Increase Computation Time Amdahl’s Law
  • 41. Synchronous SGD Challenges •Effective size of the batch becomes larger - Can we train models with very large batch sizes? •Update time depends on the slowest worker - Backup workers proposed as a mitigating strategy 41
  • 42. Facebook: Scaling Synchronous SGD 42 June 2017: Facebook reduced training time on ImageNet from 2 weeks to 1 hr https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf
  • 43. Facebook AllReduce Synchronous SGD •Experiments using ConvNets on ImageNet - ResNet-50 on up to 256 GPUs - No loss of accuracy when training with large minibatch sizes up to 8192 images - ∼90% scaling efficiency when moving from 8 to 256 GPUs •Learning rate heuristic - Make the learning rate proportional to the batch size •Warm up phase - Start at a low learning rate that you gradually increase up to the target learning rate (5 epochs to target rate) 43
  • 45. Facebook: Learning Rate Warmup Evaluation 45
  • 46. Instead of Decaying the Learning Rate, Increase Batch Size •An alternative approach to Facebook’s learning rate heuristic is to increase the batch size during training - Quoc Le et Al •Vast batch size of 65536 images when training Inception- ResNet-V2 using only 2500 iterations (model updates to all GPUs) reaching accuracy of 77% (cf. Facebook’s 76%). 5/30/2012 www.hops.io 46 https://arxiv.org/pdf/1711.00489.pdf
  • 48. Ring-AllReduce vs Parameter Server(s) 2020-06-04 48/48 GPU 0 GPU 1 GPU 2 GPU 3 send send send send recv recv recv recv GPU 1 GPU 2 GPU 3 GPU 4 Param Server(s) Network Bandwidth is the Bottleneck for Distributed Training
  • 49. One Slow Link/GPU/Bus slows down Training 2020-06-04 49/48 Ring-AllReduce
  • 50. Ring-AllReduce Algorithm •This example of AllReduce sums all elements in N arrays using N GPUs in parallel. 50
  • 51. First of N-1 iterations of scatter-reduce 51
  • 52. Intermediate sums in step1 of scatter-reduce 52
  • 53. More iterations of scatter-reduce 53
  • 54. More iterations of scatter-reduce 54
  • 55. More iterations of scatter-reduce 55
  • 56. Final state after all scatter-reduce transfers 56
  • 57. First iteration of the allgather 57
  • 58. Allgather data transfers (Iteration 1) 58
  • 59. Next Allgather data transfer 59
  • 60. Next Allgather data transfer 60
  • 61. Next Allgather data transfer 61
  • 62. Final state after all Allgather transfers 62
  • 63. Concurrency in Ring-AllReduce SGD •After computing the gradients for a layer, send the gradients to your neighbor immediately - Spreads out network traffic over time 63 ”Running the model on 40 GPUs takes approximately 650 – 700 milliseconds per iteration, while on a single GPU it takes approximately 370 milliseconds. Since by our estimate communication would take 400 milliseconds, we are saving an extra 70 – 120 milliseconds per iteration by overlapping the backpropagation with the data transfer.” http://research.baidu.com/bringing-hpc-techniques-deep-learning/
  • 64. 5/30/2012 www.hops.io Support in TensorFlow For AllReduce •Collective AllReduce in Keras/TensorFlow - Multi-node collective communication primitives •Uber Horovod - Built on NVIDIA Collective Communications Library (NCCL) - NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter over PCIe and NVLink. 64
  • 65. 5/30/2012 www.hops.io Current State of the Art (Microsoft) 65
  • 66. Distributed Deep Learning on Hopsworks 66 https://www.oreilly.com/content/distributed-tensorflow/
  • 67. Hopsworks Project World’s first Hadoop platform to support GPUs-as-a-Resource World’s fastest HDFS Published at USENIX FAST with Oracle and Spotify World’s First Open Source Feature Store for Machine Learning World’s First Distributed Filesystem to store small files in metadata on NVMe disks Winner of IEEE Scale Challenge 2017 with HopsFS - 1.2m ops/sec 2017 World’s most scalable POSIX-like Hierarchical Filesystem with Multi Data Center Availability with 1.6m ops/sec 2018 2019 World’s First managed feature store in the Cloud (Hopsworks.ai) World’s First Unified Hyperparam and Ablation Study Parallel Prog. Framework
  • 68. Inner and Outer Loop of Deep Learning Inner Loop Outer Loop Training Data worker1 worker2 workerN … ∆ 1 ∆ 2 ∆ N Synchronization Metric Search Method HParams http://tiny.cc/51yjdz
  • 69. Inner and Outer Loop of Deep Learning Inner Loop Outer Loop Training Data worker1 worker2 workerN … ∆ 1 ∆ 2 ∆ N Synchronization Metric Search Method HParams http://tiny.cc/51yjdzLEARNING SEARCH
  • 70. Black Box Optimization Learning Black Box Metric Meta-level learning & optimization Search space
  • 71. Parallel Black Box Optimization 71 Which algorithm to use for search? How to monitor progress? Fault Tolerance?How to aggregate results? Learning Black Box Metric Meta-level learning & optimization Parallel WorkerQueue Trial Trial Search space This should be managed with platform support!
  • 72. 5/30/2012 www.hops.io Distributed HParam Tuning 72 Executor Executor Driver HopsFS TensorBoard ModelsCheckpoints Training Data Logs conda_env # RUNS ON THE EXECUTORS def train(lr, dropout): def input_fn(): # return dataset optimizer = … model = … model.add(Conv2D(…)) model.compile(…) model.fit(…) model.evaluate(…) # RUNS ON THE DRIVER Hparams= {‘lr’:[0.001, 0.0001], ‘dropout’: [0.25, 0.5, 0.75]} experiment.grid_search(train,HParams) More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0 https://github.com/logicalclocks/hops-examples
  • 73. 5/30/2012 www.hops.io Distributed Training 73 Executor Executor Driver HopsFS TensorBoard ModelsCheckpoints Training Data Logs conda_env # RUNS ON THE EXECUTORS def train(): def input_fn(): # return dataset model = … optimizer = … model.compile(…) rc = tf.estimator.RunConfig( ‘CollectiveAllReduceStrategy’) keras_estimator = tf.keras.estimator. model_to_estimator(….) tf.estimator.train_and_evaluate( keras_estimator, input_fn) # RUNS ON THE DRIVER experiment.collective_all_reduce(train) More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0 https://github.com/logicalclocks/hops-examples
  • 74. 5/30/2012 www.hops.io Maggy – Parallel HParam Trials on PySpark 74 Task11 Driver Task12 Task13 Task1N … Barrier Metrics New Trial Early Stop Long running tasks execute many trials, with a global optimizer.
  • 75. ML Model Development •A simplified view
  • 76. ML Model Development •It’s simple - only four steps Explore and Design Experimentation: Tune and Search Model Training (Distributed) Explainability and Ablation Studies
  • 77. Artifacts and Non DRY Code Explore and Design Experimentation: Tune and Search Model Training (Distributed) Explainability and Ablation Studies
  • 78. Development of ML Models is Iterative Explore and Design Experimentation: Tune and Search Model Training (Distributed) Explainability and Ablation Studies
  • 79. Iterative Development Is a Pain, We Need DRY Code! •Each step requires different implementations of the training code Ablation StudiesEDA HParam Tuning Training (Dist)
  • 80. The Oblivious Training Function OBLIVIOUS TRAINING FUNCTION # RUNS ON THE WORKERS def train(): def input_fn(): # return dataset model = … optimizer = … model.compile(…) rc = tf.estimator.RunConfig( ‘CollectiveAllReduceSt rategy’) keras_estimator = tf.keras.estimator. model_to_estimator(….) tf.estimator.train_and_evaluate( keras_estimator, input_fn) Ablation StudiesEDA HParam Tuning Training (Dist)
  • 81. Challenge: Obtrusive Framework Artifacts • TF_CONFIG • Distribution Strategy • Dataset (Sharding, DFS) • Integration in Python - hard from inside a notebook • Keras vs. Estimator vs. Custom Training Loop •TensorFlow Challenges
  • 82. Trend(s) in DL: Productive High-Level APIs Idea Experiment Results Infrastructure Framework Tracking Visualization ? Hopsworks (Open Source) Databricks Apache Spark Cloud Providers
  • 83. How do we keep High-Level Code Transparent? def dataset(batch_size): (x_train, y_train) = load_data() x_train = x_train / np.float32(255) y_train = y_train.astype(np.int64) train_dataset = tf.data.Dataset.from_tensor_slices( (x_train,y_train)).shuffle(60000) .repeat().batch(batch_size) return train_dataset def build_and_compile_cnn_model(lr): model = tf.keras.Sequential([ tf.keras.Input(shape=(28, 28)), tf.keras.layers.Conv2D(32, 3, activation='relu'), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10) ]) model.compile( loss=SparseCategoricalCrossentropy(from_logits=True), optimizer=SGD(learning_rate=lr)) return model def dataset(batch_size): (x_train, y_train) = load_data() x_train = x_train / np.float32(255) y_train = y_train.astype(np.int64) train_dataset = tf.data.Dataset.from_tensor_slices( (x_train,y_train)).shuffle(60000) .repeat().batch(batch_size) return train_dataset def build_and_compile_cnn_model(lr): model = tf.keras.Sequential([ tf.keras.Input(shape=(28, 28)), tf.keras.layers.Conv2D(32, 3, activation='relu'), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10) ]) model.compile( loss=SparseCategoricalCrossentropy(from_logits=True), optimizer=SGD(learning_rate=lr)) return model NO CHANGES!
  • 84. Distribution Context •Single-host vs. parallel multi-host vs. distributed multi-host Worker1 Worker5 Worker3 Worker2 Worker4 Worker7 Worker8 Worker6 Driver TF_CONFIG Driver Experiment Controller Worker1 WorkerNWorker2 Single Host
  • 85. Distribution Context •Single-host vs. parallel multi-host vs. distributed multi-host Worker1 Worker5 Worker3 Worker2 Worker4 Worker7 Worker8 Worker6 Driver TF_CONFIG Driver Experiment Controller Worker1 WorkerNWorker2 Single Host Explore and Design Experimentation: Tune and Search Model Training (Distributed) Explainability and Ablation Studies
  • 86. Model Development Best Practices • Modularize • Parametrize • Higher order training functions • Usage of callbacks at runtime Dataset Generation Model Generation Training Logic
  • 87. Oblivious Training Function as an Abstraction •Let the system handle the complexities System takes care of ... … fixing parameters … launching the function … launching trials (parametrized instantiations of the function) … generating new trials … collecting and logging results … setting up TF_CONFIG … wrapping in Distribution Strategy … launching function as workers … collecting results
  • 88. Maggy - Asynchronous Trials on Spark •Spark is bulk-synchronous Wasted Compute Wasted Compute HopsFS Barrier Task11 Task12 Task13 Task1N Driver Metrics1 Barrier Task21 Task22 Task23 Task2N Metrics2 Barrier Task31 Task32 Task33 Task3N Metrics3 Wasted Compute Early-Stopping
  • 89. Recap: The Solution •Add Communication and Long Running Tasks Task11 Task12 Task13 Task1N Driver Barrier Metrics New Trial
  • 90. What’s New? •Worker discovery and distribution context set-up Task11 Task12 Task13 Task1N Driver Barrier Launch Oblivious Training Function in Context Discover Workers
  • 91. What’s New: Distribution Context sp = maggy.optimization.Searchspace(...) dist_strat = tf.keras.distribute.MirroredStrategy(...) ab = maggy.ablation.AblationStudy(...) maggy.set_context('optimization’) maggy.lagom(training_function, sp) maggy.set_context(‘distributed_training’) maggy.lagom(training_function, dist_strat) maggy.set_context(‘ablation’) maggy.lagom(training_function, ab)
  • 92. Contributors – Thanks! –Contributions from colleagues: –Robin Andersson @robzor92 –Sina Sheikholeslami @cutlash –Kim Hammar @KimHammar1 –Alex Ormenisan @alex_ormenisan • Maggy https://github.com/logicalclocks/maggy or https://maggy.readthedocs.io/en/latest/ 2020-06-04 Guest Lecture, Jim Dowling 92/62
  • 93. Show us some love! @hopsworks http://github.com/logicalclocks/hopsworks