Exascale Deep Learning for Climate Analytics
Thorsten Kurth*, Josh Romero*, Sean Treichler, Mayur Mudigonda,

Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe,

Massimiliano Fatica, Prabhat, Michael Houston
GTC Silicon Valley
03/17/2019, San Jose, CA
The Team
2
Thorsten Kurth Sean Treichler Josh Romero Mayur Mudigonda Nathan Luehr Everett Phillips
Ankur Mahesh Michael Matheson Jack Deslippe Massimiliano Fatica Prabhat Michael Houston
Socio-Economic Impact of
Extreme Weather Events
• tropical cyclones and
atmospheric rivers have major
impact on modern economy and
society
• CA: 50% of rainfall through
atmospheric rivers
• FL: flooding, influence on
insurance premiums and home
prices
• $200B worth of damage in 2017
• costs of ~$10B/event for large
events
3Harvey 2017
Katrina 2005 Berkeley 2019
Santa Rosa 2018
pixabay
pixabay Chris Samuel
pixabay
Understanding Extreme
Weather Phenomena
• will there be more hurricanes?
• will they be more intense?
• will they make landfall more often?
• will atmospheric rivers carry more
water?
• can they help mitigate droughts and
decrease risk of forest fires?
• will they cause flooding and heavy
precipitation?
4
pixabay
pixabay
pixabay
Typical Climate Analytics
5
Typical Climate Analytics
5
Typical Climate Analytics
5
mean/std

Temperature
Typical Climate Analytics
5
mean/std

Temperature
14M variables/3h O(1) variables/y
Impact Quantification of Extreme Weather Events
• detect hurricanes and atmospheric
rivers in climate model projections
• enable geospatial analysis of EW
events and statistical impact
studies for regions around the
world
• flexible and scalable detection
algorithm
• gear up for future simulations with
∼1 km2 spatial resolution
6
M.F. Wehner, doi:10.1002/2013MS000276
Unique Challenges for Climate Analytics
• interpret as segmentation problem
• 3 classes - background (BG), tropical cyclones (TC), atmospheric rivers (AR)
• deep learning has proven successful for these tasks
• climate data is complex
• high imbalance - more than 95% of 

pixels are background
• high variance - shape of events change
• many input channels w/ 

different properties
• high resolution required
• no static background, highly variable 

in space and time
7
NASA
Unique Challenges for Deep Learning
• need labeled data for supervised approach
• can be leveraged from existing heuristic-based approaches
• define neural network architecture
• balance between compute performance and model accuracy
• employ high productivity/flexibility frameworks for rapid prototyping
• performance optimization requires holistic approach
• hyper parameter tuning (HPO)
• necessary for convergence and accuracy
8
Unique Challenges for Deep Learning at Extreme Scale
9
• data management
• shuffling/loading/processing/feeding 20 TB dataset to keep GPUs busy
• efficient use of remote filesystem
• multi-node coordination and synchronization
• synchronous reduction of O(50)MB across 27360 GPUs after each iteration
• hyper parameter tuning (HPO)
• convergence and accuracy challenging due to larger global batch sizes
Label Creation: Atmospheric Rivers
10
1. The climate model predicts 

water vapor, wind speeds 

and humidity
2. These observables are used to

compute the 

Integrated Water Vapor Transport
Label Creation: Atmospheric Rivers
11
3. Binarization by thresholding at

95th percentile
4. Flood fill algorithm generates

AR candidates by masking out 

regions in mid-latitudes
Label Creation: Tropical Cyclones
12
1. Extract cyclone center and radius using 

thresholds for pressure, temperature, and vorticity
2. Binarize patch around cyclone center 

using thresholds for water vapor, 

wind, and precipitation
Software: TensorFlow
• high-productivity deep learning framework in Python
with C++-backend, developed by Google
• leverages optimized cuDNN library for performance
sensitive kernels (e.g. convolutions) on NVIDIA GPUs
• dataflow-style programming and asynchronous graph
execution
• provides features for building I/O input pipeline
• can be combined with external Python modules to
provide good flexibility
13
Software: Horovod
• library developed by Uber for distributed training
• provides straightforward wrappers to convert existing
single process TensorFlow programs to multi-process
data-parallel programs
• uses MPI for worker coordination/control
• MPI is ubiquitous in HPC and available broadly on
systems targeting HPC applications
• can leverage MPI or NCCL for collective communication
on NVIDIA GPUs (e.g. allreduce)
14
Systems
15
• Cray XC50 HPC system at CSCS, 5th on top500
• 5320 nodes with Intel Xeon E5-2695v3 

and 1 NVIDIA P100 GPU
• Cray Aries interconnect in 

diameter 5 dragonfly topology
• ~54.4 PetaFlop/s peak performance (FP32)
Piz Daint Summit
• leadership class HPC system at OLCF, 1st on top500
• 4609 nodes with 2 IBM P9 CPU and 6 NVIDIA V100 GPU
• 300 GB/s NVLink connection btw. 3 GPUs in a group
• 800 GB available NVMe storage/node
• dual-rail EDR Infiniband in fat-tree topology
• ~3.45 ExaFlop/s theoretical peak performance (FP16)
Carlos Jones (ORNL)CSCS
Single GPU
cuDNN
Single GPU
cuDNN
Single GPU
• Things to consider:
• Is my TensorFlow model
efficiently using GPU
resources?
• Is my data input pipeline
keeping up?
• Is my TensorFlow model
providing reasonable
results?
cuDNN
Single Node
cuDNN
cuDNN
cuDNN
cuDNN
cuDNN
cuDNN
Single Node
NCCL
MPI
cuDNN
cuDNN
cuDNN
cuDNN
cuDNN
cuDNN
Single Node
NCCL
MPI
cuDNN
cuDNN
cuDNN
cuDNN
cuDNN
cuDNN
Single Node
• Things to consider:
• Is my data input pipeline
still keeping up?
• Is my data-parallel
TensorFlow model
providing reasonable
results?
• How is my performance
scaling over PCIe/NVLink
using Horovod?
Multi Node
NCCL MPI
Multi Node
• Things to consider:
• Is my data-parallel
TensorFlow model still
providing reasonable
results?
• How is my performance
scaling over NVLink +
InfiniBand using Horovod?
• How do I distribute my
data across so many nodes?
NCCL MPI
Deep Learning Models for Extreme Weather Segmentation
20
decoderencoder
1152×768, 16
5×5 conv, 64
1152×768, 3
input output
5×5 conv, +322×
1×1 conv, 128, /2
5×5 conv, +32
1×1 conv, 384, /2
5×5 conv, +32
4×
5×
1x1 deconv, 160, /2
5×5 conv, +324×
1x1 deconv, 128, /2
5×5 conv, +322×
1x1 deconv, 64, /2
5×5 conv, +32
1x1 deconv, 64, /2
5×5 conv, +32
1×1 conv, 3
2×
2×
5×5 conv, +322×
1×1 conv, 192, /2
5×5 conv, +322×
1×1 conv, 256, /2
1152×768, 128
72×48, 160
144×96, 384
144×96, 544
576×384, 192
144×96, 384
288×192, 256
288×192, 384288×192, 256
576×384, 192 576×384, 256
1152×768, 128
1152×768, 192
1152×768, 256
Tiramisu, 35 layers, 

7.8M parameters, 4.2 TF/sample
20
DeepLabv3+, 66 layers, 

43.7M parameters, 14.4 TF/sample
decoder
ASPP
encoder
7×7 conv, 64, /2
1152×768, 16
1×1 conv, 64
3×3 conv, 64
1×1 conv, 256
288×192, 64
3×
1×1 conv, 128
3×3 conv, 128
1×1 conv, 512
144×96, 256
4×
1×1 conv, 256
3×3 conv, 256, d 2
1×1 conv, 1024
144×96, 512
6×
1×1 conv, 512
3×3 conv, 512, d 4
1×1 conv, 2048
144×96, 1024
3×
1×1 conv, 256
3×3 conv, 256, d 12
3×3 conv, 256, d 24
3×3 conv, 256, d 36
1×1 conv, 256
144×96, 1024
144×96, 2048
3×3 deconv, 256, /2
1×1 conv, 48
3×3 conv, 64
3×3 conv, 128
3×3 conv, 256
3×3 conv, 256
3×3 deconv, 256, /2
3×3 deconv, 256, /2
1×1 conv, 3
3×3 conv, 256
3×3 conv, 256
1152×768, 3
288×192, 256
1152×768, 256
1152×768, 128
288×192, 256
3×3 maxpool, /2
input output
Data Staging
21
• 250 training samples/GPU

(15 GB), sample w/ replacement
• each file will be read at most once from FS
• files shared between nodes via MPI
(mpi4py)
• preprocess and feed data to GPU
asynchronously using tf.data and python
multiprocess
Dataset Size Required BW
(27K GPUs)
GPFS/LUSTRE BurstBuffer NVM/e or DRAM
20 TB (~63K samples) 3.8 TB/s ~400 GB/s ~2 TB/s ~26 TB/s
N
V
M
e
N
V
M
e
N
V
M
e
...
1.5K
sam
ples
1.5Ksamples
1.5K
sam
ples
shuffle
On-Node I/O Pipeline
• files are in HDF5 with single sample + label/file
• list of filenames passed to TensorFlow Dataset API (tf.data)
• HDF5 serialization bottleneck addressed with multiprocessing + h5py
• extract and batch using tf.data input pipeline
22
...
data-2107-12-26-02-4.h5
data-2107-12-26-03-1.h5
data-2107-12-26-03-4.h5
data-2107-12-26-04-1.h5
data-2107-12-26-04-4.h5
data-2107-12-26-05-1.h5
data-2107-12-26-05-4.h5
data-2107-12-26-06-1.h5
data-2107-12-26-06-4.h5
data-2107-12-26-07-1.h5
...
...
data-2107-03-03-06-1.h5
data-2107-05-24-00-4.h5
data-2107-08-30-03-4.h5
data-2107-10-29-01-4.h5
data-2107-12-11-07-1.h5
data-2107-08-14-03-4.h5
data-2107-01-08-01-4.h5
data-2107-09-08-04-1.h5
data-2107-09-22-00-1.h5
data-2107-07-16-03-4.h5
...
shuffle
4-way parallel 

read + preprocess
batch
asimovinstitute.org/neural-network-zoo
CPU
GPU
Single Node Performance
• GPU execution profiled with CUDA profiler, kernels grouped by category
• convolution kernels: use latest cuDNN, favor higher computational intensity
• pay attention to memory layout to reduce transposes and copies
• tuning input pipeline on CPU to keep off critical path
23
DeepLabv3+ FP16 Training
Category
#
Kern
Time
(ms)
Math
(TF)
Mem
(GB)
%
Time
%
Math
%
Mem
Forward
n Convolutions 158 147.9 9.61 27.6 18.1 52.0 20.7
Point-wise 829 52.3 < 0.1 24.3 6.4 51.6
Backward
n
Convolutions 195 300.2 19.21 50.5 36.7 51.2 18.7
Point-wise 157 25.6 < 0.1 6.3 3.1 27.3
Optimizer 1219 3.9 < 0.1 1.1 0.5 31.3
Copies / Transposes 708 213.2 - 92.6 26.1 48.3
Allreduce (NCCL) 30 58.7 < 0.1 0.6 7.2 1.1
Type Conversions 201 1.3 - 0.6 0.2 51.3
GPU Idle 14.2 1.7
Total 3497 817.3 28.82 203.6 28.2 27.7
Improvements to Horovod: Original Control Plane
24
w4
w3
w1
...
{1, 2, 5, 8, 13}
{2, 3, 5, 10, 13}
{1, 3, 5, 10, 13, 14}
w2 {1, 3, 5, 7, 13}
w1
{5, 13}
w1
intersect lists
allreduce5,13
gather broadcast
w4
w3
w1
...
w2
Improvements to Horovod: Tree-based Control Plane
25
...
w4
w3
w1
w2
wN
allreduce5,13
asynchronous
gather + intersect
...
w4
w3
w1
w2
w5
w6
w7
w5
w6
w7
...
...
tree-based
broadcast
Improvements to Horovod: Hybrid All-Reduce
• NCCL uses NVLink for
high throughput, but
ring-based algorithms
are latency-limited at
scale
• hybrid NCCL/MPI
strategy uses strengths
of both
• one inter-node all
reduce per virtual NIC
• MPI work overlaps well
with GPU computation
26
intra-node 

allreduce (NCCL)
4x inter-node
allreduce (MPI)
4x intra-node
broadcast (NCCL)
Gradient Pipelining (Lag)
27
wN
...gN
k
allreduce
ḡk wN
w3
g3
k ḡk w3
w2
g2
k ḡk
w2
w1
g1
k ḡk w1
lag-0 (fully synchronous)
qN
q1
wN
...
w1
gN
k-1
g1
k-1
gN
k
g1
k
...
q1
qN
allreduce
ḡk-1
ḡk-1
wN
w1
...
...
lag-1
...
Importance of Node-Local Memory
28
} 

~10% performance
penalty for reading
from global file system

Scaling Tiramisu
• FP16-model sensitive
to communication
• FP16-model BW-bound

(only 2.5x faster than
FP32)
• almost ideal scaling for
both precisions on
Summit when gradient
lag is used
29
Scaling DeepLabv3+
30
• FP16-model sensitive
to communication
• FP16-model BW-bound

(only 2.5x faster than
FP32)
• excellent scaling for
both precisions on
Summit when gradient
lag is used
31
999 PetaFlop/s
(FP16) sustained
DeepLabv3+, 4560 nodes (27360 GPU)
1.13 ExaFlop/s
(FP16) peak
31
DeepLabv3+, 4560 nodes (27360 GPU)
Concurrency/Precision and Convergence
33
Concurrency/Precision and Convergence
33
~2.1x improvement in time to solution
Model/Lag and Convergence
34
Segmentation Animation
35
• best result for intersection-over-union (IoU) obtained: ∼73%
• result at large scale (batch-size > 1500): IoU ∼55%
Segmentation Animation
35
• best result for intersection-over-union (IoU) obtained: ∼73%
• result at large scale (batch-size > 1500): IoU ∼55%
What has happened since Gordon Bell?
• Horovod:
• Hierarchical (hybrid) allreduce available in upstream code (NVIDIA/Amazon)
• Control plane bypass via caching available soon (NVIDIA)
• NCCL:
• Version 2.4.x introduced tree-based reduction algorithms for improved
performance at scale
• NERSC:
• ClimateNet: human annotated labels from domain experts
• project serves as science problem for the ISC-HPC19 student cluster AI challenge
• Gained insights for design and evaluation of upcoming NERSC Perlmutter system
36
Conclusions
• deep learning and HPC converge, achieving exascale performance
• compute capabilities of contemporary HPC systems can be utilized to tackle
challenging scientific deep learning problems
• HPO and convergence at scale still an open problem — but now we can do it
• software enhancements benefit deep learning community at large
• deep learning-powered techniques usher in a new era of precision analytics for
various science areas
37
Thank You

Exascale Deep Learning for Climate Analytics

  • 1.
    Exascale Deep Learningfor Climate Analytics Thorsten Kurth*, Josh Romero*, Sean Treichler, Mayur Mudigonda,
 Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe,
 Massimiliano Fatica, Prabhat, Michael Houston GTC Silicon Valley 03/17/2019, San Jose, CA
  • 2.
    The Team 2 Thorsten KurthSean Treichler Josh Romero Mayur Mudigonda Nathan Luehr Everett Phillips Ankur Mahesh Michael Matheson Jack Deslippe Massimiliano Fatica Prabhat Michael Houston
  • 3.
    Socio-Economic Impact of ExtremeWeather Events • tropical cyclones and atmospheric rivers have major impact on modern economy and society • CA: 50% of rainfall through atmospheric rivers • FL: flooding, influence on insurance premiums and home prices • $200B worth of damage in 2017 • costs of ~$10B/event for large events 3Harvey 2017 Katrina 2005 Berkeley 2019 Santa Rosa 2018 pixabay pixabay Chris Samuel pixabay
  • 4.
    Understanding Extreme Weather Phenomena •will there be more hurricanes? • will they be more intense? • will they make landfall more often? • will atmospheric rivers carry more water? • can they help mitigate droughts and decrease risk of forest fires? • will they cause flooding and heavy precipitation? 4 pixabay pixabay pixabay
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    Impact Quantification ofExtreme Weather Events • detect hurricanes and atmospheric rivers in climate model projections • enable geospatial analysis of EW events and statistical impact studies for regions around the world • flexible and scalable detection algorithm • gear up for future simulations with ∼1 km2 spatial resolution 6 M.F. Wehner, doi:10.1002/2013MS000276
  • 10.
    Unique Challenges forClimate Analytics • interpret as segmentation problem • 3 classes - background (BG), tropical cyclones (TC), atmospheric rivers (AR) • deep learning has proven successful for these tasks • climate data is complex • high imbalance - more than 95% of 
 pixels are background • high variance - shape of events change • many input channels w/ 
 different properties • high resolution required • no static background, highly variable 
 in space and time 7 NASA
  • 11.
    Unique Challenges forDeep Learning • need labeled data for supervised approach • can be leveraged from existing heuristic-based approaches • define neural network architecture • balance between compute performance and model accuracy • employ high productivity/flexibility frameworks for rapid prototyping • performance optimization requires holistic approach • hyper parameter tuning (HPO) • necessary for convergence and accuracy 8
  • 12.
    Unique Challenges forDeep Learning at Extreme Scale 9 • data management • shuffling/loading/processing/feeding 20 TB dataset to keep GPUs busy • efficient use of remote filesystem • multi-node coordination and synchronization • synchronous reduction of O(50)MB across 27360 GPUs after each iteration • hyper parameter tuning (HPO) • convergence and accuracy challenging due to larger global batch sizes
  • 13.
    Label Creation: AtmosphericRivers 10 1. The climate model predicts 
 water vapor, wind speeds 
 and humidity 2. These observables are used to
 compute the 
 Integrated Water Vapor Transport
  • 14.
    Label Creation: AtmosphericRivers 11 3. Binarization by thresholding at
 95th percentile 4. Flood fill algorithm generates
 AR candidates by masking out 
 regions in mid-latitudes
  • 15.
    Label Creation: TropicalCyclones 12 1. Extract cyclone center and radius using 
 thresholds for pressure, temperature, and vorticity 2. Binarize patch around cyclone center 
 using thresholds for water vapor, 
 wind, and precipitation
  • 16.
    Software: TensorFlow • high-productivitydeep learning framework in Python with C++-backend, developed by Google • leverages optimized cuDNN library for performance sensitive kernels (e.g. convolutions) on NVIDIA GPUs • dataflow-style programming and asynchronous graph execution • provides features for building I/O input pipeline • can be combined with external Python modules to provide good flexibility 13
  • 17.
    Software: Horovod • librarydeveloped by Uber for distributed training • provides straightforward wrappers to convert existing single process TensorFlow programs to multi-process data-parallel programs • uses MPI for worker coordination/control • MPI is ubiquitous in HPC and available broadly on systems targeting HPC applications • can leverage MPI or NCCL for collective communication on NVIDIA GPUs (e.g. allreduce) 14
  • 18.
    Systems 15 • Cray XC50HPC system at CSCS, 5th on top500 • 5320 nodes with Intel Xeon E5-2695v3 
 and 1 NVIDIA P100 GPU • Cray Aries interconnect in 
 diameter 5 dragonfly topology • ~54.4 PetaFlop/s peak performance (FP32) Piz Daint Summit • leadership class HPC system at OLCF, 1st on top500 • 4609 nodes with 2 IBM P9 CPU and 6 NVIDIA V100 GPU • 300 GB/s NVLink connection btw. 3 GPUs in a group • 800 GB available NVMe storage/node • dual-rail EDR Infiniband in fat-tree topology • ~3.45 ExaFlop/s theoretical peak performance (FP16) Carlos Jones (ORNL)CSCS
  • 21.
  • 22.
  • 23.
    cuDNN Single GPU • Thingsto consider: • Is my TensorFlow model efficiently using GPU resources? • Is my data input pipeline keeping up? • Is my TensorFlow model providing reasonable results?
  • 24.
  • 25.
  • 26.
  • 27.
    NCCL MPI cuDNN cuDNN cuDNN cuDNN cuDNN cuDNN Single Node • Thingsto consider: • Is my data input pipeline still keeping up? • Is my data-parallel TensorFlow model providing reasonable results? • How is my performance scaling over PCIe/NVLink using Horovod?
  • 28.
  • 29.
    Multi Node • Thingsto consider: • Is my data-parallel TensorFlow model still providing reasonable results? • How is my performance scaling over NVLink + InfiniBand using Horovod? • How do I distribute my data across so many nodes? NCCL MPI
  • 30.
    Deep Learning Modelsfor Extreme Weather Segmentation 20 decoderencoder 1152×768, 16 5×5 conv, 64 1152×768, 3 input output 5×5 conv, +322× 1×1 conv, 128, /2 5×5 conv, +32 1×1 conv, 384, /2 5×5 conv, +32 4× 5× 1x1 deconv, 160, /2 5×5 conv, +324× 1x1 deconv, 128, /2 5×5 conv, +322× 1x1 deconv, 64, /2 5×5 conv, +32 1x1 deconv, 64, /2 5×5 conv, +32 1×1 conv, 3 2× 2× 5×5 conv, +322× 1×1 conv, 192, /2 5×5 conv, +322× 1×1 conv, 256, /2 1152×768, 128 72×48, 160 144×96, 384 144×96, 544 576×384, 192 144×96, 384 288×192, 256 288×192, 384288×192, 256 576×384, 192 576×384, 256 1152×768, 128 1152×768, 192 1152×768, 256 Tiramisu, 35 layers, 
 7.8M parameters, 4.2 TF/sample 20 DeepLabv3+, 66 layers, 
 43.7M parameters, 14.4 TF/sample decoder ASPP encoder 7×7 conv, 64, /2 1152×768, 16 1×1 conv, 64 3×3 conv, 64 1×1 conv, 256 288×192, 64 3× 1×1 conv, 128 3×3 conv, 128 1×1 conv, 512 144×96, 256 4× 1×1 conv, 256 3×3 conv, 256, d 2 1×1 conv, 1024 144×96, 512 6× 1×1 conv, 512 3×3 conv, 512, d 4 1×1 conv, 2048 144×96, 1024 3× 1×1 conv, 256 3×3 conv, 256, d 12 3×3 conv, 256, d 24 3×3 conv, 256, d 36 1×1 conv, 256 144×96, 1024 144×96, 2048 3×3 deconv, 256, /2 1×1 conv, 48 3×3 conv, 64 3×3 conv, 128 3×3 conv, 256 3×3 conv, 256 3×3 deconv, 256, /2 3×3 deconv, 256, /2 1×1 conv, 3 3×3 conv, 256 3×3 conv, 256 1152×768, 3 288×192, 256 1152×768, 256 1152×768, 128 288×192, 256 3×3 maxpool, /2 input output
  • 31.
    Data Staging 21 • 250training samples/GPU
 (15 GB), sample w/ replacement • each file will be read at most once from FS • files shared between nodes via MPI (mpi4py) • preprocess and feed data to GPU asynchronously using tf.data and python multiprocess Dataset Size Required BW (27K GPUs) GPFS/LUSTRE BurstBuffer NVM/e or DRAM 20 TB (~63K samples) 3.8 TB/s ~400 GB/s ~2 TB/s ~26 TB/s N V M e N V M e N V M e ... 1.5K sam ples 1.5Ksamples 1.5K sam ples shuffle
  • 32.
    On-Node I/O Pipeline •files are in HDF5 with single sample + label/file • list of filenames passed to TensorFlow Dataset API (tf.data) • HDF5 serialization bottleneck addressed with multiprocessing + h5py • extract and batch using tf.data input pipeline 22 ... data-2107-12-26-02-4.h5 data-2107-12-26-03-1.h5 data-2107-12-26-03-4.h5 data-2107-12-26-04-1.h5 data-2107-12-26-04-4.h5 data-2107-12-26-05-1.h5 data-2107-12-26-05-4.h5 data-2107-12-26-06-1.h5 data-2107-12-26-06-4.h5 data-2107-12-26-07-1.h5 ... ... data-2107-03-03-06-1.h5 data-2107-05-24-00-4.h5 data-2107-08-30-03-4.h5 data-2107-10-29-01-4.h5 data-2107-12-11-07-1.h5 data-2107-08-14-03-4.h5 data-2107-01-08-01-4.h5 data-2107-09-08-04-1.h5 data-2107-09-22-00-1.h5 data-2107-07-16-03-4.h5 ... shuffle 4-way parallel 
 read + preprocess batch asimovinstitute.org/neural-network-zoo CPU GPU
  • 33.
    Single Node Performance •GPU execution profiled with CUDA profiler, kernels grouped by category • convolution kernels: use latest cuDNN, favor higher computational intensity • pay attention to memory layout to reduce transposes and copies • tuning input pipeline on CPU to keep off critical path 23 DeepLabv3+ FP16 Training Category # Kern Time (ms) Math (TF) Mem (GB) % Time % Math % Mem Forward n Convolutions 158 147.9 9.61 27.6 18.1 52.0 20.7 Point-wise 829 52.3 < 0.1 24.3 6.4 51.6 Backward n Convolutions 195 300.2 19.21 50.5 36.7 51.2 18.7 Point-wise 157 25.6 < 0.1 6.3 3.1 27.3 Optimizer 1219 3.9 < 0.1 1.1 0.5 31.3 Copies / Transposes 708 213.2 - 92.6 26.1 48.3 Allreduce (NCCL) 30 58.7 < 0.1 0.6 7.2 1.1 Type Conversions 201 1.3 - 0.6 0.2 51.3 GPU Idle 14.2 1.7 Total 3497 817.3 28.82 203.6 28.2 27.7
  • 34.
    Improvements to Horovod:Original Control Plane 24 w4 w3 w1 ... {1, 2, 5, 8, 13} {2, 3, 5, 10, 13} {1, 3, 5, 10, 13, 14} w2 {1, 3, 5, 7, 13} w1 {5, 13} w1 intersect lists allreduce5,13 gather broadcast w4 w3 w1 ... w2
  • 35.
    Improvements to Horovod:Tree-based Control Plane 25 ... w4 w3 w1 w2 wN allreduce5,13 asynchronous gather + intersect ... w4 w3 w1 w2 w5 w6 w7 w5 w6 w7 ... ... tree-based broadcast
  • 36.
    Improvements to Horovod:Hybrid All-Reduce • NCCL uses NVLink for high throughput, but ring-based algorithms are latency-limited at scale • hybrid NCCL/MPI strategy uses strengths of both • one inter-node all reduce per virtual NIC • MPI work overlaps well with GPU computation 26 intra-node 
 allreduce (NCCL) 4x inter-node allreduce (MPI) 4x intra-node broadcast (NCCL)
  • 37.
    Gradient Pipelining (Lag) 27 wN ...gN k allreduce ḡkwN w3 g3 k ḡk w3 w2 g2 k ḡk w2 w1 g1 k ḡk w1 lag-0 (fully synchronous) qN q1 wN ... w1 gN k-1 g1 k-1 gN k g1 k ... q1 qN allreduce ḡk-1 ḡk-1 wN w1 ... ... lag-1 ...
  • 38.
    Importance of Node-LocalMemory 28 } 
 ~10% performance penalty for reading from global file system

  • 39.
    Scaling Tiramisu • FP16-modelsensitive to communication • FP16-model BW-bound
 (only 2.5x faster than FP32) • almost ideal scaling for both precisions on Summit when gradient lag is used 29
  • 40.
    Scaling DeepLabv3+ 30 • FP16-modelsensitive to communication • FP16-model BW-bound
 (only 2.5x faster than FP32) • excellent scaling for both precisions on Summit when gradient lag is used
  • 41.
  • 42.
  • 44.
  • 45.
    Concurrency/Precision and Convergence 33 ~2.1ximprovement in time to solution
  • 46.
  • 47.
    Segmentation Animation 35 • bestresult for intersection-over-union (IoU) obtained: ∼73% • result at large scale (batch-size > 1500): IoU ∼55%
  • 48.
    Segmentation Animation 35 • bestresult for intersection-over-union (IoU) obtained: ∼73% • result at large scale (batch-size > 1500): IoU ∼55%
  • 49.
    What has happenedsince Gordon Bell? • Horovod: • Hierarchical (hybrid) allreduce available in upstream code (NVIDIA/Amazon) • Control plane bypass via caching available soon (NVIDIA) • NCCL: • Version 2.4.x introduced tree-based reduction algorithms for improved performance at scale • NERSC: • ClimateNet: human annotated labels from domain experts • project serves as science problem for the ISC-HPC19 student cluster AI challenge • Gained insights for design and evaluation of upcoming NERSC Perlmutter system 36
  • 50.
    Conclusions • deep learningand HPC converge, achieving exascale performance • compute capabilities of contemporary HPC systems can be utilized to tackle challenging scientific deep learning problems • HPO and convergence at scale still an open problem — but now we can do it • software enhancements benefit deep learning community at large • deep learning-powered techniques usher in a new era of precision analytics for various science areas 37
  • 51.