SlideShare a Scribd company logo
Alexey Svyatkovskiy (Microsoft/Princeton U/PPPL),
Julian Kates-Harbeck (Harvard U/PPPL),
William Tang (Princeton U/PPPL)
Training neural networks
with low precision floats
#DL2SAIS
Outline
• Training neural networks with reduced floating-point precision
– Motivation and challenges
– Half-precision and mixed precision training approaches
• Enabling FP16 training on Pascal and Volta GPUs
– Compression of weights and gradients
– Master copy of weights, loss scaling
– TensorCores
• Fusion Recurrent Neural Net (FRNN) analysis framework
– The physics problem we are trying to solve
– Training LSTMs on long sequences of variable length
– Distributed training at the largest scale
– Results on TSUBAME 3.0 and OLCF Summit supercomputers
• Conclusions
2#DL2SAIS
Training with reduced floating-point precision: motivation
• Models are getting larger
– Optimize memory and network bandwidths
• Need faster training
– Improve maximum computational throughput, use larger mini-batch size
– Minimize communication overhead during distributed training
NN architecture Num. layers Num. trainable
parameters
GFLOPS
AlexNet 8 60 million 1.4
GoogLeNet 22 4 million 4
Inception V3 48 24 million 9
ResNet50 152 26 million 23
Deep Speech 2 11 38 millions 465
#DL2SAIS 3
FP16 training: motivation and challenges
#DL2SAIS 4
FP16 training: motivation and challenges
Challenges
• IEEE 754 FP16 has a narrow numerical range
– Overflow problem: Inf/NaN cause irreversible damage to training (quite rare)
– “Vanishing gradient“ (quite common)
• Small valued gradients
• Large ratio of weight to weight gradient
Subnormal range: [2-24
,2-14
] Normal range: [214
,65504]
float
half
-127 -24-14 0 15 128
#DL2SAIS 5
Mixed precision with master copy of weights
• Store weights, activations, gradients in FP16; perform weight update in FP32
– Solves “vanishing gradient” problem
– Disadvantage: conversion between FP32 and FP16 may be slow; extra memory to keep
an extra copy of weights in FP32
Mixed arithmetic precision
• FP16 matrix multiply, FP32 accumulate,
either FP32/FP16 for activations
• Volta TensorCores enable this on the
hardware level
– Solves “vanishing gradient” problem
– Programmable matrix-multiply-and-accumulate units.
8 TensorCores per streaming multiprocessor
Enabling FP16 training (I)
#DL2SAIS 6
Enabling FP16 training (III)
Loss scaling: shift gradients to occupy higher bins of FP16 representable range, which are not
used
– Apply a scalar factor to the loss function before backpropagation step
– Unscale weight gradients before the update step to maintain the magnitude of updates the
same as for FP32; alternatively re-tune the hyperparameters
• Scaling factors will depend on the neural network/dataset in hand
FP16 representable range FP16 representable rangeScale factor of 10 shifts
by 3 exponent values right
Everything below is irrelevant
for training
#DL2SAIS 7
Enabling FP16 training (II)
Weight and gradient sharing
• Quantize weight and gradient matrices into
N bins via clustering
– Weights in the same bin share the same value
– Store index of shared weights in integer matrix
Dynamic fixed-point representation
• Train in full precision (float)
• Quantize weight and activation
– Gather statistics about weight and activation (min, max
for each layer)
– Convert each float value to fixed point format (Google TPU uses int8 equivalent
representation and arithmetic for that)
S. Han et. al. “Deep compression”
#DL2SAIS 8
Single GPU results
• Volta, no loss-scaling required for accurate convergence
• Volta, loss-scaling required for accurate convergence
9#DL2SAIS
Model FP32 baseline Mixed precision
AlexNet 56.77 % 56.93 %
VGG-D 65.40 % 65.43 %
GoogLeNet 68.33 % 68.43 %
Inception V3 73.85 % 74.13 %
Resnet50 75.92 % 76.04 %
Model FP32 baseline MP no loss scaling MP loss sclaing
Faster R-CNN 69.1 % 68.6 % 69.7 %
Multibox SSD 76.9 % diverges 77.1 %
Result from: P. Micikevicius et. al. Mixed Precision
Training, https://arxiv.org/abs/1710.03740
OLCF Summit: hardware and setup
• Summit is an IBM system located at the Oak Ridge Leadership Computing Facility.
Having theoretical peak-performance of 200 PF, it is one of the “smartest” computers in the
world for deep learning applications with
mixed-precision capability in excess of 3 EF
– To be ranked as #1 supercomputing system
worldwide this summer
– 4600 compute nodes, two IBM POWER9,
PPC64LE processors and
6 Volta V100 accelerators per node, NVlink
– Deploy using Singularity containers
– LSF scheduler, jsrun to execute jobs
– AlpineTDS IBM Spectrumscale filesystem
– Globus DTN, HPSS
#DL2SAIS 10
Physics problem we are trying to solve
• Fusion energy: a clean source of energy that can be produced on Earth in a tokamak-style
fusion reactor
• Tokamak: a type of design for a nuclear fusion device. It is a torus-shaped vacuum chamber
surrounded by magnetic coils
• Existing tokamak-style fusion reactors: JET, DIII-D, NSTX
• Plasma disruptions are large macroscopic instabilities resulting in:
– Loss of confinement – ends fusion reaction
– Intense radiation – damaging
concentration in small areas
– Current quench – high magnetic forces
• Prediction and avoidance of disruptions is
a crucial aspect of tokamak operation,
especially in high-performance regime
#DL2SAIS 11
• Scalar time series: plasma current, lock-mode amplitude
• Vector time series
• All as a function of ρ (normalized flux surface)
• 1D current, electron temperature, radiation profiles
ρ = 0
ρ = 1
0.7 0.1 1.5 0.05
0.7 0.1 1.5 0.05 1.1 2.3 0.1
1.1 2.3 0.1
2.3 2.2 2.0 1.7 … 0.2 0.0
!" #$% & '$
1D convolution and activation
Max Pooling
Raw profile
Salient features
Full feature vector
Data: scalar and vector time series
#DL2SAIS 12
Fusion Recurrent Neural Net (FRNN)
Signals
Output
> Threshold?
Output: Disruption coming?
RNN Architecture:
• LSTM, 3 layers
• 300 hidden units per cell
• Stateful, returns sequences
Output Output
Internal
State
T = t
0D signals 1D
1D data
CNN
CNN architecture:
• Number of conv. filters: 10
• Size of conv. filters: 3
• Number of conv. layers: 2
• Pool size: 2
Time-distributed FC layer
Time-distributed FC layer
• apply to every temporal slice on
LSTM output
Alarm Alarm Alarm
LSTMLSTMLSTM
Signals Signals
0D signals 0D signals1D 1D
1D data
CNN
1D data
CNN
T = 0 T = 1#DL2SAIS 13
• Time series lengths in our problem vary by orders of magnitude: challenge for sequence-based
learning tasks
• We approximate gradient computation by truncated backprop through time (BPTT)
– Gradients are computed over a subsequence of time series, the internal states are saved,
and then the next subsequence is fed to the RNN while using the last internal states from
the previous subsequence as the initial internal states of the current subsequence
• Traditional approach of bucketing would not work as the sequence length is strongly correlated
with whether shots are disruptive or non-disruptive and thus individual batches would be biased
• We implement custom approach resetting internal state of individual examples within mini batch
– Since there is a persistent internal state between successive subsequences in time, it is not
possible to use more than one subsequence from a given shot in a given mini-batch
14#DL2SAIS
Training LSTM on sequences of variable lengths (I)
Training LSTM on sequences of variable lengths (II)
• To train batch-wise we need M independent (stemming from different shots) time slices of equal
length to feed to the GPU
– Maintain a buffer of M separate shots. At every training
step, the first T time slices of the buffer are fed as the
next batch, then shift the buffer by T time steps
• When shot is finished in the buffer, a new shot
is loaded and the RNN internal states of the
corresponding batch index are reset for training
• Internal states of the other batch indices are
maintained. Thus, the internal state is persisted
during learning for the entire length of any given shot
– Allow the RNN to learn temporal patterns much longer than unrolling length T and potentially as long as
the entire shot
• Random offsets of shots against each other and random shuffling of train set provides a mixture
of disruptive and non-disruptive samples for the network at every batch to stabilize training
15#DL2SAIS
Introducing FRNN
• FRNN represents a combined workflow implemented in Python
– Keras for layer definitions
– TensorFlow optimizers
– Custom parameter averaging, global weights update in CUDA-aware MPI
– Not a general purpose tool: geared toward Fusion energy scientific
community, but certain parts can be useful for a wider community
– https://github.com/PPPLDeepLearning/plasma-python
• We demonstrate how to enable FP16 for distributed training with this
combined Keras/TensorFlow/MPI based workflow
– Weight update is sensitive to floating point precision
– Volta/FP16 optimized algorithms are disabled by default in TF, chosen by
CuDNN autotuner
16#DL2SAIS
Distributed training flow
(Step 3) Update optimizer internal state and global
weights using the global gradient
Replica 1
GPU
worker
Replica 2
GPU
worker
Replica 3
GPU
worker
Replica 4
GPU
worker
Replica N
GPU
worker
Average local gradients
collected from workers
(Step 2) Aggregate local
gradients from each worker
using MPI_Allreduce, then
average them
(Step 4)
MPI_Broadcast
global parameters
after update
back to workers,
repeat (Step 1)
!"#$%&#
'
= !"#$%&#
'
− *+!"#$%&#
'
(Step 0) Initialize network parameters randomly, based on the model configuration
(Step 1) Perform fprop
and bprop on each
worker
17
No loss in training accuracy
• Use validation level area under the ROC and AUC (area under curve) to characterize
the quality of FRNN and applicability of the half-precision training
• Validation level AUC as a function of epoch show similar shapes for both FP16 and
FP32
– Reaching the plateau at around AUC=0.87 by the epoch 6
– SGD optimizer with momentum, loss scaling factor=10
The test set AUC=0.96
for
both the FP16 and
FP32
Base learning
rate
λ0 = 0.0004
Base learning rate
λ0 = 0.001
#DL2SAIS 18
Performance and scaling properties (I)
• FRNN shows nearly-linear runtime scaling and logarithmic communication
0.3 0.4 0.5 0.6 0.7 0.8
Validation AUC achieved
102
103
104
Speedupoverserial
Ideal
Measured
10 0 10 1 10 2 10 3 10 4
Number of GPUs
10 −1
10 0
10 1
10 2
10 3
10 4
Timeperepoch(s)
Data
Scaling model
Ideal scaling
1
G
PU
102
G
PU
s
(paralleltraining)
102
G
PU
s
(paralleltuning)
104
G
PU
s
10 −1
10 0
10 1
10 2
10 3
10 4
Runtime(hours)
2 4 6 8 10 12 14 16
Scaled walltime (hours)
10−2
10−1
100
Trainingloss
NGPU = 4
NGPU = 16
NGPU = 64
NGPU = 128
NGPU = 256
0.4
0.5
0.6
0.7
0.8
0.9
ValidationAUCachieved
10−2 10−1 100 101
Time to disruption (s)
0.0
0.2
0.4
0.6
Fractionofdetected
0.03 0.07 0.2 0.5 1.0
Time to disruption (s)
0.5
0.6
0.7
0.8
0.9
TestAUCachieved
cb
#DL2SAIS 19
Performance and scaling properties (II)
• Results for hyperparameter tuning with10000 GPUs
with parallel random search across
100 models, trained on 100 GPUs each.
• The time to solution for finding a model of given
validation AUC is compared to the scenario
of using a single GPU.
– The inset shows the time required for finding the best
model in the scenarios of using 1, 100 and
10000 total GPUs.
– For 100 GPUs, we distinguish training the 100 models
in serial, but using 100 GPUs for each model
(“parallel training”) vs. running 100 models in parallel,
trained on 1 GPU each (“parallel tuning”),
which both achieve nearly 100x speedup.
0.3 0.4 0.5 0.6 0.7 0.8
Validation AUC achieved
102
103
104
Speedupoverserial
Ideal
Measured
10 0 10 1 10 2 10 3 10 4
Number of GPUs
10 −1
10 0
10 1
10 2
10 3
10 4
Timeperepoch(s)
Data
Scaling model
Ideal scaling
1
G
PU
102
G
PU
s
(paralleltraining)
102
G
PU
s
(paralleltuning)
104
G
PU
s
10 −1
10 0
10 1
10 2
10 3
10 4
Runtime(hours)
2 4 6 8 10 12 14 16
Scaled walltime (hours)
10−2
10−1
100
Trainingloss
NGPU = 4
NGPU = 16
NGPU = 64
NGPU = 128
NGPU = 256
0.4
0.5
0.6
0.7
0.8
0.9
ValidationAUCachieved
10−2 10−1 100 101
Time to disruption (s)
0.0
Fracti
0.03 0.07 0.2 0.5 1.0
Time to disruption (s)
0.5
cb
#DL2SAIS 20
Conclusions
• As neural network models are getting larger, reduced floating point precision training
approaches become crucial
– Ability to optimize computational throughput and reduce memory and network bandwidths,
allowing for faster training
• Develop a domain-specific deep learning framework integrating Keras and TensorFlow with
custom parameter averaging and global weight update routines implemented with CUDA-aware
MPI intended for disruption forecasting in tokamaks
– Strong nearly linear runtime scaling on GPU clusters
– Hyperparameter tuning with O(10000) of nodes
• Evaluated training of deep RNNs with half-precision floats on a real scientific dataset
– Use scientific JET plasma disruption time series dataset
– Use loss scaling approach to facilitate model convergence at FP16
– Obtained comparable validation level performances at the end of each epoch for half and
single floating point precisions
21#DL2SAIS
Microsoft Cloud+AI
SMART Data Science
Team is hiring!
ML-Enhanced Coding
Assistant based on a
Markov Chain Model
Automatic Code
Reviews
Generation through
Deep Learning
Answering Documentation
Questions using a Seq-to-
Seq Deep Learning Model
AI Infused
Analytics
Code Risk
Analysis using
Graph Learning
#DL2SAIS 22
Contact: neels@microsoft.com
Backup
23#DL2SAIS
NVIDIA Volta architecture: TensorCores
• What are TensorCores?
– Programmable matrix-multiply-and-accumulate units. 8 TensorCores per
streaming multiprocessor
– Each tensor core can perform operations on small 4x4 matrices
– 1 matrix-multiply-and-accumulate operation per GPU clock
– During program execution, multiple Tensor Cores are used concurrently by a full
warp of execution in GPU
https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/
#DL2SAIS 24
Training flow
Forward pass: mostly
matrix and element-wise
multiplications
Backprop: chain-rule
differentiation,
element-wise multiplications
Optimizer weight update:
multiply gradients by learning
rate (e.g. SGD)
!"
= $" !"%&
!' = $' !&
()
(!"%&
=
()
(!"
$"
()
($"
=
()
(!"
!"%&
!& = $& *
X
!&
!'
Forward pass on the next
mini-batch of data…
!"
= $" !"%&
!& = $& *
X
!&
()
(*
=
()
(!&
$&
()
($&
=
()
(!&
*
()
(!"%&
$" = $" − ,
()
($"
$& = $& − ,
()
($&
()
($"
$"
$&()
($&
-.// )!"-.// )!"
()
(!"
#DL2SAIS 25
Recurrent Neural Nets: basic description
• RNNs are a family of neural networks to process sequential data
• Feed forward equations are recurrent:
! " = $ + &ℎ " − 1 + *+(")
ℎ " = tanh ! "
2 " = 3 + 4ℎ(")
L
o
h
x
y
U
WV
Unfold
Notations:
x – input sequence,
U – is the input to hidden weight matrix,
W - hidden to hidden,
V – hidden to output weights
b,c are the biases
tanh() is the activation function (non-linearity)
o – output sequence
Loss L and target values are denoted as y
U U
h(t-1) h(t) h(…)h(…)
W W W
x(t-1) x(t)
o(t-1) o(t)
L(t-1) L(t)
y(t-1) y(t)
#DL2SAIS 26
Model convergence at large N
• Learning rate schedule is crucial to facilitate model convergence during
distributed training when the effective mini-batch size is large
• We use exponential learning rate decay with adjustable base learning rate
• Effective batch size is multiplied by the
number of workers
• In a distributed regime, the base
learning rate is reduced as the number
of workers N is increased
• Learning rate is clipped
!" = !$ %"
!$ (', )) =
!$
1.0 + '/)
#DL2SAIS 27
Half-precision training (I)
Half-precision training:
– Store weights, activations and gradients in FP16
– Perform matrix and element-wise multiplications, reduction ops in FP16 (including during fprop, bprop and weight
update)
• You already know it is possible to perform FP16 math, arithmetics and comparisons on NVIDIA GPUs with compute capability
5.3 and greater
– CUDA half intrinsics (round to nearest even): http://docs.nvidia.com/cuda/cuda-math-
api/group__CUDA__MATH__INTRINSIC__HALF.html#group__CUDA__MATH__INTRINSIC__HALF
– TensorFlow has DT_HALF registered type:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/types.cc
– We add a custom MPI datatype of 2 contiguous bytes: https://github.com/PPPLDeepLearning/plasma-
python/blob/master/plasma/primitives/ops.py
• This talk:
– Review techniques to enable FP16 and mixed precision training
– Introduce FRNN framework for disruption forecasting in tokamak fusion plasmas , discuss strong scaling
– Evaluate FP16 training on FRNN framework and a real scientific dataset from JET experiment
– Evaluate FP16 training on the benchmark IMDB dataset
#DL2SAIS 28
Half-precision training (I)
Half-precision training:
– Store weights, activations and gradients in FP16
– Perform matrix and element-wise multiplications, reduction ops in FP16 (including during fprop, bprop and weight
update)
• You already know it is possible to perform FP16 math, arithmetics and comparisons on NVIDIA GPUs with compute capability
5.3 and greater
– CUDA half intrinsics (round to nearest even): http://docs.nvidia.com/cuda/cuda-math-
api/group__CUDA__MATH__INTRINSIC__HALF.html#group__CUDA__MATH__INTRINSIC__HALF
– TensorFlow has DT_HALF registered type:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/types.cc
– We add a custom MPI datatype of 2 contiguous bytes: https://github.com/PPPLDeepLearning/plasma-
python/blob/master/plasma/primitives/ops.py
• This talk:
– Review techniques to enable FP16 and mixed precision training
– Introduce FRNN framework for disruption forecasting in tokamak fusion plasmas , discuss strong scaling
– Evaluate FP16 training on FRNN framework and a real scientific dataset from JET experiment
– Evaluate FP16 training on the benchmark IMDB dataset
#DL2SAIS 28
RNN Predictions
True positive example
TTD [ms]
Current Increase
Locked Mode Spike
Density Drop
Internal Inductance Spike
Stored Diamagnetic Energy Drop
FRNN Signal Spike
ArbitraryUnits

More Related Content

What's hot

Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Shelan Perera
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
Dongwon Kim
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
Maksud Ibrahimov
 
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Databricks
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
Spark Summit
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
DataWorks Summit
 
CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use Cases
DataWorks Summit
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
DataWorks Summit/Hadoop Summit
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
Dongwon Kim
 
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Databricks
 
Spark tuning
Spark tuningSpark tuning
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
Databricks
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationPedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Jen Aman
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 

What's hot (20)

Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
 
CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use Cases
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationPedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon Innovation
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 

Similar to Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters with Alexey Svyatkovskiy

In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
Coburn Watson
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
cscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
csandit
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
Sneh Pahilwani
 
Scolari's ICCD17 Talk
Scolari's ICCD17 TalkScolari's ICCD17 Talk
Scolari's ICCD17 Talk
NECST Lab @ Politecnico di Milano
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
inside-BigData.com
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
Edge AI and Vision Alliance
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
Mani Goswami
 
KSpeculative aspects of high-speed processor design
KSpeculative aspects of high-speed processor designKSpeculative aspects of high-speed processor design
KSpeculative aspects of high-speed processor design
ssuser7dcef0
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating Systems
Ruhaim Izmeth
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performance
s.rohit
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
eSAT Publishing House
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
NECST Lab @ Politecnico di Milano
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
AkshitAgiwal1
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Hsien-Hsin Sean Lee, Ph.D.
 
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
RISC-V International
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
Haris456
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
Ganesan Narayanasamy
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
Shien-Chun Luo
 

Similar to Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters with Alexey Svyatkovskiy (20)

In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
 
Scolari's ICCD17 Talk
Scolari's ICCD17 TalkScolari's ICCD17 Talk
Scolari's ICCD17 Talk
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
KSpeculative aspects of high-speed processor design
KSpeculative aspects of high-speed processor designKSpeculative aspects of high-speed processor design
KSpeculative aspects of high-speed processor design
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating Systems
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performance
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
 
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 

Recently uploaded (20)

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 

Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters with Alexey Svyatkovskiy

  • 1. Alexey Svyatkovskiy (Microsoft/Princeton U/PPPL), Julian Kates-Harbeck (Harvard U/PPPL), William Tang (Princeton U/PPPL) Training neural networks with low precision floats #DL2SAIS
  • 2. Outline • Training neural networks with reduced floating-point precision – Motivation and challenges – Half-precision and mixed precision training approaches • Enabling FP16 training on Pascal and Volta GPUs – Compression of weights and gradients – Master copy of weights, loss scaling – TensorCores • Fusion Recurrent Neural Net (FRNN) analysis framework – The physics problem we are trying to solve – Training LSTMs on long sequences of variable length – Distributed training at the largest scale – Results on TSUBAME 3.0 and OLCF Summit supercomputers • Conclusions 2#DL2SAIS
  • 3. Training with reduced floating-point precision: motivation • Models are getting larger – Optimize memory and network bandwidths • Need faster training – Improve maximum computational throughput, use larger mini-batch size – Minimize communication overhead during distributed training NN architecture Num. layers Num. trainable parameters GFLOPS AlexNet 8 60 million 1.4 GoogLeNet 22 4 million 4 Inception V3 48 24 million 9 ResNet50 152 26 million 23 Deep Speech 2 11 38 millions 465 #DL2SAIS 3
  • 4. FP16 training: motivation and challenges #DL2SAIS 4
  • 5. FP16 training: motivation and challenges Challenges • IEEE 754 FP16 has a narrow numerical range – Overflow problem: Inf/NaN cause irreversible damage to training (quite rare) – “Vanishing gradient“ (quite common) • Small valued gradients • Large ratio of weight to weight gradient Subnormal range: [2-24 ,2-14 ] Normal range: [214 ,65504] float half -127 -24-14 0 15 128 #DL2SAIS 5
  • 6. Mixed precision with master copy of weights • Store weights, activations, gradients in FP16; perform weight update in FP32 – Solves “vanishing gradient” problem – Disadvantage: conversion between FP32 and FP16 may be slow; extra memory to keep an extra copy of weights in FP32 Mixed arithmetic precision • FP16 matrix multiply, FP32 accumulate, either FP32/FP16 for activations • Volta TensorCores enable this on the hardware level – Solves “vanishing gradient” problem – Programmable matrix-multiply-and-accumulate units. 8 TensorCores per streaming multiprocessor Enabling FP16 training (I) #DL2SAIS 6
  • 7. Enabling FP16 training (III) Loss scaling: shift gradients to occupy higher bins of FP16 representable range, which are not used – Apply a scalar factor to the loss function before backpropagation step – Unscale weight gradients before the update step to maintain the magnitude of updates the same as for FP32; alternatively re-tune the hyperparameters • Scaling factors will depend on the neural network/dataset in hand FP16 representable range FP16 representable rangeScale factor of 10 shifts by 3 exponent values right Everything below is irrelevant for training #DL2SAIS 7
  • 8. Enabling FP16 training (II) Weight and gradient sharing • Quantize weight and gradient matrices into N bins via clustering – Weights in the same bin share the same value – Store index of shared weights in integer matrix Dynamic fixed-point representation • Train in full precision (float) • Quantize weight and activation – Gather statistics about weight and activation (min, max for each layer) – Convert each float value to fixed point format (Google TPU uses int8 equivalent representation and arithmetic for that) S. Han et. al. “Deep compression” #DL2SAIS 8
  • 9. Single GPU results • Volta, no loss-scaling required for accurate convergence • Volta, loss-scaling required for accurate convergence 9#DL2SAIS Model FP32 baseline Mixed precision AlexNet 56.77 % 56.93 % VGG-D 65.40 % 65.43 % GoogLeNet 68.33 % 68.43 % Inception V3 73.85 % 74.13 % Resnet50 75.92 % 76.04 % Model FP32 baseline MP no loss scaling MP loss sclaing Faster R-CNN 69.1 % 68.6 % 69.7 % Multibox SSD 76.9 % diverges 77.1 % Result from: P. Micikevicius et. al. Mixed Precision Training, https://arxiv.org/abs/1710.03740
  • 10. OLCF Summit: hardware and setup • Summit is an IBM system located at the Oak Ridge Leadership Computing Facility. Having theoretical peak-performance of 200 PF, it is one of the “smartest” computers in the world for deep learning applications with mixed-precision capability in excess of 3 EF – To be ranked as #1 supercomputing system worldwide this summer – 4600 compute nodes, two IBM POWER9, PPC64LE processors and 6 Volta V100 accelerators per node, NVlink – Deploy using Singularity containers – LSF scheduler, jsrun to execute jobs – AlpineTDS IBM Spectrumscale filesystem – Globus DTN, HPSS #DL2SAIS 10
  • 11. Physics problem we are trying to solve • Fusion energy: a clean source of energy that can be produced on Earth in a tokamak-style fusion reactor • Tokamak: a type of design for a nuclear fusion device. It is a torus-shaped vacuum chamber surrounded by magnetic coils • Existing tokamak-style fusion reactors: JET, DIII-D, NSTX • Plasma disruptions are large macroscopic instabilities resulting in: – Loss of confinement – ends fusion reaction – Intense radiation – damaging concentration in small areas – Current quench – high magnetic forces • Prediction and avoidance of disruptions is a crucial aspect of tokamak operation, especially in high-performance regime #DL2SAIS 11
  • 12. • Scalar time series: plasma current, lock-mode amplitude • Vector time series • All as a function of ρ (normalized flux surface) • 1D current, electron temperature, radiation profiles ρ = 0 ρ = 1 0.7 0.1 1.5 0.05 0.7 0.1 1.5 0.05 1.1 2.3 0.1 1.1 2.3 0.1 2.3 2.2 2.0 1.7 … 0.2 0.0 !" #$% & '$ 1D convolution and activation Max Pooling Raw profile Salient features Full feature vector Data: scalar and vector time series #DL2SAIS 12
  • 13. Fusion Recurrent Neural Net (FRNN) Signals Output > Threshold? Output: Disruption coming? RNN Architecture: • LSTM, 3 layers • 300 hidden units per cell • Stateful, returns sequences Output Output Internal State T = t 0D signals 1D 1D data CNN CNN architecture: • Number of conv. filters: 10 • Size of conv. filters: 3 • Number of conv. layers: 2 • Pool size: 2 Time-distributed FC layer Time-distributed FC layer • apply to every temporal slice on LSTM output Alarm Alarm Alarm LSTMLSTMLSTM Signals Signals 0D signals 0D signals1D 1D 1D data CNN 1D data CNN T = 0 T = 1#DL2SAIS 13
  • 14. • Time series lengths in our problem vary by orders of magnitude: challenge for sequence-based learning tasks • We approximate gradient computation by truncated backprop through time (BPTT) – Gradients are computed over a subsequence of time series, the internal states are saved, and then the next subsequence is fed to the RNN while using the last internal states from the previous subsequence as the initial internal states of the current subsequence • Traditional approach of bucketing would not work as the sequence length is strongly correlated with whether shots are disruptive or non-disruptive and thus individual batches would be biased • We implement custom approach resetting internal state of individual examples within mini batch – Since there is a persistent internal state between successive subsequences in time, it is not possible to use more than one subsequence from a given shot in a given mini-batch 14#DL2SAIS Training LSTM on sequences of variable lengths (I)
  • 15. Training LSTM on sequences of variable lengths (II) • To train batch-wise we need M independent (stemming from different shots) time slices of equal length to feed to the GPU – Maintain a buffer of M separate shots. At every training step, the first T time slices of the buffer are fed as the next batch, then shift the buffer by T time steps • When shot is finished in the buffer, a new shot is loaded and the RNN internal states of the corresponding batch index are reset for training • Internal states of the other batch indices are maintained. Thus, the internal state is persisted during learning for the entire length of any given shot – Allow the RNN to learn temporal patterns much longer than unrolling length T and potentially as long as the entire shot • Random offsets of shots against each other and random shuffling of train set provides a mixture of disruptive and non-disruptive samples for the network at every batch to stabilize training 15#DL2SAIS
  • 16. Introducing FRNN • FRNN represents a combined workflow implemented in Python – Keras for layer definitions – TensorFlow optimizers – Custom parameter averaging, global weights update in CUDA-aware MPI – Not a general purpose tool: geared toward Fusion energy scientific community, but certain parts can be useful for a wider community – https://github.com/PPPLDeepLearning/plasma-python • We demonstrate how to enable FP16 for distributed training with this combined Keras/TensorFlow/MPI based workflow – Weight update is sensitive to floating point precision – Volta/FP16 optimized algorithms are disabled by default in TF, chosen by CuDNN autotuner 16#DL2SAIS
  • 17. Distributed training flow (Step 3) Update optimizer internal state and global weights using the global gradient Replica 1 GPU worker Replica 2 GPU worker Replica 3 GPU worker Replica 4 GPU worker Replica N GPU worker Average local gradients collected from workers (Step 2) Aggregate local gradients from each worker using MPI_Allreduce, then average them (Step 4) MPI_Broadcast global parameters after update back to workers, repeat (Step 1) !"#$%&# ' = !"#$%&# ' − *+!"#$%&# ' (Step 0) Initialize network parameters randomly, based on the model configuration (Step 1) Perform fprop and bprop on each worker 17
  • 18. No loss in training accuracy • Use validation level area under the ROC and AUC (area under curve) to characterize the quality of FRNN and applicability of the half-precision training • Validation level AUC as a function of epoch show similar shapes for both FP16 and FP32 – Reaching the plateau at around AUC=0.87 by the epoch 6 – SGD optimizer with momentum, loss scaling factor=10 The test set AUC=0.96 for both the FP16 and FP32 Base learning rate λ0 = 0.0004 Base learning rate λ0 = 0.001 #DL2SAIS 18
  • 19. Performance and scaling properties (I) • FRNN shows nearly-linear runtime scaling and logarithmic communication 0.3 0.4 0.5 0.6 0.7 0.8 Validation AUC achieved 102 103 104 Speedupoverserial Ideal Measured 10 0 10 1 10 2 10 3 10 4 Number of GPUs 10 −1 10 0 10 1 10 2 10 3 10 4 Timeperepoch(s) Data Scaling model Ideal scaling 1 G PU 102 G PU s (paralleltraining) 102 G PU s (paralleltuning) 104 G PU s 10 −1 10 0 10 1 10 2 10 3 10 4 Runtime(hours) 2 4 6 8 10 12 14 16 Scaled walltime (hours) 10−2 10−1 100 Trainingloss NGPU = 4 NGPU = 16 NGPU = 64 NGPU = 128 NGPU = 256 0.4 0.5 0.6 0.7 0.8 0.9 ValidationAUCachieved 10−2 10−1 100 101 Time to disruption (s) 0.0 0.2 0.4 0.6 Fractionofdetected 0.03 0.07 0.2 0.5 1.0 Time to disruption (s) 0.5 0.6 0.7 0.8 0.9 TestAUCachieved cb #DL2SAIS 19
  • 20. Performance and scaling properties (II) • Results for hyperparameter tuning with10000 GPUs with parallel random search across 100 models, trained on 100 GPUs each. • The time to solution for finding a model of given validation AUC is compared to the scenario of using a single GPU. – The inset shows the time required for finding the best model in the scenarios of using 1, 100 and 10000 total GPUs. – For 100 GPUs, we distinguish training the 100 models in serial, but using 100 GPUs for each model (“parallel training”) vs. running 100 models in parallel, trained on 1 GPU each (“parallel tuning”), which both achieve nearly 100x speedup. 0.3 0.4 0.5 0.6 0.7 0.8 Validation AUC achieved 102 103 104 Speedupoverserial Ideal Measured 10 0 10 1 10 2 10 3 10 4 Number of GPUs 10 −1 10 0 10 1 10 2 10 3 10 4 Timeperepoch(s) Data Scaling model Ideal scaling 1 G PU 102 G PU s (paralleltraining) 102 G PU s (paralleltuning) 104 G PU s 10 −1 10 0 10 1 10 2 10 3 10 4 Runtime(hours) 2 4 6 8 10 12 14 16 Scaled walltime (hours) 10−2 10−1 100 Trainingloss NGPU = 4 NGPU = 16 NGPU = 64 NGPU = 128 NGPU = 256 0.4 0.5 0.6 0.7 0.8 0.9 ValidationAUCachieved 10−2 10−1 100 101 Time to disruption (s) 0.0 Fracti 0.03 0.07 0.2 0.5 1.0 Time to disruption (s) 0.5 cb #DL2SAIS 20
  • 21. Conclusions • As neural network models are getting larger, reduced floating point precision training approaches become crucial – Ability to optimize computational throughput and reduce memory and network bandwidths, allowing for faster training • Develop a domain-specific deep learning framework integrating Keras and TensorFlow with custom parameter averaging and global weight update routines implemented with CUDA-aware MPI intended for disruption forecasting in tokamaks – Strong nearly linear runtime scaling on GPU clusters – Hyperparameter tuning with O(10000) of nodes • Evaluated training of deep RNNs with half-precision floats on a real scientific dataset – Use scientific JET plasma disruption time series dataset – Use loss scaling approach to facilitate model convergence at FP16 – Obtained comparable validation level performances at the end of each epoch for half and single floating point precisions 21#DL2SAIS
  • 22. Microsoft Cloud+AI SMART Data Science Team is hiring! ML-Enhanced Coding Assistant based on a Markov Chain Model Automatic Code Reviews Generation through Deep Learning Answering Documentation Questions using a Seq-to- Seq Deep Learning Model AI Infused Analytics Code Risk Analysis using Graph Learning #DL2SAIS 22 Contact: neels@microsoft.com
  • 24. NVIDIA Volta architecture: TensorCores • What are TensorCores? – Programmable matrix-multiply-and-accumulate units. 8 TensorCores per streaming multiprocessor – Each tensor core can perform operations on small 4x4 matrices – 1 matrix-multiply-and-accumulate operation per GPU clock – During program execution, multiple Tensor Cores are used concurrently by a full warp of execution in GPU https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/ #DL2SAIS 24
  • 25. Training flow Forward pass: mostly matrix and element-wise multiplications Backprop: chain-rule differentiation, element-wise multiplications Optimizer weight update: multiply gradients by learning rate (e.g. SGD) !" = $" !"%& !' = $' !& () (!"%& = () (!" $" () ($" = () (!" !"%& !& = $& * X !& !' Forward pass on the next mini-batch of data… !" = $" !"%& !& = $& * X !& () (* = () (!& $& () ($& = () (!& * () (!"%& $" = $" − , () ($" $& = $& − , () ($& () ($" $" $&() ($& -.// )!"-.// )!" () (!" #DL2SAIS 25
  • 26. Recurrent Neural Nets: basic description • RNNs are a family of neural networks to process sequential data • Feed forward equations are recurrent: ! " = $ + &ℎ " − 1 + *+(") ℎ " = tanh ! " 2 " = 3 + 4ℎ(") L o h x y U WV Unfold Notations: x – input sequence, U – is the input to hidden weight matrix, W - hidden to hidden, V – hidden to output weights b,c are the biases tanh() is the activation function (non-linearity) o – output sequence Loss L and target values are denoted as y U U h(t-1) h(t) h(…)h(…) W W W x(t-1) x(t) o(t-1) o(t) L(t-1) L(t) y(t-1) y(t) #DL2SAIS 26
  • 27. Model convergence at large N • Learning rate schedule is crucial to facilitate model convergence during distributed training when the effective mini-batch size is large • We use exponential learning rate decay with adjustable base learning rate • Effective batch size is multiplied by the number of workers • In a distributed regime, the base learning rate is reduced as the number of workers N is increased • Learning rate is clipped !" = !$ %" !$ (', )) = !$ 1.0 + '/) #DL2SAIS 27
  • 28. Half-precision training (I) Half-precision training: – Store weights, activations and gradients in FP16 – Perform matrix and element-wise multiplications, reduction ops in FP16 (including during fprop, bprop and weight update) • You already know it is possible to perform FP16 math, arithmetics and comparisons on NVIDIA GPUs with compute capability 5.3 and greater – CUDA half intrinsics (round to nearest even): http://docs.nvidia.com/cuda/cuda-math- api/group__CUDA__MATH__INTRINSIC__HALF.html#group__CUDA__MATH__INTRINSIC__HALF – TensorFlow has DT_HALF registered type: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/types.cc – We add a custom MPI datatype of 2 contiguous bytes: https://github.com/PPPLDeepLearning/plasma- python/blob/master/plasma/primitives/ops.py • This talk: – Review techniques to enable FP16 and mixed precision training – Introduce FRNN framework for disruption forecasting in tokamak fusion plasmas , discuss strong scaling – Evaluate FP16 training on FRNN framework and a real scientific dataset from JET experiment – Evaluate FP16 training on the benchmark IMDB dataset #DL2SAIS 28 Half-precision training (I) Half-precision training: – Store weights, activations and gradients in FP16 – Perform matrix and element-wise multiplications, reduction ops in FP16 (including during fprop, bprop and weight update) • You already know it is possible to perform FP16 math, arithmetics and comparisons on NVIDIA GPUs with compute capability 5.3 and greater – CUDA half intrinsics (round to nearest even): http://docs.nvidia.com/cuda/cuda-math- api/group__CUDA__MATH__INTRINSIC__HALF.html#group__CUDA__MATH__INTRINSIC__HALF – TensorFlow has DT_HALF registered type: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/types.cc – We add a custom MPI datatype of 2 contiguous bytes: https://github.com/PPPLDeepLearning/plasma- python/blob/master/plasma/primitives/ops.py • This talk: – Review techniques to enable FP16 and mixed precision training – Introduce FRNN framework for disruption forecasting in tokamak fusion plasmas , discuss strong scaling – Evaluate FP16 training on FRNN framework and a real scientific dataset from JET experiment – Evaluate FP16 training on the benchmark IMDB dataset #DL2SAIS 28
  • 29. RNN Predictions True positive example TTD [ms] Current Increase Locked Mode Spike Density Drop Internal Inductance Spike Stored Diamagnetic Energy Drop FRNN Signal Spike ArbitraryUnits