ROCm and Distributed Deep Learning on Spark and TensorFlow

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Jim Dowling, CEO @ Logical Clocks
Ajit Mathews, VP Machine Learning @ AMD
ROCm and Distributed
Deep Learning on Spark
and TensorFlow
#UnifiedAnalytics #SparkAISummit
jim_dowling

Great Hedge of India
• East India Company was one of
the industrial world’s first monopolies.
• They assembled a thorny hedge
(not a wall!) spanning India.
• You paid customs duty to bring salt
over the wall (sorry, hedge).
In 2019, not all graphics cards are allowed to be used
in a Data Center. Monoplies are not good for deep learning!
3
[Image from Wikipedia]

Nvidia™ 2080Ti vs AMD Radeon™ VII
ResNet-50 Benchmark
Nvidia™ 2080Ti
Memory: 11GB
TensorFlow 1.12
CUDA 10.0.130, cuDNN 7.4.1
Model: RESNET-50
Dataset: imagenet (synthetic)
------------------------------------------------------------
FP32 total images/sec: ~322
AMD Radeon™ VII
Memory: 16 GB
TensorFlow 1.13.1
ROCm: 2.3
Model: RESNET-50
Dataset: imagenet (synthetic)
------------------------------------------------------------
https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/
https://www.phoronix.com/scan.php?page=article&item=nvidia-
rtx2080ti-tensorflow&num=2
4/48
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173

5
#UnifiedAnalytics #SparkAISummit
Latest Machine Learning
Frameworks
Dockers and Kubernetes
support
Optimized Math &
Communication Libraries
Up-Streamed for Linux Kernel
Distributions
Frameworks
Middleware and
Libraries
Eigen
Spark / Machine Learning Apps
Data Platform
Tools
ROCm
Fully Open Source ROCm Platform
OpenMP HIP OpenCL™ Python
Devices GPU CPU APU DLA
RCCL
BLAS, FFT,
RNG
MIOpen
O P E N S O U R C E
F O U N D A T I O N
F O R M A C H I N E
L E A R N I N G
A M D M L & H P C S O L U T I O N S
AM D M L S O F T WAR E S T R AT E G Y

Linux Kernel
4.17
700+ upstream ROCm driver commits since 4.12 kernel
https://github.com/RadeonOpenCompute/ROCK-Kernel-
Driver
Distro: Upstream Linux Kernel Support

Programming Models
LLVM: https://llvm.org/docs/AMDGPUUsage.html
CLANG HIP: https://clang.llvm.org/doxygen/HIP_8h_source.html
OpenMP Python OpenCLHIP
LLVM -> AMDGCN Compiler
AMDGPU Code
LLVM
Languages: Multiple Programming options

rocBLAS
rocSparse
rocALUTION
rocPrim
rocFFT
rocSolver
rocRAND
Up to
98%Software Efficiency
Supports FP64, FP32, mixed precision, INT8,(INT4)
Libraries: Optimized Math Libraries
https://github.com/ROCm-Developer-Tools

Machine Learning Frameworks
High performance FP16/FP32
training with up to 8
GPUs/node
v1.13 – Available today as a
docker container
or as Python PIP wheel
AMD support in mainline
repository, including initial
multi-GPU support for Caffe2
Available today as a docker
container (or build from
source)

ROCm Distributed Training
10#UnifiedAnalytics #SparkAISummit
Optimized collective
communication operations library
Easy MPI integration
Support for Infiniband and RoCE
highspeed network fabrics
ROCm enabled UCX
ROCm w/ ROCnRDMA
RCCL
1.00X
1.99X
3.98X
7.64X
0.00X
1.00X
2.00X
3.00X
4.00X
5.00X
6.00X
7.00X
8.00X
RESNET50
Multi-GPU Scaling
(PCIe, CPU
parameter-server,
1/2/4/8 GPU)
1GPU 2GPU 4GPU 8GPU
ResNet-50

GPU Acceleration & Distributed training
Docker and Kubernetes Support
o Docker
o https://hub.docker.com/r/rocm/rocm-
terminal/
o Kubernetes
o https://hub.docker.com/r/rocm/k8s-
device-plugin/
Extensive Language support
including Python
o Anaconda NUMBA
o https://github.com/numba/roctools
o Conda
o https://anaconda.org/rocm
o CLANG HIP
o https://clang.llvm.org/doxygen/HIP_8h_s
ource.html
ROCm supported Kubernetes device
plugin that enables the registration of
GPU in a container cluster for compute
workload
With support for AMD's ROCm
drivers, Numba lets you write parallel
GPU algorithms entirely from Python
Horovod Support with RCCL
o Ring AllReduce
o https://github.com/ROCmSoftwarePlatfor
m/horovod
o Tensor Flow CollectiveAllReduceStrategy
o HopsML CollectiveAllReduceStrategy
ROCm supports AllReduce Distributed
Training using Horovod and HopsML

Hopsworks Open-Source Data and AI
Data
Sources
HopsFS
Kafka
Airflow
Spark /
Flink
Spark
Feature
Store
Hive
Deep
Learning
BI Tools &
Reporting
Notebooks
Serving w/
Kubernetes
Hopsworks
On-Premise, AWS, Azure, GCE
Elastic
External
Service
Hopsworks
Service

Hopsworks Open-Source Data and AI
Data
Sources
HopsFS
Kafka
Airflow
Spark /
Flink
Spark
Feature
Store
Hive
Deep
Learning
BI Tools &
Reporting
Notebooks
Serving w/
Kubernetes
Hopsworks
On-Premise, AWS, Azure, GCE
Elastic
External
Service
Hopsworks
Service
BATCH ANALYTICS
STREAMING
ML & DEEP LEARNING

Hopsworks Hides ML Complexity
MODEL TRAINING
Feature
Store
HopsML API
[Diagram adapted from “technical debt of machine learning”]
https://www.logicalclocks.com/feature-store/ https://hopsworks.readthedocs.io/en/latest/hopsml/hopsML.html#

ROCm -> Spark / TensorFlow
• Spark / TensorFlow
applications run
unchanged on ROCm
• Hopsworks runs
Spark/TensorFlow on
YARN and Conda

Container
A Container
is a CGroup
that isolates
CPU, memory,
and GPU
resources and
has a conda
environment
and TLS certs.
ContainerContainerContainer
YARN support for ROCm in Hops
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Executor ExecutorExecutorDriver

Distributed Deep Learning Spark/TF
Executor 1 Executor N
Driver
HopsFS
TensorBoard ModelsCheckpoints Training Data Logs
conda_env
conda_env conda_env
# RUNS ON THE EXECUTORS
def train(lr, dropout):
def input_fn(): # return dataset
optimizer = …
model = …
model.add(Conv2D(…))
model.compile(…)
model.fit(…)
model.evaluate(…)
# RUNS ON THE DRIVER
Hparams= {‘lr’:[0.001, 0.0001],
‘dropout’: [0.25, 0.5, 0.75]}
experiment.grid_search(train,HParams)
More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0
https://github.com/logicalclocks/hops-examples

Distributed Deep Learning Spark/TF
Executor 1 Executor N
Driver
HopsFS
TensorBoard ModelsCheckpoints Training Data Logs
conda_env
conda_env conda_env
# RUNS ON THE EXECUTORS
def train():
def input_fn(): # return dataset
model = …
optimizer = …
model.compile(…)
rc = tf.estimator.RunConfig(
‘CollectiveAllReduceStrategy’)
keras_estimator = tf.keras.estimator.
model_to_estimator(….)
tf.estimator.train_and_evaluate(
keras_estimator, input_fn)
# RUNS ON THE DRIVER
experiment.collective_all_reduce(train)
More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0
https://github.com/logicalclocks/hops-examples

Horizontally Scalable ML Pipelines with Hopsworks
Raw Data
Event Data
Monitor
HopsFS
Feature
Store Serving
Feature StoreData PrepIngest DeployExperiment/Train
Airflow
logs
logs
Metadata Store

Pipelines from Jupyter Notebooks
Convert .ipynb to .py
Jobs Service
Run .py or .jar
Schedule using
REST API or UI
materialize certs, ENV variables
View Old Notebooks, Experiments and Visualizations
Experiments &
Tensorboard
PySpark Kernel
.ipynb (HDFS contents)
[logs, results]
Livy Server
HopsYARNHopsFS
Interactive
materialize certs, ENV variables

DEMO
Distributed TensorFlow with Spark/TensorFlow and ROCm

22
@logicalclocks
www.logicalclocks.com
O P E N S O U R C E
F O U N D A T I O N
F O R M A C H I N E
L E A R N I N G

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product
and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between
differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise
correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the
content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO
RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR
PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER
CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY
ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2019 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Radeon and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be
trademarks of their respective owners. Use of third party marks / names is for informational purposes only and no endorsement of or by AMD
is intended or implied.

ROCm and Distributed Deep Learning on Spark and TensorFlow

More Related Content

What's hot

Similar to ROCm and Distributed Deep Learning on Spark and TensorFlow

More from Databricks

Recently uploaded

ROCm and Distributed Deep Learning on Spark and TensorFlow