WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Jim Dowling, CEO @ Logical Clocks
Ajit Mathews, VP Machine Learning @ AMD
ROCm and Distributed
Deep Learning on Spark
and TensorFlow
#UnifiedAnalytics #SparkAISummit
jim_dowling
Great Hedge of India
• East India Company was one of
the industrial world’s first monopolies.
• They assembled a thorny hedge
(not a wall!) spanning India.
• You paid customs duty to bring salt
over the wall (sorry, hedge).
In 2019, not all graphics cards are allowed to be used
in a Data Center. Monoplies are not good for deep learning!
3
[Image from Wikipedia]
Nvidia™ 2080Ti vs AMD Radeon™ VII
ResNet-50 Benchmark
Nvidia™ 2080Ti
Memory: 11GB
TensorFlow 1.12
CUDA 10.0.130, cuDNN 7.4.1
Model: RESNET-50
Dataset: imagenet (synthetic)
------------------------------------------------------------
FP32 total images/sec: ~322
FP16 total images/sec: ~560
AMD Radeon™ VII
Memory: 16 GB
TensorFlow 1.13.1
ROCm: 2.3
Model: RESNET-50
Dataset: imagenet (synthetic)
------------------------------------------------------------
FP32 total images/sec: ~302
FP16 total images/sec: ~415
https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/
https://www.phoronix.com/scan.php?page=article&item=nvidia-
rtx2080ti-tensorflow&num=2
4/48
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173
5
#UnifiedAnalytics #SparkAISummit
Latest Machine Learning
Frameworks
Dockers and Kubernetes
support
Optimized Math &
Communication Libraries
Up-Streamed for Linux Kernel
Distributions
Frameworks
Middleware and
Libraries
Eigen
Spark / Machine Learning Apps
Data Platform
Tools
ROCm
Fully Open Source ROCm Platform
OpenMP HIP OpenCL™ Python
Devices GPU CPU APU DLA
RCCL
BLAS, FFT,
RNG
MIOpen
O P E N S O U R C E
F O U N D A T I O N
F O R M A C H I N E
L E A R N I N G
A M D M L & H P C S O L U T I O N S
AM D M L S O F T WAR E S T R AT E G Y
Linux Kernel
4.17
700+ upstream ROCm driver commits since 4.12 kernel
https://github.com/RadeonOpenCompute/ROCK-Kernel-
Driver
Distro: Upstream Linux Kernel Support
Programming Models
LLVM: https://llvm.org/docs/AMDGPUUsage.html
CLANG HIP: https://clang.llvm.org/doxygen/HIP_8h_source.html
OpenMP Python OpenCLHIP
LLVM -> AMDGCN Compiler
AMDGPU Code
LLVM
Languages: Multiple Programming options
rocBLAS
rocSparse
rocALUTION
rocPrim
rocFFT
rocSolver
rocRAND
Up to
98%Software Efficiency
Supports FP64, FP32, mixed precision, INT8,(INT4)
Libraries: Optimized Math Libraries
https://github.com/ROCm-Developer-Tools
Machine Learning Frameworks
High performance FP16/FP32
training with up to 8
GPUs/node
v1.13 – Available today as a
docker container
or as Python PIP wheel
AMD support in mainline
repository, including initial
multi-GPU support for Caffe2
Available today as a docker
container (or build from
source)
ROCm Distributed Training
10#UnifiedAnalytics #SparkAISummit
Optimized collective
communication operations library
Easy MPI integration
Support for Infiniband and RoCE
highspeed network fabrics
ROCm enabled UCX
ROCm w/ ROCnRDMA
RCCL
1.00X
1.99X
3.98X
7.64X
0.00X
1.00X
2.00X
3.00X
4.00X
5.00X
6.00X
7.00X
8.00X
RESNET50
Multi-GPU Scaling
(PCIe, CPU
parameter-server,
1/2/4/8 GPU)
1GPU 2GPU 4GPU 8GPU
ResNet-50
GPU Acceleration & Distributed training
Docker and Kubernetes Support
o Docker
o https://hub.docker.com/r/rocm/rocm-
terminal/
o Kubernetes
o https://hub.docker.com/r/rocm/k8s-
device-plugin/
Extensive Language support
including Python
o Anaconda NUMBA
o https://github.com/numba/roctools
o Conda
o https://anaconda.org/rocm
o CLANG HIP
o https://clang.llvm.org/doxygen/HIP_8h_s
ource.html
ROCm supported Kubernetes device
plugin that enables the registration of
GPU in a container cluster for compute
workload
With support for AMD's ROCm
drivers, Numba lets you write parallel
GPU algorithms entirely from Python
Horovod Support with RCCL
o Ring AllReduce
o https://github.com/ROCmSoftwarePlatfor
m/horovod
o Tensor Flow CollectiveAllReduceStrategy
o HopsML CollectiveAllReduceStrategy
ROCm supports AllReduce Distributed
Training using Horovod and HopsML
Hopsworks Open-Source Data and AI
12#UnifiedAnalytics #SparkAISummit
Data
Sources
HopsFS
Kafka
Airflow
Spark /
Flink
Spark
Feature
Store
Hive
Deep
Learning
BI Tools &
Reporting
Notebooks
Serving w/
Kubernetes
Hopsworks
On-Premise, AWS, Azure, GCE
Elastic
External
Service
Hopsworks
Service
Hopsworks Open-Source Data and AI
13#UnifiedAnalytics #SparkAISummit
Data
Sources
HopsFS
Kafka
Airflow
Spark /
Flink
Spark
Feature
Store
Hive
Deep
Learning
BI Tools &
Reporting
Notebooks
Serving w/
Kubernetes
Hopsworks
On-Premise, AWS, Azure, GCE
Elastic
External
Service
Hopsworks
Service
BATCH ANALYTICS
STREAMING
ML & DEEP LEARNING
Hopsworks Hides ML Complexity
14#UnifiedAnalytics #SparkAISummit
MODEL TRAINING
Feature
Store
HopsML API
[Diagram adapted from “technical debt of machine learning”]
https://www.logicalclocks.com/feature-store/ https://hopsworks.readthedocs.io/en/latest/hopsml/hopsML.html#
ROCm -> Spark / TensorFlow
• Spark / TensorFlow
applications run
unchanged on ROCm
• Hopsworks runs
Spark/TensorFlow on
YARN and Conda
15#UnifiedAnalytics #SparkAISummit
Container
A Container
is a CGroup
that isolates
CPU, memory,
and GPU
resources and
has a conda
environment
and TLS certs.
ContainerContainerContainer
YARN support for ROCm in Hops
16#UnifiedAnalytics #SparkAISummit
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Executor ExecutorExecutorDriver
Distributed Deep Learning Spark/TF
17#UnifiedAnalytics #SparkAISummit
Executor 1 Executor N
Driver
HopsFS
TensorBoard ModelsCheckpoints Training Data Logs
conda_env
conda_env conda_env
# RUNS ON THE EXECUTORS
def train(lr, dropout):
def input_fn(): # return dataset
optimizer = …
model = …
model.add(Conv2D(…))
model.compile(…)
model.fit(…)
model.evaluate(…)
# RUNS ON THE DRIVER
Hparams= {‘lr’:[0.001, 0.0001],
‘dropout’: [0.25, 0.5, 0.75]}
experiment.grid_search(train,HParams)
More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0
https://github.com/logicalclocks/hops-examples
Distributed Deep Learning Spark/TF
18#UnifiedAnalytics #SparkAISummit
Executor 1 Executor N
Driver
HopsFS
TensorBoard ModelsCheckpoints Training Data Logs
conda_env
conda_env conda_env
# RUNS ON THE EXECUTORS
def train():
def input_fn(): # return dataset
model = …
optimizer = …
model.compile(…)
rc = tf.estimator.RunConfig(
‘CollectiveAllReduceStrategy’)
keras_estimator = tf.keras.estimator.
model_to_estimator(….)
tf.estimator.train_and_evaluate(
keras_estimator, input_fn)
# RUNS ON THE DRIVER
experiment.collective_all_reduce(train)
More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0
https://github.com/logicalclocks/hops-examples
19#UnifiedAnalytics #SparkAISummit
Horizontally Scalable ML Pipelines with Hopsworks
Raw Data
Event Data
Monitor
HopsFS
Feature
Store Serving
Feature StoreData PrepIngest DeployExperiment/Train
Airflow
logs
logs
Metadata Store
Pipelines from Jupyter Notebooks
20#UnifiedAnalytics #SparkAISummit
Convert .ipynb to .py
Jobs Service
Run .py or .jar
Schedule using
REST API or UI
materialize certs, ENV variables
View Old Notebooks, Experiments and Visualizations
Experiments &
Tensorboard
PySpark Kernel
.ipynb (HDFS contents)
[logs, results]
Livy Server
HopsYARNHopsFS
Interactive
materialize certs, ENV variables
DEMO
Distributed TensorFlow with Spark/TensorFlow and ROCm
21#UnifiedAnalytics #SparkAISummit
22
@logicalclocks
www.logicalclocks.com
O P E N S O U R C E
F O U N D A T I O N
F O R M A C H I N E
L E A R N I N G
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product
and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between
differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise
correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the
content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO
RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR
PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER
CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY
ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2019 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Radeon and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be
trademarks of their respective owners. Use of third party marks / names is for informational purposes only and no endorsement of or by AMD
is intended or implied.

ROCm and Distributed Deep Learning on Spark and TensorFlow

  • 1.
    WIFI SSID:SparkAISummit |Password: UnifiedAnalytics
  • 2.
    Jim Dowling, CEO@ Logical Clocks Ajit Mathews, VP Machine Learning @ AMD ROCm and Distributed Deep Learning on Spark and TensorFlow #UnifiedAnalytics #SparkAISummit jim_dowling
  • 3.
    Great Hedge ofIndia • East India Company was one of the industrial world’s first monopolies. • They assembled a thorny hedge (not a wall!) spanning India. • You paid customs duty to bring salt over the wall (sorry, hedge). In 2019, not all graphics cards are allowed to be used in a Data Center. Monoplies are not good for deep learning! 3 [Image from Wikipedia]
  • 4.
    Nvidia™ 2080Ti vsAMD Radeon™ VII ResNet-50 Benchmark Nvidia™ 2080Ti Memory: 11GB TensorFlow 1.12 CUDA 10.0.130, cuDNN 7.4.1 Model: RESNET-50 Dataset: imagenet (synthetic) ------------------------------------------------------------ FP32 total images/sec: ~322 FP16 total images/sec: ~560 AMD Radeon™ VII Memory: 16 GB TensorFlow 1.13.1 ROCm: 2.3 Model: RESNET-50 Dataset: imagenet (synthetic) ------------------------------------------------------------ FP32 total images/sec: ~302 FP16 total images/sec: ~415 https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/ https://www.phoronix.com/scan.php?page=article&item=nvidia- rtx2080ti-tensorflow&num=2 4/48 https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173
  • 5.
    5 #UnifiedAnalytics #SparkAISummit Latest MachineLearning Frameworks Dockers and Kubernetes support Optimized Math & Communication Libraries Up-Streamed for Linux Kernel Distributions Frameworks Middleware and Libraries Eigen Spark / Machine Learning Apps Data Platform Tools ROCm Fully Open Source ROCm Platform OpenMP HIP OpenCL™ Python Devices GPU CPU APU DLA RCCL BLAS, FFT, RNG MIOpen O P E N S O U R C E F O U N D A T I O N F O R M A C H I N E L E A R N I N G A M D M L & H P C S O L U T I O N S AM D M L S O F T WAR E S T R AT E G Y
  • 6.
    Linux Kernel 4.17 700+ upstreamROCm driver commits since 4.12 kernel https://github.com/RadeonOpenCompute/ROCK-Kernel- Driver Distro: Upstream Linux Kernel Support
  • 7.
    Programming Models LLVM: https://llvm.org/docs/AMDGPUUsage.html CLANGHIP: https://clang.llvm.org/doxygen/HIP_8h_source.html OpenMP Python OpenCLHIP LLVM -> AMDGCN Compiler AMDGPU Code LLVM Languages: Multiple Programming options
  • 8.
    rocBLAS rocSparse rocALUTION rocPrim rocFFT rocSolver rocRAND Up to 98%Software Efficiency SupportsFP64, FP32, mixed precision, INT8,(INT4) Libraries: Optimized Math Libraries https://github.com/ROCm-Developer-Tools
  • 9.
    Machine Learning Frameworks Highperformance FP16/FP32 training with up to 8 GPUs/node v1.13 – Available today as a docker container or as Python PIP wheel AMD support in mainline repository, including initial multi-GPU support for Caffe2 Available today as a docker container (or build from source)
  • 10.
    ROCm Distributed Training 10#UnifiedAnalytics#SparkAISummit Optimized collective communication operations library Easy MPI integration Support for Infiniband and RoCE highspeed network fabrics ROCm enabled UCX ROCm w/ ROCnRDMA RCCL 1.00X 1.99X 3.98X 7.64X 0.00X 1.00X 2.00X 3.00X 4.00X 5.00X 6.00X 7.00X 8.00X RESNET50 Multi-GPU Scaling (PCIe, CPU parameter-server, 1/2/4/8 GPU) 1GPU 2GPU 4GPU 8GPU ResNet-50
  • 11.
    GPU Acceleration &Distributed training Docker and Kubernetes Support o Docker o https://hub.docker.com/r/rocm/rocm- terminal/ o Kubernetes o https://hub.docker.com/r/rocm/k8s- device-plugin/ Extensive Language support including Python o Anaconda NUMBA o https://github.com/numba/roctools o Conda o https://anaconda.org/rocm o CLANG HIP o https://clang.llvm.org/doxygen/HIP_8h_s ource.html ROCm supported Kubernetes device plugin that enables the registration of GPU in a container cluster for compute workload With support for AMD's ROCm drivers, Numba lets you write parallel GPU algorithms entirely from Python Horovod Support with RCCL o Ring AllReduce o https://github.com/ROCmSoftwarePlatfor m/horovod o Tensor Flow CollectiveAllReduceStrategy o HopsML CollectiveAllReduceStrategy ROCm supports AllReduce Distributed Training using Horovod and HopsML
  • 12.
    Hopsworks Open-Source Dataand AI 12#UnifiedAnalytics #SparkAISummit Data Sources HopsFS Kafka Airflow Spark / Flink Spark Feature Store Hive Deep Learning BI Tools & Reporting Notebooks Serving w/ Kubernetes Hopsworks On-Premise, AWS, Azure, GCE Elastic External Service Hopsworks Service
  • 13.
    Hopsworks Open-Source Dataand AI 13#UnifiedAnalytics #SparkAISummit Data Sources HopsFS Kafka Airflow Spark / Flink Spark Feature Store Hive Deep Learning BI Tools & Reporting Notebooks Serving w/ Kubernetes Hopsworks On-Premise, AWS, Azure, GCE Elastic External Service Hopsworks Service BATCH ANALYTICS STREAMING ML & DEEP LEARNING
  • 14.
    Hopsworks Hides MLComplexity 14#UnifiedAnalytics #SparkAISummit MODEL TRAINING Feature Store HopsML API [Diagram adapted from “technical debt of machine learning”] https://www.logicalclocks.com/feature-store/ https://hopsworks.readthedocs.io/en/latest/hopsml/hopsML.html#
  • 15.
    ROCm -> Spark/ TensorFlow • Spark / TensorFlow applications run unchanged on ROCm • Hopsworks runs Spark/TensorFlow on YARN and Conda 15#UnifiedAnalytics #SparkAISummit
  • 16.
    Container A Container is aCGroup that isolates CPU, memory, and GPU resources and has a conda environment and TLS certs. ContainerContainerContainer YARN support for ROCm in Hops 16#UnifiedAnalytics #SparkAISummit Resource Manager Node Manager Node Manager Node Manager Executor ExecutorExecutorDriver
  • 17.
    Distributed Deep LearningSpark/TF 17#UnifiedAnalytics #SparkAISummit Executor 1 Executor N Driver HopsFS TensorBoard ModelsCheckpoints Training Data Logs conda_env conda_env conda_env # RUNS ON THE EXECUTORS def train(lr, dropout): def input_fn(): # return dataset optimizer = … model = … model.add(Conv2D(…)) model.compile(…) model.fit(…) model.evaluate(…) # RUNS ON THE DRIVER Hparams= {‘lr’:[0.001, 0.0001], ‘dropout’: [0.25, 0.5, 0.75]} experiment.grid_search(train,HParams) More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0 https://github.com/logicalclocks/hops-examples
  • 18.
    Distributed Deep LearningSpark/TF 18#UnifiedAnalytics #SparkAISummit Executor 1 Executor N Driver HopsFS TensorBoard ModelsCheckpoints Training Data Logs conda_env conda_env conda_env # RUNS ON THE EXECUTORS def train(): def input_fn(): # return dataset model = … optimizer = … model.compile(…) rc = tf.estimator.RunConfig( ‘CollectiveAllReduceStrategy’) keras_estimator = tf.keras.estimator. model_to_estimator(….) tf.estimator.train_and_evaluate( keras_estimator, input_fn) # RUNS ON THE DRIVER experiment.collective_all_reduce(train) More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0 https://github.com/logicalclocks/hops-examples
  • 19.
    19#UnifiedAnalytics #SparkAISummit Horizontally ScalableML Pipelines with Hopsworks Raw Data Event Data Monitor HopsFS Feature Store Serving Feature StoreData PrepIngest DeployExperiment/Train Airflow logs logs Metadata Store
  • 20.
    Pipelines from JupyterNotebooks 20#UnifiedAnalytics #SparkAISummit Convert .ipynb to .py Jobs Service Run .py or .jar Schedule using REST API or UI materialize certs, ENV variables View Old Notebooks, Experiments and Visualizations Experiments & Tensorboard PySpark Kernel .ipynb (HDFS contents) [logs, results] Livy Server HopsYARNHopsFS Interactive materialize certs, ENV variables
  • 21.
    DEMO Distributed TensorFlow withSpark/TensorFlow and ROCm 21#UnifiedAnalytics #SparkAISummit
  • 22.
    22 @logicalclocks www.logicalclocks.com O P EN S O U R C E F O U N D A T I O N F O R M A C H I N E L E A R N I N G
  • 23.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT
  • 24.
    Disclaimer & Attribution Theinformation presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2019 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Radeon and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. Use of third party marks / names is for informational purposes only and no endorsement of or by AMD is intended or implied.