Odsc workshop - Distributed Tensorflow on Hops

@ODSC
Distributed
Deep
Learning on
Hops Robin Andersson
Fabio Buso
RISE SICS AB
Logical Clocks AB
London | October 12th
-14th
2017

Please register on odsc.hops.site

Why you are here
4From: https:// research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf

Deep Learning with GPUs (on Hops)
5

Separate Clusters for Big Data and ML
6
*Slide from: TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters, YAHOO!

7
I need estimates for the
ROI on these candidate
features in our product
We are on it. Need
to first sync up with
IT and engineering
Data Science in Enterprises Today
7
Data Science Team
CTO

88
IT
Collaboration Overhead is High
Prepare Dataset
samples for Data
Science
Data Science Team Data Engineering
We need access
to these
Datasets
DataLake
Ok
1. Update
Access
Rights
GPU Cluster
2. Copy Dataset Samples
(some time later)
3. Run experiments

99
How it should be
IT
Data Science Data Engineering
Here’s someone
who can help you
out
I need help to
work on a project
for the CTO
Project
Conda Env, CPU/Storage Quotas, Self-Service, GDPR
Kafka
Topics
DataLake
GPU Cluster
Elasticsearch

HopsWorks
11
Kafka Topic
Project X Project Y
Project Data

HopsFS
12
Open Source fork of
Apache HDFS
16x faster than HDFS
37x more capacity than
HDFS
SSL/TLS instead of
Kerberos
Scale Challenge Winner (2017)
https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi

HopsYARN GPUs
13
Native GPU support in YARN - world first
Implications
- Schedule GPUs just like memory or CPU
- Exclusive allocation (no GPU-sharing)
- Distributed, scale-out Machine Learning

TensorFlow first-class support in Hops
14
Run in
Spark Executor
TensorFlow code
0.003 learning rate, 0.3 dropout0.001 learning rate, 0.5 dropout
0.002 learning rate, 0.7 dropout
Spark Executor
TensorFlow code
Spark Executor
TensorFlow code

HopsUtil
Library for launching TensorFlow jobs
Manages the TensorBoard lifecycle
Helper Functions for Spark/Kafka/HDFS/etc
15

HopsUtil - Read data
from hopsutil import hdfs
dataset=path.join(hdfs.project_path(),‘Resources/
mnist/tfr/train’)
files=tf.gfile.Glob(path.join(dataset,‘part-*’))
file_queue=tf.train.string_input_producer(files,
… )
16

17
HopsUtil - initialize Pydoop HDFS API
Pydoop HDFS API is a rich api that provides operations such as
- Connecting to an HDFS instance
- General file operations (create, read, write)
- Get information on files, directories, fs
Connect to HopsFS using HopsUtil:
pydoop_handle = hdfs.get()
17

HopsUtil - TensorBoard
from hopsutil import tensorboard
[...]
logdir = tensorboard.logdir()
sv = tf.train.Supervisor(is_chief=True,
logdir=logdir, [...], save_model_secs=60)
18

HopsUtil - Hyperparameter searching
from hopsutil import tflauncher
def training(learning_rate, dropout):
[....]
params = {‘learning_rate': [0.001, 0.002, 0.003],
'dropout': [0.3, 0.5, 0.7]}
tflauncher.launch(spark, training, params)
19

HopsUtil - Logging
[...]
while not sv.should_stop() and step < steps:
hdfs.log(sess.run(accuracy))
[...]
20

DEMO TIME!
TensorFlow tour on HopsWorks
21

26
Dela - Search for interesting datasets

Dela
28
p2p network of Hops clusters
Find and share interesting datasets
Exploits unused bandwidth and backs off in case of
network traffic

The Challenge
29
http://timdettmers.com/2017/08/31/deep-learning-research-directions

Experiment Time and Research Productivity
● Minutes, Hours:
○ Interactive analysis!
● 1-4 days
○ Interactivity replaced by
many parallel experiments
● 1-4 weeks
○ High value experiments only
● >1 month
○ Don’t even try!
30

State-of-the-Art in GPU Hardware
32

SingleRoot Commodity GPU Cluster Computing
34

The budget side
35
Commodity Server*
➔ 10 Nvidia GTX 1080Ti
◆ 11 GB Memory
➔ 256 GB Ram
➔ 2 Intel Xeon CPUs
➔ Infiniband
➔ SingleRoot PCI Complex
10 x Commodity Server =
150K Euro
Nvidia DGX-1
➔ 8 Nvidia Tesla V100
◆ 16 GB Memory
➔ 512 GB Ram
➔ 2 Intel Xeon CPUs
➔ Infiniband
➔ NVLink
Price per DGX-1 = 150K Euro
*www.servethehome.com/single-root-or-dual-root-for-deep-learning-gpu-to-gpu-systems/

36
Distributed TensorFlow
Distribute TensorFlow graph
Workers / Parameter server
Synchronous / Asynchronous
Model / Data parallelism
Problems:
- Clusterspec
- Manually starting process

37
Introducing TensorFlowOnSpark by YAHOO!
Wrapper for Distributed TensorFlow
- Creates clusterspec automatically!
- Runs on a Hadoop/Spark cluster
- Starts the workers/parameter servers automatically
- First attempt at “scheduling” GPUs
- Simplifies the programming model
- Manages TensorBoard
- “Migrate all existing TF programs with < 10 lines of code”
37

TensorFlowOnSpark architecture
38
HopsFs
Spark Driver
Spark Executor
Parameter
Server
Spark Executor
Worker
Spark Executor
Worker

Scaling TensorFlowOnSpark
39
Near linear scaling
up to 8 workers
*Slide from: TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters, YAHOO!

41
Our improved TensorFlowOnSpark - 1
Problem:
Use RAM (1GPU = 27GB RAM) as a proxy to ‘schedule’
GPUs.
Solution:
Hops provides GPU scheduling!
41

42
Problem:
A worker will wait until GPUs become available,
potentially forever!
Solution:
GPU scheduling ensures that the GPU is only allocated
for that particular worker.
42

43
Problem:
Each parameter server allocates 1 GPU, this is a waste!
Solution:
Only workers may use GPUs
43

44
Conversion guide: TensorFlowOnSpark
TFCluster.run(spark, training_fun, num_executors, num_ps…)
Add PySpark and TensorFlowOnSpark imports
Create your own FileWriter
Replace tf.train.Server() with TFNode.start_cluster_server()
Full conversion guide for Distributed TensorFlow to TensorFlowOnSpark
https://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide
44

DEMO TIME!
Distributed TF on Spark
45

Distributed Stochastic Gradient
Descent
46

SDG with Data Parallelism (Single Host)
47

Facebook: Scaling Synchronous SDG
June 2017: training time on
ImageNet from 2 weeks to 1
hour
➔ ~90% scaling efficiency
going from 8 to 256 GPUs
Learning rate heuristic/
Warm up phase/ Large
batches
48Paper: https:// research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf

All-Reduce
49
N GPUs, K parameters
Comm. cost: 2(N-1) * K/N
Independent from # GPUs
overlap communication and
computation
Drawback: Synchronous
communication
From: http://research.baidu.com/bringing-hpc-techniques-deep-learning/

Baidu All-Reduce - Performance scaling
50
From: http://research.baidu.com/bringing-hpc-techniques-deep-learning/

Horovod - Better than Baidu All-Reduce?
51
Fork of Baidu All-Reduce
Improvements
1. Replaced Baidu ring-allreduce with NVIDIA NCCL
2. Tensor Fusion
3. Support for larger models
4. Pip package
5. Horovod Timeline

5252
Migrating existing code to run on Horovod
1. Run hvd.init()
2. Pin a server GPU to be used by this process using
config.gpu_options.visible_device_list. Local rank
maps to unique GPU for the process.
3. Wrap optimizer in hvd.DistributedOptimizer.
4. Add hvd.BroadcastGlobalVariablesHook(0) to
broadcast initial variable states from rank 0 to all other
processes.

Horovod/Baidu AllReduce
53
Provide as a service on HopsWorks
Integration of All-Reduce with a Hadoop cluster
- Use YARN to schedule GPUs
Scheduling of homogeneous GPUs and network
- YARN supports node labels
HopsFS authentication/authorization
TensorBoard lifecycle management as in HopsUtil

The team
Active contributors:
Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos
Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson,
August Bonds, Filotas Siskos, Mahmoud Hamed.
Past contributors:
Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,
Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis
Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana
Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid
Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Aruna Kumari Yedurupaka, Tobias Johansson,
Roberto Bampi, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid.
54

www.hops.io
github.com/hopshadoop
@hopshadoop
55

Odsc workshop - Distributed Tensorflow on Hops

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Odsc workshop - Distributed Tensorflow on Hops

Similar to Odsc workshop - Distributed Tensorflow on Hops (20)

More from Jim Dowling

More from Jim Dowling (20)

Recently uploaded

Recently uploaded (20)

Odsc workshop - Distributed Tensorflow on Hops