6. Separate Clusters for Big Data and ML
6
*Slide from: TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters, YAHOO!
7. 7
I need estimates for the
ROI on these candidate
features in our product
We are on it. Need
to first sync up with
IT and engineering
Data Science in Enterprises Today
7
Data Science Team
CTO
8. 88
IT
Collaboration Overhead is High
Prepare Dataset
samples for Data
Science
Data Science Team Data Engineering
We need access
to these
Datasets
DataLake
Ok
1. Update
Access
Rights
GPU Cluster
2. Copy Dataset Samples
(some time later)
3. Run experiments
9. 99
How it should be
IT
Data Science Data Engineering
Here’s someone
who can help you
out
I need help to
work on a project
for the CTO
Project
Conda Env, CPU/Storage Quotas, Self-Service, GDPR
Kafka
Topics
DataLake
GPU Cluster
Elasticsearch
12. HopsFS
12
Open Source fork of
Apache HDFS
16x faster than HDFS
37x more capacity than
HDFS
SSL/TLS instead of
Kerberos
Scale Challenge Winner (2017)
https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
13. HopsYARN GPUs
13
Native GPU support in YARN - world first
Implications
- Schedule GPUs just like memory or CPU
- Exclusive allocation (no GPU-sharing)
- Distributed, scale-out Machine Learning
14. TensorFlow first-class support in Hops
14
Run in
Spark Executor
TensorFlow code
0.003 learning rate, 0.3 dropout0.001 learning rate, 0.5 dropout
0.002 learning rate, 0.7 dropout
Spark Executor
TensorFlow code
Spark Executor
TensorFlow code
15. HopsUtil
Library for launching TensorFlow jobs
Manages the TensorBoard lifecycle
Helper Functions for Spark/Kafka/HDFS/etc
15
16. HopsUtil - Read data
from hopsutil import hdfs
dataset=path.join(hdfs.project_path(),‘Resources/
mnist/tfr/train’)
files=tf.gfile.Glob(path.join(dataset,‘part-*’))
file_queue=tf.train.string_input_producer(files,
… )
16
17. 17
HopsUtil - initialize Pydoop HDFS API
Pydoop HDFS API is a rich api that provides operations such as
- Connecting to an HDFS instance
- General file operations (create, read, write)
- Get information on files, directories, fs
Connect to HopsFS using HopsUtil:
from hopsutil import hdfs
pydoop_handle = hdfs.get()
17
30. Experiment Time and Research Productivity
● Minutes, Hours:
○ Interactive analysis!
● 1-4 days
○ Interactivity replaced by
many parallel experiments
● 1-4 weeks
○ High value experiments only
● >1 month
○ Don’t even try!
30
37. 37
Introducing TensorFlowOnSpark by YAHOO!
Wrapper for Distributed TensorFlow
- Creates clusterspec automatically!
- Runs on a Hadoop/Spark cluster
- Starts the workers/parameter servers automatically
- First attempt at “scheduling” GPUs
- Simplifies the programming model
- Manages TensorBoard
- “Migrate all existing TF programs with < 10 lines of code”
37
41. 41
Our improved TensorFlowOnSpark - 1
Problem:
Use RAM (1GPU = 27GB RAM) as a proxy to ‘schedule’
GPUs.
Solution:
Hops provides GPU scheduling!
41
42. 42
Our improved TensorFlowOnSpark - 2
Problem:
A worker will wait until GPUs become available,
potentially forever!
Solution:
GPU scheduling ensures that the GPU is only allocated
for that particular worker.
42
43. 43
Our improved TensorFlowOnSpark - 3
Problem:
Each parameter server allocates 1 GPU, this is a waste!
Solution:
Only workers may use GPUs
43
44. 44
Conversion guide: TensorFlowOnSpark
TFCluster.run(spark, training_fun, num_executors, num_ps…)
Add PySpark and TensorFlowOnSpark imports
Create your own FileWriter
Replace tf.train.Server() with TFNode.start_cluster_server()
Full conversion guide for Distributed TensorFlow to TensorFlowOnSpark
https://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide
44
48. Facebook: Scaling Synchronous SDG
June 2017: training time on
ImageNet from 2 weeks to 1
hour
➔ ~90% scaling efficiency
going from 8 to 256 GPUs
Learning rate heuristic/
Warm up phase/ Large
batches
48Paper: https:// research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf
49. All-Reduce
49
N GPUs, K parameters
Comm. cost: 2(N-1) * K/N
Independent from # GPUs
overlap communication and
computation
Drawback: Synchronous
communication
From: http://research.baidu.com/bringing-hpc-techniques-deep-learning/
51. Horovod - Better than Baidu All-Reduce?
51
Fork of Baidu All-Reduce
Improvements
1. Replaced Baidu ring-allreduce with NVIDIA NCCL
2. Tensor Fusion
3. Support for larger models
4. Pip package
5. Horovod Timeline
52. 5252
Migrating existing code to run on Horovod
1. Run hvd.init()
2. Pin a server GPU to be used by this process using
config.gpu_options.visible_device_list. Local rank
maps to unique GPU for the process.
3. Wrap optimizer in hvd.DistributedOptimizer.
4. Add hvd.BroadcastGlobalVariablesHook(0) to
broadcast initial variable states from rank 0 to all other
processes.
53. Horovod/Baidu AllReduce
53
Provide as a service on HopsWorks
Integration of All-Reduce with a Hadoop cluster
- Use YARN to schedule GPUs
Scheduling of homogeneous GPUs and network
- YARN supports node labels
HopsFS authentication/authorization
TensorBoard lifecycle management as in HopsUtil
54. The team
Active contributors:
Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos
Kakantousis, Ermias Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso, Robin Andersson,
August Bonds, Filotas Siskos, Mahmoud Hamed.
Past contributors:
Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,
Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis
Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana
Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid
Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Aruna Kumari Yedurupaka, Tobias Johansson,
Roberto Bampi, Fanti Machmount Al Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid.
54