Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Scaling Tensorflow to 100s of GPUs
with Spark and Hops Hadoop
Global AI Conference, Santa Clara, January 18th 2018
Hops
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ RISE SICS
CEO @ Logical Clocks AB

AI Hierarchy of Needs
2
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]

3[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
Analytics
Prediction

4
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
Hops
[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]

More Data means Better Predictions
Prediction
Performance
Traditional AI
Deep Neural Nets
Amount Labelled Data
Hand-crafted
can outperform
1980s1990s2000s 2010s 2020s?

What about More Compute?
“Methods that scale with computation
are the future of AI”*
- Rich Sutton (A Founding Father of Reinforcement Learning)
2018-01-19 6/46
* https://www.youtube.com/watch?v=EeMCEQa85tw

More Compute should mean Faster Training
Training
Performance
Single-Host
Distributed
Available Compute
20152016 2017 2018?

Reduce DNN Training Time from 2 weeks to 1 hour
2018-01-19 8/46
In 2017, Facebook
reduced training
time on ImageNet
for a CNN from 2
weeks to 1 hour
by scaling out to
256 GPUs using
Ring-AllReduce on
Caffe2.
https://arxiv.org/abs/1706.02677

DNN Training Time and Researcher Productivity
9
•Distributed Deep Learning
-Interactive analysis!
-Instant gratification!
•Single Host Deep Learning
• Suffer from Google-Envy
“My Model’s Training.”
Training

Distributed Training: Theory and Practice
Image from @hardmaru on Twitter.
10

Distributed Algorithms are not all Created Equal
Training
Performance
Parameter Servers
AllReduce
Available Compute

Ring-AllReduce vs Parameter Server(s)
2018-01-19 13/46
GPU 0
GPU 1
GPU 2
GPU 3
send
send
send
send
recv
recv
recv
recv GPU 1 GPU 2 GPU 3 GPU 4
Param Server(s)
Network Bandwidth is the Bottleneck for Distributed Training

AllReduce outperforms Parameter Servers
2018-01-19 14/46
*https://github.com/uber/horovod
16 servers with 4 P100 GPUs (64 GPUs) each connected by ROCE-capable 25 Gbit/s network
(synthetic data). Speed below is images processed per second.*
For Bigger Models, Parameter Servers don’t scale

Multiple GPUs on a Single Server
2018-01-19 15/46

NVLink vs PCI-E Single Root Complex
2018-01-19 16/46On Single-Host (dist. Training), the Bus can be the Bottleneck
[Images from: https://www.microway.com/product/octoputer-4u-10-gpu-server-single-root-complex/ ]
NVLink – 80 GB/s PCI-E – 16 GB/s

Scale: Remove Bus and Net B/W Bottlenecks
2018-01-19 17/46
Only one slow worker or bus or n/w link is needed to bottleneck DNN training time.
Ring-AllReduce

The Cloud is full of Bottlenecks….
Training
Performance
Public Cloud (10 GbE)
Infiniband On-Premise
Available Compute

Deep Learning Hierarchy of Scale
2018-01-19 19/46
DDL
AllReduce
on GPU Servers
DDL with GPU Servers
and Parameter Servers
Parallel Experiments on GPU Servers
Single GPU
Many GPUs on a Single GPU Server
Days/Hours
Days
Weeks
Minutes
Training Time for ImageNet
Hours

Lots of good GPUs > A few great GPUs
Hops
100 x Nvidia 1080Ti (DeepLearning11)
8 x Nvidia P/V100 (DGX-1)
VS
Both top (100 GPUs) and bottom (8 GPUs) cost the same: $150K (2017).

Consumer GPU Server $15K (10 x 1080Ti)
2018-01-19 https://www.oreilly.com/ideas/distributed-tensorflow 21/46

Cluster of Commodity GPU Servers
#EUai8
22
InfiniBan
Max 1-2 GPU Servers per Rack (2-4 KW per server)

#EUai8
TensorFlow Spark Platforms
•TensorFlow-on-Spark
•Deep Learning Pipelines
•Horovod
•Hops
23

Hops – Running Parallel Experiments
def model_fn(learning_rate, dropout):
import tensorflow as tf
from hops import tensorboard, hdfs, devices
…..
from hops import tflauncher
args_dict = {'learning_rate': [0.001], 'dropout': [0.5]}
tflauncher.launch(spark, model_fn, args_dict)
24
Launch TF jobs as Mappers in Spark
“Pure” TensorFlow code
in the Executor

Hops – Parallel Experiments
25
#EUai8
def model_fn(learning_rate, dropout):
…..
from hops import tflauncher
args_dict = {'learning_rate': [0.001, 0.005, 0.01],
'dropout': [0.5, 0.6]}
tflauncher.launch(spark, model_fn, args_dict)
Launches 6 Executors with with a different Hyperparameter
combination. Each Executor can have 1-N GPUs.

Hops AllReduce/Horovod/TensorFlow
27
#EUai8
import horovod.tensorflow as hvd
def conv_model(feature, target, mode)
…..
def main(_):
hvd.init()
opt = hvd.DistributedOptimizer(opt)
if hvd.local_rank()==0:
hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]
…..
else:
hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]
…..
from hops import allreduce
allreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb')
“Pure” TensorFlow code

TensorFlow and Hops Hadoop
2018-01-19 28/46

Don’t do this: Different Clusters for Big Data and ML
29

Hops: Single ML and Big Data Cluster
30/70
IT
DataLake
GPUs Compute
Kafka
Data EngineeringData Science
Project1 ProjectN
Elasticsearch

HopsFS: Next Generation HDFS*
16x
Throughput
FasterBigger
*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
37x
Number of files
Scale Challenge Winner (2017)
31

Size Matters: Improving the Performance of Small Files in HDFS. Salman Niazi, Seif Haridi, Jim Dowling. Poster, EuroSys 2017.
`
HopsFS now stores Small Files in the DB

GPUs supported as a Resource in Hops 2.8.2*
33
Hops is the only Hadoop distribution to support GPUs-as-a-Resource.
*Robin Andersson, GPU Integration for Deep Learning on YARN, MSc Thesis, 2017

GPU Resource Requests in Hops
34
HopsYARN
4 GPUs on any host
10 GPUs on 1 host
100 GPUs on 10 hosts with ‘Infiniband’
20 GPUs on 2 hosts with ‘Infiniband_P100’
HopsFS
Mix of commodity GPUs and more
powerful GPUs good for (1) parallel
experiments and (2) distributed training

Hopsworks Data Platform
35
Develop Train Test Serve
MySQL Cluster
Hive
InfluxDB
ElasticSearch
KafkaProjects,Datasets,Users
HopsFS / YARN
Spark, Flink, Tensorflow
Jupyter, Zeppelin
Jobs, Kibana, Grafana
REST
API
Hopsworks

Python is a First-Class Citizen in Hopsworks
www.hops.io 36

Custom Python Environments with Conda
Python libraries are usable by Spark/Tensorflow
37

What is Hopsworks used for?
2018-01-19 38/46

HopsFS
YARN
Public Cloud or On-Premise
Parquet
ETL Workloads
39
Hive
Hopsworks
Jobs trigger

HopsFS
YARN
Parquet
Business Intelligence Workloads
40
Hive
Jupyter/Zeppelin
or Jobs
Kibana
reports
Zeppelin

HopsFS
YARN
Grafana/
InfluxDB
Elastic/
Kibana
Parquet
Data Src
Batch Analytics
Kafka
…...MySQL
Streaming Analytics in Hopsworks
41
Hive

HopsFS
YARN
FeatureStore
Tensorflow
Serving
Tensorboard
TensorFlow in Hopsworks
42
Experiments
Kafka
Hive

One Click Deployment of TensorFlow Models

Hops API
•Java/Scala library
- Secure Streaming Analytics with Kafka/Spark/Flink
• SSL/TLS certs, Avro Schema, Endpoints for Kafka/Hopsworks/etc
•Python Library
- Managing tensorboard, Load/save models in HopsFS
- Distributed Tensorflow in Python
- Parameter sweeps for parallel experiments

TensorFlow-as-a-Service in RISE SICS ICE
• Hops
• Spark/Flink/Kafka/TensorFlow/Hadoop-as-a-service
www.hops.site
• RISE SICS ICE
• 250 kW Datacenter, ~400 servers
• Research and test environment
https://www.sics.se/projects/sics-ice-data-center-in-lulea 45

Summary
•Distribution can make Deep Learning practitioners
more productive.
https://www.oreilly.com/ideas/distributed-tensorflow
•Hopsworks is a new Data Platform built on HopsFS
with first-class support for Python and Deep
Learning / ML
- Tensorflow / Spark

The Team
Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman
Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias
Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso,
Robin Andersson, August Bonds, Filotas Siskos, Mahmoud Hamed.
Active:
Alumni:
Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram
Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto
Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro,
Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos
Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid
Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Fanti Machmount Al
Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, ArunaKumari Yedurupaka,
Tobias Johansson , Roberto Bampi.
www.hops.io
@hopshadoop

Thank You.
Follow us: @hopshadoop
Star us: http://github.com/hopshadoop/hopsworks
Join us: http://www.hops.io
Thank You.
Follow us: @hopshadoop
Star us: http://github.com/hopshadoop/hopsworks
Join us: http://www.hops.io
Hops

Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling TensorFlow with Hops, Global AI Conference Santa Clara

Similar to Scaling TensorFlow with Hops, Global AI Conference Santa Clara (20)

More from Jim Dowling

More from Jim Dowling (20)

Recently uploaded

Recently uploaded (20)

Scaling TensorFlow with Hops, Global AI Conference Santa Clara