[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan

Running Tensorflow on Apache YARN –
A sneak peak into GPU Scheduling
Sunil Govindan
Apache Hadoop PMC member
YARN Team @ Hortonworks

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Overview to Machine Learning on Big Data Platform
 GPU support in Apache Hadoop YARN
 Tensorflow on YARN – example and demo

Overview:
Machine Learning on Big Data Platform

Machine learning workflow
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Model
Training
Feature
Model
Evaluation
Model
Validation
Model
Staging
Experiment
Online
Feature
Model
Database
Exper-
iment
Model as
Service
Real-time
Feature
Calibration
Data Preprocessing
Feature Engineering
Model
Training
Online
Service

Machine learning (BigData) – Data Preprocessing
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Feature
Engineering
 Import data
– HDFS
– AWS
– RDBMS
 Join data
 Data exploration
 Data sample
 Training/Test random split

Machine learning (BigData) – Feature Engineering
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Feature
Engineering
 Feature transform/selection
 Feature embedding

Machine learning (BigData) – Model Training
Model
Training
Feature
Model
Evaluation
Model
Validation
Model
Staging
Model
Training
 Traditional machine
learning models
– Logistic Regression
– Gradient boosting tree
– Recommendation/ALS
– LDA
 Libraries
– Apache Spark MLlib
– XGBoost
 Deep learning models
– DNN
– CNN
– RNN
– LSTM
 Libraries
– TensorFlow
– Apache MXNet
– BigDL

Machine learning (BigData) – Model Serving
Experiment
Online
Feature
Model
Database
Exper-
iment
Model as
Service
Real-time
Feature
Calibration
Online
Service
 Model deploy
 Model serving
– Batch
– Streaming
 Experiment
– offline
– online (A/B test)

GPU support in Apache Hadoop YARN

Machine learning platform on YARN
CPU GPU SSD
YARN: Data Operating System
(Cluster Resource Management)
Spark MLlib XGBoost Hive/LLAP Spark SQLTensorFlow
Zeppelin
HDFS AWS S3 RDBMS

Why GPU?
 GPU can speed up following computation-
intensive applications 10x - 300x times
Gene Analysis
Deep learningSelf-Driving Car
Scientific Computation
Without GPU speed up, you will almost
impossible to do these computations. (If job
runs for weeks).

Why GPU?
 GPU: Many cores to handle massive (but simple) computation tasks simultaneously:
GPU CPU
Computation Intensive Other
Nvidia Tesla K40:
2880 CUDA cores.
$2200.00 => $0.76 / core
Intel Xeon E5-2697
14 cores
$2295.00 => $163 / core

Why all under YARN
SLA!
Monitoring!A normal YARN user
Quotas!
Isolation!
Capacity Planning, Preemption, Reservation System.
Time line services, Grafana, etc.
CPU / Memory, (WIP) GPU, FPGA, Network

All running on the same YARN platform
LLAP
128 G 128 G 128 G 128 G 128 G
LLAP LLAP
128 G 128 G
GPUs

Current status of GPU support on YARN
 Using node label (YARN-796), since Apache Hadoop 2.6.0
– Use node label to partition one big cluster to smaller disjoint clusters, and assign shares/acls to
queues.
– Issues: 1) GPU is not a countable resource in scheduling. 2) No proper isolation for GPU.
 Rest part of GPU support is WIP, umbrella JIRA: YARN-6223

GPU support: Challenges
 GPU isolation
– Different from memory / cpu, computations affinity to per-GPU-device.
– And multiple processes use the single GPU will be serialized. (MPS is an exception).
– And multiple process share the same GPU cause OOM easily.
• Even though TF provide options to use GPU memory less than whole device provided. But we
cannot enforce this from external.

 Hierarchy of GPUs matters:
– Topology of GPU really matters: affect communication latency a lot! (Von Neumann bottleneck)
Picture credit to: https://opus.nci.org.au

 GPU on Docker: Build once and run
anywhere is not simple:
 For a regular app:
 It can run on Centos 6/7, or any different
hosts as well as CPU arch is same.
 However, GPU application needs driver to
talk to hardware:
Nginx App
Nginx AppUbuntu 14:04
Tensorﬂow 1.2
GPU Base Lib v2
Nginx AppHost OS
GPU Base Lib v1
X Fails
CUDA Library 5.0

GPU Support : Solutions
 GPU isolation:
– With general resource types feature:
• detect & report number of GPUs to YARN scheduler, and scheduler make central decision.
– For normal processes: use cgroups: device submodule. (Same as cpu/memory isolation
mechanism)
– For docker processes: use --device command line before launch docker container.

GPU Support : Solutions
 GPU on Docker support
– By using nvidia-docker-plugin.
Tensorﬂow 1.2
Nginx AppHost OS
GPU Base Lib v1
Volume Mount
CUDA Library 5.0

How rest of YARN helps GPU support.
 Node partition
– Without node partition, cannot guarantee
best GPU utilizations, let’s look at an example:
– Two hosts in the cluster, only host1 has GPUs.
At the beginning, cluster is empty.
– At time T1, user submit a Spark job, which
need 10G mem, 4 CPUs. Without node
partition, it could be placed to Host1
– If we have another job, which needs 15G
memory, 6 CPUs and 3 GPUs, it won’t possible
to get allocated.
20G
8
4
Mem
CPU
GPU
20G
8
Host1 (GPU)
Host2
10G
4
4
Mem
CPU
GPU
20G
8
Host1 (GPU)
Host2
Task1
?

How rest of YARN helps GPU support.
 Resource Profiles
– A generalized vector
– Admins can create custom Resource Types!
– Ease of resource requesting model using
profiles
NodeManager
Memory
CPU
GPU
NodeManager
Memory
CPU
GPU
ResourceManager
Small
Medium
Large
Profile Memory CPU GPU
Small 2 GB 4 Cores 1 Cores
Medium 4 GB 8 Cores 1 Cores
Large 16 GB 16 Cores 4 Cores
Application Master
Small

Current development status (YARN-6223)
 Apache Hadoop 3.1.0 release (Jan 15, 2018)
– GPU auto detection (Merged)
– GPU scheduling in RM (Merged)
– GPU isolation using Cgroups. (Merged)
– GPU on docker isolation & volume. (Merged)
– UI / Metrics (Merged).
– Documentation (Open)
– Ambari changes (Open)

TensorFlow on Apache Hadoop YARN

YARN assembly: Makes everything easier!
 Forget about writing an application master, this is how you can run app on YARN ..
 Write assembly spec in JSON (we call it Yarnfile)
 Post the JSON as REST request to YARN server.
 YARN to figure out rest of it.
 An example:

Demo….

Questions?

[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan

Similar to [Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan (20)

Recently uploaded

Recently uploaded (20)

[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan

Editor's Notes