Tensorflow on Apache Hadoop YARN

TensorFlow on Apache Hadoop YARN
Sunil Govindan
Apache Hadoop PMC member
Hortonworks

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Machine Learning on Big Data Platform
 TensorFlow on Apache Hadoop YARN
 Example walkthrough with demo

Apache YARN - Past, Present and Future
Sunil G
Hortonworks
https://dataworkssummit.com/sydney-2017/sessions/yarn-past-present-future/
Wednesday September 20th Room C4.5 5.10 PM
-- Related Session --
Rohith Sharma K S
Hortonworks

Machine Learning on Big Data Platform

Apache Zeppelin

Data Scientists Software engineers
Explore data
Create pipeline
Find best params
Save model
Load model
Deploy in production
Scoring on
batch/streaming data
Apache Zeppelin

Machine learning workflow
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Model
Training
Feature
Model
Evaluation
Model
Validation
Model
Staging
Experiment
Online
Feature
Model
Database
Exper-
iment
Model as
Service
Real-time
Feature
Calibration
Data Preprocessing
Feature Engineering
Model
Training
Online
Service

Machine Learning in a Unified Platform
“Hidden Technical Debt in Machine Learning Systems”,
Google

Machine learning – Data Preprocessing
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Feature
Engineering
 Import data
– HDFS
– AWS
– RDBMS
 Join data
 Data exploration
 Data sample
 Training/Test random split

Machine learning – Feature Engineering
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Feature
Engineering
 Feature transform/selection
 Feature embedding

Machine learning – Model Training
Model
Training
Feature
Model
Evaluation
Model
Validation
Model
Staging
Model
Training
 Traditional machine
learning models
– Logistic Regression
– Gradient boosting tree
– Recommendation/ALS
– LDA
 Libraries
– Apache Spark MLlib
– XGBoost
 Deep learning models
– DNN
– CNN
– RNN
– LSTM
 Libraries
– TensorFlow
– Apache MXNet
– BigDL

Model Training - Deep learning can’t fit all
 Natural language processing
 Computer vision
 Speech/Video
 Anti-fraud
 Recommendation
 CTR estimation
 Topic model
 PageRank

Machine learning – Model Serving
Experiment
Online
Feature
Model
Database
Exper-
iment
Model as
Service
Real-time
Feature
Calibration
Online
Service
 Model deploy
 Model serving
– Batch
– Streaming
 Experiment
– offline
– online (A/B test)

TensorFlow on Apache Hadoop YARN

Apache Hadoop YARN Overview
 YARN is Apache Hadoop’s resource management framework
 YARN is made up of the following components:
– ResourceManager (RM) allocates resources to applications, is the master service for
YARN
– Per-node NodeManager (NM) launches application containers and monitors their
resource usage
– Per-application ApplicationMaster (AM) requests containers for the application
 At its core, YARN is responsible for managing “containers” across a collection of
servers

Machine learning platform on YARN
CPU GPU SSD
YARN: Data Operating System
(Cluster Resource Management)
Spark MLlib XGBoost Hive/LLAP Spark SQLTensorFlow
Zeppelin
HDFS AWS S3 RDBMS

Why all under YARN
• CPU / Memory, (WIP)
GPU, FPGA, Network
• Queues / Users quota,
user access control
• Time line services,
Grafana, etc
• Capacity Planning,
Preemption,
Reservation System
Better SLA Monitoring
IsolationGovernance

All running on the same YARN platform
LLAP
128 G 128 G 128 G 128 G 128 G
LLAP LLAP
128 G 128 G
GPUs

Recent works in YARN to support ML workloads like Tensorflow
 GPU Support – Boost workload performance from GPU
 Resource Profile and support for Custom Resource Types - Support for new resources
 Native Service - Easy to define and run any custom service
 YARN Assembly - Easy to wire different apps

GPU support on YARN
 Effort led by Wangda Tan from Hortonworks
 Why?
 GPU: Many cores to handle massive (but simple) computation tasks simultaneously:
GPU CPU
GPU Computation Intensive Other
Without GPU support, researchers/engineers
has to wait longer to finish apps.

Challenges of GPU support
 Different levels of support
– Take me to a machine where GPUs are available with Partitions / Node Labels. (Current status)
– Take me to a machine where GPUs are available
• give me a full device only to me for the lifetime of my container
• give me multiple full devices only to me for the lifetime of my container
• give me full device(s) only to me for a portion of the lifetime of my container
• give me a slice of device(s) to me for a full / portion of the lifetime of my
container
 More dimensions:
– Bandwidths and on-GPU memory
– Topology of multiple GPUs
Slide credit to: Vinod Kumar Vavilapalli

Resource profiles and custom resource types
 Past
– Supports only Memory and CPU
 Now
– A generalized vector
– Admins can create custom Resource Types!
– Ease of resource requesting model using
profiles
NodeManager
Memory
CPU
GPU
NodeManager
Memory
CPU
GPU
ResourceManager
Small
Medium
Large
Profile Memory CPU GPU
Small 2 GB 4 Cores 1 Cores
Medium 4 GB 8 Cores 1 Cores
Large 16 GB 16 Cores 4 Cores
Application Master
Small

YARN Native Services
• Long Running – Simplify the deployment and management of long running
applications on YARN.
• Easily Add New Applications – Remove tedious process of bringing new applications
to YARN.
• Easy to Use – REST API and Command Line tools.
• Declarative Configuration – Provide configuration to the applications, declare
resource needs, specify placement policies.
• Service Discovery – Simplified service discovery on YARN

YARN Service assembly
• Collection of services/apps logically wired together for a common purpose
• Easily deploy services (zookeeper + Hbase) as a combination through REST
API.
• User defined service dependencies and readiness
• Container localities - affinity and anti-affinity scheduling

YARN assembly: Makes everything easier!
 Forget about writing an application master, this is how you can run app on YARN ..
 Write assembly spec in JSON (we call it Yarnfile)
 Post the JSON as REST request to YARN server.
 YARN to figure out rest of it.
 An example:

YARN assembly: Run multi-stages job

YARN assembly: Run Distributed Tensorflow Training (with PS)

YARN Assembly: Parallel Parameter Tuning

YARN Assembly: Model Serving and Update
 Application & Services
upgrades
– ”Do an upgrade of my
Tensorflow serving model
with minimal impact to end-
users”
- Use serving.tensorflow-mode-serving.wtan.domain:1234 to access the service.
- YARN could do load balancing for launched instances.

Example walkthrough: How to do Click-
Through-Rate on a big data platform

Click-Through Rate (CTR) Prediction
 Given a user and context, predict probability of a click for an ad.
 Probably the most “profitable” machine learning problem in industry
 Basic setting quite well-studied; scale make it challenging
– Google, Facebook, Yahoo, Bing
 Challenges
– Simple binary problem; but want probabilities, not just the label
– Very skewed label distribution: clicks << skips
– Tons of data (every impression generates a training example)
– Limitations at serving: need to predict quickly

Labeled events
Impression0 click
Impression1 non-click
… …
Impression11 click
… …
… …
Labeled events
Labeled events

CTR model
 Logistic regression (LR)
– LR on SGD/LBFGS - batch
– Follow the regularized leader (FTRL) -
online
 Factorization Machines (FM/FFM)
 Gradient boosting tree (GBT)
 Deep neural networks (DNN)

Questions?

BoF’s: Apache Hadoop – YARN, HDFS
Sanjay Radia
Hortonworks
Sunil G
Hortonworks
https://dataworkssummit.com/sydney-2017/birds-of-a-feather/apache-hadoop-yarn-hdfs
Thursday September 21st Room C2.1 6.00PM
-- Related Session --
Rohith Sharma K S
Hortonworks

Thank you!

Tensorflow on Apache Hadoop YARN

Recommended

Recommended

More Related Content

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Tensorflow on Apache Hadoop YARN

Editor's Notes