Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond

TonY: TensorFlow on YARN and
Beyond
Hadoop Contributors Meetup, LinkedIn, 1/30/19
Jeff Weiner
Chief Executive Officer
Anthony Hsu Jonathan Hung

Machine Learning at LinkedIn
People You May Know
Job Recommendations
News Feed
LinkedIn Learning Recommendations
2

Challenges of using Machine Learning
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
ML
Code
3

Machine Learning process
• ML process has many parts
4
Data Ingestion
Data Preparation
Feature Extraction
Model Development
Model Training
Model Deployment
Model Serving

• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to
accelerate this loop. We have
teams working on every part of
the ML pipeline.
5
Data Ingestion
Data Preparation
Feature Extraction
Model Development
Model Training
Model Deployment
Model Serving

• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to
accelerate this loop. We have
teams working on every part of
the ML pipeline.
• TonY's focus is on model
development and training.
6
Data Ingestion
Data Preparation
Feature Extraction
Model Development
Model Training
Model Deployment
Model Serving

Model development and training challenges
• Want to interactively explore the data and quickly experiment with different
models
• Want to take code developed on single machine and scale up to run on a cluster of
machines
7

Interactive model development
• Popular solution: notebooks!
• At LinkedIn, we use Jupyter notebooks
• Challenge: cannot directly access HDFS data from dev box due to security
restrictions
• Previous solution: you install on your machine, run TensorFlow inside Spark jobs
via Livy service
• Error prone, timeout issues; we really want to run TensorFlow Python programs,
not Spark jobs!
8

Notebooks on demand with TonY
• We provide our users with a prebuilt Python pex file with common libraries already
installed
• TonY lets users spin up this notebook in a container in our Hadoop cluster.
○ Can request GPU resources, too
9

Scaling up training
• To train a complex model on large amounts of data, need to:
○ Parallelize the training (common strategy is data parallelism)
○ Run on multiple machines (need fault tolerance strategy)
○ Use GPUs (to accelerate computations – many models are computation-bound,
not IO-bound)
10

New Hadoop YARN features useful for ML
• GPU resource support added in Hadoop 3.x
• Docker container support productionized in Hadoop 3.x
• Submarine deep learning CLI in Hadoop 3.2 (released last week)
11

How can we do distributed training on YARN?
• Want to take a program developed on a single machine and run it in distributed
mode with little or no modifications
• Want to take advantage of YARN's features
• Some existing open-source solutions we looked at:
○ Kubeflow (Google)
○ TensorFlow on Spark (Yahoo!)
○ Spark Deep Learning (Databricks)
○ ToY: TensorFlow on YARN (Intel)
○ XLearning (Qihoo)
○ Horovod (Uber)
○ YARN Native Service
12

Pros and Cons of open-source solutions
Open-source solution Pros Cons
Kubeflow (Google) ● Large marketplace of
libraries and plugins
● Active community
● Does not run on Hadoop
● Very new project (< 1.5 years old)
TensorFlow on Spark (Yahoo!) ● Can integrate TensorFlow
into Spark programs
● No GPU isolation
● No heterogeneous resource support
Spark Deep Learning (Databricks) ● Integrates with Spark
● Integrates with Databricks
MLflow ecosystem
● No GPU resource support until Spark
3.0.0 (SPARK-20327)
● No heterogeneous resource support
ToY: TensorFlow on YARN (Intel) ● Lightweight, loosely-
coupled system
● No activity in over 1.5 years
XLearning (Qihoo) ● Support for many ML
frameworks
● No GPU isolation support
Horovod (Uber) ● Fast MPI communication ● Requires more code modification
● No GPU isolation support
YARN Native Service ● No custom application
required, native to YARN
● Distributed TensorFlow requires YARN
DNS 13

Ultimately, we decided to build our own solution: TonY
• Client + YARN application for running distributed ML jobs
• We started with just TensorFlow support (hence TensorFlow on YARN (TonY))
• Now we also support PyTorch and are working on R support (so now perhaps
Things on YARN is more apt)
• Client lets you easily launch a job with only a few required arguments:
● Number of workers, parameter servers, GPUs per task
● Source file
● Python virtual environment
• Custom Azkaban jobtype plugin lets you run TonY jobs in the same workflow as
Spark and MapReduce jobs
14

What is TonY?
• Orchestrates running distributed
TensorFlow scripts on Hadoop
• Acquires compute resources from
Hadoop (memory, CPU, GPU)
• Sets up and launches distributed
TensorFlow jobs on Hadoop
clusters
• Manages application lifecycle
○ Fault tolerance
○ Job monitoring
15

TonY Architecture
• Entry point for TonY jobs
• Package user’s configurations,
user’s model code and submit as
YARN application
17

TonY Architecture
• Job setup and lifecycle
management
• Negotiates compute resources
from Hadoop
• Sets up container environment
• Launches and monitors
containers
18

TonY Architecture
• Container = Task Executor
• Launches user’s provided python
script
• Heartbeats to Application Master
for liveness
19

Scaling distributed TensorFlow on Hadoop
• Hadoop is aware of GPU resources
• Ensures GPU resource isolation and scheduling
20

• TensorBoard support before:
21

○ After training, copy log files to local
22

○ After training, copy log files to local
○ Start local tensorboard instance pointing to local log files
23

• TensorBoard support with TonY:
○ Directly access TensorBoard while running with one click
24

• Fault tolerance
• More workers = more failures
25

• Fault tolerance
• First attempt periodically saves model checkpoints to HDFS
26

• Fault tolerance
• Worker failure -> tear down and restart application
27

• Fault tolerance
• Worker failure -> tear down and restart application
• Read checkpoints from HDFS, resume from where previous attempt left
off
28

TonY is open source
• https://github.com/linkedin/TonY
○ Pull requests welcome!
• Engineering blog post: https://engineering.linkedin.com/blog/2018/09/open-
sourcing-tony--native-support-of-tensorflow-on-hadoop
29

TonY next steps
• TonY History Server for viewing past job executions
• Collecting metrics and integrating with Dr. Elephant to provide tuning
recommendations
• TonY runtime for Submarine (just released in Hadoop 3.2)
30

Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond

Similar to Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond (20)

Recently uploaded

Recently uploaded (20)

Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond