This document summarizes a presentation about TonY, which is LinkedIn's solution for running distributed machine learning jobs like TensorFlow on Hadoop clusters. TonY allows users to develop models interactively using Jupyter notebooks directly on HDFS data and then easily scale training to multiple machines and GPUs with minimal code changes. It handles job orchestration, fault tolerance through model checkpointing, and integration with tools like TensorBoard and Azkaban workflows. The open source TonY project addresses many of the limitations of other solutions for distributed deep learning on Hadoop.
Artificial intelligence in the post-deep learning era
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
1. TonY: TensorFlow on YARN and
Beyond
Hadoop Contributors Meetup, LinkedIn, 1/30/19
Jeff Weiner
Chief Executive Officer
Anthony Hsu Jonathan Hung
2. Machine Learning at LinkedIn
People You May Know
Job Recommendations
News Feed
LinkedIn Learning Recommendations
2
3. Challenges of using Machine Learning
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
ML
Code
3
4. Machine Learning process
• ML process has many parts
4
Data Ingestion
Data Preparation
Feature Extraction
Model Development
Model Training
Model Deployment
Model Serving
5. Machine Learning process
• ML process has many parts
• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to
accelerate this loop. We have
teams working on every part of
the ML pipeline.
5
Data Ingestion
Data Preparation
Feature Extraction
Model Development
Model Training
Model Deployment
Model Serving
6. Machine Learning process
• ML process has many parts
• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to
accelerate this loop. We have
teams working on every part of
the ML pipeline.
• TonY's focus is on model
development and training.
6
Data Ingestion
Data Preparation
Feature Extraction
Model Development
Model Training
Model Deployment
Model Serving
7. Model development and training challenges
• Want to interactively explore the data and quickly experiment with different
models
• Want to take code developed on single machine and scale up to run on a cluster of
machines
7
8. Interactive model development
• Popular solution: notebooks!
• At LinkedIn, we use Jupyter notebooks
• Challenge: cannot directly access HDFS data from dev box due to security
restrictions
• Previous solution: you install on your machine, run TensorFlow inside Spark jobs
via Livy service
• Error prone, timeout issues; we really want to run TensorFlow Python programs,
not Spark jobs!
8
9. Notebooks on demand with TonY
• We provide our users with a prebuilt Python pex file with common libraries already
installed
• TonY lets users spin up this notebook in a container in our Hadoop cluster.
○ Can request GPU resources, too
9
10. Scaling up training
• To train a complex model on large amounts of data, need to:
○ Parallelize the training (common strategy is data parallelism)
○ Run on multiple machines (need fault tolerance strategy)
○ Use GPUs (to accelerate computations – many models are computation-bound,
not IO-bound)
10
11. New Hadoop YARN features useful for ML
• GPU resource support added in Hadoop 3.x
• Docker container support productionized in Hadoop 3.x
• Submarine deep learning CLI in Hadoop 3.2 (released last week)
11
12. How can we do distributed training on YARN?
• Want to take a program developed on a single machine and run it in distributed
mode with little or no modifications
• Want to take advantage of YARN's features
• Some existing open-source solutions we looked at:
○ Kubeflow (Google)
○ TensorFlow on Spark (Yahoo!)
○ Spark Deep Learning (Databricks)
○ ToY: TensorFlow on YARN (Intel)
○ XLearning (Qihoo)
○ Horovod (Uber)
○ YARN Native Service
12
13. Pros and Cons of open-source solutions
Open-source solution Pros Cons
Kubeflow (Google) ● Large marketplace of
libraries and plugins
● Active community
● Does not run on Hadoop
● Very new project (< 1.5 years old)
TensorFlow on Spark (Yahoo!) ● Can integrate TensorFlow
into Spark programs
● No GPU isolation
● No heterogeneous resource support
Spark Deep Learning (Databricks) ● Integrates with Spark
● Integrates with Databricks
MLflow ecosystem
● No GPU resource support until Spark
3.0.0 (SPARK-20327)
● No heterogeneous resource support
ToY: TensorFlow on YARN (Intel) ● Lightweight, loosely-
coupled system
● No activity in over 1.5 years
XLearning (Qihoo) ● Support for many ML
frameworks
● No GPU isolation support
Horovod (Uber) ● Fast MPI communication ● Requires more code modification
● No GPU isolation support
YARN Native Service ● No custom application
required, native to YARN
● Distributed TensorFlow requires YARN
DNS 13
14. Ultimately, we decided to build our own solution: TonY
• Client + YARN application for running distributed ML jobs
• We started with just TensorFlow support (hence TensorFlow on YARN (TonY))
• Now we also support PyTorch and are working on R support (so now perhaps
Things on YARN is more apt)
• Client lets you easily launch a job with only a few required arguments:
● Number of workers, parameter servers, GPUs per task
● Source file
● Python virtual environment
• Custom Azkaban jobtype plugin lets you run TonY jobs in the same workflow as
Spark and MapReduce jobs
14
15. What is TonY?
• Orchestrates running distributed
TensorFlow scripts on Hadoop
• Acquires compute resources from
Hadoop (memory, CPU, GPU)
• Sets up and launches distributed
TensorFlow jobs on Hadoop
clusters
• Manages application lifecycle
○ Fault tolerance
○ Job monitoring
15
23. Scaling distributed TensorFlow on Hadoop
• TensorBoard support before:
○ After training, copy log files to local
○ Start local tensorboard instance pointing to local log files
23
24. Scaling distributed TensorFlow on Hadoop
• TensorBoard support with TonY:
○ Directly access TensorBoard while running with one click
24
26. Scaling distributed TensorFlow on Hadoop
• Fault tolerance
• More workers = more failures
• First attempt periodically saves model checkpoints to HDFS
26
27. Scaling distributed TensorFlow on Hadoop
• Fault tolerance
• More workers = more failures
• First attempt periodically saves model checkpoints to HDFS
• Worker failure -> tear down and restart application
27
28. Scaling distributed TensorFlow on Hadoop
• Fault tolerance
• More workers = more failures
• First attempt periodically saves model checkpoints to HDFS
• Worker failure -> tear down and restart application
• Read checkpoints from HDFS, resume from where previous attempt left
off
28
29. TonY is open source
• https://github.com/linkedin/TonY
○ Pull requests welcome!
• Engineering blog post: https://engineering.linkedin.com/blog/2018/09/open-
sourcing-tony--native-support-of-tensorflow-on-hadoop
29
30. TonY next steps
• TonY History Server for viewing past job executions
• Collecting metrics and integrating with Dr. Elephant to provide tuning
recommendations
• TonY runtime for Submarine (just released in Hadoop 3.2)
30