Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond

592 views

Published on

Anthony Hsu and Jonathan Hung of LinkedIn present regarding TonY ("TensorFlow on YARN"), a system for running deep learning workloads in a distributed fashion on top of YARN. They discuss its architecture and implementation, as well as where the project is headed.

This is taken from the Apache Hadoop Contributors Meetup on January 30, hosted by LinkedIn in Mountain View.

Published in: Technology

Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond

  1. 1. TonY: TensorFlow on YARN and Beyond Hadoop Contributors Meetup, LinkedIn, 1/30/19 Jeff Weiner Chief Executive Officer Anthony Hsu Jonathan Hung
  2. 2. Machine Learning at LinkedIn People You May Know Job Recommendations News Feed LinkedIn Learning Recommendations 2
  3. 3. Challenges of using Machine Learning https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf ML Code 3
  4. 4. Machine Learning process • ML process has many parts 4 Data Ingestion Data Preparation Feature Extraction Model Development Model Training Model Deployment Model Serving
  5. 5. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. 5 Data Ingestion Data Preparation Feature Extraction Model Development Model Training Model Deployment Model Serving
  6. 6. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. • TonY's focus is on model development and training. 6 Data Ingestion Data Preparation Feature Extraction Model Development Model Training Model Deployment Model Serving
  7. 7. Model development and training challenges • Want to interactively explore the data and quickly experiment with different models • Want to take code developed on single machine and scale up to run on a cluster of machines 7
  8. 8. Interactive model development • Popular solution: notebooks! • At LinkedIn, we use Jupyter notebooks • Challenge: cannot directly access HDFS data from dev box due to security restrictions • Previous solution: you install on your machine, run TensorFlow inside Spark jobs via Livy service • Error prone, timeout issues; we really want to run TensorFlow Python programs, not Spark jobs! 8
  9. 9. Notebooks on demand with TonY • We provide our users with a prebuilt Python pex file with common libraries already installed • TonY lets users spin up this notebook in a container in our Hadoop cluster. ○ Can request GPU resources, too 9
  10. 10. Scaling up training • To train a complex model on large amounts of data, need to: ○ Parallelize the training (common strategy is data parallelism) ○ Run on multiple machines (need fault tolerance strategy) ○ Use GPUs (to accelerate computations – many models are computation-bound, not IO-bound) 10
  11. 11. New Hadoop YARN features useful for ML • GPU resource support added in Hadoop 3.x • Docker container support productionized in Hadoop 3.x • Submarine deep learning CLI in Hadoop 3.2 (released last week) 11
  12. 12. How can we do distributed training on YARN? • Want to take a program developed on a single machine and run it in distributed mode with little or no modifications • Want to take advantage of YARN's features • Some existing open-source solutions we looked at: ○ Kubeflow (Google) ○ TensorFlow on Spark (Yahoo!) ○ Spark Deep Learning (Databricks) ○ ToY: TensorFlow on YARN (Intel) ○ XLearning (Qihoo) ○ Horovod (Uber) ○ YARN Native Service 12
  13. 13. Pros and Cons of open-source solutions Open-source solution Pros Cons Kubeflow (Google) ● Large marketplace of libraries and plugins ● Active community ● Does not run on Hadoop ● Very new project (< 1.5 years old) TensorFlow on Spark (Yahoo!) ● Can integrate TensorFlow into Spark programs ● No GPU isolation ● No heterogeneous resource support Spark Deep Learning (Databricks) ● Integrates with Spark ● Integrates with Databricks MLflow ecosystem ● No GPU resource support until Spark 3.0.0 (SPARK-20327) ● No heterogeneous resource support ToY: TensorFlow on YARN (Intel) ● Lightweight, loosely- coupled system ● No activity in over 1.5 years XLearning (Qihoo) ● Support for many ML frameworks ● No GPU isolation support Horovod (Uber) ● Fast MPI communication ● Requires more code modification ● No GPU isolation support YARN Native Service ● No custom application required, native to YARN ● Distributed TensorFlow requires YARN DNS 13
  14. 14. Ultimately, we decided to build our own solution: TonY • Client + YARN application for running distributed ML jobs • We started with just TensorFlow support (hence TensorFlow on YARN (TonY)) • Now we also support PyTorch and are working on R support (so now perhaps Things on YARN is more apt) • Client lets you easily launch a job with only a few required arguments: ● Number of workers, parameter servers, GPUs per task ● Source file ● Python virtual environment • Custom Azkaban jobtype plugin lets you run TonY jobs in the same workflow as Spark and MapReduce jobs 14
  15. 15. What is TonY? • Orchestrates running distributed TensorFlow scripts on Hadoop • Acquires compute resources from Hadoop (memory, CPU, GPU) • Sets up and launches distributed TensorFlow jobs on Hadoop clusters • Manages application lifecycle ○ Fault tolerance ○ Job monitoring 15
  16. 16. TonY Architecture 16
  17. 17. TonY Architecture • Entry point for TonY jobs • Package user’s configurations, user’s model code and submit as YARN application 17
  18. 18. TonY Architecture • Job setup and lifecycle management • Negotiates compute resources from Hadoop • Sets up container environment • Launches and monitors containers 18
  19. 19. TonY Architecture • Container = Task Executor • Launches user’s provided python script • Heartbeats to Application Master for liveness 19
  20. 20. Scaling distributed TensorFlow on Hadoop • Hadoop is aware of GPU resources • Ensures GPU resource isolation and scheduling 20
  21. 21. Scaling distributed TensorFlow on Hadoop • TensorBoard support before: 21
  22. 22. Scaling distributed TensorFlow on Hadoop • TensorBoard support before: ○ After training, copy log files to local 22
  23. 23. Scaling distributed TensorFlow on Hadoop • TensorBoard support before: ○ After training, copy log files to local ○ Start local tensorboard instance pointing to local log files 23
  24. 24. Scaling distributed TensorFlow on Hadoop • TensorBoard support with TonY: ○ Directly access TensorBoard while running with one click 24
  25. 25. Scaling distributed TensorFlow on Hadoop • Fault tolerance • More workers = more failures 25
  26. 26. Scaling distributed TensorFlow on Hadoop • Fault tolerance • More workers = more failures • First attempt periodically saves model checkpoints to HDFS 26
  27. 27. Scaling distributed TensorFlow on Hadoop • Fault tolerance • More workers = more failures • First attempt periodically saves model checkpoints to HDFS • Worker failure -> tear down and restart application 27
  28. 28. Scaling distributed TensorFlow on Hadoop • Fault tolerance • More workers = more failures • First attempt periodically saves model checkpoints to HDFS • Worker failure -> tear down and restart application • Read checkpoints from HDFS, resume from where previous attempt left off 28
  29. 29. TonY is open source • https://github.com/linkedin/TonY ○ Pull requests welcome! • Engineering blog post: https://engineering.linkedin.com/blog/2018/09/open- sourcing-tony--native-support-of-tensorflow-on-hadoop 29
  30. 30. TonY next steps • TonY History Server for viewing past job executions • Collecting metrics and integrating with Dr. Elephant to provide tuning recommendations • TonY runtime for Submarine (just released in Hadoop 3.2) 30
  31. 31. Thank you! 31

×