SlideShare a Scribd company logo
1 of 31
TonY: TensorFlow on YARN and
Beyond
Hadoop Contributors Meetup, LinkedIn, 1/30/19
Jeff Weiner
Chief Executive Officer
Anthony Hsu Jonathan Hung
Machine Learning at LinkedIn
People You May Know
Job Recommendations
News Feed
LinkedIn Learning Recommendations
2
Challenges of using Machine Learning
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
ML
Code
3
Machine Learning process
• ML process has many parts
4
Data Ingestion
Data Preparation
Feature Extraction
Model Development
Model Training
Model Deployment
Model Serving
Machine Learning process
• ML process has many parts
• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to
accelerate this loop. We have
teams working on every part of
the ML pipeline.
5
Data Ingestion
Data Preparation
Feature Extraction
Model Development
Model Training
Model Deployment
Model Serving
Machine Learning process
• ML process has many parts
• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to
accelerate this loop. We have
teams working on every part of
the ML pipeline.
• TonY's focus is on model
development and training.
6
Data Ingestion
Data Preparation
Feature Extraction
Model Development
Model Training
Model Deployment
Model Serving
Model development and training challenges
• Want to interactively explore the data and quickly experiment with different
models
• Want to take code developed on single machine and scale up to run on a cluster of
machines
7
Interactive model development
• Popular solution: notebooks!
• At LinkedIn, we use Jupyter notebooks
• Challenge: cannot directly access HDFS data from dev box due to security
restrictions
• Previous solution: you install on your machine, run TensorFlow inside Spark jobs
via Livy service
• Error prone, timeout issues; we really want to run TensorFlow Python programs,
not Spark jobs!
8
Notebooks on demand with TonY
• We provide our users with a prebuilt Python pex file with common libraries already
installed
• TonY lets users spin up this notebook in a container in our Hadoop cluster.
○ Can request GPU resources, too
9
Scaling up training
• To train a complex model on large amounts of data, need to:
○ Parallelize the training (common strategy is data parallelism)
○ Run on multiple machines (need fault tolerance strategy)
○ Use GPUs (to accelerate computations – many models are computation-bound,
not IO-bound)
10
New Hadoop YARN features useful for ML
• GPU resource support added in Hadoop 3.x
• Docker container support productionized in Hadoop 3.x
• Submarine deep learning CLI in Hadoop 3.2 (released last week)
11
How can we do distributed training on YARN?
• Want to take a program developed on a single machine and run it in distributed
mode with little or no modifications
• Want to take advantage of YARN's features
• Some existing open-source solutions we looked at:
○ Kubeflow (Google)
○ TensorFlow on Spark (Yahoo!)
○ Spark Deep Learning (Databricks)
○ ToY: TensorFlow on YARN (Intel)
○ XLearning (Qihoo)
○ Horovod (Uber)
○ YARN Native Service
12
Pros and Cons of open-source solutions
Open-source solution Pros Cons
Kubeflow (Google) ● Large marketplace of
libraries and plugins
● Active community
● Does not run on Hadoop
● Very new project (< 1.5 years old)
TensorFlow on Spark (Yahoo!) ● Can integrate TensorFlow
into Spark programs
● No GPU isolation
● No heterogeneous resource support
Spark Deep Learning (Databricks) ● Integrates with Spark
● Integrates with Databricks
MLflow ecosystem
● No GPU resource support until Spark
3.0.0 (SPARK-20327)
● No heterogeneous resource support
ToY: TensorFlow on YARN (Intel) ● Lightweight, loosely-
coupled system
● No activity in over 1.5 years
XLearning (Qihoo) ● Support for many ML
frameworks
● No GPU isolation support
Horovod (Uber) ● Fast MPI communication ● Requires more code modification
● No GPU isolation support
YARN Native Service ● No custom application
required, native to YARN
● Distributed TensorFlow requires YARN
DNS 13
Ultimately, we decided to build our own solution: TonY
• Client + YARN application for running distributed ML jobs
• We started with just TensorFlow support (hence TensorFlow on YARN (TonY))
• Now we also support PyTorch and are working on R support (so now perhaps
Things on YARN is more apt)
• Client lets you easily launch a job with only a few required arguments:
● Number of workers, parameter servers, GPUs per task
● Source file
● Python virtual environment
• Custom Azkaban jobtype plugin lets you run TonY jobs in the same workflow as
Spark and MapReduce jobs
14
What is TonY?
• Orchestrates running distributed
TensorFlow scripts on Hadoop
• Acquires compute resources from
Hadoop (memory, CPU, GPU)
• Sets up and launches distributed
TensorFlow jobs on Hadoop
clusters
• Manages application lifecycle
○ Fault tolerance
○ Job monitoring
15
TonY Architecture
16
TonY Architecture
• Entry point for TonY jobs
• Package user’s configurations,
user’s model code and submit as
YARN application
17
TonY Architecture
• Job setup and lifecycle
management
• Negotiates compute resources
from Hadoop
• Sets up container environment
• Launches and monitors
containers
18
TonY Architecture
• Container = Task Executor
• Launches user’s provided python
script
• Heartbeats to Application Master
for liveness
19
Scaling distributed TensorFlow on Hadoop
• Hadoop is aware of GPU resources
• Ensures GPU resource isolation and scheduling
20
Scaling distributed TensorFlow on Hadoop
• TensorBoard support before:
21
Scaling distributed TensorFlow on Hadoop
• TensorBoard support before:
○ After training, copy log files to local
22
Scaling distributed TensorFlow on Hadoop
• TensorBoard support before:
○ After training, copy log files to local
○ Start local tensorboard instance pointing to local log files
23
Scaling distributed TensorFlow on Hadoop
• TensorBoard support with TonY:
○ Directly access TensorBoard while running with one click
24
Scaling distributed TensorFlow on Hadoop
• Fault tolerance
• More workers = more failures
25
Scaling distributed TensorFlow on Hadoop
• Fault tolerance
• More workers = more failures
• First attempt periodically saves model checkpoints to HDFS
26
Scaling distributed TensorFlow on Hadoop
• Fault tolerance
• More workers = more failures
• First attempt periodically saves model checkpoints to HDFS
• Worker failure -> tear down and restart application
27
Scaling distributed TensorFlow on Hadoop
• Fault tolerance
• More workers = more failures
• First attempt periodically saves model checkpoints to HDFS
• Worker failure -> tear down and restart application
• Read checkpoints from HDFS, resume from where previous attempt left
off
28
TonY is open source
• https://github.com/linkedin/TonY
○ Pull requests welcome!
• Engineering blog post: https://engineering.linkedin.com/blog/2018/09/open-
sourcing-tony--native-support-of-tensorflow-on-hadoop
29
TonY next steps
• TonY History Server for viewing past job executions
• Collecting metrics and integrating with Dr. Elephant to provide tuning
recommendations
• TonY runtime for Submarine (just released in Hadoop 3.2)
30
Thank you!
31

More Related Content

What's hot

Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1ScyllaDB
 
HBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon 2015: OpenTSDB and AsyncHBase UpdateHBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon 2015: OpenTSDB and AsyncHBase UpdateHBaseCon
 
Scylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the DatabaseScylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the DatabaseScyllaDB
 
Scylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDSScylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDSScyllaDB
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceHBaseCon
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistencyScyllaDB
 
Performance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla ClusterPerformance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla ClusterScyllaDB
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBAthiq Ahamed
 
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
FireEye & Scylla: Intel Threat Analysis Using a Graph DatabaseFireEye & Scylla: Intel Threat Analysis Using a Graph Database
FireEye & Scylla: Intel Threat Analysis Using a Graph DatabaseScyllaDB
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsCeph Community
 
Multi-Region Cassandra Clusters
Multi-Region Cassandra ClustersMulti-Region Cassandra Clusters
Multi-Region Cassandra ClustersInstaclustr
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
Kafka Summit SF 2017 - Infrastructure for Streaming Applications
Kafka Summit SF 2017 - Infrastructure for Streaming Applications Kafka Summit SF 2017 - Infrastructure for Streaming Applications
Kafka Summit SF 2017 - Infrastructure for Streaming Applications confluent
 
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...DataStax Academy
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
 
Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataYan Wang
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 

What's hot (20)

Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
HBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon 2015: OpenTSDB and AsyncHBase UpdateHBaseCon 2015: OpenTSDB and AsyncHBase Update
HBaseCon 2015: OpenTSDB and AsyncHBase Update
 
Scylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the DatabaseScylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the Database
 
Scylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDSScylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDS
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistency
 
Performance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla ClusterPerformance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla Cluster
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
 
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
FireEye & Scylla: Intel Threat Analysis Using a Graph DatabaseFireEye & Scylla: Intel Threat Analysis Using a Graph Database
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Multi-Region Cassandra Clusters
Multi-Region Cassandra ClustersMulti-Region Cassandra Clusters
Multi-Region Cassandra Clusters
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Kafka Summit SF 2017 - Infrastructure for Streaming Applications
Kafka Summit SF 2017 - Infrastructure for Streaming Applications Kafka Summit SF 2017 - Infrastructure for Streaming Applications
Kafka Summit SF 2017 - Infrastructure for Streaming Applications
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
Dynomite @ Redis Conference 2016
Dynomite @ Redis Conference 2016Dynomite @ Redis Conference 2016
Dynomite @ Redis Conference 2016
 
Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure Data
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 

Similar to Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond

TonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on HadoopTonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on HadoopAnthony Hsu
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInAnthony Hsu
 
Automate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking EcosystemAutomate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking EcosystemHellmar Becker
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowNdjido Ardo BAR
 
Introduction to DL platform
Introduction to DL platformIntroduction to DL platform
Introduction to DL platformxiaogaozi
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)Amazon Web Services
 
How to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkHow to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkNavid Kalaei
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2aswini pilli
 
Tensor flow 2.0 what's new
Tensor flow 2.0  what's newTensor flow 2.0  what's new
Tensor flow 2.0 what's newPoo Kuan Hoong
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCDamienCarpy
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019Travis Oliphant
 
Silicon Valley Code Camp 2016 - MongoDB in production
Silicon Valley Code Camp 2016 - MongoDB in productionSilicon Valley Code Camp 2016 - MongoDB in production
Silicon Valley Code Camp 2016 - MongoDB in productionDaniel Coupal
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...Evans Ye
 
Tensorflow presentation
Tensorflow presentationTensorflow presentation
Tensorflow presentationAhmed rebai
 

Similar to Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond (20)

TonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on HadoopTonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on Hadoop
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
TensorFlow 101
TensorFlow 101TensorFlow 101
TensorFlow 101
 
Automate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking EcosystemAutomate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking Ecosystem
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Introduction to DL platform
Introduction to DL platformIntroduction to DL platform
Introduction to DL platform
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
 
How to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkHow to Choose a Deep Learning Framework
How to Choose a Deep Learning Framework
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Tensor flow 2.0 what's new
Tensor flow 2.0  what's newTensor flow 2.0  what's new
Tensor flow 2.0 what's new
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaC
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
 
hpcpp.pptx
hpcpp.pptxhpcpp.pptx
hpcpp.pptx
 
Silicon Valley Code Camp 2016 - MongoDB in production
Silicon Valley Code Camp 2016 - MongoDB in productionSilicon Valley Code Camp 2016 - MongoDB in production
Silicon Valley Code Camp 2016 - MongoDB in production
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...
 
Tensorflow presentation
Tensorflow presentationTensorflow presentation
Tensorflow presentation
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 

Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond

  • 1. TonY: TensorFlow on YARN and Beyond Hadoop Contributors Meetup, LinkedIn, 1/30/19 Jeff Weiner Chief Executive Officer Anthony Hsu Jonathan Hung
  • 2. Machine Learning at LinkedIn People You May Know Job Recommendations News Feed LinkedIn Learning Recommendations 2
  • 3. Challenges of using Machine Learning https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf ML Code 3
  • 4. Machine Learning process • ML process has many parts 4 Data Ingestion Data Preparation Feature Extraction Model Development Model Training Model Deployment Model Serving
  • 5. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. 5 Data Ingestion Data Preparation Feature Extraction Model Development Model Training Model Deployment Model Serving
  • 6. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. • TonY's focus is on model development and training. 6 Data Ingestion Data Preparation Feature Extraction Model Development Model Training Model Deployment Model Serving
  • 7. Model development and training challenges • Want to interactively explore the data and quickly experiment with different models • Want to take code developed on single machine and scale up to run on a cluster of machines 7
  • 8. Interactive model development • Popular solution: notebooks! • At LinkedIn, we use Jupyter notebooks • Challenge: cannot directly access HDFS data from dev box due to security restrictions • Previous solution: you install on your machine, run TensorFlow inside Spark jobs via Livy service • Error prone, timeout issues; we really want to run TensorFlow Python programs, not Spark jobs! 8
  • 9. Notebooks on demand with TonY • We provide our users with a prebuilt Python pex file with common libraries already installed • TonY lets users spin up this notebook in a container in our Hadoop cluster. ○ Can request GPU resources, too 9
  • 10. Scaling up training • To train a complex model on large amounts of data, need to: ○ Parallelize the training (common strategy is data parallelism) ○ Run on multiple machines (need fault tolerance strategy) ○ Use GPUs (to accelerate computations – many models are computation-bound, not IO-bound) 10
  • 11. New Hadoop YARN features useful for ML • GPU resource support added in Hadoop 3.x • Docker container support productionized in Hadoop 3.x • Submarine deep learning CLI in Hadoop 3.2 (released last week) 11
  • 12. How can we do distributed training on YARN? • Want to take a program developed on a single machine and run it in distributed mode with little or no modifications • Want to take advantage of YARN's features • Some existing open-source solutions we looked at: ○ Kubeflow (Google) ○ TensorFlow on Spark (Yahoo!) ○ Spark Deep Learning (Databricks) ○ ToY: TensorFlow on YARN (Intel) ○ XLearning (Qihoo) ○ Horovod (Uber) ○ YARN Native Service 12
  • 13. Pros and Cons of open-source solutions Open-source solution Pros Cons Kubeflow (Google) ● Large marketplace of libraries and plugins ● Active community ● Does not run on Hadoop ● Very new project (< 1.5 years old) TensorFlow on Spark (Yahoo!) ● Can integrate TensorFlow into Spark programs ● No GPU isolation ● No heterogeneous resource support Spark Deep Learning (Databricks) ● Integrates with Spark ● Integrates with Databricks MLflow ecosystem ● No GPU resource support until Spark 3.0.0 (SPARK-20327) ● No heterogeneous resource support ToY: TensorFlow on YARN (Intel) ● Lightweight, loosely- coupled system ● No activity in over 1.5 years XLearning (Qihoo) ● Support for many ML frameworks ● No GPU isolation support Horovod (Uber) ● Fast MPI communication ● Requires more code modification ● No GPU isolation support YARN Native Service ● No custom application required, native to YARN ● Distributed TensorFlow requires YARN DNS 13
  • 14. Ultimately, we decided to build our own solution: TonY • Client + YARN application for running distributed ML jobs • We started with just TensorFlow support (hence TensorFlow on YARN (TonY)) • Now we also support PyTorch and are working on R support (so now perhaps Things on YARN is more apt) • Client lets you easily launch a job with only a few required arguments: ● Number of workers, parameter servers, GPUs per task ● Source file ● Python virtual environment • Custom Azkaban jobtype plugin lets you run TonY jobs in the same workflow as Spark and MapReduce jobs 14
  • 15. What is TonY? • Orchestrates running distributed TensorFlow scripts on Hadoop • Acquires compute resources from Hadoop (memory, CPU, GPU) • Sets up and launches distributed TensorFlow jobs on Hadoop clusters • Manages application lifecycle ○ Fault tolerance ○ Job monitoring 15
  • 17. TonY Architecture • Entry point for TonY jobs • Package user’s configurations, user’s model code and submit as YARN application 17
  • 18. TonY Architecture • Job setup and lifecycle management • Negotiates compute resources from Hadoop • Sets up container environment • Launches and monitors containers 18
  • 19. TonY Architecture • Container = Task Executor • Launches user’s provided python script • Heartbeats to Application Master for liveness 19
  • 20. Scaling distributed TensorFlow on Hadoop • Hadoop is aware of GPU resources • Ensures GPU resource isolation and scheduling 20
  • 21. Scaling distributed TensorFlow on Hadoop • TensorBoard support before: 21
  • 22. Scaling distributed TensorFlow on Hadoop • TensorBoard support before: ○ After training, copy log files to local 22
  • 23. Scaling distributed TensorFlow on Hadoop • TensorBoard support before: ○ After training, copy log files to local ○ Start local tensorboard instance pointing to local log files 23
  • 24. Scaling distributed TensorFlow on Hadoop • TensorBoard support with TonY: ○ Directly access TensorBoard while running with one click 24
  • 25. Scaling distributed TensorFlow on Hadoop • Fault tolerance • More workers = more failures 25
  • 26. Scaling distributed TensorFlow on Hadoop • Fault tolerance • More workers = more failures • First attempt periodically saves model checkpoints to HDFS 26
  • 27. Scaling distributed TensorFlow on Hadoop • Fault tolerance • More workers = more failures • First attempt periodically saves model checkpoints to HDFS • Worker failure -> tear down and restart application 27
  • 28. Scaling distributed TensorFlow on Hadoop • Fault tolerance • More workers = more failures • First attempt periodically saves model checkpoints to HDFS • Worker failure -> tear down and restart application • Read checkpoints from HDFS, resume from where previous attempt left off 28
  • 29. TonY is open source • https://github.com/linkedin/TonY ○ Pull requests welcome! • Engineering blog post: https://engineering.linkedin.com/blog/2018/09/open- sourcing-tony--native-support-of-tensorflow-on-hadoop 29
  • 30. TonY next steps • TonY History Server for viewing past job executions • Collecting metrics and integrating with Dr. Elephant to provide tuning recommendations • TonY runtime for Submarine (just released in Hadoop 3.2) 30