SlideShare a Scribd company logo
1 of 54
Download to read offline
Anthony Hsu
Staff Software Engineer
Scaling Deep Learning on Hadoop
at LinkedIn
DataWorks Summit, Washington, D.C., May 23, 2019
About Me: Anthony Hsu
• https://www.linkedin.com/in/erwaman/
• Staff Software Engineer at LinkedIn working on the Hadoop Dev team
• Been working in the Hadoop space for 5.5 years on workflow scheduling (Azkaban),
dataset access (Dali), machine learning infra (TonY, this talk)
LinkedIn's Vision
Create economic opportunity
for every member of the global workforce
630M
Members
30M
Companies
20M
Jobs
50K
Skills
90K
Schools
Machine Learning at LinkedIn
People You May Know
Job Recommendations
News Feed
LinkedIn Learning Recommendations
4
Why Deep Learning?
5
Building AI Applications Using Deep Learning
https://blog.easysol.net/building-ai-applications/
• Prediction accuracy of traditional ML
models tends to plateau quickly as data
increases
• Deep networks continue to improve as
data increases
Which framework to use?
6
Andrej Karpathy, Director of AI at Tesla
https://twitter.com/karpathy/status/972295865187512320
Machine Learning process
• ML process has many parts
7
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving
Machine Learning process
• ML process has many parts
• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to accelerate
this loop. We have teams working
on every part of the ML pipeline.
8
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving
Machine Learning process
• ML process has many parts
• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to accelerate
this loop. We have teams working
on every part of the ML pipeline.
• This talk will focus on model
training.
9
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving
Early days: how AI engineers did training
• Copy code and
dependencies to each
host
• Manually specify host
and port of each process
• Customize arguments for
each process
10
# On ps0.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=ps --task_index=0
# On ps1.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=ps --task_index=1
# On worker0.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=worker --task_index=0
# On worker1.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=worker --task_index=1
Source: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md
Challenges of scaling up training
• Managing code and dependencies
• Orchestrating distributed training
• Resource contention (especially for GPUs)
• Managing an ML workflow (data preparation, training, deployment)
• Fault tolerance
11
E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to
allocate 693.00M (726663168 bytes) from device:
CUDA_ERROR_OUT_OF_MEMORY: out of memory
Existing YARN features to leverage
• YARN is Hadoop's scheduler
12
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
13
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
○ Team-based and hierarchical queues
14
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
○ Team-based and hierarchical queues
○ Elasticity between queues
15
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
○ Team-based and hierarchical queues
○ Elasticity between queues
○ User-based limits
16
New and upcoming YARN features useful for ML
• Docker container support productionized in Hadoop 3.x
• YARN Native Service in Hadoop 3.x
• Submarine ML CLI released in Hadoop 3.2.0, now its own Hadoop subproject
17
How can we do distributed training on YARN?
• Want to take a program developed on a single machine and run it in distributed mode
with little or no modifications
• Want to take advantage of YARN's features
• Some existing open-source solutions we looked at:
○ Kubeflow (Google)
○ TensorFlow on Spark (Yahoo!)
○ Spark Deep Learning (Databricks)
○ TOY: TensorFlow on YARN (Intel)
○ XLearning (Qihoo)
○ Horovod (Uber)
○ YARN Native Service (in Hadoop 3.x)
18
Kubeflow + Kubernetes
• Kubeflow is an ML toolkit built on Kubernetes
○ Has a rich ecosystem and active community
• Kubernetes is one of the most popular cluster managers
• Challenges in adopting Kubernetes at LinkedIn
○ Large investment in YARN
■ Many clusters of 1000s of nodes (our largest is ~6000)
■ Expertise and tooling for YARN
○ Scalability: "No more than 5000 nodes"
(https://kubernetes.io/docs/setup/cluster-large/)
○ Need to integrate with Hadoop security (Kerberos and Hadoop delegation tokens)
○ Lack of hierarchical namespaces 19
Spark-based solutions
• TensorFlow on Spark (Yahoo!)
• Spark Deep Learning (Databricks)
• Pros
○ Integrates well with native Spark processing
• Cons
○ GPU resource requests not supported until Spark 3.0 (SPARK-20327)
○ No heterogeneous resource support (e.g.: more memory + GPUs for workers, less
memory + only CPUs for parameter servers)
20
YARN-native solutions
• TOY: TensorFlow on YARN (Intel)
• XLearning (Qihoo)
• Pros
○ Works with YARN out-of-the-box
• Cons
○ No GPU resource support
21
Horovod
• Horovod (Uber)
• Wraps existing optimizer to allow synchronous distributed training
• Works with many frameworks (TensorFlow, PyTorch, Keras, MXNet)
• Uses MPI or NCCL for communication
○ Multi-node MPI on YARN requires Docker containers running sshd daemons
22
YARN Native Service
• YARN Native Service (available in Hadoop 3.x)
• Configure distributed training jobs via XML, YAML, or JSON config file
• Distributed TensorFlow requires deploying YARN DNS Registry and ZooKeeper
• Relatively new, LinkedIn is still on Hadoop 2.x
23
Summary of open-source solutions
Open-source solution Pros Cons
Kubeflow / Kubernetes (Google) ● Large marketplace of libraries and plugins
● Active community
● Does not run on Hadoop
● May not scale to very large clusters
TensorFlow on Spark (Yahoo!)
Spark Deep Learning (Databricks)
● Integrates with Spark ● No GPU resource support until Spark 3.0
(SPARK-20327)
● No heterogeneous resource support
TOY: TensorFlow on YARN (Intel)
XLearning (Qihoo)
● YARN native, works out-of-the-box ● No GPU resource support
Horovod (Uber) ● Supports synchronous distributed training ● MPI on YARN requires Docker
YARN Native Service ● YARN native ● Distributed TensorFlow requires YARN DNS
Registry and ZooKeeper
24
Building our own solution: TonY
• TonY is a YARN application for running distributed ML jobs
• We started with TensorFlow support (hence TensorFlow on YARN (TonY))
• Now we also support PyTorch and Horovod (so perhaps Things on YARN is more apt)
25
A Comparison of MapReduce, Spark, and TonY
26
Map
task
Map
task
Map
task
Reduce
task
Reduce
task
Spark
executor
Spark
executor
Spark
executor
Spark
executor
Foo
task
Foo
task
Foo
task
Bar
task
Bar
task
Qux
task
MapReduce
• 2 task types
• Map tasks connected
to Reduce tasks
Spark
• 1 task type
• All connected to all
TonY
• N task types
• Heterogeneous connections
Baz
task
TonY supports many different models
27
Scoring
task
Scoring
task
Scoring
task
Scoring
task
Scoring
task
Parallel tasks,
no communication
Worker
task
Worker
task
Worker
task
Parameter
server task
Parameter
server task
Worker + Parameter Server Model
Worker
task
Worker
task
Worker
task
Worker
task
Ring All-Reduce Model
TonY also supports more exotic setups
28
Worker
task
Worker
task
Worker
task
Parameter
server task
Parameter
server task
Worker-PS with chief worker and
evaluator
Chief
worker
task
Evaluator
task
Worker
task
Worker
task
Worker
task
Worker
task
Ring All-Reduce with in-memory
distributed hash table (DHT)
DHT
task
DHT
task
DHT
task
TonY supports multiple frameworks
29
TonY under the hood
30
TonY under the hood
31
TonY Client
YARN
ResourceManager
TonY component
YARN component
TonY under the hood
32
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY component
YARN component
YARN container
TonY under the hood
33
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TonY
Task Executor
TonY
Task Executor
TonY component
YARN component
YARN container
TonY under the hood
34
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
TensorFlow component
YARN component
YARN container
TonY under the hood
35
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
TensorFlow component
YARN component
YARN container
TonY under the hood
36
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
TensorFlow component
YARN component
YARN container
Related YARN changes
37
Related YARN changes
38
• Backport of GPU support to Hadoop 2.x (YARN-8200)
Related YARN changes
39
• Backport of GPU support to Hadoop 2.x (YARN-8200)
• Support for updating tracking URL (YARN-7974)
○ Contributed to Hadoop 2.x and 3.x
Using TonY
• TonY client lets you easily launch a job with only a few required arguments
40
java -cp `hadoop classpath`:tony-cli-0.3.7-all.jar 
com.linkedin.tony.cli.ClusterSubmitter 
--python_venv=venv.zip 
--python_binary_path=Python/bin/python 
--src_dir=src 
--executes=my_model.py 
--conf_file=tony-test.xml
Using TonY
• For a list of all configurations, see
https://github.com/linkedin/Ton
Y/wiki/TonY-Configurations
41
<configuration>
<property>
<name>tony.worker.instances</name>
<value>3</value>
</property>
<property>
<name>tony.worker.gpus</name>
<value>1</value>
</property>
<property>
<name>tony.ps.instances</name>
<value>1</value>
</property>
</configuration>
• Example configuration file:
Using TonY
$ java ... com.linkedin.tony.cli.ClusterSubmitter ...
...
INFO impl.YarnClientImpl: Submitted application application_XXX
INFO tony.TonyClient: URL to track running application
(will proxy to TensorBoard once it has started): http://...
INFO tony.TonyClient: ResourceManager web address for application: http://...
...
INFO tony.TonyClient: Logs for ps 0 at: http://...
INFO tony.TonyClient: Logs for worker 0 at: http://...
INFO tony.TonyClient: Logs for worker 1 at: http://...
INFO tony.TonyClient: Logs for worker 2 at: http://...
TonY Portal for accessing job events and configs
43
Using TonY to launch notebooks and tools on demand
• TonY can be used to launch
○ Jupyter notebooks
○ TensorBoard
○ MLflow
○ etc.
• Run any Python virtual environment, PEX, or shiv
• Run any Docker image
44
TonY is open-source
• Open-source repo: https://github.com/linkedin/tony
○ Contributions welcome!
• OpML '19 paper: https://arxiv.org/abs/1904.01631 (presented 3 days ago)
• LinkedIn engineering blog post: https://bit.ly/2O6L5WD
45
TonY integrations with other projects
Azkaban workflow scheduler integration
• Azkaban is a workflow
scheduler for Hadoop
• Run TonY jobs inside a
workflow that includes
Spark and other data
processing jobs
47
TonY job tuning recommendations by Dr. Elephant
48
• Dr. Elephant is a
job tuning and
performance
analysis tool for
Hadoop jobs.
Run TonY on Google Cloud DataProc
• DataProc lets you run Hadoop and Spark on Google's Cloud
• TonY setup script for DataProc:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/tony
• TonY on DataProc blog post: https://bit.ly/2HEYemT
49
TonY runtime for Hadoop Submarine
• Submarine is a deep learning CLI for Hadoop
• TonY is a supported runtime implementation for Submarine (SUBMARINE-40, in
Submarine 0.2.0)
50
TonY on Microsoft Azure HDInsight (coming soon)
• HDInsight lets you run open-source frameworks on Azure, including Hadoop, Spark,
and Kafka
• TonY integration is coming soon
51
+
Demo
52
• Live demo using TonY Client from CLI
• Video of using TonY job in Azkaban: https://youtu.be/DM89y8BGFaY
Future Work
• GPU metrics + tuning suggestions for Dr. Elephant
• Expand TonY Portal to support launching notebooks, visualization,
and managing experiments
• TonY CLI + Python library
• TonY support on Azure HDInsight
• TonY support for other ML frameworks, schedulers, and cloud services
53
+ ?
Thank you!
54
Questions?

More Related Content

What's hot

How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Databricks
 
Teradata vs-exadata
Teradata vs-exadataTeradata vs-exadata
Teradata vs-exadata
Louis liu
 

What's hot (20)

Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
 
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFsOptimizing Apache Spark UDFs
Optimizing Apache Spark UDFs
 
Rocks db state store in structured streaming
Rocks db state store in structured streamingRocks db state store in structured streaming
Rocks db state store in structured streaming
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
OrientDB Distributed Architecture v2.0
OrientDB Distributed Architecture v2.0OrientDB Distributed Architecture v2.0
OrientDB Distributed Architecture v2.0
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Teradata vs-exadata
Teradata vs-exadataTeradata vs-exadata
Teradata vs-exadata
 

Similar to Scaling Deep Learning on Hadoop at LinkedIn

project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
aswini pilli
 

Similar to Scaling Deep Learning on Hadoop at LinkedIn (20)

Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Introduction to DL platform
Introduction to DL platformIntroduction to DL platform
Introduction to DL platform
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
TonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on HadoopTonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on Hadoop
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
 
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Scaling Deep Learning on Hadoop at LinkedIn

  • 1. Anthony Hsu Staff Software Engineer Scaling Deep Learning on Hadoop at LinkedIn DataWorks Summit, Washington, D.C., May 23, 2019
  • 2. About Me: Anthony Hsu • https://www.linkedin.com/in/erwaman/ • Staff Software Engineer at LinkedIn working on the Hadoop Dev team • Been working in the Hadoop space for 5.5 years on workflow scheduling (Azkaban), dataset access (Dali), machine learning infra (TonY, this talk)
  • 3. LinkedIn's Vision Create economic opportunity for every member of the global workforce 630M Members 30M Companies 20M Jobs 50K Skills 90K Schools
  • 4. Machine Learning at LinkedIn People You May Know Job Recommendations News Feed LinkedIn Learning Recommendations 4
  • 5. Why Deep Learning? 5 Building AI Applications Using Deep Learning https://blog.easysol.net/building-ai-applications/ • Prediction accuracy of traditional ML models tends to plateau quickly as data increases • Deep networks continue to improve as data increases
  • 6. Which framework to use? 6 Andrej Karpathy, Director of AI at Tesla https://twitter.com/karpathy/status/972295865187512320
  • 7. Machine Learning process • ML process has many parts 7 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 8. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. 8 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 9. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. • This talk will focus on model training. 9 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 10. Early days: how AI engineers did training • Copy code and dependencies to each host • Manually specify host and port of each process • Customize arguments for each process 10 # On ps0.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=0 # On ps1.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=1 # On worker0.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=worker --task_index=0 # On worker1.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=worker --task_index=1 Source: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md
  • 11. Challenges of scaling up training • Managing code and dependencies • Orchestrating distributed training • Resource contention (especially for GPUs) • Managing an ML workflow (data preparation, training, deployment) • Fault tolerance 11 E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 693.00M (726663168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
  • 12. Existing YARN features to leverage • YARN is Hadoop's scheduler 12
  • 13. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types 13
  • 14. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues 14
  • 15. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues ○ Elasticity between queues 15
  • 16. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues ○ Elasticity between queues ○ User-based limits 16
  • 17. New and upcoming YARN features useful for ML • Docker container support productionized in Hadoop 3.x • YARN Native Service in Hadoop 3.x • Submarine ML CLI released in Hadoop 3.2.0, now its own Hadoop subproject 17
  • 18. How can we do distributed training on YARN? • Want to take a program developed on a single machine and run it in distributed mode with little or no modifications • Want to take advantage of YARN's features • Some existing open-source solutions we looked at: ○ Kubeflow (Google) ○ TensorFlow on Spark (Yahoo!) ○ Spark Deep Learning (Databricks) ○ TOY: TensorFlow on YARN (Intel) ○ XLearning (Qihoo) ○ Horovod (Uber) ○ YARN Native Service (in Hadoop 3.x) 18
  • 19. Kubeflow + Kubernetes • Kubeflow is an ML toolkit built on Kubernetes ○ Has a rich ecosystem and active community • Kubernetes is one of the most popular cluster managers • Challenges in adopting Kubernetes at LinkedIn ○ Large investment in YARN ■ Many clusters of 1000s of nodes (our largest is ~6000) ■ Expertise and tooling for YARN ○ Scalability: "No more than 5000 nodes" (https://kubernetes.io/docs/setup/cluster-large/) ○ Need to integrate with Hadoop security (Kerberos and Hadoop delegation tokens) ○ Lack of hierarchical namespaces 19
  • 20. Spark-based solutions • TensorFlow on Spark (Yahoo!) • Spark Deep Learning (Databricks) • Pros ○ Integrates well with native Spark processing • Cons ○ GPU resource requests not supported until Spark 3.0 (SPARK-20327) ○ No heterogeneous resource support (e.g.: more memory + GPUs for workers, less memory + only CPUs for parameter servers) 20
  • 21. YARN-native solutions • TOY: TensorFlow on YARN (Intel) • XLearning (Qihoo) • Pros ○ Works with YARN out-of-the-box • Cons ○ No GPU resource support 21
  • 22. Horovod • Horovod (Uber) • Wraps existing optimizer to allow synchronous distributed training • Works with many frameworks (TensorFlow, PyTorch, Keras, MXNet) • Uses MPI or NCCL for communication ○ Multi-node MPI on YARN requires Docker containers running sshd daemons 22
  • 23. YARN Native Service • YARN Native Service (available in Hadoop 3.x) • Configure distributed training jobs via XML, YAML, or JSON config file • Distributed TensorFlow requires deploying YARN DNS Registry and ZooKeeper • Relatively new, LinkedIn is still on Hadoop 2.x 23
  • 24. Summary of open-source solutions Open-source solution Pros Cons Kubeflow / Kubernetes (Google) ● Large marketplace of libraries and plugins ● Active community ● Does not run on Hadoop ● May not scale to very large clusters TensorFlow on Spark (Yahoo!) Spark Deep Learning (Databricks) ● Integrates with Spark ● No GPU resource support until Spark 3.0 (SPARK-20327) ● No heterogeneous resource support TOY: TensorFlow on YARN (Intel) XLearning (Qihoo) ● YARN native, works out-of-the-box ● No GPU resource support Horovod (Uber) ● Supports synchronous distributed training ● MPI on YARN requires Docker YARN Native Service ● YARN native ● Distributed TensorFlow requires YARN DNS Registry and ZooKeeper 24
  • 25. Building our own solution: TonY • TonY is a YARN application for running distributed ML jobs • We started with TensorFlow support (hence TensorFlow on YARN (TonY)) • Now we also support PyTorch and Horovod (so perhaps Things on YARN is more apt) 25
  • 26. A Comparison of MapReduce, Spark, and TonY 26 Map task Map task Map task Reduce task Reduce task Spark executor Spark executor Spark executor Spark executor Foo task Foo task Foo task Bar task Bar task Qux task MapReduce • 2 task types • Map tasks connected to Reduce tasks Spark • 1 task type • All connected to all TonY • N task types • Heterogeneous connections Baz task
  • 27. TonY supports many different models 27 Scoring task Scoring task Scoring task Scoring task Scoring task Parallel tasks, no communication Worker task Worker task Worker task Parameter server task Parameter server task Worker + Parameter Server Model Worker task Worker task Worker task Worker task Ring All-Reduce Model
  • 28. TonY also supports more exotic setups 28 Worker task Worker task Worker task Parameter server task Parameter server task Worker-PS with chief worker and evaluator Chief worker task Evaluator task Worker task Worker task Worker task Worker task Ring All-Reduce with in-memory distributed hash table (DHT) DHT task DHT task DHT task
  • 29. TonY supports multiple frameworks 29
  • 30. TonY under the hood 30
  • 31. TonY under the hood 31 TonY Client YARN ResourceManager TonY component YARN component
  • 32. TonY under the hood 32 TonY Client YARN ResourceManager TonY ApplicationMaster TonY component YARN component YARN container
  • 33. TonY under the hood 33 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TonY Task Executor TonY Task Executor TonY component YARN component YARN container
  • 34. TonY under the hood 34 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 35. TonY under the hood 35 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 36. TonY under the hood 36 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 38. Related YARN changes 38 • Backport of GPU support to Hadoop 2.x (YARN-8200)
  • 39. Related YARN changes 39 • Backport of GPU support to Hadoop 2.x (YARN-8200) • Support for updating tracking URL (YARN-7974) ○ Contributed to Hadoop 2.x and 3.x
  • 40. Using TonY • TonY client lets you easily launch a job with only a few required arguments 40 java -cp `hadoop classpath`:tony-cli-0.3.7-all.jar com.linkedin.tony.cli.ClusterSubmitter --python_venv=venv.zip --python_binary_path=Python/bin/python --src_dir=src --executes=my_model.py --conf_file=tony-test.xml
  • 41. Using TonY • For a list of all configurations, see https://github.com/linkedin/Ton Y/wiki/TonY-Configurations 41 <configuration> <property> <name>tony.worker.instances</name> <value>3</value> </property> <property> <name>tony.worker.gpus</name> <value>1</value> </property> <property> <name>tony.ps.instances</name> <value>1</value> </property> </configuration> • Example configuration file:
  • 42. Using TonY $ java ... com.linkedin.tony.cli.ClusterSubmitter ... ... INFO impl.YarnClientImpl: Submitted application application_XXX INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://... INFO tony.TonyClient: ResourceManager web address for application: http://... ... INFO tony.TonyClient: Logs for ps 0 at: http://... INFO tony.TonyClient: Logs for worker 0 at: http://... INFO tony.TonyClient: Logs for worker 1 at: http://... INFO tony.TonyClient: Logs for worker 2 at: http://...
  • 43. TonY Portal for accessing job events and configs 43
  • 44. Using TonY to launch notebooks and tools on demand • TonY can be used to launch ○ Jupyter notebooks ○ TensorBoard ○ MLflow ○ etc. • Run any Python virtual environment, PEX, or shiv • Run any Docker image 44
  • 45. TonY is open-source • Open-source repo: https://github.com/linkedin/tony ○ Contributions welcome! • OpML '19 paper: https://arxiv.org/abs/1904.01631 (presented 3 days ago) • LinkedIn engineering blog post: https://bit.ly/2O6L5WD 45
  • 46. TonY integrations with other projects
  • 47. Azkaban workflow scheduler integration • Azkaban is a workflow scheduler for Hadoop • Run TonY jobs inside a workflow that includes Spark and other data processing jobs 47
  • 48. TonY job tuning recommendations by Dr. Elephant 48 • Dr. Elephant is a job tuning and performance analysis tool for Hadoop jobs.
  • 49. Run TonY on Google Cloud DataProc • DataProc lets you run Hadoop and Spark on Google's Cloud • TonY setup script for DataProc: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/tony • TonY on DataProc blog post: https://bit.ly/2HEYemT 49
  • 50. TonY runtime for Hadoop Submarine • Submarine is a deep learning CLI for Hadoop • TonY is a supported runtime implementation for Submarine (SUBMARINE-40, in Submarine 0.2.0) 50
  • 51. TonY on Microsoft Azure HDInsight (coming soon) • HDInsight lets you run open-source frameworks on Azure, including Hadoop, Spark, and Kafka • TonY integration is coming soon 51 +
  • 52. Demo 52 • Live demo using TonY Client from CLI • Video of using TonY job in Azkaban: https://youtu.be/DM89y8BGFaY
  • 53. Future Work • GPU metrics + tuning suggestions for Dr. Elephant • Expand TonY Portal to support launching notebooks, visualization, and managing experiments • TonY CLI + Python library • TonY support on Azure HDInsight • TonY support for other ML frameworks, schedulers, and cloud services 53 + ?