SlideShare a Scribd company logo
Anthony Hsu
Staff Software Engineer
Scaling Deep Learning on Hadoop
at LinkedIn
DataWorks Summit, Washington, D.C., May 23, 2019
About Me: Anthony Hsu
• https://www.linkedin.com/in/erwaman/
• Staff Software Engineer at LinkedIn working on the Hadoop Dev team
• Been working in the Hadoop space for 5.5 years on workflow scheduling (Azkaban),
dataset access (Dali), machine learning infra (TonY, this talk)
LinkedIn's Vision
Create economic opportunity
for every member of the global workforce
630M
Members
30M
Companies
20M
Jobs
50K
Skills
90K
Schools
Machine Learning at LinkedIn
People You May Know
Job Recommendations
News Feed
LinkedIn Learning Recommendations
4
Why Deep Learning?
5
Building AI Applications Using Deep Learning
https://blog.easysol.net/building-ai-applications/
• Prediction accuracy of traditional ML
models tends to plateau quickly as data
increases
• Deep networks continue to improve as
data increases
Which framework to use?
6
Andrej Karpathy, Director of AI at Tesla
https://twitter.com/karpathy/status/972295865187512320
Machine Learning process
• ML process has many parts
7
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving
Machine Learning process
• ML process has many parts
• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to accelerate
this loop. We have teams working
on every part of the ML pipeline.
8
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving
Machine Learning process
• ML process has many parts
• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to accelerate
this loop. We have teams working
on every part of the ML pipeline.
• This talk will focus on model
training.
9
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving
Early days: how AI engineers did training
• Copy code and
dependencies to each
host
• Manually specify host
and port of each process
• Customize arguments for
each process
10
# On ps0.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=ps --task_index=0
# On ps1.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=ps --task_index=1
# On worker0.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=worker --task_index=0
# On worker1.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=worker --task_index=1
Source: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md
Challenges of scaling up training
• Managing code and dependencies
• Orchestrating distributed training
• Resource contention (especially for GPUs)
• Managing an ML workflow (data preparation, training, deployment)
• Fault tolerance
11
E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to
allocate 693.00M (726663168 bytes) from device:
CUDA_ERROR_OUT_OF_MEMORY: out of memory
Existing YARN features to leverage
• YARN is Hadoop's scheduler
12
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
13
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
○ Team-based and hierarchical queues
14
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
○ Team-based and hierarchical queues
○ Elasticity between queues
15
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
○ Team-based and hierarchical queues
○ Elasticity between queues
○ User-based limits
16
New and upcoming YARN features useful for ML
• Docker container support productionized in Hadoop 3.x
• YARN Native Service in Hadoop 3.x
• Submarine ML CLI released in Hadoop 3.2.0, now its own Hadoop subproject
17
How can we do distributed training on YARN?
• Want to take a program developed on a single machine and run it in distributed mode
with little or no modifications
• Want to take advantage of YARN's features
• Some existing open-source solutions we looked at:
○ Kubeflow (Google)
○ TensorFlow on Spark (Yahoo!)
○ Spark Deep Learning (Databricks)
○ TOY: TensorFlow on YARN (Intel)
○ XLearning (Qihoo)
○ Horovod (Uber)
○ YARN Native Service (in Hadoop 3.x)
18
Kubeflow + Kubernetes
• Kubeflow is an ML toolkit built on Kubernetes
○ Has a rich ecosystem and active community
• Kubernetes is one of the most popular cluster managers
• Challenges in adopting Kubernetes at LinkedIn
○ Large investment in YARN
■ Many clusters of 1000s of nodes (our largest is ~6000)
■ Expertise and tooling for YARN
○ Scalability: "No more than 5000 nodes"
(https://kubernetes.io/docs/setup/cluster-large/)
○ Need to integrate with Hadoop security (Kerberos and Hadoop delegation tokens)
○ Lack of hierarchical namespaces 19
Spark-based solutions
• TensorFlow on Spark (Yahoo!)
• Spark Deep Learning (Databricks)
• Pros
○ Integrates well with native Spark processing
• Cons
○ GPU resource requests not supported until Spark 3.0 (SPARK-20327)
○ No heterogeneous resource support (e.g.: more memory + GPUs for workers, less
memory + only CPUs for parameter servers)
20
YARN-native solutions
• TOY: TensorFlow on YARN (Intel)
• XLearning (Qihoo)
• Pros
○ Works with YARN out-of-the-box
• Cons
○ No GPU resource support
21
Horovod
• Horovod (Uber)
• Wraps existing optimizer to allow synchronous distributed training
• Works with many frameworks (TensorFlow, PyTorch, Keras, MXNet)
• Uses MPI or NCCL for communication
○ Multi-node MPI on YARN requires Docker containers running sshd daemons
22
YARN Native Service
• YARN Native Service (available in Hadoop 3.x)
• Configure distributed training jobs via XML, YAML, or JSON config file
• Distributed TensorFlow requires deploying YARN DNS Registry and ZooKeeper
• Relatively new, LinkedIn is still on Hadoop 2.x
23
Summary of open-source solutions
Open-source solution Pros Cons
Kubeflow / Kubernetes (Google) ● Large marketplace of libraries and plugins
● Active community
● Does not run on Hadoop
● May not scale to very large clusters
TensorFlow on Spark (Yahoo!)
Spark Deep Learning (Databricks)
● Integrates with Spark ● No GPU resource support until Spark 3.0
(SPARK-20327)
● No heterogeneous resource support
TOY: TensorFlow on YARN (Intel)
XLearning (Qihoo)
● YARN native, works out-of-the-box ● No GPU resource support
Horovod (Uber) ● Supports synchronous distributed training ● MPI on YARN requires Docker
YARN Native Service ● YARN native ● Distributed TensorFlow requires YARN DNS
Registry and ZooKeeper
24
Building our own solution: TonY
• TonY is a YARN application for running distributed ML jobs
• We started with TensorFlow support (hence TensorFlow on YARN (TonY))
• Now we also support PyTorch and Horovod (so perhaps Things on YARN is more apt)
25
A Comparison of MapReduce, Spark, and TonY
26
Map
task
Map
task
Map
task
Reduce
task
Reduce
task
Spark
executor
Spark
executor
Spark
executor
Spark
executor
Foo
task
Foo
task
Foo
task
Bar
task
Bar
task
Qux
task
MapReduce
• 2 task types
• Map tasks connected
to Reduce tasks
Spark
• 1 task type
• All connected to all
TonY
• N task types
• Heterogeneous connections
Baz
task
TonY supports many different models
27
Scoring
task
Scoring
task
Scoring
task
Scoring
task
Scoring
task
Parallel tasks,
no communication
Worker
task
Worker
task
Worker
task
Parameter
server task
Parameter
server task
Worker + Parameter Server Model
Worker
task
Worker
task
Worker
task
Worker
task
Ring All-Reduce Model
TonY also supports more exotic setups
28
Worker
task
Worker
task
Worker
task
Parameter
server task
Parameter
server task
Worker-PS with chief worker and
evaluator
Chief
worker
task
Evaluator
task
Worker
task
Worker
task
Worker
task
Worker
task
Ring All-Reduce with in-memory
distributed hash table (DHT)
DHT
task
DHT
task
DHT
task
TonY supports multiple frameworks
29
TonY under the hood
30
TonY under the hood
31
TonY Client
YARN
ResourceManager
TonY component
YARN component
TonY under the hood
32
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY component
YARN component
YARN container
TonY under the hood
33
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TonY
Task Executor
TonY
Task Executor
TonY component
YARN component
YARN container
TonY under the hood
34
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
TensorFlow component
YARN component
YARN container
TonY under the hood
35
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
TensorFlow component
YARN component
YARN container
TonY under the hood
36
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
TensorFlow component
YARN component
YARN container
Related YARN changes
37
Related YARN changes
38
• Backport of GPU support to Hadoop 2.x (YARN-8200)
Related YARN changes
39
• Backport of GPU support to Hadoop 2.x (YARN-8200)
• Support for updating tracking URL (YARN-7974)
○ Contributed to Hadoop 2.x and 3.x
Using TonY
• TonY client lets you easily launch a job with only a few required arguments
40
java -cp `hadoop classpath`:tony-cli-0.3.7-all.jar 
com.linkedin.tony.cli.ClusterSubmitter 
--python_venv=venv.zip 
--python_binary_path=Python/bin/python 
--src_dir=src 
--executes=my_model.py 
--conf_file=tony-test.xml
Using TonY
• For a list of all configurations, see
https://github.com/linkedin/Ton
Y/wiki/TonY-Configurations
41
<configuration>
<property>
<name>tony.worker.instances</name>
<value>3</value>
</property>
<property>
<name>tony.worker.gpus</name>
<value>1</value>
</property>
<property>
<name>tony.ps.instances</name>
<value>1</value>
</property>
</configuration>
• Example configuration file:
Using TonY
$ java ... com.linkedin.tony.cli.ClusterSubmitter ...
...
INFO impl.YarnClientImpl: Submitted application application_XXX
INFO tony.TonyClient: URL to track running application
(will proxy to TensorBoard once it has started): http://...
INFO tony.TonyClient: ResourceManager web address for application: http://...
...
INFO tony.TonyClient: Logs for ps 0 at: http://...
INFO tony.TonyClient: Logs for worker 0 at: http://...
INFO tony.TonyClient: Logs for worker 1 at: http://...
INFO tony.TonyClient: Logs for worker 2 at: http://...
TonY Portal for accessing job events and configs
43
Using TonY to launch notebooks and tools on demand
• TonY can be used to launch
○ Jupyter notebooks
○ TensorBoard
○ MLflow
○ etc.
• Run any Python virtual environment, PEX, or shiv
• Run any Docker image
44
TonY is open-source
• Open-source repo: https://github.com/linkedin/tony
○ Contributions welcome!
• OpML '19 paper: https://arxiv.org/abs/1904.01631 (presented 3 days ago)
• LinkedIn engineering blog post: https://bit.ly/2O6L5WD
45
TonY integrations with other projects
Azkaban workflow scheduler integration
• Azkaban is a workflow
scheduler for Hadoop
• Run TonY jobs inside a
workflow that includes
Spark and other data
processing jobs
47
TonY job tuning recommendations by Dr. Elephant
48
• Dr. Elephant is a
job tuning and
performance
analysis tool for
Hadoop jobs.
Run TonY on Google Cloud DataProc
• DataProc lets you run Hadoop and Spark on Google's Cloud
• TonY setup script for DataProc:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/tony
• TonY on DataProc blog post: https://bit.ly/2HEYemT
49
TonY runtime for Hadoop Submarine
• Submarine is a deep learning CLI for Hadoop
• TonY is a supported runtime implementation for Submarine (SUBMARINE-40, in
Submarine 0.2.0)
50
TonY on Microsoft Azure HDInsight (coming soon)
• HDInsight lets you run open-source frameworks on Azure, including Hadoop, Spark,
and Kafka
• TonY integration is coming soon
51
+
Demo
52
• Live demo using TonY Client from CLI
• Video of using TonY job in Azkaban: https://youtu.be/DM89y8BGFaY
Future Work
• GPU metrics + tuning suggestions for Dr. Elephant
• Expand TonY Portal to support launching notebooks, visualization,
and managing experiments
• TonY CLI + Python library
• TonY support on Azure HDInsight
• TonY support for other ML frameworks, schedulers, and cloud services
53
+ ?
Thank you!
54
Questions?

More Related Content

What's hot

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
JSON:APIについてざっくり入門
JSON:APIについてざっくり入門JSON:APIについてざっくり入門
JSON:APIについてざっくり入門
iPride Co., Ltd.
 
InfluxDB Roadmap: What’s New and What’s Coming
InfluxDB Roadmap: What’s New and What’s ComingInfluxDB Roadmap: What’s New and What’s Coming
InfluxDB Roadmap: What’s New and What’s Coming
InfluxData
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Presto on YARNの導入・運用
Presto on YARNの導入・運用Presto on YARNの導入・運用
Presto on YARNの導入・運用
cyberagent
 
Building an Observability platform with ClickHouse
Building an Observability platform with ClickHouseBuilding an Observability platform with ClickHouse
Building an Observability platform with ClickHouse
Altinity Ltd
 
KubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjp
KubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjpKubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjp
KubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjp
Preferred Networks
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
DataWorks Summit
 
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...
NETWAYS
 
急速に進化を続けるCNIプラグイン Antrea
急速に進化を続けるCNIプラグイン Antrea 急速に進化を続けるCNIプラグイン Antrea
急速に進化を続けるCNIプラグイン Antrea
Motonori Shindo
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
StampedeCon
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Cloudera Japan
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
Animesh Singh
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす
Akihiro Suda
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
AKASH SIHAG
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
DataWorks Summit/Hadoop Summit
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
Jason Hubbard
 
Performance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla ClusterPerformance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla Cluster
ScyllaDB
 

What's hot (20)

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
JSON:APIについてざっくり入門
JSON:APIについてざっくり入門JSON:APIについてざっくり入門
JSON:APIについてざっくり入門
 
InfluxDB Roadmap: What’s New and What’s Coming
InfluxDB Roadmap: What’s New and What’s ComingInfluxDB Roadmap: What’s New and What’s Coming
InfluxDB Roadmap: What’s New and What’s Coming
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Presto on YARNの導入・運用
Presto on YARNの導入・運用Presto on YARNの導入・運用
Presto on YARNの導入・運用
 
Building an Observability platform with ClickHouse
Building an Observability platform with ClickHouseBuilding an Observability platform with ClickHouse
Building an Observability platform with ClickHouse
 
KubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjp
KubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjpKubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjp
KubeCon 2021 NA Recap - Scheduler拡張事例最前線 / Kubernetes Meetup Tokyo #47 / #k8sjp
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
 
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...
 
急速に進化を続けるCNIプラグイン Antrea
急速に進化を続けるCNIプラグイン Antrea 急速に進化を続けるCNIプラグイン Antrea
急速に進化を続けるCNIプラグイン Antrea
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす root権限無しでKubernetesを動かす
root権限無しでKubernetesを動かす
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Performance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla ClusterPerformance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla Cluster
 

Similar to Scaling Deep Learning on Hadoop at LinkedIn

Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Erik Krogen
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
Sofian Hadiwijaya
 
Introduction to DL platform
Introduction to DL platformIntroduction to DL platform
Introduction to DL platform
xiaogaozi
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
Wangda Tan
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
DataWorks Summit
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2aswini pilli
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
Paolo Platter
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
TonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on HadoopTonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on Hadoop
Anthony Hsu
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
Giivee The
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
spinningmatt
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
Bhupesh Bansal
 
Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube
EDB
 
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Databricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
Databricks
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
Hortonworks
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Anyscale
 

Similar to Scaling Deep Learning on Hadoop at LinkedIn (20)

Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Introduction to DL platform
Introduction to DL platformIntroduction to DL platform
Introduction to DL platform
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
TonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on HadoopTonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on Hadoop
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube
 
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
 

Recently uploaded

みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 

Recently uploaded (20)

みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 

Scaling Deep Learning on Hadoop at LinkedIn

  • 1. Anthony Hsu Staff Software Engineer Scaling Deep Learning on Hadoop at LinkedIn DataWorks Summit, Washington, D.C., May 23, 2019
  • 2. About Me: Anthony Hsu • https://www.linkedin.com/in/erwaman/ • Staff Software Engineer at LinkedIn working on the Hadoop Dev team • Been working in the Hadoop space for 5.5 years on workflow scheduling (Azkaban), dataset access (Dali), machine learning infra (TonY, this talk)
  • 3. LinkedIn's Vision Create economic opportunity for every member of the global workforce 630M Members 30M Companies 20M Jobs 50K Skills 90K Schools
  • 4. Machine Learning at LinkedIn People You May Know Job Recommendations News Feed LinkedIn Learning Recommendations 4
  • 5. Why Deep Learning? 5 Building AI Applications Using Deep Learning https://blog.easysol.net/building-ai-applications/ • Prediction accuracy of traditional ML models tends to plateau quickly as data increases • Deep networks continue to improve as data increases
  • 6. Which framework to use? 6 Andrej Karpathy, Director of AI at Tesla https://twitter.com/karpathy/status/972295865187512320
  • 7. Machine Learning process • ML process has many parts 7 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 8. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. 8 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 9. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. • This talk will focus on model training. 9 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 10. Early days: how AI engineers did training • Copy code and dependencies to each host • Manually specify host and port of each process • Customize arguments for each process 10 # On ps0.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=0 # On ps1.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=1 # On worker0.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=worker --task_index=0 # On worker1.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=worker --task_index=1 Source: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md
  • 11. Challenges of scaling up training • Managing code and dependencies • Orchestrating distributed training • Resource contention (especially for GPUs) • Managing an ML workflow (data preparation, training, deployment) • Fault tolerance 11 E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 693.00M (726663168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
  • 12. Existing YARN features to leverage • YARN is Hadoop's scheduler 12
  • 13. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types 13
  • 14. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues 14
  • 15. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues ○ Elasticity between queues 15
  • 16. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues ○ Elasticity between queues ○ User-based limits 16
  • 17. New and upcoming YARN features useful for ML • Docker container support productionized in Hadoop 3.x • YARN Native Service in Hadoop 3.x • Submarine ML CLI released in Hadoop 3.2.0, now its own Hadoop subproject 17
  • 18. How can we do distributed training on YARN? • Want to take a program developed on a single machine and run it in distributed mode with little or no modifications • Want to take advantage of YARN's features • Some existing open-source solutions we looked at: ○ Kubeflow (Google) ○ TensorFlow on Spark (Yahoo!) ○ Spark Deep Learning (Databricks) ○ TOY: TensorFlow on YARN (Intel) ○ XLearning (Qihoo) ○ Horovod (Uber) ○ YARN Native Service (in Hadoop 3.x) 18
  • 19. Kubeflow + Kubernetes • Kubeflow is an ML toolkit built on Kubernetes ○ Has a rich ecosystem and active community • Kubernetes is one of the most popular cluster managers • Challenges in adopting Kubernetes at LinkedIn ○ Large investment in YARN ■ Many clusters of 1000s of nodes (our largest is ~6000) ■ Expertise and tooling for YARN ○ Scalability: "No more than 5000 nodes" (https://kubernetes.io/docs/setup/cluster-large/) ○ Need to integrate with Hadoop security (Kerberos and Hadoop delegation tokens) ○ Lack of hierarchical namespaces 19
  • 20. Spark-based solutions • TensorFlow on Spark (Yahoo!) • Spark Deep Learning (Databricks) • Pros ○ Integrates well with native Spark processing • Cons ○ GPU resource requests not supported until Spark 3.0 (SPARK-20327) ○ No heterogeneous resource support (e.g.: more memory + GPUs for workers, less memory + only CPUs for parameter servers) 20
  • 21. YARN-native solutions • TOY: TensorFlow on YARN (Intel) • XLearning (Qihoo) • Pros ○ Works with YARN out-of-the-box • Cons ○ No GPU resource support 21
  • 22. Horovod • Horovod (Uber) • Wraps existing optimizer to allow synchronous distributed training • Works with many frameworks (TensorFlow, PyTorch, Keras, MXNet) • Uses MPI or NCCL for communication ○ Multi-node MPI on YARN requires Docker containers running sshd daemons 22
  • 23. YARN Native Service • YARN Native Service (available in Hadoop 3.x) • Configure distributed training jobs via XML, YAML, or JSON config file • Distributed TensorFlow requires deploying YARN DNS Registry and ZooKeeper • Relatively new, LinkedIn is still on Hadoop 2.x 23
  • 24. Summary of open-source solutions Open-source solution Pros Cons Kubeflow / Kubernetes (Google) ● Large marketplace of libraries and plugins ● Active community ● Does not run on Hadoop ● May not scale to very large clusters TensorFlow on Spark (Yahoo!) Spark Deep Learning (Databricks) ● Integrates with Spark ● No GPU resource support until Spark 3.0 (SPARK-20327) ● No heterogeneous resource support TOY: TensorFlow on YARN (Intel) XLearning (Qihoo) ● YARN native, works out-of-the-box ● No GPU resource support Horovod (Uber) ● Supports synchronous distributed training ● MPI on YARN requires Docker YARN Native Service ● YARN native ● Distributed TensorFlow requires YARN DNS Registry and ZooKeeper 24
  • 25. Building our own solution: TonY • TonY is a YARN application for running distributed ML jobs • We started with TensorFlow support (hence TensorFlow on YARN (TonY)) • Now we also support PyTorch and Horovod (so perhaps Things on YARN is more apt) 25
  • 26. A Comparison of MapReduce, Spark, and TonY 26 Map task Map task Map task Reduce task Reduce task Spark executor Spark executor Spark executor Spark executor Foo task Foo task Foo task Bar task Bar task Qux task MapReduce • 2 task types • Map tasks connected to Reduce tasks Spark • 1 task type • All connected to all TonY • N task types • Heterogeneous connections Baz task
  • 27. TonY supports many different models 27 Scoring task Scoring task Scoring task Scoring task Scoring task Parallel tasks, no communication Worker task Worker task Worker task Parameter server task Parameter server task Worker + Parameter Server Model Worker task Worker task Worker task Worker task Ring All-Reduce Model
  • 28. TonY also supports more exotic setups 28 Worker task Worker task Worker task Parameter server task Parameter server task Worker-PS with chief worker and evaluator Chief worker task Evaluator task Worker task Worker task Worker task Worker task Ring All-Reduce with in-memory distributed hash table (DHT) DHT task DHT task DHT task
  • 29. TonY supports multiple frameworks 29
  • 30. TonY under the hood 30
  • 31. TonY under the hood 31 TonY Client YARN ResourceManager TonY component YARN component
  • 32. TonY under the hood 32 TonY Client YARN ResourceManager TonY ApplicationMaster TonY component YARN component YARN container
  • 33. TonY under the hood 33 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TonY Task Executor TonY Task Executor TonY component YARN component YARN container
  • 34. TonY under the hood 34 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 35. TonY under the hood 35 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 36. TonY under the hood 36 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 38. Related YARN changes 38 • Backport of GPU support to Hadoop 2.x (YARN-8200)
  • 39. Related YARN changes 39 • Backport of GPU support to Hadoop 2.x (YARN-8200) • Support for updating tracking URL (YARN-7974) ○ Contributed to Hadoop 2.x and 3.x
  • 40. Using TonY • TonY client lets you easily launch a job with only a few required arguments 40 java -cp `hadoop classpath`:tony-cli-0.3.7-all.jar com.linkedin.tony.cli.ClusterSubmitter --python_venv=venv.zip --python_binary_path=Python/bin/python --src_dir=src --executes=my_model.py --conf_file=tony-test.xml
  • 41. Using TonY • For a list of all configurations, see https://github.com/linkedin/Ton Y/wiki/TonY-Configurations 41 <configuration> <property> <name>tony.worker.instances</name> <value>3</value> </property> <property> <name>tony.worker.gpus</name> <value>1</value> </property> <property> <name>tony.ps.instances</name> <value>1</value> </property> </configuration> • Example configuration file:
  • 42. Using TonY $ java ... com.linkedin.tony.cli.ClusterSubmitter ... ... INFO impl.YarnClientImpl: Submitted application application_XXX INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://... INFO tony.TonyClient: ResourceManager web address for application: http://... ... INFO tony.TonyClient: Logs for ps 0 at: http://... INFO tony.TonyClient: Logs for worker 0 at: http://... INFO tony.TonyClient: Logs for worker 1 at: http://... INFO tony.TonyClient: Logs for worker 2 at: http://...
  • 43. TonY Portal for accessing job events and configs 43
  • 44. Using TonY to launch notebooks and tools on demand • TonY can be used to launch ○ Jupyter notebooks ○ TensorBoard ○ MLflow ○ etc. • Run any Python virtual environment, PEX, or shiv • Run any Docker image 44
  • 45. TonY is open-source • Open-source repo: https://github.com/linkedin/tony ○ Contributions welcome! • OpML '19 paper: https://arxiv.org/abs/1904.01631 (presented 3 days ago) • LinkedIn engineering blog post: https://bit.ly/2O6L5WD 45
  • 46. TonY integrations with other projects
  • 47. Azkaban workflow scheduler integration • Azkaban is a workflow scheduler for Hadoop • Run TonY jobs inside a workflow that includes Spark and other data processing jobs 47
  • 48. TonY job tuning recommendations by Dr. Elephant 48 • Dr. Elephant is a job tuning and performance analysis tool for Hadoop jobs.
  • 49. Run TonY on Google Cloud DataProc • DataProc lets you run Hadoop and Spark on Google's Cloud • TonY setup script for DataProc: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/tony • TonY on DataProc blog post: https://bit.ly/2HEYemT 49
  • 50. TonY runtime for Hadoop Submarine • Submarine is a deep learning CLI for Hadoop • TonY is a supported runtime implementation for Submarine (SUBMARINE-40, in Submarine 0.2.0) 50
  • 51. TonY on Microsoft Azure HDInsight (coming soon) • HDInsight lets you run open-source frameworks on Azure, including Hadoop, Spark, and Kafka • TonY integration is coming soon 51 +
  • 52. Demo 52 • Live demo using TonY Client from CLI • Video of using TonY job in Azkaban: https://youtu.be/DM89y8BGFaY
  • 53. Future Work • GPU metrics + tuning suggestions for Dr. Elephant • Expand TonY Portal to support launching notebooks, visualization, and managing experiments • TonY CLI + Python library • TonY support on Azure HDInsight • TonY support for other ML frameworks, schedulers, and cloud services 53 + ?