Scaling Deep Learning on Hadoop at LinkedIn

Anthony Hsu
Staﬀ Software Engineer
Scaling Deep Learning on Hadoop
at LinkedIn
DataWorks Summit, Washington, D.C., May 23, 2019

About Me: Anthony Hsu
• https://www.linkedin.com/in/erwaman/
• Staﬀ Software Engineer at LinkedIn working on the Hadoop Dev team
• Been working in the Hadoop space for 5.5 years on workflow scheduling (Azkaban),
dataset access (Dali), machine learning infra (TonY, this talk)

LinkedIn's Vision
Create economic opportunity
for every member of the global workforce
630M
Members
30M
Companies
20M
Jobs
50K
Skills
90K
Schools

Machine Learning at LinkedIn
People You May Know
Job Recommendations
News Feed
LinkedIn Learning Recommendations
4

Why Deep Learning?
5
Building AI Applications Using Deep Learning
https://blog.easysol.net/building-ai-applications/
• Prediction accuracy of traditional ML
models tends to plateau quickly as data
increases
• Deep networks continue to improve as
data increases

Which framework to use?
6
Andrej Karpathy, Director of AI at Tesla
https://twitter.com/karpathy/status/972295865187512320

Machine Learning process
• ML process has many parts
7
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving

• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to accelerate
this loop. We have teams working
on every part of the ML pipeline.
8
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving

• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to accelerate
this loop. We have teams working
on every part of the ML pipeline.
• This talk will focus on model
training.
9
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving

Early days: how AI engineers did training
• Copy code and
dependencies to each
host
• Manually specify host
and port of each process
• Customize arguments for
each process
10
# On ps0.example.com:
$ python trainer.py
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222
--job_name=ps --task_index=0
# On ps1.example.com:
$ python trainer.py
--job_name=ps --task_index=1
# On worker0.example.com:
$ python trainer.py
--job_name=worker --task_index=0
# On worker1.example.com:
$ python trainer.py
--job_name=worker --task_index=1
Source: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md

Challenges of scaling up training
• Managing code and dependencies
• Orchestrating distributed training
• Resource contention (especially for GPUs)
• Managing an ML workflow (data preparation, training, deployment)
• Fault tolerance
11
E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to
allocate 693.00M (726663168 bytes) from device:
CUDA_ERROR_OUT_OF_MEMORY: out of memory

Existing YARN features to leverage
• YARN is Hadoop's scheduler
12

• YARN supports
○ GPU resources and other resource types
13

• YARN supports
○ Team-based and hierarchical queues
14

• YARN supports
○ Elasticity between queues
15

• YARN supports
○ Elasticity between queues
○ User-based limits
16

New and upcoming YARN features useful for ML
• Docker container support productionized in Hadoop 3.x
• YARN Native Service in Hadoop 3.x
• Submarine ML CLI released in Hadoop 3.2.0, now its own Hadoop subproject
17

How can we do distributed training on YARN?
• Want to take a program developed on a single machine and run it in distributed mode
with little or no modifications
• Want to take advantage of YARN's features
• Some existing open-source solutions we looked at:
○ Kubeflow (Google)
○ TensorFlow on Spark (Yahoo!)
○ Spark Deep Learning (Databricks)
○ TOY: TensorFlow on YARN (Intel)
○ XLearning (Qihoo)
○ Horovod (Uber)
○ YARN Native Service (in Hadoop 3.x)
18

Kubeflow + Kubernetes
• Kubeflow is an ML toolkit built on Kubernetes
○ Has a rich ecosystem and active community
• Kubernetes is one of the most popular cluster managers
• Challenges in adopting Kubernetes at LinkedIn
○ Large investment in YARN
■ Many clusters of 1000s of nodes (our largest is ~6000)
■ Expertise and tooling for YARN
○ Scalability: "No more than 5000 nodes"
(https://kubernetes.io/docs/setup/cluster-large/)
○ Need to integrate with Hadoop security (Kerberos and Hadoop delegation tokens)
○ Lack of hierarchical namespaces 19

Spark-based solutions
• TensorFlow on Spark (Yahoo!)
• Spark Deep Learning (Databricks)
• Pros
○ Integrates well with native Spark processing
• Cons
○ GPU resource requests not supported until Spark 3.0 (SPARK-20327)
○ No heterogeneous resource support (e.g.: more memory + GPUs for workers, less
memory + only CPUs for parameter servers)
20

YARN-native solutions
• TOY: TensorFlow on YARN (Intel)
• XLearning (Qihoo)
• Pros
○ Works with YARN out-of-the-box
• Cons
○ No GPU resource support
21

Horovod
• Horovod (Uber)
• Wraps existing optimizer to allow synchronous distributed training
• Works with many frameworks (TensorFlow, PyTorch, Keras, MXNet)
• Uses MPI or NCCL for communication
○ Multi-node MPI on YARN requires Docker containers running sshd daemons
22

YARN Native Service
• YARN Native Service (available in Hadoop 3.x)
• Configure distributed training jobs via XML, YAML, or JSON config file
• Distributed TensorFlow requires deploying YARN DNS Registry and ZooKeeper
• Relatively new, LinkedIn is still on Hadoop 2.x
23

Summary of open-source solutions
Open-source solution Pros Cons
Kubeflow / Kubernetes (Google) ● Large marketplace of libraries and plugins
● Active community
● Does not run on Hadoop
● May not scale to very large clusters
TensorFlow on Spark (Yahoo!)
Spark Deep Learning (Databricks)
● Integrates with Spark ● No GPU resource support until Spark 3.0
(SPARK-20327)
● No heterogeneous resource support
TOY: TensorFlow on YARN (Intel)
XLearning (Qihoo)
● YARN native, works out-of-the-box ● No GPU resource support
Horovod (Uber) ● Supports synchronous distributed training ● MPI on YARN requires Docker
YARN Native Service ● YARN native ● Distributed TensorFlow requires YARN DNS
Registry and ZooKeeper
24

Building our own solution: TonY
• TonY is a YARN application for running distributed ML jobs
• We started with TensorFlow support (hence TensorFlow on YARN (TonY))
• Now we also support PyTorch and Horovod (so perhaps Things on YARN is more apt)
25

A Comparison of MapReduce, Spark, and TonY
26
Map
task
Map
task
Map
task
Reduce
task
Reduce
task
Spark
executor
Spark
executor
Spark
executor
Spark
executor
Foo
task
Foo
task
Foo
task
Bar
task
Bar
task
Qux
task
MapReduce
• 2 task types
• Map tasks connected
to Reduce tasks
Spark
• 1 task type
• All connected to all
TonY
• N task types
• Heterogeneous connections
Baz
task

TonY supports many diﬀerent models
27
Scoring
task
Scoring
task
Scoring
task
Scoring
task
Scoring
task
Parallel tasks,
no communication
Worker
task
Worker
task
Worker
task
Parameter
server task
Parameter
server task
Worker + Parameter Server Model
Worker
task
Worker
task
Worker
task
Worker
task
Ring All-Reduce Model

TonY also supports more exotic setups
28
Worker
task
Worker
task
Worker
task
Parameter
server task
Parameter
server task
Worker-PS with chief worker and
evaluator
Chief
worker
task
Evaluator
task
Worker
task
Worker
task
Worker
task
Worker
task
Ring All-Reduce with in-memory
distributed hash table (DHT)
DHT
task
DHT
task
DHT
task

TonY supports multiple frameworks
29

TonY under the hood
31
TonY Client
YARN
ResourceManager
TonY component
YARN component

TonY under the hood
32
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY component
YARN component
YARN container

TonY under the hood
33
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TonY
Task Executor
TonY
Task Executor
TonY component
YARN component
YARN container

TonY under the hood
34
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
TensorFlow component
YARN component
YARN container

TonY under the hood
35
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
YARN component
YARN container

TonY under the hood
36
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
YARN component
YARN container

Related YARN changes
38
• Backport of GPU support to Hadoop 2.x (YARN-8200)

Related YARN changes
39
• Backport of GPU support to Hadoop 2.x (YARN-8200)
• Support for updating tracking URL (YARN-7974)
○ Contributed to Hadoop 2.x and 3.x

Using TonY
• TonY client lets you easily launch a job with only a few required arguments
40
java -cp `hadoop classpath`:tony-cli-0.3.7-all.jar
com.linkedin.tony.cli.ClusterSubmitter
--python_venv=venv.zip
--python_binary_path=Python/bin/python
--src_dir=src
--executes=my_model.py
--conf_file=tony-test.xml

Using TonY
• For a list of all configurations, see
https://github.com/linkedin/Ton
Y/wiki/TonY-Configurations
41
<configuration>
<property>
<name>tony.worker.instances</name>
<value>3</value>
</property>
<property>
<name>tony.worker.gpus</name>
<value>1</value>
</property>
<property>
<name>tony.ps.instances</name>
<value>1</value>
</property>
</configuration>
• Example configuration file:

Using TonY
$ java ... com.linkedin.tony.cli.ClusterSubmitter ...
...
INFO impl.YarnClientImpl: Submitted application application_XXX
INFO tony.TonyClient: URL to track running application
(will proxy to TensorBoard once it has started): http://...
INFO tony.TonyClient: ResourceManager web address for application: http://...
...
INFO tony.TonyClient: Logs for ps 0 at: http://...
INFO tony.TonyClient: Logs for worker 0 at: http://...

TonY Portal for accessing job events and configs
43

Using TonY to launch notebooks and tools on demand
• TonY can be used to launch
○ Jupyter notebooks
○ TensorBoard
○ MLflow
○ etc.
• Run any Python virtual environment, PEX, or shiv
• Run any Docker image
44

TonY is open-source
• Open-source repo: https://github.com/linkedin/tony
○ Contributions welcome!
• OpML '19 paper: https://arxiv.org/abs/1904.01631 (presented 3 days ago)
• LinkedIn engineering blog post: https://bit.ly/2O6L5WD
45

TonY integrations with other projects

Azkaban workflow scheduler integration
• Azkaban is a workflow
scheduler for Hadoop
• Run TonY jobs inside a
workflow that includes
Spark and other data
processing jobs
47

TonY job tuning recommendations by Dr. Elephant
48
• Dr. Elephant is a
job tuning and
performance
analysis tool for
Hadoop jobs.

Run TonY on Google Cloud DataProc
• DataProc lets you run Hadoop and Spark on Google's Cloud
• TonY setup script for DataProc:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/tony
• TonY on DataProc blog post: https://bit.ly/2HEYemT
49

TonY runtime for Hadoop Submarine
• Submarine is a deep learning CLI for Hadoop
• TonY is a supported runtime implementation for Submarine (SUBMARINE-40, in
Submarine 0.2.0)
50

TonY on Microsoft Azure HDInsight (coming soon)
• HDInsight lets you run open-source frameworks on Azure, including Hadoop, Spark,
and Kafka
• TonY integration is coming soon
51
+

Demo
52
• Live demo using TonY Client from CLI
• Video of using TonY job in Azkaban: https://youtu.be/DM89y8BGFaY

Future Work
• GPU metrics + tuning suggestions for Dr. Elephant
• Expand TonY Portal to support launching notebooks, visualization,
and managing experiments
• TonY CLI + Python library
• TonY support on Azure HDInsight
• TonY support for other ML frameworks, schedulers, and cloud services
53
+ ?

Scaling Deep Learning on Hadoop at LinkedIn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling Deep Learning on Hadoop at LinkedIn

Similar to Scaling Deep Learning on Hadoop at LinkedIn (20)

Recently uploaded

Recently uploaded (20)

Scaling Deep Learning on Hadoop at LinkedIn