SlideShare a Scribd company logo

Scaling Deep Learning on Hadoop at LinkedIn

Describes LinkedIn's journey in building a training orchestrator, TonY, for doing deep learning on Hadoop. For more details about TonY, visit https://github.com/linkedin/tony.

1 of 54
Download to read offline
Anthony Hsu
Staff Software Engineer
Scaling Deep Learning on Hadoop
at LinkedIn
DataWorks Summit, Washington, D.C., May 23, 2019
About Me: Anthony Hsu
• https://www.linkedin.com/in/erwaman/
• Staff Software Engineer at LinkedIn working on the Hadoop Dev team
• Been working in the Hadoop space for 5.5 years on workflow scheduling (Azkaban),
dataset access (Dali), machine learning infra (TonY, this talk)
LinkedIn's Vision
Create economic opportunity
for every member of the global workforce
630M
Members
30M
Companies
20M
Jobs
50K
Skills
90K
Schools
Machine Learning at LinkedIn
People You May Know
Job Recommendations
News Feed
LinkedIn Learning Recommendations
4
Why Deep Learning?
5
Building AI Applications Using Deep Learning
https://blog.easysol.net/building-ai-applications/
• Prediction accuracy of traditional ML
models tends to plateau quickly as data
increases
• Deep networks continue to improve as
data increases
Which framework to use?
6
Andrej Karpathy, Director of AI at Tesla
https://twitter.com/karpathy/status/972295865187512320

Recommended

Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceChin Huang
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 

More Related Content

What's hot

Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Software architecture for high traffic website
Software architecture for high traffic websiteSoftware architecture for high traffic website
Software architecture for high traffic websiteTung Nguyen Thanh
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0Databricks
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Kafka Retry and DLQ
Kafka Retry and DLQKafka Retry and DLQ
Kafka Retry and DLQGeorge Teo
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 

What's hot (20)

Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Software architecture for high traffic website
Software architecture for high traffic websiteSoftware architecture for high traffic website
Software architecture for high traffic website
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Automated master failover
Automated master failoverAutomated master failover
Automated master failover
 
What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Kafka Retry and DLQ
Kafka Retry and DLQKafka Retry and DLQ
Kafka Retry and DLQ
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 

Similar to Scaling Deep Learning on Hadoop at LinkedIn

Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondErik Krogen
 
Introduction to DL platform
Introduction to DL platformIntroduction to DL platform
Introduction to DL platformxiaogaozi
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARNWangda Tan
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2aswini pilli
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production Paolo Platter
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
TonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on HadoopTonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on HadoopAnthony Hsu
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014spinningmatt
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...Databricks
 
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...Data Con LA
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Anyscale
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkDatabricks
 

Similar to Scaling Deep Learning on Hadoop at LinkedIn (20)

Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Introduction to DL platform
Introduction to DL platformIntroduction to DL platform
Introduction to DL platform
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
TonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on HadoopTonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on Hadoop
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
 
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 

Recently uploaded

Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerCentralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerSaiLinnThu2
 
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24Umar Saif
 
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...htrindia
 
AI for Educators - Integrating AI in the Classrooms
AI for Educators - Integrating AI in the ClassroomsAI for Educators - Integrating AI in the Classrooms
AI for Educators - Integrating AI in the ClassroomsPremsankar Chakkingal
 
Huntly presentation deck design for Behance
Huntly presentation deck design for BehanceHuntly presentation deck design for Behance
Huntly presentation deck design for Behancewhalesdesign
 
Confoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceConfoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceSusan Ibach
 
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Jay Zhao
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxNeo4j
 
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...DianaGray10
 
Establishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry developmentEstablishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry developmentThorsten Huelsmann
 
Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)
Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)
Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)François
 
Pragmatic UI testing with Compose Semantics.pdf
Pragmatic UI testing with Compose Semantics.pdfPragmatic UI testing with Compose Semantics.pdf
Pragmatic UI testing with Compose Semantics.pdfinfogdgmi
 
LF Energy Webinar: Introduction to TROLIE
LF Energy Webinar: Introduction to TROLIELF Energy Webinar: Introduction to TROLIE
LF Energy Webinar: Introduction to TROLIEDanBrown980551
 
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...UiPathCommunity
 
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...DianaGray10
 
Relationship Counselling: From Disjointed Features to Product-First Thinking ...
Relationship Counselling: From Disjointed Features to Product-First Thinking ...Relationship Counselling: From Disjointed Features to Product-First Thinking ...
Relationship Counselling: From Disjointed Features to Product-First Thinking ...Product School
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...Neo4j
 
AI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientAI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientKari Kakkonen
 
Roundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfRoundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfMostafa Higazy
 

Recently uploaded (20)

Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerCentralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
 
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
Progress Report: Ministry of IT under Dr. Umar Saif Aug 23-Feb'24
 
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
HBR SERIES METAL HOUSED RESISTORS POWER ELECTRICAL ABSORBS HIGH CURRENT DURIN...
 
AI for Educators - Integrating AI in the Classrooms
AI for Educators - Integrating AI in the ClassroomsAI for Educators - Integrating AI in the Classrooms
AI for Educators - Integrating AI in the Classrooms
 
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
 
Huntly presentation deck design for Behance
Huntly presentation deck design for BehanceHuntly presentation deck design for Behance
Huntly presentation deck design for Behance
 
Confoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data scienceConfoo 2024 Gettings started with OpenAI and data science
Confoo 2024 Gettings started with OpenAI and data science
 
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
 
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
Automation Ops Series: Session 1 - Introduction and setup DevOps for UiPath p...
 
Establishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry developmentEstablishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry development
 
Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)
Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)
Mind your App Footprint 🐾⚡️🌱 (@FlutterHeroes 2024)
 
Pragmatic UI testing with Compose Semantics.pdf
Pragmatic UI testing with Compose Semantics.pdfPragmatic UI testing with Compose Semantics.pdf
Pragmatic UI testing with Compose Semantics.pdf
 
LF Energy Webinar: Introduction to TROLIE
LF Energy Webinar: Introduction to TROLIELF Energy Webinar: Introduction to TROLIE
LF Energy Webinar: Introduction to TROLIE
 
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
 
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
 
Relationship Counselling: From Disjointed Features to Product-First Thinking ...
Relationship Counselling: From Disjointed Features to Product-First Thinking ...Relationship Counselling: From Disjointed Features to Product-First Thinking ...
Relationship Counselling: From Disjointed Features to Product-First Thinking ...
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
 
AI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficientAI improves software testing to be more fault tolerant, focused and efficient
AI improves software testing to be more fault tolerant, focused and efficient
 
Roundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfRoundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdf
 

Scaling Deep Learning on Hadoop at LinkedIn

  • 1. Anthony Hsu Staff Software Engineer Scaling Deep Learning on Hadoop at LinkedIn DataWorks Summit, Washington, D.C., May 23, 2019
  • 2. About Me: Anthony Hsu • https://www.linkedin.com/in/erwaman/ • Staff Software Engineer at LinkedIn working on the Hadoop Dev team • Been working in the Hadoop space for 5.5 years on workflow scheduling (Azkaban), dataset access (Dali), machine learning infra (TonY, this talk)
  • 3. LinkedIn's Vision Create economic opportunity for every member of the global workforce 630M Members 30M Companies 20M Jobs 50K Skills 90K Schools
  • 4. Machine Learning at LinkedIn People You May Know Job Recommendations News Feed LinkedIn Learning Recommendations 4
  • 5. Why Deep Learning? 5 Building AI Applications Using Deep Learning https://blog.easysol.net/building-ai-applications/ • Prediction accuracy of traditional ML models tends to plateau quickly as data increases • Deep networks continue to improve as data increases
  • 6. Which framework to use? 6 Andrej Karpathy, Director of AI at Tesla https://twitter.com/karpathy/status/972295865187512320
  • 7. Machine Learning process • ML process has many parts 7 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 8. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. 8 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 9. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. • This talk will focus on model training. 9 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 10. Early days: how AI engineers did training • Copy code and dependencies to each host • Manually specify host and port of each process • Customize arguments for each process 10 # On ps0.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=0 # On ps1.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=1 # On worker0.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=worker --task_index=0 # On worker1.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=worker --task_index=1 Source: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md
  • 11. Challenges of scaling up training • Managing code and dependencies • Orchestrating distributed training • Resource contention (especially for GPUs) • Managing an ML workflow (data preparation, training, deployment) • Fault tolerance 11 E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 693.00M (726663168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
  • 12. Existing YARN features to leverage • YARN is Hadoop's scheduler 12
  • 13. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types 13
  • 14. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues 14
  • 15. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues ○ Elasticity between queues 15
  • 16. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues ○ Elasticity between queues ○ User-based limits 16
  • 17. New and upcoming YARN features useful for ML • Docker container support productionized in Hadoop 3.x • YARN Native Service in Hadoop 3.x • Submarine ML CLI released in Hadoop 3.2.0, now its own Hadoop subproject 17
  • 18. How can we do distributed training on YARN? • Want to take a program developed on a single machine and run it in distributed mode with little or no modifications • Want to take advantage of YARN's features • Some existing open-source solutions we looked at: ○ Kubeflow (Google) ○ TensorFlow on Spark (Yahoo!) ○ Spark Deep Learning (Databricks) ○ TOY: TensorFlow on YARN (Intel) ○ XLearning (Qihoo) ○ Horovod (Uber) ○ YARN Native Service (in Hadoop 3.x) 18
  • 19. Kubeflow + Kubernetes • Kubeflow is an ML toolkit built on Kubernetes ○ Has a rich ecosystem and active community • Kubernetes is one of the most popular cluster managers • Challenges in adopting Kubernetes at LinkedIn ○ Large investment in YARN ■ Many clusters of 1000s of nodes (our largest is ~6000) ■ Expertise and tooling for YARN ○ Scalability: "No more than 5000 nodes" (https://kubernetes.io/docs/setup/cluster-large/) ○ Need to integrate with Hadoop security (Kerberos and Hadoop delegation tokens) ○ Lack of hierarchical namespaces 19
  • 20. Spark-based solutions • TensorFlow on Spark (Yahoo!) • Spark Deep Learning (Databricks) • Pros ○ Integrates well with native Spark processing • Cons ○ GPU resource requests not supported until Spark 3.0 (SPARK-20327) ○ No heterogeneous resource support (e.g.: more memory + GPUs for workers, less memory + only CPUs for parameter servers) 20
  • 21. YARN-native solutions • TOY: TensorFlow on YARN (Intel) • XLearning (Qihoo) • Pros ○ Works with YARN out-of-the-box • Cons ○ No GPU resource support 21
  • 22. Horovod • Horovod (Uber) • Wraps existing optimizer to allow synchronous distributed training • Works with many frameworks (TensorFlow, PyTorch, Keras, MXNet) • Uses MPI or NCCL for communication ○ Multi-node MPI on YARN requires Docker containers running sshd daemons 22
  • 23. YARN Native Service • YARN Native Service (available in Hadoop 3.x) • Configure distributed training jobs via XML, YAML, or JSON config file • Distributed TensorFlow requires deploying YARN DNS Registry and ZooKeeper • Relatively new, LinkedIn is still on Hadoop 2.x 23
  • 24. Summary of open-source solutions Open-source solution Pros Cons Kubeflow / Kubernetes (Google) ● Large marketplace of libraries and plugins ● Active community ● Does not run on Hadoop ● May not scale to very large clusters TensorFlow on Spark (Yahoo!) Spark Deep Learning (Databricks) ● Integrates with Spark ● No GPU resource support until Spark 3.0 (SPARK-20327) ● No heterogeneous resource support TOY: TensorFlow on YARN (Intel) XLearning (Qihoo) ● YARN native, works out-of-the-box ● No GPU resource support Horovod (Uber) ● Supports synchronous distributed training ● MPI on YARN requires Docker YARN Native Service ● YARN native ● Distributed TensorFlow requires YARN DNS Registry and ZooKeeper 24
  • 25. Building our own solution: TonY • TonY is a YARN application for running distributed ML jobs • We started with TensorFlow support (hence TensorFlow on YARN (TonY)) • Now we also support PyTorch and Horovod (so perhaps Things on YARN is more apt) 25
  • 26. A Comparison of MapReduce, Spark, and TonY 26 Map task Map task Map task Reduce task Reduce task Spark executor Spark executor Spark executor Spark executor Foo task Foo task Foo task Bar task Bar task Qux task MapReduce • 2 task types • Map tasks connected to Reduce tasks Spark • 1 task type • All connected to all TonY • N task types • Heterogeneous connections Baz task
  • 27. TonY supports many different models 27 Scoring task Scoring task Scoring task Scoring task Scoring task Parallel tasks, no communication Worker task Worker task Worker task Parameter server task Parameter server task Worker + Parameter Server Model Worker task Worker task Worker task Worker task Ring All-Reduce Model
  • 28. TonY also supports more exotic setups 28 Worker task Worker task Worker task Parameter server task Parameter server task Worker-PS with chief worker and evaluator Chief worker task Evaluator task Worker task Worker task Worker task Worker task Ring All-Reduce with in-memory distributed hash table (DHT) DHT task DHT task DHT task
  • 29. TonY supports multiple frameworks 29
  • 30. TonY under the hood 30
  • 31. TonY under the hood 31 TonY Client YARN ResourceManager TonY component YARN component
  • 32. TonY under the hood 32 TonY Client YARN ResourceManager TonY ApplicationMaster TonY component YARN component YARN container
  • 33. TonY under the hood 33 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TonY Task Executor TonY Task Executor TonY component YARN component YARN container
  • 34. TonY under the hood 34 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 35. TonY under the hood 35 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 36. TonY under the hood 36 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 38. Related YARN changes 38 • Backport of GPU support to Hadoop 2.x (YARN-8200)
  • 39. Related YARN changes 39 • Backport of GPU support to Hadoop 2.x (YARN-8200) • Support for updating tracking URL (YARN-7974) ○ Contributed to Hadoop 2.x and 3.x
  • 40. Using TonY • TonY client lets you easily launch a job with only a few required arguments 40 java -cp `hadoop classpath`:tony-cli-0.3.7-all.jar com.linkedin.tony.cli.ClusterSubmitter --python_venv=venv.zip --python_binary_path=Python/bin/python --src_dir=src --executes=my_model.py --conf_file=tony-test.xml
  • 41. Using TonY • For a list of all configurations, see https://github.com/linkedin/Ton Y/wiki/TonY-Configurations 41 <configuration> <property> <name>tony.worker.instances</name> <value>3</value> </property> <property> <name>tony.worker.gpus</name> <value>1</value> </property> <property> <name>tony.ps.instances</name> <value>1</value> </property> </configuration> • Example configuration file:
  • 42. Using TonY $ java ... com.linkedin.tony.cli.ClusterSubmitter ... ... INFO impl.YarnClientImpl: Submitted application application_XXX INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://... INFO tony.TonyClient: ResourceManager web address for application: http://... ... INFO tony.TonyClient: Logs for ps 0 at: http://... INFO tony.TonyClient: Logs for worker 0 at: http://... INFO tony.TonyClient: Logs for worker 1 at: http://... INFO tony.TonyClient: Logs for worker 2 at: http://...
  • 43. TonY Portal for accessing job events and configs 43
  • 44. Using TonY to launch notebooks and tools on demand • TonY can be used to launch ○ Jupyter notebooks ○ TensorBoard ○ MLflow ○ etc. • Run any Python virtual environment, PEX, or shiv • Run any Docker image 44
  • 45. TonY is open-source • Open-source repo: https://github.com/linkedin/tony ○ Contributions welcome! • OpML '19 paper: https://arxiv.org/abs/1904.01631 (presented 3 days ago) • LinkedIn engineering blog post: https://bit.ly/2O6L5WD 45
  • 46. TonY integrations with other projects
  • 47. Azkaban workflow scheduler integration • Azkaban is a workflow scheduler for Hadoop • Run TonY jobs inside a workflow that includes Spark and other data processing jobs 47
  • 48. TonY job tuning recommendations by Dr. Elephant 48 • Dr. Elephant is a job tuning and performance analysis tool for Hadoop jobs.
  • 49. Run TonY on Google Cloud DataProc • DataProc lets you run Hadoop and Spark on Google's Cloud • TonY setup script for DataProc: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/tony • TonY on DataProc blog post: https://bit.ly/2HEYemT 49
  • 50. TonY runtime for Hadoop Submarine • Submarine is a deep learning CLI for Hadoop • TonY is a supported runtime implementation for Submarine (SUBMARINE-40, in Submarine 0.2.0) 50
  • 51. TonY on Microsoft Azure HDInsight (coming soon) • HDInsight lets you run open-source frameworks on Azure, including Hadoop, Spark, and Kafka • TonY integration is coming soon 51 +
  • 52. Demo 52 • Live demo using TonY Client from CLI • Video of using TonY job in Azkaban: https://youtu.be/DM89y8BGFaY
  • 53. Future Work • GPU metrics + tuning suggestions for Dr. Elephant • Expand TonY Portal to support launching notebooks, visualization, and managing experiments • TonY CLI + Python library • TonY support on Azure HDInsight • TonY support for other ML frameworks, schedulers, and cloud services 53 + ?