SlideShare a Scribd company logo
1 of 23
Introduction to DL Platform
Changjian Gao
Table of Contents
• Intro

• Why?

• Goals & Non-Goals

• Heterogeneous Resources and Multi-tenant

• Distributed Training

• Deep Learning as Software Engineering
Deep Learning Frameworks
Why DL platform?
Hidden Technical Debt in Machine Learning Systems, NIPS’15
Goals
• Deep Learning as Software Engineering (think about CI/CD)

• Heterogeneous resources management (CPU, GPU etc.)

• Multi-tenant management (sharing and isolation)

• Distributed training

• Multiple DL frameworks support

• Easy to tuning and diagnosis (logs, metrics, profiling etc.)

• User-friendly interface (CLI, Web UI etc.)

• AutoML

• Feature and model sharing

• Maybe: elastic DL, model zoo
Non-Goals
• Invent yet another DL framework

• Intrusive design
Heterogeneous Resources and Multi-tenant
K8s
• Good
• Good for heterogeneous resources management and isolation

• Basic multi-tenant management (namespace etc.)

• PVC make data isolation easily

• Active community

• Bad
• Batch workload scheduling

• Flexible multi-tenant management

• YAML isn’t user-friendly (too trivial)

• So many new concepts (pod, service, deployment etc.)
K8s - Scheduling
• The default scheduler isn’t suit for batch workload

• DL job is usually batch workload (especially distributed training)

• What we miss from other scheduler (e.g. YARN):

• Gang scheduling (a.k.a. coscheduling)

• Fair-share and capacity scheduler

• Queue

• Priority

• Preemption
K8s - Scheduling
• Volcano
• Batch system built on K8s

• CNCF sandbox project

• Lead by Huawei Cloud

• SIG Scheduling
• K8s scheduling framework (since 1.15)

• Lead by IBM and Alibaba Cloud

• Scheduler Plugins
K8s - GPU Sharing
• GPU sharing is hard

• Current solutions:

• GPU Sharing Scheduler Extender (Alibaba Cloud)

• GPU Manager (Tencent Cloud)

• Virtual GPU Device Plugin (AWS)

• Multi-Instance GPUs (Nvidia)
Distributed Training
Goals
• High scaling efficiency
Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Parameter Server
Large Scale Distributed Deep Networks, NIPS’12
Ring Allreduce
Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
K8s - Operator
• The Operator pattern aims to capture the key aim of a human operator
who is managing a service or set of services

• Invented by CoreOS (acquired by Red Hat now)

• Useful operators for distributed training:

• kubeflow/tf-operator (TensorFlow, PS mode)

• kubeflow/pytorch-operator (PyTorch, PS mode)

• kubeflow/mxnet-operator (MXNet, PS mode)

• kubeflow/mpi-operator (Any framework, Allreduce mode)
Deep Learning as Software Engineering
Kubeflow Pipelines
• Reusable end-to-end ML workflows built using the Kubeflow Pipelines
SDK

• Integrate with K8s from day one (Kubeflow = Kubernetes + Workflow)

• DAG orchestration based on Argo

• Heavily rely on K8s operator (i.e. CRD)

• Web UI and API

• Lead by Google Cloud
Kubeflow Pipelines
MLflow
• An open source platform for the machine learning lifecycle

• Integrate with K8s experimentally

• Rely on K8s Job resource

• Web UI and API

• Lead by Databricks
MLflow
Thanks

More Related Content

What's hot

How to Build Highly Available Shared Storage on Microsoft Azure
How to Build Highly Available Shared Storage on Microsoft AzureHow to Build Highly Available Shared Storage on Microsoft Azure
How to Build Highly Available Shared Storage on Microsoft AzureBuurst
 
How Microsoft IT migrated SharePoint to Office 365
How Microsoft IT migrated SharePoint to Office 365How Microsoft IT migrated SharePoint to Office 365
How Microsoft IT migrated SharePoint to Office 365Sam Crewdson
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managabilityGaurav Bahrani
 
Sqlite Introduction
Sqlite IntroductionSqlite Introduction
Sqlite IntroductionPraveen Nair
 
Java scalability considerations yogesh deshpande
Java scalability considerations   yogesh deshpandeJava scalability considerations   yogesh deshpande
Java scalability considerations yogesh deshpandeIndicThreads
 
dojox.gfx: what's next, after, later
dojox.gfx: what's next, after, laterdojox.gfx: what's next, after, later
dojox.gfx: what's next, after, laterpruzand
 
SJTU Summary report
SJTU Summary reportSJTU Summary report
SJTU Summary reportYves Chan
 
Realizing the Event Driven Enterprise
Realizing the Event Driven EnterpriseRealizing the Event Driven Enterprise
Realizing the Event Driven EnterpriseDavid Reines
 
Cloudera – One Platform to Rule Them All
Cloudera – One Platform to Rule Them All Cloudera – One Platform to Rule Them All
Cloudera – One Platform to Rule Them All Xpand IT
 
Spider进化论
Spider进化论Spider进化论
Spider进化论cjhacker
 
MOOC Piattaforme digitali per la gestione del territorio - 4.3
MOOC Piattaforme digitali per la gestione del territorio - 4.3MOOC Piattaforme digitali per la gestione del territorio - 4.3
MOOC Piattaforme digitali per la gestione del territorio - 4.3Alessandro Bogliolo
 
Cloud infrastructure on Apache Mesos
Cloud infrastructure on Apache MesosCloud infrastructure on Apache Mesos
Cloud infrastructure on Apache MesosAhmed Bacha
 
An online viewer for Geospatial Vector Data using HTML5 Canvas and JavaScript
An online viewer for Geospatial Vector Data using HTML5 Canvas and JavaScript An online viewer for Geospatial Vector Data using HTML5 Canvas and JavaScript
An online viewer for Geospatial Vector Data using HTML5 Canvas and JavaScript Manikanta Kondeti
 
The Evolution of Open Source Databases
The Evolution of Open Source DatabasesThe Evolution of Open Source Databases
The Evolution of Open Source DatabasesIvan Zoratti
 
Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...
Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...
Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...Amazon Web Services
 
An Introduction to Cloud Computing
An Introduction to Cloud ComputingAn Introduction to Cloud Computing
An Introduction to Cloud ComputingMike Frampton
 
presentation of VDI in a box
presentation of VDI in a boxpresentation of VDI in a box
presentation of VDI in a boxKashan Nawaz
 

What's hot (19)

How to Build Highly Available Shared Storage on Microsoft Azure
How to Build Highly Available Shared Storage on Microsoft AzureHow to Build Highly Available Shared Storage on Microsoft Azure
How to Build Highly Available Shared Storage on Microsoft Azure
 
How Microsoft IT migrated SharePoint to Office 365
How Microsoft IT migrated SharePoint to Office 365How Microsoft IT migrated SharePoint to Office 365
How Microsoft IT migrated SharePoint to Office 365
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managability
 
Sqlite Introduction
Sqlite IntroductionSqlite Introduction
Sqlite Introduction
 
Java scalability considerations yogesh deshpande
Java scalability considerations   yogesh deshpandeJava scalability considerations   yogesh deshpande
Java scalability considerations yogesh deshpande
 
dojox.gfx: what's next, after, later
dojox.gfx: what's next, after, laterdojox.gfx: what's next, after, later
dojox.gfx: what's next, after, later
 
SJTU Summary report
SJTU Summary reportSJTU Summary report
SJTU Summary report
 
Realizing the Event Driven Enterprise
Realizing the Event Driven EnterpriseRealizing the Event Driven Enterprise
Realizing the Event Driven Enterprise
 
Cloudera – One Platform to Rule Them All
Cloudera – One Platform to Rule Them All Cloudera – One Platform to Rule Them All
Cloudera – One Platform to Rule Them All
 
Spider进化论
Spider进化论Spider进化论
Spider进化论
 
MOOC Piattaforme digitali per la gestione del territorio - 4.3
MOOC Piattaforme digitali per la gestione del territorio - 4.3MOOC Piattaforme digitali per la gestione del territorio - 4.3
MOOC Piattaforme digitali per la gestione del territorio - 4.3
 
Cloud infrastructure on Apache Mesos
Cloud infrastructure on Apache MesosCloud infrastructure on Apache Mesos
Cloud infrastructure on Apache Mesos
 
An online viewer for Geospatial Vector Data using HTML5 Canvas and JavaScript
An online viewer for Geospatial Vector Data using HTML5 Canvas and JavaScript An online viewer for Geospatial Vector Data using HTML5 Canvas and JavaScript
An online viewer for Geospatial Vector Data using HTML5 Canvas and JavaScript
 
The Evolution of Open Source Databases
The Evolution of Open Source DatabasesThe Evolution of Open Source Databases
The Evolution of Open Source Databases
 
Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...
Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...
Visualise Your Cloud Data Strategy: MongoDB Atlas and Charts (Sponsored by Mo...
 
 
An Introduction to Cloud Computing
An Introduction to Cloud ComputingAn Introduction to Cloud Computing
An Introduction to Cloud Computing
 
presentation of VDI in a box
presentation of VDI in a boxpresentation of VDI in a box
presentation of VDI in a box
 

Similar to Intro to DL Platform

introduction to micro services
introduction to micro servicesintroduction to micro services
introduction to micro servicesSpyros Lambrinidis
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Akash Tandon
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInAnthony Hsu
 
Serverless Computing: Driving Innovation and Business Value
Serverless Computing: Driving Innovation and Business ValueServerless Computing: Driving Innovation and Business Value
Serverless Computing: Driving Innovation and Business ValueAlibaba Cloud
 
Advanced dev ops governance with terraform
Advanced dev ops governance with terraformAdvanced dev ops governance with terraform
Advanced dev ops governance with terraformJames Counts
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)Amazon Web Services
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...Databricks
 
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondErik Krogen
 
Openstack - Enterprise cloud management platform
Openstack - Enterprise cloud management platformOpenstack - Enterprise cloud management platform
Openstack - Enterprise cloud management platformNagaraj Shenoy
 
Data(?)Ops with CircleCI
Data(?)Ops with CircleCIData(?)Ops with CircleCI
Data(?)Ops with CircleCIJinwoong Kim
 
Impact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology expertsImpact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology expertsAndreas Chatziantoniou
 
Impact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology expertsImpact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology expertsAndreas Chatziantoniou
 
Building Developer Experiences for the Cloud .pdf
Building Developer Experiences for the Cloud .pdfBuilding Developer Experiences for the Cloud .pdf
Building Developer Experiences for the Cloud .pdfMauricio (Salaboy) Salatino
 
Naman_Abinitio_7757021406
Naman_Abinitio_7757021406Naman_Abinitio_7757021406
Naman_Abinitio_7757021406Naman Gupta
 
Meetup_Bangalore_Rajesh
Meetup_Bangalore_RajeshMeetup_Bangalore_Rajesh
Meetup_Bangalore_RajeshD.Rajesh Kumar
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 

Similar to Intro to DL Platform (20)

introduction to micro services
introduction to micro servicesintroduction to micro services
introduction to micro services
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Serverless Computing: Driving Innovation and Business Value
Serverless Computing: Driving Innovation and Business ValueServerless Computing: Driving Innovation and Business Value
Serverless Computing: Driving Innovation and Business Value
 
Advanced dev ops governance with terraform
Advanced dev ops governance with terraformAdvanced dev ops governance with terraform
Advanced dev ops governance with terraform
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
 
Openstack - Enterprise cloud management platform
Openstack - Enterprise cloud management platformOpenstack - Enterprise cloud management platform
Openstack - Enterprise cloud management platform
 
Data(?)Ops with CircleCI
Data(?)Ops with CircleCIData(?)Ops with CircleCI
Data(?)Ops with CircleCI
 
Impact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology expertsImpact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology experts
 
Impact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology expertsImpact of cloud services on the work of oracle technology experts
Impact of cloud services on the work of oracle technology experts
 
Building Developer Experiences for the Cloud .pdf
Building Developer Experiences for the Cloud .pdfBuilding Developer Experiences for the Cloud .pdf
Building Developer Experiences for the Cloud .pdf
 
Naman_Abinitio_7757021406
Naman_Abinitio_7757021406Naman_Abinitio_7757021406
Naman_Abinitio_7757021406
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Meetup_Bangalore_Rajesh
Meetup_Bangalore_RajeshMeetup_Bangalore_Rajesh
Meetup_Bangalore_Rajesh
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
25 snowflake
25 snowflake25 snowflake
25 snowflake
 

Recently uploaded

Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Recently uploaded (20)

Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

Intro to DL Platform

  • 1. Introduction to DL Platform Changjian Gao
  • 2. Table of Contents • Intro • Why? • Goals & Non-Goals • Heterogeneous Resources and Multi-tenant • Distributed Training • Deep Learning as Software Engineering
  • 3.
  • 5. Why DL platform? Hidden Technical Debt in Machine Learning Systems, NIPS’15
  • 6. Goals • Deep Learning as Software Engineering (think about CI/CD) • Heterogeneous resources management (CPU, GPU etc.) • Multi-tenant management (sharing and isolation) • Distributed training • Multiple DL frameworks support • Easy to tuning and diagnosis (logs, metrics, profiling etc.) • User-friendly interface (CLI, Web UI etc.) • AutoML • Feature and model sharing • Maybe: elastic DL, model zoo
  • 7. Non-Goals • Invent yet another DL framework • Intrusive design
  • 9. K8s • Good • Good for heterogeneous resources management and isolation • Basic multi-tenant management (namespace etc.) • PVC make data isolation easily • Active community • Bad • Batch workload scheduling • Flexible multi-tenant management • YAML isn’t user-friendly (too trivial) • So many new concepts (pod, service, deployment etc.)
  • 10. K8s - Scheduling • The default scheduler isn’t suit for batch workload • DL job is usually batch workload (especially distributed training) • What we miss from other scheduler (e.g. YARN): • Gang scheduling (a.k.a. coscheduling) • Fair-share and capacity scheduler • Queue • Priority • Preemption
  • 11. K8s - Scheduling • Volcano • Batch system built on K8s • CNCF sandbox project • Lead by Huawei Cloud • SIG Scheduling • K8s scheduling framework (since 1.15) • Lead by IBM and Alibaba Cloud • Scheduler Plugins
  • 12. K8s - GPU Sharing • GPU sharing is hard • Current solutions: • GPU Sharing Scheduler Extender (Alibaba Cloud) • GPU Manager (Tencent Cloud) • Virtual GPU Device Plugin (AWS) • Multi-Instance GPUs (Nvidia)
  • 14. Goals • High scaling efficiency Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
  • 15. Parameter Server Large Scale Distributed Deep Networks, NIPS’12
  • 16. Ring Allreduce Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
  • 17. K8s - Operator • The Operator pattern aims to capture the key aim of a human operator who is managing a service or set of services • Invented by CoreOS (acquired by Red Hat now) • Useful operators for distributed training: • kubeflow/tf-operator (TensorFlow, PS mode) • kubeflow/pytorch-operator (PyTorch, PS mode) • kubeflow/mxnet-operator (MXNet, PS mode) • kubeflow/mpi-operator (Any framework, Allreduce mode)
  • 18. Deep Learning as Software Engineering
  • 19. Kubeflow Pipelines • Reusable end-to-end ML workflows built using the Kubeflow Pipelines SDK • Integrate with K8s from day one (Kubeflow = Kubernetes + Workflow) • DAG orchestration based on Argo • Heavily rely on K8s operator (i.e. CRD) • Web UI and API • Lead by Google Cloud
  • 21. MLflow • An open source platform for the machine learning lifecycle • Integrate with K8s experimentally • Rely on K8s Job resource • Web UI and API • Lead by Databricks