9. K8s
• Good
• Good for heterogeneous resources management and isolation
• Basic multi-tenant management (namespace etc.)
• PVC make data isolation easily
• Active community
• Bad
• Batch workload scheduling
• Flexible multi-tenant management
• YAML isn’t user-friendly (too trivial)
• So many new concepts (pod, service, deployment etc.)
10. K8s - Scheduling
• The default scheduler isn’t suit for batch workload
• DL job is usually batch workload (especially distributed training)
• What we miss from other scheduler (e.g. YARN):
• Gang scheduling (a.k.a. coscheduling)
• Fair-share and capacity scheduler
• Queue
• Priority
• Preemption
11. K8s - Scheduling
• Volcano
• Batch system built on K8s
• CNCF sandbox project
• Lead by Huawei Cloud
• SIG Scheduling
• K8s scheduling framework (since 1.15)
• Lead by IBM and Alibaba Cloud
• Scheduler Plugins
17. K8s - Operator
• The Operator pattern aims to capture the key aim of a human operator
who is managing a service or set of services
• Invented by CoreOS (acquired by Red Hat now)
• Useful operators for distributed training:
• kubeflow/tf-operator (TensorFlow, PS mode)
• kubeflow/pytorch-operator (PyTorch, PS mode)
• kubeflow/mxnet-operator (MXNet, PS mode)
• kubeflow/mpi-operator (Any framework, Allreduce mode)
19. Kubeflow Pipelines
• Reusable end-to-end ML workflows built using the Kubeflow Pipelines
SDK
• Integrate with K8s from day one (Kubeflow = Kubernetes + Workflow)
• DAG orchestration based on Argo
• Heavily rely on K8s operator (i.e. CRD)
• Web UI and API
• Lead by Google Cloud
21. MLflow
• An open source platform for the machine learning lifecycle
• Integrate with K8s experimentally
• Rely on K8s Job resource
• Web UI and API
• Lead by Databricks