SlideShare a Scribd company logo
1 of 28
Download to read offline
Armada
Batch Workloads Across 10s of Kubernetes Clusters
Who am I
● Background in Chemistry and High Performance Computing
● Entrypoint into Kubernetes was focused on enabling running scientific
workflows across high performance computing clusters and/or Kubernetes
clusters
● Working at G-Research Open Source focused on enabling batch workloads
on multiple kubernetes clusters
G-Research
● FinTech company based out of London (plus an office in Dallas)
● Builds/Operates a large research cluster for quantitative researchers
Batch Introduction
● Finite lifetime computational jobs
● Computational Science
○ Molecular Modeling, Protein Folding, Fluid Dynamics, etc
● Machine Learning
● ETL/ELT data workloads
● Genomics
Embarrassingly Parallel
Input 1
Input 2
Input x
Process 1
Process 2
Process x
Gather
Data
Parallel Computing
Request 2 nodes
Process 1 Process 2
● Message Passing Interface (MPI)
● GPU to GPU communication
SLURM HTCondor
Research Platform
● Researchers run large numbers of batch jobs.
● Jobs are being ported from HTCondor on Windows to Kubernetes on Linux
● Need a queue/scheduler to which jobs are submitted to run “somewhere”
● Target:
○ Manage 100k nodes, 1 M cores and many GPUs
○ Handle submitting ~10k jobs over a “short” span
○ Schedule jobs from single-core to multi-node
○ Divide resources fairly between users
○ Provide detailed visibility for users and admins
Kubernetes Jobs
Batch Processing With Kubernetes
● 10k nodes with Kubernetes is nontrivial
○ https://openai.com/blog/scaling-kubernetes-to-7500-nodes/
● Lack of Queuing system
● Initial support on Job API was for parallel jobs with little communication
● Scheduling is for individual pods
Kubernetes Batch Processing Projects
● Volcano
○ Extend Kubernetes via Custom Resource Definitions
○ Single cluster
○ Own scheduler
○ Queue
● Armada
○ Multiple kubernetes clusters
○ Use Kubernetes scheduler
○ Queue
Extend Kubernetes for batch
Armada Solution
● Use multiple kubernetes clusters
● Kubernetes native scheduler
● Use of pods rather than jobs
● Subscribe to events for a given grouping of jobs
Core Concepts
Job bag of K8s resources (e.g., a pod spec + ingress) to be created +
metadata.
Job Set group of jobs managed as a unit.
Queue represents a user or project- resources are divided fairly between queues.
user guide for more details
Armada Architecture
Queue
● Queue specifies resource
requirements
● RBAC for queue
Job Spec
● API takes a pod spec
● Can use common K8
resources, taints, tolerations,
etc
Clients
● DotNet client
● Python client
● Go client
● CLI
Lookout
● React UI for monitoring progress
Lookout
Roadmap
● Adding ability to support gang scheduling (all or nothing)
○ Extend the API to allow grouping of jobs
● Preemption
● Better Observability/Admin support
● Future Integrations with open source software
Getting Involved
● Kubernetes Working Group Focused on Adding Batch Capabilities
○ https://github.com/kubernetes/community/tree/master/wg-batch
○ Adding Queueing, PodGroups, Enhancing JobApi
● Research User Group
○ CNCF group focusing on enabling research workloads on cloud native architecture
○ https://github.com/cncf/research-user-group
● CNCF Batch Working Group
○ https://github.com/cncf/tag-runtime/blob/master/wg/bsi.md
○ Best practices for batch in K8, etc
● Armada Github
○ https://github.com/G-Research/armada
○ Slack Channel https://cloud-native.slack.com/archives/C03T9CBCEMC
Questions?

More Related Content

Similar to Armada: Batch Workloads Across 10s of Kubernetes Clusters

Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 
[WSO2Con EU 2018] Deploying Applications in K8S and Docker
[WSO2Con EU 2018] Deploying Applications in K8S and Docker[WSO2Con EU 2018] Deploying Applications in K8S and Docker
[WSO2Con EU 2018] Deploying Applications in K8S and DockerWSO2
 
Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...
Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...
Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...Flink Forward
 
Kubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz RiceKubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz RiceCloudOps2005
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsWeaveworks
 
[WSO2Con USA 2018] Deploying Applications in K8S and Docker
[WSO2Con USA 2018] Deploying Applications in K8S and Docker[WSO2Con USA 2018] Deploying Applications in K8S and Docker
[WSO2Con USA 2018] Deploying Applications in K8S and DockerWSO2
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsAmbassador Labs
 
GCCP JSCOE Session 2
GCCP JSCOE Session 2GCCP JSCOE Session 2
GCCP JSCOE Session 2GDSC
 
Introduction to Kubernetes Workshop
Introduction to Kubernetes WorkshopIntroduction to Kubernetes Workshop
Introduction to Kubernetes WorkshopBob Killen
 
Intro to Kubernetes & GitOps Workshop
Intro to Kubernetes & GitOps WorkshopIntro to Kubernetes & GitOps Workshop
Intro to Kubernetes & GitOps WorkshopWeaveworks
 
From development to production: Deploying Java and Scala apps to kubernetes
From development to production: Deploying Java and Scala apps to kubernetesFrom development to production: Deploying Java and Scala apps to kubernetes
From development to production: Deploying Java and Scala apps to kubernetesOlanga Ochieng'
 
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...Tobias Schneck
 
How to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime
How to Migrate 100 Clusters from On-Prem to Google Cloud Without DowntimeHow to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime
How to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtimeloodse
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices worldKarol Chrapek
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
 
How Kubernetes helps Devops
How Kubernetes helps DevopsHow Kubernetes helps Devops
How Kubernetes helps DevopsSreenivas Makam
 

Similar to Armada: Batch Workloads Across 10s of Kubernetes Clusters (20)

Kubernetes intro
Kubernetes introKubernetes intro
Kubernetes intro
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
[WSO2Con EU 2018] Deploying Applications in K8S and Docker
[WSO2Con EU 2018] Deploying Applications in K8S and Docker[WSO2Con EU 2018] Deploying Applications in K8S and Docker
[WSO2Con EU 2018] Deploying Applications in K8S and Docker
 
Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...
Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...
Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...
 
Kubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz RiceKubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz Rice
 
Netty training
Netty trainingNetty training
Netty training
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
 
[WSO2Con USA 2018] Deploying Applications in K8S and Docker
[WSO2Con USA 2018] Deploying Applications in K8S and Docker[WSO2Con USA 2018] Deploying Applications in K8S and Docker
[WSO2Con USA 2018] Deploying Applications in K8S and Docker
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
 
GCCP JSCOE Session 2
GCCP JSCOE Session 2GCCP JSCOE Session 2
GCCP JSCOE Session 2
 
Intro to Kubernetes
Intro to KubernetesIntro to Kubernetes
Intro to Kubernetes
 
Netty training
Netty trainingNetty training
Netty training
 
Introduction to Kubernetes Workshop
Introduction to Kubernetes WorkshopIntroduction to Kubernetes Workshop
Introduction to Kubernetes Workshop
 
Intro to Kubernetes & GitOps Workshop
Intro to Kubernetes & GitOps WorkshopIntro to Kubernetes & GitOps Workshop
Intro to Kubernetes & GitOps Workshop
 
From development to production: Deploying Java and Scala apps to kubernetes
From development to production: Deploying Java and Scala apps to kubernetesFrom development to production: Deploying Java and Scala apps to kubernetes
From development to production: Deploying Java and Scala apps to kubernetes
 
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...
 
How to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime
How to Migrate 100 Clusters from On-Prem to Google Cloud Without DowntimeHow to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime
How to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
How Kubernetes helps Devops
How Kubernetes helps DevopsHow Kubernetes helps Devops
How Kubernetes helps Devops
 

Recently uploaded

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Armada: Batch Workloads Across 10s of Kubernetes Clusters

  • 1. Armada Batch Workloads Across 10s of Kubernetes Clusters
  • 2. Who am I ● Background in Chemistry and High Performance Computing ● Entrypoint into Kubernetes was focused on enabling running scientific workflows across high performance computing clusters and/or Kubernetes clusters ● Working at G-Research Open Source focused on enabling batch workloads on multiple kubernetes clusters
  • 3. G-Research ● FinTech company based out of London (plus an office in Dallas) ● Builds/Operates a large research cluster for quantitative researchers
  • 4. Batch Introduction ● Finite lifetime computational jobs ● Computational Science ○ Molecular Modeling, Protein Folding, Fluid Dynamics, etc ● Machine Learning ● ETL/ELT data workloads ● Genomics
  • 5. Embarrassingly Parallel Input 1 Input 2 Input x Process 1 Process 2 Process x Gather Data
  • 6. Parallel Computing Request 2 nodes Process 1 Process 2 ● Message Passing Interface (MPI) ● GPU to GPU communication
  • 7.
  • 9.
  • 10. Research Platform ● Researchers run large numbers of batch jobs. ● Jobs are being ported from HTCondor on Windows to Kubernetes on Linux ● Need a queue/scheduler to which jobs are submitted to run “somewhere” ● Target: ○ Manage 100k nodes, 1 M cores and many GPUs ○ Handle submitting ~10k jobs over a “short” span ○ Schedule jobs from single-core to multi-node ○ Divide resources fairly between users ○ Provide detailed visibility for users and admins
  • 12. Batch Processing With Kubernetes ● 10k nodes with Kubernetes is nontrivial ○ https://openai.com/blog/scaling-kubernetes-to-7500-nodes/ ● Lack of Queuing system ● Initial support on Job API was for parallel jobs with little communication ● Scheduling is for individual pods
  • 13. Kubernetes Batch Processing Projects ● Volcano ○ Extend Kubernetes via Custom Resource Definitions ○ Single cluster ○ Own scheduler ○ Queue ● Armada ○ Multiple kubernetes clusters ○ Use Kubernetes scheduler ○ Queue
  • 15. Armada Solution ● Use multiple kubernetes clusters ● Kubernetes native scheduler ● Use of pods rather than jobs ● Subscribe to events for a given grouping of jobs
  • 16. Core Concepts Job bag of K8s resources (e.g., a pod spec + ingress) to be created + metadata. Job Set group of jobs managed as a unit. Queue represents a user or project- resources are divided fairly between queues. user guide for more details
  • 18. Queue ● Queue specifies resource requirements ● RBAC for queue
  • 19. Job Spec ● API takes a pod spec ● Can use common K8 resources, taints, tolerations, etc
  • 20. Clients ● DotNet client ● Python client ● Go client ● CLI
  • 21.
  • 22.
  • 23.
  • 24. Lookout ● React UI for monitoring progress
  • 26. Roadmap ● Adding ability to support gang scheduling (all or nothing) ○ Extend the API to allow grouping of jobs ● Preemption ● Better Observability/Admin support ● Future Integrations with open source software
  • 27. Getting Involved ● Kubernetes Working Group Focused on Adding Batch Capabilities ○ https://github.com/kubernetes/community/tree/master/wg-batch ○ Adding Queueing, PodGroups, Enhancing JobApi ● Research User Group ○ CNCF group focusing on enabling research workloads on cloud native architecture ○ https://github.com/cncf/research-user-group ● CNCF Batch Working Group ○ https://github.com/cncf/tag-runtime/blob/master/wg/bsi.md ○ Best practices for batch in K8, etc ● Armada Github ○ https://github.com/G-Research/armada ○ Slack Channel https://cloud-native.slack.com/archives/C03T9CBCEMC