Armada: Batch Workloads Across 10s of Kubernetes Clusters

•

0 likes•55 views

Presented by Kevin Hannon, Open Source Developer, G-Research Open Source, at Kubernetes Community Days, Washington DC, September 14, 2022 ● Background in Chemistry and High Performance Computing ● Entrypoint into Kubernetes was focused on enabling running scientific workflows across high performance computing clusters and/or Kubernetes clusters ● Working at G-Research Open Source focused on enabling batch workloads on multiple kubernetes clusters

Technology

Armada
Batch Workloads Across 10s of Kubernetes Clusters

Who am I
● Background in Chemistry and High Performance Computing
● Entrypoint into Kubernetes was focused on enabling running scientific
workflows across high performance computing clusters and/or Kubernetes
clusters
● Working at G-Research Open Source focused on enabling batch workloads
on multiple kubernetes clusters

G-Research
● FinTech company based out of London (plus an office in Dallas)
● Builds/Operates a large research cluster for quantitative researchers

Batch Introduction
● Finite lifetime computational jobs
● Computational Science
○ Molecular Modeling, Protein Folding, Fluid Dynamics, etc
● Machine Learning
● ETL/ELT data workloads
● Genomics

Embarrassingly Parallel
Input 1
Input 2
Input x
Process 1
Process 2
Process x
Gather
Data

Parallel Computing
Request 2 nodes
Process 1 Process 2
● Message Passing Interface (MPI)
● GPU to GPU communication

Research Platform
● Researchers run large numbers of batch jobs.
● Jobs are being ported from HTCondor on Windows to Kubernetes on Linux
● Need a queue/scheduler to which jobs are submitted to run “somewhere”
● Target:
○ Manage 100k nodes, 1 M cores and many GPUs
○ Handle submitting ~10k jobs over a “short” span
○ Schedule jobs from single-core to multi-node
○ Divide resources fairly between users
○ Provide detailed visibility for users and admins

Batch Processing With Kubernetes
● 10k nodes with Kubernetes is nontrivial
○ https://openai.com/blog/scaling-kubernetes-to-7500-nodes/
● Lack of Queuing system
● Initial support on Job API was for parallel jobs with little communication
● Scheduling is for individual pods

Kubernetes Batch Processing Projects
● Volcano
○ Extend Kubernetes via Custom Resource Definitions
○ Single cluster
○ Own scheduler
○ Queue
● Armada
○ Multiple kubernetes clusters
○ Use Kubernetes scheduler
○ Queue

Armada Solution
● Use multiple kubernetes clusters
● Kubernetes native scheduler
● Use of pods rather than jobs
● Subscribe to events for a given grouping of jobs

Core Concepts
Job bag of K8s resources (e.g., a pod spec + ingress) to be created +
metadata.
Job Set group of jobs managed as a unit.
Queue represents a user or project- resources are divided fairly between queues.
user guide for more details

Queue
● Queue specifies resource
requirements
● RBAC for queue

Job Spec
● API takes a pod spec
● Can use common K8
resources, taints, tolerations,
etc

Clients
● DotNet client
● Python client
● Go client
● CLI

Lookout
● React UI for monitoring progress

Roadmap
● Adding ability to support gang scheduling (all or nothing)
○ Extend the API to allow grouping of jobs
● Preemption
● Better Observability/Admin support
● Future Integrations with open source software

Getting Involved
● Kubernetes Working Group Focused on Adding Batch Capabilities
○ https://github.com/kubernetes/community/tree/master/wg-batch
○ Adding Queueing, PodGroups, Enhancing JobApi
● Research User Group
○ CNCF group focusing on enabling research workloads on cloud native architecture
○ https://github.com/cncf/research-user-group
● CNCF Batch Working Group
○ https://github.com/cncf/tag-runtime/blob/master/wg/bsi.md
○ Best practices for batch in K8, etc
● Armada Github
○ https://github.com/G-Research/armada
○ Slack Channel https://cloud-native.slack.com/archives/C03T9CBCEMC

Similar to Armada: Batch Workloads Across 10s of Kubernetes Clusters

Kubernetes introPravin Magdum

Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch

[WSO2Con EU 2018] Deploying Applications in K8S and DockerWSO2

Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...Flink Forward

Kubernetes 1.12 Update and Container Security with Liz RiceCloudOps2005

Netty trainingMarcelo Serpa

Free GitOps Workshop + Intro to Kubernetes & GitOpsWeaveworks

[WSO2Con USA 2018] Deploying Applications in K8S and DockerWSO2

DevOps Days Boston 2017: Real-world Kubernetes for DevOpsAmbassador Labs

GCCP JSCOE Session 2GDSC

Intro to Kubernetesmatthewbrahms

Netty trainingJackson dos Santos Olveira

Introduction to Kubernetes WorkshopBob Killen

Intro to Kubernetes & GitOps WorkshopWeaveworks

From development to production: Deploying Java and Scala apps to kubernetesOlanga Ochieng'

Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...Tobias Schneck

How to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtimeloodse

Container orchestration and microservices worldKarol Chrapek

USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse

How Kubernetes helps DevopsSreenivas Makam

Similar to Armada: Batch Workloads Across 10s of Kubernetes Clusters (20)

Kubernetes intro

Kubernetes @ Squarespace (SRE Portland Meetup October 2017)

[WSO2Con EU 2018] Deploying Applications in K8S and Docker

Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...

Kubernetes 1.12 Update and Container Security with Liz Rice

Netty training

Free GitOps Workshop + Intro to Kubernetes & GitOps

[WSO2Con USA 2018] Deploying Applications in K8S and Docker

DevOps Days Boston 2017: Real-world Kubernetes for DevOps

GCCP JSCOE Session 2

Intro to Kubernetes

Netty training

Introduction to Kubernetes Workshop

Intro to Kubernetes & GitOps Workshop

From development to production: Deploying Java and Scala apps to kubernetes

Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...

How to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime

Container orchestration and microservices world

USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month

How Kubernetes helps Devops

Recently uploaded

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

CloudStudio User manual (basic edition):comworks

How to convert PDF to text with Nanonetsnaman860154

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Recently uploaded (20)

The transition to renewables in India.pdf

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

08448380779 Call Girls In Friends Colony Women Seeking Men

Injustice - Developers Among Us (SciFiDevCon 2024)

Pigging Solutions Piggable Sweeping Elbows

CloudStudio User manual (basic edition):

How to convert PDF to text with Nanonets

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Human Factors of XR: Using Human Factors to Design XR Systems

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

08448380779 Call Girls In Civil Lines Women Seeking Men

Advanced Test Driven-Development @ php[tek] 2024

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Presentation on how to chat with PDF using ChatGPT code interpreter

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

GenCyber Cyber Security Day Presentation

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Armada: Batch Workloads Across 10s of Kubernetes Clusters

1. Armada Batch Workloads Across 10s of Kubernetes Clusters

2. Who am I ● Background in Chemistry and High Performance Computing ● Entrypoint into Kubernetes was focused on enabling running scientific workflows across high performance computing clusters and/or Kubernetes clusters ● Working at G-Research Open Source focused on enabling batch workloads on multiple kubernetes clusters

3. G-Research ● FinTech company based out of London (plus an office in Dallas) ● Builds/Operates a large research cluster for quantitative researchers

4. Batch Introduction ● Finite lifetime computational jobs ● Computational Science ○ Molecular Modeling, Protein Folding, Fluid Dynamics, etc ● Machine Learning ● ETL/ELT data workloads ● Genomics

5. Embarrassingly Parallel Input 1 Input 2 Input x Process 1 Process 2 Process x Gather Data

6. Parallel Computing Request 2 nodes Process 1 Process 2 ● Message Passing Interface (MPI) ● GPU to GPU communication

8. SLURM HTCondor

10. Research Platform ● Researchers run large numbers of batch jobs. ● Jobs are being ported from HTCondor on Windows to Kubernetes on Linux ● Need a queue/scheduler to which jobs are submitted to run “somewhere” ● Target: ○ Manage 100k nodes, 1 M cores and many GPUs ○ Handle submitting ~10k jobs over a “short” span ○ Schedule jobs from single-core to multi-node ○ Divide resources fairly between users ○ Provide detailed visibility for users and admins

11. Kubernetes Jobs

12. Batch Processing With Kubernetes ● 10k nodes with Kubernetes is nontrivial ○ https://openai.com/blog/scaling-kubernetes-to-7500-nodes/ ● Lack of Queuing system ● Initial support on Job API was for parallel jobs with little communication ● Scheduling is for individual pods

13. Kubernetes Batch Processing Projects ● Volcano ○ Extend Kubernetes via Custom Resource Definitions ○ Single cluster ○ Own scheduler ○ Queue ● Armada ○ Multiple kubernetes clusters ○ Use Kubernetes scheduler ○ Queue

14. Extend Kubernetes for batch

15. Armada Solution ● Use multiple kubernetes clusters ● Kubernetes native scheduler ● Use of pods rather than jobs ● Subscribe to events for a given grouping of jobs

16. Core Concepts Job bag of K8s resources (e.g., a pod spec + ingress) to be created + metadata. Job Set group of jobs managed as a unit. Queue represents a user or project- resources are divided fairly between queues. user guide for more details

17. Armada Architecture

18. Queue ● Queue specifies resource requirements ● RBAC for queue

19. Job Spec ● API takes a pod spec ● Can use common K8 resources, taints, tolerations, etc

20. Clients ● DotNet client ● Python client ● Go client ● CLI

21.

22.

23.

24. Lookout ● React UI for monitoring progress

25. Lookout

26. Roadmap ● Adding ability to support gang scheduling (all or nothing) ○ Extend the API to allow grouping of jobs ● Preemption ● Better Observability/Admin support ● Future Integrations with open source software

27. Getting Involved ● Kubernetes Working Group Focused on Adding Batch Capabilities ○ https://github.com/kubernetes/community/tree/master/wg-batch ○ Adding Queueing, PodGroups, Enhancing JobApi ● Research User Group ○ CNCF group focusing on enabling research workloads on cloud native architecture ○ https://github.com/cncf/research-user-group ● CNCF Batch Working Group ○ https://github.com/cncf/tag-runtime/blob/master/wg/bsi.md ○ Best practices for batch in K8, etc ● Armada Github ○ https://github.com/G-Research/armada ○ Slack Channel https://cloud-native.slack.com/archives/C03T9CBCEMC

28. Questions?

Armada: Batch Workloads Across 10s of Kubernetes Clusters

Recommended

Recommended

More Related Content

Similar to Armada: Batch Workloads Across 10s of Kubernetes Clusters

Similar to Armada: Batch Workloads Across 10s of Kubernetes Clusters (20)

Recently uploaded

Recently uploaded (20)

Armada: Batch Workloads Across 10s of Kubernetes Clusters