SlideShare a Scribd company logo
Igor Sfiligoi, University of California San Diego (UCSD/SDSC)
Serving HTC Users in K8s
by Leveraging HTCondor
Who am I?
Longtime HTC user
• Most recently as part of the
Open Science Grid (OSG)
For the past year actively involved with Kubernetes
• As part of the
Pacific Research Platform (PRP)
Name: Igor Sfiligoi
Employer: UC San Diego
https://opensciencegrid.org
http://pacificresearchplatform.org
2
Let’s define HTC
HTC = High Throughput Computing
Often also called Batch Computing
(although not all Batch Computing is HTC)
3
Let’s define HTC
HTC = High Throughput Computing
Often also called Batch Computing
(although not all Batch Computing is HTC)
The infrastructure for Ingenuously Parallel Computing
4
Ingenious Parallelism
• Restate a big computing problem as
many individually schedulable small problems.
• Minimize your requirements in order to
maximize the raw capacity that you can effectively use.
5
Ingenious Parallelism
• Restate a big computing problem as
many individually schedulable small problems.
• Minimize your requirements in order to
maximize the raw capacity that you can effectively use.
Some call it
Embarrassingly Parallel Computing
but it really takes hard thinking!
6
Example HTC problems
Monte Carlo Simulations
Parameter sweeps
Event processing
Feature extraction
And many more problems can be cast in this paradigm.
7
Example HTC resource
Open Science Grid (OSG)
operates a large scale HTC pool
Number of CPU cores in use by OSG HTC jobs
8
Example HTC users
OSG serving
many different
scientific domains
Weekly CPU hours used by OSG HTC jobs
9
HTC and Kubernetes
Can we use Kubernetes
for HTC?
10
K8s in principle great for HTC
One HTC job per Pod
Job n
Job n
Job n
Job 3
Job 2
Job 1
…
Many Pods per HW node
Job a
…
Job z
Many HW nodes per Pool
VPN Overlay
Shared
Storage
Flexible
Orchestration
11
K8s in practice not so great
• Indexed parameter passing
• Automatic Input/Output handling
Note: HTC jobs typically do not require a shared FS
• Fair Share Scheduling policies
Essential for highly contested resources
• Can it scale to millions of queued Pods?
K8s missing a few features HTC users are used to
12
K8s in practice not so great
• Indexed parameter passing
• Automatic Input/Output handling
Note: HTC jobs typically do not require a shared FS
• Fair Share Scheduling policies
Essential for highly contested resources
• Can it scale to millions of queued Pods?
K8s missing a few features HTC users are used to
Plus, lack of:
• A familiar API/CLI
• Seamless integration
with other resources
13
HTC and Kubernetes
How about leveraging
HTCondor with K8s?
14
Using HTCondor with Kubernetes
Why HTCondor?
• One of the major batch systems
• HTC-focused architecture
• Very flexible, often used in heterogeneous environments
• Native support for containers
• Highly scalable
15
https://research.cs.wisc.edu/htcondor/
Using HTCondor with Kubernetes
Why HTCondor?
• One of the major batch systems
• HTC-focused architecture
• Very flexible, often used in heterogeneous environments
• Native support for containers
• Highly scalable
The system used inside
the Open Science Grid
(OSG)
16
https://research.cs.wisc.edu/htcondor/
Using HTCondor with Kubernetes
Why HTCondor?
• One of the major batch systems
• HTC-focused architecture
• Very flexible, often used in heterogeneous environments
• Native support for containers
The system used inside
the Open Science Grid
(OSG)
17
https://research.cs.wisc.edu/htcondor/
Nov 17th, 2019
HTCondor Architecture
Persistent Job Queue
(can be more than one, but all independent)
Schedd
Submission typically local
(e.g. ssh) Collector
Central manager for bookkeeping
(can have multiple for HA)
Startd
CPU
Startd
CPU
Startd
CPU
Each execute resource
has a control process
18
HTCondor Architecture
Persistent Job Queue
(can be more than one, but all independent)
Schedd
Submission typically local
(e.g. ssh) Collector
Requirements based matchmaking
Startd
CPU
Startd
CPU
Startd
CPU
Each execute resource
has a control process
Negotiator
Central manager for bookkeeping
(can have multiple for HA)
Direct communication with
execute node
Fair share
allocations
(User and/or group
based)
Includes
data transfers
19
HTCondor Architecture
Persistent Job Queue
(can be more than one, but all independent)
Schedd
Submission typically local
(e.g. ssh) Collector Startd
CPU
Startd
CPU
Startd
CPU
Each execute resource
has a control process
Negotiator
Central manager for bookkeeping
(can have multiple for HA)
Direct communication with
execute node
Match valid for
tens of minutes,
can be used for
many jobs
Requirements based matchmaking
Fair share
allocations
(User and/or group
based)
Includes
data transfers
20
HTC and Kubernetes
How do we leverage
HTCondor with K8s?
21
Using HTCondor with Kubernetes
The Kubernetes resources can be joined to
an existing HTCondor Pool
Pod
Startd
CPU
Schedd
Collector
Negotiator
No
persistency
needed
Using CCB for
NAT traversal
22
Using HTCondor with Kubernetes
Or a complete HTCondor Pool can be created
inside Kubernetes
Pod
Startd
CPU
Schedd
Collector
Negotiator
Needs
persistent
storage
Persistent
storage
recommended
Simpler setup
due to VPN
23
Using HTCondor with Kubernetes
Can be used to join
external resources for
K8s users
Pod
Startd
CPU
Schedd
Collector
Negotiator Pod
Startd
CPU
Needs
incoming
networking
24
HTC Users and Containers
Most HTC jobs are application + arguments + data
• Container just a convenient way to package the dependencies
• Usually a department/community maintained one
25
HTCondor and Containers
HTCondor allows for a container to be attached to a job
• Will use singularity to invoke it
• After binding the application and data
Most HTC jobs are application + arguments + data
• Container just a convenient way to package the dependencies
• Usually a department/community maintained one
26
HTCondor and Containers
HTCondor allows for a container to be attached to a job
• Will use singularity to invoke it
• After binding the application and data
Most HTC jobs are application + arguments + data
• Container just a convenient way to package the dependencies
• Usually a department/community maintained one
In principle Docker could be an
option, but not currently
supported
27
Nested containerization
Singularity can be invoked inside a Docker container
• Fully unprivileged with Linux Kernel >= 4.18
Makes HTCondor execute in Kubernetes trivial to implement
Schedd
Pod – CentOS/Ubuntu/…
with Singularity
Startd
Singularity – Dept.Cont.
Application
Dept.Cont. + Appl. + Args. + InData
OurData + Logs
28
Explicit provisioning
Many systems still on older Linux Kernel Versions (e.g. CentOS 7)
• Unprivileged nested containerization not an option there
Some users also do not like singularity
• It does have some differences from Docker
• e.g. The root partition is always Read-Only
Kubernetes Pod can be launched with Container needed by User jobs
• Only jobs needing that Container will match
• Asking users to create a HTCondor-specific Container
usually a non-starter
• Better to inject HTCondor bins and config at Pod startup
A ready-to-use template available at:
https://github.com/sfiligoi/prp-htcondor-pool
Quite effective
when only a few
Container Images
needed
29
Opportunistic use
Most HTC jobs tolerate preemption
• HTC Pods great backfill option for keeping
your Kubernetes resources fully utilized
Just launch HTCondor execute Pods with a very low K8s priority
Works best when you have a single backfill pool
30
To conclude
Kubernetes is a great foundation platform for HTC jobs
• But a bit hard to use by itself
HTCondor can add the needed glue to make it easy to use
• Data handling
• Parametrized argument passing
• Robust, contention-optimized and scalable policy manager
OSG and PRP have been successfully using this combination for awhile
31
Acknowledgments
This work was partially funded by
US National Science Foundation (NSF) awards
CNS-1456638, CNS-1730158,
ACI-1540112, ACI-1541349,
MPS-1148698,
OAC-1826967, OAC 1450871,
OAC-1659169 and OAC-1841530.
32

More Related Content

What's hot

NantOmics
NantOmicsNantOmics
NantOmics
Ceph Community
 
CERN User Story
CERN User StoryCERN User Story
CERN User Story
Tim Bell
 
Euro ht condor_alahiff
Euro ht condor_alahiffEuro ht condor_alahiff
Euro ht condor_alahiff
vandersantiago
 
20170926 cern cloud v4
20170926 cern cloud v420170926 cern cloud v4
20170926 cern cloud v4
Tim Bell
 
How Kubernetes make OpenStack & Ceph better
How Kubernetes make OpenStack & Ceph betterHow Kubernetes make OpenStack & Ceph better
How Kubernetes make OpenStack & Ceph better
TeK Charnsilp Chinprasert
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps Approach
Nicola Ferraro
 
20190620 accelerating containers v3
20190620 accelerating containers v320190620 accelerating containers v3
20190620 accelerating containers v3
Tim Bell
 
[WSO2Con Asia 2018] Deploying Applications in K8S and Docker
[WSO2Con Asia 2018] Deploying Applications in K8S and Docker[WSO2Con Asia 2018] Deploying Applications in K8S and Docker
[WSO2Con Asia 2018] Deploying Applications in K8S and Docker
WSO2
 
Containers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKAContainers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKA
Belmiro Moreira
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
CESGA Centro de Supercomputación de Galicia
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack Nordic
Tim Bell
 
Kubernetes and OpenStack at Scale
Kubernetes and OpenStack at ScaleKubernetes and OpenStack at Scale
Kubernetes and OpenStack at Scale
Stephen Gordon
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
Belmiro Moreira
 
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Belmiro Moreira
 
Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015
Belmiro Moreira
 
Deploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesDeploying PostgreSQL on Kubernetes
Deploying PostgreSQL on Kubernetes
Jimmy Angelakos
 
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Belmiro Moreira
 
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Belmiro Moreira
 
Kubernetes stack reliability
Kubernetes stack reliabilityKubernetes stack reliability
Kubernetes stack reliability
Oleg Chunikhin
 
Meet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product UpdateMeet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product Update
InfluxData
 

What's hot (20)

NantOmics
NantOmicsNantOmics
NantOmics
 
CERN User Story
CERN User StoryCERN User Story
CERN User Story
 
Euro ht condor_alahiff
Euro ht condor_alahiffEuro ht condor_alahiff
Euro ht condor_alahiff
 
20170926 cern cloud v4
20170926 cern cloud v420170926 cern cloud v4
20170926 cern cloud v4
 
How Kubernetes make OpenStack & Ceph better
How Kubernetes make OpenStack & Ceph betterHow Kubernetes make OpenStack & Ceph better
How Kubernetes make OpenStack & Ceph better
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps Approach
 
20190620 accelerating containers v3
20190620 accelerating containers v320190620 accelerating containers v3
20190620 accelerating containers v3
 
[WSO2Con Asia 2018] Deploying Applications in K8S and Docker
[WSO2Con Asia 2018] Deploying Applications in K8S and Docker[WSO2Con Asia 2018] Deploying Applications in K8S and Docker
[WSO2Con Asia 2018] Deploying Applications in K8S and Docker
 
Containers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKAContainers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKA
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack Nordic
 
Kubernetes and OpenStack at Scale
Kubernetes and OpenStack at ScaleKubernetes and OpenStack at Scale
Kubernetes and OpenStack at Scale
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
 
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
 
Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015
 
Deploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesDeploying PostgreSQL on Kubernetes
Deploying PostgreSQL on Kubernetes
 
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
 
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
 
Kubernetes stack reliability
Kubernetes stack reliabilityKubernetes stack reliability
Kubernetes stack reliability
 
Meet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product UpdateMeet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product Update
 

Similar to Serving HTC Users in Kubernetes by Leveraging HTCondor

Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
Igor Sfiligoi
 
The Five Stages of Enterprise Jupyter Deployment
The Five Stages of Enterprise Jupyter DeploymentThe Five Stages of Enterprise Jupyter Deployment
The Five Stages of Enterprise Jupyter Deployment
Frederick Reiss
 
“TensorFlow Lite for Microcontrollers (TFLM): Recent Developments,” a Present...
“TensorFlow Lite for Microcontrollers (TFLM): Recent Developments,” a Present...“TensorFlow Lite for Microcontrollers (TFLM): Recent Developments,” a Present...
“TensorFlow Lite for Microcontrollers (TFLM): Recent Developments,” a Present...
Edge AI and Vision Alliance
 
HNSciCloud Info Day, 7 Sept 2016, Functional Requirements by Helge Meinhard
HNSciCloud Info Day, 7 Sept 2016, Functional Requirements by Helge MeinhardHNSciCloud Info Day, 7 Sept 2016, Functional Requirements by Helge Meinhard
HNSciCloud Info Day, 7 Sept 2016, Functional Requirements by Helge Meinhard
Helix Nebula The Science Cloud
 
Speed & Agility of Innovation with Docker & Kubernetes
Speed & Agility of Innovation with Docker & KubernetesSpeed & Agility of Innovation with Docker & Kubernetes
Speed & Agility of Innovation with Docker & Kubernetes
ICS
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
OpenEBS
 
Considerations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack CloudConsiderations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack Cloud
All Things Open
 
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Demi Ben-Ari
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
Ambassador Labs
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
Stanislav Pogrebnyak
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
MayaData Inc
 
Hyperledger community update 201805
Hyperledger community update 201805Hyperledger community update 201805
Hyperledger community update 201805
Arnaud Le Hors
 
Containers and Microservices for Realists
Containers and Microservices for RealistsContainers and Microservices for Realists
Containers and Microservices for Realists
Oracle Developers
 
Containers and microservices for realists
Containers and microservices for realistsContainers and microservices for realists
Containers and microservices for realists
Karthik Gaekwad
 
DevOps Training institute in Ameerpet
DevOps Training institute in AmeerpetDevOps Training institute in Ameerpet
DevOps Training institute in Ameerpet
Visualpath Training
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute Cluster
Ramsay Key
 
SolidFire + Platform9: Simply Faster OpenStack
SolidFire + Platform9: Simply Faster OpenStackSolidFire + Platform9: Simply Faster OpenStack
SolidFire + Platform9: Simply Faster OpenStack
Platform9
 
Power of Choice in Docker EE 2.0 - Anoop - Docker - CC18
Power of Choice in Docker EE 2.0 - Anoop - Docker - CC18Power of Choice in Docker EE 2.0 - Anoop - Docker - CC18
Power of Choice in Docker EE 2.0 - Anoop - Docker - CC18
CodeOps Technologies LLP
 
Kubernetes for All
Kubernetes for AllKubernetes for All
Kubernetes for All
William Jimenez
 
Building a Kubernetes cluster for a large organisation 101
Building a Kubernetes cluster for a large organisation 101Building a Kubernetes cluster for a large organisation 101
Building a Kubernetes cluster for a large organisation 101
Ed Schouten
 

Similar to Serving HTC Users in Kubernetes by Leveraging HTCondor (20)

Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
The Five Stages of Enterprise Jupyter Deployment
The Five Stages of Enterprise Jupyter DeploymentThe Five Stages of Enterprise Jupyter Deployment
The Five Stages of Enterprise Jupyter Deployment
 
“TensorFlow Lite for Microcontrollers (TFLM): Recent Developments,” a Present...
“TensorFlow Lite for Microcontrollers (TFLM): Recent Developments,” a Present...“TensorFlow Lite for Microcontrollers (TFLM): Recent Developments,” a Present...
“TensorFlow Lite for Microcontrollers (TFLM): Recent Developments,” a Present...
 
HNSciCloud Info Day, 7 Sept 2016, Functional Requirements by Helge Meinhard
HNSciCloud Info Day, 7 Sept 2016, Functional Requirements by Helge MeinhardHNSciCloud Info Day, 7 Sept 2016, Functional Requirements by Helge Meinhard
HNSciCloud Info Day, 7 Sept 2016, Functional Requirements by Helge Meinhard
 
Speed & Agility of Innovation with Docker & Kubernetes
Speed & Agility of Innovation with Docker & KubernetesSpeed & Agility of Innovation with Docker & Kubernetes
Speed & Agility of Innovation with Docker & Kubernetes
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Considerations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack CloudConsiderations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack Cloud
 
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Hyperledger community update 201805
Hyperledger community update 201805Hyperledger community update 201805
Hyperledger community update 201805
 
Containers and Microservices for Realists
Containers and Microservices for RealistsContainers and Microservices for Realists
Containers and Microservices for Realists
 
Containers and microservices for realists
Containers and microservices for realistsContainers and microservices for realists
Containers and microservices for realists
 
DevOps Training institute in Ameerpet
DevOps Training institute in AmeerpetDevOps Training institute in Ameerpet
DevOps Training institute in Ameerpet
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute Cluster
 
SolidFire + Platform9: Simply Faster OpenStack
SolidFire + Platform9: Simply Faster OpenStackSolidFire + Platform9: Simply Faster OpenStack
SolidFire + Platform9: Simply Faster OpenStack
 
Power of Choice in Docker EE 2.0 - Anoop - Docker - CC18
Power of Choice in Docker EE 2.0 - Anoop - Docker - CC18Power of Choice in Docker EE 2.0 - Anoop - Docker - CC18
Power of Choice in Docker EE 2.0 - Anoop - Docker - CC18
 
Kubernetes for All
Kubernetes for AllKubernetes for All
Kubernetes for All
 
Building a Kubernetes cluster for a large organisation 101
Building a Kubernetes cluster for a large organisation 101Building a Kubernetes cluster for a large organisation 101
Building a Kubernetes cluster for a large organisation 101
 

More from Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
Igor Sfiligoi
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
Igor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
Igor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
Igor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
Igor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Igor Sfiligoi
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
Igor Sfiligoi
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
Igor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
Igor Sfiligoi
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
Igor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
Igor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
Igor Sfiligoi
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Igor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
Igor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
Igor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
Igor Sfiligoi
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Igor Sfiligoi
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
Igor Sfiligoi
 

More from Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 

Serving HTC Users in Kubernetes by Leveraging HTCondor

  • 1. Igor Sfiligoi, University of California San Diego (UCSD/SDSC) Serving HTC Users in K8s by Leveraging HTCondor
  • 2. Who am I? Longtime HTC user • Most recently as part of the Open Science Grid (OSG) For the past year actively involved with Kubernetes • As part of the Pacific Research Platform (PRP) Name: Igor Sfiligoi Employer: UC San Diego https://opensciencegrid.org http://pacificresearchplatform.org 2
  • 3. Let’s define HTC HTC = High Throughput Computing Often also called Batch Computing (although not all Batch Computing is HTC) 3
  • 4. Let’s define HTC HTC = High Throughput Computing Often also called Batch Computing (although not all Batch Computing is HTC) The infrastructure for Ingenuously Parallel Computing 4
  • 5. Ingenious Parallelism • Restate a big computing problem as many individually schedulable small problems. • Minimize your requirements in order to maximize the raw capacity that you can effectively use. 5
  • 6. Ingenious Parallelism • Restate a big computing problem as many individually schedulable small problems. • Minimize your requirements in order to maximize the raw capacity that you can effectively use. Some call it Embarrassingly Parallel Computing but it really takes hard thinking! 6
  • 7. Example HTC problems Monte Carlo Simulations Parameter sweeps Event processing Feature extraction And many more problems can be cast in this paradigm. 7
  • 8. Example HTC resource Open Science Grid (OSG) operates a large scale HTC pool Number of CPU cores in use by OSG HTC jobs 8
  • 9. Example HTC users OSG serving many different scientific domains Weekly CPU hours used by OSG HTC jobs 9
  • 10. HTC and Kubernetes Can we use Kubernetes for HTC? 10
  • 11. K8s in principle great for HTC One HTC job per Pod Job n Job n Job n Job 3 Job 2 Job 1 … Many Pods per HW node Job a … Job z Many HW nodes per Pool VPN Overlay Shared Storage Flexible Orchestration 11
  • 12. K8s in practice not so great • Indexed parameter passing • Automatic Input/Output handling Note: HTC jobs typically do not require a shared FS • Fair Share Scheduling policies Essential for highly contested resources • Can it scale to millions of queued Pods? K8s missing a few features HTC users are used to 12
  • 13. K8s in practice not so great • Indexed parameter passing • Automatic Input/Output handling Note: HTC jobs typically do not require a shared FS • Fair Share Scheduling policies Essential for highly contested resources • Can it scale to millions of queued Pods? K8s missing a few features HTC users are used to Plus, lack of: • A familiar API/CLI • Seamless integration with other resources 13
  • 14. HTC and Kubernetes How about leveraging HTCondor with K8s? 14
  • 15. Using HTCondor with Kubernetes Why HTCondor? • One of the major batch systems • HTC-focused architecture • Very flexible, often used in heterogeneous environments • Native support for containers • Highly scalable 15 https://research.cs.wisc.edu/htcondor/
  • 16. Using HTCondor with Kubernetes Why HTCondor? • One of the major batch systems • HTC-focused architecture • Very flexible, often used in heterogeneous environments • Native support for containers • Highly scalable The system used inside the Open Science Grid (OSG) 16 https://research.cs.wisc.edu/htcondor/
  • 17. Using HTCondor with Kubernetes Why HTCondor? • One of the major batch systems • HTC-focused architecture • Very flexible, often used in heterogeneous environments • Native support for containers The system used inside the Open Science Grid (OSG) 17 https://research.cs.wisc.edu/htcondor/ Nov 17th, 2019
  • 18. HTCondor Architecture Persistent Job Queue (can be more than one, but all independent) Schedd Submission typically local (e.g. ssh) Collector Central manager for bookkeeping (can have multiple for HA) Startd CPU Startd CPU Startd CPU Each execute resource has a control process 18
  • 19. HTCondor Architecture Persistent Job Queue (can be more than one, but all independent) Schedd Submission typically local (e.g. ssh) Collector Requirements based matchmaking Startd CPU Startd CPU Startd CPU Each execute resource has a control process Negotiator Central manager for bookkeeping (can have multiple for HA) Direct communication with execute node Fair share allocations (User and/or group based) Includes data transfers 19
  • 20. HTCondor Architecture Persistent Job Queue (can be more than one, but all independent) Schedd Submission typically local (e.g. ssh) Collector Startd CPU Startd CPU Startd CPU Each execute resource has a control process Negotiator Central manager for bookkeeping (can have multiple for HA) Direct communication with execute node Match valid for tens of minutes, can be used for many jobs Requirements based matchmaking Fair share allocations (User and/or group based) Includes data transfers 20
  • 21. HTC and Kubernetes How do we leverage HTCondor with K8s? 21
  • 22. Using HTCondor with Kubernetes The Kubernetes resources can be joined to an existing HTCondor Pool Pod Startd CPU Schedd Collector Negotiator No persistency needed Using CCB for NAT traversal 22
  • 23. Using HTCondor with Kubernetes Or a complete HTCondor Pool can be created inside Kubernetes Pod Startd CPU Schedd Collector Negotiator Needs persistent storage Persistent storage recommended Simpler setup due to VPN 23
  • 24. Using HTCondor with Kubernetes Can be used to join external resources for K8s users Pod Startd CPU Schedd Collector Negotiator Pod Startd CPU Needs incoming networking 24
  • 25. HTC Users and Containers Most HTC jobs are application + arguments + data • Container just a convenient way to package the dependencies • Usually a department/community maintained one 25
  • 26. HTCondor and Containers HTCondor allows for a container to be attached to a job • Will use singularity to invoke it • After binding the application and data Most HTC jobs are application + arguments + data • Container just a convenient way to package the dependencies • Usually a department/community maintained one 26
  • 27. HTCondor and Containers HTCondor allows for a container to be attached to a job • Will use singularity to invoke it • After binding the application and data Most HTC jobs are application + arguments + data • Container just a convenient way to package the dependencies • Usually a department/community maintained one In principle Docker could be an option, but not currently supported 27
  • 28. Nested containerization Singularity can be invoked inside a Docker container • Fully unprivileged with Linux Kernel >= 4.18 Makes HTCondor execute in Kubernetes trivial to implement Schedd Pod – CentOS/Ubuntu/… with Singularity Startd Singularity – Dept.Cont. Application Dept.Cont. + Appl. + Args. + InData OurData + Logs 28
  • 29. Explicit provisioning Many systems still on older Linux Kernel Versions (e.g. CentOS 7) • Unprivileged nested containerization not an option there Some users also do not like singularity • It does have some differences from Docker • e.g. The root partition is always Read-Only Kubernetes Pod can be launched with Container needed by User jobs • Only jobs needing that Container will match • Asking users to create a HTCondor-specific Container usually a non-starter • Better to inject HTCondor bins and config at Pod startup A ready-to-use template available at: https://github.com/sfiligoi/prp-htcondor-pool Quite effective when only a few Container Images needed 29
  • 30. Opportunistic use Most HTC jobs tolerate preemption • HTC Pods great backfill option for keeping your Kubernetes resources fully utilized Just launch HTCondor execute Pods with a very low K8s priority Works best when you have a single backfill pool 30
  • 31. To conclude Kubernetes is a great foundation platform for HTC jobs • But a bit hard to use by itself HTCondor can add the needed glue to make it easy to use • Data handling • Parametrized argument passing • Robust, contention-optimized and scalable policy manager OSG and PRP have been successfully using this combination for awhile 31
  • 32. Acknowledgments This work was partially funded by US National Science Foundation (NSF) awards CNS-1456638, CNS-1730158, ACI-1540112, ACI-1541349, MPS-1148698, OAC-1826967, OAC 1450871, OAC-1659169 and OAC-1841530. 32