How is glideinWMS different from vanilla HTCondor

•

1 like•548 views

These slides provide an overview of why glideinWMS installations behave differently than dedicated, LAN-based HTCondor ones

Technology

Aug 2014 How are glideins different 1
glideinWMS Training
How is glideinWMS different
from vanilla HTCondor
by Igor Sfiligoi, UC San Diego

Aug 2014 How are glideins different 2
Overview
● These slides provide an overview of why
glideinWMS installations behave differently
than dedicated, LAN-based HTCondor ones

Aug 2014 How are glideins different 3
Very heterogeneous res. pool
● Many user jobs have data constraints
– And data access varies from site to site
● Each site basically results in a different
“type of resource”
– Making the resources very heterogeneous,
for matchmaking purposes
– O(100) types of resources not unusual
● Leads to autocluster
number explosion
In dedicated HTCondor
pools, 5 classes of resources
is typically already a lot

Aug 2014 How are glideins different 4
Provisioning vs matchmaking
● Glideins are provisioned
(i.e. requested from sites)
because some user jobs need more resources
– But once provisioned they may not match any jobs
● Two main reasons
– Trigger jobs already gone (i.e. not idle anymore)
– Mismatch between provisioning and
matchmaking requirements Dedicated HTCondor
installations don't
have 2 levels of
matchmaking

Aug 2014 How are glideins different 5
Limited lease lifetime
● Glideins are basically leased execute nodes
– And they come with a limited lifetime
● Lease times usually in the order of one day
– Each glidein typically runs less than 10 user jobs
● User jobs must fit in the remaining lifetime
– Or they will be killed
● Makes for more complex matchmaking decisions
– And requires user help

Aug 2014 How are glideins different 6
Multicore and limited lifetimes
● Limited lifetimes particularly problematic for
multi-core jobs, resulting in significant waste
– Since it is unlikely all jobs will terminate
at exactly the same time
job3
job2
job1
job5
job6
job8
job4 job7 job9
WASTE
1
2
3
4
CPU
time
No suitable user jobs anymore
Pilot job
can terminate

Aug 2014 How are glideins different 7
Automatic shut down
● Glideins are configured to shut down
automatically if not used for some time
– Those resources could be used by someone else
– HTCondor not the only user of the resources
● Default Unclaimed threshold quite low
– About 10 minutes
● This puts stringent limits on Matchmaking
– If a Startd is not matched in time, it is “lost”
– And restarting glideins is expensive
Unlike a dedicated HTCondor pool

Aug 2014 How are glideins different 8
Strong end-to-end security
● A glideinWMS system will typically span many
different locations
● x509 authentication between all nodes required
– At daemon startup, then sec. session cached
– With the exception of Schedd<->Startd, where
security mediated through the Collector
● All over-the-wire communication
Integrity checked
– Requires auth. Neither typically used
in LAN deployments

Aug 2014 How are glideins different 9
Not privileged on execute side
● HTCondor daemons on the execute side
do not have system privileges
– Limits what HTCondor can do
● UID switching can be achieved with glexec
– But requires proxy delegation from schedd
– Only possible if users collaborate
– Relatively expensive (at least one per job startup)
● Many other functions not an option
– e.g. cgroups

Aug 2014 How are glideins different 10
Firewalls
● HTCondor basically a P2P system
– But execute nodes are often behind firewalls
● Requires the use of CCB and
shared_port_daemon to get around it
– But this adds complexity to the system
– Schedd particularly sensible here
● CCB can become single point of failure
– Either because temp. overloaded
– Or if it dies and HA not used

Aug 2014 How are glideins different 11
Very dynamic resource pool
● Startds tend to come and go often
– A side effect of limited lease lifetime
– And provisioning due to new jobs being submitted
● Many HTCondor optimizations less effective
– e.g. Security session caching

Aug 2014 How are glideins different 12
Increased resource pool size
● Most glideinWMS installations bigger than
most LAN HTCondor installations
– At least at the peaks
● Increased scale puts more load on non-execute
daemons
– Even before all the other considerations are applied

Aug 2014 How are glideins different 13
The end

The User Analysis of the CMS experiment is performed in distributed way using both Grid and dedicated resources. In order to insulate the users from the details of computing fabric, CMS relies on the CRAB (CMS Remote Analysis Builder) package as an abstraction layer. CMS has recently switched from a client-server version of CRAB to a purely client-based solution, with ssh being used to interface with either HTCondor or glideinWMS batch systems. This switch has resulted in significant improvement of user satisfaction, as well as in significant simplification of the CRAB code base. This presentation presents the reasoning behind the change as well as the new experience. Presented at CHEP2013 in Amsterdam.

Introduction to Distributed HTC and overlay systems - OSG User School 2014

Igor Sfiligoi

glideinWMS Training 2014 - HTCondor Internals

Igor Sfiligoi

glideinWMS, The OSG overlay DHTC system - OSG School 2014

Igor Sfiligoi

Matchmaking in glideinWMS in CMS

Igor Sfiligoi

VMworld 2013: Performance and Capacity Management of DRS Clusters

VMworld

This is an experience report of a migration from self-hosted services to running in the cloud. While there have been plenty of business case studies showing the benefits of a cloud migration, there are very few reports on the IT side of the migration. This talk covers the migration of Spilgames (a small Dutch games publisher) from a self-hosted Openstack and hardware based infrastructure to Google cloud, challenges, tooling (and lack thereof). This migration is still work in progress, and the talk will cover as much detail as possible.

Disposable infrastructure

Mike Fowler

Moving from the Iron Age to the Cloud Age in computing is supposed to save us money yet many migrations seem to cost more in the long run and result in infrastructures as complex to manage as what we had before. This is often the result of the so called “lift & shift” approach many take – it’s a short term win that doesn’t address why you wanted to move to the cloud in the first place. The Cloud Age affords us the opportunity not to treat our infrastructure as something special, instead as something disposable. By applying the practices of continuous integration and delivery to our infrastructure and configuration management we can built truly scalable infrastructures to host our application’s wildest dreams. In this talk we will look at the tools and processes that can be adopted to truly make use of the possibilities of the Cloud.

Oracle SOA Suite Performance Tuning- UKOUG Application Server & Middleware SI...

C2B2 Consulting

#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo

Paris Open Source Summit

#Data management & #Blockchain - Track - Data : database Delivering a database service is not a simple job but to ensure that everything is working correctly your platform needs to be observable. In this talk, I’ll talk about how we make the MySQL/MariaDB databases observable. We’ll talk about the RED, USE methods, and the golden signals. You’ll discover how we dealt with the following questions “We think the database is slow”. This talk will allow you to make your databases discoverable with open source solutions.

Java at Scale, Dallas JUG, October 2013

Azul Systems Inc.

Title: Java at Scale - What Works and What Doesn't Work Nearly so Well Speaker: Matt Schuetze, Product Manager, Azul Systems Abstract: Java gets used everywhere and for everything due to its efficiency, portability, the productivity it offers developers, and the platform it provides for application frameworks and non-Java languages. But all is not perfect; developers both benefit from and struggle against Java's greatest strength: its memory management. In this session, Matt will describe where Java needs help, the challenges it presents developers who need to provide reliable performance, the reasons those challenges exist, and how developers have traditionally worked around them. He will then discuss where Zing fits in the spectrum of use cases where large memory and predictable performance dominate essential application characteristics.

Block Storage Updates - Juno Edition

OpenStack Foundation

High-performance high-availability Plone

Guido Stevens

Evolving to Cloud-Native - Anand Rao

VMware Tanzu

Cloud hosting your ePortfolio

Mahara Hui

OSMC 2019 | How to improve database Observability by Charles Judith

NETWAYS

Delivering a database service is not a simple job but to ensure that everything is working correctly your platform needs to be observable. In this talk, I’ll talk about how we make the MySQL/MariaDB databases observable. We’ll talk about the RED, USE methods, and the golden signals. You’ll discover how we dealt with the following questions “We think the database is slow”. This talk will allow you to make your databases discoverable with open source solutions.

"The cloud" - No longer a joke - Simon Story

Ireland & UK Moodlemoot 2012

Monitoring and troubleshooting a glideinWMS-based HTCondor pool

Igor Sfiligoi

Viewers also liked

distcom-short-20140112-1600Samsung Electronics

Understanding priorities in HTCondor

Igor Sfiligoi

Where to find DHTC resources - OSG School 2014

Igor Sfiligoi

Presentation 15 condor-v1Simon Kim

Augmenting Big Data Analytics with Nirvana

Igor Sfiligoi

Known HTCondor break points

Igor Sfiligoi

Viewers also liked (6)

distcom-short-20140112-1600

Understanding priorities in HTCondor

Where to find DHTC resources - OSG School 2014

Presentation 15 condor-v1

Augmenting Big Data Analytics with Nirvana

Known HTCondor break points

Similar to How is glideinWMS different from vanilla HTCondor

Introduction to glideinWMS

Igor Sfiligoi

Android Security: Defending Your Users

CommonsWare

Digibury: SciVisum - Making your website fast - and scalable

Lizzie Hodgson

Java concurrecnynadeembtech

OSDC 2018 | Migrating to the cloud by Devdas Bhagat

NETWAYS

Disposable infrastructure

Mike Fowler

Oracle SOA Suite Performance Tuning- UKOUG Application Server & Middleware SI...

C2B2 Consulting

#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo

Paris Open Source Summit

Java at Scale, Dallas JUG, October 2013

Azul Systems Inc.

Block Storage Updates - Juno Edition

OpenStack Foundation

High-performance high-availability Plone

Guido Stevens

Evolving to Cloud-Native - Anand Rao

VMware Tanzu

Cloud hosting your ePortfolio

Mahara Hui

OSMC 2019 | How to improve database Observability by Charles Judith

NETWAYS

"The cloud" - No longer a joke - Simon Story

Ireland & UK Moodlemoot 2012

Monitoring and troubleshooting a glideinWMS-based HTCondor pool

Igor Sfiligoi

Building hadoop based big data environmentEvans Ye

DevOps / Agile Tools Seminar 2013

Ethan Ram

2012 Annual State of the Union for Mobile Ecommerce Performance [Velocity EU]

Strangeloop

On October 3 at Velocity EU, Strangeloop president Joshua Bixby unveiled the findings from the first study ever conducted of mobile performance over cellular networks. In July and September 2012, Strangeloop conducted an industry first: a mobile performance survey of top ecommerce sites. The "2012 State of Mobile Ecommerce Performance" documents how Strangeloop tested top Alexa-ranked retail sites on a variety of mobile devices to find answers to questions like: - How long does the median site take to load in mobile browsers? - Which sites were fastest? - Do some mobile OS/browsers/devices offer a consistently faster user experience than others? - How much faster are pages served over LTE than over 3G? - How do all of these findings compare to similar research conducted for desktop performance, published in Strangeloop’s annual Page Speed and Website Performance State of the Union reports? The report is available for download at http://www.strangeloopnetworks.com/.

SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

SaltStack

As infrastructure scales, simple tasks become increasingly difficult. For large infrastructures to be manageable, we use automation. But automation, like any power tool, comes with its own set of risks and challenges. Automation should be handled like production code, and great care should be exercised with power tools. This talk will cover how SaltStack is used at LinkedIn and offer tips and tricks for automating management with SaltStack at massive scale including a look at LinkedIn-inspired Salt features such as blacklist and prereq states. It will also cover Salt master and minion instrumentation and a compilation of how not to use Salt.

Similar to How is glideinWMS different from vanilla HTCondor (20)

Introduction to glideinWMS

Android Security: Defending Your Users

Digibury: SciVisum - Making your website fast - and scalable

Java concurrecny

OSDC 2018 | Migrating to the cloud by Devdas Bhagat

Disposable infrastructure

Oracle SOA Suite Performance Tuning- UKOUG Application Server & Middleware SI...

#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo

Java at Scale, Dallas JUG, October 2013

Block Storage Updates - Juno Edition

High-performance high-availability Plone

Evolving to Cloud-Native - Anand Rao

Cloud hosting your ePortfolio

OSMC 2019 | How to improve database Observability by Charles Judith

"The cloud" - No longer a joke - Simon Story

Monitoring and troubleshooting a glideinWMS-based HTCondor pool

Building hadoop based big data environment

DevOps / Agile Tools Seminar 2013

2012 Annual State of the Union for Mobile Ecommerce Performance [Velocity EU]

SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

More from Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO

Igor Sfiligoi

O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...

Igor Sfiligoi

Comparing single-node and multi-node performance of an important fusion HPC c...

Igor Sfiligoi

Fusion simulations have traditionally required the use of leadership scale High Performance Computing (HPC) resources in order to produce advances in physics. The impressive improvements in compute and memory capacity of many-GPU compute nodes are now allowing for some problems that once required a multi-node setup to be also solvable on a single node. When possible, the increased interconnect bandwidth can result in order of magnitude higher science throughput, especially for communication-heavy applications. In this paper we analyze the performance of the fusion simulation tool CGYRO, an Eulerian gyrokinetic turbulence solver designed and optimized for collisional, electromagnetic, multiscale simulation, which is widely used in the fusion research community. Due to the nature of the problem, the application has to work on a large multi-dimensional computational mesh as a whole, requiring frequent exchange of large amounts of data between the compute processes. In particular, we show that the average-scale nl03 benchmark CGYRO simulation can be run at an acceptable speed on a single Google Cloud instance with 16 A100 GPUs, outperforming 8 NERSC Perlmutter Phase1 nodes, 16 ORNL Summit nodes and 256 NERSC Cori nodes. Moving from a multi-node to a single-node GPU setup we get comparable simulation times using less than half the number of GPUs. Larger benchmark problems, however, still require a multi-node HPC setup due to GPU memory capacity needs, since at the time of writing no vendor offers nodes with a sufficient GPU memory setup. The upcoming external NVSWITCH does however promise to deliver an almost equivalent solution for up to 256 NVIDIA GPUs. Presented at PEARC22. Paper DOI: https://doi.org/10.1145/3491418.3535130

The anachronism of whole-GPU accounting

Igor Sfiligoi

NVIDIA has been making steady progress in increasing the compute performance of its GPUs, resulting in order of magnitude compute throughput improvements over the years. With several models of GPUs coexisting in many deployments, the traditional accounting method of treating all GPUs as being equal is not reflecting compute output anymore. Moreover, for applications that require significant CPU-based compute to complement the GPU-based compute, it is becoming harder and harder to make full use of the newer GPUs, requiring sharing of those GPUs between multiple applications in order to maximize the achievable science output. This further reduces the value of whole-GPU accounting, especially when the sharing is done at the infrastructure level. We thus argue that GPU accounting for throughput-oriented infrastructures should be expressed in GPU core hours, much like it is normally done for the CPUs. While GPU core compute throughput does change between GPU generations, the variability is similar to what we expect to see among CPU cores. To validate our position, we present an extensive set of run time measurements of two IceCube photon propagation workflows on 14 GPU models, using both on-prem and Cloud resources. The measurements also outline the influence of GPU sharing at both HTCondor and Kubernetes infrastructure level. Presented at PEARC22. Document DOI: https://doi.org/10.1145/3491418.3535125

Auto-scaling HTCondor pools using Kubernetes compute resources

Igor Sfiligoi

HTCondor has been very successful in managing globally distributed, pleasantly parallel scientific workloads, especially as part of the Open Science Grid. HTCondor system design makes it ideal for integrating compute resources provisioned from anywhere, but it has very limited native support for autonomously provisioning resources managed by other solutions. This work presents a solution that allows for autonomous, demand-driven provisioning of Kubernetes-managed resources. A high-level overview of the employed architectures is presented, paired with the description of the setups used in both on-prem and Cloud deployments in support of several Open Science Grid communities. The experience suggests that the described solution should be generally suitable for contributing Kubernetes-based resources to existing HTCondor pools. Presented at PEARC22. Paper DOI: https://doi.org/10.1145/3491418.3535123

Speeding up bowtie2 by improving cache-hit rate

Igor Sfiligoi

Performance Optimization of CGYRO for Multiscale Turbulence Simulations

Igor Sfiligoi

$Comparing GPU effectiveness for Unifrac distance compute$ $Comparing GPU effectiveness for Unifrac distance compute$

Comparing GPU effectiveness for Unifrac distance compute

Igor Sfiligoi

Managing Cloud networking costs for data-intensive applications by provisioni...

Igor Sfiligoi

Presented at PEARC21. Many scientific high-throughput applications can benefit from the elastic nature of Cloud resources, especially when there is a need to reduce time to completion. Cost considerations are usually a major issue in such endeavors, with networking often a major component; for data-intensive applications, egress networking costs can exceed the compute costs. Dedicated network links provide a way to lower the networking costs, but they do add complexity. In this paper we provide a description of a 100 fp32 PFLOPS Cloud burst in support of IceCube production compute, that used Internet2 Cloud Connect service to provision several logically-dedicated network links from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and Google Cloud Platform, that in aggregate enabled approximately 100 Gbps egress capability to on-prem storage. It provides technical details about the provisioning process, the benefits and limitations of such a setup and an analysis of the costs incurred.

Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access

Igor Sfiligoi

Presented at PEARC21. Most experimental sciences now rely on computing, and biolog- ical sciences are no exception. As datasets get bigger, so do the computing costs, making proper optimization of the codes used by scientists increasingly important. Many of the codes developed in recent years are based on the Python-based NumPy, due to its ease of use and good performance characteristics. The composable nature of NumPy, however, does not generally play well with the multi-tier nature of modern CPUs, making any non-trivial multi- step algorithm limited by the external memory access speeds, which are hundreds of times slower than the CPU’s compute capabilities. In order to fully utilize the CPU compute capabilities, one must keep the working memory footprint small enough to fit in the CPU caches, which requires splitting the problem into smaller portions and fusing together as many steps as possible. In this paper, we present changes based on these principles to two important functions in the scikit-bio library, principal coordinates analysis and the Mantel test, that resulted in over 100x speed improvement in these widely used, general-purpose tools.

Using A100 MIG to Scale Astronomy Scientific Output

Igor Sfiligoi

Presented at GTC21. The raw computing power of GPUs has been steadily increasing, significantly outpacing the CPU gains. This poses a problem for many GPU-enabled scientific applications that use CPU code paths to feed data to the GPU code, resulting in lower GPU utilization, and thus reduced gains in scientific output. Applications that are high-throughput in nature, such as astronomy-focused IceCube and LIGO, can partially work around the problem by running several instances of the executable on the same GPU. This approach, however, is sub-optimal both in terms of application performance and workflow management complexity. The recently introduced Multi-Instance GPU (MIG) capability, available on the NVIDIA A100 GPU, provides a much cleaner and easier-to-use alternative by allowing the logical slicing of the powerful GPU and assigning different slices to different applications. And at least in the case of IceCube, it can provide over 3x more scientific output on the same hardware.

Using commercial Clouds to process IceCube jobs

Igor Sfiligoi

Presented at EDUCAUSE CCCG March 2021. The IceCube Neutrino Observatory is the world’s premier facility to detect neutrinos. Built at the south pole in natural ice, it requires extensive and expensive calibration to properly track the neutrinos. Most of the required compute power comes from on-prem resources through the Open Science Grid, but IceCube can easily harness the Cloud compute at any scale, too, as demonstrated by a series of Cloud bursts. This talk provides both details of the performed Cloud bursts, as well as some insight in the science itself.

Modest scale HPC on Azure using CGYRO

Igor Sfiligoi

Fusion simulations have traditionally required the use of leadership scale HPC resources in order to produce advances in physics. One such package is CGYRO, a premier tool for multi-scale plasma turbulence simulation. CGYRO is a typical HPC application that will not fit into a single node, as it requires several TeraBytes of memory and O(100) TFLOPS compute capability for cutting-edge simulations. CGYRO also requires high-throughput and low-latency networking, due to its reliance on global FFT computations. While in the past such compute may have required hundreds, or even thousands of nodes, recent advances in hardware capabilities allow for just tens of nodes to deliver the necessary compute power. We explored the feasibility of running CGYRO on Cloud resources provided by Microsoft on their Azure platform, using the infiniband-connected HPC resources in spot mode. We observed both that CPU-only resources were very efficient, and that running in spot mode was doable, with minimal side effects. The GPU-enabled resources were less cost effective but allowed for higher scaling.

Data-intensive IceCube Cloud Burst

Igor Sfiligoi

Scheduling a Kubernetes Federation with Admiralty

Igor Sfiligoi

Accelerating microbiome research with OpenACC

Igor Sfiligoi

Presented at OpenACC Summit 2020. UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another. Computing UniFrac on modest sample sizes used to take a workday on a server class CPU-only node, while modern datasets would require a large compute cluster to be feasible. After porting to GPUs using OpenACC, the compute of the same modest sample size now takes only a few minutes on a single NVIDIA V100 GPU, while modern datasets can be processed on a single GPU in hours. The OpenACC programming model made the porting of the code to GPUs extremely simple; the first prototype was completed in just over a day. Getting full performance did however take much longer, since proper memory access is fundamental for this application.

Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...

Igor Sfiligoi

Presented at PEARC20. This talk presents expanding the IceCube’s production HTCondor pool using cost-effective GPU instances in preemptible mode gathered from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform. Using this setup, we sustained for a whole workday about 15k GPUs, corresponding to around 170 PFLOP32s, integrating over one EFLOP32 hour worth of science output for a price tag of about $60k. In this paper, we provide the reasoning behind Cloud instance selection, a description of the setup and an analysis of the provisioned resources, as well as a short description of the actual science output of the exercise.

Porting and optimizing UniFrac for GPUs

Igor Sfiligoi

Poster presented at PEARC20. UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another (“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many independent subproblems and exhibits near linear scaling. In this poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We reduced the run time of computing UniFrac on the published Earth Microbiome Project dataset from 13 hours on an Intel Xeon E5-2680 v4 CPU to 12 minutes on an NVIDIA Tesla V100 GPU, and to about one hour on a laptop with NVIDIA GTX 1050 (with minor loss in precision). Computing UniFrac on a larger dataset containing 113k samples reduced the run time from over one month on the CPU to less than 2 hours on the V100 and 9 hours on an NVIDIA RTX 2080TI GPU (with minor loss in precision). This was achieved by using OpenACC for generating the GPU offload code and by improving the memory access patterns. A BSD-licensed implementation is available, which produces a Cshared library linkable by any programming language.

Demonstrating 100 Gbps in and out of the public Clouds

Igor Sfiligoi

Poster presented at PEARC20. There is increased awareness and recognition that public Cloud providers do provide capabilities not found elsewhere, with elasticity being a major driver. The value of elastic scaling is however tightly coupled to the capabilities of the networks that connect all involved resources, both in the public Clouds and at the various research institutions. This poster presents results of measurements involving file transfers inside public Cloud providers, fetching data from on-prem resources into public Cloud instances and fetching data from public Cloud storage into on-prem nodes. The networking of the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform, has been benchmarked. The on-prem nodes were managed by either the Pacific Research Platform or located at the University of Wisconsin – Madison. The observed sustained throughput was of the order of 100 Gbps in all the tests moving data in and out of the public Clouds and throughput reaching into the Tbps range for data movements inside the public Cloud providers themselves. All the tests used HTTP as the transfer protocol.

TransAtlantic Networking using Cloud links

Igor Sfiligoi

More from Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO

O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...

Comparing single-node and multi-node performance of an important fusion HPC c...

The anachronism of whole-GPU accounting

Auto-scaling HTCondor pools using Kubernetes compute resources

Speeding up bowtie2 by improving cache-hit rate

Performance Optimization of CGYRO for Multiscale Turbulence Simulations

$Comparing GPU effectiveness for Unifrac distance compute$ $Comparing GPU effectiveness for Unifrac distance compute$

Comparing GPU effectiveness for Unifrac distance compute

Managing Cloud networking costs for data-intensive applications by provisioni...

Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access

Using A100 MIG to Scale Astronomy Scientific Output

Using commercial Clouds to process IceCube jobs

Modest scale HPC on Azure using CGYRO

Data-intensive IceCube Cloud Burst

Scheduling a Kubernetes Federation with Admiralty

Accelerating microbiome research with OpenACC

Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...

Porting and optimizing UniFrac for GPUs

Demonstrating 100 Gbps in and out of the public Clouds

TransAtlantic Networking using Cloud links

Recently uploaded

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

Elevating Tactical DDD Patterns Through Object Calisthenics

Dorra BARTAGUIZ

After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Product School

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

Bits & Pixels using AI for Good.........

Alison B. Lowndes

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

Knowledge engineering: from people to machines and back

Elena Simperl

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

Recently uploaded (20)

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

FIDO Alliance Osaka Seminar: Overview.pdf

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Elevating Tactical DDD Patterns Through Object Calisthenics

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Mission to Decommission: Importance of Decommissioning Products to Increase E...

PCI PIN Basics Webinar from the Controlcase Team

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Essentials of Automations: Optimizing FME Workflows with Parameters

Bits & Pixels using AI for Good.........

How world-class product teams are winning in the AI era by CEO and Founder, P...

Knowledge engineering: from people to machines and back

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

How is glideinWMS different from vanilla HTCondor

1. Aug 2014 How are glideins different 1 glideinWMS Training How is glideinWMS different from vanilla HTCondor by Igor Sfiligoi, UC San Diego

2. Aug 2014 How are glideins different 2 Overview ● These slides provide an overview of why glideinWMS installations behave differently than dedicated, LAN-based HTCondor ones

3. Aug 2014 How are glideins different 3 Very heterogeneous res. pool ● Many user jobs have data constraints – And data access varies from site to site ● Each site basically results in a different “type of resource” – Making the resources very heterogeneous, for matchmaking purposes – O(100) types of resources not unusual ● Leads to autocluster number explosion In dedicated HTCondor pools, 5 classes of resources is typically already a lot

4. Aug 2014 How are glideins different 4 Provisioning vs matchmaking ● Glideins are provisioned (i.e. requested from sites) because some user jobs need more resources – But once provisioned they may not match any jobs ● Two main reasons – Trigger jobs already gone (i.e. not idle anymore) – Mismatch between provisioning and matchmaking requirements Dedicated HTCondor installations don't have 2 levels of matchmaking

5. Aug 2014 How are glideins different 5 Limited lease lifetime ● Glideins are basically leased execute nodes – And they come with a limited lifetime ● Lease times usually in the order of one day – Each glidein typically runs less than 10 user jobs ● User jobs must fit in the remaining lifetime – Or they will be killed ● Makes for more complex matchmaking decisions – And requires user help

6. Aug 2014 How are glideins different 6 Multicore and limited lifetimes ● Limited lifetimes particularly problematic for multi-core jobs, resulting in significant waste – Since it is unlikely all jobs will terminate at exactly the same time job3 job2 job1 job5 job6 job8 job4 job7 job9 WASTE 1 2 3 4 CPU time No suitable user jobs anymore Pilot job can terminate

7. Aug 2014 How are glideins different 7 Automatic shut down ● Glideins are configured to shut down automatically if not used for some time – Those resources could be used by someone else – HTCondor not the only user of the resources ● Default Unclaimed threshold quite low – About 10 minutes ● This puts stringent limits on Matchmaking – If a Startd is not matched in time, it is “lost” – And restarting glideins is expensive Unlike a dedicated HTCondor pool

8. Aug 2014 How are glideins different 8 Strong end-to-end security ● A glideinWMS system will typically span many different locations ● x509 authentication between all nodes required – At daemon startup, then sec. session cached – With the exception of Schedd<->Startd, where security mediated through the Collector ● All over-the-wire communication Integrity checked – Requires auth. Neither typically used in LAN deployments

9. Aug 2014 How are glideins different 9 Not privileged on execute side ● HTCondor daemons on the execute side do not have system privileges – Limits what HTCondor can do ● UID switching can be achieved with glexec – But requires proxy delegation from schedd – Only possible if users collaborate – Relatively expensive (at least one per job startup) ● Many other functions not an option – e.g. cgroups

10. Aug 2014 How are glideins different 10 Firewalls ● HTCondor basically a P2P system – But execute nodes are often behind firewalls ● Requires the use of CCB and shared_port_daemon to get around it – But this adds complexity to the system – Schedd particularly sensible here ● CCB can become single point of failure – Either because temp. overloaded – Or if it dies and HA not used

11. Aug 2014 How are glideins different 11 Very dynamic resource pool ● Startds tend to come and go often – A side effect of limited lease lifetime – And provisioning due to new jobs being submitted ● Many HTCondor optimizations less effective – e.g. Security session caching

12. Aug 2014 How are glideins different 12 Increased resource pool size ● Most glideinWMS installations bigger than most LAN HTCondor installations – At least at the peaks ● Increased scale puts more load on non-execute daemons – Even before all the other considerations are applied

13. Aug 2014 How are glideins different 13 The end

How is glideinWMS different from vanilla HTCondor

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to How is glideinWMS different from vanilla HTCondor

Similar to How is glideinWMS different from vanilla HTCondor (20)

More from Igor Sfiligoi

More from Igor Sfiligoi (20)

Recently uploaded

Recently uploaded (20)

How is glideinWMS different from vanilla HTCondor