In this KubeCon 2019 presentation, I show how HTCondor can be used to make use of Kubernetes-managed resources easier for the purpose of Hight Throughput Computing.
Presented in San Diego at KubeCon 2019:
https://sched.co/UadF
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
Overview of the recent performance optimization of CGYRO, an Eulerian GyroKinetic Fusion Plasma solver, with emphasize on the Multiscale Turbulence Simulations.
Presented at the joint US-Japan Workshop on Exascale Computing Collaboration and6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program (Jan 18th 2022).
State of Containers and the Convergence of HPC and BigDatainside-BigData.com
In this deck from 2018 Swiss HPC Conference, Christian Kniep from Docker Inc. presents: State of Containers and the Convergence of HPC and BigData.
"This talk will recap the history of and what constitutes Linux Containers, before laying out how the technology is employed by various engines and what problems these engines have to solve. Afterward Christian will elaborate on why the advent of standards for images and runtimes moved the discussion from building and distributing containers to orchestrating containerized applications at scale. In conclusion attendees will get an update on how containers foster the convergence of Big Data and HPC workloads and the state of native HPC containers."
Learn more: http://docker.com
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Taking Your Database Beyond the Border of a Single Kubernetes ClusterChristopher Bradford
Deploying applications on Kubernetes is getting easier every day. From a minimal deployment to distributed service mesh enabled applications with planning and a little bit of YAML resilient cloud-native applications are the norm. In this session, Christopher Bradford and Ty Morton will help answer the following questions: - What about your data behind these apps? - Are you running those in a multi-cluster environment or sending everything back to a common location? - How do you modernize to a distributed peer-to-peer data architecture? - How do you plan for this change? - Are there pitfalls on the road to enlightened data? Join this session to explore the key concepts needed when investigating multi-cluster deployments for data. This includes: - Cluster planning - Network design - Security - Failure handling
10 Years of OpenStack at CERN - From 0 to 300k coresBelmiro Moreira
CERN, the European Laboratory for Particle Physics, provides the infrastructure and resources to thousands of scientists all around the world to uncover the mysteries of the Universe. In the quest to build a private Cloud Infrastructure to support its users, CERN started early evaluating the OpenStack project, building several prototypes and engaging with the community. Finally, in 2013 CERN released its production Cloud Infrastructure using OpenStack. Since then we moved from a few hundred cores to a multi-cell deployment spread between different regions. After 7 years deploying and managing OpenStack in production at a large scale, we now look back and discuss the challenges of building a massive scale infrastructure from 0 to +300K cores. In this talk we will dive into the history, architecture, tools and technical decisions behind the CERN Cloud Infrastructure over the years.
The next generation of research infrastructure and large scale scientific instruments will face new magnitudes of data.
This talk presents two flagship programmes: the next generation of the Large Hadron Collider (LHC) at CERN and the Square Kilometre Array (SKA) radio telescope. Each in their way will push infrastructure to the limit.
The LHC has been one of the significant users of OpenStack in scientific computing. The SKA is now working to a final software architecture design and is focusing on OpenStack as an underlying middleware function.
Together, we plan to develop a common platform for scaling science: to accommodate new applications and software services, to deliver high ingest rate real-time and batch processing, to integrate high performance storage and to unlock the potential of software defined networking.
This is a presentation of the VINETALK one of the results from the EU-funded project VINEYARD, which was given in the HiPEAC 2018 Conference in Manchester (22-24 January 2018).
VINETALK is a framework that allows the easy virtualisation of hardware resources and the efficient deployment of hardware accelerators.
More information about VINEYARD project and other results at www.vineyard-h2020.eu.
You can also follow the latest developments on Twitter @VINEYARD_EU
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
Overview of the recent performance optimization of CGYRO, an Eulerian GyroKinetic Fusion Plasma solver, with emphasize on the Multiscale Turbulence Simulations.
Presented at the joint US-Japan Workshop on Exascale Computing Collaboration and6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program (Jan 18th 2022).
State of Containers and the Convergence of HPC and BigDatainside-BigData.com
In this deck from 2018 Swiss HPC Conference, Christian Kniep from Docker Inc. presents: State of Containers and the Convergence of HPC and BigData.
"This talk will recap the history of and what constitutes Linux Containers, before laying out how the technology is employed by various engines and what problems these engines have to solve. Afterward Christian will elaborate on why the advent of standards for images and runtimes moved the discussion from building and distributing containers to orchestrating containerized applications at scale. In conclusion attendees will get an update on how containers foster the convergence of Big Data and HPC workloads and the state of native HPC containers."
Learn more: http://docker.com
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Taking Your Database Beyond the Border of a Single Kubernetes ClusterChristopher Bradford
Deploying applications on Kubernetes is getting easier every day. From a minimal deployment to distributed service mesh enabled applications with planning and a little bit of YAML resilient cloud-native applications are the norm. In this session, Christopher Bradford and Ty Morton will help answer the following questions: - What about your data behind these apps? - Are you running those in a multi-cluster environment or sending everything back to a common location? - How do you modernize to a distributed peer-to-peer data architecture? - How do you plan for this change? - Are there pitfalls on the road to enlightened data? Join this session to explore the key concepts needed when investigating multi-cluster deployments for data. This includes: - Cluster planning - Network design - Security - Failure handling
10 Years of OpenStack at CERN - From 0 to 300k coresBelmiro Moreira
CERN, the European Laboratory for Particle Physics, provides the infrastructure and resources to thousands of scientists all around the world to uncover the mysteries of the Universe. In the quest to build a private Cloud Infrastructure to support its users, CERN started early evaluating the OpenStack project, building several prototypes and engaging with the community. Finally, in 2013 CERN released its production Cloud Infrastructure using OpenStack. Since then we moved from a few hundred cores to a multi-cell deployment spread between different regions. After 7 years deploying and managing OpenStack in production at a large scale, we now look back and discuss the challenges of building a massive scale infrastructure from 0 to +300K cores. In this talk we will dive into the history, architecture, tools and technical decisions behind the CERN Cloud Infrastructure over the years.
The next generation of research infrastructure and large scale scientific instruments will face new magnitudes of data.
This talk presents two flagship programmes: the next generation of the Large Hadron Collider (LHC) at CERN and the Square Kilometre Array (SKA) radio telescope. Each in their way will push infrastructure to the limit.
The LHC has been one of the significant users of OpenStack in scientific computing. The SKA is now working to a final software architecture design and is focusing on OpenStack as an underlying middleware function.
Together, we plan to develop a common platform for scaling science: to accommodate new applications and software services, to deliver high ingest rate real-time and batch processing, to integrate high performance storage and to unlock the potential of software defined networking.
This is a presentation of the VINETALK one of the results from the EU-funded project VINEYARD, which was given in the HiPEAC 2018 Conference in Manchester (22-24 January 2018).
VINETALK is a framework that allows the easy virtualisation of hardware resources and the efficient deployment of hardware accelerators.
More information about VINEYARD project and other results at www.vineyard-h2020.eu.
You can also follow the latest developments on Twitter @VINEYARD_EU
CERN, the European Organization for Nuclear Research, is one of the world’s largest centres for scientific research. Its business is fundamental physics, finding out what the universe is made of and how it works. At CERN, accelerators such as the 27km Large Hadron Collider, are used to study the basic constituents of matter. This talk reviews the challenges to record and analyse the 25 Petabytes/year produced by the experiments and the investigations into how OpenStack could help to deliver a more agile computing infrastructure.
Containers on Baremetal and Preemptible VMs at CERN and SKABelmiro Moreira
CERN the European Laboratory for Nuclear Research and SKA the Square Kilmeter Array are preparing the next generation of research infrastructure for the new large scale scientific instruments that will produce new magnitudes of data. In Sydney OpenStack Summit we presented the collaboration and the platform that we plan to develop for scaling science.
In this talk will present the work done related with Preemptible VMs and Containers on Baremetal.
Preemptible VMs are instances that use idle allocated resources in the infrastructure and can be terminated when this capacity is required. Containers in Baremetal eliminate the virtualization overhead enabling the container full performance required for scientific workloads.
We will present the current state, development and integration decisions and how these functionalities can be used in a common OpenStack infrastructure.
Kubernetes and OpenStack at Scale at OpenStack Summit Boston 2017
Imagine being able to stand up thousands of tenants with thousands of apps, running thousands of Docker-formatted container images and routes, all on a self-healing cluster and elastic infrastructure. Now, take that one step further - all of those images being updatable through a single upload to the registry, and with zero downtime. In this session, you will see just that.
In this presentation, we will walk through a recent benchmarking deployment using Kubernetes and OpenStack on the Cloud Native Computing Foundation’s (CNCF's) 1,000 node cluster with OpenStack and Red Hat’s OpenShift Container Platform, the enterprise-ready Kubernetes for developers.
You'll also what's been happening in subsequent rounds of testing in Red Hat's own SCALE lab and the CNCF cluster and how we are working with the relevant open source communities including OpenStack, Kubernetes, and Ansible to continue to raise the bar for horizontal scaling of these platforms via community powered innovation.
CERN, the European Organization for Nuclear Research, is running for several years a large OpenStack Cloud that helps thousands of scientists to analyze the data from the LHC.
In 2012, early in the design phase of the CERN Cloud we decided to use Nova Cells to enable the infrastructure to scale to thousands of nodes. Now with more than 280K cores spread across 70 cells that are hosted in two data centres we were faced with the challenge to migrate to Nova Cells V2 required in the Pike release.
In this presentation, we will describe how Nova Cells allowed CERN to scale to thousands of nodes, its advantages and how we mitigate the implementation issues of Nova Cells V1. Next, we will cover how we upgraded Nova from Newton with Cells V1 to Pike with Cells V2. We will explain the steps that we followed and the issues that we faced during the upgrade. Finally, we will report our experience with Cells V2 at scale, its caveats and how we are mitigate them.
What can I expect to learn?
This presentation describes how CERN migrated from Cells V1 to Cells V2 when upgraded from Newton to Pike release.
You will learn the procedures followed by CERN in order to migrate Cells V1 to Cells V2 in a large production environment.
The issues found during the upgrade and how we mitigate them will be discussed.
Also, we will present how Cells V2 behaves in a large scale deployment with serveral thounsands nodes in 70 cells.
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014Belmiro Moreira
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale
OpenStack Design Summit, Paris - November, 2014
Belmiro Moreira - CERN
Matt Van Winkle - Rackspace
Sam Morrison - NeCTAR, University of Melbourne
A look at some of the ways available to deploy Postgres in a Kubernetes cloud environment, either in small scale using simple configurations, or in larger scale using tools such as Helm charts and the Crunchy PostgreSQL Operator. A short introduction to Kubernetes will be given to explain the concepts involved, followed by examples from each deployment method and observations on the key differences.
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Belmiro Moreira
Tips Tricks and Tactics with Cells and Scaling OpenStack
OpenStack Design Summit, Paris - May, 2015
Belmiro Moreira - CERN
Matt Van Winkle - Rackspace
Sam Morrison - NeCTAR, University of Melbourne
Self-healing does not equal self-healing. There are multiple layers
to it, whether a self-healing infrastructure, cluster, pods, or Kubernetes. Kubernetes itself ensures self-healing pods. But how do you ensure your applications, whose reliability depends on every single layer, are truly reliable?
In this presentation we discuss aspects of reliability and self-healing in the different layers of a comprehensive container management stack; what Kubernetes does and doesn't do (at least not by default), and what you should look out for to ensure true reliable applications.
Meet the Experts: InfluxDB Product UpdateInfluxData
Learn more about InfluxData’s time series platform. InfluxDB 2.0 OSS is generally available, and since launch, we have made updates to the product.
Join Tim Hall, VP of Products, as he demonstrates the latest features in InfluxDB 2.0 Open Source.
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
HTCondor has been very successful in managing globally distributed, pleasantly parallel scientific workloads, especially as part of the Open Science Grid. HTCondor system design makes it ideal for integrating compute resources provisioned from anywhere, but it has very limited native support for autonomously provisioning resources managed by other solutions. This work presents a solution that allows for autonomous, demand-driven provisioning of Kubernetes-managed resources. A high-level overview of the employed architectures is presented, paired with the description of the setups used in both on-prem and Cloud deployments in support of several Open Science Grid communities. The experience suggests that the described solution should be generally suitable for contributing Kubernetes-based resources to existing HTCondor pools.
Presented at PEARC22.
Paper DOI: https://doi.org/10.1145/3491418.3535123
The Five Stages of Enterprise Jupyter DeploymentFrederick Reiss
Meetup talk from May 30, 2018.
Jupyter notebooks are an important tool for data science. For a single user on a laptop, these notebooks are a simple, straightforward tool. But Jupyter in the enterprise is a much more complex affair. Enterprises have large teams of data scientists who need to run their notebooks atop scalable compute infrastructure with secure, audited access to massive, proprietary data sets; all while keeping hardware costs down.
Here at IBM’s Center for Open-Source Data and AI Technologies, we’ve seen multiple enterprise rollouts of Jupyter notebooks, both first-hand, in IBM products and services; and second-hand, in our discussions with other members of the Jupyter community.
In this talk, we merge together the stories of these projects and walk through the process of deploying high-performance, secure, mulitentant Jupyter notebooks in an enterprise setting. Our goal is here is inform others who may be at the beginning of this journey of what is coming and how to navigate the challenges ahead.
Along the way, we answer five important questions: What are Jupyter notebooks? What makes Jupyter so attractive to data scientists? Why is deploying Jupyter in the enterprise difficult? What are your deployment options today? And, what are the tradeoffs of those approaches?
We’ll finish with a description of how how IBM and other members of the Jupyter community are working towards reducing those tradeoffs with the Jupyter Enterprise Gateway project. Finally, we’ll give a demonstration of multitenant Jupyter notebooks in action.
This talk is aimed at enterprise architects who need to support growing data science teams with multi-user deployments of Jupyter. No knowledge of data science is required.
CERN, the European Organization for Nuclear Research, is one of the world’s largest centres for scientific research. Its business is fundamental physics, finding out what the universe is made of and how it works. At CERN, accelerators such as the 27km Large Hadron Collider, are used to study the basic constituents of matter. This talk reviews the challenges to record and analyse the 25 Petabytes/year produced by the experiments and the investigations into how OpenStack could help to deliver a more agile computing infrastructure.
Containers on Baremetal and Preemptible VMs at CERN and SKABelmiro Moreira
CERN the European Laboratory for Nuclear Research and SKA the Square Kilmeter Array are preparing the next generation of research infrastructure for the new large scale scientific instruments that will produce new magnitudes of data. In Sydney OpenStack Summit we presented the collaboration and the platform that we plan to develop for scaling science.
In this talk will present the work done related with Preemptible VMs and Containers on Baremetal.
Preemptible VMs are instances that use idle allocated resources in the infrastructure and can be terminated when this capacity is required. Containers in Baremetal eliminate the virtualization overhead enabling the container full performance required for scientific workloads.
We will present the current state, development and integration decisions and how these functionalities can be used in a common OpenStack infrastructure.
Kubernetes and OpenStack at Scale at OpenStack Summit Boston 2017
Imagine being able to stand up thousands of tenants with thousands of apps, running thousands of Docker-formatted container images and routes, all on a self-healing cluster and elastic infrastructure. Now, take that one step further - all of those images being updatable through a single upload to the registry, and with zero downtime. In this session, you will see just that.
In this presentation, we will walk through a recent benchmarking deployment using Kubernetes and OpenStack on the Cloud Native Computing Foundation’s (CNCF's) 1,000 node cluster with OpenStack and Red Hat’s OpenShift Container Platform, the enterprise-ready Kubernetes for developers.
You'll also what's been happening in subsequent rounds of testing in Red Hat's own SCALE lab and the CNCF cluster and how we are working with the relevant open source communities including OpenStack, Kubernetes, and Ansible to continue to raise the bar for horizontal scaling of these platforms via community powered innovation.
CERN, the European Organization for Nuclear Research, is running for several years a large OpenStack Cloud that helps thousands of scientists to analyze the data from the LHC.
In 2012, early in the design phase of the CERN Cloud we decided to use Nova Cells to enable the infrastructure to scale to thousands of nodes. Now with more than 280K cores spread across 70 cells that are hosted in two data centres we were faced with the challenge to migrate to Nova Cells V2 required in the Pike release.
In this presentation, we will describe how Nova Cells allowed CERN to scale to thousands of nodes, its advantages and how we mitigate the implementation issues of Nova Cells V1. Next, we will cover how we upgraded Nova from Newton with Cells V1 to Pike with Cells V2. We will explain the steps that we followed and the issues that we faced during the upgrade. Finally, we will report our experience with Cells V2 at scale, its caveats and how we are mitigate them.
What can I expect to learn?
This presentation describes how CERN migrated from Cells V1 to Cells V2 when upgraded from Newton to Pike release.
You will learn the procedures followed by CERN in order to migrate Cells V1 to Cells V2 in a large production environment.
The issues found during the upgrade and how we mitigate them will be discussed.
Also, we will present how Cells V2 behaves in a large scale deployment with serveral thounsands nodes in 70 cells.
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014Belmiro Moreira
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale
OpenStack Design Summit, Paris - November, 2014
Belmiro Moreira - CERN
Matt Van Winkle - Rackspace
Sam Morrison - NeCTAR, University of Melbourne
A look at some of the ways available to deploy Postgres in a Kubernetes cloud environment, either in small scale using simple configurations, or in larger scale using tools such as Helm charts and the Crunchy PostgreSQL Operator. A short introduction to Kubernetes will be given to explain the concepts involved, followed by examples from each deployment method and observations on the key differences.
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Belmiro Moreira
Tips Tricks and Tactics with Cells and Scaling OpenStack
OpenStack Design Summit, Paris - May, 2015
Belmiro Moreira - CERN
Matt Van Winkle - Rackspace
Sam Morrison - NeCTAR, University of Melbourne
Self-healing does not equal self-healing. There are multiple layers
to it, whether a self-healing infrastructure, cluster, pods, or Kubernetes. Kubernetes itself ensures self-healing pods. But how do you ensure your applications, whose reliability depends on every single layer, are truly reliable?
In this presentation we discuss aspects of reliability and self-healing in the different layers of a comprehensive container management stack; what Kubernetes does and doesn't do (at least not by default), and what you should look out for to ensure true reliable applications.
Meet the Experts: InfluxDB Product UpdateInfluxData
Learn more about InfluxData’s time series platform. InfluxDB 2.0 OSS is generally available, and since launch, we have made updates to the product.
Join Tim Hall, VP of Products, as he demonstrates the latest features in InfluxDB 2.0 Open Source.
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
HTCondor has been very successful in managing globally distributed, pleasantly parallel scientific workloads, especially as part of the Open Science Grid. HTCondor system design makes it ideal for integrating compute resources provisioned from anywhere, but it has very limited native support for autonomously provisioning resources managed by other solutions. This work presents a solution that allows for autonomous, demand-driven provisioning of Kubernetes-managed resources. A high-level overview of the employed architectures is presented, paired with the description of the setups used in both on-prem and Cloud deployments in support of several Open Science Grid communities. The experience suggests that the described solution should be generally suitable for contributing Kubernetes-based resources to existing HTCondor pools.
Presented at PEARC22.
Paper DOI: https://doi.org/10.1145/3491418.3535123
The Five Stages of Enterprise Jupyter DeploymentFrederick Reiss
Meetup talk from May 30, 2018.
Jupyter notebooks are an important tool for data science. For a single user on a laptop, these notebooks are a simple, straightforward tool. But Jupyter in the enterprise is a much more complex affair. Enterprises have large teams of data scientists who need to run their notebooks atop scalable compute infrastructure with secure, audited access to massive, proprietary data sets; all while keeping hardware costs down.
Here at IBM’s Center for Open-Source Data and AI Technologies, we’ve seen multiple enterprise rollouts of Jupyter notebooks, both first-hand, in IBM products and services; and second-hand, in our discussions with other members of the Jupyter community.
In this talk, we merge together the stories of these projects and walk through the process of deploying high-performance, secure, mulitentant Jupyter notebooks in an enterprise setting. Our goal is here is inform others who may be at the beginning of this journey of what is coming and how to navigate the challenges ahead.
Along the way, we answer five important questions: What are Jupyter notebooks? What makes Jupyter so attractive to data scientists? Why is deploying Jupyter in the enterprise difficult? What are your deployment options today? And, what are the tradeoffs of those approaches?
We’ll finish with a description of how how IBM and other members of the Jupyter community are working towards reducing those tradeoffs with the Jupyter Enterprise Gateway project. Finally, we’ll give a demonstration of multitenant Jupyter notebooks in action.
This talk is aimed at enterprise architects who need to support growing data science teams with multi-user deployments of Jupyter. No knowledge of data science is required.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/06/tensorflow-lite-for-microcontrollers-tflm-recent-developments-a-presentation-from-bdti-and-google/
David Davis, Senior Embedded Software Engineer, and John Withers, Automation and Systems Engineer, both of BDTI, present the “TensorFlow Lite for Microcontrollers (TFLM): Recent Developments” tutorial at the May 2022 Embedded Vision Summit.
TensorFlow Lite Micro (TFLM) is a generic inference framework designed to run TensorFlow models on digital signal processors (DSPs), microcontrollers and other embedded targets with small memory footprints and very low power usage. TFLM aims to be easily portable to various embedded targets from those running an RTOS to bare-metal code. TFLM leverages the model optimization tools from the TensorFlow ecosystem and has additional embedded-specific optimizations to reduce the memory footprint. TFLM also integrates with a number of community contributed optimized hardware-specific kernel implementations.
In this talk, Davis and Withers review collaboration between BDTI and Google over the last year, including porting nearly two dozen operators from TensorFlow Lite to TFLM, creation of a separate Arduino examples repository, improved testing and documentation of both Arduino and Colab training examples and transitioning TFLM’s open-source CI framework to use GitHub Actions.
Speed & Agility of Innovation with Docker & KubernetesICS
Docker and Kubernetes pave the way for running federated, scalable and redundant systems on the Cloud and on the Edge with the same technology. In this introductory webinar we will go over the history, fundamentals and usage of Docker and Kubernetes and present key takeaways for CTOs and developers. We'll also cover the key benefits of using this technology in both the development and deployment of Qt applications.
We'll discuss:
Short history Docker and Kubernetes
Why are containers suddenly so popular?
The benefits of using Docker with Qt applications
Kubernetes for the Cloud and for the Edge
Moving to containers and Kubernetes: How and Why
Latest (storage IO) patterns for cloud-native applications OpenEBS
Applying micro service patterns to storage giving each workload its own Container Attached Storage (CAS) system. This puts the DevOps persona within full control of the storage requirements and brings data agility to k8s persistent workloads. We will go over the concept and the implementation of CAS, as well as its orchestration.
Considerations for Operating an OpenStack CloudAll Things Open
All Things Open 2014 - Day 2
Thursday, October 23rd, 2014
Mark Voelker
Technical Leader with Cisco
Cloud/OpenStack
Considerations for Operating an OpenStack Cloud
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Demi Ben-Ari
Talk that specifies the history and the reasons to start using Kubernetes and implementing a microservices architecture. Talking about Docker, Kubernetes basic terms and some of the pitfalls that you can get too while implementing it.
Also mentioning the use case of Panorays.
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsAmbassador Labs
DevOps Days Boston 2017
Microservices is an increasingly popular approach to building cloud-native applications. Dozens of new technologies that streamline adopting microservices development such as Docker, Kubernetes, and Envoy have been released over the past few years. But how do you actually use these technologies together to develop, deploy, and run microservices?
In this presentation, we’ll cover the nuances of deploying containerized applications on Kubernetes, including creating a Kubernetes manifest, debugging and logging, and how to build an automated continuous deployment pipeline. Then, we’ll do a brief tour of some of the advanced concepts related to microservices, including service mesh, canary deployments, resilience, and security.
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
Webinar Session - https://youtu.be/_5MfGMf8PG4
In this webinar, we share how the Container Attached Storage pattern makes performance tuning more tractable, by giving each workload its own storage system, thereby decreasing the variables needed to understand and tune performance.
We then introduce MayaStor, a breakthrough in the use of containers and Kubernetes as a data plane. MayaStor is the first containerized data engine available that delivers near the theoretical maximum performance of underlying systems. MayaStor performance scales with the underlying hardware and has been shown, for example, to deliver in excess of 10 million IOPS in a particular environment.
DevOps Training Online - Visualpath is the Leading and Best Software Training institute in Hyderabad. Avail complete job-oriented DevOps Online Training Course by simply enrolling in our institute in Ameerpet. Call on - +91-9989971070.
Visit: https://www.visualpath.in/devops-online-training.html
Interested in how you can get a Private Cloud solution that just works efficiently? With Platform9 and Solidfire, users can get the self-service automation of OpenStack combined with the incredible speed of flash storage. Learn more here!
This presentation was made as closing session for Container Conference 2018 on 03rd August in Bangalore by Anoop Kumar from Docker.
"In this session we will get familiarized with the technical aspects of the Docker EE 2.0 Platform. It will involve a walkthrough of the swarm as well as the relatively newly introduced Kubernetes integrations, how it enables organizational agility, choice and security and the future roadmap of the product suite. We'll finally do a quick demo of the platform and close with a Q&A section."
Kubernetes has shown considerable traction since its debut in 2014, however there is still a significant portion of enterprises that have chosen other solutions for managing containers over Kubernetes. Given its technical leadership in the community this begs the question, why aren't more using it? In this talk we will address some of the reasons for this gap, and ideas for how we can solve it including what we are doing at Rancher Labs on this front.
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
Fusion simulations have traditionally required the use of leadership scale High Performance Computing (HPC) resources in order to produce advances in physics. The impressive improvements in compute and memory capacity of many-GPU compute nodes are now allowing for some problems that once required a multi-node setup to be also solvable on a single node. When possible, the increased interconnect bandwidth can result in order of magnitude higher science throughput, especially for communication-heavy applications. In this paper we analyze the performance of the fusion simulation tool CGYRO, an Eulerian gyrokinetic turbulence solver designed and optimized for collisional, electromagnetic, multiscale simulation, which is widely used in the fusion research community. Due to the nature of the problem, the application has to work on a large multi-dimensional computational mesh as a whole, requiring frequent exchange of large amounts of data between the compute processes. In particular, we show that the average-scale nl03 benchmark CGYRO simulation can be run at an acceptable speed on a single Google Cloud instance with 16 A100 GPUs, outperforming 8 NERSC Perlmutter Phase1 nodes, 16 ORNL Summit nodes and 256 NERSC Cori nodes. Moving from a multi-node to a single-node GPU setup we get comparable simulation times using less than half the number of GPUs. Larger benchmark problems, however, still require a multi-node HPC setup due to GPU memory capacity needs, since at the time of writing no vendor offers nodes with a sufficient GPU memory setup. The upcoming external NVSWITCH does however promise to deliver an almost equivalent solution for up to 256 NVIDIA GPUs.
Presented at PEARC22.
Paper DOI: https://doi.org/10.1145/3491418.3535130
The anachronism of whole-GPU accountingIgor Sfiligoi
NVIDIA has been making steady progress in increasing the compute performance of its GPUs, resulting in order of magnitude compute throughput improvements over the years. With several models of GPUs coexisting in many deployments, the traditional accounting method of treating all GPUs as being equal is not reflecting compute output anymore. Moreover, for applications that require significant CPU-based compute to complement the GPU-based compute, it is becoming harder and harder to make full use of the newer GPUs, requiring sharing of those GPUs between multiple applications in order to maximize the achievable science output. This further reduces the value of whole-GPU accounting, especially when the sharing is done at the infrastructure level. We thus argue that GPU accounting for throughput-oriented infrastructures should be expressed in GPU core hours, much like it is normally done for the CPUs. While GPU core compute throughput does change between GPU generations, the variability is similar to what we expect to see among CPU cores. To validate our position, we present an extensive set of run time measurements of two IceCube photon propagation workflows on 14 GPU models, using both on-prem and Cloud resources. The measurements also outline the influence of GPU sharing at both HTCondor and Kubernetes infrastructure level.
Presented at PEARC22.
Document DOI: https://doi.org/10.1145/3491418.3535125
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
Poster presented at PEAC21.
The poster contains the complete scaling plots for both unweighted and weighted normalized Unifrac compute for sample sizes ranging from 1k to 307k on both GPUs and CPUs.
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
Presented at PEARC21.
Many scientific high-throughput applications can benefit from the elastic nature of Cloud resources, especially when there is a need to reduce time to completion. Cost considerations are usually a major issue in such endeavors, with networking often a major component; for data-intensive applications, egress networking costs can exceed the compute costs. Dedicated network links provide a way to lower the networking costs, but they do add complexity. In this paper we provide a description of a 100 fp32 PFLOPS Cloud burst in support of IceCube production compute, that used Internet2 Cloud Connect service to provision several logically-dedicated network links from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and Google Cloud Platform, that in aggregate enabled approximately 100 Gbps egress capability to on-prem storage. It provides technical details about the provisioning process, the benefits and limitations of such a setup and an analysis of the costs incurred.
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
Presented at PEARC21.
Most experimental sciences now rely on computing, and biolog- ical sciences are no exception. As datasets get bigger, so do the computing costs, making proper optimization of the codes used by scientists increasingly important. Many of the codes developed in recent years are based on the Python-based NumPy, due to its ease of use and good performance characteristics. The composable nature of NumPy, however, does not generally play well with the multi-tier nature of modern CPUs, making any non-trivial multi- step algorithm limited by the external memory access speeds, which are hundreds of times slower than the CPU’s compute capabilities. In order to fully utilize the CPU compute capabilities, one must keep the working memory footprint small enough to fit in the CPU caches, which requires splitting the problem into smaller portions and fusing together as many steps as possible. In this paper, we present changes based on these principles to two important func- tions in the scikit-bio library, principal coordinates analysis and the Mantel test, that resulted in over 100x speed improvement in these widely used, general-purpose tools.
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
Presented at GTC21.
The raw computing power of GPUs has been steadily increasing, significantly outpacing the CPU gains. This poses a problem for many GPU-enabled scientific applications that use CPU code paths to feed data to the GPU code, resulting in lower GPU utilization, and thus reduced gains in scientific output. Applications that are high-throughput in nature, such as astronomy-focused IceCube and LIGO, can partially work around the problem by running several instances of the executable on the same GPU. This approach, however, is sub-optimal both in terms of application performance and workflow management complexity. The recently introduced Multi-Instance GPU (MIG) capability, available on the NVIDIA A100 GPU, provides a much cleaner and easier-to-use alternative by allowing the logical slicing of the powerful GPU and assigning different slices to different applications. And at least in the case of IceCube, it can provide over 3x more scientific output on the same hardware.
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
Presented at EDUCAUSE CCCG March 2021.
The IceCube Neutrino Observatory is the world’s premier facility to detect neutrinos.
Built at the south pole in natural ice, it requires extensive and expensive calibration to properly track the neutrinos.
Most of the required compute power comes from on-prem resources through the Open Science Grid,
but IceCube can easily harness the Cloud compute at any scale, too, as demonstrated by a series of Cloud bursts.
This talk provides both details of the performed Cloud bursts, as well as some insight in the science itself.
Fusion simulations have traditionally required the use of leadership scale HPC resources in order to produce advances in physics. One such package is CGYRO, a premier tool for multi-scale plasma turbulence simulation. CGYRO is a typical HPC application that will not fit into a single node, as it requires several TeraBytes of memory and O(100) TFLOPS compute capability for cutting-edge simulations. CGYRO also requires high-throughput and low-latency networking, due to its reliance on global FFT computations. While in the past such compute may have required hundreds, or even thousands of nodes, recent advances in hardware capabilities allow for just tens of nodes to deliver the necessary compute power. We explored the feasibility of running CGYRO on Cloud resources provided by Microsoft on their Azure platform, using the infiniband-connected HPC resources in spot mode. We observed both that CPU-only resources were very efficient, and that running in spot mode was doable, with minimal side effects. The GPU-enabled resources were less cost effective but allowed for higher scaling.
For IceCube, large amount of photon propagation simulation is needed to properly calibrate natural Ice. Simulation is compute intensive and ideal for GPU compute. This Cloud run was more data intensive than precious ones, producing 130 TB of output data. To keep egress costs in check, we created dedicated network links via the Internet2 Cloud Connect Service.
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
Presented at OSG All-Hands Meeting 2020 - USCMS-USATLAS Session.
This talk presented the PRP experience with using Admiralty as a Kubernetes federation solution, with both discussion of why we need it, why Admiralty is the best (if not actually the only) solution for our needs, and how it works.
Accelerating microbiome research with OpenACCIgor Sfiligoi
Presented at OpenACC Summit 2020.
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another. Computing UniFrac on modest sample sizes used to take a workday on a server class CPU-only node, while modern datasets would require a large compute cluster to be feasible. After porting to GPUs using OpenACC, the compute of the same modest sample size now takes only a few minutes on a single NVIDIA V100 GPU, while modern datasets can be processed on a single GPU in hours. The OpenACC programming model made the porting of the code to GPUs extremely simple; the first prototype was completed in just over a day. Getting full performance did however take much longer, since proper memory access is fundamental for this application.
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
Presented at PEARC20.
This talk presents expanding the IceCube’s production HTCondor pool using cost-effective GPU instances in preemptible mode gathered from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform. Using this setup, we sustained for a whole workday about 15k GPUs, corresponding to around 170 PFLOP32s, integrating over one EFLOP32 hour worth of science output for a price tag of about $60k. In this paper, we provide the reasoning behind Cloud instance selection, a description of the setup and an analysis of the provisioned resources, as well as a short description of the actual science output of the exercise.
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
Poster presented at PEARC20.
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another (“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many independent subproblems and exhibits near linear scaling. In this poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We reduced the run time of computing UniFrac on the published Earth Microbiome Project dataset from 13 hours on an Intel Xeon E5-2680 v4 CPU to 12 minutes on an NVIDIA Tesla V100 GPU, and to about one hour on a laptop with NVIDIA GTX 1050 (with minor loss in precision). Computing UniFrac on a larger dataset containing 113k samples reduced the run time from over one month on the CPU to less than 2 hours on the V100 and 9 hours on an NVIDIA RTX 2080TI GPU (with minor loss in precision). This was achieved by using OpenACC for generating the GPU offload code and by improving the memory access patterns. A BSD-licensed implementation is available, which produces a Cshared library linkable by any programming language.
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
Poster presented at PEARC20.
There is increased awareness and recognition that public Cloud providers do provide capabilities not found elsewhere, with elasticity being a major driver. The value of elastic scaling is however tightly coupled to the capabilities of the networks that connect all involved resources, both in the public Clouds and at the various research institutions. This poster presents results of measurements involving file transfers inside public Cloud providers, fetching data from on-prem resources into public Cloud instances and fetching data from public Cloud storage into on-prem nodes. The networking of the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform, has been benchmarked. The on-prem nodes were managed by either the Pacific Research Platform or located at the University of Wisconsin – Madison. The observed sustained throughput was of the order of 100 Gbps in all the tests moving data in and out of the public Clouds and throughput reaching into the Tbps range for data movements inside the public Cloud providers themselves. All the tests used HTTP as the transfer protocol.
TransAtlantic Networking using Cloud linksIgor Sfiligoi
Scientific communities have only limited amount of bandwidth available for transferring data between the US and the EU.
We know Cloud providers have plenty of bandwidth available, but at what cost?
Bursting into the public Cloud - Sharing my experience doing it at large scal...Igor Sfiligoi
When compute workflow needs spike well in excess of the capacity of a local compute resource, capacity should be temporarily provisioned from somewhere else to both meet deadlines and to increase scientific output. Public Clouds have become an attractive option due to their ability to be provisioned with minimal advance notice. I have recently helped IceCube expand their resource pool by a few orders of magnitude, first to 380 PFLOP32s for a few hours and later to 170 PFLOP32s for a whole workday. In the process we moved O(50 TB) of data to and from the clouds, showing that networking is not a limiting factor, either. While there was a non-negligible dollar cost involved with each, the effort involved was quite modest. In this session I will explain what was done and how, alongside an overview of why IceCube needs so much compute.
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
In this presentation, which was supposed to be presented at the cancelled CENIC 2020 Annual Conference, I present an overview of what is possible to achieve in terms of networking inside the Clouds and when exchanging data between cloud resources and on-prem equipment, with an emphasis on research hosted hardware.
There is increased awareness and recognition that public cloud providers do provide capabilities not found elsewhere, with elasticity being a major driver, and funding agencies are taking an increasingly positive stance toward public clouds.
The value of elastic scaling is, however, tightly coupled to the capabilities of the networks that connect all involved resources, both in the public clouds and at the various research institutions.
This presentation tries to shed some light on what is possible today.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Serving HTC Users in Kubernetes by Leveraging HTCondor
1. Igor Sfiligoi, University of California San Diego (UCSD/SDSC)
Serving HTC Users in K8s
by Leveraging HTCondor
2. Who am I?
Longtime HTC user
• Most recently as part of the
Open Science Grid (OSG)
For the past year actively involved with Kubernetes
• As part of the
Pacific Research Platform (PRP)
Name: Igor Sfiligoi
Employer: UC San Diego
https://opensciencegrid.org
http://pacificresearchplatform.org
2
3. Let’s define HTC
HTC = High Throughput Computing
Often also called Batch Computing
(although not all Batch Computing is HTC)
3
4. Let’s define HTC
HTC = High Throughput Computing
Often also called Batch Computing
(although not all Batch Computing is HTC)
The infrastructure for Ingenuously Parallel Computing
4
5. Ingenious Parallelism
• Restate a big computing problem as
many individually schedulable small problems.
• Minimize your requirements in order to
maximize the raw capacity that you can effectively use.
5
6. Ingenious Parallelism
• Restate a big computing problem as
many individually schedulable small problems.
• Minimize your requirements in order to
maximize the raw capacity that you can effectively use.
Some call it
Embarrassingly Parallel Computing
but it really takes hard thinking!
6
7. Example HTC problems
Monte Carlo Simulations
Parameter sweeps
Event processing
Feature extraction
And many more problems can be cast in this paradigm.
7
8. Example HTC resource
Open Science Grid (OSG)
operates a large scale HTC pool
Number of CPU cores in use by OSG HTC jobs
8
9. Example HTC users
OSG serving
many different
scientific domains
Weekly CPU hours used by OSG HTC jobs
9
11. K8s in principle great for HTC
One HTC job per Pod
Job n
Job n
Job n
Job 3
Job 2
Job 1
…
Many Pods per HW node
Job a
…
Job z
Many HW nodes per Pool
VPN Overlay
Shared
Storage
Flexible
Orchestration
11
12. K8s in practice not so great
• Indexed parameter passing
• Automatic Input/Output handling
Note: HTC jobs typically do not require a shared FS
• Fair Share Scheduling policies
Essential for highly contested resources
• Can it scale to millions of queued Pods?
K8s missing a few features HTC users are used to
12
13. K8s in practice not so great
• Indexed parameter passing
• Automatic Input/Output handling
Note: HTC jobs typically do not require a shared FS
• Fair Share Scheduling policies
Essential for highly contested resources
• Can it scale to millions of queued Pods?
K8s missing a few features HTC users are used to
Plus, lack of:
• A familiar API/CLI
• Seamless integration
with other resources
13
15. Using HTCondor with Kubernetes
Why HTCondor?
• One of the major batch systems
• HTC-focused architecture
• Very flexible, often used in heterogeneous environments
• Native support for containers
• Highly scalable
15
https://research.cs.wisc.edu/htcondor/
16. Using HTCondor with Kubernetes
Why HTCondor?
• One of the major batch systems
• HTC-focused architecture
• Very flexible, often used in heterogeneous environments
• Native support for containers
• Highly scalable
The system used inside
the Open Science Grid
(OSG)
16
https://research.cs.wisc.edu/htcondor/
17. Using HTCondor with Kubernetes
Why HTCondor?
• One of the major batch systems
• HTC-focused architecture
• Very flexible, often used in heterogeneous environments
• Native support for containers
The system used inside
the Open Science Grid
(OSG)
17
https://research.cs.wisc.edu/htcondor/
Nov 17th, 2019
18. HTCondor Architecture
Persistent Job Queue
(can be more than one, but all independent)
Schedd
Submission typically local
(e.g. ssh) Collector
Central manager for bookkeeping
(can have multiple for HA)
Startd
CPU
Startd
CPU
Startd
CPU
Each execute resource
has a control process
18
19. HTCondor Architecture
Persistent Job Queue
(can be more than one, but all independent)
Schedd
Submission typically local
(e.g. ssh) Collector
Requirements based matchmaking
Startd
CPU
Startd
CPU
Startd
CPU
Each execute resource
has a control process
Negotiator
Central manager for bookkeeping
(can have multiple for HA)
Direct communication with
execute node
Fair share
allocations
(User and/or group
based)
Includes
data transfers
19
20. HTCondor Architecture
Persistent Job Queue
(can be more than one, but all independent)
Schedd
Submission typically local
(e.g. ssh) Collector Startd
CPU
Startd
CPU
Startd
CPU
Each execute resource
has a control process
Negotiator
Central manager for bookkeeping
(can have multiple for HA)
Direct communication with
execute node
Match valid for
tens of minutes,
can be used for
many jobs
Requirements based matchmaking
Fair share
allocations
(User and/or group
based)
Includes
data transfers
20
22. Using HTCondor with Kubernetes
The Kubernetes resources can be joined to
an existing HTCondor Pool
Pod
Startd
CPU
Schedd
Collector
Negotiator
No
persistency
needed
Using CCB for
NAT traversal
22
23. Using HTCondor with Kubernetes
Or a complete HTCondor Pool can be created
inside Kubernetes
Pod
Startd
CPU
Schedd
Collector
Negotiator
Needs
persistent
storage
Persistent
storage
recommended
Simpler setup
due to VPN
23
24. Using HTCondor with Kubernetes
Can be used to join
external resources for
K8s users
Pod
Startd
CPU
Schedd
Collector
Negotiator Pod
Startd
CPU
Needs
incoming
networking
24
25. HTC Users and Containers
Most HTC jobs are application + arguments + data
• Container just a convenient way to package the dependencies
• Usually a department/community maintained one
25
26. HTCondor and Containers
HTCondor allows for a container to be attached to a job
• Will use singularity to invoke it
• After binding the application and data
Most HTC jobs are application + arguments + data
• Container just a convenient way to package the dependencies
• Usually a department/community maintained one
26
27. HTCondor and Containers
HTCondor allows for a container to be attached to a job
• Will use singularity to invoke it
• After binding the application and data
Most HTC jobs are application + arguments + data
• Container just a convenient way to package the dependencies
• Usually a department/community maintained one
In principle Docker could be an
option, but not currently
supported
27
28. Nested containerization
Singularity can be invoked inside a Docker container
• Fully unprivileged with Linux Kernel >= 4.18
Makes HTCondor execute in Kubernetes trivial to implement
Schedd
Pod – CentOS/Ubuntu/…
with Singularity
Startd
Singularity – Dept.Cont.
Application
Dept.Cont. + Appl. + Args. + InData
OurData + Logs
28
29. Explicit provisioning
Many systems still on older Linux Kernel Versions (e.g. CentOS 7)
• Unprivileged nested containerization not an option there
Some users also do not like singularity
• It does have some differences from Docker
• e.g. The root partition is always Read-Only
Kubernetes Pod can be launched with Container needed by User jobs
• Only jobs needing that Container will match
• Asking users to create a HTCondor-specific Container
usually a non-starter
• Better to inject HTCondor bins and config at Pod startup
A ready-to-use template available at:
https://github.com/sfiligoi/prp-htcondor-pool
Quite effective
when only a few
Container Images
needed
29
30. Opportunistic use
Most HTC jobs tolerate preemption
• HTC Pods great backfill option for keeping
your Kubernetes resources fully utilized
Just launch HTCondor execute Pods with a very low K8s priority
Works best when you have a single backfill pool
30
31. To conclude
Kubernetes is a great foundation platform for HTC jobs
• But a bit hard to use by itself
HTCondor can add the needed glue to make it easy to use
• Data handling
• Parametrized argument passing
• Robust, contention-optimized and scalable policy manager
OSG and PRP have been successfully using this combination for awhile
31
32. Acknowledgments
This work was partially funded by
US National Science Foundation (NSF) awards
CNS-1456638, CNS-1730158,
ACI-1540112, ACI-1541349,
MPS-1148698,
OAC-1826967, OAC 1450871,
OAC-1659169 and OAC-1841530.
32