This presentation provides a detailed insight on the internal working of the glideinWMS glidein startup script and the glideins in general . Part of the glideinWMS Training session held in Jan 2012 at UCSD.
This document provides an overview of glidein internals, including how glideins work, what glidein_startup does to configure and start Condor, security considerations for multi-user pilot jobs, and how glexec can help address security issues. It describes the key tasks of glidein_startup such as downloading files, validating nodes, configuring Condor, starting Condor daemons, collecting monitoring info, and cleaning up. It also discusses limiting glidein lifetime and addressing sources of waste.
glideinWMS Frontend Internals - glideinWMS Training Jan 2012Igor Sfiligoi
This presentation provides a detailed insight on the internal working of the glideinWMS Frontend. Part of the glideinWMS Training session held in Jan 2012 at UCSD.
glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012Igor Sfiligoi
This talk walks you through the monitoring options a glideinWMS Frontend operator has.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
glideinWMS validation scirpts - glideinWMS Training Jan 2012Igor Sfiligoi
Descripton of how to write custom validation scripts in glideinWMS, with an emphasis on the VO Frontend operations.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
Throwing complexity over the wall: Rapid development for enterprise Java (Jav...Dan Allen
For many, development of enterprise Java has long been an arduous undertaking. We're of the opinion that application programmers should be free to focus on their business logic only.
In this session, we'll cover:
• What makes us most productive?
• What tasks should we be programming; more importantly, what shouldn't we?
• What is a component model, and what does it buy us?
• How is this stuff usable in the real world?
We'll discuss how testing relates to the features of the Java EE 6 stack. By the end, we'll have introduced a pair of simple and powerful frameworks that render the testing of real enterprise components as natural as calling "add" on a CS101 Calculator.java.
OpenTuesday: Die Selenium-Toolfamilie und ihr Einsatz im Web- und Mobile-Auto...Digicomp Academy AG
Dieses OpenTuesday-Referat zeigt den Einsatz der Selenium-Toolfamilie in einem hochskalierbaren Web- und Mobile-Umfeld anhand von praktischen Beispielen. Ebenso werden die Vorteile von Open Source Tools in Bezug auf Konkurrenzfähigkeit und Innovationskraft von Unternehmen aufgezeigt.
This document provides an overview of glidein internals, including how glideins work, what glidein_startup does to configure and start Condor, security considerations for multi-user pilot jobs, and how glexec can help address security issues. It describes the key tasks of glidein_startup such as downloading files, validating nodes, configuring Condor, starting Condor daemons, collecting monitoring info, and cleaning up. It also discusses limiting glidein lifetime and addressing sources of waste.
glideinWMS Frontend Internals - glideinWMS Training Jan 2012Igor Sfiligoi
This presentation provides a detailed insight on the internal working of the glideinWMS Frontend. Part of the glideinWMS Training session held in Jan 2012 at UCSD.
glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012Igor Sfiligoi
This talk walks you through the monitoring options a glideinWMS Frontend operator has.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
glideinWMS validation scirpts - glideinWMS Training Jan 2012Igor Sfiligoi
Descripton of how to write custom validation scripts in glideinWMS, with an emphasis on the VO Frontend operations.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
Throwing complexity over the wall: Rapid development for enterprise Java (Jav...Dan Allen
For many, development of enterprise Java has long been an arduous undertaking. We're of the opinion that application programmers should be free to focus on their business logic only.
In this session, we'll cover:
• What makes us most productive?
• What tasks should we be programming; more importantly, what shouldn't we?
• What is a component model, and what does it buy us?
• How is this stuff usable in the real world?
We'll discuss how testing relates to the features of the Java EE 6 stack. By the end, we'll have introduced a pair of simple and powerful frameworks that render the testing of real enterprise components as natural as calling "add" on a CS101 Calculator.java.
OpenTuesday: Die Selenium-Toolfamilie und ihr Einsatz im Web- und Mobile-Auto...Digicomp Academy AG
Dieses OpenTuesday-Referat zeigt den Einsatz der Selenium-Toolfamilie in einem hochskalierbaren Web- und Mobile-Umfeld anhand von praktischen Beispielen. Ebenso werden die Vorteile von Open Source Tools in Bezug auf Konkurrenzfähigkeit und Innovationskraft von Unternehmen aufgezeigt.
An argument for moving the requirements out of user hands - The CMS ExperienceIgor Sfiligoi
This talks makes an argument why users should not write arbitrary Requirements expressions in Condor, but only express the desires, and let someone else write the actual policy.
Presentation at Condor Week 2012.
Folksonomies allow users to collaboratively tag and categorize content, creating an informal taxonomy without a controlled vocabulary. While this gives users freedom and flexibility, it can also result in inconsistencies and redundancies compared to formal taxonomies designed by experts. An ideal system may combine the user involvement of folksonomies with some structure and controls from traditional taxonomies.
Augmenting Big Data Analytics with NirvanaIgor Sfiligoi
It has been proven that Big Data Analytics can provide major competitive advantage, yet most deployments cannot reach the vast majority of the data owned by an organization!
This presentation introduces the concept of tiered Big Data Analytics , with an eye on the use of Nirvana in this scenario.
This document provides a high level overview of how glideinWMS-based instanced do matchmaking in CMS (a High Energy Experiment). The information is accurate as of early Dec 2012.
Pilot Factory uses schedd glideins to submit pilot jobs locally from a remote resource. A pilot generator program communicates with a database and periodically submits pilots with desired configurations to matchmake and run various job types. This allows bypassing Condor-G and its overhead for large job submissions while taking advantage of local scheduling on the remote resource. Future work includes integrating pilots more directly with Condor startds for additional functionality.
Monitoring and troubleshooting a glideinWMS-based HTCondor poolIgor Sfiligoi
This document discusses tools for monitoring and troubleshooting jobs in a glideinWMS-based HTCondor pool. It describes how to determine where jobs are running, why jobs may not be starting, and why jobs are taking a long time to finish. The key tools mentioned are condor_q, condor_history, and the job event log. The document also provides guidance on checking job requirements, user priorities, supported sites, and restarting jobs.
This document provides an introduction to glideinWMS for users with experience in grid computing. It explains that glideinWMS addresses the problems of scheduling many user jobs across multiple grid sites in a fair manner. It does this by using "pilot jobs" that create an "overlay batch system" where user jobs can run. This allows flexible job scheduling policies. The document provides high-level overviews of how glideinWMS interfaces with grid sites, the glidein factory, VO frontend, and user experience.
Wedding convenience and control with RemoteCondorIgor Sfiligoi
This presentation explains why Condor is not suitable for use on user-owned machines, and why RemoteCondor is the best available solution to the problem.
1. The document discusses tools for Android platform development and proposes a new architecture using Node.js, Express, and JavaScript for building lightweight monitoring tools.
2. It describes initial pain points such as looking up AIDL files and monitoring processes/filesystem, and demonstrates a tool called Raidl that returns AIDL interfaces.
3. The proposed architecture uses a Node.js backend with Express and a Backbone.js frontend, allowing tools to be built for process monitoring, file browsing, and analyzing Binder relationships between services.
Solving Grid problems through glidein monitoringIgor Sfiligoi
This document provides an overview of common problems that can occur in a grid and how they are diagnosed and addressed through glidein monitoring. It discusses issues that may happen at various points such as compute elements refusing glideins, validation failures on worker nodes, authentication problems, and job startup failures due to issues like gLExec configuration. The document aims to help understand the debugging process for grid problems and how glidein monitoring plays a key role in solving grid issues.
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca
This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial:
1) Spark Runtime Architecture
2) Driver Node
3) Scheduling Tasks on Executors
4) Understanding the Architecture
5) Cluster Managers
6) Executors
7) Launching a Program using spark-submit
8) Local Mode & Cluster-Mode
9) Installing Standalone Cluster
10) Cluster Mode - YARN
11) Launching a Program on YARN
12) Cluster Mode - Mesos and AWS EC2
13) Deployment Modes - Client and Cluster
14) Which Cluster Manager to Use?
15) Common flags for spark-submit
Condor overview - glideinWMS Training Jan 2012Igor Sfiligoi
An overview of the Condor Workload Management System, with emphasis on how it is used within the glideinWMS.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
The event page is http://hepuser.ucsd.edu/twiki2/bin/view/Main/GlideinFrontend1201
Video available at http://www.youtube.com/watch?v=tpaedg09VMM
The glideinWMS approach to the ownership of System Images in the Cloud WorldIgor Sfiligoi
Presentation at CLOSER 2012.
Scientific communities that are accustomed to use Grid resources are now considering the use of Cloud resources. However, moving from the Grid to the Cloud brings along the need for the creation and maintenance of the system image used to configure the provisioned resources, and this presents both opportunities and problems for the users. The impact is especially interesting in the context of glideinWMS due to its layered architecture. This presentation describes the various options available to the glideinWMS project team, their advantages and disadvantages, and explains why one of them is to be preferred.
Closer web page: http://closer.scitevents.org/
Розповість про те, що зараз для розробника недостатньо знати лише мову програмування, а потрібно ще знати інструменти для розробки, покращення якості коду, CI.
https://phpfriends.club/meetups-5.html
Improving Engineering Processes using Hudson - Spark IT 2010Arun Gupta
Hudson is an open-source continuous integration (CI) server that automates the build, test, and deployment of code. It monitors code repositories for changes, compiles code, runs tests, and notifies developers of failures. Hudson emphasizes ease of use and extensibility through plugins. It allows teams to spend less time on manual tasks and more time developing code through automation of the engineering process.
In Web hosting services, hosting systems use access controls like suEXEC on apache Web servers to separate privilege by each virtual host. However, existing access control architectures on Web servers have a problem in their low performance and are not appropriate for dynamic contents like Web API since these architectures require termination of the process after each HTTP session. The system developers are not easy to install existing access controls since these are provided by each interpreter and program execution methods conventionally. In this paper, we propose the access control architecture “mod_process_security”. In this architecture a server process creates a new thread on the server process when accepting a request. Then, the web server separates privilege by the thread and processes the contents on the thread. The server process installed “mod_process_security” executes programs faster. System developers can easily install it on web servers since we replace it with the complicated existing access controls. “mod_process_security” can be installed for Apache HTTP Server on Linux as Apache Module which is widely used.
An argument for moving the requirements out of user hands - The CMS ExperienceIgor Sfiligoi
This talks makes an argument why users should not write arbitrary Requirements expressions in Condor, but only express the desires, and let someone else write the actual policy.
Presentation at Condor Week 2012.
Folksonomies allow users to collaboratively tag and categorize content, creating an informal taxonomy without a controlled vocabulary. While this gives users freedom and flexibility, it can also result in inconsistencies and redundancies compared to formal taxonomies designed by experts. An ideal system may combine the user involvement of folksonomies with some structure and controls from traditional taxonomies.
Augmenting Big Data Analytics with NirvanaIgor Sfiligoi
It has been proven that Big Data Analytics can provide major competitive advantage, yet most deployments cannot reach the vast majority of the data owned by an organization!
This presentation introduces the concept of tiered Big Data Analytics , with an eye on the use of Nirvana in this scenario.
This document provides a high level overview of how glideinWMS-based instanced do matchmaking in CMS (a High Energy Experiment). The information is accurate as of early Dec 2012.
Pilot Factory uses schedd glideins to submit pilot jobs locally from a remote resource. A pilot generator program communicates with a database and periodically submits pilots with desired configurations to matchmake and run various job types. This allows bypassing Condor-G and its overhead for large job submissions while taking advantage of local scheduling on the remote resource. Future work includes integrating pilots more directly with Condor startds for additional functionality.
Monitoring and troubleshooting a glideinWMS-based HTCondor poolIgor Sfiligoi
This document discusses tools for monitoring and troubleshooting jobs in a glideinWMS-based HTCondor pool. It describes how to determine where jobs are running, why jobs may not be starting, and why jobs are taking a long time to finish. The key tools mentioned are condor_q, condor_history, and the job event log. The document also provides guidance on checking job requirements, user priorities, supported sites, and restarting jobs.
This document provides an introduction to glideinWMS for users with experience in grid computing. It explains that glideinWMS addresses the problems of scheduling many user jobs across multiple grid sites in a fair manner. It does this by using "pilot jobs" that create an "overlay batch system" where user jobs can run. This allows flexible job scheduling policies. The document provides high-level overviews of how glideinWMS interfaces with grid sites, the glidein factory, VO frontend, and user experience.
Wedding convenience and control with RemoteCondorIgor Sfiligoi
This presentation explains why Condor is not suitable for use on user-owned machines, and why RemoteCondor is the best available solution to the problem.
1. The document discusses tools for Android platform development and proposes a new architecture using Node.js, Express, and JavaScript for building lightweight monitoring tools.
2. It describes initial pain points such as looking up AIDL files and monitoring processes/filesystem, and demonstrates a tool called Raidl that returns AIDL interfaces.
3. The proposed architecture uses a Node.js backend with Express and a Backbone.js frontend, allowing tools to be built for process monitoring, file browsing, and analyzing Binder relationships between services.
Solving Grid problems through glidein monitoringIgor Sfiligoi
This document provides an overview of common problems that can occur in a grid and how they are diagnosed and addressed through glidein monitoring. It discusses issues that may happen at various points such as compute elements refusing glideins, validation failures on worker nodes, authentication problems, and job startup failures due to issues like gLExec configuration. The document aims to help understand the debugging process for grid problems and how glidein monitoring plays a key role in solving grid issues.
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca
This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial:
1) Spark Runtime Architecture
2) Driver Node
3) Scheduling Tasks on Executors
4) Understanding the Architecture
5) Cluster Managers
6) Executors
7) Launching a Program using spark-submit
8) Local Mode & Cluster-Mode
9) Installing Standalone Cluster
10) Cluster Mode - YARN
11) Launching a Program on YARN
12) Cluster Mode - Mesos and AWS EC2
13) Deployment Modes - Client and Cluster
14) Which Cluster Manager to Use?
15) Common flags for spark-submit
Condor overview - glideinWMS Training Jan 2012Igor Sfiligoi
An overview of the Condor Workload Management System, with emphasis on how it is used within the glideinWMS.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
The event page is http://hepuser.ucsd.edu/twiki2/bin/view/Main/GlideinFrontend1201
Video available at http://www.youtube.com/watch?v=tpaedg09VMM
The glideinWMS approach to the ownership of System Images in the Cloud WorldIgor Sfiligoi
Presentation at CLOSER 2012.
Scientific communities that are accustomed to use Grid resources are now considering the use of Cloud resources. However, moving from the Grid to the Cloud brings along the need for the creation and maintenance of the system image used to configure the provisioned resources, and this presents both opportunities and problems for the users. The impact is especially interesting in the context of glideinWMS due to its layered architecture. This presentation describes the various options available to the glideinWMS project team, their advantages and disadvantages, and explains why one of them is to be preferred.
Closer web page: http://closer.scitevents.org/
Розповість про те, що зараз для розробника недостатньо знати лише мову програмування, а потрібно ще знати інструменти для розробки, покращення якості коду, CI.
https://phpfriends.club/meetups-5.html
Improving Engineering Processes using Hudson - Spark IT 2010Arun Gupta
Hudson is an open-source continuous integration (CI) server that automates the build, test, and deployment of code. It monitors code repositories for changes, compiles code, runs tests, and notifies developers of failures. Hudson emphasizes ease of use and extensibility through plugins. It allows teams to spend less time on manual tasks and more time developing code through automation of the engineering process.
In Web hosting services, hosting systems use access controls like suEXEC on apache Web servers to separate privilege by each virtual host. However, existing access control architectures on Web servers have a problem in their low performance and are not appropriate for dynamic contents like Web API since these architectures require termination of the process after each HTTP session. The system developers are not easy to install existing access controls since these are provided by each interpreter and program execution methods conventionally. In this paper, we propose the access control architecture “mod_process_security”. In this architecture a server process creates a new thread on the server process when accepting a request. Then, the web server separates privilege by the thread and processes the contents on the thread. The server process installed “mod_process_security” executes programs faster. System developers can easily install it on web servers since we replace it with the complicated existing access controls. “mod_process_security” can be installed for Apache HTTP Server on Linux as Apache Module which is widely used.
This document provides an overview of attacking Android devices. It discusses Android's architecture and security model, popular versions and their vulnerabilities, and methods for remotely accessing a device and elevating privileges. The presenters demonstrate gaining root access on an Android 2.1 device through a browser vulnerability and vendor kernel bug. They discuss options for backdooring devices at the application or system level for persistent remote access.
The document provides an introduction to GIT and compares it to SVN. It discusses the advantages of GIT such as faster performance, local repositories, and easier branching. The document outlines GIT's basic components like the working directory, staging area, and repository. It also summarizes common GIT commands for managing branches, merging, and pushing/pulling changes.
glideinWMS Training Jan 2012 - Condor tuningIgor Sfiligoi
This talk walks you through the various knobs that need to be tuned to get Condor work with glideinWMS.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
The document discusses the CGYRO simulation tool, which is used for fusion plasma turbulence simulations. CGYRO is optimized for multi-scale simulations and is both memory and compute intensive. It is inherently parallel and uses OpenMP, OpenACC, and MPI for parallelization across CPU and GPU cores. While initial runs on Perlmutter had communication bottlenecks, improved networking with Slingshot 11 has helped increase performance, though it can interfere with MPS. Overall, CGYRO users are pleased with the transition from Cori to Perlmutter, finding it much faster for equivalent hardware.
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
Fusion simulations have traditionally required the use of leadership scale High Performance Computing (HPC) resources in order to produce advances in physics. The impressive improvements in compute and memory capacity of many-GPU compute nodes are now allowing for some problems that once required a multi-node setup to be also solvable on a single node. When possible, the increased interconnect bandwidth can result in order of magnitude higher science throughput, especially for communication-heavy applications. In this paper we analyze the performance of the fusion simulation tool CGYRO, an Eulerian gyrokinetic turbulence solver designed and optimized for collisional, electromagnetic, multiscale simulation, which is widely used in the fusion research community. Due to the nature of the problem, the application has to work on a large multi-dimensional computational mesh as a whole, requiring frequent exchange of large amounts of data between the compute processes. In particular, we show that the average-scale nl03 benchmark CGYRO simulation can be run at an acceptable speed on a single Google Cloud instance with 16 A100 GPUs, outperforming 8 NERSC Perlmutter Phase1 nodes, 16 ORNL Summit nodes and 256 NERSC Cori nodes. Moving from a multi-node to a single-node GPU setup we get comparable simulation times using less than half the number of GPUs. Larger benchmark problems, however, still require a multi-node HPC setup due to GPU memory capacity needs, since at the time of writing no vendor offers nodes with a sufficient GPU memory setup. The upcoming external NVSWITCH does however promise to deliver an almost equivalent solution for up to 256 NVIDIA GPUs.
Presented at PEARC22.
Paper DOI: https://doi.org/10.1145/3491418.3535130
The anachronism of whole-GPU accountingIgor Sfiligoi
NVIDIA has been making steady progress in increasing the compute performance of its GPUs, resulting in order of magnitude compute throughput improvements over the years. With several models of GPUs coexisting in many deployments, the traditional accounting method of treating all GPUs as being equal is not reflecting compute output anymore. Moreover, for applications that require significant CPU-based compute to complement the GPU-based compute, it is becoming harder and harder to make full use of the newer GPUs, requiring sharing of those GPUs between multiple applications in order to maximize the achievable science output. This further reduces the value of whole-GPU accounting, especially when the sharing is done at the infrastructure level. We thus argue that GPU accounting for throughput-oriented infrastructures should be expressed in GPU core hours, much like it is normally done for the CPUs. While GPU core compute throughput does change between GPU generations, the variability is similar to what we expect to see among CPU cores. To validate our position, we present an extensive set of run time measurements of two IceCube photon propagation workflows on 14 GPU models, using both on-prem and Cloud resources. The measurements also outline the influence of GPU sharing at both HTCondor and Kubernetes infrastructure level.
Presented at PEARC22.
Document DOI: https://doi.org/10.1145/3491418.3535125
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
HTCondor has been very successful in managing globally distributed, pleasantly parallel scientific workloads, especially as part of the Open Science Grid. HTCondor system design makes it ideal for integrating compute resources provisioned from anywhere, but it has very limited native support for autonomously provisioning resources managed by other solutions. This work presents a solution that allows for autonomous, demand-driven provisioning of Kubernetes-managed resources. A high-level overview of the employed architectures is presented, paired with the description of the setups used in both on-prem and Cloud deployments in support of several Open Science Grid communities. The experience suggests that the described solution should be generally suitable for contributing Kubernetes-based resources to existing HTCondor pools.
Presented at PEARC22.
Paper DOI: https://doi.org/10.1145/3491418.3535123
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
Overview of the recent performance optimization of CGYRO, an Eulerian GyroKinetic Fusion Plasma solver, with emphasize on the Multiscale Turbulence Simulations.
Presented at the joint US-Japan Workshop on Exascale Computing Collaboration and6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program (Jan 18th 2022).
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
Poster presented at PEAC21.
The poster contains the complete scaling plots for both unweighted and weighted normalized Unifrac compute for sample sizes ranging from 1k to 307k on both GPUs and CPUs.
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
Presented at PEARC21.
Many scientific high-throughput applications can benefit from the elastic nature of Cloud resources, especially when there is a need to reduce time to completion. Cost considerations are usually a major issue in such endeavors, with networking often a major component; for data-intensive applications, egress networking costs can exceed the compute costs. Dedicated network links provide a way to lower the networking costs, but they do add complexity. In this paper we provide a description of a 100 fp32 PFLOPS Cloud burst in support of IceCube production compute, that used Internet2 Cloud Connect service to provision several logically-dedicated network links from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and Google Cloud Platform, that in aggregate enabled approximately 100 Gbps egress capability to on-prem storage. It provides technical details about the provisioning process, the benefits and limitations of such a setup and an analysis of the costs incurred.
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
Presented at PEARC21.
Most experimental sciences now rely on computing, and biolog- ical sciences are no exception. As datasets get bigger, so do the computing costs, making proper optimization of the codes used by scientists increasingly important. Many of the codes developed in recent years are based on the Python-based NumPy, due to its ease of use and good performance characteristics. The composable nature of NumPy, however, does not generally play well with the multi-tier nature of modern CPUs, making any non-trivial multi- step algorithm limited by the external memory access speeds, which are hundreds of times slower than the CPU’s compute capabilities. In order to fully utilize the CPU compute capabilities, one must keep the working memory footprint small enough to fit in the CPU caches, which requires splitting the problem into smaller portions and fusing together as many steps as possible. In this paper, we present changes based on these principles to two important func- tions in the scikit-bio library, principal coordinates analysis and the Mantel test, that resulted in over 100x speed improvement in these widely used, general-purpose tools.
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
The document discusses how Nvidia's A100 GPU with Multi-Instance GPU (MIG) capability can help scale up scientific output for astronomy projects like IceCube and LIGO. The A100 is much faster than previous GPUs, but MIG allows it to be partitioned so multiple jobs or processes can leverage the GPU simultaneously. This results in 200-600% higher throughput compared to using a single GPU, by better utilizing the massive parallelism of the A100. MIG makes the powerful A100 GPU practical for these CPU-bound scientific workloads.
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
Presented at EDUCAUSE CCCG March 2021.
The IceCube Neutrino Observatory is the world’s premier facility to detect neutrinos.
Built at the south pole in natural ice, it requires extensive and expensive calibration to properly track the neutrinos.
Most of the required compute power comes from on-prem resources through the Open Science Grid,
but IceCube can easily harness the Cloud compute at any scale, too, as demonstrated by a series of Cloud bursts.
This talk provides both details of the performed Cloud bursts, as well as some insight in the science itself.
Fusion simulations have traditionally required the use of leadership scale HPC resources in order to produce advances in physics. One such package is CGYRO, a premier tool for multi-scale plasma turbulence simulation. CGYRO is a typical HPC application that will not fit into a single node, as it requires several TeraBytes of memory and O(100) TFLOPS compute capability for cutting-edge simulations. CGYRO also requires high-throughput and low-latency networking, due to its reliance on global FFT computations. While in the past such compute may have required hundreds, or even thousands of nodes, recent advances in hardware capabilities allow for just tens of nodes to deliver the necessary compute power. We explored the feasibility of running CGYRO on Cloud resources provided by Microsoft on their Azure platform, using the infiniband-connected HPC resources in spot mode. We observed both that CPU-only resources were very efficient, and that running in spot mode was doable, with minimal side effects. The GPU-enabled resources were less cost effective but allowed for higher scaling.
This document discusses a large-scale GPU-based cloud burst simulation run by the IceCube collaboration to calibrate simulations of natural ice. The simulation was data-intensive, producing over 130 TB of data and exceeding 10 Gbps of egress bandwidth. Internet2 Cloud Connect service was used to provision over 20 dedicated network links between collaborators' institutions and cloud providers to enable high-throughput data transfer at a lower cost than commercial routes. Careful planning was required to smoothly ramp up the burst and avoid overloading individual network links.
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
This document discusses using Admiralty to federate the Pacific Research Platform (PRP) Kubernetes cluster, called Nautilus, with other clusters. The key points are:
1) PRP/Nautilus has been growing and now has nodes in multiple regions, requiring federation to integrate resources.
2) Admiralty provides a native Kubernetes solution for federation without centralized control. It allows clusters to participate in multiple federations.
3) Installing Admiralty on PRP/Nautilus and other clusters being federated was straightforward using Helm. Pods can be scheduled across clusters automatically.
4) Initial federation is working well between PRP/Nautilus and other clusters for expanded resource sharing
Accelerating microbiome research with OpenACCIgor Sfiligoi
Presented at OpenACC Summit 2020.
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another. Computing UniFrac on modest sample sizes used to take a workday on a server class CPU-only node, while modern datasets would require a large compute cluster to be feasible. After porting to GPUs using OpenACC, the compute of the same modest sample size now takes only a few minutes on a single NVIDIA V100 GPU, while modern datasets can be processed on a single GPU in hours. The OpenACC programming model made the porting of the code to GPUs extremely simple; the first prototype was completed in just over a day. Getting full performance did however take much longer, since proper memory access is fundamental for this application.
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
Presented at PEARC20.
This talk presents expanding the IceCube’s production HTCondor pool using cost-effective GPU instances in preemptible mode gathered from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform. Using this setup, we sustained for a whole workday about 15k GPUs, corresponding to around 170 PFLOP32s, integrating over one EFLOP32 hour worth of science output for a price tag of about $60k. In this paper, we provide the reasoning behind Cloud instance selection, a description of the setup and an analysis of the provisioned resources, as well as a short description of the actual science output of the exercise.
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
Poster presented at PEARC20.
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another (“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many independent subproblems and exhibits near linear scaling. In this poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We reduced the run time of computing UniFrac on the published Earth Microbiome Project dataset from 13 hours on an Intel Xeon E5-2680 v4 CPU to 12 minutes on an NVIDIA Tesla V100 GPU, and to about one hour on a laptop with NVIDIA GTX 1050 (with minor loss in precision). Computing UniFrac on a larger dataset containing 113k samples reduced the run time from over one month on the CPU to less than 2 hours on the V100 and 9 hours on an NVIDIA RTX 2080TI GPU (with minor loss in precision). This was achieved by using OpenACC for generating the GPU offload code and by improving the memory access patterns. A BSD-licensed implementation is available, which produces a Cshared library linkable by any programming language.
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
Poster presented at PEARC20.
There is increased awareness and recognition that public Cloud providers do provide capabilities not found elsewhere, with elasticity being a major driver. The value of elastic scaling is however tightly coupled to the capabilities of the networks that connect all involved resources, both in the public Clouds and at the various research institutions. This poster presents results of measurements involving file transfers inside public Cloud providers, fetching data from on-prem resources into public Cloud instances and fetching data from public Cloud storage into on-prem nodes. The networking of the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform, has been benchmarked. The on-prem nodes were managed by either the Pacific Research Platform or located at the University of Wisconsin – Madison. The observed sustained throughput was of the order of 100 Gbps in all the tests moving data in and out of the public Clouds and throughput reaching into the Tbps range for data movements inside the public Cloud providers themselves. All the tests used HTTP as the transfer protocol.
TransAtlantic Networking using Cloud linksIgor Sfiligoi
Scientific communities have only limited amount of bandwidth available for transferring data between the US and the EU.
We know Cloud providers have plenty of bandwidth available, but at what cost?
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
3. Refresher - What is a glidein?
● A glidein is just a properly configured execution
node submitted as a Grid job
Central manager
glidein
Execution node
Collector
glidein
Execution node
Negotiator
Submit node
Submit node
glidein
Execution node
Submit node
Execution node
glidein
Schedd Startd
Job
UCSD Jan 17th 2012 Glidein internals 3
7. Downloading files
● Files downloaded via HTTP
● From both the factory and the frontend Web servers
● Can use local Web proxy (e.g. Squid)
● Mechanism tamper proof and cache coherent
Factory node glidein_startup
HTTPd ● Load files
from factory Web
Squid
● Load files
from frontend Web
Frontend node ● Run executables
● Start Condor Startd
HTTPd ● Cleanup
Nothing will work
if HTTPd down!
UCSD Jan 17th 2012 Glidein internals 7
8. URLs
● All URLs passed to glidein_startup
as arguments
● Factory Web server
● Frontend Web server
● Squid, if any
● glidein_startup arguments defined
by the factory
● Frontend Web URL passed to the Factory via
request ClassAd
UCSD Jan 17th 2012 Glidein internals 8
9. Cache coherence
● Files never change, ● There is a (set of)
once uploaded to the logical to physical
Web area name map files
● *_file_list.id.lst
● If the source file
changes, a file with a
● The names of the list
different name is files are in
created on the Web ● description.id.cfg
area ● The id of this list is
passed as an
● Essentially argument
Frontend ids
fname.date passed to the to glidein_startup
factory via ClassAd
UCSD Jan 17th 2012 Glidein internals 9
10. Making the download tamper proof
● Verifying SHA1 ● The hash of the
hashes signature file a Frontend hash
● Each file has a SHA1 parameter of passed to
the factory
associated with it glidein_startup via ClassAd
● The hashes delivered ● This is guaranteed to
in a single file be secure with GRAM
● signature.id.sha1
● List of (fname,hash)
● Description file hash
pairs in the signature file
● Signature file id in the ● Still safe even if
description file loaded out of order
UCSD Jan 17th 2012 Glidein internals 10
11. Example Web area
Frontend Factory
Globus
glidein_startup
...
.
..
UCSD Jan 17th 2012 Glidein internals 11
12. Node validation
● Run scripts / plugins provided by both
factory and frontend
● The map file has a flag to distinguish
scripts from “regular” files
# Outfile InFile Exec/other
# Outfile InFile Exec/other
#########################################################
#########################################################
constants.cfg constants.b8ifnW.cfg regular
constants.cfg constants.b8ifnW.cfg regular
cat_consts.sh cat_consts.b8ifnW.sh exec
cat_consts.sh cat_consts.b8ifnW.sh exec
set_home_cms.source set_home_cms.b8ifnW.source wrapper
set_home_cms.source set_home_cms.b8ifnW.source wrapper
● If a script returns with exit_code !=0,
glidein_startup stops execution
● Condor never started → no user jobs ever pulled
● Will sleep for 20 mins → blackhole protection
UCSD Jan 17th 2012 Glidein internals 12
13. Condor configuration
● Condor config values coming from files
downloaded by glidein_startup
● Static from both factory and frontend config files
● More details at http://tinyurl.com/glideinWMS/doc.prd/factory/custom_vars.html
● Scripts can add, alter or delete any attribute
● Based on dynamic information
● Either discovered on the WN
or by combining info from various sources
● More details at http://tinyurl.com/glideinWMS/doc.prd/factory/custom_scripts.html
UCSD Jan 17th 2012 Glidein internals 13
14. Example (pseudo-)script
## check ifCMSSW installed locally
check if CMSSW installed locally
ifif [-f-f"$CMSSW" ];];then
[ "$CMSSW" then
## get the other env
get the other env
source "$CMSSW"
source "$CMSSW"
else
else
## fail validation
fail validation
echo "CMSSW not found!n" 1>&2
echo "CMSSW not found!n" 1>&2
exit 11
exit
fifi
## publish to Condor
publish to Condor
add_config_line "CMSSW_LIST" "$CMS_SW_LIST"
add_config_line "CMSSW_LIST" "$CMS_SW_LIST"
add_condor_vars_line "CMSSW_LIST" "S" "-" "+" "Y" "Y" "-"
add_condor_vars_line "CMSSW_LIST" "S" "-" "+" "Y" "Y" "-"
## if Igot here, passed validation
if I got here, passed validation
exit 00
exit
More details at http://tinyurl.com/glideinWMS/doc.prd/factory/custom_scripts.html
UCSD Jan 17th 2012 Glidein internals 14
15. Example Frontend attr
● START expression
● As GLIDECLIENT_Group_Start
● Automatically && with Factory provided one
GLIDECLIENT_Group_Start =
(stringListMember(GLIDEIN_SEs,
DESIRED_SEs)=?=True) ||
(stringListMember(GLIDEIN_Site,
DESIRED_Sites=?=True))
More details at http://tinyurl.com/glideinWMS/doc.prd/factory/custom_vars.html
UCSD Jan 17th 2012 Glidein internals 15
16. Condor startup
● After validation and configuration,
glidein just start condor_master
● Which in turn start condor_startd
● It is just regular Condor from here on
● Any policy the VO needs must be part of
Condor configuration
Central manager
Submit node
Execution node
Collector
Schedd glidein_startup
Negotiator
Startd
UCSD Jan 17th 2012 Glidein internals 16
17. Post mortem monitoring
● After a glidein terminates, the logs are sent
back to the Factory
● Frontend does not have access to them
● Work in progress for future glideinWMS release
● For now, ask Factory admins if debugging
UCSD Jan 17th 2012 Glidein internals 17
18. Cleanup
● glidein_startup will remove all files
before terminating
● Unless killed by the OS, of course
● Condor does something similar for the user jobs
disk area after every job
● Users should not expect anything to survive
their jobs
UCSD Jan 17th 2012 Glidein internals 18
20. Glidein lifetime
● Glideins are temporary resources
● Must go away after some time
● We want them to go away by their own will
● So we can monitor progress and clean up
● As opposed to being killed by Grid batch system
● Condor daemons configured to
die by themselves
● Just need to tell them when
UCSD Jan 17th 2012 Glidein internals 20
21. Limiting glidein lifetime
● Hard End-of-Life deadline
● Condor will terminate if it is ever reached
● Any jobs running at that point in time will be killed
– Resulting in waste
● Deadline site specific
● Set by the Factory
● Controlled by GLIDEIN_Max_Walltime
● Condor advertises it in the glidein ClassAd
● Attribute GLIDEIN_ToDie
UCSD Jan 17th 2012 Glidein internals 21
22. Deadline for job startup
● Starting jobs just to kill them wasteful
● Glideins set an earlier deadline for job startup
● After the deadline, Condor will terminate at job end
● Advertised as GLIDEIN_TORETIRE
● Need to know how long is the
Default exists but
expected job lifetime don't count on it
● Should be provided by the Frontend
● Parameter GLIDEIN_Job_Max_Time
● ToRetire=Max_Walltime-Job_Max_Time
UCSD Jan 17th 2012 Glidein internals 22
23. Termination due to non-use
● glideins will self-terminate if not used, too
● To limit waste
● Reasons for hitting this
● No more user jobs (spikes)
● Policy problems (Frontend vs Negotiator)
● Typically uniform across the Condor pool
● Can be set by either Frontend or Factory
● Controlled by GLIDEIN_Max_Idle
To give Negotiator
● Defaults to 20 mins the time to match
the glidein
UCSD Jan 17th 2012 Glidein internals 23
24. Finer grained policies
● Job_Max_Time is the typical expected job lifetime
● What if not all jobs have the same lifetime?
● Use Condor matchmaking to define
finer grained policies
● To prevent long jobs from starting if they are
expected to be killed before terminating
● This allows (lower priority) shorter jobs to use the CPU
● Or waste just 20 mins (instead of hours)
● Job lifetime can be difficult to know
● Even users often don't know
● Can refine after restarts (assuming they are deterministic)
UCSD Jan 17th 2012 Glidein internals 24
25. Example START expression
● Have two different thresholds
● One for first startup, one for all following
● Useful when wide dynamic range, unpredictable
GLIDECLIENT_Start =
ifthenelse(
LastVacateTime=?=UNDEFINED,
NormMaxWallTime < (GLIDEIN_ToDie-MyCurrentTime),
MaxWallTime < (GLIDEIN_ToDie-MyCurrentTime)
)
UCSD Jan 17th 2012 Glidein internals 25
26. Other sources of waste
● Glideins must stay within the limits of the
resource lease
● Typical limit is on memory usage
OSG Factory defines GLIDEIN_MaxMemMBs
● If user jobs exceed that limit, the glideins may
be killed by the resource provider
(e.g. Grid batch system)
● Resulting in the user job being killed, too
● Thus waste
UCSD Jan 17th 2012 Glidein internals 26
27. Example START expression
● Prevent startup of jobs that are known to use too
much memory
●
Unless user defines it, known only at 2nd re-run
● Also kill jobs as soon as they pass that
● So we don't have to start a new glidein for other jobs
GLIDECLIENT_Start =
ImageSize<=(GLIDEIN_MaxMemMBs*1024)
PREEMPT =
ImageSize>(GLIDEIN_MaxMemMBs*1024)
More details at http://research.cs.wisc.edu/condor/manual/v7.6/10_Appendix_A.html#82447
UCSD Jan 17th 2012 Glidein internals 27
29. Talking to the rest of Condor
● Glideins will talk over ● Collector and glidein
WAN to the rest of must whitelist each
Condor other
● Strong security a must ● Schedd and glidein
Central manager
trusted indirectly
Collector ● Security token
Negotiator through Collector
Submit node Execution node
AN
glidein_startup
Schedd
W
Startd
UCSD Jan 17th 2012 Glidein internals 29
30. Multi User Pilot Jobs
● A glidein typically starts with a proxy that is not
representing a specific user
● Very few exceptions (e.g. CMS Organized Reco)
● Three potential security issues:
● Site does not know who the final user is
(and thus cannot ban specific users, if needed)
● If 2 glideins land on the same node, 2 users may be
running as the same UID (no system protection)
● User and pilot code run as the same UID
(no system protection)
UCSD Jan 17th 2012 Glidein internals 30
31. glexec
● We have one tool that can help us with all 3
● Namely glexec
● glexec provides UID switching
● But must be installed by the Site on WNs
● Only a subset of sites have done it so far
● glexec requires a valid proxy from
the final users on the submit node
● Or jobs will not match
● Not a problem for long time Grid users, like CMS,
but it may be for other VOs
UCSD Jan 17th 2012 Glidein internals 31
32. glexec in a picture
Legend
Central manager Pilot UID
Collector User UID
Negotiator
Submit node
Submit node
Execution node
Submit node
glidein
Schedd Startd
glexec Job
UCSD Jan 17th 2012 Glidein internals 32
33. glexec, cont
● It is in VO best interest to use glexec
● Only way to have a secure MUPJ system
● Should push sites to deploy it
● Some sites require glexec for their own interest
● To cryptographically know who the final users are
● To easily ban users May be a legal
requirement
● Note on site banning
● Condor currently does not handle well glexec
failures – VO admin must look for this
UCSD Jan 17th 2012 Glidein internals 33
34. Turning on glexec
● Factory provides information about glexec
installation at site
● Attribute GLEXEC_BIN = NONE | OSG | glite
● Frontend decides if it should be used
● Parameter GLIDEIN_Glexec_Use
● Possible values:
– NEVER - Do not use, even if available
– OPTIONAL - Use whenever available
– REQUIRED - Only use sites that provide glexec
UCSD Jan 17th 2012 Glidein internals 34
36. Pointers
● The official project Web page is
http://tinyurl.com/glideinWMS
● glideinWMS development team is reachable at
glideinwms-support@fnal.gov
● CMS AnaOps Frontend at UCSD
http://glidein-collector.t2.ucsd.edu:8319/vofrontend/monitor/frontend_UCSD-v5_2/frontendStatus.html
UCSD Jan 17th 2012 Glidein internals 36
37. Acknowledgments
● The glideinWMS is a CMS-led project
developed mostly at FNAL, with contributions
from UCSD and ISI
● The glideinWMS factory operations at UCSD is
sponsored by OSG
● The funding comes from NSF, DOE and the
UC system
UCSD Jan 17th 2012 Glidein internals 37