This talk walks you through the monitoring options a glideinWMS Frontend operator has.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...Igor Sfiligoi
This presentation provides a detailed insight on the internal working of the glideinWMS glidein startup script and the glideins in general . Part of the glideinWMS Training session held in Jan 2012 at UCSD.
glideinWMS validation scirpts - glideinWMS Training Jan 2012Igor Sfiligoi
Descripton of how to write custom validation scripts in glideinWMS, with an emphasis on the VO Frontend operations.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
glideinWMS Frontend Internals - glideinWMS Training Jan 2012Igor Sfiligoi
This presentation provides a detailed insight on the internal working of the glideinWMS Frontend. Part of the glideinWMS Training session held in Jan 2012 at UCSD.
Glidein startup Internals and Glidein configuration - glideinWMS Training Jan...Igor Sfiligoi
This presentation provides a detailed insight on the internal working of the glideinWMS glidein startup script and the glideins in general . Part of the glideinWMS Training session held in Jan 2012 at UCSD.
glideinWMS validation scirpts - glideinWMS Training Jan 2012Igor Sfiligoi
Descripton of how to write custom validation scripts in glideinWMS, with an emphasis on the VO Frontend operations.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
glideinWMS Frontend Internals - glideinWMS Training Jan 2012Igor Sfiligoi
This presentation provides a detailed insight on the internal working of the glideinWMS Frontend. Part of the glideinWMS Training session held in Jan 2012 at UCSD.
Add-On Development: EE Expects that Every Developer will do his Dutyreedmaniac
Add-Ons are what make ExpressionEngine the flexible powerhouse that it is today. Being able to write your own simple plugins or incredibly expansive modules allows you to mold ExpressionEngine to nearly any task that your website might require. However, with that power comes a great responsibility to insure that your code is not slowing down the entire site or unduly stressing the server through bad code architecture.
There are simple tools already built into ExpressionEngine and PHP that you can use to see precisely what your Add-On is doing during page processing and where it might be doing more work than is absolutely necessary. Every developer should use these to optimize their work from the very beginning of development, prior to release. This workshop will explain these tools and how you can use them effectively. It will also delve deeper into optimization techniques and tricks that will keep your code light and clean, while finding a balance between functionality and performance.
This slide show illustrates an preliminary work on the "pilot mechanism" using the Condor system. The goal is to create a uniform user interface to the computational resource from across the network and in the meantime, increase the parallelism of user tasks towards optimal throughput in the long run.
This document provides a high level overview of how glideinWMS-based instanced do matchmaking in CMS (a High Energy Experiment). The information is accurate as of early Dec 2012.
Condor overview - glideinWMS Training Jan 2012Igor Sfiligoi
An overview of the Condor Workload Management System, with emphasis on how it is used within the glideinWMS.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
The event page is http://hepuser.ucsd.edu/twiki2/bin/view/Main/GlideinFrontend1201
Video available at http://www.youtube.com/watch?v=tpaedg09VMM
glideinWMS Training Jan 2012 - Condor tuningIgor Sfiligoi
This talk walks you through the various knobs that need to be tuned to get Condor work with glideinWMS.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
Wedding convenience and control with RemoteCondorIgor Sfiligoi
This presentation explains why Condor is not suitable for use on user-owned machines, and why RemoteCondor is the best available solution to the problem.
DCSF 19 Deploying Rootless buildkit on KubernetesDocker, Inc.
DockerCon Open Source Summit: BuildKit
Akihiro Suda, NTT Corporation
Building images on Kubernetes is attractive for distributing workload across multiple nodes, typically in CI/CD pipeline. However, it had been considered dangerous due to the dependency on `securityContext.privileged`.
In this talk, Akihiro will show how to use Rootless BuildKit in Kubernetes, which can be executed as a non-root user without extra `securityContext` configuration.
Slides from my beginner level talk on FRIDA and its usage while Pentesting Android Applications. Covers topics like Installation of Frida and Bypassing Pinning and Root Detection using Frida.
More and more companies are adopting some form of cloud for hosting their applications. Cloud also increasingly means Docker containers and Kubernetes to orchestrate them. Cloud is a different beast when it comes to running Java applications. It focuses on fast provisioning, ephemeral containers that startup and scale quickly and use minimum resources when idle. Correspondingly the JVM has undergone a number of changes that help it to adapt to cloud environments.
This talk looks closely at each of the cloud specific aspects and explores the changes in Java that a developer needs to be aware of to make the most of running a Java app in the cloud and provides a github based tutorial for reference.
Add-On Development: EE Expects that Every Developer will do his Dutyreedmaniac
Add-Ons are what make ExpressionEngine the flexible powerhouse that it is today. Being able to write your own simple plugins or incredibly expansive modules allows you to mold ExpressionEngine to nearly any task that your website might require. However, with that power comes a great responsibility to insure that your code is not slowing down the entire site or unduly stressing the server through bad code architecture.
There are simple tools already built into ExpressionEngine and PHP that you can use to see precisely what your Add-On is doing during page processing and where it might be doing more work than is absolutely necessary. Every developer should use these to optimize their work from the very beginning of development, prior to release. This workshop will explain these tools and how you can use them effectively. It will also delve deeper into optimization techniques and tricks that will keep your code light and clean, while finding a balance between functionality and performance.
This slide show illustrates an preliminary work on the "pilot mechanism" using the Condor system. The goal is to create a uniform user interface to the computational resource from across the network and in the meantime, increase the parallelism of user tasks towards optimal throughput in the long run.
This document provides a high level overview of how glideinWMS-based instanced do matchmaking in CMS (a High Energy Experiment). The information is accurate as of early Dec 2012.
Condor overview - glideinWMS Training Jan 2012Igor Sfiligoi
An overview of the Condor Workload Management System, with emphasis on how it is used within the glideinWMS.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
The event page is http://hepuser.ucsd.edu/twiki2/bin/view/Main/GlideinFrontend1201
Video available at http://www.youtube.com/watch?v=tpaedg09VMM
glideinWMS Training Jan 2012 - Condor tuningIgor Sfiligoi
This talk walks you through the various knobs that need to be tuned to get Condor work with glideinWMS.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.
Wedding convenience and control with RemoteCondorIgor Sfiligoi
This presentation explains why Condor is not suitable for use on user-owned machines, and why RemoteCondor is the best available solution to the problem.
DCSF 19 Deploying Rootless buildkit on KubernetesDocker, Inc.
DockerCon Open Source Summit: BuildKit
Akihiro Suda, NTT Corporation
Building images on Kubernetes is attractive for distributing workload across multiple nodes, typically in CI/CD pipeline. However, it had been considered dangerous due to the dependency on `securityContext.privileged`.
In this talk, Akihiro will show how to use Rootless BuildKit in Kubernetes, which can be executed as a non-root user without extra `securityContext` configuration.
Slides from my beginner level talk on FRIDA and its usage while Pentesting Android Applications. Covers topics like Installation of Frida and Bypassing Pinning and Root Detection using Frida.
More and more companies are adopting some form of cloud for hosting their applications. Cloud also increasingly means Docker containers and Kubernetes to orchestrate them. Cloud is a different beast when it comes to running Java applications. It focuses on fast provisioning, ephemeral containers that startup and scale quickly and use minimum resources when idle. Correspondingly the JVM has undergone a number of changes that help it to adapt to cloud environments.
This talk looks closely at each of the cloud specific aspects and explores the changes in Java that a developer needs to be aware of to make the most of running a Java app in the cloud and provides a github based tutorial for reference.
Android on Windows 11 - A Developer's Perspective (Windows Subsystem For Andr...Embarcadero Technologies
The Windows Subsystem for Android (WSA) brings native Android applications to the Windows 11 desktop. Learn how to set up and configure Windows Subsystem for Android for use in software development. See what is required to run WSA as well as what is required to target it from your Android development. Windows Subsystem for Android is available for public preview on Windows 11.
Webinar replay and more: https://blogs.embarcadero.com/?p=134192
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
Fusion simulations have traditionally required the use of leadership scale High Performance Computing (HPC) resources in order to produce advances in physics. The impressive improvements in compute and memory capacity of many-GPU compute nodes are now allowing for some problems that once required a multi-node setup to be also solvable on a single node. When possible, the increased interconnect bandwidth can result in order of magnitude higher science throughput, especially for communication-heavy applications. In this paper we analyze the performance of the fusion simulation tool CGYRO, an Eulerian gyrokinetic turbulence solver designed and optimized for collisional, electromagnetic, multiscale simulation, which is widely used in the fusion research community. Due to the nature of the problem, the application has to work on a large multi-dimensional computational mesh as a whole, requiring frequent exchange of large amounts of data between the compute processes. In particular, we show that the average-scale nl03 benchmark CGYRO simulation can be run at an acceptable speed on a single Google Cloud instance with 16 A100 GPUs, outperforming 8 NERSC Perlmutter Phase1 nodes, 16 ORNL Summit nodes and 256 NERSC Cori nodes. Moving from a multi-node to a single-node GPU setup we get comparable simulation times using less than half the number of GPUs. Larger benchmark problems, however, still require a multi-node HPC setup due to GPU memory capacity needs, since at the time of writing no vendor offers nodes with a sufficient GPU memory setup. The upcoming external NVSWITCH does however promise to deliver an almost equivalent solution for up to 256 NVIDIA GPUs.
Presented at PEARC22.
Paper DOI: https://doi.org/10.1145/3491418.3535130
The anachronism of whole-GPU accountingIgor Sfiligoi
NVIDIA has been making steady progress in increasing the compute performance of its GPUs, resulting in order of magnitude compute throughput improvements over the years. With several models of GPUs coexisting in many deployments, the traditional accounting method of treating all GPUs as being equal is not reflecting compute output anymore. Moreover, for applications that require significant CPU-based compute to complement the GPU-based compute, it is becoming harder and harder to make full use of the newer GPUs, requiring sharing of those GPUs between multiple applications in order to maximize the achievable science output. This further reduces the value of whole-GPU accounting, especially when the sharing is done at the infrastructure level. We thus argue that GPU accounting for throughput-oriented infrastructures should be expressed in GPU core hours, much like it is normally done for the CPUs. While GPU core compute throughput does change between GPU generations, the variability is similar to what we expect to see among CPU cores. To validate our position, we present an extensive set of run time measurements of two IceCube photon propagation workflows on 14 GPU models, using both on-prem and Cloud resources. The measurements also outline the influence of GPU sharing at both HTCondor and Kubernetes infrastructure level.
Presented at PEARC22.
Document DOI: https://doi.org/10.1145/3491418.3535125
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
HTCondor has been very successful in managing globally distributed, pleasantly parallel scientific workloads, especially as part of the Open Science Grid. HTCondor system design makes it ideal for integrating compute resources provisioned from anywhere, but it has very limited native support for autonomously provisioning resources managed by other solutions. This work presents a solution that allows for autonomous, demand-driven provisioning of Kubernetes-managed resources. A high-level overview of the employed architectures is presented, paired with the description of the setups used in both on-prem and Cloud deployments in support of several Open Science Grid communities. The experience suggests that the described solution should be generally suitable for contributing Kubernetes-based resources to existing HTCondor pools.
Presented at PEARC22.
Paper DOI: https://doi.org/10.1145/3491418.3535123
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
Overview of the recent performance optimization of CGYRO, an Eulerian GyroKinetic Fusion Plasma solver, with emphasize on the Multiscale Turbulence Simulations.
Presented at the joint US-Japan Workshop on Exascale Computing Collaboration and6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program (Jan 18th 2022).
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
Poster presented at PEAC21.
The poster contains the complete scaling plots for both unweighted and weighted normalized Unifrac compute for sample sizes ranging from 1k to 307k on both GPUs and CPUs.
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
Presented at PEARC21.
Many scientific high-throughput applications can benefit from the elastic nature of Cloud resources, especially when there is a need to reduce time to completion. Cost considerations are usually a major issue in such endeavors, with networking often a major component; for data-intensive applications, egress networking costs can exceed the compute costs. Dedicated network links provide a way to lower the networking costs, but they do add complexity. In this paper we provide a description of a 100 fp32 PFLOPS Cloud burst in support of IceCube production compute, that used Internet2 Cloud Connect service to provision several logically-dedicated network links from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and Google Cloud Platform, that in aggregate enabled approximately 100 Gbps egress capability to on-prem storage. It provides technical details about the provisioning process, the benefits and limitations of such a setup and an analysis of the costs incurred.
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
Presented at PEARC21.
Most experimental sciences now rely on computing, and biolog- ical sciences are no exception. As datasets get bigger, so do the computing costs, making proper optimization of the codes used by scientists increasingly important. Many of the codes developed in recent years are based on the Python-based NumPy, due to its ease of use and good performance characteristics. The composable nature of NumPy, however, does not generally play well with the multi-tier nature of modern CPUs, making any non-trivial multi- step algorithm limited by the external memory access speeds, which are hundreds of times slower than the CPU’s compute capabilities. In order to fully utilize the CPU compute capabilities, one must keep the working memory footprint small enough to fit in the CPU caches, which requires splitting the problem into smaller portions and fusing together as many steps as possible. In this paper, we present changes based on these principles to two important func- tions in the scikit-bio library, principal coordinates analysis and the Mantel test, that resulted in over 100x speed improvement in these widely used, general-purpose tools.
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
Presented at GTC21.
The raw computing power of GPUs has been steadily increasing, significantly outpacing the CPU gains. This poses a problem for many GPU-enabled scientific applications that use CPU code paths to feed data to the GPU code, resulting in lower GPU utilization, and thus reduced gains in scientific output. Applications that are high-throughput in nature, such as astronomy-focused IceCube and LIGO, can partially work around the problem by running several instances of the executable on the same GPU. This approach, however, is sub-optimal both in terms of application performance and workflow management complexity. The recently introduced Multi-Instance GPU (MIG) capability, available on the NVIDIA A100 GPU, provides a much cleaner and easier-to-use alternative by allowing the logical slicing of the powerful GPU and assigning different slices to different applications. And at least in the case of IceCube, it can provide over 3x more scientific output on the same hardware.
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
Presented at EDUCAUSE CCCG March 2021.
The IceCube Neutrino Observatory is the world’s premier facility to detect neutrinos.
Built at the south pole in natural ice, it requires extensive and expensive calibration to properly track the neutrinos.
Most of the required compute power comes from on-prem resources through the Open Science Grid,
but IceCube can easily harness the Cloud compute at any scale, too, as demonstrated by a series of Cloud bursts.
This talk provides both details of the performed Cloud bursts, as well as some insight in the science itself.
Fusion simulations have traditionally required the use of leadership scale HPC resources in order to produce advances in physics. One such package is CGYRO, a premier tool for multi-scale plasma turbulence simulation. CGYRO is a typical HPC application that will not fit into a single node, as it requires several TeraBytes of memory and O(100) TFLOPS compute capability for cutting-edge simulations. CGYRO also requires high-throughput and low-latency networking, due to its reliance on global FFT computations. While in the past such compute may have required hundreds, or even thousands of nodes, recent advances in hardware capabilities allow for just tens of nodes to deliver the necessary compute power. We explored the feasibility of running CGYRO on Cloud resources provided by Microsoft on their Azure platform, using the infiniband-connected HPC resources in spot mode. We observed both that CPU-only resources were very efficient, and that running in spot mode was doable, with minimal side effects. The GPU-enabled resources were less cost effective but allowed for higher scaling.
For IceCube, large amount of photon propagation simulation is needed to properly calibrate natural Ice. Simulation is compute intensive and ideal for GPU compute. This Cloud run was more data intensive than precious ones, producing 130 TB of output data. To keep egress costs in check, we created dedicated network links via the Internet2 Cloud Connect Service.
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
Presented at OSG All-Hands Meeting 2020 - USCMS-USATLAS Session.
This talk presented the PRP experience with using Admiralty as a Kubernetes federation solution, with both discussion of why we need it, why Admiralty is the best (if not actually the only) solution for our needs, and how it works.
Accelerating microbiome research with OpenACCIgor Sfiligoi
Presented at OpenACC Summit 2020.
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another. Computing UniFrac on modest sample sizes used to take a workday on a server class CPU-only node, while modern datasets would require a large compute cluster to be feasible. After porting to GPUs using OpenACC, the compute of the same modest sample size now takes only a few minutes on a single NVIDIA V100 GPU, while modern datasets can be processed on a single GPU in hours. The OpenACC programming model made the porting of the code to GPUs extremely simple; the first prototype was completed in just over a day. Getting full performance did however take much longer, since proper memory access is fundamental for this application.
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
Presented at PEARC20.
This talk presents expanding the IceCube’s production HTCondor pool using cost-effective GPU instances in preemptible mode gathered from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform. Using this setup, we sustained for a whole workday about 15k GPUs, corresponding to around 170 PFLOP32s, integrating over one EFLOP32 hour worth of science output for a price tag of about $60k. In this paper, we provide the reasoning behind Cloud instance selection, a description of the setup and an analysis of the provisioned resources, as well as a short description of the actual science output of the exercise.
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
Poster presented at PEARC20.
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another (“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many independent subproblems and exhibits near linear scaling. In this poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We reduced the run time of computing UniFrac on the published Earth Microbiome Project dataset from 13 hours on an Intel Xeon E5-2680 v4 CPU to 12 minutes on an NVIDIA Tesla V100 GPU, and to about one hour on a laptop with NVIDIA GTX 1050 (with minor loss in precision). Computing UniFrac on a larger dataset containing 113k samples reduced the run time from over one month on the CPU to less than 2 hours on the V100 and 9 hours on an NVIDIA RTX 2080TI GPU (with minor loss in precision). This was achieved by using OpenACC for generating the GPU offload code and by improving the memory access patterns. A BSD-licensed implementation is available, which produces a Cshared library linkable by any programming language.
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
Poster presented at PEARC20.
There is increased awareness and recognition that public Cloud providers do provide capabilities not found elsewhere, with elasticity being a major driver. The value of elastic scaling is however tightly coupled to the capabilities of the networks that connect all involved resources, both in the public Clouds and at the various research institutions. This poster presents results of measurements involving file transfers inside public Cloud providers, fetching data from on-prem resources into public Cloud instances and fetching data from public Cloud storage into on-prem nodes. The networking of the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform, has been benchmarked. The on-prem nodes were managed by either the Pacific Research Platform or located at the University of Wisconsin – Madison. The observed sustained throughput was of the order of 100 Gbps in all the tests moving data in and out of the public Clouds and throughput reaching into the Tbps range for data movements inside the public Cloud providers themselves. All the tests used HTTP as the transfer protocol.
TransAtlantic Networking using Cloud linksIgor Sfiligoi
Scientific communities have only limited amount of bandwidth available for transferring data between the US and the EU.
We know Cloud providers have plenty of bandwidth available, but at what cost?
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Assure Contact Center Experiences for Your Customers With ThousandEyes
glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012
1. glideinWMS Training @ UCSD
glideinWMS Frontend
Monitoring
by Igor Sfiligoi (UCSD)
UCSD Jan 18th 2012 Frontend Monitoring 1
2. Overview
● Refresher
● What is available
● What to look for
UCSD Jan 18th 2012 Frontend Monitoring 2
3. Refresher – glideinWMS
● A glidein is just a properly configured Condor
execution node submitted as a Grid job
● Frontend drives submission Configure Condor G.N.
Submit node
Frontend node Worker node
Monitor Submit node
Frontend Condor glidein
Central manager
Startd
Match
Globus Job
Request
glideins Factory node
Condor glidein
Execution node
CREAM
Factory glidein
Execution node
Submit
glideins
UCSD Jan 18th 2012 Frontend Monitoring 3
4. Reminder
Condor is king!
(glideinWMS just a small layer on top)
UCSD Jan 18th 2012 Frontend Monitoring 4
5. Refresher – Frontend arch
● Many Groups
● With a “Master” Frontend as an aggregator
Submit node
Factory Submit node Factory
Central manager
Frontend node
Group
Entry ... Group glidein
Spawn Web
Server
Frontend
UCSD Jan 18th 2012 Frontend Monitoring 5
6. Available monitoring
● Condor monitoring
Even if a dynamic one
● It is just a condor pool!
● Any Condor monitoring tools will work
● VO Frontend monitoring
● The VO Frontend provides some basic
Condor monitoring
● Plus the monitoring of it own internal workings
● Glidein Factory monitoring
You should not need to use it
but it is publicly accessible
UCSD Jan 18th 2012 Frontend Monitoring 6
8. Condor Monitoring
● Out of the box you get
● Command line tools
● Log parsing
● Several external tools available, e.g.
● CondorView
Condor external package
● CycleServer
Your portal may Commercial tool, (semi-)free for Academia
provide additional
monitoring, too
UCSD Jan 18th 2012 Frontend Monitoring 8
9. Glidein monitoring
● The glideins will register with the Collector
● Condor command to monitor them Same syntax as
condor_status Requirements
● -constraint - To select a subset of them
● -total - For a quick summary
● Output formatting options
● No arguments - In use/unused
● -long - Full ClassAds
● -format - Select attributes only
● -xml - xml formatting Easier to
http://www.cs.wisc.edu/condor/manual/v7.6/condor_status.html machine parse
UCSD Jan 18th 2012 Frontend Monitoring 9
10. Example
$ condor_status
$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
glidein_17848@alic LINUX X86_64 Claimed Busy 7.440 18037 0+01:06:06
glidein_17848@alic LINUX X86_64 Claimed Busy 7.440 18037 0+01:06:06
glidein_15842@alic LINUX X86_64 Claimed Busy 7.010 18037 0+00:35:21
glidein_15842@alic LINUX X86_64 Claimed Busy 7.010 18037 0+00:35:21
glidein_18249@alic LINUX X86_64 Claimed Busy 7.510 18037 0+01:24:09
glidein_18249@alic LINUX X86_64 Claimed Busy 7.510 18037 0+01:24:09
glidein_17825@wn89 LINUX X86_64 Unclaimed Idle 11.990 16056 0+00:15:12
glidein_17825@wn89 LINUX X86_64 Unclaimed Idle 11.990 16056 0+00:15:12
glidein_10082@wn91 LINUX X86_64 Claimed Idle 7.000 16056 0+00:02:46
glidein_10082@wn91 LINUX X86_64 Claimed Idle 7.000 16056 0+00:02:46
…
…
glidein_3964@wp-05 LINUX X86_64 Claimed Busy 24.000 64464 0+16:00:29
glidein_3964@wp-05 LINUX X86_64 Claimed Busy 24.000 64464 0+16:00:29
glidein_5614@wp-05 LINUX X86_64 Claimed Busy 23.360 64464 0+16:12:56
glidein_5614@wp-05 LINUX X86_64 Claimed Busy 23.360 64464 0+16:12:56
glidein_5861@wp-05 LINUX X86_64 Claimed Retiring 22.140 64464 0+00:23:18
glidein_5861@wp-05 LINUX X86_64 Claimed Retiring 22.140 64464 0+00:23:18
Total Owner Claimed Unclaimed Matched Preempting Backfill
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 23249 0 22697 552 0 0 0
X86_64/LINUX 23249 0 22697 552 0 0 0
Total 23249 0 22697 552 0 0 0
Total 23249 0 22697 552 0 0 0
UCSD Jan 18th 2012 Frontend Monitoring 10
12. Collector log(s)
Place to look when
things seem fishy!
● The Collector(s) will log any errors
● The interesting errors will likely be in the leaves of
the Collector tree
~condor/glidecondor/condor_local/log/CondorXXXLog
● Logs rotate, so be sure to look in .old as well
Yes, you will
● You also get the glidein have 100s
authentication logs of them!
● And log verbosity can be further increased with
COLLECTOR_DEBUG
http://www.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:SubsysDebug
UCSD Jan 18th 2012 Frontend Monitoring 12
13. Example
01/13/12 17:24:13 ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/CN=
01/13/12 17:24:13 ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/CN=
uscmspilot47/glidein-1.t2.ucsd.edu'
uscmspilot47/glidein-1.t2.ucsd.edu'
01/13/12 17:24:13 ZKM: 2: mapret: 00 included_voms: 0 canonical_user:glidein47
01/13/12 17:24:13 ZKM: 2: mapret: included_voms: 0 canonical_user: glidein47
01/13/12 17:24:13 ZKM: successful mapping to glidein47
01/13/12 17:24:13 ZKM: successful mapping to glidein47
...
...
01/13/12 17:24:19 condor_read() failed: recv() returned -1, errno ==104 Connection reset
01/13/12 17:24:19 condor_read() failed: recv() returned -1, errno 104 Connection reset
by peer, reading 44 bytesfrom <130.104.133.245:7812>.
by peer, reading bytes from <130.104.133.245:7812>.
01/13/12 17:24:19 DaemonCore: Can't receive command request from 130.104.133.245
01/13/12 17:24:19 DaemonCore: Can't receive command request from 130.104.133.245
(perhaps aa timeout?)
(perhaps timeout?)
...
...
01/13/12 17:24:41 CCB: rejecting request from SHADOW <169.228.130.26:9615?sock=1
01/13/12 17:24:41 CCB: rejecting request from SHADOW <169.228.130.26:9615?sock=1
0716_242d_201658> on <169.228.130.26:21142> for ccbid 14018 because no daemon is
0716_242d_201658> on <169.228.130.26:21142> for ccbid 14018 because no daemon is
currently registered with that id (perhaps ititrecently disconnected).
currently registered with that id (perhaps recently disconnected).
UCSD Jan 18th 2012 Frontend Monitoring 13
14. Job monitoring
● You can monitor local jobs
● For jobs still in the queue (still waiting or running)
condor_q
● For finished jobs Limited number of jobs
preserved
condor_history
● Similar cmdline args as condor_status
● Remote condor_q possible with
-name
http://www.cs.wisc.edu/condor/manual/v7.6/condor_q.html
http://www.cs.wisc.edu/condor/manual/v7.6/condor_history.html
UCSD Jan 18th 2012 Frontend Monitoring 14
15. Example
$ condor_q
$ condor_q
-- Submitter: my.node : <192.168.130.11:9615?sock=9763_cd4c_2> : my.node
-- Submitter: my.node : <192.168.130.11:9615?sock=9763_cd4c_2> : my.node
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
367788.0 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 1 1
367788.0 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 1 1
367788.1 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 2 1
367788.1 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 2 1
383995.19 uscms1789 1/11 02:26 2+13:35:38 R 0 1953.1 CMSSW.sh 1118 4
383995.19 uscms1789 1/11 02:26 2+13:35:38 R 0 1953.1 CMSSW.sh 1118 4
383995.179 uscms1789 1/11 02:26 2+11:29:06 R 0 1464.8 CMSSW.sh 1310 4
383995.179 uscms1789 1/11 02:26 2+11:29:06 R 0 1464.8 CMSSW.sh 1310 4
383999.32 uscms1789 1/11 02:31 2+09:36:12 R 0 1953.1 CMSSW.sh 299 4
383999.32 uscms1789 1/11 02:31 2+09:36:12 R 0 1953.1 CMSSW.sh 299 4
383999.46 uscms1789 1/11 02:31 2+11:00:25 R 0 1953.1 CMSSW.sh 316 4
383999.46 uscms1789 1/11 02:31 2+11:00:25 R 0 1953.1 CMSSW.sh 316 4
…
…
385002.7 uscms3015 1/13 17:31 0+00:01:51 R 0 0.0 CMSSW.sh 70 2
385002.7 uscms3015 1/13 17:31 0+00:01:51 R 0 0.0 CMSSW.sh 70 2
385002.8 uscms3015 1/13 17:31 0+00:01:49 R 0 0.0 CMSSW.sh 89 2
385002.8 uscms3015 1/13 17:31 0+00:01:49 R 0 0.0 CMSSW.sh 89 2
385002.9 uscms3015 1/13 17:31 0+00:01:29 R 0 0.0 CMSSW.sh 91 2
385002.9 uscms3015 1/13 17:31 0+00:01:29 R 0 0.0 CMSSW.sh 91 2
385002.10 uscms3015 1/13 17:31 0+00:01:00 R 0 0.0 CMSSW.sh 97 2
385002.10 uscms3015 1/13 17:31 0+00:01:00 R 0 0.0 CMSSW.sh 97 2
58707 jobs; 39484 idle, 11694 running, 7529 held
58707 jobs; 39484 idle, 11694 running, 7529 held
UCSD Jan 18th 2012 Frontend Monitoring 15
16. Job logs
● Users are encouraged to have a log for jobs
● Provides easy way to monitor the progress without
calling condor_q/condor_history
000 (001.000.000) 12/15 12:28:05 Job submitted from host: <127.0.0.1:43569>
000 (001.000.000) 12/15 12:28:05 Job submitted from host: <127.0.0.1:43569>
...
...
001 (001.000.000) 12/16 08:30:02 Job executing on host: <169.228.130.213:36422>
001 (001.000.000) 12/16 08:30:02 Job executing on host: <169.228.130.213:36422>
...
...
005 (001.000.000) 12/16 13:30:32 Job terminated.
005 (001.000.000) 12/16 13:30:32 Job terminated.
Literally ...
(1) Normal termination (return value 0)
(1) Normal termination (return value 0)
Usr 0 01:00:00, Sys 0 00:05:00 - Run Remote Usage
Usr 0 01:00:00, Sys 0 00:05:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 01:00:00, Sys 0 00:05:00 - Total Remote Usage
Usr 0 01:00:00, Sys 0 00:05:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
217 - Run Bytes Sent By Job
217 - Run Bytes Sent By Job
76 - Run Bytes Received By Job
76 - Run Bytes Received By Job
217 - Total Bytes Sent By Job
217 - Total Bytes Sent By Job
76 - Total Bytes Received By Job
76 - Total Bytes Received By Job
...
...
UCSD Jan 18th 2012 Frontend Monitoring 16
17. Condor Daemon logs
● By default
● Schedd writes a log
/opt/glidecondor/condor_local/log/ScheddLog
● Shadows share a common log
/opt/glidecondor/condor_local/log/ShadowLog
● The logs rotate, look for .old files as well
● Lots of interesting info in them
● Quite high verbosity by default
UCSD Jan 18th 2012 Frontend Monitoring 17
18. ScheddLog Example
01/12/12 20:38:37 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: cfeng
01/12/12 20:38:37 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: cfeng
01/12/12 20:38:37 (pid:32035) ZKM: successful mapping to cfeng
01/12/12 20:38:37 (pid:32035) ZKM: successful mapping to cfeng
01/12/12 20:38:37 (pid:28485) GET_JOB_CONNECT_INFO failed: No such job: 157170.4
01/12/12 20:38:37 (pid:28485) GET_JOB_CONNECT_INFO failed: No such job: 157170.4
01/12/12 20:39:27 (pid:32035) ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/
01/12/12 20:39:27 (pid:32035) ZKM: 1: attempting to map '/DC=org/DC=doegrids/OU=Services/
CN=rokpilot01/osg.ctbp.ucsd.edu'
CN=rokpilot01/osg.ctbp.ucsd.edu'
01/12/12 20:39:27 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: pilot
01/12/12 20:39:27 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: pilot
01/12/12 20:39:27 (pid:32035) ZKM: successful mapping to pilot
01/12/12 20:39:27 (pid:32035) ZKM: successful mapping to pilot
01/12/12 20:39:32 (pid:32035) Shadow pid 11058 for job 157170.82 exited with status 100
01/12/12 20:39:32 (pid:32035) Shadow pid 11058 for job 157170.82 exited with status 100
...
...
01/13/12 18:05:02 (pid:32035) Activity on stashed negotiator socket: <169.228.40.37:60824>
01/13/12 18:05:02 (pid:32035) Activity on stashed negotiator socket: <169.228.40.37:60824>
01/13/12 18:05:02 (pid:32035) Using negotiation protocol: NEGOTIATE
01/13/12 18:05:02 (pid:32035) Using negotiation protocol: NEGOTIATE
01/13/12 18:05:02 (pid:32035) Negotiating for owner: cfeng@osg.ctbp.ucsd.edu
01/13/12 18:05:02 (pid:32035) Negotiating for owner: cfeng@osg.ctbp.ucsd.edu
01/13/12 18:05:02 (pid:32035) Finished negotiating for cfeng in local pool: 4 matched, 1 rejected
01/13/12 18:05:02 (pid:32035) Finished negotiating for cfeng in local pool: 4 matched, 1 rejected
01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_14642@
01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_14642@
cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng
cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng
01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_16403@
01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_16403@
cabinet-1-1-0.t2.ucsd.edu <169.228.131.217:34929?CCBID=169.228.40.37:9623#108&noUDP> for cfeng
cabinet-1-1-0.t2.ucsd.edu <169.228.131.217:34929?CCBID=169.228.40.37:9623#108&noUDP> for cfeng
01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_13015@
01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_13015@
cabinet-2-2-24.t2.ucsd.edu <169.228.131.156:33784?CCBID=169.228.40.37:9708#95&noUDP> for cfeng
cabinet-2-2-24.t2.ucsd.edu <169.228.131.156:33784?CCBID=169.228.40.37:9708#95&noUDP> for cfeng
01/13/12 18:05:04 (pid:32035) Starting add_shadow_birthdate(157177.138)
01/13/12 18:05:04 (pid:32035) Starting add_shadow_birthdate(157177.138)
01/13/12 18:05:04 (pid:32035) Started shadow for job 157177.138 on glidein_14642@
01/13/12 18:05:04 (pid:32035) Started shadow for job 157177.138 on glidein_14642@
cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng,
cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng,
(shadow pid = 5238)
(shadow pid = 5238)
UCSD Jan 18th 2012 Frontend Monitoring 18
19. ShadowLog Example
01/12/12 21:52:36 DaemonCore: command socket at <169.228.40.37:47586?noUDP>
01/12/12 21:52:36 DaemonCore: command socket at <169.228.40.37:47586?noUDP>
01/12/12 21:52:36 DaemonCore: private command socket at <169.228.40.37:47586>
01/12/12 21:52:36 DaemonCore: private command socket at <169.228.40.37:47586>
01/12/12 21:52:36 Setting maximum accepts per cycle 4.
01/12/12 21:52:36 Setting maximum accepts per cycle 4.
01/12/12 21:52:36 Initializing a VANILLA shadow for job 157171.108
01/12/12 21:52:36 Initializing a VANILLA shadow for job 157171.108
01/12/12 21:52:36 (157171.97) (32318): Request to run on
01/12/12 21:52:36 (157171.97) (32318): Request to run on
glidein_23261@cabinet-2-2-26.t2.ucsd.edu <169.228.131.154:48495?
glidein_23261@cabinet-2-2-26.t2.ucsd.edu <169.228.131.154:48495?
CCBID=169.228.40.37:9644#87&noUDP> was ACCEPTED
CCBID=169.228.40.37:9644#87&noUDP> was ACCEPTED
01/12/12 21:52:36 (157170.39) (10937): Job 157170.39 terminated:
01/12/12 21:52:36 (157170.39) (10937): Job 157170.39 terminated:
exited with status 0
exited with status 0
01/12/12 21:52:36 (157170.39) (10937): **** condor_shadow (condor_SHADOW)
01/12/12 21:52:36 (157170.39) (10937): **** condor_shadow (condor_SHADOW)
pid 10937 EXITING WITH STATUS 100
pid 10937 EXITING WITH STATUS 100
…
…
01/13/12 18:01:08 (157177.52) (4768): DoUpload: (Condor error code 12, subcode 28)
01/13/12 18:01:08 (157177.52) (4768): DoUpload: (Condor error code 12, subcode 28)
SHADOW at 169.228.40.37 failed to send file(s) to <169.228.131.178:47976>;
SHADOW at 169.228.40.37 failed to send file(s) to <169.228.131.178:47976>;
STARTER at 169.228.131.178 failed to write to file /data6/condor_local/execute/
STARTER at 169.228.131.178 failed to write to file /data6/condor_local/execute/
dir_13776/glide_J13834/execute/dir_19620/griddatasourceB.tgz:
dir_13776/glide_J13834/execute/dir_19620/griddatasourceB.tgz:
(errno 28) No space left on device
(errno 28) No space left on device
01/13/12 18:01:15 (157177.52) (4768): **** condor_shadow (condor_SHADOW)
01/13/12 18:01:15 (157177.52) (4768): **** condor_shadow (condor_SHADOW)
pid 4768 EXITING WITH STATUS 112
pid 4768 EXITING WITH STATUS 112
UCSD Jan 18th 2012 Frontend Monitoring 19
20. Submitter ClassAds
● The schedd will advertise two types of
ClassAds to the Collector
● Schedd daemon ClassAds
condor_status -schedd
● Per-user ClassAds
condor_status -submitter
● Can be useful for getting a summary view
of the system
UCSD Jan 18th 2012 Frontend Monitoring 20
23. Negotiator Monitoring
● To check for user priorities, use
condor_userprio
● -alluser - Without, only running users
● -all - Provides detailed info
● Negotiator Log useful to troubleshoot
~/glidecondor/condor_local/log/NegotiatorLog
● Look for errors and to monitor cycle times
● Negotiator also advertises a ClassAd
● Use condor_status -negotiator -long
UCSD Jan 18th 2012 Frontend Monitoring 23
29. Frontend monitoring
Frontend node
Entry
Group ... Group
● Helper cmdline tool Spawn
● Plus, each Group provides: Frontend
● Activity/Error logs
● RRD files with statistics (running, held, etc.)
● XML files with current snapshot
● Resource ClassAds
● Master frontend aggregates RRD and XML
files, and writes them in its own area
● Human readable/viewable Web pages available
UCSD Jan 18th 2012 Frontend Monitoring 29
30. Helper cmdline tool
● Wrapper around condor condor_status
glideinWMS/tools/glidein_status.py
● Provides useful formatting
~/glideinWMS/tools$ ./glidein_status.py
~/glideinWMS/tools$ ./glidein_status.py
Name Site Factory Entry State Activit
Name Site Factory Entry State Activi
glidein_6682@alicegrid26.ba.infn.it Bari v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed Busy
glidein_6682@alicegrid26.ba.infn.it
glidein_10678@alicegrid32.ba.infn.it Bari
Bari v1_0@OSGGOC
v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed
CMS_T2_IT_Bari_ce01 Claimed Busy
Busy
… glidein_10678@alicegrid32.ba.infn.it Bari v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed Busy
…
glidein_5861@wp-05-12.pn.pd.infn.it Legnaro v1_0@OSGGOC CMS_T2_IT_Legnaro_. Claimed Retirin
glidein_5861@wp-05-12.pn.pd.infn.it Legnaro v1_0@OSGGOC CMS_T2_IT_Legnaro_. Claimed Retiri
Total Owner Claimed/Busy Claimed/Retiring Claimed/Other Unclaimed M
Total Owner Claimed/Busy Claimed/Retiring Claimed/Other Unclaimed
CMS_T2_US_Nebraska_Red_gw2@v1_0@OSGGOC 11 0 11 0 0 0
CMS_T2_US_Nebraska_Red_gw2@v1_0@OSGGOC 522
CMS_T2_US_Purdue_hansen@v1_0@OSGGOC 11 00 11
517 00 00 50
CMS_T2_US_Purdue_hansen@v1_0@OSGGOC 1201
CMS_T2_US_Purdue_osg@v1_0@OSGGOC 522 00 517
1182 14 0 00 55
… CMS_T2_US_Purdue_osg@v1_0@OSGGOC 1201 0 1182 14 0 5
…
CMS_T2_US_UCSD_gw4@Production_v4_2@UCSD 135 0 132 0 0 3
CMS_T2_US_UCSD_gw4@Production_v4_2@UCSD 135 0 132 0 0 3
Total 21474 0 19742 1264 0 468
Total 21474 0 19742 1264 0 468
UCSD Jan 18th 2012 Frontend Monitoring 30
31. Log files
● Each Frontend group provides 3 types of logs
log/group_XXX/frontend.date.type.log
● info - Progress and warnings
● err - One line warnings
● debug - Multi line error messages
● The master frontend has similar logs
log/frontend/frontend.date.type.log
● But rarely anything interesting there
UCSD Jan 18th 2012 Frontend Monitoring 31
32. Example Info Log
:01-07:00 15037] Iteration at Tue Nov 15 10:44:01 2011
:01-07:00 15037] Query condor
:01-07:00 15037] Child processes created
:05-07:00 31633] WARNING: Failed to talk to schedd submit-1.t2.ucsd.edu. See debug log for more details.
:05-07:00 15037] All children terminated
:05-07:00 15037] Jobs found total 4836 idle 1732 (old 1732, voms 1703) running 3104
:05-07:00 15037] Glideins found total 639 idle 8 running 630 limit 800 curb 600
:05-07:00 15037] Using 1 proxies
:05-07:00 15037] Match
:05-07:00 15037] Counting
:05-07:00 15037] Child processes created
:06-07:00 15037] All children terminated
:06-07:00 15037] Total matching idle 1732 (old 1703) running 3104
:06-07:00 15037] Jobs in schedd queues | Glideins | Request
:06-07:00 15037] Idle (match eff old uniq ) Run ( here max ) | Total Idle Run | Idle MaxRun Down Factory
:06-07:00 15037] 171( 1705 170 169 0) 3104( 102 250) | 105 1 103 | 10 3276 Up CMS_T2_US_Nebraska_Red@Produ
:06-07:00 15037] 171( 1705 167 169 0) 3104( 187 250) | 197 4 193 | 10 3276 Up CMS_T2_US_Nebraska_Red_gw1@P
:06-07:00 15037] 171( 1705 171 169 0) 3104( 0 250) | 0 0 0 | 10 3276 Down CMS_T2_US_Nebraska_Red_gw2@P
:06-07:00 15037] 171( 1705 171 169 0) 3104( 62 250) | 62 0 62 | 10 3276 Up CMS_T2_US_Wisconsin_cms01@Pr
:06-07:00 15037] 171( 1705 171 169 0) 3104( 71 250) | 71 0 71 | 10 3276 Up CMS_T2_US_Wisconsin_cms02@Pr
:06-07:00 15037] 171( 1705 169 169 0) 3104( 88 250) | 96 2 94 | 10 3276 Up CMS_T2_US_Nebraska_Red@v1_0@
:06-07:00 15037] 171( 1705 171 169 0) 3104( 1 250) | 1 0 1 | 10 3276 Up CMS_T2_US_Nebraska_Red_gw1@v
:06-07:00 15037] 171( 1705 171 169 0) 3104( 0 250) | 0 0 0 | 10 3276 Down CMS_T2_US_Nebraska_Red_gw2@v
:06-07:00 15037] 171( 1705 171 169 0) 3104( 45 250) | 45 0 45 | 10 3276 Up CMS_T2_US_Wisconsin_cms01@v1
:06-07:00 15037] 171( 1705 170 169 0) 3104( 60 250) | 62 1 61 | 10 3276 Up CMS_T2_US_Wisconsin_cms02@v1
:06-07:00 15037] Jobs in schedd queues | Glideins | Request
:06-07:00 15037] Idle (match eff old uniq ) Run ( here max ) | Total Idle Run | Idle MaxRun Down Factory
:06-07:00 15037] 1368(13640 1360 1352 0) 24832( 616 2000) | 639 8 630 | 80 26208 Up Sum of useful factories
:06-07:00 15037] 342( 3410 342 338 0) 6208( 0 500) | 0 0 0 | 20 6552 Down Sum of down factories
:06-07:00 15037] 27( 27 27 14 27) 0( 0 0) | 0 0 0 | 0 0 Down Unmatched
:06-07:00 15037] Advertizing 10 requests
:07-07:00 15037] Done advertizing
:07-07:00 15037] Advertising 10 glideresource classads to the user pool
:07-07:00 15037] Done advertising glideresource classads
:07-07:00 15037] Writing stats
:07-07:00 15037] Sleep
UCSD Jan 18th 2012 Frontend Monitoring 32
33. Example log files
frontend.20120113.err.log
[2012-01-13T18:50:47-07:00 16444] Advertizing failed for 2 requests. See debug log for more details.
[2012-01-13T18:50:47-07:00 16444] Advertizing failed for 2 requests. See debug log for more details.
[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd. See debug log for more details.
[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd. See debug log for more details.
frontend.20120113.debug.log
[2012-01-13T18:50:47-07:00 16444] Advertizing failed: Error running '/home/frontend/glidecondor/sbin/condor_advertise
[2012-01-13T18:50:47-07:00 16444] Advertizing failed: Error running '/home/frontend/glidecondor/sbin/condor_advertise
-pool glidein-1.t2.ucsd.edu -tcp -multiple UPDATE_MASTER_AD /tmp/gfi_aw_276509240_16444_2'
-pool glidein-1.t2.ucsd.edu -tcp -multiple UPDATE_MASTER_AD /tmp/gfi_aw_276509240_16444_2'
code 1:failed to send classad to <169.228.130.10:9618>
code 1:failed to send classad to <169.228.130.10:9618>
failed to send classad to <169.228.130.10:9618>
failed to send classad to <169.228.130.10:9618>
failed to send classad to <169.228.130.10:9618>
failed to send classad to <169.228.130.10:9618>
failed to send classad to <169.228.130.10:9618>
failed to send classad to <169.228.130.10:9618>
[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd: Schedd 'vocms120.cern.ch' not found
[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd: Schedd 'vocms120.cern.ch' not found
UCSD Jan 18th 2012 Frontend Monitoring 33
34. Web pages 1/3
frontendStatus.html
Historical overview
Fully dynamic,
allows for zooming
and selecting of
elements to plot
Default shows everything,
but can restrict to a group
and/or a Factory
UCSD Jan 18th 2012 Frontend Monitoring 34
36. Web pages 3/3
frontendGroupGraphStatusNow.html
Contains also pie-charts with the same info
UCSD Jan 18th 2012 Frontend Monitoring 36
37. RRDs and XML files
● The Web pages are just rendering of the RRDs
and XML pages
● Raw data loaded in the browser and rendered
● No server side code
● Other tools could use those data
● Publicly available, if one knows the URL
● No user-identifying data, only summary stats
UCSD Jan 18th 2012 Frontend Monitoring 37
38. Resource ClassAds
● The Frontend Groups advertise one ClassAd
for each Factory it is requesting glideins from
● Type glideresource
● They contain pretty much everything the
Frontend Group knows about the Factory:
● Factory attributes used for matchmaking
● Stats about the matching jobs
● What is being requested
● Even what the Factory is doing!
UCSD Jan 18th 2012 Frontend Monitoring 38
39. Example query
● Not a Condor native type, must use
● -any
● Then constrain the type
$ condor_status -any -const 'MyType=="glideresource"' -format '%sn' Name
$ condor_status -any -const 'MyType=="glideresource"' -format '%sn' Name
CMS_T2_US_Caltech_cit2@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Caltech_cit2@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Caltech_cit@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Caltech_cit@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Florida_iogw1@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Florida_iogw1@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Florida_pg@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Florida_pg@v1_0@OSGGOC@UCSD-v5_3.main
...
...
CMS_T2_US_UCSD_gw2@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_UCSD_gw2@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Wisconsin_cms01@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Wisconsin_cms01@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Wisconsin_cms02@v1_0@OSGGOC@UCSD-v5_3.main
CMS_T2_US_Wisconsin_cms02@v1_0@OSGGOC@UCSD-v5_3.main
Remotely queryable
UCSD Jan 18th 2012 Frontend Monitoring 39
40. Example ClassAd
$ condor_status -any
$ condor_status -any
CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main -l
CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main -l
MyType = "glideresource"
MyType = "glideresource" Identification
Name = "CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main"
Name = "CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main"
GlideClientName = "UCSD-v5_3.main"
GlideClientName = "UCSD-v5_3.main"
...
...
GlideClientMonitorJobsIdle = 210.000000
GlideClientMonitorJobsIdle = 210.000000
GlideClientMonitorJobsRunningHere = 213 Info about local jobs
GlideClientMonitorJobsRunningHere = 213
...
...
GlideClientMonitorGlideinsRequestIdle = 50
GlideClientMonitorGlideinsRequestIdle = 50
GlideClientMonitorGlideinsRequestMaxRun = 445
What is being requested
GlideClientMonitorGlideinsRequestMaxRun = 445
...
...
GLIDEIN_Site = "UCSD"
GLIDEIN_Site = "UCSD"
GLEXEC_BIN = "OSG" Factory attributes
GLEXEC_BIN = "OSG"
...
...
GlideClientMonitorGlideinsRunning = 215
GlideClientMonitorGlideinsRunning = 215
GlideClientMonitorGlideinsTotal = 216 Info about registered glideins
GlideClientMonitorGlideinsTotal = 216
...
...
GlideFactoryMonitorStatusRunning = 339
GlideFactoryMonitorStatusRunning = 339
GlideFactoryMonitorStatusPending = 277 Factory status
GlideFactoryMonitorStatusPending = 277
GlideFactoryMonitorStatusHeld = 0
GlideFactoryMonitorStatusHeld = 0
...
...
Currently more information than you get on the Web
UCSD Jan 18th 2012 Frontend Monitoring 40
41. OK, now you know
what's available.
What will you do
with all that information?
(i.e. What to look for)
UCSD Jan 18th 2012 Frontend Monitoring 41
42. Monitoring the health of the system
● Six major areas to look after; your goal is
● Few unclaimed glideins
(both globally, and per site)
● No unmatched jobs
● Reasonably low restart rate
(both global, and per site)
● Reasonably low job failure rate
(both global, and per site)
● Negotiation cycle reasonably short
● Schedd node not overloaded
UCSD Jan 18th 2012 Frontend Monitoring 42
43. Unclaimed glideins
● Frontend and Negotiator policies are
not identical
● You may end up with glideins that
never run any jobs
● The discrepancy can be big enough to be
noticed on a global scale
● But more often it is just for one (or few) sites
● Short spikes are not a problem
● But long periods are
UCSD Jan 18th 2012 Frontend Monitoring 43
44. How do you notice it?
● Historical Web monitoring
Bad
Good
● Ask for daily emails from the Factory
● Or write your own scripts No Frontend report generators
in glideinWMS at this time
Parse the RRDs
UCSD Jan 18th 2012 Frontend Monitoring 44
45. How do you find the root cause?
● Analyze the latest snapshots
● condor_status/glidein_status
● condor_q
● Frontend Web
● Limit the research to few sites, if possible
● Then start comparing
● Job Requirements, with
Can be daunting!
● Glidein Start expressions
In theory, there is “condor_q -ana”, but it is usually worthless
UCSD Jan 18th 2012 Frontend Monitoring 45
46. Unmatched jobs
● The other side of the problem
● Glideins never asked for some jobs Jobs will never start!
● Two possible reasons
● Wrong Frontend matchmaking policy
● No available Factory entries to serve the job
UCSD Jan 18th 2012 Frontend Monitoring 46
47. How do you notice it?
● “Unmatched Factory” in Web monitoring
UCSD Jan 18th 2012 Frontend Monitoring 47
48. How do you find the root cause?
● Again, start with the latest snaphot
● condor_q
● condor_status -any -const 'MyType=="glideresource"'
● Get the (python) Match expression from XML
● Start comparing!
Can be daunting!
UCSD Jan 18th 2012 Frontend Monitoring 48
49. Restarted jobs
● Any restart == wasted CPU
● How do you notice it?
● condor_q is your friend here
condor_q -format '%in' NumJobStarts
No historical/Web monitoring provided
● Why it happens?
● Glidein disappears!
● End of lifetime hit
Not in the default config,
● Preemption policies but you may set Condor to do it
● Submit node overload
Condor daemons do not like being resource constrained!
UCSD Jan 18th 2012 Frontend Monitoring 49
50. Why glideins disappear?
● Three main reasons Rare
● Remote node just died Some sites do this; nothing you can do.
Learn who they are and act accordingly.
● Site preemption policy
● Glidein killed by Site because it exceeded slot limits
– Most likely Memory One of 2 limits the OSG factory advertises.
GLIDEIN_MaxMemMBs
● Why can limits be exceeded?
Job told you it needed
● Job underestimated resource use more resources than
the limit!
● Frontend matchmaking logic problem
● Wrong advertised limits
Factory problem!
UCSD Jan 18th 2012 Frontend Monitoring 50
51. Wallclock limits
● Main resource limit is time
● The glidein automatically deals with it
– Will go away before the deadline
– … killing/preemptiong any jobs if needed!
● Limit advertised as In seconds
– Factory: GLIDEIN_Max_Walltime (-Δ)
– Glidein: GLIDEIN_ToDie UNIX time
● Why jobs may reach the deadline?
● Like with all other resources
– Job underestimates time it needs
– Frontend matchmaking logic problems
UCSD Jan 18th 2012 Frontend Monitoring 51
52. Job failures
● Jobs can fail for many reasons
● You should monitor the ExitCode
condor_history -back -const 'JobStatus==5' -format '%in' ExitCode
● Knowing what users run often needed to interpret
errors
● For common WN errors, Frontend admin
should create appropriate validation script
● So glideins fail, not user jobs
UCSD Jan 18th 2012 Frontend Monitoring 52
53. Negotiation time
● The negotiation time should be << 5mins
● If much longer,
glideins may terminate without running any jobs
● Monitor the NegotiatorLog on CM
● Possible causes
● CPU starvations (e.g. other processes)
● Autocluster explosion
– Condor tries to be smart about Matchmaking
– But if users don't cooperate, cannot do much
UCSD Jan 18th 2012 Frontend Monitoring 53
54. Autoclustering
Much faster
● Condor Schedd will try to group jobs if only few
groups exist
● All “similar jobs” will be matched together!
● What “similar” means?
● Similar == Would result in the same match
● How it is implemented?
● Tuple of attributes considered during matchmaking
● E.g. (DESIRED_Sites,ImageSize)
● How can the number of autoclusters explode?
● If an attribute that changes a lot is added
Example of really bad one: JobID
https://condor-wiki.cs.wisc.edu/index.cgi/attach_get/220/cs739.pdf
UCSD Jan 18th 2012 Frontend Monitoring 54
55. Submit node health
● Condor is very sensitive to resource starvation
● If submit node overloaded, expect problems!
● How can we get to resource starvation?
Trying to run 3k jobs on a 1G RAM node???
● Poor planning
● Other processes May steal CPU/RAM/IO from Condor
● Interactive activity particularly risky
● Due to its unpredictable nature
– Including user errors
● But portals not immune to resource overuse
UCSD Jan 18th 2012 Frontend Monitoring 55
57. Summary
● You have plenty of Monitoring options
● Some prettier, some more powerful
● Most of the time, things just work
● So you don't need to constantly watch after your
installation
● But occasionally things will break
Or the users will tell you!
● It is in your interest noticing it
● Having good monitoring tools will help you there!
UCSD Jan 18th 2012 Frontend Monitoring 57
59. Pointers
● The official glideinWMS project Web page is
http://tinyurl.com/glideinWMS
● glideinWMS development team is reachable at
glideinwms-support@fnal.gov
● The OSG glidein factory is reachable at
osg-gfactory-support@physics.ucsd.edu
UCSD Jan 18th 2012 Frontend Monitoring 59
60. Acknowledgments
● The glideinWMS is a CMS-led project
developed mostly at FNAL, with contributions
from UCSD and ISI
● The glideinWMS factory operations at UCSD is
sponsored by OSG
● The funding comes from NSF, DOE and the
UC system
UCSD Jan 18th 2012 Frontend Monitoring 60