SlideShare a Scribd company logo
1 of 12
Download to read offline
glideinWMS training




       Glidein Factory Operations
                        i.e. How we spend our time?

                           by Igor Sfiligoi (UCSD)




glideinWMS training             G.Factroy Operations   1
G. Factory Operation Categories
 ●   Factory node operations

 ●   Serving VO Frontend Admin requests

 ●   Keeping up with changes in the Grid

 ●   Debugging Grid problems



glideinWMS training   G.Factroy Operations   2
G. Factory Operation Ongoing Costs
 ●   Factory node operations
      ●   Pretty much runs itself, unexpected <1day/month
 ●   Serving VO Frontend Admin requests
      ●   Highly variable, average a few hours/week
 ●   Keeping up with changes in the Grid
      ●   Variable, currently O(10 hours)/week
 ●   Debugging Grid problems                         Better tools could
                                                   drastically reduce this
      ●   More than we have effort for!

glideinWMS training         G.Factroy Operations                             3
Factory node ops
 ●   The factory mostly just runs
      ●   Occasional upgrade of SW needed,
          but typically fast and painless
 ●   Most effort going into investigating          O(hours)
     unexpected behavior, e.g.                        /
      ●   High load                                 month
      ●   Weird problems after a reboot/OS upgrade
 ●   Of course, installing a new node can take
     significant time
      ●   But a very rare event
glideinWMS training         G.Factroy Operations         4
VO FE Admin requests
 ●   Adding a new VO FE can be expensive
      ●   Apart from config changes, to help them start running
      ●   However, relatively rare to have new VOs
 ●   In steady state, VOs may request
                                                    O(hours)
      ●   New sites
                                                       /
          New attributes
                                                     week
      ●


 ●   g.Factory operators also must
     assist with debugging FE config changes
      ●   Error logs come only to GF (currently)
glideinWMS training          G.Factroy Operations            5
Following changes in the Grid
 ●   G.Factory operational principle is trust-but-verify
      ●   G.Factory admins must approve any change
          in the G.Factory config
 ●   Grid a very dynamic place
      ●   At least one site makes a change every single day
      ●   Mostly complaint driven,
          have no good tools to automate change discovery
 ●   G.Factory admins thus must                         Better tools
     change the G.Factory config often                   would be
                                                         welcome
      ●   Currently mostly a manual process
                                            O(10 hours) / week
glideinWMS training        G.Factroy Operations                        6
Grid debugging 1/2
 ●   With O(50k) glideins running at any time,
     we always find something broken somewhere
 ●   Full spectrum of errors
      ●   Broken worker nodes (validation errors)
      ●   Broken CEs (authentication/startup/monitor errors)
      ●   Network problems (glideins not registering)
 ●   Mostly cannot directly solve the problem(s)
      ●   i.e. have to notify remote Admins
      ●   But we have to discover the root cause to get it solved
glideinWMS training         G.Factroy Operations               7
Grid debugging 2/2
 ●   Grid a difficult place to debug
      ●   Most sites are black boxes for us
 ●   Luckily, glideins provide lots of info in the logs
      ●   When we get them... a broken site
          may not return anything useful, or anything at all
 ●   Prodding the black box often needed
      ●   Which is hard!
                                                    Many FTEs
      ●   And some problems
          may be VO specific, too                    DC, if we
                                                     had them
glideinWMS training          G.Factroy Operations              8
What else we do?
●   In order to make our life easier, we also
      ●   Host a test glideinWMS instance
      ●   Develop new helper tools
●   The test glideinWMS instance allows us
    to discover problems early, thus both
      ●   Increasing user satisfaction
      ●   Reducing the time needed in debugging errors
●   We create helper tools to suit our needs
      ●   And anything major we contribute back to glideinWMS
    glideinWMS training        G.Factroy Operations         9
The test glideinWMS Instance
 ●   The test glideinWMS instance contains both a
     G.Factory and a VO Frontend
      ●   This allows us end-to-end testing
 ●   Major focus on the G.Factory, to test before
     deploying in production
      ●   New SW releases
      ●   New sites
      ●   New services on existing sites


glideinWMS training         G.Factroy Operations    10
Summary
 ●   Operating a G.Factory is much more than
     keeping the G.Factory service alive
      ●   Indeed, this part takes almost a
          negligible amount of time
 ●   Most effort going into debugging
     Grid-related problems
      ●   At O(50k) CPUs,
          something is always broken somewhere
 ●   Finally, providing expertise to help
     VO FE Admins also an essential part of the job
glideinWMS training         G.Factroy Operations      11
Acknowledgments
 ●   This document was sponsored by grants from
     the US NSF and US DOE,
     and by the UC system




glideinWMS training       G.Factroy Operations    12

More Related Content

Similar to Glidein Factory Operations

Vietnam qa meetup
Vietnam qa meetupVietnam qa meetup
Vietnam qa meetupSyam Sasi
 
Solving Grid problems through glidein monitoring
Solving Grid problems through glidein monitoringSolving Grid problems through glidein monitoring
Solving Grid problems through glidein monitoringIgor Sfiligoi
 
Yarn optimization (Real life use case)
Yarn optimization (Real life use case)Yarn optimization (Real life use case)
Yarn optimization (Real life use case)Jean-Louis Quéguiner
 
Devops at Startup Weekend BXL
Devops at Startup Weekend BXLDevops at Startup Weekend BXL
Devops at Startup Weekend BXLKris Buytaert
 
Experiences from DevOps production: Deployment, performance, failure.
Experiences from DevOps production: Deployment, performance, failure.Experiences from DevOps production: Deployment, performance, failure.
Experiences from DevOps production: Deployment, performance, failure.Server Density
 
Analysing in depth work manager
Analysing in depth work managerAnalysing in depth work manager
Analysing in depth work managerlpu
 
Continuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritageContinuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritageRan Levy
 
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor poolMonitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor poolIgor Sfiligoi
 
How to Become a Winner in the JVM Performance-Tuning Battle
How to Become a Winner in the JVM Performance-Tuning BattleHow to Become a Winner in the JVM Performance-Tuning Battle
How to Become a Winner in the JVM Performance-Tuning BattleCapgemini
 
Jenkinsconf Presentation - Advance jenkins management with multiple projects.
Jenkinsconf Presentation - Advance jenkins management with multiple projects.Jenkinsconf Presentation - Advance jenkins management with multiple projects.
Jenkinsconf Presentation - Advance jenkins management with multiple projects.Ohad Basan
 
Devops, the future is here, it's just not evenly distributed yet.
Devops, the future is here, it's just not evenly distributed yet.Devops, the future is here, it's just not evenly distributed yet.
Devops, the future is here, it's just not evenly distributed yet.Kris Buytaert
 
Improve the deployment process step by step
Improve the deployment process step by stepImprove the deployment process step by step
Improve the deployment process step by stepDaniel Fahlke
 
DrupalCon 2014: A Perfect Launch, Every Time
DrupalCon 2014: A Perfect Launch, Every TimeDrupalCon 2014: A Perfect Launch, Every Time
DrupalCon 2014: A Perfect Launch, Every TimePantheon
 
Automating MySQL operations with Puppet
Automating MySQL operations with PuppetAutomating MySQL operations with Puppet
Automating MySQL operations with PuppetKris Buytaert
 
Administrivia: Golden Tips for Making JIRA Hum
Administrivia: Golden Tips for Making JIRA HumAdministrivia: Golden Tips for Making JIRA Hum
Administrivia: Golden Tips for Making JIRA HumAtlassian
 
Administrivia: Golden Tips for Making JIRA Hum
Administrivia: Golden Tips for Making JIRA HumAdministrivia: Golden Tips for Making JIRA Hum
Administrivia: Golden Tips for Making JIRA HumAtlassian
 
Grails 4: Upgrade your Game!
Grails 4: Upgrade your Game!Grails 4: Upgrade your Game!
Grails 4: Upgrade your Game!Zachary Klein
 
Cynthia Wu: Satisfaction Not Guaranteed
Cynthia Wu: Satisfaction Not GuaranteedCynthia Wu: Satisfaction Not Guaranteed
Cynthia Wu: Satisfaction Not GuaranteedAnna Royzman
 

Similar to Glidein Factory Operations (20)

Vietnam qa meetup
Vietnam qa meetupVietnam qa meetup
Vietnam qa meetup
 
Solving Grid problems through glidein monitoring
Solving Grid problems through glidein monitoringSolving Grid problems through glidein monitoring
Solving Grid problems through glidein monitoring
 
Yarn optimization (Real life use case)
Yarn optimization (Real life use case)Yarn optimization (Real life use case)
Yarn optimization (Real life use case)
 
Continuous integration (eng)
Continuous integration (eng)Continuous integration (eng)
Continuous integration (eng)
 
Devops at Startup Weekend BXL
Devops at Startup Weekend BXLDevops at Startup Weekend BXL
Devops at Startup Weekend BXL
 
Experiences from DevOps production: Deployment, performance, failure.
Experiences from DevOps production: Deployment, performance, failure.Experiences from DevOps production: Deployment, performance, failure.
Experiences from DevOps production: Deployment, performance, failure.
 
Analysing in depth work manager
Analysing in depth work managerAnalysing in depth work manager
Analysing in depth work manager
 
Griffon demo
Griffon demoGriffon demo
Griffon demo
 
Continuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritageContinuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritage
 
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor poolMonitoring and troubleshooting a glideinWMS-based HTCondor pool
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
 
How to Become a Winner in the JVM Performance-Tuning Battle
How to Become a Winner in the JVM Performance-Tuning BattleHow to Become a Winner in the JVM Performance-Tuning Battle
How to Become a Winner in the JVM Performance-Tuning Battle
 
Jenkinsconf Presentation - Advance jenkins management with multiple projects.
Jenkinsconf Presentation - Advance jenkins management with multiple projects.Jenkinsconf Presentation - Advance jenkins management with multiple projects.
Jenkinsconf Presentation - Advance jenkins management with multiple projects.
 
Devops, the future is here, it's just not evenly distributed yet.
Devops, the future is here, it's just not evenly distributed yet.Devops, the future is here, it's just not evenly distributed yet.
Devops, the future is here, it's just not evenly distributed yet.
 
Improve the deployment process step by step
Improve the deployment process step by stepImprove the deployment process step by step
Improve the deployment process step by step
 
DrupalCon 2014: A Perfect Launch, Every Time
DrupalCon 2014: A Perfect Launch, Every TimeDrupalCon 2014: A Perfect Launch, Every Time
DrupalCon 2014: A Perfect Launch, Every Time
 
Automating MySQL operations with Puppet
Automating MySQL operations with PuppetAutomating MySQL operations with Puppet
Automating MySQL operations with Puppet
 
Administrivia: Golden Tips for Making JIRA Hum
Administrivia: Golden Tips for Making JIRA HumAdministrivia: Golden Tips for Making JIRA Hum
Administrivia: Golden Tips for Making JIRA Hum
 
Administrivia: Golden Tips for Making JIRA Hum
Administrivia: Golden Tips for Making JIRA HumAdministrivia: Golden Tips for Making JIRA Hum
Administrivia: Golden Tips for Making JIRA Hum
 
Grails 4: Upgrade your Game!
Grails 4: Upgrade your Game!Grails 4: Upgrade your Game!
Grails 4: Upgrade your Game!
 
Cynthia Wu: Satisfaction Not Guaranteed
Cynthia Wu: Satisfaction Not GuaranteedCynthia Wu: Satisfaction Not Guaranteed
Cynthia Wu: Satisfaction Not Guaranteed
 

More from Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingIgor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateIgor Sfiligoi
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
 

More from Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Glidein Factory Operations

  • 1. glideinWMS training Glidein Factory Operations i.e. How we spend our time? by Igor Sfiligoi (UCSD) glideinWMS training G.Factroy Operations 1
  • 2. G. Factory Operation Categories ● Factory node operations ● Serving VO Frontend Admin requests ● Keeping up with changes in the Grid ● Debugging Grid problems glideinWMS training G.Factroy Operations 2
  • 3. G. Factory Operation Ongoing Costs ● Factory node operations ● Pretty much runs itself, unexpected <1day/month ● Serving VO Frontend Admin requests ● Highly variable, average a few hours/week ● Keeping up with changes in the Grid ● Variable, currently O(10 hours)/week ● Debugging Grid problems Better tools could drastically reduce this ● More than we have effort for! glideinWMS training G.Factroy Operations 3
  • 4. Factory node ops ● The factory mostly just runs ● Occasional upgrade of SW needed, but typically fast and painless ● Most effort going into investigating O(hours) unexpected behavior, e.g. / ● High load month ● Weird problems after a reboot/OS upgrade ● Of course, installing a new node can take significant time ● But a very rare event glideinWMS training G.Factroy Operations 4
  • 5. VO FE Admin requests ● Adding a new VO FE can be expensive ● Apart from config changes, to help them start running ● However, relatively rare to have new VOs ● In steady state, VOs may request O(hours) ● New sites / New attributes week ● ● g.Factory operators also must assist with debugging FE config changes ● Error logs come only to GF (currently) glideinWMS training G.Factroy Operations 5
  • 6. Following changes in the Grid ● G.Factory operational principle is trust-but-verify ● G.Factory admins must approve any change in the G.Factory config ● Grid a very dynamic place ● At least one site makes a change every single day ● Mostly complaint driven, have no good tools to automate change discovery ● G.Factory admins thus must Better tools change the G.Factory config often would be welcome ● Currently mostly a manual process O(10 hours) / week glideinWMS training G.Factroy Operations 6
  • 7. Grid debugging 1/2 ● With O(50k) glideins running at any time, we always find something broken somewhere ● Full spectrum of errors ● Broken worker nodes (validation errors) ● Broken CEs (authentication/startup/monitor errors) ● Network problems (glideins not registering) ● Mostly cannot directly solve the problem(s) ● i.e. have to notify remote Admins ● But we have to discover the root cause to get it solved glideinWMS training G.Factroy Operations 7
  • 8. Grid debugging 2/2 ● Grid a difficult place to debug ● Most sites are black boxes for us ● Luckily, glideins provide lots of info in the logs ● When we get them... a broken site may not return anything useful, or anything at all ● Prodding the black box often needed ● Which is hard! Many FTEs ● And some problems may be VO specific, too DC, if we had them glideinWMS training G.Factroy Operations 8
  • 9. What else we do? ● In order to make our life easier, we also ● Host a test glideinWMS instance ● Develop new helper tools ● The test glideinWMS instance allows us to discover problems early, thus both ● Increasing user satisfaction ● Reducing the time needed in debugging errors ● We create helper tools to suit our needs ● And anything major we contribute back to glideinWMS glideinWMS training G.Factroy Operations 9
  • 10. The test glideinWMS Instance ● The test glideinWMS instance contains both a G.Factory and a VO Frontend ● This allows us end-to-end testing ● Major focus on the G.Factory, to test before deploying in production ● New SW releases ● New sites ● New services on existing sites glideinWMS training G.Factroy Operations 10
  • 11. Summary ● Operating a G.Factory is much more than keeping the G.Factory service alive ● Indeed, this part takes almost a negligible amount of time ● Most effort going into debugging Grid-related problems ● At O(50k) CPUs, something is always broken somewhere ● Finally, providing expertise to help VO FE Admins also an essential part of the job glideinWMS training G.Factroy Operations 11
  • 12. Acknowledgments ● This document was sponsored by grants from the US NSF and US DOE, and by the UC system glideinWMS training G.Factroy Operations 12