Glidein Factory Operations


Published on

What it takes to operate a glideinWMS glidein factory.

The OSG experience.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Glidein Factory Operations

  1. 1. glideinWMS training Glidein Factory Operations i.e. How we spend our time? by Igor Sfiligoi (UCSD)glideinWMS training G.Factroy Operations 1
  2. 2. G. Factory Operation Categories ● Factory node operations ● Serving VO Frontend Admin requests ● Keeping up with changes in the Grid ● Debugging Grid problemsglideinWMS training G.Factroy Operations 2
  3. 3. G. Factory Operation Ongoing Costs ● Factory node operations ● Pretty much runs itself, unexpected <1day/month ● Serving VO Frontend Admin requests ● Highly variable, average a few hours/week ● Keeping up with changes in the Grid ● Variable, currently O(10 hours)/week ● Debugging Grid problems Better tools could drastically reduce this ● More than we have effort for!glideinWMS training G.Factroy Operations 3
  4. 4. Factory node ops ● The factory mostly just runs ● Occasional upgrade of SW needed, but typically fast and painless ● Most effort going into investigating O(hours) unexpected behavior, e.g. / ● High load month ● Weird problems after a reboot/OS upgrade ● Of course, installing a new node can take significant time ● But a very rare eventglideinWMS training G.Factroy Operations 4
  5. 5. VO FE Admin requests ● Adding a new VO FE can be expensive ● Apart from config changes, to help them start running ● However, relatively rare to have new VOs ● In steady state, VOs may request O(hours) ● New sites / New attributes week ● ● g.Factory operators also must assist with debugging FE config changes ● Error logs come only to GF (currently)glideinWMS training G.Factroy Operations 5
  6. 6. Following changes in the Grid ● G.Factory operational principle is trust-but-verify ● G.Factory admins must approve any change in the G.Factory config ● Grid a very dynamic place ● At least one site makes a change every single day ● Mostly complaint driven, have no good tools to automate change discovery ● G.Factory admins thus must Better tools change the G.Factory config often would be welcome ● Currently mostly a manual process O(10 hours) / weekglideinWMS training G.Factroy Operations 6
  7. 7. Grid debugging 1/2 ● With O(50k) glideins running at any time, we always find something broken somewhere ● Full spectrum of errors ● Broken worker nodes (validation errors) ● Broken CEs (authentication/startup/monitor errors) ● Network problems (glideins not registering) ● Mostly cannot directly solve the problem(s) ● i.e. have to notify remote Admins ● But we have to discover the root cause to get it solvedglideinWMS training G.Factroy Operations 7
  8. 8. Grid debugging 2/2 ● Grid a difficult place to debug ● Most sites are black boxes for us ● Luckily, glideins provide lots of info in the logs ● When we get them... a broken site may not return anything useful, or anything at all ● Prodding the black box often needed ● Which is hard! Many FTEs ● And some problems may be VO specific, too DC, if we had themglideinWMS training G.Factroy Operations 8
  9. 9. What else we do?● In order to make our life easier, we also ● Host a test glideinWMS instance ● Develop new helper tools● The test glideinWMS instance allows us to discover problems early, thus both ● Increasing user satisfaction ● Reducing the time needed in debugging errors● We create helper tools to suit our needs ● And anything major we contribute back to glideinWMS glideinWMS training G.Factroy Operations 9
  10. 10. The test glideinWMS Instance ● The test glideinWMS instance contains both a G.Factory and a VO Frontend ● This allows us end-to-end testing ● Major focus on the G.Factory, to test before deploying in production ● New SW releases ● New sites ● New services on existing sitesglideinWMS training G.Factroy Operations 10
  11. 11. Summary ● Operating a G.Factory is much more than keeping the G.Factory service alive ● Indeed, this part takes almost a negligible amount of time ● Most effort going into debugging Grid-related problems ● At O(50k) CPUs, something is always broken somewhere ● Finally, providing expertise to help VO FE Admins also an essential part of the jobglideinWMS training G.Factroy Operations 11
  12. 12. Acknowledgments ● This document was sponsored by grants from the US NSF and US DOE, and by the UC systemglideinWMS training G.Factroy Operations 12