Successfully reported this slideshow.
Your SlideShare is downloading. ×

Glidein Factory Operations

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 12 Ad

More Related Content

Similar to Glidein Factory Operations (20)

More from Igor Sfiligoi (20)

Advertisement

Recently uploaded (20)

Glidein Factory Operations

  1. 1. glideinWMS training Glidein Factory Operations i.e. How we spend our time? by Igor Sfiligoi (UCSD) glideinWMS training G.Factroy Operations 1
  2. 2. G. Factory Operation Categories ● Factory node operations ● Serving VO Frontend Admin requests ● Keeping up with changes in the Grid ● Debugging Grid problems glideinWMS training G.Factroy Operations 2
  3. 3. G. Factory Operation Ongoing Costs ● Factory node operations ● Pretty much runs itself, unexpected <1day/month ● Serving VO Frontend Admin requests ● Highly variable, average a few hours/week ● Keeping up with changes in the Grid ● Variable, currently O(10 hours)/week ● Debugging Grid problems Better tools could drastically reduce this ● More than we have effort for! glideinWMS training G.Factroy Operations 3
  4. 4. Factory node ops ● The factory mostly just runs ● Occasional upgrade of SW needed, but typically fast and painless ● Most effort going into investigating O(hours) unexpected behavior, e.g. / ● High load month ● Weird problems after a reboot/OS upgrade ● Of course, installing a new node can take significant time ● But a very rare event glideinWMS training G.Factroy Operations 4
  5. 5. VO FE Admin requests ● Adding a new VO FE can be expensive ● Apart from config changes, to help them start running ● However, relatively rare to have new VOs ● In steady state, VOs may request O(hours) ● New sites / New attributes week ● ● g.Factory operators also must assist with debugging FE config changes ● Error logs come only to GF (currently) glideinWMS training G.Factroy Operations 5
  6. 6. Following changes in the Grid ● G.Factory operational principle is trust-but-verify ● G.Factory admins must approve any change in the G.Factory config ● Grid a very dynamic place ● At least one site makes a change every single day ● Mostly complaint driven, have no good tools to automate change discovery ● G.Factory admins thus must Better tools change the G.Factory config often would be welcome ● Currently mostly a manual process O(10 hours) / week glideinWMS training G.Factroy Operations 6
  7. 7. Grid debugging 1/2 ● With O(50k) glideins running at any time, we always find something broken somewhere ● Full spectrum of errors ● Broken worker nodes (validation errors) ● Broken CEs (authentication/startup/monitor errors) ● Network problems (glideins not registering) ● Mostly cannot directly solve the problem(s) ● i.e. have to notify remote Admins ● But we have to discover the root cause to get it solved glideinWMS training G.Factroy Operations 7
  8. 8. Grid debugging 2/2 ● Grid a difficult place to debug ● Most sites are black boxes for us ● Luckily, glideins provide lots of info in the logs ● When we get them... a broken site may not return anything useful, or anything at all ● Prodding the black box often needed ● Which is hard! Many FTEs ● And some problems may be VO specific, too DC, if we had them glideinWMS training G.Factroy Operations 8
  9. 9. What else we do? ● In order to make our life easier, we also ● Host a test glideinWMS instance ● Develop new helper tools ● The test glideinWMS instance allows us to discover problems early, thus both ● Increasing user satisfaction ● Reducing the time needed in debugging errors ● We create helper tools to suit our needs ● And anything major we contribute back to glideinWMS glideinWMS training G.Factroy Operations 9
  10. 10. The test glideinWMS Instance ● The test glideinWMS instance contains both a G.Factory and a VO Frontend ● This allows us end-to-end testing ● Major focus on the G.Factory, to test before deploying in production ● New SW releases ● New sites ● New services on existing sites glideinWMS training G.Factroy Operations 10
  11. 11. Summary ● Operating a G.Factory is much more than keeping the G.Factory service alive ● Indeed, this part takes almost a negligible amount of time ● Most effort going into debugging Grid-related problems ● At O(50k) CPUs, something is always broken somewhere ● Finally, providing expertise to help VO FE Admins also an essential part of the job glideinWMS training G.Factroy Operations 11
  12. 12. Acknowledgments ● This document was sponsored by grants from the US NSF and US DOE, and by the UC system glideinWMS training G.Factroy Operations 12

×