The document discusses the main operations of a Glidein Factory: factory node operations require minimal time, while serving VO frontend admin requests and keeping up with grid changes require several hours per week. The largest effort goes towards debugging grid problems, which could require multiple full-time employees due to the difficulty of debugging a large distributed system. Operating a test GlideinWMS instance and developing new tools help reduce debugging time and increase user satisfaction.
Dev Dives: Streamline document processing with UiPath Studio Web
Glidein Factory Operations
1. glideinWMS training
Glidein Factory Operations
i.e. How we spend our time?
by Igor Sfiligoi (UCSD)
glideinWMS training G.Factroy Operations 1
2. G. Factory Operation Categories
● Factory node operations
● Serving VO Frontend Admin requests
● Keeping up with changes in the Grid
● Debugging Grid problems
glideinWMS training G.Factroy Operations 2
3. G. Factory Operation Ongoing Costs
● Factory node operations
● Pretty much runs itself, unexpected <1day/month
● Serving VO Frontend Admin requests
● Highly variable, average a few hours/week
● Keeping up with changes in the Grid
● Variable, currently O(10 hours)/week
● Debugging Grid problems Better tools could
drastically reduce this
● More than we have effort for!
glideinWMS training G.Factroy Operations 3
4. Factory node ops
● The factory mostly just runs
● Occasional upgrade of SW needed,
but typically fast and painless
● Most effort going into investigating O(hours)
unexpected behavior, e.g. /
● High load month
● Weird problems after a reboot/OS upgrade
● Of course, installing a new node can take
significant time
● But a very rare event
glideinWMS training G.Factroy Operations 4
5. VO FE Admin requests
● Adding a new VO FE can be expensive
● Apart from config changes, to help them start running
● However, relatively rare to have new VOs
● In steady state, VOs may request
O(hours)
● New sites
/
New attributes
week
●
● g.Factory operators also must
assist with debugging FE config changes
● Error logs come only to GF (currently)
glideinWMS training G.Factroy Operations 5
6. Following changes in the Grid
● G.Factory operational principle is trust-but-verify
● G.Factory admins must approve any change
in the G.Factory config
● Grid a very dynamic place
● At least one site makes a change every single day
● Mostly complaint driven,
have no good tools to automate change discovery
● G.Factory admins thus must Better tools
change the G.Factory config often would be
welcome
● Currently mostly a manual process
O(10 hours) / week
glideinWMS training G.Factroy Operations 6
7. Grid debugging 1/2
● With O(50k) glideins running at any time,
we always find something broken somewhere
● Full spectrum of errors
● Broken worker nodes (validation errors)
● Broken CEs (authentication/startup/monitor errors)
● Network problems (glideins not registering)
● Mostly cannot directly solve the problem(s)
● i.e. have to notify remote Admins
● But we have to discover the root cause to get it solved
glideinWMS training G.Factroy Operations 7
8. Grid debugging 2/2
● Grid a difficult place to debug
● Most sites are black boxes for us
● Luckily, glideins provide lots of info in the logs
● When we get them... a broken site
may not return anything useful, or anything at all
● Prodding the black box often needed
● Which is hard!
Many FTEs
● And some problems
may be VO specific, too DC, if we
had them
glideinWMS training G.Factroy Operations 8
9. What else we do?
● In order to make our life easier, we also
● Host a test glideinWMS instance
● Develop new helper tools
● The test glideinWMS instance allows us
to discover problems early, thus both
● Increasing user satisfaction
● Reducing the time needed in debugging errors
● We create helper tools to suit our needs
● And anything major we contribute back to glideinWMS
glideinWMS training G.Factroy Operations 9
10. The test glideinWMS Instance
● The test glideinWMS instance contains both a
G.Factory and a VO Frontend
● This allows us end-to-end testing
● Major focus on the G.Factory, to test before
deploying in production
● New SW releases
● New sites
● New services on existing sites
glideinWMS training G.Factroy Operations 10
11. Summary
● Operating a G.Factory is much more than
keeping the G.Factory service alive
● Indeed, this part takes almost a
negligible amount of time
● Most effort going into debugging
Grid-related problems
● At O(50k) CPUs,
something is always broken somewhere
● Finally, providing expertise to help
VO FE Admins also an essential part of the job
glideinWMS training G.Factroy Operations 11
12. Acknowledgments
● This document was sponsored by grants from
the US NSF and US DOE,
and by the UC system
glideinWMS training G.Factroy Operations 12