Solving Grid problems through glidein monitoring


Published on

This document presents how Glidein Factory operations help solving problems that develop on Grid resources.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Solving Grid problems through glidein monitoring

  1. 1. glideinWMS training Solving Grid problems through glidein monitoring i.e. The Grid debugging part of G.Factory operations by Igor Sfiligoi (UCSD)glideinWMS training Grid debugging 1
  2. 2. Glidein Factory Operations● Factory node operations● Serving VO Frontend Admin requests● Keeping up with changes in the Grid● Debugging Grid problems ● The most time consuming part ● Effectively we help solve Grid problems, through glidein monitoring glideinWMS training Grid debugging 2
  3. 3. Reminder - Glideins ● A glidein is a properly configured Condor startd daemon submitted as a Grid job Submit node Worker node Frontend glidein Monitor Condor Central manager Startd Match CE Job Request glideins Factory Submit glideinsglideinWMS training Grid debugging 3
  4. 4. What can go wrong in the Grid? ● Many places where thing can go wrong ● Essentially at any of the arrows below Submit node Worker node glidein Central manager Startd CE Job FactoryglideinWMS training Grid debugging 4
  5. 5. What can go wrong in the Grid? ● In particular ● CE may refuse to accept glideins Submit node Worker node glidein Central manager Startd CE Job FactoryglideinWMS training Grid debugging 5
  6. 6. What can go wrong in the Grid? ● In particular ● CE may not start glideins ● Or fail to tell us what the status of the job is Submit node Worker node glidein Central manager Startd CE Job FactoryglideinWMS training Grid debugging 6
  7. 7. What can go wrong in the Grid? ● In particular ● The worker node may be broken/misconfigured – Thus validation will fail ● Many Submit node Worker node reasons glidein Central manager Startd CE Job FactoryglideinWMS training Grid debugging 7
  8. 8. What can go wrong in the Grid? ● In particular ● The WAN networking may not work properly ● The CM never hears from the Startd ● Or Schedd Submit node Worker node cannot glidein talk to Central manager Startd Startd CE Job ● Can be selective FactoryglideinWMS training Grid debugging 8
  9. 9. What can go wrong in the Grid? ● In particular ● Or the security infrastructure could be broken – CAs missing – Time discrepancies – Etc. Submit node Worker node glidein Central manager Startd CE Job FactoryglideinWMS training Grid debugging 9
  10. 10. What can go wrong in the Grid? ● In particular ● The site may refuse to start the user job – e.g. glexec Submit node Worker node glidein Central manager Startd CE Job FactoryglideinWMS training Grid debugging 10
  11. 11. What can go wrong with glideins? ● And there are also non-Grid problems ● Jobs not matching ● But thats beyond the scope Submit node Worker node of this glidein document Central manager Startd CE Job FactoryglideinWMS training Grid debugging 11
  12. 12. Problem classification ● Most often we see WN problems Typically easy to diagnose ● Followed by CEs refusing glideins ● Then there are misbehaving CEs ● Very hard to diagnose! ● Everything else quite rare ● But usually hard to diagnose as wellglideinWMS training Grid debugging 12
  13. 13. Grid debugging Validation problems i.e. Problems on Worker NodesglideinWMS training Grid debugging 13
  14. 14. WN problems ● The glidein startup script runs a list of validation scripts ● If any of them fails, the WN is considered broken ● This way user jobs never get to broken WNs ● Two sources of tests ● Glidein Factory ● VO Frontend ● Of course, if the validation script cannot be fetched from either Web server, it is considered a failure as wellglideinWMS training Grid debugging 14
  15. 15. Types of tests ● The glideinWMS SW comes with a set of standard tests (provided by the factory): ● Grid environment present (e.g. CAs) ● Some free disk on $PWD and on /tmp ● Enough FE-provided proxy lifetime remaining ● gLExec related tests ● OS type ● Each VO may have its own needs, e.g.: ● Is VO SW pre-installed and accessible?glideinWMS training Grid debugging 15
  16. 16. Discovering the problems ● Any error message printed out by the validation script will be delivered back to the factory ● After the glidein terminates ● Most validation scripts provide clear indication what went wrong ● And we strive to get all to do it! ● New machine readable format being introduced ● With v2_6_2glideinWMS training Grid debugging 16
  17. 17. Typical ops ● Noticing that a large fraction of glideins for a site are failing is easy ● Just look at the monitoring ● And we are getting a daily email as well ● Discovering what exactly is broken not too difficult either ● Just parse the logs ● Will get even easier when all scripts return machine readable information With appropriate toolsglideinWMS training Grid debugging 17
  18. 18. Action items Unless this is ● Not much we can do directly the result of a misconfiguration on our part ● Typically, we open a ticket with the site ● Provide the list of nodes where it happens (rare to have the whole site broken) ● A concise but complete error report essential for a speedy resolution ● In minority of cases we have to contact the VO FE admin, e.g. ● Unclear error messages ● Non-WN specific validation errorsglideinWMS training Grid debugging 18
  19. 19. Black hole nodes ● There is one further WM problem ● Black hole WNs ● WNs that accept glidein jobs, but dont execute them ● glidein_startup never has the chance to log anything ● Not even the node it is running on ● Thus, empty log files! ● We can infer we have a black hole node at a site by looking at job timing (in Condor-G logs) ● Good jobs run for at least 20 minsglideinWMS training Grid debugging 19
  20. 20. Grid debugging CE refusing the glideinsglideinWMS training Grid debugging 20
  21. 21. CE Refusing the glideins ● CE admin has the right to refuse anyone ● But usually does not change his mind overnight ● First time accessing a site an issue on its own – Not covered here ● When things go wrong, the typical reason is ● CE service down, ● Problems in the Security/Auth infrastructure, ● CE seriously misconfigured/brokenglideinWMS training Grid debugging 21
  22. 22. Expected vs Unexpected ● Some “problems” are expected ● e.g. the CE is down for scheduled maintenance ● Nothing to do in this case! – Just a monitoring issue ● So, checking the maintenance DB important! ● If not, we have to notify the site ● The VO FEs are not getting the CPU slots they are asking forglideinWMS training Grid debugging 22
  23. 23. Discovering the problem ● Condor-G reacts in two different ways ● Does nothing – We still have monitoring showing the job did not progress from Waiting→Pending ● Puts the job on Hold ● The G.Factory will react on Held jobs ● Releasing them a few time → Condor-G retries ● Removing them after a while – Just to be replaced with identical glideins For most non-trivial problems the problem does not solve by itselfglideinWMS training Grid debugging 23
  24. 24. Action items (for unexpected problems) ● Most of the time, not much we can do directly ● Will just open ticket with site ● If any useful info in the HoldReason, we pass it on ● DN of the proxy the most valuable info ● But it could be our problem, too ● Found many Condor-G problems in the past ● Comparing the behavior of many G.Factory instances can confirm or exclude this Ah-hoc solutions needed if this is the caseglideinWMS training Grid debugging 24
  25. 25. Grid debugging CE not properly handling the glideinsglideinWMS training Grid debugging 25
  26. 26. Problematic CE ● Three basic types of problems: ● Glideins not starting ● Improper monitoring information ● Output files not being delivered to client ● And there is two more ● Unexpected policies that kill glideinsglideinWMS training Grid debugging 26
  27. 27. Glideins not starting ● The CE scheduling policy is not available to us ● So often not obvious if we are just low priority or something else is going on ● GF/Condor-G does not see it as an error condition ● We usually dont act on it, unless ● The VO FE admin complains, or ● We have been given explicit guidance of the expected startup rates ● Not much for us to investigate ● Just tell the site admin “Jobs are not starting”glideinWMS training Grid debugging 27
  28. 28. Glideins being killed by the site ● Ideally, our glideins should fit within the policies of the site But getting this info is not trivial, remember? ● But sometimes they dont ● So they get killed hard ● Discovering this from our side very hard ● We often just notice empty log files ● Not an error for Condor-G ● Often learn of this because the VO complains ● If and when we understand the problem, we can deal with it ourselves ● i.e. we config the glideins to stay within the limitsglideinWMS training Grid debugging 28
  29. 29. Preemption ● Some site will preempt our glideins if higher priority jobs get into the queue ● Effectively killing our glideins ● Not an actual error ● Sites have the right to do it! ● But it can mess up with our monitoring/ops ● We may see killed glideins, or ● We may see glideins that seem to run for a very long time (when automatically rescheduled on the CE) ● We have to efficiently filter these events outglideinWMS training Grid debugging 29
  30. 30. Improper monitoring info from CE ● A CE may not provide reliable information ● Each VO FE provides us with monitoring information about its central manager ● By comparing what it tells us, with what the CE tells us, we can infer if there are problems ● A large, consistent discrepancy typically signals problems in the CE monitoring ● Very difficult to figure out what is going on ● We have no direct detailed data to act upon ● Mostly ad-hoc detective work, prodding the black box ● Often inconclusiveglideinWMS training Grid debugging 30
  31. 31. Lack of output files● The glidein output files contain ● Accounting information ● Detail logging● Without other problems, mostly an annoyance● But much more often paired with glideins failing ● Making failure diagnostics close to impossible● Extremely hard to diagnose the root cause ● Sometimes we may infer it (black holes, killed glideins, ...) ● For actual CE problems it requires help from many parties, including us, the site admins and SW developers glideinWMS training Grid debugging 31
  32. 32. Grid debugging Networking problemsglideinWMS training Grid debugging 32
  33. 33. Glideins are network heavy ● Each glidein opens several long‑lived TCP connections (in CCB mode) ● Can overwhelm networking gear – e.g. NATs can run out of spare ports ● Problems can have non-linear behavior ● Will work fine on small scale ● Will degrade after a while – Not necessarily a step function, though Although straight out denials due to firewalls are also a problemglideinWMS training Grid debugging 33
  34. 34. Diagnostics and action items ● Not trivial to detect ● Errors often in the glidein logs And we are lacking ● But difficult to interpret tools for automatically detecting this. ● Not much we can do directly ● A problem between the VO services and the site – So we notify both ● However ● we usually end up assisting as expertsglideinWMS training Grid debugging 34
  35. 35. Grid debugging Authentication problemsglideinWMS training Grid debugging 35
  36. 36. Security is delicate stuff ● Grid security mechanisms paranoid by design ● “Availability” is the last to be considered ● The main focus is keeping the “bad guys” out ● So they are extremely delicate ● If any piece of the chain breaks, everything breaks ● Things that can go wrong (non exhaustive list): ● Missing CA(s) ● Expired CRLs ● Expired glidein proxy ● Wrong system time (clock skew)glideinWMS training Grid debugging 36
  37. 37. Diagnostics and action items ● Finding the root cause usually hard And we are lacking ● Errors are in the glidein logs tools for automatically detecting this. ● But usually do not provide enough info (to avoid giving up too much info to a hypothetical attacker) ● Have to distinguish between site problems and VO problems, too ● Only obvious if only a fraction fails (→ WN problem) ● Else, may need to get both sides involved to properly diagnose the root causeglideinWMS training Grid debugging 37
  38. 38. Grid debugging Job startup problemsglideinWMS training Grid debugging 38
  39. 39. gLExec (1) ● The biggest source of problems, by far, is gLExec refusing to accept a user proxy ● Resulting in jobs not starting ● BTW, Condor is not good at handling gLExec denials ● We can only partially test gLExec during validation ● May behave differently based on the proxy used ● Its behavior can change in time ● And final users may be the source of the problem ● e.g. by letting the proxy expire Condor could catch these, and hopefully soon willglideinWMS training Grid debugging 39
  40. 40. gLExec (2) ● Non trivial to detect ● Errors are in the glidein logs ● But we miss the tools to extract them ● Finding the root cause impossible without site admin help ● gLExec policies are a site secret ● We thus just notify the site, providing the failing user DNglideinWMS training Grid debugging 40
  41. 41. Configuration problems ● Condor can be configured to run a wrapper around the user job ● To customize the user environment ● Usually provided by the VO FE ● If that fails, the user job fails with it ● Luckily, failures are rare ● If we notice them, we notify the VO FE admins ● However, they often notice before we doglideinWMS training Grid debugging 41
  42. 42. Other job startup problems ● By default, we validate the node only at glidein startup ● WN conditions may change by the time a job is scheduled to run We should do better. – e.g. the disk fills up Condor supports periodic validation ● The errors are usually only tests, we just dont use them right now. seen by the final users ● So we hardly ever notice these kind of problemsglideinWMS training Grid debugging 42
  43. 43. Summary● The Grid world is a good approximation of a chaotic system ● There are thus many failure modes● The pilot paradigm hides most of the failures from the final users ● But the failures are still there ● Resulting in wasted/underused CPU cycles● The G.Factory operators are in the best position to diagnose the root cause of the failures ● By having a global view ● However, they cannot solve the problems by themselves glideinWMS training Grid debugging 43
  44. 44. Acknowledgments ● This document was sponsored by grants from the US NSF and US DOE, and by the UC systemglideinWMS training Grid debugging 44