Monitoring and troubleshooting a glideinWMS-based HTCondor pool

712 views

Published on

A guide for users of glideinWMS-based HTCondor pools on how to monitor the system, and troubleshoot the most common problems.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
712
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Monitoring and troubleshooting a glideinWMS-based HTCondor pool

  1. 1. glideinWMS for users Monitoring and troubleshooting a glideinWMS-based HTCondor pool by Igor Sfiligoi (UCSD)CERN, Dec 2012 glideinWMS monitoring 1
  2. 2. Scope of this talk This talk describes what information are available when troubleshooting in a glideinWMS-based HTCondor pool, and what tools can you use to mine them. Reader is expected to already have a basic understanding of HTCondor and glideinWMS.CERN, Dec 2012 glideinWMS monitoring 2
  3. 3. HTCondor Architecture ● As a reminder G.F. +3 VO FE Grid G.F. +1 Execute node Central manager Execute node Submit node Execute node Negotiator Submit node Execute node Submit node Execute node Schedd CondorCERN, Dec 2012 glideinWMS monitoring 3
  4. 4. Typical user questions addressed in this talk ● Where is/was my job running? ● Why are my jobs not starting? ● Why do my jobs take forever to finish?CERN, Dec 2012 glideinWMS monitoring 4
  5. 5. Where is/was my job running?CERN, Dec 2012 glideinWMS monitoring 5
  6. 6. Job progress monitoring ● HTCondor provides two basic means to monitor job progress ● Querying the system for current status – Using the cmdline condor_q/condor_history ● Parsing the job event log – Either plain text or XML formatted – Starting with 7.9.1, condor_history can be used to extract the last known stateCERN, Dec 2012 glideinWMS monitoring 6
  7. 7. Job status ● Each Job has a status associated with it ● An integer attribute called JobStatus – But has well known semantics associated with each value ● Jobs start in the Idle state ● Become Running if everything works fine ● Completed when they terminate ● If anything goes wrong, a Job will go into Hold ● If removed before completion, will be RemovedCERN, Dec 2012 glideinWMS monitoring 7
  8. 8. Monitoring the Job Status ● Idle/Running/Held jobs can be polled with condor_q ● Will query the Schedd daemon ● Once they terminate, or are removed, they leave the Schedd queue ● Are put into a file on disk One exception: If a job was running when it ● Can use condor_history was removed, but the execute node does not confirm the job was to retrieve the last ClassAd killed remotely, the job will be kept in the Schedd. ● The job event log has all the state transitions (of course)CERN, Dec 2012 glideinWMS monitoring 8
  9. 9. So, where is the job running? ● Easy to get the machine name and/or IP ● Standard HTCondor attribute RemoteHost & StartdIpAddr ● But may not necessary make sense ● Do you recognize all network domains? ● And they could be on a private network!CERN, Dec 2012 glideinWMS monitoring 9
  10. 10. Getting glidein attributes ● Glideins have many more attributes that describe them ● e.g. symbolic site name GLIDEIN_CMSSite ● However, by default, you do not get this info in the Job Classad ● But easy to add ● <my attr> = $$(<glidein attr>:Unknown) – Will get the info in MATCH_EXP_<my attr>CERN, Dec 2012 glideinWMS monitoring 10
  11. 11. Standard attributes● Standard glideinWMS attributes ● JOB_GLIDEIN_Entry_Name = "$$(GLIDEIN_Entry_Name:Unknown)" ● JOB_GLIDEIN_Name = "$$(GLIDEIN_Name:Unknown)" ● JOB_GLIDEIN_Factory = "$$(GLIDEIN_Factory:Unknown)" ● JOB_GLIDEIN_Schedd = "$$(GLIDEIN_Schedd:Unknown)" Useful ● JOB_GLIDEIN_ClusterId = "$$(GLIDEIN_ClusterId:...)" for in-depth debugging ● JOB_GLIDEIN_ProcId = "$$(GLIDEIN_ProcId:Unknown)" ● JOB_GLIDEIN_Site = "$$(GLIDEIN_Site:Unknown)"● Standard CMS glideinWMS attribute ● JOB_CMSSite = "$$(GLIDEIN_CMSSite:Unknown)" Configured by the HTCondor admin, no need for the user to do anything SUBMIT_EXPRS = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, ...CERN, Dec 2012 glideinWMS monitoring 11
  12. 12. Getting them in the event log ● You (or the admins) can also propagate the attributes into the event log job_ad_information_attrs = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, … ● As a result you get “Job Ad” events ... ... 001 (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749> ... (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749> 001 ... 028 (20327.002.000) 12/03 00:46:33 Job ad information event triggered. TriggerEventTypeNumber = 12/03 00:46:33 Job ad information event triggered. 028 (20327.002.000) 1 TriggerEventTypeNumber = 1 Cluster = 20327 Cluster = 20327 EventTypeNumber = 28 EventTypeNumber = 28 ExecuteHost = "<193.48.85.94:38749>" ExecuteHost = "<193.48.85.94:38749>" JOB_CMSSite = "T2_FR_IPHC" JOB_CMSSite = "T2_FR_IPHC" EventTime = "2012-12-03T00:46:33" EventTime = "2012-12-03T00:46:33" TriggerEventTypeName = "ULOG_EXECUTE" TriggerEventTypeName = "ULOG_EXECUTE" Proc = 2 Proc = 2 Subproc = 0 CurrentTime 0= time() Subproc = CurrentTime = time() MyType = "ExecuteEvent" MyType = "ExecuteEvent" ... ...CERN, Dec 2012 glideinWMS monitoring 12
  13. 13. Why is my job not starting?CERN, Dec 2012 glideinWMS monitoring 13
  14. 14. Troubleshooting process ● First question ● Do my jobs match any (logical) resource? ● Once you are sure of that ● Are there jobs from higher priority users? ● Are desired sites just too busy? ● Are there problems at desired site(s)? ● If nothing gives a satisfying answer ● It may be a glideinWMS misconfiguration, see help from VO FE adminsCERN, Dec 2012 glideinWMS monitoring 14
  15. 15. How do I know if my jobs match? ● Good question! ● Unfortunately, the answer is not trivial ● The FE matching policy not “public” ● And, of course, no tools to probe for it ● You will have to rely on the FE admins to “explain” the policy ● Hopefully in a human readable format ● Hopefully without conversion errors!CERN, Dec 2012 glideinWMS monitoring 15
  16. 16. An example FE policy ● See the CMS FE talk for an actual high level view ● The actual FE policy is a python expression A simple example – could be much more complex (glidein["attrs"]["GLIDEIN_CMSSite"] (glidein["attrs"]["GLIDEIN_CMSSite"] in job["DESIRED_Sites"].split(",")) and in job["DESIRED_Sites"].split(",")) and ((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True") ((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True") == (job.get("DESIRES_HTPC")==1)) == (job.get("DESIRES_HTPC")==1)) ● And then there is the matching HTCondor one (stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) && (stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) && ((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True)) ((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))CERN, Dec 2012 glideinWMS monitoring 16
  17. 17. A word about HTCondor matching ● Once glideins start, you can probe their policy condor_status -format %s START $ condor_status -format %sn START ( $( condor_statustrue ) && %sn START ( ( stringListMember(GLIDEIN_CMSSite, true ) && ( -format ( true ) && DESIRED_Sites,",") =?= )true () true( )( && ( ( stringListMember(GLIDEIN_CMSSite, ( ( true ) && ( true && && GLIDEIN_Is_HTPC =?= true ) =?= ( DESIR DESIRED_Sites,",")) =?=) true( )( && GLIDEIN_ToRetire =?= =?= true ) ) || (( DESIR ES_HTPC =?= true ) ) && ( ( ( GLIDEIN_Is_HTPC undefined =?= Cur ES_HTPC < GLIDEIN_ToRetire ) () () ( GLIDEIN_ToRetire =?= undefined ) || ( Cur rentTime =?= true ) ) ) ) && ( rentTime) <&& ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite, ( true GLIDEIN_ToRetire ) ) ) DESIRED_Sites,",") =?= && ( ( true ) && ( true )true () true( )( && ( ( stringListMember(GLIDEIN_CMSSite, && GLIDEIN_Is_HTPC =?= true ) =?= ( DESIR ES_HTPC =?= true ) ) =?=) true( )( && GLIDEIN_ToRetire =?= =?= true ) ) || (( DESIR DESIRED_Sites,",") ) && ( ( ( GLIDEIN_Is_HTPC undefined =?= Cur rentTime < GLIDEIN_ToRetire ) () () ( GLIDEIN_ToRetire =?= undefined ) || ( Cur ES_HTPC =?= true ) ) ) ) && rentTime < GLIDEIN_ToRetire ) ) ) ... ... ● But no tools to help you understand the M.M. ● The closest is condor_q -analyze – But only looks at Job requirements – So, not really helping when all/most of the policy in glideinsCERN, Dec 2012 glideinWMS monitoring 17
  18. 18. User priorities ● So, jobs should be matching, but are not starting ● And there are plenty matching glideins in the system ● Likely there are other higher-priority jobs in the system ● Possibly from a different user Warning: Slow! condor_userio ● Possibly on a different schedd condor_status -submitters ● No tools to give you the easy answer ● If you need the answer, you will have to investigateCERN, Dec 2012 glideinWMS monitoring 18
  19. 19. Unclaimed glideins ● If you see plenty of Unclaimed glideins, but no matching jobs from other users ● You have either reached the schedd limit MAX_JOBS_RUNNING ● Or something bad is going on! ● You can only ask yout FE admin for help ● But first double check that your jobs should indeed be matching, at least on paperCERN, Dec 2012 glideinWMS monitoring 19
  20. 20. Supported Sites● What should you do if there are no (new) glideins coming from an expected site?● First off, see if the site is even supported by the glideinWMS instance!● Each Entry has a ClassAd condor_status -any -const MyType==”glideresource” ● Look for the attributes your FE is matching on e.g. GLIDEIN_CMSSite Site not there? Notify your FE admin! CERN, Dec 2012 glideinWMS monitoring 20
  21. 21. Is the FE even asking for them? ● You are sure that your jobs should be matching? ● But what if you are wrong? ● Check it out … -format %in GlideFactoryMonitorRequestedIdle But remember it is not just your jobs.CERN, Dec 2012 glideinWMS monitoring 21
  22. 22. Maybe the site is just busy? ● Glideins have to compete with other Grid jobs at most sites ● Maybe the site is just busy? ● Check if glideinWMS has put any glideins in the Grid queues … -format %in GlideFactoryMonitorStatusPending If you find zeros, notify your FE admin!CERN, Dec 2012 glideinWMS monitoring 22
  23. 23. Site problems? ● The glideins will validate the worker node before talking to the C.M. ● If the test fails, the glidein will “waste” 20 mins on the node to prevent other jobs to fail on it again ● You can check if there are “Running” glideins in glideinWMS, even though you see none (or few) in the C.M. … -format %in GlideFactoryMonitorStatusRunning If you find a discrepancy, notify your FE admin!CERN, Dec 2012 glideinWMS monitoring 23
  24. 24. Still no clue? ● If all your detective work fails ● Notify your VO FE admin ● They have access to information you dontCERN, Dec 2012 glideinWMS monitoring 24
  25. 25. Why do my jobs take forever to finish?CERN, Dec 2012 glideinWMS monitoring 25
  26. 26. My jobs are running, but... ● Great, your jobs are happily running ● But you are getting no results back! ● i.e. the jobs are not finishing in the expected time ● Two main likely reasons ● They are being restarted ● You miscalculated the needed timeCERN, Dec 2012 glideinWMS monitoring 26
  27. 27. Jobs re-starting ● HTCondor tries to be user friendly ● If a job gets preempted, for almost any reason, it will try to re-start it with the hope it will finish on the next try ● And will not ever give up! (by default) ● You can easily check how many times it started condor_q -format %in NumJobStarts ● You may want to cap the number with periodic_hold/remove http://research.cs.wisc.edu/htcondor/manual/v7.8/condor_submit.html#condor-submit-periodic-remove http://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html#param:SystemPeriodicRemoveCERN, Dec 2012 glideinWMS monitoring 27
  28. 28. Why is it restarting? ● OK, I now know it is restarting... but why? ● Most likely, the glidein was killed ● Was it due to your job “misbehaving”? ● Most Grid sites have limits on resource use ● Including CPU, memory and disk ● If you exceed them, the glidein (and you) will be killed ● Glideins should be configured to detect and hold/remove your job if you “misbehave” ● Thus you would not be re-started ● If you see many restart, notify your FE admin Likely there is a policy rule missingCERN, Dec 2012 glideinWMS monitoring 28
  29. 29. What is my job doing? ● What if it is not restarting... just running forever (or until hitting the time limit) ● HTCondor allows for peeking at a running job ● A cmdline tool called condor_ssh_to_job ● Unfortunately, needs implicit permission from site – And about half of the sites dont allow itCERN, Dec 2012 glideinWMS monitoring 29
  30. 30. The EndCERN, Dec 2012 glideinWMS monitoring 30
  31. 31. Pointers ● glideinWMS Home Page http://tinyurl.com/glideinWMS ● HTCondor Home Page http://research.cs.wisc.edu/htcondor/ ● HTCondor support htcondor-users@cs.wisc.edu htcondor-admin@cs.wisc.edu ● glideinWMS support glideinwms-support@fnal.govCERN, Dec 2012 glideinWMS monitoring 31
  32. 32. Acknowledgments ● The creation of this document was sponsored by grants from the US NSF and US DOE, and by the University of California systemCERN, Dec 2012 glideinWMS monitoring 32

×