Your SlideShare is downloading. ×
glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

glideinWMS Frontend Monitoring - glideinWMS Training Jan 2012

2,330
views

Published on

This talk walks you through the monitoring options a glideinWMS Frontend operator has. …

This talk walks you through the monitoring options a glideinWMS Frontend operator has.
Part of the glideinWMS Training session held in Jan 2012 at UCSD.

Published in: Technology, News & Politics

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,330
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. glideinWMS Training @ UCSD glideinWMS Frontend Monitoring by Igor Sfiligoi (UCSD)UCSD Jan 18th 2012 Frontend Monitoring 1
  • 2. Overview ● Refresher ● What is available ● What to look forUCSD Jan 18th 2012 Frontend Monitoring 2
  • 3. Refresher – glideinWMS ● A glidein is just a properly configured Condor execution node submitted as a Grid job ● Frontend drives submission Configure Condor G.N. Submit node Frontend node Worker node Monitor Submit node Frontend Condor glidein Central manager Startd Match Globus Job Request glideins Factory node Condor glidein Execution node CREAM Factory glidein Execution node Submit glideinsUCSD Jan 18th 2012 Frontend Monitoring 3
  • 4. Reminder Condor is king! (glideinWMS just a small layer on top)UCSD Jan 18th 2012 Frontend Monitoring 4
  • 5. Refresher – Frontend arch ● Many Groups ● With a “Master” Frontend as an aggregator Submit node Factory Submit node Factory Central manager Frontend node Group Entry ... Group glidein Spawn Web Server FrontendUCSD Jan 18th 2012 Frontend Monitoring 5
  • 6. Available monitoring ● Condor monitoring Even if a dynamic one ● It is just a condor pool! ● Any Condor monitoring tools will work ● VO Frontend monitoring ● The VO Frontend provides some basic Condor monitoring ● Plus the monitoring of it own internal workings ● Glidein Factory monitoring You should not need to use it but it is publicly accessibleUCSD Jan 18th 2012 Frontend Monitoring 6
  • 7. Condor monitoringUCSD Jan 18th 2012 Frontend Monitoring 7
  • 8. Condor Monitoring ● Out of the box you get ● Command line tools ● Log parsing ● Several external tools available, e.g. ● CondorView Condor external package ● CycleServer Your portal may Commercial tool, (semi-)free for Academia provide additional monitoring, tooUCSD Jan 18th 2012 Frontend Monitoring 8
  • 9. Glidein monitoring ● The glideins will register with the Collector ● Condor command to monitor them Same syntax as condor_status Requirements ● -constraint - To select a subset of them ● -total - For a quick summary ● Output formatting options ● No arguments - In use/unused ● -long - Full ClassAds ● -format - Select attributes only ● -xml - xml formatting Easier to http://www.cs.wisc.edu/condor/manual/v7.6/condor_status.html machine parseUCSD Jan 18th 2012 Frontend Monitoring 9
  • 10. Example$ condor_status $ condor_statusName OpSys Arch State Activity LoadAv Mem ActvtyTime Name OpSys Arch State Activity LoadAv Mem ActvtyTimeglidein_17848@alic LINUX X86_64 Claimed Busy 7.440 18037 0+01:06:06 glidein_17848@alic LINUX X86_64 Claimed Busy 7.440 18037 0+01:06:06glidein_15842@alic LINUX X86_64 Claimed Busy 7.010 18037 0+00:35:21 glidein_15842@alic LINUX X86_64 Claimed Busy 7.010 18037 0+00:35:21glidein_18249@alic LINUX X86_64 Claimed Busy 7.510 18037 0+01:24:09 glidein_18249@alic LINUX X86_64 Claimed Busy 7.510 18037 0+01:24:09glidein_17825@wn89 LINUX X86_64 Unclaimed Idle 11.990 16056 0+00:15:12 glidein_17825@wn89 LINUX X86_64 Unclaimed Idle 11.990 16056 0+00:15:12glidein_10082@wn91 LINUX X86_64 Claimed Idle 7.000 16056 0+00:02:46 glidein_10082@wn91 LINUX X86_64 Claimed Idle 7.000 16056 0+00:02:46… …glidein_3964@wp-05 LINUX X86_64 Claimed Busy 24.000 64464 0+16:00:29 glidein_3964@wp-05 LINUX X86_64 Claimed Busy 24.000 64464 0+16:00:29glidein_5614@wp-05 LINUX X86_64 Claimed Busy 23.360 64464 0+16:12:56 glidein_5614@wp-05 LINUX X86_64 Claimed Busy 23.360 64464 0+16:12:56glidein_5861@wp-05 LINUX X86_64 Claimed Retiring 22.140 64464 0+00:23:18 glidein_5861@wp-05 LINUX X86_64 Claimed Retiring 22.140 64464 0+00:23:18 Total Owner Claimed Unclaimed Matched Preempting Backfill Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 23249 0 22697 552 0 0 0 X86_64/LINUX 23249 0 22697 552 0 0 0 Total 23249 0 22697 552 0 0 0 Total 23249 0 22697 552 0 0 0UCSD Jan 18th 2012 Frontend Monitoring 10
  • 11. Another example$ condor_status -format %-50s Name -format %6in GLIDEIN_Max_Walltime $ condor_status -format %-50s Name -format %6in GLIDEIN_Max_Walltime -const "GLIDEIN_Max_Walltime>83000" -const "GLIDEIN_Max_Walltime>83000"glidein_10001@we017.grid.hep.ph.ic.ac.uk 86040 glidein_10001@we017.grid.hep.ph.ic.ac.uk 86040glidein_10006@rossmann-a292.rcac.purdue.edu 114840 glidein_10006@rossmann-a292.rcac.purdue.edu 114840glidein_10007@we033.grid.hep.ph.ic.ac.uk 86040 glidein_10007@we033.grid.hep.ph.ic.ac.uk 86040... ...glidein_9990@lxbra6310.cern.ch 114840 glidein_9990@lxbra6310.cern.ch 114840glidein_9990@rossmann-a212.rcac.purdue.edu 114840 glidein_9990@rossmann-a212.rcac.purdue.edu 114840glidein_9993@grid191.lal.in2p3.fr 114840 glidein_9993@grid191.lal.in2p3.fr 114840$ condor_status -format %-50s Name -format %6in GLIDEIN_Max_Walltime -xml $ condor_status -format %-50s Name -format %6in GLIDEIN_Max_Walltime -xml -const "GLIDEIN_Max_Walltime>83000" -const "GLIDEIN_Max_Walltime>83000"<?xml version="1.0"?> <?xml version="1.0"?><!DOCTYPE classads SYSTEM "classads.dtd"> <!DOCTYPE classads SYSTEM "classads.dtd"><classads> <classads><c> <c> <a n="MyType"><s>Machine</s></a> <a n="MyType"><s>Machine</s></a> <a n="TargetType"><s>Job</s></a> <a n="TargetType"><s>Job</s></a> <a n="Name"><s>glidein_10001@we017.grid.hep.ph.ic.ac.uk</s></a> <a n="Name"><s>glidein_10001@we017.grid.hep.ph.ic.ac.uk</s></a> <a n="GLIDEIN_Max_Walltime"><i>86040</i></a> <a n="GLIDEIN_Max_Walltime"><i>86040</i></a> <a n="CurrentTime"><e>time()</e></a> <a n="CurrentTime"><e>time()</e></a></c> </c>... ... UCSD Jan 18th 2012 Frontend Monitoring 11
  • 12. Collector log(s) Place to look when things seem fishy! ● The Collector(s) will log any errors ● The interesting errors will likely be in the leaves of the Collector tree ~condor/glidecondor/condor_local/log/CondorXXXLog ● Logs rotate, so be sure to look in .old as well Yes, you will ● You also get the glidein have 100s authentication logs of them! ● And log verbosity can be further increased with COLLECTOR_DEBUG http://www.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:SubsysDebugUCSD Jan 18th 2012 Frontend Monitoring 12
  • 13. Example01/13/12 17:24:13 ZKM: 1: attempting to map /DC=org/DC=doegrids/OU=Services/CN= 01/13/12 17:24:13 ZKM: 1: attempting to map /DC=org/DC=doegrids/OU=Services/CN=uscmspilot47/glidein-1.t2.ucsd.edu uscmspilot47/glidein-1.t2.ucsd.edu01/13/12 17:24:13 ZKM: 2: mapret: 00 included_voms: 0 canonical_user:glidein47 01/13/12 17:24:13 ZKM: 2: mapret: included_voms: 0 canonical_user: glidein4701/13/12 17:24:13 ZKM: successful mapping to glidein47 01/13/12 17:24:13 ZKM: successful mapping to glidein47... ...01/13/12 17:24:19 condor_read() failed: recv() returned -1, errno ==104 Connection reset 01/13/12 17:24:19 condor_read() failed: recv() returned -1, errno 104 Connection resetby peer, reading 44 bytesfrom <130.104.133.245:7812>. by peer, reading bytes from <130.104.133.245:7812>.01/13/12 17:24:19 DaemonCore: Cant receive command request from 130.104.133.245 01/13/12 17:24:19 DaemonCore: Cant receive command request from 130.104.133.245 (perhaps aa timeout?) (perhaps timeout?)... ...01/13/12 17:24:41 CCB: rejecting request from SHADOW <169.228.130.26:9615?sock=1 01/13/12 17:24:41 CCB: rejecting request from SHADOW <169.228.130.26:9615?sock=10716_242d_201658> on <169.228.130.26:21142> for ccbid 14018 because no daemon is 0716_242d_201658> on <169.228.130.26:21142> for ccbid 14018 because no daemon is currently registered with that id (perhaps ititrecently disconnected). currently registered with that id (perhaps recently disconnected).UCSD Jan 18th 2012 Frontend Monitoring 13
  • 14. Job monitoring ● You can monitor local jobs ● For jobs still in the queue (still waiting or running) condor_q ● For finished jobs Limited number of jobs preserved condor_history ● Similar cmdline args as condor_status ● Remote condor_q possible with -name http://www.cs.wisc.edu/condor/manual/v7.6/condor_q.html http://www.cs.wisc.edu/condor/manual/v7.6/condor_history.htmlUCSD Jan 18th 2012 Frontend Monitoring 14
  • 15. Example$ condor_q $ condor_q-- Submitter: my.node : <192.168.130.11:9615?sock=9763_cd4c_2> : my.node -- Submitter: my.node : <192.168.130.11:9615?sock=9763_cd4c_2> : my.node ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD367788.0 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 1 1 367788.0 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 1 1367788.1 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 2 1 367788.1 uscms2330 1/6 11:03 0+00:00:00 I 0 0.0 CMSSW.sh 2 1383995.19 uscms1789 1/11 02:26 2+13:35:38 R 0 1953.1 CMSSW.sh 1118 4 383995.19 uscms1789 1/11 02:26 2+13:35:38 R 0 1953.1 CMSSW.sh 1118 4383995.179 uscms1789 1/11 02:26 2+11:29:06 R 0 1464.8 CMSSW.sh 1310 4 383995.179 uscms1789 1/11 02:26 2+11:29:06 R 0 1464.8 CMSSW.sh 1310 4383999.32 uscms1789 1/11 02:31 2+09:36:12 R 0 1953.1 CMSSW.sh 299 4 383999.32 uscms1789 1/11 02:31 2+09:36:12 R 0 1953.1 CMSSW.sh 299 4383999.46 uscms1789 1/11 02:31 2+11:00:25 R 0 1953.1 CMSSW.sh 316 4 383999.46 uscms1789 1/11 02:31 2+11:00:25 R 0 1953.1 CMSSW.sh 316 4… …385002.7 uscms3015 1/13 17:31 0+00:01:51 R 0 0.0 CMSSW.sh 70 2 385002.7 uscms3015 1/13 17:31 0+00:01:51 R 0 0.0 CMSSW.sh 70 2385002.8 uscms3015 1/13 17:31 0+00:01:49 R 0 0.0 CMSSW.sh 89 2 385002.8 uscms3015 1/13 17:31 0+00:01:49 R 0 0.0 CMSSW.sh 89 2385002.9 uscms3015 1/13 17:31 0+00:01:29 R 0 0.0 CMSSW.sh 91 2 385002.9 uscms3015 1/13 17:31 0+00:01:29 R 0 0.0 CMSSW.sh 91 2385002.10 uscms3015 1/13 17:31 0+00:01:00 R 0 0.0 CMSSW.sh 97 2 385002.10 uscms3015 1/13 17:31 0+00:01:00 R 0 0.0 CMSSW.sh 97 258707 jobs; 39484 idle, 11694 running, 7529 held 58707 jobs; 39484 idle, 11694 running, 7529 held UCSD Jan 18th 2012 Frontend Monitoring 15
  • 16. Job logs ● Users are encouraged to have a log for jobs ● Provides easy way to monitor the progress without calling condor_q/condor_history 000 (001.000.000) 12/15 12:28:05 Job submitted from host: <127.0.0.1:43569> 000 (001.000.000) 12/15 12:28:05 Job submitted from host: <127.0.0.1:43569> ... ... 001 (001.000.000) 12/16 08:30:02 Job executing on host: <169.228.130.213:36422> 001 (001.000.000) 12/16 08:30:02 Job executing on host: <169.228.130.213:36422> ... ... 005 (001.000.000) 12/16 13:30:32 Job terminated. 005 (001.000.000) 12/16 13:30:32 Job terminated.Literally ... (1) Normal termination (return value 0) (1) Normal termination (return value 0) Usr 0 01:00:00, Sys 0 00:05:00 - Run Remote Usage Usr 0 01:00:00, Sys 0 00:05:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 01:00:00, Sys 0 00:05:00 - Total Remote Usage Usr 0 01:00:00, Sys 0 00:05:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 217 - Run Bytes Sent By Job 217 - Run Bytes Sent By Job 76 - Run Bytes Received By Job 76 - Run Bytes Received By Job 217 - Total Bytes Sent By Job 217 - Total Bytes Sent By Job 76 - Total Bytes Received By Job 76 - Total Bytes Received By Job ... ... UCSD Jan 18th 2012 Frontend Monitoring 16
  • 17. Condor Daemon logs ● By default ● Schedd writes a log /opt/glidecondor/condor_local/log/ScheddLog ● Shadows share a common log /opt/glidecondor/condor_local/log/ShadowLog ● The logs rotate, look for .old files as well ● Lots of interesting info in them ● Quite high verbosity by defaultUCSD Jan 18th 2012 Frontend Monitoring 17
  • 18. ScheddLog Example01/12/12 20:38:37 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: cfeng 01/12/12 20:38:37 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: cfeng01/12/12 20:38:37 (pid:32035) ZKM: successful mapping to cfeng 01/12/12 20:38:37 (pid:32035) ZKM: successful mapping to cfeng01/12/12 20:38:37 (pid:28485) GET_JOB_CONNECT_INFO failed: No such job: 157170.4 01/12/12 20:38:37 (pid:28485) GET_JOB_CONNECT_INFO failed: No such job: 157170.401/12/12 20:39:27 (pid:32035) ZKM: 1: attempting to map /DC=org/DC=doegrids/OU=Services/ 01/12/12 20:39:27 (pid:32035) ZKM: 1: attempting to map /DC=org/DC=doegrids/OU=Services/CN=rokpilot01/osg.ctbp.ucsd.edu CN=rokpilot01/osg.ctbp.ucsd.edu01/12/12 20:39:27 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: pilot 01/12/12 20:39:27 (pid:32035) ZKM: 2: mapret: 0 included_voms: 0 canonical_user: pilot01/12/12 20:39:27 (pid:32035) ZKM: successful mapping to pilot 01/12/12 20:39:27 (pid:32035) ZKM: successful mapping to pilot01/12/12 20:39:32 (pid:32035) Shadow pid 11058 for job 157170.82 exited with status 100 01/12/12 20:39:32 (pid:32035) Shadow pid 11058 for job 157170.82 exited with status 100... ...01/13/12 18:05:02 (pid:32035) Activity on stashed negotiator socket: <169.228.40.37:60824> 01/13/12 18:05:02 (pid:32035) Activity on stashed negotiator socket: <169.228.40.37:60824>01/13/12 18:05:02 (pid:32035) Using negotiation protocol: NEGOTIATE 01/13/12 18:05:02 (pid:32035) Using negotiation protocol: NEGOTIATE01/13/12 18:05:02 (pid:32035) Negotiating for owner: cfeng@osg.ctbp.ucsd.edu 01/13/12 18:05:02 (pid:32035) Negotiating for owner: cfeng@osg.ctbp.ucsd.edu01/13/12 18:05:02 (pid:32035) Finished negotiating for cfeng in local pool: 4 matched, 1 rejected 01/13/12 18:05:02 (pid:32035) Finished negotiating for cfeng in local pool: 4 matched, 1 rejected01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_14642@ 01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_14642@cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_16403@ 01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_16403@cabinet-1-1-0.t2.ucsd.edu <169.228.131.217:34929?CCBID=169.228.40.37:9623#108&noUDP> for cfeng cabinet-1-1-0.t2.ucsd.edu <169.228.131.217:34929?CCBID=169.228.40.37:9623#108&noUDP> for cfeng01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_13015@ 01/13/12 18:05:04 (pid:32035) Completed REQUEST_CLAIM to startd glidein_13015@cabinet-2-2-24.t2.ucsd.edu <169.228.131.156:33784?CCBID=169.228.40.37:9708#95&noUDP> for cfeng cabinet-2-2-24.t2.ucsd.edu <169.228.131.156:33784?CCBID=169.228.40.37:9708#95&noUDP> for cfeng01/13/12 18:05:04 (pid:32035) Starting add_shadow_birthdate(157177.138) 01/13/12 18:05:04 (pid:32035) Starting add_shadow_birthdate(157177.138)01/13/12 18:05:04 (pid:32035) Started shadow for job 157177.138 on glidein_14642@ 01/13/12 18:05:04 (pid:32035) Started shadow for job 157177.138 on glidein_14642@cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng, cabinet-0-0-30.t2.ucsd.edu <169.228.131.224:55393?CCBID=169.228.40.37:9673#57&noUDP> for cfeng, (shadow pid = 5238) (shadow pid = 5238) UCSD Jan 18th 2012 Frontend Monitoring 18
  • 19. ShadowLog Example01/12/12 21:52:36 DaemonCore: command socket at <169.228.40.37:47586?noUDP> 01/12/12 21:52:36 DaemonCore: command socket at <169.228.40.37:47586?noUDP>01/12/12 21:52:36 DaemonCore: private command socket at <169.228.40.37:47586> 01/12/12 21:52:36 DaemonCore: private command socket at <169.228.40.37:47586>01/12/12 21:52:36 Setting maximum accepts per cycle 4. 01/12/12 21:52:36 Setting maximum accepts per cycle 4.01/12/12 21:52:36 Initializing a VANILLA shadow for job 157171.108 01/12/12 21:52:36 Initializing a VANILLA shadow for job 157171.10801/12/12 21:52:36 (157171.97) (32318): Request to run on 01/12/12 21:52:36 (157171.97) (32318): Request to run onglidein_23261@cabinet-2-2-26.t2.ucsd.edu <169.228.131.154:48495? glidein_23261@cabinet-2-2-26.t2.ucsd.edu <169.228.131.154:48495?CCBID=169.228.40.37:9644#87&noUDP> was ACCEPTED CCBID=169.228.40.37:9644#87&noUDP> was ACCEPTED01/12/12 21:52:36 (157170.39) (10937): Job 157170.39 terminated: 01/12/12 21:52:36 (157170.39) (10937): Job 157170.39 terminated:exited with status 0 exited with status 001/12/12 21:52:36 (157170.39) (10937): **** condor_shadow (condor_SHADOW) 01/12/12 21:52:36 (157170.39) (10937): **** condor_shadow (condor_SHADOW)pid 10937 EXITING WITH STATUS 100 pid 10937 EXITING WITH STATUS 100… …01/13/12 18:01:08 (157177.52) (4768): DoUpload: (Condor error code 12, subcode 28) 01/13/12 18:01:08 (157177.52) (4768): DoUpload: (Condor error code 12, subcode 28)SHADOW at 169.228.40.37 failed to send file(s) to <169.228.131.178:47976>; SHADOW at 169.228.40.37 failed to send file(s) to <169.228.131.178:47976>;STARTER at 169.228.131.178 failed to write to file /data6/condor_local/execute/ STARTER at 169.228.131.178 failed to write to file /data6/condor_local/execute/dir_13776/glide_J13834/execute/dir_19620/griddatasourceB.tgz: dir_13776/glide_J13834/execute/dir_19620/griddatasourceB.tgz:(errno 28) No space left on device (errno 28) No space left on device01/13/12 18:01:15 (157177.52) (4768): **** condor_shadow (condor_SHADOW) 01/13/12 18:01:15 (157177.52) (4768): **** condor_shadow (condor_SHADOW)pid 4768 EXITING WITH STATUS 112 pid 4768 EXITING WITH STATUS 112 UCSD Jan 18th 2012 Frontend Monitoring 19
  • 20. Submitter ClassAds ● The schedd will advertise two types of ClassAds to the Collector ● Schedd daemon ClassAds condor_status -schedd ● Per-user ClassAds condor_status -submitter ● Can be useful for getting a summary view of the systemUCSD Jan 18th 2012 Frontend Monitoring 20
  • 21. Example $ condor_status -schedd $ condor_status -schedd Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs cmsfnal01.fnal.gov cmsfnal01. 0 0 0 cmsfnal01.fnal.gov cmsfnal01. 0 0 0 glidein-2.t2.ucsd.ed glidein-2. 10932 38480 7607 glidein-2.t2.ucsd.ed glidein-2. 10932 38480 7607 submit-2.t2.ucsd.edu submit-2.t 11103 8955 1667 submit-2.t2.ucsd.edu submit-2.t 11103 8955 1667 vocms120.cern.ch vocms120.c 0 4024 2 vocms120.cern.ch vocms120.c 0 4024 2 TotalRunningJobs TotalIdleJobs TotalHeldJobs TotalRunningJobs TotalIdleJobs TotalHeldJobs Total 22035 51459 9276 Total 22035 51459 9276 $ condor_status -schedd -l submit-2.t2.ucsd.edu $ condor_status -schedd -l submit-2.t2.ucsd.edu Name = "submit-2.t2.ucsd.edu" Name = "submit-2.t2.ucsd.edu" MaxJobsRunning = 20000 MaxJobsRunning = 20000 TotalHeldJobs = 1667 TotalHeldJobs = 1667 TotalIdleJobs = 9347 TotalIdleJobs = 9347 … … TotalJobAds = 22096 TotalJobAds = 22096 TransferQueueDownloadWaitTime = 0 TransferQueueDownloadWaitTime = 0 MyType = "Scheduler" MyType = "Scheduler"UCSD Jan 18th 2012 Frontend Monitoring 21
  • 22. Example$ condor_status -submitter $ condor_status -submitterName Machine Running IdleJobs HeldJobs Name Machine Running IdleJobs HeldJobsuscms1789@glidein-2. glidein-2. 344 0 20 uscms1789@glidein-2. glidein-2. 344 0 20uscms1811@glidein-2. glidein-2. 176 1141 0 uscms1811@glidein-2. glidein-2. 176 1141 0uscms1976@glidein-2. glidein-2. 629 0 7 uscms1976@glidein-2. glidein-2. 629 0 7… …uscms742@submit-2.t2 submit-2.t 405 0 0 uscms742@submit-2.t2 submit-2.t 405 0 0cms1279@vocms120.cer vocms120.c 0 4000 0 cms1279@vocms120.cer vocms120.c 0 4000 0 RunningJobs IdleJobs HeldJobs RunningJobs IdleJobs HeldJobsuscms019@submit-2.t2 11 0 1 uscms019@submit-2.t2 11 0 1uscms1537@glidein-2. 0 0 1 uscms1537@glidein-2. 0 0 1uscms1811@glidein-2. 176 1141 0 uscms1811@glidein-2. 176 1141 0uscms1811@submit-2.t 177 3324 0 uscms1811@submit-2.t 177 3324 0… …uscms742@glidein-2.t 3107 289 41 uscms742@glidein-2.t 3107 289 41uscms742@submit-2.t2 405 0 0 uscms742@submit-2.t2 405 0 0uscms911@glidein-2.t 0 0 42 uscms911@glidein-2.t 0 0 42 Total 22092 51518 9280 Total 22092 51518 9280UCSD Jan 18th 2012 Frontend Monitoring 22
  • 23. Negotiator Monitoring ● To check for user priorities, use condor_userprio ● -alluser - Without, only running users ● -all - Provides detailed info ● Negotiator Log useful to troubleshoot ~/glidecondor/condor_local/log/NegotiatorLog ● Look for errors and to monitor cycle times ● Negotiator also advertises a ClassAd ● Use condor_status -negotiator -longUCSD Jan 18th 2012 Frontend Monitoring 23
  • 24. Example 1/2$ condor_userprio -all -allusers $ condor_userprio -all -allusersLast Priority Update: 1/13 18:33 Last Priority Update: 1/13 18:33 Effective Real Priority Res ... Effective Real Priority Res ...User Name Priority Priority Factor Used ... User Name Priority Priority Factor Used ...------------------------------ --------- -------- ------------ ---- ... ------------------------------ --------- -------- ------------ ---- ...cmspa0029@submit-2.t2.ucsd.edu 158.01 15.80 10.00 0 ... cmspa0029@submit-2.t2.ucsd.edu 158.01 15.80 10.00 0 ...cmspa0029@glidein-2.t2.ucsd.ed 205.37 20.54 10.00 0 ... cmspa0029@glidein-2.t2.ucsd.ed 205.37 20.54 10.00 0 ...uscms506@glidein-2.t2.ucsd.edu 559.11 0.56 1000.00 0 ... uscms506@glidein-2.t2.ucsd.edu 559.11 0.56 1000.00 0 ...uscms2450@glidein-2.t2.ucsd.ed 576.15 0.58 1000.00 0 ... uscms2450@glidein-2.t2.ucsd.ed 576.15 0.58 1000.00 0 ...uscms3501@submit-2.t2.ucsd.edu 775.26 0.78 1000.00 0 ... uscms3501@submit-2.t2.ucsd.edu 775.26 0.78 1000.00 0 ...shi034@glidein-2.t2.ucsd.edu 827.95 0.83 1000.00 0 ... shi034@glidein-2.t2.ucsd.edu 827.95 0.83 1000.00 0 ...uscms2450@submit-2.t2.ucsd.edu 1455.42 1.46 1000.00 0 ... uscms2450@submit-2.t2.ucsd.edu 1455.42 1.46 1000.00 0 ...uscms4043@glidein-2.t2.ucsd.ed 1677.00 1.68 1000.00 0 ... uscms4043@glidein-2.t2.ucsd.ed 1677.00 1.68 1000.00 0 ...uscms2336@glidein-2.t2.ucsd.ed 2113.44 2.11 1000.00 0 ... uscms2336@glidein-2.t2.ucsd.ed 2113.44 2.11 1000.00 0 ...uscms2330@glidein-2.t2.ucsd.ed 2493.31 2.49 1000.00 0 ... uscms2330@glidein-2.t2.ucsd.ed 2493.31 2.49 1000.00 0 ...uscms4084@glidein-2.t2.ucsd.ed 2506.61 2.51 1000.00 0 ... uscms4084@glidein-2.t2.ucsd.ed 2506.61 2.51 1000.00 0 ...uscms2330@submit-2.t2.ucsd.edu 2771.17 2.77 1000.00 0 ... uscms2330@submit-2.t2.ucsd.edu 2771.17 2.77 1000.00 0 ...uscms4043@submit-2.t2.ucsd.edu 5150.52 5.15 1000.00 0 ... uscms4043@submit-2.t2.ucsd.edu 5150.52 5.15 1000.00 0 ...uscms2535@glidein-2.t2.ucsd.ed 5357.76 5.36 1000.00 176 ... uscms2535@glidein-2.t2.ucsd.ed 5357.76 5.36 1000.00 176 ...UCSD Jan 18th 2012 Frontend Monitoring 24
  • 25. Example 2/2$ condor_userprio -all -allusers $ condor_userprio -all -allusersLast Priority Update: 1/13 18:33 Last Priority Update: 1/13 18:33 … Total Usage Usage Last … Total Usage Usage LastUser Name … (wghted-hrs) Start Time Usage Time User Name … (wghted-hrs) Start Time Usage Time------------------------------ … ----------- ---------------- ---------------- ------------------------------ … ----------- ---------------- ----------------cmspa0029@submit-2.t2.ucsd.edu … 82863.87 10/03/2011 01:41 1/11/2012 07:05 cmspa0029@submit-2.t2.ucsd.edu … 82863.87 10/03/2011 01:41 1/11/2012 07:05cmspa0029@glidein-2.t2.ucsd.ed … 202430.74 10/31/2011 01:30 1/12/2012 02:00 cmspa0029@glidein-2.t2.ucsd.ed … 202430.74 10/31/2011 01:30 1/12/2012 02:00uscms506@glidein-2.t2.ucsd.edu … 437667.09 7/02/2011 08:06 1/08/2012 07:29 uscms506@glidein-2.t2.ucsd.edu … 437667.09 7/02/2011 08:06 1/08/2012 07:29uscms2450@glidein-2.t2.ucsd.ed … 47024.87 10/09/2011 13:26 1/07/2012 01:28 uscms2450@glidein-2.t2.ucsd.ed … 47024.87 10/09/2011 13:26 1/07/2012 01:28uscms3501@submit-2.t2.ucsd.edu … 3677.14 11/23/2011 08:12 1/10/2012 01:02 uscms3501@submit-2.t2.ucsd.edu … 3677.14 11/23/2011 08:12 1/10/2012 01:02shi034@glidein-2.t2.ucsd.edu … 1309024.85 6/03/2009 00:48 1/07/2012 15:57 shi034@glidein-2.t2.ucsd.edu … 1309024.85 6/03/2009 00:48 1/07/2012 15:57uscms2450@submit-2.t2.ucsd.edu … 81864.63 9/26/2011 15:22 1/07/2012 05:46 uscms2450@submit-2.t2.ucsd.edu … 81864.63 9/26/2011 15:22 1/07/2012 05:46uscms4043@glidein-2.t2.ucsd.ed … 6966.57 10/10/2011 22:48 1/09/2012 17:35 uscms4043@glidein-2.t2.ucsd.ed … 6966.57 10/10/2011 22:48 1/09/2012 17:35uscms2336@glidein-2.t2.ucsd.ed … 57125.01 5/27/2011 02:00 1/09/2012 21:13 uscms2336@glidein-2.t2.ucsd.ed … 57125.01 5/27/2011 02:00 1/09/2012 21:13uscms2330@glidein-2.t2.ucsd.ed … 85581.04 8/06/2011 12:45 1/09/2012 07:45 uscms2330@glidein-2.t2.ucsd.ed … 85581.04 8/06/2011 12:45 1/09/2012 07:45uscms4084@glidein-2.t2.ucsd.ed … 158894.51 10/11/2011 11:11 1/08/2012 17:17 uscms4084@glidein-2.t2.ucsd.ed … 158894.51 10/11/2011 11:11 1/08/2012 17:17uscms2330@submit-2.t2.ucsd.edu … 13528.66 9/05/2011 02:15 1/09/2012 23:46 uscms2330@submit-2.t2.ucsd.edu … 13528.66 9/05/2011 02:15 1/09/2012 23:46uscms4043@submit-2.t2.ucsd.edu … 10824.76 9/28/2011 05:02 1/09/2012 03:27 uscms4043@submit-2.t2.ucsd.edu … 10824.76 9/28/2011 05:02 1/09/2012 03:27uscms2535@glidein-2.t2.ucsd.ed … 304430.61 11/17/2009 11:04 1/13/2012 18:33 uscms2535@glidein-2.t2.ucsd.ed … 304430.61 11/17/2009 11:04 1/13/2012 18:33UCSD Jan 18th 2012 Frontend Monitoring 25
  • 26. NegotiatorLog Example 01/13/12 18:23:05 ---------- Finished Negotiation Cycle ---------- 01/13/12 18:23:05 ---------- Finished Negotiation Cycle ---------- 01/13/12 18:24:09 ---------- Started Negotiation Cycle ---------- 01/13/12 18:24:09 ---------- Started Negotiation Cycle ---------- 01/13/12 18:24:09 Phase 1: Obtaining ads from collector ... 01/13/12 18:24:09 Phase 1: Obtaining ads from collector ... 01/13/12 18:24:09 Getting all public ads ... 01/13/12 18:24:09 Getting all public ads ... 01/13/12 18:24:44 Sorting 23021 ads ... 01/13/12 18:24:44 Sorting 23021 ads ... 01/13/12 18:24:46 Getting startd private ads ... 01/13/12 18:24:46 Getting startd private ads ... 01/13/12 18:24:51 Got ads: 23021 public and 22571 private 01/13/12 18:24:51 Got ads: 23021 public and 22571 private 01/13/12 18:24:51 Public ads include 38 submitter, 22568 startd 01/13/12 18:24:51 Public ads include 38 submitter, 22568 startd 01/13/12 18:24:51 Phase 2: Performing accounting ... 01/13/12 18:24:51 Phase 2: Performing accounting ... 01/13/12 18:25:01 Phase 3: Sorting submitter ads by priority ... 01/13/12 18:25:01 Phase 3: Sorting submitter ads by priority ... 01/13/12 18:25:01 Phase 4.1: Negotiating with schedds ... 01/13/12 18:25:01 Phase 4.1: Negotiating with schedds ... 01/13/12 18:25:01 Negotiating with sfiligoi@submit-2.t2.ucsd.edu at 01/13/12 18:25:01 Negotiating with sfiligoi@submit-2.t2.ucsd.edu at <169.228.130.26:9615?sock=10263_1229_2> <169.228.130.26:9615?sock=10263_1229_2> 01/13/12 18:25:01 0 seconds so far 01/13/12 18:25:01 0 seconds so far 01/13/12 18:25:02 Request 345869.00000: 01/13/12 18:25:02 Request 345869.00000: 01/13/12 18:25:02 Rejected 345869.0 sfiligoi@submit-2.t2.ucsd.edu 01/13/12 18:25:02 Rejected 345869.0 sfiligoi@submit-2.t2.ucsd.edu <169.228.130.26:9615?sock=10263_1229_2>: no match found <169.228.130.26:9615?sock=10263_1229_2>: no match found 01/13/12 18:25:02 Got NO_MORE_JOBS; done negotiating 01/13/12 18:25:02 Got NO_MORE_JOBS; done negotiating … … 01/13/12 18:25:06 Request 384970.00170: 01/13/12 18:25:06 Request 384970.00170: 01/13/12 18:25:06 Matched 384970.170 uscms2535@glidein-2.t2.ucsd.edu 01/13/12 18:25:06 Matched 384970.170 uscms2535@glidein-2.t2.ucsd.edu <169.228.130.11:9615?sock=9763_cd4c_2> preempting none <192.168.3.77:55906? <169.228.130.11:9615?sock=9763_cd4c_2> preempting none <192.168.3.77:55906? CCBID=169.228.130.23:9823#13833&noUDP> glidein_15335@lnxfarm177.colorado.edu CCBID=169.228.130.23:9823#13833&noUDP> glidein_15335@lnxfarm177.colorado.edu 01/13/12 18:25:06 Successfully matched with 01/13/12 18:25:06 Successfully matched with glidein_15335@lnxfarm177.colorado.edu glidein_15335@lnxfarm177.colorado.eduUCSD Jan 18th 2012 Frontend Monitoring 26
  • 27. CycleServer Screenshots ● Can do more than just monitoring ● But the rest beyond the scope of this talkUCSD Jan 18th 2012 Frontend Monitoring 27
  • 28. Frontend MonitoringUCSD Jan 18th 2012 Frontend Monitoring 28
  • 29. Frontend monitoring Frontend node Entry Group ... Group ● Helper cmdline tool Spawn ● Plus, each Group provides: Frontend ● Activity/Error logs ● RRD files with statistics (running, held, etc.) ● XML files with current snapshot ● Resource ClassAds ● Master frontend aggregates RRD and XML files, and writes them in its own area ● Human readable/viewable Web pages availableUCSD Jan 18th 2012 Frontend Monitoring 29
  • 30. Helper cmdline tool ● Wrapper around condor condor_status glideinWMS/tools/glidein_status.py ● Provides useful formatting~/glideinWMS/tools$ ./glidein_status.py ~/glideinWMS/tools$ ./glidein_status.pyName Site Factory Entry State Activit Name Site Factory Entry State Activiglidein_6682@alicegrid26.ba.infn.it Bari v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed Busy glidein_6682@alicegrid26.ba.infn.itglidein_10678@alicegrid32.ba.infn.it Bari Bari v1_0@OSGGOC v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed CMS_T2_IT_Bari_ce01 Claimed Busy Busy… glidein_10678@alicegrid32.ba.infn.it Bari v1_0@OSGGOC CMS_T2_IT_Bari_ce01 Claimed Busy …glidein_5861@wp-05-12.pn.pd.infn.it Legnaro v1_0@OSGGOC CMS_T2_IT_Legnaro_. Claimed Retirin glidein_5861@wp-05-12.pn.pd.infn.it Legnaro v1_0@OSGGOC CMS_T2_IT_Legnaro_. Claimed Retiri Total Owner Claimed/Busy Claimed/Retiring Claimed/Other Unclaimed M Total Owner Claimed/Busy Claimed/Retiring Claimed/Other Unclaimed CMS_T2_US_Nebraska_Red_gw2@v1_0@OSGGOC 11 0 11 0 0 0 CMS_T2_US_Nebraska_Red_gw2@v1_0@OSGGOC 522 CMS_T2_US_Purdue_hansen@v1_0@OSGGOC 11 00 11 517 00 00 50 CMS_T2_US_Purdue_hansen@v1_0@OSGGOC 1201 CMS_T2_US_Purdue_osg@v1_0@OSGGOC 522 00 517 1182 14 0 00 55… CMS_T2_US_Purdue_osg@v1_0@OSGGOC 1201 0 1182 14 0 5 …CMS_T2_US_UCSD_gw4@Production_v4_2@UCSD 135 0 132 0 0 3 CMS_T2_US_UCSD_gw4@Production_v4_2@UCSD 135 0 132 0 0 3 Total 21474 0 19742 1264 0 468 Total 21474 0 19742 1264 0 468 UCSD Jan 18th 2012 Frontend Monitoring 30
  • 31. Log files ● Each Frontend group provides 3 types of logs log/group_XXX/frontend.date.type.log ● info - Progress and warnings ● err - One line warnings ● debug - Multi line error messages ● The master frontend has similar logs log/frontend/frontend.date.type.log ● But rarely anything interesting thereUCSD Jan 18th 2012 Frontend Monitoring 31
  • 32. Example Info Log:01-07:00 15037] Iteration at Tue Nov 15 10:44:01 2011:01-07:00 15037] Query condor:01-07:00 15037] Child processes created:05-07:00 31633] WARNING: Failed to talk to schedd submit-1.t2.ucsd.edu. See debug log for more details.:05-07:00 15037] All children terminated:05-07:00 15037] Jobs found total 4836 idle 1732 (old 1732, voms 1703) running 3104:05-07:00 15037] Glideins found total 639 idle 8 running 630 limit 800 curb 600:05-07:00 15037] Using 1 proxies:05-07:00 15037] Match:05-07:00 15037] Counting:05-07:00 15037] Child processes created:06-07:00 15037] All children terminated:06-07:00 15037] Total matching idle 1732 (old 1703) running 3104:06-07:00 15037] Jobs in schedd queues | Glideins | Request:06-07:00 15037] Idle (match eff old uniq ) Run ( here max ) | Total Idle Run | Idle MaxRun Down Factory:06-07:00 15037] 171( 1705 170 169 0) 3104( 102 250) | 105 1 103 | 10 3276 Up CMS_T2_US_Nebraska_Red@Produ:06-07:00 15037] 171( 1705 167 169 0) 3104( 187 250) | 197 4 193 | 10 3276 Up CMS_T2_US_Nebraska_Red_gw1@P:06-07:00 15037] 171( 1705 171 169 0) 3104( 0 250) | 0 0 0 | 10 3276 Down CMS_T2_US_Nebraska_Red_gw2@P:06-07:00 15037] 171( 1705 171 169 0) 3104( 62 250) | 62 0 62 | 10 3276 Up CMS_T2_US_Wisconsin_cms01@Pr:06-07:00 15037] 171( 1705 171 169 0) 3104( 71 250) | 71 0 71 | 10 3276 Up CMS_T2_US_Wisconsin_cms02@Pr:06-07:00 15037] 171( 1705 169 169 0) 3104( 88 250) | 96 2 94 | 10 3276 Up CMS_T2_US_Nebraska_Red@v1_0@:06-07:00 15037] 171( 1705 171 169 0) 3104( 1 250) | 1 0 1 | 10 3276 Up CMS_T2_US_Nebraska_Red_gw1@v:06-07:00 15037] 171( 1705 171 169 0) 3104( 0 250) | 0 0 0 | 10 3276 Down CMS_T2_US_Nebraska_Red_gw2@v:06-07:00 15037] 171( 1705 171 169 0) 3104( 45 250) | 45 0 45 | 10 3276 Up CMS_T2_US_Wisconsin_cms01@v1:06-07:00 15037] 171( 1705 170 169 0) 3104( 60 250) | 62 1 61 | 10 3276 Up CMS_T2_US_Wisconsin_cms02@v1:06-07:00 15037] Jobs in schedd queues | Glideins | Request:06-07:00 15037] Idle (match eff old uniq ) Run ( here max ) | Total Idle Run | Idle MaxRun Down Factory:06-07:00 15037] 1368(13640 1360 1352 0) 24832( 616 2000) | 639 8 630 | 80 26208 Up Sum of useful factories:06-07:00 15037] 342( 3410 342 338 0) 6208( 0 500) | 0 0 0 | 20 6552 Down Sum of down factories:06-07:00 15037] 27( 27 27 14 27) 0( 0 0) | 0 0 0 | 0 0 Down Unmatched:06-07:00 15037] Advertizing 10 requests:07-07:00 15037] Done advertizing:07-07:00 15037] Advertising 10 glideresource classads to the user pool:07-07:00 15037] Done advertising glideresource classads:07-07:00 15037] Writing stats:07-07:00 15037] Sleep UCSD Jan 18th 2012 Frontend Monitoring 32
  • 33. Example log files frontend.20120113.err.log [2012-01-13T18:50:47-07:00 16444] Advertizing failed for 2 requests. See debug log for more details. [2012-01-13T18:50:47-07:00 16444] Advertizing failed for 2 requests. See debug log for more details. [2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd. See debug log for more details. [2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd. See debug log for more details. frontend.20120113.debug.log[2012-01-13T18:50:47-07:00 16444] Advertizing failed: Error running /home/frontend/glidecondor/sbin/condor_advertise [2012-01-13T18:50:47-07:00 16444] Advertizing failed: Error running /home/frontend/glidecondor/sbin/condor_advertise-pool glidein-1.t2.ucsd.edu -tcp -multiple UPDATE_MASTER_AD /tmp/gfi_aw_276509240_16444_2 -pool glidein-1.t2.ucsd.edu -tcp -multiple UPDATE_MASTER_AD /tmp/gfi_aw_276509240_16444_2code 1:failed to send classad to <169.228.130.10:9618> code 1:failed to send classad to <169.228.130.10:9618>failed to send classad to <169.228.130.10:9618> failed to send classad to <169.228.130.10:9618>failed to send classad to <169.228.130.10:9618> failed to send classad to <169.228.130.10:9618>failed to send classad to <169.228.130.10:9618> failed to send classad to <169.228.130.10:9618>[2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd: Schedd vocms120.cern.ch not found [2012-01-13T18:52:59-07:00 19312] Failed to talk to schedd: Schedd vocms120.cern.ch not found UCSD Jan 18th 2012 Frontend Monitoring 33
  • 34. Web pages 1/3 frontendStatus.html Historical overview Fully dynamic, allows for zooming and selecting of elements to plot Default shows everything, but can restrict to a group and/or a FactoryUCSD Jan 18th 2012 Frontend Monitoring 34
  • 35. Web pages 2/3frontendGroupGraphStatusNow.html Current snapshot in tabular form Useful for spotting problems UCSD Jan 18th 2012 Frontend Monitoring 35
  • 36. Web pages 3/3 frontendGroupGraphStatusNow.html Contains also pie-charts with the same infoUCSD Jan 18th 2012 Frontend Monitoring 36
  • 37. RRDs and XML files ● The Web pages are just rendering of the RRDs and XML pages ● Raw data loaded in the browser and rendered ● No server side code ● Other tools could use those data ● Publicly available, if one knows the URL ● No user-identifying data, only summary statsUCSD Jan 18th 2012 Frontend Monitoring 37
  • 38. Resource ClassAds ● The Frontend Groups advertise one ClassAd for each Factory it is requesting glideins from ● Type glideresource ● They contain pretty much everything the Frontend Group knows about the Factory: ● Factory attributes used for matchmaking ● Stats about the matching jobs ● What is being requested ● Even what the Factory is doing!UCSD Jan 18th 2012 Frontend Monitoring 38
  • 39. Example query ● Not a Condor native type, must use ● -any ● Then constrain the type $ condor_status -any -const MyType=="glideresource" -format %sn Name $ condor_status -any -const MyType=="glideresource" -format %sn Name CMS_T2_US_Caltech_cit2@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Caltech_cit2@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Caltech_cit@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Caltech_cit@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Florida_iogw1@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Florida_iogw1@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Florida_pg@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Florida_pg@v1_0@OSGGOC@UCSD-v5_3.main ... ... CMS_T2_US_UCSD_gw2@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_UCSD_gw2@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Wisconsin_cms01@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Wisconsin_cms01@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Wisconsin_cms02@v1_0@OSGGOC@UCSD-v5_3.main CMS_T2_US_Wisconsin_cms02@v1_0@OSGGOC@UCSD-v5_3.main Remotely queryableUCSD Jan 18th 2012 Frontend Monitoring 39
  • 40. Example ClassAd$ condor_status -any $ condor_status -any CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main -l CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main -lMyType = "glideresource" MyType = "glideresource" IdentificationName = "CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main" Name = "CMS_T2_US_UCSD_gw4@v1_0@OSGGOC@UCSD-v5_3.main"GlideClientName = "UCSD-v5_3.main" GlideClientName = "UCSD-v5_3.main"... ...GlideClientMonitorJobsIdle = 210.000000 GlideClientMonitorJobsIdle = 210.000000GlideClientMonitorJobsRunningHere = 213 Info about local jobs GlideClientMonitorJobsRunningHere = 213... ...GlideClientMonitorGlideinsRequestIdle = 50 GlideClientMonitorGlideinsRequestIdle = 50GlideClientMonitorGlideinsRequestMaxRun = 445 What is being requested GlideClientMonitorGlideinsRequestMaxRun = 445... ...GLIDEIN_Site = "UCSD" GLIDEIN_Site = "UCSD"GLEXEC_BIN = "OSG" Factory attributes GLEXEC_BIN = "OSG"... ...GlideClientMonitorGlideinsRunning = 215 GlideClientMonitorGlideinsRunning = 215GlideClientMonitorGlideinsTotal = 216 Info about registered glideins GlideClientMonitorGlideinsTotal = 216... ...GlideFactoryMonitorStatusRunning = 339 GlideFactoryMonitorStatusRunning = 339GlideFactoryMonitorStatusPending = 277 Factory status GlideFactoryMonitorStatusPending = 277GlideFactoryMonitorStatusHeld = 0 GlideFactoryMonitorStatusHeld = 0... ... Currently more information than you get on the Web UCSD Jan 18th 2012 Frontend Monitoring 40
  • 41. OK, now you know whats available. What will you do with all that information? (i.e. What to look for)UCSD Jan 18th 2012 Frontend Monitoring 41
  • 42. Monitoring the health of the system ● Six major areas to look after; your goal is ● Few unclaimed glideins (both globally, and per site) ● No unmatched jobs ● Reasonably low restart rate (both global, and per site) ● Reasonably low job failure rate (both global, and per site) ● Negotiation cycle reasonably short ● Schedd node not overloadedUCSD Jan 18th 2012 Frontend Monitoring 42
  • 43. Unclaimed glideins ● Frontend and Negotiator policies are not identical ● You may end up with glideins that never run any jobs ● The discrepancy can be big enough to be noticed on a global scale ● But more often it is just for one (or few) sites ● Short spikes are not a problem ● But long periods areUCSD Jan 18th 2012 Frontend Monitoring 43
  • 44. How do you notice it? ● Historical Web monitoring Bad Good ● Ask for daily emails from the Factory ● Or write your own scripts No Frontend report generators in glideinWMS at this time Parse the RRDsUCSD Jan 18th 2012 Frontend Monitoring 44
  • 45. How do you find the root cause? ● Analyze the latest snapshots ● condor_status/glidein_status ● condor_q ● Frontend Web ● Limit the research to few sites, if possible ● Then start comparing ● Job Requirements, with Can be daunting! ● Glidein Start expressions In theory, there is “condor_q -ana”, but it is usually worthlessUCSD Jan 18th 2012 Frontend Monitoring 45
  • 46. Unmatched jobs ● The other side of the problem ● Glideins never asked for some jobs Jobs will never start! ● Two possible reasons ● Wrong Frontend matchmaking policy ● No available Factory entries to serve the jobUCSD Jan 18th 2012 Frontend Monitoring 46
  • 47. How do you notice it? ● “Unmatched Factory” in Web monitoringUCSD Jan 18th 2012 Frontend Monitoring 47
  • 48. How do you find the root cause? ● Again, start with the latest snaphot ● condor_q ● condor_status -any -const MyType=="glideresource" ● Get the (python) Match expression from XML ● Start comparing! Can be daunting!UCSD Jan 18th 2012 Frontend Monitoring 48
  • 49. Restarted jobs ● Any restart == wasted CPU ● How do you notice it? ● condor_q is your friend here condor_q -format %in NumJobStarts No historical/Web monitoring provided ● Why it happens? ● Glidein disappears! ● End of lifetime hit Not in the default config, ● Preemption policies but you may set Condor to do it ● Submit node overload Condor daemons do not like being resource constrained!UCSD Jan 18th 2012 Frontend Monitoring 49
  • 50. Why glideins disappear? ● Three main reasons Rare ● Remote node just died Some sites do this; nothing you can do. Learn who they are and act accordingly. ● Site preemption policy ● Glidein killed by Site because it exceeded slot limits – Most likely Memory One of 2 limits the OSG factory advertises. GLIDEIN_MaxMemMBs ● Why can limits be exceeded? Job told you it needed ● Job underestimated resource use more resources than the limit! ● Frontend matchmaking logic problem ● Wrong advertised limits Factory problem!UCSD Jan 18th 2012 Frontend Monitoring 50
  • 51. Wallclock limits ● Main resource limit is time ● The glidein automatically deals with it – Will go away before the deadline – … killing/preemptiong any jobs if needed! ● Limit advertised as In seconds – Factory: GLIDEIN_Max_Walltime (-Δ) – Glidein: GLIDEIN_ToDie UNIX time ● Why jobs may reach the deadline? ● Like with all other resources – Job underestimates time it needs – Frontend matchmaking logic problemsUCSD Jan 18th 2012 Frontend Monitoring 51
  • 52. Job failures ● Jobs can fail for many reasons ● You should monitor the ExitCode condor_history -back -const JobStatus==5 -format %in ExitCode ● Knowing what users run often needed to interpret errors ● For common WN errors, Frontend admin should create appropriate validation script ● So glideins fail, not user jobsUCSD Jan 18th 2012 Frontend Monitoring 52
  • 53. Negotiation time ● The negotiation time should be << 5mins ● If much longer, glideins may terminate without running any jobs ● Monitor the NegotiatorLog on CM ● Possible causes ● CPU starvations (e.g. other processes) ● Autocluster explosion – Condor tries to be smart about Matchmaking – But if users dont cooperate, cannot do muchUCSD Jan 18th 2012 Frontend Monitoring 53
  • 54. Autoclustering Much faster ● Condor Schedd will try to group jobs if only few groups exist ● All “similar jobs” will be matched together! ● What “similar” means? ● Similar == Would result in the same match ● How it is implemented? ● Tuple of attributes considered during matchmaking ● E.g. (DESIRED_Sites,ImageSize) ● How can the number of autoclusters explode? ● If an attribute that changes a lot is added Example of really bad one: JobID https://condor-wiki.cs.wisc.edu/index.cgi/attach_get/220/cs739.pdfUCSD Jan 18th 2012 Frontend Monitoring 54
  • 55. Submit node health ● Condor is very sensitive to resource starvation ● If submit node overloaded, expect problems! ● How can we get to resource starvation? Trying to run 3k jobs on a 1G RAM node??? ● Poor planning ● Other processes May steal CPU/RAM/IO from Condor ● Interactive activity particularly risky ● Due to its unpredictable nature – Including user errors ● But portals not immune to resource overuseUCSD Jan 18th 2012 Frontend Monitoring 55
  • 56. SummaryUCSD Jan 18th 2012 Frontend Monitoring 56
  • 57. Summary ● You have plenty of Monitoring options ● Some prettier, some more powerful ● Most of the time, things just work ● So you dont need to constantly watch after your installation ● But occasionally things will break Or the users will tell you! ● It is in your interest noticing it ● Having good monitoring tools will help you there!UCSD Jan 18th 2012 Frontend Monitoring 57
  • 58. The EndUCSD Jan 18th 2012 Frontend Monitoring 58
  • 59. Pointers ● The official glideinWMS project Web page is http://tinyurl.com/glideinWMS ● glideinWMS development team is reachable at glideinwms-support@fnal.gov ● The OSG glidein factory is reachable at osg-gfactory-support@physics.ucsd.eduUCSD Jan 18th 2012 Frontend Monitoring 59
  • 60. Acknowledgments ● The glideinWMS is a CMS-led project developed mostly at FNAL, with contributions from UCSD and ISI ● The glideinWMS factory operations at UCSD is sponsored by OSG ● The funding comes from NSF, DOE and the UC systemUCSD Jan 18th 2012 Frontend Monitoring 60