Condor from the user point of view - glideinWMS Training Jan 2012

570 views
517 views

Published on

A high level view on how final users experience Condor. Part of the glideinWMS Training session held in Jan 2012 at UCSD.

Published in: Technology, Business
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
570
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Condor from the user point of view - glideinWMS Training Jan 2012

  1. 1. glideinWMS Training @ UCSD Condor from the user point of view by Igor Sfiligoi (UCSD)UCSD Jan 17th 2012 Condor for users 1
  2. 2. Acknowledgement ● These slides are heavily based on the presentation Todd Tannenbaum gave at CERN in Feb 2011 https://indico.cern.ch/conferenceTimeTable.py?confId=124982#20110214.detailedUCSD Jan 17th 2012 Condor for users 2
  3. 3. What is Condor ● Condor is a Workload Management System ● i.e. a batch system ● Strong points ● Fault tolerant ● Robust feature set ● Flexible ● Development team dedicated to working closely w/ scientific community as priority #1UCSD Jan 17th 2012 Condor for users 3
  4. 4. How can Condor be used ● Managing local processes (local) We treat ● Managing local cluster (~vanilla) these two ● Connecting clusters (flocking) in this talk (as one) ● Handling resource overlays (glideins) ● Swiss-knife for accessing other WMS (Condor-G) ● e.g. Grid, Cloud, pbs, etc.UCSD Jan 17th 2012 Condor for users 4
  5. 5. Submitting Jobs to Condor ● Get access to submit host ● Choose a “Universe” for your job ● Make your job “batch-ready” ● Includes making your data available to your job ● Create a submit description file ● Run condor_submit to put your job(s) in the queue ● Relax while Condor manages and watches over your job(s)UCSD Jan 17th 2012 Condor for users 5
  6. 6. Choose the job “Universe” ● Controls how Condor handles jobs ● Condors many universes include: ● Vanilla (aka regular single node job) ● Parallel ● Grid ● Java ● VM ● StandardUCSD Jan 17th 2012 Condor for users 6
  7. 7. Hello World Submit File# Simple condor_submit input file # Simple condor_submit input file# (Lines beginning with # are comments) # (Lines beginning with # are comments)# NOTE: the words on the left side are not # NOTE: the words on the left side are not## case sensitive, but filenames are! case sensitive, but filenames are!Universe Universe = vanilla = vanillaExecutable = cosmos Executable = cosmos · ·Jobs executable Jobs executableArguments = -k 1543.3 Arguments = -k 1543.3 · ·Jobs args Jobs argsOutput Output = cosmos.out = cosmos.out · ·Jobs STDOUT Jobs STDOUTInput Input = cosmos.in = cosmos.in · ·Jobs STDIN Jobs STDINLog Log = cosmos.log = cosmos.log · ·Log the jobs activities Log the jobs activitiesQueue 1 Queue 1 · ·Put the job in the queue! Put the job in the queue! UCSD Jan 17th 2012 Condor for users 7
  8. 8. condor_submit & condor_q % condor_submit sim.submit % condor_submit sim.submit Submitting job(s). Submitting job(s). 1 job(s) submitted to cluster 1. 1 job(s) submitted to cluster 1. % condor_q % condor_q -- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : -- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 sim.exe 1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 sim.exe 1 jobs; 1 idle, 0 running, 0 held 1 jobs; 1 idle, 0 running, 0 held % %UCSD Jan 17th 2012 Condor for users 8
  9. 9. View the full ClassAd % condor_q -long % condor_q -long -- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : -- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : MyType = “Job” MyType = “Job” TargetType = “Machine” TargetType = “Machine” ClusterId = 1 ClusterId = 1 QDate = 1150921369 QDate = 1150921369 CompletionDate = 0 CompletionDate = 0 Owner = “frieda” Owner = “frieda” RemoteWallClockTime = 0.000000 RemoteWallClockTime = 0.000000 LocalUserCpu = 0.000000 LocalUserCpu = 0.000000 LocalSysCpu = 0.000000 LocalSysCpu = 0.000000 RemoteUserCpu = 0.000000 RemoteUserCpu = 0.000000 RemoteSysCpu = 0.000000 RemoteSysCpu = 0.000000 ExitStatus = 0 ExitStatus = 0 … …UCSD Jan 17th 2012 Condor for users 9
  10. 10. Monitor progress through logs ● The log file you specified is updated every time something happens to the job ● e.g. submit, start, termination ● Example log file 000 (001.000.000) 05/25 19:10:03 Job submitted from host: <128.105.146.14:1816> 000 (001.000.000) 05/25 19:10:03 Job submitted from host: <128.105.146.14:1816> ... ... 001 (001.000.000) 05/25 19:12:17 Job executing on host: <128.105.146.14:1026> 001 (001.000.000) 05/25 19:12:17 Job executing on host: <128.105.146.14:1026> ... ... 005 (001.000.000) 05/25 19:13:06 Job terminated. 005 (001.000.000) 05/25 19:13:06 Job terminated. (1) Normal termination (return value 0) (1) Normal termination (return value 0) ... ...UCSD Jan 17th 2012 Condor for users 10
  11. 11. condor_status gives information about the pool: % condor_status % condor_status Name OpSys Arch State Activ LoadAv Mem ActvtyTime Name OpSys Arch State Activ LoadAv Mem ActvtyTime perdita.cs.wi LINUX INTEL Owner Idle 0.020 511 0+02:28:42 perdita.cs.wi LINUX INTEL Owner Idle 0.020 511 0+02:28:42 coral.cs.wisc LINUX INTEL Claimed Busy 0.990 511 0+01:27:21 coral.cs.wisc LINUX INTEL Claimed Busy 0.990 511 0+01:27:21 doc.cs.wisc.e LINUX INTEL Unclaimed Idle 0.260 511 0+00:20:04 doc.cs.wisc.e LINUX INTEL Unclaimed Idle 0.260 511 0+00:20:04 dsonokwa.cs.w LINUX INTEL Claimed Busy 0.810 511 0+00:01:45 dsonokwa.cs.w LINUX INTEL Claimed Busy 0.810 511 0+00:01:45 ferdinand.cs. LINUX INTEL Claimed Suspe 1.130 511 0+00:00:55 ferdinand.cs. LINUX INTEL Claimed Suspe 1.130 511 0+00:00:55 To inspect full ClassAds: condor_status -longUCSD Jan 17th 2012 Condor for users 11
  12. 12. General User Commands ● condor_submit Submit new Jobs ● condor_run Submit and block ● condor_q View Job Queue ● condor_status View Pool Status ● condor_rm Remove Jobs ● condor_history Completed Job Info ● condor_hold Put a job on hold ● condor_release Release a job from hold http://www.cs.wisc.edu/condor/manual/v7.6/9_Command_Reference.htmlUCSD Jan 17th 2012 Condor for users 12
  13. 13. Condor File Transfer ● Condor will transfer files between submit and execute nodes (eliminating the need for NFS) if desired: ● ShouldTransferFiles – YES: Always transfer files to execution site – NO: Always rely on a shared filesystem – IF_NEEDED: Condor will automatically transfer the files if the submit and execute machine are not in the same FileSystemDomain (Use shared file system if available) ● When_To_Transfer_Output – ON_EXIT: Transfer the jobs output files back to the submitting machine only when the job completes – ON_EXIT_OR_EVICT: Like above, but also when the job is evictedUCSD Jan 17th 2012 Condor for users 13
  14. 14. Condor File Transfer, cont ● Transfer_Input_Files ● List of files that you want Condor to transfer to the execute machine ● Transfer_Output_Files ● List of files that you want Condor to transfer from the execute machine – If specified, the files must exist when job terminates, or the job will fail (put on hold) – If not specified, Condor will transfer back all new or modified files in the execute directory (but not subdirectories)UCSD Jan 17th 2012 Condor for users 14
  15. 15. Simple File Transfer Example # Example submit file using file # Example submit file using file transfer transfer Universe Universe = vanilla = vanilla Executable Executable = cosmos = cosmos Log Log = cosmos.log = cosmos.log ShouldTransferFiles ShouldTransferFiles = YES = YES Transfer_input_files Transfer_input_files = cosmos.dat = cosmos.dat Transfer_output_files Transfer_output_files = results.dat = results.dat When_To_Transfer_Output = ON_EXIT When_To_Transfer_Output = ON_EXIT Queue QueueUCSD Jan 17th 2012 Condor for users 15
  16. 16. These annoying emails ● Condor will send an email every time a job finishes ● Which is potentially nice if you have 10 jobs ● But it is annoying when you have O(10k)! ● To disable this behavior, add Notification = Never to the submit fileUCSD Jan 17th 2012 Condor for users 16
  17. 17. Referencing job id ● Each job has a unique ID ● Composed of (Cluster,Process) pair ● Can reference them in the submit file with ● $(Cluster) ● $(Process)UCSD Jan 17th 2012 Condor for users 17
  18. 18. Adding arbitrary attributes ● The job can add arbitrary attributes ● Useful both as mnemonic Assuming glideins e.g. AnaType=”Higgs” provide the requirements ● and for use during Machmaking e.g. DESIRED_Sites=”FNAL,UCSD,Nebraska” ● Condor syntax expects a “+” sign e.g. +DESIRED_sites=”FNAL,UCSD,Nebraska” ● Anything starting with a letter must be a Condor recognized attributeUCSD Jan 17th 2012 Condor for users 18
  19. 19. Example submit file # Example submit file using file # Example submit file using file transfer transfer Universe Universe = vanilla = vanilla Executable Executable = cosmos = cosmos Arguments = cosmos.dat $(Cluster) Arguments = cosmos.dat $(Cluster) Log Log = cosmos.log = cosmos.log ShouldTransferFiles ShouldTransferFiles = YES = YES Transfer_input_files Transfer_input_files = cosmos.dat = cosmos.dat Transfer_output_files Transfer_output_files = results.dat = results.dat When_To_Transfer_Output = ON_EXIT When_To_Transfer_Output = ON_EXIT Notification Notification = Never = Never +MyAna +MyAna = “Higgs” = “Higgs” +DESIRED_Sites = “FNAL,UCSD,Nebraska” +DESIRED_Sites = “FNAL,UCSD,Nebraska” Queue QueueUCSD Jan 17th 2012 Condor for users 19
  20. 20. We just scratched the surface ● Many more options available ● See Condor Manual http://research.cs.wisc.edu/condor/manual/v7.6/condor_submit.html ● See CondorWeek User tutorial http://research.cs.wisc.edu/condor/CondorWeek2011/tuesday_condor.htmlUCSD Jan 17th 2012 Condor for users 20
  21. 21. PrioritiesUCSD Jan 17th 2012 Condor for users 21
  22. 22. Priorities ● Priorities between users set by the Negotiator administrator ● User cannot really influence that ● More details tomorrow ● But user can set priorities relative to his own jobs ● FIFO by default ● User can designate some jobs as higher priority (or even lower priority) – Again FIFO within the same priority level http://research.cs.wisc.edu/condor/manual/v7.6/condor_prio.htmlUCSD Jan 17th 2012 Condor for users 22
  23. 23. Managing priorities ● Check with condor_q ● Modify with condor_prio% condor_q 366701.193 % condor_q 366701.193-- Submitter: santa1.claus : <192.168.130.11:9615?sock=9763_cd4c_2> : santa1.claus -- Submitter: santa1.claus : <192.168.130.11:9615?sock=9763_cd4c_2> : santa1.claus ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD366701.193 frida 12/25 12:01 0+00:00:00 I 0 0.0 cosmis -k 3.4 366701.193 frida 12/25 12:01 0+00:00:00 I 0 0.0 cosmis -k 3.4% condor_prio -10 366701.193 % condor_prio -10 366701.193% condor_q 366701.193 % condor_q 366701.193-- Submitter: santa1.claus : <192.168.130.11:9615?sock=9763_cd4c_2> : santa1.claus -- Submitter: santa1.claus : <192.168.130.11:9615?sock=9763_cd4c_2> : santa1.claus ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD366701.193 frida 12/25 12:01 0+00:00:00 I -10 0.0 cosmos -k 3.4 366701.193 frida 12/25 12:01 0+00:00:00 I -10 0.0 cosmos -k 3.4UCSD Jan 17th 2012 Condor for users 23
  24. 24. WorkflowsUCSD Jan 17th 2012 Condor for users 24
  25. 25. What is workflow? ● HTC all about using many CPUs ● Workflow = collection of jobs solving one task ● Often users have workflows that are more than “a bunch of jobs” ● May have dependencies ● Condor provides a tool to handle dependencies ● The Condor DAGManUCSD Jan 17th 2012 Condor for users 25
  26. 26. condor_dagman ● condor_dagman is a program to manage a DAG ● DAG = Direct Acyclic Graph Example DAG ● Effective way to handle a large workflow At most 4 in parallel ● Condor will make sure the workflow is completed ● Or tell you which node failed 1 ● Will have one running per submitted workflowUCSD Jan 17th 2012 Condor for users 26
  27. 27. How to submit a workflow ● In a nutshell A ● Create submit files for nodes B C ● Create a dagman submit file D ● Submit the DAG using condor_submit_dag ## Example dagman file Example dagman file JOB AAA.condor ● Condor takes care of rest JOB A.condor JOB BB B.condor JOB B.condor JOB C C.condor JOB C C.condor JOB D D.condor JOB D D.condor PARENT AA CHILD B C PARENT CHILD B C PARENT BB C CHILD D PARENT C CHILD D http://research.cs.wisc.edu/condor/manual/v7.6/2_10DAGMan_Applications.htmlUCSD Jan 17th 2012 Condor for users 27
  28. 28. The EndUCSD Jan 17th 2012 Condor for users 28
  29. 29. The Condor Project (Established ‘85) ● Research and Development in the Distributed High Throughput Computing field ● Team of ~35 faculty, full time staff and students ● Face software engineering challenges in a distributed UNIX/Linux/NT environment ● Are involved in national and international grid collaborations ● Actively interact with academic and commercial entities and users ● Maintain and support large distributed production environments ● Educate and train studentsUCSD Jan 17th 2012 Condor for users 29
  30. 30. Pointers ● Condor Home Page http://www.cs.wisc.edu/condor/ ● Condor Manual http://www.cs.wisc.edu/condor/manual/v7.6/ ● Support condor-user@cs.wisc.edu condor-admin@cs.wisc.eduUCSD Jan 17th 2012 Condor for users 30

×