Job Management Systems SGE v1.3 Author: Anand Vaidya [email_address]
Why use SGE? <ul><li>Maintain order in a shared resource – like queing up at a movie ticket counter rather than mobbing th...
What is SGE? <ul><li>SGE is a distributed resource management software </li></ul><ul><li>Provides users the means to submi...
How does SGE work? <ul><li>Users submit jobs to the Grid Engine. </li></ul><ul><li>Unless resources are immediately availa...
SGE Components <ul><li>Hosts </li></ul><ul><ul><li>Master (coordinate activities, hold queues) </li></ul></ul><ul><ul><li>...
SGE Commands - qhost <ul><li>What is the state of the cluster? How many nodes, type, load? What is my chance of getting a ...
SGE Commands - qsub <ul><li>Create a jobscripts (myjob.sh) </li></ul><ul><li>Submit for execution </li></ul><ul><li>$ qsub...
(C) Anand Vaidya anand@novaglobal.com.sg SGE Commands - qstat <ul><li>check status of your job: </li></ul><ul><ul><li>qsta...
SGE Commands - qstat <ul><li>Status of the job is indicated by letters as: </li></ul><ul><li>qw - waiting t  - transfering...
SGE Commands - qdel <ul><li>Delete your job, if you wish </li></ul><ul><li>qdel 743 </li></ul><ul><li>vaidya has deleted j...
SGE Commands - qmon <ul><li>qmon is a  XWindows GUI tool to submit/delete/view jobs, configure SGE system </li></ul><ul><l...
SGE Commands – qsh, qtcsh <ul><li>Submit  a Interactive session request: </li></ul><ul><li>qlogin </li></ul><ul><li>qrsh  ...
SGE Commands – jobscript <ul><li>sample job script: </li></ul><ul><li>#!/bin/bash  </li></ul><ul><li># </li></ul><ul><li>#...
SGE Commands – jobscript <ul><li>sample job script: </li></ul><ul><li>#!/bin/bash  </li></ul><ul><li>#  </li></ul><ul><li>...
SGE Commands – jobscript <ul><li>-cwd = change to current dir before running job </li></ul><ul><li>-j y = merge error with...
Admin Commands <ul><li>Next few slides show commands useful for SGE admins (not users/researchers) </li></ul>
SGE Commands – qconf <ul><li>Show: </li></ul><ul><ul><li>complexes: qconf -sc </li></ul></ul><ul><ul><li>queues: qconf -sq...
SGE Commands – qping <ul><li>[anand@shark-c02 ~]$ qping -info shark-c01 537 execd 1 </li></ul><ul><li>05/24/2006 21:57:34:...
LSF Commands <ul><li>bsub – submit a job </li></ul><ul><li>bstop – suspend a job </li></ul><ul><li>bresume – resume a susp...
LSF Commands <ul><li>lsmon – monitor load, resource availability... </li></ul><ul><li>lsid – show lsf details (version etc...
Acknowledgements & Copying <ul><li>This material is based on my experience as well as material collected from SGE document...
The End  <ul><li>Thanks for your time. If you have any feedback, corrections or questions please contact me: Anand Vaidya,...
Upcoming SlideShare
Loading in …5
×

Linux Cluster Job Management Systems (SGE)

16,372 views

Published on

These slides provide an introduction to Sun Grid Engine, used on Linux HPC clusters quite often

Published in: Economy & Finance, Career
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
16,372
On SlideShare
0
From Embeds
0
Number of Embeds
111
Actions
Shares
0
Downloads
249
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Linux Cluster Job Management Systems (SGE)

  1. 1. Job Management Systems SGE v1.3 Author: Anand Vaidya [email_address]
  2. 2. Why use SGE? <ul><li>Maintain order in a shared resource – like queing up at a movie ticket counter rather than mobbing the counter </li></ul><ul><li>Apply different usage policies – PhDs and Profs get better treatment than first year grads </li></ul><ul><li>Everyone gets a fair share of the computing resource. </li></ul>
  3. 3. What is SGE? <ul><li>SGE is a distributed resource management software </li></ul><ul><li>Provides users the means to submit computationally demanding tasks to the SGE system for transparent distribution of the associated workload. </li></ul>
  4. 4. How does SGE work? <ul><li>Users submit jobs to the Grid Engine. </li></ul><ul><li>Unless resources are immediately available non-interactive jobs are kept in queues until resources to execute them become available. </li></ul><ul><li>Jobs are passed onto the available execution hosts </li></ul><ul><li>Records of each jobs progress through the system are kept and reported when requested. </li></ul>
  5. 5. SGE Components <ul><li>Hosts </li></ul><ul><ul><li>Master (coordinate activities, hold queues) </li></ul></ul><ul><ul><li>Execution (workers) </li></ul></ul><ul><ul><li>Administration (sets up system, queues etc) </li></ul></ul><ul><ul><li>Submit (users can submit jobs from these) </li></ul></ul><ul><li>Usually the master and admin host are the same machines </li></ul><ul><li>Queues (defined by the administrator) </li></ul><ul><li>User and Administrator Commands </li></ul><ul><li>Daemons: sge_qmaster (Master Daemon), sge_schedd (Scheduler Daemon), sge_execd (Execution Daemon) and sge_commd (Communication Daemon) </li></ul>
  6. 6. SGE Commands - qhost <ul><li>What is the state of the cluster? How many nodes, type, load? What is my chance of getting a node? </li></ul><ul><li>[root@shark ~]# qhost </li></ul><ul><li>HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS </li></ul><ul><li>------------------------------------------------------------------------------- </li></ul><ul><li>global - - - - - - - </li></ul><ul><li>shark-c00 lx24-amd64 2 2.02 3.9G 240.8M 4.0G 0.0 </li></ul><ul><li>shark-c02 lx24-amd64 2 2.00 3.9G 214.9M 4.0G 0.0 </li></ul><ul><li>shark-c03 lx24-amd64 2 1.76 3.9G 215.9M 4.0G 0.0 </li></ul>
  7. 7. SGE Commands - qsub <ul><li>Create a jobscripts (myjob.sh) </li></ul><ul><li>Submit for execution </li></ul><ul><li>$ qsub myjob.sh </li></ul><ul><li>Your job 742 (&quot;myjob.sh&quot;) has been submitted. </li></ul><ul><li>Simplest Job: </li></ul><ul><li>[vaidya@shark ~]$ cat myjob.sh </li></ul><ul><li>#!/bin/sh </li></ul><ul><li>sleep 10 </li></ul><ul><li>date > /tmp/test1.out.txt </li></ul><ul><li>Variations: qsub -cwd myjob.sh </li></ul>
  8. 8. (C) Anand Vaidya anand@novaglobal.com.sg SGE Commands - qstat <ul><li>check status of your job: </li></ul><ul><ul><li>qstat ; qstat -f ; </li></ul></ul><ul><ul><li>qstat -u username ; qstat -j job_id </li></ul></ul><ul><li>[root@shark ~]# qstat </li></ul><ul><li>job-ID prior name user state submit/start at queue slots ja-task-ID </li></ul><ul><li>----------------------------------------------------------------------------------------------------------------- </li></ul><ul><li>639 0.55500 HCPDIV7 test1 r 05/17/2006 10:16:31 all.q@shark-c00 1 </li></ul><ul><li>658 0.55500 HCPDIV1 test1 r 05/17/2006 13:37:35 all.q@shark-c00 1 </li></ul><ul><li>694 0.55500 FCCDVI test1 r 05/17/2006 23:52:19 all.q@shark-c02 1 </li></ul><ul><li>695 0.55500 FCCDVI1 test1 r 05/17/2006 23:52:19 all.q@shark-c02 1 </li></ul>
  9. 9. SGE Commands - qstat <ul><li>Status of the job is indicated by letters as: </li></ul><ul><li>qw - waiting t - transfering </li></ul><ul><li>r - running s,S - suspended </li></ul><ul><li>R - restarted T - threshold </li></ul>
  10. 10. SGE Commands - qdel <ul><li>Delete your job, if you wish </li></ul><ul><li>qdel 743 </li></ul><ul><li>vaidya has deleted job 743 </li></ul>
  11. 11. SGE Commands - qmon <ul><li>qmon is a XWindows GUI tool to submit/delete/view jobs, configure SGE system </li></ul><ul><li>Example: Submit a job using qmon </li></ul><ul><ul><li>Click the Job Submission icon. </li></ul></ul><ul><ul><li>Click the Job Script file selection icon to open a file selection box and select your script file. Then, click OK. </li></ul></ul><ul><ul><li>Click the Submit button at the bottom of the Job Submission dialog. </li></ul></ul><ul><ul><li>After a couple of seconds, you should be able to monitor your job in the Job Control dialog. Click the Job Control icon in the QMON control panel. </li></ul></ul><ul><ul><li>You first see it under Pending Jobs, and it quickly moves to Running Jobs after it gets started. </li></ul></ul>
  12. 12. SGE Commands – qsh, qtcsh <ul><li>Submit a Interactive session request: </li></ul><ul><li>qlogin </li></ul><ul><li>qrsh </li></ul><ul><li>Ensure you have a valid XServer running on your desktop. Allow remote xclients to display on your desktop. </li></ul><ul><li>Submit an Interactive session request: </li></ul><ul><li>qsh </li></ul><ul><li>qtcsh </li></ul><ul><li>Note: using this feature needs additional configuration, may not work otherwise. </li></ul>
  13. 13. SGE Commands – jobscript <ul><li>sample job script: </li></ul><ul><li>#!/bin/bash </li></ul><ul><li># </li></ul><ul><li>#$ -cwd </li></ul><ul><li>#$ -j y </li></ul><ul><li>#$ -S /bin/bash </li></ul><ul><li>#$ -V </li></ul><ul><li>date </li></ul><ul><li>sleep 10 </li></ul><ul><li>env </li></ul><ul><li>date </li></ul>
  14. 14. SGE Commands – jobscript <ul><li>sample job script: </li></ul><ul><li>#!/bin/bash </li></ul><ul><li># </li></ul><ul><li>#$ -cwd </li></ul><ul><li>#$ -j y </li></ul><ul><li>#$ -S /bin/bash </li></ul><ul><li># </li></ul><ul><li>$MPI_DIR/mpirun -np $NSLOTS -machinefile $TMPDIR/machines myparallelprog.exe {infile.txt outfile.txt} </li></ul>
  15. 15. SGE Commands – jobscript <ul><li>-cwd = change to current dir before running job </li></ul><ul><li>-j y = merge error with stdout </li></ul><ul><li>-r y = code is re-runnable </li></ul><ul><li>-N jname = set the job name </li></ul><ul><li>-l h_rt = 00:30:00 run job for max of 30mins </li></ul><ul><li>-pe mpich – Invoke parallel environment </li></ul><ul><li>-pe mpich-ib – use infiniband parallel environment </li></ul><ul><li>-pe mpich-eth – use ethernet parallel env </li></ul><ul><li>-V = carry all env variable settings </li></ul>
  16. 16. Admin Commands <ul><li>Next few slides show commands useful for SGE admins (not users/researchers) </li></ul>
  17. 17. SGE Commands – qconf <ul><li>Show: </li></ul><ul><ul><li>complexes: qconf -sc </li></ul></ul><ul><ul><li>queues: qconf -sql </li></ul></ul><ul><ul><li>PE: qconf -spl </li></ul></ul><ul><ul><li>exec host: qconf -sel qconf -se c35 </li></ul></ul><ul><ul><li>submit hosts: qconf -ss </li></ul></ul><ul><ul><li>admin hosts: qconf -sh </li></ul></ul><ul><ul><li>list calendars qconf -scall </li></ul></ul><ul><ul><li>configuration qconf -sconf </li></ul></ul><ul><ul><li>user list: qconf -suserl </li></ul></ul><ul><ul><li>Scheduler conf: qconf -ssconf </li></ul></ul>
  18. 18. SGE Commands – qping <ul><li>[anand@shark-c02 ~]$ qping -info shark-c01 537 execd 1 </li></ul><ul><li>05/24/2006 21:57:34: </li></ul><ul><li>SIRM version: 0.1 </li></ul><ul><li>SIRM message id: 1 </li></ul><ul><li>start time: 05/24/2006 21:31:37 (1148477497) </li></ul><ul><li>run time [s]: 1768 </li></ul><ul><li>messages in read buffer: 0 </li></ul><ul><li>messages in write buffer: 0 </li></ul><ul><li>nr. of connected clients: 2 </li></ul><ul><li>status: 0 </li></ul><ul><li>info: dispatcher: R (0.04) | OK </li></ul><ul><li>Monitor: disabled </li></ul>
  19. 19. LSF Commands <ul><li>bsub – submit a job </li></ul><ul><li>bstop – suspend a job </li></ul><ul><li>bresume – resume a suspended task </li></ul><ul><li>btop – move job to top </li></ul><ul><li>bswitch – move jobs between queues </li></ul><ul><li>lsgrun – run a task on a set of hosts </li></ul><ul><li>bkill – kill a job </li></ul>
  20. 20. LSF Commands <ul><li>lsmon – monitor load, resource availability... </li></ul><ul><li>lsid – show lsf details (version etc) </li></ul><ul><li>lshosts – show hosts & static info </li></ul><ul><li>lsload – show load info for hosts </li></ul><ul><li>lsinfo – show lsf config info </li></ul><ul><li>busers – show user info </li></ul><ul><li>bacct – show acct info on finished jobs </li></ul><ul><li>bjobs – show info on jobs </li></ul><ul><li>bpeek – show stdin/stdout of unfinished jobs </li></ul>
  21. 21. Acknowledgements & Copying <ul><li>This material is based on my experience as well as material collected from SGE documentation. </li></ul><ul><li>This presentation can be redistributed as follows: </li></ul><ul><ul><ul><li>No commercial re-distribution: eg, as part of a for-profit CDROM or as part of your sales pitch. Seek my permission first. </li></ul></ul></ul><ul><ul><ul><li>Must attribute the document creator. </li></ul></ul></ul><ul><ul><ul><li>Share alike: If you use this document and enhance it or modify, share the modifications or the modified document </li></ul></ul></ul><ul><ul><ul><li>Which means I apply: Creative Commons License, http://creativecommons.org/licenses/by-nc-sa/2.5/ </li></ul></ul></ul>
  22. 22. The End <ul><li>Thanks for your time. If you have any feedback, corrections or questions please contact me: Anand Vaidya, anand@novaglobal.com.sg </li></ul><ul><li>This document was created with OpenOffice on Linux. email me if you want the odp file instead of the pdf </li></ul>

×