Contents The problem being addressed Torque – how it helps Maui – how it helps Job Submission – job priorities, job dependencies, job queues Job Monitoring Job Accounting Install
The problem Have jobs/tasks run as soon as possible Have higher priority jobs run earlier than others Run jobs on any free machine across a cluster automatically not just on one machine Have jobs run un-attended and inform in case of error Machine utilization has to be high Monitor and account for all the usage
Torque – how it helps What is TORQUE’s job as the resource manager. Accepting and starting jobs/tasks across a batch farm (qsub command) Cancelling jobs (qdel command) Monitoring the state of jobs (qstatcommand) Collecting return codes (qstat) Accounting of jobs, the time they took, memory used, etc (tracejob command)
Maui – how it helps What is MAUI’s Job? MAUI makes all the decisions. Should a job be started asking questions like: Is there enough resource to start the job? Given all the jobs I could start which one should I start? MAUI runs a scheduling iteration: When a job is submitted. When a job ends. At regular configurable intervals.
Job Submission Jobs are submitted to the batch system by means of the qsub command, as in qsub job.sh But you can also add resource description directly on the command line: qsub -l nodes=1:ppn=4 job.sh:mem=200mb:walltime=120 job.sh qsub Returns a <jobid>
Job priority Can give priority with qsub qsub –p 20 job.sh Default priority is 0 U can give priorities from 0 to 1023 for a job
Job dependencies Run a job after another job successfully ends echo “vflush” | qsub -W depend=afterok:10.penguin7.orchesys.com -p 10 -q flush_queue Here ‘10.penguin7.orchesys.com’ is jobid of another job which has to complete successfully only then the current job is launched.
Job Queues Batch systems are usually configured with multiple queues. Each queue can be configured to accept job from a certain group of users, or within specified resource limits Queue selection is performed with -q queuename on the qsubcommand line Glassbeam has default queue (batch) and flush_queue (where only one job can run at a time)
Job Monitoring For a job id, u can see the command that was fired for the job in the file /var/spool/torque/server_priv/jobs/<JOBID.SC> sudo cat 90.localhost.localdomain.SC /home/gbprod/testscript_aruba/aruba_parallel_loader qa0 1306219430 aruba_test_pod /glassbeam/core/bin qstat – status of all submitted jobs Status of only one job - qstat <jobid> Only running jobs - qstat –r Email alert for jobs - qsub -m ae -M email@example.com (Send email in case of a – abort, e – end of job)
Job accounting … Can give job return status, how much time and show what happened today to job id Tracejob <jobid> tracejob -n d <jobid> (search last d days for the job), fast version of tracejob: tracejob -f error -f system -f admin -f security -f sched -f debug -f debug2 -f job -f job_usage 114.localhost
Install Torque install As root user Go to folder install/torque-gb-3.0.1 Run command: ./torque.setupgbprodlocalhost Maui install As root user Go to folder install/maui-gb-3.3.1 Run command shinstall.sh