2. Sun/Oracle Grid Engine is: A quick and easy way to set up a multi- cluster system using existing hardware Oracle Grid Engine is the most widely deployed workload management solution in the industry and offers unmatched scalability. On top of a rich set of advanced scheduling capabilities and the flexibility to adapt to any computing environment and application workload, Oracle Grid Engine offers comprehensive support for the cloud computing model.
3. How to Install Via Webappl.blogspot.com http://webappl.blogspot.com/2011/05/ins tall-sun-grid-engine-sge-on-ubuntu.html
4. Install SGE on master node: Install SGE on master node: mpiuser@ub0:~$ sudo apt-get install gridengine-client gridengine-common gridengine-master gridengine-qmon gridengine-exec #remove gridengine-exec from the list if master node is not supposed to run jobs #during the installation, we need to set the cluster CELL name (such as „default‟)
5. Install SGE on other nodes: Install SGE on other nodes: mpiuser@ub1:~$ sudo apt-get install gridengine-client gridengine-exec The CELL name is set the same as that of the master node
6. Set SGE_ROOT andSGE_CELL Set SGE_ROOT and SGE_CELL environment variables: $SGE_ROOT refers to the installation path of SGE $SGE_CELL is cell name which is „default‟ on our machine Edit /etc/profile and /etc/bash.bachrc, add the following two lines export SGE_ROOT=/var/lib/gridengine #this is the path on our machines export SGE_CELL=default Source the script: source /etc/profile
7. Configure SGE with qmon Configure SGE with qmon (This section is modified from a note by Junjun Mao) Invoke qmon as superuser: mpiuser@ub0:~$ sudo qmon #On our machine, qmon failed to start due to missing fonts „-adobe-helvetica-…” # To solve the fonts problem: mpiuser@ub0:~$ sudo apt-get install xfs xfstt mpiuser@ub0:~$ sudo apt-get install t1- xfree86-nonfree ttf-xfree86-nonfree ttf-xfree86- nonfree-syriac xfonts-75dpi xfonts-100dpi mpiuser@ub0:~$ sudo reboot #after reboot, the problem is gone
8. Configure hosts Configure hosts "Host Configuration" -> "Administration Host" -> Add master node and other administrative nodes "Host Configuration" -> "Submit Host" -> Add master node and other submit nodes "Host Configuration" -> "Execution Host" -> Add slave nodes ->Click on "Done" to finish
9. Configure the user Configure the user Add or delete users that are allowed to access SGE here. In this example, a user is added to an existing group and later this group will be allowed to submit jobs. Everything else is left as default values. "User Configuration" -> "Userset" -> Highlight userset "arusers" and click on "Modify" -> Input user name in "User/Group" field ->Click "Done" to finish
10. Configure the queue Configure the queue While Host Configuration deals what computing resources are available and User Configuration defines who have access to the resources, this Queue Control defines ways to connect hosts and users.
11. Queue Control "Queue Control" -> "Hosts" -> Confirm the execution hosts show up there. "Queue Control" -> "Cluster Queues" -> Click on "Add" -> Name the queue, add execution nodes to Hostlist; and "Use access" -> allow access to user group arusers; "General Configuration" -> Field "Slots" -> Raise the number to total CPU cores on slave nodes (ok to use a bigger number than actual CPU cores). "Queue Control" -> "Queue Instances" -> This is the place to manually assign hosts to queues, and control the state (active, suspend ...) of hosts.
12. Configure parallel environment Configure parallel environment "Queue Control" -> "Cluster Queues" -> Select a queue that will run parallel jobs -> Click on "Modify" -> "Parallel Environment" - > Click on icon "PE" below the right and left arrows -> Click on "Add" -> Name the PE, slots = 999, start_proc_args = $SGE_ROOT/mpi/startmpi.sh $pe_hostfile, stop_proc_args = $SGE_ROOT/mpi/stopmpi.sh, allocation_rule=$fill_up, check "Control slaves" to make this variable checked. Make sure the configured PE is loaded from "Available PE" to "Referenced PE". Confirm and close all config windows and open "Queue Control" -> "Cluster Queues" -> "Parallel Environment" again, the named PE should show up. Once created and linked to a queue, PE can be edited from "Queue Control" -> "PE" too.
13. Check whether sge hosts arerunning properly Check whether sge hosts are running properly mpiuser@ub0:~$ qhost #it should list the system info from all nodes mpiuser@ub0:~$ qconf -sel #it should list the hostnames of nodes mpiuser@ub0:~$ qconf -sql #it should list the queues mpiuser@ub0:~$ ps aux | grep sge_qmaster | grep -v grep #check master daemon mpiuser@ub0:~$ ps aux | grep sge_execd | grep -v grep #check execute daemon mpiuser@ub1:~$ ps aux | grep sge_ execd | grep -v grep #check execute daemon #If sge_qmaster or sge_execd daemon is not running, try starting by service #mpiuser@ub1:~$ sudo service gridengine-master start #mpiuser@ub1:~$ sudo service gridengine-exec start … #Reboot node(s) if sge_qmaster or sge_execd fails to start
14. Run a test script Run a test script Make a script named „test‟ with content: #!/bin/bash ### Request Bourne shell as shell for job #$ -S /bin/bash ### Use current directory as working directory #$ -CWD ### Name the job: #$ -N test echo “Running environment:” env echo “=============================” ###end of script
15. Job Submission To submit the job: qsub test #a job id returned if successful Query the job status: qstat #If the job is running successfully, there will be two output files produced in the current working directory with name test.oXXX (the standard output) and test.eXXX (the standard error), where test is the job name and XXX is the job id.
16. Always check your logs Check log messages if error occurs mpiuser@ub0:~$ less /var/spool/gridengine/qmaster/messages #master node mpiuser@ub0:~$ less /var/spool/gridengine/execd/ub0/messag es #exec node
17. Possible Errors Question: My output file has a Warning: no access to tty (Bad file descriptor).Thus no job control in this shell. Answer: This warning is caused if you are using the tcsh or csh as shell for submitting job. It is safe to ignore this warning. Alternatively you can qsub -S /bin/bash to run your program in different shell or add a line of „#$ -S /bin/bash‟ in the job script.
18. Possible Errors Question: Master host failed to respond properly. Error message is “error: commlib error: access denied (client IP resolved to host name „ub0…‟. This is not identical to clients host name „ub0‟) error: unable to contact qmaster using port 6444 on host „ub0‟” Answer: Reboot the master node or install the SGE from source code on master node (Solutions not confirmed yet). It also could be due to that the utility of gethostname (full path is „/usr/lib/gridengine/gethostname‟ on our machines) returns a different hostname to that from running command „hostname -f‟. If this is the case (e.g., host having multiple network interfaces), create a file named „host_aliases‟ under „$SGE_ROOT/$SGE_CELL/common‟ and populate as follows, # cat host_aliases ub0 ub0.my.com ub0-grid ub1 ub1.my.com ub1-grid ub2 ub2.my.com ub2-grid ub3 ub3.my.com ub3-grid and then restart the gridengine daemon (see man page of sge_host_aliases for details). Check the aliases: mpiuser@ub0:~$ /usr/lib/gridengine/gethostname -aname ub0-grid mpiuser@ub0:~$ /usr/lib/gridengine/gethostname -aname ub0 #both of them should return ub0