PowerPoint icon

508 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
508
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

PowerPoint icon

  1. 1. Running on the SDSC Blue Gene Mahidhar Tatineni Blue Gene Workshop SDSC, April 5, 2007
  2. 2. BG System Overview: SDSC’s three-rack system
  3. 3. BG System Overview: Integrated system
  4. 4. BG System Overview: Multiple operating systems & functions <ul><li>Compute nodes: run Compute Node Kernel (CNK = blrts) </li></ul><ul><ul><li>Each run only one job at a time </li></ul></ul><ul><ul><li>Each use very little memory for CNK </li></ul></ul><ul><li>I/O nodes: run Embedded Linux </li></ul><ul><ul><li>Run CIOD to manage compute nodes </li></ul></ul><ul><ul><li>Perform file I/O </li></ul></ul><ul><ul><li>Run GPFS </li></ul></ul><ul><li>Front-end nodes: run SuSE Linux </li></ul><ul><ul><li>Support user logins </li></ul></ul><ul><ul><li>Run cross compilers & linker </li></ul></ul><ul><ul><li>Run parts of mpirun to submit jobs & LoadLeveler to manage jobs </li></ul></ul><ul><li>Service node: runs SuSE Linux </li></ul><ul><ul><li>Uses DB2 to manage four system databases </li></ul></ul><ul><ul><li>Runs control system software, including MMCS </li></ul></ul><ul><ul><li>Runs other parts of mpirun & LoadLeveler </li></ul></ul><ul><li>Software comes in drivers: We are currently running Driver V1R3M1 </li></ul>
  5. 5. SDSC Blue Gene: Getting Started Logging on & moving files <ul><li>Logging on </li></ul><ul><ul><li>ssh bglogin.sdsc.edu </li></ul></ul><ul><ul><li>or </li></ul></ul><ul><ul><li>ssh -l username bglogin.sdsc.edu </li></ul></ul><ul><ul><li>Alternate login node: bg-login4.sdsc.edu </li></ul></ul><ul><ul><li>(We will use bg-login4 for the workshop) </li></ul></ul><ul><li>Moving files </li></ul><ul><ul><li>scp file username @bglogin.sdsc.edu: ~ </li></ul></ul><ul><ul><li>or </li></ul></ul><ul><ul><li>scp -r directory username @bglogin.sdsc.edu: ~ </li></ul></ul>
  6. 6. SDSC Blue Gene: Getting started Places to store your files <ul><li>/users (home directory) </li></ul><ul><ul><li>1.1 TB NFS mounted file system </li></ul></ul><ul><ul><li>Recommended for storing source / important files. </li></ul></ul><ul><ul><li>Do not write data/output to this area: Slow and limited in size! </li></ul></ul><ul><ul><li>Regular backups </li></ul></ul><ul><li>/bggpfs available for parallel I/O via GPFS </li></ul><ul><ul><li>~18.5 TB accessed via IA-64 NSD servers </li></ul></ul><ul><ul><li>No backups </li></ul></ul><ul><li>700 TB /gpfs-wan available for parallel I/O and shared with DataStar and TG IA-64 cluster. </li></ul>
  7. 7. SDSC Blue Gene: Checking your allocation <ul><li>Use the reslist command to check your allocation on the SDSC Blue Gene </li></ul><ul><li>Sample output is as follows </li></ul><ul><li>bg-login1 mahidhar/bg_workshop> reslist -u ux452208 </li></ul><ul><li>Querying database, this may take several seconds ... </li></ul><ul><li>Output shown is local machine usage. For full usage on roaming accounts, please use tgusage. </li></ul><ul><li>SBG: Blue Gene at SDSC </li></ul><ul><li>SU Hours SU Hours </li></ul><ul><li>Name UID ACID ACC PCTG ALLOCATED USED USER </li></ul><ul><li>ux452208 452208 1606 U 100 99999 0 Guest8, Hpc </li></ul><ul><li>MKG000 1606 99999 40 </li></ul>
  8. 8. Accessing HPSS from the Blue Gene <ul><li>What is HPSS </li></ul><ul><ul><li>The centralized, long-term data storage </li></ul></ul><ul><ul><li>system at SDSC is the </li></ul></ul><ul><ul><li>High Performance Storage System (HPSS) </li></ul></ul><ul><li>Setup your authentication </li></ul><ul><ul><li>run ‘get_hpss_keytab’ script </li></ul></ul><ul><li>Use hsi, and htar clients to connect to HPSS. For example </li></ul><ul><ul><li>hsi put mytar .tar </li></ul></ul><ul><ul><li>htar -c -f mytar .tar -L file_or_directory </li></ul></ul>
  9. 9. Using the compilers: Important programming considerations <ul><li>Front-end nodes have different processors & run different OS than compute nodes </li></ul><ul><ul><li>Hence codes must be cross compiled </li></ul></ul><ul><ul><li>Care must be taken with configure scripts </li></ul></ul><ul><ul><li>Discovery of system characteristics during compilation (e.g., via configure) may require modifications to the configure script. </li></ul></ul><ul><ul><li>Make sure that if code has to be executed during the configure, it runs on the compute nodes. </li></ul></ul><ul><ul><li>Alternately, system characteristics can be specified by user and the configure modified to take this into account. </li></ul></ul><ul><li>Some system calls are not supported by the compute node kernel </li></ul>
  10. 10. Using the compilers: Compiler versions, paths, & wrappers <ul><li>Compilers (version numbers the same as on DataStar) </li></ul><ul><ul><li>XL Fortran V10.1: blrts_xlf & blrts_xlf90 </li></ul></ul><ul><ul><li>XL C/C++ V8.0: blrts_xlc & blrts_xlC </li></ul></ul><ul><li>Paths to compilers in default .bashrc </li></ul><ul><ul><li>export PATH=/opt/ibmcmp/xlf/bg/10.1/bin:$PATH </li></ul></ul><ul><ul><li>export PATH=/opt/ibmcmp/vac/bg/8.0/bin:$PATH </li></ul></ul><ul><ul><li>export PATH=/opt/ibmcmp/vacpp/bg/8.0/bin:$PATH </li></ul></ul><ul><li>Compilers with MPI wrappers (recommended) </li></ul><ul><ul><li>mpxlf, mpxlf90, mpcc, & mpCC </li></ul></ul><ul><li>Path to MPI-wrapped compilers in default .bashrc </li></ul><ul><ul><li>export PATH=/usr/local/apps/bin:$PATH </li></ul></ul>
  11. 11. Using the compilers: Options <ul><li>Compiler options </li></ul><ul><ul><li>-qarch=440 uses only single FPU per processor (minimum option) </li></ul></ul><ul><ul><li>-qarch=440d allows both FPUs per processor (alternate option) </li></ul></ul><ul><ul><li>-qtune=440 tunes for the 440 processor </li></ul></ul><ul><ul><li>-O3 gives minimal optimization with no SIMDization </li></ul></ul><ul><ul><li>-O3 –qarch=440d adds backend SIMDization </li></ul></ul><ul><ul><li>-O3 –qhot adds TPO (a high-level inter-procedural optimizer) SIMDization, more loop optimization </li></ul></ul><ul><ul><li>-O4 adds compile-time interprocedural analysis </li></ul></ul><ul><ul><li>-O5 adds link-time interprocedural analysis </li></ul></ul><ul><ul><li>(TPO SIMDization default with –O4 and –O5) </li></ul></ul><ul><li>Current recommendation: </li></ul><ul><ul><li>Start with -O3 –qarch=440d –qtune=440 </li></ul></ul><ul><ul><li>Try –O4, -O5 next </li></ul></ul>
  12. 12. Using libraries <ul><li>ESSL </li></ul><ul><ul><li>Version 4.2 is available in /usr/local/apps/lib </li></ul></ul><ul><li>MASS/MASSV </li></ul><ul><ul><li>Version 4.3 is available in /usr/local/apps/lib </li></ul></ul><ul><li>FFTW </li></ul><ul><ul><li>Versions 2.1.5 and 3.1.2 available in both single & double precision. The libraries are located in /usr/local/apps/V1R3 </li></ul></ul><ul><li>NETCDF </li></ul><ul><ul><li>Versions 3.6.0p1 and 3.6.1 are available in /usr/local/apps/V1R3 </li></ul></ul><ul><li>Example link paths </li></ul><ul><ul><li>-Wl,--allow-multiple-definition -L/usr/local/apps/lib -lmassv -lmass -lesslbg -L/usr/local/apps/V1R3/fftw-3.1.2s/lib -lfftw3f </li></ul></ul>
  13. 13. Running jobs: Overview <ul><li>There are two compute modes </li></ul><ul><ul><li>Coprocessor (CO) mode: one compute processor per node </li></ul></ul><ul><ul><li>Virtual node (VN) mode: two compute processors per node </li></ul></ul><ul><li>Jobs run in partitions or blocks </li></ul><ul><ul><li>These are typically powers of two </li></ul></ul><ul><ul><li>Blocks must be allocated (or booted) before run & are restricted to a single user at a time </li></ul></ul><ul><li>Only batch jobs are supported </li></ul><ul><ul><li>Batch jobs are managed by LoadLeveler </li></ul></ul><ul><li>Users can monitor jobs using llq –b & llq -x </li></ul>
  14. 14. Running jobs: LoadLeveler for batch jobs <ul><li>Here is an example LoadLeveler run script (test.cmd) </li></ul><ul><ul><li>#!/usr/bin/ksh </li></ul></ul><ul><ul><li>#@ environment = COPY_ALL; </li></ul></ul><ul><ul><li>#@ job_type = BlueGene </li></ul></ul><ul><ul><li>#@ account_no = <your user account> </li></ul></ul><ul><ul><li>#@ class = parallel </li></ul></ul><ul><ul><li>#@ bg_partition = <partition name; for example: top> </li></ul></ul><ul><ul><li>#@ output = file.$(jobid).out </li></ul></ul><ul><ul><li>#@ error = file.$(jobid).err </li></ul></ul><ul><ul><li>#@ notification = complete </li></ul></ul><ul><ul><li>#@ notify_user = <your email address> </li></ul></ul><ul><ul><li>#@ wall_clock_limit = 00:10:00 </li></ul></ul><ul><ul><li>#@ queue </li></ul></ul><ul><ul><li>mpirun -mode VN -np <number of procs> -exe <your executable> -cwd <working directory> </li></ul></ul><ul><li>Submit as follows: </li></ul><ul><ul><li>llsubmit test.cmd </li></ul></ul>
  15. 15. Running jobs: mpirun options <ul><li>Key mpirun options are </li></ul><ul><ul><li>-mode compute mode: CO or VN </li></ul></ul><ul><ul><li>-np number of compute processors </li></ul></ul><ul><ul><li>-mapfile logical mapping of processors </li></ul></ul><ul><ul><li>-cwd full path of current working directory </li></ul></ul><ul><ul><li>-exe full path of executable </li></ul></ul><ul><ul><li>-args arguments of executable (in double quotes) </li></ul></ul><ul><ul><li>-env environmental variables (in double quotes) </li></ul></ul><ul><ul><li>(These are mostly different than for TeraGrid) </li></ul></ul>
  16. 16. Running jobs: Partition Layout and Usage Guidelines <ul><li>To make effective use of the Blue Gene, production runs should generally use one-fourth or more of the machine, i.e., 256 or more compute nodes. Thus predefined partitions are provided for production runs. </li></ul><ul><ul><li>SDSC: All 3076 nodes </li></ul></ul><ul><ul><li>R01R02: 2048 nodes combing rack 1 & 2 </li></ul></ul><ul><ul><li>rack, R01, R02:all 1,024 nodes of each rack 0, rack 1, and rack 2 </li></ul></ul><ul><ul><li>top, bot; R01-top, R01-bot; R02-top, R02-bot: 512 nodes </li></ul></ul><ul><ul><li>top256–1 & top256–2 256 nodes in each half of the top midplane of rack 0 </li></ul></ul><ul><ul><li>bot256–1 & bot256–2 256 nodes in each half of the bottom midplane of rack 0 </li></ul></ul><ul><li>Smaller 64 (bot64-1, …, bot64-8) and 128 (bot128-1 , … , bot128-4) node partitions are available for test runs. </li></ul><ul><li>Use the /usr/local/apps/utils/showq command to get more information on partition requests of jobs in the queue. </li></ul>
  17. 17. Running jobs: Partition Layout
  18. 18. Running Jobs: Reservation <ul><li>There is a reservation in place for today’s workshop for all the guest users. </li></ul><ul><li>The reservation ID is bgsn.76.r </li></ul><ul><li>Set the LL_RES_ID variable to bgsn.76.r. This will automatically bind jobs to the reservation. </li></ul><ul><ul><li>csh/tcsh: setenv LL_RES_ID bgsn.76.r </li></ul></ul><ul><ul><li>bash: export LL_RES_ID=bgsn.76.r </li></ul></ul>
  19. 19. Running Jobs: Example 1 <ul><li>The examples featured in today’s talk are included in the following directory: </li></ul><ul><li>/bggpfs/projects/bg_workshop </li></ul><ul><li>Copy them to your directory by using the following command: </li></ul><ul><li>cp -r /bggpfs/projects/bg_workshop /users/<your_dir> </li></ul><ul><li>In the first example we will compile a simple mpi program (mpi_hello_c.c/mpi_hello_f.f), use the sample Loadleveler script (example1.cmd) to submit and run the job. </li></ul>
  20. 20. Example 1 (contd.) <ul><li>Compile the example files using the mpcc/mpxlf wrappers </li></ul><ul><ul><li>mpcc -o hello mpi_hello_c.c </li></ul></ul><ul><ul><li>mpxlf –o hello mpi_hello_f.f </li></ul></ul><ul><li>Modify the loadleveler submit file (example1.cmd). Add the account number, partition name, email address, and mpirun options </li></ul><ul><li>Use llsubmit to put the job in the queue </li></ul><ul><ul><li>llsubmit example1.cmd </li></ul></ul>
  21. 21. Running Jobs: Example 2 <ul><li>In example 2 we will use a IO benchmark (IOR) to illustrate the use of arguments with mpirun </li></ul><ul><li>The mpirun line is as follows </li></ul><ul><ul><li>mpirun -np 64 -mode CO -cwd /bggpfs/projects/bg_workshop –exe /bggpfs/projects/bg_workshop/IOR -args &quot;-a MPIIO -b 32m -t 4m -i 3“ </li></ul></ul><ul><li>The –mode, -exe, and –args options are used in this example. The –args option is used to pass options to the IOR executable. </li></ul>
  22. 22. Checkpoint-Restart on the Blue Gene <ul><li>Checkpoint and restart are among the primary techniques for fault recovery on the Blue Gene. </li></ul><ul><li>The current version of the checkpoint library requires users to manually insert calls in their code to checkpoint their code at the proper place in their codes. </li></ul><ul><li>The process can be initialized by calling the BGLCheckpointInit() function. </li></ul><ul><li>Checkpoint files can be written by making a call to BGLCheckpoint(). This can be done any number of times and the checkpoint files are distinguished by a sequence number. </li></ul><ul><li>The environment variables BGL_CHKPT_RESTART_SEQNO and BGL_CHKPT_DIR_PATH control the restart number and location. </li></ul>
  23. 23. Example for Checkpoint-Restart <ul><li>Let us look at the entire checkpoint restart process using the example provided in the /bggpfs/projects/bg_workshop directory. </li></ul><ul><li>We are using a simple Poisson solver to illustrate the checkpoint process (file: poisson-chkpt.f) </li></ul><ul><li>Compile the program using mpxlf and including the checkpoint library: </li></ul><ul><ul><li>mpxlf –o pchk poisson-chkpt.f /bgl/BlueLight/ppcfloor/bglsys/lib/libchkpt.rts.a </li></ul></ul><ul><li>Use the chkpt.cmd file to submit the job </li></ul><ul><li>The program writes checkpoint files after every 1000 steps. The checkpoint files are tagged with the node ids and the sequence number. For example: </li></ul><ul><ul><li>ckpt.x06-y01-z00.1.2 </li></ul></ul>
  24. 24. Example for Checkpoint-Restart (Contd.) <ul><li>Verify that the checkpoint restart works </li></ul><ul><li>From the first run (when the checkpoint files were written): </li></ul><ul><ul><li>Done Step # 3997 ; Error= 1.83992678887004613 </li></ul></ul><ul><ul><li>Done Step # 3998 ; Error= 1.83991115295111185 </li></ul></ul><ul><ul><li>Done Step # 3999 ; Error= 1.83989551716504351 </li></ul></ul><ul><ul><li>Done Step # 4000 ; Error= 1.83987988151185511 </li></ul></ul><ul><ul><li>Done Step # 4001 ; Error= 1.83986424599153198 </li></ul></ul><ul><ul><li>Done Step # 4002 ; Error= 1.83984861060408078 </li></ul></ul><ul><ul><li>Done Step # 4003 ; Error= 1.83983297534951951 </li></ul></ul><ul><li>From the second run (continued from step 4000, sequence 4) </li></ul><ul><ul><li>Done Step # 4000 ; Error= 1.83987988151185511 </li></ul></ul><ul><ul><li>Done Step # 4001 ; Error= 1.83986424599153198 </li></ul></ul><ul><ul><li>Done Step # 4002 ; Error= 1.83984861060408078 </li></ul></ul><ul><li>We get identical results from both runs </li></ul>
  25. 25. BG System Overview: References <ul><li>Blue Gene Web site at SDSC </li></ul><ul><ul><li>http://www.sdsc.edu/us/resources/bluegene </li></ul></ul><ul><li>Loadleveler guide </li></ul><ul><li>http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.loadl.doc/loadl331/am2ug30305.html </li></ul><ul><li>Blue Gene Application development guide (from IBM redbooks) </li></ul><ul><li>http://www.redbooks.ibm.com/abstracts/sg247179.html </li></ul>

×