Big Iron and Parallel Processing
   USArray Data Processing Workshop

            Scott Teige, PhD
                July 30...
Overview
                •    How big is “Big Iron”?
                •    Where is it, what is it?
                •    On...
What is the TeraGrid?
                • “… a nationally distributed
                  cyberinfrastructure that provides le...
Some TeraGrid Systems
     Kraken                  NICS     Cray   608 TF   128 TB
     Ranger                  TACC     S...
System Layout
                Kraken             2.30 GHz   66048 cores

                Ranger             2.66       629...
Availability
      Kraken                  608TFLOPS 96% Use   24.3 IdleTF
      Ranger                  579       91%    ...
Research Cyberinfrastructure
                The Big Picture:
                • Compute
                   Big Red (IBM e1...
High Performance Systems
                       •   Big Red [TeraGrid System]
                            30 TFLOPS IBM JS...
USArray Data Processing Workshop   July 30, 2009
Data Capacitor (AKA Lustre)
                High Performance Parallel File system
                  -ca 1.2PB spinning dis...
HPSS
                •    High Performance Storage System
                •    ca. 3 PB tape storage
                •    ...
Serial vs. Parallel
                • Calculation      •   Calculation
                • Flow Control     •   Flow Control...
1-F     1-F


            A Serial
            Program
                                                                   ...
Speed for various scaling rules

 “Paralyzable Process”

       S=Ne -(N-1)/q



 “Superlinear Scaling”

        S>N




U...
MPI vs. OpenMP
                • MPI code may            • OpenMP code
                  execute across many       execute...
Other methods exist:
                •    Sockets
                •    Explicit shared memory calls/operations
           ...
export OMP_NUM_THREADS=8
icc mp_baby.c -openmp -o mp_baby
./mp_baby

#include <stdio.h>
#include <omp.h>

int main(int arg...
PROGRAM DOT_PRODUCT

      INTEGER N, CHUNKSIZE, CHUNK, I
      PARAMETER (N=100)
      PARAMETER (CHUNKSIZE=10)
      REA...
Synchronization Constructs
                • MASTER: block executed only by master
                  thread
              ...
Data Scope Attribute Clauses
                • SHARED: variable is shared across all
                  threads
           ...
Some Useful Library routines
                •    omp_set_num_threads(integer)
                •    omp_get_num_threads()
...
OpenMP Advice
                •    Always explicitly scope variables
                •    Never branch into/out of a paral...
Exercise: OpenMP

                •    The example programs are in ~/OMP_F_examples or
                     ~/OMP_C_exampl...
#include <stdio.h>
     #include <stdlib.h>
     #include <mpi.h>
     int myrank;
     int ntasks;                       ...
mpicc mpi_baby.c –o mpi_baby

mpirun –np 8 mpi_baby

mpirun –np 32 –machinefile my_list mpi_baby




USArray Data Processi...
C AUTHOR: Blaise Barney                                        From the man page:
     program scatter
     include 'mpif....
Some linux tricks to get more information:
man -w MPI
ls /N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/share/man/man3
...
MPI cool stuff:
                •    Bi-directional communication
                •    Non-blocking communication
        ...
MPI Advice
                • Never put a barrier in an if block
                • Use care with non-blocking
             ...
So, can I use MPI with OpenMP?
                • Yes you can; extreme care is advised
                • Some implementatio...
Exercise: MPI
                •    Examples are in ~/MPI_F_examples or ~/MPI_C_examples
                •    Go to https:/...
Where were those again?
                •    https://computing.llnl.gov/tutorials/openMP/excercise.html
                • ...
Acknowledgements

         •    This material is based upon work supported by the National Science
              Foundatio...
Upcoming SlideShare
Loading in...5
×

2009 Us Array

401

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
401
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

2009 Us Array

  1. 1. Big Iron and Parallel Processing USArray Data Processing Workshop Scott Teige, PhD July 30, 2009
  2. 2. Overview • How big is “Big Iron”? • Where is it, what is it? • One system, the details • Parallelism, the way forward • Scaling and what it means to you • Programming techniques • Examples • Excercises USArray Data Processing Workshop July 30, 2009
  3. 3. What is the TeraGrid? • “… a nationally distributed cyberinfrastructure that provides leading edge computational and data services for scientific discovery through research and education…” • A document exists in your training account home directories. USArray Data Processing Workshop July 30, 2009
  4. 4. Some TeraGrid Systems Kraken NICS Cray 608 TF 128 TB Ranger TACC Sun 579 123 Abe NCSA Dell 89 9.4 Lonestar TACC Dell 62 11.6 Steele Purdue Dell 60 12.4 Queen Bee LONI Dell 50 5.3 Lincoln NCSA Dell 47 3.0 BigRed IU IBM 30 6.0 USArray Data Processing Workshop July 30, 2009
  5. 5. System Layout Kraken 2.30 GHz 66048 cores Ranger 2.66 62976 Abe 2.33 9600 Lonestar 2.66 5840 Steele 2.33 7144 USArray Data Processing Workshop July 30, 2009
  6. 6. Availability Kraken 608TFLOPS 96% Use 24.3 IdleTF Ranger 579 91% 52.2 Abe 89 90% 8.9 Lonestar 62 92% 5.0 Steele 60 67% 19.8 Queen Bee 51 95% 2.5 Lincoln 48 4% 45.6 Big Red 31 83% 5.2 USArray Data Processing Workshop July 30, 2009
  7. 7. Research Cyberinfrastructure The Big Picture: • Compute Big Red (IBM e1350 Blade Center JS21) Quarry (IBM e1350 Blade Center HS21) • Storage HPSS GPFS OpenAFS Lustre Lustre/WAN USArray Data Processing Workshop July 30, 2009
  8. 8. High Performance Systems • Big Red [TeraGrid System] 30 TFLOPS IBM JS21 SuSE Cluster 768 blades/3072 cores: 2.5 GHz PPC 970MP 8GB Memory, 4 cores per blade Myrinet 2000 LoadLeveler & Moab • Quarry [Future TeraGrid System] 7 TFLOPS IBM HS21 RHEL Cluster 140 blades/1120 cores: 2.0 GHz Intel Xeon 5335 8GB Memory, 8 cores per blade 1Gb Ethernet (upgrading to 10Gb) PBS (Torque) & Moab USArray Data Processing Workshop July 30, 2009
  9. 9. USArray Data Processing Workshop July 30, 2009
  10. 10. Data Capacitor (AKA Lustre) High Performance Parallel File system -ca 1.2PB spinning disk -local and WAN capabilities SC07 Bandwidth Challenge Winner -moved 18.2 Gbps across a single 10Gbps link USArray Data Processing Workshop July 30, 2009
  11. 11. HPSS • High Performance Storage System • ca. 3 PB tape storage • 75 TB front-side disk cache • Ability to mirror data between IUPUI and IUB campuses USArray Data Processing Workshop July 30, 2009
  12. 12. Serial vs. Parallel • Calculation • Calculation • Flow Control • Flow Control • I/O • I/O • Synchronization • Communication USArray Data Processing Workshop July 30, 2009
  13. 13. 1-F 1-F A Serial Program F/N Amdahl’s Law: F S=1/(1-F+F/N) Special case, F=1 S=N, Ideal Scaling USArray Data Processing Workshop July 30, 2009
  14. 14. Speed for various scaling rules “Paralyzable Process” S=Ne -(N-1)/q “Superlinear Scaling” S>N USArray Data Processing Workshop July 30, 2009
  15. 15. MPI vs. OpenMP • MPI code may • OpenMP code execute across many executes only on the nodes set of cores sharing • Entire program is memory replicated for each • Sections of code may core (sections may or be parallel or serial may not execute) • Variables may be • Variables not shared shared • Typically requires • Incremental structural parallelization is easy modification to code USArray Data Processing Workshop July 30, 2009
  16. 16. Other methods exist: • Sockets • Explicit shared memory calls/operations • Pthreads • None are recommended USArray Data Processing Workshop July 30, 2009
  17. 17. export OMP_NUM_THREADS=8 icc mp_baby.c -openmp -o mp_baby ./mp_baby #include <stdio.h> #include <omp.h> int main(int argc, char *argv[]) { int iam = 0, np = 1; #pragma omp parallel default(shared) private(iam, np) { Fork #if defined (_OPENMP) np = omp_get_num_threads(); iam = omp_get_thread_num(); … #endif printf("Hello from thread %d out of %dn", iam, np); } Join } USArray Data Processing Workshop July 30, 2009
  18. 18. PROGRAM DOT_PRODUCT INTEGER N, CHUNKSIZE, CHUNK, I PARAMETER (N=100) PARAMETER (CHUNKSIZE=10) REAL A(N), B(N), RESULT ! Some initializations DO I = 1, N A(I) = I * 1.0 B(I) = I * 2.0 ENDDO RESULT= 0.0 CHUNK = CHUNKSIZE !$OMP PARALLEL DO !$OMP& DEFAULT(SHARED) PRIVATE(I) !$OMP& SCHEDULE(STATIC,CHUNK) !$OMP& REDUCTION(+:RESULT) Fork DO I = 1, N RESULT = RESULT + (A(I) * B(I)) ENDDO … !$OMP END PARALLEL DO NOWAIT Join PRINT *, 'Final Result= ', RESULT END USArray Data Processing Workshop July 30, 2009
  19. 19. Synchronization Constructs • MASTER: block executed only by master thread • CRITICAL: block executed by one thread at a time • BARRIER: each thread waits until all threads reach the barrier • ORDERED: block executed sequentially by threads USArray Data Processing Workshop July 30, 2009
  20. 20. Data Scope Attribute Clauses • SHARED: variable is shared across all threads • PRIVATE: variable is replicated in each thread • DEFAULT: change the default scoping of all variables in a region USArray Data Processing Workshop July 30, 2009
  21. 21. Some Useful Library routines • omp_set_num_threads(integer) • omp_get_num_threads() • omp_get_max_threads() • omp_get_thread_num() • Others are implementation dependent USArray Data Processing Workshop July 30, 2009
  22. 22. OpenMP Advice • Always explicitly scope variables • Never branch into/out of a parallel region • Never put a barrier in an if block • Quarry is at OpenMP version <3.0, TASK construct, for example, not there USArray Data Processing Workshop July 30, 2009
  23. 23. Exercise: OpenMP • The example programs are in ~/OMP_F_examples or ~/OMP_C_examples • Go to https://computing.llnl.gov/tutorials/openMP/excercise.html • Skip to step 4, compiler is “icc” or “ifort” • There is no evaluation form USArray Data Processing Workshop July 30, 2009
  24. 24. #include <stdio.h> #include <stdlib.h> #include <mpi.h> int myrank; int ntasks; Node 1 Node 2 … int main(int argc, char **argv) { /* Initialize MPI */ MPI_Init(&argc, &argv); /* get number of workers */ MPI_Comm_size(MPI_COMM_WORLD, &ntasks); /* Find out my identity in the default communicator … … each task gets a unique rank between 0 and ntasks-1 */ MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Barrier(MPI_COMM_WORLD); fprintf(stdout,"Hello from MPI_BABY=%dn",myrank); MPI_Finalize(); exit(0); } USArray Data Processing Workshop July 30, 2009
  25. 25. mpicc mpi_baby.c –o mpi_baby mpirun –np 8 mpi_baby mpirun –np 32 –machinefile my_list mpi_baby USArray Data Processing Workshop July 30, 2009
  26. 26. C AUTHOR: Blaise Barney From the man page: program scatter include 'mpif.h' MPI_Scatter - Sends data from one task integer SIZE to all tasks in a group parameter(SIZE=4) … integer numtasks, rank, sendcount, recvcount, source, ierr message is split into n equal segments, real*4 sendbuf(SIZE,SIZE), recvbuf(SIZE) the ith segment is sent to the ith process in the group C Fortran stores this array in column major order, so the C scatter will actually scatter columns, not rows. data sendbuf /1.0, 2.0, 3.0, 4.0, & 5.0, 6.0, 7.0, 8.0, & 9.0, 10.0, 11.0, 12.0, & 13.0, 14.0, 15.0, 16.0 / call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr) if (numtasks .eq. SIZE) then source = 1 sendcount = SIZE recvcount = SIZE call MPI_SCATTER(sendbuf, sendcount, MPI_REAL, recvbuf, & recvcount, MPI_REAL, source, MPI_COMM_WORLD, ierr) print *, 'rank= ',rank,' Results: ',recvbuf else print *, 'Must specify',SIZE,' processors. Terminating.' endif call MPI_FINALIZE(ierr) end USArray Data Processing Workshop July 30, 2009
  27. 27. Some linux tricks to get more information: man -w MPI ls /N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/share/man/man3 MPI_Abort MPI_Allgather MPI_Allreduce MPI_Alltoall ... MPI_Wait MPI_Waitall MPI_Waitany MPI_Waitsome mpicc --showme /N/soft/linux-rhel4-x86_64/intel/cce/10.1.022/bin/icc -I/N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/include -pthread -L/N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/lib -lmpi -lopen-rte -lopen-pal -ltorque -lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -ldl -Wl,-rpath -Wl,/usr/lib64 USArray Data Processing Workshop July 30, 2009
  28. 28. MPI cool stuff: • Bi-directional communication • Non-blocking communication • User defined types • Virtual topologies USArray Data Processing Workshop July 30, 2009
  29. 29. MPI Advice • Never put a barrier in an if block • Use care with non-blocking communication, things can pile up fast USArray Data Processing Workshop July 30, 2009
  30. 30. So, can I use MPI with OpenMP? • Yes you can; extreme care is advised • Some implementations of MPI forbid it • You can get killed by “oversubscription” real fast, I’ve seen time increase like N2 • But sometimes you must… some fftw libraries are OMP multithreaded, for example. USArray Data Processing Workshop July 30, 2009
  31. 31. Exercise: MPI • Examples are in ~/MPI_F_examples or ~/MPI_C_examples • Go to https://computing.llnl.gov/tutorials/mpi/exercise.html • Skip to step 6. MPI compilers are “mpif90” and “mpicc”, normal (serial) compilers are “ifort” and “icc”. • Compile your code: “make all” (Overrides section 9) • To run an mpi code: “mpirun –np 8 <exe>” …or… • “mpirun –np 16 –machinefile <ask me> <exe>” • Skip section 12 • There is no evaluation form. USArray Data Processing Workshop July 30, 2009
  32. 32. Where were those again? • https://computing.llnl.gov/tutorials/openMP/excercise.html • https://computing.llnl.gov/tutorials/mpi/exercise.html USArray Data Processing Workshop July 30, 2009
  33. 33. Acknowledgements • This material is based upon work supported by the National Science Foundation under Grant Numbers 0116050 and 0521433. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation (NSF). • This work was support in part by the Indiana Metabolomics and Cytomics Initiative (METACyt). METACyt is supported in part by Lilly Endowment, Inc. • This work was support in part by the Indiana Genomics Initiative. The Indiana Genomics Initiative of Indiana University is supported in part by Lilly Endowment, Inc. • This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University. USArray Data Processing Workshop July 30, 2009
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×