Parallel Programming and MPI


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Parallel Programming and MPI

  1. 1. FLASH Tutorial May 13, 2004 Parallel Computing and MPI
  2. 2. What is Parallel Computing ? And why is it useful <ul><li>Parallel Computing is more than one cpu working together on one problem </li></ul><ul><li>It is useful when </li></ul><ul><ul><li>Large problem, could take very long </li></ul></ul><ul><ul><li>Data size too big to fit in the memory of one processor </li></ul></ul><ul><li>When to parallelize </li></ul><ul><ul><li>Problem could be subdivided into relatively independent tasks </li></ul></ul><ul><li>How much to parallelize </li></ul><ul><ul><li>While the speedup in computation relative to single processor is of the order of number of processors </li></ul></ul>
  3. 3. Parallel paradigms <ul><li>SIMD – Single instruction multiple data </li></ul><ul><ul><li>Processors work in lock-step </li></ul></ul><ul><li>MIMD – Multiple instruction multiple data </li></ul><ul><ul><li>Processors do their own thing with occasional synchronization </li></ul></ul><ul><li>Shared Memory </li></ul><ul><ul><li>One way communications </li></ul></ul><ul><li>Distributed Memory </li></ul><ul><ul><li>Message passing </li></ul></ul><ul><li>Loosely Coupled </li></ul><ul><ul><li>When the process on each cpu is fairly self contained and relatively independent of processes on other cpu’s </li></ul></ul><ul><li>Tightly Coupled </li></ul><ul><ul><li>When cpu’s need to communicate with each other frequently </li></ul></ul>
  4. 4. How to Parallelize <ul><li>Divide a problem into a set of mostly independent tasks </li></ul><ul><ul><ul><li>Partitioning a problem </li></ul></ul></ul><ul><li>Tasks get their own data </li></ul><ul><ul><ul><li>Localize a task </li></ul></ul></ul><ul><li>They operate on their own data for the most part </li></ul><ul><ul><ul><li>Try to make it self contained </li></ul></ul></ul><ul><li>Occasionally </li></ul><ul><ul><li>Data may be needed from other tasks </li></ul></ul><ul><ul><ul><li>Inter-process communication </li></ul></ul></ul><ul><ul><li>Synchronization may be required between tasks </li></ul></ul><ul><ul><ul><li>Global operation </li></ul></ul></ul><ul><li>Map tasks to different processors </li></ul><ul><ul><ul><li>One processor may get more than one task </li></ul></ul></ul><ul><ul><ul><li>Task distribution should be well balanced </li></ul></ul></ul>
  5. 5. New Code Components <ul><li>Initialization </li></ul><ul><li>Query parallel state </li></ul><ul><ul><li>Identify process </li></ul></ul><ul><ul><li>Identify number of processes </li></ul></ul><ul><li>Exchange data between processes </li></ul><ul><ul><li>Local, Global </li></ul></ul><ul><li>Synchronization </li></ul><ul><ul><li>Barriers, Blocking Communication, Locks </li></ul></ul><ul><li>Finalization </li></ul>
  6. 6. MPI <ul><li>Message Passing Interface, standard for distributed memory model of parallelism </li></ul><ul><li>MPI-2 will support one-way communication, commonly associated with shared memory operations </li></ul><ul><li>Works with communicators; a collection of processors </li></ul><ul><ul><li>MPI_COMM_WORLD default </li></ul></ul><ul><li>Has support for lowest level communication operations and composite operations </li></ul><ul><li>Has blocking and non-blocking operations </li></ul>
  7. 7. Communicators COMM1 COMM2
  8. 8. Low level Operations in MPI <ul><li>MPI_Init </li></ul><ul><li>MPI_Comm_size </li></ul><ul><ul><li>Find number of processors </li></ul></ul><ul><li>MPI_Comm_rank </li></ul><ul><ul><li>Find my processor number </li></ul></ul><ul><li>MPI_Send/Recv </li></ul><ul><ul><li>Communicate with other processors one at a time </li></ul></ul><ul><li>MPI_Bcast </li></ul><ul><ul><li>Global data transmission </li></ul></ul><ul><li>MPI_Barrier </li></ul><ul><ul><li>Synchronization </li></ul></ul><ul><li>MPI_Finalize </li></ul>
  9. 9. Advanced Constructs in MPI <ul><li>Composite Operations </li></ul><ul><ul><li>Gather/Scatter </li></ul></ul><ul><ul><li>Allreduce </li></ul></ul><ul><ul><li>Alltoall </li></ul></ul><ul><li>Cartesian grid operations </li></ul><ul><ul><li>Shift </li></ul></ul><ul><li>Communicators </li></ul><ul><ul><li>Creating subgroups of processors to operate on </li></ul></ul><ul><li>User-defined Datatypes </li></ul><ul><li>I/O </li></ul><ul><ul><li>Parallel file operations </li></ul></ul>
  10. 10. Communication Patterns 1 0 3 2 Collective 0 1 2 3 Shift 1 0 2 All to All 1 0 3 2 Point to Point 1 0 3 2 One to All Broadcast
  11. 11. Communication Overheads <ul><li>Latency vs. Bandwidth </li></ul><ul><li>Blocking vs. Non-Blocking </li></ul><ul><ul><li>Overlap </li></ul></ul><ul><ul><li>Buffering and copy </li></ul></ul><ul><li>Scale of communication </li></ul><ul><ul><li>Nearest neighbor </li></ul></ul><ul><ul><li>Short range </li></ul></ul><ul><ul><li>Long range </li></ul></ul><ul><li>Volume of data </li></ul><ul><ul><li>Resource contention for links </li></ul></ul><ul><li>Efficiency </li></ul><ul><ul><li>Hardware, software, communication method </li></ul></ul>
  12. 12. Parallelism in FLASH <ul><li>Short range communications </li></ul><ul><ul><li>Nearest neighbor </li></ul></ul><ul><li>Long range communications </li></ul><ul><ul><li>Regridding </li></ul></ul><ul><li>Other global operations </li></ul><ul><ul><li>All-reduce operations on physical quantities </li></ul></ul><ul><ul><li>Specific to solvers </li></ul></ul><ul><ul><ul><li>multi-pole method </li></ul></ul></ul><ul><ul><ul><li>FFT based solvers </li></ul></ul></ul>
  13. 13. Domain Decomposition P0 P1 P2 P3
  14. 14. Border Cells / Ghost Points <ul><li>When splitting up solnData, need data from other processors. </li></ul><ul><li>Need a layer of cells from each processor </li></ul><ul><li>Need to update each time step </li></ul>
  15. 15. Border/Ghost Cells Short Range communication
  16. 16. Two MPI Methods for doing it <ul><li>MPI_Cart_create </li></ul><ul><ul><li>Create topology </li></ul></ul><ul><li>MPE_Decomp1d </li></ul><ul><ul><li>Domain decomp on topology </li></ul></ul><ul><li>MPI_Cart_shift </li></ul><ul><ul><li>Who’s on the left/right? </li></ul></ul><ul><li>MPI_SendRecv </li></ul><ul><ul><li>Ghost cells left </li></ul></ul><ul><li>MPI_SendRecv </li></ul><ul><ul><li>Ghost cells right </li></ul></ul><ul><li>MPI_Comm_rank </li></ul><ul><li>MPI_Comm_size </li></ul><ul><li>Manually decompose grid over processors </li></ul><ul><li>Calculate left/right </li></ul><ul><li>MPI_Send/MPI_Recv </li></ul><ul><ul><li>Carefully to avoid deadlocks </li></ul></ul>
  17. 17. Adaptive Grid Issues <ul><li>Discretization not uniform </li></ul><ul><li>Simple left-right guard cell fills inadequate </li></ul><ul><li>Adjacent grid points may not be mapped to the nearest neighbors in processors topology </li></ul><ul><li>Redistribution of work necessary </li></ul>
  18. 18. Regridding <ul><li>Change in number of cells/blocks </li></ul><ul><li>Some processors get more work than others </li></ul><ul><li>Load imbalance </li></ul><ul><li>Redistribute data to even out work on all processors </li></ul><ul><li>Long range communications </li></ul><ul><li>Large quantities of data moved </li></ul>
  19. 19. Regridding
  20. 20. Other parallel operations in FLASH <ul><li>Global max/sum etc (Allreduce) </li></ul><ul><ul><li>Physical quantities </li></ul></ul><ul><ul><li>In solvers </li></ul></ul><ul><ul><li>Performance monitoring </li></ul></ul><ul><li>Alltoall </li></ul><ul><ul><li>FFT based solver on UG </li></ul></ul><ul><li>User defined datatypes and file operations </li></ul><ul><ul><li>Parallel I/O </li></ul></ul>