Parallel and Distributed Computing on Low Latency Clusters

2,315 views

Published on

Slides from the thesis defence in Chicago by Vittorio Giovara.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,315
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
53
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Parallel and Distributed Computing on Low Latency Clusters

  1. 1. Parallel and Distributed Computing on Low Latecy Clusters Vittorio Giovara M. S. Electrical Engineering and Computer Science University of Illinois at Chicago May 2009
  2. 2. Contents • Motivation • Application • Strategy • Compiler Optimizations • Technologies • OpenMP and MPI over Infinband • OpenMP • Results • MPI • Conclusions • Infinband
  3. 3. Motivation
  4. 4. Motivation • Scaling trend has to stop for CMOS technology: ✓ Direct-tunneling limit in SiO2 ~3 nm ✓ Distance between Si atoms ~0.3 nm ✓ Variabilty • Foundamental reason: rising fab cost
  5. 5. Motivation • Easy to build multiple core processor • Requires human action to modify and adapt concurrent software • New classification for computer architectures
  6. 6. Classification SISD SIMD data pool data pool instruction pool instruction pool CPU CPU CPU MISD MIMD data pool data pool instruction pool instruction pool CPU CPU CPU CPU CPU CPU
  7. 7. easier to parallelize abstraction level algorithm loop level process management
  8. 8. Levels recursion memory management profiling data dependency branching overhead control flow algorithm loop level process management SMP Multiprogramming Multithreading and Scheduling
  9. 9. Backfire • Difficutly to fully exploit the parallelism offered • Automatic tools required to adapt software to parallelism • Compiler support for manual or semi- automatic enhancement
  10. 10. Applications • OpenMP and MPI are two popular tools used to simplify the parallelizing process of both new and old software • Mathematics and Physics • Computer Science • Biomedics
  11. 11. Specific Problem and Background • Sally3D is a micromagnetism program suit for field analysis and modeling developed at Politecnico di Torino (Department of Electrical Engineering) • Computationally intensive (even days of CPU); speedup required • Previous works still not fully encompassing the problem (no Infiniband or OpenMP +MPI solutions)
  12. 12. Strategy
  13. 13. Strategy • Install a Linux Kernel with ad-hoc configuration for scientific computation • Compile a OpenMP enable GCC (supported from 4.3.1 onwards) • Add the Infiniband link among clusters with proper drivers in kernel and user space • Select a MPI implementation library
  14. 14. Strategy • Verify Infiniband network through some MPI test examples • Install the target software • Proceed to include OpenMP and MPI directives in the code • Run test cases
  15. 15. OpenMP • standard • supported by most of modern compilers • requires little knowledge of the software • very simple construction methods
  16. 16. OpenMP - example
  17. 17. OpenMP - example Parallel Task 1 Parallel Task 3 Parallel Task 2 Parallel Task 4
  18. 18. Parallel Task 1 Parallel Task 2 Thread A Parallel Task 4 Thread B Parallel Task 3 Join Master Thread
  19. 19. OpenMP Sceduler • Which scheduler available for hardware? - Static - Dynamic - Guided
  20. 20. OpenMP Scheduler OpenMP Static Scheduler Chart 80000 70000 60000 50000 microseconds 40000 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
  21. 21. OpenMP Scheduler OpenMP Dynamic Scheduler Chart 117000 102375 87750 73125 microseconds 58500 43875 29250 14625 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
  22. 22. OpenMP Scheduler OpenMP Guided Scheduler Chart 80000 70000 60000 50000 microseconds 40000 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
  23. 23. OpenMP Scheduler
  24. 24. OpenMP Scheduler static scheduler dynamic scheduler guided scheduler
  25. 25. MPI • standard • widely used in cluster environment • many transport link supported • different implementations available - OpenMPI - MVAPICH
  26. 26. Infiniband • standard • widely used in cluster environment • very low latency for small packets • up to 16 Gb/s transfer speed
  27. 27. MPI over Infiniband 10000000,0 µs 1000000,0 µs 100000,0 µs 10000,0 µs 1000,0 µs 100,0 µs 10,0 µs 1,0 µs kB kB kB kB kB kB 12 B 25 B 51 B kB B B B B 32 B 64 B 12 B 25 B 51 B B B B B B B k k k M M M M M M M M M M G G G G G 1 2 4 8 16 32 64 8 6 2 1 2 4 8 16 1 2 4 8 16 8 6 2 OpenMPI Mvapich2
  28. 28. MPI over Infiniband 10000000,00 µs 1000000,00 µs 100000,00 µs 10000,00 µs 1000,00 µs 100,00 µs 10,00 µs 1,00 µs kB kB kB kB kB kB kB kB kB kB B B B B M M M M 1 2 4 8 16 32 64 8 6 2 1 2 4 8 12 25 51 OpenMPI Mvapich2
  29. 29. Optimizations • Active at compile time • Available only after porting the software to standard FORTRAN • Consistent documentation available • Unexpected positive results
  30. 30. Optimizations •-march = native •-O3 •-ffast-math •-Wl,-O1
  31. 31. Target Software
  32. 32. Target Software • Sally3D • micromagnetic equation solver • written in FORTRAN with some C libraries • program uses linear formulation of mathematical models
  33. 33. Implementation Scheme sequential loop parallel loop standard programming model OpenMP Threads distributed loop OpenMP Threads OpenMP Threads Host 1 Host 2 MPI
  34. 34. Implementation Scheme • Data Structure: not embarrassingly parallel • Three dimensional matrix • Several temporary arrays – synchronization obiects required ➡ send() and recv() mechanism ➡ critical regions using OpenMP directives ➡ functions merging ➡ matrix conversion
  35. 35. Results
  36. 36. Results OMP MPI OPT seconds * * * 133 * * - 400 * - * 186 * - - 487 - * * 200 - * - 792 - - * 246 - - - 1062 Total Speed Increase: 87.52%
  37. 37. Actual Results OMP MPI seconds * * 59 * - 129 - * 174 - - 249 Function Name Normal OpenMP MPI OpenMP+MPI calc_intmudua 24.5 s 4.7 s 14.4 s 2.8 s calc_hdmg_tet 16.9 s 3.0 s 10.8 s 1.7 s calc_mudua 12.1 s 1.9 s 7.0 s 1.1 s campo_effettivo 17.7 s 4.5 s 9.9 s 2.3 s
  38. 38. Actual Results • OpenMP – 6-8x • MPI – 2x • OpenMP + MPI – 14 - 16x Total Raw Speed Increment: 76%
  39. 39. Conclusions
  40. 40. Conclusions and Future Works • Computational time has been significantly decreased • Speedup is consistent with expected results • Submitted to COMPUMAG ‘09 • Continue inserting OpenMP and MPI directives • Perform algorithm optimizations • Increase cluster size

×