Your SlideShare is downloading. ×
0
Parallel concepts1
Parallel concepts1
Parallel concepts1
Parallel concepts1
Parallel concepts1
Parallel concepts1
Parallel concepts1
Parallel concepts1
Parallel concepts1
Parallel concepts1
Parallel concepts1
Parallel concepts1
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Parallel concepts1

197

Published on

Parallel Programming

Parallel Programming

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
197
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. TM Parallel Concepts Dr. C.V. Suresh Babu
  • 2. TM The Goal of Parallelization • Reduction of elapsed time of a program • Reduction in turnaround time of jobs cpu time 1 processor communication overhead 4 processors • Overhead: – – – – – total increase in cpu time communication synchronization additional work in algorithm non-parallel part of the program • (one processor works, others spin idle) pr oc s finish 2 Elapsed time 8 proc s 4 p roc s start 1 or s es c ro p Reduction in elapsed time Elapsed time
  • 3. TM Speedup and Efficiency Both measure the parallelization properties of a program • Let T(p) be the elapsed time on p processors • The Speedup S(p) and the Efficiency E(p) are defined as: S(p) = T(1)/T(p) E(p) = S(p)/p • for ideal parallel speedup we get: Speedup ideal T(p) = T(1)/p S(p) = T(1)/T(p) = p E(p) = S(p)/p = 1 or 100% Efficiency 1 Super-linear Saturation Disaster Number of processors Number of processors
  • 4. Amdahl’s Law This rule states the following for parallel programs: The non-parallel fraction of the code (I.e. overhead) imposes the upper limit on the scalability of the code • the non-parallel (serial) fraction s of the program includes the (1) 1 = s + f ! program has serial communication and synchronization overhead and parallel fractions (2) (3) (4) T(1) = = = T(p) = S(p) = = < (5) T(parallel) + T(serial) T(1) *(f + s) T(1) *(f + (1-f)) T(1) *(f/p + (1-f)) T(1)/T(p) 1/(f/p + 1-f) 1/(1-f) ! for p-> inf. S(p) < 1/(1-f) TM
  • 5. Amdahl’s Law: Time to Solution T(p) = T(1)/S(p) S(p) = 1/(f/p + (1-f)) Hypothetical program run time as function of #processors for several parallel fractions f. Note the log-log plot TM
  • 6. TM Fine-Grained Vs Coarse-Grained • Fine-grain parallelism (typically loop level) – – – – can be done incrementally, one loop at a time does not require deep knowledge of the code a lot of loops have to be parallel for decent speedup potentially many synchronization points MAIN (at the end of each parallel loop) A E B F C G K • Coarse-grain parallelism – make larger loops parallel at higher call-tree level potentially in-closing many small loops – more code is parallel at once – fewer synchronization points, reducing overhead – requires deeper knowledge of the code H L p Coarse-grained D I J N M O q r s t Fine-grained
  • 7. TM Other Impediments to Scalability Load imbalance: p0 p1 p2 p3 • the time to complete a parallel execution of a code segment is start determined by the longest running thread Elapsed time finish • unequal work load distribution leads to some processors being idle, while others work too much with coarse grain parallelization, more opportunities for load imbalance exist Too many synchronization points • compiler will put synchronization points at the start and exit of each parallel region
  • 8. Computing π with DPL π= 1 4 dx (1+x2) 0 =Σ 0<i<N 4 N(1+((i+0.5)/N)2) PROGRAM PIPROG INTEGER, PARAMETER:: N = 1000000 REAL (KIND=8):: LS,PI, W = 1.0/N PI = SUM( (/ (4.0*W/(1.0+((I+0.5)*W)**2),I=1,N) /) ) PRINT *, PI END Notes: – – – – – essentially sequential form automatic detection of parallelism automatic work sharing all variables shared by default number of processors specified outside of the code compile with: TM
  • 9. Computing π with Shared Memory π= 1 4 dx (1+x2) 0 =Σ 0<i<N 4 N(1+((i+0.5)/N)2) #define n 1000000 main() { double pi, l, ls = 0.0, w = 1.0/n; int i; #pragma omp parallel for private(i,l) reduction(+:ls) for(i=0; i<n; i++) { l = (i+0.5)*w; ls += 4.0/(1.0+l*l); } printf(“pi is %fn”,ls*w); } Notes: – essentially sequential form – automatic work sharing TM
  • 10. Computing π with Message Passing 1 #include <mpi.h> 4 dx #define N 1000000 (1+x2) main() 0 { double pi, l, ls = 0.0, w = 1.0/N; π= =Σ 0<i<N 4 N(1+((i+0.5)/N)2) int i, mid, nth; MPI_init(&argc, &argv); MPI_comm_rank(MPI_COMM_WORLD,&mid); MPI_comm_size(MPI_COMM_WORLD,&nth); } for(i=mid; i<N; i += nth) { l = (i+0.5)*w; ls += 4.0/(1.0+l*l); } MPI_reduce(&ls,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); if(mid == 0) printf(“pi is %fn”,pi*w); MPI_finalize(); Notes: TM
  • 11. Comparing Parallel Paradigms • Automatic parallelization combined with explicit Shared Variable programming (compiler directives) used on machines with global memory – Symmetric Multi-Processors, CC-NUMA, PVP – These methods collectively known as Shared Memory Programming (SMP) – SMP programming model works at loop level, and coarse level parallelism: • the coarse level parallelism has to be specified explicitly • loop level parallelism can be found by the compiler (implicitly) – Explicit Message Passing Methods are necessary with machines that have no global memory addressability: • clusters of all sort, NOW & COW – Message Passing Methods require coarse level parallelism to be scalable •Choosing programming model is largely a matter of the application, personal preference and the target machine. •it has nothing to do with scalability. limitations: – communication overhead – process synchronization Scalability function of •scalability is mainly aparallelism the hardware and (your) implementation of the TM
  • 12. Summary TM • The serial part or the communication overhead of the code limits the scalability of the code (Amdahl Law) • programs have to be >99% parallel to use large (>30 proc) machines • several Programming Models are in use today: – Shared Memory programming (SMP) (with Automatic Compiler parallelization, Data-Parallel and explicit Shared Memory models) – Message Passing model • Choosing a Programming Model is largely a matter of the application, personal choice and target machine. It has nothing to do with scalability. – Don’t confuse Algorithm and implementation • machines with a global address space can run applications based on both, SMP and Message Passing programming models

×