Your SlideShare is downloading. ×
0
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
20090720 smith
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

20090720 smith

538

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
538
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 7/17/2009<br />1<br />Parallel and High Performance Computing<br />Burton Smith<br />Technical Fellow<br />Microsoft<br />
  • 2. Agenda<br />Introduction<br />Definitions<br />Architecture and Programming<br />Examples<br />Conclusions<br />7/17/2009<br />2<br />
  • 3. Introduction<br />7/17/2009<br />3<br />
  • 4. “Parallel and High Performance”?<br />“Parallel computing is a form of computation in which many calculations are carried out simultaneously” G.S. Almasi and A. Gottlieb, Highly Parallel Computing. Benjamin/Cummings, 1994<br />A High Performance (Super) Computer is:<br />One of the 500 fastest computers as measured by HPL: the High Performance Linpack benchmark<br />A computer that costs 200.000.000 руб or more<br />Necessarily parallel, at least since the 1970’s<br />7/17/2009<br />4<br />
  • 5. Recent Developments<br />For 20 years, parallel and high performance computing have been the same subject<br />Parallel computing is now mainstream<br />It reaches well beyond HPC into client systems: desktops, laptops, mobile phones<br />HPC software once had to stand alone<br />Now, it can be based on parallel PC software<br />The result: better tools and new possibilities<br />7/17/2009<br />5<br />
  • 6. The Emergence of the Parallel Client<br />Uniprocessor performance is leveling off<br />Instruction-level parallelism nears a limit (ILP Wall)<br />Power is getting painfully high (Power Wall)<br />Caches show diminishing returns (Memory Wall)<br />Logic density continues to grow (Moore’s Law)<br />So uniprocessors will collapse in area and cost<br />Cores per chip need to increase exponentially<br />We must all learn to write parallel programs<br />So new “killer apps” will enjoy more speed<br />
  • 7. The ILP Wall<br />Instruction-level parallelism preserves the serial programming model<br />While getting speed from “undercover” parallelism <br />For example, see HPS†: out-of-order issue, in-order retirement, register renaming, branch prediction, speculation, …<br />At best, we get a few instructions/clock<br />† Y.N. Patt et al., "Critical Issues Regarding HPS, a High Performance Microarchitecture,“Proc. 18th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, 1985, pp. 109−116. <br />
  • 8. The Power Wall<br />In the old days, power was kept roughly constant<br />Dynamic power, equal to CV2f, dominated<br />Every shrink of .7 in feature size halved transistor area <br />Capacitance C and voltage V also decreased by .7<br />Even with the clock frequency f increased by 1.4, power per transistor was cut in half<br />Now, shrinking no longer reduces V very much<br />So even at constant frequency, power density doubles<br />Static (leakage) power is also getting worse<br />Simpler, slower processors are more efficient<br />And to conserve power, we can turn some of them off <br />
  • 9. The Memory Wall<br />We can get bigger caches from more transistors<br />Does this suffice, or is there a problem scaling up?<br />To speed up 2X without changing bandwidth below the cache, the miss rate must be halved<br />How much bigger does the cache have to be?†<br />For dense matrix multiply or dense LU, 4x bigger<br />For sorting or FFTs, the square of its former size<br />For sparse or dense matrix-vector multiply, impossible<br />Deeper interconnects increase miss latency<br />Latency tolerance needs memory access parallelism<br />† H.T. Kung, “Memory requirements for balanced computer architectures,” 13th International Symposium on Computer Architecture, 1986, pp. 49−54.<br />
  • 10. Overcoming the Memory Wall<br />Provide more memory bandwidth<br />Increase DRAM I/O bandwidth per gigabyte<br />Increase microprocessor off-chip bandwidth<br />Use architecture to tolerate memory latency<br />More latency  more threads or longer vectors<br />No change in programming model is needed<br />Use caches for bandwidth as well as latency<br />Let compilers control locality<br />Keep cache lines short<br />Avoid mis-speculation<br />
  • 11. The End of The von Neumann Model<br />“Instructions are executed one at a time…”<br />We have relied on this idea for 60 years<br />Now it (and things it brought) must change<br />Serial programming is easier than parallel programming, at least for the moment<br />But serial programs are now slow programs<br />We need parallel programming paradigms that will make all programmers successful<br />The stakes for our field’s vitality are high<br />Computing must be reinvented<br />
  • 12. Definitions<br />7/17/2009<br />12<br />
  • 13. Asymptotic Notation<br />Quantities are often meaningful only within a constant factor<br />Algorithm performance analyses, for example<br />f(n) = O(g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)|<br />f(n) = (g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)|<br />f(n) = (g(n)) means both f(n) = O(g(n)) and f(n) = (g(n))<br />7/17/2009<br />13<br />
  • 14. Speedup, Time, and Work<br />The speedup of a computation is how much faster it runs in parallel compared to serially<br />If one processor takes T1 and p of them take Tp then the p-processor speedup is Sp = T1/Tp<br />The work done is the number of operations performed, either serially or in parallel<br />W1 = O(T1) is the serial work, Wp the parallel work<br />We say a parallel computation is work-optimal ifWp = O(W1) = O(T1)<br />We say a parallel computation is time-optimal ifTp = O(W1/p) = O(T1/p)<br />7/17/2009<br />14<br />
  • 15. Latency, Bandwidth, & Concurrency<br />In any system that moves items from input to output without creating or destroying them,<br />Queueing theory calls this result Little’s law<br />latency × bandwidth = concurrency<br />concurrency = 6<br />bandwidth = 2<br />latency = 3<br />
  • 16. Architecture ANDPROGRAMMING<br />7/17/2009<br />16<br />
  • 17. Parallel Processor Architecture<br />SIMD: Each instruction operates concurrently on multiple data items<br />MIMD: Multiple instruction sequences execute concurrently<br />Concurrency is expressible in space or time<br />Spatial: the hardware is replicated<br />Temporal: the hardware is pipelined<br />7/17/2009<br />17<br />
  • 18. Trends in Parallel Processors<br />Today’s chips are spatial MIMD at top level<br />To get enough performance, even in PCs<br />Temporal MIMD is also used<br />SIMD is tending back toward spatial<br />Intel’s Larrabee combines all three<br />Temporal concurrency is easily “adjusted”<br />Vector length or number of hardware contexts<br />Temporal concurrency tolerates latency<br />Memory latency in the SIMD case<br />For MIMD, branches and synchronization also<br />7/17/2009<br />18<br />
  • 19. Parallel Memory Architecture<br />A shared memory system is one in which any processor can address any memory location<br />Quality of access can be either uniform (UMA) or nonuniform (NUMA), in latency and/or bandwidth<br />A distributed memory system is one in which processors can’t address most of memory<br />The disjoint memory regions and their associated processors are usually called nodes<br />A cluster is a distributed memory system with more than one processor per node<br />Nearly all HPC systems are clusters <br />7/17/2009<br />19<br />
  • 20. Parallel Programming Variations<br />Data Parallelism andTask Parallelism<br />Functional Style and Imperative Style<br />Shared Memory and Message Passing<br />…and more we won’t have time to look at<br /> A parallel application may use all of them <br />7/17/2009<br />20<br />
  • 21. Data Parallelism and Task Parallelism<br />A computation is data parallel when similar independent sub-computations are done simultaneously on multiple data items<br />Applying the same function to every element of a data sequence, for example<br />A computation is task parallel when dissimilar independent sub-computations are done simultaneously<br />Controlling the motions of a robot, for example<br />It sounds like SIMD vs. MIMD, but isn’t quite<br />Some kinds of data parallelism need MIMD<br />7/17/2009<br />21<br />
  • 22. Functional and Imperative Programs<br />A program is said to be written in (pure) functional style if it has no mutable state<br />Computing = naming and evaluating expressions <br />Programs with mutable state are usually called imperative because the state changes must be done when and where specified:<br />while (z < x) { x = y; y = z; z = f(x, y);} return y;<br />Often, programs can be written either way:<br />let w(x, y, z) = if (z < x) then w(y, z, f(x, y)) else y;<br />7/17/2009<br />22<br />
  • 23. Shared Memory and Message Passing<br />Shared memory programs access data in a shared address space<br />When to access the data is the big issue<br />Subcomputations therefore must synchronize<br />Message passing programs transmit data between subcomputations<br />The sender computes a value and then sends it<br />The receiver recieves a value and then uses it<br />Synchronization can be built in to communication<br />Message passing can be implemented very well on shared memory architectures<br />7/17/2009<br />23<br />
  • 24. Barrier Synchronization<br />A barrier synchronizes multiple parallel sub-computations by letting none proceed until all have arrived<br />It is named after the barrier<br />used to start horse races<br />It guarantees everything before the barrier finishes before anything after it begins<br />It is a central feature in several data-parallel languages such as OpenMP<br />7/17/2009<br />24<br />
  • 25. Mutual Exclusion<br />This type of synchronization ensures only one subcomputation can do a thing at any time<br />If the thing is a code block, it is a critical section<br />It classically uses a lock: a data structure with which subcomputations can stop and start<br />Basic operations on a lock object L might be <br />Acquire(L): blocks until other subcomputations are finished with L, then acquires exclusive ownership<br />Release(L): yields L and unblocks some Acquire(L)<br />A lot has been written on these subjects<br />7/17/2009<br />25<br />
  • 26. Non-Blocking Synchronization<br />The basic idea is to achieve mutual exclusion using memory read-modify-write operations<br />Most commonly used is compare-and-swap: <br />CAS(addr, old, new) reads memory at addr and if it contains old then old is replaced by new<br />Arbitrary update operations at an addr require {read old; compute new; CAS(addr, old, new);}be repeated until the CAS operation succeeds<br />If there is significant updating contention at addr, the repeated computation of new may be wasteful<br />7/17/2009<br />26<br />
  • 27. Load Balancing<br />Some processors may be busier than others<br />To balance the workload, subcomputations can be scheduled on processors dynamically<br />A technique for parallel loops is self-scheduling: processors repetitively grab chunks of iterations<br />In guided self-scheduling, the chunk sizes shrink<br />Analogous imbalances can occur in memory<br />Overloaded memory locations are called hot spots<br />Parallel algorithms and data structures must be designed to avoid them<br />Imbalanced messaging is sometimes seen<br />7/17/2009<br />27<br />
  • 28. Examples<br />7/17/2009<br />28<br />
  • 29. A Data Parallel Example: Sorting<br />7/17/2009<br />29<br />void sort(int *src, int *dst,int size, intnvals) {<br />inti, j, t1[nvals], t2[nvals];<br /> for (j = 0 ; j < nvals ; j++) { t1[j] = 0;<br />}<br /> for (i = 0 ; i < size ; i++) { t1[src[i]]++;<br />} //t1[] now contains a histogram of the values<br /> t2[0] = 0;<br /> for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1];<br />} //t2[j] now contains the origin for value j<br /> for (i = 0 ; i < size ; i++) {dst[t2[src[i]]++] = src[i];<br />}<br />}<br />
  • 30. When Is a Loop Parallelizable?<br />The loop instances must safely interleave<br />A way to do this is to only read the data <br />Another way is to isolate data accesses<br />Look at the first loop:<br />The accesses to t1[] are isolated from each other<br /> This loop can run in parallel “as is”<br />7/17/2009<br />30<br />for (j = 0 ; j < nvals ; j++) {<br /> t1[j] = 0;<br />}<br />
  • 31. Isolating Data Updates<br />The second loop seems to have a problem:<br />Two iterations may access the same t1[src[i]]<br />If both reads precede both increments, oops!<br />A few ways to isolate the iteration conflicts:<br />Use an “isolated update” (lock prefix) instruction<br />Use an array of locks, perhaps as big as t1[]<br /> Use non-blocking updates<br />Use a transaction<br />7/17/2009<br />31<br />for (i = 0 ; i < size ; i++) {<br /> t1[src[i]]++;<br />}<br />
  • 32. Dependent Loop Iterations<br />The 3rd loop is an interesting challenge:<br />Each iteration depends on the previous one<br />This loop is an example of a prefix computation<br />If • is an associative binary operation on a set S, the • - prefixes of the sequence x0 ,x1 ,x2 … of values from S is x0,x0•x1,x0•x1•x2 …<br />Prefix computations are often known as scans<br />Scan can be done in efficiently in parallel<br />7/17/2009<br />32<br />for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1];<br /> }<br />
  • 33. Cyclic Reduction<br />Each vertical line represents a loop iteration<br />The associated sequence element is to its right<br />On step k of the scan, iteration j prefixes its own value with the value from iteration j – 2k<br />7/17/2009<br />33<br />a<br />b<br />c<br />d<br />e<br />f<br />g<br />a<br />ab<br />bc<br />cd<br />de<br />ef<br />fg<br />a<br />ab<br />abc<br />abcd<br />bcde<br />cdef<br />defg<br />a<br />ab<br />abc<br />abcd<br />abcde<br />abcdef<br />abcdefg<br />
  • 34. Applications of Scan<br />Linear recurrences like the third loop<br />Polynomial evaluation<br />String comparison<br />High-precision addition<br />Finite automata<br />Each xi is the next-state function given the ith input symbol and • is function composition<br />APL compress<br />When only the final value is needed, the computation is called a reduction instead<br />It’s a little bit cheaper than a full scan<br />
  • 35. More Iterations nThan Processors p<br />7/17/2009<br />35<br />Wp = 3n + O(p log p), Tp = 3n / p + O(log p)<br />
  • 36. OpenMP<br />OpenMP is a widely-implemented extension to C++ and Fortran for data† parallelism<br />It adds directives to serial programs<br />A few of the more important directives:<br />#pragmaomp parallel for <modifiers><for loop><br />#pragmaomp atomic<binary op=,++ or -- statement><br />#pragmaomp critical <name><structured block><br />#pragmaomp barrier<br />7/17/2009<br />36<br />†And perhaps task parallelism soon<br />
  • 37. The Sorting Example in OpenMP<br />Only the third “scan” loop is a problem<br />We can at least do this loop “manually”:<br />7/17/2009<br />37<br />nt = omp_get_num_threads();<br />intta[nt], tb[nt];<br />#omp parallel for<br />for(myt = 0; myt < nt; myt++) {<br /> //Set ta[myt]= local sum of nvals/nt elements of t1[]<br /> #pragmaomp barrier<br /> for(k = 1; k <= myt; k *= 2){<br />tb[myt] = ta[myt];<br />ta[myt] += tb[myt - k];<br /> #pragmaomp barrier<br /> }<br /> fix = (myt > 0) ? ta[myt – 1] : 0;<br /> //Setnvals/ntelements of t2[] to fix + local scan of t1[]<br />}<br />
  • 38. Parallel Patterns Library (PPL)<br />PPL is a Microsoft C++ library built on top of the ConcRT user-mode scheduling runtime<br />It supports mixed data- and task-parallelism:<br />parallel_for, parallel_for_each, parallel_invoke<br />agent, send, receive, choice, join, task_group<br />Parallel loops use C++ lambda expressions:<br />Updates can be isolated using intrinsic functions<br />Microsoft and Intel plan to unify PPL and TBB<br />7/17/2009<br />38<br />parallel_for(1,nvals,[&t1](int j) {<br /> t1[j] = 0;<br />});<br />(void)_InterlockedIncrement(t1[src[i]]++);<br />
  • 39. Dynamic Resource Management<br />PPL programs are written for an arbitrary number of processors, could be just one<br />Load balancing is mostly done by work stealing<br />There are two kinds of work to steal:<br />Work that is unblocked and waiting for a processor<br />Work that is not yet started and is potentially parallel<br />Work of the latter kind will be done serially unless it is first stolen by another processor<br />This makes recursive divide and conquer easy<br />There is no concern about when to stop parallelism<br />7/17/2009<br />39<br />
  • 40. A Quicksort Example<br />void quicksort (vector<int>::iterator first,<br /> vector<int>::iterator last) {<br /> if (last - first < 2){return;}<br />int pivot = *first;<br /> auto mid1 = partition (first, last,<br /> [=](int e){return e < pivot;});<br /> auto mid2 = partition (mid1, last,<br /> [=](int e){return e == pivot;});<br />parallel_invoke(<br /> [=] { quicksort(first, mid1); },<br /> [=] { quicksort(mid2, last); }<br /> );<br />}; <br />7/17/2009<br />40<br />
  • 41. LINQ and PLINQ<br />LINQ (Language Integrated Query) extends the .NET languages C#, Visual Basic, and F#<br />A LINQ query is really just a functional monad<br />It queries databases, XML, or any IEnumerable<br />PLINQ is a parallel implementation of LINQ<br />Non-isolated functions must be avoided<br />Otherwise it is hard to tell the two apart<br />7/17/2009<br />41<br />
  • 42. A LINQ Example<br />7/17/2009<br />42<br />PLINQ<br />.AsParallel()<br />var q = from n in names<br />        where n.Name == queryInfo.Name && <br />n.State == queryInfo.State &&<br />n.Year >= yearStart &&<br />n.Year <= yearEnd<br />        orderbyn.Year ascending<br />        select n;<br />
  • 43. Message Passing Interface (MPI)<br />MPI is a widely used message passing library for distributed memory HPC systems<br />Some of its basic functions:<br />A few of its “collective communication” functions:<br />7/17/2009<br />43<br />MPI_Init<br />MPI_Comm_rank<br />MPI_Comm_size<br />MPI_Send<br />MPI_Recv<br />MPI_Reduce<br />MPI_Allreduce<br />MPI_Scan<br />MPI_Exscan<br />MPI_Barrier<br />MPI_Gather<br />MPI_Allgather<br />MPI_Alltoall<br />
  • 44. Sorting in MPI<br />Roughly, it could work like this on n nodes:<br />Run the first two loops locally<br />Use MPI_Allreduce to build a global histogram<br />Run the third loop (redundantly) at every node<br />Allocate n value intervals to nodes (redundantly)<br />Balancing the data per node as well as possible<br />Run the fourth loop using the local histogram<br />Use MPI_Alltoall to redistribute the data<br />Merge the n sorted subarrays on each node<br />Collective communication is expensive<br />But sorting needs it (see the Memory Wall slide)<br />7/17/2009<br />44<br />
  • 45. Another Way to Sort in MPI<br />The Samplesort algorithm is like Quicksort<br />It works like this on n nodes:<br />Sort the local data on each node independently<br />Take s samples of the sorted data on each node<br />Use MPI_Allgather to send all nodes all samples<br />Compute n  1 splitters (redundantly) on all nodes<br />Balancing the data per node as well as possible<br />Use MPI_Alltoall to redistribute the data<br />Merge the n sorted subarrays on each node<br />7/17/2009<br />45<br />
  • 46. CONCLUSIONS<br />7/17/2009<br />46<br />
  • 47. Parallel Computing Has Arrived<br />We must rethink how we write programs<br />And we are definitely doing that<br />Other things will also need to change<br />Architecture<br />Operating systems<br />Algorithms<br />Theory<br />Application software<br />We are seeing the biggest revolution in computing since its very beginnings<br />7/17/2009<br />47<br />

×