20090720 smith


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

20090720 smith

  1. 1. 7/17/2009<br />1<br />Parallel and High Performance Computing<br />Burton Smith<br />Technical Fellow<br />Microsoft<br />
  2. 2. Agenda<br />Introduction<br />Definitions<br />Architecture and Programming<br />Examples<br />Conclusions<br />7/17/2009<br />2<br />
  3. 3. Introduction<br />7/17/2009<br />3<br />
  4. 4. “Parallel and High Performance”?<br />“Parallel computing is a form of computation in which many calculations are carried out simultaneously” G.S. Almasi and A. Gottlieb, Highly Parallel Computing. Benjamin/Cummings, 1994<br />A High Performance (Super) Computer is:<br />One of the 500 fastest computers as measured by HPL: the High Performance Linpack benchmark<br />A computer that costs 200.000.000 руб or more<br />Necessarily parallel, at least since the 1970’s<br />7/17/2009<br />4<br />
  5. 5. Recent Developments<br />For 20 years, parallel and high performance computing have been the same subject<br />Parallel computing is now mainstream<br />It reaches well beyond HPC into client systems: desktops, laptops, mobile phones<br />HPC software once had to stand alone<br />Now, it can be based on parallel PC software<br />The result: better tools and new possibilities<br />7/17/2009<br />5<br />
  6. 6. The Emergence of the Parallel Client<br />Uniprocessor performance is leveling off<br />Instruction-level parallelism nears a limit (ILP Wall)<br />Power is getting painfully high (Power Wall)<br />Caches show diminishing returns (Memory Wall)<br />Logic density continues to grow (Moore’s Law)<br />So uniprocessors will collapse in area and cost<br />Cores per chip need to increase exponentially<br />We must all learn to write parallel programs<br />So new “killer apps” will enjoy more speed<br />
  7. 7. The ILP Wall<br />Instruction-level parallelism preserves the serial programming model<br />While getting speed from “undercover” parallelism <br />For example, see HPS†: out-of-order issue, in-order retirement, register renaming, branch prediction, speculation, …<br />At best, we get a few instructions/clock<br />† Y.N. Patt et al., "Critical Issues Regarding HPS, a High Performance Microarchitecture,“Proc. 18th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, 1985, pp. 109−116. <br />
  8. 8. The Power Wall<br />In the old days, power was kept roughly constant<br />Dynamic power, equal to CV2f, dominated<br />Every shrink of .7 in feature size halved transistor area <br />Capacitance C and voltage V also decreased by .7<br />Even with the clock frequency f increased by 1.4, power per transistor was cut in half<br />Now, shrinking no longer reduces V very much<br />So even at constant frequency, power density doubles<br />Static (leakage) power is also getting worse<br />Simpler, slower processors are more efficient<br />And to conserve power, we can turn some of them off <br />
  9. 9. The Memory Wall<br />We can get bigger caches from more transistors<br />Does this suffice, or is there a problem scaling up?<br />To speed up 2X without changing bandwidth below the cache, the miss rate must be halved<br />How much bigger does the cache have to be?†<br />For dense matrix multiply or dense LU, 4x bigger<br />For sorting or FFTs, the square of its former size<br />For sparse or dense matrix-vector multiply, impossible<br />Deeper interconnects increase miss latency<br />Latency tolerance needs memory access parallelism<br />† H.T. Kung, “Memory requirements for balanced computer architectures,” 13th International Symposium on Computer Architecture, 1986, pp. 49−54.<br />
  10. 10. Overcoming the Memory Wall<br />Provide more memory bandwidth<br />Increase DRAM I/O bandwidth per gigabyte<br />Increase microprocessor off-chip bandwidth<br />Use architecture to tolerate memory latency<br />More latency  more threads or longer vectors<br />No change in programming model is needed<br />Use caches for bandwidth as well as latency<br />Let compilers control locality<br />Keep cache lines short<br />Avoid mis-speculation<br />
  11. 11. The End of The von Neumann Model<br />“Instructions are executed one at a time…”<br />We have relied on this idea for 60 years<br />Now it (and things it brought) must change<br />Serial programming is easier than parallel programming, at least for the moment<br />But serial programs are now slow programs<br />We need parallel programming paradigms that will make all programmers successful<br />The stakes for our field’s vitality are high<br />Computing must be reinvented<br />
  12. 12. Definitions<br />7/17/2009<br />12<br />
  13. 13. Asymptotic Notation<br />Quantities are often meaningful only within a constant factor<br />Algorithm performance analyses, for example<br />f(n) = O(g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)|<br />f(n) = (g(n)) means there exist constants c and n0 such that nn0 implies |f(n)|  |cg(n)|<br />f(n) = (g(n)) means both f(n) = O(g(n)) and f(n) = (g(n))<br />7/17/2009<br />13<br />
  14. 14. Speedup, Time, and Work<br />The speedup of a computation is how much faster it runs in parallel compared to serially<br />If one processor takes T1 and p of them take Tp then the p-processor speedup is Sp = T1/Tp<br />The work done is the number of operations performed, either serially or in parallel<br />W1 = O(T1) is the serial work, Wp the parallel work<br />We say a parallel computation is work-optimal ifWp = O(W1) = O(T1)<br />We say a parallel computation is time-optimal ifTp = O(W1/p) = O(T1/p)<br />7/17/2009<br />14<br />
  15. 15. Latency, Bandwidth, & Concurrency<br />In any system that moves items from input to output without creating or destroying them,<br />Queueing theory calls this result Little’s law<br />latency × bandwidth = concurrency<br />concurrency = 6<br />bandwidth = 2<br />latency = 3<br />
  16. 16. Architecture ANDPROGRAMMING<br />7/17/2009<br />16<br />
  17. 17. Parallel Processor Architecture<br />SIMD: Each instruction operates concurrently on multiple data items<br />MIMD: Multiple instruction sequences execute concurrently<br />Concurrency is expressible in space or time<br />Spatial: the hardware is replicated<br />Temporal: the hardware is pipelined<br />7/17/2009<br />17<br />
  18. 18. Trends in Parallel Processors<br />Today’s chips are spatial MIMD at top level<br />To get enough performance, even in PCs<br />Temporal MIMD is also used<br />SIMD is tending back toward spatial<br />Intel’s Larrabee combines all three<br />Temporal concurrency is easily “adjusted”<br />Vector length or number of hardware contexts<br />Temporal concurrency tolerates latency<br />Memory latency in the SIMD case<br />For MIMD, branches and synchronization also<br />7/17/2009<br />18<br />
  19. 19. Parallel Memory Architecture<br />A shared memory system is one in which any processor can address any memory location<br />Quality of access can be either uniform (UMA) or nonuniform (NUMA), in latency and/or bandwidth<br />A distributed memory system is one in which processors can’t address most of memory<br />The disjoint memory regions and their associated processors are usually called nodes<br />A cluster is a distributed memory system with more than one processor per node<br />Nearly all HPC systems are clusters <br />7/17/2009<br />19<br />
  20. 20. Parallel Programming Variations<br />Data Parallelism andTask Parallelism<br />Functional Style and Imperative Style<br />Shared Memory and Message Passing<br />…and more we won’t have time to look at<br /> A parallel application may use all of them <br />7/17/2009<br />20<br />
  21. 21. Data Parallelism and Task Parallelism<br />A computation is data parallel when similar independent sub-computations are done simultaneously on multiple data items<br />Applying the same function to every element of a data sequence, for example<br />A computation is task parallel when dissimilar independent sub-computations are done simultaneously<br />Controlling the motions of a robot, for example<br />It sounds like SIMD vs. MIMD, but isn’t quite<br />Some kinds of data parallelism need MIMD<br />7/17/2009<br />21<br />
  22. 22. Functional and Imperative Programs<br />A program is said to be written in (pure) functional style if it has no mutable state<br />Computing = naming and evaluating expressions <br />Programs with mutable state are usually called imperative because the state changes must be done when and where specified:<br />while (z < x) { x = y; y = z; z = f(x, y);} return y;<br />Often, programs can be written either way:<br />let w(x, y, z) = if (z < x) then w(y, z, f(x, y)) else y;<br />7/17/2009<br />22<br />
  23. 23. Shared Memory and Message Passing<br />Shared memory programs access data in a shared address space<br />When to access the data is the big issue<br />Subcomputations therefore must synchronize<br />Message passing programs transmit data between subcomputations<br />The sender computes a value and then sends it<br />The receiver recieves a value and then uses it<br />Synchronization can be built in to communication<br />Message passing can be implemented very well on shared memory architectures<br />7/17/2009<br />23<br />
  24. 24. Barrier Synchronization<br />A barrier synchronizes multiple parallel sub-computations by letting none proceed until all have arrived<br />It is named after the barrier<br />used to start horse races<br />It guarantees everything before the barrier finishes before anything after it begins<br />It is a central feature in several data-parallel languages such as OpenMP<br />7/17/2009<br />24<br />
  25. 25. Mutual Exclusion<br />This type of synchronization ensures only one subcomputation can do a thing at any time<br />If the thing is a code block, it is a critical section<br />It classically uses a lock: a data structure with which subcomputations can stop and start<br />Basic operations on a lock object L might be <br />Acquire(L): blocks until other subcomputations are finished with L, then acquires exclusive ownership<br />Release(L): yields L and unblocks some Acquire(L)<br />A lot has been written on these subjects<br />7/17/2009<br />25<br />
  26. 26. Non-Blocking Synchronization<br />The basic idea is to achieve mutual exclusion using memory read-modify-write operations<br />Most commonly used is compare-and-swap: <br />CAS(addr, old, new) reads memory at addr and if it contains old then old is replaced by new<br />Arbitrary update operations at an addr require {read old; compute new; CAS(addr, old, new);}be repeated until the CAS operation succeeds<br />If there is significant updating contention at addr, the repeated computation of new may be wasteful<br />7/17/2009<br />26<br />
  27. 27. Load Balancing<br />Some processors may be busier than others<br />To balance the workload, subcomputations can be scheduled on processors dynamically<br />A technique for parallel loops is self-scheduling: processors repetitively grab chunks of iterations<br />In guided self-scheduling, the chunk sizes shrink<br />Analogous imbalances can occur in memory<br />Overloaded memory locations are called hot spots<br />Parallel algorithms and data structures must be designed to avoid them<br />Imbalanced messaging is sometimes seen<br />7/17/2009<br />27<br />
  28. 28. Examples<br />7/17/2009<br />28<br />
  29. 29. A Data Parallel Example: Sorting<br />7/17/2009<br />29<br />void sort(int *src, int *dst,int size, intnvals) {<br />inti, j, t1[nvals], t2[nvals];<br /> for (j = 0 ; j < nvals ; j++) { t1[j] = 0;<br />}<br /> for (i = 0 ; i < size ; i++) { t1[src[i]]++;<br />} //t1[] now contains a histogram of the values<br /> t2[0] = 0;<br /> for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1];<br />} //t2[j] now contains the origin for value j<br /> for (i = 0 ; i < size ; i++) {dst[t2[src[i]]++] = src[i];<br />}<br />}<br />
  30. 30. When Is a Loop Parallelizable?<br />The loop instances must safely interleave<br />A way to do this is to only read the data <br />Another way is to isolate data accesses<br />Look at the first loop:<br />The accesses to t1[] are isolated from each other<br /> This loop can run in parallel “as is”<br />7/17/2009<br />30<br />for (j = 0 ; j < nvals ; j++) {<br /> t1[j] = 0;<br />}<br />
  31. 31. Isolating Data Updates<br />The second loop seems to have a problem:<br />Two iterations may access the same t1[src[i]]<br />If both reads precede both increments, oops!<br />A few ways to isolate the iteration conflicts:<br />Use an “isolated update” (lock prefix) instruction<br />Use an array of locks, perhaps as big as t1[]<br /> Use non-blocking updates<br />Use a transaction<br />7/17/2009<br />31<br />for (i = 0 ; i < size ; i++) {<br /> t1[src[i]]++;<br />}<br />
  32. 32. Dependent Loop Iterations<br />The 3rd loop is an interesting challenge:<br />Each iteration depends on the previous one<br />This loop is an example of a prefix computation<br />If • is an associative binary operation on a set S, the • - prefixes of the sequence x0 ,x1 ,x2 … of values from S is x0,x0•x1,x0•x1•x2 …<br />Prefix computations are often known as scans<br />Scan can be done in efficiently in parallel<br />7/17/2009<br />32<br />for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1];<br /> }<br />
  33. 33. Cyclic Reduction<br />Each vertical line represents a loop iteration<br />The associated sequence element is to its right<br />On step k of the scan, iteration j prefixes its own value with the value from iteration j – 2k<br />7/17/2009<br />33<br />a<br />b<br />c<br />d<br />e<br />f<br />g<br />a<br />ab<br />bc<br />cd<br />de<br />ef<br />fg<br />a<br />ab<br />abc<br />abcd<br />bcde<br />cdef<br />defg<br />a<br />ab<br />abc<br />abcd<br />abcde<br />abcdef<br />abcdefg<br />
  34. 34. Applications of Scan<br />Linear recurrences like the third loop<br />Polynomial evaluation<br />String comparison<br />High-precision addition<br />Finite automata<br />Each xi is the next-state function given the ith input symbol and • is function composition<br />APL compress<br />When only the final value is needed, the computation is called a reduction instead<br />It’s a little bit cheaper than a full scan<br />
  35. 35. More Iterations nThan Processors p<br />7/17/2009<br />35<br />Wp = 3n + O(p log p), Tp = 3n / p + O(log p)<br />
  36. 36. OpenMP<br />OpenMP is a widely-implemented extension to C++ and Fortran for data† parallelism<br />It adds directives to serial programs<br />A few of the more important directives:<br />#pragmaomp parallel for <modifiers><for loop><br />#pragmaomp atomic<binary op=,++ or -- statement><br />#pragmaomp critical <name><structured block><br />#pragmaomp barrier<br />7/17/2009<br />36<br />†And perhaps task parallelism soon<br />
  37. 37. The Sorting Example in OpenMP<br />Only the third “scan” loop is a problem<br />We can at least do this loop “manually”:<br />7/17/2009<br />37<br />nt = omp_get_num_threads();<br />intta[nt], tb[nt];<br />#omp parallel for<br />for(myt = 0; myt < nt; myt++) {<br /> //Set ta[myt]= local sum of nvals/nt elements of t1[]<br /> #pragmaomp barrier<br /> for(k = 1; k <= myt; k *= 2){<br />tb[myt] = ta[myt];<br />ta[myt] += tb[myt - k];<br /> #pragmaomp barrier<br /> }<br /> fix = (myt > 0) ? ta[myt – 1] : 0;<br /> //Setnvals/ntelements of t2[] to fix + local scan of t1[]<br />}<br />
  38. 38. Parallel Patterns Library (PPL)<br />PPL is a Microsoft C++ library built on top of the ConcRT user-mode scheduling runtime<br />It supports mixed data- and task-parallelism:<br />parallel_for, parallel_for_each, parallel_invoke<br />agent, send, receive, choice, join, task_group<br />Parallel loops use C++ lambda expressions:<br />Updates can be isolated using intrinsic functions<br />Microsoft and Intel plan to unify PPL and TBB<br />7/17/2009<br />38<br />parallel_for(1,nvals,[&t1](int j) {<br /> t1[j] = 0;<br />});<br />(void)_InterlockedIncrement(t1[src[i]]++);<br />
  39. 39. Dynamic Resource Management<br />PPL programs are written for an arbitrary number of processors, could be just one<br />Load balancing is mostly done by work stealing<br />There are two kinds of work to steal:<br />Work that is unblocked and waiting for a processor<br />Work that is not yet started and is potentially parallel<br />Work of the latter kind will be done serially unless it is first stolen by another processor<br />This makes recursive divide and conquer easy<br />There is no concern about when to stop parallelism<br />7/17/2009<br />39<br />
  40. 40. A Quicksort Example<br />void quicksort (vector<int>::iterator first,<br /> vector<int>::iterator last) {<br /> if (last - first < 2){return;}<br />int pivot = *first;<br /> auto mid1 = partition (first, last,<br /> [=](int e){return e < pivot;});<br /> auto mid2 = partition (mid1, last,<br /> [=](int e){return e == pivot;});<br />parallel_invoke(<br /> [=] { quicksort(first, mid1); },<br /> [=] { quicksort(mid2, last); }<br /> );<br />}; <br />7/17/2009<br />40<br />
  41. 41. LINQ and PLINQ<br />LINQ (Language Integrated Query) extends the .NET languages C#, Visual Basic, and F#<br />A LINQ query is really just a functional monad<br />It queries databases, XML, or any IEnumerable<br />PLINQ is a parallel implementation of LINQ<br />Non-isolated functions must be avoided<br />Otherwise it is hard to tell the two apart<br />7/17/2009<br />41<br />
  42. 42. A LINQ Example<br />7/17/2009<br />42<br />PLINQ<br />.AsParallel()<br />var q = from n in names<br />        where n.Name == queryInfo.Name && <br />n.State == queryInfo.State &&<br />n.Year >= yearStart &&<br />n.Year <= yearEnd<br />        orderbyn.Year ascending<br />        select n;<br />
  43. 43. Message Passing Interface (MPI)<br />MPI is a widely used message passing library for distributed memory HPC systems<br />Some of its basic functions:<br />A few of its “collective communication” functions:<br />7/17/2009<br />43<br />MPI_Init<br />MPI_Comm_rank<br />MPI_Comm_size<br />MPI_Send<br />MPI_Recv<br />MPI_Reduce<br />MPI_Allreduce<br />MPI_Scan<br />MPI_Exscan<br />MPI_Barrier<br />MPI_Gather<br />MPI_Allgather<br />MPI_Alltoall<br />
  44. 44. Sorting in MPI<br />Roughly, it could work like this on n nodes:<br />Run the first two loops locally<br />Use MPI_Allreduce to build a global histogram<br />Run the third loop (redundantly) at every node<br />Allocate n value intervals to nodes (redundantly)<br />Balancing the data per node as well as possible<br />Run the fourth loop using the local histogram<br />Use MPI_Alltoall to redistribute the data<br />Merge the n sorted subarrays on each node<br />Collective communication is expensive<br />But sorting needs it (see the Memory Wall slide)<br />7/17/2009<br />44<br />
  45. 45. Another Way to Sort in MPI<br />The Samplesort algorithm is like Quicksort<br />It works like this on n nodes:<br />Sort the local data on each node independently<br />Take s samples of the sorted data on each node<br />Use MPI_Allgather to send all nodes all samples<br />Compute n  1 splitters (redundantly) on all nodes<br />Balancing the data per node as well as possible<br />Use MPI_Alltoall to redistribute the data<br />Merge the n sorted subarrays on each node<br />7/17/2009<br />45<br />
  46. 46. CONCLUSIONS<br />7/17/2009<br />46<br />
  47. 47. Parallel Computing Has Arrived<br />We must rethink how we write programs<br />And we are definitely doing that<br />Other things will also need to change<br />Architecture<br />Operating systems<br />Algorithms<br />Theory<br />Application software<br />We are seeing the biggest revolution in computing since its very beginnings<br />7/17/2009<br />47<br />