Part I- Fundamental Concepts

1,173 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,173
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Part I- Fundamental Concepts

  1. 1. Part I Fundamental Concepts Spring 2006 Parallel Processing, Fundamental Concepts Slide 1
  2. 2. About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami Edition Released Revised Revised First Spring 2005 Spring 2006 Spring 2006 Parallel Processing, Fundamental Concepts Slide 2
  3. 3. I Fundamental Concepts Provide motivation, paint the big picture, introduce the 3 Ts: • Taxonomy (basic terminology and models) • Tools for evaluation or comparison • Theory to delineate easy and hard problems Topics in This Part Chapter 1 Introduction to Parallelism Chapter 2 A Taste of Parallel Algorithms Chapter 3 Parallel Algorithm Complexity Chapter 4 Models of Parallel Processing Spring 2006 Parallel Processing, Fundamental Concepts Slide 3
  4. 4. 1 Introduction to Parallelism Set the stage for presenting the course material, including: • Challenges in designing and using parallel systems • Metrics to evaluate the effectiveness of parallelism Topics in This Chapter 1.1 Why Parallel Processing? 1.2 A Motivating Example 1.3 Parallel Processing Ups and Downs 1.4 Types of Parallelism: A Taxonomy 1.5 Roadblocks to Parallel Processing 1.6 Effectiveness of Parallel Processing Spring 2006 Parallel Processing, Fundamental Concepts Slide 4
  5. 5. 1.1 Why Parallel Processing? T IP S P ro c e s s o r pe rfo rma nc e 1 .6 / yr G IP S P entium II R 10 000 P entium 6804 0 8048 6 8038 6 M IP S 6800 0 8028 6 K IP S 1980 1990 2000 2010 C a le n da r y e a r Fig. 1.1 The exponential growth of microprocessor performance, known as Moore’s Law, shown over the past two decades (extrapolated). Spring 2006 Parallel Processing, Fundamental Concepts Slide 5
  6. 6. The Semiconductor Technology Roadmap Calendar year  2001 2004 2007 2010 2013 2016 Halfpitch (nm) 140 90 65 45 32 22 Clock freq. (GHz) 2 4 7 12 20 30 Wiring levels 7 8 9 10 10 10 Power supply (V) 1.1 1.0 0.8 0.7 0.6 0.5 Max. power (W) 130 160 190 220 250 290 From the 2001 edition of the roadmap [Alla02] T IP S Factors contributing to the validity of Moore’s law P ro c e s s o r pe rfo rma nc e Denser circuits; Architectural improvements G IP S 1 .6 / yr P entium II R 10 000 Measures of processor performance 6804 0 P entium 8048 6 Instructions/second (MIPS, GIPS, TIPS, PIPS) M IP S 8038 6 6800 0 Floating-point operations per second 8028 6 (MFLOPS, GFLOPS, TFLOPS, PFLOPS) K IP S Running time on benchmark suites 1980 1990 C a le n da r y e a r 2000 2010 Spring 2006 Parallel Processing, Fundamental Concepts Slide 6
  7. 7. Why High-Performance Computing? Higher speed (solve problems faster) Important when there are “hard” or “soft” deadlines; e.g., 24-hour weather forecast Higher throughput (solve more problems) Important when there are many similar tasks to perform; e.g., transaction processing Higher computational power (solve larger problems) e.g., weather forecast for a week rather than 24 hours, or with a finer mesh for greater accuracy Categories of supercomputers Uniprocessor; aka vector machine Multiprocessor; centralized or distributed shared memory Multicomputer; communicating via message passing Massively parallel processor (MPP; 1K or more processors) Spring 2006 Parallel Processing, Fundamental Concepts Slide 7
  8. 8. The Speed-of-Light Argument The speed of light is about 30 cm/ns. Signals travel at a fraction of speed of light (say, 1/3). If signals must travel 1 cm during the execution of an instruction, that instruction will take at least 0.1 ns; thus, performance will be limited to 10 GIPS. This limitation is eased by continued miniaturization, architectural methods such as cache memory, etc.; however, a fundamental limit does exist. How does parallel processing help? Wouldn’t multiple processors need to communicate via signals as well? Spring 2006 Parallel Processing, Fundamental Concepts Slide 8
  9. 9. Why Do We Need TIPS or TFLOPS Performance? Reasonable running time = Fraction of hour to several hours (103-104 s) In this time, a TIPS/TFLOPS machine can perform 1015-1016 operations Example 1: Southern oceans heat Modeling (10-minute iterations) 300 GFLOP per iteration 300 000 iterations per 6 yrs = 1016 FLOP Example 2: Fluid dynamics calculations (1000 1000 1000 lattice) 109 lattice points 1000 FLOP/point 10 000 time steps = 1016 FLOP Example 3: Monte Carlo simulation of nuclear reactor 1011 particles to track (for 1000 escapes) 104 FLOP/particle = 1015 FLOP Decentralized supercomputing ( from Mathworld News, 2006/4/7 ): Grid of tens of thousands networked computers discovers 230 402 457 – 1, the 43rd Mersenne prime, as the largest known prime (9 152 052 digits ) Spring 2006 Parallel Processing, Fundamental Concepts Slide 9
  10. 10. The ASCI Program 1000 Plan Dev elop Us e P e rfo rm a nce ( TF L O P S ) 100+ TFLO PS, 20 T B 100 A S C I P ur ple 30+ TFLO PS, 10 T B A S CI Q 10+ TFLO PS, 5 T B 10 A S C I W hite ASCI 3+ TFL O PS, 1.5 TB A S C I B lue 1+ TFL O PS, 0.5 TB 1 A S C I R ed 1995 2000 2005 2010 C a le nd a r ye a r Fig. 24.1 Milestones in the Accelerated Strategic (Advanced Simulation &) Computing Initiative (ASCI) program, sponsored by the US Department of Energy, with extrapolation up to the PFLOPS level. Spring 2006 Parallel Processing, Fundamental Concepts Slide 10
  11. 11. The Quest for Higher Performance Top Three Supercomputers in 2005 (IEEE Spectrum, Feb. 2005, pp. 15-16) 1. IBM Blue Gene/L 2. SGI Columbia 3. NEC Earth Sim LLNL, California NASA Ames, California Earth Sim Ctr, Yokohama Material science, Aerospace/space sim, Atmospheric, oceanic, nuclear stockpile sim climate research and earth sciences 32,768 proc’s, 8 TB, 10,240 proc’s, 20 TB, 5,120 proc’s, 10 TB, 28 TB disk storage 440 TB disk storage 700 TB disk storage Linux + custom OS Linux Unix 71 TFLOPS, $100 M 52 TFLOPS, $50 M 36 TFLOPS*, $400 M? Dual-proc Power-PC 20x Altix (512 Itanium2) Built of custom vector chips (10-15 W power) linked by Infiniband microprocessors Full system: 130k-proc, Volume = 50x IBM, 360 TFLOPS (est) Power = 14x IBM * Led the top500 list for 2.5 yrs Spring 2006 Parallel Processing, Fundamental Concepts Slide 11
  12. 12. Supercomputer Performance Growth P FLOP S A S C I go als S u pe rc o m p ute r pe rfo r m a nc e $240M M P P s $30M M P P s C M -5 V ec t or s upers TF L O P S C M -5 C M -2 M ic ros Y -M P GFLOP S A lpha C ray X-M P 80860 80386 MFLOP S 1980 1990 2000 2010 C a le n da r y e a r Fig. 1.2 The exponential growth in supercomputer performance over the past two decades (from [Bell92], with ASCI performance goals and microprocessor peak FLOPS superimposed as dotted lines). Spring 2006 Parallel Processing, Fundamental Concepts Slide 12
  13. 13. Init. Pass 1 Pass 2 Pass 3 1.2 A Motivating 2 m 2 2 2 3 3 m 3 3 Example 4 5 5 5 m 5 6 7 7 7 7 m Fig. 1.3 The sieve of 8 9 9 Eratosthenes yielding a 10 11 11 11 11 list of 10 primes for n = 30. 12 Marked elements have 13 13 13 13 14 been distinguished by 15 15 erasure from the list. 16 17 17 17 17 18 19 19 19 19 20 Any composite number 21 21 22 has a prime factor 23 23 23 23 that is no greater than 24 25 25 25 its square root. 26 27 27 28 29 29 29 29 30 Spring 2006 Parallel Processing, Fundamental Concepts Slide 13
  14. 14. Single-Processor Implementation of the Sieve Current Prime Index P Bit-vector n 1 2 Fig. 1.4 Schematic representation of single-processor solution for the sieve of Eratosthenes. Spring 2006 Parallel Processing, Fundamental Concepts Slide 14
  15. 15. Control-Parallel Implementation of the Sieve Index Index Index P1 P2 ... Pp Shared Current Prime Memory I/O D evice 1 2 n (b) Fig. 1.5 Schematic representation of a control-parallel solution for the sieve of Eratosthenes. Spring 2006 Parallel Processing, Fundamental Concepts Slide 15
  16. 16. Running Time of the Sequential/Parallel Sieve Time 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ 2 | 3 | 5 | 7 | 11 |13|17 19 29 p = 1, t = 1411 23 31 23 29 31 2 | 7 |17 3 5 | 11 |13| 19 p = 2, t = 706 2 | | 3 11 | 19 29 31 5 | 7 13|17 23 p = 3, t = 499 Fig. 1.6 Control-parallel realization of the sieve of Eratosthenes with n = 1000 and 1 p 3. Spring 2006 Parallel Processing, Fundamental Concepts Slide 16
  17. 17. Data-Parallel Implementation of the Sieve P1 Current Prime Index Assume at most n processors, so that all prime factors dealt with are in P1 (which broadcasts them) 1 2 n/p P2 Current Prime Index Com mun i- cat io n n/p +1 2n/ p Pp Current Prime Index n <n/p n–n /p+1 n Fig. 1.7 Data-parallel realization of the sieve of Eratosthenes. Spring 2006 Parallel Processing, Fundamental Concepts Slide 17
  18. 18. One Reason for Sublinear Speedup: Communication Overhead Id eal s pee du p S olutio n tim e A c tual s peedu p C om putati on C om m u nic atio n Num b e r o f p ro ce sso rs Num b e r o f p ro ce sso rs Fig. 1.8 Trade-off between communication time and computation time in the data-parallel realization of the sieve of Eratosthenes. Spring 2006 Parallel Processing, Fundamental Concepts Slide 18
  19. 19. Another Reason for Sublinear Speedup: Input/Output Overhead Id eal s pee du p S olutio n tim e A c tual s peedu p C om putati on I/O ti m e Num b e r o f p ro ce sso rs Num b e r o f p ro ce sso rs Fig. 1.9 Effect of a constant I/O time on the data-parallel realization of the sieve of Eratosthenes. Spring 2006 Parallel Processing, Fundamental Concepts Slide 19
  20. 20. 1.3 Parallel Processing Ups and Downs Fig. 1.10 Richardson’s circular Using thousands of “computers” theater for weather forecasting (humans + calculators) for 24-hr calculations. weather prediction in a few hours  1960s: ILLIAC IV (U Illinois) –    four 8 8 mesh quadrants, SIMD            1980s: Commercial interest –         technology was driven by government grants & contracts.       Once funding dried up, C o n d u cto r      many companies went bankrupt              2000s: Internet revolution –       info providers, multimedia, data  mining, etc. need lots of power Spring 2006 Parallel Processing, Fundamental Concepts Slide 20
  21. 21. Trends in High-Technology Development Go vResG ovRes GovRe sGovR esGov ResGo vResG ovRes GovRe sGovR esGov Res G ra p hic s In dResI ndRes IndRe sIndR esInd ResIn dResI ndRes IndRe sIndR esInd Res In dDevI ndDev $1 B$1B$ 1B$1B $1B$1 B$1B$ 1B$1B $1B$1 B$1B Go vResG ovRes GovRe sGovR esGov ResGo vResG ovRes GovRe sGovR esGov Re N e tw o rk ing In dResI ndRes IndRe sIndR esInd ResIn dResI ndRes IndRe sIndR Tra n s fe r o f id e as /p eo p le In dDevI ndDev $1 B$1B$ 1B$1B $1B$1 B$1B$ 1B$1B $1B$1 Go vRes R IS C In dResI ndR In dDev $1 B$1B$ 1B$1B $1B$1 B$1B$ 1B$ Go vResG ovRes GovRe sG Go vResG ovRes GovRe sGo P a ra lle lis m In dResI ndRes IndRe sIndR esInd ResIn d Evolution of parallel processing has been quite different from other high tech fields In dDevI ndDev $1 B$1B$ 1B$1B $1B$1 B$1 1960 1970 1980 1990 2000 Development of some technical fields into $1B businesses and the roles played by government research and industrial R&D over time (IEEE Computer, early 90s?). Spring 2006 Parallel Processing, Fundamental Concepts Slide 21
  22. 22. Trends in Hi-Tech Development (2003) Spring 2006 Parallel Processing, Fundamental Concepts Slide 22
  23. 23. Status of Computing Power (circa 2000) GFLOPS on desktop: Apple Macintosh, with G4 processor TFLOPS in supercomputer center: 1152-processor IBM RS/6000 SP (switch-based network) Cray T3E, torus-connected PFLOPS on drawing board: 1M-processor IBM Blue Gene (2005?) 32 proc’s/chip, 64 chips/board, 8 boards/tower, 64 towers Processor: 8 threads, on-chip memory, no data cache Chip: defect-tolerant, row/column rings in a 6 6 array Board: 8 8 chip grid organized as 4 4 4 cube Tower: Boards linked to 4 neighbors in adjacent towers System: 32 32 32 cube of chips, 1.5 MW (water-cooled) Spring 2006 Parallel Processing, Fundamental Concepts Slide 23
  24. 24. 1.4 Types of Parallelism: A Taxonomy Single data Mult iple data Shared Mes s age s tream s treams v ariables pas s ing Single ins tr me mor y s tr eam Global Jo h n so n ’ s e x pa n sio n S IS D S IM D GMS V GMMP Uniproc es s ors A rray or v ec tor Shared -memory Rarely us ed proc es s ors mult iproc ess ors Mult iple ins tr Dis tr ibuted s tr eams me mor y M IS D M IM D DMSV DMMP Rarely us ed Mult iproc ’s or Dis tributed Dis trib- me mory mult ic omputers s hared memory mult ic omputers F lyn n ’s ca te g o rie s Fig. 1.11 The Flynn-Johnson classification of computer systems. Spring 2006 Parallel Processing, Fundamental Concepts Slide 24
  25. 25. 1.5 Roadblocks to Parallel Processing Grosch’s law: Economy of scale applies, or power = cost2 No longer valid; in fact we can get more bang per buck in micros Minsky’s conjecture: Speedup tends to be proportional to log p Has roots in analysis of memory bank conflicts; can be overcome Tyranny of IC technology: Uniprocessors suffice (x10 faster/5 yrs) Faster ICs make parallel machines faster too; what about x1000? Tyranny of vector supercomputers: Familiar programming model Not all computations involve vectors; parallel vector machines Software inertia: Billions of dollars investment in software New programs; even uniprocessors benefit from parallelism spec Amdahl’s law: Unparallelizable code severely limits the speedup Spring 2006 Parallel Processing, Fundamental Concepts Slide 25
  26. 26. Amdahl’s Law 50 f = 0 f = fraction 40 unaffected S pee du p (s ) f = 0 .01 30 p = speedup f = 0 .02 of the rest 20 f = 0 .05 1 s= 10 f + (1 – f)/p f = 0 .1 0 min(p, 1/f) 0 10 20 30 40 50 E n h a n c e m e n t f a c to r (p ) Fig. 1.12 Limit on speed-up according to Amdahl’s law. Spring 2006 Parallel Processing, Fundamental Concepts Slide 26
  27. 27. 1.6 Effectiveness of Parallel Processing 1 Fig. 1.13 p Number of processors Task graph exhibiting W(p) Work performed by p processors 2 limited inherent T(p) Execution time with p processors 3 parallelism. T(1) = W(1); T(p) W(p) 4 S(p) Speedup = T(1) / T(p) 5 8 7 6 E(p) Efficiency = T(1) / [p T(p)] 10 9 R(p) Redundancy = W(p) / W(1) 11 12 W(1) = 13 U(p) Utilization = W(p) / [p T(p)] T(1) = 13 13 Q(p) Quality = T3(1) / [p T2(p) W(p)] T( ) = 8 Spring 2006 Parallel Processing, Fundamental Concepts Slide 27
  28. 28. Reduction or Fan-in Computation Example: Adding 16 numbers, 8 processors, unit-time additions ----------- 16 numbers to be added ----------- Zero-time communication + + + + + + + + E(8) = 15 / (8 4) = 47% S(8) = 15 / 4 = 3.75 R(8) = 15 / 15 = 1 + + + + Q(8) = 1.76 Unit-time communication + + E(8) = 15 / (8 7) = 27% S(8) = 15 / 7 = 2.14 + R(8) = 22 / 15 = 1.47 Q(8) = 0.39 Sum Fig. 1.14 Computation graph for finding the sum of 16 numbers . Spring 2006 Parallel Processing, Fundamental Concepts Slide 28
  29. 29. ABCs of Parallel Processing in One Slide A Amdahl’s Law (Speedup Formula) Bad news – Sequential overhead will kill you, because: Speedup = T1/Tp 1/[f + (1 – f)/p] min(1/f, p) Morale: For f = 0.1, speedup is at best 10, regardless of peak OPS. B Brent’s Scheduling Theorem Good news – Optimal scheduling is very difficult, but even a naive scheduling algorithm can ensure: T1/p Tp T1/p + T = (T1/p)[1 + p/(T1/T )] Result: For a reasonably parallel task (large T1/T ), or for a suitably small p (say, p T1/T ), good speedup and efficiency are possible. C Cost-Effectiveness Adage Real news – The most cost-effective parallel solution may not be the one with highest peak OPS (communication?), greatest speed-up (at what cost?), or best utilization (hardware busy doing what?). Analogy: Mass transit might be more cost-effective than private cars even if it is slower and leads to many empty seats. Spring 2006 Parallel Processing, Fundamental Concepts Slide 29
  30. 30. 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity: • By implementing 5 building-block parallel computations • On 4 simple parallel architectures (20 combinations) Topics in This Chapter 2.1 Some Simple Computations 2.2 Some Simple Architectures 2.3 Algorithms for a Linear Array 2.4 Algorithms for a Binary Tree 2.5 Algorithms for a 2D Mesh 2.6 Algorithms with Shared Variables Spring 2006 Parallel Processing, Fundamental Concepts Slide 30
  31. 31. 2.1 Some Simple Computations x0 identity x1 element t =0 x2 t =1 t =2 t =3 . . xn–2 . xn–1 s = x0 x1 ... xn–1 t =n–1 t =n s Fig. 2.1 Semigroup computation on a uniprocessor. Spring 2006 Parallel Processing, Fundamental Concepts Slide 31
  32. 32. Parallel Semigroup Computation x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 log2 n levels s = x0 x1 ... xn–1 s Semigroup computation viewed as tree or fan-in computation. Spring 2006 Parallel Processing, Fundamental Concepts Slide 32
  33. 33. x0 Parallel Prefix Computation identity x1 element t =0 Parallel version x2 t =1 much trickier compared to that t =2 of semigroup s0 t =3 computation s1 . . xn–2 Requires a s2 . minimum of xn–1 log2 n levels s = x0 x1 x2 ... xn–1 t =n–1 t =n s n–2 s n–1 Prefix computation on a uniprocessor. Spring 2006 Parallel Processing, Fundamental Concepts Slide 33
  34. 34. The Five Building-Block Computations Semigroup computation: aka tree or fan-in computation All processors to get the computation result at the end Parallel prefix computation: The ith processor to hold the ith prefix result at the end Packet routing: Send a packet from a source to a destination processor Broadcasting: Send a packet from a source to all processors Sorting: Arrange a set of keys, stored one per processor, so that the ith processor holds the ith key in ascending order Spring 2006 Parallel Processing, Fundamental Concepts Slide 34
  35. 35. 2.2 Some Simple Architectures P0 P1 P2 P3 P4 P5 P6 P7 P8 P0 P1 P2 P3 P4 P5 P6 P7 P8 Fig. 2.2 A linear array of nine processors and its ring variant. Max node degree d=2 Network diameter D=p–1 ( p/2 ) Bisection width B=1 (2) Spring 2006 Parallel Processing, Fundamental Concepts Slide 35
  36. 36. (Balanced) Binary Tree Architecture P0 Complete binary tree 2q – 1 nodes, 2q–1 leaves Balanced binary tree P1 P4 Leaf levels differ by 1 P2 P3 P5 P6 Max node degree d=3 Network diameter D = 2 log2 p ( 1) P7 P8 Bisection width B=1 Fig. 2.3 A balanced (but incomplete) binary tree of nine processors. Spring 2006 Parallel Processing, Fundamental Concepts Slide 36
  37. 37. Two-Dimensional (2D) Mesh P P P P P P 0 1 2 0 1 2 Nonsquare mesh P P P (r rows, P P P 3 4 5 3 4 5 p/r col’s) also possible P P P P P P 6 7 8 6 7 8 Max node degree d=4 Network diameter D=2 p–2 ( p) Bisection width B p (2 p) Fig. 2.4 2D mesh of 9 processors and its torus variant. Spring 2006 Parallel Processing, Fundamental Concepts Slide 37
  38. 38. Shared-Memory Architecture Max node degree d=p–1 P Network diameter D=1 0 P P Bisection width B = p/2 p/2 8 1 P P Costly to implement 7 2 Not scalable But . . . P P 6 3 Conceptually simple Easy to program P P 5 4 Fig. 2.5 A shared-variable architecture modeled as a complete graph. Spring 2006 Parallel Processing, Fundamental Concepts Slide 38
  39. 39. Architecture/Algorithm Combinations Semi- Parallel Packet Broad- Sorting group prefix routing casting P0 P1 P2 P3 P4 P5 P6 P7 P8 P0 P1 P2 P3 P4 P5 P6 P7 P8 P0 We will spend more time on P1 P4 linear array and binary tree P2 P3 P5 P6 P7 P8 P P P P P P 0 1 2 0 1 2 P P P P P P 3 4 5 3 4 5 and less time on mesh and P P P P P P 6 7 8 6 7 8 P P 8 0 P 1 shared memory (studied later) P P 7 2 P P 6 3 P P 5 4 Spring 2006 Parallel Processing, Fundamental Concepts Slide 39
  40. 40. 2.3 Algorithms for a Linear Array P0 P1 P2 P3 P4 P5 P6 P7 P8 Ini tia l 5 2 8 6 3 7 9 1 4 va lue s 5 8 8 8 7 9 9 9 4 8 8 8 8 9 9 9 9 9 8 8 8 9 9 9 9 9 9 8 8 9 9 9 9 9 9 9 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 M a xim um id e nti fie d Fig. 2.6 Maximum-finding on a linear array of nine processors. For general semigroup computation: Phase 1: Partial result is propagated from left to right Phase 2: Result obtained by processor p – 1 is broadcast leftward Spring 2006 Parallel Processing, Fundamental Concepts Slide 40
  41. 41. Linear Array Prefix Sum Computation P0 P1 P2 P3 P4 P5 P6 P7 P8 Ini tia l 5 2 8 6 3 7 9 1 4 va lue s 5 7 8 6 3 7 9 1 4 5 7 15 6 3 7 9 1 4 5 7 15 21 3 7 9 1 4 5 7 15 21 24 7 9 1 4 5 7 15 21 24 31 9 1 4 5 7 15 21 24 31 40 1 4 5 7 15 21 24 31 40 41 4 5 7 15 21 24 31 40 41 45 F ina l re su lts Fig. 2.7 Computing prefix sums on a linear array of nine processors. Diminished parallel prefix computation: The ith processor obtains the result up to element i – 1 Spring 2006 Parallel Processing, Fundamental Concepts Slide 41
  42. 42. Linear-Array Prefix Sum Computation P0 P1 P2 P3 P4 P5 P6 P7 P8 5 2 8 6 3 7 9 1 4 Ini tia l 1 6 3 2 5 3 6 7 5 va lue s 5 2 8 6 3 7 9 1 4 L o ca l 6 8 11 8 8 10 15 8 9 p re fixe s + 0 6 14 25 33 41 51 66 74 D im inishe d = p re fixe s 5 8 22 31 36 48 60 67 78 6 14 25 33 41 51 66 74 83 F ina l re su lts Fig. 2.8 Computing prefix sums on a linear array with two items per processor. Spring 2006 Parallel Processing, Fundamental Concepts Slide 42
  43. 43. Linear Array Routing and Broadcasting L e ft- m o ving p a cke ts P0 P1 P2 P3 P4 P5 P6 P7 P8 Rig ht-m o ving p a cke ts Routing and broadcasting on a linear array of nine processors. To route from processor i to processor j: Compute j – i to determine distance and direction To broadcast from processor i: Send a left-moving and a right-moving broadcast message Spring 2006 Parallel Processing, Fundamental Concepts Slide 43
  44. 44. 5 2 8 6 3 7 9 1 4 5 2 8 6 3 7 9 1 4 5 2 8 6 3 7 9 4 1 9 5 2 8 6 3 7 1 4 7 9 Linear Array 5 2 8 6 3 1 4 3 7 5 2 8 6 1 4 9 Sorting 1 6 3 4 7 9 5 2 8 (Externally 5 2 1 8 3 6 4 7 9 Supplied Keys) 5 1 2 3 8 4 6 7 9 5 3 8 7 1 2 4 6 9 5 4 8 9 1 2 3 6 7 5 6 8 1 2 3 4 7 9 5 7 9 1 2 3 4 6 8 6 8 1 2 3 4 5 7 9 Fig. 2.9 Sorting on a 7 9 linear array with the 1 2 3 4 5 6 8 8 keys input sequentially 1 2 3 4 5 6 7 9 9 from the left. 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 Spring 2006 Parallel Processing, Fundamental Concepts Slide 44
  45. 45. Linear Array Sorting (Internally Stored Keys) P0 P1 P2 P3 P4 P5 P6 P7 P8 In o d d s te p s, 5 2 8 6 3 7 9 1 4 1 , 3 , 5 , e tc ., 5 2 8 3 6 7 9 1 4 odd- 2 5 3 8 6 7 1 9 4 2 3 5 6 8 1 7 4 9 n um b e re d 2 3 5 6 1 8 4 7 9 p ro ce sso rs 2 3 5 1 6 4 8 7 9 e xc ha ng e 2 3 1 5 4 6 7 8 9 va lue s wi th 2 1 3 4 5 6 7 8 9 the i r rig h t 1 2 3 4 5 6 7 8 9 ne ig hb o rs Fig. 2.10 Odd-even transposition sort on a linear array. T(1) = W(1) = p log2 p T(p) = p W(p) p2/2 S(p) = log2 p (Minsky’s conjecture?) R(p) = p/(2 log2 p) Spring 2006 Parallel Processing, Fundamental Concepts Slide 45
  46. 46. 2.4 Algorithms for a Binary Tree P0 P0 P1 P4 P1 P4 P2 P3 P5 P6 P5 P6 P2 P3 P7 P8 P7 P8 Semigroup computation and broadcasting on a binary tree. Spring 2006 Parallel Processing, Fundamental Concepts Slide 46
  47. 47. x0 x 1 x2 x 3 x 4 Binary Tree x0 x 1 x2 x 3 x Parallel Prefix 4 Upward Upward Propagation propagation Computation x0 x1 x2 x3 x 4 x3 x4 Identity Fig. 2.11 Parallel prefix computation Identity x0 x 1 on a binary tree of processors. Identity x0 x0 x 1 x0 x 1 x 2 Downward Downward Propagation x0 x1 x2 propagation x0 x 1 x x0 x 1 x2 x 3 2 x3 x4 x0 x0x 1 x0 x 1 x2 Results x0 x 1 x2 x 3 x0 x 1 x2 x 3 x 4 Spring 2006 Parallel Processing, Fundamental Concepts Slide 47
  48. 48. Node Function in Binary Tree Parallel Prefix Two binary operations: [i, k] [0, i – 1] one during the upward propagation phase, and another during downward propagation Upward Insert latches for propagation systolic operation Downward (no long wires or propagation propagation path) [i, j – 1] [0, j – 1] [0, i – 1] [ j, k] Spring 2006 Parallel Processing, Fundamental Concepts Slide 48
  49. 49. Usefulness of Parallel Prefix Computation Ranks of 1s in a list of 0s/1s: Data: 0 0 1 0 1 0 0 1 1 1 0 Prefix sums: 0 0 1 1 2 2 2 3 4 5 5 Ranks of 1s: 1 2 3 4 5 Priority arbitration circuit: Data: 0 0 1 0 1 0 0 1 1 1 0 Dim’d prefix ORs: 0 0 0 1 1 1 1 1 1 1 1 Complement: 1 1 1 0 0 0 0 0 0 0 0 AND with data: 0 0 1 0 0 0 0 0 0 0 0 Carry-lookahead network: p¢x=x p g a g g p p p g a cin a¢x=a g or a g¢x=g Direction of indexing Spring 2006 Parallel Processing, Fundamental Concepts Slide 49
  50. 50. Binary Tree Packet Routing P0 XXX LXX RXX P1 P4 LLX RRX P2 P3 P5 P6 LRX RLX RRL RRR Preorder P7 P8 Node index is a representation indexing of the path from the tree root Packet routing on a binary tree with two indexing schemes. Spring 2006 Parallel Processing, Fundamental Concepts Slide 50
  51. 51. Binary Tree Sorting 5 2 3 Small values “bubble up,” causing the 5 2 3 1 4 root to “see” the values in 1 4 (a) (b) ascending order 2 2 1 5 3 1 5 3 Linear-time sorting (no 4 4 better than linear array) (c ) (d) Fig. 2.12 The first few steps of the sorting algorithm on a binary tree. Spring 2006 Parallel Processing, Fundamental Concepts Slide 51
  52. 52. The Bisection-Width Bottleneck in a Binary Tree Bisectio n Width = 1 Linear-time sorting is the best possible due to B = 1 Fig. 2.13 The bisection width of a binary tree architecture. Spring 2006 Parallel Processing, Fundamental Concepts Slide 52
  53. 53. 2.5 Algorithms for a 2D Mesh 5 2 8 8 8 8 9 9 9 6 3 7 7 7 7 9 9 9 9 1 4 9 9 9 9 9 9 Row maximums Column maximums Finding the max value on a 2D mesh. 5 7 15 5 7 15 0 5 7 15 6 9 16 6 9 16 15 21 24 31 9 10 14 9 10 14 31 40 41 45 Row prefix sums Diminished prefix Broadcas t in rows sums in last column and combine Computing prefix sums on a 2D mesh Spring 2006 Parallel Processing, Fundamental Concepts Slide 53
  54. 54. Routing and Broadcasting on a 2D Mesh P P P P P P 0 1 2 0 1 2 Nonsquare mesh P P P (r rows, P P P 3 4 5 3 4 5 p/r col’s) also possible P P P P P P 6 7 8 6 7 8 Routing: Send along the row to the correct column; route in column Broadcasting: Broadcast in row; then broadcast in all column Routing and broadcasting on a 9-processors 2D mesh or torus Spring 2006 Parallel Processing, Fundamental Concepts Slide 54
  55. 55. Sorting on a 2D Mesh Using Shearsort 5 2 8 2 5 8 1 4 3 1 3 4 1 3 2 Number of iterations = log2 p 6 3 7 7 Compare-exchange steps in each iteration = 2 6p 6 3 2 5 8 8 5 2 5 4 Total steps = (log2 p + 1) p 9 1 4 1 4 9 7 6 9 6 7 9 8 7 9 In itia l va lu e s S n a ke-like T o p -to -b o tto m S n a ke-like T o p -to -b o tto ro w so rt co lu m n ro w so rt co lu m n 2 5 8 1 4 3 1 3 4 1 3 2 1 2 3 so rt so rt P h a se P h a se 7 6 3 2 5 8 8 5 2 6 5 4 4 5 6 1 2 1 4 9 7 6 9 6 7 9 8 7 9 7 8 9 s S n a ke-like T o p-to -b o tto m S n a ke-like T o p-to -b o tto m L e ft-to -rig h t ro w so rt co lu m n ro w so rt co lu m n ro w so rt so rt so rt P h a se 1 P h a se 2 P h a se 3 1 2 Fig. 2.14 The shearsort algorithm on a 3 3 mesh. 3 Spring 2006 Parallel Processing, Fundamental Concepts Slide 55
  56. 56. 2.6 Algorithms with Shared Variables Semigroup computation: Each processor can perform P 0 the computation locally P P 8 1 Parallel prefix computation: Same as semigroup, except only data from smaller-index P P processors are combined 7 2 Packet routing: Trivial Broadcasting: One step with P P all-port (p – 1 steps with 6 3 single-port) communication Sorting: Each processor P P 5 4 determines the rank of its data element; followed by routing Spring 2006 Parallel Processing, Fundamental Concepts Slide 56
  57. 57. 3 Parallel Algorithm Complexity Review algorithm complexity and various complexity classes: • Introduce the notions of time and time/cost optimality • Derive tools for analysis, comparison, and fine-tuning Topics in This Chapter 3.1 Asymptotic Complexity 3.2 Algorithms Optimality and Efficiency 3.3 Complexity Classes 3.4 Parallelizable Tasks and the NC Class 3.5 Parallel Programming Paradigms 3.6 Solving Recurrences Spring 2006 Parallel Processing, Fundamental Concepts Slide 57
  58. 58. 3.1 Asymptotic Complexity c g(n) c' g(n) g(n) f(n) g(n) f(n) f(n) c g(n) c g(n) n 0 n n 0 n n 0 n f(n) O(g(n)) f(n) = = O(g(n)) f(n)= = (g(n)) f(n) (g(n)) f(n) (g(n)) (g(n)) f(n) = = Fig. 3.1 Graphical representation of the notions of asymptotic complexity. 3n log n = O(n2) ½ n log2 n = (n) 2000 n2= (n2) Spring 2006 Parallel Processing, Fundamental Concepts Slide 58
  59. 59. Little Oh, Big Oh, and Their Buddies Notation Growth rate Example of use f(n) = o(g(n)) strictly less than T(n) = cn2 + o(n2) f(n) = O(g(n)) no greater than T(n, m) = O(n logn+m) f(n) = (g(n)) the same as T(n) = (n log n) f(n) = (g(n)) no less than T(n) = ( n) f(n) = (g(n)) strictly greater than T(n) = (log n) Spring 2006 Parallel Processing, Fundamental Concepts Slide 59
  60. 60. Growth Rates for Sublinear Linear Superlinear Typical Functions log2n n1/2 n n log2n n3/2 -------- -------- -------- -------- -------- Table 3.1 Comparing the 9 3 10 90 30 Growth Rates of Sublinear 36 10 100 3.6 K 1K and Superlinear Functions 81 31 1K 81 K 31 K (K = 1000, M = 1 000 000). 169 100 10 K 1.7 M 1M 256 316 100 K 26 M 31 M 361 1K 1 M 361 M 1000 M n (n/4)log2n nlog2n 100 n1/2 n3/2 Table 3.3 Effect of Constants -------- -------- -------- -------- -------- 10 20 s 2 min 5 min 30 s on the Growth Rates of 100 15 min 1 hr 15 min 15 min Running Times Using Larger 1K 6 hr 1 day 1 hr 9 hr Time Units and Round Figures. 10 K 5 day 20 day 3 hr 10 day 100 K 2 mo 1 yr 9 hr 1 yr Warning: Table 3.3 in 1 M 3 yr 11 yr 1 day 32 yr text needs corrections. Spring 2006 Parallel Processing, Fundamental Concepts Slide 60
  61. 61. Some Commonly Encountered Growth Rates Notation Class name Notes O(1) Constant Rarely practical O(log log n) Double-logarithmic Sublogarithmic O(log n) Logarithmic O(logk n) Polylogarithmic k is a constant O(na), a < 1 e.g., O(n1/2) or O(n1– ) O(n / logk n) Still sublinear ------------------------------------------------------------------------------------------------------------------------------------------------------------------- O(n) Linear ------------------------------------------------------------------------------------------------------------------------------------------------------------------- O(n logk n) Superlinear O(nc), c > 1 Polynomial e.g., O(n1+ ) or O(n3/2) O(2n) Exponential Generally intractable O(22n ) Double-exponential Hopeless! Spring 2006 Parallel Processing, Fundamental Concepts Slide 61
  62. 62. 3.2 Algorithm Optimality and Efficiency Lower bounds: Theoretical arguments Upper bounds: Deriving/analyzing based on bisection width, and the like algorithms and proving them correct S hifting lo w er b oun ds Im p ro ving u pp er bo unds 1988 1994 1996 1991 1988 1982 O ptim al Z ak ’s thm . Y ing’s thm . D an a’s alg. C hin’s alg. B ert’s alg. A nne’s alg. algorith m ? (log n) (log2n) O (n ) O (n lo g log n) O (n lo g n) O (n 2 ) log n log2n n / log n n n log log n n log n n2 Linea r S ublinea r S uperlin ea r T yp ica l co m p le xity c la sse s Fig. 3.2 Upper and lower bounds may tighten over time. Spring 2006 Parallel Processing, Fundamental Concepts Slide 62
  63. 63. Some Notions of Algorithm Optimality Time optimality (optimal algorithm, for short) T(n, p) = g(n, p), where g(n, p) is an established lower bound Problem size Number of processors Cost-time optimality (cost-optimal algorithm, for short) pT(n, p) = T(n, 1); i.e., redundancy = utilization = 1 Cost-time efficiency (efficient algorithm, for short) pT(n, p) = (T(n, 1)); i.e., redundancy = utilization = (1) Spring 2006 Parallel Processing, Fundamental Concepts Slide 63
  64. 64. Beware of Comparing Step Counts S olution M ac hine o r algorith m A 4 s teps 20 s teps M ac hine o r algorith m B For example, one algorithm may need 20 GFLOP, another 4 GFLOP (but float division is a factor of 10 slower than float multiplication Fig. 3.2 Five times fewer steps does not necessarily mean five times faster. Spring 2006 Parallel Processing, Fundamental Concepts Slide 64
  65. 65. 3.3 Complexity Classes NP-hard (Int ractable?) NP-complete E x ponential tim e (e.g. t he subset sum problem) (intrac tabl e pro blem s ) NP P s pa ce -c om plete Nondeterministic Polynomial P s pa ce N P- C o -N P - P co m ple te NP C o- N P co m ple te Polynomial (Tract able) ? P = NP P (trac table ) A more complete view of complexity classes Conceptual view of the P, NP, NP-complete, and NP-hard classes. Spring 2006 Parallel Processing, Fundamental Concepts Slide 65
  66. 66. Some NP-Complete Problems Subset sum problem: Given a set of n integers and a target sum s, determine if a subset of the integers adds up to s. Satisfiability: Is there an assignment of values to variables in a product-of-sums Boolean expression that makes it true? (Is in NP even if each OR term is restricted to have exactly three literals) Circuit satisfiability: Is there an assignment of 0s and 1s to inputs of a logic circuit that would make the circuit output 1? Hamiltonian cycle: Does an arbitrary graph contain a cycle that goes through all of its nodes? Traveling salesman: Find a lowest-cost or shortest-distance tour of a number of cities, given travel costs or distances. Spring 2006 Parallel Processing, Fundamental Concepts Slide 66
  67. 67. 3.4 Parallelizable Tasks and the NC Class NP-hard (Int ractable?) NC (Nick’s class): NP-complete Subset of problems (e.g. t he subset sum problem) in P for which there NP exist parallel No ndetermin istic Poly nomial algorithms using p = nc processors P Poly n omial (polynomially many) (Tract able) ? P = NP that run in O(logk n) P-complete time (polylog time). NC Nick's Class ? NC = P P-complete problem: " efficient ly" Given a logic circuit parallelizable with known inputs, determine its output (circuit value prob.). Fig. 3.4 A conceptual view of complexity classes and their relationships. Spring 2006 Parallel Processing, Fundamental Concepts Slide 67
  68. 68. 3.5 Parallel Programming Paradigms Divide and conquer Decompose problem of size n into smaller problems; solve subproblems independently; combine subproblem results into final answer T(n) = Td(n) + Ts + Tc(n) Decompose Solve in parallel Combine Randomization When it is impossible or difficult to decompose a large problem into subproblems with equal solution times, one might use random decisions that lead to good results with very high probability. Example: sorting with random sampling Other forms: Random search, control randomization, symmetry breaking Approximation Iterative numerical methods may use approximation to arrive at solution(s). Example: Solving linear systems using Jacobi relaxation. Under proper conditions, the iterations converge to the correct solutions; more iterations greater accuracy Spring 2006 Parallel Processing, Fundamental Concepts Slide 68
  69. 69. 3.6 Solving Recurrences f(n) = f(n – 1) + n {rewrite f(n – 1) as f((n – 1) – 1) + n – 1} = f(n – 2) + n – 1 + n = f(n – 3) + n – 2 + n – 1 + n ... = f(1) + 2 + 3 + . . . + n – 1 + n This method is known as unrolling = n(n + 1)/2 – 1 = (n2) f(n) = f(n/2) + 1 {rewrite f(n/2) as f((n/2)/2 + 1} = f(n/4) + 1 + 1 = f(n/8) + 1 + 1 + 1 ... = f(n/n) + 1 + 1 + 1 + . . . + 1 -------- log2 n times -------- = log2 n = (log n) Spring 2006 Parallel Processing, Fundamental Concepts Slide 69
  70. 70. More Example of Recurrence Unrolling f(n) = 2f(n/2) + 1 = 4f(n/4) + 2 + 1 = 8f(n/8) + 4 + 2 + 1 ... = n f(n/n) + n/2 + . . . + 4 + 2 + 1 = n – 1 = (n) Solution via guessing: f(n) = f(n/2) + n Guess f(n) = (n) = cn + g(n) = f(n/4) + n/2 + n cn + g(n) = cn/2 + g(n/2) + n = f(n/8) + n/4 + n/2 + n Thus, c = 2 and g(n) = g(n/2) ... = f(n/n) + 2 + 4 + . . . + n/4 + n/2 + n = 2n – 2 = (n) Spring 2006 Parallel Processing, Fundamental Concepts Slide 70
  71. 71. Still More Examples of Unrolling f(n) = 2f(n/2) + n Alternate solution method: = 4f(n/4) + n + n f(n)/n = f(n/2)/(n/2) + 1 = 8f(n/8) + n + n + n Let f(n)/n = g(n) ... g(n) = g(n/2) + 1 = log2 n = n f(n/n) + n + n + n + . . . + n --------- log2 n times --------- = n log2n = (n log n) f(n) = f(n/2) + log2 n = f(n/4) + log2(n/2) + log2 n = f(n/8) + log2(n/4) + log2(n/2) + log2 n ... = f(n/n) + log2 2 + log2 4 + . . . + log2(n/2) + log2 n = 1 + 2 + 3 + . . . + log2 n = log2 n (log2 n + 1)/2 = (log2 n) Spring 2006 Parallel Processing, Fundamental Concepts Slide 71
  72. 72. Master Theorem for Recurrences Theorem 3.1: Given f(n) = a f(n/b) + h(n); a, b constant, h arbitrary function the asymptotic solution to the recurrence is (c = logb a) f(n) = (n c) if h(n) = O(n c – ) for some > 0 f(n) = (n c log n) if h(n) = (n c) f(n) = (h(n)) if h(n) = (n c + ) for some > 0 Example: f(n) = 2 f(n/2) + 1 a = b = 2; c = logb a = 1 h(n) = 1 = O( n 1 – ) f(n) = (n c) = (n) Spring 2006 Parallel Processing, Fundamental Concepts Slide 72
  73. 73. Intuition Behind the Master Theorem Theorem 3.1: Given f(n) = a f(n/b) + h(n); a, b constant, h arbitrary function the asymptotic solution to the recurrence is (c = logb a) f(n) = (n c) if h(n) = O(n c – ) for some > 0 f(n) = 2f(n/2) + 1 = 4f(n/4) + 2 + 1 = . . . The last term = n f(n/n) + n/2 + . . . + 4 + 2 + 1 dominates f(n) = (n c log n) if h(n) = (n c) f(n) = 2f(n/2) + n = 4f(n/4) + n + n = . . . All terms are = n f(n/n) + n + n + n + . . . + n comparable f(n) = (h(n)) if h(n) = (n c + ) for some > 0 f(n) = f(n/2) + n = f(n/4) + n/2 + n = . . . The first term = f(n/n) + 2 + 4 + . . . + n/4 + n/2 + n dominates Spring 2006 Parallel Processing, Fundamental Concepts Slide 73
  74. 74. 4 Models of Parallel Processing Expand on the taxonomy of parallel processing from Chap. 1: • Abstract models of shared and distributed memory • Differences between abstract models and real hardware Topics in This Chapter 4.1 Development of Early Models 4.2 SIMD versus MIMD Architectures 4.3 Global versus Distributed Memory 4.4 The PRAM Shared-Memory Model 4.5 Distributed-Memory or Graph Models 4.6 Circuit Model and Physical Realizations Spring 2006 Parallel Processing, Fundamental Concepts Slide 74
  75. 75. 4.1 Development of Early Models Associative memory 100111010110001101000 Comparand Mask Parallel masked search of all words Bit-serial implementation with RAM Memory array with Associative processor comparison Add more processing logic to PEs logic Table 4.1 Entering the second half-century of associative processing ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Decade Events and Advances Technology Performance ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– 1940s Formulation of need & concept Relays 1950s Emergence of cell technologies Magnetic, Cryogenic Mega-bit-OPS 1960s Introduction of basic architectures Transistors 1970s Commercialization & applications ICs Giga-bit-OPS 1980s Focus on system/software issues VLSI Tera-bit-OPS 1990s Scalable & flexible architectures ULSI, WSI Peta-bit-OPS ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Spring 2006 Parallel Processing, Fundamental Concepts Slide 75
  76. 76. The Flynn-Johnson Classification Revisited D a ta s tre a m(s ) S ing le M ultip le S IM D S ing le C o ntro l s tre a m (s ) S IS D S IM D ve rs us “U nip ro c e s s o r ” “A rra y p ro c e s s o r” M IM D G lob al “S h a re d-m e m o ry m u ltip ro c es s o r” M e m o ry M ultip le GMSV GMMP M IS D M IM D (R a re ly us e d ) DMSV DMMP D is trib “D is trib u ted “D is trib-m e m o ry I2 s h a re d m e m o ry” m u ltic o m pu te r I1 I5 S hare d M e ss ag e G lo b a l Data Data variab le s p ass ing ve rs us In Out C o m m u nic a tio n / D is trib u te d Fig. 4.2 S y nc hro ni za tio n m e m o ry I3 I4 Fig. 4.1 The Flynn-Johnson classification of computer systems. Spring 2006 Parallel Processing, Fundamental Concepts Slide 76
  77. 77. 4.2 SIMD versus MIMD Architectures SIMD Timeline Most early parallel machines had SIMD designs Attractive to have skeleton processors (PEs) 1960 Eventually, many processors per chip High development cost for custom chips, high cost ILLIAC IV 1970 MSIMD and SPMD variants Most modern parallel machines have MIMD designs DAP COTS components (CPU chips and switches) 1980 Goodyear MPP MPP: Massively or moderately parallel? TMC CM-2 Tightly coupled versus loosely coupled MasPar MP-1 1990 Explicit message passing versus shared memory Network-based NOWs and COWs 2000 Networks/Clusters of workstations Clearspeed array coproc Grid computing Vision: Plug into wall outlets for computing power 2010 Spring 2006 Parallel Processing, Fundamental Concepts Slide 77
  78. 78. 4.3 Global versus Distributed Memory M e m o ry P ro c e s s o rs m o d u le s 0 0 1 1 Options: P roc es s or- P roce sso r- . . Crossbar to-p roc es s or to -m e m o ry netw o rk ne two rk Bus(es) . . MIN . . p -1 m -1 Bottleneck . . . Complex P a ra lle l I/O Expensive Fig. 4.3 A parallel processor with global memory. Spring 2006 Parallel Processing, Fundamental Concepts Slide 78
  79. 79. Removing the Processor-to-Memory Bottleneck M e m o ry P ro ce sso rs C a che s m o d u le s 0 0 1 1 P roc es s or- P roce sso r- to-p roc es s or Challenge: . to -m e m o ry . netw o rk Cache . ne two rk . coherence . . p -1 m -1 ... P a ra lle l I/O Fig. 4.4 A parallel processor with global memory and processor caches. Spring 2006 Parallel Processing, Fundamental Concepts Slide 79

×