Parallel architecture


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Parallel architecture

  1. 1. Parallel Architecture Dr. Doug L. Hoffman Computer Science 330 Spring 2002
  2. 2. Parallel Computers „ Definition: “A parallel computer is a collection of processiong elements that cooperate and communicate to solve large problems fast.” „ Questions about parallel computers: ƒ How large a collection? ƒ How powerful are processing elements? ƒ How do they cooperate and communicate? ƒ How are data transmitted? ƒ What type of interconnection? ƒ What are HW and SW primitives for programmer? ƒ Does it translate into performance?
  3. 3. Parallel Processors“Religion” „ The dream of computer architects since 1960: replicate processors to add performance vs. design a faster processor „ Led to innovative organization tied to particular programming models since “uniprocessors can’t keep going” ƒ e.g., uniprocessors must stop getting faster due to limit of speed of light: 1972, … , 1989 ƒ Borders religious fervor: you must believe! ƒ Fervor damped some when 1990s companies went out of business: Thinking Machines, Kendall Square, ... „ Argument instead is the “pull” of opportunity of scalable performance, not the “push” of uniprocessor performance plateau
  4. 4. Opportunities: Scientific Computing„ Nearly Unlimited Demand (Grand Challenge):App Perf (GFLOPS) Memory (GB)48 hour weather 0.1 0.172 hour weather 3 1Pharmaceutical design 100 10Global Change, Genome 1000 1000„ Successes in some real industries: ƒ Petroleum: reservoir modeling ƒ Automotive: crash simulation, drag analysis, engine ƒ Aeronautics: airflow analysis, engine, structural mechanics ƒ Pharmaceuticals: molecular modeling ƒ Entertainment: full length movies (“Toy Story”)
  5. 5. Opportunities: Commercial Computing„Throughput (Transactions per minute) vs. Time (1996)„Speedup: 1 4 8 16 32 64 112IBM RS6000 735 1438 3119 1.00 1.96 4.24Tandem Himilaya 3043 6067 12021 20918 1.00 1.99 3.95 6.87 ƒ IBM performance hit 1=>4, good 4=>8 ƒ Tandem scales: 112/16 = 7.0„Others: File servers, eletronic CAD simulation (multipleprocesses), WWW search engines
  6. 6. What level Parallelism? „ Bit level parallelism: 1970 to ­1985 ƒ 4 bits, 8 bit, 16 bit, 32 bit microprocessors „ Instruction level parallelism (ILP): 1985 through today ƒ Pipelining ƒ Superscalar ƒ VLIW ƒ Out­of­Order execution ƒ Limits to benefits of ILP? „ Process Level or Thread level parallelism; mainstream for general purpose computing? ƒ Servers are parallel ƒ High end Desktop dual processor PC soon??
  7. 7. Parallel Architecture„ Parallel Architecture extends traditional computer architecture with a communication architecture ƒ abstractions (HW/SW interface) ƒ organizational structure to realize abstraction efficiently
  8. 8. Fundamental Issues „ 3 Issues to characterize parallel machines 1) Naming 2) Synchronization 3) Latency and Bandwidth
  9. 9. Parallel Framework „ Layers: ƒ Programming Model: ‚ Multiprogramming : lots of jobs, no communication ‚ Shared address space: communicate via memory ‚ Message passing: send and recieve messages ‚ Data Parallel: several agents operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing) ƒ Communication Abstraction: ‚ Shared address space: e.g., load, store, atomic swap ‚ Message passing: e.g., send, receive library calls ‚ Debate over this topic (ease of programming, scaling) => many hardware designs 1:1 programming model
  10. 10. Shared Address/MemoryMultiprocessor Model„ Communicate via Load and Store ƒ Oldest and most popular model„ Based on timesharing: processes on multiple processors vs. sharing single processor„ process: a virtual address space and 1 thread of control ƒ Multiple processes can overlap (share), but ALL threads share a process address space„ Writes to shared address space by one thread are visible to reads of other threads ƒ Usual model: share code, private stack, some shared heap, some private heap
  11. 11. Example: Small-Scale MP Designs„ Memory: centralized with uniform memory access time (“uma”) and bus interconnect, I/O„ Examples: Sun Enterprise 6000, SGI Challenge, Intel SystemPro
  12. 12. SMP Interconnect„ Processors to Memory AND to I/O„ Bus based: all memory locations equal access time so SMP = “Symmetric MP” ƒ Sharing limited BW as add processors, I/O ƒ (see Chapter 1, Figs 1­18/19, page 42­43 of [CSG96])„ Crossbar: expensive to expand„ Multistage network (less expensive to expand than crossbar with more BW)„ “Dance Hall” designs: All processors on the left, all memories on the right
  13. 13. Small-Scale—SharedMemory„ Caches serve to: ƒ Increase bandwidth versus bus/memory ƒ Reduce latency of access ƒ Valuable for both private data and shared data„ What about cache consistency?
  14. 14. What Does CoherencyMean?„ Informally: ƒ “Any read must return the most recent write” ƒ Too strict and too difficult to implement„ Better: ƒ “Any write must eventually be seen by a read” ƒ All writes are seen in proper order (“serialization”)„ Two rules to ensure this: ƒ “If P writes x and P1 reads it, P’s write will be seen by P1 if the read and write are sufficiently far apart” ƒ Writes to a single location are serialized: seen in one order ‚ Latest write will be seen ‚ Otherewise could see writes in illogical order (could see older value after a newer value)
  15. 15. Potential HW Coherency Solutions„ Snooping Solution (Snoopy Bus): ƒ Send all requests for data to all processors ƒ Processors snoop to see if they have a copy and respond accordingly ƒ Requires broadcast, since caching information is at processors ƒ Works well with bus (natural broadcast medium) ƒ Dominates for small scale machines (most of the market)„ Directory­Based Schemes ƒ Keep track of what is being shared in one centralized place ƒ Distributed memory => distributed directory for scalability (avoids bottlenecks) ƒ Send point­to­point requests to processors via network ƒ Scales better than Snooping ƒ Actually existed BEFORE Snooping­based schemes
  16. 16. Large-Scale MP Designs„ Memory: distributed with non­uniform memory access time (“numa”) and scalable interconnect (distributed memory)1 cycle 40 cycles 100 cycles Low Latency High Reliability
  17. 17. Shared Address ModelSummary „ Each processor can name every physical location in the machine „ Each process can name all data it shares with other processes „ Data transfer via load and store „ Data size: byte, word, ... or cache blocks „ Uses virtual memory to map virtual to local or remote physical „ Memory hierarchy model applies: now communication moves data to local proc. cache (as load moves data from memory to cache) ƒ Latency, BW (cache block?), scalability when communicate?
  18. 18. Message Passing Model„ Whole computers (CPU, memory, I/O devices) communicate as explicit I/O operations ƒ Essentially NUMA but integrated at I/O devices vs. memory system„ Send specifies local buffer + receiving process on remote computer„ Receive specifies sending process on remote computer + local buffer to place data ƒ Usually send includes process tag and receive has rule on tag: match 1, match any ƒ Synch: when send completes, when buffer free, when request accepted, receive wait for send„ Send+receive => memory­memory copy, where each each supplies local address, AND does pairwise synchronization!
  19. 19. Message Passing Model„ Send+receive => memory­memory copy, synchronization on OS even on 1 processor„ History of message passing: ƒ Network topology important because could only send to immediate neighbor ƒ Typically synchronous, blocking send & receive ƒ Later DMA with non­blocking sends, DMA for receive into buffer until processor does receive, and then data is transferred to local memory ƒ Later SW libraries to allow arbitrary communication„ Example: IBM SP­2, RS6000 workstations in racks ƒ Network Interface Card has Intel 960 ƒ 8X8 Crossbar switch as communication building block ƒ 40 MByte/sec per link
  20. 20. Communication Models„ Shared Memory ƒ Processors communicate with shared address space ƒ Easy on small­scale machines ƒ Advantages: ‚ Model of choice for uniprocessors, small­scale MPs ‚ Ease of programming ‚ Lower latency ‚ Easier to use hardware controlled caching„ Message passing ƒ Processors have private memories, communicate via messages ƒ Advantages: ‚ Less hardware, easier to design ‚ Focuses attention on costly non­local operations„ Can support either SW model on either HW base
  21. 21. Popular Flynn Categories(e.g., -RAID level for MPPs)„ SISD (Single Instruction Single Data) ƒ Uniprocessors„ MISD (Multiple Instruction Single Data) ƒ ???„ SIMD (Single Instruction Multiple Data) ƒ Examples: Illiac­IV, CM­2 ‚ Simple programming model ‚ Low overhead ‚ Flexibility ‚ All custom integrated circuits„ MIMD (Multiple Instruction Multiple Data) ƒ Examples: Sun Enterprise 5000, Cray T3D, SGI Origin ‚ Flexible ‚ Use off­the­shelf micros
  22. 22. Data Parallel Model „ Operations can be performed in parallel on each element of a large regular data structure, such as an array „ 1 Control Processsor broadcast to many PEs (see Ch. 1, Fig. 1­26, page 51 of [CSG96]) ƒ When computers were large, could amortize the control portion of many replicated PEs „ Condition flag per PE so that can skip „ Data distributed in each memory „ Early 1980s VLSI => SIMD rebirth: 32 1­bit PEs + memory on a chip was the PE „ Data parallel programming languages lay out data to processor
  23. 23. Data Parallel Model „ Vector processors have similar ISAs, but no data placement restriction „ SIMD led to Data Parallel Programming languages „ Advancing VLSI led to single chip FPUs and whole fast µProcs (SIMD less attractive) „ SIMD programming model led to Single Program Multiple Data (SPMD) model ƒ All processors execute identical program „ Data parallel programming languages still useful, do communication all at once: “Bulk Synchronous” phases in which all communicate after a global barrier
  24. 24. Convergence in ParallelArchitecture„ Complete computers connected to scalable network via communication assist„ Different programming models place different requirements on communication assist ƒ Shared address space: tight integration with memory to capture memory events that interact with others + to accept requests from other nodes ƒ Message passing: send messages quickly and respond to incoming messages: tag match, allocate buffer, transfer data, wait for receive posting ƒ Data Parallel: fast global synchronization„ Hi Perf Fortran shared­memory, data parallel; Msg. Passing Inter. message passing library; both work on many machines, different implementations
  25. 25. Summary: Parallel Framework Programming Model Communication Abstraction Interconnection SW/OS„ Layers: Interconnection HW ƒ Programming Model: ‚ Multiprogramming : lots of jobs, no communication ‚ Shared address space: communicate via memory ‚ Message passing: send and recieve messages ‚ Data Parallel: several agents operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing) ƒ Communication Abstraction: ‚ Shared address space: e.g., load, store, atomic swap ‚ Message passing: e.g., send, recieve library calls ‚ Debate over this topic (ease of programming, scaling) => many hardware designs 1:1 programming model
  26. 26. Summary : Small-Scale MP Designs„ Memory: centralized with uniform access time (“uma”) and bus interconnect„ Examples: Sun Enterprise 5000 , SGI Challenge, Intel SystemPro
  27. 27. Summary„ Caches contain all information on state of cached memory blocks„ Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping => uniform memory access)„ Directory has extra data structure to keep track of state of all cache blocks„ Distributing directory => scalable shared address multiprocessor => Cache coherent, Non uniform memory access