Hardware/Software Co-Design       Lecture MPSoC 1
5. Multiprocessor Architectures• 5.1 Introduction  – The focus is on embedded microprocessors study  – Multiprocessing (MP...
5. Multiprocessor Architectures• 5.1 Introduction  – A multiprocessor is made of multiple processing    elements (PEs)    ...
5. Multiprocessor Architectures• 5.1 Introduction  – An MP consists of 3 major subsystems  1. Processing elements that ope...
5. Multiprocessor Architectures• 5.1 Introduction  – When designing an embedded multiprocessor the    choices are varied a...
5. Multiprocessor Architectures• 5.1 Introduction  – We can use memory blocks of different sizes     • Also we do not have...
5. Multiprocessor Architectures• 5.1 Introduction   – Embedded MPs      • Make use of SIMD parallelism techniques      • B...
5. Multiprocessor Architectures• 5.2 Why Embedded Multiprocessors?• MPs are commonly used for scientific and  business ser...
5. Multiprocessor Architectures• 5.2 Why Embedded Multiprocessors?• Embedded MPs face more constraints than  scientific pr...
5. Multiprocessor Architectures• 5.2 Why Embedded Multiprocessors?• The rigorous demands of embedded  computing push us to...
5. Multiprocessor Architectures• 5.2.1 Requirements on Embedded  Systems• Example: Computation in Cellular  Telephones  – ...
5. Multiprocessor Architectures• 5.2.1 Requirements on Embedded Systems• Example: Computation in Cellular Telephones   – F...
5. Multiprocessor Architectures• 5.2.1 Requirements on Embedded Systems• Example: Computation in Video Cameras  – Video co...
5. Multiprocessor Architectures• 5.2.1 Requirements on Embedded Systems• Example: Computation in Video Cameras  – Most vid...
5. Multiprocessor Architectures• 5.2.1 Requirements on Embedded Systems• Example: Computation in Video Cameras  – Most vid...
5. Multiprocessor Architectures• 5.2.2 Performance and Energy  – Many embedded applications need lots of raw processing   ...
5. Multiprocessor Architectures• 5.2.2 Performance and Energy   – [Austin et al.] estimate that a mobile     supercomputin...
5. Multiprocessor Architectures• 5.2.2 Performance and Energy   – Given that today’s highest-performance batteries     hav...
5. Multiprocessor Architectures• 5.2.2 Performance and Energy  – [Mudge et al.] estimate that to power the mobile    super...
Performance trends for desktop processors  [Austin et al.] IEEE Computer Society
5. Multiprocessor Architectures• 5.2.2 Performance and Energy  – [Mudge et al.] show that power consumption is    getting ...
Power consumption trends for desktop      processors [Austin et al.]
5. Multiprocessor Architectures• 5.2.2 Performance and Energy  – One key advantage that embedded system architects can    ...
5. Multiprocessor Architectures• 5.2.3 Specialization and Multiprocessors  – It is the combination of high performance, lo...
5. Multiprocessor Architectures• 5.2.3 Specialization and Multiprocessors  – Real Time & Multiprocessing     • Real-time r...
5. Multiprocessor Architectures• 5.2.3 Specialization and Multiprocessors  – Specialization     • The following parts of e...
5. Multiprocessor Architectures• 5.2.3 Specialization and Multiprocessors  – Specialization     • The following parts of e...
5. Multiprocessor Architectures• 5.2.3 Specialization and Multiprocessors  – Real-Time Performance     • In addition to re...
5. Multiprocessor Architectures• 5.2.4 Flexibility and Efficiency   – Use HW and SW      • Many embedded systems perform c...
5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques  – We discuss embedded multiprocessor design    meth...
5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques   – 5.3.1 Multiprocessor Design Methodologies      •...
5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques   – 5.3.1 Multiprocessor Design Methodologies       ...
5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques   – 5.3.1 Multiprocessor Design Methodologies     • ...
5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques   – 5.3.1 Multiprocessor Design Methodologies     • ...
5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques   – 5.3.1 Multiprocessor Design Methodologies     • ...
5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques   – 5.3.1 Multiprocessor Design Methodologies     • ...
5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques   – 5.3.2 Multiprocessor Modeling and Simulation    ...
5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques   – 5.3.2 Multiprocessor Modeling and Simulation    ...
5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques   – 5.3.2 Multiprocessor Modeling and Simulation    ...
5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques   – 5.3.2 Multiprocessor Modeling and Simulation    ...
5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques   – 5.3.2 Multiprocessor Modeling and Simulation    ...
Upcoming SlideShare
Loading in...5
×

Hwswcd mp so_c_1

171

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
171
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hwswcd mp so_c_1

  1. 1. Hardware/Software Co-Design Lecture MPSoC 1
  2. 2. 5. Multiprocessor Architectures• 5.1 Introduction – The focus is on embedded microprocessors study – Multiprocessing (MP) is very common in embedded computing because • Allows us to meet our performance, cost and energy/power consumption goals – Embedded MP are often heterogeneous multi-processors • Made of several types of processors • They run sophisticated SW that must be carefully designed to obtain the most out of the multi- processor
  3. 3. 5. Multiprocessor Architectures• 5.1 Introduction – A multiprocessor is made of multiple processing elements (PEs) Processing Processing Processing Element Element Element Generic Interconnection network Multiprocessor (MP) Memory Memory Memory
  4. 4. 5. Multiprocessor Architectures• 5.1 Introduction – An MP consists of 3 major subsystems 1. Processing elements that operate on data 2. Memory blocks that hold data values 3. Interconnection networks between the PEs and memory• In any MP design we have to decide – How many PEs to use – How much memory and how to divide it up – How rich the interconnection between the PEs and memory should be
  5. 5. 5. Multiprocessor Architectures• 5.1 Introduction – When designing an embedded multiprocessor the choices are varied and complex • SERVERS typically use symmetric MP built of identical PEs and uniform memory – This simplifies programming the machine – BUT ES designers will be ready to trade off some programming complexity for cost/performance/energy/power • => some additional variables – We can vary the types of PEs, they do not have to be of the same type • Different types of CPUs • Non –programmable PEs (perform only 1 function)
  6. 6. 5. Multiprocessor Architectures• 5.1 Introduction – We can use memory blocks of different sizes • Also we do not have to require that every PE access all memory – Using private memories that are shared by only a few PEs – Therefore the MEM performance is optimized for the units that use it – We can use specialized interconnection networks that provide only certain connections
  7. 7. 5. Multiprocessor Architectures• 5.1 Introduction – Embedded MPs • Make use of SIMD parallelism techniques • But MIMD architectures are the dominant mode of parallel machines in Embedded Computing • They tend to be heterogeneous (varied) PEs • Scientific MPs tend to be homogeneous parallel machines (copies of the same type of PEs)
  8. 8. 5. Multiprocessor Architectures• 5.2 Why Embedded Multiprocessors?• MPs are commonly used for scientific and business servers, so why need them in embedded computing? – Because many of them actually have to support huge amounts of computation – The best way to meet those demands is to use MPs • This is particularly true when we must meet real- time constraints that are concerned with power consumption
  9. 9. 5. Multiprocessor Architectures• 5.2 Why Embedded Multiprocessors?• Embedded MPs face more constraints than scientific processors do – Both intend to deliver high performance but Embedded Systems must do something in addition • They must provide real-time performance that is predictable • They often run at low energy and power levels • They have to be cost effective (i.e. provide high performance without using excessive amounts of HW)
  10. 10. 5. Multiprocessor Architectures• 5.2 Why Embedded Multiprocessors?• The rigorous demands of embedded computing push us toward several design techniques – Heterogeneous microprocessors are often more energy-efficient and cost-effective than symmetric multiprocessors – Heterogeneous memory systems improve real-time performance – NoCs support heterogeneous architectures
  11. 11. 5. Multiprocessor Architectures• 5.2.1 Requirements on Embedded Systems• Example: Computation in Cellular Telephones – A cellular telephone must perform a variety of functions that are basic to telephony • Compute and check error-correction codes • Perform voice compression and decompression • Respond to the protocol that governs communication with the cellular network
  12. 12. 5. Multiprocessor Architectures• 5.2.1 Requirements on Embedded Systems• Example: Computation in Cellular Telephones – Furthermore, modern cell phones must perform a variety of other functions that are required by regulations or demanded by the marketplace • In US, cell phones must keep track of their position in case the user must be located for emergency services – A GPS is often used to find the phone’s position • Many cell phones play MP3 audio and also use MIDI or other methods to play music for ring tones • High-end cell phones provide cameras for still pictures and video • Cell phones may download application code from network
  13. 13. 5. Multiprocessor Architectures• 5.2.1 Requirements on Embedded Systems• Example: Computation in Video Cameras – Video compression requires a great deal of computation, even for small images – Most video compression systems combine 3 basic methods to compress video • Lossless compression is used to reduce the size of the representation of the video data stream • Discrete cosine transform (DCT) is used to help quantize the images and reduce the size of the video stream by lossy encoding • Motion estimation and compensation allow the contents of one frame to be described in terms of motion from another frame
  14. 14. 5. Multiprocessor Architectures• 5.2.1 Requirements on Embedded Systems• Example: Computation in Video Cameras – Most video compression systems combine the 3 basic methods to compress video – Of these 3, motion estimation is the most computationally intensive • Even an efficient motion estimation algorithm must perform a 16×16 correlation at several points in the video frame, and if must be done for the entire frame • For a QCIF frame which is commonly used in cell phones, we have 176×144 pixels – That frame is divided into 11×9 of these 16×16 macroblocks for motion estimation • If we perform correlations for each macroblock – We will have to perform 11×9×16×16 = 25,344 pixel comparisons – All these calculations must be done on almost every frame, at a rate of 15 or 30 frames/second!!!
  15. 15. 5. Multiprocessor Architectures• 5.2.1 Requirements on Embedded Systems• Example: Computation in Video Cameras – Most video compression systems combine 3 basic methods to compress video – Of these 3, motion estimation is the most computationally intensive – The DCT operator is also computationally intensive • Even efficient algorithms require a large number of multiplications to perform the 8×8 DCT that is commonly used in video and image compression – For example [Feig and Winograd] an algorithm for DCT uses 94 multiplications and 454 additions to perform an 8×8 2-D DCT – This amounts to 148,896 multiplications per frame for a size frame with 1,584 blocks
  16. 16. 5. Multiprocessor Architectures• 5.2.2 Performance and Energy – Many embedded applications need lots of raw processing performance • But that is not enough, those computations have to be performed efficiently – [Austin et al. 2004] posed the embedded system performance problem as “mobile supercomputing” • Today’s PDA/Cell phones already perform a great deal of what once was considered as requiring large processors – Speech recognition – Video compression and recognition – High-resolution graphics – High-bandwidth wireless communication
  17. 17. 5. Multiprocessor Architectures• 5.2.2 Performance and Energy – [Austin et al.] estimate that a mobile supercomputing workload would require about 10,000 SPECint of performance – That means about 16× of that provided by a 2GHz Intel Pentium IV processor – In the mobile environment, all this computation must be performed at very low energy • Battery power is growing at only 5%/year
  18. 18. 5. Multiprocessor Architectures• 5.2.2 Performance and Energy – Given that today’s highest-performance batteries have an energy density close to that of TNT – We may be close to the amount of energy that people are willing to carry with them =
  19. 19. 5. Multiprocessor Architectures• 5.2.2 Performance and Energy – [Mudge et al.] estimate that to power the mobile supercomputer with a battery for 5 days, with it being used 20% of the time • It must consume no more than 74 mW • Unfortunately, general-purpose processors do not meet those trends – Moore’s law: dictates that chip sizes double every 18 months => circuits run faster • If we could make use of all the potential increase in speed, we could meet the 10,000 SPECint performance target • But trends show that we are not keeping up with performance • The performance of commercial processors and predicted trends • Traditional optimizations (pipelining, instruction-level parallelism) are becoming less effective (they have previously helped designers capture Moore’s law)
  20. 20. Performance trends for desktop processors [Austin et al.] IEEE Computer Society
  21. 21. 5. Multiprocessor Architectures• 5.2.2 Performance and Energy – [Mudge et al.] show that power consumption is getting worse – We need to reduce the energy consumption of the processor to use it in a mobile supercomputer! • But desktop processors consume more power with every new generation – Breaking away from these trends requires taking advantage of the characteristics of the problem • Adding units that are tuned to the core operations that we need to perform and • Eliminating HW that does not directly contribute to performance for this equation – By designing HW that meets its performance goals efficiently, we reduce system’s power consumption
  22. 22. Power consumption trends for desktop processors [Austin et al.]
  23. 23. 5. Multiprocessor Architectures• 5.2.2 Performance and Energy – One key advantage that embedded system architects can leverage is task-level parallelism • Many embedded applications neatly divide into several tasks or phases that communicate with each other • Which is a natural and easily exploitable source of parallelism – Desktop processors rely on instruction-level parallelism (ILP) to improve performance • But only a small amount of ILP is available in most programs – We can build custom multiprocessor architectures that reflect the task-level parallelism available in the application • And meet performance targets at much lower cost and with much less energy
  24. 24. 5. Multiprocessor Architectures• 5.2.3 Specialization and Multiprocessors – It is the combination of high performance, low power, and real-time that drives us to use multiprocessors (MPs) – And these requirements lead us further toward heterogeneous processors • Which starkly contrast with the symmetric multi-processors used for scientific computation – Multiprocessing Vs. Uniprocessing • Even if we build a multiprocessor out of several copies of the same type of CPU – We may end up with a more efficient system than if we used a uni-processor • The manufacturing cost of a microprocessor is a non-linear function of clock speed – Customers pay considerably more for modest increases in clock speed
  25. 25. 5. Multiprocessor Architectures• 5.2.3 Specialization and Multiprocessors – Real Time & Multiprocessing • Real-time requirements also lead to multiprocessing • When we put several real-time processes on the same CPU, they compete for cycles • But we cannot be sure that we can use 100% of the CPU if we want to meet real-time deadlines • Furthermore, we must pay for those reserved cycles at the nonlinear rate of higher clock speed – Multiprocessing & Accelerators • The next step beyond symmetric microprocessors is heterogeneous multiprocessors • We can specialize all aspects of the multiprocessor: the PEs, the memory, and the interconnection network • Specializations understandably lead to lower power consumption; perhaps less intuitively, they can also improve real-time behavior
  26. 26. 5. Multiprocessor Architectures• 5.2.3 Specialization and Multiprocessors – Specialization • The following parts of embedded systems lend themselves to specialized implementations – Some operations, particularly those defined by standards, are not likely to change » The 8×8 DCT, for example, has become widely used well beyond its original function in JPEG » Given the frequency and variety of its uses, it is worthwhile to optimize not just the DCT, but in particular its 8×8 form – Some functions require operations that do not map well onto a CPU’s data operations » The mismatch may be due to several reasons » For instance, bit-level operations are difficult to perform efficiently on some CPUs » The operations may require too many registers » We can design either a specialized CPU or a special-purpose HW unit to perform these functions
  27. 27. 5. Multiprocessor Architectures• 5.2.3 Specialization and Multiprocessors – Specialization • The following parts of embedded systems lend themselves to specialized implementations – Highly responsive I/O operations may be best performed by an accelerator with an attached I/O unit – If data must be read, processed, written to meet a tight deadline – For example, (in engine control) a dedicated HW unit may be more efficient than a CPU – Cost Vs. Power • Heterogeneity reduces power consumption: it removes unnecessary HW • The additional HW required to generalize functions adds to both dynamic and static power dissipation • Excessive specialization can add so much communication cost that the energy gain from specialization is lost • However, specializing the right functions can lead to big energy savings
  28. 28. 5. Multiprocessor Architectures• 5.2.3 Specialization and Multiprocessors – Real-Time Performance • In addition to reducing costs, using multiple CPUs can help with real-time performance • We can often meet deadlines and be responsive to interaction much more easily when we put those time-critical processes on separate CPUs • Specialized memory systems and interconnects also help make the response time of a process more predictable
  29. 29. 5. Multiprocessor Architectures• 5.2.4 Flexibility and Efficiency – Use HW and SW • Many embedded systems perform complex functions that would be too difficult to implement entirely in HW • Translating all the standards to HW may be too time-consuming and expensive • Multiple standards encourage SW implementation – For ex. must be able to play audio data in many different formats: MP3, Dolby Digital, Ogg Vorbis, etc. – These standards perform some similar operations but cannot be easily collapsed into a few key HW units – The reasonable choice: processors running SW, aided by a few key HW units
  30. 30. 5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques – We discuss embedded multiprocessor design methodologies in detail – 5.3.1 Multiprocessor Design Methodologies • The design of embedded multiprocessors is data-driven and relies on analyzing programs • We call these programs the workload, in contrast with the term benchmark commonly used in computer architecture • Embedded systems operate under real-time constraints and overall throughput – Therefore we often use a sample set of applications to evaluate overall system performance – These programs may not be the exact code run on the final system and the final system may have many modes – But using workloads is still useful and very important
  31. 31. 5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques – 5.3.1 Multiprocessor Design Methodologies • Benchmarks are generally treated as independent entities • While embedded multiprocessor design requires evaluating the interaction between programs • The workload, in fact, includes data inputs as well as the programs themselves
  32. 32. 5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques – 5.3.1 Multiprocessor Design Methodologies • Multiprocessor-based embedded system design methodology Operation Workload counts, etc PE, memory, PlatformPlatform-independent interconnect design optimizations design Platform-dependentPlatform-independent optimizations measurements Implementation
  33. 33. 5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques – 5.3.1 Multiprocessor Design Methodologies • This workflow includes both the design of the HW platform and the SW that runs on the platform • Before the workload is used to evaluate the architecture, it generally must be put into good shape with platform-independent optimizations • Many programs are not written with embedded platform restrictions, real-time performance or low power in mind • Using programs designed to work in non-real-time mode with unlimited main memory can often lead to bad architectural decisions
  34. 34. 5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques – 5.3.1 Multiprocessor Design Methodologies • Once we have the workload programs in shape, we can perform simple experiments before defining an architecture – To obtain platform-independent measurements • Simple measurements, such as dynamic instruction count and data access patterns, provide valuable information about the nature of the workload • Using these platform-independent metrics, we can identify an initial candidate architecture – If the platform relies on static allocation, we may need to map the workload programs onto the platform – We then measure platform-dependent characteristics
  35. 35. 5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques – 5.3.1 Multiprocessor Design Methodologies • Based on these characteristics, we evaluate the architecture, using both numerical measures and judgment • If the platform is satisfactory, then we are finished • If not, we modify the platform and make a new round of measurements • Along the way, we need to design the components of the multiprocessor – The processing elements, – The memory system, and – The interconnects
  36. 36. 5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques – 5.3.1 Multiprocessor Design Methodologies • Once we are satisfied with the platform – We can map the SW onto the platform – During that process » We may be aided by libraries of code and » Compilers – Most of the optimizations performed at this phase should be platform-specific » We must allocate operations to processing elements » Allocate data to memories » Allocate Communications to links » We now also have to determine when things happen
  37. 37. 5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques – 5.3.2 Multiprocessor Modeling and Simulation • [Cai & Gajski] defined a hierarchy of modeling methods for digital systems and compared their characteristics Communication Computation Communication PE time time scheme Interface Specification No No Variable No PEs Component No Approximate Variable channel Abstract (PE) assembly Bus Approximate Approximate Abstract bus Abstract arbitration channel Bus Cycle accurate Approximate Protocol bus Abstract functional channel Cycle Approximate Cycle accurate Abstract bus Pin accurate channel accurate Implementation Cycle accurate Cycle accurate Wires Pin accurate
  38. 38. 5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques – 5.3.2 Multiprocessor Modeling and Simulation • Most multiProc simulators are systems of communicating simulators • The component simulators represent CPUs, memory elements, and routing networks • The multiProc simulator itself negotiates communication between those component simulators • We can use the techniques of parallel computing to build the multiProc simulator • Each component simulator is a process, both in the simulation metaphor and literally as a process running on the host CPU’s operating system
  39. 39. 5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques – 5.3.2 Multiprocessor Modeling and Simulation • Consider the simulation of a write from a PE to a ME (memory element) • The PE and ME are each component simulators that run as processes on the host CPU • The WRITE operation requires a message from the PE simulator to the ME simulator PE ME Simulator Message( write address, Simulator data to be written)
  40. 40. 5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques – 5.3.2 Multiprocessor Modeling and Simulation • The MultiProc simulator must route that message by determining which simulation process is responsible for the address of the write operation • After performing the required mapping, it sends a message to the ME simulator, asking it to perform the write • Most MultiProc simulators are assuming homogeneous MP architectures, and use that assumption to build simulation shortcuts – However, many Embedded MPs are heterogeneous, and therefore cannot use these optimizations
  41. 41. 5. Multiprocessor Architectures• 5.3 Multiprocessor Design Techniques – 5.3.2 Multiprocessor Modeling and Simulation • SystemC (http://www.systemc.org) is a widely used framework for transaction-level design of heterogeneous MultiProcs • It is designed to facilitate the simulation of heterogeneous architectures built from combinations of hardwired blocks and programmable processors • SystemC is built on top of C++ – Defines a set of classes used to describe the system being simulated – A simulation manager guides the execution of the simulator
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×