Architectures and Compilers for Embedded Systems (ACES) Laboratory  Center for Embedded Computer Systems University of Cal...
Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages ...
Traditional Processor-Centric Designs <ul><li>Performance driven designs </li></ul><ul><ul><li>Limitations: </li></ul></ul...
Embedded System-on-Chip (SOC) Designs <ul><li>One or few dedicated applications </li></ul><ul><ul><li>Opportunity to custo...
Embedded S-O-C Design Issues <ul><li>Technology Trends </li></ul><ul><ul><li>1G transistor chips by ~2010 (SIA Roadmap) </...
What to do with all these transistors?  <ul><li>New processor architectures </li></ul><ul><ul><ul><li>E.g., Ultra Large In...
Programmable Embedded Systems:  Boards to SOCs <ul><li>Past </li></ul><ul><ul><li>Board-level IC’s </li></ul></ul><ul><li>...
Networked Embedded System  [Courtesy: R. Gupta]
Programmable SOC Platforms <ul><li>Domain-specific </li></ul><ul><li>Parameterized Cores </li></ul><ul><li>Sample Paramete...
Why Explore Architectures? <ul><li>5.10x exe. </li></ul><ul><li>7.51x power </li></ul><ul><li>2.73x energy </li></ul>[Sour...
Philips Velocity SoC Platform
Configurable Processor Platform : Tensilica Xtensa MMU ALU Pipe Cache I/O Timer Register File Controller
Fixed Programmable SOC Template
Programmable Architectural Trends <ul><li>Recent advances in System-On-Chip Technology </li></ul><ul><ul><ul><li>customiza...
Architecture-Compiler Coupling Parameters : no, size of units no, size, ports of reg files caches memory hierarchy Archite...
Compiler-Architecture-CAD Coupling Parameters : no, size of units no, size, ports of reg files caches memory hierarchy Arc...
Programmable Arch’s: Traditional Design Flow - Application-to-architecture mapping - Early HW/SW partitioning - Ensuing ta...
Programmable Arch’s: Traditional Design Flow Issues: -- Multiple specifications Functional, IS, RT (synthesis) -- Software...
Traditional Design Flow Design Specification Hw/Sw Partitioning Off-Chip Memory Processor Core On-Chip Memory Synthesized ...
IP-Centric Design Flow <ul><li>Increasing use of IP blocks </li></ul><ul><ul><li>COTS => IP, Soft/Hard IP blocks </li></ul...
Main Bottleneck <ul><li>SOC Customization with IP Blocks </li></ul><ul><ul><li>COTS: SW tools available (already developed...
ADL-Driven Design Flow ADL Specification Verification Rapid design space exploration Quality tool-kit generation Design re...
Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages ...
<ul><li>Specify architecture templates of SOCs </li></ul><ul><ul><li>Blocks/components which reside on the SOC </li></ul><...
ADL-Based SOC Codesign Flow Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitionin...
Survey of ADLs <ul><li>Classification Based on Type of Information Captured </li></ul><ul><ul><li>Behavior-centric ADLs </...
Behavior Centric ADLs <ul><li>Primarily capture Instruction Set (IS) </li></ul><ul><ul><li>Provide programmer’s view </li>...
Structure Centric ADLs <ul><li>Provide net-list view of the architecture </li></ul><ul><ul><li>Advantages: </li></ul></ul>...
Mixed-Level ADLs <ul><li>Capture Instruction Set view </li></ul><ul><li>Capture high-level architecture view </li></ul><ul...
Survey of ADLs <ul><li>Classification Based on Type of Information Captured </li></ul><ul><ul><li>Behavior-centric ADLs </...
Synthesis-Oriented ADLs Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitioning Sy...
Synthesis-Oriented ADLs <ul><li>MIMOLA (Univ. of Dortmund, Germany)Synthesizable HDL </li></ul><ul><ul><li>Mainly targeted...
Synthesis-Oriented ADLs <ul><li>Summary </li></ul><ul><ul><li>Synthesis and simulation tools available </li></ul></ul><ul>...
Compiler-Oriented ADLs Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitioning Syn...
Compiler-Oriented ADLs <ul><li>nML (TU Berlin, Germany) </li></ul><ul><ul><li>Mainly targeted to DSPs and ASIPs </li></ul>...
Compiler-Oriented ADLs <ul><li>MDES (HPLabs & UIUC, USA) </li></ul><ul><ul><li>Used for design space exploration of high-p...
Compiler-Oriented ADLs <ul><li>Other Compiler-Oriented ADLs </li></ul><ul><ul><li>The FlexWare CAD system supporting compi...
Simulator-Oriented ADLs Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitioning Sy...
Simulator-Oriented ADLs <ul><li>LISA (RWTH Aachen, Germany) </li></ul><ul><ul><li>Mainly targeted to DSPs </li></ul></ul><...
Simulator-Oriented ADLs <ul><li>Summary </li></ul><ul><ul><li>Capture both the structural and architectural aspect of the ...
Validation-Oriented ADLs Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitioning S...
Validation-Oriented ADLs <ul><li>AIDL (Univ. of Tsukuba, Japan) </li></ul><ul><ul><li>Targeted to high-performance supersc...
Future Directions for ADLs <ul><li>Formal Verification </li></ul><ul><ul><li>Detection of pipeline conflicts (resource, da...
Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages ...
Software Toolkits for Processor Cores <ul><li>SOC designers using processor cores.  </li></ul><ul><li>Major bottleneck: la...
Architecture Description Languages (ADLs) <ul><li>Objectives: </li></ul><ul><ul><li>Support automated SW toolkit generatio...
Software Tools  <ul><li>Estimators </li></ul><ul><ul><li>Code Size, Memory Requirements, Performance, Power etc. </li></ul...
Software Tools  <ul><li>Estimators </li></ul><ul><ul><li>Code Size, Memory Requirements, Performance, Power etc. </li></ul...
Compiler Issues for Embedded SOC <ul><li>Traditional ES Software </li></ul><ul><ul><li>Handcoded in assembly </li></ul></u...
Compiler as an Exploration Tool <ul><li>Analysis Phase of Compiler: Estimation </li></ul><ul><ul><ul><li>Memory size </li>...
Retargetable Compilers Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitioning Syn...
Retargetable Compilers <ul><li>Issues: </li></ul><ul><ul><li>Produce efficient code for a wide variety of processor archit...
Compiler Flow (Front-End) Lexical Analysis Semantic Analysis <ul><li>Analysis: </li></ul><ul><ul><li>Data dependence </li>...
Compiler Flow (Back-End) Lowering :  Complex Expressions, Array Subscripts <ul><li>Pre-scheduling optimizations </li></ul>...
Compiler Flow (Back-End) <ul><li>Post-scheduling optimizations :  </li></ul><ul><ul><li>Peephole Optimizations </li></ul><...
Retargetable Compilers Survey (1) <ul><li>CHESS (using nML ADL) </li></ul><ul><ul><li>Mainly targeted to fixed-point DSPs ...
Retargetable Compilers Survey (2) <ul><li>ELCOR (using MDES ADL) </li></ul><ul><ul><li>Mainly targeted to VLIW architectur...
Retargetable Compilers Survey (3) <ul><li>Other Retargetable Compilers </li></ul><ul><ul><li>The FlexWare CAD system </li>...
Software Tools  <ul><li>Estimators </li></ul><ul><ul><li>Code Size, Memory Requirements, Performance, Power etc. </li></ul...
Simulators/Simulator Generators Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partiti...
Simulators/Simulator Generators <ul><li>Issues: </li></ul><ul><ul><li>Level of abstraction </li></ul></ul><ul><ul><ul><li>...
Simulators/Simulator Generators Survey (1) <ul><li>GENSIM/XSIM (using ISDL ADL) </li></ul><ul><ul><li>Mainly targeted to V...
Simulators/Simulator Generators Survey (2) <ul><li>LISA/S (using LISA ADL) </li></ul><ul><ul><li>Mainly targeted to DSPs <...
Simulators/Simulator Generators Survey (3) <ul><li>Other Retargetable Simulators/Simulator Generators: </li></ul><ul><ul><...
Software Tools  <ul><li>Estimators </li></ul><ul><ul><li>Code Size, Memory Requirements, Performance, Power etc. </li></ul...
ADL-driven Validation/Verification Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Part...
Bottom-up Validation Approach RTL Reverse Engineering High Level Description Manual Verification Property Checking Propert...
ADL-driven Validation RTL Reverse Engineering High Level Description Manual Verification Property Checking Property Checki...
Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages ...
Memory Libraries Cache SRAM PrefetchBuffer Frame Buffer EDO On-chip RD RAM SD RAM VLIW DSP ASIP Toolkit Generator Toolkit ...
System -Level Exploration Alg. spec C implementation Proced. code Cost estimation (mem,...) Perf. estimation H/S Partition...
MEMOREX:  Memory Exploration Environment System spec in C Parser, FG Generator w/ Semantics Retention Memory Disambiguatio...
Software Toolkit for the System Designer <ul><ul><li>EXPRESS  -  An Extensible, Retargetable, Instruction-Level Paralleliz...
EXPRESS: Compiler Environment for  Embedded Processors GCC + Semantics Retention Analysis Mutating  Transformations Simula...
Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages ...
The DLX Example Architcture
Design Space Exploration <ul><li>Designer targets various goals (power, area, perf) </li></ul><ul><ul><li>Often conflictin...
1. Forwarding path from All_mem_Latch to A1 2. Forwarding path Mem_WB_Latch to INT 3. Both (1) and (2) 4. Forwarding path ...
1. Forwarding path All_mem_Latch to A1 2. Forwarding path Mem_WB_Latch to INT 3. Both (1) and (2) 4. Forwarding path All_m...
1. Forwarding path All_mem_Latch to A1 2. Forwarding path Mem_WB_Latch to INT 3. Both (1) and (2) 4. Forwarding path All_m...
1. Forwarding path All_mem_Latch to A1 2. Forwarding path Mem_WB_Latch to INT 3. Both (1) and (2) 4. Forwarding path All_m...
1. Forwarding path All_mem_Latch to A1 2. Forwarding path Mem_WB_Latch to INT 3. Both (1) and (2) 4. Forwarding path All_m...
DLX Pipeline DSE Results Innerp Linear_eq State_eq Integrate 1D_particle GLR
DLX Pipelining Experiments Summary <ul><li>Forwarding paths added: </li></ul><ul><ul><li>average performance improvement: ...
Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages ...
Memory-Aware Compilation <ul><li>Traditionally, memory system transparent to compiler:  </li></ul><ul><ul><li>Scheduled al...
Exploiting DRAM Access Modes in  Memory-Aware Compiler <ul><li>Allow Compiler to exploit page-mode, burst-mode accesses </...
Example Exploiting DRAM Access Modes in  Memory-Aware Compiler for(i=0;i<9;i++){ a = a + x[i] + y[i]; b = b + z[i] + u[i];...
Experiments exploiting DRAM access modes Dynamic cycle counts exploiting page-mode and burst-mode accesses in the compiler...
MIST: Cache miss traffic management <ul><li>Cache misses: most time consuming operations </li></ul><ul><li>Traditionally, ...
Cache miss traffic management Example Cache line size: 4 ... for(i=0;i<12;i+=4){ s=s+temp s=s+a[i+1];  <== HIT s=s+a[i+2];...
Miss Traffic Management Experiments Dynamic cycle counts for MIST: Memory Miss Traffic Management Algorithm.  Proc. Intern...
Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages ...
Embedded memories: the programmer’s viewpoint <ul><li>Register files </li></ul><ul><ul><li>Explicit usage in instruction s...
Memory Organizations and Architectures <ul><li>Traditional memory hierarchies </li></ul><ul><ul><li>Caching: spatial and t...
Custom Memory Architectures <ul><li>Disk File systems:  Parsons et al., Patterson et al.:    use file access patterns to i...
APEX: Access Pattern based Memory Exploration <ul><li>Motivation: </li></ul><ul><ul><li>Majority of memory accesses genera...
Customizing Memory Architectures <ul><li>Opportunity for wide range of power, cost, performance </li></ul><ul><ul><li>Anal...
Motivating Example <ul><li>Illustrative example: 2 cases </li></ul><ul><ul><li>1. Traditional Cache-only Memory Architectu...
1. Traditional Cache-only Memory Arch.  for(i=0;i<1000;i++){ …  = a[i] + …; } … for(i=0;i<1000;i++){ code = codetab[code];...
2. APEX: Access Pattern-based Memory Customization for(i=0;i<1000;i++){ …  = a[i] + …; } … for(i=0;i<1000;i++){ code = cod...
Cost/Perf Exploration: Compress
Memory Exploration: Compress (Perf. Paretos)
Perf/Power Exploration:  Compress
Memory Exploration: Compress (Power Paretos)
Memory Organizations and Architectures <ul><li>Traditional memory hierarchies </li></ul><ul><ul><li>Caching: spatial and t...
Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages ...
Summary <ul><li>Today we reviewed </li></ul><ul><ul><li>ADL-driven architectural exploration of programmable embedded syst...
Outlook <ul><li>Current Focus:  </li></ul><ul><ul><li>Language-driven SW toolkit generation (ADL=>compiler, simulator,…) <...
Upcoming SlideShare
Loading in...5
×

Final version is available

806

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
806
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • - I will start with a short motivation of our work. - Recent advances in SOC technology make it possible to utilize customizable processor cores, together with a variety of novel on-chip/off-chip memories, allowing customization of SOC architectures for specific embedded applications and tasks. - These trends clearly present tremendous opportunities for system designers to tune and customize SOC designs for diverse goals, such as low power, area, code size. - However, shrinking time-to-market cycles, coupled with increasingly short product lifetimes create a critical need to rapidly evaluate candidate SOC architectures, and complete both software and hardware implementations in parallel. - Thus the need for Design Space Exploration.
  • - The traditional design flow started from a design specification, - Which was then used to drive the Hw/Sw partitioning, - Some early exploration through a set of Estimators predicted the quality of the final design - Then the Hw part was used for synthesis, - while the Sw part was fed to a Compiler - Both Hw and Sw were then used for Cosimulation, - and final implementation. - For the Hw part, we can use CAD tools available. - For the Sw part, - for off-the-shelf processors, the Sw toolkit is made available by the manufacturer. - However, for customizable processor cores from an IP library, Sw toolkits are not available.
  • - The traditional design flow started from a design specification, - Which was then used to drive the Hw/Sw partitioning, - Some early exploration through a set of Estimators predicted the quality of the final design - Then the Hw part was used for synthesis, - while the Sw part was fed to a Compiler - Both Hw and Sw were then used for Cosimulation, - and final implementation. - For the Hw part, we can use CAD tools available. - For the Sw part, - for off-the-shelf processors, the Sw toolkit is made available by the manufacturer. - However, for customizable processor cores from an IP library, Sw toolkits are not available.
  • - To support such DSE, we need an ADL-driven Design Flow, where the ADL is used to specify the arch. template, for the H/S co-design flow. - Since this ADL may mix and match different memory and processor IP blocks, we need to automatically gen. The Sw tools to support the arch. template. - That includes e.g., a compiler, simulator... - The ADL description can also be used for verification purposes, - as well as for synthesis of the final hardware. - Moreover, the ADL description can be used to drive Design Space Exploration, by selecting different IP components from an IP library, - to mix and match different processor cores, memory modules, and so on, - and rapidly explore different candidate designs.
  • - To support such DSE, we need an ADL-driven Design Flow, where the ADL is used to specify the arch. template, for the H/S co-design flow. - Since this ADL may mix and match different memory and processor IP blocks, we need to automatically gen. The Sw tools to support the arch. template. - That includes e.g., a compiler, simulator... - The ADL description can also be used for verification purposes, - as well as for synthesis of the final hardware. - Moreover, the ADL description can be used to drive Design Space Exploration, by selecting different IP components from an IP library, - to mix and match different processor cores, memory modules, and so on, - and rapidly explore different candidate designs.
  • - Traditionally, for such processors, the Sw toolkit was built later in the design flow. - This prevents concurrent Hw/Sw codesign and development. - More importantly, w/o such a toolkit, we cannot comparatively evaluate a processor core 1 instance from the IP library vs a processor core 2 instance. - Because the Sw toolkit is unavailable, you cannot simulate, generate code, you cannot determine how many cycles or how much memory a particular application takes. - Therefore, Design Space Exploration is meaningless without the immediate availability of toolkits such as compilers, simulators, assemblers. - The solution is to automatically generate the toolkit from a target machine specification.
  • - The first such optimization technique exploits in the Memory-Aware compiler the DRAM page-mode and burst-mode accesses. - Normal DRAM access is composed of row-decode, a column-decode and a precharge. During the row-decode, the row-part of the address is used to select a particualr row from the DRAM array, and copy it into the row buffer. During the column decode, the column part of the address is used to select a particual element from the row buffer, and output it. Precharge de-activates the bank. - The row decode, column decode and precharge are represented using the nodes in the figure: for instance this node represents a row decode, taking 2 cycles. … - We combine these primitive operations into complete memory accesses, as shown in these 2 accesses. - For instance, a normal DRAM access is composed of a row decode, a column decode and a precharge, - while a set of 4 consecutive page-mode accesses contain 1 row decode, 4 column decodes and 1 precharge. - Using this detailed information, the compiler can better schedule the memory accesses, and hide hide the latency of the memory operations by overlapping them with CPU and other operations.
  • - I will use the small example in the figure to illustrate the potential gains obtained with our technique. - The example contains a loop and 4 memory accesses. 1. - In the first case, the naïve approach, we assume that there are no efficient DRAM access modes available. Here all the accesses from the example are normal accesses, composed of a row-decode, column-decode and precharge. - In this case, the total cycle count is 180 cycles. 2. - By unrolling the loop and grouping together the accesses to the same page, the locality of the accesses can be improved, and page-mode accesses can be used, as shown in the case 2. Here the total cycle count is reduced to 84 cycles. 3. - However, by providing the compiler with accurate timing information, the performance can be further improved. The compiler can even better overlap the row decode and precharge operations, and reduce the cycle counts even further to 60 cycles, generating a further 40% gain in performance.
  • - In the following I will present a set of experiments to show the performance improvements obtained by our Memory-Aware Compilation technique. - The first column represents the benchmarks. - The second column shows the dynamic cycle counts for the naïve approach, where we assume that no efficient DRAM access modes are available. - The third column shows the dynamic cycle counts for the best traditional approach, where the code has been optimized for page-mode and burst-mode accesses, but without accurate timing information in the compiler. - The fourth column shows the dynamic cycle counts for the Memory-Aware compiler, using accurate timing information - The last column shows the performance improvement of the Memory-Aware compiler approach (column 4), compared to the best traditional approach (column 3). - The performance improvement varies between 6% (for GSR, where there are few optimization opportunities), and 47.9% (for SOR, containing accesses which can be distributed over different DRAM pages and banks). - The average performance improvement over the best traditional approach is 23%. - What I presented so far is one technique which can employ accurate memory timing information in the compiler, in the presence of DRAM modules with page-mode and burst-mode accesses. - In the following I will present a technique which allows the compiler to use this accurate timing information in the presence of caches, obtaining further performance improvements.
  • ... - In the following we will represent the cache misses as a non-shaded node, taking 20 cycles. - The next node represents a cache hit, taking 2 cycles, - and the dark node is a cpu operation, taking 1 cycle. - By using this accurate timing information, the compiler can account for the longer cache miss latencies, and hide it by overlapping the cache misses with cache hits to other cache lines, as we will show in the following example.
  • - We use the simple example in the figure to illustrate the potential gains obtained by our technique. - Assuming for simplicity a cache line size of 4, the original example takes 120 cycles, since every fourth memory accesses results in a cache miss. - By unrolling the loop 4 times, we isolate the cache misses from the hits, and we obtain the code show in the figure on the right. The first access in the body of the loop is always a miss, while the next three accesses are always hits. The cycle count is reduced in this case to 108 cycles. - However, we can notice here that the cache miss and the subsequent hits are to the same cache line. Therefore, the hits depend on the miss to bring in the data from the memory into the cache. There is a dependence between the hits and the miss to the same cache line, which we call cache dependence. - By shifting the cache miss to the previous loop iteration, reduce the dependeces inside the loop body, and allow further performance improvements. The cycle count generate by the memory-aware compiler in this case is 87 cycles, which represents a 37% gain.
  • - We present here a set of experiments showing the performance improvement obtained by MIST, our miss traffic management technique. - The first column shows the benchmarks. - The second column shows the dynamic cycle counts for the traditional approach, with no optimizations. - The next 2 pairs of columns show the dynamic cycle counts and performance improvement for the 2 steps of our algorithm. - The last column shows the overall performance improvement. - The average performance improvement for the 2 steps of the algorithm is 34% and 21% respectively, and the overall average is roughly 60%. - What we showed is 2 instances of techniques where a Memory-Aware Compiler can use explicit memory timing information to better schedule and hide the latency of the memory operations, generating significant performance improvements. - Thank you for your attention. I will let Prof.Dutt to continue with the next part of the presentation.
  • There has been related work in 3 broad areas: In program transformations to optimize the cache hit ratio In cache behavior analysis, to predict the number/moment of cache misses, used mainly for early performance estimation, and to guide optimizations such as prefetching. Memory timing extraction and exploitation, addressed in the areas of interface synthesis, high-level synthesis and compilation. However, none of these approaches tries to explicitly manage the cache miss traffic in the compiler, using accurate timing information to improve performance. //To our knowledge no prior work addresses compiler management of cache miss traffic, using accurate timing information from the processor/memory arch to aggressively schedule the cache miss traffic.
  • In our MIST: memory miss traffic management approach We allow the memory-aware compiler to explicitly manage the cache miss traffic And overlap the cache misses with other .. By for instance, …. And generate substantial performance …
  • We use the simple example in the figure, containing a loop accessing an array a. We will use the nodes described in the figure on the right to represent the cache hits, misses and CPU operations. The red node represents a cache miss, taking 20 cycles. The blue node represents a cache hit, taking 2 cycles, And the black node, taking 1 cycle, represents an add operation. We will present the performance of the example in 3 cases: The first case is the traditional approach, where the compiler treats all cache accesses as hits, scheduling them optimistically, and relying on the memory controller to account for longer delays. The second case is the first phase of our MIST optimization, where we isolate the cache misses from the hits in the code, and attach accurate timing for the compiler to aggressively schedule the cache miss operations. The third case is the second phase of our MIST optimization, where we further improve the cache miss traffic, by overlapping the cache misses with cache hits to a different cache line.
  • We use the simple example in the figure, containing a loop accessing an array a. We will use the nodes described in the figure on the right to represent the cache hits, misses and CPU operations. The red node represents a cache miss, taking 20 cycles. The blue node represents a cache hit, taking 2 cycles, And the black node, taking 1 cycle, represents an add operation. We will present the performance of the example in 3 cases: The first case is the traditional approach, where the compiler treats all cache accesses as hits, scheduling them optimistically, and relying on the memory controller to account for longer delays. The second case is the first phase of our MIST optimization, where we isolate the cache misses from the hits in the code, and attach accurate timing for the compiler to aggressively schedule the cache miss operations. The third case is the second phase of our MIST optimization, where we further improve the cache miss traffic, by overlapping the cache misses with cache hits to a different cache line.
  • We use the simple example in the figure, containing a loop accessing an array a. We will use the nodes described in the figure on the right to represent the cache hits, misses and CPU operations. The red node represents a cache miss, taking 20 cycles. The blue node represents a cache hit, taking 2 cycles, And the black node, taking 1 cycle, represents an add operation. We will present the performance of the example in 3 cases: The first case is the traditional approach, where the compiler treats all cache accesses as hits, scheduling them optimistically, and relying on the memory controller to account for longer delays. The second case is the first phase of our MIST optimization, where we isolate the cache misses from the hits in the code, and attach accurate timing for the compiler to aggressively schedule the cache miss operations. The third case is the second phase of our MIST optimization, where we further improve the cache miss traffic, by overlapping the cache misses with cache hits to a different cache line.
  • This figure shows the memory and connectivity exploration results for the compress benchmark. Again, the x axis is cost, and the y axis is performance (latency). As we mentioned earlier, The purple line shows the cost/performance pareto points. However, the designer may be interested in the pareto points from different points of view, such as performance and power. The yellow line represents the perf/power pareto points, that is the best points from the perf/power point of view, represented in the same cost/performance graph. In general, when the designer tries to optimize 2 goals of the system, for instance, performance and cost, it has to give up the third dimension. For instance here, most points which are pareto points from the performance and power points of view (the yellow line – the perf/power paretos), Do not coincide with the most promising points from the cost/performance point of view. The only point which is on both paretos is this point, which Has large cost.
  • This figure represents an analysis of the cost/performance pareto points, presented earlier for the compress benchmark. The points a and b represent the traditional cache architecture, with 2 different bus configurations, and AMBA and a dedicated connection. However, by progressively implementing the access patterns in the application with special memory modules, the architectures c, d, e, and so on, we can progressively improve the performance, by trading off the cost of the system. For instance, in the architecture h, containing multiple memory modules, such as self-indirect buffer, a stream and a negative stream buffer, while the connectivity is implemented using a MUX and an AMBA APB bus, generates a roughly 25% performance improvement over the traditional architecture. Clearly, by moving more access patterns into special purpose memory modules, it is possible to significantly improve the memory system behavior.
  • This figure represents the same cost/perf and perf/power pareto points, now in the perf/power 2-dimensional space. As mentioned earlier, the two pareto curves have this point in common, which we saw in the previous graph that it has a rather high cost. By presenting the designer with such design alternatives in multiple spaces, he can better target the specific design goals of the system, early During the design process.
  • Final version is available

    1. 1. Architectures and Compilers for Embedded Systems (ACES) Laboratory Center for Embedded Computer Systems University of California, Irvine [email_address] http://www.cecs.uci.edu/~dutt Architectural Exploration for Programmable Embedded Systems With Contributions from the EXPRESSION team: Peter Grun, Ashok Halambi, Nick Savoiu, Radu Cornea, Prabhat Mishra, Aviral Shrivastava, Partha Biswas, Srikanth Srinivasan, Ilya Issenin, Marcio Buss, Dr. Hiroyuki Tomiyama, and Prof. Alex Nicolau Work Partially Supported by NSF, ONR, and DARPA Nikil D. Dutt
    2. 2. Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages (ADLs) </li></ul><ul><li>Software Toolkit Generation </li></ul><ul><li>Architectural Exploration </li></ul><ul><li>Summary and Conclusions </li></ul>
    3. 3. Traditional Processor-Centric Designs <ul><li>Performance driven designs </li></ul><ul><ul><li>Limitations: </li></ul></ul><ul><ul><ul><li>from application: only limited by available parallelism </li></ul></ul></ul><ul><ul><ul><li>from architecture: widening processor-memory gap => memory bottleneck </li></ul></ul></ul><ul><ul><li>Solution: </li></ul></ul><ul><ul><ul><li>expose maximally the available parallelism in application (compiler) </li></ul></ul></ul><ul><ul><ul><li>devise memory hierarchy to exploit effectively this parallelism </li></ul></ul></ul><ul><li>Can increase performance by </li></ul><ul><ul><li>explicit exploitation of available parallelism </li></ul></ul><ul><ul><li>implicit exploitation of parallelism to mask operations and memory latencies </li></ul></ul><ul><li>Match processor architecture w/ memory configuration for application suite(s) </li></ul>
    4. 4. Embedded System-on-Chip (SOC) Designs <ul><li>One or few dedicated applications </li></ul><ul><ul><li>Opportunity to customize design </li></ul></ul><ul><li>Diverse requirements </li></ul><ul><ul><li>(Real-time) performance, power, data/code density, testability,…. </li></ul></ul><ul><li>Approach: aggressively exploit application behavior: </li></ul><ul><ul><li>Use coarse-grain and fine-grain compiler techniques </li></ul></ul><ul><ul><li>Evaluate different architectures and memory organizations </li></ul></ul><ul><li>Need for exploration capability without loss of efficiency </li></ul><ul><ul><li>Rapid software toolkit generation (compiler, simulator, debugger,...) </li></ul></ul>
    5. 5. Embedded S-O-C Design Issues <ul><li>Technology Trends </li></ul><ul><ul><li>1G transistor chips by ~2010 (SIA Roadmap) </li></ul></ul><ul><ul><li>Faster processors => Migration of functionality from HW to SW </li></ul></ul><ul><ul><li>Reconfigurable logic => SW </li></ul></ul><ul><ul><li>DRAM merged with logic (plus analog, RF, etc.) </li></ul></ul><ul><li>Market Trends </li></ul><ul><ul><li>Shrinking time-to-market </li></ul></ul><ul><ul><li>Design Reuse </li></ul></ul><ul><ul><ul><li>Componentization, decreasing time between design starts </li></ul></ul></ul><ul><ul><ul><li>Product “versioning” </li></ul></ul></ul><ul><ul><li>New standards, but unique implementations (e.g., Bluetooth, G3) </li></ul></ul><ul><li>Result: </li></ul><ul><ul><li>Intense pressure to rapidly innovate, explore, and differentiate, while meeting complex design contraints </li></ul></ul>
    6. 6. What to do with all these transistors? <ul><li>New processor architectures </li></ul><ul><ul><ul><li>E.g., Ultra Large Instruction Word Machines (i.e., VLIW-like) </li></ul></ul></ul><ul><ul><ul><li>Aggressive use of compiler technology (speculation, sophisticated disambiguation) </li></ul></ul></ul><ul><li>Multiprocessors on a chip </li></ul><ul><ul><ul><li>Heterogeneous processors tuned for specific tasks/functions </li></ul></ul></ul><ul><ul><ul><li>Enhanced compiler technology for better communication/synchronization </li></ul></ul></ul><ul><ul><ul><li>Integration of OS/Multithreading </li></ul></ul></ul><ul><li>Novel memory organizations and hierarchies </li></ul><ul><ul><ul><li>Different types of on-chip memories: multiple cache hierarchies, frame buffers, stream buffers, etc. </li></ul></ul></ul><ul><ul><ul><li>Need “memory-aware”compiler, and processor-memory coexploration </li></ul></ul></ul><ul><li>RESULT: Software issues WILL dominate, requiring rapid generation of software toolkits to support design </li></ul>
    7. 7. Programmable Embedded Systems: Boards to SOCs <ul><li>Past </li></ul><ul><ul><li>Board-level IC’s </li></ul></ul><ul><li>Present </li></ul><ul><ul><li>System-on-a-chip (SOC) and IP “cores” </li></ul></ul><ul><ul><li>Core types </li></ul></ul><ul><ul><ul><li>Hard: layout </li></ul></ul></ul><ul><ul><ul><li>Firm: structural HDL </li></ul></ul></ul><ul><ul><ul><li>Soft: RT-synthesizable HDL </li></ul></ul></ul>Processor Memory Peripheral Board Peripheral Mem Processor IP cores Core library PeripheralA PeripheralB ProcessorX SOC [Source: F. Vahid]
    8. 8. Networked Embedded System [Courtesy: R. Gupta]
    9. 9. Programmable SOC Platforms <ul><li>Domain-specific </li></ul><ul><li>Parameterized Cores </li></ul><ul><li>Sample Parameters: </li></ul><ul><ul><li>Voltage scale </li></ul></ul><ul><ul><li>Size, line, associativity </li></ul></ul><ul><ul><li>Bus width, encoding (gray, invert) </li></ul></ul><ul><ul><li>UART tx/rx buffer size </li></ul></ul><ul><ul><li>DCT resol. </li></ul></ul><ul><li>Configurations impact power/performance </li></ul>[Source: T. Givargis] UART MIPS I-Cache D-Cache Bridge Peripheral Bus DCT CODEC Memory DMA System-on-a-Chip (SOC)
    10. 10. Why Explore Architectures? <ul><li>5.10x exe. </li></ul><ul><li>7.51x power </li></ul><ul><li>2.73x energy </li></ul>[Source: T. Givargis] Example: JPEG implemented on prog. SOC platform Tremendous Variation in Power/Performance! Variations:
    11. 11. Philips Velocity SoC Platform
    12. 12. Configurable Processor Platform : Tensilica Xtensa MMU ALU Pipe Cache I/O Timer Register File Controller
    13. 13. Fixed Programmable SOC Template
    14. 14. Programmable Architectural Trends <ul><li>Recent advances in System-On-Chip Technology </li></ul><ul><ul><ul><li>customizable processor cores, coprocessors, multiple processors on SOC </li></ul></ul></ul><ul><ul><ul><li>novel on-chip/off-chip memory hierarchies, heterogeneous memory organizations </li></ul></ul></ul><ul><ul><ul><li>mixed memory/logic fabrication (on-chip DRAM) </li></ul></ul></ul><ul><li>Customization of SOC architectures for specific embedded applications/tasks. </li></ul><ul><li>Software content of SOCs increasing rapidly </li></ul><ul><li>Tune SOC for diverse goals: power, code size, area, ... </li></ul><ul><li>Shrinking time-to-market + short product lifetimes </li></ul><ul><li>Need: rapidly evaluate SOC architectures </li></ul><ul><ul><li>Design Space Exploration (DSE) </li></ul></ul>
    15. 15. Architecture-Compiler Coupling Parameters : no, size of units no, size, ports of reg files caches memory hierarchy Architecture Compiler Instruction Set Definition : basic instructions sub-word parallelism application-specific instructions cache control instructions … . … .
    16. 16. Compiler-Architecture-CAD Coupling Parameters : no, size of units no, size, ports of reg files caches memory hierarchy Architecture Compiler CAD Instruction Set Definition : basic instructions sub-word parallelism application-specific instructions cache control instructions … . … . Tasks : estimate global memory identify bottlenecks reduce memory traffic … . partition and organize memories Hardware/Software Partitioning Memory-related Optimizations
    17. 17. Programmable Arch’s: Traditional Design Flow - Application-to-architecture mapping - Early HW/SW partitioning - Ensuing tasks of synthesis, SW compilation Design Specification Hw/Sw Partitioning Off-Chip Memory Processor Core On-Chip Memory Synthesized HW Interface HW VHDL, Verilog SW C Synthesis Compiler Cosimulation Estimators
    18. 18. Programmable Arch’s: Traditional Design Flow Issues: -- Multiple specifications Functional, IS, RT (synthesis) -- Software after Hardware -- Limited Exploration Space need compiler/simulator in-the-loop -- Consistency and Validation -- Verification and Testing Design Specification Hw/Sw Partitioning Off-Chip Memory Processor Core On-Chip Memory Synthesized HW Interface HW VHDL, Verilog SW C Synthesis Compiler Cosimulation Estimators
    19. 19. Traditional Design Flow Design Specification Hw/Sw Partitioning Off-Chip Memory Processor Core On-Chip Memory Synthesized HW Interface HW VHDL, Verilog SW C Synthesis Compiler Cosimulation Estimators Predefined Architectural Model
    20. 20. IP-Centric Design Flow <ul><li>Increasing use of IP blocks </li></ul><ul><ul><li>COTS => IP, Soft/Hard IP blocks </li></ul></ul><ul><ul><li>Processor Core Families </li></ul></ul><ul><ul><ul><li>RISC, DSP, VLIW, ASIPs: many attributes parametrizable </li></ul></ul></ul><ul><ul><li>Custom Memory Configurations </li></ul></ul><ul><ul><li>Special-purpose HW blocks </li></ul></ul><ul><ul><ul><li>(video/audio compression/decompression engines, encryption engines, etc.) </li></ul></ul></ul><ul><li>Design Reuse </li></ul><ul><ul><li>Leveraged through predesigned, preverified blocks </li></ul></ul><ul><ul><li>Customization, adaptation </li></ul></ul><ul><li>Reduce time-to-market </li></ul><ul><ul><li>Key Bottleneck: lack of software tools to support use of IP </li></ul></ul><ul><ul><li>Again, urgent need to rapidly generate optimized software toolkits </li></ul></ul>
    21. 21. Main Bottleneck <ul><li>SOC Customization with IP Blocks </li></ul><ul><ul><li>COTS: SW tools available (already developed) </li></ul></ul><ul><ul><li>IP Blocks: no support tools, huge time lag until SW tools are generated/modified </li></ul></ul><ul><li>Need rapid generation of SW toolkit for Embedded SOC (compilers, simulators, debuggers, etc.) </li></ul><ul><li>Language-Based Design Methodology for Embedded SOC </li></ul><ul><ul><li>Application=> Specification Language </li></ul></ul><ul><ul><li>Architecture=> ADL (drives SW tools generation) </li></ul></ul>
    22. 22. ADL-Driven Design Flow ADL Specification Verification Rapid design space exploration Quality tool-kit generation Design reuse Design Specification Hw/Sw Partitioning Off-Chip Memory Processor Core On-Chip Memory Synthesized HW Interface HW VHDL, Verilog SW C Synthesis Compiler Cosimulation Estimators P1 M1 P2 IP Library
    23. 23. Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages (ADLs) </li></ul><ul><li>Software Toolkit Generation </li></ul><ul><li>Architectural Exploration </li></ul><ul><li>Summary and Conclusions </li></ul>
    24. 24. <ul><li>Specify architecture templates of SOCs </li></ul><ul><ul><li>Blocks/components which reside on the SOC </li></ul></ul><ul><ul><li>How they are connected or interact </li></ul></ul><ul><ul><li>Functionality of each component </li></ul></ul><ul><li>Support </li></ul><ul><ul><li>Automated SW toolkit generation </li></ul></ul><ul><ul><ul><li>ILP compilers </li></ul></ul></ul><ul><ul><ul><li>Simulators (instruction-set-, cycle-, phase-accurate) </li></ul></ul></ul><ul><ul><ul><li>Debuggers </li></ul></ul></ul><ul><ul><ul><li>Real-time OSs </li></ul></ul></ul><ul><ul><li>Verification / Validation </li></ul></ul>Architecture Description Languages
    25. 25. ADL-Based SOC Codesign Flow Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitioning Synthesis Compiler Application Processors ASICs Memories IFs Interconnection System on Chip Specify Synthesize IP Library Verify/Validate Generate ADL Specification Estimator Reuse Estimate Modify
    26. 26. Survey of ADLs <ul><li>Classification Based on Type of Information Captured </li></ul><ul><ul><li>Behavior-centric ADLs </li></ul></ul><ul><ul><li>Structure-centric ADLs </li></ul></ul><ul><ul><li>Mixed-level ADLs </li></ul></ul><ul><li>Classification Based on Their Main Objective </li></ul><ul><ul><li>Synthesis-Oriented ADLs </li></ul></ul><ul><ul><li>Compiler-Oriented ADLs </li></ul></ul><ul><ul><li>Simulation-Oriented ADLs </li></ul></ul><ul><ul><li>Validation-Oriented ADLs </li></ul></ul>
    27. 27. Behavior Centric ADLs <ul><li>Primarily capture Instruction Set (IS) </li></ul><ul><ul><li>Provide programmer’s view </li></ul></ul><ul><ul><li>Organized in a hierarchical manner for conciseness </li></ul></ul><ul><ul><li>Advantages: </li></ul></ul><ul><ul><ul><li>Capture easily available information </li></ul></ul></ul><ul><ul><ul><li>Good for regular architectures </li></ul></ul></ul><ul><ul><li>Disadvantages: </li></ul></ul><ul><ul><ul><li>Tedious for irregular architectures </li></ul></ul></ul><ul><ul><ul><li>Hard to specify pipelining </li></ul></ul></ul><ul><ul><ul><li>Contain an implicit architecture model </li></ul></ul></ul>Examples: nML, ISDL, ValenC, CSDL Instruction-Set Arithmetic Operations: Addition ………………… .. Memory Operations: ………………… .. ………………… .. Constraints: ……………………
    28. 28. Structure Centric ADLs <ul><li>Provide net-list view of the architecture </li></ul><ul><ul><li>Advantages: </li></ul></ul><ul><ul><ul><li>Common specification for both software toolkit generation and hardware synthesis </li></ul></ul></ul><ul><ul><ul><li>Can capture detailed pipelining information </li></ul></ul></ul><ul><ul><li>Disadvantages: </li></ul></ul><ul><ul><ul><li>Hard to extract IS view </li></ul></ul></ul>Instruction-Set Arithmetic Operations: Addition ………………… .. Memory Operations: ………………… .. ………………… .. Constraints: …………………… Examples: MIMOLA, COACH
    29. 29. Mixed-Level ADLs <ul><li>Capture Instruction Set view </li></ul><ul><li>Capture high-level architecture view </li></ul><ul><ul><li>Combine benefits of both </li></ul></ul><ul><ul><li>Advantages: </li></ul></ul><ul><ul><ul><li>Common specification for both software toolkit generation and hardware synthesis </li></ul></ul></ul><ul><ul><ul><li>Can validate/verify structure versus behavior (and vice-versa) </li></ul></ul></ul><ul><ul><li>Disadvantages: </li></ul></ul><ul><ul><ul><li>May require specification of redundant information </li></ul></ul></ul>Instruction-Set Arithmetic Operations: Addition ………………… .. Memory Operations: ………………… .. ………………… .. Constraints: …………………… Examples: MDes, LISA/RADL, EXPRESSION
    30. 30. Survey of ADLs <ul><li>Classification Based on Type of Information Captured </li></ul><ul><ul><li>Behavior-centric ADLs </li></ul></ul><ul><ul><li>Structure-centric ADLs </li></ul></ul><ul><ul><li>Mixed-level ADLs </li></ul></ul><ul><li>Classification Based on Their Main Objective </li></ul><ul><ul><li>Synthesis-Oriented ADLs </li></ul></ul><ul><ul><li>Compiler-Oriented ADLs </li></ul></ul><ul><ul><li>Simulation-Oriented ADLs </li></ul></ul><ul><ul><li>Validation-Oriented ADLs </li></ul></ul>
    31. 31. Synthesis-Oriented ADLs Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitioning Synthesis Compiler Application Processors ASICs Memories IFs Interconnection System on Chip Synthesize IP Library Verify/Validate Generate ADL Specification Estimator Reuse Estimate Modify Enable early synthesis of architectures
    32. 32. Synthesis-Oriented ADLs <ul><li>MIMOLA (Univ. of Dortmund, Germany)Synthesizable HDL </li></ul><ul><ul><li>Mainly targeted to DSPs with tightly constrained datapaths </li></ul></ul><ul><ul><li>Used in the MSSQ and RECORD compiler systems </li></ul></ul><ul><ul><li>Capture the structure (RT-level netlist) of the target processor </li></ul></ul><ul><ul><li>Behavior (instruction set) is automatically extracted </li></ul></ul><ul><ul><li>ILP constraints are automatically detected </li></ul></ul><ul><li>COACH (Kyushu Univ., Japan) </li></ul><ul><ul><li>CAD system for ASIPs </li></ul></ul><ul><ul><li>Mainly targeted to simple RISC processors without ILP </li></ul></ul><ul><ul><li>Use the UDL/I HDL for processor description </li></ul></ul><ul><ul><li>Capture the structure </li></ul></ul><ul><ul><li>Behavior is automatically extracted </li></ul></ul><ul><ul><li>Generate compilers and instruction-set simulators </li></ul></ul>
    33. 33. Synthesis-Oriented ADLs <ul><li>Summary </li></ul><ul><ul><li>Synthesis and simulation tools available </li></ul></ul><ul><ul><li>Capture only the structural aspect (RT-level netlist) of the processors </li></ul></ul><ul><ul><li>Low abstraction level => not suited to early and rapid DSE of SOCs </li></ul></ul><ul><ul><li>Behavior extraction and compiler generation are successful for a limited class of processor architectures </li></ul></ul>
    34. 34. Compiler-Oriented ADLs Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitioning Synthesis Compiler Application Processors ASICs Memories IFs Interconnection System on Chip Synthesize IP Library Verify/Validate Generate ADL Specification Estimator Reuse Estimate Modify Support automatic generation of compilers
    35. 35. Compiler-Oriented ADLs <ul><li>nML (TU Berlin, Germany) </li></ul><ul><ul><li>Mainly targeted to DSPs and ASIPs </li></ul></ul><ul><ul><li>Generate compilers, instruction-set simulators, and assemblers at TU Berlin, IMEC, Cadence, etc. </li></ul></ul><ul><ul><li>Capture the behavior (instruction set) of the processors as an attribute grammar </li></ul></ul><ul><ul><li>ILP constraints are described in a form of a set of legal combinations of operations </li></ul></ul><ul><li>ISDL (MIT, USA) </li></ul><ul><ul><li>Mainly targeted to VLIW processors </li></ul></ul><ul><ul><li>Generate compilers, assemblers, and cycle-accurate simulators </li></ul></ul><ul><ul><li>Capture the behavior </li></ul></ul><ul><ul><li>ILP constraints are described in a form of a set of Boolean rules all of which must be satisfied </li></ul></ul><ul><ul><li>Can be translated to synthesizable Verilog code </li></ul></ul>
    36. 36. Compiler-Oriented ADLs <ul><li>MDES (HPLabs & UIUC, USA) </li></ul><ul><ul><li>Used for design space exploration of high-performance processors in the Trimaran system </li></ul></ul><ul><ul><li>Generate compilers and cycle-accurate simulators </li></ul></ul><ul><ul><li>Retargetability of cycle-accurate simulators are limited to the HPL-PD processor family </li></ul></ul><ul><ul><li>Mainly captures the behavior (instruction set) </li></ul></ul><ul><ul><li>ILP constraints are described in a form of reservation tables </li></ul></ul><ul><li>EXPRESSION (UC Irvine, USA) </li></ul><ul><ul><li>Targeted to a wide range of architectures (e.g., RISC, VLIW, SS, DSP) </li></ul></ul><ul><ul><li>Generate compilers and cycle-accurate simulators </li></ul></ul><ul><ul><li>Capture both the behavior and the structure (high-level netlist) </li></ul></ul><ul><ul><li>Models complex memory organizations/hierarchies </li></ul></ul><ul><ul><li>ILP constraints are automatically detected through reservation tables </li></ul></ul><ul><ul><li>Graphical front-end for specification and analysis </li></ul></ul>
    37. 37. Compiler-Oriented ADLs <ul><li>Other Compiler-Oriented ADLs </li></ul><ul><ul><li>The FlexWare CAD system supporting compiler and simulator generation for DSPs and ASIPs (TIMA, France) </li></ul></ul><ul><ul><li>The Valen-C compiler system supporting bit-width optimization of RISC-like ASIPs (Kyushu Univ., Japan) </li></ul></ul><ul><ul><li>The Zephyr compiler system supporting development of custom compilers (Univ. of Virginia, USA) </li></ul></ul><ul><li>Summary </li></ul><ul><ul><li>In most compiler-oriented ADLs, the behavior of the target processor is mainly captured. In addition, manual description of ILP constraints is need for ILP scheduling. </li></ul></ul><ul><ul><li>EXPRESSION captures both the behavior and the structure, enabling automatic detection of ILP constraints </li></ul></ul>
    38. 38. Simulator-Oriented ADLs Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitioning Synthesis Compiler Application Processors ASICs Memories IFs Interconnection System on Chip Synthesize IP Library Verify/Validate Generate ADL Specification Estimator Reuse Estimate Modify Support automatic generation of simulators
    39. 39. Simulator-Oriented ADLs <ul><li>LISA (RWTH Aachen, Germany) </li></ul><ul><ul><li>Mainly targeted to DSPs </li></ul></ul><ul><ul><li>Generate bit-true cycle-accurate compiled simulators </li></ul></ul><ul><ul><li>Explicit support for modeling pipeline behaviors such as interlocking, bypassing, stalls, flushes, etc. </li></ul></ul><ul><ul><li>No support for compiler generation </li></ul></ul><ul><li>RADL (Rockwell Semiconductor, USA) </li></ul><ul><ul><li>Extension of the LISA approach </li></ul></ul><ul><ul><li>Mainly targeted to DSPs </li></ul></ul><ul><ul><li>Generate phase-accurate simulators </li></ul></ul><ul><ul><li>Explicit support for modeling delay slots, interrupts, zero-overhead loops, hazards and multi-pipelines in addition to features of LISA </li></ul></ul><ul><ul><li>No support for compiler generation </li></ul></ul>
    40. 40. Simulator-Oriented ADLs <ul><li>Summary </li></ul><ul><ul><li>Capture both the structural and architectural aspect of the processors </li></ul></ul><ul><ul><li>Explicit support for modeling pipeline behaviors such as stalls and flushes </li></ul></ul><ul><ul><li>No explicit support for ILP compiler generation </li></ul></ul>
    41. 41. Validation-Oriented ADLs Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitioning Synthesis Compiler Application Processors ASICs Memories IFs Interconnection System on Chip Synthesize IP Library Verify/Validate Generate ADL Specification Estimator Reuse Estimate Modify Enable early verification/validation of architectures
    42. 42. Validation-Oriented ADLs <ul><li>AIDL (Univ. of Tsukuba, Japan) </li></ul><ul><ul><li>Targeted to high-performance superscalar processors </li></ul></ul><ul><ul><li>Describe timing behavior of pipelines (e.g., data-forwarding, out-of-order completion, etc.) using temporal logic </li></ul></ul><ul><ul><li>The timing behavior is validated/verified through simulation </li></ul></ul><ul><ul><li>No support for SW toolkit generation </li></ul></ul><ul><ul><li>Can be translated to synthesizable VHDL code </li></ul></ul><ul><li>Summary </li></ul><ul><ul><li>Limited previous work </li></ul></ul><ul><ul><li>Few properties can be validated </li></ul></ul><ul><ul><li>No support for SW toolkit generation </li></ul></ul>
    43. 43. Future Directions for ADLs <ul><li>Formal Verification </li></ul><ul><ul><li>Detection of pipeline conflicts (resource, data, and control conflicts) </li></ul></ul><ul><ul><li>Consistency checking between the behavior and the structure </li></ul></ul><ul><li>SOC Architecture Synthesis from ADL Specifications </li></ul><ul><li>Automatic Generation of Real-Time OSs </li></ul><ul><ul><li>Optimization of task scheduling, interrupt handling, memory management, etc. </li></ul></ul><ul><li>IP Libraries </li></ul><ul><ul><li>Standard mechanisms to specify SOC architectures </li></ul></ul><ul><ul><li>Standard mechanisms to encapsulate design attributes such as performance, power consumption, feature size, etc.) </li></ul></ul><ul><li>Support for Future SOC Architectures </li></ul><ul><ul><li>Heterogeneous multi-processors with multi-threaded architectures </li></ul></ul><ul><ul><li>On-chip memory hierarchies with various memory types (e.g., DRAM, flash memories, etc.) </li></ul></ul><ul><ul><li>On-chip reconfigurable devices </li></ul></ul>
    44. 44. Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages (ADLs) </li></ul><ul><li>Software Toolkit Generation </li></ul><ul><li>Architectural Exploration </li></ul><ul><li>Summary and Conclusions </li></ul>
    45. 45. Software Toolkits for Processor Cores <ul><li>SOC designers using processor cores. </li></ul><ul><li>Major bottleneck: lack of supporting software tools (compiler, simulator, …) </li></ul><ul><li>Traditionally: toolkit built at later stages of system design </li></ul><ul><li>Design Space Exploration meaningless w/o toolkit support </li></ul><ul><li>Solution: Generate Toolkit from a Target machine specification </li></ul><ul><ul><li>Architecture Description Language (ADL) used to define architectural template </li></ul></ul><ul><ul><li>ADL is used to drive generation of compiler, simulators, validation/verification, and synthesis </li></ul></ul><ul><ul><li>Approach allows compiler-in-the-loop architectural exploration </li></ul></ul>
    46. 46. Architecture Description Languages (ADLs) <ul><li>Objectives: </li></ul><ul><ul><li>Support automated SW toolkit generation </li></ul></ul><ul><ul><ul><li>exploration (through parametrization & generality) </li></ul></ul></ul><ul><ul><ul><li>production quality SW tools (cycle-accurate simulator, memory-aware compiler..) </li></ul></ul></ul><ul><ul><li>Specify from a variety of architecture classes (VLIWs, DSP, RISC, ASIPs…) </li></ul></ul><ul><ul><li>Specify novel memory organizations </li></ul></ul><ul><ul><li>Specify pipelining and resource constraints </li></ul></ul>Architecture Description File Compiler Simulator Synthesis Architecture Model ADL Compiler Formal Verification
    47. 47. Software Tools <ul><li>Estimators </li></ul><ul><ul><li>Code Size, Memory Requirements, Performance, Power etc. </li></ul></ul><ul><li>Compilers </li></ul><ul><ul><li>Coarse-grain (task-level) and ILP (microarchitecture-level) </li></ul></ul><ul><li>Assembler, Linker, Loader </li></ul><ul><li>Profiler, Debugger, Code Development Environment </li></ul><ul><li>Simulators </li></ul><ul><ul><li>Bus-functional, instruction-, cycle-, and phase- accurate, structural </li></ul></ul><ul><li>Real Time Operating Systems (RTOS) </li></ul><ul><li>Validation/Verification </li></ul>
    48. 48. Software Tools <ul><li>Estimators </li></ul><ul><ul><li>Code Size, Memory Requirements, Performance, Power etc. </li></ul></ul><ul><li>Compilers </li></ul><ul><ul><li>Coarse-grain (task-level) and ILP (microarchitecture-level) </li></ul></ul><ul><li>Assembler, Linker, Loader </li></ul><ul><li>Profiler, Debugger, Code Development Environment </li></ul><ul><li>Simulators </li></ul><ul><ul><li>Bus-functional, instruction-, cycle-, and phase- accurate, structural </li></ul></ul><ul><li>Real Time Operating Systems (RTOS) </li></ul><ul><li>Validation/Verification </li></ul>
    49. 49. Compiler Issues for Embedded SOC <ul><li>Traditional ES Software </li></ul><ul><ul><li>Handcoded in assembly </li></ul></ul><ul><ul><ul><li>Poor code quality from compilers </li></ul></ul></ul><ul><ul><ul><li>Idiosyncratic architectural features (specialized IS, register banks, etc.) </li></ul></ul></ul><ul><li>Embedded SOC </li></ul><ul><ul><ul><li>Widely heterogeneous, customized processors </li></ul></ul></ul><ul><ul><ul><li>Multiple levels of parallelism </li></ul></ul></ul><ul><ul><ul><li>Complex, non-traditional memory organization/hierarchy </li></ul></ul></ul><ul><ul><ul><li>Complex constraints (hard RT, code size, power, cost,…) </li></ul></ul></ul><ul><li>Embedded SOC Software </li></ul><ul><ul><li>Cannot do handcoding </li></ul></ul><ul><ul><li>Need powerful retargetable compiler technology </li></ul></ul><ul><ul><li>Must fully exploit unique/non-traditional IS or architecture features </li></ul></ul><ul><ul><li>Compiler is CRITICAL for Embedded SOC </li></ul></ul><ul><li>Compiler Issues for Embedded SOC </li></ul><ul><li>Language-driven Software Toolkit Generation </li></ul><ul><li>Architectural Exploration of Embedded SOC </li></ul>
    50. 50. Compiler as an Exploration Tool <ul><li>Analysis Phase of Compiler: Estimation </li></ul><ul><ul><ul><li>Memory size </li></ul></ul></ul><ul><ul><ul><li>parallelism </li></ul></ul></ul><ul><ul><ul><li>resources </li></ul></ul></ul><ul><li>“ Fast” Compiler Algorithms to Evaluate Tradeoffs </li></ul><ul><ul><ul><li>on-chip parallelism vs. memory </li></ul></ul></ul><ul><ul><ul><li>effect on speed, power, code size </li></ul></ul></ul><ul><li>“ Fast” Simulator to evaluate architectural modifications/enhancements </li></ul><ul><ul><ul><li>Customized instructions </li></ul></ul></ul><ul><ul><ul><li>customized units </li></ul></ul></ul><ul><ul><ul><li>data path size (bitwidth) </li></ul></ul></ul><ul><ul><ul><li>customized memory organization/hierarchy </li></ul></ul></ul><ul><li>Compiler Critical for Embedded SOC exploration </li></ul>
    51. 51. Retargetable Compilers Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitioning Synthesis Compiler Application Processors ASICs Memories IFs Interconnection System on Chip Synthesize IP Library Verify/Validate Generate ADL Specification Estimator Reuse Estimate Modify Automatic generation of compilers from ADLs
    52. 52. Retargetable Compilers <ul><li>Issues: </li></ul><ul><ul><li>Produce efficient code for a wide variety of processor architectures </li></ul></ul><ul><ul><ul><li>DSP, VLIW, RISC, Superscalar </li></ul></ul></ul><ul><ul><ul><li>Multi-processor/Multi-threaded architectures </li></ul></ul></ul><ul><ul><ul><li>Need efficient code optimization techniques </li></ul></ul></ul><ul><ul><ul><ul><li>ILP, Predicated Execution </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Techniques for novel instruction-sets, architectures </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Multimedia instructions, cache control instructions </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Specialized addressing modes, specialized functional units </li></ul></ul></ul></ul></ul><ul><ul><ul><li>Need dynamic phase ordering capability </li></ul></ul></ul><ul><ul><li>Produce code that satisfies varied constraints </li></ul></ul><ul><ul><ul><li>Instruction Memory size, Data Memory size </li></ul></ul></ul><ul><ul><ul><li>Power, Performance </li></ul></ul></ul>
    53. 53. Compiler Flow (Front-End) Lexical Analysis Semantic Analysis <ul><li>Analysis: </li></ul><ul><ul><li>Data dependence </li></ul></ul><ul><ul><li>Array, Pointer </li></ul></ul><ul><ul><li>Loop </li></ul></ul><ul><li>Memory/Power: </li></ul><ul><ul><li>Estimation </li></ul></ul><ul><ul><li>Loop/Array optimizations </li></ul></ul><ul><li>Parallelization </li></ul><ul><ul><li>Task-level </li></ul></ul><ul><ul><li>Loop-level </li></ul></ul>Program High-level IR High-level IR Multi-processor/ Multi-threading Info. Memory Subsystem/ Power Info. ADL
    54. 54. Compiler Flow (Back-End) Lowering : Complex Expressions, Array Subscripts <ul><li>Pre-scheduling optimizations </li></ul><ul><ul><li>Dead code removal, </li></ul></ul><ul><ul><li>Induction Variable Elimination </li></ul></ul><ul><ul><li>Partial Redundancy Elimination, ….. </li></ul></ul><ul><li>Memory/Power: </li></ul><ul><ul><li>Initial memory assignment </li></ul></ul><ul><ul><li>Data-Cache Optimizations </li></ul></ul><ul><ul><li>Loop blocking, skewing, etc. </li></ul></ul><ul><li>Transformations: </li></ul><ul><ul><li>Software Pipelining </li></ul></ul><ul><ul><li>Instruction Selection Register Allocation </li></ul></ul><ul><ul><li>Scheduling (ILP) </li></ul></ul>High-Level IR Medium-level IR Low-level IR Memory Subsystem/ Power Info. ADL <ul><li>Optimizations: </li></ul><ul><ul><li>Tree Height Reduction </li></ul></ul><ul><ul><li>Strength Reduction </li></ul></ul><ul><ul><li>Spill code optimization </li></ul></ul>Memory Subsystem/ Resource Info. Operation Behavior Register File Info. Pipeline Conflicts/ Constraints Info.
    55. 55. Compiler Flow (Back-End) <ul><li>Post-scheduling optimizations : </li></ul><ul><ul><li>Peephole Optimizations </li></ul></ul><ul><ul><li>Machine Specific Optimizations </li></ul></ul><ul><li>Memory/Power: </li></ul><ul><ul><li>Block reordering </li></ul></ul><ul><ul><li>Instruction-Cache Optimizations </li></ul></ul><ul><ul><li>Final memory assignment </li></ul></ul>Low-Level IR Low-level IR Memory Subsystem/ Power Info. ADL Code Generation Object Code Operation Format/ Image Info. <ul><li>InterProcedural: </li></ul><ul><ul><li>Register Allocation </li></ul></ul><ul><ul><li>Call convention implementation </li></ul></ul><ul><ul><li>Global references aggregation </li></ul></ul>Call Convention/ Register Info.
    56. 56. Retargetable Compilers Survey (1) <ul><li>CHESS (using nML ADL) </li></ul><ul><ul><li>Mainly targeted to fixed-point DSPs and ASIPs </li></ul></ul><ul><ul><li>Performs instruction selection, register allocation, and scheduling. </li></ul></ul><ul><ul><li>Fixed phase ordering </li></ul></ul><ul><ul><li>ILP constraints described as a set of legal combinations of operations </li></ul></ul><ul><li>AVIV (using ISDL ADL) </li></ul><ul><ul><li>Mainly targeted to VLIW processors </li></ul></ul><ul><ul><li>Optimizes for minimal code size </li></ul></ul><ul><ul><li>Branch-and-bound techniques for concurrent scheduling, resource allocation </li></ul></ul><ul><ul><li>ILP constraints described as a set of Boolean rules which must be satisfied </li></ul></ul>
    57. 57. Retargetable Compilers Survey (2) <ul><li>ELCOR (using MDES ADL) </li></ul><ul><ul><li>Mainly targeted to VLIW architectures with speculative execution </li></ul></ul><ul><ul><li>Used for design space exploration of high-performance processors in the Trimaran system </li></ul></ul><ul><ul><li>ILP constraints are explicitly described as reservation tables </li></ul></ul><ul><li>EXPRESS (using EXPRESSION ADL) </li></ul><ul><ul><li>Targeted to a wide range of processor architectures such as RISC, VLIW, Superscalar, and DSP </li></ul></ul><ul><ul><li>Mutation-Scheduling based dynamic phase ordering capability </li></ul></ul><ul><ul><li>ILP constraints are automatically detected using reservation tables </li></ul></ul>
    58. 58. Retargetable Compilers Survey (3) <ul><li>Other Retargetable Compilers </li></ul><ul><ul><li>The FlexWare CAD system </li></ul></ul><ul><ul><ul><li>Supports compiler generation for DSPs and ASIPs (TIMA, France) </li></ul></ul></ul><ul><ul><li>The Valen-C compiler system </li></ul></ul><ul><ul><ul><li>Supports bit-width optimization of RISC-like ASIPs (Kyushu Univ., Japan) </li></ul></ul></ul><ul><ul><li>The Zephyr compiler system </li></ul></ul><ul><ul><ul><li>Supports development of custom compilers (Univ. of Virginia, USA) </li></ul></ul></ul><ul><ul><li>SUIF Compiler Infrastructure </li></ul></ul><ul><ul><ul><li>Open compiler insfrastructure (Stanford Univ., USA) </li></ul></ul></ul><ul><ul><li>Other Efforts discussed at this workshop </li></ul></ul><ul><ul><ul><li>Dortmund, EPFL, IITB, IITD, ... </li></ul></ul></ul>
    59. 59. Software Tools <ul><li>Estimators </li></ul><ul><ul><li>Code Size, Memory Requirements, Performance, Power etc. </li></ul></ul><ul><li>Compilers </li></ul><ul><ul><li>Coarse-grain (task-level) and ILP (microarchitecture-level) </li></ul></ul><ul><li>Assembler, Linker, Loader </li></ul><ul><li>Profiler, Debugger, Code Development Environment </li></ul><ul><li>Simulators </li></ul><ul><ul><li>Bus-functional, instruction-, cycle-, and phase- accurate, structural </li></ul></ul><ul><li>Real Time Operating Systems (RTOS) </li></ul><ul><li>Validation/Verification </li></ul>
    60. 60. Simulators/Simulator Generators Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitioning Synthesis Compiler Application Processors ASICs Memories IFs Interconnection System on Chip Synthesize IP Library Verify/Validate Generate ADL Specification Estimator Reuse Estimate Modify Support automatic generation of simulators
    61. 61. Simulators/Simulator Generators <ul><li>Issues: </li></ul><ul><ul><li>Level of abstraction </li></ul></ul><ul><ul><ul><li>Functional (no timing information) </li></ul></ul></ul><ul><ul><ul><li>Cycle-accurate (cycle level timing information) </li></ul></ul></ul><ul><ul><ul><li>Bit-, Phase-accurate (detailed timing information) </li></ul></ul></ul><ul><ul><li>Simulation model </li></ul></ul><ul><ul><ul><li>Interpretation based (easy to generate, flexible but slower) </li></ul></ul></ul><ul><ul><ul><li>Compilation based (fast but not very flexible) </li></ul></ul></ul><ul><ul><ul><ul><li>Static compiled simulation </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Dynamic compiled simulation </li></ul></ul></ul></ul><ul><ul><li>Interoperability (the ability to integrate with other tools) </li></ul></ul><ul><ul><li>Ability to simulate a wide variety of architectures </li></ul></ul>Faster, less detail Slower, more detail
    62. 62. Simulators/Simulator Generators Survey (1) <ul><li>GENSIM/XSIM (using ISDL ADL) </li></ul><ul><ul><li>Mainly targeted to VLIW architectures </li></ul></ul><ul><ul><li>Generate cycle-accurate, bit-true Instruction Level Simulator </li></ul></ul><ul><ul><li>Interpretation based, but perform disassembly off-line to improve speed </li></ul></ul><ul><ul><li>Used for architecture evaluation </li></ul></ul><ul><li>SIMPRESS (using EXPRESSION ADL) </li></ul><ul><ul><li>Targeted to wide range of processor architectures such as RISC, VLIW, Superscalar, and DSP </li></ul></ul><ul><ul><li>Generate cycle-accurate, structural simulator </li></ul></ul><ul><ul><li>Interpretation based. </li></ul></ul><ul><ul><li>Used for design space exploration and architecture evaluation </li></ul></ul>
    63. 63. Simulators/Simulator Generators Survey (2) <ul><li>LISA/S (using LISA ADL) </li></ul><ul><ul><li>Mainly targeted to DSPs </li></ul></ul><ul><ul><li>Generate bit-true, cycle-accurate, static compiled simulators </li></ul></ul><ul><ul><li>Explicit support for modeling pipeline behaviors such as interlocking, bypassing, stalls, flushes, etc. </li></ul></ul><ul><li>RADL (Rockwell Semiconductor, USA) </li></ul><ul><ul><li>Extension of the LISA approach </li></ul></ul><ul><ul><li>Mainly targeted to DSPs </li></ul></ul><ul><ul><li>Generate phase-accurate simulators </li></ul></ul><ul><ul><li>Explicit support for modeling delay slots, interrupts, zero-overhead loops, hazards and multi-pipelines in addition to features of LISA </li></ul></ul>
    64. 64. Simulators/Simulator Generators Survey (3) <ul><li>Other Retargetable Simulators/Simulator Generators: </li></ul><ul><ul><li>HPL-PD simulator (using the MDES ADL) </li></ul></ul><ul><ul><ul><li>Limited retargetability in the form of parameters such as number of FUs, etc. </li></ul></ul></ul><ul><ul><li>MIMOLA ADL </li></ul></ul><ul><ul><ul><li>Convert the processor description into a simulatable HDL model </li></ul></ul></ul><ul><ul><li>Insulin </li></ul></ul><ul><ul><ul><li>Uses a VHDL model of a generic parameterizable machine </li></ul></ul></ul><ul><ul><li>Several Commercial Offerings </li></ul></ul><ul><ul><ul><li>Axys, Lisa, Vast,…. </li></ul></ul></ul>
    65. 65. Software Tools <ul><li>Estimators </li></ul><ul><ul><li>Code Size, Memory Requirements, Performance, Power etc. </li></ul></ul><ul><li>Compilers </li></ul><ul><ul><li>Coarse-grain (task-level) and ILP (microarchitecture-level) </li></ul></ul><ul><li>Assembler, Linker, Loader </li></ul><ul><li>Profiler, Debugger, Code Development Environment </li></ul><ul><li>Simulators </li></ul><ul><ul><li>Bus-functional, instruction-, cycle-, and phase- accurate, structural </li></ul></ul><ul><li>Real Time Operating Systems (RTOS) </li></ul><ul><li>Validation/Verification </li></ul>
    66. 66. ADL-driven Validation/Verification Processor Processor ASICs Memories IFs ASICs Memories IFs Cosimulation HW SW HW/SW Partitioning Synthesis Compiler Application Processors ASICs Memories IFs Interconnection System on Chip Synthesize IP Library Verify/Validate Generate ADL Specification Estimator Reuse Estimate Modify Support validation/verification of architecture spec and implementation
    67. 67. Bottom-up Validation Approach RTL Reverse Engineering High Level Description Manual Verification Property Checking Property Checking Specification (English Document)
    68. 68. ADL-driven Validation RTL Reverse Engineering High Level Description Manual Verification Property Checking Property Checking Property Checking Specification (English Document) ADL Description in EXPRESSION High Level Description Equivalence Checking Property Checking RTL Equivalence Checking Ref: papers from EXPRESSION group at HLDVT99-01, VLSI02, DATE02 (Mishra et al.)
    69. 69. Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages (ADLs) </li></ul><ul><li>Software Toolkit Generation </li></ul><ul><li>Architectural Exploration </li></ul><ul><li>Summary and Conclusions </li></ul>EXPRESSION ADL Toolkit/Framework
    70. 70. Memory Libraries Cache SRAM PrefetchBuffer Frame Buffer EDO On-chip RD RAM SD RAM VLIW DSP ASIP Toolkit Generator Toolkit Generator SIMPRESS EXPRESS SIMPRESS EXPRESS Profiler Profiler Application Exploration Phase Generation Phase Processor Libraries Verification Feedback EXPRESSION: Our ADL Approach EXPRESSION ADL Feedback EXPRESSION , EXPRESS , and SIMPRESS comprise the toolkit to aid the System Designer . Compiler-in-the-loop architectural exploration
    71. 71. System -Level Exploration Alg. spec C implementation Proced. code Cost estimation (mem,...) Perf. estimation H/S Partitioning Coarse-grain & alg transformations HLS EXPRESS Compiler Target Code RTOS Kernel ROM Proc ASIC On-chip Memory Main Memory HW SW Controller Datapath
    72. 72. MEMOREX: Memory Exploration Environment System spec in C Parser, FG Generator w/ Semantics Retention Memory Disambiguation, Multi-dim DF analysis Hw/Sw Partitioning (SpecSyn) SW Synthesis (EXPRESS) HW Synthesis (ISE, Synopsys) HW/SW Codesign Memory Estimation Transformations Memory Optimizations Virtual Memory Mapping User Interface Control/DF Graph CDFG with real memories Memory Library Physical Memory Mapping MEMOREX
    73. 73. Software Toolkit for the System Designer <ul><ul><li>EXPRESS - An Extensible, Retargetable, Instruction-Level Parallelizing (ILP) Compiler </li></ul></ul><ul><ul><ul><ul><li>State-of-art ILP techniques: </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Resource Directed Loop Pipelining (RDLP), Trailblazing Percolation Scheduling (TiPS) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><li>Mutation Scheduling : Framework for dynamically exploring tradeoffs between transformations. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Detailed architecture model (for enhanced retargetability and optimizing capability) </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Automatic generation of operation conflict information (as Reservation Tables) from EXPRESSION </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Very general speculation/predication </li></ul></ul></ul></ul></ul><ul><ul><li>SIMPRESS - A Retargetable, Cycle-accurate simulator </li></ul></ul><ul><ul><ul><ul><li>Runs on EXPRESS IR . (Compiler designers can use the simulator as a debugging tool) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Structural Simulation . (Provides System Designer with detailed statistics) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Highly retargetable . (Can be used to simulate VLIWs, DSPs etc) </li></ul></ul></ul></ul><ul><ul><li>V-SAT - A Visual Architecture Specification and Analysis Tool </li></ul></ul><ul><ul><ul><ul><li>Visual Tool for easy specification of Structural and Instruction-Set Information . </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Interfaces with SIMPRESS to collect detailed statistical information about the architecture </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Visual display of the statistics in an intuitive manner to aid architecture evaluation . </li></ul></ul></ul></ul>
    74. 74. EXPRESS: Compiler Environment for Embedded Processors GCC + Semantics Retention Analysis Mutating Transformations Simulation, Visualization, Interaction (SIMPRESS & VSAT) Retargetable Back End Memory Hierarchy Transformations Proc 1 Proc 2 Proc n ....... Control EXPRESSION (ADL) EXPRESS
    75. 75. Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages (ADLs) </li></ul><ul><li>Software Toolkit Generation </li></ul><ul><li>Architectural Exploration </li></ul><ul><li>Summary and Conclusions </li></ul><ul><li>Experiments: </li></ul><ul><li>- Pipelining </li></ul><ul><li>- Memory-aware compilation </li></ul><ul><li>- Memory arch exploration </li></ul>
    76. 76. The DLX Example Architcture
    77. 77. Design Space Exploration <ul><li>Designer targets various goals (power, area, perf) </li></ul><ul><ul><li>Often conflicting </li></ul></ul><ul><li>DSE allows trade-offs between these goals. </li></ul><ul><li>Explore changes to: </li></ul><ul><ul><li>processor/memory system architecture </li></ul></ul><ul><ul><ul><li>changing the pipeline structure </li></ul></ul></ul><ul><ul><ul><li>changing the data path structure </li></ul></ul></ul><ul><ul><ul><li>increasing parallelism </li></ul></ul></ul><ul><ul><ul><li>changing the memory components </li></ul></ul></ul><ul><ul><li>instruction set </li></ul></ul><ul><ul><ul><li>adding new operations (e.g., MAC) </li></ul></ul></ul><ul><li>DLX simulation </li></ul><ul><ul><ul><li>Pipeline stalled 53% of time, due to RAW data hazards </li></ul></ul></ul><ul><ul><ul><li>INT and FP Adder units are the most utilized </li></ul></ul></ul><ul><li>Explored several forwarding path placements </li></ul>
    78. 78. 1. Forwarding path from All_mem_Latch to A1 2. Forwarding path Mem_WB_Latch to INT 3. Both (1) and (2) 4. Forwarding path All_mem_latch to INT and (1) 5. Forwarding path Mem_WB_Latch to A1 and (1) Example Design Space Exploration: Pipelining Exploits (mpy,fp_add) sequences
    79. 79. 1. Forwarding path All_mem_Latch to A1 2. Forwarding path Mem_WB_Latch to INT 3. Both (1) and (2) 4. Forwarding path All_mem_latch to INT and (1) 5. Forwarding path Mem_WB_Latch to A1 and (1) Exploits (ld,int_add) sequences Example Design Space Exploration: Pipelining
    80. 80. 1. Forwarding path All_mem_Latch to A1 2. Forwarding path Mem_WB_Latch to INT 3. Both (1) and (2) 4. Forwarding path All_mem_latch to INT and (1) 5. Forwarding path Mem_WB_Latch to A1 and (1) Exploits (mpy,fp_add) and (ld,int_add) sequences Example Design Space Exploration: Pipelining
    81. 81. 1. Forwarding path All_mem_Latch to A1 2. Forwarding path Mem_WB_Latch to INT 3. Both (1) and (2) 4. Forwarding path All_mem_latch to INT and (1) 5. Forwarding path Mem_WB_Latch to A1 and (1) Exploits (mpy,fp_add) and (mpy,int_add) sequences Example Design Space Exploration: Pipelining
    82. 82. 1. Forwarding path All_mem_Latch to A1 2. Forwarding path Mem_WB_Latch to INT 3. Both (1) and (2) 4. Forwarding path All_mem_latch to INT and (1) 5. Forwarding path Mem_WB_Latch to A1 and (1) Exploits (mpy,fp_add) and (ld,fp_add) sequences Example Design Space Exploration: Pipelining
    83. 83. DLX Pipeline DSE Results Innerp Linear_eq State_eq Integrate 1D_particle GLR
    84. 84. DLX Pipelining Experiments Summary <ul><li>Forwarding paths added: </li></ul><ul><ul><li>average performance improvement: 15% </li></ul></ul><ul><li>Reduced the number of pipeline stages </li></ul><ul><ul><li>Multiply from 7 to 5 stages </li></ul></ul><ul><ul><li>FP Adder from 4 to 3 stages </li></ul></ul><ul><ul><li>average performance improvement: 6% </li></ul></ul><ul><li>Forwarding paths + reduced number of pipeline stages: </li></ul><ul><ul><li>average performance improvement: 25.9% </li></ul></ul><ul><li>Multi-issue version of DLX: </li></ul><ul><ul><li>4 instructions issued every cycle </li></ul></ul><ul><ul><li>average performance improvement: 11.7% </li></ul></ul>
    85. 85. Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages (ADLs) </li></ul><ul><li>Software Toolkit Generation </li></ul><ul><li>Architectural Exploration </li></ul><ul><li>Summary and Conclusions </li></ul><ul><li>Experiments: </li></ul><ul><li>- Pipelining </li></ul><ul><li>- Memory-aware compilation </li></ul><ul><li>- Memory arch exploration </li></ul>
    86. 86. Memory-Aware Compilation <ul><li>Traditionally, memory system transparent to compiler: </li></ul><ul><ul><li>Scheduled all loads/stores assuming a uniform behavior </li></ul></ul><ul><li>However, memory operations intrinsically non-uniform: </li></ul><ul><ul><li>Modern DRAMs: Page-mode, burst-mode accesses, banking, pipelining </li></ul></ul><ul><ul><li>Caches: cache hits and misses have very different timing </li></ul></ul><ul><li>Our Approach: TIMGEN </li></ul><ul><ul><li>Provide accurate memory timing information to compiler </li></ul></ul><ul><ul><li>Allow compiler to globally hide latencies of lengthy memory operations. </li></ul></ul><ul><ul><li>Generate significant performance improvements </li></ul></ul><ul><ul><li>Two instances: </li></ul></ul><ul><ul><ul><li>DRAM Efficient Access Modes (page, burst-mode accesses) </li></ul></ul></ul><ul><ul><ul><li>In the presence of caches: Cache Miss Traffic Management </li></ul></ul></ul>
    87. 87. Exploiting DRAM Access Modes in Memory-Aware Compiler <ul><li>Allow Compiler to exploit page-mode, burst-mode accesses </li></ul><ul><li>DRAM access: </li></ul><ul><ul><li>Row-decode, Column-decode, Precharge </li></ul></ul><ul><li>Page-mode access: </li></ul><ul><ul><li>Consecutive accesses to the same row </li></ul></ul><ul><ul><li>Row-decode and precharge can be omitted. </li></ul></ul><ul><li>Burst-mode access: </li></ul><ul><ul><li>Starting from an initial address, a number of words are clocked out on consecutive cycles </li></ul></ul>Normal DRAM access: Page-mode DRAM access: 5 cycles 8 cycles
    88. 88. Example Exploiting DRAM Access Modes in Memory-Aware Compiler for(i=0;i<9;i++){ a = a + x[i] + y[i]; b = b + z[i] + u[i]; } No efficient access modes (180 cyc): ... Access mode optimization (84 cyc): 114 % gain Memory-aware compiler (60 cyc): 40 % further gain Time
    89. 89. Experiments exploiting DRAM access modes Dynamic cycle counts exploiting page-mode and burst-mode accesses in the compiler. Presented at Design Automation Conference (DAC) 2000.
    90. 90. MIST: Cache miss traffic management <ul><li>Cache misses: most time consuming operations </li></ul><ul><li>Traditionally, compiler assumed all memory accesses as cache hits, relying on the memory controller to account for the cache misses. </li></ul><ul><li>However, hiding latency of cache misses is crucial </li></ul><ul><li>Our approach: MIST. </li></ul><ul><ul><li>Allow compiler to perform global optimizations, and hide the latency of the cache misses. </li></ul></ul>Cache miss (20 cyc) Cache hit (2 cyc) Add (1 cyc)
    91. 91. Cache miss traffic management Example Cache line size: 4 ... for(i=0;i<12;i+=4){ s=s+temp s=s+a[i+1]; <== HIT s=s+a[i+2]; <== HIT s=s+a[i+3]; <== HIT temp=a[i+4]; <== MISS } ... Isolate cache misses Shift cache miss to previous iteration for(i=0;i<16;i++){ s=s+a[i]; } ... 120 cyc ... ... 87 cyc (37% gain) for(i=0;i<16;i+=4){ s=s+a[i]; <== MISS s=s+a[i+1]; <== HIT s=s+a[i+2]; <== HIT s=s+a[i+3]; <== HIT } Cache Dependences ... 108 cyc (11% gain)
    92. 92. Miss Traffic Management Experiments Dynamic cycle counts for MIST: Memory Miss Traffic Management Algorithm. Proc. International Conference on Computer Aided Design (ICCAD) 2000
    93. 93. Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages (ADLs) </li></ul><ul><li>Software Toolkit Generation </li></ul><ul><li>Architectural Exploration </li></ul><ul><li>Summary and Conclusions </li></ul><ul><li>Experiments: </li></ul><ul><li>- Pipelining </li></ul><ul><li>- Memory-aware compilation </li></ul><ul><li>- Memory arch exploration </li></ul>
    94. 94. Embedded memories: the programmer’s viewpoint <ul><li>Register files </li></ul><ul><ul><li>Explicit usage in instruction set </li></ul></ul><ul><li>Caches, TLBs </li></ul><ul><ul><li>Fully implicit </li></ul></ul><ul><li>RAM buffers </li></ul><ul><ul><li>Explicitly controlled through special LD,ST instructions </li></ul></ul><ul><li>Reconfigurable memories </li></ul><ul><ul><li>Explicitly controlled through control instructions </li></ul></ul>For embedded systems Expose memory architecture to the compiler
    95. 95. Memory Organizations and Architectures <ul><li>Traditional memory hierarchies </li></ul><ul><ul><li>Caching: spatial and temporal locality </li></ul></ul><ul><li>Embedded memories </li></ul><ul><ul><li>Architectural and circuit techniques </li></ul></ul><ul><li>Custom memory architectures </li></ul><ul><li>Other storage optimization examples </li></ul><ul><ul><li>Spatial locality (multiple banks) </li></ul></ul><ul><ul><li>Parsimony (compression) </li></ul></ul><ul><ul><li>Scratch-pad memories, register files,... </li></ul></ul>
    96. 96. Custom Memory Architectures <ul><li>Disk File systems: Parsons et al., Patterson et al.: use file access patterns to improve file system. </li></ul><ul><li>High-level Synthesis </li></ul><ul><ul><li>Catthoor et al.: memory allocation, packing data structures into memories </li></ul></ul><ul><ul><li>Panda et al.: Scratch-pad on-chip SRAM together with cache </li></ul></ul><ul><ul><li>Bakshi et al.: memory exploration combining different port configurations </li></ul></ul><ul><li>Computer Architectures </li></ul><ul><ul><li>Jouppi: Kessler et al.: hardware stream buffers to enhance memory perf. </li></ul></ul><ul><ul><li>Graphics processors: frame buffers, FIFOs, etc. </li></ul></ul>
    97. 97. APEX: Access Pattern based Memory Exploration <ul><li>Motivation: </li></ul><ul><ul><li>Majority of memory accesses generated by a few instructions </li></ul></ul><ul><ul><ul><li>e.g., Vocoder, 15k LOC: only 15 instructions, 62% </li></ul></ul></ul><ul><ul><li>Customize memory architecture for these accesses </li></ul></ul><ul><li>APEX Approach (Grun, et al. ISSS-2001) </li></ul><ul><ul><li>Extract, analyze and cluster the most active Access Patterns in the application </li></ul></ul><ul><ul><li>Use heuristic to prune the design space </li></ul></ul><ul><ul><ul><li>many possible mappings with different power/perf/costs </li></ul></ul></ul><ul><ul><ul><li>Avoid simulation of the entire design space </li></ul></ul></ul>[Grun ISSS2001]
    98. 98. Customizing Memory Architectures <ul><li>Opportunity for wide range of power, cost, performance </li></ul><ul><ul><li>Analyze application behavior (compile-time) </li></ul></ul><ul><ul><li>Map memory accesses to structures supporting access patterns </li></ul></ul>CPU Cache DRAM Stream buffer Linked- list buff SRAM CPU Cache DRAM
    99. 99. Motivating Example <ul><li>Illustrative example: 2 cases </li></ul><ul><ul><li>1. Traditional Cache-only Memory Architecture </li></ul></ul><ul><ul><ul><li>All data structures handled by the cache </li></ul></ul></ul><ul><ul><li>2. APEX: Access Pattern-based Memory Customization </li></ul></ul><ul><ul><ul><li>Access Patterns go to Stream buffers, SRAMs, Linked-list, and self-indirect Memory Modules. </li></ul></ul></ul>for(i=0;i<1000;i++){ … = a[i] + …; } … for(i=0;i<1000;i++){ code = codetab[code]; } … while(…){ … p = p->next; } … for(I=0;I<1000;I++){ for(j=0;j<10;j++){ … = coeff[j] + …; } }
    100. 100. 1. Traditional Cache-only Memory Arch. for(i=0;i<1000;i++){ … = a[i] + …; } … for(i=0;i<1000;i++){ code = codetab[code]; } … while(…){ … p = p->next; } … for(I=0;I<1000;I++){ for(j=0;j<10;j++){ … = coeff[j] + …; } } <ul><li>All data structures handled by the cache </li></ul>CPU Cache DRAM a[] codetab[] Heap coeff[]
    101. 101. 2. APEX: Access Pattern-based Memory Customization for(i=0;i<1000;i++){ … = a[i] + …; } … for(i=0;i<1000;i++){ code = codetab[code]; } … while(…){ … p = p->next; } … for(I=0;I<1000;I++){ for(j=0;j<10;j++){ … = coeff[r] + …; } } <ul><li>Mapping data structures to memories supporting their access modes: </li></ul><ul><ul><li>stream buffer, linked-list buffer, SRAM, and cache </li></ul></ul>CPU Cache DRAM a[] codetab[] Heap Stream buffer Linked- list buff SRAM coeff[] [Grun ISSS2001]
    102. 102. Cost/Perf Exploration: Compress
    103. 103. Memory Exploration: Compress (Perf. Paretos)
    104. 104. Perf/Power Exploration: Compress
    105. 105. Memory Exploration: Compress (Power Paretos)
    106. 106. Memory Organizations and Architectures <ul><li>Traditional memory hierarchies </li></ul><ul><ul><li>Caching: spatial and temporal locality </li></ul></ul><ul><li>Embedded memories </li></ul><ul><ul><li>Architectural and circuit techniques </li></ul></ul><ul><li>Custom memory architectures </li></ul><ul><li>Other storage optimization examples </li></ul><ul><ul><li>Spatial locality (multiple banks) </li></ul></ul><ul><ul><li>Parsimony (compression) </li></ul></ul><ul><ul><li>Scratch-pad memories, registers,.. </li></ul></ul>
    107. 107. Outline <ul><li>Methodology for Architectural Exploration </li></ul><ul><li>Survey of Architectural Description Languages (ADLs) </li></ul><ul><li>Software Toolkit Generation </li></ul><ul><li>Architectural Exploration </li></ul><ul><li>Summary and Conclusions </li></ul>
    108. 108. Summary <ul><li>Today we reviewed </li></ul><ul><ul><li>ADL-driven architectural exploration of programmable embedded systems </li></ul></ul><ul><ul><ul><li>methodology, ADL survey, toolkit generation, sample experiments </li></ul></ul></ul><ul><li>Tremendous opportunity for architectural exploration </li></ul><ul><ul><li>Application-specific customization </li></ul></ul><ul><ul><ul><li>Performance, power, size variations </li></ul></ul></ul><ul><ul><ul><li>Processor, coprocessor, memory co-exploration </li></ul></ul></ul><ul><li>Key technologies required </li></ul><ul><ul><li>ADL as an executable specification of the architecture </li></ul></ul><ul><ul><ul><li>tookit generation, validation/verification,... </li></ul></ul></ul><ul><ul><ul><li>Highly tunable/retargettable compiler technology </li></ul></ul></ul><ul><ul><li>Compiler-in-the-loop architectural evaluation </li></ul></ul><ul><ul><li>Application-Architecture co-evolution </li></ul></ul>
    109. 109. Outlook <ul><li>Current Focus: </li></ul><ul><ul><li>Language-driven SW toolkit generation (ADL=>compiler, simulator,…) </li></ul></ul><ul><ul><li>Memory issues for embedded systems-on-chip: organization, exploration </li></ul></ul><ul><ul><ul><li>performance, power, size </li></ul></ul></ul><ul><ul><li>Flexible, powerful compilation environment for processor-core based designs </li></ul></ul><ul><ul><ul><li>compiler as an exploration tool, and as a software synthesis tool </li></ul></ul></ul><ul><ul><li>Data and Instruction cache sizing for embedded applications </li></ul></ul><ul><ul><li>Estimators, tight bounds on WCET for real-time applications using caches </li></ul></ul><ul><li>Future Directions </li></ul><ul><ul><li>Memory/S-O-C architectures for Embedded DRAM/embedded logic </li></ul></ul><ul><ul><li>Simulation/compilation environment for multiprocessors and novel memory hierarchies on chip </li></ul></ul><ul><ul><li>Customized OS support </li></ul></ul><ul><ul><li>Tight coupling between arch, compiler, CAD, PP and OS </li></ul></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×