Successfully reported this slideshow.
A Compiler Approach to Fast Hardware Design Space
           Exploration in FPGA-based Systems

                          ...
Abstraction
    hardware design space exploration
      parallelizing compiler technique
      high-level synthesis tools
...
56     Pedro Diniz et al.
                                    DEFACTO
  tion written in C or FORTRAN, and performs pre-pro...
Contributions

a compiler algorithm for design space
exploration that relies on behavioral
synthesis estimates
  applies l...
mizations on the resulting inner loop body, such as paral-
  Behavioral Synth. vs. Compilers
lelizing and pipelining opera...
Optimization Goal & Balance
Optimization Criteria
     the design must not exceed the capacity
     constraints of the sys...
Analyses & Transformations
Unroll-and-Jam
 unrolling one or more loops
 fusing inner loop bodies
Scalar Replacement
 elimi...
and-jam, involves unrolling one or more loops in the itera-           D[j] = d 0;
                   tion space and fusing...
t are   in the general case. We address this problem by limiting the
         number of registers in Section 5.4.
        ...
stop the search, or it is compute bound and we continue. If         Ucurr = Uinit
       it is compute bound, then we cons...
Experimental Result
FIR/Matrix Multiply/String Pattern
Matching/Jacobi Iteration/Sobel Edge
Detection
SUIF(SUIF2VHDL)     ...
Result(1)
                                                                                                                ...
10
          1.2                                                                                                 1000




...
10
          1.2                                                                                                 1000




...
500
                                                                                                                      ...
Related Work

Synthesizing High-Level Constructs
 Handel-C(influenced by the OCCAM)
 SA-C
Design Space Exploration
 Monet
 ...
Upcoming SlideShare
Loading in …5
×

A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems

675 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems

  1. 1. A Compiler Approach to Fast Hardware Design Space Exploration in FPGA-based Systems Byoungro So, Mary W. Hall and Pedro C. Diniz Information Sciences Institute University of Southern California 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292 {bso,mhall,pedro}@isi.edu ABSTRACT : his paper describes an automated approach to hardware 1. INTRODUCTION The extreme flexibility of Field Programmable Gate Ar esign space exploration, through a collaboration between rays (FPGAs) has made them the medium of choice for fas arallelizing compiler technology and high-level synthesis hardware prototyping and a popular vehicle for the real ools. We present a compiler algorithm that automatically ization of custom computing machines. FPGAs are com xplores the large design spaces resulting from the applica- posed of thousands of small programmable logic cells dy on of several program transformations commonly used in namically interconnected to allow the implementation of an logic function. Tremendous growth in device capacity ha
  2. 2. Abstraction hardware design space exploration parallelizing compiler technique high-level synthesis tools designing a loop nest computation synthesis estimation techniques with DEFACTO, five multi-media kernels This technology thus significantly raises the level of abstraction for hardware design and explores a design space much larger than is feasible for a human designer.
  3. 3. 56 Pedro Diniz et al. DEFACTO tion written in C or FORTRAN, and performs pre-processing and several com- parallelizing compiler tech. (in SUIF) mon optimizations. In the second step, the code is partitioned into what will execute in software on the host and what will execute in on the FPGAs. with hardware synthesis tools [9] Program General Compiler Optimizations Source Code Design Space Partitioning Exploration Memory Access Parallelization Loop Transformations Permutation Unrolling Tiling Memory Access Protocols Reuse Analysis Logic Synthesis Scalar Replacement Place & Route SUIF2VHDL Estimation Target Architecture Library Functions No Good Yes Host CPU Design FPGA−boards Fig. 1. DEFACTO Design Flow.
  4. 4. Contributions a compiler algorithm for design space exploration that relies on behavioral synthesis estimates applies loop transformations to explore a space-time trade-off Defines a balance metric for guiding design space exploration results for five multimedia kernels
  5. 5. mizations on the resulting inner loop body, such as paral- Behavioral Synth. vs. Compilers lelizing and pipelining operations and minimizing registers and operators to save space. However, deciding the unroll factor is left up to the programmer. Behavioral Synthesis Parallelizing Compilers Optimizations only on scalar variables Optimizations on scalars and arrays Optimizations only inside loop body Optimizations inside loop body and across loop iterations Supports user-controlled Analyses guide automatic loop unrolling loop transformations Manages registers and Optimizes memory accesses inter-operator communication Evaluates trade-offs of different storage on- and off-chip Considers only single FPGA System-level view: multiple FPGAs multiple memories Performs allocation, binding and No knowledge of hardware scheduling of hardware resources implementation of computation Table 1: Comparison of Behavioral Synthesis and Parallelizing Compiler Technologies. 167
  6. 6. Optimization Goal & Balance Optimization Criteria the design must not exceed the capacity constraints of the system the execution time should be minimized a given level of performance, FPGA space usage should be minimized Using two metrics result of estimation provide space usage Balance = F/C (F: data fetch rate, C: data consumption rate)
  7. 7. Analyses & Transformations Unroll-and-Jam unrolling one or more loops fusing inner loop bodies Scalar Replacement eliminates true dependences when reuse is carried (not just the innermost loop) Loop peeling & Loop-Invariant Data Layout and Array Renaming
  8. 8. and-jam, involves unrolling one or more loops in the itera- D[j] = d 0; tion space and fusing inner loop bodies together, as shown in D[j+1] = d 1; Figure 1(b). Unrolling exposes operator parallelism to high- } gests a int S[96]; synthesis. In the example, all of the multiplies can level (c) After scalar replacement of accesses to C and D across ns that be performed in parallel. Two additions can subsequently int C[32]; both i and j loop. be performed in parallel, followed by two more additions. int D[64]; gests a , which int S[96]; j<64; j++) also decrease the dependence distances for (j=0; j<32; j++) { /* initialize D registers */ Unroll-and-jam can for (j=0; ns that ttempts int C[32]; 0; i<32; i++) d 0 = D2[j]; for reused data accesses, which, when combined with scalar for(i = ia. The int D[64]; D[j] + (S[i+j] * below, can be used to expose oppor- replacement discussed C[i]); D[j] = d 1 = D3[j]; which design. for (j=0; j<64; j++) for (i=0; i<16; i++) { (a) tunities for parallel memory accesses. Original code. tempts design, for(i = 0; i<32; i++) if (j==0) { /* initialize C registers */ Scalar Replacement. Scalar replacement replaces ar- ric The a. used for D[j] = j<64;+ (S[i+j] * C[i]); temporary scalar variables, so (j=0; D[j] j+=2)accesses to ray references by c 0 0 = C0[i]; design. a 2 and (a) Original i<32; i+=2){ c 1 0 = C1[i]; for(i = 0; code. synthesis will exploit reuse in registers [5]. that high-level design, } Our = D[j] + (S[i+j] * C[i]); D[j] approach to scalar replacement closely matches previ- ic used for (j=0;work, which eliminates true dependences when reuse S 0 = S1[i+j]; ous = D[j] j+=2) D[j] j<64; + (S[i+j+1] * C[i+1]); 2 and for(i = 0; i<32; i+=2){ d 0 = d 0 + S0[i+j] * c 0 0; /* unroll(0,0) */ ata bits is carried D[j+1] + (S[i+j+1] * C[i]); accesses in the affine D[j+1] = by the innermost loop, for D[j] = D[j]D[j+1] + (S[i+j+2] * C[i+1]); domain= + consistent dependences (i.e., constant depen- D[j+1] with (S[i+j] * C[i]); d 0 = d 0 + S 0 * c 1 0; /* unroll(0,1) */ he data } D[j] = D[j] + (S[i+j+1] * C[i+1]); however, two differences: dence distances) [5]. There are, d 1 = d 1 + S 0 * c 0 0; /* unroll(1,0) */ an con- ta bits (b)D[j+1] unrolling j+ (S[i+j+1]loop bymemory writes on out- (1) we = D[j+1] loop and i * C[i]); 1 (unroll After also eliminate unnecessary d 1 = d 1 + S0[i+j+1] * c 1 0; /* unroll(1,1) */ close to heis less D[j+1] 2) D[j+1] +and, copies *exploit reuse across all loops put dependences; (S[i+j+2] of i loop together. factor = and jamming (2) we C[i+1]); rotate registers(c 0 0, ... ,c 0 15); e data an con- } in the nest, not just the innermost loop. The latter differ- rotate registers(c 1 0, ... ,c 1 15); an one, lose to (b) After stems from the observation that many, though not all, for (j=0; unrolling j loop/* initialize by registers */ ence j<64; j+=2) { and i loop D 1 (unroll } ed, this is less d factor 2) and jamming copies of have sufficiently small loop 0 = D[j]; mapped to FPGAs i loop together. algorithms D3[j] = d 1; devoted an one, d 1 = D[j+1];small reuse distances, and the number of regis- bounds or D2[j] = d 0; d, this forters j<64; j+=2) { /* for (j=0;that can i+=2) { initialize D registers */ (i=0; i<32; be configured on an FPGA is sufficiently large. } work for evoted d 0A (j==0) { /* initialize C registers */ if=more detailed description of our scalar replacement and D[j]; (d) Final code generated for FIR, including loop nce the d 1register = C[i]; =cD[j+1]; analysis can be found in [9]. 0 0 reuse normalization and data layout optimization. Because ork for for (i=0; 0 = C[i+1]; in{Figure 1(c), we see the results of scalar c i<32; i+=2) In1the example pent in Figure 1: Optimization Example: FIR. nce data if (j==0) { /*which illustrates some*/ the above differences replacement, initialize C registers of } he the ecause e them S c 0 0S[i+j+1]; 0 = = C[i]; pent in d c 1 0d= C[i+1]; * c 0 0; /* unroll(0,0) */ 0 = 0 + S[i+j] e data } 0 = d 0 + S 0 * c 1 0; /* unroll(0,1) */ d IONS S 0 = S[i+j+1]; 0 * c 0 0; /* unroll(1,0) */ 168 e them d1=d1+S ransfor- d 0 = d 0 + S[i+j] * c * c 1/* unroll(0,0) */ */ d 1 = d 1 + S[i+j+2] 0 0; 0; /* unroll(1,1) he FIR d 0 = dregisters(c 0 0, 0; /* 0 15); rotate 0 + S 0 * c 1 ... ,c unroll(0,1) */ ONS d 1 = dregisters(c 1 0, 0; /* 1 15); rotate 1 + S 0 * c 0 ... ,c unroll(1,0) */ ansfor- unroll- } d 1 = d 1 + S[i+j+2] * c 1 0; /* unroll(1,1) */ heitera- e FIR rotate 0; D[j] = d registers(c 0 0, ... ,c 0 15); hown in rotate registers(c 1 0, ... ,c 1 15); D[j+1] = d 1; unroll- to high- }} eies can itera- D[j] = d 0; (c) After scalar replacement of accesses to C and D across own in quently D[j+1] = d 1; j loop. both i and o high- ditions. } es can stances (c) After scalar replacement of accesses to C and D across for (j=0; j<32; j++) { /* initialize D registers */ huently scalar d both D2[j]; j loop. 0 = i and
  9. 9. t are in the general case. We address this problem by limiting the number of registers in Section 5.4. Optimization Algorithm i1 + where 5.1 Definitions s and We define a saturation point as a vector of unroll factors rated where the memory parallelism reaches the bandwidth of the rmly architecture, such that the following property holds for the esult resulting unrolled loop body: s the Saturation Point rtual widthi = C1 ∗ widthl . n ac- i∈Reads l∈NumMemories uling yout widthj = C2 ∗ widthl . ry 0, j∈Writes l∈NumMemories esses ed to Search Space Properties Here, C1 and C2 are integer constants. To simplify this discussion, let us assume that the access widths match the olling ed to Algorithm Description memory width, so that we are simply looking for an unroll factor that results in a multiple of N umM emories read and t as- write accesses for the smallest values of C1 and C2 . The rans- Adjusting Number of On-chip Registers saturation set, Sat, can then be determined as a function of ports the number of read and write accesses, R and W , in a single lay- iteration of the loop nest and the unroll factor for each loop sibly in the nest. We consider reads and writes separately because cking they will be scheduled separately. sions We are interested in determining the saturation point after
  10. 10. stop the search, or it is compute bound and we continue. If Ucurr = Uinit it is compute bound, then we consider unroll factors that Umb = Umax Algorithm provide increased operator parallelism, in addition to mem- ok = False ory parallelism. Thus, we first look for a loop that carries while (!ok) do no dependence (i.e., ∀d∈D di = 0). All unrolled iterations of Code = Generate(Ucurr ) such a loop can be executed in parallel. If such a loop i is Estimate = Synthesize(Code) B = Balance(Code,Estimate.Performance) found, then we set the unroll factor to Sati . assuming this /* first deal with space-constrained designs */ unroll factor is in Sat. if (Estimate.Space > Capacity) then If no such loop exists, then we instead select an unroll fac- if (Ucurr = Uinit ) then tor that favors loops with the largest dependence distances, Ucurr = FindLargestFit(Ubase , Ucurr ) because such loops can perform in parallel computations be- ok = True tween dependences. The details of how our algorithm selects else Ucurr = SelectBetween(Ucb , Ucurr ) the initial unroll factor in this case is beyond the scope of else if (B = 1) then ok = True /* Balanced, so DONE! */ he this paper, but the key insight is that we unroll all loops else if (B < 1) then /* memory bound */ Search Algorithm: 5]. in the nest, with larger unroll factors for the loops carrying Umb = Ucurr Input: Code /* An n-deep loop nest */ ry larger minimum nonzero dependence distances. The mono- if (Ucurr = Uinit ) then ok = True Output: u1 , . . . , un /* a vector of unroll factors */ else we tonicity property also applies when considering simultaneous unrolling for multiple loops as long as unroll factors for all /* Balanced solution is between earlier size and this */ If Ucurr = Uinit Ucurr = SelectBetween(Ucb , Umb ) at loops = U either increasing or decreasing. Umb are max else if (B > 1) then /* compute bound */ m- If the initial design is space constrained, we must re- ok = False Ucb = Ucurr es duce the unroll factor until the design size is less than the while (!ok) do if (Umb = Umax ) then of size constraint Capacity, resulting in a suboptimal design. Code = Generate(Ucurr ) /* Have only seen compute bound so far */ is The function Synthesize(Code) simply selects the largest un- Estimate = FindLargestFit Ucurr = Increase(Ucb ) B = Balance(Code,Estimate.Performance) roll factor between the baseline design corresponding to no else his /* first deal with space-constrained designs */ /* Balanced solution is between earlier size and this */ unrolling (called Ubase ), and Uinit , regardless of balance, be- if (Estimate.Space > Capacity) then Ucurr = SelectBetween(Ucb , Umb ) ac- cause this will maximize available parallelism. if (Ucurr = Uinit ) then /* Check if no more points to search */ es, Assuming = FindLargestFit(Uis compute bound, the algo- Ucurr the initial design base , Ucurr ) if (Ucurr = Ucb ) ok = True be- rithm increases the unroll factors until it reaches a design ok = True end cts that else(1) memory bound; (2) larger than Capacity; or, is return Ucurr of Ucurr = full unrolling of Ucurr ) (3) represents SelectBetween(Ucb ,all loops in the nest (i.e., else if (B = 1) then ok = True /* Balanced, so DONE! */ ps Ucurr = (B < 1) as follows. U ), else if max then /* memory bound */ Figure 2: Algorithm for Design Space Exploration. ng Themb = UcurrIncrease(Uin ) returns unroll factor vector U function no- Uout if (Ucurr = Uinit ) then ok = True such that first place, the design will be smaller and more likely to fit us else on chip, and secondly, space is freed up so that it can be all (1)PBalanced solution is between earlier size and this */ /* (Uout ) = 2 ∗ P (Uin ); and, used to increase the operator parallelism for designs that (2)∀i uin ≤ uout ≤ umax . cb , Umb ) Ucurr = SelectBetween(U i i i are compute bound. else if (B > 1) then /* compute bound */ re- To adjust the number of on-chip registers, we can use loop he Ucb are no If there = Ucurr such remaining unroll factor vectors, then tiling to tile the loop nest so that the localized iteration if (U = Umax ) then gn. Increase mb returns Uin . compute bound so far */ /* Have only seen space within a tile matches the desired number of registers, If either a space-constrained or memory bound design is
  11. 11. Experimental Result FIR/Matrix Multiply/String Pattern Matching/Jacobi Iteration/Sobel Edge Detection SUIF(SUIF2VHDL) C Application invokes the Mentor Graphics’ MonetTM SUIF Compiler Analyses scalar replacement the compiler currently fixes the clock data layout array renaming data reuse unroll & jam tiling period to be 40ns Unroll Factor Determination SUIF2VHDL Transformed SUIF Behavioral VHDL Monet Behavioral Synthesis Metrics: Area, Number Clock Cycles Balance Calculation NO Balanced Design? YES Figure 3: Compilation and Synthesis Flow
  12. 12. Result(1) 16000 4 max space Outer Loop Unroll Factor 1 10 0.4 14000 selected design selected design Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 8 12000 Execution Cycles (log-scaled) 0.35 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 32 10000 Outer Loop Unroll Factor 16 Execution Cycles 0.3 Outer Loop Unroll Factor 64 Outer Loop Unroll Factor 32 Balance Outer Loop Unroll Factor 64 8000 3 10 0.25 6000 selected design 0.2 4000 0.15 2000 2 0.1 0 10 4 5 0 1 2 4 8 16 32 0 1 2 4 8 16 32 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 4: Balance, Execution Time and Area for Non-pipelined FIR. 3.2 4 10 6000 max space Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 selected design 2.8 Outer Loop Unroll Factor 2 selected design Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 4 5000 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 8 Execution Cycles (log-scaled) Outer Loop Unroll Factor 16 2.4 Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 32 Execution Cycles Outer Loop Unroll Factor 32 4000 Outer Loop Unroll Factor 64 3 10 Outer Loop Unroll Factor 64 Balance 2 3000 selected design 1.6 2000 2 10 1.2 1000 0.8 0 4 5 6 0 1 2 4 8 16 32 0 1 2 4 8 16 32 10 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 5: Balance, Execution Cycles and Area for Pipelined FIR.
  13. 13. 10 1.2 1000 Result(2) 0.8 0 4 5 6 0 1 2 4 8 16 32 0 1 2 4 8 16 32 10 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 5: Balance, Execution Cycles and Area for Pipelined FIR. 7000 4 10 1 max space Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 6000 selected design 0.9 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 4 Execution Cycles (log-scaled) 5000 0.8 Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 16 Execution Cycles 4000 Outer Loop Unroll Factor 32 Balance 0.7 selected design 3000 selected design 0.6 3 0.5 2000 10 0.4 1000 0.3 0 4 5 0 1 2 4 8 16 0 1 2 4 8 16 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 6: Balance, Execution Cycles and Area for Non-pipelined MM. 5000 4 10 max space 3 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 selected design Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 4 4000 Outer Loop Unroll Factor 2 2.5 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 4 Execution Cycles (log-scaled) Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 16 Execution Cycles 3000 Outer Loop Unroll Factor 32 Balance 2 selected design 3 10 selected design 2000 1.5 1 1000 2 0.5 0 10 4 5 0 1 2 4 8 16 0 1 2 4 8 16 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 7: Balance, Execution Cycles and Area for Pipelined MM.
  14. 14. 10 1.2 1000 Result(3) 0.8 0 4 5 6 0 1 2 4 8 16 32 0 1 2 4 8 16 32 10 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 5: Balance, Execution Cycles and Area for Pipelined FIR. 7000 4 10 1 max space Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 6000 selected design 0.9 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 4 Execution Cycles (log-scaled) 5000 0.8 Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 16 Execution Cycles 4000 Outer Loop Unroll Factor 32 Balance 0.7 selected design 3000 selected design 0.6 3 0.5 2000 10 0.4 1000 0.3 0 4 5 0 1 2 4 8 16 0 1 2 4 8 16 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 6: Balance, Execution Cycles and Area for Non-pipelined MM. 5000 4 10 max space 3 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 selected design Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 4 4000 Outer Loop Unroll Factor 2 2.5 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 4 Execution Cycles (log-scaled) Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 16 Execution Cycles 3000 Outer Loop Unroll Factor 32 Balance 2 selected design 3 10 selected design 2000 1.5 1 1000 2 0.5 0 10 4 5 0 1 2 4 8 16 0 1 2 4 8 16 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 7: Balance, Execution Cycles and Area for Pipelined MM.
  15. 15. 500 max space 1 Result(4) 0 Outer0 Loop2 Unroll Factor 1 8 1 4 16 0 0 1 2 4 8 16 10 3 10 10 4 Space (log-scaled) Outer Loop Unroll Factor 2 Unroll Factor Inner Loop Inner Loop Unroll Factor (a) Balance (b) Execution Time (c) Area Outer Loop Unroll Factor 4 selected design Execution Cycles (log-scaled) Figure 9: Outer Loop Unroll Factor 8 Balance, Execution Cycles and Area for Pipelined PAT. Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 64 2.1 4 10 Outer Loop Unroll Factor 1 8000 max space Outer Loop Unroll Factor 2 selected design 2 Outer Loop Unroll Factor 4 7000 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 4 selected design Execution Cycles (log-scaled) 1.9 Outer Loop Unroll Factor 16 6000 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 16 Execution Cycles Outer Loop Unroll Factor 64 Outer Loop Unroll Factor 32 1.8 Balance 5000 Outer Loop Unroll Factor 64 selected design 1.7 selected design 4000 1.6 3000 1.5 3 2000 10 4 5 16 32 10 10 10 4 3 1.4 1000 Inner Loop Unroll Factor0 1 2 4 8 16 Inner Loop Unroll Factor 32 0 1 2 4 8 16 Space (log-scaled) Inner Loop Unroll Factor 32 10 Space (log-scaled) 10 5 ecution Time (a) Balance (c) Area (b) Execution Time (c) Area Figure 10: Balance, Execution Time and Area for Pipelined SOBEL. Time and Area for Pipelined SOBEL. Program Non-Pipelined Pipelined uration point, and then decreasing. The execution time is FIR 7.67 17.26 also monotonically nonincreasing, related to Observation 2. MM 4.55 13.36 n all programs, our algorithm selects a design that is close JAC 3.87 5.56 o best in terms of performance, but uses relativelyNon-Pipelined Program small Pipelined PAT 7.53 34.61 SOBEL 4.01 3.90 unroll factors. Among the designs with comparable perfor- 7.67 FIR 17.26 mance, in all cases our algorithm selected the design that MM consumes the smallest amount of space. As a result, we 4.55 Table 2: 13.36 on a single FPGA. Speedup have shown that our approach meets the optimization goals 3.87 JAC 5.56 et forth in Section 3. In most cases, the most balanced 7.53 heuristics based on the saturation point and balance, PAT ing 34.61 design is selected by the algorithm. When a less balanced as described in section 5. This reveals the effectiveness of design is selected, it is either because the more balanced de- 4.01 algorithm as it 3.90 SOBEL the finds the best design point having only ign is before a saturation point (as for non-pipelined FIR), explored a small fraction, only 0.3% of the design space con- or is too large to fit on the FPGA (as for pipelined MM). sisting of all possible unroll factors for each loop. For larger Table 2: Speedup on a single FPGA. Table 2 presents the speedup results of the selected de- ign for each kernel as compared to the baseline, for both design spaces, we expect the number of points searched rel- ative to the size to be even smaller. pipelined and non-pipelined designs. The baseline is the
  16. 16. Related Work Synthesizing High-Level Constructs Handel-C(influenced by the OCCAM) SA-C Design Space Exploration Monet Derrien/Rajopadhye Discussion

×