SlideShare a Scribd company logo
1 of 16
A Compiler Approach to Fast Hardware Design Space
           Exploration in FPGA-based Systems

                               Byoungro So, Mary W. Hall and Pedro C. Diniz
                                         Information Sciences Institute
                                        University of Southern California
                                        4676 Admiralty Way, Suite 1001
                                        Marina del Rey, California 90292
                                                {bso,mhall,pedro}@isi.edu



ABSTRACT
                                                 :
his paper describes an automated approach to hardware
                                                               1. INTRODUCTION
                                                                  The extreme flexibility of Field Programmable Gate Ar
esign space exploration, through a collaboration between       rays (FPGAs) has made them the medium of choice for fas
arallelizing compiler technology and high-level synthesis      hardware prototyping and a popular vehicle for the real
ools. We present a compiler algorithm that automatically       ization of custom computing machines. FPGAs are com
xplores the large design spaces resulting from the applica-    posed of thousands of small programmable logic cells dy
 on of several program transformations commonly used in        namically interconnected to allow the implementation of an
                                                               logic function. Tremendous growth in device capacity ha
Abstraction
    hardware design space exploration
      parallelizing compiler technique
      high-level synthesis tools
    designing a loop nest computation
      synthesis estimation techniques
    with DEFACTO, five multi-media kernels

This technology thus significantly raises the level of
abstraction for hardware design and explores a design space
much larger than is feasible for a human designer.
56     Pedro Diniz et al.
                                    DEFACTO
  tion written in C or FORTRAN, and performs pre-processing and several com-
parallelizing compiler tech. (in SUIF)
  mon optimizations. In the second step, the code is partitioned into what will
  execute in software on the host and what will execute in on the FPGAs.
with hardware synthesis tools [9]
                                             Program

                                             General
                                    Compiler Optimizations


                                          Source Code
                     Design Space         Partitioning
                      Exploration
                                                                 Memory Access
                                                                  Parallelization
                                     Loop Transformations
                                       Permutation
                                       Unrolling
                                       Tiling                   Memory Access
                                                                    Protocols

                                          Reuse Analysis

                                                                 Logic Synthesis
                                         Scalar Replacement


                                                                  Place & Route
                                           SUIF2VHDL


                                           Estimation          Target Architecture
                                                               Library Functions


                                    No        Good       Yes
                 Host CPU                     Design              FPGA−boards




                            Fig. 1. DEFACTO Design Flow.
Contributions

a compiler algorithm for design space
exploration that relies on behavioral
synthesis estimates
  applies loop transformations to explore a
  space-time trade-off
Defines a balance metric for guiding design
space exploration
results for five multimedia kernels
mizations on the resulting inner loop body, such as paral-
  Behavioral Synth. vs. Compilers
lelizing and pipelining operations and minimizing registers
and operators to save space. However, deciding the unroll
factor is left up to the programmer.

        Behavioral Synthesis                  Parallelizing Compilers
 Optimizations only on scalar variables   Optimizations on scalars and arrays
  Optimizations only inside loop body       Optimizations inside loop body
                                               and across loop iterations
        Supports user-controlled               Analyses guide automatic
             loop unrolling                       loop transformations
         Manages registers and                Optimizes memory accesses
     inter-operator communication           Evaluates trade-offs of different
                                                storage on- and off-chip
      Considers only single FPGA          System-level view: multiple FPGAs
                                                   multiple memories
   Performs allocation, binding and           No knowledge of hardware
   scheduling of hardware resources         implementation of computation




Table 1: Comparison of Behavioral Synthesis and
Parallelizing Compiler Technologies.



                                                                                167
Optimization Goal & Balance
Optimization Criteria
     the design must not exceed the capacity
     constraints of the system
     the execution time should be minimized
     a given level of performance, FPGA space
     usage should be minimized
Using two metrics
     result of estimation provide space usage
     Balance = F/C (F: data fetch rate, C: data consumption rate)
Analyses & Transformations
Unroll-and-Jam
 unrolling one or more loops
 fusing inner loop bodies
Scalar Replacement
 eliminates true dependences when reuse
 is carried (not just the innermost loop)
Loop peeling & Loop-Invariant
Data Layout and Array Renaming
and-jam, involves unrolling one or more loops in the itera-           D[j] = d 0;
                   tion space and fusing inner loop bodies together, as shown in         D[j+1] = d 1;
                   Figure 1(b). Unrolling exposes operator parallelism to high-         }
 gests a      int S[96]; synthesis. In the example, all of the multiplies can
                   level                                                                (c) After scalar replacement of accesses to C and D across
 ns that           be performed in parallel. Two additions can subsequently
              int C[32];                                                                    both i and j loop.
                   be performed in parallel, followed by two more additions.
              int D[64];
 gests a
 , which     int S[96]; j<64; j++) also decrease the dependence distances               for (j=0; j<32; j++) { /* initialize D registers */
                   Unroll-and-jam can
              for (j=0;
ns that
ttempts      int C[32]; 0; i<32; i++)                                                     d 0 = D2[j];
                   for reused data accesses, which, when combined with scalar
                for(i =
 ia. The     int D[64]; D[j] + (S[i+j] * below, can be used to expose oppor-
                   replacement discussed C[i]);
                   D[j] =                                                                 d 1 = D3[j];
   which
  design.    for (j=0; j<64; j++)                                                         for (i=0; i<16; i++) {
              (a) tunities for parallel memory accesses.
                   Original code.
 tempts
  design,      for(i = 0; i<32; i++)                                                         if (j==0) { /* initialize C registers */
                      Scalar Replacement. Scalar replacement replaces ar-
ric The
 a. used
              for D[j] = j<64;+ (S[i+j] * C[i]); temporary scalar variables, so
                   (j=0; D[j] j+=2)accesses to
                   ray references by                                                            c 0 0 = C0[i];
 design.
a 2 and      (a) Original i<32; i+=2){                                                          c 1 0 = C1[i];
                for(i = 0; code. synthesis will exploit reuse in registers [5].
                   that high-level
 design,                                                                                     }
                   Our = D[j] + (S[i+j] * C[i]);
                   D[j] approach to scalar replacement closely matches previ-
 ic used     for (j=0;work, which eliminates true dependences when reuse                     S 0 = S1[i+j];
                   ous = D[j] j+=2)
                   D[j] j<64; + (S[i+j+1] * C[i+1]);
   2 and       for(i = 0; i<32; i+=2){                                                       d 0 = d 0 + S0[i+j] * c 0 0; /* unroll(0,0) */
ata bits           is carried D[j+1] + (S[i+j+1] * C[i]); accesses in the affine
                   D[j+1] = by the innermost loop, for
                  D[j] = D[j]D[j+1] + (S[i+j+2] * C[i+1]);
                   domain= + consistent dependences (i.e., constant depen-
                   D[j+1] with (S[i+j] * C[i]);                                              d 0 = d 0 + S 0 * c 1 0; /* unroll(0,1) */
 he data
                } D[j] = D[j] + (S[i+j+1] * C[i+1]); however, two differences:
                   dence distances) [5]. There are,                                          d 1 = d 1 + S 0 * c 0 0; /* unroll(1,0) */
 an con-
 ta bits      (b)D[j+1] unrolling j+ (S[i+j+1]loop bymemory writes on out-
                   (1) we = D[j+1] loop and i * C[i]); 1 (unroll
                   After also eliminate unnecessary                                          d 1 = d 1 + S0[i+j+1] * c 1 0; /* unroll(1,1) */
 close to
heis less         D[j+1] 2) D[j+1] +and, copies *exploit reuse across all loops
                   put dependences; (S[i+j+2] of i loop together.
                   factor = and jamming (2) we C[i+1]);                                      rotate registers(c 0 0, ... ,c 0 15);
 e data
an con-        } in the nest, not just the innermost loop. The latter differ-                 rotate registers(c 1 0, ... ,c 1 15);
 an one,
 lose to     (b) After stems from the observation that many, though not all,
              for (j=0; unrolling j loop/* initialize by registers */
                   ence j<64; j+=2) { and i loop D 1 (unroll                              }
 ed, this
   is less      d factor 2) and jamming copies of have sufficiently small loop
                  0 = D[j]; mapped to FPGAs i loop together.
                   algorithms                                                             D3[j] = d 1;
devoted
an one,         d 1 = D[j+1];small reuse distances, and the number of regis-
                   bounds or                                                              D2[j] = d 0;
 d, this        forters j<64; j+=2) { /*
             for (j=0;that can i+=2) { initialize D registers */
                     (i=0; i<32; be configured on an FPGA is sufficiently large.           }
work for
 evoted        d 0A (j==0) { /* initialize C registers */
                   if=more detailed description of our scalar replacement and
                        D[j];                                                           (d) Final code generated for FIR, including loop
 nce the
               d 1register = C[i];
                     =cD[j+1]; analysis can be found in [9].
                        0 0 reuse                                                            normalization and data layout optimization.
Because
 ork for       for (i=0; 0 = C[i+1]; in{Figure 1(c), we see the results of scalar
                       c i<32; i+=2)
                      In1the example
 pent in                                                                                       Figure 1: Optimization Example: FIR.
nce data          if (j==0) { /*which illustrates some*/ the above differences
                   replacement, initialize C registers of
                   }
he the
  ecause
 e them            S c 0 0S[i+j+1];
                      0 = = C[i];
pent in            d c 1 0d= C[i+1]; * c 0 0; /* unroll(0,0) */
                      0 = 0 + S[i+j]
 e data           } 0 = d 0 + S 0 * c 1 0; /* unroll(0,1) */
                   d
IONS              S 0 = S[i+j+1]; 0 * c 0 0; /* unroll(1,0) */                    168
e them             d1=d1+S
ransfor-          d 0 = d 0 + S[i+j] * c * c 1/* unroll(0,0) */ */
                   d 1 = d 1 + S[i+j+2] 0 0; 0; /* unroll(1,1)
  he FIR          d 0 = dregisters(c 0 0, 0; /* 0 15);
                   rotate 0 + S 0 * c 1 ... ,c unroll(0,1) */
 ONS              d 1 = dregisters(c 1 0, 0; /* 1 15);
                   rotate 1 + S 0 * c 0 ... ,c unroll(1,0) */
 ansfor-
  unroll-       } d 1 = d 1 + S[i+j+2] * c 1 0; /* unroll(1,1) */
heitera-
 e FIR            rotate 0;
                D[j] = d registers(c 0 0, ... ,c 0 15);
hown in           rotate registers(c 1 0, ... ,c 1 15);
                D[j+1] = d 1;
 unroll-
to high-      }}
eies can
   itera-      D[j] = d 0;
              (c) After scalar replacement of accesses to C and D across
 own in
 quently       D[j+1] = d 1; j loop.
                   both i and
o high-
 ditions.    }
 es can
 stances     (c) After scalar replacement of accesses to C and D across
              for (j=0; j<32; j++) { /* initialize D registers */
huently
   scalar       d both D2[j]; j loop.
                  0 = i and
t are   in the general case. We address this problem by limiting the
         number of registers in Section 5.4.
                 Optimization Algorithm
  i1 +
where    5.1 Definitions
 s and
           We define a saturation point as a vector of unroll factors
rated
         where the memory parallelism reaches the bandwidth of the
 rmly
         architecture, such that the following property holds for the
 esult
         resulting unrolled loop body:
 s the
              Saturation Point
rtual                     widthi = C1 ∗                   widthl .
n ac-
               i∈Reads                    l∈NumMemories
uling
 yout                     widthj = C2 ∗                   widthl .
 ry 0,         j∈Writes                   l∈NumMemories
esses
ed to         Search Space Properties
         Here, C1 and C2 are integer constants. To simplify this
         discussion, let us assume that the access widths match the
olling
ed to         Algorithm Description
         memory width, so that we are simply looking for an unroll
         factor that results in a multiple of N umM emories read and
 t as-
         write accesses for the smallest values of C1 and C2 . The
rans-         Adjusting Number of On-chip Registers
         saturation set, Sat, can then be determined as a function of
ports
         the number of read and write accesses, R and W , in a single
  lay-
         iteration of the loop nest and the unroll factor for each loop
 sibly
         in the nest. We consider reads and writes separately because
cking
         they will be scheduled separately.
 sions
            We are interested in determining the saturation point after
stop the search, or it is compute bound and we continue. If         Ucurr = Uinit
       it is compute bound, then we consider unroll factors that           Umb = Umax


                                                       Algorithm
       provide increased operator parallelism, in addition to mem-         ok = False
       ory parallelism. Thus, we first look for a loop that carries         while (!ok) do
       no dependence (i.e., ∀d∈D di = 0). All unrolled iterations of        Code = Generate(Ucurr )
       such a loop can be executed in parallel. If such a loop i is         Estimate = Synthesize(Code)
                                                                            B = Balance(Code,Estimate.Performance)
       found, then we set the unroll factor to Sati . assuming this         /* first deal with space-constrained designs */
       unroll factor is in Sat.                                             if (Estimate.Space > Capacity) then
          If no such loop exists, then we instead select an unroll fac-        if (Ucurr = Uinit ) then
       tor that favors loops with the largest dependence distances,               Ucurr = FindLargestFit(Ubase , Ucurr )
       because such loops can perform in parallel computations be-                ok = True
       tween dependences. The details of how our algorithm selects             else
                                                                                  Ucurr = SelectBetween(Ucb , Ucurr )
       the initial unroll factor in this case is beyond the scope of        else if (B = 1) then ok = True /* Balanced, so DONE! */
he     this paper, but the key insight is that we unroll all loops          else if (B < 1) then /* memory bound */
         Search Algorithm:
5].    in the nest, with larger unroll factors for the loops carrying          Umb = Ucurr
         Input: Code               /* An n-deep loop nest */
 ry    larger minimum nonzero dependence distances. The mono-                  if (Ucurr = Uinit ) then ok = True
         Output: u1 , . . . , un /* a vector of unroll factors */              else
we     tonicity property also applies when considering simultaneous
       unrolling for multiple loops as long as unroll factors for all             /* Balanced solution is between earlier size and this */
  If     Ucurr = Uinit                                                            Ucurr = SelectBetween(Ucb , Umb )
 at    loops = U either increasing or decreasing.
         Umb are max                                                        else if (B > 1) then /* compute bound */
m-        If the initial design is space constrained, we must re-
         ok = False                                                            Ucb = Ucurr
 es    duce the unroll factor until the design size is less than the
         while (!ok) do                                                        if (Umb = Umax ) then
 of    size constraint Capacity, resulting in a suboptimal design.
          Code = Generate(Ucurr )                                                 /* Have only seen compute bound so far */
  is   The function Synthesize(Code) simply selects the largest un-
          Estimate = FindLargestFit                                               Ucurr = Increase(Ucb )
          B = Balance(Code,Estimate.Performance)
       roll factor between the baseline design corresponding to no             else
his       /* first deal with space-constrained designs */                          /* Balanced solution is between earlier size and this */
       unrolling (called Ubase ), and Uinit , regardless of balance, be-
          if (Estimate.Space > Capacity) then                                     Ucurr = SelectBetween(Ucb , Umb )
ac-    cause this will maximize available parallelism.
             if (Ucurr = Uinit ) then                                       /* Check if no more points to search */
es,       Assuming = FindLargestFit(Uis compute bound, the algo-
                Ucurr the initial design base , Ucurr )                     if (Ucurr = Ucb ) ok = True
be-    rithm increases the unroll factors until it reaches a design
                ok = True                                                  end
cts    that else(1) memory bound; (2) larger than Capacity; or,
              is                                                           return Ucurr
  of
                Ucurr = full unrolling of Ucurr )
       (3) represents SelectBetween(Ucb ,all loops in the nest (i.e.,
          else if (B = 1) then ok = True /* Balanced, so DONE! */
 ps    Ucurr = (B < 1) as follows.
                  U    ),
          else if max then /* memory bound */                              Figure 2: Algorithm for Design Space Exploration.
 ng       Themb = UcurrIncrease(Uin ) returns unroll factor vector
             U function
no-    Uout if (Ucurr = Uinit ) then ok = True
              such that                                                    first place, the design will be smaller and more likely to fit
 us          else                                                          on chip, and secondly, space is freed up so that it can be
 all           (1)PBalanced solution is between earlier size and this */
                /* (Uout ) = 2 ∗ P (Uin ); and,                            used to increase the operator parallelism for designs that
               (2)∀i uin ≤ uout ≤ umax . cb , Umb )
                Ucurr = SelectBetween(U
                      i     i       i                                      are compute bound.
          else if (B > 1) then /* compute bound */
re-                                                                           To adjust the number of on-chip registers, we can use loop
 he
             Ucb are no
       If there = Ucurr such remaining unroll factor vectors, then         tiling to tile the loop nest so that the localized iteration
             if (U    = Umax ) then
gn.    Increase mb returns Uin . compute bound so far */
                /* Have only seen                                          space within a tile matches the desired number of registers,
         If either a space-constrained or memory bound design is
Experimental Result
FIR/Matrix Multiply/String Pattern
Matching/Jacobi Iteration/Sobel Edge
Detection
SUIF(SUIF2VHDL)                                              C Application




  invokes the Mentor Graphics’ MonetTM
                                                                  SUIF



                                                            Compiler Analyses
                                                            scalar replacement




  the compiler currently fixes the clock
                                                               data layout
                                                              array renaming
                                                               data reuse
                                                               unroll & jam
                                                                   tiling



  period to be 40ns                    Unroll Factor
                                       Determination
                                                              SUIF2VHDL
                                                                       Transformed SUIF




                                                                      Behavioral VHDL

                                                                 Monet
                                                            Behavioral Synthesis


                                                                      Metrics: Area, Number Clock Cycles

                                                            Balance Calculation



                                                       NO       Balanced
                                                                Design?

                                                                     YES

                                   Figure 3: Compilation and Synthesis Flow
Result(1)
                                                                                                                   16000
                                                                                                                                                                                                                                           4
                                                                                                                                                                                                                                                              max space
                                                            Outer Loop Unroll Factor 1                                                                                                                                                10
           0.4                                                                                                     14000                                                                                                                                                                 selected design
                             selected design                Outer Loop Unroll Factor 2
                                                                                                                                                                   Outer Loop Unroll Factor 1
                                                            Outer Loop Unroll Factor 4                                                                             Outer Loop Unroll Factor 2
                                                            Outer Loop Unroll Factor 8                             12000




                                                                                                                                                                                                      Execution Cycles (log-scaled)
          0.35                                                                                                                                                     Outer Loop Unroll Factor 4
                                                            Outer Loop Unroll Factor 16                                                                            Outer Loop Unroll Factor 8
                                                            Outer Loop Unroll Factor 32                            10000                                           Outer Loop Unroll Factor 16




                                                                                                Execution Cycles
           0.3                                              Outer Loop Unroll Factor 64                                                                            Outer Loop Unroll Factor 32
Balance




                                                                                                                                                                   Outer Loop Unroll Factor 64
                                                                                                                    8000                                                                                                                   3
                                                                                                                                                                                                                                      10
          0.25
                                                                                                                    6000                                          selected design

           0.2
                                                                                                                    4000

          0.15
                                                                                                                    2000

                                                                                                                                                                                                                                           2
           0.1                                                                                                            0                                                                                                           10            4                                5
                 0 1 2   4      8                  16                                     32                                  0 1 2   4     8            16                                      32                                             10                                  10
                                          Inner Loop Unroll Factor                                                                              Inner Loop Unroll Factor                                                                                  Space (log-scaled)
                                    (a) Balance                                                                                           (b) Execution Time                                                                                            (c) Area

                                   Figure 4: Balance, Execution Time and Area for Non-pipelined FIR.


          3.2                                                                                                                                                                                                                          4
                                                                                                                                                                                                                                      10
                                                                                                                   6000                                                                                                                                   max space
                                                            Outer Loop Unroll Factor 1                                                                            Outer Loop Unroll Factor 1
                                                                                                                                                                  Outer Loop Unroll Factor 2                                                                                         selected design
          2.8                                               Outer Loop Unroll Factor 2
                             selected design                                                                                                                      Outer Loop Unroll Factor 4
                                                            Outer Loop Unroll Factor 4                             5000                                           Outer Loop Unroll Factor 8
                                                            Outer Loop Unroll Factor 8




                                                                                                                                                                                                      Execution Cycles (log-scaled)
                                                                                                                                                                  Outer Loop Unroll Factor 16
          2.4                                               Outer Loop Unroll Factor 16                                                                           Outer Loop Unroll Factor 32
                                                                                               Execution Cycles




                                                            Outer Loop Unroll Factor 32                            4000                                           Outer Loop Unroll Factor 64                                              3
                                                                                                                                                                                                                                      10
                                                            Outer Loop Unroll Factor 64
Balance




           2
                                                                                                                   3000                                           selected design

          1.6
                                                                                                                   2000
                                                                                                                                                                                                                                       2
                                                                                                                                                                                                                                      10
          1.2                                                                                                      1000


          0.8                                                                                                         0                                                                                                                         4                               5                           6
                0 1 2    4     8                  16                                      32                              0 1 2       4     8            16                                      32                                            10                              10                          10
                                         Inner Loop Unroll Factor                                                                               Inner Loop Unroll Factor                                                                                 Space (log-scaled)

                                    (a) Balance                                                                                           (b) Execution Time                                                                                            (c) Area

                                      Figure 5: Balance, Execution Cycles and Area for Pipelined FIR.
10
          1.2                                                                                                 1000




                                                                                                                Result(2)
          0.8                                                                                                    0                                                                                                                              4                           5                      6
                0 1 2   4   8              16                                         32                             0 1 2   4     8            16                                          32                                                 10                          10                     10
                                  Inner Loop Unroll Factor                                                                             Inner Loop Unroll Factor                                                                                      Space (log-scaled)

                                (a) Balance                                                                                      (b) Execution Time                                                                                                 (c) Area

                                 Figure 5: Balance, Execution Cycles and Area for Pipelined FIR.


                                                                                                              7000                                                                                                                    4
                                                                                                                                                                                                                                 10
            1                                                                                                                                                                                                                                           max space
                                                    Outer Loop Unroll Factor 1
                                                    Outer Loop Unroll Factor 2                                6000                                                                                                                                                              selected design
          0.9                                                                                                                                            Outer Loop Unroll Factor 1
                                                    Outer Loop Unroll Factor 4                                                                           Outer Loop Unroll Factor 2
                                                    Outer Loop Unroll Factor 8                                                                           Outer Loop Unroll Factor 4




                                                                                                                                                                                                 Execution Cycles (log-scaled)
                                                                                                              5000
          0.8                                       Outer Loop Unroll Factor 16                                                                          Outer Loop Unroll Factor 8
                                                    Outer Loop Unroll Factor 32                                                                          Outer Loop Unroll Factor 16




                                                                                           Execution Cycles
                                                                                                              4000                                       Outer Loop Unroll Factor 32
Balance




          0.7
                                                               selected design
                                                                                                              3000                                       selected design
          0.6

                                                                                                                                                                                                                                      3
          0.5                                                                                                 2000                                                                                                               10


          0.4                                                                                                 1000


          0.3                                                                                                    0                                                                                                                         4                                     5
                0   1   2   4              8                                     16                                  0   1   2     4            8                                      16                                                 10                                    10
                                  Inner Loop Unroll Factor                                                                             Inner Loop Unroll Factor                                                                                       Space (log-scaled)

                                (a) Balance                                                                                      (b) Execution Time                                                                                                 (c) Area

                            Figure 6: Balance, Execution Cycles and Area for Non-pipelined MM.


                                                                                                              5000                                                                                                                    4
                                                                                                                                                                                                                                 10
                                                                                                                                                                                                                                                         max space
           3                                        Outer Loop Unroll Factor 1
                                                    Outer Loop Unroll Factor 2                                                                                                                                                                                                  selected design
                                                                                                                                                         Outer Loop Unroll Factor 1
                                                    Outer Loop Unroll Factor 4                                4000
                                                                                                                                                         Outer Loop Unroll Factor 2
          2.5                                       Outer Loop Unroll Factor 8                                                                           Outer Loop Unroll Factor 4




                                                                                                                                                                                                 Execution Cycles (log-scaled)
                                                    Outer Loop Unroll Factor 16                                                                          Outer Loop Unroll Factor 8
                                                    Outer Loop Unroll Factor 32                                                                          Outer Loop Unroll Factor 16
                                                                                           Execution Cycles




                                                                                                              3000
                                                                                                                                                         Outer Loop Unroll Factor 32
Balance




           2                                                 selected design                                                                                                                                                          3
                                                                                                                                                                                                                                 10
                                                                                                                                                         selected design
                                                                                                              2000
          1.5



            1                                                                                                 1000



                                                                                                                                                                                                                                      2
          0.5                                                                                                    0                                                                                                               10        4                                     5
                0   1   2   4              8                                     16                                  0   1   2     4            8                                      16                                                 10                                    10
                                  Inner Loop Unroll Factor                                                                             Inner Loop Unroll Factor                                                                                       Space (log-scaled)
                                (a) Balance                                                                                      (b) Execution Time                                                                                                 (c) Area

                                Figure 7: Balance, Execution Cycles and Area for Pipelined MM.
10
          1.2                                                                                                 1000




                                                                                                                Result(3)
          0.8                                                                                                    0                                                                                                                              4                           5                      6
                0 1 2   4   8              16                                         32                             0 1 2   4     8            16                                          32                                                 10                          10                     10
                                  Inner Loop Unroll Factor                                                                             Inner Loop Unroll Factor                                                                                      Space (log-scaled)

                                (a) Balance                                                                                      (b) Execution Time                                                                                                 (c) Area

                                 Figure 5: Balance, Execution Cycles and Area for Pipelined FIR.


                                                                                                              7000                                                                                                                    4
                                                                                                                                                                                                                                 10
            1                                                                                                                                                                                                                                           max space
                                                    Outer Loop Unroll Factor 1
                                                    Outer Loop Unroll Factor 2                                6000                                                                                                                                                              selected design
          0.9                                                                                                                                            Outer Loop Unroll Factor 1
                                                    Outer Loop Unroll Factor 4                                                                           Outer Loop Unroll Factor 2
                                                    Outer Loop Unroll Factor 8                                                                           Outer Loop Unroll Factor 4




                                                                                                                                                                                                 Execution Cycles (log-scaled)
                                                                                                              5000
          0.8                                       Outer Loop Unroll Factor 16                                                                          Outer Loop Unroll Factor 8
                                                    Outer Loop Unroll Factor 32                                                                          Outer Loop Unroll Factor 16




                                                                                           Execution Cycles
                                                                                                              4000                                       Outer Loop Unroll Factor 32
Balance




          0.7
                                                               selected design
                                                                                                              3000                                       selected design
          0.6

                                                                                                                                                                                                                                      3
          0.5                                                                                                 2000                                                                                                               10


          0.4                                                                                                 1000


          0.3                                                                                                    0                                                                                                                         4                                     5
                0   1   2   4              8                                     16                                  0   1   2     4            8                                      16                                                 10                                    10
                                  Inner Loop Unroll Factor                                                                             Inner Loop Unroll Factor                                                                                       Space (log-scaled)

                                (a) Balance                                                                                      (b) Execution Time                                                                                                 (c) Area

                            Figure 6: Balance, Execution Cycles and Area for Non-pipelined MM.


                                                                                                              5000                                                                                                                    4
                                                                                                                                                                                                                                 10
                                                                                                                                                                                                                                                         max space
           3                                        Outer Loop Unroll Factor 1
                                                    Outer Loop Unroll Factor 2                                                                                                                                                                                                  selected design
                                                                                                                                                         Outer Loop Unroll Factor 1
                                                    Outer Loop Unroll Factor 4                                4000
                                                                                                                                                         Outer Loop Unroll Factor 2
          2.5                                       Outer Loop Unroll Factor 8                                                                           Outer Loop Unroll Factor 4




                                                                                                                                                                                                 Execution Cycles (log-scaled)
                                                    Outer Loop Unroll Factor 16                                                                          Outer Loop Unroll Factor 8
                                                    Outer Loop Unroll Factor 32                                                                          Outer Loop Unroll Factor 16
                                                                                           Execution Cycles




                                                                                                              3000
                                                                                                                                                         Outer Loop Unroll Factor 32
Balance




           2                                                 selected design                                                                                                                                                          3
                                                                                                                                                                                                                                 10
                                                                                                                                                         selected design
                                                                                                              2000
          1.5



            1                                                                                                 1000



                                                                                                                                                                                                                                      2
          0.5                                                                                                    0                                                                                                               10        4                                     5
                0   1   2   4              8                                     16                                  0   1   2     4            8                                      16                                                 10                                    10
                                  Inner Loop Unroll Factor                                                                             Inner Loop Unroll Factor                                                                                       Space (log-scaled)
                                (a) Balance                                                                                      (b) Execution Time                                                                                                 (c) Area

                                Figure 7: Balance, Execution Cycles and Area for Pipelined MM.
500
                                                                                                                                                                               max space
                                1




                                                                                                                                                                 Result(4)
                                                                                                                                                                                                                                                                                0
                   Outer0 Loop2 Unroll Factor 1 8
                           1        4                                                                                    16
                                                                                                                                                         0
                                                                                                                                                             0   1   2     4              8                                      16
                                                                                                                                                                                                                                                                               10        3
                                                                                                                                                                                                                                                                                        10                                10
                                                                                                                                                                                                                                                                                                                               4

                                                                                                                                                                                                                                                                                                     Space (log-scaled)
                   Outer Loop Unroll Factor 2 Unroll Factor
                                           Inner Loop                                                                                                                            Inner Loop Unroll Factor
                                        (a) Balance                                                                                                                      (b) Execution Time                                                                                                     (c) Area
                   Outer Loop Unroll Factor 4                                                                                                                                                                                    selected design




                                                                                        Execution Cycles (log-scaled)
                                         Figure 9:
                   Outer Loop Unroll Factor 8                                    Balance, Execution Cycles and Area for Pipelined PAT.
                   Outer Loop Unroll Factor 16
                   Outer Loop Unroll Factor 32
                   Outer Loop Unroll Factor 64
                               2.1                                                                                                                                                                                                                                              4
                                                                                                                                                                                                                                                                               10
                                                                       Outer Loop Unroll Factor 1                                                     8000                                                                                                                                   max space
                                                                       Outer Loop Unroll Factor 2
                 selected design
                      2
                                                                       Outer Loop Unroll Factor 4                                                     7000
                                                                                                                                                                                                   Outer Loop Unroll Factor 1
                                                                       Outer Loop Unroll Factor 8                                                                                                  Outer Loop Unroll Factor 2
                                                                                                                                                                                                   Outer Loop Unroll Factor 4                                                                                             selected design




                                                                                                                                                                                                                                               Execution Cycles (log-scaled)
                               1.9                                     Outer Loop Unroll Factor 16
                                                                                                                                                      6000                                         Outer Loop Unroll Factor 8
                                                                       Outer Loop Unroll Factor 32
                                                                                                                                                                                                   Outer Loop Unroll Factor 16




                                                                                                                                   Execution Cycles
                                                                       Outer Loop Unroll Factor 64                                                                                                 Outer Loop Unroll Factor 32
                               1.8
                     Balance




                                                                                                                                                      5000                                         Outer Loop Unroll Factor 64
                                                                      selected design
                               1.7                                                                                                                                                                selected design
                                                                                                                                                      4000

                               1.6                                                                                                                    3000


                               1.5                                                                                        3                           2000
                                                                                                                        10 4                                                                                                               5
         16                                                          32                                                   10                                                                                                     10                                            10 4
                                                                                                                                                                                                                                                                                    3
                      1.4                                                                                                                             1000
Inner Loop Unroll Factor0            1 2   4   8              16
                                                     Inner Loop Unroll Factor
                                                                                                                              32                             0 1 2   4     8              16 Space (log-scaled)
                                                                                                                                                                                 Inner Loop Unroll Factor
                                                                                                                                                                                                                                      32                                         10
                                                                                                                                                                                                                                                                                                     Space (log-scaled)
                                                                                                                                                                                                                                                                                                                          10
                                                                                                                                                                                                                                                                                                                            5




ecution Time                                       (a) Balance                                                                                                                     (c) Area
                                                                                                                                                                         (b) Execution Time                                                                                                     (c) Area
                                                   Figure 10: Balance, Execution Time and Area for Pipelined SOBEL.
Time and Area for Pipelined SOBEL.
                                                                                 Program   Non-Pipelined    Pipelined
uration point, and then decreasing. The execution time is                          FIR           7.67         17.26
also monotonically nonincreasing, related to Observation 2.                        MM            4.55         13.36
 n all programs, our algorithm selects a design that is close                      JAC           3.87          5.56
 o best in terms of performance, but uses relativelyNon-Pipelined
                                      Program            small                     Pipelined
                                                                                   PAT           7.53         34.61
                                                                                  SOBEL          4.01          3.90
unroll factors. Among the designs with comparable perfor- 7.67
                                          FIR                                          17.26
mance, in all cases our algorithm selected the design that
                                          MM
consumes the smallest amount of space. As a result, we
                                                               4.55        Table 2: 13.36 on a single FPGA.
                                                                                       Speedup
have shown that our approach meets the optimization goals 3.87
                                          JAC                                           5.56
 et forth in Section 3. In most cases, the most balanced 7.53 heuristics based on the saturation point and balance,
                                          PAT                     ing                  34.61
design is selected by the algorithm. When a less balanced         as described in section 5. This reveals the effectiveness of
design is selected, it is either because the more balanced de- 4.01 algorithm as it 3.90
                                       SOBEL                      the                   finds the best design point having only
 ign is before a saturation point (as for non-pipelined FIR),     explored a small fraction, only 0.3% of the design space con-
or is too large to fit on the FPGA (as for pipelined MM).          sisting of all possible unroll factors for each loop. For larger
                                                       Table 2: Speedup on a single FPGA.
   Table 2 presents the speedup results of the selected de-
 ign for each kernel as compared to the baseline, for both
                                                                  design spaces, we expect the number of points searched rel-
                                                                  ative to the size to be even smaller.
pipelined and non-pipelined designs. The baseline is the
Related Work

Synthesizing High-Level Constructs
 Handel-C(influenced by the OCCAM)
 SA-C
Design Space Exploration
 Monet
 Derrien/Rajopadhye
Discussion

More Related Content

What's hot

NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputinginside-BigData.com
 
小型安価なFPGAボードの紹介と任意波形発生器
小型安価なFPGAボードの紹介と任意波形発生器小型安価なFPGAボードの紹介と任意波形発生器
小型安価なFPGAボードの紹介と任意波形発生器uchan_nos
 
GPU Programming with CUDA
GPU Programming with CUDAGPU Programming with CUDA
GPU Programming with CUDAFilipo Mór
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
 
High Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel StationHigh Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel StationIntel IT Center
 
AI Crash Course- Supercomputing
AI Crash Course- SupercomputingAI Crash Course- Supercomputing
AI Crash Course- SupercomputingIntel IT Center
 
Why a zynq should power your next project
Why a zynq should power your next projectWhy a zynq should power your next project
Why a zynq should power your next projectMark Smith
 
FPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsaraFPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsaraIntel IT Center
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021Grigory Sapunov
 
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019corehard_by
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Intel® Software
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 

What's hot (20)

NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
小型安価なFPGAボードの紹介と任意波形発生器
小型安価なFPGAボードの紹介と任意波形発生器小型安価なFPGAボードの紹介と任意波形発生器
小型安価なFPGAボードの紹介と任意波形発生器
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
GPU Programming with CUDA
GPU Programming with CUDAGPU Programming with CUDA
GPU Programming with CUDA
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
High Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel StationHigh Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel Station
 
AI Crash Course- Supercomputing
AI Crash Course- SupercomputingAI Crash Course- Supercomputing
AI Crash Course- Supercomputing
 
Why a zynq should power your next project
Why a zynq should power your next projectWhy a zynq should power your next project
Why a zynq should power your next project
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
FPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsaraFPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsara
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
 
Main (3)
Main (3)Main (3)
Main (3)
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
AI Hardware
AI HardwareAI Hardware
AI Hardware
 
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDATech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
 

Similar to A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems

A NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable Systems
A NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable SystemsA NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable Systems
A NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable SystemsLisa Muthukumar
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdlArshit Rai
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdlArshit Rai
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingRuymán Reyes
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...Naoki Shibata
 
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIMAn Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIMjournalBEEI
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareESUG
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET Journal
 
A Review of FPGA-based design methodologies for efficient hardware Area estim...
A Review of FPGA-based design methodologies for efficient hardware Area estim...A Review of FPGA-based design methodologies for efficient hardware Area estim...
A Review of FPGA-based design methodologies for efficient hardware Area estim...IOSR Journals
 
Programmable logic controller performance enhancement by field programmable g...
Programmable logic controller performance enhancement by field programmable g...Programmable logic controller performance enhancement by field programmable g...
Programmable logic controller performance enhancement by field programmable g...ISA Interchange
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...TigerGraph
 
resume-XinyuSui
resume-XinyuSuiresume-XinyuSui
resume-XinyuSuiXinyu Sui
 
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages toolslaparuma
 

Similar to A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems (20)

A NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable Systems
A NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable SystemsA NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable Systems
A NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable Systems
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdl
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdl
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
 
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIMAn Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable Hardware
 
A Common Backend for Hardware Acceleration of DSLs on FPGA
A Common Backend for Hardware Acceleration of DSLs on FPGAA Common Backend for Hardware Acceleration of DSLs on FPGA
A Common Backend for Hardware Acceleration of DSLs on FPGA
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
 
Blanket project presentation
Blanket project presentationBlanket project presentation
Blanket project presentation
 
A Review of FPGA-based design methodologies for efficient hardware Area estim...
A Review of FPGA-based design methodologies for efficient hardware Area estim...A Review of FPGA-based design methodologies for efficient hardware Area estim...
A Review of FPGA-based design methodologies for efficient hardware Area estim...
 
Programmable logic controller performance enhancement by field programmable g...
Programmable logic controller performance enhancement by field programmable g...Programmable logic controller performance enhancement by field programmable g...
Programmable logic controller performance enhancement by field programmable g...
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
 
resume-XinyuSui
resume-XinyuSuiresume-XinyuSui
resume-XinyuSui
 
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
 

More from Takefumi MIYOSHI

ACRi_webinar_20220118_miyo
ACRi_webinar_20220118_miyoACRi_webinar_20220118_miyo
ACRi_webinar_20220118_miyoTakefumi MIYOSHI
 
ACRiルーム1年間の活動と 新たな取り組み
ACRiルーム1年間の活動と 新たな取り組みACRiルーム1年間の活動と 新たな取り組み
ACRiルーム1年間の活動と 新たな取り組みTakefumi MIYOSHI
 
RISC-V introduction for SIG SDR in CQ 2019.07.29
RISC-V introduction for SIG SDR in CQ 2019.07.29RISC-V introduction for SIG SDR in CQ 2019.07.29
RISC-V introduction for SIG SDR in CQ 2019.07.29Takefumi MIYOSHI
 
Misc for edge_devices_with_fpga
Misc for edge_devices_with_fpgaMisc for edge_devices_with_fpga
Misc for edge_devices_with_fpgaTakefumi MIYOSHI
 
Synthesijer - HLS frineds 20190511
Synthesijer - HLS frineds 20190511Synthesijer - HLS frineds 20190511
Synthesijer - HLS frineds 20190511Takefumi MIYOSHI
 
Abstracts of FPGA2017 papers (Temporary Version)
Abstracts of FPGA2017 papers (Temporary Version)Abstracts of FPGA2017 papers (Temporary Version)
Abstracts of FPGA2017 papers (Temporary Version)Takefumi MIYOSHI
 
Synthesijer and Synthesijer.Scala in HLS-friends 201512
Synthesijer and Synthesijer.Scala in HLS-friends 201512Synthesijer and Synthesijer.Scala in HLS-friends 201512
Synthesijer and Synthesijer.Scala in HLS-friends 201512Takefumi MIYOSHI
 
Synthesijer jjug 201504_01
Synthesijer jjug 201504_01Synthesijer jjug 201504_01
Synthesijer jjug 201504_01Takefumi MIYOSHI
 
Synthesijer zynq qs_20150316
Synthesijer zynq qs_20150316Synthesijer zynq qs_20150316
Synthesijer zynq qs_20150316Takefumi MIYOSHI
 
Synthesijer fpgax 20150201
Synthesijer fpgax 20150201Synthesijer fpgax 20150201
Synthesijer fpgax 20150201Takefumi MIYOSHI
 

More from Takefumi MIYOSHI (20)

ACRi_webinar_20220118_miyo
ACRi_webinar_20220118_miyoACRi_webinar_20220118_miyo
ACRi_webinar_20220118_miyo
 
DAS_202109
DAS_202109DAS_202109
DAS_202109
 
ACRiルーム1年間の活動と 新たな取り組み
ACRiルーム1年間の活動と 新たな取り組みACRiルーム1年間の活動と 新たな取り組み
ACRiルーム1年間の活動と 新たな取り組み
 
RISC-V introduction for SIG SDR in CQ 2019.07.29
RISC-V introduction for SIG SDR in CQ 2019.07.29RISC-V introduction for SIG SDR in CQ 2019.07.29
RISC-V introduction for SIG SDR in CQ 2019.07.29
 
Misc for edge_devices_with_fpga
Misc for edge_devices_with_fpgaMisc for edge_devices_with_fpga
Misc for edge_devices_with_fpga
 
Cq off 20190718
Cq off 20190718Cq off 20190718
Cq off 20190718
 
Synthesijer - HLS frineds 20190511
Synthesijer - HLS frineds 20190511Synthesijer - HLS frineds 20190511
Synthesijer - HLS frineds 20190511
 
Reconf 201901
Reconf 201901Reconf 201901
Reconf 201901
 
Hls friends 201803.key
Hls friends 201803.keyHls friends 201803.key
Hls friends 201803.key
 
Abstracts of FPGA2017 papers (Temporary Version)
Abstracts of FPGA2017 papers (Temporary Version)Abstracts of FPGA2017 papers (Temporary Version)
Abstracts of FPGA2017 papers (Temporary Version)
 
Hls friends 20161122.key
Hls friends 20161122.keyHls friends 20161122.key
Hls friends 20161122.key
 
Slide
SlideSlide
Slide
 
Synthesijer and Synthesijer.Scala in HLS-friends 201512
Synthesijer and Synthesijer.Scala in HLS-friends 201512Synthesijer and Synthesijer.Scala in HLS-friends 201512
Synthesijer and Synthesijer.Scala in HLS-friends 201512
 
Das 2015
Das 2015Das 2015
Das 2015
 
Microblaze loader
Microblaze loaderMicroblaze loader
Microblaze loader
 
Reconf 201506
Reconf 201506Reconf 201506
Reconf 201506
 
Synthesijer jjug 201504_01
Synthesijer jjug 201504_01Synthesijer jjug 201504_01
Synthesijer jjug 201504_01
 
Synthesijer zynq qs_20150316
Synthesijer zynq qs_20150316Synthesijer zynq qs_20150316
Synthesijer zynq qs_20150316
 
Synthesijer fpgax 20150201
Synthesijer fpgax 20150201Synthesijer fpgax 20150201
Synthesijer fpgax 20150201
 
Synthesijer hls 20150116
Synthesijer hls 20150116Synthesijer hls 20150116
Synthesijer hls 20150116
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems

  • 1. A Compiler Approach to Fast Hardware Design Space Exploration in FPGA-based Systems Byoungro So, Mary W. Hall and Pedro C. Diniz Information Sciences Institute University of Southern California 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292 {bso,mhall,pedro}@isi.edu ABSTRACT : his paper describes an automated approach to hardware 1. INTRODUCTION The extreme flexibility of Field Programmable Gate Ar esign space exploration, through a collaboration between rays (FPGAs) has made them the medium of choice for fas arallelizing compiler technology and high-level synthesis hardware prototyping and a popular vehicle for the real ools. We present a compiler algorithm that automatically ization of custom computing machines. FPGAs are com xplores the large design spaces resulting from the applica- posed of thousands of small programmable logic cells dy on of several program transformations commonly used in namically interconnected to allow the implementation of an logic function. Tremendous growth in device capacity ha
  • 2. Abstraction hardware design space exploration parallelizing compiler technique high-level synthesis tools designing a loop nest computation synthesis estimation techniques with DEFACTO, five multi-media kernels This technology thus significantly raises the level of abstraction for hardware design and explores a design space much larger than is feasible for a human designer.
  • 3. 56 Pedro Diniz et al. DEFACTO tion written in C or FORTRAN, and performs pre-processing and several com- parallelizing compiler tech. (in SUIF) mon optimizations. In the second step, the code is partitioned into what will execute in software on the host and what will execute in on the FPGAs. with hardware synthesis tools [9] Program General Compiler Optimizations Source Code Design Space Partitioning Exploration Memory Access Parallelization Loop Transformations Permutation Unrolling Tiling Memory Access Protocols Reuse Analysis Logic Synthesis Scalar Replacement Place & Route SUIF2VHDL Estimation Target Architecture Library Functions No Good Yes Host CPU Design FPGA−boards Fig. 1. DEFACTO Design Flow.
  • 4. Contributions a compiler algorithm for design space exploration that relies on behavioral synthesis estimates applies loop transformations to explore a space-time trade-off Defines a balance metric for guiding design space exploration results for five multimedia kernels
  • 5. mizations on the resulting inner loop body, such as paral- Behavioral Synth. vs. Compilers lelizing and pipelining operations and minimizing registers and operators to save space. However, deciding the unroll factor is left up to the programmer. Behavioral Synthesis Parallelizing Compilers Optimizations only on scalar variables Optimizations on scalars and arrays Optimizations only inside loop body Optimizations inside loop body and across loop iterations Supports user-controlled Analyses guide automatic loop unrolling loop transformations Manages registers and Optimizes memory accesses inter-operator communication Evaluates trade-offs of different storage on- and off-chip Considers only single FPGA System-level view: multiple FPGAs multiple memories Performs allocation, binding and No knowledge of hardware scheduling of hardware resources implementation of computation Table 1: Comparison of Behavioral Synthesis and Parallelizing Compiler Technologies. 167
  • 6. Optimization Goal & Balance Optimization Criteria the design must not exceed the capacity constraints of the system the execution time should be minimized a given level of performance, FPGA space usage should be minimized Using two metrics result of estimation provide space usage Balance = F/C (F: data fetch rate, C: data consumption rate)
  • 7. Analyses & Transformations Unroll-and-Jam unrolling one or more loops fusing inner loop bodies Scalar Replacement eliminates true dependences when reuse is carried (not just the innermost loop) Loop peeling & Loop-Invariant Data Layout and Array Renaming
  • 8. and-jam, involves unrolling one or more loops in the itera- D[j] = d 0; tion space and fusing inner loop bodies together, as shown in D[j+1] = d 1; Figure 1(b). Unrolling exposes operator parallelism to high- } gests a int S[96]; synthesis. In the example, all of the multiplies can level (c) After scalar replacement of accesses to C and D across ns that be performed in parallel. Two additions can subsequently int C[32]; both i and j loop. be performed in parallel, followed by two more additions. int D[64]; gests a , which int S[96]; j<64; j++) also decrease the dependence distances for (j=0; j<32; j++) { /* initialize D registers */ Unroll-and-jam can for (j=0; ns that ttempts int C[32]; 0; i<32; i++) d 0 = D2[j]; for reused data accesses, which, when combined with scalar for(i = ia. The int D[64]; D[j] + (S[i+j] * below, can be used to expose oppor- replacement discussed C[i]); D[j] = d 1 = D3[j]; which design. for (j=0; j<64; j++) for (i=0; i<16; i++) { (a) tunities for parallel memory accesses. Original code. tempts design, for(i = 0; i<32; i++) if (j==0) { /* initialize C registers */ Scalar Replacement. Scalar replacement replaces ar- ric The a. used for D[j] = j<64;+ (S[i+j] * C[i]); temporary scalar variables, so (j=0; D[j] j+=2)accesses to ray references by c 0 0 = C0[i]; design. a 2 and (a) Original i<32; i+=2){ c 1 0 = C1[i]; for(i = 0; code. synthesis will exploit reuse in registers [5]. that high-level design, } Our = D[j] + (S[i+j] * C[i]); D[j] approach to scalar replacement closely matches previ- ic used for (j=0;work, which eliminates true dependences when reuse S 0 = S1[i+j]; ous = D[j] j+=2) D[j] j<64; + (S[i+j+1] * C[i+1]); 2 and for(i = 0; i<32; i+=2){ d 0 = d 0 + S0[i+j] * c 0 0; /* unroll(0,0) */ ata bits is carried D[j+1] + (S[i+j+1] * C[i]); accesses in the affine D[j+1] = by the innermost loop, for D[j] = D[j]D[j+1] + (S[i+j+2] * C[i+1]); domain= + consistent dependences (i.e., constant depen- D[j+1] with (S[i+j] * C[i]); d 0 = d 0 + S 0 * c 1 0; /* unroll(0,1) */ he data } D[j] = D[j] + (S[i+j+1] * C[i+1]); however, two differences: dence distances) [5]. There are, d 1 = d 1 + S 0 * c 0 0; /* unroll(1,0) */ an con- ta bits (b)D[j+1] unrolling j+ (S[i+j+1]loop bymemory writes on out- (1) we = D[j+1] loop and i * C[i]); 1 (unroll After also eliminate unnecessary d 1 = d 1 + S0[i+j+1] * c 1 0; /* unroll(1,1) */ close to heis less D[j+1] 2) D[j+1] +and, copies *exploit reuse across all loops put dependences; (S[i+j+2] of i loop together. factor = and jamming (2) we C[i+1]); rotate registers(c 0 0, ... ,c 0 15); e data an con- } in the nest, not just the innermost loop. The latter differ- rotate registers(c 1 0, ... ,c 1 15); an one, lose to (b) After stems from the observation that many, though not all, for (j=0; unrolling j loop/* initialize by registers */ ence j<64; j+=2) { and i loop D 1 (unroll } ed, this is less d factor 2) and jamming copies of have sufficiently small loop 0 = D[j]; mapped to FPGAs i loop together. algorithms D3[j] = d 1; devoted an one, d 1 = D[j+1];small reuse distances, and the number of regis- bounds or D2[j] = d 0; d, this forters j<64; j+=2) { /* for (j=0;that can i+=2) { initialize D registers */ (i=0; i<32; be configured on an FPGA is sufficiently large. } work for evoted d 0A (j==0) { /* initialize C registers */ if=more detailed description of our scalar replacement and D[j]; (d) Final code generated for FIR, including loop nce the d 1register = C[i]; =cD[j+1]; analysis can be found in [9]. 0 0 reuse normalization and data layout optimization. Because ork for for (i=0; 0 = C[i+1]; in{Figure 1(c), we see the results of scalar c i<32; i+=2) In1the example pent in Figure 1: Optimization Example: FIR. nce data if (j==0) { /*which illustrates some*/ the above differences replacement, initialize C registers of } he the ecause e them S c 0 0S[i+j+1]; 0 = = C[i]; pent in d c 1 0d= C[i+1]; * c 0 0; /* unroll(0,0) */ 0 = 0 + S[i+j] e data } 0 = d 0 + S 0 * c 1 0; /* unroll(0,1) */ d IONS S 0 = S[i+j+1]; 0 * c 0 0; /* unroll(1,0) */ 168 e them d1=d1+S ransfor- d 0 = d 0 + S[i+j] * c * c 1/* unroll(0,0) */ */ d 1 = d 1 + S[i+j+2] 0 0; 0; /* unroll(1,1) he FIR d 0 = dregisters(c 0 0, 0; /* 0 15); rotate 0 + S 0 * c 1 ... ,c unroll(0,1) */ ONS d 1 = dregisters(c 1 0, 0; /* 1 15); rotate 1 + S 0 * c 0 ... ,c unroll(1,0) */ ansfor- unroll- } d 1 = d 1 + S[i+j+2] * c 1 0; /* unroll(1,1) */ heitera- e FIR rotate 0; D[j] = d registers(c 0 0, ... ,c 0 15); hown in rotate registers(c 1 0, ... ,c 1 15); D[j+1] = d 1; unroll- to high- }} eies can itera- D[j] = d 0; (c) After scalar replacement of accesses to C and D across own in quently D[j+1] = d 1; j loop. both i and o high- ditions. } es can stances (c) After scalar replacement of accesses to C and D across for (j=0; j<32; j++) { /* initialize D registers */ huently scalar d both D2[j]; j loop. 0 = i and
  • 9. t are in the general case. We address this problem by limiting the number of registers in Section 5.4. Optimization Algorithm i1 + where 5.1 Definitions s and We define a saturation point as a vector of unroll factors rated where the memory parallelism reaches the bandwidth of the rmly architecture, such that the following property holds for the esult resulting unrolled loop body: s the Saturation Point rtual widthi = C1 ∗ widthl . n ac- i∈Reads l∈NumMemories uling yout widthj = C2 ∗ widthl . ry 0, j∈Writes l∈NumMemories esses ed to Search Space Properties Here, C1 and C2 are integer constants. To simplify this discussion, let us assume that the access widths match the olling ed to Algorithm Description memory width, so that we are simply looking for an unroll factor that results in a multiple of N umM emories read and t as- write accesses for the smallest values of C1 and C2 . The rans- Adjusting Number of On-chip Registers saturation set, Sat, can then be determined as a function of ports the number of read and write accesses, R and W , in a single lay- iteration of the loop nest and the unroll factor for each loop sibly in the nest. We consider reads and writes separately because cking they will be scheduled separately. sions We are interested in determining the saturation point after
  • 10. stop the search, or it is compute bound and we continue. If Ucurr = Uinit it is compute bound, then we consider unroll factors that Umb = Umax Algorithm provide increased operator parallelism, in addition to mem- ok = False ory parallelism. Thus, we first look for a loop that carries while (!ok) do no dependence (i.e., ∀d∈D di = 0). All unrolled iterations of Code = Generate(Ucurr ) such a loop can be executed in parallel. If such a loop i is Estimate = Synthesize(Code) B = Balance(Code,Estimate.Performance) found, then we set the unroll factor to Sati . assuming this /* first deal with space-constrained designs */ unroll factor is in Sat. if (Estimate.Space > Capacity) then If no such loop exists, then we instead select an unroll fac- if (Ucurr = Uinit ) then tor that favors loops with the largest dependence distances, Ucurr = FindLargestFit(Ubase , Ucurr ) because such loops can perform in parallel computations be- ok = True tween dependences. The details of how our algorithm selects else Ucurr = SelectBetween(Ucb , Ucurr ) the initial unroll factor in this case is beyond the scope of else if (B = 1) then ok = True /* Balanced, so DONE! */ he this paper, but the key insight is that we unroll all loops else if (B < 1) then /* memory bound */ Search Algorithm: 5]. in the nest, with larger unroll factors for the loops carrying Umb = Ucurr Input: Code /* An n-deep loop nest */ ry larger minimum nonzero dependence distances. The mono- if (Ucurr = Uinit ) then ok = True Output: u1 , . . . , un /* a vector of unroll factors */ else we tonicity property also applies when considering simultaneous unrolling for multiple loops as long as unroll factors for all /* Balanced solution is between earlier size and this */ If Ucurr = Uinit Ucurr = SelectBetween(Ucb , Umb ) at loops = U either increasing or decreasing. Umb are max else if (B > 1) then /* compute bound */ m- If the initial design is space constrained, we must re- ok = False Ucb = Ucurr es duce the unroll factor until the design size is less than the while (!ok) do if (Umb = Umax ) then of size constraint Capacity, resulting in a suboptimal design. Code = Generate(Ucurr ) /* Have only seen compute bound so far */ is The function Synthesize(Code) simply selects the largest un- Estimate = FindLargestFit Ucurr = Increase(Ucb ) B = Balance(Code,Estimate.Performance) roll factor between the baseline design corresponding to no else his /* first deal with space-constrained designs */ /* Balanced solution is between earlier size and this */ unrolling (called Ubase ), and Uinit , regardless of balance, be- if (Estimate.Space > Capacity) then Ucurr = SelectBetween(Ucb , Umb ) ac- cause this will maximize available parallelism. if (Ucurr = Uinit ) then /* Check if no more points to search */ es, Assuming = FindLargestFit(Uis compute bound, the algo- Ucurr the initial design base , Ucurr ) if (Ucurr = Ucb ) ok = True be- rithm increases the unroll factors until it reaches a design ok = True end cts that else(1) memory bound; (2) larger than Capacity; or, is return Ucurr of Ucurr = full unrolling of Ucurr ) (3) represents SelectBetween(Ucb ,all loops in the nest (i.e., else if (B = 1) then ok = True /* Balanced, so DONE! */ ps Ucurr = (B < 1) as follows. U ), else if max then /* memory bound */ Figure 2: Algorithm for Design Space Exploration. ng Themb = UcurrIncrease(Uin ) returns unroll factor vector U function no- Uout if (Ucurr = Uinit ) then ok = True such that first place, the design will be smaller and more likely to fit us else on chip, and secondly, space is freed up so that it can be all (1)PBalanced solution is between earlier size and this */ /* (Uout ) = 2 ∗ P (Uin ); and, used to increase the operator parallelism for designs that (2)∀i uin ≤ uout ≤ umax . cb , Umb ) Ucurr = SelectBetween(U i i i are compute bound. else if (B > 1) then /* compute bound */ re- To adjust the number of on-chip registers, we can use loop he Ucb are no If there = Ucurr such remaining unroll factor vectors, then tiling to tile the loop nest so that the localized iteration if (U = Umax ) then gn. Increase mb returns Uin . compute bound so far */ /* Have only seen space within a tile matches the desired number of registers, If either a space-constrained or memory bound design is
  • 11. Experimental Result FIR/Matrix Multiply/String Pattern Matching/Jacobi Iteration/Sobel Edge Detection SUIF(SUIF2VHDL) C Application invokes the Mentor Graphics’ MonetTM SUIF Compiler Analyses scalar replacement the compiler currently fixes the clock data layout array renaming data reuse unroll & jam tiling period to be 40ns Unroll Factor Determination SUIF2VHDL Transformed SUIF Behavioral VHDL Monet Behavioral Synthesis Metrics: Area, Number Clock Cycles Balance Calculation NO Balanced Design? YES Figure 3: Compilation and Synthesis Flow
  • 12. Result(1) 16000 4 max space Outer Loop Unroll Factor 1 10 0.4 14000 selected design selected design Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 8 12000 Execution Cycles (log-scaled) 0.35 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 32 10000 Outer Loop Unroll Factor 16 Execution Cycles 0.3 Outer Loop Unroll Factor 64 Outer Loop Unroll Factor 32 Balance Outer Loop Unroll Factor 64 8000 3 10 0.25 6000 selected design 0.2 4000 0.15 2000 2 0.1 0 10 4 5 0 1 2 4 8 16 32 0 1 2 4 8 16 32 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 4: Balance, Execution Time and Area for Non-pipelined FIR. 3.2 4 10 6000 max space Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 selected design 2.8 Outer Loop Unroll Factor 2 selected design Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 4 5000 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 8 Execution Cycles (log-scaled) Outer Loop Unroll Factor 16 2.4 Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 32 Execution Cycles Outer Loop Unroll Factor 32 4000 Outer Loop Unroll Factor 64 3 10 Outer Loop Unroll Factor 64 Balance 2 3000 selected design 1.6 2000 2 10 1.2 1000 0.8 0 4 5 6 0 1 2 4 8 16 32 0 1 2 4 8 16 32 10 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 5: Balance, Execution Cycles and Area for Pipelined FIR.
  • 13. 10 1.2 1000 Result(2) 0.8 0 4 5 6 0 1 2 4 8 16 32 0 1 2 4 8 16 32 10 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 5: Balance, Execution Cycles and Area for Pipelined FIR. 7000 4 10 1 max space Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 6000 selected design 0.9 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 4 Execution Cycles (log-scaled) 5000 0.8 Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 16 Execution Cycles 4000 Outer Loop Unroll Factor 32 Balance 0.7 selected design 3000 selected design 0.6 3 0.5 2000 10 0.4 1000 0.3 0 4 5 0 1 2 4 8 16 0 1 2 4 8 16 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 6: Balance, Execution Cycles and Area for Non-pipelined MM. 5000 4 10 max space 3 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 selected design Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 4 4000 Outer Loop Unroll Factor 2 2.5 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 4 Execution Cycles (log-scaled) Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 16 Execution Cycles 3000 Outer Loop Unroll Factor 32 Balance 2 selected design 3 10 selected design 2000 1.5 1 1000 2 0.5 0 10 4 5 0 1 2 4 8 16 0 1 2 4 8 16 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 7: Balance, Execution Cycles and Area for Pipelined MM.
  • 14. 10 1.2 1000 Result(3) 0.8 0 4 5 6 0 1 2 4 8 16 32 0 1 2 4 8 16 32 10 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 5: Balance, Execution Cycles and Area for Pipelined FIR. 7000 4 10 1 max space Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 6000 selected design 0.9 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 4 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 4 Execution Cycles (log-scaled) 5000 0.8 Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 16 Execution Cycles 4000 Outer Loop Unroll Factor 32 Balance 0.7 selected design 3000 selected design 0.6 3 0.5 2000 10 0.4 1000 0.3 0 4 5 0 1 2 4 8 16 0 1 2 4 8 16 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 6: Balance, Execution Cycles and Area for Non-pipelined MM. 5000 4 10 max space 3 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 2 selected design Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 4 4000 Outer Loop Unroll Factor 2 2.5 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 4 Execution Cycles (log-scaled) Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 16 Execution Cycles 3000 Outer Loop Unroll Factor 32 Balance 2 selected design 3 10 selected design 2000 1.5 1 1000 2 0.5 0 10 4 5 0 1 2 4 8 16 0 1 2 4 8 16 10 10 Inner Loop Unroll Factor Inner Loop Unroll Factor Space (log-scaled) (a) Balance (b) Execution Time (c) Area Figure 7: Balance, Execution Cycles and Area for Pipelined MM.
  • 15. 500 max space 1 Result(4) 0 Outer0 Loop2 Unroll Factor 1 8 1 4 16 0 0 1 2 4 8 16 10 3 10 10 4 Space (log-scaled) Outer Loop Unroll Factor 2 Unroll Factor Inner Loop Inner Loop Unroll Factor (a) Balance (b) Execution Time (c) Area Outer Loop Unroll Factor 4 selected design Execution Cycles (log-scaled) Figure 9: Outer Loop Unroll Factor 8 Balance, Execution Cycles and Area for Pipelined PAT. Outer Loop Unroll Factor 16 Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 64 2.1 4 10 Outer Loop Unroll Factor 1 8000 max space Outer Loop Unroll Factor 2 selected design 2 Outer Loop Unroll Factor 4 7000 Outer Loop Unroll Factor 1 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 4 selected design Execution Cycles (log-scaled) 1.9 Outer Loop Unroll Factor 16 6000 Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 16 Execution Cycles Outer Loop Unroll Factor 64 Outer Loop Unroll Factor 32 1.8 Balance 5000 Outer Loop Unroll Factor 64 selected design 1.7 selected design 4000 1.6 3000 1.5 3 2000 10 4 5 16 32 10 10 10 4 3 1.4 1000 Inner Loop Unroll Factor0 1 2 4 8 16 Inner Loop Unroll Factor 32 0 1 2 4 8 16 Space (log-scaled) Inner Loop Unroll Factor 32 10 Space (log-scaled) 10 5 ecution Time (a) Balance (c) Area (b) Execution Time (c) Area Figure 10: Balance, Execution Time and Area for Pipelined SOBEL. Time and Area for Pipelined SOBEL. Program Non-Pipelined Pipelined uration point, and then decreasing. The execution time is FIR 7.67 17.26 also monotonically nonincreasing, related to Observation 2. MM 4.55 13.36 n all programs, our algorithm selects a design that is close JAC 3.87 5.56 o best in terms of performance, but uses relativelyNon-Pipelined Program small Pipelined PAT 7.53 34.61 SOBEL 4.01 3.90 unroll factors. Among the designs with comparable perfor- 7.67 FIR 17.26 mance, in all cases our algorithm selected the design that MM consumes the smallest amount of space. As a result, we 4.55 Table 2: 13.36 on a single FPGA. Speedup have shown that our approach meets the optimization goals 3.87 JAC 5.56 et forth in Section 3. In most cases, the most balanced 7.53 heuristics based on the saturation point and balance, PAT ing 34.61 design is selected by the algorithm. When a less balanced as described in section 5. This reveals the effectiveness of design is selected, it is either because the more balanced de- 4.01 algorithm as it 3.90 SOBEL the finds the best design point having only ign is before a saturation point (as for non-pipelined FIR), explored a small fraction, only 0.3% of the design space con- or is too large to fit on the FPGA (as for pipelined MM). sisting of all possible unroll factors for each loop. For larger Table 2: Speedup on a single FPGA. Table 2 presents the speedup results of the selected de- ign for each kernel as compared to the baseline, for both design spaces, we expect the number of points searched rel- ative to the size to be even smaller. pipelined and non-pipelined designs. The baseline is the
  • 16. Related Work Synthesizing High-Level Constructs Handel-C(influenced by the OCCAM) SA-C Design Space Exploration Monet Derrien/Rajopadhye Discussion

Editor's Notes