A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems
1. A Compiler Approach to Fast Hardware Design Space
Exploration in FPGA-based Systems
Byoungro So, Mary W. Hall and Pedro C. Diniz
Information Sciences Institute
University of Southern California
4676 Admiralty Way, Suite 1001
Marina del Rey, California 90292
{bso,mhall,pedro}@isi.edu
ABSTRACT
:
his paper describes an automated approach to hardware
1. INTRODUCTION
The extreme flexibility of Field Programmable Gate Ar
esign space exploration, through a collaboration between rays (FPGAs) has made them the medium of choice for fas
arallelizing compiler technology and high-level synthesis hardware prototyping and a popular vehicle for the real
ools. We present a compiler algorithm that automatically ization of custom computing machines. FPGAs are com
xplores the large design spaces resulting from the applica- posed of thousands of small programmable logic cells dy
on of several program transformations commonly used in namically interconnected to allow the implementation of an
logic function. Tremendous growth in device capacity ha
2. Abstraction
hardware design space exploration
parallelizing compiler technique
high-level synthesis tools
designing a loop nest computation
synthesis estimation techniques
with DEFACTO, five multi-media kernels
This technology thus significantly raises the level of
abstraction for hardware design and explores a design space
much larger than is feasible for a human designer.
3. 56 Pedro Diniz et al.
DEFACTO
tion written in C or FORTRAN, and performs pre-processing and several com-
parallelizing compiler tech. (in SUIF)
mon optimizations. In the second step, the code is partitioned into what will
execute in software on the host and what will execute in on the FPGAs.
with hardware synthesis tools [9]
Program
General
Compiler Optimizations
Source Code
Design Space Partitioning
Exploration
Memory Access
Parallelization
Loop Transformations
Permutation
Unrolling
Tiling Memory Access
Protocols
Reuse Analysis
Logic Synthesis
Scalar Replacement
Place & Route
SUIF2VHDL
Estimation Target Architecture
Library Functions
No Good Yes
Host CPU Design FPGA−boards
Fig. 1. DEFACTO Design Flow.
4. Contributions
a compiler algorithm for design space
exploration that relies on behavioral
synthesis estimates
applies loop transformations to explore a
space-time trade-off
Defines a balance metric for guiding design
space exploration
results for five multimedia kernels
5. mizations on the resulting inner loop body, such as paral-
Behavioral Synth. vs. Compilers
lelizing and pipelining operations and minimizing registers
and operators to save space. However, deciding the unroll
factor is left up to the programmer.
Behavioral Synthesis Parallelizing Compilers
Optimizations only on scalar variables Optimizations on scalars and arrays
Optimizations only inside loop body Optimizations inside loop body
and across loop iterations
Supports user-controlled Analyses guide automatic
loop unrolling loop transformations
Manages registers and Optimizes memory accesses
inter-operator communication Evaluates trade-offs of different
storage on- and off-chip
Considers only single FPGA System-level view: multiple FPGAs
multiple memories
Performs allocation, binding and No knowledge of hardware
scheduling of hardware resources implementation of computation
Table 1: Comparison of Behavioral Synthesis and
Parallelizing Compiler Technologies.
167
6. Optimization Goal & Balance
Optimization Criteria
the design must not exceed the capacity
constraints of the system
the execution time should be minimized
a given level of performance, FPGA space
usage should be minimized
Using two metrics
result of estimation provide space usage
Balance = F/C (F: data fetch rate, C: data consumption rate)
7. Analyses & Transformations
Unroll-and-Jam
unrolling one or more loops
fusing inner loop bodies
Scalar Replacement
eliminates true dependences when reuse
is carried (not just the innermost loop)
Loop peeling & Loop-Invariant
Data Layout and Array Renaming
8. and-jam, involves unrolling one or more loops in the itera- D[j] = d 0;
tion space and fusing inner loop bodies together, as shown in D[j+1] = d 1;
Figure 1(b). Unrolling exposes operator parallelism to high- }
gests a int S[96]; synthesis. In the example, all of the multiplies can
level (c) After scalar replacement of accesses to C and D across
ns that be performed in parallel. Two additions can subsequently
int C[32]; both i and j loop.
be performed in parallel, followed by two more additions.
int D[64];
gests a
, which int S[96]; j<64; j++) also decrease the dependence distances for (j=0; j<32; j++) { /* initialize D registers */
Unroll-and-jam can
for (j=0;
ns that
ttempts int C[32]; 0; i<32; i++) d 0 = D2[j];
for reused data accesses, which, when combined with scalar
for(i =
ia. The int D[64]; D[j] + (S[i+j] * below, can be used to expose oppor-
replacement discussed C[i]);
D[j] = d 1 = D3[j];
which
design. for (j=0; j<64; j++) for (i=0; i<16; i++) {
(a) tunities for parallel memory accesses.
Original code.
tempts
design, for(i = 0; i<32; i++) if (j==0) { /* initialize C registers */
Scalar Replacement. Scalar replacement replaces ar-
ric The
a. used
for D[j] = j<64;+ (S[i+j] * C[i]); temporary scalar variables, so
(j=0; D[j] j+=2)accesses to
ray references by c 0 0 = C0[i];
design.
a 2 and (a) Original i<32; i+=2){ c 1 0 = C1[i];
for(i = 0; code. synthesis will exploit reuse in registers [5].
that high-level
design, }
Our = D[j] + (S[i+j] * C[i]);
D[j] approach to scalar replacement closely matches previ-
ic used for (j=0;work, which eliminates true dependences when reuse S 0 = S1[i+j];
ous = D[j] j+=2)
D[j] j<64; + (S[i+j+1] * C[i+1]);
2 and for(i = 0; i<32; i+=2){ d 0 = d 0 + S0[i+j] * c 0 0; /* unroll(0,0) */
ata bits is carried D[j+1] + (S[i+j+1] * C[i]); accesses in the affine
D[j+1] = by the innermost loop, for
D[j] = D[j]D[j+1] + (S[i+j+2] * C[i+1]);
domain= + consistent dependences (i.e., constant depen-
D[j+1] with (S[i+j] * C[i]); d 0 = d 0 + S 0 * c 1 0; /* unroll(0,1) */
he data
} D[j] = D[j] + (S[i+j+1] * C[i+1]); however, two differences:
dence distances) [5]. There are, d 1 = d 1 + S 0 * c 0 0; /* unroll(1,0) */
an con-
ta bits (b)D[j+1] unrolling j+ (S[i+j+1]loop bymemory writes on out-
(1) we = D[j+1] loop and i * C[i]); 1 (unroll
After also eliminate unnecessary d 1 = d 1 + S0[i+j+1] * c 1 0; /* unroll(1,1) */
close to
heis less D[j+1] 2) D[j+1] +and, copies *exploit reuse across all loops
put dependences; (S[i+j+2] of i loop together.
factor = and jamming (2) we C[i+1]); rotate registers(c 0 0, ... ,c 0 15);
e data
an con- } in the nest, not just the innermost loop. The latter differ- rotate registers(c 1 0, ... ,c 1 15);
an one,
lose to (b) After stems from the observation that many, though not all,
for (j=0; unrolling j loop/* initialize by registers */
ence j<64; j+=2) { and i loop D 1 (unroll }
ed, this
is less d factor 2) and jamming copies of have sufficiently small loop
0 = D[j]; mapped to FPGAs i loop together.
algorithms D3[j] = d 1;
devoted
an one, d 1 = D[j+1];small reuse distances, and the number of regis-
bounds or D2[j] = d 0;
d, this forters j<64; j+=2) { /*
for (j=0;that can i+=2) { initialize D registers */
(i=0; i<32; be configured on an FPGA is sufficiently large. }
work for
evoted d 0A (j==0) { /* initialize C registers */
if=more detailed description of our scalar replacement and
D[j]; (d) Final code generated for FIR, including loop
nce the
d 1register = C[i];
=cD[j+1]; analysis can be found in [9].
0 0 reuse normalization and data layout optimization.
Because
ork for for (i=0; 0 = C[i+1]; in{Figure 1(c), we see the results of scalar
c i<32; i+=2)
In1the example
pent in Figure 1: Optimization Example: FIR.
nce data if (j==0) { /*which illustrates some*/ the above differences
replacement, initialize C registers of
}
he the
ecause
e them S c 0 0S[i+j+1];
0 = = C[i];
pent in d c 1 0d= C[i+1]; * c 0 0; /* unroll(0,0) */
0 = 0 + S[i+j]
e data } 0 = d 0 + S 0 * c 1 0; /* unroll(0,1) */
d
IONS S 0 = S[i+j+1]; 0 * c 0 0; /* unroll(1,0) */ 168
e them d1=d1+S
ransfor- d 0 = d 0 + S[i+j] * c * c 1/* unroll(0,0) */ */
d 1 = d 1 + S[i+j+2] 0 0; 0; /* unroll(1,1)
he FIR d 0 = dregisters(c 0 0, 0; /* 0 15);
rotate 0 + S 0 * c 1 ... ,c unroll(0,1) */
ONS d 1 = dregisters(c 1 0, 0; /* 1 15);
rotate 1 + S 0 * c 0 ... ,c unroll(1,0) */
ansfor-
unroll- } d 1 = d 1 + S[i+j+2] * c 1 0; /* unroll(1,1) */
heitera-
e FIR rotate 0;
D[j] = d registers(c 0 0, ... ,c 0 15);
hown in rotate registers(c 1 0, ... ,c 1 15);
D[j+1] = d 1;
unroll-
to high- }}
eies can
itera- D[j] = d 0;
(c) After scalar replacement of accesses to C and D across
own in
quently D[j+1] = d 1; j loop.
both i and
o high-
ditions. }
es can
stances (c) After scalar replacement of accesses to C and D across
for (j=0; j<32; j++) { /* initialize D registers */
huently
scalar d both D2[j]; j loop.
0 = i and
9. t are in the general case. We address this problem by limiting the
number of registers in Section 5.4.
Optimization Algorithm
i1 +
where 5.1 Definitions
s and
We define a saturation point as a vector of unroll factors
rated
where the memory parallelism reaches the bandwidth of the
rmly
architecture, such that the following property holds for the
esult
resulting unrolled loop body:
s the
Saturation Point
rtual widthi = C1 ∗ widthl .
n ac-
i∈Reads l∈NumMemories
uling
yout widthj = C2 ∗ widthl .
ry 0, j∈Writes l∈NumMemories
esses
ed to Search Space Properties
Here, C1 and C2 are integer constants. To simplify this
discussion, let us assume that the access widths match the
olling
ed to Algorithm Description
memory width, so that we are simply looking for an unroll
factor that results in a multiple of N umM emories read and
t as-
write accesses for the smallest values of C1 and C2 . The
rans- Adjusting Number of On-chip Registers
saturation set, Sat, can then be determined as a function of
ports
the number of read and write accesses, R and W , in a single
lay-
iteration of the loop nest and the unroll factor for each loop
sibly
in the nest. We consider reads and writes separately because
cking
they will be scheduled separately.
sions
We are interested in determining the saturation point after
10. stop the search, or it is compute bound and we continue. If Ucurr = Uinit
it is compute bound, then we consider unroll factors that Umb = Umax
Algorithm
provide increased operator parallelism, in addition to mem- ok = False
ory parallelism. Thus, we first look for a loop that carries while (!ok) do
no dependence (i.e., ∀d∈D di = 0). All unrolled iterations of Code = Generate(Ucurr )
such a loop can be executed in parallel. If such a loop i is Estimate = Synthesize(Code)
B = Balance(Code,Estimate.Performance)
found, then we set the unroll factor to Sati . assuming this /* first deal with space-constrained designs */
unroll factor is in Sat. if (Estimate.Space > Capacity) then
If no such loop exists, then we instead select an unroll fac- if (Ucurr = Uinit ) then
tor that favors loops with the largest dependence distances, Ucurr = FindLargestFit(Ubase , Ucurr )
because such loops can perform in parallel computations be- ok = True
tween dependences. The details of how our algorithm selects else
Ucurr = SelectBetween(Ucb , Ucurr )
the initial unroll factor in this case is beyond the scope of else if (B = 1) then ok = True /* Balanced, so DONE! */
he this paper, but the key insight is that we unroll all loops else if (B < 1) then /* memory bound */
Search Algorithm:
5]. in the nest, with larger unroll factors for the loops carrying Umb = Ucurr
Input: Code /* An n-deep loop nest */
ry larger minimum nonzero dependence distances. The mono- if (Ucurr = Uinit ) then ok = True
Output: u1 , . . . , un /* a vector of unroll factors */ else
we tonicity property also applies when considering simultaneous
unrolling for multiple loops as long as unroll factors for all /* Balanced solution is between earlier size and this */
If Ucurr = Uinit Ucurr = SelectBetween(Ucb , Umb )
at loops = U either increasing or decreasing.
Umb are max else if (B > 1) then /* compute bound */
m- If the initial design is space constrained, we must re-
ok = False Ucb = Ucurr
es duce the unroll factor until the design size is less than the
while (!ok) do if (Umb = Umax ) then
of size constraint Capacity, resulting in a suboptimal design.
Code = Generate(Ucurr ) /* Have only seen compute bound so far */
is The function Synthesize(Code) simply selects the largest un-
Estimate = FindLargestFit Ucurr = Increase(Ucb )
B = Balance(Code,Estimate.Performance)
roll factor between the baseline design corresponding to no else
his /* first deal with space-constrained designs */ /* Balanced solution is between earlier size and this */
unrolling (called Ubase ), and Uinit , regardless of balance, be-
if (Estimate.Space > Capacity) then Ucurr = SelectBetween(Ucb , Umb )
ac- cause this will maximize available parallelism.
if (Ucurr = Uinit ) then /* Check if no more points to search */
es, Assuming = FindLargestFit(Uis compute bound, the algo-
Ucurr the initial design base , Ucurr ) if (Ucurr = Ucb ) ok = True
be- rithm increases the unroll factors until it reaches a design
ok = True end
cts that else(1) memory bound; (2) larger than Capacity; or,
is return Ucurr
of
Ucurr = full unrolling of Ucurr )
(3) represents SelectBetween(Ucb ,all loops in the nest (i.e.,
else if (B = 1) then ok = True /* Balanced, so DONE! */
ps Ucurr = (B < 1) as follows.
U ),
else if max then /* memory bound */ Figure 2: Algorithm for Design Space Exploration.
ng Themb = UcurrIncrease(Uin ) returns unroll factor vector
U function
no- Uout if (Ucurr = Uinit ) then ok = True
such that first place, the design will be smaller and more likely to fit
us else on chip, and secondly, space is freed up so that it can be
all (1)PBalanced solution is between earlier size and this */
/* (Uout ) = 2 ∗ P (Uin ); and, used to increase the operator parallelism for designs that
(2)∀i uin ≤ uout ≤ umax . cb , Umb )
Ucurr = SelectBetween(U
i i i are compute bound.
else if (B > 1) then /* compute bound */
re- To adjust the number of on-chip registers, we can use loop
he
Ucb are no
If there = Ucurr such remaining unroll factor vectors, then tiling to tile the loop nest so that the localized iteration
if (U = Umax ) then
gn. Increase mb returns Uin . compute bound so far */
/* Have only seen space within a tile matches the desired number of registers,
If either a space-constrained or memory bound design is
11. Experimental Result
FIR/Matrix Multiply/String Pattern
Matching/Jacobi Iteration/Sobel Edge
Detection
SUIF(SUIF2VHDL) C Application
invokes the Mentor Graphics’ MonetTM
SUIF
Compiler Analyses
scalar replacement
the compiler currently fixes the clock
data layout
array renaming
data reuse
unroll & jam
tiling
period to be 40ns Unroll Factor
Determination
SUIF2VHDL
Transformed SUIF
Behavioral VHDL
Monet
Behavioral Synthesis
Metrics: Area, Number Clock Cycles
Balance Calculation
NO Balanced
Design?
YES
Figure 3: Compilation and Synthesis Flow
15. 500
max space
1
Result(4)
0
Outer0 Loop2 Unroll Factor 1 8
1 4 16
0
0 1 2 4 8 16
10 3
10 10
4
Space (log-scaled)
Outer Loop Unroll Factor 2 Unroll Factor
Inner Loop Inner Loop Unroll Factor
(a) Balance (b) Execution Time (c) Area
Outer Loop Unroll Factor 4 selected design
Execution Cycles (log-scaled)
Figure 9:
Outer Loop Unroll Factor 8 Balance, Execution Cycles and Area for Pipelined PAT.
Outer Loop Unroll Factor 16
Outer Loop Unroll Factor 32
Outer Loop Unroll Factor 64
2.1 4
10
Outer Loop Unroll Factor 1 8000 max space
Outer Loop Unroll Factor 2
selected design
2
Outer Loop Unroll Factor 4 7000
Outer Loop Unroll Factor 1
Outer Loop Unroll Factor 8 Outer Loop Unroll Factor 2
Outer Loop Unroll Factor 4 selected design
Execution Cycles (log-scaled)
1.9 Outer Loop Unroll Factor 16
6000 Outer Loop Unroll Factor 8
Outer Loop Unroll Factor 32
Outer Loop Unroll Factor 16
Execution Cycles
Outer Loop Unroll Factor 64 Outer Loop Unroll Factor 32
1.8
Balance
5000 Outer Loop Unroll Factor 64
selected design
1.7 selected design
4000
1.6 3000
1.5 3 2000
10 4 5
16 32 10 10 10 4
3
1.4 1000
Inner Loop Unroll Factor0 1 2 4 8 16
Inner Loop Unroll Factor
32 0 1 2 4 8 16 Space (log-scaled)
Inner Loop Unroll Factor
32 10
Space (log-scaled)
10
5
ecution Time (a) Balance (c) Area
(b) Execution Time (c) Area
Figure 10: Balance, Execution Time and Area for Pipelined SOBEL.
Time and Area for Pipelined SOBEL.
Program Non-Pipelined Pipelined
uration point, and then decreasing. The execution time is FIR 7.67 17.26
also monotonically nonincreasing, related to Observation 2. MM 4.55 13.36
n all programs, our algorithm selects a design that is close JAC 3.87 5.56
o best in terms of performance, but uses relativelyNon-Pipelined
Program small Pipelined
PAT 7.53 34.61
SOBEL 4.01 3.90
unroll factors. Among the designs with comparable perfor- 7.67
FIR 17.26
mance, in all cases our algorithm selected the design that
MM
consumes the smallest amount of space. As a result, we
4.55 Table 2: 13.36 on a single FPGA.
Speedup
have shown that our approach meets the optimization goals 3.87
JAC 5.56
et forth in Section 3. In most cases, the most balanced 7.53 heuristics based on the saturation point and balance,
PAT ing 34.61
design is selected by the algorithm. When a less balanced as described in section 5. This reveals the effectiveness of
design is selected, it is either because the more balanced de- 4.01 algorithm as it 3.90
SOBEL the finds the best design point having only
ign is before a saturation point (as for non-pipelined FIR), explored a small fraction, only 0.3% of the design space con-
or is too large to fit on the FPGA (as for pipelined MM). sisting of all possible unroll factors for each loop. For larger
Table 2: Speedup on a single FPGA.
Table 2 presents the speedup results of the selected de-
ign for each kernel as compared to the baseline, for both
design spaces, we expect the number of points searched rel-
ative to the size to be even smaller.
pipelined and non-pipelined designs. The baseline is the
16. Related Work
Synthesizing High-Level Constructs
Handel-C(influenced by the OCCAM)
SA-C
Design Space Exploration
Monet
Derrien/Rajopadhye
Discussion