Architectural_Synthesis_for_DSP_Structured_Datapaths

Architectural Synthesis of DSP Structured datapaths: Shereef B. M. Shehata
Architectural Synthesis of DSP Structured
Datapaths
Shereef B. M. Shehata

OUTLINE
• An overview of the architectural Level Synthesis Problem.
• Subtasks of the High Level Synthesis problems
Ë Scheduling
Ë Binding
Ë Architecture Optimization
• NP-hard Algorithms(Heuristics versus Mathematical Programming techniques)
• Novel Mathematical Programming Formulation of the Synthesis Problem:
Ë Linearization of the Quadratic Nonlinear Problem
Ë Optimization of Performance and Structural Complexity
Ë Techniques To improve the Solution time for ILP formulation:
Ë Heuristics as Bounds for Mathematical Programming.
• Results for typical HLS benchmarks.
• Conclusion.
•
•
•

Motivation
To develop an architectural synthesis technique speciﬁc to the synthesis of
architectures for DSP targeting FPGA implementations.
The technique is general enough to accommodate other technologies, such as new
submicron technologies.
To provide an accurate evaluation method for our High Level Synthesis
methodologies.
• The total execution time is the yardstick for Performance comparison and
not The number of control steps.
Exploit important features of FPGA technology:
• Large number of Registers
• FPGA utilization is largely reduced with complex interconnections
• High multiplexer cost.
• Wide difference between the delays of multiplications and additions.
• Efﬁcient RAM storage.
• Dedicated high-speed carry-propagation circuit

Chapter 1

The Symmetrical Array FPGA Module (Xilinx)
Ë CLB routing is associated with each row and column of the CLB array.
Ë Global Routing consists of dedicated networks primarily designed to distribute clocks
throughout the device with minimum delay and skew. It can also be used to distribute high fan-
out signals throughout the device with minimum delay.
Ë Global nets and buffers has increased in more recent Xilinx 4000 generation to allow more
ﬂexibility in routing.
Programmable
Connection Matrix
Programmable
Switching Matrix
Programmable Logic Block

XC4000 family switch box architecture
Ë SRAM configuration cell, implies Reuse, and prototyping. The hardware becomes
reconfigurable and the designer can update the system on the fly.
Ë The total size of the SRAM configuration cell and the transistor switch that the SRAM drives
is larger than the programming devices used in antifuse technologies.
Interconnect Points Switch Matrix
DataLines
Six pass transistors per switch
matric interconnect point
Data Lines
Ë The horizontal and vertical single- and double-length lines intersect at a box called a
programmable switch matrix. Each switch matrix consists of programmable pass
transistors used to establish connections between the lines.

The Xilinx 4000 Configurable logic block (dedicated carry logic is not shown)
Ë The inputs C1-C4 can also be used to control the use of the F and G- LUTs as 32-bits of SRAM.
Ë Mux control maps four control inputs (C1-C4) into: LUT input H1, direct in (DIN), enable
clock (EC) and set/reset for the flip flops.
Ë The XC4000 CLB has also has special fast dedicated carry logic hardwired between the
CLBs.
G1
G2
G3
G4
F4
LUT
LUT
LUT
multiplexer
C1 C2 C3 C4
R
S
state
state
D
D
Q
Q
G
Q2
Q1
Fclock
Programmable
H1
DIN
F1
F2
F3
Carry outCarry in
Carry outCarry in
to/from adhacent CLBs
to/from adhacent CLBs

Carry propagation paths in Xilinx 4000 series
Ë The carry chain in XC4000 can run either up or down. At the top or bottom of the columns
where there are no more CLBs, the carry is propagated to the right.
Ë The Fast carry logic can be accessed by using Relational Placed Macros that already include
special library symbols for using the fast carry logic.
Ë The carry logic shares operands and control with the function generators.
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
Dedicated carry-path

Interconnect Overview for the XC4000 family
Long
Double
Single
Quad
Quad
Long
Global
Clock
Long
Double
CLB Direct
Connect
Long
Carry
Chain
Direct
Connect
Single
Global
Clock

Details of XC4000 dedicated carry logic.
Ë The two 4-input function generators can be conﬁgured as a 2-bit adder with built-in hidden
carry that can be expanded to any length.
Ë This dedicated carry circuitry is so fast that conventional speed-up methods like carry
generate/propagate has marginal beneﬁt at the 32-bit level and almost no effect at the 16-bit
level.
Ai+1Bi+1
Si
Si+1
Ci+2
G-Function Generator
F-Function Generator
Bi
Ai
Ci

Details of a Logic Array Block (LAB) in FLEX 8000 family
4
4
4
4
4
4
4
4
4
4
8
8 16
8
Carry-out to the LAB
on the right
LAB Local
interconnect
Carry-in
from the LAB
on left
Row Interconnect
Column Interconnect
LAB Control
Signals
LE
LE
LE
LE
LE
LE
LE
LE
Ë There are Eight LEs stacked
to form a Logic Array Block
(LAB)

FLEX 8000 Logic Element(LE)
Ë The FLEX LE uses a four-input LUT, a ﬂip-ﬂop, cascade logic and carry logic.
Carry
Chain
Look-Up
Table(LUT)
Cascade
Chain
QD
CLRN
PRN LE Out
Carry-In Cascade-In
DATA1
DATA2
DATA3
DATA4
LABCTRL1
LABCTRL2
LABCTRL3
LABCTRL4
Clear/Preset
Logic
Carry-Out Cascade-Out
Clock
Select

Flex 8000 device block diagram
IOE
IOE
IOE
IOE
IOE
IOE
IOE
IOE
IOEIOE
IOEIOE
IOEIOE
IOEIOE
Fast Track Interconnect
I/O Element
Logic
Element
Logic Array
Block(LAB)

General Architecture Model
FUi FUj
R
Chaining Register
Interconnect
Register Mux FU
Mux
FU O/P
Tristate Bus
One of the Pipelined Busses
Driver
Register File
( RAM) Modules
FU
Module
Register
Mux
FU Mux
Sub-Module
(Optional)
(Optional)(Optional)
Control Unit
InterconnectControl
signals
Function Units and Register
Control Signals

CDFG
- Data Storage Assignment
STEP-LAST: Register Allocation
STEP-4: ILP: Bus Insertion
-Bus transfer
scheduling
-Bus allocation
-Storage Minimization
-Bus loading Minim.
-Interconnect minimization.
-Bus loading minimization.
- Scheduling and Binding
- Chaining of Operations
STEP-3: ILP: Random Topology
- Clock cycle minimization +
- FU pipelining choice
ation of the numberMinimiz
of cycles.
OR
- Minimization of the total
execution time, (i.e. throughput
maximization).
- VHDL generation of the
Datapath and the Controller
- Heuristics to determine the lower bound on the number of
cycles.
- Heuristics to tighten the ASAP/ALAP values under the given
resource constraints.
DFG
-DFG exploration.
-Dynamic Set generation for chaining
-ILP constraint generation
To Logic Synthesis tools
STEP-2: C++: Constraint Generation for ILP
STEP-1: Scheduling Bounds
Tech

Flow of the Back-End Tools
Ë Stage-2 uses Synopsys tools(logic synthesis and FPGA mapping), and stage-3 uses
Xilinx(xact tools) for PPR
VHDL SOURCE FILES
- Xilinx Hard-macros
Simulate
Read HDL and insert pads
- Area Constraints
- Delay Constraints
- FU-Pipelining (i.e.
Register-balancing)
- Xilinx Library
To simulation
Partition, Placement
and Routing
Xilinx
SYNOPSYS
compile and optimize the
datapath and controller
Stage-3
Stage-2

Chapter 2

Basic Definitions.
Ë A Polyhedron “P“: is the set of points that satisfy a finite number of linear
inequalities, that is:
Ë A polytope: is a bounded polyhedron, that is:
Ë A Polyhedron Face: The set is called a face of P and
the valid inequality is said to define the face F.
P R
n
⊆ P x R
n
∈ A x⋅ b≤
 
 
 
=
 
 
 
,
w∃ R
1
∈ P x R
n
∈ w– x j w≤ ≤( ) j∀ j 1…n=,( )
 
 
 
⊆
F x P∈ π x⋅ π0={ }=
π x⋅ π0≤

Ë The Convex Hull: Given a set , a point . The Convex hull of S
denoted by Conv(S) is the set of finite points that can be written as a convex
combination of points in S.
Ë where x1, x2, ..., xt are any finite set of points in S. The convex hull Conv(S) can
be described by a finite set of linear inequalities.
S R
n
⊆ x R
n
∈
Conv S( ) x R +
n
∈ x λi x
i
⋅
i 1=
∑=
 
 
 
 
 
=
λi
i 1=
t
∑ , λ R +
t
∈

Ë A partially ordered set: , or poset, is a non-empty set X and a binary relationship B
on X which is reﬂexive, anti-symmetric and transitive. The elements of X are called points
and the binary relationship B is called partial ordering on X.
Ë A strict partially ordered set: , or Sposet, is a non-empty set X and a
binary relationship on X which is irreﬂexive, anti-symmetric and transitive.
Ë We use to denote that and to denote that .
Ë A Hasse diagram: of a poset (X,P) is a drawing in which the points of X are places
so that if y covers x, then y is placed at a higher level than x and joined to x by a line
segment. The corresponding graph is called a Hasse Graph of the poset.
Ë A Clique in a graph G = (V,E) is a with the property that every pair of nodes in C is
joined by an edge.
Ë A subset of the vertices of the graph is an r-clique if it induces a complete
subgraph, i.e.
Ë A stable set (or independent set) of vertices is a subset X of the vertex set of a graph G,
no two of which are adjacent.
X B,( )
X B˜,( )
B˜
xBy x y,( ) B∈ xB˜ y x y,( ) B˜∈
C V⊆
A V⊆ G V E,( )=
GA Kr≅

Ë A Comparability graph: is an undirected graph that is transitively orientable.
That is each edge can be assigned a one-way direction such that the resulting
directed graph G = (V,E) satisﬁes the following condition: and
imply .
Ë A graph G is a triangulated graph, if for every simple cycle of length strictly greater than
3 posses a chord.
Ë The stability number of G is the number of vertices in a stable set of
maximum cardinality.
Ë The chromatic number of G the smallest possible k for which there exists
a proper k-coloring of G.
Ë The clique number of G is the number of vertices in a clique of maximum
cardinality.
Ë The clique cover number is the fewest number of complete subgraphs
needed to cover the vertices of G, i.e. the size of the smallest possible clique cover
of the graph G.
a b,( ) E∈ b c,( ) E∈
a c,( ) E∈ a b c, ,∀ V∈
α G( )
γ G( )
ω G( )
θ G( )

Ë A Vertex packing on a graph G = (V,E) is a set of vertices , with the property
that no pair of vertices in U is joined by an edge.
Ë The fractional vertex packing polytope of a graph G = (V,E) is
where and is the maximal clique matrix of
the graph G.
U V⊆
P x R +
n
∈ κ x⋅ 1≤
 
 
 
= n V= κ

chapter 3

Simultaneous Performance Optimization and Interconnect minimization
• Exploration of much larger solution space guided by a Highly selective objective
function that rejects architectures with more interconnection unsuitable for FPGA
implementation.
• Developing an ILP formulation that incorporates:
Ë Multilevel chaining of operations and deeply pipelined functional units which are
effective for FPGAs.
Ë Optimal scheduling and binding of Operations while minimizing interconnections.
Ë Determination of the system clock duration.
Ë Minimization of the Total execution time vs. the number of control steps.

Details of the Integer Linear Programming Formulation
• Operation Assignment Constraints
Ë This Constraint assigns Every Operation of the DFG to only one control step and one FU.
Xop n s,,
n 1=
Nt
∑
s Range op( )∈
∑ 1 op∀=
Xi,1,1 Xi,2,1 Xi,3,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xj,1,5 Xj,2,5
21
Op j
ALAP(opj)
1 2 3
Op i
ASAP(opj)
ALAP(opi)
ASAP(opi)
The variables in the shaded region add up to 1.
OPi
OPj
precedence

• Function Unit Assignment Constraint
Ë Each FU has at most only one operation assigned at a given time.
Xop n p,,
op Fut∈
∑
p s=
s L op( )– 1+
∑ 1≤ n s∀,∀
Xi,1,1 Xi,2,1 Xj,1,1 Xj,2,1
Xi,1,2 Xi,2,2 Xj,1,2 Xj,2,2 Xk,1,2 Xk,2,2
Xi,1,3 Xi,2,3 Xj,1,3 Xj,2,3 Xk,1,3 Xk,2,3
Xi,1,4 Xi,2,4 Xj,1,4 Xj,2,4 Xk,1,4 Xk,2,4
Xj,1,5 Xj,2,5
Op i
1 2
Op k
1 2
Op j
1 2
c-step1
c-step2
c-step3
c-step4
c-step5
The summation of these variables is less than 1
OPi
OPj
precedence

• Scheduling partially ordered operations has to follow the precedence order (no
Chaining)
X
opi n p, ,
X
op j n p, ,
n 1=
Ntj
∑ 1≤
p ASAP op j( )=
s
∑+
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑
ASAP op j( ) s ALAP opi( ) D opi( ) 1–+≤ ≤
s∀ opi op j→( )∀,
Xi,1,1 Xi,2,1 Xi,3,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5
Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6
OPi
OPj
precedenceASAP(opj)
current c-step
The variables in the shaded region add up to 1
ALAP(opi)
ASAP(opi)
1 2 3
Op i
21
Op j

To Determine the Total length of the schedule
Ë The following constraint illustrates the determination of the total number
of steps T, from the schedule of the operations in the set W.
Where W is the set of operations without Successors in the DFG.
Ë The variable T has both an upper and lower bound (Determined from
Heuristics) as:
s Xop n s,, T–×
n 1=
Nt
∑
s Range op( )∈
∑ D op( ) 1+–( )≤ op W∈∀
T Tcr≥
T Tcr T∆+≤

Constraints to minimize the structural complexity of the synthesized Architecture
Ë Counting the number of Motifs
Ë A corresponding term to minimize the MOTIFSUM is included in the objective function
to increase the utilization of the already assigned interconnect between different Function
units.
Xi,1,1 Xi,2,1 Xi,3,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,3
Xj,1,5 Xj,2,5
Xo pi n s,,
s Range opi( )∈
o pi Fut∈
∑ Xo p j n s,,
s Range op j( )∈
o p j Fut′∈
∑+
Motif Fut n Fut′ n′,,,( ) 1≤–
o pi op j→( )∀
n n 1…Nt=( )∀
n′ n′ 1…Nt′=( )∀
1 2 3
Op i
21
Op j
c-step 1
c-step 3
c-step 2
c-step 4
ASAP(op
i
)
ASAP(op
j
)
ALAP(op
i
)
ALAP(op
j
)c-step 5
The summation of these variables sets the value of Motif A 2 M 1,,,( )

Constraints to minimize the structural complexity of the synthesized Architecture
Ë Counting the number of Chaining Motifs
Ë A corresponding term to minimize the CMOTIFSUM is included in the objective
function to increase the utilization of the already assigned Chaining interconnect between
different Function units.
Xi,1,1 Xi,2,1 Xi,3,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xj,1,5 Xj,2,5
1 2 3
op i
21
opj
c-step 1
c-step 3
c-step 2
c-step 4
The summation of these variables sets the value of CMotif A 2 M 1,,,( )
opi
opj
Precedence
c-step 5
Xo pi n s,,
o pi Fut∈
∑ Xo p j n s,,
o p j Fut′∈
∑+
CMotif Fut n Fut′ n′,,,( ) 1≤–
s∀ , o pi op j→( )∀
n n 1…Nt=( )∀
n′ n′ 1…Nt′=( )∀

Ë Counting Incompatible Motifs
Ë The idea is to minimize the number of Motifs that terminates on the Same Function unit. This will
decrease the number of Multiplexers in the synthesized architecture.
Moti f Fut n Fut′ n′,,,( )
n 1=
Nt
∑
Fut
∑ Incom p Fut′( )– 0≤
n′∀
Fut′∀
'1
'3
'2
'1
'1
'3
'1
'3
'1
I/O
'2
'3
'1
Schedules and Motifs Architecture

Ë Minimizing the Maximum Number of edges with the Same FU Destination
Type(Incompatible Motifs).
Introducing an integer variable to count the number of incompatible Motifs.
Moti f Fut n Fut′ n′,,,( )
n 1=
Nt
∑
Fut
∑ Incom p Fut′( )– 0≤ Fut′∀ n′∀,
(a) (b) (c)

Minimizing the Maximum Number of Edge Overlap
K Xopi n p, , Xop j n p, ,
n 1=
Ntj
∑
p =
ASAP o p j( )
s
∑–
n 1=
Nti
∑
p =
ASAP o pi( )
s
∑
 
 
 
 
 
 
 
o pi op j→( )∀
op j Fut∈
edge wrap∉
∑










×
Xopi n p, , Xop j n p, ,
n 1=
Ntj
∑
p s 1+=
ALAP op j( )
∑+
n 1=
Nti
∑
p =
ASAP o pi( )
s
∑
 
 
 
 
 
 
 
o pi op j→( )∀
op j Fut∈
edge wrap∈
∑+
M– axovla p Fut( ) 0≤ s∀ Fut∀
K 1

Formulation for chaining of Two operations per control step
• The destination operation can not be scheduled “before” the source operation.
Ë However, they can be share the same control step.
Xi,1,1 Xi,2,1 Xi,3,1 Xj,1,1 Xj,1,2
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5
Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6
ASAP(opj)
current c-step
ALAP(op i)
The Summation of the variables in the shaded regions add up to 1
21
Op j
1 2 3
Op i
OPi
OPj
precedence
X
opi n p, ,
X
op j n p, ,
n 1=
Ntj
∑ 1≤
p ASAP op j( )=
s 1–
∑+
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑
s∀ opi op j→( )∀,

• The source operation can not be scheduled “after” the destination operation.
Ë However, they can share the same control step.This constraints and the previous one are not
redundant. They tighten the Formulation.
Xopi n p, , Xop j n p, ,
n 1=
Ntj
∑ 1≤
p s=
∑+
n 1=
Nti
∑
p s D opi( )– 2+=
ALAP opi( )
∑
s∀ , opi op j→( )∀
Xi,1,1 Xi,2,1 Xi,3,1 Xj,1,1 Xj,2,2
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,2 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5
Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6
211 2 3
Op i Op j
ASAP(opj)
current c-step
ALAP(opi)
ASAP(opi)
OPi
OPj
precedence

• The following constraint prevents chaining of more than two operations in the same
control step.
Xopi n p, , Xopk n p, ,
n 1=
Ntk
∑ 1≤
p s=
∑+
n 1=
Nti
∑
p s D opi( ) 1+–=
∑ s∀ , opi op j,( )∀ ℜ2∈
ASAP opk( ) s ALAP opi( ) D opi( ) 1–+≤ ≤
Xi,1,1 Xi,2,1 Xi,3,1 Xj,1,1 Xj,2,1
Xi,1,2 Xi,2,2 Xi,3,2 Xj,1,2 Xj,2,2
Xi,1,3 Xi,2,3 Xi,3,3 Xj,1,3 Xj,2,3
Xi,1,4 Xi,2,4 Xi,3,4 Xj,1,4 Xj,2,4
Xi,1,5 Xi,2,5 Xi,3,5 Xj,1,5 Xj,2,5
Xi,1,6 Xi,2,6 Xi,3,6 Xj,1,6 Xj,2,6
OPi
OPj
precedenceASAP(opk)
current c-stepALAP(opi)
OPk
precedence
ASAP(opi)
1 2 3
Op i
21
Op k

Multi- Level Chaining
Ë Patterns to look for in the DFG
Ë Formulation
Ë By generating the set , such that if , and
is a multi-cycle operation(e.g. multiply operation).
Ë The following constraint will then apply to the members of this set
*+ +
opi
opk
opi
opk
*
opi
opk
*
C D
+ +
opi
opk
A B
∆M O
2
⊆ op1 op2,( ) ∆M∈ op1 opM→ opM op2→
opM
Xopi n p, , Xopk n p, ,
n 1=
Ntk
∑ 1≤
p s=
∑+
n 1=
Nti
∑
p s D opi( ) 1+–=
∑
ASAP opk( ) s∀ ALAP opi( ) D opi( ) 1–+≤ ≤
o pi o pk( , )∀ ∆M∈

Delay model for an N-bit adder implemented in Xilinx FPGAs
For the Xilinx 4000 series, is 0.7/1 ns, and is 4 ns.
Ë The delay is linear with the number of bits. This proportionality factor is , and as such
they make the fastest possible carry path circuits.
Adder
S0 S1 S2 S3 S4 S5 SN-4 SN-3 SN-2 SN-1
TOPCY
Tsum
LSB MSB
A0,
B0
A1,
B1
A2,
B2
A3,
B3
A4,
B4
AN-4,
BN-4
AN-3,
BN-3
AN-2,
BN-2
AN-1,
BN-1
(N-4)/2 CLBs
Tcarry Tcarry Tcarry
CLB
T A TOPCY N 4–( ) 2⁄ Tcarry× Tsum+ +=
Tcarry T
OPCY
Tcarry

Delay model for a pipelined-multiplier chained with an adder
For the Xilinx 4000 series, is 5 ns.
Adder
Last pipeline stage of a multiplier
S0 S1 S2 S3 S4 S5 SN-4 SN-3 SN-2 SN-1
TOPCY
Tcarry
Tsum
Tsum
Tcarry
LSB MSB
Tcarry Tcarry
TOPCY
Tcarry Tcarry
CLB
T pd T pipe Tsum+=
Tsum

Scheduling with Multi-level chaining and Interconnect minimization
+
+
+
+
+
+
+
+
+
+
+
+
i1 i2 i3 i4 i5
i9 i10 i11 i12 i13
i6 i7 i8
out
+
++
+
R1R2
i4 i5 i8 i12 i13i3 i7 i11
i1 i9 i2 i6 i10
+
Adder 2
+ Adder 3
Adder 1
i6
i11
i8
i1
i10 i9
i3 i5 i7 i13 i4 i12
R1
R2
+
Extra Number of Mux inputs: 2
Number of CLBs: 128
Execution time: 84 nsec
Number of registers: 2
Extra Number of Mux inputs: 8
Number of CLBs: 180
++
+
+ +
+
+
+
+
+
+
i1 i2 i3
i4 i5
i6
i7 i8
i9 i10
i11
i12 i13
out
+

Delaying of interconnect optimization after scheduling
Ë Comparison of our results for an addition tree, with methods that restrict the solution space,
or does not minimize interconnect simultaneously with scheduling and binding of operations.
+
+
+
+
+ +
+
+
+
+
out
+ +
i6 i7 i8
i9 i10 i11 i12 i13i1 i2 i3 i4 i5
+
+
+
+
Adder1R1
R3
R2
Adder 3
Adder 2
Adder 4
R4
i9 i10i6 i11i13 i12
i4 i8 i3 i5
i2
i7
i1
Extra number of mux inputs: 7
Number of CLBs: 168

Scheduling and binding for the CDFG with non-pipelined multipliers and no chaining
• The schedule needs 7 control steps, with clock duration of 150ns
+
+ +
+*
*
+
++
+
+
+
+
+
+
Clock cycle: 150 ns
Exec. Time: 7 * 150 = 1050 ns
Resources: 2 Adders, 1 Multiplier
No-Chaining
Non-Piplined Multipliers.
c-step 1
c-step 2
c-step 3
c-step 4
c-step 5
c-step 6
c-step 1

Effect of increasing the resources to three adders and one non-pipelined multiplier on the
scheduling of the CDFG
• Increasing the resources by one adder does not effect the execution time for the CDFG
+
+ +
+
*
*
+
+ +
+
+
+
+
+
+
Clock cycle: 150 ns
Exec. Time: 7 * 150 = 1050 ns
No-Chaining
Non-Piplined Multipliers.
c-step 1
c-step 2
c-step 3
c-step 4
c-step 5
c-step 6
c-step 7

Scheduling and binding for the CDFG with pipelined multipliers and no chaining.
• The schedule needs 8 control steps, with clock duration of 80ns
+
+
+
*
*
+
+
+
+
+
+
+
+
+
+
Clock Cycle: 80 ns
Execution Time: 8 * 80 = 640 ns
No-Chaining
Pipelined Multipliers.
c-step 1
c-step 2
c-step 3
c-step 4
c-step 5
c-step 6
c-step 7
c-step 8

Scheduling and binding of the CDFG, using pipelined multiplier and chaining
• The schedule needs 5 control steps with clock duration of 90 ns.
Clock Cycle: 90 ns
Execution Time: 5 * 90 = 450 ns
Resources = 3 Adders, 1 Multiplier
Pipelined Multipliers
2-level Chaining allowed.
*
*
c-step 1
c-step 2
c-step 3
c-step 4
c-step 5

Synthesized architecture of for the scheduling and binding using pipelining and
chaining.
R4R2R1R3
*

Minimization of the Total Execution time (Performance Optimization)
Ë The Following constraint sets the Clock duration during the solution:
Ë The constraint to set the chaining variable is given below:
Ë The Upper and Lower limits that exist for the Clock Duration:
δ ψijk( ) ψijk× Ω≤ ψijk∀ Ψ∈,
ψMAA
Xopi n p, , Xopk n p, , ψMAA–
n 1=
Ntk
∑ 1≤
p s=
∑+
n 1=
Nti
∑
p s D opi( ) 1+–=
∑
ALAP opi( ) D opi( ) 1–+ s∀ ASAP o pk( ) o pi o pk( , )∀ ℑ2S∈,≥ ≥
opi NM∈( )and o pk NA∈( )
Ωmin Ω Ωmax≤ ≤

Ë The values of the Upper/Lower bounds are determined as follows:
Ë If the clock duration is allowed only discrete values:
Ë is a relaxed version of the discrete valued , that can assume any
positive number.
Ωmax MAX δ Ψ( ){ }=
Ωmin MIN δ Ψ( ){ }=
δ ψijk( ) ψijk× Ωrelaxed≤ ψijk∀ Ψ∈
Ω
Ωrelaxed
Ωmin
------------------------- Ωmin⋅=
Ωrelaxed Ω

Minimization of the DFG Total Execution Time
Ë The Number of control steps (integer) can be represented in terms of Binary Variables:
Ë The part of the Objective function that minimizes the Total execution is Nonlinear
Ë The Objective Function can be conceptually presented as:
T 2i β
i
⋅
i 0=
n 1–
∑=
IN 2
i
CLOCK⋅( ) β
i
⋅
i 0=
n 1–
∑=
I IN IL1+=

Ë Linearization of the Nonlinear part of the Objective function
Ë Linearization of the Nonlinear part of the Objective function(cont’d):
IL2 2
i
CLKMIN⋅ βi⋅ Θi+
 
 
i 0=
n 1–
∑=
Θi 2
i
CLOCK⋅ 2
i
CLKMIN⋅ βi⋅– 2
i
CLKMAX⋅ 1 βi–( )– i,≥ 0 … n 1–,,=
Θi 0 i,≥ 0 … n 1–,,=

Ë Linearization does not increase the complexity of the formulation:
• Where n is the number of discrete variables added to the formulation
Θi
2i CLOCK CLKMAX–( )⋅ if βi is 0 Θi 0≥( ),,
2i CLOCK CLKMIN–( )⋅ if βi is 1 Θi 0≥( ),,





≥
IL2 2
i
CLOCK⋅
i ri, 1=
∑=
n Tlog( ) 2log( )⁄=

Tree Hight Reduction
Ë The performance of the architecture is bounded by the length of the critical path.
Before THR After(THR) Delay Estimation
A B C D
(A + B)+ C + D
A B C D
(A+B) + (C+D)
δ ψAAA( ) δ ψAA( )=
A B C D
(A + B) - C + D
A B CD
(A+B) + (D - C)
δ ψASA( ) MAX δ ψAA( ) δ ψSA( ){ , }=
A B C D
(A + B) + C - D
A B DC
(A+B) + (C - D)
δ ψAAS( ) MAX δ ψAA( ) δ ψSA( ){ , }=

A B C D
(A - B) + C + D
A B DC
(A-B) + (C +D)
δ ψSAA( ) MAX δ ψAA( ) δ ψSA( ){ , }=
A B C D
(A - B) - C + D
A B CD
(A-B) + (D - C)
δ ψSSA( ) MAX δ ψSA( ) δ ψSA( ){ , }=

A B C D
(A - B) - C - D
A B DC
(A-B) - (C + D)
δ ψSSS( ) MAX δ ψSS( ) δ ψAS( ){ , }=
A B
C D
(A * B) + C + D
A B
C D
(A * B) + (C + D)
δ ψMAA( ) MAX δ ψMA( ) δ ψAA( ){ , }=

A B
C D
(A * B) - C + D
A B
D C
(A * B) + (D - C)
δ ψMSA( ) MAX δ ψMA( ) δ ψSA( ){ , }=
A B
C D
(A * B) - C - D
A B
C D
(A * B) - (C + D)
δ ψMSS( ) MAX δ ψMS( ) δ ψAS( ){ , }=

CHAPTER 4

Hasse Graph for scheduling with n-level chaining
1 2 3
1
2
3
4
5
n-1 n
α1 α2 αn−2 αn−1 αn
cstep,s
op
6
7
n+1
Assignement Edges
Timing Edges

Topological Sorting of the Hasse Graph can be modiﬁed to be used for Coloring
the Graph
Ë Nodes Are numbered according to topological sorting.
op
cstep,s
1 2 3
1
2
3
4
5
6 1
4
7
10
13
16
3
6
9
12
15
2
5
8
11
14

Two different Colorings for the Hasse Graph for scheduling with 2-level chaining
Ë Nodes are numbered according to the Corresponding color.
op
cstep,s
1 2 3
1
2
3
4
5
6 5
4
3
2
1
4
3
2
1
0
5
4
3
2
1
5
op
cstep,s
1 2 3
1
2
3
4
5
6
4
3
2
1
0
5 4
3
2
1
0
4
3
2
1
0

Topological Sorting of the Hasse Graph can be modiﬁed to be used for Coloring
the Graph
opcstep,s
1 2 3 4
1
2
3
4
5
6
1 2 3
4 5 6 7
8
12
16
20
9
13
17
21
10
14
18
22
11
15
19

Two different Colorings for the Hasse Graph for scheduling with 3-level chaining
Ë The graph has 22 nodes and “43” edges. Then number of maximal cliques can not be greater
than 22 (or even equal 22).
Ë The Transitive Closure of the graph has “115” edges.
op
cstep,s
1 2 3 4
1
2
3
4
5
6
5 5 4
4 4 4 3
3 3 3 2
2 2 2 1
1 1 1 0
0 0 0
op
cstep,s
1 2 3 4
1
2
3
4
5
6 5 5 5
5 4 4 4
4
3
2
1
3
2
1
0
3
2
1
0
3
2
1

An Odd-Hole graph and A Wheel graph
1
2
34
5
6
1
2
34
5
An Odd-Hole Graph
x1 x2 x3 x4 x5+ + + + 2≤
A Wheel Graph
x1 x2 x3 x4 x5 2 x6⋅+ + + + + 2≤

The Extended Wheel Graph
1
2
34
5
6
7
An Extended-Wheel Graph
x1 x2 x3 x4 x5 2 x6⋅ 2 x7⋅+ + + + + + 2≤

Example Constraint Class
Example: for s = 3
Constraint (α2βα1β) for 3-level chain
op
cstep,s
1 2 3 4
1
2
3
4
5
X1 3, X3 2, X3 3, X4 2, X4 3, 2 X1 4,⋅ 2 X1 5,⋅+ + 2≤+ + + +
Xopi a s D opi( ) 1+–( ), ,
a 1=
Nti
∑ Xopk a p, ,
a 1=
Ntk
∑
p s 1–=
s
∑+
Xopl a p, ,
a 1=
Ntl
∑
p s 1–=
s
∑+ + 2 Xopi a p, ,
a 1=
Nti
∑
p s D opi( ) 2+–=
ALAP opi( )
∑⋅
 
 
 
2≤
s∀ Range opi( ) Range opl( )∩( )∈
s D opi( ) 2+– Range opi( )∈ s 1–( ) Range opl( )∈,
opi opk,( ) ℑ2S∈∀ opi opl,( ) ℑ3S∈∀,

An Extended Wheel Graph Constraint Class for 3-level chaining.
Example Constraint Class
Example: for s = 3
Constraint (α2βα1β) for 3-level chain
op
cstep,s
1 2 3 4
1
2
3
4
5
X1 3, X3 2, X3 3, X4 2, X4 3, 2 X1 4,⋅ 2 X1 5,⋅+ + 2≤+ + + +
Xopi a s D opi( ) 1+–( ), ,
a 1=
Nti
∑ Xopk a p, ,
a 1=
Ntk
∑
p s 1–=
s
∑+
Xopl a p, ,
a 1=
Ntl
∑
p s 1–=
s
∑+ + 2 Xopi a p, ,
a 1=
Nti
∑
p s D opi( ) 2+–=
ALAP opi( )
∑⋅
 
 
 
2≤
s D opi( ) 2+– Range opi( )∈ s 1–( ) Range opl( )∈,
opi opk,( ) ℑ2S∈∀ opi opl,( ) ℑ3S∈∀,

Exploring the Hasse diagram for schedules with 2-level chaining.
class 1
α1
α2
β
class 3
class 4
α1
class 3
α1
class 5
β
start
β
α1/α2
α1/α2
β
class 2
β
β β

1- Clique Constraint Class for 2-level chainingβ
)
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 1 :β
)
Xop n s,,
n 1=
Nt
∑
s Range op( )∈
∑ 1 op DFG∈∀≤

2-Clique Constraint Class for 2-level chainingβsα2βs
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 2 :βsα2βs
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s
∑+ 1≤
s∀ Range opi( ) Range opk( )∩( )∈
opi opk,( )∀ ℑ2S∈

3-Clique Constraint Class for 2-level chainingβsα1β s 1–( )
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 3βsα1β s 1–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n p, ,
n 1=
Ntj
∑
p ASAP op j( )=
s 1–
∑+ 1≤
s∀ Range opi( ) s 1–( ) Range op j( )∈( )∈
opi op j,( )∀ ℑ1∈

4-Clique Constraint Class for 2-level chaining
Ë The example illustrated in the Figure for class 4 is for the case of both .
βsα1β˜
s 2–( ) j k, ,
i′
α1β s 2– i′–( )
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 4
βsα1β˜
s 2–( ) j k, ,
i′
α1β s 2– i′–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n p, ,
n 1=
Ntj
∑
p ASAP op j( )=
s 1–
∑+ 1≤
s∀ Range opi( ) s 1–( ) Range op j( )∈( )∈
i′ 1=

5-Clique Constraint Class for 2-level chainingβsα1α1β s 2–( )
op
cstep,s
1 2 3
1
2
3
4
5
6
Ë The constraint class 5βsα1α1β s 2–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s 2–
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opk( )∈( )
opi op j,( )∀ ℑ1S∈ , op j opk,( )∀ ℑ2S∈

Exploring 3 the Hasse diagram for schedules with 3-level chaining.
class 1
α1 α2
α3
β α2
class 6
class 9
β
class 8
α1class 6
α1
β
β
α1
class 11
class 12
β
β
class 14
β
start
β
α1/α2/α3
α1/α2/α3
β
α2
class 7
β
class 3
β
class 3
class 5
α1
β
class 2
β
class 4
β
α1
α1
class 8
β
class 10
β
α1
class 11
β
class 13
α1
β
β

1- Clique Constraint Class for 3-level chainingβ
)
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 1β
) Xop n s,,
n 1=
Nt
∑
s Range op( )∈
∑ 1 op DFG∈∀≤

2- Clique Constraint Class for 3-level chainingβsα3βs
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
op Ë The constraint class 2 for 3-level chainingβsα3βs
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s
∑+ 1≤
opi opl,( )∀ ℑ3S∈

3- Clique Constraint Class for 3-level chainingβsα2β s 1–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 3:βsα2β s 1–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s 1–
∑+ 1≤
s∀ Range opi( )∈ s 1–( ) Range opk( )∈
opi opk,( )∀ ℑ2S∈

4- Clique Constraint Class for 3-level chainingβsα2β˜
s 2–( ) k l, ,
i′
α1β
s 2– i′–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint constraint class 4βsα2β˜
s 2–( ) k l, ,
i′
α1β
s 2– i′–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopk n p, ,
n 1=
Ntk
∑
p s 1– i′–( )=
s 1–( )
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 2– i′–( )
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opl( )∈( )
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ opi opk,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
i′ 1 i′ s 2 A– SAP opl( )–≤ ≤∀

5- Clique Constraint Class for 3-level chainingβsα2α1β s 2–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xopk n s 1–( ), ,
n 1=
Ntk
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 2–
∑+ + 1≤

6- Clique Constraint Class for 3-level chainingβsα1β s 1–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 6 :βsα1β s 1–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n p, ,
n 1=
Ntj
∑
p ASAP op j( )=
s 1–
∑+ 1≤
s∀ Range opi( )∈ s 1–( ) Range op j( )∈,

s 2–( ) j k, ,
i′
α2β
s 2– i′–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
βsα1β˜
s 2–( ) j l, ,
i′
α2β
s 2– i′–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑
s 1– i′–( )
s 1–
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 2– i′–( )
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opl( )∈,
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ op j opl,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,
i′ 1 i′ s 2 A– SAP opl( )–≤ ≤∀

s 2–( ) j k, ,
i′
α1β
s 2– i′–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 8βsα1β˜
s 2–( ) j k, ,
i′
α1β
s 2– i′–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ X
n 1=
Ntj
∑ op j n p, ,
p s 1– i′–( )=
s 1–
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s 2– i′–
∑+ + 1≤
s∀ Range opi( )∈ s 2– i′–( ) Range opk( )∈( )
opi op j,( ) op j opk,( ),∀ ℑ1S∈ opi opk,( ) ℑ2S∈
i′ 1 i′ s 2 A– SAP opk( )–≤ ≤∀

9- Clique Constraint Class for 3-level chaining
βsα1β˜
s 2–( ) j l, ,
i′
α1β˜
s 2–( ) k l, ,
i″
α1β
s i′– i″–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 9βsα1β˜
s 2–( ) j l, ,
i′
α1β˜
s 2–( ) k l, ,
i″
α1β s i′– i″–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ X
n 1=
Ntj
∑ op j n p, ,
p s 1– i′–( )=
s 1–
∑+ +
Xopk n p, ,
n 1=
Ntk
∑
p s 2– i′– i″–=
s 2– i′–
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 3 i′ i″–––
∑ 1≤+
s∀ Range opi( )∈ s 3– i′ i″––( ) Range opl( )∈( )
i′ 1 i′ s 4 A– SAP opl( )and i″∀ 1 i″ s 3– ASAP opl( ) i′––≤≤( )–≤ ≤∀
max i′ i″+( ) s 3– ASAP opl( )–=
i′ i″, 1=

s 2–( ) j k, ,
i′
α1α1β
s 2– i–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 10:βsα1β˜
s 2–( ) j k, ,
i′
α1α1β
s 2– i–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ X
n 1=
Ntj
∑ op j n p, ,
p s 1– i′–( )=
s 1–
∑+ +
Xopk n s 2– i′–( ), ,
n 1=
Ntk
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 3 i′––
∑ 1≤+
s∀ Range opi( )∈ s 3– i′–( ) Range opl( )∈( )
i′ 1 i′ s 3 A– SAP opl( )–≤ ≤∀

11- Clique Constraint Class for 3-level chainingβsα1α1β
s 2–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 11βsα1α1β
s 2–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑ Xopk n p, ,
n 1=
Ntk
∑
p ASAP opk( )=
s 2–
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opk( )∈( )
opi op j,( ) op j opk,( ),∀ ℑ1S∈ opi opk,( ) ℑ2S∈

12- Clique Constraint Class for 3-level chaining
βsα1α1β˜
s 2–( ) k l, ,
i″
α1β
s i″–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
βsα1α1β˜
s 2–( ) k l, ,
i″
α1β
s i″–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ X
n 1=
Ntk
∑ opk n p, ,
p s 2– i′–( )=
s 2–
∑+ +
Xop j n s 1–( ), ,
n 1=
Ntj
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 3 i′––
∑ 1≤+
s∀ Range opi( ) s 3– i′–( ) Range opl( )∈( )∈
i′ 1 i′ s 3 A– SAP opl( )–≤ ≤∀
i′ 1=

13- Clique Constraint Class for 3-level chaining formulationβsα1α1α
1
β
s 3–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Ë The constraint class 13βsα1α1α
1
β
s 3–( )
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑+ +
Xopk n s 2–( ), ,
n 1=
Ntk
∑ Xopk n p, ,∑
p ASAP opl( )=
s 3–
∑ 1≤+

14- Clique Constraint Class for 3-level chaining formulationβsα1α2β s 2–( )
op
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
7
8
Xopi n p, ,
n 1=
Nti
∑
p s D opi( )– 1+=
ALAP opi( )
∑ Xop j n s 1–( ), ,
n 1=
Ntj
∑ Xopl n p, ,
n 1=
Ntl
∑
p ASAP opl( )=
s 2–
∑+ + 1≤
s∀ Range opi( )∈ s 2–( ) Range opl( )∈
opi op j,( ) op j opk,( ) opk opl,( ), ,∀ ℑ1S∈ op j opl,( ) ℑ2S∈ opi opl,( ) ℑ3S∈,

Maximal Clique Constraints are stronger than the Extended Wheel Constraints
Ë The Extended Wheel Constraint:
Ë The combined maximal cliques constraint:
op
cstep,s
1 2 3 4
1
2
3
4
5
cstep,s
1 2 3 4
1
2
3
4
5
α1 α2 α3
6
op
X1 3, X3 2, X3 3, X4 2, X4 3, 2 X1 4,⋅ 2 X1 5,⋅+ + 2≤+ + + +
X1 3, X1 4, X1 5, X+ +
3 1,
X
3 2,
X3 3, X
3 4,
X3 5, X
3 6,
X4 2, X4 3, 2≤+ + + + + + + +

Comparing the logical formulation vs. the maximal clique formulation for the AR
filter
Ë To reach a first integer solution, the maximal clique formulation takes more time
logical formulation
maximal cliques
formulation
Number of iterations (Primal) 1,200 1,480
Number of iterations
(Integer)
1,420 1,706
Number of nodes of Branch
and Bound
54 103
CPU time in sec (primal) 12 (Ultra Sparc 2) 24 (Ultra Sparc 2
CPU time in sec (integer) 19 (Ultra Sparc 2 38 (Ultra Sparc 2
Total CPU time in sec 31 62
Optimality condition first integer first integer
Number of discrete variables
in the formulation
536 536
Number of Single inequali-
ties
7,363 9,256 (25.7% increase)
Termination condition first integer solution first integer solution

Comparing the logical formulation vs. the maximal clique formulation for the AR
ﬁlter
Ë The maximal clique formulation achieves an optimal solution within tolerance long before
the logical fomulation.
logical formulation
maximal cliques
formulation
Number of iterations (Integer) 8.45E5 14,577
Number of nodes of Branch and
Bound
42,025 596
CPU time in sec (primal) 12 (Ultra Sparc 2) 24 (Ultra Sparc 2)
CPU time in sec (integer) 14,491 (Ultra Sparc 2) 221 (Ultra Sparc 2)
Total CPU time in sec 14,503 245
Optimality condition 0.07 (not achieved) 0.07 (achieved)
Number of discrete variables in
the formulation
536 536
Number of Single inequalities 7,363 9,256
Termination condition. after 5 integer solutions achieved optimal result
within tolerance

Comparing the logical formulation vs. the maximal clique formulation for the
EWF benchmark
Ë The maximal clique formulation achieves an optimal solution within tolerance before the
logical fomulation.
logical formulation
maximal cliques
formulation
Number of iterations (Integer) 56,697 4,668
Number of nodes in Branch and
Bound
1,827 190
CPU time in sec (primal) 86 (Ultra Sparc 2) 150 (Ultra Sparc 2)
CPU time in sec (integer) 5.4 E3 (Ultra Sparc 2) 518 (Ultra Sparc 2)
Total CPU time in sec 5.48 E3 668
Number of discrete variables in the for-
mulation
940 940
Number of Single inequalities 11,154 16,195 (45.2 % increase)
Termination condition after 5 integer solutions achieved optimal result
within tolerance

Comparing the logical formulation vs. the maximal clique formulation for the
DCT benchmark
Ë The maximal clique formulation achieves an optimal solution within tolerance before the
logical fomulation.
logical formulation
maximal cliques
formulation
Number of iterations (Primal) 3,288 (Ultra Sparc 2) 4,623 (Ultra Sparc 2)
Number of iterations (Integer) 23 (Ultra Sparc 2) (Ultra Sparc 2)
Number of nodes in Branch and
Bound
1E4 168
CPU time in sec (primal) 83 312
CPU time in sec (integer) 2E5 2,575
Total CPU time in sec 2 E5 2,887
Number of discrete variables in
the formulation
1,066 1,066
Number of Single inequalities 13,623 18,979 (39.3%)
Termination condition after 5 integer solutions achieved optimal result
within tolerance

Chapter 5

Convex Bipartite Graph. Matching
Ë This bipartite graph corresponds to the multiply operations of the EWF benchmark. The
function unit resources are 1 Multiplier.
FU_IOEI
[3,7]
[3,7]
[8,11]
[8,11]
[12,15]
[13,15]
[13,15]
[14,15]
[3,6]
[4,7]
[8,10]
[9,11]
[12,12]
[13,13]
[14,14]
[15,15]
InitialOperation
17
18
8
29
33
24
4
12
12
34
56
78
9
10
11
12

Strong components corresponding to the bipartite graph matching
Ë Dotted edges can be pruned at this step.
1 2
3 4
6 5
10
9
11
12
7
8
12
34
56
78
9
10
11
12

Chapter 6

The ﬁfth-order Elliptic Wave Filter benchmark
Ë Consists of 34 operations(8 multiplications and 26 additions)
++++
+
Z
Z
+*
+
+
+
+
+ *
+
+ +
Z
+
*
+
*
+
Z
+
+ +
*
*
+
+
+
Z
+ +
*
Z
Z
*
+
input
output

The DFG of the EWF benchmark
control
step
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
OUT
IN a b c d e f g h i
a b c d e f g h i
1 25
2715
6 26
16
19
7 20
21
2822
103
11
3132
2 23
5 13
3414
9
4
8
12
17
18
24
29
30
33
1
2
3 4
5
6
7
8
9 10
11
12
13
14
15
32
28
33
36
302724
25
20
18
19
50
17
35 54
42
41
4038
39
43
45
44
16
37
56
57
53
55
58
29
48
51
34
52
21
22
23
26
31
47
46
49

Effect of Chaining AND Pipelining FUs On Datapath Performance.
Cost ( Number of CLBs)
Totexec, Λ, ns
1- 1+,1*
3-Non-Pipe
4
5 6
8
9
10
11
2-a-Bus-ours
• 7
Exploration of the Design Space for the EWF benchmark.
2-b-Best-others
2-pipe
4-pipe

Effect of Chaining and Pipelining FUs On EWF Datapath
Performance
Design
Space
CSteps, T Resources Pipeline level Chaining Cost T(ns)
1 27 1+, 1* 2-stages yes 140 2158
2-a (ours) 17 2+, 1 *,1b 2-stages NO 160 1275
2-b [13] 17 2+, 1 * 2-stages NO 180 1275
3 10 3+, 1* No-pipe yes 195 1650
4 12 3+, 1* 2-stages yes 185 996
5 11 3+, 1* 2-stages yes 190 935
6 11 3+, 1* 2-stages yes 195 913
7 17 3+, 1* 4-stages yes 225 731
8 19 3+, 1* 4-stages yes 205 836
9 17 3+, 1* 4-stages yes 210 765
10 18 3+, 1* 4-stages yes 215 774
11 17 3+, 1* 4-stages yes 220 765

Final FGPA Implementation on Xilinx4000 series. †
† Using XACT 5.0 tools, the best area architecture would ﬁt into x4006 chip and require about
200 CLBs.
Our Best Area
Our Best Perfor-
mance
Best in Litera-
ture(Simulated Evo-
lution)
Controller 33 27 30
Register_File 10 Not used Not used
ROM 4 4 4
Multiplier 110 110 110
Adder 10 10 10
4/3/2 to 1 mux 16/8 16/16/8 16/16/8
Register /Tristate 8/1 8/1 8/1
7/6/5/To 1 Mux Not used 36/26/25 36/26/25
Total # CLBS: 323 391 361
Total Execution time
(nsec):
1275 731 1275

Scheduling and binding for the AR-ﬁlter, illustrating register binding.
1
2
3
4
5
6
7
8
9
1 2
3 4
9
10
5 6
13
7 8
14
11
12
15
1617
18
19
20
21 22
23 24
25
26
28
27
R1
R2
R2R2
R1
R1
R3
R1
R1

Synthesized architecture for the AR ﬁlter
Ë Resources:2 Multiplier (2-stage Pipelined),2 Adders and uses 3-registers and 12-
multiplexer inputs.
R1
R3
R2
A1
M1
M2
A2

The scheduling and binding for the AR ﬁlter, using 4-stage pipelined multipliers
1
2
3
4
5
6
7
8
9
10
11
12
13
1 2
3 4
5 6
8 7
9
10
11
12
13
14
16
15
17
18
19
22 21
23
20
24
25
26
27
28

The DFG of the Fast Discrete Transform Benchmark
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1
2
3
4
9
10
12
15
24
23
17
20
19
11
25
26
6
5
7
22
21
2827
8
29
30
40
37
39
41
42
1 2
3
4
5
6
7
8
9 10
11
12
13 14
15
16
17
18
19
20
21 22
23 24
25
26
27
28
29 31
32
33
34
35
36
30
37
38
39
40
43 44
45
46
47
48
49
50
51
52
41 42
16
33
13
14
35
32
38
36
34
31
18

The Fast Discrete Cosine Transform.
Ours SODAS-DSP MARS
Resources 2*, 2+,2- 2*, 2+,2- 2*, 2+, 2-
# mux inputs 37 66 NA
# registers 13 47 NA
Clock (ns) 60 100 NA
# csteps 10 12, dii=8a
a. dii is the data initiation rate for the Pipelined architecture used in SODAS-DSP.
8b
b. MARS, reports 8 cycles. No other details of the scheduling is available.
Totexec(ns) 600 1200 NA
Throughputc (MHz)
c. Throughput indicates the highest input-sampling rate of the architecture.
1.67 1.25 NA

Synthesized Architecture for the Fast Discrete Cosine Transform benchmark.
A2A1M1 M2S1S2
Ë Resources: 2 Multiplier and 2Adders and 2 Subtracters. Uses 13 registers, 37 mux
inputs.

The DFG of the Discrete Cosine Transform Benchmark
1
2
3
4
5
6
7
8
2
3
4
5
6
7
8
11 2 3 4
9 10
5 6 7 8
15
33
34
45
35
36
46
37
38
47
39
40
48
1211 1413
41 42
43 44
d7 d0 d4 d6 d1 d5 d2 d0 d3 d1 d2 d5d3 d7 d4 d6
1 2
3 4
5 6 7 8
9 10
11 12
13
14
15
16
17
19
18 20
21 22
23
24
25
26
27 28
29
30
31
32
33
34
35
36
37
38
41
42
43
44 45 46 47 48 49 50
51
52
53
54
55
56
57 58
59 60
61
62
6364
16
39
40
17 18 20 21 23 24 26 27
25 28 29 30 32
19
22
31

Synthesized Architecture for the Discrete Cosine Transform.
Ë Resources: 2 Multiplier (4-stage Pipelined) and 4Adders. Uses 11 registers, 28 mux inputs.
A3M1M2A1A2 A4

Chaining paths for the Discrete Cosine Transform
M1
M2
A1 A2
A3 A4
A3M1M2A1A2 A4

Chaining interconnections modeled for false paths detection
M1
M2
A1
A2
A3
A4
M1
M2
A1
A2
A3
A4
M1
M2
A1
A2
A3
A4
V1 V2 V3

Synthesized Bus architecture of the DCT benchmark
Ë Resources 1 Multiplier (4-stage Pipelined) and 3 Adders/Subtracters. Uses 9 registers, 18 mux
inputs and 1 Bus.
Bus1
A1A2
A3
R1
ROM
R4
R7
R5
R6
R8
R2
R9
R3
Register
File
M
class1
α1
α2
β
class3
class4
α1
class3
α1
class5
β
startβ
α1/α2
α1/α2
β
class2
β
ββ

Synthesized Random topology architecture for the DCT benchmark
Ë Resources: 1 Multiplier (4-pipe stages) and 3 Adders/Subtracters. Uses 10 registers and 24
mux inputs
A1 A2 A3
R2 R10R9 R7R5R8R6R4R1R3
ROM

Synthesized Random topology architecture for the DCT benchmark
Ë Resources: 1 Multiplier (4-pipe stages) and 3 Adders/Subtracters. Uses 12 registers and 20
mux inputs.
A1 A2 A3
R2 R1R9R2 R3 R4 R5R6R7R8R11R12R10
ROM
M

The Discrete Cosine Transform Benchmark
Ours PSGA_Syn,
[69]
Tool [23]
Chaudhuri/
Walker
SALSA[34]
(Chain)
SALSA[34]
Resources 2*, 4+ 3*,3+ 3*, 4+ 2*, 4+ 2*,4+
# mux inputs 28 NA NA NA 30
# registers 11 14 NA 15 13
Clock (ns) 45 120a
a. This tool does not use chaining nor pipelining for the DCT.
65b
b. The tool described in [23], does not use chaining.
135c
c. The level of chaining is not reported in [34]
65d
d. SALSA[34], does not determine the clock duration of the total execution. However, we have
used the same library for comparison
# csteps 11 18 9 8 11
Totexec(ns) 495 2160 585 1080 715

The Discrete Cosine Transform Benchmark
Ours PSGA_Syn
Tool in [69]
SALSA
(Chain)
[34]
OSTA no-Chain
[70]
Resources 1*, 3+ 3*,3+ 2*, 4+ 3*, 6+
# mux i/p 24 NA NA 38
# registers 10 14 15 24
Clock (ns) 45 120 130 120
# csteps, T 19 18 8 9
Totexec(ns) 855 2160 1080 1080

chapter 7

CONCLUSIONS
• Our architectural model is suitable for a broad base of technology
implementations. Speciﬁcally FPGAs including bus/SRAM based ones.
• Introduced optimization criteria for ILP solvers for Datapath Synthesis:
Ë Our model and criteria can be used for other solvers (e.g.stochastic).
• The approach:
Ë Scheduling with chaining and deep-pipelining of FUs while minimizing “Structural
Complexity ”.
Ë Optimization of the Total Execution time of the architecture, with clock cycle determination.
Ë followed by bus assignment if it is supported by the FPGA.
• This Approach has demonstrated that a discriminating search of a larger architectural space
can produce:
Ë Regular Architectures with minimuminterconnections, Low resources and Fast
Throughput.

Contribution of this research
Ë Several interconnect minimization measures were incorporated in the formulation,
which significantly improve the quality of the resulting synthesized architectures.
Ë This was demonstrated for different benchmarks, where number of registers and
multiplexer inputs were consistently smaller in architectures synthesized with this
methodology as compared to previously published results. This is an important issue
in developing a tool geared toward technologies with scarce interconnect resources
such as FPGAs.
Ë For the first time, an Integral Linear Programming (ILP) formulation that includes
a non-tabular, non-restricted model of the system clock duration was developed. This
has proved to be a significant step in the modeling of the total execution time of the
architecture and as a result, successful performance minimization.
Ë The formulation of the architectural synthesis scheduling and binding as a
performance optimization problem rather than the mere minimization of the number
of control steps was presented. A theoretical linearization technique for the objective
function of this formulation was presented. It was demonstrated that this linearization
technique has negligible impact on the size of the problem.

Contribution of this research
Ë Verification of the validity of the overall methodology by integrating this tool to logic
synthesis and back-end tools.
Ë The development of the set of valid inequalities for the scheduling and binding problem.
The identification and derivation of both the extended wheel graph inequalities and the
maximal clique inequalities. This guarantees the tightest formulation for schedules with n-
levels of chaining and multicycled/pipelined resources for the first time.
Ë An algorithmic approach for the generation of the minimum set of inequality
classes necessary for the general scheduling and binding problem is developed. This
algorithm explores a Hasse graph representing the scheduling problem. The algorithm
classifies all the maximal paths into maximal path classes. These classes can be
incorporated into the automatic generation of the maximal clique constraints. These
maximal clique constraints represent the tightest description of the scheduling and
binding problem with n-level chaining.

VERTICAL PAGES

Flow of the Architectural synthesis methodology.

CDFG
-Data Storage Assignment
STEP-LAST: Register Allocation
STEP-4: ILP: Bus Insertion
-Bus transfer scheduling
-Bus allocation
-Storage Minimization
-Bus loading Minim.
-Bus loading minimization.
- Scheduling and Binding
- Chaining of Operations
STEP-3: ILP: Random Topology
- Clock cycle minimization +
- FU pipelining choice
ation of the numberMinimiz
of cycles.
OR
- Minimization of the total
execution time, (i.e. throughput
maximization).
- VHDL generation of the
Datapath and the Controller
- Heuristics to determine the lower bound on the
number of cycles.
- Heuristics to tighten the ASAP/ALAP values
under the given resource constraints.
DFG
-DFG exploration.
-Dynamic Set generation for chaining
-ILP constraint generation
STEP-2: C++: Constraint Generation for ILP
STEP-1: Scheduling Bounds
Tech

Flow of the Back-End Tools
Ë Stage-2 uses Synopsys tools(logic synthesis and FPGA mapping), and stage-3 uses
Xilinx(xact tools) for PPR

VHDL SOURCE FILES
- Xilinx Hard-macros
Simulate
Read HDL and insert pads
- Area Constraints
- Delay Constraints
- FU-Pipelining (i.e.
Register-balancing)
- Xilinx Library
To simulation
Partition, Placement
and Routing
Xilinx
SYNOPSYS
compile and optimize the
datapath and controller
Stage-3
Stage-2

ASAP Scheduling
Input: Data Flow Graph G
Output: node arrayint, schedule_I, representing the As soon as possible
scheduling of the nodes of the DFG for a maximum chaining level
“Max_Chain_Length”.
ASAP{
1- G.for_all_nodes(v) {
if (input_degree(v) = 0)
{ schedule_I(v) = 1; }
else
{ schedule_I(v) = 0; insert v into the node set S; }
2- While (node set S ≠ Φ )
{
G.for_all_nodes(v) {
if ( (v ∈ S) and (all_pred_scheduled(G,v,schedule_I))
{
G.all_input_edges(e,v){
w = G.source(e);
if (G.type(w) and G.type(v) ≠ “multicycle”)
if ( Ch_Level_ASAP(w) ≤ Max_Chain_Length)
{ temp_schedule = schedule_I(w);}
else
{ temp_schedule = schedule_I(w) + delay(v);}

if ((G.type(w) = “multicycle”)
{
if (G.type(v) ≠ “multicycle”)
{ temp_schedule = schedule_I(w) + delay(w) -1 ;}
else
{ temp_schedule = schedule_I(w) + delay(w);}
}
if ( temp_schedule schedule_I(v))
{ schedule_I(v) = temp_schedule;}
}
3- Adj_Ch_Level_ASAP(G, v, schedule_I, Ch_Level_ASAP);
4- delete node v from the node set S;
} } } }

Adjust Chaining level of a node
Input: Data Flow Graph G, node v, node array representing the current schedule schedule_I, and the node array
representing ther current chaining level Ch_level_ASAP.
Output: Adjusted version of Ch_level_ASAP for node v, according to the current schedule schedule_I
Adj_Ch_Level_ASAP{
G.all_input_edges(e,v) {
w = G.source(e);
if ( ( G.type(w) ≠ “multicycle”) and (schedule_I(v) = schedule_I(w))
and (Ch_Level_ASAP(w) Max_Chain_Length)
and (Ch_Level_ASAP(v) Ch_Level_ASAP(w) + 1))
{Ch_Level_ASAP(v) = Ch_Level_ASAP(w) + 1;}
if ( ( G.type(w) = “multicycle”) and (G.type(v) ≠ multicycle”)
and (schedule_I(v) = schedule_I(w) + mul_delay -1)
and (Ch_Level_ASAP(v) ≤ 2))
{ Ch_Level_ASAP(v) = Ch_Level_ASAP(w) + 1; }
}

procedure create_classes_with_β
create_classes_with_β ( active_edge , distance, j, classBase ) {
if (distance = 0) {
class j = classBase + ;
}
else
for x = 0 to distance {
if (x = 0) {
classnew = class j;
create_class_without_β (active_edge , distance, j, classnew );
}
if (x 0) {
j = j + 1;
if ( x = n - i ) {
}
if ( x n - i ) {
distancenew = distance - x;
classnew = class j;
create_class_with_β ( active_edge , distancenew, j, classnew );
}
αi
β{ }
β{ }
αi
αx{ } β{ }+
αx{ }
αi

procedure create_classes_without_β:
create_classes_without_β ( active_edge , distance, j, classBase ) {
for t = distance down to 1 {
if t = (n - i ) {
j = j + 1;
}
if t (n - i) {
j = j + 1;
distancenew = distance - t;
classnew = class j;
create_class_with_β ( active_edge , distancenew, j, classnew );
}
}
αi
αt{ } β{ }+
αt{ }
αi

Architectural_Synthesis_for_DSP_Structured_Datapaths

More Related Content

What's hot

Viewers also liked

Similar to Architectural_Synthesis_for_DSP_Structured_Datapaths

More from Shereef Shehata

Architectural_Synthesis_for_DSP_Structured_Datapaths