A Flexible High-Performance Network Router with Hybrid Wave ...
A Flexible High-Performance Network Router with
Jabulani Nyathi and Jose G. Delgado-Frias
Department of Electrical Engineering
State University of New York
Binghamton, NY 13902-6000
In this paper a novel hybrid wave-pipelined bit-pattern associative router is pre-
sented. A router is an important component in any communication network systems.
The bit-pattern associative router (BPAR) allows for exibility and can accommodate
a large number of routing algorithms. Wave-pipelining is a high performance approach
which implements pipelining in logic without the use of intermediate registers. In this
study a hybrid wave-pipelined approach has been proposed and implemented. Hybrid
wave-pipelining allows for the reduction of the delay di erence between the maximum
and minimum delays by narrowing the gap between each stage of the system. This
approach yields narrow computing cones" that could allow faster clocks to be run.
This is the rst study in wave-pipelining that deals with a system that has a sub-
stantially di erent set of pipeline stages. The bit-pattern associative router has three
stages: condition match, selection function, and port assignment. In each stage data
delay paths are tightly controlled in order to optimize the proper propagation of signals.
Internal control signals are generated to ensure that data propagates between stages
in a proper fashion. The simulation results show that using a hybrid wave-pipelining
signi cantly reduces the clock period and the circuit delays become the limiting factor,
preventing further clock cycle time reduction.
Communication channels between any two nodes regardless of their physical location within
a communication network system can be established by using routers at each node. The pur-
pose of the router would be to handle messages on the network. A router receives, forwards,
and delivers messages. The router system transfers messages based on a routing algorithm
which is a crucial element of any communication network 2, 3]. Given the recon guring re-
quirements for many computing systems, a number of routing algorithms as well as network
topologies must be supported. This in turn leads to a need for a very high performance exi-
ble router to support these requirements. To maximize a machine's overall performance and
to accommodate recon guration in a distributed environment requires matching the applica-
tion characteristics with a suitable routing algorithm and topology. It has been shown that
machine performance is fairly sensitive to the routing function and tra c 4]. In addition,
having the capability of changing the routing algorithm at run time could facilitate smart
interconnects and adaptability that allow changes on the machine's topology for di erent
A router intended to be used for a number of routing algorithms and/or network
topologies has to be able to accommodate a number of routing requirements. It is of great
importance that the routing algorithm execution time be extremely short. This time dictates
how fast a message can advance through the network, since a message cannot be transferred
until an output port has been selected by the routing algorithm. Thus, the routing algorithm
execution time must be reduced to decrease message delays. Other requirements may include:
exibility to accommodate modi cations to a network, algorithm and/or topology switching
with minimum delay, and programmability to support a large number of routing algorithms
and network topologies.
The major approaches to the realization of exible routers are dedicated processors
and look-up table 5]. A dedicated processor allows bit manipulation, required by most
algorithms for ease of execution. This type of router is capable of handling a number of
interconnection networks, but due to its sequential execution of routing algorithms it is a
slow approach. Also the resulting router can be expensive in terms of hardware. The look-
up table approach requires the storage in memory of a predetermined routing path for all
possible destination nodes. The look-up table approach reduces the time of determining the
output port to a memory access delay, however, large tables are required to store the routing
information. The bit-pattern approach uses an associative approach to determine in parallel
the potential output ports (according to the routing algorithm). The output port is selected
by means of a selection function. This scheme has been shown to be able to accommodate
a large number of routing algorithms and to have fast execution 7, 6]. This approach has
been mapped onto a pipelined VLSI architecture 5].
In this paper we present a high performance hybrid wave-pipelined VLSI router
that constitutes of several modules and uses dynamic circuitry within the modules. Wave-
pipelining is a design method that enables pipelining in logic without the use of intermediate
registers. In order to realize practical systems using wave-pipelining, it is a requirement that
accurate system level and circuit level timing analysis be done. At system level, general-
ized timing constraints for proper clocking and system optimization need to be considered.
Since the performance of the system depends on the timing between multiple signals within
the data path and also the timing between the clock signal and data at the input register,
there is, therefore, room for performance optimization within the data path and the clock
At the circuit level, performance is determined by the maximum circuit delay dif-
ference in propagating signals within a given module of the system. Accurate analysis and
strict control of these delays are required in the study of worst case delay paths of circuits.
Data dependent delays also present a problem that needs to be considered in the use of wave-
pipelining to improve performance. In this study a di erent approach to minimizing the clock
period is undertaken. It is termed hybrid wave-pipelining and hinges on wave-pipelining.
2 Router Functional Organization
The bit-pattern associative router (BPAR) scheme supports the execution of routing algo-
rithms that are used in most communication switches. Using routing algorithms to determine
the output port requires that the following issues be considered:
A routing algorithm examines the status of addressing bits that are important in
determining the output port. Some addressing bits need to be ignored, depending on
the current and destination nodes.
Output port alternatives need to be prioritized. This priority is used to favor a port
that would yield a short path. This feature ensures deterministic routing algorithms.
In this section, an overall description of the router system is presented. This descrip-
tion is at a functional level, in order to illustrate the requirements and capabilities of the
2.1 Router System Description
The associative router uses a content addressable memory as its bit-pattern associative unit,
and this enables the destination address alternatives to be considered in parallel. The des-
tination address is presented as the input to the bit-pattern associative unit for comparison
with the stored data. This operation constitutes the normal operation of this design. The
bit-pattern associative router shown in Figure 1 consists of several modules 5]. In this illus-
tration arrows show the general ow of data from one stage to the next, with the destination
address being presented to the bit-pattern associative unit, which compares (for normal op-
eration) the presented destination address to the current address (stored in the bit-pattern
associative unit). The results of the comparison are passed to the selection function, which
then passes the match output with the highest priority to the port assignment to select the
word corresponding to the selected output port.
De nitions of some the modules and labels that appear in the gure are provided
1. Destination address. The destination address is passed to the routing function from
an input port. It becomes the search argument for the pattern associative mechanism.
2. Bit-Pattern Associative Unit. The destination address is compared to the current
address by means of a bit-pattern matching mechanism. An associative computation
with all the bit-patterns is performed to determine potential routing paths.
3. Selection Function. If more than one condition is met only one assignment should be
processed based on a priority scheme. For an oblivious router, the other matches are
ignored. For an adaptive router, the other matches may be considered as alternatives
to the port assignment.
4. Port assignment. The output port assignment is made by selecting the output port
associated with the selection condition.
destination address priority output port
Routing port assignment
Figure 1: The bit-pattern associative router scheme.
The patterns stored in the bit-pattern associative unit allow the router to make a decision
about the destination port based on the routing algorithm. The address of a destination
node D (dn; : : : d ) needs to be compared to the current node's address C (cn; : : : c ). A
1 0 1 0
routing algorithm compares the bits of the two addresses some bits are ignored since they
do not a ect the current routing decision. These don't care" bits usually occur at di erent
positions for each potential path being considered and need to be customized according to
the routing algorithm requirements. This in turn imposes a requirement for the bit-pattern
associative unit to include don't care" bits at di erent positions in each entry. To provide
the exibility required to support multiple interconnection networks and routing algorithms
the bit-pattern associative router must be programmable.
Modules of the bit-pattern associative router include:
1. A search argument register (SAR), responsible for receiving the destination address
and passing it to the next stage at the appropriate time.
2. The Bit-Pattern Associative Unit, for storing the bit patterns for a given interconnec-
tion network node and for performing a comparison between the destination address
and the current address (speci ed by the bit patterns). All the patterns that match
the input address are passed to the following stage the selection function.
3. The selection function which operates as a priority encoder, determines the priority of
the bit-pattern associative unit outputs. If more than one pattern has been matched,
only one match gets selected. The other matching outputs are ignored. The selected
match pointer gets passed to the next stage. If no matching pattern is found the
selection function sets a no match signal.
4. The port assignment stores the output port assignments. The address where the assign-
ment is read is speci ed by a pointer from the selection function and this assignment
is passed to the port assignment register.
5. The port assignment register is responsible for storing the output port assignment to
be used by the interconnection network.
6. The shift register serves as a pointer to a row of DCAMs (bit-pattern associative unit
array) and DRAMs (port assignment) to be programmed to or to be refreshed. It is
not used during normal operation.
7. The refresh circuitry is responsible for refreshing stored data, since dynamic circuits
are used and it is transparent to the other modes of operation.
The modules described above are shown in Figure 2, where the normal operation of the
associative router is depicted.
2.2 BPAR Normal Operation
Even though programming precedes the normal operation of the bit-pattern associative
router, the organization of the normal operation is presented rst, since this operation con-
stitutes the main function of the router. The system uses a content addressable memory
(CAM) as the basic element of the bit-pattern associative unit. The CAM has a property
that enables some addressing bits to be ignored. Figure 2 shows the bit-pattern associative
router organization for normal operation. The destination address is presented as input
to the search and argument register (SAR) and passed to the bit-pattern associative unit,
which already has the current address stored. The presented destination address is com-
pared simultaneously to all the stored bits within the associative unit. Several matching
conditions can result from this comparison. All the matching conditions are presented to the
selection function which then prioritizes these inputs and selects the one determined to have
the highest priority. The output of the selection function selects a word to be read from the
port assignment. This word is the output port speci er and is passed to the port assignment
Refreshing the stored data is inter-leaved with the comparison of the destination
address to the current address. This makes the refresh operation transparent to the matching
(comparison of destination address with current address) operation. The row select block
for both the bit-pattern associative unit and the port assignment serve to provide a pointer
to a word to be read or written during the refresh operation. Figure 2 depicts a situation in
which the destination address matches two words (two current addresses) in the bit-pattern
associative unit. The selection function, therefore, receives two inputs from the associative
unit and has to select one with the highest priority, which will then be passed as the selected
match to the port assignment. The example presented shows that the topmost match has
a higher priority than the one below it. The organization for the programming mode is the
same as that of the normal mode of operation with minor di erences.
Destination Address To Switching Network
(From Input Port) (Output Port)
Search Argument Port Assignment
selection function (SF)
o Bit - Pattern o
S (DCAM) S
e (DRAM) e
match match e
Refresh (DCAM) Refresh (DRAM)
Figure 2: The bit-pattern associative router organization (normal operation.)
2.3 BPAR Programming Mode
During the programming phase the row select is used to select a word to be written in
both the matching unit and the port assignment while the match (comparing of destination
address to current address), the selection function, and the refresh logic are ignored. The
modules that are ignored during programming appear as dotted lines in the gure. The
search and argument register is used as an input latch during this mode of operation. Also
the output port register serves to latch the input to the port assignment. The row select shift
register points to the row of DCAMs or DRAMs to be programmed to, and this selection
occurs simultaneously for both memories. The procedure for writing into the memories is
illustrated in Figure 3. The selected word to be written to during programming is shown
as a shaded block in both memories (the bit-pattern associative unit and the output port
2.4 Dynamic CAM Cell
The DCAM cell implements a comparison between an input and the ternary digit condition
stored in the cell. The proposed cell has a small transistor count. The circuit of a single
DCAM cell, shown in Figure 4, consists of eight and a half transistors transistor Te is shared
by two cells. The read, write, evaluation, and match line signals are shared by the cells in a
word while the BIT STORE, NBIT STORE, BIT COMPARE and NBIT COMPARE lines
are shared by the corresponding bit in all words of the matching unit. The design uses
a precharged match line to allow fast and simple evaluation of the match condition. This
condition is represented by the DCAM cell state which is set by Sb1 and Sb0 node status.
Program Data Program Data
(Current Address) (Output Port Assignment)
Search Argument Port Assignment
Input Latch Register Input Latch
selectiuon function (SF)
o Bit - Pattern o
Associative Unit Port Assignment
e (DRAM) S
match match e
Refresh (DCAM) Refresh (DRAM)
Figure 3: Router organization (Programming mode).
This state is stored in the gate capacitance of the transistors Tc and Tc . In Table 1 1 0
the possible DCAM cell stored values are listed. Transistors Tref and Tref are used for 0 1
The DCAM cell performs a match between its stored ternary value and a bit input
(provided by BIT COMPARE and NBIT COMPARE which represent a bit and its opposite
value, respectively). The match line is precharged to 1" thus, when a no match exists this
line is discharged by means of transistor Tm .
Table 1: Representation of stored data in a DCAM cell.
Sb1 Sb0 state
0 0 X(don't care
0 1 1(one)
1 0 0(zero)
1 1 not allowed
Normal operation involves comparing the input data to the patterns stored in the
DCAM and determining if a match has been found. During match operation, the input data
is presented on the BIT COMPARE line and its inverse value on the NBIT COMPARE line.
Before the actual matching of these two values is performed, the match line is precharged
to 1" which indicates a match condition. The matching of the input data and the stored
data is performed by means of an exclusive-or operation which is implemented by the two
transistors (Tc and Tc ) that hold the stored value.
write Tw1 Tw0
Figure 4: CMOS circuit for a DCAM cell.
2.5 Selection Function and Port Assignment
The selection function should be designed to ensure the deterministic execution of the routing
algorithms. With a given input (i.e. destination address) and a set of patterns stored in the
matching unit, the port assignment should always be the same. The priority allows only the
highest priority pattern that matches the current input to pass on to the port assignment
memory. The encoded priority (EPi) output depends on the match at the current bit-pattern
and the priority for this row. If both match and priority are 1" then the encoded priority is
true. A priority lookahead scheme has been proposed and implemented it has been reported
The port assignment memory or RAM holds information about the output port that
has to be assigned after a bit-pattern that matches the current input is found. This memory
is proposed to be implemented using a dynamic approach. The selected row address is passed
from the priority encoder. The cells in this row are read and their data is latched in the port
assignment register. The DRAM structure is able to perform an OR function per column
when multiple RAM rows are selected this is when multiple matches are passed on. The
DRAM structure is explained in 5].
Wave-pipelining is an approach to achieve high-performance in pipelined digital systems by
removing intermediate latches or registers 10]. Idle time of individual logic gates within
combinational logic blocks can be minimized using wave-pipelining.
Wave-pipelining reduces clock loads and associated area, power and latency while
retaining the external functionality and timing of a synchronous circuit 10]. Using wave-
pipelining, new data can be applied to the input of combinational blocks or stages within
a digital system before the previous outputs are available, thus e ectively pipelining the
combinational logic and maximizing the utilization of logic without inserting registers 8, 9].
3.1 Modelling Wave-Pipelining
Conventional circuit pipelining uses intermediate latches in addition to the input and output
registers. Intermediate latches (registers) ensure that when the leading edge of the system
clock comes data gets propagated from one stage to the next in a synchronous manner. In
a system set-up like this there is only one set of data between register stages. By using
wave-pipelining in digital systems, the clock frequency of digital circuits can be increased.
This set up allows for coherent waves of data to be sent through a block of combinational
logic and also allows for new inputs to be applied before the previous data propagates to the
output latch. There may be more than one data wave within the system.
The rate at which data propagates through the circuit depends not on the longest path
but on the di erence between the longest and the shortest path delays 10]. Wave-pipelining
requires that delay paths within a combinational logic circuit be balanced so that the data
waves applied at the input simultaneously arrive at the next stage. An illustration of a
wave-pipelined system appears in Figure 5. The intermediate latches have been removed
and replaced by intrinsic" latches. Also the system clock is no longer distributed and
multiple coherent waves" of data are available between storage elements.
3.2 Wave-Pipelining Requirements
Wave-pipelining requires studies across a variety of levels such as process, layout, circuit,
logic, timing and architecture. Several research groups have explored some of these areas.
Some of the challenges of designing wave-pipelined systems within any of the areas mentioned
Preventing intermixing of unrelated data waves." There must be no data overrun in
each circuit block, and it must be ensured that there is no over committing of the data
path. This is achieved by determining an appropriate range of clock frequencies at
which to apply data at the input.
Designing dedicated control circuitry. Control logic circuits must be designed to operate
synchronously with the circuitry of the pipeline stages. The number of these control
logic circuits must be minimal in order to be area e cient.
Balancing delay paths. Delay paths must be equalized to ensure that data waves applied
to the input latch propagate through all the stages synchronously. This is achieved
by inserting delay elements in the shortest paths within logic blocks to equalize their
delays to those of the longest paths. This approach allows for the elimination of the
intermediate latches or registers.
designed from worst
clock case operation
N Logic Logic Logic Logic U
P Circuit T
Circut Circuit Circuit
U Array Array Array P
(LCA0) ( LCA1 ) (LCA2) ( LCAn-1 )
input latch intermediate output latch
designed from worst
clock case operation
waven-1 wave 2 wave1 wave0
input latch output latch
Figure 5: Block diagram of a wave-pipelined digital system.
The requirements stated above are not inclusive but they represent some of the most impor-
tant design issues in wave-pipelining.
In this section some of the parameters of importance in wave-pipelining are discussed.
These include delays within combinational logic, clock skew, and sampling time at output
registers. Following an example from Wayne et al 10], a single combinational logic block is
considered in order to come up with the terminology that is used to determine the timing
requirements for wave-pipelining. The timing constraints derived from using this single block
should hold for any design with more than a single stage of combinational logic.
3.3 Wave-Pipelining Formulation
To present the clock parameters and delays, a simple pipeline stage is shown in Figure 6.
Some important labels are:
Dmin the minimum propagation delay through the combinational logic.
Dmax the Maximum (worst case) delay in the combinational logic.
Tclk the clock period.
input register combinational logic block output register
clock skew ∆
∆ clk ∆ clk
Figure 6: Longest and shortest path delays of a combinational logic block.
Ts and Th the register setup and hold times.
the constructive clock skew.
clk the register's worst case uncontrolled clock skew.
DR the register's propagation delay.
TL the time at which data at the output register can be sampled.
Tsx the temporal separation between the waves at an intermediate node x.
For this discussion it will be assumed that the combinational logic block has two inputs, the
setup and hold times of the input and output registers are equal and the propagation delay
through the register is the same for both the input and output registers. From the diagram
of Figure 6 the propagation delay of the signals through the combinational logic can be
related to the input and output clocks. Figure 7 shows the maximum and minimum delays
in relation to logic depth. The shaded region in the gure represents a period during which
data is being processed within the combinational logic block. Figure 7 can be extended to
represent several data sets being processed as time progresses.
D min Ts + ∆ clk Th + ∆ clk
x i-1 i i+1
000 input clock
Figure 7: Temporal/spatial diagram of a wave-pipelined system.
For wave-pipelining to operate properly, the system clocking must be such that the
output data is clocked after the latest data has arrived at the output and the earliest data
from the next clock cycle arrives at the output 10]. The time at which data can be sampled
at the output register TL is given by:
TL = N Tclk + (1)
where N is the number of clock cycles required to propagate the input through the combi-
national logic block before it could be latched at the output register. Latching the latest
data at the output register requires that the latest possible signal arrives early enough to be
clocked by this register during the N th clock cycle. Thus, the lower bound of TL is:
TL > DR + Dmax + Ts + clk (2)
To latch the same data requires that the arrival of the next wave not interfere with the
latching of the current wave output. The clock skew clk must also be accounted for,
resulting in the upper bound of TL being:
TL < Tclk + DR + Dmin ; ( clk + Th) (3)
The di erence of these two equations results in a value of Tclk given by:
Tclk > (Dmax ; Dmin) + Ts + Th + 2 clk (4)
From Equation 4 it is apparent that the minimum clock period is limited by the
di erence in path delays (Dmax ; Dmin) and the clocking overhead Ts + Th + 2 clk . This
overhead occurs as a result of the inclusion of the input and output registers and the clock
skew. Internal nodes within the system need to be considered in this analysis. It is important
to ensure that there is no data overlap within the system. Therefore, the next earliest possible
set of data should not arrive at a node until the latest possible wave has propagated through.
If x is the output of an internal node within a logic network at a point on the logic depth
axis of Figure 7, its longest and shortest delays from its inputs would be represented by
the variables dmax and dmin, respectively. The equation that describes this internal node's
Tclk dmax x ; dmin x + Tsx + clk
( ) ( ) (5)
where Tsx is the minimum time that node x must be stable in order to correctly propagate
a signal through the gate.
3.4 Minimizing the Clock Period
In order to minimize the clock period, both register constraints and internal node constraints
must be taken into account. According to 11] the feasibility region of valid clock period Tclk
is not continuous. It is composed of a nite set of disjoint regions. Equations 2 and 3 are
revisited in order to derive a two sided constraint on the clock period. It was established that
the register latching time is given by TL = NTclk + and if this is applied to Equations 2
and 3 to obtain the intermediate values between the lower and upper bound of the register
latching time, the following equation results:
DR + Dmax + Ts + clk < NTclk + < Tclk + DR + Dmin + Dmax ; ( clk + Th ) (6)
To avoid writing long expressions two variables Tmax and Tmin are introduced.
Tmax maximum delay through the logic including clocking overhead and clock skews.
Tmin minimum delay through the logic including clocking overhead and clock skews.
Tmax = DR + Dmax + Ts + clk ; (7)
Tmin = DR + Dmin ; ( clk ; Th ) ; (8)
Having established these new variables Equation 6 can be simpli ed to read:
Tmax < T < Tmin + Tclk (9)
The inequality to the right can be simpli ed as follows:
NTclk < Tmin + Tclk
NTclk ; Tclk < Tmin
(N ; 1)Tclk < Tmin
Equation 9 becomes:
Tmax < T < Tmin (10)
N N ;1
The above analysis shows that for N = 1, the clock period is not continuous and is only
bounded below by Tmax .
4 Hybrid Wave-Pipelining
Having presented the timing constraints of wave-pipelining, attention is now turned towards
describing these timing constraints as they apply to a hybrid wave-pipelined approach. Equa-
tions to describe the timing constraints for this approach are derived and compared to those
of the previous section. In many computer/digital systems each stage has a signi cantly
di erent function and circuitry, therefore, wide variations in delays (Dmin and Dmax ) may
not be tolerated. Figure 8 shows an example of a wave-pipelined system. The inputs to
the system are passed in a synchronous manner by means of an external clock. Assuming
that three variables need to be computed in this stage and passed on to the following stage,
it follows that for a given set of inputs, these variables would have delays associated with
each one of them. These delays are denoted as dA, dB , and dC , for inputs A, B , and C
respectively. The di erence in delay may cause stage 2 (Figure 8) to start in a di erent
computation path than what is expected. This in turn produces a false start. This false
start creates unnecessary changes in stage 2 as well as additional delays, that need not have
occurred. These delays dA, dB and dC would depend on the gates associated with each path,
and also on the set of input values.
Figure 8: Example of delays associated with pipeline stages.
These problems are avoided using a hybrid wave-pipelining approach. In order to
provide some insight into the problem de nition and solution, a brief summary is provided
below. A common engineering practice is to consider the worst case delay (Dmax), to ensure
that the system runs properly. Dmax plays a very important role in the system's performance
and safe regions of operation. Dmin (the shortest delay path), on the other hand imposes
a restriction in the valid input window. Getting Dmin closer to Dmax could increase this
window in other words it could decrease clock cycle time.
Ts + ∆clk ∆clk + Th
d min (1)
i-1 i Tsx
d min (0) dhold(0)
Figure 9: Temporal/Spatial diagram of a hybrid wave-pipelined system.
The equations derived for the hybrid wave-pipelining are denoted by the subscript h
to di erentiate them from those of the wave-pipelining time constraints. The de nitions for
the variables presented in the previous subsection still hold. To derive the equations that
describe the timing constraints for the hybrid wave-pipeline, the temporal/spatial diagram
representing this scheme is presented rst. The shaded regions of Figure 9 indicate that data
is not stable therefore, register outputs cannot be sampled. The computational cones in
this diagram have been arranged to represent each stage within the design. Some variables
need to be de ned. They are:
dmin(n) the minimum delay encountered in propagating data within a single stage n.
Dmin hold the overall minimum delay of all the stages and it includes the intrinsic
registers' hold times.
For hybrid wave-pipelining, TL 's lower bound is described as follows:
TLh > DR + Dmax + Ts + clk (11)
The upper bound of TLh is
TLh < Tclk + DR + Dmin hold ; ( clk + Th) (12)
where Dmin hold = dmin(0) + dhold(0) + dmin(1) + dhold(1) + dmin(2) + dhold(2)
This equation takes into consideration the intermediate stages of the design. The
minimum delays and the hold times of each stage are considered. From the above equation
it can be determined that:
Dmin hold Dmin (13)
This implies that this delay di erence is less than Dmax ; Dmin for the wave-pipelining
scheme. If further derivations are carried out, the clock period for the hybrid approach is
determined to be:
Tclk h > (Dmax ; Dmin hold) + Ts + Th + 2 clk
( ) (14)
Comparing Equations 4 and 14 and having Dmin hold Dmin , a conclusion that
Tclkh Tclk can be drawn. This implies that the clock period for the hybrid wave-pipelined
approach allows for the clock signal period to be reduced, hence an increase in performance.
A complete analysis of the hybrid wave-pipelining scheme must include clock cycle minimiza-
tion, taking into consideration the constraints of the internal nodes of the system and the
register constraints. Based on the analysis of Equation 6, the minimum delay of the hybrid
approach can be re-written to include the stage hold times as follows:
Tmin h = DR + Dmin hold ;
( ) clk ; Th ; (15)
The maximum delay through the logic including the overhead and clock skews Tmax , remains
unchanged for the hybrid scheme. From Equation 15, and by taking into consideration the
fact that Dmin hold Dmin, it is determined that: Tmin < Tmin h . Also from Figure 9
it can be noticed that the region in which data is not stable, i.e. the di erence between
Dmax ; Dmin hold , is short. It can then be safely stated that Dmax Dmin hold. The signal
latching time, Equation 6 now becomes:
DR + Dmin hold + Ts + clk ; < NTclk < Tclk + DR + Dmin hold ; ( clk + Th ) ; (16)
This analysis is graphically presented in Figures 10 and 11. Figure 10 shows that
there is room to make the clock cycle smaller, since the distance between labels windowh
and windoww can be reduced. Figure 11 shows how e ectively this could a ect the clock
period, reducing it from Tclk to Tclk .
4.1 Timing Approach
Propagating coherent data waves from one pipeline stage to the next without the use of
a distributed system clock is achieved by balancing the delays within each pipeline stage.
Delay balancing is necessary since the data within a circuit module takes di erent paths
and experiences di erent delays depending on the operation, resulting in the data arriving
at the input of the next stage at di erent times. The data that has the shortest delay
arrives at the input to the next stage earlier than data that takes the critical path (longest
delay), hence the use of intermediate latches between pipeline stages (modules). Wayne 10]
et al, have suggested bu er insertion as a method of delay balancing. Bu ers are inserted
D max windowh
d min (1)
d min (0) dhold(0)
000 Tclk 1111
Figure 10: Temporal/spatial diagram before clock period reduction.
Figure 11: Temporal/spatial diagram after clock period reduction.
in the shortest signal path to force these delays to be equal to the worst case delay, which
in turn permits the signals to arrive at the input of the next stage at the same time without
the use of intermediate latches nor the system clock. A di erent approach is considered in
this research. It does not involve bu er insertion, but uses the delays in the design of the
appropriate signals that will enable data to ripple through the stages coherently without the
use of distributed clock nor intermediate latches. The method used will not be limited in use
to the bit-pattern associative router circuitry and should nd application in many circuits
that employ the wave pipelining technique to improve performance.
The di erence between the maximum (worst case) delay and the minimum delay is of
particular interest in designing the timing scheme of a wave-pipelined bit-pattern associative
router. There will be no bu er insertion to balance the delays, instead specialized circuitry
is designed to provide very precise timing sequence. The advantage of this approach is that
it is not a ected by scaling or use of a di erent technology since all the modules are a ected
by these changes the same. The only major limiting factor becomes the worst case delay of
the circuit. The valid clock pulse width (PW) can be made short, and it cannot be less than
Dmin + (Dmax ; Dmin ) i.e. PW Dmin + (Dmax ; Dmin ). This expression includes the rise
and fall times of the clock and is based on measurements taken from SPICE simulations.
This gives the minimum pulse width that will enable application of wave-pipelining to the
bit-pattern associative router without intermixing the data waves" within any given pipeline
stage. This provides the basis for the design of the clock pulses required to cause the data to
ripple through the pipeline stages. The system clock is responsible for passing data from the
search argument register (SAR) to the bit-pattern associative unit. It has been established
that the bit-pattern associative router requires the following signals for its operation:
Read and write signals for the memories (DCAM and DRAM). The write signal
must be raised to a logic 1 only during programming (storing of current address) and
refreshing of the stored data this is done when the row select signal selects a speci c
word to write to within the bit-pattern associative unit array. Reading occurs when
there is a need to refresh the contents of the words stored within the array. The read
and write signals are generated whenever the stored data has been sensed to degrade.
Bit lines. They pass the input to the bit-pattern associative unit array during pro-
gramming as well as normal operation and they need to be precharged before reading
from the array. There is thus a need to design specialized circuitry to generate the
precharge signal for the bit lines.
The evaluate signal. The evaluate signal is responsible for determining if a matching
condition exists in the bit-pattern associative unit. Special circuitry to provide the
proper evaluate signal has to be designed based on dmax (x) and dmin(x), where dmax (x)
is the maximum delay path for a signal propagating through stage x and dmin(x) the
minimum delay in propagating a signal through stage x of the bit-pattern associative
The precharge signal. The match line of the bit-pattern associative unit is designed
to be normally on, and this is achieved by precharging it before evaluation occurs. A
precharge signal, therefore, has to be designed and it must occur at appropriate times.
The signals dmin(x) and dmax (x) play an important role in the design of the precharge
signal for the match line.
The signal that permits the data wave" to be passed to the selection function. The
signal that allows the match output to be passed to the selection function is generated
by using the evaluate signal and sizing transistors with the threshold voltage of the
transistor used to latch the match outputs.
Precharge for the selection function. This signal is simply the inverse of the evaluate
The priority status signal. Within the selection function the priority status signal is of
importance. If a match has been found in any of the entries other than the last entry of
the matching unit, the selection function has to detect it as a match with the highest
priority and must then propagate a priority status indicating to the entries below that
a match has been found. In this design a priority status Pi = 0 indicates that a match
with the highest priority has been found and hence kills all the other possible matches
below the current one. The delays experienced by this signal are of importance since
it can be used to determine when new input can be applied to the SAR.
The signal required to pass the selected match output to the port assignment. This signal
is required to gate the selected match output which is the one determined to have the
highest priority by the selection function, to the next stage. The system clock is not
used in the selection function, therefore, the signals already generated have to be used
to provide precise timing to pass the selected match output to the port assignment
memory. This signal serves to select the appropriate output port assignment.
The signals described above form the basis for proper timing of the hybrid wave-
pipelined bit-pattern associative router. The primary objective is to reduce clock cycle
time by allowing the delays to propagate data from stage to stage with the use of
neither intermediate latches nor distributed clocks.
As clock rates continue increasing, clock distribution is becoming a major problem
wave-pipelining could alleviate this problem by signi cantly reducing the clock load as well
as the number of inter-stage latches 1].
4.2 Hybrid Wave-Pipelining Design
In this section circuits to generate some of the signals discussed above are presented. Re-
moval of intermediate latches introduces the possibility that two or more data waves can
be intermixed. A wave-pipelined design must then ensure that this data collusion does not
occur. Circuitry to ensure that data propagates within its own wave has been designed.
4.3 Control Signals for The Bit-Pattern Associative Unit
The requirement for precharging the match line is that this occurs before evaluation takes
place and after the output of the match line has been passed to the next stage. As a result
the precharge signal is designed using the signals used to evaluate and pass the match line
output to the next stage. The circuit in Figure 12 is designed to generate the match line
precharge signal. Transistor Tc, when conducting, charges up node V and the status of node
V is passed to the inverter input when the pass signal is at 0". At evaluation time it is
required that the match line precharge signal be at 1" to prevent a con ict of operations on
the match line (attempting to precharge the match line while evaluating it). Also, care must
be taken to prevent precharging the match line before its output is passed to the next stage.
The match line precharge signal, therefore, has to be low only when both the evaluate and
pass signals are at 0", that is, the match line must be precharged when both the evaluate
and pass signals are at 0". This condition is enforced by use of the evaluate signal to turn
on transistor Td , providing a path to ground during evaluation. This occurs when pass is at
logic level 0, ensuring that node V gets discharged to 0", resulting in match line precharge
signal being at logic 1. The precharge device is a p-type transistor it is, therefore, not
conducting when the match line precharge signal is at logic 1. Transistor Toff passes a 1"
to the gate of Tc when the evaluate signal is at logic 1, preventing a 1" from being passed
to node V by turning o transistor Tc. Since the pass and evaluate signals are never at logic
1 at the same time, pass at the gate of Ton, after allowing a time lapse of two inverter delays,
enables Tc to conduct. This arrangement ensures that the precharge signals will operate
precisely as desired.
Figure 12: Generating the match line precharge signal.
The evaluate signal must go to 1" after the data passed to the DCAM for comparison
has stabilized on the bit lines (BIT COMPARE and NBIT COMPARE). There is circuitry
that generates this signal once all the lines have been set to their appropriate input values.
This circuitry takes into account setting a 0" or 1" in the particular CAM array and is
shown in Figure 13. Its operation is governed by the status of the match line. The inverse
of the clock gets passed to an intermediate node when the match line is at 1" and this
intermediate node gets precharged to 1" whenever the match line is at 0". The inverse of
the intermediate node is the evaluate signal. Using the match line to pass the clock signal
and set the evaluate signal to 0" when evaluation is completed provides the proper duration
for the evaluate signal to stay at 0" or 1". Passing the clock signal when the match line is
at logic 1 basically determines that the data on the bit line has had enough time to stabilize
on the bit lines.
Figure 13: Generating the evaluate signal.
Once evaluation has been completed the match line outputs are passed to the selection
function. Thus, the pass signal must immediately be active following completion of the
evaluation process. The circuit used to generate the pass signal is shown in Figure 14.
The pass signal is designed to mimic the path that a zero on the match line (non-matching
condition) takes once evaluation completes. Passing a 0" to the selection function provides
the maximum delay that can be experienced in passing the matching unit's outputs to
the selection function and, therefore, constitutes the worst case propagation delay for this
operation. The node pass is initialized to 1" by precharging node z when the evaluate
signal is 1". A 1" at the pass node ensures that transistor Tpz passes a 0" which gets
inverted and becomes input to transistor Tpd. When transistor Tpd conducts discharging the
pass node, this becomes an indication that the pass signal has stayed at 1" long enough to
pass the output of the match line to the selection function. The duration and voltage level
of both the evaluate and pass signals need to be just long and high enough to accomplish
their purpose. The circuits in Figures 13 and 14 have been designed to sense the duration
and voltage levels of these signals. All the signals presented to this point depend on the
Figure 14: Generating the pass signal.
Another signal of importance is within the selection function. This is the signal that
enables the port assignment select signals i.e. the output of the selection function the pointer
to the selected word of the port assignment. It is required that if a match with the highest
priority has been established, the ones below this match output must be disabled. The
priority status performs this function along with an enable signal, which has been designed
based on the fact that it takes longer to pass 0" from the bit-pattern associative unit. The
required delay between sets of four selection function cells is arrived at by inserting delay
elements within the circuit. This approach ensures the prevention of false starts.
Figure 15: Enabling the selection function outputs.
4.4 Hybrid Wave-Pipelining Signals
The signals generated by the above circuits appear in Figure 16 along with the clock. The
delays of the rst two stages are clearly marked in the gure to show a correlation with the
hybrid wave-pipelined approach. On Figure 16 the minimum delays, dmin and hold times
dhold of each stage are shown. Also the duration of the evaluate and pass signals are depicted.
Using 1 m technology 2.5 V is su cient to determine a logic 1.
Once the output of the match lines have been received by the selection function
a decision needs be made when more than one matching condition has been received. The
selection function is designed to propagate a priority status Pi to the entry below it indicating
whether it has registered a match. This priority must be propagated very fast to prevent false
starts. Some of the signals of importance in this scheme are shown in Figure 17. The selection
function's critical operation occurs when the rst and last entries of the selection function
simultaneously receive inputs indicating that matching conditions have been found in the
corresponding bit-pattern associative unit entries. The rst entry of the selection function
has to propagate a 0" to the last entry of the selection function to prevent generation of
a pointer to the DRAM by the last entry. Signals to lessen the possibility of false starts
are generated and these are labeled enable. They are designed to ensure that even if a false
start occurs the pointer to the DRAM array is not established. In Figure 17 delays from the
second stage of the BPAR and those of the third stage are shown. In Figure 17 signals used
to enable the DRAM pointers for the rst and last entries from a setup in which the last entry
always nds a match and the rst entry nds a match every other clock cycle are shown.
The delays involved are shown. The plots in Figure 18 are those of the DRAM pointers to
the rst and last entries of the DRAM. They serve to show that an output port assignment
can be read from the DRAM array every one and a half clock cycles. Also ensuring that
the matching condition occurs every other clock cycle for the rst entry of the bit-pattern
associative unit, while the last entry always nds a match, makes for a good check for false
SF = selection function
evaluate pass clock
dminCAM d CAM d SF
0.00000 0.00001 0.00249 0.00500 0.00639 0.01000 0.01166
Figure 16: Plot of the bit-pattern associative unit control signals.
output0 output15 DRAMselect15
SF = selection function
0.000 0.001 0.004 0.006 0.010 0.011 0.016 0.019 0.021
Figure 17: Plots of the enable signals.
DRAM DRAM DRAM
select0 clock select15 output bit0
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
Figure 18: Plot of the DRAM pointers.
5 BPAR's Space/Time Hybrid Wave-Pipelining
Figure 19 shows the delays associated with each stage (as shown in Figures 16, 17, and 18).
The computational cones of Figure 19 show the three stages of the bit-pattern associative
router. Hybrid wave-pipelining has enabled for the delay di erence per stage to be reduced.
Unlike wave-pipelining where Dmin for the system provides the upper bound for the clock
period, hybrid wave-pipelining shows that each stage has it's maximum delay that can be
manipulated independent of other stages to obtain optimal timing.
The bit-pattern associative unit completes worst case operation (computation) in 5.1
nano seconds (ns). This time includes the input latch delays, clock skew and hold and setup
times for this stage. To accommodate the latest data wave arriving at this stage additional
time during which this data can still be passed to the next stage is allocated. This time
appears in the gure as dpass with a value of 0.9 ns. For data arriving late at the bit-
pattern associative unit at least 2 ns is the time during which this data can still be processed
within this stage. The hold time for the selection function is very short, almost equal to the
minimum delay of this stage. This short hold time is a direct result of the selection function
design, which ensures propagation of the priority status to entries below the current one
very fast by means of the priority lookahead. The port assignment has a minimum delay
of 0.4 ns and requires 3 ns of hold time. It also accommodates the latest wave" for the
current data. Reduction of the gap between Dmax and Dmin at each stage is presented here
graphically with the delays displayed to show how the clock period is made shorter in this
design. Comparing this gure with Figure 20 it can be seen how the time during which data
can be sampled at output latches is reduced.
A combined temporal and spatial diagram for a wave-pipelined router is shown in
Figure 20. The horizontal axis represents time while the vertical axis represents the circuitry
of the system which includes: input latch, CAM (bit-pattern associative unit), selection
function, RAM (port assignment), and output latch. The shadowed regions represent the
times when a computation is being performed and data may be unstable. The space between
two adjacent regions corresponds to stable data. In this gure several waves of data are shown
(from data i ; 1 to i + 1). It should be pointed out that the worst time for the selection
function occurs when the top priority CAM entry matches and produces a read. This in
turn creates a delay for the last entry which is ready before the priority lookahead signal
reaches it. In the gure, it is shown how the rst entry a ects the last entry. In a traditional
pipeline this creates a very long delay in the selection function stage since the CAM output
is not known until the pipeline clock comes. The output data should be read when it is
stable thus the output clock should be synchronized to achieve this.
The hybrid wave-pipelined router described in this paper has been fabricated using a
0.5 m technology and the chip tested. Some traces of the signals described in this section
appear in the Appendix.
6 Concluding Remarks
An extremely fast exible router capable of accommodating a number of routing algorithms
and networks has been presented in this paper. The bit-pattern associative router is designed
0.4 ns 3 ns
RAM and output latch
0.7 ns 0.8 ns
selection function (SF)
dmin CAM dholdCAM
1.6 ns 3.5ns
input latch and CAM
Figure 19: Hybrid wave-pipelined delays translated to computational cone.
D min dmin RAM dhold RAM
ouput latch &
dmin SF dhold SF
dmin CAM dhold CAM
input latch &
Figure 20: Router's Temporal/spatial diagram.
using a novel scheme that combines wave-pipelining and conventional pipelining producing
hybrid wave-pipelining. The hybrid wave-pipelining approach narrows the gap between Dmin
and Dmax resulting in a reduction of the clock cycle time. To further improve the performance
of the bit-pattern associative router dynamic circuits along with a high performance selection
function to reduce priority status propagation delays are used. This study is the rst of
it's kind it examines the delays within each module of the design and minimizes them
Issues such as signal generation, uneven delays, and timing have been addressed to
synchronize the pipeline. The bit-pattern associative router presented in this paper has been
designed, implemented and tested. The test results appear in the appendex.
1] C. Thomas Gray, Wentai Liu, Ralph K. Cavin, III, Wave Pipelining: Theory and CMOS
Implementation, Kluwer Academic Publisher 1994.
2] D. Clark and J. Pasquale, Strategic Directions in Networks and Telecommunications,"
ACM Computing Surveys, vol. 28, no. 4, pp. 679-690, Dec. 1996.
3] T. Leighton, Average Case Analysis of Greedy Routing Algorithms on Arrays," 2nd
Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 2-10, Crete,
4] R. Cypher, A. Ho, S. Konstantinidou, and P. Messina, Architectural Requirements
of Parallel Scienti c Applications with Explicit Communication," 20th International
Symposium on Computer Architecture, pp. 2-13, May 1993.
5] J. G. Delgado-Frias, J. Nyathi and D. H. Summerville, A Programmable Dynamic
Interconnection Network Router with Hidden Refresh" IEEE Trans. on Circuits and
Systems, I vol. 45, no.11, pp. 1182-1190, Nov. 1998.
6] D. H. Summerville, J. G. Delgado-Frias, and S. Vassiliadis, A Flexible Bit-Associative
Router for Interconnection Networks," IEEE Trans. on Parallel and Distributed Sys-
tems, vol. 7, no. 5, pp. 477-485, May 1996.
7] D. H. Summerville, J. G. Delgado-Frias, and S. Vassiliadis, A High Performance Pat-
tern Associative Oblivious Router for Tree Topologies," IPPS '94: The 8th International
Parallel Processing Symposium, pp. 541-545, Cancun, Mexico, April 1994.
8] C. Thomas Gray, Wentai Liu, and Ralph K. Cavin, III, Timing Constraints for Wave-
Pipelined Systems," IEEE Trans. on Computer-Aided Design of Integrated Circuits and
Systems, Vol. 13, No. 8, pp. 987-1004, August 1994.
9] D. Ghosh, and S. K. Nandy. Design and Realization of High-Performance Wave-
Pipelined 8 X 8 b Multiplier in CMOS Technology," IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, Vol. 3, No. 1, pp. 36-48, March 1995.
10] W. P. Burleson, M. Ciesielski, F. Klass and W. Liu, Wave-Pipelining: A Tutorial and
Research Survey," IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol.
6, No. 3, pp. 464-474, September 1998.
11] W. K. C. Lam, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, Valid Clock Frequen-
cies and Their Computation in Wave-pipelined Circuits," IEEE Trans. on Computer-
Aided Design, vol. 15, no. 7, pp. 791-807, July 1996.
12] J. G. Delgado-Frias and J. Nyathi, A High Performance Encoder with Priority Looka-
head," IEEE Sixth Great Lakes Symp on VLSI, pp. 59-64, Lafayette, Louisiana, Feb.
match-alternately (first and last entries)
Figure 21: Matching the rst and last entries alternately.
Figure 22: Periodic discharge of the matchline.
RAM_output (first entry) RAM-output (last entry)
Figure 23: Port Assignment output.