Your SlideShare is downloading. ×
A Flexible High-Performance Network Router with Hybrid Wave ...
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

A Flexible High-Performance Network Router with Hybrid Wave ...


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. A Flexible High-Performance Network Router with Hybrid Wave-Pipelining Jabulani Nyathi and Jose G. Delgado-Frias Department of Electrical Engineering State University of New York Binghamton, NY 13902-6000 Abstract In this paper a novel hybrid wave-pipelined bit-pattern associative router is pre- sented. A router is an important component in any communication network systems. The bit-pattern associative router (BPAR) allows for exibility and can accommodate a large number of routing algorithms. Wave-pipelining is a high performance approach which implements pipelining in logic without the use of intermediate registers. In this study a hybrid wave-pipelined approach has been proposed and implemented. Hybrid wave-pipelining allows for the reduction of the delay di erence between the maximum and minimum delays by narrowing the gap between each stage of the system. This approach yields narrow computing cones" that could allow faster clocks to be run. This is the rst study in wave-pipelining that deals with a system that has a sub- stantially di erent set of pipeline stages. The bit-pattern associative router has three stages: condition match, selection function, and port assignment. In each stage data delay paths are tightly controlled in order to optimize the proper propagation of signals. Internal control signals are generated to ensure that data propagates between stages in a proper fashion. The simulation results show that using a hybrid wave-pipelining signi cantly reduces the clock period and the circuit delays become the limiting factor, preventing further clock cycle time reduction. 1 Introduction Communication channels between any two nodes regardless of their physical location within a communication network system can be established by using routers at each node. The pur- pose of the router would be to handle messages on the network. A router receives, forwards, and delivers messages. The router system transfers messages based on a routing algorithm which is a crucial element of any communication network 2, 3]. Given the recon guring re- quirements for many computing systems, a number of routing algorithms as well as network topologies must be supported. This in turn leads to a need for a very high performance exi- ble router to support these requirements. To maximize a machine's overall performance and to accommodate recon guration in a distributed environment requires matching the applica- tion characteristics with a suitable routing algorithm and topology. It has been shown that 1
  • 2. machine performance is fairly sensitive to the routing function and tra c 4]. In addition, having the capability of changing the routing algorithm at run time could facilitate smart interconnects and adaptability that allow changes on the machine's topology for di erent applications. A router intended to be used for a number of routing algorithms and/or network topologies has to be able to accommodate a number of routing requirements. It is of great importance that the routing algorithm execution time be extremely short. This time dictates how fast a message can advance through the network, since a message cannot be transferred until an output port has been selected by the routing algorithm. Thus, the routing algorithm execution time must be reduced to decrease message delays. Other requirements may include: exibility to accommodate modi cations to a network, algorithm and/or topology switching with minimum delay, and programmability to support a large number of routing algorithms and network topologies. The major approaches to the realization of exible routers are dedicated processors and look-up table 5]. A dedicated processor allows bit manipulation, required by most algorithms for ease of execution. This type of router is capable of handling a number of interconnection networks, but due to its sequential execution of routing algorithms it is a slow approach. Also the resulting router can be expensive in terms of hardware. The look- up table approach requires the storage in memory of a predetermined routing path for all possible destination nodes. The look-up table approach reduces the time of determining the output port to a memory access delay, however, large tables are required to store the routing information. The bit-pattern approach uses an associative approach to determine in parallel the potential output ports (according to the routing algorithm). The output port is selected by means of a selection function. This scheme has been shown to be able to accommodate a large number of routing algorithms and to have fast execution 7, 6]. This approach has been mapped onto a pipelined VLSI architecture 5]. In this paper we present a high performance hybrid wave-pipelined VLSI router that constitutes of several modules and uses dynamic circuitry within the modules. Wave- pipelining is a design method that enables pipelining in logic without the use of intermediate registers. In order to realize practical systems using wave-pipelining, it is a requirement that accurate system level and circuit level timing analysis be done. At system level, general- ized timing constraints for proper clocking and system optimization need to be considered. Since the performance of the system depends on the timing between multiple signals within the data path and also the timing between the clock signal and data at the input register, there is, therefore, room for performance optimization within the data path and the clock generation circuitry. At the circuit level, performance is determined by the maximum circuit delay dif- ference in propagating signals within a given module of the system. Accurate analysis and strict control of these delays are required in the study of worst case delay paths of circuits. Data dependent delays also present a problem that needs to be considered in the use of wave- pipelining to improve performance. In this study a di erent approach to minimizing the clock period is undertaken. It is termed hybrid wave-pipelining and hinges on wave-pipelining. 2
  • 3. 2 Router Functional Organization The bit-pattern associative router (BPAR) scheme supports the execution of routing algo- rithms that are used in most communication switches. Using routing algorithms to determine the output port requires that the following issues be considered: A routing algorithm examines the status of addressing bits that are important in determining the output port. Some addressing bits need to be ignored, depending on the current and destination nodes. Output port alternatives need to be prioritized. This priority is used to favor a port that would yield a short path. This feature ensures deterministic routing algorithms. In this section, an overall description of the router system is presented. This descrip- tion is at a functional level, in order to illustrate the requirements and capabilities of the system. 2.1 Router System Description The associative router uses a content addressable memory as its bit-pattern associative unit, and this enables the destination address alternatives to be considered in parallel. The des- tination address is presented as the input to the bit-pattern associative unit for comparison with the stored data. This operation constitutes the normal operation of this design. The bit-pattern associative router shown in Figure 1 consists of several modules 5]. In this illus- tration arrows show the general ow of data from one stage to the next, with the destination address being presented to the bit-pattern associative unit, which compares (for normal op- eration) the presented destination address to the current address (stored in the bit-pattern associative unit). The results of the comparison are passed to the selection function, which then passes the match output with the highest priority to the port assignment to select the word corresponding to the selected output port. De nitions of some the modules and labels that appear in the gure are provided below. 1. Destination address. The destination address is passed to the routing function from an input port. It becomes the search argument for the pattern associative mechanism. 2. Bit-Pattern Associative Unit. The destination address is compared to the current address by means of a bit-pattern matching mechanism. An associative computation with all the bit-patterns is performed to determine potential routing paths. 3. Selection Function. If more than one condition is met only one assignment should be processed based on a priority scheme. For an oblivious router, the other matches are ignored. For an adaptive router, the other matches may be considered as alternatives to the port assignment. 4. Port assignment. The output port assignment is made by selecting the output port associated with the selection condition. 3
  • 4. destination address priority output port selection function Routing port assignment function Figure 1: The bit-pattern associative router scheme. The patterns stored in the bit-pattern associative unit allow the router to make a decision about the destination port based on the routing algorithm. The address of a destination node D (dn; : : : d ) needs to be compared to the current node's address C (cn; : : : c ). A 1 0 1 0 routing algorithm compares the bits of the two addresses some bits are ignored since they do not a ect the current routing decision. These don't care" bits usually occur at di erent positions for each potential path being considered and need to be customized according to the routing algorithm requirements. This in turn imposes a requirement for the bit-pattern associative unit to include don't care" bits at di erent positions in each entry. To provide the exibility required to support multiple interconnection networks and routing algorithms the bit-pattern associative router must be programmable. Modules of the bit-pattern associative router include: 1. A search argument register (SAR), responsible for receiving the destination address and passing it to the next stage at the appropriate time. 2. The Bit-Pattern Associative Unit, for storing the bit patterns for a given interconnec- tion network node and for performing a comparison between the destination address and the current address (speci ed by the bit patterns). All the patterns that match the input address are passed to the following stage the selection function. 3. The selection function which operates as a priority encoder, determines the priority of the bit-pattern associative unit outputs. If more than one pattern has been matched, only one match gets selected. The other matching outputs are ignored. The selected match pointer gets passed to the next stage. If no matching pattern is found the selection function sets a no match signal. 4
  • 5. 4. The port assignment stores the output port assignments. The address where the assign- ment is read is speci ed by a pointer from the selection function and this assignment is passed to the port assignment register. 5. The port assignment register is responsible for storing the output port assignment to be used by the interconnection network. 6. The shift register serves as a pointer to a row of DCAMs (bit-pattern associative unit array) and DRAMs (port assignment) to be programmed to or to be refreshed. It is not used during normal operation. 7. The refresh circuitry is responsible for refreshing stored data, since dynamic circuits are used and it is transparent to the other modes of operation. The modules described above are shown in Figure 2, where the normal operation of the associative router is depicted. 2.2 BPAR Normal Operation Even though programming precedes the normal operation of the bit-pattern associative router, the organization of the normal operation is presented rst, since this operation con- stitutes the main function of the router. The system uses a content addressable memory (CAM) as the basic element of the bit-pattern associative unit. The CAM has a property that enables some addressing bits to be ignored. Figure 2 shows the bit-pattern associative router organization for normal operation. The destination address is presented as input to the search and argument register (SAR) and passed to the bit-pattern associative unit, which already has the current address stored. The presented destination address is com- pared simultaneously to all the stored bits within the associative unit. Several matching conditions can result from this comparison. All the matching conditions are presented to the selection function which then prioritizes these inputs and selects the one determined to have the highest priority. The output of the selection function selects a word to be read from the port assignment. This word is the output port speci er and is passed to the port assignment register. Refreshing the stored data is inter-leaved with the comparison of the destination address to the current address. This makes the refresh operation transparent to the matching (comparison of destination address with current address) operation. The row select block for both the bit-pattern associative unit and the port assignment serve to provide a pointer to a word to be read or written during the refresh operation. Figure 2 depicts a situation in which the destination address matches two words (two current addresses) in the bit-pattern associative unit. The selection function, therefore, receives two inputs from the associative unit and has to select one with the highest priority, which will then be passed as the selected match to the port assignment. The example presented shows that the topmost match has a higher priority than the one below it. The organization for the programming mode is the same as that of the normal mode of operation with minor di erences. 5
  • 6. Destination Address To Switching Network (From Input Port) (Output Port) Search Argument Port Assignment Register Register R selection function (SF) R o Bit - Pattern o w w Associative Unit Port Assignment S (DCAM) S e (DRAM) e l l e selected match match e c c t t match no match Refresh (DCAM) Refresh (DRAM) Figure 2: The bit-pattern associative router organization (normal operation.) 2.3 BPAR Programming Mode During the programming phase the row select is used to select a word to be written in both the matching unit and the port assignment while the match (comparing of destination address to current address), the selection function, and the refresh logic are ignored. The modules that are ignored during programming appear as dotted lines in the gure. The search and argument register is used as an input latch during this mode of operation. Also the output port register serves to latch the input to the port assignment. The row select shift register points to the row of DCAMs or DRAMs to be programmed to, and this selection occurs simultaneously for both memories. The procedure for writing into the memories is illustrated in Figure 3. The selected word to be written to during programming is shown as a shaded block in both memories (the bit-pattern associative unit and the output port assignment). 2.4 Dynamic CAM Cell The DCAM cell implements a comparison between an input and the ternary digit condition stored in the cell. The proposed cell has a small transistor count. The circuit of a single DCAM cell, shown in Figure 4, consists of eight and a half transistors transistor Te is shared by two cells. The read, write, evaluation, and match line signals are shared by the cells in a word while the BIT STORE, NBIT STORE, BIT COMPARE and NBIT COMPARE lines are shared by the corresponding bit in all words of the matching unit. The design uses a precharged match line to allow fast and simple evaluation of the match condition. This condition is represented by the DCAM cell state which is set by Sb1 and Sb0 node status. 6
  • 7. Program Data Program Data (Current Address) (Output Port Assignment) Search Argument Port Assignment Input Latch Register Input Latch Register selectiuon function (SF) R R o Bit - Pattern o w w Associative Unit Port Assignment S (DCAM) e (DRAM) S e l l e selected match match e c c t t no match Refresh (DCAM) Refresh (DRAM) Figure 3: Router organization (Programming mode). This state is stored in the gate capacitance of the transistors Tc and Tc . In Table 1 1 0 the possible DCAM cell stored values are listed. Transistors Tref and Tref are used for 0 1 refreshing. The DCAM cell performs a match between its stored ternary value and a bit input (provided by BIT COMPARE and NBIT COMPARE which represent a bit and its opposite value, respectively). The match line is precharged to 1" thus, when a no match exists this line is discharged by means of transistor Tm . Table 1: Representation of stored data in a DCAM cell. Sb1 Sb0 state 0 0 X(don't care 0 1 1(one) 1 0 0(zero) 1 1 not allowed Normal operation involves comparing the input data to the patterns stored in the DCAM and determining if a match has been found. During match operation, the input data is presented on the BIT COMPARE line and its inverse value on the NBIT COMPARE line. Before the actual matching of these two values is performed, the match line is precharged to 1" which indicates a match condition. The matching of the input data and the stored data is performed by means of an exclusive-or operation which is implemented by the two transistors (Tc and Tc ) that hold the stored value. 1 0 7
  • 8. BIT_STORE NBIT_STORE BIT_COMPARE NBIT_COMPARE read Tr Tref1 Tref0 write Tw1 Tw0 Sb1 Sb0 Tc1 Tc0 match line Tm evaluate Te Figure 4: CMOS circuit for a DCAM cell. 2.5 Selection Function and Port Assignment The selection function should be designed to ensure the deterministic execution of the routing algorithms. With a given input (i.e. destination address) and a set of patterns stored in the matching unit, the port assignment should always be the same. The priority allows only the highest priority pattern that matches the current input to pass on to the port assignment memory. The encoded priority (EPi) output depends on the match at the current bit-pattern and the priority for this row. If both match and priority are 1" then the encoded priority is true. A priority lookahead scheme has been proposed and implemented it has been reported in 12]. The port assignment memory or RAM holds information about the output port that has to be assigned after a bit-pattern that matches the current input is found. This memory is proposed to be implemented using a dynamic approach. The selected row address is passed from the priority encoder. The cells in this row are read and their data is latched in the port assignment register. The DRAM structure is able to perform an OR function per column when multiple RAM rows are selected this is when multiple matches are passed on. The DRAM structure is explained in 5]. 3 Wave-Pipelining Wave-pipelining is an approach to achieve high-performance in pipelined digital systems by removing intermediate latches or registers 10]. Idle time of individual logic gates within 8
  • 9. combinational logic blocks can be minimized using wave-pipelining. Wave-pipelining reduces clock loads and associated area, power and latency while retaining the external functionality and timing of a synchronous circuit 10]. Using wave- pipelining, new data can be applied to the input of combinational blocks or stages within a digital system before the previous outputs are available, thus e ectively pipelining the combinational logic and maximizing the utilization of logic without inserting registers 8, 9]. 3.1 Modelling Wave-Pipelining Conventional circuit pipelining uses intermediate latches in addition to the input and output registers. Intermediate latches (registers) ensure that when the leading edge of the system clock comes data gets propagated from one stage to the next in a synchronous manner. In a system set-up like this there is only one set of data between register stages. By using wave-pipelining in digital systems, the clock frequency of digital circuits can be increased. This set up allows for coherent waves of data to be sent through a block of combinational logic and also allows for new inputs to be applied before the previous data propagates to the output latch. There may be more than one data wave within the system. The rate at which data propagates through the circuit depends not on the longest path but on the di erence between the longest and the shortest path delays 10]. Wave-pipelining requires that delay paths within a combinational logic circuit be balanced so that the data waves applied at the input simultaneously arrive at the next stage. An illustration of a wave-pipelined system appears in Figure 5. The intermediate latches have been removed and replaced by intrinsic" latches. Also the system clock is no longer distributed and multiple coherent waves" of data are available between storage elements. 3.2 Wave-Pipelining Requirements Wave-pipelining requires studies across a variety of levels such as process, layout, circuit, logic, timing and architecture. Several research groups have explored some of these areas. Some of the challenges of designing wave-pipelined systems within any of the areas mentioned above include: Preventing intermixing of unrelated data waves." There must be no data overrun in each circuit block, and it must be ensured that there is no over committing of the data path. This is achieved by determining an appropriate range of clock frequencies at which to apply data at the input. Designing dedicated control circuitry. Control logic circuits must be designed to operate synchronously with the circuitry of the pipeline stages. The number of these control logic circuits must be minimal in order to be area e cient. Balancing delay paths. Delay paths must be equalized to ensure that data waves applied to the input latch propagate through all the stages synchronously. This is achieved by inserting delay elements in the shortest paths within logic blocks to equalize their delays to those of the longest paths. This approach allows for the elimination of the intermediate latches or registers. 9
  • 10. clocking signal designed from worst clock case operation 1 0 1 0 11 00 11 00 I O 1 0 11 00 N Logic Logic Logic Logic U P Circuit T 1 0 11 00 Circut Circuit Circuit U Array Array Array P 1 0 11 00 Array T U (LCA0) ( LCA1 ) (LCA2) ( LCAn-1 ) 1 0 11 00 S T S input latch intermediate output latch latches removed clocking signal designed from worst clock case operation 1 0 waven-1 wave 2 wave1 wave0 1 0 1 0 1 0 1 0 I O 1 0 1 0 N U P T U 1 0 1 0 P 1 0 1 0 T U 1 0 1 0 S T S input latch output latch Figure 5: Block diagram of a wave-pipelined digital system. The requirements stated above are not inclusive but they represent some of the most impor- tant design issues in wave-pipelining. In this section some of the parameters of importance in wave-pipelining are discussed. These include delays within combinational logic, clock skew, and sampling time at output registers. Following an example from Wayne et al 10], a single combinational logic block is considered in order to come up with the terminology that is used to determine the timing requirements for wave-pipelining. The timing constraints derived from using this single block should hold for any design with more than a single stage of combinational logic. 3.3 Wave-Pipelining Formulation To present the clock parameters and delays, a simple pipeline stage is shown in Figure 6. Some important labels are: Dmin the minimum propagation delay through the combinational logic. Dmax the Maximum (worst case) delay in the combinational logic. Tclk the clock period. 10
  • 11. input register combinational logic block output register D min D max clock clock skew ∆ 1111 0000 111 000 1111 0000 1111 0000 111 000 111 000 1111 0000 111 000 Tclk clock edge 11111111 00000000 11111111 00000000 skewed 11111111 00000000 edge 11111111 00000000 skewed 11111111 00000000 edge 11111111 00000000 ∆ clk ∆ clk Figure 6: Longest and shortest path delays of a combinational logic block. Ts and Th the register setup and hold times. the constructive clock skew. clk the register's worst case uncontrolled clock skew. DR the register's propagation delay. TL the time at which data at the output register can be sampled. Tsx the temporal separation between the waves at an intermediate node x. For this discussion it will be assumed that the combinational logic block has two inputs, the setup and hold times of the input and output registers are equal and the propagation delay through the register is the same for both the input and output registers. From the diagram of Figure 6 the propagation delay of the signals through the combinational logic can be related to the input and output clocks. Figure 7 shows the maximum and minimum delays in relation to logic depth. The shaded region in the gure represents a period during which data is being processed within the combinational logic block. Figure 7 can be extended to represent several data sets being processed as time progresses. 11
  • 12. TL D max D min Ts + ∆ clk Th + ∆ clk 000 clock output 111 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 x i-1 i i+1 Tsx logic depth time 11 00 11 00 111 000 111 000 11 00 11 00 11 00 11 00 111 000 111 000 111 000 111 000 11 00 11 00 111 000 111 000 input clock Tclk Figure 7: Temporal/spatial diagram of a wave-pipelined system. For wave-pipelining to operate properly, the system clocking must be such that the output data is clocked after the latest data has arrived at the output and the earliest data from the next clock cycle arrives at the output 10]. The time at which data can be sampled at the output register TL is given by: TL = N Tclk + (1) where N is the number of clock cycles required to propagate the input through the combi- national logic block before it could be latched at the output register. Latching the latest data at the output register requires that the latest possible signal arrives early enough to be clocked by this register during the N th clock cycle. Thus, the lower bound of TL is: TL > DR + Dmax + Ts + clk (2) To latch the same data requires that the arrival of the next wave not interfere with the latching of the current wave output. The clock skew clk must also be accounted for, resulting in the upper bound of TL being: TL < Tclk + DR + Dmin ; ( clk + Th) (3) The di erence of these two equations results in a value of Tclk given by: Tclk > (Dmax ; Dmin) + Ts + Th + 2 clk (4) 12
  • 13. From Equation 4 it is apparent that the minimum clock period is limited by the di erence in path delays (Dmax ; Dmin) and the clocking overhead Ts + Th + 2 clk . This overhead occurs as a result of the inclusion of the input and output registers and the clock skew. Internal nodes within the system need to be considered in this analysis. It is important to ensure that there is no data overlap within the system. Therefore, the next earliest possible set of data should not arrive at a node until the latest possible wave has propagated through. If x is the output of an internal node within a logic network at a point on the logic depth axis of Figure 7, its longest and shortest delays from its inputs would be represented by the variables dmax and dmin, respectively. The equation that describes this internal node's constraints is: Tclk dmax x ; dmin x + Tsx + clk ( ) ( ) (5) where Tsx is the minimum time that node x must be stable in order to correctly propagate a signal through the gate. 3.4 Minimizing the Clock Period In order to minimize the clock period, both register constraints and internal node constraints must be taken into account. According to 11] the feasibility region of valid clock period Tclk is not continuous. It is composed of a nite set of disjoint regions. Equations 2 and 3 are revisited in order to derive a two sided constraint on the clock period. It was established that the register latching time is given by TL = NTclk + and if this is applied to Equations 2 and 3 to obtain the intermediate values between the lower and upper bound of the register latching time, the following equation results: DR + Dmax + Ts + clk < NTclk + < Tclk + DR + Dmin + Dmax ; ( clk + Th ) (6) To avoid writing long expressions two variables Tmax and Tmin are introduced. Tmax maximum delay through the logic including clocking overhead and clock skews. Tmin minimum delay through the logic including clocking overhead and clock skews. Tmax = DR + Dmax + Ts + clk ; (7) and Tmin = DR + Dmin ; ( clk ; Th ) ; (8) Having established these new variables Equation 6 can be simpli ed to read: Tmax < T < Tmin + Tclk (9) clk N N The inequality to the right can be simpli ed as follows: NTclk < Tmin + Tclk NTclk ; Tclk < Tmin (N ; 1)Tclk < Tmin 13
  • 14. Equation 9 becomes: Tmax < T < Tmin (10) clk N N ;1 The above analysis shows that for N = 1, the clock period is not continuous and is only bounded below by Tmax . 4 Hybrid Wave-Pipelining Having presented the timing constraints of wave-pipelining, attention is now turned towards describing these timing constraints as they apply to a hybrid wave-pipelined approach. Equa- tions to describe the timing constraints for this approach are derived and compared to those of the previous section. In many computer/digital systems each stage has a signi cantly di erent function and circuitry, therefore, wide variations in delays (Dmin and Dmax ) may not be tolerated. Figure 8 shows an example of a wave-pipelined system. The inputs to the system are passed in a synchronous manner by means of an external clock. Assuming that three variables need to be computed in this stage and passed on to the following stage, it follows that for a given set of inputs, these variables would have delays associated with each one of them. These delays are denoted as dA, dB , and dC , for inputs A, B , and C respectively. The di erence in delay may cause stage 2 (Figure 8) to start in a di erent computation path than what is expected. This in turn produces a false start. This false start creates unnecessary changes in stage 2 as well as additional delays, that need not have occurred. These delays dA, dB and dC would depend on the gates associated with each path, and also on the set of input values. input latch dA I N false start P d B U d D T S d C d E stage 2 stage 1 clock Figure 8: Example of delays associated with pipeline stages. These problems are avoided using a hybrid wave-pipelining approach. In order to provide some insight into the problem de nition and solution, a brief summary is provided below. A common engineering practice is to consider the worst case delay (Dmax), to ensure that the system runs properly. Dmax plays a very important role in the system's performance and safe regions of operation. Dmin (the shortest delay path), on the other hand imposes 14
  • 15. a restriction in the valid input window. Getting Dmin closer to Dmax could increase this window in other words it could decrease clock cycle time. output clock 111 000 1111 0000 1111 0000 1111 0000 111 000 111 000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 D max D min_hold Ts + ∆clk ∆clk + Th D min stage 2 dhold(1) d min (1) stage 1 i-1 i Tsx i+1 logic depth d min (0) dhold(0) stage 0 DR time 111 000 111 000 1111 0000 1111 0000 1111 0000 1111 0000 111 000 111 000 1111 0000 1111 0000 111 000 clock input Tclk Figure 9: Temporal/Spatial diagram of a hybrid wave-pipelined system. The equations derived for the hybrid wave-pipelining are denoted by the subscript h to di erentiate them from those of the wave-pipelining time constraints. The de nitions for the variables presented in the previous subsection still hold. To derive the equations that describe the timing constraints for the hybrid wave-pipeline, the temporal/spatial diagram representing this scheme is presented rst. The shaded regions of Figure 9 indicate that data is not stable therefore, register outputs cannot be sampled. The computational cones in this diagram have been arranged to represent each stage within the design. Some variables need to be de ned. They are: dmin(n) the minimum delay encountered in propagating data within a single stage n. Dmin hold the overall minimum delay of all the stages and it includes the intrinsic registers' hold times. For hybrid wave-pipelining, TL 's lower bound is described as follows: TLh > DR + Dmax + Ts + clk (11) The upper bound of TLh is TLh < Tclk + DR + Dmin hold ; ( clk + Th) (12) where Dmin hold = dmin(0) + dhold(0) + dmin(1) + dhold(1) + dmin(2) + dhold(2) 15
  • 16. This equation takes into consideration the intermediate stages of the design. The minimum delays and the hold times of each stage are considered. From the above equation it can be determined that: Dmin hold Dmin (13) This implies that this delay di erence is less than Dmax ; Dmin for the wave-pipelining scheme. If further derivations are carried out, the clock period for the hybrid approach is determined to be: Tclk h > (Dmax ; Dmin hold) + Ts + Th + 2 clk ( ) (14) Comparing Equations 4 and 14 and having Dmin hold Dmin , a conclusion that Tclkh Tclk can be drawn. This implies that the clock period for the hybrid wave-pipelined approach allows for the clock signal period to be reduced, hence an increase in performance. A complete analysis of the hybrid wave-pipelining scheme must include clock cycle minimiza- tion, taking into consideration the constraints of the internal nodes of the system and the register constraints. Based on the analysis of Equation 6, the minimum delay of the hybrid approach can be re-written to include the stage hold times as follows: Tmin h = DR + Dmin hold ; ( ) clk ; Th ; (15) The maximum delay through the logic including the overhead and clock skews Tmax , remains unchanged for the hybrid scheme. From Equation 15, and by taking into consideration the fact that Dmin hold Dmin, it is determined that: Tmin < Tmin h . Also from Figure 9 ( ) it can be noticed that the region in which data is not stable, i.e. the di erence between Dmax ; Dmin hold , is short. It can then be safely stated that Dmax Dmin hold. The signal latching time, Equation 6 now becomes: DR + Dmin hold + Ts + clk ; < NTclk < Tclk + DR + Dmin hold ; ( clk + Th ) ; (16) This analysis is graphically presented in Figures 10 and 11. Figure 10 shows that there is room to make the clock cycle smaller, since the distance between labels windowh and windoww can be reduced. Figure 11 shows how e ectively this could a ect the clock 0 period, reducing it from Tclk to Tclk . 4.1 Timing Approach Propagating coherent data waves from one pipeline stage to the next without the use of a distributed system clock is achieved by balancing the delays within each pipeline stage. Delay balancing is necessary since the data within a circuit module takes di erent paths and experiences di erent delays depending on the operation, resulting in the data arriving at the input of the next stage at di erent times. The data that has the shortest delay arrives at the input to the next stage earlier than data that takes the critical path (longest delay), hence the use of intermediate latches between pipeline stages (modules). Wayne 10] et al, have suggested bu er insertion as a method of delay balancing. Bu ers are inserted 16
  • 17. D max windowh D min_hold D min windoww stage 2 dhold(1) d min (1) Tsx stage 1 i i+1 logic depth d min (0) dhold(0) stage 0 DR time 111 000 111 000 111 000 1111 0000 1111 0000 111 000 111 000 111 000 clock input 111 000 Tclk 1111 0000 111 000 Figure 10: Temporal/spatial diagram before clock period reduction. windowh D max D min_hold D min stage 2 Tsx stage 1 i i+1 logic depth stage 0 time DR 111 000 11 00 11 00 111 000 111 000 111 000 111 000 111 000 111 000 11 00 111 000 111 000 T’clk Figure 11: Temporal/spatial diagram after clock period reduction. 17
  • 18. in the shortest signal path to force these delays to be equal to the worst case delay, which in turn permits the signals to arrive at the input of the next stage at the same time without the use of intermediate latches nor the system clock. A di erent approach is considered in this research. It does not involve bu er insertion, but uses the delays in the design of the appropriate signals that will enable data to ripple through the stages coherently without the use of distributed clock nor intermediate latches. The method used will not be limited in use to the bit-pattern associative router circuitry and should nd application in many circuits that employ the wave pipelining technique to improve performance. The di erence between the maximum (worst case) delay and the minimum delay is of particular interest in designing the timing scheme of a wave-pipelined bit-pattern associative router. There will be no bu er insertion to balance the delays, instead specialized circuitry is designed to provide very precise timing sequence. The advantage of this approach is that it is not a ected by scaling or use of a di erent technology since all the modules are a ected by these changes the same. The only major limiting factor becomes the worst case delay of the circuit. The valid clock pulse width (PW) can be made short, and it cannot be less than Dmin + (Dmax ; Dmin ) i.e. PW Dmin + (Dmax ; Dmin ). This expression includes the rise and fall times of the clock and is based on measurements taken from SPICE simulations. This gives the minimum pulse width that will enable application of wave-pipelining to the bit-pattern associative router without intermixing the data waves" within any given pipeline stage. This provides the basis for the design of the clock pulses required to cause the data to ripple through the pipeline stages. The system clock is responsible for passing data from the search argument register (SAR) to the bit-pattern associative unit. It has been established that the bit-pattern associative router requires the following signals for its operation: Read and write signals for the memories (DCAM and DRAM). The write signal must be raised to a logic 1 only during programming (storing of current address) and refreshing of the stored data this is done when the row select signal selects a speci c word to write to within the bit-pattern associative unit array. Reading occurs when there is a need to refresh the contents of the words stored within the array. The read and write signals are generated whenever the stored data has been sensed to degrade. Bit lines. They pass the input to the bit-pattern associative unit array during pro- gramming as well as normal operation and they need to be precharged before reading from the array. There is thus a need to design specialized circuitry to generate the precharge signal for the bit lines. The evaluate signal. The evaluate signal is responsible for determining if a matching condition exists in the bit-pattern associative unit. Special circuitry to provide the proper evaluate signal has to be designed based on dmax (x) and dmin(x), where dmax (x) is the maximum delay path for a signal propagating through stage x and dmin(x) the minimum delay in propagating a signal through stage x of the bit-pattern associative router. The precharge signal. The match line of the bit-pattern associative unit is designed to be normally on, and this is achieved by precharging it before evaluation occurs. A precharge signal, therefore, has to be designed and it must occur at appropriate times. 18
  • 19. The signals dmin(x) and dmax (x) play an important role in the design of the precharge signal for the match line. The signal that permits the data wave" to be passed to the selection function. The signal that allows the match output to be passed to the selection function is generated by using the evaluate signal and sizing transistors with the threshold voltage of the transistor used to latch the match outputs. Precharge for the selection function. This signal is simply the inverse of the evaluate signal. The priority status signal. Within the selection function the priority status signal is of importance. If a match has been found in any of the entries other than the last entry of the matching unit, the selection function has to detect it as a match with the highest priority and must then propagate a priority status indicating to the entries below that a match has been found. In this design a priority status Pi = 0 indicates that a match with the highest priority has been found and hence kills all the other possible matches below the current one. The delays experienced by this signal are of importance since it can be used to determine when new input can be applied to the SAR. The signal required to pass the selected match output to the port assignment. This signal is required to gate the selected match output which is the one determined to have the highest priority by the selection function, to the next stage. The system clock is not used in the selection function, therefore, the signals already generated have to be used to provide precise timing to pass the selected match output to the port assignment memory. This signal serves to select the appropriate output port assignment. The signals described above form the basis for proper timing of the hybrid wave- pipelined bit-pattern associative router. The primary objective is to reduce clock cycle time by allowing the delays to propagate data from stage to stage with the use of neither intermediate latches nor distributed clocks. As clock rates continue increasing, clock distribution is becoming a major problem wave-pipelining could alleviate this problem by signi cantly reducing the clock load as well as the number of inter-stage latches 1]. 4.2 Hybrid Wave-Pipelining Design In this section circuits to generate some of the signals discussed above are presented. Re- moval of intermediate latches introduces the possibility that two or more data waves can be intermixed. A wave-pipelined design must then ensure that this data collusion does not occur. Circuitry to ensure that data propagates within its own wave has been designed. 4.3 Control Signals for The Bit-Pattern Associative Unit The requirement for precharging the match line is that this occurs before evaluation takes place and after the output of the match line has been passed to the next stage. As a result 19
  • 20. the precharge signal is designed using the signals used to evaluate and pass the match line output to the next stage. The circuit in Figure 12 is designed to generate the match line precharge signal. Transistor Tc, when conducting, charges up node V and the status of node V is passed to the inverter input when the pass signal is at 0". At evaluation time it is required that the match line precharge signal be at 1" to prevent a con ict of operations on the match line (attempting to precharge the match line while evaluating it). Also, care must be taken to prevent precharging the match line before its output is passed to the next stage. The match line precharge signal, therefore, has to be low only when both the evaluate and pass signals are at 0", that is, the match line must be precharged when both the evaluate and pass signals are at 0". This condition is enforced by use of the evaluate signal to turn on transistor Td , providing a path to ground during evaluation. This occurs when pass is at logic level 0, ensuring that node V gets discharged to 0", resulting in match line precharge signal being at logic 1. The precharge device is a p-type transistor it is, therefore, not conducting when the match line precharge signal is at logic 1. Transistor Toff passes a 1" to the gate of Tc when the evaluate signal is at logic 1, preventing a 1" from being passed to node V by turning o transistor Tc. Since the pass and evaluate signals are never at logic 1 at the same time, pass at the gate of Ton, after allowing a time lapse of two inverter delays, enables Tc to conduct. This arrangement ensures that the precharge signals will operate precisely as desired. Toff Tc pass Ton V Tr precharge match line evaluate Td Figure 12: Generating the match line precharge signal. The evaluate signal must go to 1" after the data passed to the DCAM for comparison has stabilized on the bit lines (BIT COMPARE and NBIT COMPARE). There is circuitry that generates this signal once all the lines have been set to their appropriate input values. This circuitry takes into account setting a 0" or 1" in the particular CAM array and is shown in Figure 13. Its operation is governed by the status of the match line. The inverse of the clock gets passed to an intermediate node when the match line is at 1" and this intermediate node gets precharged to 1" whenever the match line is at 0". The inverse of the intermediate node is the evaluate signal. Using the match line to pass the clock signal 20
  • 21. and set the evaluate signal to 0" when evaluation is completed provides the proper duration for the evaluate signal to stay at 0" or 1". Passing the clock signal when the match line is at logic 1 basically determines that the data on the bit line has had enough time to stabilize on the bit lines. match line clock evaluate Figure 13: Generating the evaluate signal. Once evaluation has been completed the match line outputs are passed to the selection function. Thus, the pass signal must immediately be active following completion of the evaluation process. The circuit used to generate the pass signal is shown in Figure 14. The pass signal is designed to mimic the path that a zero on the match line (non-matching condition) takes once evaluation completes. Passing a 0" to the selection function provides the maximum delay that can be experienced in passing the matching unit's outputs to the selection function and, therefore, constitutes the worst case propagation delay for this operation. The node pass is initialized to 1" by precharging node z when the evaluate signal is 1". A 1" at the pass node ensures that transistor Tpz passes a 0" which gets inverted and becomes input to transistor Tpd. When transistor Tpd conducts discharging the pass node, this becomes an indication that the pass signal has stayed at 1" long enough to pass the output of the match line to the selection function. The duration and voltage level of both the evaluate and pass signals need to be just long and high enough to accomplish their purpose. The circuits in Figures 13 and 14 have been designed to sense the duration and voltage levels of these signals. All the signals presented to this point depend on the system clock. evaluate Tpu Tprech Tpz p Tp z pass Tpd Figure 14: Generating the pass signal. 21
  • 22. Another signal of importance is within the selection function. This is the signal that enables the port assignment select signals i.e. the output of the selection function the pointer to the selected word of the port assignment. It is required that if a match with the highest priority has been established, the ones below this match output must be disabled. The priority status performs this function along with an enable signal, which has been designed based on the fact that it takes longer to pass 0" from the bit-pattern associative unit. The required delay between sets of four selection function cells is arrived at by inserting delay elements within the circuit. This approach ensures the prevention of false starts. pass precharge enable0 PL0 enable1 PLn-1 enablen Figure 15: Enabling the selection function outputs. 4.4 Hybrid Wave-Pipelining Signals The signals generated by the above circuits appear in Figure 16 along with the clock. The delays of the rst two stages are clearly marked in the gure to show a correlation with the hybrid wave-pipelined approach. On Figure 16 the minimum delays, dmin and hold times dhold of each stage are shown. Also the duration of the evaluate and pass signals are depicted. Using 1 m technology 2.5 V is su cient to determine a logic 1. Once the output of the match lines have been received by the selection function a decision needs be made when more than one matching condition has been received. The selection function is designed to propagate a priority status Pi to the entry below it indicating whether it has registered a match. This priority must be propagated very fast to prevent false starts. Some of the signals of importance in this scheme are shown in Figure 17. The selection function's critical operation occurs when the rst and last entries of the selection function simultaneously receive inputs indicating that matching conditions have been found in the corresponding bit-pattern associative unit entries. The rst entry of the selection function has to propagate a 0" to the last entry of the selection function to prevent generation of 22
  • 23. a pointer to the DRAM by the last entry. Signals to lessen the possibility of false starts are generated and these are labeled enable. They are designed to ensure that even if a false start occurs the pointer to the DRAM array is not established. In Figure 17 delays from the second stage of the BPAR and those of the third stage are shown. In Figure 17 signals used to enable the DRAM pointers for the rst and last entries from a setup in which the last entry always nds a match and the rst entry nds a match every other clock cycle are shown. The delays involved are shown. The plots in Figure 18 are those of the DRAM pointers to the rst and last entries of the DRAM. They serve to show that an output port assignment can be read from the DRAM array every one and a half clock cycles. Also ensuring that the matching condition occurs every other clock cycle for the rst entry of the bit-pattern associative unit, while the last entry always nds a match, makes for a good check for false starts. 5 4 SF = selection function evaluate pass clock 3 dholdCAM Voltage (V) dminCAM d CAM d SF 2 1 0 0.00000 0.00001 0.00249 0.00500 0.00639 0.01000 0.01166 Time (µs) Figure 16: Plot of the bit-pattern associative unit control signals. 23
  • 24. 5 DRAMselect0 enable0 output0 output15 DRAMselect15 4 enable3 SF = selection function pass 3 Voltage (V) d (RAM) dpass(SF) dhold(RAM) d (SF) 2 1 0 0.000 0.001 0.004 0.006 0.010 0.011 0.016 0.019 0.021 Time (µs) Figure 17: Plots of the enable signals. 5 4 DRAM DRAM DRAM select0 clock select15 output bit0 3 Voltage (V) 2 1 0 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 time (µs) Figure 18: Plot of the DRAM pointers. 24
  • 25. 5 BPAR's Space/Time Hybrid Wave-Pipelining Figure 19 shows the delays associated with each stage (as shown in Figures 16, 17, and 18). The computational cones of Figure 19 show the three stages of the bit-pattern associative router. Hybrid wave-pipelining has enabled for the delay di erence per stage to be reduced. Unlike wave-pipelining where Dmin for the system provides the upper bound for the clock period, hybrid wave-pipelining shows that each stage has it's maximum delay that can be manipulated independent of other stages to obtain optimal timing. The bit-pattern associative unit completes worst case operation (computation) in 5.1 nano seconds (ns). This time includes the input latch delays, clock skew and hold and setup times for this stage. To accommodate the latest data wave arriving at this stage additional time during which this data can still be passed to the next stage is allocated. This time appears in the gure as dpass with a value of 0.9 ns. For data arriving late at the bit- pattern associative unit at least 2 ns is the time during which this data can still be processed within this stage. The hold time for the selection function is very short, almost equal to the minimum delay of this stage. This short hold time is a direct result of the selection function design, which ensures propagation of the priority status to entries below the current one very fast by means of the priority lookahead. The port assignment has a minimum delay of 0.4 ns and requires 3 ns of hold time. It also accommodates the latest wave" for the current data. Reduction of the gap between Dmax and Dmin at each stage is presented here graphically with the delays displayed to show how the clock period is made shorter in this design. Comparing this gure with Figure 20 it can be seen how the time during which data can be sampled at output latches is reduced. A combined temporal and spatial diagram for a wave-pipelined router is shown in Figure 20. The horizontal axis represents time while the vertical axis represents the circuitry of the system which includes: input latch, CAM (bit-pattern associative unit), selection function, RAM (port assignment), and output latch. The shadowed regions represent the times when a computation is being performed and data may be unstable. The space between two adjacent regions corresponds to stable data. In this gure several waves of data are shown (from data i ; 1 to i + 1). It should be pointed out that the worst time for the selection function occurs when the top priority CAM entry matches and produces a read. This in turn creates a delay for the last entry which is ready before the priority lookahead signal reaches it. In the gure, it is shown how the rst entry a ects the last entry. In a traditional pipeline this creates a very long delay in the selection function stage since the CAM output is not known until the pipeline clock comes. The output data should be read when it is stable thus the output clock should be synchronized to achieve this. The hybrid wave-pipelined router described in this paper has been fabricated using a 0.5 m technology and the chip tested. Some traces of the signals described in this section appear in the Appendix. 6 Concluding Remarks An extremely fast exible router capable of accommodating a number of routing algorithms and networks has been presented in this paper. The bit-pattern associative router is designed 25
  • 26. dminRAM dholdRAM 0.4 ns 3 ns RAM and output latch dminSF dholdSF 0.7 ns 0.8 ns selection function (SF) dmin CAM dholdCAM logic depth 1.6 ns 3.5ns dpass 0.9 ns input latch and CAM devaluate 2 ns time Figure 19: Hybrid wave-pipelined delays translated to computational cone. windowh D max D min_hold D min dmin RAM dhold RAM ouput latch & RAM dmin SF dhold SF Tsx selection function dmin CAM dhold CAM logic depth i-1 i i+1 input latch & CAM time DR 11 00 11 00 11 00 111 000 111 000 111 000 111 000 11 00 11 00 11 00 111 000 111 000 T’clk Figure 20: Router's Temporal/spatial diagram. using a novel scheme that combines wave-pipelining and conventional pipelining producing hybrid wave-pipelining. The hybrid wave-pipelining approach narrows the gap between Dmin and Dmax resulting in a reduction of the clock cycle time. To further improve the performance of the bit-pattern associative router dynamic circuits along with a high performance selection function to reduce priority status propagation delays are used. This study is the rst of it's kind it examines the delays within each module of the design and minimizes them 26
  • 27. individually. Issues such as signal generation, uneven delays, and timing have been addressed to synchronize the pipeline. The bit-pattern associative router presented in this paper has been designed, implemented and tested. The test results appear in the appendex. 27
  • 28. References 1] C. Thomas Gray, Wentai Liu, Ralph K. Cavin, III, Wave Pipelining: Theory and CMOS Implementation, Kluwer Academic Publisher 1994. 2] D. Clark and J. Pasquale, Strategic Directions in Networks and Telecommunications," ACM Computing Surveys, vol. 28, no. 4, pp. 679-690, Dec. 1996. 3] T. Leighton, Average Case Analysis of Greedy Routing Algorithms on Arrays," 2nd Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 2-10, Crete, Greece, 1990. 4] R. Cypher, A. Ho, S. Konstantinidou, and P. Messina, Architectural Requirements of Parallel Scienti c Applications with Explicit Communication," 20th International Symposium on Computer Architecture, pp. 2-13, May 1993. 5] J. G. Delgado-Frias, J. Nyathi and D. H. Summerville, A Programmable Dynamic Interconnection Network Router with Hidden Refresh" IEEE Trans. on Circuits and Systems, I vol. 45, no.11, pp. 1182-1190, Nov. 1998. 6] D. H. Summerville, J. G. Delgado-Frias, and S. Vassiliadis, A Flexible Bit-Associative Router for Interconnection Networks," IEEE Trans. on Parallel and Distributed Sys- tems, vol. 7, no. 5, pp. 477-485, May 1996. 7] D. H. Summerville, J. G. Delgado-Frias, and S. Vassiliadis, A High Performance Pat- tern Associative Oblivious Router for Tree Topologies," IPPS '94: The 8th International Parallel Processing Symposium, pp. 541-545, Cancun, Mexico, April 1994. 8] C. Thomas Gray, Wentai Liu, and Ralph K. Cavin, III, Timing Constraints for Wave- Pipelined Systems," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, Vol. 13, No. 8, pp. 987-1004, August 1994. 9] D. Ghosh, and S. K. Nandy. Design and Realization of High-Performance Wave- Pipelined 8 X 8 b Multiplier in CMOS Technology," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 3, No. 1, pp. 36-48, March 1995. 10] W. P. Burleson, M. Ciesielski, F. Klass and W. Liu, Wave-Pipelining: A Tutorial and Research Survey," IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol. 6, No. 3, pp. 464-474, September 1998. 11] W. K. C. Lam, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, Valid Clock Frequen- cies and Their Computation in Wave-pipelined Circuits," IEEE Trans. on Computer- Aided Design, vol. 15, no. 7, pp. 791-807, July 1996. 12] J. G. Delgado-Frias and J. Nyathi, A High Performance Encoder with Priority Looka- head," IEEE Sixth Great Lakes Symp on VLSI, pp. 59-64, Lafayette, Louisiana, Feb. 1998. 28
  • 29. evaluate match-alternately (first and last entries) Figure 21: Matching the rst and last entries alternately. matchline pass Figure 22: Periodic discharge of the matchline. RAM_output (first entry) RAM-output (last entry) Figure 23: Port Assignment output. 29