Power efficient solution for network on chip

Term Paper Submission ECE 562 – Fall 2013
1
ISBs: Bidirectional Buffer-less Router with Intelligent Space Buffers
Dhiraj Chaudhary and Ahmed Louri
Dept. of Electrical and Computer Engineering, University of Arizona, Tucson, AZ 85721
{dhirajchaudhary,louri}@ece.arizona.edu
ABSTRACT
Buffers in routers consume significant power and area. A novel intelligent space buffers (ISBs) NOC architecture capable of mitigating both power and performance issues is proposed. Buffer-less router designs illustrates a significant degradation of performance at high injection rates. We make a case for new approach for power efficient design of Network- on-Chip utilizing buffer-less routers with improved performance.
General Terms: Architecture, Algorithm, Design.
Keywords: routing, network on chip, control, buffers, Channels.
1. INTRODUCTION
Today high performance and power are very tight constraints for Network on Chip (NOC). According to some papers, NOC consumes up to 30% of power in the Intel 80-core Terascale chip [1] and about 40% in the MIT RAW chip [2].A lot work has been done and still in progress to balance power and performance. As we increase the number of cores the latency dominates and power control mechanisms further worsen this situations. It is essential to design a low power design for NOC by keeping performance with in certain limits. This paper will discuss about a new low power design which can be thought of as a balanced implementation for future NOC designs.
Buffers are power hungry. A paper by Onur Mutlu et. al [3] suggests removing buffers can save upto 60% of total power in NOC. But removing buffers has potential negative impact on performance and bandwidth efficiency. This design works well for low injection rates but for high injection rate BLESS consumes a substantial percentage of chip power with degradation in performance. Latif Khalid et. al [4] discusses a very straight forward approach, utilize ideal buffers. Storing packets require more power as compared to transmission them so it is better to transmit packets [9]. Sharing of buffers amongst various ports or virtual channels can decrease a significant buffer count. This design comes with an additional computational complexity impacting area consumption and may be power in certain cases. Avinash Kodi et. al [5] has introduced adaptive dual-function links. Links can dynamically configured as repeaters as well as storage units in case of congestion. It can save ~40% of buffer power, and area efficient as well.
In this paper, we propose intelligent space buffers (ISBs) which can achieve high performance with buffer-less routers by keeping power consumption with in certain limits. We deploy buffers in the space around the router. Congestion control mechanism is inherent quality of control unit. Control unit dynamically manages the number of buffers allocated to each channel according to traffic. Bi-directional [6] links has been utilized to utilize buffers in a more effective manner.
2. RELATED WORKS
2.1 BLESS: buffer-less routers
Buffers are responsible for 60% of total power consumption in network on chip (NOC) and consumes about 64% of static power [7] [8]. Many researchers hate buffers and try to completely keep them away from router. Buffer-less router design BLESS by Onur Mutlu et. al [3] demonstrates 60% reduction in area, deadlock avoidance, simplified router design and no live locks etc. But the research statistics shows that by eliminating buffers, there is a major degradation in performance. Concept goes well for low injection rate but with high injection rate, significant degradation in both power and performance has been observed [3].
In conventional design one can see the buffers associated with each virtual channel. Along with that there is huge area hungry control circuitry including VC

2
allocator, switch allocator and route computation unit are present.
Figure 1. Traditional switch architecture with buffers
Figure 2. Buffer-less switch architecture
If we go for buffer-less router then significant area can be saved. BLESS uses hot potato routing protocol. It is a deflection based mechanism in which after receiving a packet or flit, router will deflect it in any direction based on port availability. Flit ranking mechanism illustrated in figure -- takes care of live-lock problem caused by deflection. Oldest packet will get more priority which can avoid the live-lock situation in buffer-less. As the flits are always in motion so deadlock situation cannot arise, which is one of the major problems in the routers with buffers. Another advantage of BLESS is very less router latency because of less routing computations. But major drawback is buffer-less does not perform well in high injection rates. With the increase in injection rate at router, its performance degrades drastically. As illustrated in [3] injection rate of 0.08, buffer-less router outperforms the router with buffers. At injection rate 0.28 there is drastic increase in link and router energy. This is due to the fact that packet takes longer time when deflected in wrong directions to reach destination. Pipeline latency is less in BLESS as compared to conventional router with buffers. Decrease in latency is because of elimination of virtual channel allocation and switch allocation stages. Experimental results [3] clearly indicates the breakdown for buffer-less at 0.29 injection rate compared to 0.35 for 4 VC- 4 flits buffer. All the experiments are carried out by considering 8*8 routers using synthetic traces utilizing 4 different traffic patterns: Uniform routing (UR), transpose (TR), mesh tornado (TOR) and bit complement (BC).
BLESS design works well for less traffic network. In NOCs it is applicable to the memory-core interface. As memory and core communicate at less injection rates. But still there are a lot of issues associated with buffer-less routers. First one is flit overhead, every flit should have header associated with it. Second one is high latency with respect to each flit reaching destination. Because flits will arrive at different time intervals therefore to accumulate flits to packet we may require a large buffer size at receiver. Because of all above stated drawbacks BLESS did not get much success in term of practical implementation.
2.2 Shared buffers
In this design Latif Khalid et. al [4] has proposed to share the buffers associated with each virtual channel. Figure 3 describes the conventional router architecture in which each virtual channel has its own buffer space associated with it.

3
Figure 3. Architecture of input part of router for shared buffers NOC design (Courtesy of Latif, Khalid, Tiberiu Seceleanu, and Hannu Tenhunen. "Power and area efficient design of network-on-chip router through utilization of idle buffers." Engineering of Computer Based Systems (ECBS), 2010 17th IEEE International Conference and Workshops on. IEEE, 2010.)
Figure 1 describes the conventional router architecture in which each virtual channel has its own buffer space associated with it. Traffic of virtual channel 1 cannot utilize the buffers of other virtual channel even though they are free. In practical scenario 100% buffers are never utilized. The idea is to utilize this unutilized channel buffer space. In figure 3 we showcase the shared buffer architecture.
The main contribution of this paper lies in the input part where the channels share the common buffer space. Each packet is divided in flits in which first flit is head flit. We call it as beginning of packet (BOP). When BOP arrives at buffer allocator unit. It will look for the free buffer space and allocate it. Then allocated signal is sent to buffer write controller in response to which buffer write controller will send busy signal. After receiving busy signal buffer allocator will send allocated to signal which will set the multiplexer pins of input buffer. After allocation, grant signal will be sent to port sending flits. This signal acts as the virtual channel identifier. For every new flit the port will send the NewFlit_Dx_x signal to buffer write controller. In case of two requests for one buffer slot we need to arbitrate which is done by priority signal shown in figure. Status_flag is the logical AND operation of all the busy signals which indicate all buffer slots are full. After receiving this signal, requesting neighboring port takes decision to redirect flits to some other direction or store until congestion is resolved.
2.3 iDEAL- Inter-router Dual-function Energy and Area-efficient Links for NoC architectures
With continued improvement in the router design, a paper [5] addresses a completely new era of architecture in NOCs which saves up to 40% of buffer power and 41% of router area. Basic idea is to utilize the repeaters in the links to dynamically act as buffers. iDEAL replaces the conventional buffers by three state repeaters. When the control signal is low, three state repeater acts in the similar way as conventional repeater. But with high control signal it can act as a buffer which can hold the bit.
Figure 1 illustrates the conventional router architecture, in which each virtual channel has 4 buffer slots of 128 bits each. We can remove some of these buffers and can place them on the link. This can save router area and power consumption as well. Figure 4 shows the reduced buffer size of router v4-r16-c0 to v4-r8-c8. Congestion control signal dynamically configure these adaptive link buffers (ALBs) to act as repeaters or buffers according to traffic load. iDEAL improves power

4
Figure 4. Dual function links used in iDEAL NOC architecture (Courtesy of Kodi, Avinash Karanth, Ashwini Sarathy, and Ahmed Louri. "iDEAL: Inter-router dual-function energy and area-efficient links for network-on-chip (NoC) architectures." ACM SIGARCH Computer Architecture News. Vol. 36. No. 3. IEEE Computer Society, 2008)
and area more than 40% with 1-2 % degradation in performance [5].
2.4 BiNoC: A Bidirectional NoC Architecture with Dynamic Self-Reconfigurable Channel
Bidirectional NoCs allow each communication channel to be dynamically configured in either directions to enhance the performance. This design illustrates a significant increase in performance with some area penalty [6]. Aim is to utilize the channel’s bandwidth more effectively. In BiNOC design, if outgoing channel has more traffic as compared to incoming channel, BiNoC design can switch the direction of incoming channel. In this way load is shared between two channels. BiNoC can be utilized in the networks where traffic density varies much in opposite directions.
3. DESIGN OF INTELLIGENT SPACE BUFFERS
3.1 NOC router Architecture
We use an n * n mesh architecture in a 2-D mesh. Routers are considered as buffer-less and connected to processing element (PE). Each router is connected to four adjacent neighbors north, east, south & west respectively. Packets are divided in to head, body and tail flits similar to conventional architectures. Deflection routing algorithm is considered in this design.
3.2 Problem description:
Buffer-less routers illustrates a significant degradation in performance and power consumption at high injection rates, which defeats aim to go for buffer-less [6].
(a)
(b)

5
Figure 5. (a) Drop packet in case of congestion for BLESS router architecture
(b)Redirected packet in case of congestion for BLESS architecture.
In figure 5, suppose that B and C both send their respective packets to same output port of router A. Then router A will have to drop one of packets because there is no buffers to store packets and at a time only one can take that output port. Or if deflection based routing algorithm is employed then packets are redirected to any output port which is free. Deflected packet takes long time to reach destination which degrades the overall performance of BLESS router design.
3.1 Intelligent space buffers (ISBs) implementation
In this section we detail the implementation of intelligent space buffers and associated control unit.
Figure 6. Proposed intelligent space buffers.
Figure 6 illustrates the conventional buffers
replaced by stack of buffers placed outside router.
When the decision and control unit’s signal is low then buffers will be in power down mode. Whereas in case of congestion, buffers will be activated and hold the data bits. Buffers will be in activation mode until congestion is alleviated. This implementation enables the buffer-less routers to perform well at high injection rates. Control unit is the heart of ISBs which is discussed in next section.
3.2 Control Unit Implementation
Control unit enables the buffers to be in power down or active mode during congestion. A single control unit is responsible for the activation of all space buffers shown in figure 6. Control unit as illustrated in figure 7, consists of a counter which counts the number of flits/ packets flowing in particular link. Although for simplicity only one link is shown but in practical implementation 2 links will be controlled by control unit. Comparator unit compares the count obtained from counter unit to the predetermined stored value “P”. If value exceed this threshold value (P) then decision & control unit sends the activate signal to respective buffers. Apart from that control unit will also send
Figure 7. Proposed control unit implementation for ISBs

6
the switching signals to sw1 and sw2. Now all the traffic from port A to B will traverse via buffer unit. The overhead of control unit is negligible if we compare it with power saving.
Figure 8. Proposed algorithm implemented at control unit of ISB architecture
Figure 8 illustrates the detailed algorithm to be implemented at control unit. The main issue is, how to determine threshold value. Another issue is how much buffer space to be allocated to each channel in case of congestion. We have considered 80% for the prototype but still it needs an improvement.
3.3 Dynamic space buffers in Bi-Directional links
Proposed intelligent space buffers architecture can be further optimized by utilizing bi-directional links [6]. Figure 9 illustrates the behavior of links when traffic in one dimension dominates the other. In figure 9(b), R1 (Router 1) configures both the channels and links as the output when traffic from R1 to R2 is more than traffic from R2 to R1. Figure 9(c) illustrates the opposite scenario that is traffic from R2 to R1 is more.
In figure 10 block diagram illustrates the bidirectional channel or link between router A and B.
Introducing bidirectional links can improve performance [6] at high injection rates.
Figure 9.
(a) Conventional unidirectional link between routers R1 and R2.
(b) Reconfigured links for congestion from R1 to R2 router.
But there is scope of power reduction in our design by using bi-directional channels instead of unidirectional. Algorithm at router interface works in a similar fashion as described in [6]
Figure 10. Bidirectional links implemented in ISBs
Suppose that routers cannot process a packet before 2 ns and a packet is sent from router A to router B at 1 ns followed by one more packet on the same port interface at 2 ns. But router B cannot process new request before 3 ns so it will drop the packet. We can utilize the incoming channel from router B to Router A at same port if it is free. A control circuitry is needed to switch the direction of port. If 2 or more packets request the same port at 2 ns then algorithm illustrated in figure 8 running at control circuitry of space buffers will start executing.

7
3.4 Power gated frame implementation
Figure 11. Proposed pipelined power gating scheme
Power gating suffers from wake up latency which impacts performance [10] [11]. We are using sleep mode transistors in ISBs for performance optimization. 10% of total transistors are in sleep mode and 90 % remain in complete shut off. When injection rate at any port is high, control block will redirect the traffic via buffers. When 8 % of buffers are occupied then 30 % of remaining buffers are triggered to wake up mode. This will avoid the wake up latency. As shown in figure 11, when traffic is below threshold then we can start sending buffers back to power down mode. We have assumed 10% drop in buffer space when load decreases below some threshold value. State 5 indicates 90% buffers are utilized at most. After this all the packets specific to that port will be discarded. This will avoid the impact of congestion to another port. Proposed gating scheme can perform well at high injection rates also. As we overcoming wakeup latency, this scheme offers high performance as compared to conventional power gating. We are keeping buffers in power down mode which is complete shut-down hence static power dissipation will be less in pipelined power gating scheme.
Pipelined power gating scheme is easy to implement and promising in terms of power and high performance. Exact performance gain can be calculated after simulations. Our estimation shows saving of more than 5 clock cycles. As 5 clock

8
cycles saving is illustrated in [11] and pipelined power gating can further improvise this performance.
4. DESIGN COMPLEXITY
Proposed ISBs architecture is not area efficient design. Because we are dynamically controlling links as well as buffers. Control circuitry may take a large percentage of area. Another issue is with predetermined threshold value used in control unit. We need to recheck the proposed design in real time traffic. We may implement a learning mechanism to set predetermined threshold but area constraint is the major issue which we need to look for success of ISBs.
5. FUTURE WORK
While ISBs is appealing design for its power and performance balance but there exists a large design space that spans the gap between traditional and ISBs architecture. First, area efficient design for ISBs NOC architecture, which is not discussed in this paper. Another one is, permutation and priority schemes to be implemented at the control block in case of congestion. Deadlock may also be the problem of ISBs because of implementation of new buffers. Flow control mechanisms are implemented by counter, which can be improved to make ISBs more performance and power.
6. CONCLUSION
In this paper we propose a novel architecture to counter performance and power issues in NOC. ISBs utilizes buffer-less router and bidirectional links to achieve significant saving in power. To counter performance issue, we provide self- configured intelligent space buffers. Novel architecture lacks in simulations because of time constraints. It is our hope that this proposed architecture will inspire more new ideas for works on NOC.
7. REFRENCES
[1] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar. “A 5-ghz mesh interconnect for a teraflops processor”. IEEE Micro, 27(5), 2007.
[2] Taylor, Michael Bedford, et al. "Evaluation of the Raw microprocessor: An exposed-wire-
ndelay architecture for ILP and streams." ACM SIGARCH Computer Architecture News. Vol. 32. No. 2. IEEE Computer Society, 2004.
[3] Moscibroda, Thomas, and Onur Mutlu. "A case for bufferless routing in on-chip networks." ACM SIGARCH Computer Architecture News. Vol. 37. No. 3. ACM, 2009]
[4] Latif, Khalid, Tiberiu Seceleanu, and Hannu Tenhunen. "Power and area efficient design of network-on-chip router through utilization of idle buffers." Engineering of Computer Based Systems (ECBS), 2010 17th IEEE International Conference and Workshops on. IEEE, 2010.
[5] Kodi, Avinash Karanth, Ashwini Sarathy, and Ahmed Louri. "iDEAL: Inter-router dual- function energy and area-efficient links for network-on-chip (NoC) architectures." ACM SIGARCH Computer Architecture News. Vol. 36. No. 3. IEEE Computer Society, 2008.
[6] Y.C. Lan, S.H. Lo, Y.C. Lin, Y.H. Hu, and S.J. Chen, "BiNoC: A Bidirectional NoC Architecture with Dynamic Self- Reconfigurable Channel," in Proc. of the 3rd ACM/IEEE International Symposium on Networks-on-Chip, pp. 266-275, 2009.
[7] W. Hangsheng, L. S. Peh, and S. Malik. “Power driven design of router microarchitectures in on-chip networks,” Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 105-116, 2003.
[8] Xuning Chen and Li-Shiuan Peh. “Leakage power modeling and optimization of interconnection networks”. Proceedings of International Symposium on Low Power Electronics and Design, pp. 9095, 2003.
[9] T. T. Ye, L. Benini, G. De Micheli. “Analysis of power consumption on switch fabrics in network routers,” Proceedings of the 39th Design Automation Conference (DAC), pp. 524-529, 2002.
[10] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose, "Microarchitectural techniques for power gating of execution units," in International Symposium on Lower Power Electronics and Design (ISLPED), CA, USA, pp. 32-37, 2004.
[11] H. Matsutani, M. Koibuchi, W. Daihan, and H. Amano, "Run-time power gating of on-chip routers using look-ahead routing," in 13th Asia and South Pacific Design Automation Conference (ASP-DAC), Piscataway, NJ, USA, pp. 55-60, 2008.

Power efficient solution for network on chip

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (13)

Similar to Power efficient solution for network on chip

Similar to Power efficient solution for network on chip (20)

Power efficient solution for network on chip