Latency Reduction of Selected Data Streams in Network-on-Chips for Adaptive Manycore Systems                   Thilo Piont...
Prioritization          Switching technique            Decision making             Link typethe requirements of runtime re...
Determination of number, length, and location of the long-                   in table I. Note that latency values are give...
TABLE I                                        M AIN C HARACTERISTICS OF N O C S WITH P ER H OP P RIORITIZATION           ...
Upcoming SlideShare
Loading in...5



Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Transcript of "94"

  1. 1. Latency Reduction of Selected Data Streams in Network-on-Chips for Adaptive Manycore Systems Thilo Pionteck, Christoph Osterloh Carsten Albrecht Institute of Computer Engineering Dr¨ ger Medical GmbH a Universit¨ t zu L¨ beck a u 23558 L¨ beck u 23538 L¨ beck, Germany u Germany Email: {pionteck, osterloh} Email: Abstract—This paper reviews Network-on-Chip architectures In case that the number of hops cannot be reduced, awith prioritization of selected data streams targeting runtime communication latency reduction can be achieved by reducingreconfigurable manycore systems. The common idea of these the latency of individual routers. Appropriate techniques arearchitectures is to minimize the latency of selected packettransmissions by either bypassing or parallelizing processing speculative execution of router pipeline stages in parallel [3],stages in routers or by using dedicated links bypassing complete [4] and by pre-computing routing decision using look-aheadrouters. Potential classes of selected data streams are latency schemes [5], [6], [7]. End-to-end latency can also be reducedcritical messages, i.e. cache accesses in multiprocessor systems, by using adaptive routing schemes, allowing to bypass nodesor systems with semi-static data streams, i.e. systems in which the with high congestion. The work presented in [8] describessame components continuously exchange data for a longer period.The review categorizes the diverse architectures and evaluates such a NoC in combination with the ability to bypass thetheir pros and cons in terms of universality, hardware efficiency router pipeline. Common disadvantages of these approachesand support of changing traffic patterns. are their increased hardware effort and a latency unsuitable for latency-critical messages. Based on the observation that I. I NTRODUCTION only a certain amount of messages are latency-critical or With the emerge of manycore systems and the increased show semi-static characteristics, it is favorable to prioritizeneed for scalable global on-chip communication architectures, these kind of data only. In [9], the composition and amountNetwork-on-Chips (NoCs) are becoming the dominant com- of latency-critical messages in shared-memory chip multi-munication architecture for complex system designs. Com- processor (CMP) systems are analyzed. The authors identifypared to shared buses and point-to-point connections, NoCs protocol requests, acknowledgment packets and critical wordfeature high scalability, high throughput and cost efficiency in packets in read and write transactions as the main categoriesterms of area and power [1]. The main drawback of NoCs is of latency-critical messages. It is also shown that data trafficthat packets have to pass several routers along their path. At exhibits a strong temporal and spatial locality and accounts foreach router, packets compete for router resources while going about 18% of the overall traffic. Examples for semi-static datathrough a complex processing pipeline [2]. Depending on the streams are given in [10]. Here, implementations of wirelessnumber of hops this results in a significant communication communication standards are analyzed with regards to theirlatency, limiting the overall system performance. In case of a inter-module communication characteristics. The authors showlarge number of consecutive packets following the same path, that for long periods subsequent data items of a stream followsome internal router processing steps are even not required, as, the same route and have periodic behavior.for example, routing decision will be the same for all packets. The aim of this paper is to provide an overview on diverseIf these processing steps are left out, latency as well as power approaches for prioritizing latency-critical or semi-static dataconsumption can be reduced. streams in NoC-based communication architectures. A focus A way to minimize the number of hops is to customize the is set on NoC architectures suitable for runtime reconfigurablenetwork topology according to the communication characteris- manycore systems. Irregular NoC architectures such as [11],tics of the processing elements for a given application scenario. [12] are not considered as well as NoCs based on non-meshYet, this contradicts the idea of an universal communication topologies such as [13]. The former are restricted to certainarchitecture, increases the design effort, and is not applicable communication patterns while the latter ask for high radixfor systems with a wide range of application scenarios. This routers, significantly increasing the area footprint.approach also implies that type and location of processing Criteria for categorizing NoC architectures with latencycores are fixed during lifetime of a system. For runtime recon- reduction techniques are given in section II. Among thesefigurable systems where processing cores can be exchanged at criteria, the effect of a prioritization is chosen for categorizingruntime, this assumption cannot be kept valid. Such systems the NoC architectures: per end-to-end connection (section III),show changing communication patterns during system lifetime per router (section IV), or per path segment (section V). Aand ask for regular communication architectures. discussion of pros and cons of the architectures with regard to978-1-4244-8971-8/10$26.00 c 2010 IEEE
  2. 2. Prioritization Switching technique Decision making Link typethe requirements of runtime reconfigurable manycore systems packet-switched circuit-switched speculative deterministic physical logical Examplesis given in section VI. [14] II. D ESIGN S PACE core-to-core [15] [16] There is a huge design space for latency reduction tech-niques within NoCs. Focusing on NoCs suitable for adap- [17]tive manycore systems and applying the design constraints per router [18]discussed in the previous section, four features mainly dis- [19]tinguishing architecures are identified: effect of prioritization,switching technique, kind of decision making and link type. [2] Effect of prioritization: The aim of all presented NoCs per [20] pathis to minimize the core-to-core latency for packets. Yet, segment [21]the architectures differ in the effect of their prioritization [22]efforts. Prioritization can be applied once for a core-to-coreconnection, per router or per path segment. When prioritization Fig. 1. NoC prioritization techniques design spaceis done on basis of core-to-core connections, prioritization isdone only once at packets entry point to the communication according to the effect of prioritization. In addition, figure 1network. This prioritization will sustain till the packet reaches shows the parameters of the individual NoCs presented in theits final destination. Applying this technique, packets will next sections.normally bypass the standard NoC infrastructure which results III. C ORE - TO -C ORE P RIORITIZATIONin the highest potential for latency reduction. This is in contrast This section comprises NoC architectures featuring core-to-to the prioritization on per-hop basis. Prioritization has to be core optimization. For the fast path, they all provide directredetermined at every hop, causing time overhead and, thus, connections from source to sink, either by using a bus [14],reduces the achievable latency reduction. In between these by providing dedicated long-range links [15] or by formingtwo extremes is the prioritization on per path segment basis. logical topology on top of the physical topology [16]. As theHere, packets travel along prioritized path segments. These first two NoC designs provide additional physical connectivitypath segments may not cover the complete route from source they are sub-categorized as heterogeneous NoC, while theto sink. Prioritization decisions have to be done for each path latter one is sub-categorized as a logical one.segment individually. As these three prioritization methods aremutually exclusive, they are chosen for categorizing the NoC A. Heterogeneous NoCsarchitectures reviewed in this paper. The main characteristic of both architectures presented here Switching technique: For transferring prioritized packets, is that they employ a standard regular grip-shaped NoC foreither packet switching or circuit switching techniques are most of the traffic. Long distance or latency sensitive trafficapplied. Packet switching in this context means that a packet is bypassed using the additional communication infrastructure.travels along several routers from source to destination. At An architectural view on both NoCs is given in figure 2.each router, routing decisions and arbitration of network re- A combination of a NoC with a shared bus is proposedsources have to be done. This is in contrast to circuit switching. in [14]. The bus enhanced NoC (BENoC), as shown inHere, a dedicated path from source to destination exists. Note figure 2(a), is composed of a packet switched grid-shapedthat within this work, even for circuit-switched transmissions NoC and a low latency, low bandwidth bus. The bus is usedthe data stream is divided into packets. In general, this is for for global, latency-critical control signals, provides broadcastcomparability reasons with the non-prioritized traffic. as well as multicast capabilities, and it can also be used Decision making: This criteria determines whether rout- for the configuration and management of the NoC. Highing for the prioritized packets is made deterministically or throughput data communication between cores is handled byspeculatively. The former means that the routing decision is the NoC. Performance evaluation is performed by means ofguaranteed to be correct, which is not the case for the latter. a dynamic non-uniform cache access architecture (DNUCA)Speculative forwarding also inheres the danger of generating multi-processor system consisting of 16 processors and 64dead flits, i.e. flits which are sent on a wrong link and which L2 cache tiles. The authors show that BENoC facilitates anhave to be deleted. average system speedup of about 300% for several benchmarks Link type: The fast path to be used by prioritized packets compared to a pure NoC-based communication infrastructure.can either be physical or logical. A logical fast path is mapped For the test system, the area requirement of the bus is less thanon top of the physical network. The packets follow the same 0.1% of the die area for a 0.18µm process. Area numbers forroute as non-prioritized packets, only parts or the router the NoC are not given.are bypassed. Physical fast path are dedicated connections Another approach is followed by [15]. Here, long-rangeimplemented in silicon. links are inserted on top of a regular mesh network (see Typical parameter combinations of latency-optimized NoCs figure 2(b)). The long-range links consist of segments of fixed-are presented in figure 1. Primary classification has been done length links connected by repeaters with buffering capabilities.
  3. 3. Determination of number, length, and location of the long- in table I. Note that latency values are given per router whilerange links is done at design time for a given application. overhead values are given for complete test systems.The long-range links are used for any kind of traffic as longas the long-range links provide a shorter path and do not A. Prioritized Access to Router Resourcescause deadlocks. The area overhead caused by the long-range A dynamic path management scheme on router level islinks is about 10% for a 4 × 4 mesh with 4 long-range links proposed in [17]. The idea is that flits arriving on a frequentlyusing a Xilinx Virtex-II FPGA as hardware platform. Overall used input/output link combination are prioritized against otherenergy consumption increases by about 1%. For performance traffic during switch allocation. The input/output pair formsevaluation, the critical traffic workload, the number of injected a fast path and a virtual channel (VC) is dedicated for thepackets per cycle at which packet delivery rate rises abruptly, is fast flits. Fast paths are determined locally by collectingconsidered. Results for a 4 × 4 auto industry benchmark and a statistics of intra-router transfer patterns. The router pipeline5×5 telecom benchmark show that the critical traffic workload for flits travelling along a fast path can be reduced furtheris increased by 13.6% and 36.3%, respectively. The application by sending the switch arbitration request to the next nodespecific insertions of long-range links at design time hampers before the flit actually enters the node. Therefor, the switchthe usage of this NoC architecture for systems with unforesee- allocation request is send to the next node while the flitsable communication patterns, i.e. adaptive manycore system. traverses the crossbar. Performance is evaluated using a CMPThough the underlying communication architecture is still a architecture and executing applications from the SPLASHregular mesh NoC, this work is considered within this paper. benchmark. Network latency is reduced by up to 30% and power consumption is reduced by about 2.5% on the cost of an area increase of 1.34%. Processing core NoC switch B. Speculative Forwarding Processing Repeater Another method to reduce the router pipeline latency is core NoC switch presented in [18]. For each idle input link the output link NoC switch with extra port NoC link being used by the next packet transfer is predicted and switch NoC link Bus Long-range link arbitration is speculatively completed. If the prediction hits, routing computation and switch arbitration stages of the router (a) BNoC [14] (b) Long-range links [15] pipeline are bypassed and switch traversal is completed in a Fig. 2. Heterogenous NoC architectures single cycle. Otherwise, packets are transferred to the orig- inal router pipeline without any additional latency overhead.B. Logical Topology Wrongly forwarded flits are masked in the output channel. For A NoC design with a reconfigurable, circuit-switched logi- prediction, several adaptive and static schemes were proposedcal topology called ReNoC( Reonfigurable NoC) is presented and analyzed according their hit rate. In order to facilitatein [16]. Conventional NoC routers are wrapped by topology a better adaption to different traffic characteristics, it is alsoswitches which form a configurable layer between routers possible to implement several prediction schemes for one inputand links. These topology switches can either be configured link in parallel. The optimal prediction scheme is selected byto connect a link to a router port or to directly connect choosing the one with the highest hit rate over a given timetwo links with each other bypassing the router. Thus, it is period. In three case studies, the latency is reduced in the rangepossible to form logical long links between two cores, two of 30.7% − 48.2% on cost of an increase in area and powerrouters, or between a processing core and a router. The in range of 6.4% − 15.9% and 8.0% − 10.5%, respectively.logical links are created on top of a static NoC topology A combination of speculative forwarding and setting up ofand form a logical topology with a combination of circuit preferred paths is presented in [19]. The authors adapt theswitched and packet switched elements. Configuration of the ”mad-postman” technique and speculatively forward flits to atopology switches is done according to the communication pre-configured output bypassing the router logic. Technically,needs of the actual running application. Details about the this is realized by connecting all inputs directly with eachconfiguration process are not given. For a video object decoder output via tri-state buffers. If an output has a preferredapplication, the authors show that ReNoC facilitates a decrease input, the corresponding tri-state buffer is preselected and allin power consumption of about 56% compared to a static mesh flits arriving at that input are forwarded. In order to detecttopology. The topology switches lead to an area increase of mistakenly forwarded flits, incoming flits are also analyzedabout 10%. whether the last forwarding led the flit closer to its destination. If this is not the case, the flit is stored in a FIFO and transferred IV. P ER ROUTER P RIORITIZATION using the standard routing functionality. Mistakenly forwarded Prioritization on per-hop basis is done by providing pri- flits will be identified as dead flits at the first router not beingoritized access to router resources [17] or by speculative part of the preferred path and are deleted. Preferred path are setforwarding and execution of router pipeline stages [18], [19]. up by single-flit packets and can be changed at runtime. TheA summary of the main characteristics of these NoCs is given preferred path latency is a function of the number of hops, of
  4. 4. TABLE I M AIN C HARACTERISTICS OF N O C S WITH P ER H OP P RIORITIZATION No load latency Overhead Path Speculative Dead standard prioritized power area determination pipeline flits [17] 2 clock cycles 1 clock cycle −2.5% 1.34% automatic yes no [18] 3 clock cycles 1 clock cycle 8.0% − 10.5% 6.4% − 15.9% automatic yes yes [19] 1 clock cycle delay of tri-state buffer not given 13% manual no yesthe delay of the tri-state buffers and of the links. Area overhead mechanism is extended to allow a flexible binding of EVCsis given with about 13% for a whole test chip. of arbitrary length to a node and to a more advanced buffer signaling. Compared to the original EVC design, an additional V. P RIORITIZATION PER PATH S EGMENT 44% improvement in latency under heavy load and a reduction NoC architectures prioritizing packets along path segments of power up to 8.2% is achieved.spanning several hops are presented in this section. This kind Another approach for setting up direct virtual links atof prioritization is normally done by selecting dedicated VCs runtime is given in [21]. Based on the current NoC state,as realized in NoC designs by [2], [20], [21]. Along the path virtual point-to-point (VIP) paths are created, allowing packetssegments, the prioritized packets bypass router pipeline stages to bypass the pipeline of intermediate routers. Packets travelwhich results in a reduced latency. A combination of path along VIPs by using a dedicated VC, for which each router issegment prioritization and virtual, circuit-switched network pre-configured to forward the packet to a designated output.topology is presented in [22]. This network creates paths Each router port can be used by at most one VIP connection.for prioritized packets which may lead from core-to-core. In combination with prioritizing VIP packets over normal NoCThus, this NoC could have been categorized as a core-to-core traffic, VIP connections cannot attain busy channels along theirpriorization architecture. Yet, as the length of the prioritized path. VIPs are set up using a simple and small bit-width setuppath is not guaranteed, it is categorized as a per path segment network controlled by a root node. This network is also usedoptimizing NoC. A summary of the main characteristics of all to collect the monitoring data of each router. Periodically, theNoC architectures from this section is given in table II. root node checks whether an adaptation of the VIP paths is required and manages the tearing down of old and setting upA. Virtual Links assigned to VCs of new VIPs. Evaluation is done using a multicore SoC with The concept of Express Virtual Channels (EVCs) is pre- different benchmarks running on the same cores. Results showsented in [2]. At any router port the set of VCs is partitioned an average latency reduction of 44% and a power reductionbetween normal VCs (NVCs) and EVCs. EVCs provide vir- of 17% compared to conventional NoC.tual express lanes in the network which are used to bypassintermediate routers by skipping the router pipeline. EVCs B. Spatial Division Multiplexingare restricted to connect routers only along a single dimension A combination of a packet switched NoC and a circuitand are not allowed to turn. Focusing on dynamic EVCs, each switched NoC is proposed in [22]. Using spatial-divisionrouter can act as a source/sink of EVCs or as a bypass node. multiplexing, network resources are split between a packet-The length of each EVC can be configured in advance, allow- switching sub-network (Pnet) and a circuit-switched sub-ing dynamic adaptation to different traffic patterns. Packets network (Cnet). Configuration of the Cnet is done by a light-normally try to acquire the longest possible EVC along their weight setup-network called Snet. Processing of flits arrivingpass. In case of high contention of a particular EVC, smaller at the Pnet is done in the same way as for standard packet-EVCs can be chosen. If all possible EVCs are occupied, NVCs switched NoCs. The only difference is during the routingare used. While virtual express lanes are mapped on top of a computing stage. If the Cnet part of the physical output linkregular mesh topology and do not require extra wires, extra is free, the flit is moved to the Cnet. The Snet is used tocontrol lines are needed for flow control between sinks and build the longest possible direct link to the destination node.sources of individual EVCs. For the SPLASH benchmark, the Flits traveling along the Cnet bypass the router pipeline andauthors show a latency reduction of 84%, a power reduction are sent in a pipelined fashion to the destination node. Atof 38% and a throughput improvement of 23%. the destination node, the flits are either transfered to the local An improved version of EVCs [2] based on a hybrid core or are handled in the same way as flits arriving at theinterconnect called NOCHI is given in [20]. The EVC network Pnet. For synthetic traffic patterns, the authors showed latencyis supplemented by a control plane comprised of global lines and power reduction of 45% and 22%, respectively. An areaspanning all nodes in a row or column. The global lines overhead of less than 10% is mainly caused by the Snet.are used for exchanging broadcast control information andflow control messages and replace the dedicated point-to-point VI. D ISCUSSIONcontrol wires required in the initial EVCs design. They base For runtime reconfigurable manycore systems, not only theon capacitive feed-forward circuits and are extended for one standard NoC design parameters such as throughput, latency,cycle multi-broadcast abilities and collision detection with power and area requirements are relevant. Key propertiesnode quantity determination. The original EVC flow control having high impact on these parameters are the adaptability
  5. 5. TABLE II M AIN C HARACTERISTICS OF N O C S WITH P RIORITIZATION OF PATH S EGMENTS No load latency Power Area Type of virtual Configuration standard prioritized reduction overhead connection [2] 4 clock cycles 2 clock cycles 38% conrol lines arbitrary nodes in one dimension design time [20] 4 clock cycles 2 clock cycles 44% control network arbitrary nodes in one dimension runtime [21] 5 clock cycles 2 clock cycles 18% 2% core to core runtime [22] 5 clock cycles wire delay 22% < 10% core/node to node/core runtimeto diverse traffic patterns, the selective prioritization of certain wire delay between nodes [14], [16], or to one clock cycle perdata streams, and implementation issues. Table III summarizes hop [15]. In combination with reduced latency, bypassing ofthese relevant parameters for the NoC designs presented. routers also leads to energy savings. As the additional physicalConcerning implementation issues, NoC designs for runtime communication links exist in parallel to the standard NoC,reconfigurable manycore systems are often restricted to the system throughput is increased, too. Another advantage of thishardware structure of and design tool limitations for Xilinx NoC category is that latency sensitive traffic is guaranteedVirtex FPGAs. Apart from few designs implemented on ASIC- to be prioritized. There is no speculation involved as instyle runtime reconfigurable platforms, these FPGAs form the the link prioritization architectures presented in subsectionbasis for most runtime reconfigurable systems. Thus, NoC B of section IV. When focusing on runtime reconfigurabledesigns have to deal with their restrictions and limitations. manycore architectures, a drawback of this NoC categoryColumn technology requirements of table III lists the hardware is their limited flexibility. Apart from [16], they are eitherrequirement of NoC architectures [19] and [20] which hamper limited by the message length to be transmitted along thea direct realization on the Virtex FPGA platform. Implemen- additional infrastructure or by node locations. [14] is optimizedtation issues that complicate but do not prevent an FPGA to transfer short control messages along the additional networkrealization are given in column layout anomalies of table III. and, thus, is not appropriate for data intensive semi-static dataThese restrictions are caused by the fact that for runtime streams. Manual insertion of long links at design time restrictsreconfigurable FPGA designs, a homogeneous and regular placement of runtime exchangeable processing cores to certainsystem layout is desirable. Even though it is possible, routing locations in case they want to make use of these links [15].signals through regions to be reconfigured is not advisable. Concerning adaptivity to changing traffic patterns, NoCAs a result, additional, non-uniform control wires between architectures prioritizing on per hop basis or per segment basisnodes [2], physical fast paths [14], [15] or additional control provide a flexible option. The only exception is [2], wherenetworks [20], [21], [22] hamper a smooth design flow. the architecture requires dedicated control lines for connecting Another important design feature for runtime reconfigurable source and sink of virtual links. Thus, virtual connections havemanycore system is the ability to prioritize selected data to be defined at design time which reduces system flexibilitystreams. Column prioritization of packets of table III specifies in the same way as the manual insertion of long links for NoCwhether prioritization is selectable or fixed for all packets. architectures [15]. Yet, the extended version of [2] presentedAs pointed out in the introduction, a universal prioritization in [20] circumvents this limitation. With the exception of [19],neglects the fact that often only a small amount of data is prioritization on per router basis neither requires any dedicatedlatency-sensitive [9]. NoCs that prioritize all packets along a control lines nor additional physical bypasses. As a result,route are well suited for semi-static data streams, yet small these architectures are universal applicable. Yet, they suffercontrol messages might even be delayed in the case they do from reduced optimization potential. Flits always have to passnot follow the main route. This is especially true for NoC at least some stages of the router pipeline at each hop whicharchitectures with speculative forwarding such as [17], [18]. limits the achievable latency reduction. In order to reduceWhether or not a NoC guarantees to prioritize a selected pipeline depth and, thus, the latency as much as possible,data stream is given in column guaranteed prioritization of often complex speculative pipeline structures are used [17],table III. [18]. This comes on costs of area as well as power efficiency. With regard to latency, all NoC architectures achieve sig- Concerning energy efficiency, the approaches of [17], [18]nificant improvements. Yet, they feature significant differences are problematic. Both designs speculatively forwards flits andin energy consumption and area overhead. In general, NoCs have to check afterwards whether this was correct or not. Thisproviding core-to-core prioritization tend to show the highest increases switching activity and in case of [18] may also leadarea increase compared to standard mesh-topology NoC de- to congestion. In addition, speculation failure rate becomessigns. This is a result of the additional physical communication high in case of increasing network traffic [2]. For runtimestructure such as a bus [14] or long-range links [15], or due to reconfigurable systems, the self-adaptive approach of [17],the additional logic for setting up virtual topologies [16]. The [18] is favorable. These architectures do not require any con-main advantage of these architecture is their near optimal com- figuration for prioritizing frequently chosen connections. Yet,munication latency for prioritized data streams. The additional the required configuration of routers in [19] has the advantagescommunication infrastructure bypasses the router pipeline at that a settling phase is avoided and that the communicationeach hop and reduces the communication delay to either the network can better be adapted to latency-critical data flows.
  6. 6. TABLE III C HARACTERISTICS OF PRESENTED N O C ARCHITECTURES FOR RUNTIME RECONFIGURABLE MANYCORE DESIGNS Kind of Prioritization Guaranteed Technology Layout anomalies Remarks prioritization of packets priotitization requirements [14] additional bus selectable yes - bus in tree topology bus can handle low-bandwidth only [15] physical long links destination yes - long-links disturb flexibility restricted dependent regularity by fixed long links [16] circuit-switched logical links fixed yes - - configurable virtual topology [17] reduced pipeline fixed no - - - [18] reduced pipeline + fixed no - - - speculative forwarding [19] reduced pipeline + fixed no tri-state buffers - configurable paths speculative forwarding [2] virtual express channels fixed no - extra control lines Virtual express channels in one dimension only [20] virtual express channels fixed no capacitive feed- control network Virtual express channels forward circuits in one dimension only [21] virtual point-to-point links selectable yes - setup network - [22] circuit-switched logical links fixed yes - setup network - The VC-based NoCs with prioritization per path segment [7] L. Xin and C.-s. Choy, “A Low-latency NoC Router with Lookaheadsummarized in subsection A of section V suffer from the Bypass,” in IEEE Int. Symp. pn Circuits and Systems (ISCAS), 2010, pp. 3981–3984.same drawback as NoCs prioritizing individual input/output [8] A. Kumar, L.-S. Peh, and N. Jha, “Token Flow Control,” in 41stconnections per router: flits have to pass at least some router IEEE/ACM Int. Symp. on Microarchitecture (MICRO-41), 2008, pp.pipeline stages, lowering the achievable latency reduction. 342–353. [9] Z. Li, J. Wu, L. Shang, R. Dick, and Y. Sun, “Latency Criticality AwareThe approach of [2] and its extension in [20] also limits On-Chip Communication,” in Design, Automation & Test in Europevirtual links to one dimension. In case source and sink are Conference (DATE), 2009, pp. 1052–1057.not located in the same dimension, flits have to pass at least [10] P. T. Wolkotte, G. J. Smit, Rauwerda, and L. T. Smit, “An Energy- Efficient Reconfigurable Circuit-Switched Network-on-Chip,” in 19ththree times the full router pipeline. In contrast, [21] allows Int. Parallel and Distributed Processing Symp., 2005, pp. 155a–155a.setting up virtual connections between cores or routers directly. [11] J. Chan and S. Parameswaran, “NoCOUT: NoC Topology GenerationWith regard to virtual connections, the approach of [16] is with Mixed Packet-switched and Point-to-Point Networks,” in Asia and South Pacific Design Automation Conference, 2008, pp. 265–270.the most flexible one, as a complete virtual topology can [12] B. Grot, J. Hestness, S. Keckler, and O. Mutlu, “Express Cube Topolo-be generated on top of the physical network. While still gies for on-Chip Interconnects,” in IEEE 15th Int. Symp. on Highbe configurable, this architecture enables the design of an Performance Computer Architecture (HPCA), 2009, pp. 163–174. [13] J. Kim, J. Balfour, and W. Dally, “Flattened Butterfly Topology for On-application specific infrastructure and, thus, is well suited for Chip Networks,” in 40th IEEE/ACM Int. Symp. on Microarchitectureruntime reconfigurable systems. The only drawback of this (MICRO), 2007, pp. 172–182.approach is that configuration affects an entire link. In case a [14] R. Manevich, I. Walter, I. Cidon, and A. Kolodny, “Best of Both Worlds: A Bus Enhanced NoC (BENoC),” in 3rd ACM/IEEE Int. Symp. oncore sends data to two different destinations a direct point-to- Networks-on-Chip (NoCS), 2009, pp. 173–182.point connection cannot be set up. [15] U. Ogras and R. Marculescu, “Application-Specific Network-on-Chip Architecture Customization via Long-Range Link Insertion,” in Int. VII. ACKNOWLEDGEMENT Conf. on Computer-Aided Design (ICCAD), 2005, pp. 246–253. [16] M. Stensgaard and J. Sparso, “ReNoC: A Network-on-Chip Architecture This work was funded in part by the German Research with Reconfigurable Topology,” in Second ACM/IEEE Inter. Symp. onFoundation (DFG) within priority programme 1148 under Networks-on-Chip (NoCS), 2008, pp. 55–64. [17] D. Park, R. Das, C. Nicopoulos, J. Kim, N. Vijaykrishnan, R. Iyer, andgrant reference Ma 1412/5. C. Das, “Design of a Dynamic Priority-Based Fast Path Architecture for On-Chip Interconnects,” in 15th IEEE Symp. on High-Performance R EFERENCES Interconnects (HOTI), 2007, pp. 15–20. [1] Bolotin, Evgeny and Cidon, Israel and Ginosar, Ran and Kolodny, [18] H. Matsutani, M. Koibuchi, H. Amano, and T. Yoshinaga, “Prediction Avinoam, “Cost considerations in network on chip,” Integration, the router: Yet another low latency on-chip router architecture,” in IEEE VLSI Jounal, vol. 38, no. 1, pp. 19–42, 2004. 15th Int. Symp. on High Performance Computer Architecture (HPCA), [2] Kumar, Amit and Peh, Li-Shiuan and Kundu, Partha and Jha, Niraj K., 2009, pp. 367–378. “Express Virtual Channels: Towards the Ideal Interconnection Fabric,” [19] G. Michelogiannakis, D. Pnevmatikatos, and M. Katevenis, “Approach- in 4th Int. Symp. on Computer Architecture, 2007, pp. 150–161. ing Ideal NoC Latency with Pre-Configured Routes,” in First Int. Symp. [3] R. Mullins, A. West, and S. Moore, “Low-Latency Virtual-Channel on Networks-on-Chip (NOCS), 2007, pp. 153–162. Routers for On-Chip Networks,” in 31st Int.Symp. on Computer Ar- [20] T. Krishna, A. Kumar, P. Chiang, M. Erez, and L.-S. Peh, “NoC with chitecture, 2004, pp. 188–197. Near-Ideal Express Virtual Channels Using Global-Line Communica- [4] L.-S. Peh and W. Dally, “A delay model and speculative architecture tion,” in 16th IEEE Symp. on High Performance Interconnects (HOTI), for pipelined routers,” in 7th Int. Symp. on High-Performance Computer 2008, pp. 11–20. Architecture (HPCA), 2001, pp. 255–266. [21] M. Modarressi, A. Tavakkol, and H. Sarbazi-Azad, “Virtual Point-to- [5] K. Kim, S.-J. Lee, K. Lee, and H.-J. Yoo, “An Arbitration Look-Ahead Point Connections for NoCs,” IEEE Trans. on Computer-Aided Design Scheme for Reducing End-to-End Latency in Networks on Chip,” in of Integrated Circuits and Systems, vol. 29, no. 6, pp. 855–868, 2010. IEEE Int. Symp. on Circuits and Systems (ISCAS), 2005, pp. 2357–2360. [22] M. Modarressi, H. Sarbazi-Azad, and M. Arjomand, “A Hybrid Packet- [6] A. Kodi, A. Louri, and J. Wang, “Design of energy-efficient channel Circuit Switched on-Chip Network Based on SDM,” in Design, Automa- buffers with router bypassing for network-on-chips (NoCs),” in Quality tion & Test in Europe Conference (DATE), 2009, pp. 566–569. of Electronic Design (ISQED), 2009, pp. 826–832.