An Improved Self-Reconﬁgurable Interconnection Scheme for a Coarse Grain Reconﬁgurable Architecture Muhammad Ali Shami Ahmed Hemani School of ICT School of ICT Royal Institute of Technology, KTH Royal Institute of Technology, KTH Stockholm, Sweden Stockholm, Sweden Email: email@example.com Email: firstname.lastname@example.org Abstract—An improved Dynamic, Partial and self reconﬁg- compose bigger systems using these CGIs by connectingurable interconnection network (Hybrid-2 Network) is presented them together. This is also a property of a computationalfor Dynamically Reprogrammable Resource Array (DRRA), fabric.which is a Coarse Grain Reconﬁguration Architecture (CGRA).To justify the design decision, Hybrid-2 network implementa- 4) Local Connectivity: To reduce delay and energy con-tion is compared against the possible implementations using sumption, the interconnection network has local connec-Multiplexer, NoC, Crossbar and already published Hybrid-1 tivity which is limited to 3-hops communication.interconnection network. Results shows that newly presented 5) Non-blocking and Point to Point/Multi-Point: TheHybrid-2 Interconnection network take (1.08x, 0.104x, 0.212x and DRRA interconnection network is a Non-blocking,0.681x) the area, (1x, 0.037x, 0.026x and 0.107x) the conﬁgurationbits of Multiplexer, NoC, Crossbar and Hybrid-1 Implementation Point-to-Point and Point-to-Multipoint network.respectively. Hybrid-2 network is also 2.87x and 5.86x faster than 6) Sliding Window connectivity: The local connectivity inMultiplexer and Hybrid-1 networks. non-overlapping segments restricts the interconnection network to create a ﬁx maximum size CGI. By having I. I NTRODUCTION connectivity in overlapping segments, a sliding window Flexibility of a reconﬁgurable architecture comes from a) its style local connectivity is created which allows creationability to reconﬁgure computational logic and b) the ability to of arbitrary size CGIs.reconﬁgure the interconnection network to connect the compu- 7) Dynamic Reconﬁguration: Dynamic reconﬁguration oftational logic blocks with each other. Interconnection network, a network allows the system to reconﬁgure the networkin any Coarse Grain Reconﬁgurable Architecture (CGRA), is at run-time. For dynamic reconﬁguration, the numbera key component which makes a reconﬁgurable architecture of conﬁguration bits and the conﬁguration time shouldﬂexible. This paper presents an improved interconnection be low. The DRRA network is reconﬁgurable duringnetwork for Dynamically Reprogrammable Resource Array runtime on cycle basic.(DRRA) which is a CGRA fabric. The old interconnection 8) Partial Reconﬁguration: The interconnection networknetwork, published in , will be referred as Hybrid-1. which allows conﬁguration of only a segment of the net-Moreover the new interconnection network, presented in this work is Partially reconﬁgurable interconnection network.paper, will be referred as Hybrid-2 in rest of the paper. The Conﬁguring only a segment of the network results inDRRA fabric has the following properties; fewer bits generation, and allow conﬁguration of a part 1) Creation of Coarse Grain Instruction (CGI): The in- of the network without disturbing the network connectiv- terconnection network enables creation of coarse grain ity in the surrounding. In DRRA, even a single network instructions by connecting two or more computational connection can be reconﬁgured without disturbing the resources with each other. The maximum size of the other network connections. CGI, which can be created, depends on the maximum 9) Self Reconﬁguration: DRRA interconnection network connectivity of the reconﬁgurable system. is self conﬁgurable which means that the CGIs, which 2) Arbitrary Parallelism: The interconnection network al- are created by the combination of the CGRA resources, lows creation of many such CGIs and run them in can reconﬁgure the interconnection network. This allows parallel. This is the property of the computational fabric the algorithms running on a CGI to reprogram the like FPGA. A CGRA which has this property is called interconnection network and hence the CGIs according a CGRA fabric. to their need. It also reduces the conﬁguration time 3) Implementation of large sub-system: In addition to cre- since the main conﬁguration manager doesn’t have to ation of CGI, the interconnection network is also able to generate and send the conﬁgurations. This improvement978-1-4244-8971-8/10$26.00 c 2010 IEEE
eliminates the need for a separate conﬁguration network for the interconnection network.Properties 1,2,3,4,5,6 and 7 were implemented with Hybrid-1 in DRRA fabric. The Hybrid-2 implements properties 8and 9, in addition to properties 1-7, in DRRA fabric. Theproperty 7 has also been improved by reducing the dynamicreconﬁguration time of the interconnection fabric. This paperhas two main contribution; • An improvement over existing Hybrid-1 interconnection scheme of DRRA fabric. The improvement not only includes new and improved functionality (property 7, 8 and 9) but also includes a redesigned switchbox to reduce the number of conﬁguration bits and conﬁguration memory size. • A quantitative comparison to other Multiplexer, Cross- Fig. 1. Dynamically Reprogrammable Resource Array(DRRA) Fabric bar and NoC based interconnect schemes including the Hybrid-1. Section-2 discusses the related work. Section-3 contains a interconnection network. The ﬁrst level offer nearest neighborbrief introduction to DRRA. Section-4 presents the different connectivity, second and third level consists of local and globalimplementations of DRRA interconnection network. Section-5 buses.presents the results while Section-6 concludes the paper. Interconnect exploration for mapping of algorithms helps to ﬁnd the best routing and interconnection scheme. This paper II. R ELATED W ORK is an effort in exploring the implementation style for DRRA Two decade of research on CGRAs has produced a number Interconnection network discussed in introduction section toof CGRA architectures with different interconnection prop- ﬁnd the best implementation for area, conﬁguration bits, anderties and their implementation styles. These architectures power. The DRRA interconnection network is different fromhave been reviewed in  and . This section will discuss the above mentioned architectures because it is a computa-the interconnection schemes in some of these architectures. tional fabric like an FPGA, and allows creation of a numberADRS is a CGRA with a multiplexer based mesh network arbitrary size partitions executing different algorithms.with topologies like nearest neighboring connectivity, next III. DRRAhop connectivity, extra connection to central register ﬁle andvertical busses etc. REMAC  also has a Multiplexer based Dynamically Reprogrammable Resources Array (DRRA)nearest neighbor connectivity along with full row and column is a CGRA fabric, as shown in Figure 1, which consistsBUS connectivity. Multiplexer based networks are good to of pool of a)Arithmatic/Logic (mDPU), b)Storage (RFile)provide Point-to-Multipoint connectivity, but this comes at the and c)Control (Sequencers) Resources. These resources arecost of long wires and high capacitance to drive. This has been seamlessly partitionable to compose Coarse Grain Instructionsrecognized by ADRS and they have proposed a full custom (CGIs). The arithmetic resources are used to create the data-transistor to disconnect these segments of the wires which path for the CGI. Two or more mDPUs can be connectedare not used during a speciﬁc network conﬁguration. Crossbar together to create a complex data-path which matches theprovides full connectivity but requires maximum number of granularity of the algorithm. The RFile not only provides theconﬁguration bits and is not scalable. Colt uses a crossbar to storage, but enough memory ports to feed this complex data-communicate between data port and array of 4x4 elements path. The sequencers are used to control these resources bywhich are connected in mesh network with nearest neighbor instantiating them in appropriate mode. The sequencers haveconnectivity. VIRAM  processor also uses a crossbar for an instruction memory of 64 words only.communication between DRAM banks and vector lanes. The In DRRA a CGI is composed by conﬁguring the in-crossbar is not scalable and has huge area and conﬁguration terconnection network which connects these arithmetic andoverhead. Chameleon and Imagine use circuit switched storage resources with each other. Our goal is to design anNoC for their interconnection network. Recently Multistage interconnection network which can create a CGI as complexInterconnection Network (MIN) has also been proposed for as Radix-4 FFT butterﬂy or bigger. To compose such bigCGRA. This network is created to provide arbitrary routing data-paths, we found that a sliding window communicationby connecting together different stages of the network. Since of 3-hops would be required. 3-hops communication windowcreating a communication path in a NoC based network will means that every DRRA resource can communication withrequire involvement of many geologically distributed switches, every other DRRA resource in either right or left direction upcreating a self reconﬁgurable network is not possible by using to 3-columns away as shown in Figure 1. The Sliding windowthis approach. MorphoSys has a three level of Hybrid means that these communication windows slides with respect
Fig. 3. Circuit Switched NoC Based DRRA Interconnection Network conﬁgured in 6 cycles. Since all the sequencers can program Fig. 2. Multiplexer Based DRRA Interconnection Network their interconnects in parallel, it takes 6 cycles at most to completely program this interconnection network in DRRA. Ato DRRA columns in a way that they are overlapping. The conﬁguration memory for one DRRA column can be designedFigure 1 shows a 2x8 fabric of DRRA which is created with which will be connected to both the sequencers. This willthese properties. It is important to mention that this fabric result in enabling the two sequencers in a DRRA column tois a fragment and in 90nm technology, a 10x10mm chip can conﬁgure all the four switch-boxes by just conﬁguring theaccommodate 324 DRRA Cells. memory. The memory will be organized in 12x8 (12 rows and 10 column). The ﬁrst four column bits will decide the inputIV. I NTERCONNECTION I MPLEMENTATION E XPLORATION multiplexer which is to be conﬁgured while the rest of the An interconnection network for an architecture is designed 6-bits will conﬁgure the 56x1 multiplexer.with two main considerations; a)the functionality of the Multiplexer based network has two problems associated; a)interconnection network and b)the physical overheads e.g. The large size Multiplexors cause routing congestion duringarea, power, speed, and conﬁguration bits. An interconnection ﬂoorplan, and b) A Point-to-Miltipoint connection results innetwork with the functionality discussed in the introduction every output driving all the inputs (7x12) in the intercon-section can be implemented using multiple implementation nection window as shown in Figure 2. This will not onlystyles. Hence it becomes important to do an implementation increase the length of the interconnection wire, but alsoexploration of all these implementation styles to ﬁnd the increase the driving load of the output. This results in a slowerphysical overheads. To do an implementation exploration of interconnection network which consumes much energy. Wethis interconnection network, we have implemented it in Multi- can break the wire length by driving every output in eitherplexer, Crossbar, NoC, Hybrid-1 and Hybrid-2 implementation right direction or in left direction. That would result in drivingstyles. The implementation details and results are discussed in 42 inputs which is still huge.the sections/subsections below. B. Circuit Switch Network (NoC)A. Multiplexer Based DRRA Network A circuit switch network can be created for this kind of A DRRA interconnection network, as discussed in introduc- fabric as shown in Figure 3. A fully non-blocking, slidingtion, can be implemented using Multiplexers. Every resource window interconnection network with 3-hops connectivityinput, in DRRA fabric, can receive data from resources up to 3- requires 48 rows. Every column has 12-inputs and 8-outputs.columns away on both sides as shown in Figure 2. This creates These 20 input/outputs will be connected to these 48 rows.an interconnection window of 7-columns. This window of This will result in 480 4-way switches. Every NoC switchconnectivity moves with the resources, and that is why called requires four conﬁguration bits to conﬁgure resulting in 1920sliding window. Each column has four resources with two out- bits of conﬁguration memory in every column.puts from every resource. This results in selecting one out of The problem with this network is that if a physical commu-56(7x4x2) possible outputs for every single input and requires nication channel is to be established between two resources,a Multiplexer of size 56x1. Since a column has 12-inputs, the geographically distributed switchboxes in the path betweentwelve 56x1 multiplexers will be required for every column these two resources will have to be conﬁgured. This canin multiplexer based DRRA interconnection network. A 56x1 be done only by an external conﬁguration unit, since themultiplexer requires 6-bits to conﬁgure, therefore a DRRA sequencers can only conﬁgure local switchboxes. So a selfcolumn will require 72-bits to conﬁgure. This interconnection reconﬁguration of this network is not possible. This kind ofscheme is partial, dynamic and self reconﬁgurable, and doesn’t NoC can also communicate beyond 3-hops. This communica-require a dedicated interconnect reconﬁguration network. A tion will be blocking and the synthesis tool will report a lowersequencer can conﬁgure one input per cycle by providing clock frequency. To avoid this, the NoC switches will have to6-bits. A complete DRRA column having 12 inputs can be be pipelined, which will increase their power consumptions
and area.C. CrossBar based network Fig. 4. Crossbar Based DRRA Interconnecction Network Fig. 5. DRRA Hybrid-1 Network A crossbar based sliding window network is possible tocreate by using small crossbars cascadedly connected togetheras shown in Figure 4. To provide connectivity to resources DRRA resource to drive all the inputs in the 7-column com-on both sides up to 3-hops away, 48x56 crossbars will be munication window. However this interconnection networkrequired. This will result in conﬁguration memory of size suffers from the delay of the crossbar based switchboxes.2688-bits per column. These crossbars are used in slidingwindow fashion i.e. every crossbar is connected to every E. Hybrid-2 Network with Tri-state Multiplexers and BUSesother crossbar up to 3-hops away to create a 3-hops slidingwindow network. Crossbar based network can be used forcommunication beyond 3-hops, but that communication willbe blocking and will decrease the system clock because of thelonger network delay. The problem with this implementation is its huge size,conﬁguration bits and large network delay. A crossbar has toconﬁgure 2688 possible connections. If a self reconﬁgurationrequires one cycle to conﬁgure one connection, it will take1344 cycles by the two sequencers to completely conﬁgurethe crossbar.D. Hybrid-1 Network with Crossbars and BUSes A single column of DRRA Hybrid-1 interconnection net-work using Crossbars and Buses is shown in ﬁgure 5. Thisinterconnection network is organized in horizontal and verticalBUSes with 14x12 Crossbars at the intersection called H2Vcrossbars. The horizontal BUSes consist of the outputs of theDRRA resources which are connected to the inputs of the H2V Fig. 6. DRRA Hybrid-2 Networkcrossbars in sliding window fashion as discussed before. Thesecrossbars receives inputs from resources on both sides up to 3- Two problems are identiﬁed in Hybrid-1 type intercon-hops (3-columns) away. Each column has four H2V crossbars. nection network; a)conﬁguration bits are larger than the bitsOne H2V crossbar requires 14x12=168 bits to conﬁgure. A present in Multiplexer based network, and b)The networksingle DRRA column requires 4x168=672 bits to conﬁgure. delay of this network is also greater than the multiplexer basedThis memory is conﬁgured by an external conﬁguration unit network because of the use of crossbars based switchboxes.through an interconnect conﬁguration network, so a self re- Therefore the Hybrid-1 interconnection network is improvedconﬁguration is not possible for this network. These horizontal by redesigning the switchboxes. Figure 6 shows the Hybrid-inputs to the H2V crossbars are conﬁgured to connect to the 12 2 interconnection network with a newer switchbox design.vertical BUSes which are then connected to the inputs of the This switchbox consists of twelve 14x1 multiplexers whichresources. This organization of interconnection network, with are connected to a tri-state buffer. These tri-state buffers areH2V crossbar based switchboxes, prevents an output from a permanently connected to one of the twelve vertical buses.
Area Cfg.Bits Cfg.Cycles NetworkDelayThis design has three advantages over the previous design (Gates) (pS)a) the conﬁguration bits are reduced, b) the area of the MUX 8402 120 6 707switchbox is reduced and c)delay of the switchbox is also NoC 87840 1920 Variable Variable Crossbar 43008 2688 1344 Variablereduced. The new switchbox requires 48 bits to conﬁgure in Hybrid-1 13416 672 6*CND 1443this interconnection network. Since all four switchboxes drives Hybrid-2 9147 120 6 246the same vertical buses, their tri-state drivers are mutually TABLE Iexclusive to each other. We can use this property to create C OMPARISON B ETWEEN D IFFERENT I MPLEMENTATIONSa memory organized as 12x6 bits (12 rows and 10 columns).Every row corresponds to the output connected to one of thevertical BUSes. First two column bits select the switchboxfrom one of the four switchboxes in one column, the next 4- interconnection network. To conﬁgure one complete DRRAbits select the vertical BUS which is to be derived, and the column, twelve inputs are conﬁgured by the two sequencers.last 4-bits select the horizontal BUS which will be driving the A sequencer takes single cycle to conﬁgure one input, hence itselected vertical BUS. takes 6 cycles to completely conﬁgure a DRRA column. Since all the DRRA columns are conﬁgured independently by their own sequencers, a complete DRRA fabric, no matter how big, can be conﬁgured in 6 cycles. V. R ESULTS The above mentioned interconnect implementations are synthesized for DRRA using TSMC 90nm technology in Cadence RTL Compiler. The Table I contains the data for Area, Conﬁguration Bits, Conﬁguration Cycles and Network delay of these implementations after the synthesis. This data shows that; 1) Multiplexer based networks are the best in terms of area, conﬁguration bits and number of cycles to conﬁgure the network. However Multiplexer based networks are slow because of the long Point-to-Multipoint wires. This problem has been realized by ADRS as well. To remove this problem they have designed pass transistor based full custom switches to break the wires . 2) Crossbar and NoC based solutions are very expensive in terms of Area, conﬁguration bits and conﬁguration cycles etc. In NoC based solutions, conﬁguration of a Fig. 7. Application Mapping Flow link depends on the number of switches in the path. Partial and Dynamic reconﬁguration can be supported 1) Self Reconﬁguration: The new conﬁguration memory in NoC and Crossbar based network using an externalhas very few bits to conﬁgure and is designed as the two port conﬁguration network. Self reconﬁguration cannot bememory to allow connectivity with the two sequencers present supported in NoC because the sequencers cannot recon-in same DRRA column. This allows the sequencers to program ﬁgure the geographically distributed switches involvedthe conﬁguration memory hence creating a self reconﬁguration in establishing a communication channel between twosystem. Using sequencers, we can dynamically and partially resources. The conﬁguration cycles in NoC and Cross-reprogram the interconnection network without the need of bar based interconnection network also depends on thethe external conﬁguration unit. So the external reconﬁguration number of switches/crossbars involved and conﬁgurationnetwork for interconnects has been completely removed. The network delay (CND).interconnect conﬁgurations are stored inside the sequencer dur- 3) The Hybrid-1 is better than NoC and Crossbar baseding storage of the program/conﬁgware. The application map- networks. However it takes more area and conﬁgurationping ﬂow is shown in ﬁgure 7. A DRRA program/conﬁgware bits, as compared to Multiplexer based network. It iscontains Memory, Data-path and Interconnect instructions. also slower than the Multiplexer based network. TheThis program is loaded into the DRRA sequencer. When number of cycles to conﬁgure a DRRA Column dependsthe sequencer starts, it executes the interconnect instructions on the Conﬁguration Network Delay (CND) of Hybrid-1to conﬁgure the interconnection network. Once the network network.is conﬁgured, the data-path and memory instructions are 4) The Hybrid-2 network, as can be seen in Table I,executed. During execution of the algorithm, the sequencer has almost same area and conﬁguration bits as thatcan issue new Interconnect instructions to re-conﬁgure the of a Multiplexer based network. Since the network is
self reconﬁgurable, the conﬁguration network delay in ACKNOWLEDGMENT this network is one. Hence it takes only 6 cycles to The Author is thankful to Swedish Research Council and completely reconﬁgure a DRRA column. Furthermore Higher Education Commission of Pakistan for funding this all DRRA columns can be reconﬁgured in parallel, research. therefore it takes only 6 cycles to completely reconﬁgure the whole DRRA fabric. Hybrid-2 network is also faster R EFERENCES than the Multiplexer based network. Hybrid-2 network,  M. Baron. Trends in use of reconﬁgurable platforms. In 41st Pro- in reality is a Multiplexer based network with tri-state ceedings of Design Automation Conference, pages 415–415. IEEE, July 2004. buffers. The increase in size of the area is because of  R. Ferreira, M. Laure, A. C. Beck, T. Lo, M. Rutzig, and L. Carro. A these tri-state buffers. Using this Hybrid-2 approach, we low cost and adaptable routing network for reconﬁgurable systems. In have broken down the long Point-to-Multipoint wires Proc. IEEE Int. Symp. Parallel & Distributed Processing IPDPS 2009, pages 1–8, 2009. of Multiplexer based network into Point-to-Point wires  R. Hartenstein. A decade of reconﬁgurable computing: a visionary using switchboxes. This doesn’t affect the Point-to- retrospective. In Design, Automation and Test in Europe, pages 642–649. Multipoint capability of the network. This approach is IEEE, March 2001.  P. M. Heysters. Coarse-Grained Reconﬁgurable Processors; Flexibility better than  in which pass transistor based switch Meets Efﬁciency. PhD Thesis, ISBN:90-365-2076-2, Neitherlands, 2003. was used to break the long wires of Multiplexer based  B. Khailany, W. J. Dally, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. network. Furthermore, the switchboxes in Hybrid-2 have Owens, B. Towles, A. Chang, and S. Rixner. Imagine: media processing with streams. IEEE MICRO, 21(2):35–46, 2001. been designed completely in standard cell technology  C. E. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, which keeps the design ﬂow simple and reduce the time N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, to market. N. Treuhaft, and K. Yelick. Scalable processors in the billion-transistor era: Iram. Computer, 30(9):75–78, 1997. 5) DRRA with Hybrid-2 network is synthesized and ﬂoor-  Z. Kwok and S. J. E. Wilton. Register ﬁle architecture optimization in a planned in 90nm using Cadence RTL compiler and SoC coarse-grained reconﬁgurable architecture. In Proc. 13th Annual IEEE Encounter. Using this network, 2x8 fabric of DRRA Symp. Field-Programmable Custom Computing Machines FCCM 2005, pages 35–44, 2005. shown in Figure 1 runs at a frequency of 720MHz and  T. Miyamori and K. Olukotun. Remarc:reconﬁgurable multimedia can support a peak local bandwidth of 138GB/s. array coprocessor. IEICE Transactions on Information and Systems, 82(5):389–397, November 1998.  M. A. Shami and A. Hemani. Morphable dpu: Smart and efﬁcient data path for signal processing applications. In Proc. IEEE Workshop Signal VI. C ONCLUSION AND F UTURE W ORK Processing Systems SiPS 2009, pages 167–172, 2009.  M. A. Shami and A. Hemani. Partially reconﬁgurable interconnection network for dynamically reprogrammable resource array. In IEEE 8th International Conference on ASIC, pages 122–125. IEEE, Octoer 2009. An improved implementation of Hybrid-2 interconnection  H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M.network for Dynamically Reprogrammable Resource Array Chaves Filho. Morphosys: an integrated reconﬁgurable system for data-has been presented. To justify the design decisions, an in- parallel and computation-intensive applications. IEEE Transactions on Computers, 49(5):465–481, 2000.terconnect exploration is done by implementing the samenetwork using Multiplexer, NoC and Crossbar based network.Hybrid-2 network is then compared against Multiplexer, NoC,Crossbar and previously published Hybrid-1 network. Resultsshow that newly presented network takes (1.08x, 0.104x,0.212x and 0.681x) the area, (1x, 0.037x, 0.026x and 0.107x)the conﬁguration bits of Multiplexer, NoC, Crossbar andHybrid-1 Implementation. Hybrid-2 network is 2.87x and5.86x better in terms of speed as compared to Multiplexerand Hybrid-1 networks. Hybrid-2 network also takes minimumnumber of cycles to conﬁgure/reconﬁgure the complete DRRAcolumn. A future version of the interconnection network with ad-justable sliding window has been planned. By lowering theclock frequency, the width of the sliding window can beincreased to allow mapping of more complex data paths thanwhat is possible today. Similarly at higher clock frequenciesthis width can be reduced. The future version of DRRA willhave voltage frequency scaling and power shut off method-ology. This may result in some parts of DRRA working indifferent voltage/frequency range or completely turned off.The DRRA switchboxes will be improved to handle suchsituations by having level shifters, or isolators.