An Improved Self-Reconfigurable Interconnection     Scheme for a Coarse Grain Reconfigurable                    Architecture...
eliminates the need for a separate configuration network      for the interconnection network.Properties 1,2,3,4,5,6 and 7 ...
Fig. 3.   Circuit Switched NoC Based DRRA Interconnection Network                                                         ...
and area.C. CrossBar based network       Fig. 4.   Crossbar Based DRRA Interconnecction Network                           ...
Area      Cfg.Bits   Cfg.Cycles   NetworkDelayThis design has three advantages over the previous design                   ...
self reconfigurable, the configuration network delay in                                ACKNOWLEDGMENT     this network is on...
Upcoming SlideShare
Loading in …5
×

83

519 views
460 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
519
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

83

  1. 1. An Improved Self-Reconfigurable Interconnection Scheme for a Coarse Grain Reconfigurable Architecture Muhammad Ali Shami Ahmed Hemani School of ICT School of ICT Royal Institute of Technology, KTH Royal Institute of Technology, KTH Stockholm, Sweden Stockholm, Sweden Email: shami@kth.se Email: hemani@kth.se Abstract—An improved Dynamic, Partial and self reconfig- compose bigger systems using these CGIs by connectingurable interconnection network (Hybrid-2 Network) is presented them together. This is also a property of a computationalfor Dynamically Reprogrammable Resource Array (DRRA), fabric.which is a Coarse Grain Reconfiguration Architecture (CGRA).To justify the design decision, Hybrid-2 network implementa- 4) Local Connectivity: To reduce delay and energy con-tion is compared against the possible implementations using sumption, the interconnection network has local connec-Multiplexer, NoC, Crossbar and already published Hybrid-1 tivity which is limited to 3-hops communication.interconnection network. Results shows that newly presented 5) Non-blocking and Point to Point/Multi-Point: TheHybrid-2 Interconnection network take (1.08x, 0.104x, 0.212x and DRRA interconnection network is a Non-blocking,0.681x) the area, (1x, 0.037x, 0.026x and 0.107x) the configurationbits of Multiplexer, NoC, Crossbar and Hybrid-1 Implementation Point-to-Point and Point-to-Multipoint network.respectively. Hybrid-2 network is also 2.87x and 5.86x faster than 6) Sliding Window connectivity: The local connectivity inMultiplexer and Hybrid-1 networks. non-overlapping segments restricts the interconnection network to create a fix maximum size CGI. By having I. I NTRODUCTION connectivity in overlapping segments, a sliding window Flexibility of a reconfigurable architecture comes from a) its style local connectivity is created which allows creationability to reconfigure computational logic and b) the ability to of arbitrary size CGIs.reconfigure the interconnection network to connect the compu- 7) Dynamic Reconfiguration: Dynamic reconfiguration oftational logic blocks with each other. Interconnection network, a network allows the system to reconfigure the networkin any Coarse Grain Reconfigurable Architecture (CGRA), is at run-time. For dynamic reconfiguration, the numbera key component which makes a reconfigurable architecture of configuration bits and the configuration time shouldflexible. This paper presents an improved interconnection be low. The DRRA network is reconfigurable duringnetwork for Dynamically Reprogrammable Resource Array runtime on cycle basic.(DRRA) which is a CGRA fabric. The old interconnection 8) Partial Reconfiguration: The interconnection networknetwork, published in [10], will be referred as Hybrid-1. which allows configuration of only a segment of the net-Moreover the new interconnection network, presented in this work is Partially reconfigurable interconnection network.paper, will be referred as Hybrid-2 in rest of the paper. The Configuring only a segment of the network results inDRRA fabric has the following properties; fewer bits generation, and allow configuration of a part 1) Creation of Coarse Grain Instruction (CGI): The in- of the network without disturbing the network connectiv- terconnection network enables creation of coarse grain ity in the surrounding. In DRRA, even a single network instructions by connecting two or more computational connection can be reconfigured without disturbing the resources with each other. The maximum size of the other network connections. CGI, which can be created, depends on the maximum 9) Self Reconfiguration: DRRA interconnection network connectivity of the reconfigurable system. is self configurable which means that the CGIs, which 2) Arbitrary Parallelism: The interconnection network al- are created by the combination of the CGRA resources, lows creation of many such CGIs and run them in can reconfigure the interconnection network. This allows parallel. This is the property of the computational fabric the algorithms running on a CGI to reprogram the like FPGA. A CGRA which has this property is called interconnection network and hence the CGIs according a CGRA fabric. to their need. It also reduces the configuration time 3) Implementation of large sub-system: In addition to cre- since the main configuration manager doesn’t have to ation of CGI, the interconnection network is also able to generate and send the configurations. This improvement978-1-4244-8971-8/10$26.00 c 2010 IEEE
  2. 2. eliminates the need for a separate configuration network for the interconnection network.Properties 1,2,3,4,5,6 and 7 were implemented with Hybrid-1 in DRRA fabric. The Hybrid-2 implements properties 8and 9, in addition to properties 1-7, in DRRA fabric. Theproperty 7 has also been improved by reducing the dynamicreconfiguration time of the interconnection fabric. This paperhas two main contribution; • An improvement over existing Hybrid-1 interconnection scheme of DRRA fabric. The improvement not only includes new and improved functionality (property 7, 8 and 9) but also includes a redesigned switchbox to reduce the number of configuration bits and configuration memory size. • A quantitative comparison to other Multiplexer, Cross- Fig. 1. Dynamically Reprogrammable Resource Array(DRRA) Fabric bar and NoC based interconnect schemes including the Hybrid-1. Section-2 discusses the related work. Section-3 contains a interconnection network. The first level offer nearest neighborbrief introduction to DRRA. Section-4 presents the different connectivity, second and third level consists of local and globalimplementations of DRRA interconnection network. Section-5 buses.presents the results while Section-6 concludes the paper. Interconnect exploration for mapping of algorithms helps to find the best routing and interconnection scheme. This paper II. R ELATED W ORK is an effort in exploring the implementation style for DRRA Two decade of research on CGRAs has produced a number Interconnection network discussed in introduction section toof CGRA architectures with different interconnection prop- find the best implementation for area, configuration bits, anderties and their implementation styles. These architectures power. The DRRA interconnection network is different fromhave been reviewed in [3] and [1]. This section will discuss the above mentioned architectures because it is a computa-the interconnection schemes in some of these architectures. tional fabric like an FPGA, and allows creation of a numberADRS[7] is a CGRA with a multiplexer based mesh network arbitrary size partitions executing different algorithms.with topologies like nearest neighboring connectivity, next III. DRRAhop connectivity, extra connection to central register file andvertical busses etc. REMAC [8] also has a Multiplexer based Dynamically Reprogrammable Resources Array (DRRA)nearest neighbor connectivity along with full row and column is a CGRA fabric, as shown in Figure 1, which consistsBUS connectivity. Multiplexer based networks are good to of pool of a)Arithmatic/Logic (mDPU)[9], b)Storage (RFile)provide Point-to-Multipoint connectivity, but this comes at the and c)Control (Sequencers) Resources. These resources arecost of long wires and high capacitance to drive. This has been seamlessly partitionable to compose Coarse Grain Instructionsrecognized by ADRS and they have proposed a full custom (CGIs). The arithmetic resources are used to create the data-transistor[7] to disconnect these segments of the wires which path for the CGI. Two or more mDPUs can be connectedare not used during a specific network configuration. Crossbar together to create a complex data-path which matches theprovides full connectivity but requires maximum number of granularity of the algorithm. The RFile not only provides theconfiguration bits and is not scalable. Colt uses a crossbar to storage, but enough memory ports to feed this complex data-communicate between data port and array of 4x4 elements path. The sequencers are used to control these resources bywhich are connected in mesh network with nearest neighbor instantiating them in appropriate mode. The sequencers haveconnectivity. VIRAM [6] processor also uses a crossbar for an instruction memory of 64 words only.communication between DRAM banks and vector lanes. The In DRRA a CGI is composed by configuring the in-crossbar is not scalable and has huge area and configuration terconnection network which connects these arithmetic andoverhead. Chameleon[4] and Imagine[5] use circuit switched storage resources with each other. Our goal is to design anNoC for their interconnection network. Recently Multistage interconnection network which can create a CGI as complexInterconnection Network (MIN)[2] has also been proposed for as Radix-4 FFT butterfly or bigger. To compose such bigCGRA. This network is created to provide arbitrary routing data-paths, we found that a sliding window communicationby connecting together different stages of the network. Since of 3-hops would be required. 3-hops communication windowcreating a communication path in a NoC based network will means that every DRRA resource can communication withrequire involvement of many geologically distributed switches, every other DRRA resource in either right or left direction upcreating a self reconfigurable network is not possible by using to 3-columns away as shown in Figure 1. The Sliding windowthis approach. MorphoSys[11] has a three level of Hybrid means that these communication windows slides with respect
  3. 3. Fig. 3. Circuit Switched NoC Based DRRA Interconnection Network configured in 6 cycles. Since all the sequencers can program Fig. 2. Multiplexer Based DRRA Interconnection Network their interconnects in parallel, it takes 6 cycles at most to completely program this interconnection network in DRRA. Ato DRRA columns in a way that they are overlapping. The configuration memory for one DRRA column can be designedFigure 1 shows a 2x8 fabric of DRRA which is created with which will be connected to both the sequencers. This willthese properties. It is important to mention that this fabric result in enabling the two sequencers in a DRRA column tois a fragment and in 90nm technology, a 10x10mm chip can configure all the four switch-boxes by just configuring theaccommodate 324 DRRA Cells. memory. The memory will be organized in 12x8 (12 rows and 10 column). The first four column bits will decide the inputIV. I NTERCONNECTION I MPLEMENTATION E XPLORATION multiplexer which is to be configured while the rest of the An interconnection network for an architecture is designed 6-bits will configure the 56x1 multiplexer.with two main considerations; a)the functionality of the Multiplexer based network has two problems associated; a)interconnection network and b)the physical overheads e.g. The large size Multiplexors cause routing congestion duringarea, power, speed, and configuration bits. An interconnection floorplan, and b) A Point-to-Miltipoint connection results innetwork with the functionality discussed in the introduction every output driving all the inputs (7x12) in the intercon-section can be implemented using multiple implementation nection window as shown in Figure 2. This will not onlystyles. Hence it becomes important to do an implementation increase the length of the interconnection wire, but alsoexploration of all these implementation styles to find the increase the driving load of the output. This results in a slowerphysical overheads. To do an implementation exploration of interconnection network which consumes much energy. Wethis interconnection network, we have implemented it in Multi- can break the wire length by driving every output in eitherplexer, Crossbar, NoC, Hybrid-1 and Hybrid-2 implementation right direction or in left direction. That would result in drivingstyles. The implementation details and results are discussed in 42 inputs which is still huge.the sections/subsections below. B. Circuit Switch Network (NoC)A. Multiplexer Based DRRA Network A circuit switch network can be created for this kind of A DRRA interconnection network, as discussed in introduc- fabric as shown in Figure 3. A fully non-blocking, slidingtion, can be implemented using Multiplexers. Every resource window interconnection network with 3-hops connectivityinput, in DRRA fabric, can receive data from resources up to 3- requires 48 rows. Every column has 12-inputs and 8-outputs.columns away on both sides as shown in Figure 2. This creates These 20 input/outputs will be connected to these 48 rows.an interconnection window of 7-columns. This window of This will result in 480 4-way switches. Every NoC switchconnectivity moves with the resources, and that is why called requires four configuration bits to configure resulting in 1920sliding window. Each column has four resources with two out- bits of configuration memory in every column.puts from every resource. This results in selecting one out of The problem with this network is that if a physical commu-56(7x4x2) possible outputs for every single input and requires nication channel is to be established between two resources,a Multiplexer of size 56x1. Since a column has 12-inputs, the geographically distributed switchboxes in the path betweentwelve 56x1 multiplexers will be required for every column these two resources will have to be configured. This canin multiplexer based DRRA interconnection network. A 56x1 be done only by an external configuration unit, since themultiplexer requires 6-bits to configure, therefore a DRRA sequencers can only configure local switchboxes. So a selfcolumn will require 72-bits to configure. This interconnection reconfiguration of this network is not possible. This kind ofscheme is partial, dynamic and self reconfigurable, and doesn’t NoC can also communicate beyond 3-hops. This communica-require a dedicated interconnect reconfiguration network. A tion will be blocking and the synthesis tool will report a lowersequencer can configure one input per cycle by providing clock frequency. To avoid this, the NoC switches will have to6-bits. A complete DRRA column having 12 inputs can be be pipelined, which will increase their power consumptions
  4. 4. and area.C. CrossBar based network Fig. 4. Crossbar Based DRRA Interconnecction Network Fig. 5. DRRA Hybrid-1 Network A crossbar based sliding window network is possible tocreate by using small crossbars cascadedly connected togetheras shown in Figure 4. To provide connectivity to resources DRRA resource to drive all the inputs in the 7-column com-on both sides up to 3-hops away, 48x56 crossbars will be munication window. However this interconnection networkrequired. This will result in configuration memory of size suffers from the delay of the crossbar based switchboxes.2688-bits per column. These crossbars are used in slidingwindow fashion i.e. every crossbar is connected to every E. Hybrid-2 Network with Tri-state Multiplexers and BUSesother crossbar up to 3-hops away to create a 3-hops slidingwindow network. Crossbar based network can be used forcommunication beyond 3-hops, but that communication willbe blocking and will decrease the system clock because of thelonger network delay. The problem with this implementation is its huge size,configuration bits and large network delay. A crossbar has toconfigure 2688 possible connections. If a self reconfigurationrequires one cycle to configure one connection, it will take1344 cycles by the two sequencers to completely configurethe crossbar.D. Hybrid-1 Network with Crossbars and BUSes A single column of DRRA Hybrid-1 interconnection net-work using Crossbars and Buses is shown in figure 5. Thisinterconnection network is organized in horizontal and verticalBUSes with 14x12 Crossbars at the intersection called H2Vcrossbars. The horizontal BUSes consist of the outputs of theDRRA resources which are connected to the inputs of the H2V Fig. 6. DRRA Hybrid-2 Networkcrossbars in sliding window fashion as discussed before. Thesecrossbars receives inputs from resources on both sides up to 3- Two problems are identified in Hybrid-1 type intercon-hops (3-columns) away. Each column has four H2V crossbars. nection network; a)configuration bits are larger than the bitsOne H2V crossbar requires 14x12=168 bits to configure. A present in Multiplexer based network, and b)The networksingle DRRA column requires 4x168=672 bits to configure. delay of this network is also greater than the multiplexer basedThis memory is configured by an external configuration unit network because of the use of crossbars based switchboxes.through an interconnect configuration network, so a self re- Therefore the Hybrid-1 interconnection network is improvedconfiguration is not possible for this network. These horizontal by redesigning the switchboxes. Figure 6 shows the Hybrid-inputs to the H2V crossbars are configured to connect to the 12 2 interconnection network with a newer switchbox design.vertical BUSes which are then connected to the inputs of the This switchbox consists of twelve 14x1 multiplexers whichresources. This organization of interconnection network, with are connected to a tri-state buffer. These tri-state buffers areH2V crossbar based switchboxes, prevents an output from a permanently connected to one of the twelve vertical buses.
  5. 5. Area Cfg.Bits Cfg.Cycles NetworkDelayThis design has three advantages over the previous design (Gates) (pS)a) the configuration bits are reduced, b) the area of the MUX 8402 120 6 707switchbox is reduced and c)delay of the switchbox is also NoC 87840 1920 Variable Variable Crossbar 43008 2688 1344 Variablereduced. The new switchbox requires 48 bits to configure in Hybrid-1 13416 672 6*CND 1443this interconnection network. Since all four switchboxes drives Hybrid-2 9147 120 6 246the same vertical buses, their tri-state drivers are mutually TABLE Iexclusive to each other. We can use this property to create C OMPARISON B ETWEEN D IFFERENT I MPLEMENTATIONSa memory organized as 12x6 bits (12 rows and 10 columns).Every row corresponds to the output connected to one of thevertical BUSes. First two column bits select the switchboxfrom one of the four switchboxes in one column, the next 4- interconnection network. To configure one complete DRRAbits select the vertical BUS which is to be derived, and the column, twelve inputs are configured by the two sequencers.last 4-bits select the horizontal BUS which will be driving the A sequencer takes single cycle to configure one input, hence itselected vertical BUS. takes 6 cycles to completely configure a DRRA column. Since all the DRRA columns are configured independently by their own sequencers, a complete DRRA fabric, no matter how big, can be configured in 6 cycles. V. R ESULTS The above mentioned interconnect implementations are synthesized for DRRA using TSMC 90nm technology in Cadence RTL Compiler. The Table I contains the data for Area, Configuration Bits, Configuration Cycles and Network delay of these implementations after the synthesis. This data shows that; 1) Multiplexer based networks are the best in terms of area, configuration bits and number of cycles to configure the network. However Multiplexer based networks are slow because of the long Point-to-Multipoint wires. This problem has been realized by ADRS as well. To remove this problem they have designed pass transistor based full custom switches to break the wires [7]. 2) Crossbar and NoC based solutions are very expensive in terms of Area, configuration bits and configuration cycles etc. In NoC based solutions, configuration of a Fig. 7. Application Mapping Flow link depends on the number of switches in the path. Partial and Dynamic reconfiguration can be supported 1) Self Reconfiguration: The new configuration memory in NoC and Crossbar based network using an externalhas very few bits to configure and is designed as the two port configuration network. Self reconfiguration cannot bememory to allow connectivity with the two sequencers present supported in NoC because the sequencers cannot recon-in same DRRA column. This allows the sequencers to program figure the geographically distributed switches involvedthe configuration memory hence creating a self reconfiguration in establishing a communication channel between twosystem. Using sequencers, we can dynamically and partially resources. The configuration cycles in NoC and Cross-reprogram the interconnection network without the need of bar based interconnection network also depends on thethe external configuration unit. So the external reconfiguration number of switches/crossbars involved and configurationnetwork for interconnects has been completely removed. The network delay (CND).interconnect configurations are stored inside the sequencer dur- 3) The Hybrid-1 is better than NoC and Crossbar baseding storage of the program/configware. The application map- networks. However it takes more area and configurationping flow is shown in figure 7. A DRRA program/configware bits, as compared to Multiplexer based network. It iscontains Memory, Data-path and Interconnect instructions. also slower than the Multiplexer based network. TheThis program is loaded into the DRRA sequencer. When number of cycles to configure a DRRA Column dependsthe sequencer starts, it executes the interconnect instructions on the Configuration Network Delay (CND) of Hybrid-1to configure the interconnection network. Once the network network.is configured, the data-path and memory instructions are 4) The Hybrid-2 network, as can be seen in Table I,executed. During execution of the algorithm, the sequencer has almost same area and configuration bits as thatcan issue new Interconnect instructions to re-configure the of a Multiplexer based network. Since the network is
  6. 6. self reconfigurable, the configuration network delay in ACKNOWLEDGMENT this network is one. Hence it takes only 6 cycles to The Author is thankful to Swedish Research Council and completely reconfigure a DRRA column. Furthermore Higher Education Commission of Pakistan for funding this all DRRA columns can be reconfigured in parallel, research. therefore it takes only 6 cycles to completely reconfigure the whole DRRA fabric. Hybrid-2 network is also faster R EFERENCES than the Multiplexer based network. Hybrid-2 network, [1] M. Baron. Trends in use of reconfigurable platforms. In 41st Pro- in reality is a Multiplexer based network with tri-state ceedings of Design Automation Conference, pages 415–415. IEEE, July 2004. buffers. The increase in size of the area is because of [2] R. Ferreira, M. Laure, A. C. Beck, T. Lo, M. Rutzig, and L. Carro. A these tri-state buffers. Using this Hybrid-2 approach, we low cost and adaptable routing network for reconfigurable systems. In have broken down the long Point-to-Multipoint wires Proc. IEEE Int. Symp. Parallel & Distributed Processing IPDPS 2009, pages 1–8, 2009. of Multiplexer based network into Point-to-Point wires [3] R. Hartenstein. A decade of reconfigurable computing: a visionary using switchboxes. This doesn’t affect the Point-to- retrospective. In Design, Automation and Test in Europe, pages 642–649. Multipoint capability of the network. This approach is IEEE, March 2001. [4] P. M. Heysters. Coarse-Grained Reconfigurable Processors; Flexibility better than [7] in which pass transistor based switch Meets Efficiency. PhD Thesis, ISBN:90-365-2076-2, Neitherlands, 2003. was used to break the long wires of Multiplexer based [5] B. Khailany, W. J. Dally, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. network. Furthermore, the switchboxes in Hybrid-2 have Owens, B. Towles, A. Chang, and S. Rixner. Imagine: media processing with streams. IEEE MICRO, 21(2):35–46, 2001. been designed completely in standard cell technology [6] C. E. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, which keeps the design flow simple and reduce the time N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, to market. N. Treuhaft, and K. Yelick. Scalable processors in the billion-transistor era: Iram. Computer, 30(9):75–78, 1997. 5) DRRA with Hybrid-2 network is synthesized and floor- [7] Z. Kwok and S. J. E. Wilton. Register file architecture optimization in a planned in 90nm using Cadence RTL compiler and SoC coarse-grained reconfigurable architecture. In Proc. 13th Annual IEEE Encounter. Using this network, 2x8 fabric of DRRA Symp. Field-Programmable Custom Computing Machines FCCM 2005, pages 35–44, 2005. shown in Figure 1 runs at a frequency of 720MHz and [8] T. Miyamori and K. Olukotun. Remarc:reconfigurable multimedia can support a peak local bandwidth of 138GB/s. array coprocessor. IEICE Transactions on Information and Systems, 82(5):389–397, November 1998. [9] M. A. Shami and A. Hemani. Morphable dpu: Smart and efficient data path for signal processing applications. In Proc. IEEE Workshop Signal VI. C ONCLUSION AND F UTURE W ORK Processing Systems SiPS 2009, pages 167–172, 2009. [10] M. A. Shami and A. Hemani. Partially reconfigurable interconnection network for dynamically reprogrammable resource array. In IEEE 8th International Conference on ASIC, pages 122–125. IEEE, Octoer 2009. An improved implementation of Hybrid-2 interconnection [11] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M.network for Dynamically Reprogrammable Resource Array Chaves Filho. Morphosys: an integrated reconfigurable system for data-has been presented. To justify the design decisions, an in- parallel and computation-intensive applications. IEEE Transactions on Computers, 49(5):465–481, 2000.terconnect exploration is done by implementing the samenetwork using Multiplexer, NoC and Crossbar based network.Hybrid-2 network is then compared against Multiplexer, NoC,Crossbar and previously published Hybrid-1 network. Resultsshow that newly presented network takes (1.08x, 0.104x,0.212x and 0.681x) the area, (1x, 0.037x, 0.026x and 0.107x)the configuration bits of Multiplexer, NoC, Crossbar andHybrid-1 Implementation. Hybrid-2 network is 2.87x and5.86x better in terms of speed as compared to Multiplexerand Hybrid-1 networks. Hybrid-2 network also takes minimumnumber of cycles to configure/reconfigure the complete DRRAcolumn. A future version of the interconnection network with ad-justable sliding window has been planned. By lowering theclock frequency, the width of the sliding window can beincreased to allow mapping of more complex data paths thanwhat is possible today. Similarly at higher clock frequenciesthis width can be reduced. The future version of DRRA willhave voltage frequency scaling and power shut off method-ology. This may result in some parts of DRRA working indifferent voltage/frequency range or completely turned off.The DRRA switchboxes will be improved to handle suchsituations by having level shifters, or isolators.

×