A NoC Based Distributed Memory Architecture with Programmable and Partitionable Capabilities. Mohammad Adeel Tajammul, M. A. Shami, A. Hemani S. Moorthi School of ICT, Royal Institute of Technology, Dept. of EEE, Stockholm, Sweden. NIT, Tiruchirappalli, INDIA. (tajammul, shami, hemani)@kth.se firstname.lastname@example.org Abstract: The paper focuses on the design of a Performance and Energy: The distributed nature of Network-on-chip based programmable and memory architecture and the concept of private partitionable distributed memory architecture execution environments enable a short distance which can be integrated with a Coarse Grain between storage and computation, which in turn Reconfigurable Architecture (CGRA). The contributes to low latency. Further, it enables effective proposed interconnect enables better interaction power management by allowing the unused mBanks between computation fabric and memory fabric. to shutdown or put into low power mode. The system can modify its memory to computation Scalability: The DiMArch is scalable with the size of element ratio at runtime. The extensive capabilities memory partitions and clock frequency. The circuit- of the memory system are analyzed by interfacing switched segments of the data Network-on-Chip it with a Dynamically Reconfigurable Resource (dNoC) can be optionally pipelined. Array (DRRA), a CGRA. The interconnect can II. RELATED WORK provide multiple interfaces which supports upto 8 Reconfigurable architectures of the past decade are GB/s per interface. Distributed Memory System; DSM; Network on Chip; investigated for memory organization in . NoC; Coarse Grain Reconfigurable Architecture; CGRA; Lambrechts et al.  investigate power and performance for multiple forms of interconnects. The I. INTRODUCTION design proposed in this paper is very similar to b-neg With the advent increase in the integration of the design in . The programmable pipelined and distributed processing elements, Network-on-Chip private partitioning differentiates the proposed design (NoC) is considered to be a promising scheme . from . Further, the control traffic is routed over bus On-Chip interconnect network can provide high network as it has lower traffic. Data traffic is routed bandwidth, low-latency, scalable and reliable over the crossbar with programmable pipelined communication. Usually, the NoCs will have large interconnects. Memory systems for MP-SoCs can be amount of memories which supports the parallel either caches or scratch-pad based memory systems. transactions . The exploration of new memory Marescaux  provides a case where scratch-pad architecture is the need for any Network-on-Chip. memory systems behave superior to cache based This paper presents on-chip distributed memory systems. architecture for the PHY layer of DRRA, with the SMART CELL  describes an innovative following salient features: reconfigurable architecture with distributed data and Distributed: DRRA being a fabric, the computation instruction memory architecture. These memory units is distributed across the chip [3, 4] which runs several are directly connected to processing elements. Each applications in parallel. With distributed memory, the mBank is 1K and works as a scratch pad memory. proposed design enables multiple private and parallel MONTIUM Tile processor  has a local execution environments (PREX). memory for each tile processor. Each tile processor is Partitioning: Partitioning is also a distributed then connected to a Network-on-Chip (NoC) which exercise and it happens in parallel. The proposed can bring data from an on-chip memory via AHB- system also enables runtime re-partitioning. bridge interface. The local memory per tile remains Streaming: Individual partitions composed of fixed in each MONTIUM implementations . memory banks (mBanks) should be able to act as a The DRRA architecture with integrated memory unit and stream data to the computational units. It can system is discussed in sections III & IV. The proposed introduce elasticity in streaming by adjusting the memory architecture can alter its memory to delay values. computational fabric ratio by partitioning, as discussed in section V. Further-more, if the system is running with variable clock, then it can alter its This Research is funded by the Higher Education Commission critical path accordingly as discussed in section VI. of Pakistan978-1-4244-8971-8/10$26.00 c 2010 IEEE
mDPUs are native 16-bit i integer units with four 16-bit inputs corresponding to two complex numbers and two 16-bit output correspo onding to one complex number. mDPU also has two comparators, one for each output and a counter. The results of comparators, counter and overflow, underflow are logged in a status w word read by sequencer. mDP can do saturation, PU truncation/ rounding, overflow, underflow check. The , end result bit-width can be con nfigured to be anything from 8 to 16 bits. RFile - the D DRRA Register File is 64 word 16 bit register file wi dual read and write ith ports. RFile has a DSP s style AGU (Address Generation Unit) with vectoris sed, circular buffer and bit reverse addressing that is uuseful in implementing FFT. Sequencer is a micro-code sequencing machine ed that controls a single mDPU and a RFile and the switchbox. Sequencer can be daisy chained to allow a single sequencer to control adjoining Sequencers within the sliding-window reach. This concept is used to implement a hierarchy of con ntrollers, for instance to implement Rx/Tx FSMs of a MODEM or encode/decode FSMs of a C CODEC. With elastic streaming capability of RFile together with the proposed memory architecture described in section e( IV) the sequencers provide the capability to implement chained elastic str reaming functionalities that matches very well the natu of most PHY layers ure for radio and multi-media appli ications. The fabric can be as large as the die allows; se everal thousand DRRA cells can be accommodated in a 45 nm, 300 mm2 die. Figure 1 shows only a fragment for clarity. t IV. DISTRIBUTED MEMOR ARCHITECTURE RY The proposed Distributed Memory Architecture (DiMArch) for DRRA is com mposed of (a) a set of distributed memory banks (m mBanks), (b) a circuit- switched data Network-on- -Chip (dNoC) (that transports data between mBan and RFiles (DRRA nks Register Files)), (c) a packe et-switched instructionFig. 1 DRRA Architecture with Mem mory fabric Network-on-Chip (iNoC), a No and bus hybrid used oC to create partitions, program m mBanks to stream data and transport instructions fro sequencer to the om III. DRRA ARCHITECTURE instruction Switch (iSwitch). DRRA is a Coarse Grain Reconfigurable n A. Memory Banks (mBan nk): The distributedArchitecture (CGRA) capable of hosting multiple, memory banks are SRAM ma acros, typically 2 to 4complete Radio and Multimedia a applications. It has KB, a design time decision, a the goal is to align asresources for physical layer (PHY layer), Protocol Y mBanks with the columns o the DRRA fabric. ofProcessing layers (PP layer), appli ication and system mBanks are controlled by mF FSMs - state machinescontrol and runtime management. The DRRA fabric that also acts as interface betw ween mBanks and thefor the PHY layer has been imple emented in   data Switches (dSwitch). mFSMs act asand is shown in Figure 1 along w with the proposed programmable address generati unit with a general ionmemory system. A single DRRA cell is composed of timing model. They implement single read/write;a morphable DataPath Units (mDPU a Register Files U), vectorized read/writes with p programmable address(RFile), a sequencer and an int terconnect scheme offset, circular buffer and bit reversed addressing. tgluing these elements together. Pre esently, the storage mFSMs also provide a general purpose timing modelof DRRA fabric is restricted to RFFiles which are 64 using three delays, an initial ddelay before a loop, anwords of 16 bits. intermittent delay before every read/ write within a y loop and an end delay at the e of the loop before end repeating the next iterations. Thhese delays are used to
synchronize the memory to register file streams with processing and also take the data back to the mBanksthe computation. Individual delays can be changed once processed.depending on the intermediate results of the cFSMs associated with each dSwitch are programmedcomputation which makes streaming elastic. mFSMs to time multiplex the path to and from register file in aare programmed via iNoC with special instructions. co-ordinated way so that it appears as if RFile is B. Data Network-on-Chip (dNoC): dNoC is a reading from/writing to one large contiguous memory.half-duplex circuit-switched mesh Network on-Chip. Compiler ensures that the computation isThe streaming nature of applications, the inherent synchronised with the behaviour of cFSMs controllingQoS guarantees and improved latency compared to the memory transactions. This works fine for thepacket-switched network were the motivations for signal processing application with deterministic,using circuit-switched network. A memory partition cyclo-stationary behaviour. The ability to partiallytogether with a computation partition is called private reprogram these streams, allows these streams to beexecution environment (PREX). The interface elastic as well. The DRRA sequencers have the hooksbetween memory and computational partition is as to chain these elastic streams but the presentwide as the number of RFiles involved; the width here DiMArch does not support chained elastic streamsimplies the number of dNoC connections, each dNoC The architecture can deal with the degenerate case ofbeing 256 bit wide, can be changed as it is a nondeterministic random individual memoryGENERIC VHDL parameter in a template. Since the transactions as well like a normal processor; this casedata traffic at each RFile/dNoC (RFMI) interface can will obviously not benefit from the efficiency ofonly be read or written, half-duplex interconnects are autonomic (elastic) streaming capability of cFSMs inproposed. dNoC is realized as a mesh network of DiMArch. cFSMs are programmed by specialdSwitches. As shown in Figure 2, each dSwitch is instructions via iNoC.made up of five dSwitch cells (dCell) serving the N,E, W, S and the mBank directions. Each dCell has C. Instruction Network-on-Chip (iNoC): iNoCfour inputs coming from the other four directions; one is a packet-switched network used in DiMArch toof these four inputs is multiplexed out in the output program the cFSMs and mFSMs, as the packet-mode; in the input mode, data from the associated switched networks are primarily used for shortdirection enters the dCell. The bidirectional I/O is programming messages and life of a certain path isoptionally buffered to cope with long wires and very short. Also, it includes the feature of packetizedprovide flexibility to implement the planned Dynamic network to reach any node of the DiMArch from anyVoltage Frequency Scaling. cFSMs control the sequencer. The agility of programming ortemporal behaviour of dSwitch. They are essential to reconfiguring DiMArchs partitions and behaviors is amake multiple mBanks behave as a contiguous key goal of the DRRA architecture to make itmemory. Figure 3 shows an example of a memory dynamically reconfigurable. To achieve this agility,partition made up of three mBanks A, B and C thatbring data to a single Register File (RFile) for IMUX ISEL IMUX ISEL IMUX ISEL IMUX IMUX ISEL ISEL REG REG REG REG REG PMUX PSEL PMUX PSEL PMUX PSEL PMUX PMUX PSEL PSEL IOSEL IOSEL IOSEL IOSEL IOSEL mBank South West East North Fig. 2 dSwitch
while retaining the generality of packet-switchingnetwork, two architectural measures have been taken. Sequencer 0 Sequencer 1 Sequencer 2The first is that the horizontal and vertical segments ofthe iNoC are a hybrid of bus and NoC behaviours. iSwitch iSwitch iSwitchAny message asserted on an iSwitch is broadcast (0,0) (1,0) (2,0)along its entire length of vertical segment, behavinglike a bus as the broadcast happens in a single cycle.Every iSwitch on the vertical segment analyzes themessage in parallel to check if the message address is iSwitch iSwitch iSwitchon its associated horizontal segment and if it is, a (0,1) (1,1) (2,1)second broadcast happens on the horizontal segment.Again, every iSwitch on the horizontal segmentlistens to the broadcast and analyzes if the message isaddressed to it and if it is, it forwards it to zFSM that iSwitch iSwitch iSwitch (0,2) (1,2) (2,2)analyzes it and appropriately acts on it. By having abus like behaviour, the message is broadcast in asingle cycle, i.e., each iSwitch can be reached in twocycles. The second measure is the partitioningcapability which is explained in forthcoming section Fig. 4 Private PartitioningV. dSwitches + CFSMs Then Sequencer 1 instructs iSwitch (1, 2) to close mBank horizontal splitter right-left between iSwitch (0, 2) A and iSwitch (1, 2). At this point all sequencers have access to their desired private partitions. At run-time Sequencer 2 can gain access to iSwitch (2, 2) to allocate more memory for additional memory mBank mBank requirements. However, since the proposed B C architecture does not provide memory locks, all access conflicts are resolved at compile time. When two PREX needs a shared memory space, then a shared iSwitch is specified. e.g. Sequencer 0 and Sequencer 1 can specify iSwitch (0, 1) as a shared space. In that case, iSwitch(0,1) will receive RFile instructions from both PREX. Fig. 3 Single memory partition VI. PROGRAMMABLE PIPELINING Consider an example where three RFiles V. PRIVATE PARTITIONING connected to nine MTiles in 3x3 configuration. For a single block transfer from mBank to RFile, mBank As in figure 4, consider three Sequencers which data is first sent to dSwitch. The pipelined mode isneed access to multiple memory banks (mBanks). used when PMUX is programmed to use the registermBanks in the first row have dedicated access from in its path (see Figure 2). The pipelined path isSequencer in the same column. All splitters are open omitted for Single Cycle Multi-Hop Transferat the start. Sequencers 1 and 2 issue an instruction for (SCMHT) mode. The concerned dSwitch routes therespective instruction switch (iSwitch) to close data to neighbouring dSwitch. At destination, data isvertical splitter for top-to-down access. Horizontal directly loaded into RFile from neighbouring dSwitch.splitters are set to remain open. Instruction Switch The number of cycles of each transfer is not alwaystakes one cycle to process this instruction. After a equal to the number of hops. The cycles may bewait of one cycle, Sequencers 1 and 2 can now access reduced if any dSwitch is in SCMHT mode. TheiSwitches in second row (row 1). This one cycle wait critical path for single cycle transfer for any givencan be used to configure mFSM/ cFSM for required wireload model is variable; increase in number oftraffic patterns. Sequencer 1 issues another instruction hops will increase the critical path exponentially. So,for iSwitch (1, 1) to get access to iSwitch (1, 2). increasing the maximum number of hops in SCMHTSequencer 1 issues instruction to close the vertical mode will reduce the clock speed of the system by thesplitter top-down. same rate. Hence, the number of hops should only be ‘ increased for such cases when the gain of SCMHT mode is more than degradation due to lower clock frequency.
VII. SYSTEM COSTS AND O OVERHEADS reordering and reconfiguratio is performed by on common three stage data tran nsfer. If the number of The system has three types of co and overheads osts butterfly operations per stage is more than the number sinterms of cycles: 1. Com mputational Cost of available butterflies, then ad dditional butterflies are(CComputation), the time spent (cyycles) in processing processed serially. The interconnnect reconfiguration isdata by the DPU and is dependen on the mode of nt performed if re-ordering i required between ismDPU and data width which is give by the equation: en neighbouring RFiles . D During reordering the CComputation = (NSample/ 2) + OCompu utation (1) mBank truly behaves as a scrat tch-pad memory, wherewhere NSample is Number of Sample and es OComputation intermediate data is stored.is Overhead of Computation. OComputation = CPipeline + CLoad Store (2)where CLoad Store is RFile load store C Cost, CPipeline is costof pipeline which changes with each mDPU mode. h2. Reconfiguration Overhead, th time which is herequired to reconfigure the interco onnect partitioning.This time is directly proportional to the number ofsplitters to be programmed. Programming a singlesplitter takes three cycles (Instruct tion identification,Decoding and Partition set/reset) 3. Interconnect ).Overhead, deals with the amount o time (cycles) it of roughput. Figure 5 FFT Thrtakes for the data to move between m mBank and RFile. Computation Interconnect Rec configuration cost First step in this mapping is loa ading the correct data in(CCIR) is the cost of reconfig guration for the the correct RFile. Data is loade from the memory to edcomputational fabric which direct depend on the tly all the RFiles. By keeping a correct order at gnumber of interconnects to reconfig gure and the cost of instruction level, data is pick up by the correct kedreconfiguration. This cost can ha ave the maximum RFile.value of six cycles. CCIR = NInterconnect * CComp. Reconf (3)mBank Interconnect Reconfigura ation cost (CMIR)directly proportional to the numb of instructions berused to reconfigure the memory part titions (CPR). CMIR = NInstruction * CPR (4) Reconfiguration is performed when 1. New dmBank is to be allocated, 2. The instructionpartitioning interconnects are reconnfigured to changedirection of instruction flow (or) 3. Computational Figure 6 FFT Ov verhead.fabric interconnects are rec configured. Theinterconnect cost (CInterconnect) d depends on data Twiddle factors are also loa aded to RFile which acttransfers and is represented as: as a Look Up Table (LUT). Figure 5 extends the results of  for DRRA arch hitecture. To keep the CInterconnect = NTransfers * CNum of hops (5) m comparison fair, equal numb ber of computationalwhere NTransfers is Number of Transfers and CNumof hops elements is used. The la ast two results foris cost in cycles/ data transfer. SMARTCELL and FPGA are interpolated based on VIII. CASE STUDIES . The proposed system out tperforms others by an order of magnitude more than the expected error in nA. Mapping one dimensional poin FFT nt interpolation. For FFT larger th 512, the SmartCell han An implementation of radix-2 FFT butterfly was is 1.24 to 1.36 times slow than DRRA. The wercarried out with four mDPUs use to perform two ed implementations of FFT small than 512 points on lerbutterfly operations (one real and o imaginary) . one such parallel system do not ex xploit locality. Figure 6The operations are pipelined and performed in six illustrates that the data for smalll-sized FFTs (less thancycles. Depending on data acc cess patterns, the 512) spends more time in m motion rather than inbutterfly can be reused to implemen various stages of nt computation. The memor ry-data interconnectsFFT. Hence, an algorithm can be defined for the overhead takes more or comp parable times than thetraffic pattern between RFile, mBa and mDPU. In ank time spent in computation. This can be furtherthis case, FFT traffics are man nually mapped. A elaborated by the fact that it takes marginally lessmaximum of sixteen butterflies are used in parallel. A cycles to compute the 64 poin FFT using a single ntsingle butterfly is fed with data sam mples and twiddle butterfly (real and imaginary) set. Such case uses 16factors from RFiles that can perform 32 operations in m times less resources compared to the case presented d40 pipelined cycles. Between each FFT stage, (16 butterflies).
B. 2D Mapping vs McNoC REFERENCES The two dimensional FFT is performed in two  Axel Jantsch and Hannu Tenhunen, “Networks-on-steps. First, row-wise FFTs are calculated. It is Chip”, Kluwer Academic Publishers, 2003.followed by column-wise FFT calculation. Hence two  William James Dally and Brian Towles, “Principles anddimensional FFTs are broken down into multiple one- Practices of Interconnection Networks”, Morgan Kaufmanndimensional FFTs. It is mapped using the same Publishers, 2004.principles as in section VIII-A. In this experiment, thesize of FFT remains constant and number of resources  M. A. Shami and A. Hemani, “Morphable DPU: Smartis increased. Furthermore, an extra step is added and Efficient Data Path for Signal Processing Applications,”where horizontal to vertial translation is performed. vol. SiPS, 2009, pp. 167–172.The results for such mapping are given in Figure 7.  ——, “Partially Reconfigurable Interconnection x 105 Network for Dynamically Reprogrammable Resource Array,” in IEEE 8th International Conference on ASIC (ASICON’09), 2009, pp. 122–125.  M. Herz, R. Hartenstein, M. Miranda, and E. Brockmeyer, “Memory Addressing Organization for Streaming-based Reconfigurable Computing,” vol. 2, 2002, pp. 813–817.  A. Lambrechts, P. Raghavan, M. Jayapala, B. Mei, F. Fig. 7 Cycle count for 2D-FFT Catthoor, and D. Verkest, “Interconnect Exploration for Energy versus Performance tradeoffs for Coarse Grained IX. CONCLUSION Reconfigurable Architectures,” vol. 17, no. 1, JANUARY This paper proposes a programmable and 2009, pp. 151–155.partitionable interconnect interface as a method of  T. Marescaus, E. Brockmeyer, and H. Corporaal, “Thecommunication between mBanks and RFiles. The Impact of Higher Communication Layers on NoC supportedprogrammable interconnect supports pipelined or MPSoCs,” in Proceeding of the First Internationalbufferless modes. The instruction interconnect also Symposium on Network-on-Chip (NOCS’07), 2007, pp.has private partitioning capability where multiple 107–116.sequencers can communicate with different mBanksusing different segments of the network. The  C. Liang and X. Huang, “SMARTCELL: A Powercontrollers within the system help in providing Efficient Reconfigurable Architecture for Data Streaming application,” in IEEE Workshop on Signal Processingpatterns of data which can be routed based on the Systems (SiPS’08), 2008, pp. 257–262.instructions. The FFT and 2D experiments show the overhead  G. Rauwerda, P. Heysters, and G. Smit, “Towardsof the interconnect, compared to the computational Software Defined Radios using Coarse-Grainedcost. The results show that for the given Reconfigurable Hardware,” vol. 16, no. 1, 2008, pp. 3–13.programmable interconnect the best throughput isobtained when reasonable computational resources are  L. Smit, A. Molclerink, P. Wolkotte, and G. Smit, “Implementation of 2-D 8x8 IDCT on Reconfigurableutilized with good locality to the data. Moreover, gate MONTIUM core,” in International Conference on Fieldlevel synthesis results show that the system can run Programmable Logic and Applications, 2007, pp. 562–566.up-to 400 MHz on 90nm technology. As a part of future work, another sequencer will be  C. Liang and X. Huang, “Mapping parallel FFTadded which will act as a dedicated main controller algorithm onto SMARTCELL Coarse-Grainedfor memory. A compiler to automate the process of Reconfigurable Architecture,” in IEEE 20th Internationalmapping is also under development. Conference on Application-specific Systems, Architectures and Processors, 2009, pp. 231–234.  X. Chen, Z. Lu, A. Jantsch, and S. Chen, “Supporting Distributed Shared Memory on Multi-core Network-on-chip using a Dual Micro-coded Controller”, 2010, pp. 39–44.