High-Performance NoC Interface with Interrupt Batching for           Micronmesh MPSoC Prototype Platform on FPGA          ...
A. The Architecture of the MSIQ                                            (LOCAL MEMORY), fragments the messages, generat...
local memory from the Rx-buffer table (RX-BUFFER TABLE),                     and sends them to the Micronswitch. After the...
MSIQ SW’s data structure, and posts the Rx-channel’s signaling               = Npck × 5 clock cycles. Owing to this simpli...
where n = 1, …, Qsize and parameter Msize is the message size in bits.         the interrupt services are requested and th...
also bursts of messages, it is necessary that the NI is able to achieve a   costs. It would also be possible to reduce the...
Upcoming SlideShare
Loading in...5
×

61

279

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
279
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

61

  1. 1. High-Performance NoC Interface with Interrupt Batching for Micronmesh MPSoC Prototype Platform on FPGA Heikki Kariniemi and Jari Nurmi Department of Computer systems Tampere University of Technology Tampere, Finland Email: {heikki.kariniemi, jari.nurmi}@tut.fiAbstract—This paper presents a new NoC Interface (NI) targeted reducing the software overhead produced by the interrupt processingfor improving the performance of the Micronmesh and the processor utilization. The usage of the jumbo frames, i.e. largeMultiprocessor System-on-Chip (MPSoC). The previous version messages, makes it possible to reduce the message rate and theof the NI called Micronswitch Interface (MSI) can zero-copy interrupt frequency [2, 3, 5]. The fragmentation is related to the jumbomessages as it sends and receives them. It offloads also some frames which are usually fragmented to smaller frames before sendingfunctionalities of the communication protocol from software [1, 3, 5]. The MSIQ HW also fragments the messages to small fixed(SW) to hardware (HW), but interrupt processing produces extra sized packets as it sends them to the Micronmesh NoC and assemblesSW overhead and reduces the performance. For this reason, an the received messages from the received packets.improved version of the MSI called MSI-with-Queues (MSIQ) The interrupt coalescing [2, 3, 5] is a technique used for batchingwas designed with a new queue mechanism in order to reduce the interrupt service requests so that every execution of the ISR couldfrequency of interrupts and the SW overhead. Owing to the new serve several requests, which reduces the interrupt frequency and thequeue mechanism of the MSIQ it is possible to batch and service software overhead. It has also variants called Interrupt Multiplexingmultiple interrupt service requests by every execution of the [1] and Enabling Disabling (ED) technique [4]. In a typicalInterrupt Service Routine (ISR). Additionally, the new MSIQ implementation the interrupts are delayed until a certain amount ofHW is able to send and receive messages while the processor is interrupts has been batched or a timeout expires. The implementationrunning the ISR. The performance of the MSIQ is also analyzed used in the new MSIQ works slightly differently. When receiving thein this paper. The results show that the queue mechanism messages, the MSIQ generates an interrupt immediately after it hasimproves the performance with moderate hardware costs. received a new message. If more messages arrive or have arrived in bursts during the execution of the ISR, they are also served. This I. INTRODUCTION method provides a low latency and a good burst tolerance against bursts of short messages in addition to the reduced interrupt frequency. In computer systems where computers are connected by high- When sending the messages, the MSIQ sends several messagesspeed networks the operation of the network interfaces may become a successively in batches. It generates the interrupts after finishing themain obstacle for the communication throughput and the performance. sending of the first message of the batches, which makes it possible toThis is because the communication between the CPUs and the network start running the ISR while the sending is still continued. As ainterfaces produces extra software overhead. Several methods like, for consequence of this, the ISR can also be running concurrently with theexample, zero-copying, protocol offloading, jumbo frames, message MSIQ HW, which improves the performance further.fragmentation, and interrupt coalescing have been presented inliterature [1, 2, 3, 4, 5, 6, 7] for eliminating this problem. Due to In the MSIQ the interrupt coalescing is implemented with send-certain similarities of architectures these same methods can be used request and receive-request queues. The results of the performancefor solving the same problem in the MPSoCs where distributed analysis and the logic synthesis presented in this paper show that thememory and message-passing communication architectures are used. improved performance is achieved with small additional HW costs compared to the old MSI [12]. The MSIQ could also be used with In the Micronmesh MPSoC platform [8] the tightly coupled polling, but polling is usually used with interrupts and more difficultoperation of the Micron Message-Passing (MMP) protocol [9] and the to implement [6, 7]. Furthermore, the length of the polling period mustMSIQ enables direct message transfers between the local variables of be carefully adapted to the message rate in order to achieve a goodthe user threads and the MSIQ which is a technique called zero- performance, because if it is too long, the communication latencycopying in the literature [1, 2, 3, 5]. The zero-copying reduces grows, and if it is too short, the software overhead grows.communication latency and improves the performance, because iteliminates copying of messages from user memory to MSIQ through This paper is organized as follows. Section II presents theintermediate buffers in the kernel memory. The multiplexing and architecture and the operation of the new MSIQ. Section III presentsdemultiplexing functions of the MMP protocol are also offloaded to the performance analysis and the HW costs of the new MSIQ, andthe MSIQ HW in order to reduce software overhead. Protocol finally, Section IV concludes this paper.offloading is used for speeding up the protocol functions by HW andfor reducing the software overhead [1, 2, 3, 4, 5]. II. MICRONSWITCH INTERFACE WITH QUEUES The interrupt-driven systems provide low latency and low SW The Micronmesh MPSoC platforms [8] consist of Micronmeshoverhead if the interrupt rate is low, but the performance degrades if nodes that contain a local NIOS II processor [13], local on-chipthe interrupt frequency grows. Interrupts produce additional SW memories, a timer, a local Avalon system bus [14], the MSIQ, and theoverhead by causing context switching from a user mode to a kernel Micronswitch [8]. The NIOS II processors are running distinctmode before the execution of the ISR and back to the user mode from MicroC/OS II real-time kernels [11] in every Micronmesh node. Thethe kernel mode after the execution of the ISR is finished [1, 2, 3, 4, 5, MSIQs connect the Micronmesh nodes to the Micronmesh NoC6, 7, 10, 11]. The last three methods mentioned above are used for through the local Micronswitches. This research is funded by the Academy of Finland under grant122361. 978-1-4244-8971-8/10$26.00 c 2010 IEEE
  2. 2. A. The Architecture of the MSIQ (LOCAL MEMORY), fragments the messages, generates packets of the fragments, and writes the packets to the Tx-FIFO from which the The MSIQ consists of three main sub-blocks which are the MSIQ MSIQ Tx-master’s Tx-interface (TX-IF) sends them to theRx-master, the MSIQ Tx-master, and the MSIQ Slave. It is depicted Micronswitch. Packets consist of two headers and two payload wordson the bottom of schematic Fig. 1. The MSIQ Rx-master on the left [9, 12]. The addresses of the messages are passed to the MSIQ Tx-receives messages from the NoC, the MSIQ Tx-master on the rightsends messages to the NoC, and the MSIQ Slave in the middle is used master’s Avalon interface through the Tx-base-address-FIFO. In Fig.for controlling and configuring the operations of the MSIQ Masters 1 this address points to the beginning of the Tx-buffer A of thread B,through the MSIQ’s register interface. The MSIQ Slave is also which is illustrated by arrow A. The routing headers and the protocolresponsible for generating interrupt service requests according to the control headers of the packets are stored into the Tx-routing-header-MSIQ Masters status. FIFO and the Tx-protocol-control-header-FIFO. The control register values are passed through the Tx-control-FIFO. After finishing the sending of the message, the MSIQ Tx-master changes its status in order to make the MSIQ Slave to generate an interrupt service request, reads the next send-request from the HW send-request queue, and continues sending of messages till the HW send-request queue becomes empty. It can continue sending while the processor is running the ISR. The maximum size of the message batches depends on the size of the HW send-request queue. The larger the HW send-request queue the more messages can be sent without interrupts. If only one message could be sent at a time, the execution time of the interrupts would dominate the total sending time especially if the messages would be short [12]. Hence, owing to the HW send-request queues it is possible to reduce the interrupt frequency and improve the performance. TABLE I. MSIQ’S REGISTER INTERFACE AND QUEUES Register Description MSIQ-status The common status register of the MSIQ Masters. The control register used for controlling the MSIQ Rx-control Rx-master’s operation. Rx-base- The base-address of the Rx-buffer table. address The Rx-routing header of the last packet of the Rx-routing- received message. This register is part of the header receive-requests queue and it is the output of the Rx-routing-header-FIFO. Rx- The Rx-protocol-control header of the last packet protocol- of the received message. This register is part of the control- receive-request queue and it is the output of the Rx- header protocol-control-header-FIFO. The Tx-control register used for controlling the MSIQ Tx-master’s operation. This register is part Tx-control of the send-request queue and it is the input of the Figure 1. The architecture of the MSIQ. Tx-control-FIFO. The start address of the message stored into the Tx- The MSIQ’s register interface is partly presented in Table 1. It Tx-base- buffer. This register is part of the send-requestcontains a status register MSIQ-status which is a combined status of address queue and it is the input of the Tx-base-address-the MSIQ Masters. The values of the Tx-control, the Tx-base-address, FIFO.the Tx-routing-header, and the Tx-protocol-control-header registers The Tx-routing header template of the packets ofform send-requests that are stored to the HW send-request queue of Tx-routing- the message to be sent. This register is part of thethe MSIQ HW (HW SEND-REQUEST QUEUE). The writing of header send-request queue and it is the input of Tx-routing-these registers starts the sending of one message. Respectively, the header-FIFO.values of the Rx-routing-header and the Rx-protocol-control-header The Tx-protocol-control header template of theregisters form the receive-requests that are stored into the HW receive- Tx-protocol-request queue of the MSIQ HW (HW RECEIVE-REQUEST packets of the message to be sent. This register is control -QUEUE). The reading of these registers ends the receiving of one part of the send-request queue and it is the input of headermessage. The MSIQ Slave contains also four FIFOs for storing the the Tx-protocol-control-header-FIFO.send-requests and two FIFOs for storing the receive-requests likeTable I explains. The MSIQ Rx-master’s Rx-interface (RX-IF) receives packets The MSIQ Tx-master starts sending messages as it receives send- from the Micronswitch and writes them to the Rx-FIFO. The MSIQrequests through the HW send-request queue from the MSIQ Slave. Rx-master’s Avalon interface (AVA-RX-IF) reads the packets fromThe MSIQ Tx-master’s Avalon Interface (AVA-TX-IF) reads the the Rx-FIFO and writes the packet payloads to the Rx-buffers whichmessages directly from the Tx-buffers from the local memory are in the local memory. It obtains the Rx-buffer addresses from the
  3. 3. local memory from the Rx-buffer table (RX-BUFFER TABLE), and sends them to the Micronswitch. After the sending of the messagewhich is referred by the Rx-base-address register like arrow B in Fig. is finished the MSIQ Tx-master’s Avalon interface changes its status1 illustrates, and computes the storage addresses of the packet and the MSIQ Slave generates an interrupt service request accordinglypayloads. When doing this, the MSIQ Rx-master’s Avalon interface which starts the execution of the MSIQ ISR in step four. If the HWdemultiplexes and assembles the messages of different Rx-channels send-request queue is not empty yet, the MSIQ Tx-master’s Avalonfrom one input packet stream through the Rx-FIFO to multiple Rx- interface reads the next send-request from it and continues sendingbuffers. The Channel Identifiers (CID) of the protocol control headers messages until the queue is empty while the processor is running theare used for addressing the Rx-buffer table elements like arrow C msiq_isr (ISR) in step four.illustrates. The Rx-buffer table elements contain the Rx-bufferaddresses like arrow D illustrates. They are used by the MSIQ Rx- 4. The processor starts running msiq_isr (ISR). The msiq_isrmaster for addressing the Rx-buffers like arrow E illustrates. After acknowledges the interrupt service request, reads the address of thefinishing the receiving of a message, the MSIQ Rx-master’s Avalon signaling semaphore from the Tx-serviced queue, and posts theinterface writes the receive-request to the HW receive-request queue signaling semaphore to the thread, which called the mmpp_sendand changes its status in order to make the MSIQ Slave to generate an function. This wakes up the thread and the mmpp_send functioninterrupt service request. If the HW receive-request queue is not full, returns. If the SW send-request queue is not empty, the msiq_isr readsthe receiving can be continued while the processor is running the ISR. the next send-request from it, stores the address of the Tx-channel’sSince every execution of the ISR can service multiple receive-requests signaling semaphore to the Tx-serviced queue, and writes the nextthe number of interrupts can be reduced. This happens especially if send-request to the HW send-request queue, which enables themessages are short and several messages arrive in bursts between the sending and the interrupts again. These operations are repeated in aconsecutive executions of the ISR. Furthermore, the performance loop until all of the signaling semaphores of the serviced send-requestsimproves also, because the receiving needs to be stopped less have been posted from the Tx-serviced queue and either the HW send-frequently. request queue is full or the SW send-request queue is empty. As steps three and four show the HW send-request queue enablesB. The MSIQ device driver and the MMP protocol the interrupt batching. Additionally, the Tx-buffers are mapped to the local variables of the threads and the MSIQ HW uses DMA (Direct The main parts of the MSIQ device driver (MSIQ SW) are a state Memory Access) transfers for zero-copying the messages directlydata structure, send (msiq_send) and receive (msiq_receive) functions, from the Tx-buffers. The MSIQ also slices the messages into packetsand the ISR (msiq_isr). The MSIQ SW is used by the MMP protocol’s as it multiplexes and sends them in one packet stream to thefunctions for controlling the operations of the MSIQ. The MMP Micronmesh NoC, which implements message fragmentation.protocol is a messaging layer protocol which forms an ApplicationProgramming Interface (API) for programming fault-tolerant message-passing applications [9]. This API contains, for example, functions for D. Receiving of messagessending (mmpp_send) and receiving (mmpp_receive) messages. The The messages are received in the following way.MSIQ SW’s state data structure contains also a SW send-requestqueue and a Tx-serviced queue. In the SW send-request queue the 1. A thread calls the mmpp_receive function, which prepares thesend-requests are pointers to the data structures of the MMP protocol’s Rx-channel for receiving by deasserting the lock bit and by updatingchannels [9] which contain the register values of the send-requests to the address field of the Rx-channel’s Rx-buffer table element. Then itbe stored into the HW send-request queue. The elements of the Tx- calls the msiq_receive function of the MSIQ SW which enables theserviced queue are pointers to the Tx-channels’ signaling semaphores. MSIQ Rx-master to receive messages. 2. The MSIQ Rx-master’s Rx-interface receives packets from theC. Sending of messages Micronswitch and writes them to the Rx-FIFO. The Rx-master’s The messages are sent in the following way. Avalon interface reads the packets from the Rx-FIFO one by one, computes the addresses of the Rx-buffer table elements by adding the1. A thread calls the mmpp_send function which calls the msiq_send packets’ CIDs multiplied by four to the Rx-base-address register’sfunction of the MSIQ SW. value, and reads the Rx-buffer table elements from the local memory. Then it multiplies packets’ sequence numbers carried in the protocol2. The msiq_send function puts at first the address of the Tx- control headers by eight and the address field of the Rx-buffer tablechannel’s data structure to the SW send-request queue. Then it reads element by four. The sums of these two products are the storagethe status of the MSIQ. If the MSIQ Tx-master is idle, it reads the addresses of the packet payloads. These multiplications are performedsend-request from the SW send-request queue, stores the address of by simple shift left operations. After computing the storage addresses,the Tx-channel’s signaling semaphore to the Tx-serviced queue, and the MSIQ Rx-master writes the packet payloads to the Rx-buffers. Ifwrites the send-request to the HW send-request queue. This enables successive packets have the same CID, the Rx-master can reuse thethe MSIQ Tx-master to send and operation continues in step three. If Rx-buffer table element and only the storage address must bethe MSIQ Tx-master is not idle, the msiq_send lets the ISR (msiq_isr) computed again for each of the packets separately. Otherwise, the Rx-of the MSI device driver to initialize the sending of the next message buffer table elements must be read from the memory. After the lastas the processor starts running it in step four after the previous send is packet of the message is received, the MSIQ Rx-master’s Avalonfinished, and returns. The accessing of the MSIQ SW’s state data interface asserts the lock bit and updates the address field of the Rx-structure and the MSIQ’s register interface is controlled by a buffer table element to point to the end of the message, writes the Rx-semaphore so that they can be accessed only by one thread at a time or buffer table element to the memory, writes the receive-request to thethe msiq_isr. Additionally, because the msiq_isr has also higher HW receive-request queue, and changes its status in order to make thepriority than the threads, it can be guaranteed that the MSIQ SW’s MSI Slave to generate an interrupt service request. Then it continuesdata structures and queues are maintained correctly. receiving messages until the HW receive-request queue is full while3. The MSIQ Tx-master’s Avalon interface reads the send-request the msiq_isr (ISR) is executed in step three.from the HW send-request queue and starts reading a message from 3. The processor starts running the msiq_isr (ISR) function. Thethe Tx-buffer, slices it into packet payloads, generates both of the msiq_isr acknowledges the MSI Rx-master’s interrupt service request,headers for every packet, and writes complete packets to the Tx-FIFO. reads the receive-request from the HW receive-request queue, obtainsThe MSIQ Tx-master’s Tx-interface reads packets from the Tx-FIFO the address of the Rx-channel’s data structure by the CID from the
  4. 4. MSIQ SW’s data structure, and posts the Rx-channel’s signaling = Npck × 5 clock cycles. Owing to this simplification and because thesemaphore to the thread that called mmpp_receive function. These interfaces operate at the same clock rate, it is not any longer necessaryoperations are repeated in a loop until the HW receive-request queue is to take into consideration the filling of the Tx-FIFO and the emptyingempty or a certain maximum number of receive-requests are serviced. of the Rx-FIFO. Hence, the HW receive-request queue of the MSIQ can be usedfor batching the interrupts. The MSIQ Rx-master’s Avalon interface B. The performance of the MSIQ SW and HWalso offloads the MMP protocol’s functions partly by using the Rx- In the performance analysis a couple of things must be taken intobuffer table for demultiplexing interleaved packets of different consideration. Firstly, the length of the messages and the size of thechannels from a single input packet stream according to the CIDs. queues Qsize affect the theoretic maximum throughput. Secondly, theFurthermore, because the Rx-channels’ Rx-buffers are mapped to the MSIQ masters can receive and send messages while the locallocal variables of the threads [9], it can use DMA for zero-copying and processors are running the ISR. Additionally, the ISR (msiq_isr)assembling the messages to the Rx-buffers. consists of different Tx-ISR and Rx-ISR branches for servicing interrupts caused by the MSIQ Tx-master and the MSIQ Rx-master as was described in sections II.C and II.D. III. PERFORMANCE ANALYSIS A theoretic approach is used for estimating the performances of The execution time of the Tx-ISR isthe MSI and the MSIQs. This is because several factors like, for Ttx-isr (n) = Ttx-start + n × Ttx-loop, (1)example, the operation speed of memories, the size of cachememories, the operation delay of interrupt logic etc. affect the where Ttx-start is the time consumed in the beginning of the execution ofperformance and measurements with only one configuration would not the ISR before the Tx-loop iterations and where n = 1, …, Qsize is theproduce reliable estimates. However, the execution time of the ISR number of serviced send-requests. Parameter Qsize is also thewas measured for calculations with a simple platform where the MSIQ maximum batch size and Ttx-loop is the execution time of the Tx-ISR’sMasters were connected to different ports of a dual-port on-chip Tx-loop executed in step four of sending as described in subsectionSRAM which contained the buffers. Furthermore, the program code II.C. The sending of other messages generates new interrupt serviceand data were stored to a different single-port on-chip SRAM. The requests, but they are masked during the execution of the ISR.performance analysis is targeted for comparing the operations, the The service time of the Tx-interrupts iscosts, and the performances of the new MSIQ and the MSI. Ttx-int (n) = Tres + Ttx-isr (n) + Trec, (2) The theoretic maximum throughputs with messages of differentsizes represent the peak communication performances achievable where parameter Tres is the response time between the assertion of thewhen as many messages as possible are sent or received continuously. interrupt request and the start of the ISR’s execution, and Trec is theIn the first step of the analysis the performance of the MSIQ HW is interrupt recovery time. If NIOS II/f (fast) core is used, parameter Tresanalyzed. The result of the first step is used for simplifying the second = 105 clock cycles and parameter Trec = 62 clock cycles [10].step of the performance analysis where the performance of both theMSIQ HW and the MSIQ SW is analyzed together. The execution time of the Rx-ISR is Trx-isr (n) = Trx-start + n × Trx-loop, (3)A. The performance of the MSIQ HW where Trx-start is the time consumed in the beginning of the execution As messages are sent the MSIQ Tx-master’s Avalon interface of the ISR before the Rx-loop iterations and where n = 1, …, Qsize isreads packet payloads of two words from the Tx-buffers, generates the number of Rx-ISR’s Rx-loop iterations which is limited by the sizepackets, and stores the packets to the Tx-FIFO. After storing the last of queues Qsize. Parameter Trx-loop is the time consumed by each of thepacket of the message to the Tx-FIFO, it changes its status in order to Rx-loop iterations executed in step three of receiving as described inmake the MSIQ Slave to generate an interrupt. The latency of reading subsection II.D. The receiving of new messages generates alsothe payloads of Npck packets is Dread(Npck) = Npck×4 +2 clock cycles. receive-requests, but the interrupts are masked during the execution ofThis includes the time required for generating and storing Npck packets the ISR.to the Tx-FIFO. The latency of sending Npck packets from the Tx-FIFOto the Micronswitch is Dsend(Npck) = Npck×5 clock cycles respectively. The service time of the Rx-interrupts isSince Dread(Npck) ≤ Dsend(Npck), when Npck ≥ 2, it can be concluded that Trx-int (n) = Tres + Trx-isr (n) + Trec, (4)the MSIQ Tx-master’s Tx-interface limits the throughput. where parameters n, Tres, and Trec are equal to those of formula (2). The MSIQ Rx-master’s Avalon interface reads packets from theRx-FIFO, reads the Rx-buffer table elements and computes the storage In the performance analysis the operation of the MSIQ HW andaddresses, and writes the packet payloads to the Rx-buffers. After the SW can be divided into periods during which the MSIQ masters sendlast packet of a message it changes its status in order to make the MSI or receive a certain number of messages and the ISR is executed once.Slave to generate an interrupt. The latency of writing the payloads of The length of the periods is denoted by Tperiod (n), where n = 1, …,Npck packets to the Rx-buffer is Dwrite(Npck) = 2 + Npck×2 + 2 clock Qsize is the number of serviced send-requests or receive-requests, i.e.cycles. The latency of receiving Npck packets through the Rx-interface the batch size. The length of the period is determined by the executionof the MSIQ Rx-master (RX-IF) is Dreceive(Npck) = Npck×5 clock time of the interrupt services or the time required for sending orcycles. Since Dwrite(Npck) ≤ Dreceive(Npck), when Npck ≥ 2, it can be receiving n messages. The value of parameter n is floating and itsconcluded that the MSIQ Rx-master’s Rx-interface limits the value depends also on the message size. The length of the periodthroughput. determines the theoretic maximum message rate As was shown the Tx-interface and the Rx-interface of the MSIQ Rmsg (n) = n / Tperiod (n) (5)Masters limit the throughputs like in the original MSI [8]. Therefore,in order to simplify the performance analysis of the MSIQ HW and and the theoretic maximum bit rateSW it can be assumed that the processing of every packet takes five Rbit (n) = Msize × Rmsg (n) = Msize × n / Tperiod (n), (6)clock cycles also by both of the Avalon interfaces of the MSIQMasters and that Dread(Npck) = Dsend(Npck) = Dwrite(Npck) = Dreceive(Npck)
  5. 5. where n = 1, …, Qsize and parameter Msize is the message size in bits. the interrupt services are requested and the throughput of the MSIQThe theoretic maximum bit rate Rbit (n) is the theoretic maximum Rx-master.throughput. Formulas of the maximum theoretic throughputs arederived for sending and receiving separately in the following two In the case that messages are shorter, the interrupt service time issubsections. longer than the receiving time of Qsize messages and Trx-int (Qsize) > Qsize × Trx-msg. In this case the HW receive-request queue is full most 1) The throughput with the send-request queue of the time and the MSIQ Rx-master must stop receiving until the Rx- ISR’s Rx-loop iterations read receive-requests from the HW receive- request queue. The interrupt service time Trx-int (n) determines clearly If Ttx-int (Qsize) = Qsize × Ttx-msg, where parameter Ttx-msg = the length of the periods and Tperiod (n) = Trx-int (n). Because at mostDsend(Npck) is the sending time of a message as defined in subsection Qsize receive-requests can be read from the HW receive-request queueIII.A, the MSIQ Tx-master is able to send messages continuously and Qsize messages can be received during the periods, the theoreticwithout stopping the sending while processors is running the Tx-ISR. maximum throughput is achieved with value n = Qsize and Tperiod (Qsize)The HW send-request queue can never be emptied by the MSIQ Tx- = Trx-int (Qsize). Hence, the theoretic maximum throughput ismaster, because the processor runs the Tx-ISR which puts new send-requests to the HW send-request queue from the SW send-request Rbit (Qsize) = Msize × Qsize / Trx-int (Qsize). (9)queue. The MSIQ Tx-master generates interrupts after every sending In the case that messages are longer, the interrupt service time canof a message, but these interrupt service requests are masked if be shorter than the receiving time of Qsize messages and Trx-int (Qsize) ≤processor is running the ISR. The performance analysis of the MSIQ Qsize × Trx-msg. Because the processors can service the receive-requestsTx-master consists of two separate cases, where either Ttx-int (Qsize) > of Qsize messages in a shorter time than the MSIQ Rx-master canQsize × Ttx-msg or Ttx-int (Qsize) ≤ Qsize × Ttx-msg, since the message size receive the next Qsize messages, the receiving can be continued withoutaffects the rate at which the interrupt services are requested and the stops and the receive-request queue can never become full. Finally, ifthroughput of the MSIQ Tx-master. the message size is further increased, the Rx-loop is executed only In the case that messages are shorter, the interrupt service time is once during every execution of the Rx-ISR and Trx-int (1) ≤ Trx-msg.longer than the sending time of Qsize messages and Ttx-int (Qsize) > Qsize Hence, if Trx-int (Qsize) ≤ Qsize × Trx-msg, the message size determines the× Ttx-msg. In this case the HW send-request queue is emptied and the number of received messages n during the periods and the length ofMSIQ Tx-master must stop sending messages until the Tx-ISR puts the period Tperiod (n) = n × Trx-msg, where n = 1, …, Qsize. Thus, thethe next send-requests into the HW send-request queue. Thus, with theoretic maximum message rate is Rmsg (n) = n / (n × Trx-msg) = 1 /shorter messages the interrupt service time Ttx-int (n) determines the Trx-msg and the theoretic maximum throughput islength of the period and Tperiod (n) = Ttx-int (n). The message rate is Rmsg Rbit (n) = Msize × Rmsg (n) = Msize / Trx-msg. (10)(n) = n / Tperiod (n) = n / Ttx-int (n), where n = 1, …, Qsize, and the bitrate is Rbit (n) = Msize × Rmsg (n). The theoretic maximum throughput isachieved with value n = Qsize, when the ISR loads Qsize send-requests C. Comparison of performances and coststo the HW send-request queue, and the theoretic maximum throughput The performances of the MSIQ and the MSI are presented in Fig.is 2 where the horizontal axis shows the message size in 32 bits wide words and the vertical axis shows the throughputs in GBits/s. TheRbit (Qsize) = Msize × Rmsg (Qsize) = Msize × Qsize / Ttx-int (Qsize). (7) throughputs were computed with 100 MHz clock. The throughputs of In the case that messages are longer, the interrupt service time can the basic MSI, which does not have the queues, are presented withbe smaller than the sending time of the messages and Ttx-int (Qsize) ≤ lines Q1(300) and Q1(600). These lines are computed like in [13] withQsize × Ttx-msg. Because the Tx-ISR can put a larger number of send- interrupt service times (Ttx-int, Trx-int) of 300 and 600 clock cycles. Therequests to the HW send-request queue than the MSIQ Tx-master can throughputs of the MSIQ with queues of four send-requests andsend during the interrupt service time Ttx-int (Qsize), the HW send- receive-requests are presented with lines Q4(450) and Q4(900). Theserequest queue is nonempty most of the time and the sending can lines are computed with equal Tx-loop and Rx-loop execution timescontinue without stops. Because the number of Tx-loop iterations of (Ttx-loop, Trx-loop) of 450 and 900 clock cycles, and with the ISR startthe Tx-ISR depends on the message size which determines the sending times (Ttx-start, Trx-start) of 20 clock cycles. The throughputs of thetime, parameter n can also be smaller than Qsize. Hence, the sending MSIQ with the queues of eight requests are not presented, since theytime of the messages determines the length of the period Tperiod (n) = n are quite similar to those of Q4(450) and Q4(900). This is because the× Ttx-msg, where n = 1, …, Qsize, and the theoretic maximum message total execution times of the loops dominate the total interrupt servicerate Rmsg (n) = n / Tperiod (n) = n / (n × Ttx-msg) = 1 / Ttx-msg, where n = times as the number of loop iterations increases, which reduces the1, …, Qsize. In this case the theoretic maximum throughput does not effect of the other delay parameters. The threshold message sizes ofdepend on the value of parameter n and it is Q4(450) and Q4(900) are 199 and 379 words respectively. With the threshold message sizes Ttx-int (Qsize) = Qsize × Ttx-msg = Qsize ×Rbit (n) = Msize × Rmsg (n) = Msize / Ttx-msg. (8) Dsend(Npck) and Trx-int (Qsize) = Qsize × Trx-msg = Qsize × Dreceive(Npck). Thus, with 100 MHz clock the throughputs or the MSIQ saturate to 2) The throughput with the receive-request queue 1.28 GBits/s actually with smaller messages than Fig. 2 presents. Formulas (7) and (9) are used for computing the throughputs of the If Trx-int (Qsize) = Qsize × Trx-msg, where parameter Trx-msg = MSIQ for message sizes that are smaller than the threshold values andDreceive(Npck) is the receiving time of a message as defined in formulas (8) and (10) are used for computing the throughputs withsubsection III.A, the MSIQ Rx-master is able to receive the next Qsize message sizes that are higher than or equal to the thresholds.messages without stopping the receiving while the processor is By comparing line Q1(300) to line Q4(450) and line Q1(600) torunning the ISR. This is because each interrupt services Qsize receive- line Q4(900) it can be concluded that with messages which are smallerrequests while the MSIQ Rx-master receives the next Qsize messages. than 64 and 128 words the theoretic maximum throughputs of theThe MSIQ Rx-master generates new interrupt service request after basic MSI and the MSIQ are quite similar. However, the throughputsreceiving of messages, but these interrupt service requests are masked Q4(450) and Q4(900) of the MSIQ grow much faster as the messageif processor is running the ISR. The analysis divides also into two size is increased and they saturate to 1.28 GBits/s already at the pointseparate cases, where either Trx-int (Qsize) > Qsize × Trx-msg or Trx-int of 256 and 512 words. Furthermore, the results in Fig. 2 do not show(Qsize) ≤ Qsize × Trx-msg, since the message size affects the rate at which the performance with message bursts. Because usually traffic contains
  6. 6. also bursts of messages, it is necessary that the NI is able to achieve a costs. It would also be possible to reduce the HW costs by usinghigh peak performance for short time intervals under burst traffic. This smaller send-request queues in the MSIQ without reducing thecan be achieved by HW send-request and HW receive-request queues. performance significantly.For example, with queues of eight requests the MSIQ Masters are ableto send and receive bursts of eight messages at the maximum ratewithout stopping their operation. ACKNOWLEDGMENT This research is funded by the Academy of Finland under grant 122361. REFERENCES [1] Z.D. Dittia, G.M. Parulkar, and J.R. Cox, “The APIC Approach to High Performance Interface Design: Protected DMA and Other Techniques,” Proc. of the IEEE International Conference on Computer Communications, Kobe, Japan, Apr. 7-12, 1997, pp. 823-831. [2] A.F. Diaz, J. Ortega, A. Canas, F.J. Fernandez, M. Anguita, and A. Prieto, “The lightweight Protocol CLIC on Gigabit Ethernet,” Proc. of the International Parallel and Distributed Processing Symposium, Nice, France, Apr. 22-26, 2003, pp. 8. [3] P. Gilfeather, and A.B. Maccabe, “Modeling Protocol Offload Figure 2. Theoretic maximum throughput of the MSI and the MSIQ. for Message-Oriented Communication,” Proc. of the IEEE Internatonal Conference on Cluster Computing, Burlington, Masschusets, USA, Sept. 27-30, 2005, pp. 1-10. The synthesis results are in Table 2. The MSIQs and the MSIcontain Tx-FIFOs and Rx-FIFOs of four packets. The logic and [4] S.A. AlQahtani, “Performance Evaluation of Handling Interruptsregister consumptions of the MSIQs and the MSI are quite similar, but Schemes in Gigabit Networks,” Proc. of the IEEE International Conference on Computer and Information Technology, Aizu-the amount of block memory bits grows clearly as the size of the Wakamatsu, Fukushima, Japan, Oct. 16-19, 2007, pp. 497-502.queues is increased. The maximum size of the queues is 16 requests.With queues of that size the MSIQ would consume 4096 block [5] B. Coglin, and N. Furmento, “Finding a Tradeoff between Host Interrupt load and MPI Latency over Ethernet,” Proc. of thememory bits, but it would provide also better theoretic maximum IEEE International Conference on Cluster Computing, Newthroughput and burst tolerance. Additionally, it would be possible to Orleans, Lousiana, USA, Aug. 31-Sept. 4, 2009, pp. 1-9.use smaller HW send-request queue so as to reduce the HW costs, [6] J. Mogul, and K.K. Ramakrishnan, Eliminating Receive livelockbecause the SW send-request queue can store a large number of send- in an Interrupt Driven Kernel, ACM transactions on Computerrequests in any case. For example, with the HW send-request queue of Systems, Vol. 15, No. 3, Aug. 1997, pp. 217-252.four requests and the HW receive-request queue of 16 requests the [7] K. Langendoen, J. Romein, R. Bhoedjang, and H. Bal,MSIQ would consume also 2560 block memory bits. “Integrating Polling, Interrupts, and Thread Management,” Proc. of the Frontiers of Massively Parallel Computing symposium, Annapolis, MD, USA, Oct. 27-31, 1996, pp. 13-22. TABLE II. RESOURCE CONSUMPTIONS IN STRATIX III EP3SL150 [15] [8] H. Kariniemi, and J. Nurmi, “Micronmesh for Fault-tolerant MSI MSIQ with MSIQ with GALS Multiprocessors on FPGA,” Proc. of the International FPGA resource Symposium on System-on-Chip, Tampere, Finland, Nov. 4-6, Qsize = 1 Qsize = 4 Qsize = 8 Combinational 2008, pp. 1-8. 1550 1665 (7.4%) 1695 (9.3%) [9] H. Kariniemi, and J. Nurmi, “Fault-Tolerant Communication ALUTs over Micronmesh NoC with Micron Message-Passing protocol,” Memory ALUTs 0 0 (0.0%) 0 (0.0%) Proc. of the 11th internation symposium on System-on-Chip, Tampere, Finland, Oct. 5-7, 2009, pp. 5–12. Logic registers 1454 1609 (10.6%) 1609 (10.6%) [10] Altera Corp., NIOS II software developers handbook, Mach Block memory 2009. Website, <http://www.pldworld.com/_Semiconductors/ 1024 1792 (75.0%) 2560 (150.0%) bits Altera/one_click_niosII_docs_9_0/files/n2sw_nii5v2.pdf> 20.08.2010 IV. CONCLUSIONS [11] J. Labrosse, MicroC/OS-II The real-time kernel, Second ed., This paper presents MSIQ NI where a new queue mechanism is CMP Books, San Francisco, USA, 2002.used for batching interrupts in order to improve the performance. [12] H. Kariniemi, and J. Nurmi, “NoC Interface for Fault-TolerntInterrupts generated by the NIs produce a lot of SW overhead and the Message-Passing Communication on Multiprocessor SoCperformance can be improved by reducing the interrupt frequency. platform,” Proc. of the NORCHIP, Trondheim, Norway, Nov.This is achieved by the send-request and the receive-request queues 2009.which make it possible to batch interrupt service requests so that [13] Altera Corp., NIOS II processor reference handbook, Novemberindividual ISR executions can serve multiple interrupt requests. The 2009, Website, <http://www.altera.com/literature/hb/nios2/throughput improves especially with longer messages. Furthermore, n2cpu_nii5v1.pdf> 20.08.2010the burst tolerance against short messages improves. In addition to the [14] Altera Corp., Quartus II Handbook v10.0, Ch. 2: Systeminterrupt batching this is also partly owing to that the request queues interconnect fabric for memory-mapped interfaces, July 2010,allow the MSIQ HW to continue sending and receiving messages Website, <http://www.altera.com/literature/hb/qts/ qts_qii54003.pdf > 20.08.2010while processor is running the ISR. Hence, the new queue mechanismenables more efficient concurrent operation of the MSIQ HW and the [15] Altera Corp., Stratix III device handbook, Volume I, San Jose,SW. The results of the performance analysis and the logic synthesis USA, July 2010. Website, <http://www.altera.com/literature/hb/ stx3/stratix3_handbook.pdf> 20.08.2010show also clearly that the performance can be improved with tolerable

×