2. Ruchira Pradeep Pawar and Dr. S D Shirbahadurkar
http://www.iaeme.com/IJECET/index.asp 20 editor@iaeme.com
1. INTRODUCTION
In traditional System-on-Chip (SoC) design, shared buses are used for data transfer
among various subsystems. As SoC design involves a larger number of subsystems,
so that it becomes more complex and traditional bus-based architecture. It gives rise
to new paradigm for on-chip communication. This paradigm is called Network-on-
Chip (NoC). The NoC consists of the switches and point-to-point links. The internal
structure of the NoC switch is given in figure 1.
Figure 1 NoC Switch Schematic
Each switch consists of four distinct components. Particular set of input blocks are
connected to incoming packet lines. These input blocks contain buffers to queue the
incoming packets. Therefore they can be stored until; they are ready to be transferred
over the shared crossbar matrix. Particular set of output blocks are connected to the
outgoing packet lines. The most constraining aspect in the design of embedded system
is the complexity, arising due to the rapid technological evolution. The issues of cost
and timing add the difficulties in realization of network-on-chip (NoC) applications.
In which, many IPs (Intellectual Property) such as processor cores, memories, DSP
processors and peripheral devices are placed together, on a single die. Most of the
time, these modules communicate by means of a shared resource and the on-chip
network. The increasing demand for higher bandwidth on the network lines along
with the increasing complexity of the individual devices and an operating frequency
hitting new limits with almost every new design over the time. These factors place the
communication and/or computation resources arbitration being the performance
bottleneck of the NoC system. The arbitration is desired to be completed within one
clock cycle (e.g., between a processing element (PE) and a memory block) to avoid
large latencies between the cores on the chip. The overall delay introduced by the
arbitration should be low. Then it will not impact the overall system clock frequency
to achieve the arbitration in one clock cycle. This introduces new challenges for the
design of the arbiters. Though the key research problems in the design of NoC
include; are not limited to topology, channel width, buffer size, floor plan, routing,
switching, scheduling, and IP mapping [8]. The NoC components include the network
adapter, the routing node, and the network links [13]. Basically the routing node
consists of four major components: the input ports, the scheduler, the crossbar switch,
and the output ports. Numbers of researchers have developed scheduling algorithms
[5], [6], [7], [8] for research prototypes [2], [4] and commercial products [3].
2. LIMITATIONS ON THE SCALABILITY OF RRA
The granting of the requests by the Input selector is the most time-consuming
operation of the generic RRA, which also dominates the critical path delay. The
3. Congestion Control For Scalability In Bufferless On-Chip Networks With Fpga
Implementation
http://www.iaeme.com/IJECET/index.asp 21 editor@iaeme.com
complexity of the Input Selector arises due to two main issues: (1) the issue of
changing priority: - The priority of the inputs changes as inputs are granted and the
RPT value changes which requires the Input Selector circuit to consider all possible
priority settings. (2) The issue of circular priority order: - The priority order is
circular, leaving the priority processing even much harder. This leads to two separate
conditions for the grant decisions: Grant decisions for requests at or below the
Request [RPT] and grant decisions for requests above the Request [RPT]. Any grant
produced by the former decision has a higher priority over any grant produced by the
latter decision. This two parted decision causes deepening of the critical path. These
two issues are even more strongly marked as the RRA size gets larger (i.e., an RRA
with more inputs). Also RRA has its throughput near about 63%.
3. THE ISLIP SCHEDULING ALGORITHM
The Islip algorithm is designed so that to meet our goals. Islip is an iterative algorithm
— during each time slot, multiple iterations are performed, by matching inputs to
outputs to select a crossbar configuration. The Islip algorithm uses rotating priority
(“round-robin”) arbitration. This is used to schedule each active input and output in
turn. The main characteristic of Islip is its simplicity. As it is readily implemented in
hardware and can operate at high speed. Islip attempts to quickly converge on a
conflict-free match in multiple iterations, where each of the iteration consists of three
steps. All inputs and outputs are initially unmatched. And only those inputs and
outputs not matched at the end of the one iteration are eligible for matching in the
next.
Following are the steps for each of the iteration:
Step 1: Request - It implies that, each input sends a request to every output for
which it has a queued cell.
Step 2: Grant - It implies that, if an output receives any requests. It chooses the
one that appears next in a fixed round-robin schedule starting from the highest priority
element. Besides the output notifies each input, whether its request was granted or
not.
Step 3: Accept - It implies that, if an input receives a grant, it accepts the one that
appears next in a fixed round-robin schedule starting from the highest priority
element. The pointer to the highest priority element of the round-robin schedule is
incremented (modulo N) to one location beyond the accepted output. The pointer to
the highest priority element at the corresponding output is incremented (modulo N) to
one location beyond the granted input. If and only if the grant is accepted after the
first iteration then only the pointers are updated. By considering only unmatched
inputs and outputs, each of the iteration matches inputs and outputs that were not
matched during earlier iterations.
4. ISLIP SCHEDULER WITH ARRAY
Arbiters are a fundamental component in systems containing shared resources, and a
centralized arbiter is a tightly integrated design for its input requests. The function of
the scheduler is to arbitrate between requests from input blocks to output blocks. The
arbitration scheme is based on a maximal size approach. It is a variant of round robin
matching, which is shown to prevent starvation under uniform traffic.
In this study, we propose a new centralized arbiter, which may be used in
arbitration of a crossbar switch in NoC routers In the proposed system, we have used
4. Ruchira Pradeep Pawar and Dr. S D Shirbahadurkar
http://www.iaeme.com/IJECET/index.asp 22 editor@iaeme.com
four location array for each five ports namely South, East, West, North and Local
port. In this study, it is named as i-slip-array-s, i-slip-array-e, i-slip-array-w, i-slip-
array-n. Packet first stored in the related i-slip-array according to the destination port
of that packet and then it is forwarded to the next node as per the routing path. Figure
3.5 shows 4*4 Mesh topology in which it shows eight routing nodes.
Figure.2 4* 4 Mesh Topology
In this, we have taken packet of 22bits in which packet data is of 16bits and 3 bits
of source address and 3 bits of destination address.
Figure.3 State Diagram for Five Possible Cases
Finite state machine implementated for state variable. FSM starts with state
denoted by INIT_STATE. FSM logic implements priority encoder. Priority encoder
programmed with init_state gives us PPE. State Machine operates in a round robin
format. Overall arbiter/scheduler supports X-Y routing.
5. PACKET IN PROPOSED ALGORITHM
The devices communicate using a simple packet-based protocol. The packets are of
fixed size, and include a 6-bit header and 16-bits of packet data for a total of 22 bits.
The header is comprised of a 3-bit source identifier and 3-bit destination identifier.
5. Congestion Control For Scalability In Bufferless On-Chip Networks With Fpga
Implementation
http://www.iaeme.com/IJECET/index.asp 23 editor@iaeme.com
TABLE.I PACKET FORMAT
Bits 0 2 3 5 6 21
Field Dest Src Packet Data
Table no. 1 8-Bit Input Packet Format
The Packet Data field is multipurpose and may contain commands, addresses,
data, crc, or any other payload. The interconnect pays no attention to the contents of
the Packet Data field, and simply passes it through as a payload. The Src field
specifies the originating device, and the Dest field specifies the destination of the
packet. Since the packets travel a relatively short distance on a well-characterizeable
chip, it is assumed that the interconnect will be robust enough to not require
additional parity, ecc, or crc.
6. IP CORE FPGA DESIGN IMPLEMENTATION
METHODOLOGY
Goal: To develop a synthesizable IP core in HDL, for Islip scheduler block of the
routing logic.
A. HDL:
Design Description using Verilog-HDL
B. Verification Scheme:
Functional Simulation using Xilinx ISE Simulator
Test bench for top level design using Verilog-HDL
Stimulus for testing IP Core
Simulation results / Timing diagram Analysis
C. FPGA Implementation:
Design to be synthesized using Xilinx ISE Software.
Generate a design bit stream for FPGA device.
7. SIMULATION
A. Simulation Environment
We have simulated I-slip Scheduler using I-slip scheduling algorithm in Verilog. The
target FPGA Device is the Xilinx SPARTAN3 xc3s400-4fg456 with 238 slices
containings only related logic. The Xilinx ISE14.2 is used to synthesize the code.
B. Simulaton Result
Packet src-e: 100
Packet dest_e: 111
Packet data_e: ff01
Crossbar_select_e: 1
Grant_e: 1
I-slip buffer_e(array): ff00,0000,0000,0000
6. Ruchira Pradeep Pawar and Dr. S D Shirbahadurkar
http://www.iaeme.com/IJECET/index.asp 24 editor@iaeme.com
7. Congestion Control For Scalability In Bufferless On-Chip Networks With Fpga
Implementation
http://www.iaeme.com/IJECET/index.asp 25 editor@iaeme.com
C. Summary
TABLE II DEVICE UTILIZATION SUMMARY
Logic Utilization
Device Utilization Summary
Used Available Utilization
No. of Slice flipflop 101 7,168 1%
No. of 4 input LUTs 440 7,168 6%
No. of occupied Slices 238 3,584 5%
No. of Slices containing only
related logic
238 238 100%
No. of Slices containing
unrelated logic
0 238 0%
Total no. of 4 input LUTs 440 7,168 6%
No. of bonded IOs 202 264 76%
8. FPGA IMPLEMENTATION
FPGA implementation of I-slip scheduler of router has been performed on Xilinx
Spartan-3E FPGA XC3S250E. The Device Utilization Summary of the I-slip
scheduler is given in Table no .II. The design of the I-slip scheduler has been
implemented on XC3S250E-4pq208. The device utilization summary indicates that
202 IOBs out of 264 IOBs have been used.
8. Ruchira Pradeep Pawar and Dr. S D Shirbahadurkar
http://www.iaeme.com/IJECET/index.asp 26 editor@iaeme.com
The 8-bits version of the input block was used for implementation of the
interconnect on a Spartan-3E FPGA as the 22-bit version could not be accommodated
on this FPGA. The input packets are generated using DIP switches. The input packet
size is 8 bits. They are generated in a two-phase manner, with four bits in the first
phase and four bits in the next phase.
Figure 4 Output of West Condition on FPGA
REFERENCES
[1] McKeown, N. W. 1995 Scheduling Algorithms for Input-Queued Cell Switches.
Doctoral Thesis. UMI Order Number: UMI Order No. GAX96-02658.,
University of California at Berkeley.
[2] C. Partridge; P. Carvey et al. A Fifty-Gigabit per Second IP Router, Submitted to
IEEE/ACM Transactions on Networking, 1996.
[3] T. Anderson et al. High Speed Switch Scheduling for Local Area Networks, ACM
Trans. on Computer Systems, Nov 1993, pp. 319-352.
[4] N. McKeown et al. The Tiny Tera: A small high-bandwidth packet switch core,
IEEE Micro, Jan-Feb 1997.
[5] LaMaire, R.O.; Serpanos, D.N.; Two-dimensional round-robin schedulers for
packet switches with multiple input queues, IEEE/ACM Transactions on
Networking, Oct. 1994, 2(5), pp.471-82.
[6] Lund, C.; Phillips, S.; Reingold, N.; Fair prioritizescheduling in an input-buffered
switch, Proceedings of the IFIP-IEEE Conf. on Broadband Communications ‘96,
Montreal, April 1996. pp 358-69.
[7] Hui, J.; Arthurs, E. A broadband packet switch for integrated transport, IEEE J.
Selected Areas Communications, 5(8), Oct 1987, pp 1264-1273.
[8] Ali, M.; Nguyen, H. A neural network implementation of an input access scheme
in a high-speed packet switch, Proc. of GLOBECOM 1989, pp.1192-1196.
[9] N. McKeown, Scheduling Cells in an input-queued switch, PhD Thesis,
University of California at Berkeley, May 1995.
9. Congestion Control For Scalability In Bufferless On-Chip Networks With Fpga
Implementation
http://www.iaeme.com/IJECET/index.asp 27 editor@iaeme.com
[10] N. McKeown, iSLIP: A Scheduling Algorithm for Input-Queued Switchs,
Submitted to IEEE Transactions on Networking.
[11] T. Bjerregaard, and S. Mahadevan, A Survey of Research and Practices of
Network-on-Chip, ACM Computing Surveys, 38, March 2006.
[12] N. McKeown, Scheduling Algorithms for Input Queued Switches, Ph.D. Thesis,
Department ofElectrical Engineering and Computer Science, University of
California at Berkeley, 1995.
[13] P. Gupta and N. McKeown, Designing and Implementing a Fast Crossbar
Scheduler, Micro, IEEE, 19(1), Jan/Feb 1999, pp. 20-28.
[14] T. Moscibroda and O. Mutlu. A case for bufferless routing in on chip networks.
ISCA-36, 2009.
[15] Prof. Abhinav V. Deshpande. System Designing and Modelling Using FPGA.
International Journal of Electronics and Communication Engineering &
Technology, 5(11), 2014, pp. 47-52
[16] Susan Kuriakose and Ms. Miji Jacob. QPSK Modulation for DSSS-CDMA
Transmitter and Receiver Using FPGA. International Journal of Electronics and
Communication Engineering & Technology, 5(12), 2014, pp. 167-173.