1. SOC CHIP SCHEDULER EMBODYING I-SLIP ALGORITHM
Trupti B. Salankar,Vilas A Nitnaware
SRKNEC,NAGPUR
Truptis_135@yahoo.com, vilasan30@yahoo.com
Abstract Communication networks connect different
geographically distributed points and Switching
We describe the methodology; the design and the systems reduce the overall network costs by reducing
implementation of scheduler block of interconnect. the number of transmission links required to enable
The scheduler block is implemented in Verilog a given population of users to communicate. various
using SYNOPSYS tool’s DVE and Design_vision. switching techniques, chosen on the basis of
The interconnect is capable of handling 72 bit optimizing the usage of bandwidth in the network.
packets and a total of 32 packets at a time. There The two main switching techniques are: circuit
are total 8 devices and we have to establish the switching and packet switching. In data networks,
communication between them. Each device there are certain gaps between the messages. The
consists of an input block and the output block. user devices do not need the transmission link all the
The input block first receives the 72 bit packet and time, but when they do, they require relatively high
the total of 32 packets one by one. The input block bandwidths. Assigning a continuous connection with
internally consists of four arrays-destination head, high bandwidth for such connections is obviously a
destination tail, packet array and linked list array waste of resources and results in low utilizations. If
and also a shift register. It stores the packets in an the circuit of high bandwidth was set up and released
array called packet array. When scheduler sends for each message transmission, then the set up time
transmit request these packets are given to the incurred for each message transmission would be
scheduler. Scheduler internally consists of grant high compared to the transmission time of the
and accept arbiters. Scheduler perform its message. Thus, switches in data networks
operation in three steps i.e. request, grant and incorporate the store and forward technique for
accept. It works on the principle of i-slip transmitting the messages.
algorithm. Finally the scheduler decides that which In store and forward, a message is first sent from the
packet should be send from the input block to the source to the switch to which it is attached. The
output block of the device. Output block of the switch scans the header of the message and decides
device simply receives the packet. These packets to which output to forward the message. The same
are sent and received in two phases. In the first scheme is repeated from switch to switch until the
phase 36 bits are sent and in the second phase message reaches its destination. The advantage of
36bits are sent. Thus the connection is established such a switching scheme is that the transmission
between the devices using interconnect. We are links are occupied only for the duration of the
also modifying the scheduler design to reduce the transmission of a message. After that the links are
area required for on chip implementation. For this released in order to transmit other messages. In other
reason we are combining the two sets of arbiters words, the bandwidth allocation in the store and
into only one, so the total arbiters required for forward scheme is determined dynamically on the
modified scheduler now reduces to only 8 basis of a particular message and a particular link in
compared to 16 for original scheduler. the network.
Packet switching is an extension of message
1. Introduction switching. In packet switching, messages are broken
into certain blocks called packets, and packets are
With improving fabrication technology integration transmitted independently using the store and
of system components onto a single die increases. forward scheme. Some of the advantages of packet
Communication between these components can switching over message switching are as follows:
become the limiting factor for performance unless 1) Messages are fragmented into packets that cannot
careful attention is given to designing high exceed a maximum size. This leads to fairness in the
performance interconnects. network utilization, even when messages are long.
2) Successive packets in a message can be
transmitted simultaneously on different links,
978-1-4244-8971-8/10$26.00 c 2010 IEEE
2. reducing the end-to-end transmission delay. (This replaced datagram networks altogether and hasn’t
effect is called pipelining.) been the one and only dominant technology (as it
3) Due to the smaller size of packets compared to was promising 10 years ago), but still it has been
messages, packet are less likely to be rejected at the deployed in many networks. Vendors are continuing
intermediate nodes due to storage capacity limitation to study and improve ATM technology to achieve the
at the switches. implementation of more and more Quality of Service
4) Both the probability of error and the error (QoS). In ATM networks data is transferred over
recovery time will be lower for packets since they are Virtual Circuits (VC’s) in 53-byte packets called
smaller. Once an error occurs, only the packet with cells.
the error needs to be retransmitted rather than the Our implementation is done in Verilog using
whole message. This leads to a more efficient use of SYNOPSYS software using DVE & synthesized
the transmission bandwidth. using Design_vision. The ATM crossbar switch that
A packet switch is a box with N inputs and N we have implemented is a modular design (can be
outputs that routes the packets arriving on its inputs scaled) and consists of three main components: input
to their requested outputs. One can say that the main port modules, crossbar scheduler, and output port.
functions of packet switches are buffering and The functionality of the switch can be described as
routing. follows. The packets first enter the input ports of the
Besides these basic operations a switch can have switch where they are queued based on their order of
other capabilities, such as handling multicast traffic arrival. Each input port has a port controller that
and priority functions. Small N×N packet switches determines the destination of a packet. The port
are the key components of the interconnection controller then sends a request to the scheduler for
networks used in multiprocessors and integrated the destination output port. The scheduler grants a
communication networking for data, voice, and request based on a priority algorithm that ensures
video. A popular choice in the hardware fair service to all the input ports. Once a grant is
implementation of packet switches is crossbar issued, the crossbar fabric is configured to map the
architecture. Crossbar is a non-blocking architecture. granted input ports to their destination output ports.
This means that any input-output pair can This project implements an on-chip SOC
communicate with each other as long as they do not interconnect (switch) embodying the i-SLIP
interfere with the other input-output pairs. In other algorithm for efficient communication between SOC
words, any permutation of inputs and outputs is devices.
possible as long as each input sends data to a The name of this algorithm is derived from the
different output, and each output receives data from serial line internet protocol. It is merely a packet
at most one input. framing protocol: SLIP defines a sequence of
This document describes the design and characters that frame IP packets on a serial line, and
implementation of an asynchronous transfer mode nothing more. It provides no addressing, packet type
(ATM) crossbar switch. ATM is a means of digital identification, error detection/correction or
communication with the potential for replacing the compression mechanisms. It is a TCP/IP protocol
conflicting communication infrastructures (telephone used for communication between two machines that
networks, cable TV networks, and computer are previously configured for communication with
networks) that nowadays need to be integrated into each other. Slip is commonly used on dedicated
one. serial links and sometimes for dialup purposes, and
These three information infrastructures have is usually used with line speeds between 1200 bps
some overlaps among themselves and are all moving and 19.2 kbps.It is useful for allowing mixes of hosts
from analog technology to digital technology for and routers to communicate with one another. For
transmission, switching, and multiplexing. New example, the internet server provider may provide
technologies are being developed that are stepping the user with a SLIP connection so that the
along the way of merging these three communication provider’s server can respond to requests, pass them
infrastructures. ATM technology is intended to be on to the internet and forwards requested internet
used in networks that transport a variety of different responses back to the user. Hence the name of this
types of information including voice traffic that was algorithm is iterative-serial line internet protocol
traditionally carried over telephone networks, data algorithm. This algorithm is derived from round
traffic typically carried on computer networks, and robin scheduling algorithm.
multimedia traffic consisting of a mixture of image,
audio and video information. Each of these various
types of traffic can have a different requirement and 2. Internal interconnect of the switch
places different demands on switching and
transmission facilities. Although ATM has not
3. There have been discussions about what the internal
interconnect of the switch should be. The internal
interconnect of the switch can be in the form of a
single stage network (shared bus, ring, crossbar) or a
multi-stage network of smaller switches arranged in
a banyan. Even with a non-blocking interconnect
such as the crossbar, some buffering is necessary
because packets that arrive at the interconnect are
unscheduled and the switch has to multiplex them.
There are three basic conditions where buffering is
necessary: 1) The output port through which the
packet needs to be routed is blocked by the next stage
of the network. 2) Two packets destined for the same
output port arrive simultaneously at different input
ports but the output port can accept only one packet
at a time. 3) The packet needs to be held while the
routing module in the switch determines the output
port to which the packet is sent. Figure 2
2) Operating the switch fabric at a faster speed
than the input/output lines (speedup): This scheme
reduces the effect of HOL blocking but does not
remove it completely [6]. A speedup by a factor of S
can remove S packets from each input port within
each time slot. Therefore, for an N×N switch, if
output buffers are used, the speedup is N, and if
input buffers are used, the speedup is equal to one.
For switches that use speedup, both input and output
Figure1 : Input-queued packet switches buffers are required.
Many subsequent studies have tackled improving the 3) Examining the first K cells in a FIFO queue
performance of input-queued packet switches. Some where K>1 :
of the proposed techniques are as listed below:
Consider a switch with input port buffers as shown
1) Using non-FIFO buffers: One scheme in this in Figure 2.9 .The packet labels are destination port
category is virtual output queuing (VOQ). In this numbers. We define array Ai = [ai1, ai2, ai3, …,
scheme each input has N queues or blocks of aiN]T where ais = d is the destination port number, i
memory instead of one single FIFO queue. In other is the column number, and s is the source port
words, there is a separate queue for each input- number.
output We also define transmission array T = [t1, t2, …,
pair (Figure 2). tN]T, where ts = d indicates that input port s is
assigned to transmit a packet to output port d.
4. Figure3 is incremented to one location beyond the granted
input if and only if the grant is accepted in step3.
In other words, the priority round robin at the
Scheduling algorithms: output side is incremented (provided that the grant
was accepted) after the Accept step is passed.
The scheduler module in a packet switch decides
Those inputs and outputs that are not matched at the
when data is sent from particular inputs to their
end of one iteration are eligible for matching in the
desired outputs. The scheduling algorithm has to be
next. This small change to the RRM algorithm
fast, fair, and easy to implement in hardware. The
makes I-slip capable of handling heavy loads of
problem of scheduling, that is determining which
traffic and eliminates starvation of any connections.
input and output should be connected to each other
The algorithm converges in an average of O(log N)
in each time slot, is equivalent to finding a matching
and a maximum of N iterations. I-slip can fit in a
in a bipartite graph. Several scheduling algorithms
single chip and is readily implemented in hardware.
are
The SLIP algorithm is modified as follows
1) Maximum Size Matching scheduling
algorithm
2) Maximum Weight Matching algorithm
3) Oldest Cell First (OCF) scheduling
4) Longest Port First (LPF) algorithm
5) Parallel iterative matching (PIM)
algorithm
6) Round robin matching (RRM)
The basic round-robin algorithm is designed to
overcome two problems complexity and unfairness.
This scheduling algorithm is used when all tasks are
equally important. The three steps of arbitration are:
Properties for high performance:
For practical high-performance systems, we desire
algorithms with the following properties:
• High Throughput: An algorithm that keeps the
backlog low in the VOQ’s; ideally, the algorithm
will sustain an offered load up to 100% on each
input and output.
• Starvation Free: The algorithm should not allow a
nonempty VOQ to remain unserved indefinitely.
• Fast: To achieve the highest bandwidth switch, it
Figure 4 is important that the scheduling algorithm does not
become the performance bottleneck; the algorithm
7) iSLIP is an iterative algorithm achieved by should therefore find a match as quickly as possible.
making a small change to the RRM scheme. iSLIP • Simple to Implement: If the algorithm is to be fast
has the same three steps of RRM. Only the second in practice, it must be implemented in special-
step (Grant step) has changed little. The SLIP purpose hardware, preferably within a single chip.
algorithm is a variation of RRM designed to reduce
the synchronization of the output arbiters. SLIP The i-slip algorithm is able to achieve all these
achieves this by not moving the grant pointers unless aspects
the grant is accepted leading to a resynchronization 3. Interconnect overview
of the arbiters under high load. SLIP is identical to
RRM except for a condition placed on updating the
grant pointers. The grant step of RRM is changed to: on chip communication design has been done using
Step2: grant-If an output receives any requests, it rather ad-hoc and informal approaches that fail to
chooses the one that appears next in a fixed, round- meet the challenges posed by next generation SOC
robin schedule starting from the highest priority designs
element. The output notifies each input whether or The goal of this design is to provide a fast, efficient
not its request was granted. The pointer gi to the SOC interconnect between 8 on-chip
highest priority element of the round-robin schedule devices. The eight devices are connected to one
another through a single instance of the routing
switch to be designed.
5. The devices communicate using a simple packet-
based protocol. The packets are of fixed size, and
include a 6-bit header and 66-bits of packet data for
a total of 72 bits. The header is comprised of a 3-bit
source identifier and 3-bit destination identifier
(Table 2).
The Packet Data field is multipurpose and may
contain commands, addresses, data, crc,or any other
payload. The interconnect pays no attention to the
contents of the Packet Data field, and simply passes
it through as a payload. The Src field specifies the
Figure 5 originating device, and the Dest field specifies the
destination of the packet. Since the packets travel a
There are total 8 devices and each device consist of relatively short distance on a well-characterizeable
an input block and output block. Each input block chip, it is assumed that the interconnect will be
consist of 32 packets which are to be sent to the robust enough to not require additional parity, ecc,
output block. For establishing the connection or crc.
between the input block and output block we need to
design this interconnect with the help of this i-SLIP 4. Concept of linked lists
scheduling algorithm This design is an 8x8 crossbar Many non-numeric applications require that an
for use as an on-chip SOC interconnect. The ordered list of information items be represented and
interconnect serves as a communication portal stored in memory in such a way that it is easy to add
between 8 on-chip devices items to the list or to delete items from the list at any
position while maintaining the desired order of
items.
There are total 8 devices. We have to establish the
connection between these 8 devices. Each device
consists of an input block and output block. Each
input block consists of 32 packets .Each packet
consist of 72 bit data as seen earlier. The input
blocks have three responsibilities:
1. Receive incoming packets
2. Store the packets while waiting for scheduling
3. Transmitting the next packet to the selected
destination once scheduling is complete
Figure 6
Packets used in this interconnect design:
6. Figure 7: Input block:
Figure8: High –level block diagram of scheduler.
.
The input block is comprised of four memory arrays, HIGH LEVEL DESIGN: is a very high level block
a FIFO and a shift register diagram of the scheduler. The scheduling
5. Schedular: algorithm’s three phases (request, grant, and accept)
correspond
The schedular acts as a central switch arbiter.The
to the three blocks shown in the figure. Because the
goal for the scheduling algorithm is to match input
algorithm’s request phase
queues containing waiting packets with output
corresponds just to forwarding the requests to the
queues to achieve the maximum throughput while
grant arbiters, our implementation combines the
maintaining stability and eliminating starvation.The
request and grant phases. Figure 1 also shows the
slip algorithm matches inputs to outputs in a single
decision feedback information from the accept
iteration.However,after this iteration , several
arbiters, which the scheduler uses in successive
possible input and output ports may remain
iterations to mask off requests from already matched
unutilized. The i-slip algorithm uses multiple
inputs and outputs.
iterations to find paths to utilize as many input and
output ports as possible until it converges to finding
no more possible matches.
The single iteration slip is a specialization of i-slip
and may be characterized as i-slip with only a single
iteration or 1-slip.
7. [8]“High speed symmetric crossbar switch by
Maryam Keyvani “– B. Sc. University of Tehran
1998.
[9]“IEEE Paper on designing and implementing a fast
crossbar scheduler” by Pankaj Gupta, Stanford
University 1999.
[10]“Implementation of an On chip Interconnect using
I – slip scheduling Algorithm” by John D. Pape
December 11, 2006
[11]“Quality of service for Asynchronous on chip
Networks”, Thesis submitted by Tomaz
Felicijan, Dept. of Computer Science, 2004.
[12]“Study of – VOQ crossbar switches for Multicast
Traffic”, National Yunlin University of Science
and Technology.
[13]“The I-slip scheduling algorithm for input queued
switches”, IEEE transaction, Vol – 7, April
1999.
[14]Tutorial on “The slip algorithm with multiple
Iterations”
Figure9: Scheduler block diagram.
[15]Tutorial on “The slip algorithm with single
Iteration”
6. References
SIMULATION RESULT OF SCHEDULER
[1] “A high – Speed and Lightweight on chip
crossbar switch for on chip
interconnection networks”, paper return by
Kangmin Lee, See-Joong Lee and Hui-Jun Yoo,
semiconductor system laboratory at department
of Electrical Engineering KAIST, Daejeon,
Korea.
[2] “Addressing the system on a chip Interconnect
Woes through Communication based Design”,
University of California at Berkeley, 2001.
[3] “Algorithm – Hardware co-design of fast Parallel
Round Robin Arbiters”, University of
Texas,2004.
[4] “An Adadptive oundRobinscheduler for Head of
line Blocking problem in Wireless LANs”,
Department of Information Engineering, Li Bin
Jiang and Soung Chang Liew, 1999.
[5]“Concept of linked list” from book written by
Andrewson Tenanbum on Computer
Architecture and Organization.
[6]“Fair queuing in data networks, Internetworking
2002” by Rodrigo Sieera. Scheduler timing report on SYNOPSYS:
[7]“Head of line blocking” from Wikipedia, the free ***********************************
encyclopedia. *****
Report : timing
8. -path full
-delay max ***********************************
-max_paths 1 *****
-sort_by group Report : area
Design : sc Design : sc
Version: Y-2006.06-SP6 Version: Y-2006.06-SP6
Date : Wed Jul 28 16:48:35 2010 Date : Wed Jul 28 17:25:51 2010
*********************************** ***********************************
***** *****
Operating Conditions: TYPICAL Library(s) Used:
Library: saed90nm_typ
Wire Load Model Mode: enclosed saed90nm_typ (File:
/home/student1/today/saed90nm_typ.d
Startpoint: datactrl4_reg[4] b)
(rising edge-
triggered flip-flop) Number of ports: 220
Endpoint: in_dec_valid[4] Number of nets: 852
(output port) Number of cells: 380
Path Group: (none) Number of references: 37
Path Type: max
Combinational area:
Des/Clust/Port Wire Load 13071.848633
Model Library Noncombinational area:
--------------------------------- 3035.730469
--------------- Net Interconnect area:
sc 35000 1093.291504
saed90nm_typ
Total cell area:
Point 16107.509766
Incr Path Total area:
--------------------------------- 17200.800781
--------------------------
datactrl4_reg[4]/CLK (DFFX1)
0.00 0.00 r
datactrl4_reg[4]/Q (DFFX1)
0.22 0.22 f
U456/QN (NOR4X0)
0.17 0.38 r
U455/QN (NAND4X0)
0.10 0.48 f
U369/Q (AND2X1)
0.09 0.57 f
in_dec_valid[4] (out)
0.00 0.57 f
data arrival time
0.57
---------------------------------
--------------------------
(Path is unconstrained)
Scheduler area report on SYNOPSYS: