OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 831
throughput under uniform traffic2 by using RD, the internal
expansion ratio has to be set to approximately 1.6 when the
switch size is large .
One question arises: Is it possible to achieve a high
throughput by using a practical dispatching scheme without
allocating any buffers in the second stage to avoid the
out-of-sequence problem and without expanding the internal
This paper presents two solutions to this question. First,
we introduce an innovative dispatching scheme called the Fig. 1. Analogy among scheduling schemes.
concurrent round-robin dispatching (CRRD) scheme for a
Clos-network switch. The basic idea of the novel CRRD
scheme is to use the desynchronization effect  in the
Clos-network switch. The desynchronization effect has been
studied using such simple scheduling algorithms as
,  and dual round-robin matching (DRRM) ,  in
an input-queued crossbar switch. CRRD provides high switch
throughput without increasing internal bandwidth and its im-
plementation is very simple because only simple round-robin
arbiters are employed. We show via simulation that CRRD
achieves 100% throughput under uniform traffic. With slightly
unbalanced traffic, we also show that CRRD provides a better
performance than RD.
Second, this paper describes a scalable round-robin-based
dispatching scheme, called the concurrent master–slave
round-robin dispatching (CMSD) scheme. CMSD is an
improved version of CRRD that provides more scalability. Fig. 2. Clos-network switch with VOQs in the IMs.
To make CRRD more scalable while preserving CRRD’s
advantages, we introduce two sets of output-link round-robin implementation of CRRD and CMSD. Section VIII summarizes
arbiters in the first-stage module, which are master arbiters the key points.
and slave arbiters. These two sets of arbiters operate in a
hierarchical round-robin manner. The dispatching scheduling II. CLOS-NETWORK SWITCH MODEL
time is reduced, as is the interconnection complexity of the
dispatching scheduler. This makes the hardware of CMSD Fig. 2 shows a three-stage Clos-network switch. The termi-
easier to implement than that of CRRD when the switch size nology used in this paper is as follows.
becomes large. In addition, CMSD preserves the advantage IM Input module at the first stage.
of CRRD where the desynchronization effect is obtained CM Central module at the second stage.
in a Clos-network switch. Simulation suggests that CMSD OM Output module at the third stage.
also achieves 100% throughput under uniform traffic without Number of input ports (IPs)/OPs in each
expanding internal switch capacity. IM/OM, respectively.
Fig. 1 categorizes several scheduling schemes and shows the Number of IMs/OMs.
analogy of our proposed schemes. In crossbar switches , Number of CMs.
a round-robin-based algorithm, was developed to overcome the IM number, where .
throughput limitation of the parallel iterative matching (PIM) al- OM number, where .
gorithm , which uses randomness. In Clos-network switches, IP/OP number in each IM/OM, respectively,
CRRD and CMSD, two round-robin-based algorithms, are de- where .
veloped to overcome the throughput limitation and the high im- CM number, where .
plementation complexity of RD, which also uses randomness. ( )th IM.
The remainder of this paper is organized as follows. Sec- ( )th CM.
tion II describes a Clos-network switch model that we reference ( )th OM.
throughout this paper. Section III explains throughput limita- ( )th IP at .
tion of RD. Section IV introduces CRRD. Section V describes ( )th OP at .
CMSD as an improved version of CRRD. Section VI presents a Virtual output queue (VOQ) at that
performance study of CRRD and CMSD. Section VII discusses stores cells destined for .
Output link at that is connected to
2A switch can achieve 100% throughput under uniform traffic if the switch .
is stable. The stable sate is defined in  and . Note that the 100% defini- Output link at that is connected to
tion in  is more general than the one used here because all independent and
admissible arrivals are considered in .
832 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002
The first stage consists of IMs, each of which has
dimension. The second stage consists of bufferless CMs, each
of which has dimension. The third stage consists of OMs,
each of which has dimension.
An has VOQs to eliminate head-of-line (HOL)
blocking. A VOQ is denoted as . Each
stores cells that go from to the OP
at . A VOQ can receive at most cells from
IPs and can send one cell to a CM in one cell time slot.3
Each has output links. Each output link is
connected to each . Fig. 3. Example of RD scheme (n = m = k = 2).
A has output links, each of which is denoted as
, which are connected to OMs, each of which is • Phase 2: Matching between IM and CM
. — Step 1: A request that is associated with is
An has OPs, each of which is and has an sent out to the corresponding .A has
output buffer.4 Each output buffer receives at most cells at output links , each of which corresponds to
one cell time slot and each OP at the OM forward one cell in a an . An arbiter that is associated with
first-in-first-out (FIFO) manner to the output line.5 selects one request among requests. A random-selec-
tion scheme is used for this arbitration.6 sends
III. RD SCHEME up to grants, each of which is associated with one
, to the corresponding IMs.
The ATLANTA switch, developed by Lucent Technologies,
— Step 2: If a VOQ at the IM receives the grant from the
uses the random selection in its dispatching algorithm . This
CM, it sends the corresponding cell at next time slot.
section describes the basic concept and performance character-
Otherwise, the VOQ will be a candidate again at step
istics of the RD scheme. This description helps to understand
2 in phase 1 at the next time slot.
the CRRD and CMSD schemes.
A. RD Algorithm B. Performance of RD Scheme
Two phases are considered for dispatching from the first to Although RD can dispatch cells evenly to CMs, a high switch
second stage. The details of RD are described in . We show throughput cannot be achieved due to the contention at the CM,
an example of RD in Section III-B. In phase 1, up to VOQs unless the internal bandwidth is expanded. Fig. 3 shows an ex-
from each IM are selected as candidates and a selected VOQ is ample of the throughput limitation in the case of
assigned to an IM output link. A request that is associated with . Let us assume that every VOQ is always occupied with cells.
this output link is sent from the IM to the CM. This matching Each VOQ sends a request for a candidate at every time slot.
between VOQs and output links is performed only within the We estimate how much utilization an output link
can achieve for the cell transmission.7 Since the utilization
IM. The number of VOQs that are chosen in this matching is
of every is the same, we focus only on a single one,
always , where is the number of nonempty
VOQs. In phase 2, each selected VOQ that is associated with i.e., . The link utilization of is obtained
each IM output link sends a request from the IM to CM. CMs from the sum of the cell-transmission rates of
and . First, we estimate how much traffic
respond with the arbitration results to IMs so that the matching
can send through . The probability that
between IMs and CMs can be completed.
uses to request for is 1/4
• Phase 1: Matching within IM because there are four VOQs in . Consider that
— Step 1: At each time slot, nonempty VOQs send re- requests for using . If either
quests for candidate selection. or requests for through ,
— Step 2: selects up to requests out of a contention occurs with the request by at
nonempty VOQs. For example, a round-robin arbi- . The aggregate probability that either
tration can be employed for this selection . Then, or , among four VOQs in , requests for
proposes up to candidate cells to randomly through is 1/2. In this case, the winning
selected CMs. probability of is 1/2. If there is no contention
3An L-bit cell must be written to or read from a VOQ memory in a time less for caused by requests in , can
than L=C (n + 1), where C is a line speed. For example, when L = 64 8 bits, always send a cell with a probability of 1.0 without contention.
C = 10 Gbit/s and n = 8, L=C (n + 1) is 5.6 ns. This is feasible when we
consider current available CMOS technologies. 6Even when the round-robin scheme is employed as the contention resolution
4We assume that the output buffer size at OP(j; h) is large enough to avoid function for L (r; j ) at CM(r ), the result that we will discuss later is the same
cell loss without flow control between IMs and OMs so that we can focus the as the random selection due to the effect of the RD.
discussion on the properties of dispatching schemes in this paper. However, the 7The output-link utilization is defined as the probability that the output link
flow control mechanism can be adopted between IMs and OMs to avoid cell is used in cell transmission. Since we assume that the traffic load destined for
loss when the output buffer size is limited. each OP is not less than 1.0, we consider the maximum switch throughput as
5Similar to the VOQ memory, an L-bit cell must be written to or read from the output-link utilization throughout this paper, unless specifically stated oth-
an output memory in a time less than L=C (m + 1). erwise.
OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 833
Therefore, the traffic that can send through
is given as follows:
Since can use either or (i.e., two
CMs) and is in the same situation as
(i.e., two VOQs), the total link utilization of is
Thus, in this example of , the switch can
achieve a throughput of only 75%, unless the internal bandwidth
This observation can be generally extended. We derive the
formula that gives the maximum throughput by using RD Fig. 4. CRRD scheme.
under uniform traffic in the following equation:
Therefore, is redefined as , where
Fig. 4 illustrates the detailed CRRD algorithm by showing
an example. To determine the matching between a request from
(3) and the output link , CRRD uses an itera-
tive matching in . In , there are output-link round-
robin arbiters and VOQ round-robin arbiters. An output-link
We describe how to obtain in detail in Appendix A. The arbiter associated with and has its own pointer .
factor in (3) expresses the expansion ratio. A VOQ arbiter associated with has its own pointer
When the expansion ratio is 1.0 (i.e., ), the maximum . In , there are round-robin arbiters, each of
throughput is only a function of . As described in the above which corresponds to and has its own pointer .
example of , the maximum throughput is 0.75. As in- As described in Section III-A, two phases are also considered
creases, the maximum throughput decreases. When , in CRRD. In phase 1, CRRD employs an iterative matching by
the maximum throughput tends to using round-robin arbiters to assign a VOQ to an IM output link.
(see Appendix B). In other words, to achieve 100% throughput The matching between VOQs and output links is performed
by using RD, the expansion ratio has to be set to at least within the IM. It is based on a round-robin arbitration and is
. similar to the request/grant/accept approach. A major
difference is that in CRRD, a VOQ sends a request to every
IV. CRRD SCHEME output-link arbiter; in , a VOQ sends a request to only
A. CRRD Algorithm the destined output-link arbiter. In phase 2, each selected VOQ
that is matched with an output link sends a request from the IM
The switch model described in this section is the same as the to the CM. CMs send the arbitration results to IMs to complete
one described in Section II. However, to simplify the explana- the matching between the IM and CM.
tion of CRRD, the order of in is rearranged
• Phase 1: Matching within IM
for dispatching as follows: First iteration
* Step 1: Each nonempty VOQ sends a request to every
* Step 2: Each output-link arbiter chooses one nonempty
VOQ request in a round-robin fashion by searching
from the position of . It then sends the grant
to the selected VOQ.
* Step 3: The VOQ arbiter sends the accept to the
granting output-link arbiter among all those received
in a round-robin fashion by searching from the posi-
tion of .
— th iteration ( )9
8The purpose of redefining the VOQs is to easily obtain the desynchronization
9The number of iterations is designed by considering the limitation of the
arbitration time in advance. CRRD tries to choose as many nonempty VOQs in
this matching as possible, but it does not always choose the maximum available
nonempty VOQs if the number of iterations is not enough. On the other hand,
RD can choose the maximum number of nonempty VOQs in phase 1.
834 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002
Fig. 5. Example of desynchronization effect of CRRD (n = m = k = 2).
* Step 1: Each unmatched VOQ at the previous iterations B. Desynchronization Effect of CRRD
sends another request to all unmatched output-link ar-
While RD causes contention at the CM, as described in
Steps 2 and 3: The same procedure is performed as Appendix A, CRRD decreases contention at the CM because
in the first iteration for matching between unmatched pointers , , and are desynchronized.
nonempty VOQs and unmatched output links. We demonstrate how the pointers are desynchronized
• Phase 2: Matching between IM and CM by using simple examples. Let us consider the example of
— Step 1: After phase 1 is completed, sends the , as shown in Fig. 5. We assume that every
request to . Each round-robin arbiter associated VOQ is always occupied with cells. Each VOQ sends a request
with then chooses one request by searching to be selected as a candidate at every time slot. All the pointers
from the position of and sends the grant to are set to be , , and at the
of . initial state. Only one iteration in phase 1 is considered here.
— Step 2: If the IM receives the grant from the CM, it At time slot , since all the pointers are set to zero, only
sends a corresponding cell from that VOQ at the next one VOQ in , which is , can send a cell with
time slot. Otherwise, the IM cannot send the cell at the through . The related pointers with the grant
next time slot. The request from the CM that is not , , and are updated from zero to one.
granted will again be attempted to be matched at the At , three VOQs, which are , ,
next time slot because the pointers that are related to
and , can send cells. The related pointers with the
the ungranted requests are not moved.
grants are updated. Four VOQs can send cells at . In
As with , the round-robin pointers and
this situation, 100% switch throughput is achieved. There is no
in and in are updated to one
position after the granted position only if the matching within contention at all at the CMs from because the pointers
the IM is achieved at the first iteration in phase 1 and the are desynchronized.
request is also granted by the CM in phase 2. Similar to the above example, CRRD can achieve the desyn-
Fig. 4 shows an example of , where CRRD chronization effect and provide high throughput even though the
operates at the first iteration in phase 1. At step 1, , switch size is increased.
, , and , which are nonempty
VOQs, send requests to all the output-link arbiters. At step 2,
V. CMSD SCHEME
output-link arbiters that are associated with , ,
and select , , and , CRRD overcomes the problem of limited switch throughput
respectively, according to their pointers’ positions. At step 3, of the RD scheme by using simple round-robin arbiters. CMSD
receives two grants from both output-link arbiters is an improved version of CRRD that preserves CRRD’s advan-
of and and accepts by using its tages and provides more scalability.
own VOQ arbiter. Since receives one grant from
output-link arbiter , it accepts the grant. With one
A. CMSD Algorithm
iteration, cannot be matched with any nonempty
VOQs. At the next iteration, the matching between unmatched Two phases are also considered in the description of the
nonempty VOQs and will be performed. CMSD scheme, as in the CRRD scheme. The difference be-
OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 835
Fig. 6. CMSD scheme.
tween CMSD and CRRD is how the iterative matching within fashion by searching from the position of 10
the IM operates in phase 1. and sends the grant to the selected VOQ.
We define several notations to describe the CMSD algorithm, * Step 3: The VOQ arbiter sends the accept to the
which is shown in an example in Fig. 6. is denoted as granting master and slave output-link arbiters among
a VOQ group that consists of VOQs, each of which is de- all those received in a round-robin fashion by searching
noted as . In , there are master output-link from the position of .
round-robin arbiters, slave output-link round-robin arbiters, — th iteration ( )
and VOQ round-robin arbiters. Each master output-link ar- * Step 1: Each unmatched VOQ at the previous itera-
biter associated with is denoted as and has tions sends a request to all the slave output-link ar-
its own pointer . Each slave output-link arbiter asso- biters again. , which has at least one unmatched
ciated with and is denoted as and nonempty VOQ, sends a request to all the unmatched
has its own pointer . A VOQ arbiter associated with master output-link arbiters again.
has its own pointer . * Steps 2 and 3: The same procedure is performed as in
the first iteration for matching between the unmatched
• Phase 1: Matching within IM nonempty VOQs and unmatched output links.
— First iteration • Phase 2: Matching between IM and CM
* Step 1: There are two sets of requests issued to This operation is the same as in CRRD.
the output-link arbiters. One set is a request that is As with CRRD, the round-robin pointers ,
sent from a nonempty to every asso- , and in and in
ciated within . The other set is a are updated to one position after the granted position only if
group-level request sent from that has at least the matching within the IM is achieved at the first iteration in
one nonempty VOQ to every . phase 1 and the request is also granted by the CM in phase 2.
* Step 2: Each chooses a request among Fig. 6 shows an example of , where
VOQ groups independently in a round-robin fashion CMSD is used at the first iteration in phase 1. At step 1,
by searching from the position of and sends
10In the actual hardware design, SL(i; j; r ) can start to search for a VOQ
the grant to that belongs to the selected
request without waiting to receive the grant from a master arbiter. If SL(i; j; r )
. receives the grant from its master does not receive a grant, the search by SL(i; j; r) is invalid. This is an advantage
arbiter and selects one VOQ request in a round-robin of CMSD. Section VII-A discusses the dispatching scheduling time.
836 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002
Fig. 7. Example of the desynchronization effect (n = m = k = 2).
, , and , which are , , and are updated from zero to one.
nonempty VOQs, send requests to the associated slave At , three VOQs, which are , ,
output-link arbiters. and send requests to all the and can send cells. The related pointers with the
master output-link arbiters. At step 2, , , and grants are updated. Four VOQs can send cells at . In
select , , and , respectively, ac- this situation, 100% switch throughput is achieved. There is no
cording to the pointers’ positions. , , and contention at the CMs from because the pointers are
, which belong to the selected VOQ groups, choose desynchronized.
, , and , respectively. At Similar to the above example, CMSD can also achieve the
step 3, receives two grants from both desynchronization effect and provide high throughput even
and . It accepts by using its own VOQ though the switch size is increased, as in CRRD.
arbiter. Since receives one grant from ,
it accepts the grant. In this iteration, is not matched VI. PERFORMANCE OF CRRD AND CMSD
with any nonempty VOQs. At the next iteration, the matching
between an unmatched nonempty VOQ and will be The performance of CRRD and CMSD was evaluated by sim-
performed. ulation using uniform and nonuniform traffic.
B. Desynchronization Effect of CMSD A. Uniform Traffic
CMSD also decreases contention at the CM because pointers 1) CRRD: Fig. 8 shows that CRRD provides higher
, , , and are desynchro- throughput than RD under uniform traffic. A Bernoulli arrival
nized. process is used for input traffic. Simulation suggests that CRRD
We illustrate how the pointers are desynchronized by using can achieve 100% throughput for any number of iterations in
simple examples. Let us consider the example of IM.
, as shown in Fig. 7. We assume that every VOQ is al- The reason CRRD provides 100% throughput under uniform
ways occupied with cells. Each VOQ sends a request to be se- traffic is explained as follows. When the offered load is 1.0 and
lected as a candidate at every time slot. All the pointers are set if the idle state, in which the internal link is not fully utilized,
to be , , , and still occurs due to contention in the IM and CM, a VOQ that fails
at the initial state. Only one iteration in phase 1 is in the contention has to store backlogged cells. Under uniform
considered here. traffic, every VOQ keeps backlogged cells until the idle state is
At time slot , since all the pointers are set to zero, only eliminated, i.e., until the stable state is reached. The stable state
one VOQ in , which is , can send a cell with is defined in . In the stable state, every VOQ is occupied
through . The related pointers with the grant with backlogged cells. In this situation, as illustrated in Fig. 5,
OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 837
99.9% delay of CRRD (n = m = k = 8).
Fig. 8. Delay performance of CRRD and RD schemes (n = m = k = 8). Fig. 9.
the desynchronization effect is always obtained. Therefore, even
when the offered load is 1.0, no contention occurs in the stable
As the number of iterations increases, the delay performance
improves when the offered load is less than 0.7, as shown in
Fig. 8. This is because the matching between VOQs and output
links within the IM increases. When the offered traffic
load is not heavy, the desynchronization of the pointers is not
completely achieved. At the low offered load, the delay per-
formance of RD is better than that of CRRD with one itera-
tion. This is because the matching within the IM in CRRD is
not completely achieved, while the complete matching within
the IM in RD is always achieved, as described in Section III-A.
When the offered load is larger than 0.7, the delay performance Fig. 10. Delay performance of CRRD affected by bursty traffic (n = m=
of CRRD is not improved by the number of iterations in the IM. k = 8).
The number of iterations in the IM only improves the matching
within the IM, but does not improve the matching between the
IM and CM. The delay performance is improved by the number
of iterations in the IM when the offered load is not heavy. Fig. 8
shows that, when the number of iterations in the IM increases
to four, the delay performances almost converge.
Fig. 9 shows that 99.9% delay of CRRD has the same ten-
dency as the average delay, as shown in Fig. 8.
Our simulation shows, as presented in Fig. 10, that even when
the input traffic is bursty, CRRD provides 100% throughput
for any number of iterations. However, the delay performance
under the bursty traffic becomes worse than that of the non-
bursty traffic at the heavy load condition. The reason for the
100% throughput even for bursty traffic can be explained as de-
scribed in the Bernoulli traffic discussion above. We assume that
Fig. 11. Average match size ratio R in phase 1 (n = m = k = 8).
the burst length is exponentially distributed as the bursty traffic.
In this evaluation, the burst length is set to ten.
By looking at Fig. 10, we can make the following observa- is defined as
When the input offered load , where is very low, the (4)
number of iterations does not affect the delay performance sig-
nificantly. This is because average match size ratios for the where is a match size within in phase
matching within the IM in phase 1 and for the matching 1 and is the number of nonempty VOQs in
between the IM and CM in phase 2 are close to 1.0, as shown in at each time slot. When ,
Figs. 11 and 12. is set to 1.0. Note that RD always
The average match size ratios and are defined provides , as is described in
as follows. Section III-A.
838 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002
Fig. 13. Delay performance of CMSD compared with CRRD (n = m = k =
Fig. 12. Average match size ratio R in phase 2 (n = m = k = 8). 8).
is defined as 2) CMSD: This section describes the performance of
CMSD under uniform traffic by comparing it to CRRD.
Fig. 13 shows that CMSD provides 100% throughput under
(5) uniform traffic, as CRRD does, due to the desynchronization
effect, as described in Section V-B. The throughput of CMSD
if there is a request from is also independent of the number of iterations in the IM.
to In Fig. 13, the number of iterations in the IM improves the
otherwise delay performance even in heavy load regions in the CMSD
(6) scheme. On the other hand, the delay performance of CRRD is
improved by the number of iterations in the IM when the offered
and load is not heavy. As a result, in the heavy load region, the delay
performance of CMSD is better than that of CRRD when the
if there is a grant from to number of iterations in the IM is large.
in phase 2 Since the master output-link arbiter of CMSD can choose a
otherwise VOQ group, not a VOQ itself in CRRD, the master output-link
(7) arbiter of CMSD affects the matching between the IM and CM
more than the output-link arbiter of CRRD. Therefore, CMSD
When , is set to makes the matching between the IM and CM more efficient than
1.0. CRRD when the number of iterations in the IM increases. Al-
When increases to approximately 0.4, of CRRD with though this is one of the advantages of CMSD, the main advan-
one iteration for Bernoulli traffic decreases due to contention tage is that CMSD is easier to implement than CRRD when the
in the IM. for the bursty traffic is not more affected by switch size increases; this will be described in Section VII.
than for Bernoulli traffic. The desynchronization effect is more Fig. 13 shows that when the number of iterations in the IM
difficult to obtain for Bernoulli traffic than for bursty traffic be- increases to four for CMSD, the delay performance almost con-
cause the arrival process is more independent. In this load re- verges.
gion, where the traffic load is approximately 0.4, the delay per- Fig. 14 shows via simulation that CMSD also provides 100%
formance with one iteration for Bernoulli traffic is worse than throughput under uniform traffic even when a bursty arrival
for bursty traffic. process is considered. The bursty traffic assumption in Fig. 14
At , the number of iterations affects the delay is the same as that in Fig. 10.
performance for Bernoulli traffic, as shown in Fig. 10. In this In the heavy-load region, the delay performance of CMSD is
load region, with one iteration for Bernoulli traffic is low, better than that of CRRD under bursty traffic when the number
as shown in Fig. 11, while with four iterations is nearly 1.0. of iterations in the IM is large.
On the other hand, at , the number of iterations 3) Expansion Factor: Fig. 15 shows that RD requires an ex-
affects the delay performance for the bursty traffic, as shown in pansion ratio of over 1.5 to achieve 100% throughput, while
Fig. 10. In this load region, with one iteration for the bursty CRRD and CMSD need no bandwidth expansion.
traffic is low, as shown in Fig. 11, while with four iterations
is nearly 1.0. B. Nonuniform Traffic
When the traffic load becomes very heavy, and
with any iterations for both Bernoulli and bursty traffic approach We compared the performance of CRRD, CMSD, and RD
1.0. In this load region, the desynchronization effect is easy to using nonuniform traffic. The nonuniform traffic considered
obtain, as described above. here is defined by introducing unbalanced probability, . Let us
OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 839
Fig. 14. Delay performance of CMSD in bursty traffic compared with CRRD Fig. 16. Switch throughput under nonuniform traffic (n = m = k = 8).
(n = m = =
the throughput of CRRD and CMSD increases because the con-
tention at the CM decreases. The reason is that, as increases,
more traffic from is destined for by using
and less traffic from is destined for by using
, where , according to (8). That is why contention
of at decreases.
Due to the desynchronization effect, both CRRD and CMSD
provide better performance than RD when is smaller than 0.4.
In addition, the throughput of CRRD and CMSD is not worse
than that of RD at any larger than 0.4, as shown in Fig. 16.
Developing a new practical dispatching scheme that avoids
the throughput degradation even under nonuniform traffic
without expanding the internal bandwidth is for further study.
Fig. 15. Relationship between switch throughput and expansion factor (n =
k = 8). VII. IMPLEMENTATION OF CRRD AND CMSD
consider IP , OP , and the offered input load for each IP . This section discusses the implementation of CRRD and
The traffic load from IP to OP , is given by CMSD. First, we briefly compare them to RD. Since the CRRD
and CMSD schemes are based on round-robin arbitration, their
if implementations are much simpler than that of RD which needs
(8) random generators in the IMs. These generator are difficult
otherwise and expensive to implement at high speeds . Therefore,
where is the switch size. Here, the aggregate offered the implementation cost for CRRD and CMSD is reduced
load that goes to output from all the IPs is given by compared to RD.
The following sections describe the dispatching scheduling
(9) time, hardware complexity, and interconnection-wire com-
plexity for CRRD and CMSD.
When , the offered traffic is uniform. On the other hand,
when , it is completely unbalanced. This means that all
A. Dispatching Scheduling Time
the traffic of IP is destined for only OP , where .
Fig. 16 shows that the throughput with CRRD and CMSD is 1) CRRD: In phase 1, at each iteration, two round-robin ar-
higher than that of RD when the traffic is slightly unbalanced.11 biters are used. One is an output-link arbiter that chooses one
The throughput of CRRD is almost the same as that of CMSD. VOQ request out of, at most, requests. The other is a VOQ ar-
We assume that enough large numbers of iterations in the IM are biter that chooses one grant from, at most, output-link grants.
adopted in this evaluation for both CRRD and CMSD schemes We assume that priority encoders are used for the implementa-
to observe how nonuniform traffic impacts the matching be- tion of the round-robin arbiters . Since , the dis-
tween the IM and CM. From to around , the patching scheduling time complexity of each iteration in phase
throughput of CRRD and CMSD decreases. This is because the 1 is (see Appendix D) . As is the case of ,
complete desynchronization of the pointers is hard to achieve ideally, iterations in phase 1 are preferable because there are
under unbalanced traffic.12 However, when is larger than 0.4, output links that should be matched with VOQs in the IM.
Therefore, the time complexity in phase 1 is . How-
11Although both results shown in Fig. 16 are obtained by simulation, we also
ever, we do not practically need iterations. As described in
analytically derived the upper bound throughput of RD in Appendix C.
12For the same reason, when input traffic is very asymmetric, throughput Section VI, simulation suggests that one iteration is sufficient
degradation occurs. to achieve 100% throughput under uniform traffic. In this case,
840 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002
increasing the number of iterations improves the delay perfor- TABLE I
mance. RATIO OF T (CMSD) TO T (CRRD), R (n = k = m)
In phase 2, one IM request is chosen from, at most, requests.
The time complexity at phase 2 is .
As a result, the time complexity of CRRD is
, where is the number of switch ports. If we set
the number of iterations in phase 1 as , where , for
round-robin execution time to clarify the dependency on , ,
practical purposes, the time complexity of CRRD is expressed
and so that we can further eliminate the device-dependent
factor . Using (10)–(12), we obtain the following equation:
Next, we consider the required dispatching scheduling time
of CRRD. The required time of CRRD, i.e., , is
(10) Table I shows the relationship between and when we
assume . Note that when , depends
where is a constant coefficient, which is determined by device only on and is independent of .13
technologies. is the transmission delay between arbiters. When , for example, the required time of CMSD is
2) CMSD: In phase 1, at each iteration, three round-robin reduced more than 30% compared to CRRD.
arbiters are used. The first one is a master output-link arbiter In (13), we ignored the factors of and . When
that chooses one out of VOQ group requests. The second and are large, the reduction effect of CMSD
one is a slave output-link arbiter that chooses one out of re- in Table I decreases. As the distance between two arbiters gets
quests. The third one is a VOQ arbiter that chooses one out larger, these factors become significant.
of output-link grants. The slave output-link arbiter can per-
form its arbitration without waiting for the result in the master B. Hardware Complexity for Dispatching in the IM
output-link arbitration, as described in Section V-A. In other Since the CM arbitration algorithm of CMSD is the same as
words, both master output-link and slave output-link arbiters that of CRRD, we discuss hardware complexities in the IM only.
operate simultaneously. Since is satisfied, the longer 1) CRRD: According to implementation results presented
arbitration time of either the master output-link arbitration or in , the hardware complexity for each round-robin arbiter
the VOQ arbitration time is dominant. Therefore, the time com- is , where is the number of requests to be selected
plexity of each iteration in phase 1 is . by the arbiter.
The time complexity in phase 1 is if Each IM has output-link arbiters and VOQ arbiters.
iterations are adopted. Each output-link arbiter chooses one out of requests. The
In phase 2, one is chosen out of IM requests. The dis- hardware complexity for all the output-link arbiters is .
patching scheduling time complexity at phase 2 is . Each VOQ arbiter chooses one out of requests. The hardware
As a result, the time complexity of CMSD is complexity for all the VOQ arbiters is . Therefore, the
. As in the case of CRRD, for hardware complexity of CRRD in the IM is .
practical purposes, the time complexity of CMSD is expressed 2) CMSD: Each IM has master output-link arbiters and
as . When we assume , slave output-link arbiters VOQ arbiters. Each master
the time complexity of CMSD is , which is output-link arbiter chooses one out of requests. The hardware
equivalent to . This is the same order of time complexity for all the master output-link arbiters is .
complexity as CRRD. Each slave output-link arbiter chooses one out of requests.
We consider the required dispatching scheduling time of The hardware complexity for all the slave output-link arbiters is
CMSD. With the same assumption as CRRD, the required time . Each VOQ chooses one out of requests. The hard-
of CMSD, i.e., , is approximately ware complexity for all the VOQ arbiters is . Therefore,
the hardware complexity of CMSD in the IM is . Thus,
the order of the hardware complexity of CMSD is the same as
that of CRRD.
where is the transmission delay between arbiters.
C. Interconnection-Wire Complexity Between Arbiters
3) Comparison of Required Dispatching Scheduling
Times: Although the order of the scheduling time complexity 1) CRRD: In CRRD, each VOQ arbiter is connected to all
of CRRD is the same as that of CMSD, the required times are output-link arbiters with three groups of wires. The first group
different. To compare both required times, we define the ratio is used by a VOQ arbiter to send requests to the output-link
of to as follows: arbiters. The second one is used by a VOQ arbiter to receive
grants from the output-link arbiters. The third one is used by a
(12) VOQ arbiter to send grants to the output-link arbiters.
As the switch size is larger, the number of wires that connect
all the VOQ arbiters with all the output-link arbiters becomes
First, we assume that the transmission delay between arbiters
and is negligible compared to the required 13R = (2i + 1)=(3i + 1)
OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 841
The first term on the right-hand side is the number of cross-
points of the interconnection wires between VOQs and
slave output-link arbiters in VOQ groups. The second term is
the number of crosspoints of the interconnection wires between
VOQ groups and master output-link arbiters.
3) Comparison of : As the switch size in-
creases, is much larger than . In gen-
eral, when ,
. For example, when and , the number
of crosspoints of the interconnection wires of CMSD is reduced
by 87.5%, 93.8%, and 96.9%, respectively, compared with that
of CRRD. Thus, CMSD can dramatically reduce the number of
Fig. 17. Interconnection-wire model.
large. In this situation, a large number of crosspoints is pro-
duced. That causes high layout complexity. When we implement
the dispatching scheduler in one chip, the layout complexity af- A Clos-network switch architecture is attractive because of its
fects the interconnection between transistors. However, if we scalability. Unfortunately, previously proposed implementable
cannot implement it in one chip due to a gate limitation, we need dispatching schemes are not able to achieve high throughput
to use multiple chips. In this case, the layout complexity influ- unless the internal bandwidth is expanded.
ences not only the interconnection within a chip, but also the in- First, we have introduced a novel dispatching scheme called
terconnection between chips on printed circuit boards (PCBs). CRRD for the Clos-network switch. CRRD provides high
In addition, the number of pins in the scheduler chips of CRRD switch throughput without increasing internal bandwidth, while
increases. only simple round-robin arbiters are employed. Our simulation
Consider that there are source nodes and destination showed that CRRD achieves 100% throughput under uniform
nodes. We assume that each source node is connected to all traffic. Even though the offered load reaches 1.0, the pointers
the destination nodes with wires without detouring, as shown of round-robin arbiters that we use at the IMs and CMs are
in Fig. 17. The number of crosspoints for the interconnection completely desynchronized and contention is avoided.
wires between nodes is given by Second, we have presented CMSD with hierarchical
(14) round-robin arbitration, which was developed as an improved
version of CRRD to make it more scalable in terms of dis-
The derivation of (14) is described in Appendix E. Let the patching scheduling time and interconnection complexity in
number of crosspoints for interconnection wires between the dispatching scheduler when the size of the switch increases.
VOQs and output-link arbiters be . We showed that CMSD preserves the advantages of CRRD and
is determined by achieves more than 30% reduction of the dispatching sched-
uling time when arbitration time is significant. Furthermore,
with CMSD, the number of interconnection crosspoints is dra-
Note that the factor of three in (15) expresses the number matically reduced with the ratio of . When ,
of groups of wires, as described at the beginning of Sec- a 96.9% reduction effect is obtained. This makes CMSD easier
tion VII-C-I. to implement than CRRD when the switch size becomes large.
2) CMSD: In CMSD, each VOQ arbiter is connected to its This paper assumes that the dispatching scheduling must be
own slave output-link arbiters with three groups of wires. The completed within one time slot. However, the constraint is a
first group is used by a VOQ arbiter to send requests to the slave bottleneck when port speed becomes high. In a future study,
output-link arbiters. The second one is used by a VOQ arbiter to pipeline-based approaches should be considered. A pipeline-
receive grants from the slave output-link arbiters. The third one based scheduling approach for a crossbar switches is described
is used by a VOQ arbiter to send grants to the slave output-link in , which relaxes the scheduling timing constraint without
arbiters. throughput degradation.
In the same way, each VOQ group is connected to all the
master output-link arbiters with three groups of wires. The
first is used by a VOQ group to send requests to the master
MAXIMUM SWITCH THROUGHPUT WITH RD SCHEME
output-link arbiters. The second is used by a VOQ group to
UNDER UNIFORM TRAFFIC
receive grants from the master output-link arbiters. The third is
used by a VOQ group to send grants to the master output-link To derive (3), we consider the throughput of output link
arbiters. , . This is the same as the switch throughput
Let the number of crosspoints for interconnection wires be- that we would like to derive because the conditions of all the
tween arbiters in CMSD be . is de- output links are the same. Therefore,
First, we define several notations. is the prob-
(16) ability that one output link in , e.g., , is used by
842 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002
a ’s request. Here, does not depend APPENDIX II
on which output link in is used for the request because DERIVATION OF FOR
it is selected randomly. is the probability that the re- We sketch the term in (3) according to the decreasing
quest by wins the contention of output link at order in the following:
By using and , is expressed
in the following equation:
Here, in the last form of (18), we consider that and
do not depend on both and .
Since has VOQs, each VOQ is equally selected by
each output link in . is given by
Next, we consider . is obtained as follows:
The first term of the right-hand side at (20) is the winning proba- (24)
bility of the request by at when there are totally
requests for . The second term is the winning proba- In the above deduction, we use the following equation:
bility of the request by at when there are (25)
requests at . In the same way, the ( )th term is the
winning probability of the request by at when
there are requests at . APPENDIX III
can then be expressed in the following equation: UPPER BOUND OF MAXIMUM SWITCH THROUGHPUT WITH
RD SHEME UNDER NONUNIFORM TRAFFIC
The upper bound of the maximum switch throughput with RD
scheme under nonuniform traffic is determined by
(21) (26), shown at the bottom of the page.
Equation (26) can be easily derived in the same way as (3).
By using (17)–(19) and (21), we can derive (3). However, since the traffic load of each VOQ is different, (3) has
OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 843
there are AND gates where the largest gate has inputs. An
-input AND gate can be built by using a number of -input AND
gates where the level of delay is , i.e., in . As for
the OR-gate stage, there are OR gates where the largest
has inputs. The largest AND gate delay is then the delay
order for the priority encoder.
DERIVATION OF (14)
We define an interconnection wire between source node ,
where and destination node , where
, as . We also define the number of crosspoints
that wire has as .
Fig. 18. Comparison between simulation and analytical upper bound results
under nonuniform traffic (n = = =
m k 8). The total number of crosspoints of the interconnection wires
between source nodes and destination nodes is ex-
to be modified. The first term of the right-hand side at (26) is ob-
tained by considering the probability that a heavy-loaded VOQ
sends a request to the CM and wins the contention. The second (27)
term of the right-hand side is obtained by considering the prob-
ability that, when a heavy-loaded VOQ and nonheavy-loaded
VOQs send a request to the CM, a nonheavy-loaded VOQ wins Here, the factor of 1/2 on the right-hand side of (27) eliminates
the contention. The third term of the right-hand side is obtained the double counting of the number of crosspoints.
by considering the probability that a heavy-loaded VOQ does First consider . Since has no crosspoints
not send a request to the CM, nonheavy-loaded VOQs send re- with other wires, . Since has
quests to the CM, and a nonheavy-loaded VOQ wins the con- crosspoints with ,
tention. . has
Note that (26) gives the upper bound of the switch throughput crosspoints with ,
of the RD scheme in the switch model described in Section II. . Therefore,
This is because only one cell can be read from the same VOQ . In the same way, .
at the same time slot in the switch model, as described in Sec- Next, consider . Since has
tion II. However, we assume that more than one cell can be read crosspoints with ,
from the same VOQ at the same time slot in the derivation of . has crosspoints with
(26) for simplicity. This different assumption causes different and cross-
results under nonuniform traffic.14 points with . Therefore,
Fig. 18 compares the simulation result with the upper bound . has
result obtained by (26). As increases to approximately 0.6, crosspoints with ,
the difference between the simulation and analytical results in- and cross-
creases due to the reason described above. At the region of points with . Therefore,
, the difference decreases with because the . In the same way,
contention probability at the CM decreases. .
We can then derive the following:
TIME COMPLEXITY OF ROUND-ROBIN ARBITER
In a round-robin arbiter, the arbitration time complexity is
dominated by an -to- priority encoder, where is the
number of inputs for the arbiter.
The priority encoder is analogous to a maximum or minimum
In general, is given by
search algorithm. We use a binary search tree, whose computa-
tion complexity (and, thus, the timing) is , where is the (28)
depth of the tree, and where is equal to for a binary
tree. We substitute (28) into (27). The following equation is obtained:
The priority encoder, in the canonical form, has two stages:
the AND-gate stage and the OR-gate stage. In the AND-gate stage,
14However, under uniform traffic, both results are exactly the same because
all the VOQs can be assumed to be in a nonempty state. This does not cause any
difference in the results. (29)
844 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002
REFERENCES Zhigang Jing (M’01) received the B.S., M.S., and
Ph.D. degrees in optical fiber communication from
 T. Anderson, S. Owicki, J. Saxe, and C. Thacker, “High speed switch the University of Electronic Science and Technology
scheduling for local area networks,” ACM Trans. Comput. Syst., vol. 11, of China, Chengdu, China, in 1993, 1996, and 1999,
no. 4, pp. 319–352, Nov. 1993.
respectively, all in electrical engineering.
 T. Chaney, J. A. Fingerhut, M. Flucke, and J. S. Turner, “Design of a
He then joined the Department of Electrical
Gigabit ATM switch,” in Proc. IEEE INFOCOM, Apr. 1997, pp. 2–11.
 H. J. Chao, B.-S. Choe, J.-S. Park, and N. Uzun, “Design and imple- Engineering, Tsinghua University, Beijing, China,
mentation of Abacus switch: A scalable multicast ATM switch,” IEEE where he was a Post-Doctoral Fellow. Since March
J. Select. Areas Commun., vol. 15, pp. 830–843, June 1997. 2000, he has been with the Department of Electrical
 H. J. Chao and J.-S. Park, “Centralized contention resolution schemes Engineering, Polytechnic University of New York,
for a large-capacity optical ATM switch,” in Proc. IEEE ATM Workshop, Brooklyn, as a Post-Doctoral Fellow. His current
Fairfax, VA, May 1998, pp. 11–16. research interests are high-speed networks, terabit IP routers, multimedia
 H. J. Chao, “Saturn: A terabit packet switch using Dual Round-Robin,” communication, Internet QoS, Diffserv, MPLS, etc.
in Proc. IEEE Globecom, Dec. 2000, pp. 487–495.
 F. M. Chiussi, J. G. Kneuer, and V. P. Kumar, “Low-cost scalable
switching solutions for broadband networking: The ATLANTA archi- Roberto Rojas-Cessa (S’97–M’01) received the
tecture and chipset,” IEEE Commun. Mag., pp. 44–53, Dec. 1997. B.S. degree in electronic instrumentation from the
 C. Clos, “A study of nonblocking switching networks,” Bell Syst. Tech. Universidad Veracruzana (University of Veracruz),
J., pp. 406–424, Mar. 1953. Veracruz, Mexico, in 1991. He graduated with an
 T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algo- Honorary Diploma. He received the M.S. degree
rithms, 19 ed. Cambridge, MA: MIT Press, 1997, sec. 13.2, p. 246. in electrical engineering from the Centro de In-
 C. Y. Lee and A. Y. Qruc, “A fast parallel algorithm for routing unicast vestigacion y de Estudios Avanzados del Instituto
assignments in Benes networks,” IEEE Trans. Parallel Distrib. Syst., vol. Politecnico Nacional (CINVESTAV-IPN), Mexico
6, pp. 329–333, Mar. 1995. City, Mexico, in 1995, and the M.S. degree in
 T. T. Lee and S.-Y. Liew, “Parallel routing algorithm in Benes-Clos net- computer engineering and Ph.D. degree in electrical
works,” in Proc. IEEE INFOCOM, 1996, pp. 279–286. engineering from the Polytechnic University of New
 N. McKeown, “Scheduling algorithm for input-queued cell switches,” York, Brooklyn, in 2000 and 2001, respectively.
Ph.D. dissertation, Dept. Elect. Eng. Comput. Sci., Univ. California at In 1993, he was with the Sapporo Electronics Center in Japan for a Micro-
Berkeley, Berkeley, CA, 1995. electronics Certification. From 1994 to 1996, he was with UNITEC and Iber-
 N. McKeown, V. Anantharam, and J. Walrand, “Achieving 100% chip, Mexico City, Mexico. From 1996 to 2001, he was a Teaching and Re-
throughput in an input-queued switch,” in Proc. IEEE INFOCOM, search Fellow with the Polytechnic University of Brooklyn. In 2001, he was a
1996, pp. 296–302.
Post-Doctoral Fellow with the Department of Electrical and Computer Engi-
 N. McKeown, M. Izzard, A. Mekkittikul, W. Ellerisick, and M.
Horowitz, “Tiny-Tera: A packet switch core,” IEEE Micro, pp. 26–33, neering, Polytechnic University of Brooklyn. He has been involved in applica-
Jan.–Feb. 1997. tion-specific integrated-circuit (ASIC) design for biomedical applications and
 N. McKeown, “The iSLIP scheduling algorithm for input-queues research on scheduling schemes for high-speed packet switches and switch reli-
switches,” IEEE/ACM Trans. Networking, vol. 7, pp. 188–200, Apr. ability. He is currently an Assistant Professor with the Department of Electrical
1999. and Computer Engineering, New Jersey Institute of Technology, Newark. His
 A. Mekkittikul and N. McKeown, “A practical scheduling algorithm to research interests include high-speed and high-performance switching, fault tol-
achieve 100% throughput in input-queued switches,” in Proc. IEEE IN- erance, and implementable scheduling algorithms for packet switches.
FOCOM, 1998, pp. 792–799. Dr. Rojas-Cessa is a member of the Institute of Electrical, Information and
 E. Oki, N. Yamanaka, Y. Ohtomo, K. Okazaki, and R. Kawano, “A Communication Engineers (IEICE), Japan. He was a CONACYT Fellow.
10-Gb/s (1.25 Gb/s 8) 4 2 0.25-m CMOS/SIMOX ATM switch
based on scalable distributed arbitration,” IEEE J. Solid-State Circuits,
vol. 34, pp. 1921–1934, Dec. 1999. H. Jonathan Chao (S’83–M’85–SM’95–F’01)
 E. Oki, R. Rojas-Cessa, and H. J. Chao, “A pipeline-based approach for received the B.S. and M.S. degrees in electrical
maximal-sized matching scheduling in input-buffered switches,” IEEE engineering from the National Chiao Tung Univer-
Commun. Lett., vol. 5, pp. 263–265, June 2001. sity, Taiwan, R.O.C., respectively, and the Ph.D.
 J. Turner and N. Yamanaka, “Architectural choices in large scale ATM in electrical engineering from The Ohio State
switches,” IEICE Trans. Commun., vol. E81-B, no. 2, pp. 120–137, Feb. University, Columbus.
1998. In January 1992, he joined the Polytechnic Univer-
sity of New York, Brooklyn, where he is currently a
Professor of electrical engineering. He has served as
a consultant for various companies such as NEC, Lu-
cent Technologies, and Telcordia in the areas of ATM
switches, packet scheduling, and MPLS traffic engineering. He has given short
courses to industry in the subjects of the IP/ATM/SONET network for over a
Eiji Oki (M’95) received the B.E. and M.E. degrees decade. He was cofounder and from 2000 to 2001, the CTO of Coree Networks,
in instrumentation engineering and the Ph.D. degree where he led a team in the implementation of a multiterabit packet switch system
in electrical engineering from Keio University, Yoko- with carrier-class reliability. From 1985 to 1992, he was a Member of Technical
hama, Japan, in 1991, 1993, and 1999, respectively. Staff with Telcordia, where he was involved in transport and switching system
In 1993, he joined the Communication Switching architecture designs and ASIC implementations, such as the first SONET-like
Laboratories, Nippon Telegraph and Telephone framer chip, ATM layer chip, aequencer chip (the first chip handling packet
Corporation (NTT), Tokyo, Japan, where he has scheduling), and ATM switch chip. From 1977 to 1981, he was a Senior En-
researched multimedia-communication network ar- gineer with Telecommunication Laboratories, Taiwan, R.O.C., where he was
chitectures based on ATM techniques, traffic-control involved with circuit designs for a digital telephone switching system. He has
methods, and high-speed switching systems with authored or coauthored over 100 journal and conference papers. He coauthored
the NTT Network Service Systems Laboratories. Broadband Packet Switching Technologies (New York: Wiley, 2001) and Quality
From 2000 to 2001, he was a Visiting Scholar with Polytechnic University of of Service Control in High-Speed Networks (New York: Wiley, 2001). He has
New York, Brooklyn. He is currently engaged in research and development of performed research in the areas of terabit packet switches/routers and QoS con-
high-speed optical IP backbone networks as a Research Engineer with NTT trol in IP/ATM/MPLS networks. He holds 19 patents with five pending.
Network Innovation Laboratories, Tokyo, Japan. He coauthored Broadband Dr. Chao served as a guest editor for the IEEE JOURNAL ON SELECTED AREAS
Packet Switching Technologies (New York: Wiley, 2001). IN COMMUNICATIONS with special topic on “Advances in ATM Switching Sys-
Dr. Oki is a member of the Institute of Electrical, Information and Commu- tems for B-ISDN” (June 1997) and “Next Generation IP Switches and Routers”
nication Engineers (IEICE), Japan. He was the recipient of the 1998 Switching (June 1999). He also served as an editor for the IEEE/ACM TRANSACTIONS ON
System Research Award and the 1999 Excellent Paper Award presented NETWORKING (1997–2000). He was the recipient of the 1987 Telcordia Excel-
by IEICE, Japan, and the IEEE ComSoc. 2001 APB Outstanding Young lence Award. He was a corecipient of the 2001 Best Paper Award of the IEEE
Researcher Award. TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY.