Concurrent round-robin-based dispatching schemes for clos ...
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Concurrent round-robin-based dispatching schemes for clos ...

on

  • 563 views

 

Statistics

Views

Total Views
563
Views on SlideShare
563
Embed Views
0

Actions

Likes
0
Downloads
10
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Concurrent round-robin-based dispatching schemes for clos ... Document Transcript

  • 1. 830 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002 Concurrent Round-Robin-Based Dispatching Schemes for Clos-Network Switches Eiji Oki, Member, IEEE, Zhigang Jing, Member, IEEE, Roberto Rojas-Cessa, Member, IEEE, and H. Jonathan Chao, Fellow, IEEE Abstract—A Clos-network switch architecture is attractive For implementation in a high-speed switching system, there because of its scalability. Previously proposed implementable are mainly two approaches. One is a single-stage switch archi- dispatching schemes from the first stage to the second stage, tecture. An example of the single-stage architecture is a crossbar such as random dispatching (RD), are not able to achieve high switch. Identical switching elements are arranged on a matrix throughput unless the internal bandwidth is expanded. This paper presents two round-robin-based dispatching schemes to plane. Several high-speed crossbar switches are described [4], overcome the throughput limitation of the RD scheme. First, we [13], [16]. However, the number of I/O pins in a crossbar chip introduce a concurrent round-robin dispatching (CRRD) scheme limits the switch size. This makes a large-scale switch difficult for the Clos-network switch. The CRRD scheme provides high to implement cost effectively as the number of chips becomes switch throughput without expanding internal bandwidth. CRRD large. implementation is very simple because only simple round-robin The other approach is to use a multiple-stage switch archi- arbiters are adopted. We show via simulation that CRRD achieves 100% throughput under uniform traffic. When the offered tecture, such as a Clos-network switch [7]. The Clos-network load reaches 1.0, the pointers of round-robin arbiters at the switch architecture, which is a three-stage switch, is very attrac- first- and second-stage modules are completely desynchronized tive because of its scalability. We can categorize the Clos-net- and contention is avoided. Second, we introduce a concurrent work switch architecture into two types. One has buffers to store master–slave round-robin dispatching (CMSD) scheme as an cells in the second-stage modules and the other has no buffers improved version of CRRD to make it more scalable. CMSD in the second-stage modules. uses hierarchical round-robin arbitration. We show that CMSD preserves the advantages of CRRD, reduces the scheduling time Reference [2] demonstrated a gigabit asynchronous transfer by 30% or more when arbitration time is significant and has mode (ATM) switch using buffers in the second stage. In this a dramatically reduced number of crosspoints of the intercon- architecture, every cell is randomly distributed from the first nection wires between round-robin arbiters in the dispatching stage to the second-stage modules to balance the traffic load scheduler with a ratio of 1 , where is the switch size. This in the second stage. The purpose of implementing buffers in makes CMSD easier to implement than CRRD when the switch the second-stage modules is to resolve contention among cells size becomes large. from different first-stage modules [18]. The internal speed-up Index Terms—Arbitration, Clos-network switch, dispatching, factor was suggested to be set more than 1.25 in [2]. However, packet switch, throughput. this requires a re-sequence function at the third-stage modules or the latter modules because the buffers in the second-stage I. INTRODUCTION modules cause an out-of-sequence problem. Furthermore, as the port speed increases, this re-sequence function makes a switch A S THE Internet continues to grow, high-capacity switches and routers are needed for backbone networks. Several ap- proaches have been presented for high-speed packet switching more difficult to implement. Reference [6] developed an ATM switch by using non- buffered second-stage modules. This approach is promising systems [2], [3], [13]. Most high-speed packet switching sys- even if the port line speed increases because it does not cause tems use a fixed-sized cell in the switch fabric. Variable-length the out-of-sequence problem. Since there is no buffer in the packets are segmented into several fixed-sized cells when they second-stage modules to resolve the contention, how to dis- arrive, switched through the switch fabric and reassembled into patch cells from the first to second stage becomes an important packets before they depart. issue. There are some buffers to absorb contention at the first Manuscript received December 14, 2000; revised November 1, 2001; ap- and third stages.1 proved by IEEE/ACM TRANSACTIONS ON NETWORKING Editor N. McKeown. A random dispatching (RD) scheme is used for cell dis- This work was supported in part by the New York State Center for Advanced patching from the first to second stage [6], as it is adopted Technology in Telecommunications (CATT), and in part by the National Sci- ence Foundation under Grant 9906673 and Grant ANI-9814856. in the case of the buffered second-stage modules in [2]. This E. Oki was with the Polytechnic University of New York, Brooklyn, NY scheme attempts to distribute traffic evenly to the second-stage 11201 USA. He is now with the NTT Network Innovation Laboratories, Tokyo modules. However, RD is not able to achieve a high throughput 180-8585, Japan (e-mail: oki.eiji@lab.ntt.co.jp). Z. Jing and H. J. Chao are with the Department of Electrical Engineering, unless the internal bandwidth is expanded because the con- Polytechnic University of New York, Brooklyn, NY 11201 USA (e-mail: tention at the second stage cannot be avoided. To achieve 100% zgjing@kings.poly.edu; chao@poly.edu). R. Rojas-Cessa was with the Department of Electrical Engineering, Poly- 1Although there are several studies on scheduling algorithms [9], [10] where technic University of New York, Brooklyn, NY 11201 USA. He is now with every stage has no buffer, the assumption used in the literature is out of the scope the Department of Electrical and Computer Engineering, New Jersey Institute of this paper. This is because the assumption requires a contention resolution of Technology, Newark, NJ 07102 USA (e-mail: rojasccs@njit.edu). function for output ports (OPs) before cells enter the Clos-network switch. This Digital Object Identifier 10.1109/TNET.2002.804823 function is not easy to implement in high-speed switching systems. 1063-6692/02$17.00 © 2002 IEEE
  • 2. OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 831 throughput under uniform traffic2 by using RD, the internal expansion ratio has to be set to approximately 1.6 when the switch size is large [6]. One question arises: Is it possible to achieve a high throughput by using a practical dispatching scheme without allocating any buffers in the second stage to avoid the out-of-sequence problem and without expanding the internal bandwidth? This paper presents two solutions to this question. First, we introduce an innovative dispatching scheme called the Fig. 1. Analogy among scheduling schemes. concurrent round-robin dispatching (CRRD) scheme for a Clos-network switch. The basic idea of the novel CRRD scheme is to use the desynchronization effect [11] in the Clos-network switch. The desynchronization effect has been studied using such simple scheduling algorithms as [11], [14] and dual round-robin matching (DRRM) [4], [5] in an input-queued crossbar switch. CRRD provides high switch throughput without increasing internal bandwidth and its im- plementation is very simple because only simple round-robin arbiters are employed. We show via simulation that CRRD achieves 100% throughput under uniform traffic. With slightly unbalanced traffic, we also show that CRRD provides a better performance than RD. Second, this paper describes a scalable round-robin-based dispatching scheme, called the concurrent master–slave round-robin dispatching (CMSD) scheme. CMSD is an improved version of CRRD that provides more scalability. Fig. 2. Clos-network switch with VOQs in the IMs. To make CRRD more scalable while preserving CRRD’s advantages, we introduce two sets of output-link round-robin implementation of CRRD and CMSD. Section VIII summarizes arbiters in the first-stage module, which are master arbiters the key points. and slave arbiters. These two sets of arbiters operate in a hierarchical round-robin manner. The dispatching scheduling II. CLOS-NETWORK SWITCH MODEL time is reduced, as is the interconnection complexity of the dispatching scheduler. This makes the hardware of CMSD Fig. 2 shows a three-stage Clos-network switch. The termi- easier to implement than that of CRRD when the switch size nology used in this paper is as follows. becomes large. In addition, CMSD preserves the advantage IM Input module at the first stage. of CRRD where the desynchronization effect is obtained CM Central module at the second stage. in a Clos-network switch. Simulation suggests that CMSD OM Output module at the third stage. also achieves 100% throughput under uniform traffic without Number of input ports (IPs)/OPs in each expanding internal switch capacity. IM/OM, respectively. Fig. 1 categorizes several scheduling schemes and shows the Number of IMs/OMs. analogy of our proposed schemes. In crossbar switches , Number of CMs. a round-robin-based algorithm, was developed to overcome the IM number, where . throughput limitation of the parallel iterative matching (PIM) al- OM number, where . gorithm [1], which uses randomness. In Clos-network switches, IP/OP number in each IM/OM, respectively, CRRD and CMSD, two round-robin-based algorithms, are de- where . veloped to overcome the throughput limitation and the high im- CM number, where . plementation complexity of RD, which also uses randomness. ( )th IM. The remainder of this paper is organized as follows. Sec- ( )th CM. tion II describes a Clos-network switch model that we reference ( )th OM. throughout this paper. Section III explains throughput limita- ( )th IP at . tion of RD. Section IV introduces CRRD. Section V describes ( )th OP at . CMSD as an improved version of CRRD. Section VI presents a Virtual output queue (VOQ) at that performance study of CRRD and CMSD. Section VII discusses stores cells destined for . Output link at that is connected to 2A switch can achieve 100% throughput under uniform traffic if the switch . is stable. The stable sate is defined in [12] and [15]. Note that the 100% defini- Output link at that is connected to tion in [15] is more general than the one used here because all independent and admissible arrivals are considered in [15]. .
  • 3. 832 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002 The first stage consists of IMs, each of which has dimension. The second stage consists of bufferless CMs, each of which has dimension. The third stage consists of OMs, each of which has dimension. An has VOQs to eliminate head-of-line (HOL) blocking. A VOQ is denoted as . Each stores cells that go from to the OP at . A VOQ can receive at most cells from IPs and can send one cell to a CM in one cell time slot.3 Each has output links. Each output link is connected to each . Fig. 3. Example of RD scheme (n = m = k = 2). A has output links, each of which is denoted as , which are connected to OMs, each of which is • Phase 2: Matching between IM and CM . — Step 1: A request that is associated with is An has OPs, each of which is and has an sent out to the corresponding .A has output buffer.4 Each output buffer receives at most cells at output links , each of which corresponds to one cell time slot and each OP at the OM forward one cell in a an . An arbiter that is associated with first-in-first-out (FIFO) manner to the output line.5 selects one request among requests. A random-selec- tion scheme is used for this arbitration.6 sends III. RD SCHEME up to grants, each of which is associated with one , to the corresponding IMs. The ATLANTA switch, developed by Lucent Technologies, — Step 2: If a VOQ at the IM receives the grant from the uses the random selection in its dispatching algorithm [6]. This CM, it sends the corresponding cell at next time slot. section describes the basic concept and performance character- Otherwise, the VOQ will be a candidate again at step istics of the RD scheme. This description helps to understand 2 in phase 1 at the next time slot. the CRRD and CMSD schemes. A. RD Algorithm B. Performance of RD Scheme Two phases are considered for dispatching from the first to Although RD can dispatch cells evenly to CMs, a high switch second stage. The details of RD are described in [6]. We show throughput cannot be achieved due to the contention at the CM, an example of RD in Section III-B. In phase 1, up to VOQs unless the internal bandwidth is expanded. Fig. 3 shows an ex- from each IM are selected as candidates and a selected VOQ is ample of the throughput limitation in the case of assigned to an IM output link. A request that is associated with . Let us assume that every VOQ is always occupied with cells. this output link is sent from the IM to the CM. This matching Each VOQ sends a request for a candidate at every time slot. between VOQs and output links is performed only within the We estimate how much utilization an output link can achieve for the cell transmission.7 Since the utilization IM. The number of VOQs that are chosen in this matching is of every is the same, we focus only on a single one, always , where is the number of nonempty VOQs. In phase 2, each selected VOQ that is associated with i.e., . The link utilization of is obtained each IM output link sends a request from the IM to CM. CMs from the sum of the cell-transmission rates of and . First, we estimate how much traffic respond with the arbitration results to IMs so that the matching can send through . The probability that between IMs and CMs can be completed. uses to request for is 1/4 • Phase 1: Matching within IM because there are four VOQs in . Consider that — Step 1: At each time slot, nonempty VOQs send re- requests for using . If either quests for candidate selection. or requests for through , — Step 2: selects up to requests out of a contention occurs with the request by at nonempty VOQs. For example, a round-robin arbi- . The aggregate probability that either tration can be employed for this selection [6]. Then, or , among four VOQs in , requests for proposes up to candidate cells to randomly through is 1/2. In this case, the winning selected CMs. probability of is 1/2. If there is no contention 3An L-bit cell must be written to or read from a VOQ memory in a time less for caused by requests in , can 2 than L=C (n + 1), where C is a line speed. For example, when L = 64 8 bits, always send a cell with a probability of 1.0 without contention. C = 10 Gbit/s and n = 8, L=C (n + 1) is 5.6 ns. This is feasible when we consider current available CMOS technologies. 6Even when the round-robin scheme is employed as the contention resolution 4We assume that the output buffer size at OP(j; h) is large enough to avoid function for L (r; j ) at CM(r ), the result that we will discuss later is the same cell loss without flow control between IMs and OMs so that we can focus the as the random selection due to the effect of the RD. discussion on the properties of dispatching schemes in this paper. However, the 7The output-link utilization is defined as the probability that the output link flow control mechanism can be adopted between IMs and OMs to avoid cell is used in cell transmission. Since we assume that the traffic load destined for loss when the output buffer size is limited. each OP is not less than 1.0, we consider the maximum switch throughput as 5Similar to the VOQ memory, an L-bit cell must be written to or read from the output-link utilization throughout this paper, unless specifically stated oth- an output memory in a time less than L=C (m + 1). erwise.
  • 4. OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 833 Therefore, the traffic that can send through is given as follows: (1) Since can use either or (i.e., two CMs) and is in the same situation as (i.e., two VOQs), the total link utilization of is (2) Thus, in this example of , the switch can achieve a throughput of only 75%, unless the internal bandwidth is expanded. This observation can be generally extended. We derive the formula that gives the maximum throughput by using RD Fig. 4. CRRD scheme. under uniform traffic in the following equation: Therefore, is redefined as , where and .8 Fig. 4 illustrates the detailed CRRD algorithm by showing an example. To determine the matching between a request from (3) and the output link , CRRD uses an itera- tive matching in . In , there are output-link round- robin arbiters and VOQ round-robin arbiters. An output-link We describe how to obtain in detail in Appendix A. The arbiter associated with and has its own pointer . factor in (3) expresses the expansion ratio. A VOQ arbiter associated with has its own pointer When the expansion ratio is 1.0 (i.e., ), the maximum . In , there are round-robin arbiters, each of throughput is only a function of . As described in the above which corresponds to and has its own pointer . example of , the maximum throughput is 0.75. As in- As described in Section III-A, two phases are also considered creases, the maximum throughput decreases. When , in CRRD. In phase 1, CRRD employs an iterative matching by the maximum throughput tends to using round-robin arbiters to assign a VOQ to an IM output link. (see Appendix B). In other words, to achieve 100% throughput The matching between VOQs and output links is performed by using RD, the expansion ratio has to be set to at least within the IM. It is based on a round-robin arbitration and is . similar to the request/grant/accept approach. A major difference is that in CRRD, a VOQ sends a request to every IV. CRRD SCHEME output-link arbiter; in , a VOQ sends a request to only A. CRRD Algorithm the destined output-link arbiter. In phase 2, each selected VOQ that is matched with an output link sends a request from the IM The switch model described in this section is the same as the to the CM. CMs send the arbitration results to IMs to complete one described in Section II. However, to simplify the explana- the matching between the IM and CM. tion of CRRD, the order of in is rearranged • Phase 1: Matching within IM for dispatching as follows: First iteration — * Step 1: Each nonempty VOQ sends a request to every output-link arbiter. * Step 2: Each output-link arbiter chooses one nonempty VOQ request in a round-robin fashion by searching from the position of . It then sends the grant to the selected VOQ. * Step 3: The VOQ arbiter sends the accept to the granting output-link arbiter among all those received in a round-robin fashion by searching from the posi- tion of . — th iteration ( )9 8The purpose of redefining the VOQs is to easily obtain the desynchronization effect. 9The number of iterations is designed by considering the limitation of the arbitration time in advance. CRRD tries to choose as many nonempty VOQs in this matching as possible, but it does not always choose the maximum available nonempty VOQs if the number of iterations is not enough. On the other hand, RD can choose the maximum number of nonempty VOQs in phase 1.
  • 5. 834 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002 Fig. 5. Example of desynchronization effect of CRRD (n = m = k = 2). * Step 1: Each unmatched VOQ at the previous iterations B. Desynchronization Effect of CRRD sends another request to all unmatched output-link ar- While RD causes contention at the CM, as described in biters. Steps 2 and 3: The same procedure is performed as Appendix A, CRRD decreases contention at the CM because * in the first iteration for matching between unmatched pointers , , and are desynchronized. nonempty VOQs and unmatched output links. We demonstrate how the pointers are desynchronized • Phase 2: Matching between IM and CM by using simple examples. Let us consider the example of — Step 1: After phase 1 is completed, sends the , as shown in Fig. 5. We assume that every request to . Each round-robin arbiter associated VOQ is always occupied with cells. Each VOQ sends a request with then chooses one request by searching to be selected as a candidate at every time slot. All the pointers from the position of and sends the grant to are set to be , , and at the of . initial state. Only one iteration in phase 1 is considered here. — Step 2: If the IM receives the grant from the CM, it At time slot , since all the pointers are set to zero, only sends a corresponding cell from that VOQ at the next one VOQ in , which is , can send a cell with time slot. Otherwise, the IM cannot send the cell at the through . The related pointers with the grant next time slot. The request from the CM that is not , , and are updated from zero to one. granted will again be attempted to be matched at the At , three VOQs, which are , , next time slot because the pointers that are related to and , can send cells. The related pointers with the the ungranted requests are not moved. grants are updated. Four VOQs can send cells at . In As with , the round-robin pointers and this situation, 100% switch throughput is achieved. There is no in and in are updated to one position after the granted position only if the matching within contention at all at the CMs from because the pointers the IM is achieved at the first iteration in phase 1 and the are desynchronized. request is also granted by the CM in phase 2. Similar to the above example, CRRD can achieve the desyn- Fig. 4 shows an example of , where CRRD chronization effect and provide high throughput even though the operates at the first iteration in phase 1. At step 1, , switch size is increased. , , and , which are nonempty VOQs, send requests to all the output-link arbiters. At step 2, V. CMSD SCHEME output-link arbiters that are associated with , , and select , , and , CRRD overcomes the problem of limited switch throughput respectively, according to their pointers’ positions. At step 3, of the RD scheme by using simple round-robin arbiters. CMSD receives two grants from both output-link arbiters is an improved version of CRRD that preserves CRRD’s advan- of and and accepts by using its tages and provides more scalability. own VOQ arbiter. Since receives one grant from output-link arbiter , it accepts the grant. With one A. CMSD Algorithm iteration, cannot be matched with any nonempty VOQs. At the next iteration, the matching between unmatched Two phases are also considered in the description of the nonempty VOQs and will be performed. CMSD scheme, as in the CRRD scheme. The difference be-
  • 6. OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 835 (a) (b) (c) Fig. 6. CMSD scheme. tween CMSD and CRRD is how the iterative matching within fashion by searching from the position of 10 the IM operates in phase 1. and sends the grant to the selected VOQ. We define several notations to describe the CMSD algorithm, * Step 3: The VOQ arbiter sends the accept to the which is shown in an example in Fig. 6. is denoted as granting master and slave output-link arbiters among a VOQ group that consists of VOQs, each of which is de- all those received in a round-robin fashion by searching noted as . In , there are master output-link from the position of . round-robin arbiters, slave output-link round-robin arbiters, — th iteration ( ) and VOQ round-robin arbiters. Each master output-link ar- * Step 1: Each unmatched VOQ at the previous itera- biter associated with is denoted as and has tions sends a request to all the slave output-link ar- its own pointer . Each slave output-link arbiter asso- biters again. , which has at least one unmatched ciated with and is denoted as and nonempty VOQ, sends a request to all the unmatched has its own pointer . A VOQ arbiter associated with master output-link arbiters again. has its own pointer . * Steps 2 and 3: The same procedure is performed as in the first iteration for matching between the unmatched • Phase 1: Matching within IM nonempty VOQs and unmatched output links. — First iteration • Phase 2: Matching between IM and CM * Step 1: There are two sets of requests issued to This operation is the same as in CRRD. the output-link arbiters. One set is a request that is As with CRRD, the round-robin pointers , sent from a nonempty to every asso- , and in and in ciated within . The other set is a are updated to one position after the granted position only if group-level request sent from that has at least the matching within the IM is achieved at the first iteration in one nonempty VOQ to every . phase 1 and the request is also granted by the CM in phase 2. * Step 2: Each chooses a request among Fig. 6 shows an example of , where VOQ groups independently in a round-robin fashion CMSD is used at the first iteration in phase 1. At step 1, by searching from the position of and sends 10In the actual hardware design, SL(i; j; r ) can start to search for a VOQ the grant to that belongs to the selected request without waiting to receive the grant from a master arbiter. If SL(i; j; r ) . receives the grant from its master does not receive a grant, the search by SL(i; j; r) is invalid. This is an advantage arbiter and selects one VOQ request in a round-robin of CMSD. Section VII-A discusses the dispatching scheduling time.
  • 7. 836 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002 Fig. 7. Example of the desynchronization effect (n = m = k = 2). , , and , which are , , and are updated from zero to one. nonempty VOQs, send requests to the associated slave At , three VOQs, which are , , output-link arbiters. and send requests to all the and can send cells. The related pointers with the master output-link arbiters. At step 2, , , and grants are updated. Four VOQs can send cells at . In select , , and , respectively, ac- this situation, 100% switch throughput is achieved. There is no cording to the pointers’ positions. , , and contention at the CMs from because the pointers are , which belong to the selected VOQ groups, choose desynchronized. , , and , respectively. At Similar to the above example, CMSD can also achieve the step 3, receives two grants from both desynchronization effect and provide high throughput even and . It accepts by using its own VOQ though the switch size is increased, as in CRRD. arbiter. Since receives one grant from , it accepts the grant. In this iteration, is not matched VI. PERFORMANCE OF CRRD AND CMSD with any nonempty VOQs. At the next iteration, the matching between an unmatched nonempty VOQ and will be The performance of CRRD and CMSD was evaluated by sim- performed. ulation using uniform and nonuniform traffic. B. Desynchronization Effect of CMSD A. Uniform Traffic CMSD also decreases contention at the CM because pointers 1) CRRD: Fig. 8 shows that CRRD provides higher , , , and are desynchro- throughput than RD under uniform traffic. A Bernoulli arrival nized. process is used for input traffic. Simulation suggests that CRRD We illustrate how the pointers are desynchronized by using can achieve 100% throughput for any number of iterations in simple examples. Let us consider the example of IM. , as shown in Fig. 7. We assume that every VOQ is al- The reason CRRD provides 100% throughput under uniform ways occupied with cells. Each VOQ sends a request to be se- traffic is explained as follows. When the offered load is 1.0 and lected as a candidate at every time slot. All the pointers are set if the idle state, in which the internal link is not fully utilized, to be , , , and still occurs due to contention in the IM and CM, a VOQ that fails at the initial state. Only one iteration in phase 1 is in the contention has to store backlogged cells. Under uniform considered here. traffic, every VOQ keeps backlogged cells until the idle state is At time slot , since all the pointers are set to zero, only eliminated, i.e., until the stable state is reached. The stable state one VOQ in , which is , can send a cell with is defined in [12]. In the stable state, every VOQ is occupied through . The related pointers with the grant with backlogged cells. In this situation, as illustrated in Fig. 5,
  • 8. OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 837 99.9% delay of CRRD (n = m = k = 8). Fig. 8. Delay performance of CRRD and RD schemes (n = m = k = 8). Fig. 9. the desynchronization effect is always obtained. Therefore, even when the offered load is 1.0, no contention occurs in the stable state. As the number of iterations increases, the delay performance improves when the offered load is less than 0.7, as shown in Fig. 8. This is because the matching between VOQs and output links within the IM increases. When the offered traffic load is not heavy, the desynchronization of the pointers is not completely achieved. At the low offered load, the delay per- formance of RD is better than that of CRRD with one itera- tion. This is because the matching within the IM in CRRD is not completely achieved, while the complete matching within the IM in RD is always achieved, as described in Section III-A. When the offered load is larger than 0.7, the delay performance Fig. 10. Delay performance of CRRD affected by bursty traffic (n = m= of CRRD is not improved by the number of iterations in the IM. k = 8). The number of iterations in the IM only improves the matching within the IM, but does not improve the matching between the IM and CM. The delay performance is improved by the number of iterations in the IM when the offered load is not heavy. Fig. 8 shows that, when the number of iterations in the IM increases to four, the delay performances almost converge. Fig. 9 shows that 99.9% delay of CRRD has the same ten- dency as the average delay, as shown in Fig. 8. Our simulation shows, as presented in Fig. 10, that even when the input traffic is bursty, CRRD provides 100% throughput for any number of iterations. However, the delay performance under the bursty traffic becomes worse than that of the non- bursty traffic at the heavy load condition. The reason for the 100% throughput even for bursty traffic can be explained as de- scribed in the Bernoulli traffic discussion above. We assume that Fig. 11. Average match size ratio R in phase 1 (n = m = k = 8). the burst length is exponentially distributed as the bursty traffic. In this evaluation, the burst length is set to ten. By looking at Fig. 10, we can make the following observa- is defined as tions. When the input offered load , where is very low, the (4) number of iterations does not affect the delay performance sig- nificantly. This is because average match size ratios for the where is a match size within in phase matching within the IM in phase 1 and for the matching 1 and is the number of nonempty VOQs in between the IM and CM in phase 2 are close to 1.0, as shown in at each time slot. When , Figs. 11 and 12. is set to 1.0. Note that RD always The average match size ratios and are defined provides , as is described in as follows. Section III-A.
  • 9. 838 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002 Fig. 13. Delay performance of CMSD compared with CRRD (n = m = k = Fig. 12. Average match size ratio R in phase 2 (n = m = k = 8). 8). is defined as 2) CMSD: This section describes the performance of CMSD under uniform traffic by comparing it to CRRD. Fig. 13 shows that CMSD provides 100% throughput under (5) uniform traffic, as CRRD does, due to the desynchronization effect, as described in Section V-B. The throughput of CMSD if there is a request from is also independent of the number of iterations in the IM. to In Fig. 13, the number of iterations in the IM improves the otherwise delay performance even in heavy load regions in the CMSD (6) scheme. On the other hand, the delay performance of CRRD is improved by the number of iterations in the IM when the offered and load is not heavy. As a result, in the heavy load region, the delay performance of CMSD is better than that of CRRD when the if there is a grant from to number of iterations in the IM is large. in phase 2 Since the master output-link arbiter of CMSD can choose a otherwise VOQ group, not a VOQ itself in CRRD, the master output-link (7) arbiter of CMSD affects the matching between the IM and CM more than the output-link arbiter of CRRD. Therefore, CMSD When , is set to makes the matching between the IM and CM more efficient than 1.0. CRRD when the number of iterations in the IM increases. Al- When increases to approximately 0.4, of CRRD with though this is one of the advantages of CMSD, the main advan- one iteration for Bernoulli traffic decreases due to contention tage is that CMSD is easier to implement than CRRD when the in the IM. for the bursty traffic is not more affected by switch size increases; this will be described in Section VII. than for Bernoulli traffic. The desynchronization effect is more Fig. 13 shows that when the number of iterations in the IM difficult to obtain for Bernoulli traffic than for bursty traffic be- increases to four for CMSD, the delay performance almost con- cause the arrival process is more independent. In this load re- verges. gion, where the traffic load is approximately 0.4, the delay per- Fig. 14 shows via simulation that CMSD also provides 100% formance with one iteration for Bernoulli traffic is worse than throughput under uniform traffic even when a bursty arrival for bursty traffic. process is considered. The bursty traffic assumption in Fig. 14 At , the number of iterations affects the delay is the same as that in Fig. 10. performance for Bernoulli traffic, as shown in Fig. 10. In this In the heavy-load region, the delay performance of CMSD is load region, with one iteration for Bernoulli traffic is low, better than that of CRRD under bursty traffic when the number as shown in Fig. 11, while with four iterations is nearly 1.0. of iterations in the IM is large. On the other hand, at , the number of iterations 3) Expansion Factor: Fig. 15 shows that RD requires an ex- affects the delay performance for the bursty traffic, as shown in pansion ratio of over 1.5 to achieve 100% throughput, while Fig. 10. In this load region, with one iteration for the bursty CRRD and CMSD need no bandwidth expansion. traffic is low, as shown in Fig. 11, while with four iterations is nearly 1.0. B. Nonuniform Traffic When the traffic load becomes very heavy, and with any iterations for both Bernoulli and bursty traffic approach We compared the performance of CRRD, CMSD, and RD 1.0. In this load region, the desynchronization effect is easy to using nonuniform traffic. The nonuniform traffic considered obtain, as described above. here is defined by introducing unbalanced probability, . Let us
  • 10. OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 839 Fig. 14. Delay performance of CMSD in bursty traffic compared with CRRD Fig. 16. Switch throughput under nonuniform traffic (n = m = k = 8). (n = m = = k 8). the throughput of CRRD and CMSD increases because the con- tention at the CM decreases. The reason is that, as increases, more traffic from is destined for by using and less traffic from is destined for by using , where , according to (8). That is why contention of at decreases. Due to the desynchronization effect, both CRRD and CMSD provide better performance than RD when is smaller than 0.4. In addition, the throughput of CRRD and CMSD is not worse than that of RD at any larger than 0.4, as shown in Fig. 16. Developing a new practical dispatching scheme that avoids the throughput degradation even under nonuniform traffic without expanding the internal bandwidth is for further study. Fig. 15. Relationship between switch throughput and expansion factor (n = k = 8). VII. IMPLEMENTATION OF CRRD AND CMSD consider IP , OP , and the offered input load for each IP . This section discusses the implementation of CRRD and The traffic load from IP to OP , is given by CMSD. First, we briefly compare them to RD. Since the CRRD and CMSD schemes are based on round-robin arbitration, their if implementations are much simpler than that of RD which needs (8) random generators in the IMs. These generator are difficult otherwise and expensive to implement at high speeds [14]. Therefore, where is the switch size. Here, the aggregate offered the implementation cost for CRRD and CMSD is reduced load that goes to output from all the IPs is given by compared to RD. The following sections describe the dispatching scheduling (9) time, hardware complexity, and interconnection-wire com- plexity for CRRD and CMSD. When , the offered traffic is uniform. On the other hand, when , it is completely unbalanced. This means that all A. Dispatching Scheduling Time the traffic of IP is destined for only OP , where . Fig. 16 shows that the throughput with CRRD and CMSD is 1) CRRD: In phase 1, at each iteration, two round-robin ar- higher than that of RD when the traffic is slightly unbalanced.11 biters are used. One is an output-link arbiter that chooses one The throughput of CRRD is almost the same as that of CMSD. VOQ request out of, at most, requests. The other is a VOQ ar- We assume that enough large numbers of iterations in the IM are biter that chooses one grant from, at most, output-link grants. adopted in this evaluation for both CRRD and CMSD schemes We assume that priority encoders are used for the implementa- to observe how nonuniform traffic impacts the matching be- tion of the round-robin arbiters [14]. Since , the dis- tween the IM and CM. From to around , the patching scheduling time complexity of each iteration in phase throughput of CRRD and CMSD decreases. This is because the 1 is (see Appendix D) [8]. As is the case of , complete desynchronization of the pointers is hard to achieve ideally, iterations in phase 1 are preferable because there are under unbalanced traffic.12 However, when is larger than 0.4, output links that should be matched with VOQs in the IM. Therefore, the time complexity in phase 1 is . How- 11Although both results shown in Fig. 16 are obtained by simulation, we also ever, we do not practically need iterations. As described in analytically derived the upper bound throughput of RD in Appendix C. 12For the same reason, when input traffic is very asymmetric, throughput Section VI, simulation suggests that one iteration is sufficient degradation occurs. to achieve 100% throughput under uniform traffic. In this case,
  • 11. 840 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002 increasing the number of iterations improves the delay perfor- TABLE I mance. RATIO OF T (CMSD) TO T (CRRD), R (n = k = m) In phase 2, one IM request is chosen from, at most, requests. The time complexity at phase 2 is . As a result, the time complexity of CRRD is , where is the number of switch ports. If we set the number of iterations in phase 1 as , where , for round-robin execution time to clarify the dependency on , , practical purposes, the time complexity of CRRD is expressed and so that we can further eliminate the device-dependent as . factor . Using (10)–(12), we obtain the following equation: Next, we consider the required dispatching scheduling time of CRRD. The required time of CRRD, i.e., , is (13) approximately (10) Table I shows the relationship between and when we assume . Note that when , depends where is a constant coefficient, which is determined by device only on and is independent of .13 technologies. is the transmission delay between arbiters. When , for example, the required time of CMSD is 2) CMSD: In phase 1, at each iteration, three round-robin reduced more than 30% compared to CRRD. arbiters are used. The first one is a master output-link arbiter In (13), we ignored the factors of and . When that chooses one out of VOQ group requests. The second and are large, the reduction effect of CMSD one is a slave output-link arbiter that chooses one out of re- in Table I decreases. As the distance between two arbiters gets quests. The third one is a VOQ arbiter that chooses one out larger, these factors become significant. of output-link grants. The slave output-link arbiter can per- form its arbitration without waiting for the result in the master B. Hardware Complexity for Dispatching in the IM output-link arbitration, as described in Section V-A. In other Since the CM arbitration algorithm of CMSD is the same as words, both master output-link and slave output-link arbiters that of CRRD, we discuss hardware complexities in the IM only. operate simultaneously. Since is satisfied, the longer 1) CRRD: According to implementation results presented arbitration time of either the master output-link arbitration or in [14], the hardware complexity for each round-robin arbiter the VOQ arbitration time is dominant. Therefore, the time com- is , where is the number of requests to be selected plexity of each iteration in phase 1 is . by the arbiter. The time complexity in phase 1 is if Each IM has output-link arbiters and VOQ arbiters. iterations are adopted. Each output-link arbiter chooses one out of requests. The In phase 2, one is chosen out of IM requests. The dis- hardware complexity for all the output-link arbiters is . patching scheduling time complexity at phase 2 is . Each VOQ arbiter chooses one out of requests. The hardware As a result, the time complexity of CMSD is complexity for all the VOQ arbiters is . Therefore, the . As in the case of CRRD, for hardware complexity of CRRD in the IM is . practical purposes, the time complexity of CMSD is expressed 2) CMSD: Each IM has master output-link arbiters and as . When we assume , slave output-link arbiters VOQ arbiters. Each master the time complexity of CMSD is , which is output-link arbiter chooses one out of requests. The hardware equivalent to . This is the same order of time complexity for all the master output-link arbiters is . complexity as CRRD. Each slave output-link arbiter chooses one out of requests. We consider the required dispatching scheduling time of The hardware complexity for all the slave output-link arbiters is CMSD. With the same assumption as CRRD, the required time . Each VOQ chooses one out of requests. The hard- of CMSD, i.e., , is approximately ware complexity for all the VOQ arbiters is . Therefore, the hardware complexity of CMSD in the IM is . Thus, the order of the hardware complexity of CMSD is the same as (11) that of CRRD. where is the transmission delay between arbiters. C. Interconnection-Wire Complexity Between Arbiters 3) Comparison of Required Dispatching Scheduling Times: Although the order of the scheduling time complexity 1) CRRD: In CRRD, each VOQ arbiter is connected to all of CRRD is the same as that of CMSD, the required times are output-link arbiters with three groups of wires. The first group different. To compare both required times, we define the ratio is used by a VOQ arbiter to send requests to the output-link of to as follows: arbiters. The second one is used by a VOQ arbiter to receive grants from the output-link arbiters. The third one is used by a (12) VOQ arbiter to send grants to the output-link arbiters. As the switch size is larger, the number of wires that connect all the VOQ arbiters with all the output-link arbiters becomes First, we assume that the transmission delay between arbiters and is negligible compared to the required 13R = (2i + 1)=(3i + 1)
  • 12. OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 841 The first term on the right-hand side is the number of cross- points of the interconnection wires between VOQs and slave output-link arbiters in VOQ groups. The second term is the number of crosspoints of the interconnection wires between VOQ groups and master output-link arbiters. 3) Comparison of : As the switch size in- creases, is much larger than . In gen- eral, when , . For example, when and , the number of crosspoints of the interconnection wires of CMSD is reduced by 87.5%, 93.8%, and 96.9%, respectively, compared with that of CRRD. Thus, CMSD can dramatically reduce the number of Fig. 17. Interconnection-wire model. wire crosspoints. large. In this situation, a large number of crosspoints is pro- VIII. CONCLUSIONS duced. That causes high layout complexity. When we implement the dispatching scheduler in one chip, the layout complexity af- A Clos-network switch architecture is attractive because of its fects the interconnection between transistors. However, if we scalability. Unfortunately, previously proposed implementable cannot implement it in one chip due to a gate limitation, we need dispatching schemes are not able to achieve high throughput to use multiple chips. In this case, the layout complexity influ- unless the internal bandwidth is expanded. ences not only the interconnection within a chip, but also the in- First, we have introduced a novel dispatching scheme called terconnection between chips on printed circuit boards (PCBs). CRRD for the Clos-network switch. CRRD provides high In addition, the number of pins in the scheduler chips of CRRD switch throughput without increasing internal bandwidth, while increases. only simple round-robin arbiters are employed. Our simulation Consider that there are source nodes and destination showed that CRRD achieves 100% throughput under uniform nodes. We assume that each source node is connected to all traffic. Even though the offered load reaches 1.0, the pointers the destination nodes with wires without detouring, as shown of round-robin arbiters that we use at the IMs and CMs are in Fig. 17. The number of crosspoints for the interconnection completely desynchronized and contention is avoided. wires between nodes is given by Second, we have presented CMSD with hierarchical (14) round-robin arbitration, which was developed as an improved version of CRRD to make it more scalable in terms of dis- The derivation of (14) is described in Appendix E. Let the patching scheduling time and interconnection complexity in number of crosspoints for interconnection wires between the dispatching scheduler when the size of the switch increases. VOQs and output-link arbiters be . We showed that CMSD preserves the advantages of CRRD and is determined by achieves more than 30% reduction of the dispatching sched- uling time when arbitration time is significant. Furthermore, (15) with CMSD, the number of interconnection crosspoints is dra- Note that the factor of three in (15) expresses the number matically reduced with the ratio of . When , of groups of wires, as described at the beginning of Sec- a 96.9% reduction effect is obtained. This makes CMSD easier tion VII-C-I. to implement than CRRD when the switch size becomes large. 2) CMSD: In CMSD, each VOQ arbiter is connected to its This paper assumes that the dispatching scheduling must be own slave output-link arbiters with three groups of wires. The completed within one time slot. However, the constraint is a first group is used by a VOQ arbiter to send requests to the slave bottleneck when port speed becomes high. In a future study, output-link arbiters. The second one is used by a VOQ arbiter to pipeline-based approaches should be considered. A pipeline- receive grants from the slave output-link arbiters. The third one based scheduling approach for a crossbar switches is described is used by a VOQ arbiter to send grants to the slave output-link in [17], which relaxes the scheduling timing constraint without arbiters. throughput degradation. In the same way, each VOQ group is connected to all the master output-link arbiters with three groups of wires. The APPENDIX I first is used by a VOQ group to send requests to the master MAXIMUM SWITCH THROUGHPUT WITH RD SCHEME output-link arbiters. The second is used by a VOQ group to UNDER UNIFORM TRAFFIC receive grants from the master output-link arbiters. The third is used by a VOQ group to send grants to the master output-link To derive (3), we consider the throughput of output link arbiters. , . This is the same as the switch throughput Let the number of crosspoints for interconnection wires be- that we would like to derive because the conditions of all the tween arbiters in CMSD be . is de- output links are the same. Therefore, termined by (17) First, we define several notations. is the prob- (16) ability that one output link in , e.g., , is used by
  • 13. 842 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002 a ’s request. Here, does not depend APPENDIX II on which output link in is used for the request because DERIVATION OF FOR it is selected randomly. is the probability that the re- We sketch the term in (3) according to the decreasing quest by wins the contention of output link at order in the following: . By using and , is expressed in the following equation: (18) Here, in the last form of (18), we consider that and do not depend on both and . Since has VOQs, each VOQ is equally selected by each output link in . is given by (19) (22) When , Next, we consider . is obtained as follows: (23) Therefore, (20) The first term of the right-hand side at (20) is the winning proba- (24) bility of the request by at when there are totally requests for . The second term is the winning proba- In the above deduction, we use the following equation: bility of the request by at when there are (25) requests at . In the same way, the ( )th term is the winning probability of the request by at when there are requests at . APPENDIX III can then be expressed in the following equation: UPPER BOUND OF MAXIMUM SWITCH THROUGHPUT WITH RD SHEME UNDER NONUNIFORM TRAFFIC The upper bound of the maximum switch throughput with RD scheme under nonuniform traffic is determined by (21) (26), shown at the bottom of the page. Equation (26) can be easily derived in the same way as (3). By using (17)–(19) and (21), we can derive (3). However, since the traffic load of each VOQ is different, (3) has (26)
  • 14. OKI et al.: CONCURRENT ROUND-ROBIN-BASED DISPATCHING SCHEMES 843 there are AND gates where the largest gate has inputs. An -input AND gate can be built by using a number of -input AND gates where the level of delay is , i.e., in . As for the OR-gate stage, there are OR gates where the largest has inputs. The largest AND gate delay is then the delay order for the priority encoder. APPENDIX V DERIVATION OF (14) We define an interconnection wire between source node , where and destination node , where , as . We also define the number of crosspoints that wire has as . Fig. 18. Comparison between simulation and analytical upper bound results under nonuniform traffic (n = = = m k 8). The total number of crosspoints of the interconnection wires between source nodes and destination nodes is ex- pressed by to be modified. The first term of the right-hand side at (26) is ob- tained by considering the probability that a heavy-loaded VOQ sends a request to the CM and wins the contention. The second (27) term of the right-hand side is obtained by considering the prob- ability that, when a heavy-loaded VOQ and nonheavy-loaded VOQs send a request to the CM, a nonheavy-loaded VOQ wins Here, the factor of 1/2 on the right-hand side of (27) eliminates the contention. The third term of the right-hand side is obtained the double counting of the number of crosspoints. by considering the probability that a heavy-loaded VOQ does First consider . Since has no crosspoints not send a request to the CM, nonheavy-loaded VOQs send re- with other wires, . Since has quests to the CM, and a nonheavy-loaded VOQ wins the con- crosspoints with , tention. . has Note that (26) gives the upper bound of the switch throughput crosspoints with , of the RD scheme in the switch model described in Section II. . Therefore, This is because only one cell can be read from the same VOQ . In the same way, . at the same time slot in the switch model, as described in Sec- Next, consider . Since has tion II. However, we assume that more than one cell can be read crosspoints with , from the same VOQ at the same time slot in the derivation of . has crosspoints with (26) for simplicity. This different assumption causes different and cross- results under nonuniform traffic.14 points with . Therefore, Fig. 18 compares the simulation result with the upper bound . has result obtained by (26). As increases to approximately 0.6, crosspoints with , the difference between the simulation and analytical results in- and cross- creases due to the reason described above. At the region of points with . Therefore, , the difference decreases with because the . In the same way, contention probability at the CM decreases. . We can then derive the following: APPENDIX IV TIME COMPLEXITY OF ROUND-ROBIN ARBITER In a round-robin arbiter, the arbitration time complexity is dominated by an -to- priority encoder, where is the number of inputs for the arbiter. The priority encoder is analogous to a maximum or minimum In general, is given by search algorithm. We use a binary search tree, whose computa- tion complexity (and, thus, the timing) is , where is the (28) depth of the tree, and where is equal to for a binary tree. We substitute (28) into (27). The following equation is obtained: The priority encoder, in the canonical form, has two stages: the AND-gate stage and the OR-gate stage. In the AND-gate stage, 14However, under uniform traffic, both results are exactly the same because all the VOQs can be assumed to be in a nonempty state. This does not cause any difference in the results. (29)
  • 15. 844 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002 REFERENCES Zhigang Jing (M’01) received the B.S., M.S., and Ph.D. degrees in optical fiber communication from [1] T. Anderson, S. Owicki, J. Saxe, and C. Thacker, “High speed switch the University of Electronic Science and Technology scheduling for local area networks,” ACM Trans. Comput. Syst., vol. 11, of China, Chengdu, China, in 1993, 1996, and 1999, no. 4, pp. 319–352, Nov. 1993. respectively, all in electrical engineering. [2] T. Chaney, J. A. Fingerhut, M. Flucke, and J. S. Turner, “Design of a He then joined the Department of Electrical Gigabit ATM switch,” in Proc. IEEE INFOCOM, Apr. 1997, pp. 2–11. [3] H. J. Chao, B.-S. Choe, J.-S. Park, and N. Uzun, “Design and imple- Engineering, Tsinghua University, Beijing, China, mentation of Abacus switch: A scalable multicast ATM switch,” IEEE where he was a Post-Doctoral Fellow. Since March J. Select. Areas Commun., vol. 15, pp. 830–843, June 1997. 2000, he has been with the Department of Electrical [4] H. J. Chao and J.-S. Park, “Centralized contention resolution schemes Engineering, Polytechnic University of New York, for a large-capacity optical ATM switch,” in Proc. IEEE ATM Workshop, Brooklyn, as a Post-Doctoral Fellow. His current Fairfax, VA, May 1998, pp. 11–16. research interests are high-speed networks, terabit IP routers, multimedia [5] H. J. Chao, “Saturn: A terabit packet switch using Dual Round-Robin,” communication, Internet QoS, Diffserv, MPLS, etc. in Proc. IEEE Globecom, Dec. 2000, pp. 487–495. [6] F. M. Chiussi, J. G. Kneuer, and V. P. Kumar, “Low-cost scalable switching solutions for broadband networking: The ATLANTA archi- Roberto Rojas-Cessa (S’97–M’01) received the tecture and chipset,” IEEE Commun. Mag., pp. 44–53, Dec. 1997. B.S. degree in electronic instrumentation from the [7] C. Clos, “A study of nonblocking switching networks,” Bell Syst. Tech. Universidad Veracruzana (University of Veracruz), J., pp. 406–424, Mar. 1953. Veracruz, Mexico, in 1991. He graduated with an [8] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algo- Honorary Diploma. He received the M.S. degree rithms, 19 ed. Cambridge, MA: MIT Press, 1997, sec. 13.2, p. 246. in electrical engineering from the Centro de In- [9] C. Y. Lee and A. Y. Qruc, “A fast parallel algorithm for routing unicast vestigacion y de Estudios Avanzados del Instituto assignments in Benes networks,” IEEE Trans. Parallel Distrib. Syst., vol. Politecnico Nacional (CINVESTAV-IPN), Mexico 6, pp. 329–333, Mar. 1995. City, Mexico, in 1995, and the M.S. degree in [10] T. T. Lee and S.-Y. Liew, “Parallel routing algorithm in Benes-Clos net- computer engineering and Ph.D. degree in electrical works,” in Proc. IEEE INFOCOM, 1996, pp. 279–286. engineering from the Polytechnic University of New [11] N. McKeown, “Scheduling algorithm for input-queued cell switches,” York, Brooklyn, in 2000 and 2001, respectively. Ph.D. dissertation, Dept. Elect. Eng. Comput. Sci., Univ. California at In 1993, he was with the Sapporo Electronics Center in Japan for a Micro- Berkeley, Berkeley, CA, 1995. electronics Certification. From 1994 to 1996, he was with UNITEC and Iber- [12] N. McKeown, V. Anantharam, and J. Walrand, “Achieving 100% chip, Mexico City, Mexico. From 1996 to 2001, he was a Teaching and Re- throughput in an input-queued switch,” in Proc. IEEE INFOCOM, search Fellow with the Polytechnic University of Brooklyn. In 2001, he was a 1996, pp. 296–302. Post-Doctoral Fellow with the Department of Electrical and Computer Engi- [13] N. McKeown, M. Izzard, A. Mekkittikul, W. Ellerisick, and M. Horowitz, “Tiny-Tera: A packet switch core,” IEEE Micro, pp. 26–33, neering, Polytechnic University of Brooklyn. He has been involved in applica- Jan.–Feb. 1997. tion-specific integrated-circuit (ASIC) design for biomedical applications and [14] N. McKeown, “The iSLIP scheduling algorithm for input-queues research on scheduling schemes for high-speed packet switches and switch reli- switches,” IEEE/ACM Trans. Networking, vol. 7, pp. 188–200, Apr. ability. He is currently an Assistant Professor with the Department of Electrical 1999. and Computer Engineering, New Jersey Institute of Technology, Newark. His [15] A. Mekkittikul and N. McKeown, “A practical scheduling algorithm to research interests include high-speed and high-performance switching, fault tol- achieve 100% throughput in input-queued switches,” in Proc. IEEE IN- erance, and implementable scheduling algorithms for packet switches. FOCOM, 1998, pp. 792–799. Dr. Rojas-Cessa is a member of the Institute of Electrical, Information and [16] E. Oki, N. Yamanaka, Y. Ohtomo, K. Okazaki, and R. Kawano, “A Communication Engineers (IEICE), Japan. He was a CONACYT Fellow. 2 2 10-Gb/s (1.25 Gb/s 8) 4 2 0.25-m CMOS/SIMOX ATM switch based on scalable distributed arbitration,” IEEE J. Solid-State Circuits, vol. 34, pp. 1921–1934, Dec. 1999. H. Jonathan Chao (S’83–M’85–SM’95–F’01) [17] E. Oki, R. Rojas-Cessa, and H. J. Chao, “A pipeline-based approach for received the B.S. and M.S. degrees in electrical maximal-sized matching scheduling in input-buffered switches,” IEEE engineering from the National Chiao Tung Univer- Commun. Lett., vol. 5, pp. 263–265, June 2001. sity, Taiwan, R.O.C., respectively, and the Ph.D. [18] J. Turner and N. Yamanaka, “Architectural choices in large scale ATM in electrical engineering from The Ohio State switches,” IEICE Trans. Commun., vol. E81-B, no. 2, pp. 120–137, Feb. University, Columbus. 1998. In January 1992, he joined the Polytechnic Univer- sity of New York, Brooklyn, where he is currently a Professor of electrical engineering. He has served as a consultant for various companies such as NEC, Lu- cent Technologies, and Telcordia in the areas of ATM switches, packet scheduling, and MPLS traffic engineering. He has given short courses to industry in the subjects of the IP/ATM/SONET network for over a Eiji Oki (M’95) received the B.E. and M.E. degrees decade. He was cofounder and from 2000 to 2001, the CTO of Coree Networks, in instrumentation engineering and the Ph.D. degree where he led a team in the implementation of a multiterabit packet switch system in electrical engineering from Keio University, Yoko- with carrier-class reliability. From 1985 to 1992, he was a Member of Technical hama, Japan, in 1991, 1993, and 1999, respectively. Staff with Telcordia, where he was involved in transport and switching system In 1993, he joined the Communication Switching architecture designs and ASIC implementations, such as the first SONET-like Laboratories, Nippon Telegraph and Telephone framer chip, ATM layer chip, aequencer chip (the first chip handling packet Corporation (NTT), Tokyo, Japan, where he has scheduling), and ATM switch chip. From 1977 to 1981, he was a Senior En- researched multimedia-communication network ar- gineer with Telecommunication Laboratories, Taiwan, R.O.C., where he was chitectures based on ATM techniques, traffic-control involved with circuit designs for a digital telephone switching system. He has methods, and high-speed switching systems with authored or coauthored over 100 journal and conference papers. He coauthored the NTT Network Service Systems Laboratories. Broadband Packet Switching Technologies (New York: Wiley, 2001) and Quality From 2000 to 2001, he was a Visiting Scholar with Polytechnic University of of Service Control in High-Speed Networks (New York: Wiley, 2001). He has New York, Brooklyn. He is currently engaged in research and development of performed research in the areas of terabit packet switches/routers and QoS con- high-speed optical IP backbone networks as a Research Engineer with NTT trol in IP/ATM/MPLS networks. He holds 19 patents with five pending. Network Innovation Laboratories, Tokyo, Japan. He coauthored Broadband Dr. Chao served as a guest editor for the IEEE JOURNAL ON SELECTED AREAS Packet Switching Technologies (New York: Wiley, 2001). IN COMMUNICATIONS with special topic on “Advances in ATM Switching Sys- Dr. Oki is a member of the Institute of Electrical, Information and Commu- tems for B-ISDN” (June 1997) and “Next Generation IP Switches and Routers” nication Engineers (IEICE), Japan. He was the recipient of the 1998 Switching (June 1999). He also served as an editor for the IEEE/ACM TRANSACTIONS ON System Research Award and the 1999 Excellent Paper Award presented NETWORKING (1997–2000). He was the recipient of the 1987 Telcordia Excel- by IEICE, Japan, and the IEEE ComSoc. 2001 APB Outstanding Young lence Award. He was a corecipient of the 2001 Best Paper Award of the IEEE Researcher Award. TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY.