HE et al.: ON GUARANTEED SMOOTH SWITCHING FOR BUFFERED CROSSBAR SWITCHES 719In particular, to guarantee this allocated bandwidth for this ﬂow, width regulator can deliver a throughput of almost 100% and anthe number of credits required, or the crosspoint buffer capacity, average delay of multiple microseconds. In particular, neithershall be at least the product of the required bandwidth and the credit-based ﬂow control nor speedup is needed, and arbitrarytotal fabric-internal round-trip time of this ﬂow. fabric-internal latency is allowed between line cards and the Actually, the coupled effect of fabric-internal latency and switch core. Section VII concludes the paper with discussioncredit-based ﬂow control is often neglected by most previous of the rate-based control scheme and the smooth switchingresearch. Since a multi-terabit switch must be distributed over problems for some leading switches. All proofs of theorems aremultiple racks, the fabric-internal latency is increased consider- placed in the Appendix.ably along with the distance between line cards and the switchcore , . For example, a 4-Tb/s CICQ packet switch is II. SMOOTHNESSdistinguished with a large fabric-internal latency by packaging This section deﬁnes the concept of smoothness.up to 256 10-Gb/s (OC-192) line cards into multiple racks, with There are ﬂows of ﬁxed-size cells sharing a link of band-a distance of tens to hundreds of feet between line cards and the width ; each ﬂow is allocated a bandwidth , whereswitch core . Large fabric-internal latency, if not properly dealt and . This speciﬁes an instance ,with, degrades sharply the performance of a crossbar switch . which can be reduced to its normal form ,With credit-based ﬂow control, an arbitrarily small crosspoint abbr. , in which and .buffer can guarantee no cell loss; but to guarantee bandwidths to Time is slotted, with slot denoting the real interval ,ﬂows , or to guarantee input work conservation and output and slot interval denoting the slot setfull utilization , , the size of each crosspoint buffer shall . A schedule or scheduler for an instancebe at least the product of the link speed and the fabric-internal is a function , mapping slotsround-trip time, and the size of total crosspoint memory will be to symbols or cell services; symbol stands for an idle cell times large since there are crosspoint buffers in a switch. service and symbol stands for a cell service to ﬂow . Therefore, CICQ switches with credit-based ﬂow control The smooth multiplexing problem (SMP) is to generate aface new challenges of scalability and predictability. Large smooth schedule such that cell services to each ﬂow are dis-crosspoint buffer, induced by large fabric-internal latency and tributed evenly in the whole sequence. Intuitively, in an ideallyfast link speed, is surely a bottleneck to scalability. Without smooth schedule for an SMP instance , anyspeedup, it is still open whether CICQ switches can guarantee interval of consecutive slots should cover number, or100% throughput for either best-effort or real-time services. in practice, either or number of cell services to In this paper, we propose a design approach of rate-based ﬂow . In a complementary view, successive numbersmoothed switching to tackle these problems. Rate information of cell services to ﬂow should be spacing slots apart,is more compact and hence more scalable than credit infor- or either or slots apart. Such different andmation. It can feasibly be obtained by admission control for intuitive views of covering and spacing, along with the integralreal-time services, or by on-line bandwidth demand estimation constraint, shall be taken into account during formalization.and allocation for best-effort services. With additional rate in- A. Coveringformation, predictability of switch design can be enhanced. Inparticular, for CICQ switches, if rate-based smoothed sched- Let , abbr. , denote the numberulers are placed at inputs and outputs, ﬁlling and draining cross- of cell services to ﬂow that are scheduled by scheduler in-point buffers smoothly, then a small buffer will sufﬁce, regard- side slot interval . We investigate the whole spectrum ofless of fabric-internal latency or link speed. This will break the and its worst-case deviation from an ideal distribu-bottleneck of large crosspoint buffers, eliminate the credit-based tion over all possible slot intervals .feedback ﬂow control, and reinforce predictability. Given a slot interval , , which could be Speciﬁcally, Section II develops the concept of smoothness ﬁnite or inﬁnite , the minimum and thefrom two intuitive and complementary viewpoints: covering maximum covers at length within this interval can be deﬁnedsmoothness and spacing smoothness, which are then uniﬁed in as follows:the same model. Section III proposes a smoothed multiplexersMUX to solve the smooth multiplexing problem, with almostideal smoothness bounds obtained for each of the ﬂows sharinga link. Section IV proposes a smoothed buffered crossbarsBUX, which uses scheduler sMUX at inputs and outputs,and a buffer of just two cells at each crosspoint. It is provedthat sBUX can use 100% of the switch capacity to support We measure covering smoothness of the actual distributionreal-time services, with deterministic guarantees of bandwidth of cell services to ﬂow within by the following twoand fairness, delay and jitter bounds for each ﬂow. To support worst-case covering deviations from the ideal:best-effort services, Section V introduces a bandwidth regulatorto periodically deliver an admissible bandwidth matrix to sBUXby on-line bandwidth demand estimation and allocation. Thetwo-cell crosspoint buffers are still sufﬁcient to guarantee no By deﬁnition, , . See Table I forcell loss even during dynamic bandwidth updating. Section VI illustration. Within interval , ifconducts simulation to show that sBUX coupled with the band- , or equivalently, if for anyAuthorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 31, 2009 at 02:28 from IEEE Xplore. Restrictions apply.
720 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 3, JUNE 2008 TABLE I COVERING AND SPACING PROPERTIES OF IN S[0; 6) = ( ; ; ; ; ; ), w = 4=6interval , then the distribution of cell services Theorem 2.1: In a schedule for an SMP instance, for anyto ﬂow is ideal from the covering point of view. ﬂow , and any nonnegative integers , , , and ,B. Spacing Let , abbr. , denote the slot of the th cell service to ﬂow scheduled by scheduler , and let , abbr. , denote an -step space, orthe number of slots between the positions of the th and thcell services to ﬂow . where the right-hand side of is deﬁned within the interval inThen given an interval , , which the left-hand side of is deﬁned.either ﬁnite or inﬁnite , the minimum Theorem 2.2: In a schedule for an SMP instance, for anyand the maximum spaces at step within this interval can be ﬂow ,deﬁned as follows: where the left-hand side of is deﬁned within the interval in We measure spacing smoothness of the actual distribution of which the right-hand side of is deﬁned.cell services to ﬂow within by the Theorem 2.2 describes the relationship between coveringfollowing two worst-case spacing deviations from the ideal: smoothness and spacing smoothness, revealing their consis- tency and correspondence. It has a corollary as follows. Corollary 2.3: In a schedule for an SMP instance, for any ﬂow , By deﬁnition, , . See Table Ifor illustration. Within interval ,if , or equivalently, if for all and subjectto and , then the distribution of cell servicesto ﬂow is ideal from the spacing point of view. where the right-hand side of is deﬁned within the interval inC. Bridge Between Covering and Spacing which the left-hand side of is deﬁned. Before we proceed, we should ﬁrst notice the different do- Corollary 2.3 reveals the correspondence between ideal cov-mains of deﬁnition for covering and spacing. Suppose covering ering and ideal spacing, strengthening the rationality of smooth-properties are deﬁned in interval , which contains ness deﬁnitions. Given a schedule for an SMP instance, ifnumber of cell services to ﬂow located at through , then the schedule is called ideal for , then spacing properties can be deﬁned only within ﬂow ; if the schedule is ideal for every ﬂow, then the scheduleinterval which is normally a proper is called ideal. In general, an ideal schedule might not exist, e.g.,subset of . Intuitively the properties deﬁned in the su- for the SMP instance (2/6, 3/6, 1/6). But it is easy to generateperset should be able to determine the properties deﬁned in the an ideal schedule only for a single ﬂow, or for two ﬂows whosesubset, but not vice versa. normalized bandwidths add up to 1, as shown next.Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 31, 2009 at 02:28 from IEEE Xplore. Restrictions apply.
HE et al.: ON GUARANTEED SMOOTH SWITCHING FOR BUFFERED CROSSBAR SWITCHES 721 Theorem 3.1: In a sMUX schedule for an SMP instance with initiation times , the th cell service to ﬂow is allocated within the slot interval , where , and . B. Guaranteed Smoothness of sMUXFig. 1. The sMUX schedule for SMP instance (5/8, 2/8) with common initiationtime 0. The dashed lines are eligible times and deadlines for each cell service. Theorem 3.2 below characterizes the guaranteed smoothness bounds of sMUX. Owing to introduction of the individual initia- Theorem 2.4: SMP instance with tion time , the scheduler is ready to schedule each ﬂow at pos-always has an ideal schedule, which can be constructed by al- sibly different times. Hence, the covering properties concerninglocating cell services to ﬂow to slots for all in- ﬂow should be calculated from slot onward.teger , and cell services to ﬂow to the slots left, where is Theorem 3.2: Given a sMUX schedule for an SMP instancean arbitrary but ﬁxed real number. with initiation times . Then for While the concept of smoothness was deﬁned within a peri- any ﬂow , interval length , and step size , the sMUX scheduleodic frame as certain covering discrepancy in , their deﬁni- has the following properties:tion did not consider the viewpoint of spacing, let alone the rela-tionship between covering and spacing, in a general sequence.No formal deﬁnitions of smoothness are given in . To ourbest knowledge, our deﬁnitions of smoothness are most generaland comprehensive. III. SMOOTHED MULTIPLEXER sMUX The covering deviation of sMUX is at most one cell service, This section designs and analyses an algorithm called which could be outperformed only by an ideal schedule if it ex-smoothed multiplexer or sMUX. ists. In this sense the sMUX schedule is almost ideal or optimal. When an ideal schedule does not exist, which is the normal case,A. Algorithm sMUX the sMUX schedule is optimal. The constant deviation also in- dicates that compared with the spacing deviation, the covering Assume an SMP instance is given. Besides deviation seems to be a normalized measure of smoothness.the normalized bandwidth , each ﬂow is additionally as- By eligible time control, cells of ﬂow are admitted intosociated with an initiation time to mark the earliest time slot sMUX at times , , with allthe scheduler is ready to schedule ﬂow . This is to reﬂect the spacing and covering deviations equal to zero. If sMUX onlydynamic scheduling of newly admitted ﬂows in practical appli- introduces a constant delay to each cell, then spacing devia-cations. Starting from time , the th cell ser- tions of leaving cell sequences should remain zero. Therefore,vice to ﬂow is eligible at time , and is any non-zero spacing deviations are actually the delay jitters,expected to ﬁnish before deadline . That is, the resulting from variable delays introduced to cells by sMUX. th cell service for ﬂow is expected to be allocated in the th Simply speaking, sMUX combines the guarantee of circuitwindow . multiplexing and the ﬂexibility of packet multiplexing . If resource allocation can be in units of arbitrarily small size The frame-based circuit multiplexing guarantees dedicatedrather than in units of slot, then an earliest deadline ﬁrst or EDF bandwidth and ﬁxed delay (comparable with the frame dura-scheduler shall feasibly schedule the set of ﬂows (i.e., all the tion), but suffers from a service granularity problem: a smallrequirements of eligible time, service time and deadline shall frame might waste bandwidth while a large frame complicatesbe met) since . But solutions to SMP require that frame scheduling and storage. In contrast, packet multiplexingeach ﬂow be serviced in units of an integral time slot, which allows more ﬂexible bandwidth sharing, but the worst-casemakes it impossible to meet all the requirements stated before. delay might be unpredictable. For real-time services, e.g.,Algorithm sMUX, as an EDF scheduler adapted to the integral continuous media, we need ﬂexible bandwidth allocation andconstraint, tries to achieve the best approximation. predictable worst-case delay and jitter bounds. The scheduler Algorithm sMUX: At each time slot , among those ﬂows sMUX allows arbitrary and almost exact sharing of the linkthat are eligible for scheduling, i.e., their next cell services have bandwidth; the implicit huge frame reference can be consideredeligible times no later than ( , or equivalently, to be compactly embedded in the slot-by-slot sMUX algorithm. ), allocate slot to the ﬂow with the earliest upper-rounded The worst-case delay and jitter, however, are independent ofdeadline ; ties are broken arbitrarily. When no ﬂow is el- the length of the implicit frame reference.igible, slot is left idle. In terms of fast hardware implementation, the minimum of Fig. 1 is an illustration of how sMUX works on an SMP in- 256 24-bit numbers can be found in 4.5 ns in 0.18 m CMOSstance (5/8, 2/8) with all . technology . For comparison, the transmission time of a It is obvious that sMUX provides the th cell service for ﬂow 64-byte cell in a 40-Gb/s (OC-768) link is 12.8 ns. This strongly after its eligible time . The following theorem shows that suggests that sMUX can feasibly be used in switches with highthe service ﬁnishes before . link speed ( 40 Gb/s) and high port density ( 256).Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 31, 2009 at 02:28 from IEEE Xplore. Restrictions apply.
722 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 3, JUNE 2008 XPB is backlogged. Hence, input schedulers and output sched- ulers are further decoupled. While the overhead of credit-based ﬂow control is eliminated, the small crosspoint buffers are in danger of overﬂow. However, the occupancy of each XPB is always upper-bounded by two cells, as shown next. B. XPB Occupancy Analysis Let us take as our point of view. In practice, it is reasonable to place at the switch core and at the th line card, with a latency in between. By available hardware technology, this latency can be measured precisely and compensated accordingly such that whenFig. 2. Architecture of a 3 2 3 buffered crossbar switch sBUX. starts to schedule out of at initiation time , has already started to schedule out of at an ear- IV. SMOOTHED BUFFERED CROSSBAR sBUX lier initiation time . This is functionally equivalent to placing and at where is located, with no This section describes a CICQ switch called smoothed latency in between, and starting to schedule into atbuffered crossbar or , which uses sMUX as both input the delayed initiation time , or at the sameand output schedulers. It will be shown that a two-cell cross- time that starts to schedule out of . Therefore,point buffer (XPB) sufﬁces to guarantee no cell loss even from now on, we will assume that both and , togetherwithout credit-based ﬂow control. with , are placed at where is located, with no latency in between. They start scheduling into and out ofA. Switch Model with the same rate , at the same initiation time . Consider an CICQ switch with unit-speed input and Assume that has an inﬁnite capacity and is emptyoutput links. For each input-output pair , the switch has initially (i.e., empty before ). We are interested in the max-a virtual output queue at input, and a ﬁrst-in-ﬁrst-out imum occupancy of under the operations of andcrosspoint buffer inside the crossbar. There is a sched- , which will be the minimum capacity of to guar-uler for each input , and there is a scheduler for each antee no cell loss. Because the input scheduling and the outputoutput ; they are all sMUX schedulers. The fabric-internal la- scheduling are completely decoupled in a sBUX switch, a keytency can be arbitrary between line cards and the switch core. observation is: the critical situation to produce maximum occu-See Fig. 2. pancy of is when is always backlogged. Data crossing the switch are organized into ﬂows of ﬁxed-size Since both input and output scheduling are rate-based, regard-cells. In this paper we consider unicast ﬂows. Cells of ﬂow less of cell availability, input and output operations may be inef-originally arrive at input , temporarily queue at and fective when there is no cell available in VOQ or XPB. Speciﬁ-then at , and ﬁnally go to output . Associated with ﬂow cally, in time interval , number of is a bandwidth , and bandwidth matrix input operations on consist ofis feasible or admissible, i.e., for all and number of effective ones and numberfor all . For bandwidth-guaranteed real-time services, such a of ineffective ones. Similarly, number ofmatrix should be available by admission control. For best-effort output operations on consist ofservices, such a matrix is obtained periodically by an on-line number of effective ones and numberbandwidth regulator, to be described in Section V. of ineffective ones. Under the critical situation, all input oper- Time is slotted with each time slot being the time to transmit ations are apparently effective, but what happens with outputone cell at link speed. At each time slot, each input scheduler operations and crosspoint buffers? will select one VOQ at input , say , based on the Theorem 4.1: Under the critical situation, at most oneparameters in row of matrix , and transmit the head cell to output operation is ineffective:the corresponding crosspoint buffer . At the same timeslot, each out scheduler will select one crosspoint buffer,say , based on the parameters in column of matrix , Theorem 4.2: Let denote the maximum occu-and transmit the head cell to the corresponding output . The pancy of , then is upper-bounded by two sMUX schedulers make their decisions independently and cells:in parallel, and then the switch executes their decisions synchro-nously. What is unique to switch sBUX is that no credit-based ﬂowcontrol is used for each XPB. Speciﬁcally, at each time slot, ac- Theorem 4.2 assumes a ﬁxed bandwidth. What happens if thecording to sMUX, each input scheduler will provide a cell ser- bandwidth is updated dynamically?vice to one VOQ, no matter whether this VOQ is backlogged or Theorem 4.3: Assume that once every slots, the admissiblethe corresponding XPB has any vacancy; each output scheduler bandwidth matrix is updated with possibly new parameters. As-will provide a cell service to one XPB, no matter whether this sume that each bandwidth is represented by ,Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 31, 2009 at 02:28 from IEEE Xplore. Restrictions apply.
HE et al.: ON GUARANTEED SMOOTH SWITCHING FOR BUFFERED CROSSBAR SWITCHES 723where is an integer. Then is still bounded by two sBUX requires the same memory bandwidth as an IQ switchcells: without speedup, which is the lowest possible. Its computation is based on rate rather than per-slot information, and is conducted in a fully distributed manner; this minimizes communication Note that this two-cell XPB sufﬁces for arbitrary fabric-in- overhead. The key operation of its computation, namely theternal latency and arbitrary link speed. In contrast, the size of minimum selection among multiple numbers, is amenable to fast hardware implementation, and seems indispensable to anyeach crosspoint buffer is 64 cells in  or 2 KB in  for their scheduler with guaranteed performance under non-uniformrespective products of fabric-internal round-trip time and ﬂow trafﬁc. It eliminates credit-based feedback ﬂow control for theor line rate. crosspoint buffer, decreases the crosspoint buffer size to theC. Guaranteed Performances minimal, eliminates output buffer since no speedup is needed; all these greatly simplify the implementation. Its guaranteed The sBUX switch, including the input schedulers, the output smoothness bounds under any admissible bandwidth matrix areschedulers and the crosspoint buffers, is almost equivalent to small constants, implying almost the best achievable qualities ofan OQ switch operated by sMUX, because each sMUX output service including throughput, bandwidth and fairness, delay andscheduler of sBUX conducts at most one ineffective service op- jitter guarantees. In particular, these guarantees are deterministic,eration when all VOQs are backlogged at inputs (Theorem 4.1), worst-case, and ﬁnite-scaled guarantees, rather than stochastic,which is negligible for all practical purposes. average-case, and asymptotic ones. To our best knowledge, no If taking this ineffective operation into account, sBUX pro- other CICQ or CIOQ (combined input- and output-queued)vides the following worst-case performance guarantees. switches have been proved to hold such comprehensive quality Theorem 4.4: In a sBUX switch, ﬂow or the channel from of service guarantees. In short, sBUX is a transparent switch.input to output is guaranteed a service with the followingsmoothness bounds: V. BANDWIDTH REGULATOR sBUX is ready to support bandwidth-guaranteed services since an admissible bandwidth matrix is already available through admission control. When sBUX is to support best-ef- Theorem 4.5: In a sBUX switch, once a cell of ﬂow goes fort services, such an admissible bandwidth matrix is notto the head of , it will wait at most slots before available, and has to be obtained dynamically via a bandwidthit leaves the crosspoint buffer (excluding the constant regulator, whose design is described in this section.fabric-internal propagation latency common to all cells). A. Architecture Simulation shows that all the bounds of Theorems 4.1–4.5 aretight in the sense that they are achievable at certain instances. Fig. 3 is the architecture of the bandwidth regulator, which isFor example, observe the instance where , , composed of distributed bandwidth demand estimators and a cen- , , row vector , and tralizedbandwidthallocator.Periodically,bandwidthdemandes-column vector . Starting from , timators collect information about trafﬁc arrivals and VOQ back-the second output operation on at slot 7 is ineffective, logs at each input port, and produce an on-line estimation of theand occupancy reaches 2 at the end of slot 12. In bandwidth demand matrix. The bandwidth allocator receives thethe cell sequence out of output , the ﬁrst cell coming from possibly inadmissible bandwidth demand matrix, transforms itinput occurs at slot 1, the second at slot 13. This produces an into an admissible and fully loaded one, and then feeds it to theinterval of length 11 with no cell service to input covered, or smoothed switch sBUX for scheduling in the next period.a covering deviation of 2 cell services. This also produces a B. Bandwidth Demand Estimator1-step space of 12 slots, or a spacing deviation of 7 slots, morethan but less than slots. After the Two sources of information can be used to design a band-ﬁrst cell of ﬂow is scheduled into at slot 1, the width demand estimator: trafﬁc arrival and VOQ backlog. Tosecond cell becomes the head of and waits 12 slots compensate for the one period lag due to bandwidth allocationbefore it departs from ; the waiting time is more than computing, bandwidth demand estimation at the beginning of a but less than . period is to estimate the demand in the next period instead of In general, given an admissible bandwidth matrix, if the the current immediate one.switch, possibly with constant speedup support, can guarantee Trafﬁc arrival prediction receives much attention by previousconstant smoothness bounds and for research, which uses the common technique of exponentiallyeach ﬂow from input to output , then we deﬁne that this weighted moving average ﬁlter to predict the future arrival basedswitch achieves smooth switching. Actually, such constant on the information of past (factual) arrival and current (pre-smoothness bounds imply that the switch can use 100% of the dicted) one , . Speciﬁcally, at the beginning of the th pe-switch capacity to provide deterministic rate guarantees for riod, we are to predict the arrival rate of ﬂow ineach ﬂow with constant fairness and jitter bounds, or perfect the th period, based on the factual arrival ratepredictability. By deﬁnition, sBUX achieves smooth switching. in the th period and the predicted arrival rate in the th period:D. Summary When supporting real-time services, the CICQ switch sBUXbest achieves the two goals of scalability and predictability. where is the gain or forgetting factor, .Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 31, 2009 at 02:28 from IEEE Xplore. Restrictions apply.
724 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 3, JUNE 2008 Fig. 4. Illustration of bandwidth allocation. The ﬁrst is the bandwidth demand matrix. The second is the bandwidth allocation matrix after proportional scaling. The third to ﬁfth are the results of three steps of the bandwidth booster. missible allocations. But its computing complexity is also high, and it is more fairness-oriented than throughput-oriented. Instead, we use the proportional scaler, the simplest among the three. It is amenable to fast hardware implementation and provides proportional fairness. Speciﬁcally, suppose ﬂow demands bandwidth , with total demand at row or input , and total demand at column or output . Then the proportional scaler makes an allo- cation . Such an allocation is admissible, because , andFig. 3. Architecture of the bandwidth regulator. . The VOQ backlog information reﬂects both the actual arrival Under rate-based switch scheduling with periodic rate up-and the actual allocation information; it is the deﬁcit of alloca- dating, an admissible yet not fully loaded bandwidth allocationtion compared with arrival accumulated in the past. This infor- matrix will waste the excess capacity of the sBUX switch. Themation is actually more important since, based solely on it, a proportional scaler has a unique merit that is not available withmaximum weight matching scheduler longest queue ﬁrst guar- the maximum ﬂow and the max-min fair allocators. That is, it-antees 100% throughput for a crossbar switch , . Specif- erated proportional scaling can boost bandwidth allocation pro-ically, at the beginning of the th period, we are to predict the portionally while keeping admissibility, and guarantee to con-backlog of ﬂow at the beginning of the th verge, possibly to a doubly stochastic or fully loaded matrix if each element is originally positive. However, within a ﬁxedperiod. While previous research uses , we number of iterations, proportional allocation cannot convergepropose a more accurate design of estimator by taking to a doubly stochastic matrix; there is still excess bandwidth tointo account the current factual backlog , plus the arrival be used. Therefore, after applying proportional scaler just once,rate and the allocated rate for the current period: we use a bandwidth booster to obtain a doubly stochastic allo- cation matrix. The bandwidth booster is an extension of the Inukai algorithm for the same purpose. The Inukai algorithm starts from positionwhere is the time of period. (1, 1), boosts the element to saturate at least one of the associ- The ﬁnal bandwidth demand estimator we use is the ated two lines, then makes a down or right move and repeatssum of the arrival and the backlog estimators: boosting, until a doubly stochastic matrix is obtained within steps . This algorithm is unfair since elements appearing earlier in the path get more chances of boosting; be-It is worth noting that certain demand estimators do not fully sides, the algorithm is sequential. We propose a parallel and fairtake backlog information into account, which simplify the fol- booster. Speciﬁcally, partition the matrix into diagonals, eachlowing bandwidth allocation but probably at the cost of dynamic one corresponding to a perfect match; boost in parallel theresponse , . elements in each diagonal to saturation, which is fair to rows and columns; ﬁnish diagonals in steps. To further improveC. Bandwidth Allocator fairness, in each period, we select a new starting diagonal in the round-robin manner. The design of bandwidth allocator should take complexity, Fig. 4 is an illustration of how to transform a bandwidth de-throughput and fairness into account. Previous research con- mand matrix into a doubly stochastic one by the proportionalsiders the maximum ﬂow model best for this purpose, with scaler and the bandwidth booster. The matrices have been trans-bandwidth demand and line rate as capacity constraints , formed into integral ones, with period as the target. However, if the period degenerates into one slot, i.e., line sum. The proportional scaler works as , then the maximum ﬂow degenerates into a maximumsize matching, which does not guarantee 100% throughput. . The bandwidth booster works asBesides, the maximum ﬂow computation is of high complexity. . Such integral representation andTherefore, this maximum ﬂow model is neither sound in prin- operation are amenable to hardware implementation.ciple nor feasible in practice. D. Hardware Implementation Another choice is the max-min fair allocator . It is an ex-tension of the classical one-dimensional max-min fair allocator The implementation efﬁciency of the bandwidth regulator de-to the two-dimensional matrix, and guarantees that the min- termines the minimum period that is affordable in dynamicimum allocation to an element is the maximum among all ad- bandwidth updating.Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 31, 2009 at 02:28 from IEEE Xplore. Restrictions apply.
HE et al.: ON GUARANTEED SMOOTH SWITCHING FOR BUFFERED CROSSBAR SWITCHES 725 TABLE II TIME TO IMPLEMENT THE BANDWIDTH REGULATOR The bandwidth demand estimators are distributed in line Fig. 5. The durations of 932 phases in 1 000 000 slots, each generated randomlycards, and their implementation is trivial. Each line card will in [100, 2000].transmit bandwidth demands to the bandwidth allocatorin the switch core, and then receive bandwidth allocations VI. SIMULATIONfrom the allocator. All line cards work in parallel. Assume In this section, we evaluate the performance of sBUX pluseach bandwidth demand or allocation is encoded as an -bit bandwidth regulator under best-effort trafﬁc.integer, and is transmitted between line cards and the switchcore through high-speed serial links at speed , then the total A. Simulation Designtime of transmission is . There are three key components for the simulation design: The proportional scaling is done by switch settings, trafﬁc patterns, and performance measures. . The two line sums and can be The switch is the integration of the bandwidth regulator andcomputed along with the receiving of bandwidth demands in the smoothed buffered crossbar sBUX, with switch size 8,a pipelined manner, and hence no additional time cost. Each 16, 32, 64, and bandwidth regulation period 32, 64, 128,comparison operation takes time . The multiplication by 256, 512, 1024, 2048, 4096. takes no time since can be set as a power of two. Each At inputs, random linear combinations of three key trafﬁc pat-division operation takes time . Therefore, the proportional terns with random durations are generated to reﬂect the dynamicscaling of elements needs time . Since all nature of arriving trafﬁc. This is called the synthetic pattern,these operations are independent, they can be done in par- more difﬁcult to deal with than its individual component. Theallel with parallel components, which reduces the time to three component trafﬁc patterns are the uniform (U), log-diag- . onal (LD), and unbalanced (UB) ; the last one is used at un- With a proportional bandwidth allocation in hand, before balanced factor as that is harsh to most switches. Theboosting, we should ﬁrst compute the line sums and simulation lasts 1 000 000 slots and contains 932 phases whose . Each element is involved in two additions, and ele- durations are generated randomly within [100, 2000] (Fig. 5). Inments in each of the diagonals can be computed in parallel. each phase, three random coefﬁcients , andThis incurs time , where is the time of are generated, and is syn-addition operation. thesized as the trafﬁc pattern for this phase. Trafﬁc load varies Then the bandwidth boosting from 0.1 to 1.0 with an increment of 0.1 when is done in steps. At each step, ele- and an increment of 0.05 when . We also use aments in a diagonal are boosted in parallel. Each boostingoperation adds to , , and , and bursty trafﬁc pattern, i.e., an on-off arrival process modulatedinvolves one comparison and four addition operations. The by a two-state Markov chain to test the effect of burstiness onsteps need time . sBUX . To sum up, the total time for the bandwidth regulation is At outputs, throughput, average delay and burstiness are mea- . With sured over the whole simulation duration of 1 000 000 slots.available hardware technology, , , Throughput is the ratio of the number of cells that appear at out- , , . We conduct em- puts to the number of cells that arrive at inputs, when each inputulation with FPGA chip EP2S180F1508C3 (Altera Stratix II). is fully loaded . Average delay is measured over allSince each bandwidth demand is coded in , for cells that appear at outputs and under all loads . , the line sum can be coded in 26 bits. By 20-stage Average burstiness is measured over all the segments at outputs,pipelining, a 26-bit divider can be implemented in 10 ns. And with each segment being a maximal subsequence of consecutive100 dividers consume only 61% of the chip resource. Table II cells from the same input.presents the time estimation for different switch size , in unitsof s and slot of 64-byte cell at 10 Gb/s (OC-192) and 40 Gb/s B. Simulation Results(OC-768) link speed. Even for , 16-bit encoding is Simulation results are presented in three parts.enough to represent the number of cells arrived in a period of The ﬁrst part studies the bandwidth demand estimator design.bandwidth regulation. For , rate information can be ex- Besides selecting , we also test the bandwidth demand esti-tracted within several microseconds. mator that is based solely on backlog, just for comparison. Sim-Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 31, 2009 at 02:28 from IEEE Xplore. Restrictions apply.
726 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 3, JUNE 2008Fig. 6. Average delay under synthetic trafﬁc at N = 32, T = 256. Fig. 8. Average delay under synthetic trafﬁc at T = 256. TABLE III TABLE IV THROUGHPUT UNDER SYNTHETIC TRAFFIC PERFORMANCES UNDER BURSTY TRAFFIC ( N = 32, T = 256) creases, the average delay increases slowly. The tendency is the same with other . The third part studies the performance under bursty trafﬁc. As Table IV shows, when input burst length increases, average delay also increases, but output burst length remains 1 for any loading, light or heavy, meaning perfect burstiness reduction byFig. 7. Average delay under synthetic trafﬁc at N = 32. sBUX. The throughput remains over 99%. VII. CONCLUSIONulation is done under synthetic trafﬁc at , , Rate-based smoothed switching is the design approach wewith varied in [0.1, 1.0] at an increment of 0.1. All through- propose in this paper, which decomposes a switch design intoputs are 0.997, no difference at all; but average delays differ, a generic bandwidth regulator and a speciﬁc smoothed switch,as Fig. 6 shows. Speciﬁcally, by addition of arrival prediction, with an aim to achieve greater scalability and predictability.average delay can be reduced by at least 35% at medium load Rate-based control is in sharp contrast to the predominant . The choice of , however, seems insigniﬁcant. Sim- approach of credit-based control. The latter uses instantaneousulations with other settings of and produce similar results. buffer occupancy information at each time slot to compute theTo keep balance between past and present, we select schedule and prevent buffer overﬂow. However, this per-slotin subsequent simulations. This choice simpliﬁes hardware im- computing and communication mode is difﬁcult to scale withplementation of the arrival predictor. faster link speed and larger switch size. The scalability consid- The second part studies the effects of period and switch eration urges us to use more compact information and coarse-size . All combinations of switch size 8, 16, 32, 64, and grained or batch mode of processing. Rate information is com-period 32, 64, 128, 256, 512, 1024, 2048, 4096 are tested. pact in nature as it is applicable for a period of time slots, andTable III shows the throughputs, all of which are almost 100%, can be extracted feasibly in microseconds by our bandwidthwith marginal difference between each other. regulator. The rate-based smoothed buffered crossbar sBUX re- Fig. 7 shows the average delay under synthetic trafﬁc at varied moves the credit-based ﬂow control, augments the distributedperiods yet unvaried switch size . When in- feature of input and output scheduling, reduces the capacity ofcreases, the average delay increases slowly. The tendency is the each crosspoint buffer to a minimal constant of two cells, andsame with other . With , , 64-byte cell and eliminates the need of speedup. All these greatly simplify the40-Gb/s link speed, the average delay is 470 slots or 6 s at implementation and enhance the scalability. . Rate-based smoothed switching is also highly predicable. Fig. 8 shows the average delay under synthetic trafﬁc at varied With just a two-cell buffer at each crosspoint and no speedup,switch size yet unvaried period . When in- sBUX guarantees 100% throughput for real-time services. InAuthorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 31, 2009 at 02:28 from IEEE Xplore. Restrictions apply.
HE et al.: ON GUARANTEED SMOOTH SWITCHING FOR BUFFERED CROSSBAR SWITCHES 727particular, the almost ideal smoothness guarantees imply that If , since , it is impliedrate guarantees are very accurate even in the worst case and in that . Hence,any time window, short or long, ﬁnite or inﬁnite. In some sense, , contradiction. It is the same withthe switch is so predictable that it can be considered transparent. .Simulation shows that coupled with bandwidth regulator, the If , we can locate theswitch guarantees almost 100% throughput and perfect bursti- cell service to ﬂow immediately before slot atness reduction for best-effort services, with average delays of and locate the cell service to ﬂow immediatelyat most tens of microseconds even under heavy load after slot at . Sinceand reasonably large switch size It is worth noting , then , andthat the approach of rate-based smoothed switching can support , contradiction.best-effort and real-time services gracefully in the same model, (3) Assume that , but . Then ,an obvious yet important feature we do not elaborate in the such that . The intervalpaper. contains number of One of the key problems left is to design a bandwidth reg- cell services to ﬂow , but has a length of at mostulator with guaranteed 100% throughput and with improved . Thenaverage delay and implementation efﬁciency. We conjecture , contradictory with the assumption .that based on just queueing backlog information, the propor- (4) Assume that , but . Then ,tional allocator, after sufﬁcient iterations, can guarantee 100% such that . Within intervalthroughput for best-effort services. , there are at least number of cell services The smooth switching problem offers a critical perspective to ﬂow , which decide an -step space of at most ,for switch design and analysis, and remains open for most contradictory with the assumption .switches, even for the classical single-stage crossbar switch B. Proof of Theorem 2.2despite the feasible perfect emulations of certain push-in,ﬁrst-output OQ switches , . We conjecture that a The theorem is correct when the interval for deﬁnition con-crossbar switch can achieve smooth switching with a speedup tains exactly one cell service to ﬂow . Let us consider the gen-of no more than two. Considering that even rate guarantees are eral cases where the interval for deﬁnition contains at least twodifﬁcult to obtain for most switches, the smoothness guarantee cell services to ﬂow . We will use two general properties con-may be a luxury. But smoothness does play an indispensable cerning any integer and real number :role in bounding the buffer usage in the switches, as demon- Property 1: .strated by sBUX. Property 2: . The further scalability of buffered crossbar switches might (1) (2) Assume that and are de-suffer from the crosspoints. When the switch size scales to ﬁned within interval , which covers totallyhundreds or even thousands, the three-stage buffered crossbar number of cell services to ﬂowswitches are preferable; a recent example is a 1024 1024 , with the ﬁrst at slot and the last at slotthree-stage switch, with thirty-two 32 32 buffered crossbar ; and are deﬁned withinchips at each stage . There is a solution for rate-based three- .stage bufferless crossbar switches , but we hope that with DeﬁnesBUX as switching element, the rate-based three-stage buffered , i.e., is deﬁned withinswitch fabrics can be made simpler in implementation and more the interval for deﬁnition of spacing properties andpredictable in performance. . Since the domain of deﬁnition of is a subset of that of , . In the same way we deﬁne , and . APPENDIX (1) For any step size , , let , then (since , by Theorem 2.1,A. Proof of Theorem 2.1 , hence . If , The theorem is correct when the interval for deﬁnition con- by Theorem 2.1, , contradiction).tains exactly one cell service to ﬂow . Let us consider the gen-eral cases where the interval for deﬁnition contains at least twocell services to ﬂow . The proof is by contradiction. (1) Assume that , but . Then , such that by Property . Hence, , , By Property 2,such that . Therefore , contradictory with the as-sumption . (2) Assume that , but . Then , such that . Assume that is deﬁned within , then Hence, . . Since is also deﬁned within , hence (2) For any step size , , let , then . (since , by Theorem 2.1,Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 31, 2009 at 02:28 from IEEE Xplore. Restrictions apply.
728 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 3, JUNE 2008 , hence ; if By Property 2, . , by Theorem 2.1, , contradiction). Therefore, . C. Proof of Theorem 2.4 First we prove the distribution of cell services to ﬂow is ideal over arbitrary intervals. Consider arbitrary interval by Property , By Property 2, Since and are arbitrary, is Hence, . valid for any interval. (3) (4) Assume that and are deﬁned Next, we prove the distribution of cell services to ﬂow inwithin interval , , which the schedule is also ideal. For any slot and interval length ,has a length of . The and since the distribution of cell services to ﬂow is ideal, are deﬁned within the same interval. (3) For any interval length , , let . If , then , and(if , then by Theorem 2.1, , Since , , wecontradiction). have That is, . Therefore, the distributions of cell services to ﬂow and by Property in the schedule are both ideal, so the schedule is ideal. D. Proof of Theorem 3.1 By Property 2, . Lemma 3.1 (Spuri):  Given a set of real-time jobs If and are ﬁnite, we have to deal with two special cases. , where job requires a service time ofNote that , , after its eligible time but before its deadline . Given an . interval of time , deﬁne the processor demand of the job If , set on interval as subject to and . Then any job set is feasibly scheduled by EDF on uni-processor if and only if its processor demand on any interval . by Property The th cell service to ﬂow can be modeled as a real-time job , where , , and . Scheduler sMUX is to schedule By Property 2, the job set by EDF on uniprocessor. Given an integral interval , assume that number of cell services to ﬂow , say, from the th cell service to the th, are completely contained inside the interval . If , That is, By , Therefore, . (4) For any interval length , , let ,then (if , then by Theorem The processor demand of job set on is2.1, , contradiction). That is, . Since both sides of are by Property integers, . Given an arbitrary interval , thenAuthorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 31, 2009 at 02:28 from IEEE Xplore. Restrictions apply.
HE et al.: ON GUARANTEED SMOOTH SWITCHING FOR BUFFERED CROSSBAR SWITCHES 729 Therefore, by Lemma 3.1, the job set is feasibly scheduled by Therefore,sMUX. , and .E. Proof of Theorem 3.2 (5) Since , with an ideal schedule as reference, we have (1) By Theorem 3.1, That is, the th cell service to ﬂow is covered by In the same way we can prove .interval , therefore . F. Proof of Theorem 4.1 Under the critical situation, all input operations are effective. By case (1) of Theorem 3.2, This means the position of the thcell service to ﬂow is beyond interval . Therefore, Therefore, (2) For any , As a result, . G. Proof of Theorem 4.2 To obtain , it sufﬁces to merely check the critical situation, under which all input operations are effective, and at most one output operation is ineffective (Theorem 4.1). By Theorem 3.2, case (1), and Theorem 4.1, Therefore, (3) Since , with an ideal scheduleas reference, we obtain In the same way, we can prove . H. Proof of Theorem 4.3 (4) By Theorem 3.1, in the sMUX schedule , the th Proof: The key to the proof is the fact that during each pe-and the th cell services to ﬂow satisfy riod of slots, sMUX will conduct exactly number of input operations and just the same number of output operations. Again, to deduce the maximum occupancy of crosspoint buffer , just consider the critical situation, in which all input operations are effective.Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 31, 2009 at 02:28 from IEEE Xplore. Restrictions apply.
730 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 16, NO. 3, JUNE 2008 During the ﬁrst period of slots, Theorem 4.2 has proved We already know that the th cell enters no earlier than its eligible time , thus the th cell becomes the head of no earlier than . Therefore, from the th cell becoming the head In particular, the same number of input operations and output of to its total departure from , the delay is nooperations are conducted, and at most one output operation is more thanineffective (Theorem 4.1). Therefore, at the end of the ﬁrst pe-riod, has at most one cell stored in it. Now consider the second period. At the beginning, if there isno cell in , then the conclusion is just the same as theﬁrst period. If there is one cell stored in at the beginning, thenthere will be no ineffective output operation, because from thebeginning of the period, the number of output operations ex- REFERENCESceeds the number of input operations by at most 1 (by case (1)  F. Abel, C. Minkenberg, R. P. Luijten, M. Gusat, and I. Iliadis, “A four- terabit packet switch supporting long round-trip times,” IEEE Micro,of Theorem 3.2, . Symmetri- pp. 10–24, Jan./Feb. 2003.cally, from the beginning of the period, the number of (effective)  M. Ajmone-Marsan, P. Giaccone, E. Leonardi, and F. Neri, “Localinput operations exceeds the number of (effective) output oper- scheduling policies in networks of packet switches with input queues,”ations by at most 1 (by the same reason as above). Taking the in Proc. IEEE INFOCOM, 2003.  T. E. Anderson, S. S. Owicki, J. B. Saxe, and C. P. Thacker, “High-initial one cell into account, at any time during the second pe- speed switch scheduling for local area networks,” ACM Trans. Comput.riod, there are at most two cells in , or . Syst., vol. 11, no. 4, pp. 319–352, Nov. 1993.Since the number of (effective) input operations and the number  H. Balakrishnan, S. Devadas, D. Ehlert, and Arvind, “Rate guarantees and overload protection in input-queued switches,” in Proc. IEEE IN-of (effective) output operations are equal, there will be one cell FOCOM, Mar. 2004.left in at the end of the second period.  C.-S. Chang, W.-J. Chen, and H.-Y. Huang, “Birkhoff-von Neumann The conclusion is the same with all the other periods. input buffered crossbar switches,” in Proc. IEEE INFOCOM, 2000.  C.-S. Chang, D.-S. Lee, and Y.-S. Jou, “Load balanced Birkhoff-vonI. Proof of Theorem 4.4 Neumann switches,” Comput. Commun., vol. 25, pp. 611–634, 2002.  H. J. Chao, “Saturn: A terabit packet switch using dual round-robin,” (1) Assume that is always backlogged. Let us derive IEEE Commun. Mag., vol. 38, no. 12, pp. 78–84, 2000.the bounds to , the number of effective  F. M. Chiussi and A. Francini, “Scalable electronic packet switches,” IEEE J. Sel. Areas Commun., vol. 21, no. 4, pp. 486–500, May 2003.output operations in interval , , with the switch  N. Chrysos and M. Katevenis, “Scheduling in non-blocking bufferedcore as our point of view. three-stage switching fabrics,” in Proc. IEEE INFOCOM, 2006. For the upper bound, by Theorem 3.2,  S.-T. Chuang, A. Goel, N. McKeown, and B. Prabhakar, “Matching output queueing with a combined input/output-queued switch,” IEEE J. Sel. Areas Commun., vol. 17, no. 6, pp. 1030–1039, 1999.  S.-T. Chuang, S. Iyer, and N. McKeown, “Practical algorithms for per- formance guarantees in buffered crossbars,” in Proc. IEEE INFOCOM, 2005.  J. G. Dai and B. Prabhakar, “The throughput of data switches with and without speedup,” in Proc. IEEE INFOCOM, 2000, pp. 556–564.  Q. Duan and J. N. Daigle, “Resource allocation for quality of service provision in buffered crossbar switches,” in Proc. 11th ICCCN, Oct. For the lower bound, by Theorems 3.2 and 4.1, 2002, pp. 509–513.  P. Giaccone, E. Leonardi, and D. Shah, “On the maximal throughput of networks with ﬁnite buffers and its application to buffered crossbar,” in Proc. IEEE INFOCOM, 2005.  T. Inukai, “An efﬁcient SS/TDMA time slot assignment algorithm,” IEEE Trans. Commun., vol. 27, no. 10, pp. 1449–1455, Oct. 1979.  K. G. I. Harteros, “Fast parallel comparison circuits for scheduling,” Inst. Comput. Sci., FORTH, Heraklion, Crete, Greece, Tech. Rep. FORTH-ICS/TR-304.  M. Hosaagrahara and H. Sethu, “Max-min fair scheduling in input- (2) The spacing smoothness bounds can be derived directly queued switches,” IEEE Trans. Parallel Distrib. Syst., in press. DOI: 10.1109/TPDS.2007.70746.by Theorem 2.2.  J. Y. Hui, Switching and Trafﬁc Theory for Integrated Broadband Net- works. Boston, MA: Kluwer Academic, 1990.J. Proof of Theorem 4.5  T. Javidi, R. B. Magil, and T. Hrabik, “A high-throughput scheduling algorithm for a buffered crossbar switch fabric,” in Proc. IEEE ICC, Just consider the critical situation, which produces the largest Jun. 2001, pp. 1586–1591.delay for the cell at the head of VOQ. Let denote  M. Katevenis, G. Passas, D. Simos, I. Papaefstathiou, and N. Chrysos,the position of the th cell of ﬂow in the output “Variable packet size buffered crossbar (CICQ) switches,” in Proc. IEEE ICC, Jun. 2004.sequence. Consider the interval . By  I. Keslassy, M. Kodialam, T. V. Lakshman, and D. Siliadis, “On guar-case (1) of Theorem 3.2, this interval covers output operations anteed smooth scheduling for input-queued switches,” in Proc. IEEEof at least . Since at most one INFOCOM, 2003.ineffective output operation, this interval will cover at least  P. Krishna, N. S. Patel, A. Charny, and R. J. Simcoe, “On the speedup required for work-conserving crossbar switches,” IEEE J. Sel. Areaseffective output operations. Therefore, Commun., vol. 17, no. 6, pp. 1057–1066, Jun. 1999.  H. T. Kung and R. Morris, “Credit-based ﬂow control for ATM net- works,” IEEE Network, pp. 40–48, Mar./Apr. 1995.Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 31, 2009 at 02:28 from IEEE Xplore. Restrictions apply.