Dual-resource TCPAQM for Processing-constrained Networks


Published on

Dual-resource TCPAQM
for Processing-constrained Networks

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Dual-resource TCPAQM for Processing-constrained Networks

  1. 1. IEEE/ACM TRAS. ON NETWORKING, VOL. 6, NO. 1, JUNE 2008 1 Dual-resource TCP/AQM for Processing-constrained Networks Minsu Shin, Student Member, IEEE, Song Chong, Member, IEEE, and Injong Rhee, Senior Member, IEEE Abstract—This paper examines congestion control issues for processing capacity in the network components. New routerTCP flows that require in-network processing on the fly in technologies such as extensible routers [3] or programmablenetwork elements such as gateways, proxies, firewalls and even routers [4] also need to deal with scheduling of CPU usagerouters. Applications of these flows are increasingly abundant inthe future as the Internet evolves. Since these flows require use of per packet as well as bandwidth usage per packet. Moreover,CPUs in network elements, both bandwidth and CPU resources the standardization activities to embrace various network ap-can be a bottleneck and thus congestion control must deal plications especially at network edges are found in [5] [6] aswith “congestion” on both of these resources. In this paper, we the name of Open Pluggable Edge Services.show that conventional TCP/AQM schemes can significantly lose In this paper, we examine congestion control issues forthroughput and suffer harmful unfairness in this environment,particularly when CPU cycles become more scarce (which is likely an environment where both bandwidth and CPU resourcesthe trend given the recent explosive growth rate of bandwidth). As can be a bottleneck. We call this environment dual-resourcea solution to this problem, we establish a notion of dual-resource environment. In the dual-resource environment, different flowsproportional fairness and propose an AQM scheme, called Dual- could have different processing demands per byte.Resource Queue (DRQ), that can closely approximate propor- Traditionally, congestion control research has focused ontional fairness for TCP Reno sources with in-network processingrequirements. DRQ is scalable because it does not maintain per- managing only bandwidth. However, we envision (also it isflow states while minimizing communication among different indeed happening now to some degree) that diverse networkresource queues, and is also incrementally deployable because services reside somewhere inside the network, most likely atof no required change in TCP stacks. The simulation study the edge of the Internet, processing, storing or forwardingshows that DRQ approximates proportional fairness without data packets on the fly. As the in-network processing is likelymuch implementation cost and even an incremental deploymentof DRQ at the edge of the Internet improves the fairness and to be popular in the future, our work that examines whetherthroughput of these TCP flows. Our work is at its early stage and the current congestion control theory can be applied withoutmight lead to an interesting development in congestion control modification, or if not, then what scalable solutions can beresearch. applied to fix the problem, is highly timely. Index Terms—TCP-AQM, transmission link capacity, CPU In our earlier work [7], we extended proportional fairnesscapacity, fairness, efficiency, proportional fairness. to the dual-resource environment and proposed a distributed congestion control protocol for the same environment where I. I NTRODUCTION end-hosts are cooperative and explicit signaling is available for congestion control. In this paper, we propose a scal-A DVANCES in optical network technology enable fast pace increase in physical bandwidth whose growth ratehas far surpassed that of other resources such as CPU and able active queue management (AQM) strategy, called Dual- Resource Queue (DRQ), that can be used by network routers to approximate proportional fairness, without requiring anymemory bus. This phenomenon causes network bottlenecks change in end-host TCP stacks. Since it does not require anyto shift from bandwidth to other resources. The rise of new change in TCP stacks, our solution is incrementally deployableapplications that require in-network processing hastens this in the current Internet. Furthermore, DRQ is highly scalable inshift, too. For instance, a voice-over-IP call made from a cell the number of flows it can handle because it does not maintainphone to a PSTN phone must go through a media gateway that per-flow states or queues. DRQ maintains only one queue perperforms audio transcoding “on the fly” as the two end points resource and works with classes of application flows whoseoften use different audio compression standards. Examples processing requirements are a priori known or measurable.of in-network processing services are increasingly abundant Resource scheduling and management of one resourcefrom security, performance-enhancing proxies (PEP), to media type in network environments where different flows couldtranslation [1] [2]. These services add additional loads to have different demands are a well-studied area of research. An early version of this paper was presented at the IEEE INFOCOM 2006, Weighted-fair queuing (WFQ) [8] and its variants such asBarcelona, Spain, 2006. This work was supported by the center for Broadband deficit round robin (DRR) [9] are well known techniques toOFDM Mobile Access (BrOMA) at POSTECH through the ITRC program achieve fair and efficient resource allocation. However, theof the Korean MIC, supervised by IITA. (IITA-2006-C1090-0603-0037).Minsu Shin and Song Chong are with the School of Electrical Engineering solutions are not scalable and implementing them in a high-and Computer Science, Korea Advanced Institute of Science and Technol- speed router with many flows is difficult since they need toogy (KAIST), Daejeon 305-701, Korea (email: msshin@netsys.kaist.ac.kr; maintain per-flow queues and states. Another extreme is tosong@ee.kaist.ac.kr). Injong Rhee is with the Department of ComputerScience, North Carolina State University, Raleigh, NC 27695, USA (email: have routers maintain simpler queue management schemesrhee@csc.ncsu.edu). such as RED [10], REM [11] or PI [12]. Our study finds that
  2. 2. IEEE/ACM TRAS. ON NETWORKING, VOL. 6, NO. 1, JUNE 2008 2these solutions may yield extremely unfair allocation of CPU These constraints are called dual-resource constraints and aand bandwidth and sometimes lead to very inefficient resource nonnegative rate vector r = [r1 , · · · , rS ]T satisfying theseusages. dual constraints for all CPUs k ∈ K and all links l ∈ L is Some fair queueing algorithms such as Core-Stateless Fair said to be feasible.Queueing (CSFQ) [13] and Rainbow Fair Queueing (RFQ)[14] have been proposed to eliminate the problem of main- A1: We assume that each CPU k ∈ K knows the processing ktaining per-flow queues and states in routers. However, those densities ws ’s for all the flows s ∈ S(k).schemes are concerned about bandwidth sharing only and donot consider joint allocation of bandwidth and CPU cycles. This assumption is reasonable because a majority of InternetEstimation-based Fair Queueing (EFQ) [15] and Prediction applications are known and their processing requirements canBased Fair Queueing (PBFQ) [16] have been also proposed be measured either off-line or on-line as discussed below. Infor fair CPU sharing but they require per-flow queues and do practice, network flows could be readily classified into a smallnot consider joint allocation of bandwidth and CPU cycles number of application types [15], [17]–[19]. That is, thereeither. is a finite set of application types, a flow is an instance of Our study, to the best of our knowledge, is the first in an application type, and flows will have different processingexamining the issues of TCP and AQM under the dual- densities only if they belong to different application types.resource environment and we show that by simulation DRQ In [17], applications have been divided into two categories:achieves fair and efficient resource allocation without imposing header-processing applications and payload-processing appli-much implementation cost. The remainder of this paper is cations, and each category has been further divided into aorganized as follows. In Section II, we define the problem and set of benchmark applications. In particular, authors in [15]fairness in the dual-resource environment, in Sections III and experimentally measure the per-packet processing times forIV, we present DRQ and its simulation study, and in Section several benchmark applications such as encryption, compres-V, we conclude our paper. sion, and forward error correction. The measurement results find the network processing workloads to be highly regular and II. P RELIMINARIES : N ETWORK M ODEL AND predictable. Based on the results, they propose an empirical D UAL - RESOURCE P ROPORTIONAL FAIRNESS model for the per-packet processing time of these applications for a given processing platform. Interestingly, it is a simpleA. Network model affine function of packet size M , i.e., µk +νa M where µk and a k a We consider a network that consists of a set of unidirectional k νa are the parameters specific to each benchmark application alinks, L = {1, · · · , L}, and a set of CPUs, K = {1, · · · , K}. for a given processing platform k. Thus, the processing densityThe transmission capacity (or bandwidth) of link l is Bl (in cycles/bit) of a packet of size M from application a at(bits/sec) and the processing capacity of CPU k is Ck (cy- µk k platform k can be modelled as M +νa . Therefore, the average acles/sec). These network resources are shared by a set of flows k processing density wa of application a at platform k can be(or data sources), S = {1, · · · , S}. Each flow s is associated computed upon arrival of a packet using an exponentiallywith its data rate rs (bits/sec) and its end-to-end route (or weighted moving average (EWMA) filter:path) which is defined by a set of links, L(s) ⊂ L, and aset of CPUs, K(s) ⊂ K, that flow s travels through. Let k k µk a k wa ← (1 − λ)wa + λ( + νa ), 0 < λ < 1. (1)S(l) = {s ∈ S|l ∈ L(s)} be the set of flows that travel Mthrough link l and let S(k) = {s ∈ S|k ∈ K(s)} be the set µk k One could also directly measure the quantity M +νa in Eq. aof flows that travel through CPU k. Note that this model is (1) as a whole instead of relying on the empirical model bygeneral enough to include various types of router architecture counting the number of CPU cycles actually consumed by aand network element with multiple CPUs and transmission packet while the packet is being processed. Lastly, determininglinks. the application type an arriving packet belongs to is an easy Flows can have different CPU demands. We represent this k task in many commercial routers today since L3/L4 packetnotion by processing density ws , k ∈ K, of each flow s, which classification is a default functionality.is defined to be the average number of CPU cycles required kper bit when flow s is processed by CPU k. ws depends on ksince different processing platforms (CPU, OS, and software) B. Proportional fairness in the dual-resource environmentwould require a different number of CPU cycles to process Fairness and efficiency are two main objectives in re-the same flow s. The processing demand of flow s at CPU k source allocation. The notion of fairness and efficiency has kis then ws rs (cycles/sec). been extensively studied and well understood with respect Since there are limits on CPU and bandwidth capacities, to bandwidth sharing. In particular, proportionally fair (PF)the amount of processing and bandwidth usage by all rate allocation has been considered as the bandwidth sharingflows sharing these resources must be less than or equal strategy that can provide a good balance between fairness andto the capacities at anytime. We represent this notion efficiency [20], [21].by the following two constraints: for each CPU k ∈ K, In our recent work [7], we extended the notion of pro- k s∈S(k) ws rs ≤ Ck (processing constraint) and for each portional fairness to the dual-resource environment wherelink l ∈ L, s∈S(l) rs ≤ Bl (bandwidth constraint). processing and bandwidth resources are jointly constrained.
  3. 3. IEEE/ACM TRAS. ON NETWORKING, VOL. 6, NO. 1, JUNE 2008 3 flows 1.25 1.25 Link queue Processing- Jointly- Bandwidth- Processing- Jointly- Bandwidth- s∈S CPU queue Normalized CPU usage Normalized throughput limited limited limited limited limited limited 1.00 1.00 rs (bits/sec) ∑ rs B ws (cycles/bit) r1 B ∑ws rs C CPU Link 0.75 r2 B 0.75 w1 r1 C r3 B w2 r2 C C (cycles/sec) B (bits/sec) r4 B w3 r3 C 0.50 0.50 w4 r4 CFig. 1. Single-CPU and single-link network 0.25 0.25 0 1.5 2 4.5 5 0 1.5 2 4.5 5 0 0.5 1 2.5 3 3.5 4 0 0.5 1 2.5 3 3.5 4 wh wh C B wa C B waIn the following, we present this notion and its potential (a) (b)advantages for the dual-resource environment to define our 1.25 1.25goal for our main study of this paper on TCP/AQM. Processing- Jointly- Bandwidth- Processing- Jointly- Bandwidth- Normalized CPU usage Normalized throughput limited limited limited limited limited limited 1.00 1.00 Consider an aggregate log utility maximization problem (P) ∑ rs r1 B B ∑ws rs C r2with dual constraints: B w1 r1 C 0.75 0.75 r3 B w2 r2 C r4 B 0.50 0.50 w3 r3 C P: max αs log rs (2) w4 r4 C r 0.25 0.25 s∈S 0 1.5 2 4.5 5 0 1.5 2 4.5 5 0 0.5 1 2.5 3 3.5 4 0 0.5 1 2.5 3 3.5 4 wh wh subject to k C B wa C B wa s∈S(k) ws rs ≤ Ck , ∀k∈K (3) (c) (d) s∈S(l) rs ≤ Bl , ∀ l ∈ L (4) rs ≥ 0, ∀ s ∈ S (5) Fig. 2. Fairness and efficiency in the dual-resource environment (single- CPU and single-link network): (a) and (b) respectively show the normalizedwhere αs is the weight (or willingness to pay) of flow bandwidth and CPU allocations enforced by PF rate allocation, and (c) and (d) respectively show the normalized bandwidth and CPU allocations enforced bys. The solution r∗ of this problem is unique since it is TCP-like rate allocation. When C/B < wa , TCP-like rate allocation gives ¯a strictly concave maximization problem over a convex lower bandwidth utilization than PF rate allocation (shown in (a) and (c)) and has an unfair allocation of CPU cycles (shown in (d)).set [22]. Furthermore, r∗ is weighted proportionally fair since ∗ rs −rs s∈S αs rs ∗ ≤ 0 holds for all feasible rate vectors rby the optimality condition of the problem. We define thisallocation to be (dual-resource) PF rate allocation. Note that • Bandwidth(BW)-limited case (θ∗ = 0 and π ∗ > 0): rs = ∗ αs ∗ ∗this allocation can be different from Kelly’s PF allocation [20] π∗ , ∀s ∈ S, s∈S ws rs ≤ C and s∈S rs = B. From Csince the set of feasible rate vectors can be different from that these, we know that this case occurs when B ≥ wa and ¯ ∗ αs Bof Kelly’s formulation due to the extra processing constraint PF rate allocation becomes rs = , ∀s ∈ S. s∈S αs(3). • Jointly-limited case (θ∗ > 0 and π ∗ > 0): This case From the duality theory [22], r∗ satisfies that occurs when wh < B < wa . By plugging rs = ws θαs ∗ , ¯ C ¯ ∗ ∗ +π ∗ ∗ αs ∀s ∈ S, into s∈S ws rs = C and s∈S rs = B, we ∗ rs = k θ∗ + ∗, ∀ s ∈ S, (6) can obtain θ∗ , π ∗ and consequently rs , ∀s ∈ S. ∗ k∈K(s) ws k l∈L(s) πl We can apply other increasing and concave utility func-where θ∗ =[θ1 , · · · , θK ]T and π ∗ =[π1 , · · · , πL ]T are Lagrange ∗ ∗ ∗ ∗ tions (including the one from TCP itself [23]) in the dual- ∗multiplier vectors for Eqs. (3) and (4), respectively, and θk resource problem in Eqs. (2)-(5). The reason why we give ∗and πl can be interpreted as congestion prices of CPU k a special attention to proportional fairness by choosing logand link l, respectively. Eq. (6) reveals an interesting property utility function is that it automatically yields weighted fairthat the PF rate of each flow is inversely proportional to the CPU sharing (ws rs = αs Cαs , ∀s ∈ S) if CPU is limited, ∗aggregate congestion price of its route with the contribution s∈S ∗ k ∗ and weighted fair bandwidth sharing (rs = αs Bαs , ∀s ∈ S) ∗of each θk being weighted by ws . The congestion price θk or s∈S ∗ if bandwidth is limited, as illustrated in the example of Figureπl is positive only when the corresponding resource becomes 1. This property is obviously what is desirable and a directa bottleneck, and is zero, otherwise. consequence of the particular form of rate-price relationship To illustrate the characteristics of PF rate allocation in given in Eq. (6). Thus, this property is not achievable whenthe dual-resource environment, let us consider a limited case other utility functions are used.where there are only one CPU and one link in the network, as Figures 2 (a) and (b) illustrate the bandwidth and CPUshown in Figure 1. For now, we drop k and l in the notation allocations enforced by PF rate allocation in the single-CPUfor simplicity. Let wa and wh be the weighted arithmetic and ¯ ¯ and single-link case using an example of four flows withharmonic means of the processing densities of flows sharing ws αs identical weights (αs =1, ∀s) and different processing densitiesthe CPU and link, respectively. So, wa = ¯ s∈S αs −1 s∈S (w1 , w2 , w3 , w4 ) = (1, 2, 4, 8) where wh =2.13 and wa =3.75. ¯ ¯and wh = ¯ s∈S ws s∈S αs αs . There exist three cases as For comparison, we also consider a rate allocation in whichbelow. flows with an identical end-to-end path get an equal share of ∗ ∗ ∗ α the maximally achievable throughput of the path and call it • CPU-limited case (θ > 0 and π = 0): rs = w s ∗ , sθ TCP-like rate allocation. That is, if TCP flows run on the ∗ ∗ ∀s ∈ S, s∈S ws rs = C and s∈S rs ≤ B. From C example network in Figure 1 with ordinary AQM schemes these, we know that this case occurs when B ≤ wh and ¯ ∗ αs C such as RED on both CPU and link queues, they would have PF rate allocation becomes rs = ws αs , ∀s ∈ S. s∈S
  4. 4. IEEE/ACM TRAS. ON NETWORKING, VOL. 6, NO. 1, JUNE 2008 4the same long-term throughput. Thus, in our example, TCP- A2: We assume that each TCP flow s has a constant RTTlike rate allocation is defined to be the maximum equal rate τs , as customary in the fluid modeling of TCP dynamicsvector satisfying the dual constraints, which is rs = B , ∀s, S [23]–[28].if B ≥ wa , and rs = wCS , ∀s, otherwise. The bandwidth C ¯ ¯aand CPU allocations enforced by TCP-like rate allocation are Let yl (t) be the average queue length at link l at time t,shown in Figures 2 (c) and (d). measured in bits. Then, From Figure 2, we observe that TCP-like rate allocation s∈S(l) xs (t − τsl ) − Bl yl (t) > 0yields far less aggregate throughput than PF rate allocation yl (t) = ˙ + (7)when C/B < wa , i.e., in both CPU-limited and jointly- ¯ s∈S(l) xs (t − τsl ) − Bl yl (t) = 0.limited cases. Intuitively, this is because TCP-like allocation Similarly, let zk (t) be the average queue length at CPU k atwhich finds an equal rate allocation yields unfair sharing time t, measured in CPU cycles. Then,of CPU cycles as CPU becomes a bottleneck (see Figure 2 k ws xs (t − τsk ) − Ck zk (t) > 0(d)), which causes the severe aggregate throughput drop. In s∈S(k) zk (t) = ˙ k +contrast, PF allocation yields equal sharing of CPU cycles, s∈S(k) ws xs (t − τsk ) − Ck zk (t) = 0.i.e., ws rs become equal for all s ∈ S, as CPU becomes a (8)bottleneck (see Figure 2 (b)), which mitigates the aggregate Let ps (t) be the end-to-end marking (or loss) probabilitythroughput drop. This problem in TCP-like allocation would at time t to which TCP source s reacts. Then, the rate-get more severe when the processing densities of flows have adaptation dynamics of TCP Reno or its variants, particularlya more skewed distribution. in the timescale of tens (or hundreds) of RTTs, can be readily In summary, in a single-CPU and single-link network, described by [23]PF rate allocation achieves equal bandwidth sharing when  2  Ms (1−p2 (t)) − 2 xs (t)ps (t) s xs (t) > 0bandwidth is a bottleneck, equal CPU sharing when CPU is Ns τs 3 Ns Ms xs (t) = ˙ 2 + (9)a bottleneck, and a good balance between equal bandwidth  Ms (1−ps (t)) 2 xs (t)ps (t) − 3 Ns Ms xs (t) = 0 Ns τ 2sharing and equal CPU sharing when bandwidth and CPU sform a joint bottleneck. Moreover, in comparison to TCP- where Ms is the average packet size in bits of TCP flowlike rate allocation, such consideration of CPU fairness in PF s and Ns is the number of consecutive data packets thatrate allocation can increase aggregate throughput significantly are acknowledged by an ACK packet in TCP flow s (Ns iswhen CPU forms a bottleneck either alone or jointly with typically 2).bandwidth. In DRQ, we employ one RED queue per one resource. Each RED queue computes a probability (we refer to it as pre- III. M AIN R ESULT: S CALABLE TCP/AQM A LGORITHM marking probability) in the same way as an ordinary RED queue computes its marking probability. In this section, we present a scalable AQM scheme, called That is, the RED queue at link l computes a pre-markingDual-Resource Queue (DRQ), that can approximately imple- probability ρl (t) at time t byment dual-resource PF rate allocation described in Section II for TCP-Reno flows. DRQ modifies RED [10] to achieve PF  0  yl (t) ≤ bl ˆ  ml allocation without incurring per-flow operations (queueing or bl −bl (ˆl (t) − bl ) y bl ≤ yl (t) ≤ bl ˆ ρl (t) = 1−ml (10)state management). DRQ does not require any change in TCP  b (ˆl (t) − bl ) + ml bl ≤ yl (t) ≤ 2bl  y ˆ   lstacks. 1 yl (t) ≥ 2bl ˆ ˙ loge (1 − λl ) loge (1 − λl )A. DRQ objective and optimality yl (t) = ˆ yl (t) − ˆ yl (t) (11) ηl ηl We describe a TCP/AQM network using the fluid model as where ml ∈(0, 1], 0 ≤ bl < bl and Eq. (11) is the continuous-in the literature [23]-[28]. In the fluid model, the dynamics time representation of the EWMA filter [25] used by the RED,whose timescale is shorter than several tens (or hundreds) of i.e.,round-trip times (RTTs) are neglected. Instead, it is convenientto study the longer timescale dynamics and so adequate to yl ((k +1)ηl ) = (1−λl )ˆl (kηl )+λl yl (kηl ), λl ∈ (0, 1). (12) ˆ ymodel the macroscopic dynamics of long-lived TCP flows that Eq. (11) does not model the case where the averagingwe are concerning. timescale of the EWMA filter is smaller than the averaging Let xs (t) (bits/sec) be the average data rate of TCP timescale ∆ on which yl (t) is defined. In this case, Eq. (11)source s at time t where the average is taken over the must be replaced by yl (t) = yl (t). ˆtime interval ∆ (seconds) and ∆ is assumed to be on the Similarly, the RED queue at CPU k computes a pre-markingorder of tens (or hundreds) of RTTs, i.e., large enough to probability σk (t) at time t byaverage out the additive-increase and multiplicative decrease (AIMD) oscillation of TCP. Define the RTT τs of source s  0  m vk (t) ≤ bk ˆ  by τs = τsi + τis where τsi denotes forward-path delay from k bk −bk (ˆk (t) − bk ) v bk ≤ vk (t) ≤ bk ˆ σk (t) = 1−mksource s to resource i and τis denotes backward-path delay  b (ˆk (t) − bk ) + mk bk ≤ vk (t) ≤ 2bk  v ˆ   kfrom resource i to source s. 1 vk (t) ≥ 2bk ˆ (13)
  5. 5. IEEE/ACM TRAS. ON NETWORKING, VOL. 6, NO. 1, JUNE 2008 5 ˙ loge (1 − λk ) loge (1 − λk ) In the current Internet environment, however, these condi- vk (t) = ˆ vk (t) − ˆ vk (t) (14) ηk ηk tions will hardly be violated particularly as the bandwidth-where vk (t) is the translation of zk (t) in bits. delay products of flows increase. By applying C1 and C2 to Given these pre-marking probabilities, the objective of DRQ the Lagrangian optimality condition of Problem P in Eq. (6) √ 3/2Msis to mark (or discard) packets in such a way that the end-to- with αs = , we have τsend marking (or loss) probability ps (t) seen by each TCP flow ∗ 3/2s at time t becomes rs τs = k ∗ ∗ (18) k 2 Ms k∈K(s) ws θk + l∈L(s) πl k∈K(s) ws σk (t − τks ) + l∈L(s) ρl (t − τls )ps (t) = 2. 3/2 k > (19) 1+ k∈K(s) ws σk (t − τks ) + l∈L(s) ρl (t − τls ) k∈K(s) k ws + |L(s)| (15) r∗ τThe actual marking scheme that can closely approximate this where Mss is the bandwidth-delay product (or window size) of sobjective function will be given in Section III-B. flow s, measured in packets. The maximum packet size in the The Reno/DRQ network model given by Eqs. (7)-(15) is Internet is Ms = 1, 536 bytes (i.e., maximum Ethernet packetcalled average Reno/DRQ network as the model describes the size). Flows that have the minimum processing density are IPinteraction between DRQ and Reno dynamics in long-term forwarding applications with maximum packet size [17]. Foraverage rates rather than explicitly capturing instantaneous instance, a measurement study in [15] showed that per-packetTCP rates in the AIMD form. This average network model processing time required for NetBSD radix-tree routing tableenables us to study fixed-valued equilibrium and consequently lookup on a Pentium 167 MHz processor is 51 µs (for aestablish in an average sense the equilibrium equivalence of a faster CPU, the processing time reduces; so as what mattersReno/DRQ network and a network with the same configuration is the number of cycles per bit, this estimate applies to thebut under dual-resource PF congestion control. other CPUs). Thus, the processing density for this application k Let x = [x1 , · · · , xS ]T , σ = [σ1 , · · · , σK ]T , flow is about ws =51(µsec)x167(MHz)/1,536(bytes)=0.69ρ = [ρ1 , · · · , ρL ] , p = [p1 , · · · , pS ] , y = [y1 , · · · , yL ]T , T T (cycles/bit). Therefore, from Eq. (19), ∗the worst-case lower rs τz = [z1 , · · · , zK ]T , v = [v1 , · · · , vK ]T , y = [ˆ1 , · · · , yL ]T ˆ y ˆ bound on the window size becomes Mss > 1.77 (packets),and v = [ˆ1 , · · · , vK ]T . ˆ v ˆ which occurs when the flow traverses a CPU only in the path (i.e., |K(s)| = 1 and |L(s)| = 0) . This concludes that Proposition 1: Consider an average Reno/DRQ network the conditions C1 and C2 will never be violated as long asgiven by Eqs. (7)-(15) and formulate the corresponding ag- the steady-state average TCP window size is sustainable at agregate log utility maximization problem (Problem P) as in √ value greater than or equal to 2 packets, even in the worst case. s 3/2MEqs. (2)-(5) with αs = τs . If the Lagrange multiplier ∗ ∗vectors, θ and π , of this corresponding Problem P satisfythe following conditions: B. DRQ implementation C1 : ∗ θk < 1, ∀k ∈ K(s), ∀s ∈ S, (16) In this section, we present a simple scalable packet marking ∗ (or discarding) scheme that closely approximates the DRQ C2 : πl < 1, ∀l ∈ L(s), ∀s ∈ S, (17) objective function we laid out in Eq. (15).then, the average Reno/DRQ network has a unique equilibriumpoint (x∗ , σ ∗ , ρ∗ , p∗ , y ∗ , z ∗ , v ∗ , y ∗ , v ∗ ) and (x∗ , σ ∗ , ρ∗ ) is ˆ ˆ A3: We assume that for all timesthe primal-dual optimal solution of the corresponding Problem  2 ∗ ∗ ∗P. In addition, vk > bk if σk > 0 and 0 ≤ vk ≤ bk otherwise,  ∗ ∗ ∗ k ws σk (t − τks ) + ρl (t − τls ) 1, ∀ s ∈ S.and yl > bl if ρl > 0 and 0 ≤ yl ≤ bl otherwise, for all k∈K(s) l∈L(s)k ∈ K and l ∈ L. (20) Proof: The proof is given in Appendix. This assumption implies that (ws σk (t))2 k 1, ∀k ∈ K, Proposition 1 implies that once the Reno/DRQ network ρl (t) 2 k 1, ∀l ∈ L, and any product of ws σk (t) andreaches its steady state (i.e., equilibrium), the average data ρl (t) is also much smaller than 1. Note that our analysisrates of Reno sources satisfy weighted proportional fairness √ is based on long-term average values of σk (t) and ρl (t). 3/2Mswith weights αs = τs . In addition, if a CPU k is a The typical operating points of TCP in the Internet during ∗bottleneck (i.e., σk > 0), its average equilibrium queue length steady state where TCP shows a reasonable performance are ∗vk stays at a constant value greater than bk , and if not, it stays under low end-to-end loss probabilities (less than 1%) [29].at a constant value between 0 and bk . The same is true for Since the end-to-end average probabilities are low, the markinglink congestion. probabilities at individual links and CPUs can be much lower. The existence and uniqueness of such an equilibrium point Let R be the set of all the resources (including CPUs andin the Reno/DRQ network is guaranteed if conditions C1 links) in the network. Also, for each flow s, let R(s) =and C2 hold in the corresponding Problem P. Otherwise, the {1, · · · , |R(s)|} ⊂ R be the set of all the resources that itReno/DRQ networks do not have an equilibrium point. traverses along its path and let i ∈ R(s) denote the i-th
  6. 6. IEEE/ACM TRAS. ON NETWORKING, VOL. 6, NO. 1, JUNE 2008 6 i in R(s). Then, the proposed ECN marking scheme can be expressed by the following recursion. For i = 1, 2, · · · , |R(s)|,When a packet arrives at resource i at time t: i i−1 i−1 i−1 if (ECN = 11) P11 = P11 + (1 − P11 )δi + P10 (1 − δi ) i (24) set ECN to 11 with probability δi (t); = 1 − (1 − δi )(1 − i−1 P11 i−1 − P10 i ), (25) if (ECN == 00) i P10 i−1 = P10 (1 − δi )(1 − i−1 i ) + P00 (1 − δi ) i , (26) set ECN to 10 with probability i (t); else if (ECN == 10) i P00 = pi−1 (1 − δi )(1 − i ) 00 (27) set ECN to 11 with probability i (t); 0 0 0 with the initial condition that P00 = 1, P10 = 0, P11 = 0. Evolving i from 0 to |R(s)|, we obtain   |R(s)| |R(s)| i−1Fig. 3. DRQ’s ECN marking algorithm |R(s)| P11 = 1− (1 − δi ) 1 − i i + Θ (28) i=1 i=2 i =1resource along its path and indicate whether it is a CPU or a where Θ is the higher-order terms (order ≥ 3) of i ’s. Bylink. Then, some manipulation after applying Assumption A3 Assumption A3, we have  to Eq. (15) gives |R(s)| |R(s)| i−1  2 |R(s)| P11 ≈ 1− (1 − δi ) 1 − i i (29) i=1 i=2 i =1 ps (t) ≈  k ws σk (t − τks ) + ρl (t − τls ) |R(s)| |R(s)| i−1 k∈K(s) l∈L(s) (21) ≈ δi + (30) |R(s)| |R(s)| i−1 i i i=1 i=2 i =1 = δi (t − τis ) + i (t − τis ) i (t − τi s ) i=1 i=2 i =1 which concludes that the proposed ECN marking scheme approximately implements the DRQ objective function in Eq.where |R(s)| (21) since P11 = ps . (ws σi (t))2 i if i indicates CPU Disclaimer: DRQ requires alternative semantics for the δi (t) = (22) ρi (t)2 if i indicates link ECN field in the IP header, which are different from the defaultand semantics defined in RFC 3168 [31]. What we have shown √ i here is that DRQ can be implemented using two-bit signaling i (t) = √2ws σi (t) if i indicates CPU (23) such as ECN. The coexistence of the default semantics and the 2ρi (t) if i indicates link. alternative semantics required by DRQ needs further study. Eq. (21) tells that each resource i ∈ R(s) (except thefirst resource in R(s), i.e., i=1) contributes to ps (t) with C. DRQ stability i−1two quantities, δi (t − τis ) and i =1 i (t − τis ) i (t − τi s ). In this section, we explore the stability of Reno/DRQMoreover, resource i can compute the former using its own networks. Unfortunately, analyzing its global stability is an ex-congestion information, i.e., σi (t) if it is a CPU or ρi (t) tremely difficult task since the dynamics involved are nonlinearif it is a link, whereas it cannot compute the latter without and retarded. Here, we present a partial result concerning localknowing the congestion information of its upstream resources stability, i.e., stability around the equilibrium point.on its path (∀ l < l). That is, the latter requires an inter- Define |R|x|S| matrix Γ(z) whose (i, s) element is givenresource signaling to exchange the congestion information. byFor this reason, we refer to δi (t) as intra-resource marking  i −zτ  ws e is if s ∈ S(i) and i indicates CPUprobability of resource i at time t and i (t) as inter-resource Γis (z) = e−zτis if s ∈ S(i) and i indicates linkmarking probability of resource i at time t. We solve this intra-  0 otherwise.and inter-resource marking problem using two-bit ECN flags (31)without explicit communication between resources. Proposition 2: An average Reno/DRQ network is locally Consider the two-bit ECN field in the IP header [30]. stable if we choose the RED parameters in DRQ suchAmong the four possible values of ECN bits, we use three val- that max{ b mk , 1−mk }Ck ∈ (0, ψ), ∀k ∈ K, and −b b k kues to indicate three cases: initial state (ECN=00), signaling- k max{ b ml , 1−ml }Bl ∈ (0, ψ), ∀l ∈ L, andmarked (ECN=10) and congestion-marked (ECN=11). When −b l l b la packet is congestion-marked (ECN=11), the packet is either 3/2φ2 [Γ(0)] minmarked (if TCP supports ECN) or discarded (if not). DRQ sets ψ≤ (32) |R|Λ2 τmax wmax max 3the ECN bits as shown in Figure 3. Below, we verify that the ECN marking scheme in Figure 3 max{Ck } where Λmax = max{ min{wk } min{Ms } , min{Ms} }, τmax = max{Bl } sapproximately implements the objective function in Eq. (21). k max{τs }, wmax = max{ws , 1} and φmin [Γ(0)] denotes theConsider a flow s with path R(s). For now, we drop the time smallest singular values of the matrix Γ(z) evaluated at z = 0. i i iindex t to simplify the notation. Let P00 , P10 , P11 respectively Proof: The proof is given in Appendix and it is adenote the probabilities that packets of flow s will have straightforward application of the TCP/RED stability result inECN=00, ECN=10, ECN=11, upon departure from resource [32].
  7. 7. IEEE/ACM TRAS. ON NETWORKING, VOL. 6, NO. 1, JUNE 2008 7 w=0.25 1 5ms 5ms 1 IV. P ERFORMANCE w=0.50 2 L1A. Simulation setup w=1.00 3 R1 R2 40Mbps In this section, we use simulation to verify the performance w=2.00 4 10ms 4of DRQ in the dual-resource environment with TCP Reno : 10 TCP sources : TCP sinksources. We compare the performance of DRQ with that of thetwo other AQM schemes that we discussed in the introduction. Fig. 4. Single link scenario in dumbell topologyOne scheme is to use the simplest approach where both CPUand link queues use RED and the other is to use DRR (a 2.2 Average throughput (Mbps) 2.0 Processing-limited Jointly-limited Bandwidth-limitedvariant of WFQ) to schedule CPU usage among competing 1.8flows according to the processing density of each flow. DRR 1.6 1.4maintains per flow queues, and equalizes the CPU usage in a 1.2round robin fashion when the processing demand is higher 1.0 0.8than the CPU capacity (i.e., CPU-limited). In some sense, 0.6 SG1, w = 0.25 SG2, w = 0.50these choices of AQM are two extreme; one is simple, but 0.4 0.2 SG3, w = 1.00 SG4, w = 2.00less fair in use of CPU as RED is oblivious to differing CPU 0 10 20 30 40 50demands of flows and the other is complex, but fair in use CPU capacity (Mcycles/sec)of CPU as DRR installs equal shares of CPU among these (a) RED-REDflows. Our goal is to demonstrate through simulation that DRQ 2.2 Average throughput (Mbps)using two FIFO queues always offers provable fairness and 2.0 Processing-limited Jointly-limited Bandwidth-limited 1.8efficiency, which is defined as the dual-resource PF allocation. 1.6Note that all three schemes use RED for link queues, but DRQ 1.4 1.2uses its own marking algorithm for link queues as shown in 1.0Figure 3 which uses the marking probability obtained from the 0.8 0.6 SG1, w = 0.25underlying RED queue for link queues. We call the scheme 0.4 SG2, w = 0.50 SG3, w = 1.00with DRR for CPU queues and RED for link queues, DRR- 0.2 0 SG4, w = 2.00RED, the scheme with RED for CPU queues and RED for 10 20 30 40 50 CPU capacity (Mcycles/sec)link queues, RED-RED. The simulation is performed in the NS-2 [33] environment. (b) DRR-REDWe modified NS-2 to emulate the CPU capacity by simply 2.2 Average throughput (Mbps) 2.0 Processing-limited Jointly-limited Bandwidth-limitedholding a packet for its processing time duration. In the 1.8simulation, TCP-NewReno sources are used at end hosts and 1.6 1.4RED queues are implemented using its default setting for the 1.2 1.0“gentle” RED mode [34] (mi = 0.1, bi = 50 pkts, bi = 550 0.8pkts and λi = 10−4 . The packet size is fixed at 500 Bytes). 0.6 SG1, w = 0.25 SG2, w = 0.50 0.4The same RED setting is used for the link queues of DRR- 0.2 SG3, w = 1.00 SG4, w = 2.00RED and RED-RED, and also for both CPU and link queues 0 10 20 30 40 50of DRQ (DRQ uses a function of the marking probabilities CPU capacity (Mcycles/sec)to mark or drop packets for both queues). In our analytical (c) DRQmodel, we separate CPU and link. To simplify the simulation Fig. 5. Average throughput of four different classes of long-lived TCP flowssetup and its description, when we refer to a “link” for the in the dumbell topology. Each class has a different CPU demand per bit (w).simulation setup, we assume that each link l consists of one No other background traffic is added. The Dotted lines indicate the ideal PFCPU and one Tx link (i.e., bandwidth). rate allocation for each class. In the figure, we find that DRQ and DRR-RED show good fairness under the CPU-limited region while RED-RED does not. By adjusting CPU capacity Cl , link bandwidth Bl , and the Vertical bars indicate 95% confidence intervals.amount of background traffic, we can control the bottleneckconditions. Our simulation topologies are chosen from a vari-ous set of Internet topologies from simple dumbell topologies region to the BW-limited region. Four classes of long-livedto more complex WAN topologies. Below we discuss these se- TCP flows are added for simulation whose processing densitiestups and simulation scenarios in detail and their corresponding are 0.25, 0.5, 1.0 and 2.0 respectively. We simulate ten TCPresults for the three schemes we discussed above. Reno flows for each class. All the flows have the same RTT of 40 ms.B. Dumbell with long-lived TCP flows In presenting our results, we take the average throughput To confirm our analysis in Section II-B, we run a single link of TCP flows that belong to the same class. Figure 5 plotsbottleneck case. Figure 4 shows an instance of the dumbell the average throughput of each class. To see whether DRQtopology commonly used in congestion control research. We achieves PF allocation, we also plot the ideal proportional fairfix the bandwidth of the bottleneck link to 40 Mbps and vary rate for each class (which is shown in a dotted line). As shownits CPU capacity from 5 Mcycles/s to 55 Mcycles/s. This in Figure 5(a), where we use typical RED schemes at bothvariation allows the bottleneck to move from the CPU-limited queues, all TCP flows achieve the same throughput regardless
  8. 8. IEEE/ACM TRAS. ON NETWORKING, VOL. 6, NO. 1, JUNE 2008 8 1 0.8 0.9 Bandwidth utilization 0.7 Normalized CPU sharing of 0.8 0.6 high processing flows 0.7 0.5 0.6 0.4 0.5 0.3 0.4 DRQ 0.2 0.3 DRR-RED DRQ 0.2 RED-RED 0.1 DRR-RED RED-RED 0.1 0 10 20 30 40 50 1 2 3 4 5 6 7 8 9 10 CPU capacity (Mcycles/sec) Number of high processing flows (a) CPU sharingFig. 6. Comparison of bandwidth utilization in the Dumbbell single 40 Total throughput (Mbps)bottleneck topology. RED-RED achieves far less bandwidth utilization thanDRR-RED and DRQ when CPU becomes a bottleneck. 35 30 25of the CPU capacity of the link and their processing densities. 20Figures 5(b) and (c) show that the average throughput curves DRQof DRR-RED and DRQ follow the ideal PF rates reasonably 15 DRR-RED RED-REDwell. When CPU is only a bottleneck resource, the PF rate of 10 1 2 3 4 5 6 7 8 9 10each flow must be inversely proportional to its processing den- Number of high processing flowssity ws , in order to share CPU equally. Under the BW-limited (b) Total throughputregion, the proportionally-fair rate of each flow is identical tothe equal share of the bandwidth. Under the jointly-limited Fig. 7. Impact of high processing flows. As the number of high processing flows increase, the network becomes more CPU-bound. Under RED-RED,region, flows maintain the PF rates while fully utilizing both these flows can dominate the use of CPU, reaching about 80% CPU usageresources. Although DRQ does not employ the per-flow queue with only 10 flows, starving 40 competing, but low processing flows.structure as DRR, its performance is comparable to that ofDRR-RED. Figure 6 shows that the aggregate throughput achieved link is in the jointly-limited region which is the reason whyby each scheme. It shows that RED-RED has much lower the CPU share of high-processing flows go beyond 20%.bandwidth utilization than the two other schemes. This is be-cause, as discussed in Section II-B, when CPU is a bottleneck D. Dumbell with background Internet trafficresource, the equilibrium operating points of TCP flows over No Internet links are without cross traffic. In order tothe RED CPU queue that achieve the equal bandwidth usage emulate more realistic Internet environments, we add crosswhile keeping the total CPU usage below the CPU capacity traffic modelled from various observations on RTT distribu-are much lower than those of the other schemes that need to tion [35], flow sizes [36] and flow arrival [37]. As modellingensure the equal sharing of CPU (not the bandwidth) under the Internet traffic in itself is a topic of research, we do notthe CPU-limited region. dwell on which model is more realistic. In this paper, we present one model that contains the statistical characteristicsC. Impact of flows with high processing demands that are commonly assumed or confirmed by researchers. In the introduction, we indicated that RED-RED can cause These characteristics include that the distribution of flowextreme unfairness in use of resources. To show this by sizes has a long-range dependency [36], [38], the RTTs ofexperiment, we construct a simulation run where we fix the flows is rather exponentially distributed [39] and the arrivalsCPU capacity to 40 Mcycles/s and add an increasing number of flows are exponentially distributed [37]. Following theseof flows with a high CPU demand (ws = 10) in the same setup characteristics, our cross traffic consists of a number of shortas the dumbell sink bottleneck environment in Section IV-B. and medium-lived TCP flows that follow a Poisson arrivalWe call these flows high processing flows. From Figure 5, at process and send a random number of packets derived from40 Mcycles/s, when no high processing flows are added, CPU a hybrid distribution of Lognormal (body) and Pareto (tail)is not a bottleneck. But as the number of high processing distributions with cutoff 133KB (55% of packets are fromflows increases, the network moves into the CPU-limited flows larger than the cutoff size). We set the parameters ofregion. Figure 7 shows the results of this simulation run. In flow sizes identical to those from Internet traffic characteristicsFigure 7 (a), as we increase the number of high processing in [36] (Lognormal:µ = 9.357, σ = 1.318, Pareto:α = 1.1), soflows, the aggregate CPU share of high processing flows that a continuous distribution of flow sizes is included in thedeviates significantly from the equal CPU share; under a larger background traffic. Furthermore, we also generate reverse-pathnumber of high processing flows (e.g., 10 flows), these flows traffic consisting of 10 long-lived TCP flows and a number ofdominate the CPU usage over the other lower processing short and medium-lived flows to increase realism and also todensity flows, driving them to starvation. In contrast, DRQ and reduce the phase effect. The RTT of each cross traffic flowDRR approximately implement the equal CPU sharing policy. is randomly selected from a range of 20 to 60 ms. We fixEven though the number of high processing flows increases, the bottleneck bandwidth and CPU capacities to 40 Mbpsthe bandwidth remains a bottleneck resource as before, so the and 40 Mcycles/s, respectively, and generate the same number