• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Receiving buffer adaptation for high speed data transfer
 

Receiving buffer adaptation for high speed data transfer

on

  • 658 views

For more projects visit @ www.nanocdac.com

For more projects visit @ www.nanocdac.com

Statistics

Views

Total Views
658
Views on SlideShare
658
Embed Views
0

Actions

Likes
1
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Receiving buffer adaptation for high speed data transfer Receiving buffer adaptation for high speed data transfer Document Transcript

    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 1Receiving Buffer Adaptation for High-speedData TransferHao Liu, Yaoxue Zhang, Yuezhi Zhou, Member, IEEE, Xiaoming Fu, Senior Member, IEEE,and Laurence T. Yang, Member, IEEEAbstract—New applications based on cloud computing, such as data synchronization for large chain departmental stores and banktransaction records, require very high-speed data transport. Although a number of high-bandwidth networks have been built, existingtransport protocols or their variants over such networks cannot fully exploit the network bandwidth. Our experiments show that thefixed-size application level buffer employed in the receiver side is a major cause of this deficiency. A buffer that is either too small ortoo large impairs the transfer performance. Due to the varied natures of network conditions and of real-time packet processing (i.e.,consuming) speed at the receiver, it is important to ensure that the buffer size is dynamically adjusted according to the perceivedexecution situation during runtime. In this paper, we propose Rada, a dynamic receiving buffer adaptation scheme for high-speeddata transfer. Rada employs an Exponential Moving Average aided scheme to quantify the data arrival rate and consumption rate inthe buffer. Based on these two rates, we develop a Linear Aggressive Increase Conservative Decrease scheme to adjust the buffersize dynamically. Moreover, a Weighted Mean Function is employed to make the adjustment adaptive to the available memory in thereceiver. Theoretical analysis is provided to demonstrate the rationale and parameter bounds of Rada. The performance of Rada is alsotheoretically compared with potential alternatives. We implement Rada in a Linux platform and extensively evaluate its performance ina variety of scenarios. Experimental results conform to the theoretical results, and show that Rada outperforms the static buffer schemein terms of throughput, memory footprint, and fairness.Index Terms—cloud computing, high-speed data transfer, buffer adaptation, rate detection.!1 INTRODUCTIONCloud computing [13] has been a hot research topic inrecent years and is expected to have a higher marketvalue in the near future . As cloud sites are oftengeographically distributed, they need very high-speeddata communication for tasks such as data synchroniza-tion and data mining [10]. For example, some researchtestbeds such as Open Cloud Testbed [15] and Ultra-Science Net (USN) [16] need a transfer speed of up tomultiple Gbps to meet the demands of large-scale scienceapplications. Enterprise cloud applications, such as datasynchronization for large chain department stores, alsodemand high-speed data transfer [17].Although a number of dedicated high-bandwidth net-works have been constructed, including JGN2plus [18],UCLP [19] and DRAGON [20], Transmission ControlProtocol (TCP) fails to perform well in these high-speednetworks [5]. Thus, TCP variants [21], [22], [23] for high-speed networks have been proposed to improve thetransfer throughput over these networks. However, mostof TCP variants run at the transport layer and needkernel modification for implementation, so that large-scale deployment is difficult. To achieve a high-speed• H. Liu, Y. Zhang, and Y. Zhou are with the Department of Com-puter Science and Technology, Tsinghua University, China. E-mail: li-uhao.buaa@gmail.com, zyx@moe.edu.cn, zhouyz@mail.tsinghua.edu.cn.• X. Fu is with the Institute of Computer Science, Georg-August-Universityof Goettingen, Germany. E-mail: fu@cs.uni-goettingen.de.• L. T. Yang is with Department of Computer Science, St. Francis XavierUniversity, Canada. E-mail: ltyang@stfx.ca.data transfer, as well as easy deployment, User Data-gram Protocol (UDP) based high-speed protocols [5],[24] running at the application level have recently beenproposed and deployed. For example, PA-UDP [5] is arecently proposed UDP based protocol. It runs at theapplication level and can be easily deployed to facilitatebulk data transfers in high-bandwidth networks such asUSN [16], JGN2plus [18], UCLP [19] and DRAGON [20].However, these UDP based high-speed protocols stillcannot fully utilize these high-bandwidth networks. Inthese protocols, the receiver employs a fixed-size appli-cation level buffer to hold the data received before de-livering it to upper layer protocols. As the buffer size isfixed, the buffer could be exhausted in protocols withouta precise flow control mechanism, such as [24]. In thiscase, the newly arrived packets would be discarded andthe throughput would decease dramatically. Protocolswith precise flow control, such as [5], ensure that thebuffer would not be exhausted, but they have to employa conservative mechanism to send data at a speed belowa certain threshold. In both cases, the throughput wouldbe limited by the fixed buffer size, which, in this paperis referred to as buffer bottleneck.Note that the buffer bottleneck that we discuss in thispaper means the application level receiving buffer in thereceiver side of UDP based high-speed protocols, since 1)there is seldom concern about the sending buffer in UDPbased protocols [5], [38]; and 2) the OS can manage thekernel UDP buffer on demand by simply monitoring thebuffer occupancy, while for the application level receiv-ing buffer in recent UDP based high-speed protocols [5],Digital Object Indentifier 10.1109/TC.2012.109 0018-9340/12/$31.00 © 2012 IEEEIEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 2buffer occupancy is not an indicator of buffer demandbecause these protocols’ adaptive sending rate ensuresthat the receiving buffer will not be exhausted.A common practice used to avoid buffer bottleneck is tomanually set the receiving buffer size to a larger value.However, due to the difficulty of estimating the actualbuffer size needed, manual adjustment tends to be eithertoo small to maximize the transfer throughput due tobuffer bottleneck, or too large to fully utilize the largeamounts of memory. Note that the servers in the cloudare often responsible for multiple projects or multipletasks [15], [16], so that a waste of memory in one applica-tion could significantly impair the performance of otherapplications. Even if the receiver is used only for onetask, e.g., high-speed data transfer, the waste of memoryin one transfer can still impair the throughput of othertransfers. Moreover, a larger buffer usually requires morecomplicated data structures, resulting in higher over-head in buffer management. In a situation where thememory at the receiver side becomes heavily loaded, alarger buffer can even result in memory swapping andserious performance drop (this is further demonstratedin Section 5.3). Therefore, an optimal or near optimalbuffer size is needed to maximize the throughput, whileminimizing the memory footprint. We define the optimalbuffer size loptimal asloptimal = min{l, l ∈ Lmax thoughput},where Lmax throughput means the set of static buffer sizesthat can achieve the maximum transfer throughput.We conducted several experiments demonstrating thatthe optimal receiving buffer size varies during runtime,depending on factors including the transfer data size,the available memory in the receiver, and the number ofsimultaneous transfers 1. A straightforward approach isthus to adapt the receiving buffer size dynamically withthe change in such parameters.There have been existing works on adapting the TCPsending buffer based on network feedbacks in terms of achange in packet loss [22], [23], [25], queuing delay [21]and bandwidth-delay product (BDP) [26], [28]. Inspiredby these TCP sending buffer adaptation schemes, Gulatiet al. proposed a dynamic adaptation algorithm [8], [29]for the sending buffer of I/O requests based on theestimated data access latency. Both types of buffer adap-tation algorithms aim at adapting the sending buffer, anduse the perceived performance metrics such as packetloss, queuing delay and BDP as feedbacks to adjust thesending buffer size during runtime. However, similarperformance metrics are not available at the receiverside. Thus these approaches cannot be applied for adapt-ing the receiving buffer. Incast, a buffer issue of switchesin data center networks has attracted great interest re-cently. It refers to the phenomenon when one clientsimultaneously requests data from multiple servers, aburst of TCP traffic will overwhelm the switch buffer1. The corresponding results are demonstrated in Section 5.and thus the TCP throughput will decrease significantlydue to packet loss and TCP timeout. To address thisfixed-size hardware buffer issue inside a data center, anumber of approaches [1], [2], [3], [4], [6], [11], [14], [27]have been proposed to control the buffer occupation toavoid buffer overflow as well as maximize throughput.Note that the buffer in Incast is the hardware bufferof switches inside a data center. However, the bufferbottleneck we address in this paper refers to the receivingbuffer for high-speed data transfer between two serversfrom two data centers. This buffer is a software bufferallocated from the memory at the receiver machine.Therefore, we could reallocate/deallocate the memoryto dynamically adjust the buffer size. With a dynamicbuffer size adjustment, the throughput would not belimited by a fixed buffer size due to congestion control.This paper describes Rada, a dynamic buffer adapta-tion approach to address the buffer bottleneck problem:1) Although the optimal buffer size relates to a numberof factors (including the CPU and disk state, the avail-able bandwidth, the number of simultaneous transfers,the available memory, etc.), we observe that all these fac-tors except the available memory could be representedby the data arrival rate and consumption rate in thebuffer. Rada employs a rate detection based approachto circumvent the complexity of taking all these fac-tors into account, and it merges the factor of availablememory into the adaptation later. Rada decides to in-crease/decrease the buffer size when the data arrival rateis constantly faster/slower than the data consumptionrate. An Exponential Moving Average (EMA) [35] aidedscheme is proposed to quantify that the data arrival rateis constantly faster/slower than the data consumptionrate.2) The buffer adaptation motivation is to some ex-tent opposite to that of TCP congestion control. Thismotivates Rada to employ a Linear Aggressive IncreaseConservative Decrease (LAICD) scheme to control the adap-tation extent 2in each buffer increase/decrease operation.Theoretical analysis is provided to show the rationaleand parameter bounds of LAICD. We also theoreticallycompare the performance of LAICD and potential al-ternatives (e.g., Multiplicative Increase Additive Decrease(MIAD)). These results are further confirmed by theexperimental evaluation.3) To merge the factor of available memory intothe adaptation, we introduce a Weighted Mean Function(WMF) [37] with time-varying weights to adjust the adap-tation extent. The time-varying weights are based on thetradeoff between the transfer demand and the memoryavailable during runtime. Theoretical analysis showsthat this scheme achieves the expected mergence effect.We implement the Rada algorithm in Linux and eval-uate its effectiveness in various scenarios with differenttransfer size, different available memory, and different2. Adaptation extent means how much the buffer size is increased ordecreased in each adaptation operation.IEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 3number of simultaneous transfers. Our results indicatethat Rada outperforms the static buffer scheme in termsof throughput, memory footprint, and fairness in all thescenarios that we have explored.Note that this paper is an extended version of ourprevious paper presented at ICCCN 2010 [33], whichpresents an overview of our dynamic buffer adaptationidea and a brief evaluation with a high-speed protocoldeployed in our testbed. There are significant improve-ments beyond our previous work in this paper: 1) aLAICD mechanism is proposed to improve the bufferadaptation algorithm (Section 3.2); 2) theoretical analysisis provided to show the rationale of our scheme (Theo-rem 1, 2, 3); 3) more extensive performance evaluation isconducted in various scenarios with a recently proposedUDP based high-speed protocol [5] (Section 5.2, 5.3, 5.4);and 4) both theoretical analysis and experimental resultsabout parameters tuning in our scheme are presented(Theorem 4, Section 5.5).The rest of the paper is organized as follows. Section 2outlines the basic idea of Rada. Section 3 describesthe details of the Rada algorithm, including the EMAaided buffer adaptation decision, the LAICD schemefor controlling the adaptation extent, and the memoryadaptive buffer adaptation with WMF. Section 4 presentsthe design of Rada and Section 5 presents the evaluationresults of Rada in various scenarios. Section 6 reviewsrelated work, and Section 7 concludes this paper.2 BASIC IDEA OF RADAAs described in the previous section, the buffer size mustbe adapted dynamically according to condition varia-tions. However, the dynamic buffer adaptation is non-trivial, given that there are so many different conditions.For example, the CPU or disk could alternate betweenthe state of busy and idle; the memory usage couldrange from several megabytes to multiple gigabytes; theavailable bandwidth could vary from several hundredMbps to tens of Gbps; and the number of simultaneoustransfers could also be a single one to several. Thus, itis difficult to adapt the buffer size for a given specificcondition, as so many different conditions must be con-sidered. Furthermore, to implement an algorithm thatcan cover all possible conditions is also difficult.Certainly, some obvious heuristic schemes could beadopted: when the CPU or disk becomes busy, increasethe buffer; when the memory becomes heavily loaded,decrease the buffer; when the available bandwidth de-creases, decrease the buffer; and when the number ofsimultaneous transfers (transfer number hereinafter) de-creases, increase the buffer. However, does the changeof these conditions truly require a buffer adaptation?For example, when the memory becomes more heavilyloaded while there is no sign that other applications ortransfers need to consume a large amount of memoryin a short period of time, there is in fact no needto decrease the buffer. Moreover, it is more difficultto deal with situations where more parameters changesimultaneously. For example, there is a change (decrease)in memory utilization (which requires increasing thebuffer) meanwhile there is another change (increase) intransfer number (which requires decreasing the buffer).Therefore, it is important to design an algorithm that canadapt to changes in all types of conditions.Although it is challenging to design a buffer adap-tation approach that can adjust the buffer based on allpossible condition changes, we observe that all of thesecondition changes, except the variation of memory uti-lization, ultimately lead to the variation of the followingtwo rates: data arrival rate, vrecv, and data consumptionrate, vsrv, in the buffer. When the disk becomes busy, vsrvdecreases; when the available link bandwidth increase,vrecv increases; other condition changes, except the vari-ation of available memory, can also be transformed tothe variation of vrecv or vsrv.This motivates the following adaptation strategy thatadapts the buffer according to the variation of thesetwo rates. First, by periodically detecting the valuesof vrecv and vsrv, the buffer adaptation decision canbe made based on these two values. If vrecv is con-stantly larger than vsrv, the buffer must be increased;if vrecv is constantly smaller than vsrv, the buffer mustbe decreased; otherwise, the buffer does not need tobe adapted. Second, as the factor of available memorycannot be transformed to the variation of vrecv and vsrv,the adaptation extent in each buffer increase/decreaseoperation is also adapted according to the availablememory.This rate detection based scheme circumvents thecomplexity involved in covering all possible conditionchanges and instead focuses on two fundamental factors:vrecv and vsrv. In the next section, we describe this ratedetection based scheme and some enhancements to it indetail.3 DYNAMIC BUFFER ADAPTATIONRada employs a rate detection based buffer adaptationapproach. Three sub-problems lie in such a rate detectionbased scheme. First, it is difficult to decide when toadapt the buffer. The basic idea of Rada is that whenvrecv is constantly larger/smaller than vsrv, Rada in-creases/decreases the buffer. However, it is not clearhow to quantify that vrecv is constantly larger/smallerthan vsrv. Second, there is no existing research abouthow much the adaptation extent should be or what kindof adaptation pattern (e.g., linear growth, exponentialgrowth) should be employed. Finally, as the factor ofavailable memory cannot be transformed to the vari-ations of vrecv and vsrv, the adaptation extent must beadaptive to the available memory in the receiver.In this section, we describe the solutions to thesethree problems respectively. Section 3.1 describes howto quantify that vrecv is constantly larger/smaller thanvsrv based on a periodical detection of vrecv and vsrv,IEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 4and an Exponential Moving Average (EMA) [35] trans-formation of these two values. Section 3.2 proposes aLinear Aggressive Increase Conservative Decrease (LAICD)mechanism to control the adaptation extent, motivated byTCP’s congestion control algorithms but with a quitedifferent adaptation pattern. Section 3.3 introduces aWeighted Mean Function (WMF) [37] with time-varyingweights to adjust the adaptation extent. The time-varyingweights are based on the tradeoff between the transferdemand and the available memory during runtime.3.1 Buffer Adaptation DecisionRada makes the buffer adaptation decision (decideswhen to adapt the buffer) based on whether vrecv isconstantly larger or smaller than vsrv during a certainperiod of time. Rada determines this relationship byperiodically detecting these two values. Let vb representthe difference between vrecv and vsrv (vb = vrecv − vsrv);then the buffer adaptation decision actually depends onthe sign of vb in consecutive epochs.We conduct several experiments with a UDP basedhigh-speed protocol [24], as well as another UDP basedhigh-speed protocol that we developed. The results showthat there are three categories of scenarios for vb from theaspect of buffer adaptation for these protocols:Scenario 1 vb constantly alternates between nearly thesame positive and negative values. Thebuffer requires no adaptation.Scenario 2 vb is constantly positive or negative duringseveral consecutive epochs. The buffer sizeshould be increased or decreased respec-tively.Scenario 3 Scenarios other than the above two. For ex-ample, vb changes abruptly in certain discreteepochs or varies significantly between pos-itive and negative values. It is not easy todecide how to resize the buffer.Figure 1 shows the three scenarios respectively. Fig-ure 1a shows that vb constantly alternates between nearlythe same positive and negative values (Scenario 1), inwhich case the buffer demand remains relatively un-changed and the buffer requires no adaptation. Fig-ure 1b shows that vb is constantly positive during sev-eral consecutive epochs. Thus, the cumulative effectsof these epochs lead to an increase in buffer demand,and the buffer must be increased (Scenario 2). Figure 1cshows that vb abruptly presents greater values in severalepochs, e.g., the 1st and 8th epochs. In this case, it isdifficult to determine how to resize the buffer (Scenario3). Therefore, we need to transform Scenario 3 to eitherScenario 1 or Scenario 2 to expose the characteristicsof vb in order to directly apply the buffer adaptationprinciple. Based on the observation that the large valuein certain epochs of Scenario 3 could be reallocated tonearby epochs, we could exploit some Moving Averagemethods, which are useful in smoothing out short-termfluctuations and highlighting long-term trends.Simple Moving Average (SMA) [39] and ExponentialMoving Average (EMA) [35] are the widely used Mov-ing Average methods. SMA treat all activities withoutdiscrimination, while EMA emphasizes recent activitiesmore than old activities. Thus, EMA performs better ifa quick response to recent activities is required. For thisreason, EMA has been employed in [8], [21], [36].To make the buffer adaptation more responsive, wealso employ the EMA scheme to make a transformationof vb. Let vb,n represent the value of vb in the nth epoch.The EMA transformation of vb is shown in Equation (1).The new value vb is determined by both the detected ratedifference in the current epoch and its value in the lastepoch, of which the weight given to the current epochdepends on the parameter ω (ω ∈ [0, 1]).vb,n = ωvb,n + (1 − ω)vb,n−1 (1)Figure 2 shows the results of the EMA transformationof Figure 1 (ω = 0.5). After the transformation, Figure 2aand Figure 1a, Figure 2b and Figure 1b still belongto the same scenarios respectively. However, Figure 2cindicates that Figure 1c is transformed from Scenario 3to Scenario 2. As a result, the buffer adaptation prin-ciple can be easily applied. Although Figure 1c can betransformed from Scenario 3 to Scenario 2, this does notmean that all cases of Scenario 3 can be transformed toScenario 2. In fact, Scenario 3 could be transformed to allof these three scenarios, so the buffer adaptation decisionof Scenario 3 depends on the transformation result.Let k denote the number of epochs which the bufferadaptation decision is made based on, and let vb,i rep-resent vb in the ith epoch. Let f denote the bufferadaptation decision; the decisions to increase, decrease andmake no adaptation are represented by f = 1, f = −1and f = 0, respectively. Then, the decision of bufferadaptation can be expressed as follows:f =⎧⎪⎨⎪⎩1 for ∀i, vb,i > 0, i ∈ [n − k + 1, n]−1 for ∀i, vb,i < 0, i ∈ [n − k + 1, n].0 others(2)As a result, Rada makes the buffer adaptation decisionbased on the values of vb in the last k consecutive epochs.To keep a high buffer utilization, Rada increases thebuffer only when f = 1 and the buffer utilization isabove a predefined threshold.3.2 Linear Aggressive Increase Conservative De-creaseOnce Rada decides to adapt the buffer, the next prob-lem is how to control the adaptation extent in each in-crease/decrease operation. Our initial idea regarding thisissue is inspired by the congestion control algorithm ofTCP. TCP tries to additively increase the congestion win-dow size to maximize the throughput, until the packetloss is detected. Furthermore, when the packet loss isdetected, the congestion window is halved to avoidIEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 50 2 4 6 8 10 12 14 16 18 20−0.0500.05Time(x100ms)vb(Mbps)(a) Scenario 1. The buffer requires no adap-tation.0 2 4 6 8 10 12 14 16 18 20−0.0200.04Time(x100ms)vb(Mbps)(b) Scenario 2. The buffer must be increased.0 2 4 6 8 10 12 14 16 18 20−0.0500.05Time(x100ms)vb(Mbps)(c) Scenario 3. It is difficult to determine howto resize the buffer.Fig. 1: Three categories of scenarios for the variation of vb.0 2 4 6 8 10 12 14 16 18 20−0.0500.05Time(x100ms)vb(Mbps)(a) vb variation is still Scenario 1. The bufferrequires no adaptation.0 2 4 6 8 10 12 14 16 18 2000.04−0.02Time(x100ms)vb(Mbps)(b) vb variation is still Scenario 2. The buffermust be increased.0 2 4 6 8 10 12 14 16 18 20−0.0500.05Time(x100ms)vb(Mbps)(c) vb variation is transformed from Scenario3 to Scenario 2. The buffer must be de-creased.Fig. 2: vb variation corresponding to Figure 1 after EMA transformation.further packet loss. In summary, TCP employs a conser-vative additive increase and an aggressive multiplicativedecrease mechanism. Interestingly, we observe that ourmotivation is opposite to that of TCP. In our scenario,the buffer size is expected to increase aggressively, sothat the transfer throughput cannot be limited by thebuffer size; it is also expected to decrease conservativelyin case of a large rate span (vb) and a sharp increase inbuffer demand afterwards. The opposite of TCP’s Ad-ditive Increase Multiplicative Decrease (AIMD) mechanismis Multiplicative Increase Additive Decrease (MIAD). How-ever, both theoretical analysis and experimental resultsshow that a MIAD scheme is not a good idea in ourscenario.Definition 1. The overhead of an operation is the CPUcycles needed to finish the operation.In the MIAD case, the following theorem characterizesthe overhead of buffer increase operations.Theorem 1. In MIAD, the overhead of the (n + 1)th bufferincrease operation CMIADn+1 grows exponentially with n.Proof: Let Ln denote the buffer size in the nthepoch. In a multiplicative increase case, the buffer sizeafter adaptation, i.e., the buffer size in the (n + 1)thepoch can be calculated by Ln+1 = βLn. β denotesthe multiplicative coefficient and its physical meaningis increasing the buffer size to β times of the currentbuffer size. Thus, considering a simple case when thebuffer size increases continuously, we haveLn+1 = βLn = β2Ln−1 = ... = βn+1L0, (3)where L0 denotes the initial buffer size. The mainoverhead to increase the buffer comes from memoryallocation. If c denotes the allocation overhead for eachmemory unit, the overhead for the (n + 1)th increaseoperation CMIADn+1 isCMIADn+1 = c(Ln+1 − Ln) = c(βn+1L0 − βnL0)= cL0βn(β − 1).(4)In this case, the overhead of the buffer increase operationgrows exponentially.In a more general case when the buffer size alsodecreases before it becomes Ln+1, the overhead CMIADn+1is smaller, but the property of exponential growth remains.We only present the proof for above simple case here,and the proof for the general case is presented in Ap-pendix A.Using our experimental servers 3, the time overheadto allocate the initial buffer, i.e., cL0, is approximately25 ms. If β = 2, the time overhead could exceed 1 swith only 6 adaptation operations. Such a large overheadwould impair the transfer performance significantly. Ourexperiments show that the MIAD scheme suffers from asharp throughput drop in each one of the buffer increaseoperations, which confirms the above analysis.We later employ a Linear Aggressive Increase Conser-vative Decrease (LAICD) scheme in Rada. It avoids theexponential growth of buffer increase overhead, butremains the property of aggressive increase and con-servative decrease. The aggressive and conservative hereare achieved by giving different adaptation extent to theincrease and decrease operation. The increase operationis given a higher extent than the decrease operation,although both are linear. Equation (5) shows this idea.Ln+1 =⎧⎪⎪⎨⎪⎪⎩Ln + β1 vb,n if f = 1Ln if f = 0,Ln − β2 vb,n if f = −1(5)where vb,n denotes the value of vb in the nth epoch, asdoes Equation (2). The introduction of vb to the adapta-tion is based on the observation that vb determines thevariation of the data in the buffer, so that it is reasonableto adjust the buffer size according to vb. With a similaranalysis to the MIAD case, we confirm that the overheadof buffer increase operation now stays linearly with |vb|,avoiding the exponential growth in MIAD, as describedin Theorem 2.3. The detailed information for the servers is described in Section 5.IEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 6Theorem 2. In Rada, the overhead of the (n + 1)th bufferincrease operation is CRadan+1 = cβ1 vb,n .Proof: The proof is similar to that of Theorem 1.Therefore, this LAICD scheme employed in Radaavoids the exponential growth of adaptation overhead inMIAD. In Equation (5), β1 is given a higher value than β2so that the increase operation is relatively aggressive andthe decrease operation is relatively conservative. Next,we will perform a theoretical analysis to confirm thebenefit of aggressive increase and conservative decrease.In terms of buffer allocation and release overhead,the rationale of aggressive increase and conservativedecrease is presented in Theorem 3.Theorem 3. The overhead of increasing the buffer size by aconstant value δL is monotonically decreasing with γ, whereγ denotes β1/β2.Proof: See Appendix B.Therefore, a large γ is needed to limit the buffer adap-tation overhead. This conclusion confirms the rationaleof the LAICD scheme in Rada. In Rada, β1 is given ahigher value than β2 to obtain the effect of aggressiveincrease and conservative decrease. This means γ is atleast larger than 1. With a larger γ, Rada could performbetter with less overhead. However, β1 cannot be toolarge because Rada has to limit the overhead of eachincrease operation below a certain threshold; otherwise,a similar performance drop would arise, as occurs in theMIAD case. Let δLthreshold denote the threshold that theincreased size in each buffer increase operation needs tobe below. The upper and lower bound for β1 to avoidperformance drop can be defined by Theorem 4, whichhas been confirmed by the experimental results shownin Section 5.5.Theorem 4. To guarantee a stable performance 4, β1 needsto satisfykδt ≤ β1 < δLthreshold/ |vb| , (6)where δt denotes the epoch of rate detection.Proof: See Appendix C.One issue raised by the LAICD mechanism is that thebuffer size will stay at a relatively large value during thelong sleep period (if there is one) between two transfers.Because this mechanism is not a convergent algorithm,the buffer size will not converge to the initial buffersize automatically. Therefore, Rada deceases the buffergradually when both vrecv and vsrv are close to zero fora long period of time.3.3 Memory Adaptive Buffer AdaptationIn this section, we discuss the memory issue. An adapta-tion scheme without considering the available memorycould impair the performance of the transfer as well4. Stable performance means that there is no risk of throughput dropdue to buffer bottleneck or large buffer adaptation overhead.0 1000MaxMemory Utilization (%)IncreaseExtent(a) The expected curve for bufferincrease.0 1000MaxMemory Utilization (%)DecreaseExtent(b) The expected curve for bufferdecrease.Fig. 3: The expected curve for buffer adaptation.as the whole system. Consider a situation in whichthe values of β1|vb,n| and β2|vb,n|(according to Equation(5), these two values determine the adaptation extent ineach buffer increase/decrease operation) are 10 MB at acertain moment. If at that moment, the available memoryis 1000 MB, increasing/decreasing the buffer by 10 MBis acceptable. However, if at that moment the availablememory is only 50 MB, increasing/decreasing the bufferby 10 MB would be excessive. Therefore, the adaptationextent must be adaptive to the available memory. Figure 3shows the expected curve for the adaptation extent withthe available memory. A higher memory utilization re-quires smaller increase extent and bigger decrease extent,and vice versa. Note that the expected curve does nothave to be linear monotonic. It can have some differencewith Figure 3 while still maintaining a similar trend.Our idea regarding above issue is inspired by theobservation that there is a bargaining for the adaptationextent between the transfer and the server. In the bufferincrease case, the transfer seeks to adapt the buffer sizeto Ln+1 = Ln + β1|vb,n| to improve data transfer per-formance, while the receiving server itself would liketo keep the buffer size unchanged (i.e., Ln+1 = Ln) toavoid the overuse of memory. Therefore, Rada couldmake a tradeoff between the demands of the transferand the receiving server, based on the state of memoryutilization. When the memory is heavily loaded, thereceiving server is given a higher priority; otherwise,the transfer is more highly prioritized. This could beachieved by a Weighted Mean Function [37] with thememory utilization as the time-varying weights. Hence,if we let α(n) denote the memory utilization in the nthepoch, the buffer size after the increase operation couldbe expressed asLn+1 = α(n)Ln + (1 − α(n))(Ln + β1 vb,n )= Ln + (1 − α(n))β1 vb,n .With a similar analysis, the buffer size after the decreaseoperation isLn+1 = Ln − α(n)β2 vb,n .Therefore, Equation (5) can be extended to Equa-IEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 7tion (7).Ln+1 =⎧⎪⎪⎨⎪⎪⎩Ln + (1 − α(n))β1 vb,n if f = 1Ln if f = 0Ln − α(n)β2 vb,n if f = −1(7)Equation (7) leverages the memory efficiently whenthere is abundant free memory, and it does not competefor the memory too aggressively otherwise. For example,when there is an abundance of free memory, α(n) issmall, so that 1−α(n) is relatively large. Thus, when thebuffer is enlarged, β1 vb,n is given a greater weight, andthe size is increased by a greater margin than it is whenthere is insufficient free memory. When the buffer isreduced, β2 vb,n is given a small weight, and the buffersize is decreased by a smaller margin than it is whenthere is insufficient free memory. When α(n) approachesits two extremes, namely 1 and 0, Equation (7) becomesEquations (8) and (9), respectively.lima(n)→1Ln+1 =Ln if f = 1 or f = 0Ln − β2 vb,n if f = −1(8)lima(n)→0Ln+1 =Ln + β1 vb,n if f = 1Ln if f = −1 or f = 0(9)Equations (8) and (9) show that the buffer no longerincreases when there is no available free memory; onthe other hand, it stops decreasing when the memoryis totally free. These results conform well with the ex-pected curves shown in Figure 3. In the case of bufferincrease, the increase extent is δL = (1 − α(n))β1 vb,n ,which linearly decreases with α(n); in the case of bufferdecrease, the decrease extent is δL = α(n)β2 vb,n , whichlinearly increases with α(n). Furthermore, β1 vb,n andβ2 vb,n are the Max values in the y−axis of Figure 3aand Figure 3b with α(n) set to be 0 and 1, respectively.4 FROM THEORY TO DESIGNAlgorithm 1 Rate Detection.1: procedure Rate-Detection2: δt: the epoch3: prevDataRecv ← data received4: prevDataSrv ← data served5: Sleep(δt)6: currentDataRecv ←data received7: currentDataSrv ←data served8: vrecv ← (currentDataRecv − prevDataRecv)/δt9: vsrv ← (currentDataSrv − prevDataSrv)/δt10: vb ← vrecv − vsrv11: return vb12: end procedureIn this section, we describe the design of Rada. Radaemploys an independent thread (the Rada thread here-inafter) for the buffer adaptation. By periodically de-tecting the values of vrecv and vsrv , the Rada threadAlgorithm 2 Buffer Adaptation.1: k: how many consecutive epochs the buffer adaptation decisionis based on2: ω: the coefficient for the EMA transformation3: β1: the buffer increase coefficient4: β2: the buffer decrease coefficient5: α: memory utilization6: bFirstEpoch: bool flag to identify the first epoch7: increaseCount: the counter for consecutive positive vb8: decreaseCount: the counter for consecutive negative vb9: bFirstEpoch ← True10: increaseCount ← 011: decreaseCount ← 012: while transfer do13: vb ← Rate − Detection()14: if bFirstEpoch then15: vb ← vb16: bFirstEpoch ← False17: else18: vb ← ωvb + (1 − ω)vb19: end if20: if vb > 0 then21: increaseCount ← increaseCount + 122: decreaseCount ← 023: if increaseCount >= k then24: α ← GetMemoryUtilization()25: Increase the buffer by (1 − α)β1 vb26: increaseCount ← 027: end if28: else if vb < 0 then29: decreaseCount ← decreaseCount + 130: increaseCount ← 031: if decreaseCount >= k then32: α ← GetMemoryUtilization()33: Decrease the buffer by αβ2 vb34: decreaseCount ← 035: end if36: end if37: end whilecalculates the difference between the two values, vb, andmakes the EMA transformation to obtain vb. If the valuesof vb are constantly positive or negative in the last kconsecutive epochs, the Rada thread adapts the bufferwith Equation (7).Algorithm 1 presents the pseudocode for the detectionof the two rates, vrecv and vsrv. The procedure Rate-Detection detects the values of vrecv and vsrv, and calcu-lates their difference, vb. This procedure is called by theRada thread iteratively to obtain the values of vb duringconsecutive epochs. The EMA operation is presentedfrom line 14 to line 19 in Algorithm 2. Note that in thefirst epoch, we do not have historical values of vb, so thatthe new value vb is simply vb; otherwise, vb is determinedby both the detected value of vb in the current epoch andthe value of vb in the last epoch. Based on the values ofvb, when Rada decides to increase or decrease the buffer,the buffer is increased by (1 − α)β1 |vb| or decreasedby αβ2 |vb|. The memory utilization α is calculated bya system call provided by the operating system (e.g.,through the file /proc/meminfo in Linux). These operationsare presented from line 20 to line 36 in Algorithm 2.5 EVALUATIONWe implemented Rada, within a recently proposed UDPbased high-speed protocol, PA-UDP [5], in Linux withIEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 8kernel version 2.6.18. Rada consists of roughly 300 linesof C code within PA-UDP, so that Rada is easily portableto other protocols. The initial receiving buffer size in ourimplementation is 36 MB, and Rada only increases thebuffer when the amount of free space in the buffer isless than the initial buffer size. Furthermore, when bothvrecv and vsrv are smaller than 0.01 Mbps (we believethis indicates a sleep between two transfers as describedin Section 3.2), Rada decreases the buffer by 4 MB untilthe buffer size is lower than the initial buffer size. Thesetwo techniques help Rada leverage the memory moreefficiently.Rada is proposed to eliminate the buffer bottleneckproblem, by dynamically adapting the buffer size accord-ing to variations in the transfer data size, the availablememory in the receiver, and the transfer number, etc.In this section, we evaluate the performance of Radain terms of throughput, fairness, memory utilizationaccording to these variations. We compare Rada withthe original PA-UDP with different static buffer sizes inthe following three scenarios.• Single Transfer Scenario. In this scenario with asingle transfer from the sender to the receiver, weevaluate the performance of Rada with variations oftransfer data size.• Loaded Receiver Scenario. In this scenario, weevaluate Rada in a simulated scenario in which thereceiver is loaded with other applications and signif-icant amounts of memory have been consumed byother applications. Through a comparison betweenthe Single Transfer Scenario and the Loaded ReceiverScenario, we can evaluate Rada’s performance ac-cording to variations in available memory in thereceiver.• Multiple Transfers Scenario. In this scenario, thereare multiple transfers from the sender to the re-ceiver. Compared with the Single Transfer Scenario,we can evaluate Rada’s performance with variationsin transfer number.To explore the parameters employed in Rada, we alsoconduct several experiments to discuss these parameters(Section 5.5). All the experiments presented in the rest ofthis section have been performed for five trails, unlessotherwise stated.5.1 Experimental SetupWe test Rada in a Gigabit LAN. The setup consists ofa sender and a receiver, which are connected through aTP-LINK TL-SG1048 gigabit switch.• Sender: a Dell PowerEdge 1900 Server with an IntelXeon Quad-Core 1.60 GHz CPU, 2GB DDR2 667MHz memory, a 1 Gigabit NIC, a 7200 RPM SATAhard disk, and Red Hat Enterprise Linux(RHEL) 5.1with the Linux Kernel 2.6.18.• Receiver: a Dell PowerEdge T110 Server with an In-tel Xeon Quad-Core 2.40 GHz CPU, 3GB DDR3 667MHz memory, a 1 Gigabit NIC, a 7200 RPM SATAhard disk, and Red Hat Enterprise Linux(RHEL) 5with the Linux Kernel 2.6.18.In the following experiments, unless otherwise stated,δt = 10ms, k = 3, β1 = 2, β2 = 0.02, and ω = 0.3. At theend of this section, we will discuss how these parametersrelate to the transfer performance and explain how to setthese parameters.5.2 Single Transfer ScenarioIn the first test, we evaluate the performance of Radawith a single data transfer to the receiver. Figure 4compares the average completion time for file transfersbetween Rada and the static buffer scheme. When thetransfer size is no more than 1000 MB, a static buffer of500 MB has the shortest completion time. However, withthe increase in transfer size, a larger buffer is neededto reduce the completion time. When the transfer sizeincreases to 5000 MB, a buffer size as large as 3000 MBis required. The reason that PA-UDP can benefit fromsuch a large buffer is that a large amount of data canbe buffered in the memory and written to disk afterthe transfer is completed [5]. These results confirm thebenefit of dynamic buffer adaptation. When the receiveris primarily employed to transfer data of no more than1000 MB at a certain moment, the optimal buffer sizecannot be larger than 500 MB. However, when thereceiver is busy transferring a larger amount of data,the optimal buffer size can be much larger. As we needto maximize the throughput and to minimize the mem-ory footprint, a dynamic buffer adaptation approach isdesirable.The MIAD curve in Figure 4 presents the completiontime (the best tuning results in our experiment) of theMultiplicative Increase Additive Decrease mechanism. Asdescribed in Section 3.2, MIAD is our initial idea tocontrol the adaptation extent. However, both the theoret-ical analysis in Section 3.2 and the experimental resultsshown here demonstrate that the MIAD mechanism per-forms far worse than the LAICD mechanism that Radaemploys.To show that Rada consumes less memory than thestatic buffer scheme, we compared the memory efficiencyof Rada with the static buffer scheme. A simple compari-son of the memory consumed would be unfair, becausea larger buffer is deserved if it can improve the transferthroughput. We introduce a metric called throughput perunit of memory, i.e., the transfer throughput divided bybuffer size, which means how much throughput onememory unit can contribute.The memory efficiency results are illustrated in Fig-ure 5. To make this figure more readable, we haveremoved the results of memory efficiency for buffer sizeswhose completion time is obviously longer than theshortest. In this case, it is easy to find the optimal buffersize loptimal defined in Section 1, since only buffer sizesin the set of Lmax throughput are presented in the figure.In Figure 5, all the statistics for the 100 MB buffer areIEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 9400 800 1000 2000 3000 4000 5000020406080100Transfer Size (MB)CompletionTime(s)100MB500MB1000MB2000MB3000MBRadaMIADFig. 4: Completion time with the increase in transfer size(Single Transfer).400 800 1000 2000 3000 4000 5000012345Transfer Size (MB)MemoryEfficiency(MbpsPerMB)500MB1000MB2000MB3000MBRadaFig. 5: Memory efficiency with the increase in transfersize (Single Transfer).ignored, as the 100 MB buffer clearly requires longercompletion time in all cases (see Figure 4). The statisticsfor the 500 MB buffer with transfer size larger than 1000MB are also ignored for the same reason.The results show that Rada achieves maximum mem-ory efficiency or close to maximum efficiency in all cases.However, each one of the static buffer sizes can onlyachieve a high memory efficiency in certain cases, butcannot in others. Note that the buffer size is at themagnitude of thousands of MB, so a small advantagein memory efficiency could save memory as much asseveral hundred MB. Figure 6 shows the actual memorythat Rada consumes. Compared with a 3000 MB buffer,Rada can save memory from 610 MB to 2760 MB indifferent cases.5.3 Loaded Receiver ScenarioIn the second test, we evaluate Rada with the loadedreceiver, which means the receiver’s memory has already400 800 1000 2000 3000 4000 5000050010001500200025003000Transfer Size (MB)MemoryConsumed(MB)Fig. 6: Memory consumed by Rada with the increase intransfer size (Single Transfer).been occupied by other applications to some extent. Weuse the same sender and receiver as the Single TransferScenario, except that a 1 GB memory module has beenremoved from the memory slots in the receiver. Weattempt to simulate the scenario that the receiver isloaded with other applications and significant amountsof memory have been consumed (1 GB here) by otherapplications. We do not evaluate this with real appli-cations because the results would depend heavily onthe applications’ memory access pattern. In addition, theevaluation results of removing one memory module aresufficient to indicate the situation of a loaded receiver.Figure 7 shows the completion time in this case.When the transfer size is no more than 3000 MB, thecompletion time is nearly the same as that of the Sin-gle Transfer Scenario. However, when the transfer sizereaches or exceeds 4000 MB, the completion time is quitedifferent compared with the Single Transfer Scenario (seeFigure 4). In the Single Transfer Scenario, the 3000 MBbuffer performs quite well in all cases. However, in theLoaded Server Scenario, the 3000 MB buffer suffers froma sharp increase in completion time when the transfersize reaches 5000 MB. It requires more time than eventhe 2000 MB buffer. This indicates that a larger buffercan sometimes result in performance degradation. Thereason is that 2 GB of physical memory has already beenconsumed, and a 3000 MB buffer has to be confrontedwith memory swapping, which impairs the performanceof the transfer and other applications.In the Single Transfer Scenario, the optimal buffer sizeof all the sizes presented in Figure 4 is 3000 MB, becauseit has the shortest completion time. However, in theLoaded Receiver Scenario, the 2000 MB buffer achievesthe shortest completion time since 1 GB of memory hasalready been consumed by other applications. Therefore,the optimal buffer size changes all the time according tosystem execution and memory consumption in other ap-plications. The optimal buffer size at one moment couldimpair the transfer performance at another moment. Thisagain raises the demand for a dynamic buffer adaptationapproach. In both the Single Transfer Scenario and theLoaded Server Scenario, Rada demonstrates a very stableperformance despite variations in available memory.The memory efficiency results are presented in Fig-ure 8. When the transfer size is no more than 4000 MB,the memory efficiency results are similar to those ofthe Single Transfer Scenario. Rada displays high memoryefficiency in all these cases. When the transfer sizeincreases to 5000 MB, unlike the Single Transfer Scenariowhere only the results of 3000 MB buffer size and Radaare presented, here, only the results of 2000 MB bufferand Rada are presented, since the 2000 MB buffer hasthe shortest completion time instead of 3000 MB. Asdescribed in Section 5.2, we have removed the results forbuffer sizes whose completion time is obviously longerthan the shortest.IEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 10400 800 1000 2000 3000 4000 5000020406080100Transfer Size (MB)CompletionTime(s)100MB500MB1000MB2000MB3000MBRadaFig. 7: Completion time with the increase in transfer size(Loaded Receiver).400 800 1000 2000 3000 4000 5000012345Transfer Size (MB)MemoryEfficiency(MbpsPerMB)500MB1000MB2000MB3000MBRadaFig. 8: Memory efficiency with the increase in transfersize (Loaded Receiver).5.4 Multiple Transfers ScenarioIn this section, we compare the performance of Radawith the static buffer scheme when there are multiplesimultaneous transfers to the receiver. Since high-speedtransfer in cloud environments typically does not requiretoo many simultaneous transfers 5, we evaluate Radain this Multiple Transfers Scenario with 4 simultaneoustransfers. Figure 9 shows the total completion time of4 simultaneous transfers with the increase in transfersize. A 100 MB buffer has a long completion time inall the cases. When the buffer size increases to 500MB, the completion time is reduced. However, whenthe buffer size increases to 1000 MB, the completiontime does not continue to decrease. Instead, it increasesdramatically. Therefore, in the Multiple Transfers Scenario,a buffer size that is either too large or too small canlimit the transfer performance. In fact, the optimal buffersize varies with the transfer number. That is why a 1000MB buffer outperforms a 500 MB buffer in the SingleTransfer Scenario, while in the Multiple Transfers Scenario,the results are the reverse. Note that we only presentthe results with 4 simultaneous transfers in this section.The results with other transfer numbers show differentoptimal buffer sizes.With the participation of new transfers and the depar-ture of existing transfers, the transfer number can varyall the time. In this case, the optimal buffer size alsochanges constantly. The optimal buffer size at present5. Both PA-UDP [5] and UDT [24] are evaluated with no more than4 simultaneous transfers.400 800 1000 2000 3000 4000 50000100200300400500Transfer Size (MB)CompletionTime(s)100MB500MB1000MBRadaFig. 9: Completion time with the increase in transfer size(Multiple Transfers).can limit the transfer performance at a later time. Radaincreases the buffer size to a value large enough in theSingle Transfer Scenario and adjusts the buffer size to arelatively small value in the Multiple Transfers Scenario,so that in both cases, Rada delivers a good performance.Another important metric in the Multiple TransfersScenario is the fairness among the transfers. Let vi bethe average throughput of the ith transfer, and let n bethe total transfer number. The fairness can be evaluatedusing Jain’s fairness index [34]F = (ni=1vi)2/(nni=1v2i ) (10)This fairness index always lies between 0 and 1 (ahigher value means greater fairness). Figure 10 showsthe results of fairness. In most cases, the transfers displaygood levels of fairness, which means that all the trans-fers have almost equal throughput. However, when thebuffer size increases to 1000 MB, the transfers suffer frompoor fairness. These results can only indicate the relativevalue of fairness. They do not indicate to what extentone transfer’s throughput differs from that of another.The standard deviation of the transfers’ throughput isfurther evaluated to explore the actual difference amongthe transfers, which is defined as 1n−1ni=1(vi − ¯v)2(¯v means the average value of vi). Figure 11 shows thestandard deviation of the throughput. In the 1000 MBbuffer case, the average difference of throughput amongthe four transfers can be as high as 20 Mbps, while inother cases, this difference does not exceed 5 Mbps.Combining the results from Figure 9, we can concludethat in the Multiple Transfers Scenario, a large buffer cansometimes limit the throughput, and impair fairnessamong the transfers. Rada achieves both a high through-put and a good level of fairness by adapting the bufferdynamically. Rada also demonstrates high memory ef-ficiency compared with the static buffer scheme. Theresults are similar to the Single Transfer Scenario, so thatthey are not presented here.According to the three scenarios listed above, Radaoutperforms the static buffer scheme in almost all of thecases that we have explored, while each one of the staticbuffer sizes performs well only in some cases, but not inothers. Figures 12 and 13 summarize the throughput andIEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 11400 800 1000 2000 3000 4000 50000.840.860.880.90.920.940.960.981Transfer Size (MB)Fairness100MB500MB1000MBRadaFig. 10: Fairness index with the increase in transfer size(Multiple Transfers).400 800 1000 2000 3000 4000 50000510152025Transfer Size (Mbps)StandardDeviationofThroughput(Mbps)100MB500MB1000MBRadaFig. 11: The standard deviation of throughput with theincrease in transfer size (Multiple Transfers).memory efficiency, respectively, of all the cases that wehave previously explored. Each one of the three scenariosabove has 7 different transfer sizes, and thus we have atotal of 21 cases.We can see from Figure 12 that transfers with smallbuffer sizes (i.e., 100 MB and 500 MB) could only reacha limited throughput in most cases. With the increase ofbuffer size, the buffer could achieve a high throughputin most cases but the throughput decreases sharply tonearly zero in other cases. We therefore conclude thatwhether a static buffer is large or small, it may functionwell in certain situations but may not perform well inothers. Our dynamic approach, Rada, demonstrates agood performance with a high throughput in all thecases due to its scheme of dynamic buffer size adap-tation.Figure 13 shows the memory efficiency results for the21 cases. All transfers of the static buffer sizes yield to asimilar curve, with a stable vertical line in the beginningthat gradually drops to zero afterwards. The verticalline shows the extreme memory efficiency of the corre-sponding buffer size. For example, the 500 MB buffer hasthe extreme memory efficiency of 1000Mbps/500MB =2Mbps/MB in a gigabit network. It is possible thatin some scenarios, the transfer cannot consume all thememory of the static buffer, which can result in memorywaste and low memory efficiency. Rada can adapt thebuffer size to a value close to the transfer needs, so that ityields a varying curve, which creates a triangle over thecurve of the static buffer size. This triangle indicates howmuch more efficient Rada is in comparison to different0 200 400 600 800 100000.20.40.60.81.0Throughput (Mbps)CumulativeDistributionFunction100MB500MB1000MB2000MB3000MBRadaFig. 12: Throughput for all the cases.0 1 2 3 4 500.20.40.60.81.0Memory Efficiency (Mbps Per MB)CumulativeDistributionFunction500MB1000MB2000MB3000MBRadaFig. 13: Memory efficiency for all the cases.sizes of static buffer. Note that the memory efficiency ofthe 100 MB buffer is not presented here because the 100MB buffer performs poorly in all the cases.5.5 Parameters DiscussionIn this section, we discuss the parameters used in Rada.The first parameter we explore is the rate detectionepoch, δt. Rada counts the average data arrival rate,vrecv, and the data consumption rate, vsrv, for everycycle of δt. If δt is too small, the count operationswork excessively, which leads to more CPU overhead. Inaddition, it is usually difficult to count the average ratesprecisely in a short epoch so that there could be someadaptation fault with a small δt. On the other hand, ifδt is too large, Rada requires a relatively long periodof time to detect the increase in buffer demand and thebuffer could be exhausted before the next buffer increaseoperation.We have conducted several experiments with differentvalues of δt (1ms, 10ms, 100ms, and 1000ms, respec-tively). Figure 14 shows the throughput with 5000 MBdata transferred in the Single Transfer Scenario. Clearly,Rada performs best when δt lies in the magnitude of10 ms. A δt that is either too much larger or too muchsmaller than the magnitude of 10 ms can limit transferperformance. When δt is set as large as 1000 ms, thethroughput decreases dramatically to nearly half of itsbest performance. On the other hand, when δt is set toas small as 1 ms, the throughput fluctuates widely andhas a low average throughput. The results in other cases(different transfer size, and different available memory,different transfer number) are similar and are not pre-sented here.IEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 120 5 10 15 20 25 30 35 402003004005006007008009001000Time (s)Throughput(Mbps)1ms10ms100ms1000msFig. 14: Transfer throughput withdifferent values of δt.1 2 3 4 5 6 7 8 9 10 1102004006008001000KThroughput(Mbps)Fig. 15: Transfer throughput withdifferent values of k.0 0.2 0.4 0.6 0.8 15006007008009001000ωThroughput(Mbps)Fig. 16: Transfer throughput withdifferent values of ω.The second parameter discussed here is k in Equa-tion (2). Rada determines whether to increase or de-crease the buffer based on the values of vb in the lastk consecutive epochs. In a very conservative case (k islarge), Rada makes the buffer adaptation decision basedon the values of vb in a number of consecutive epochs,so that Rada may fail to adapt the buffer in time. In avery aggressive case (k is small), Rada makes the bufferadaptation decision simply based on the values of vb inone or two epochs. Therefore, Rada may fail to makeuse of the characteristics of vb described in Section 3.1,resulting in a poor buffer adaptation decision.Figure 15 shows the throughput when 5000 MB dataare transferred in the Single Transfer Scenario with differ-ent values of k. Rada can only have a throughput of nomore than 300 Mbps when k is set to 1, in which casethe adaptation is very aggressive and incorrect bufferadaptation decisions could be made. With the increaseof k until it approaches 5, Rada achieves a very stablenear-maximum throughput. When k exceeds 5, however,the throughput drops dramatically because Rada cannotadapt the buffer quickly enough.Next, we discuss the parameter ω used in the EMAtransformation. ω determines how fast we discount olderobserved values. A ω that is either too small or too largewill make Rada fail to make the EMA transformation asdescribed in Section 3.1. Figure 16 shows the throughputwhen 5000 MB data are transferred in the Single TransferScenario with different values of ω. Rada achieves a highand stable throughput when ω is in the range of 0.2to 0.7. However, the throughput decreases significantlywith a ω that is either too small or too large.The parameters above mainly relate to the bufferdecision. Next, we discuss the parameters involved withthe buffer adaptation extent. In Rada, β1 and β2 com-bined with the value of vb determine the base of bufferadaptation extent in each increase/decrease operation. Asdescribed in Section 3.2, β1 is set to a larger value thanβ2, so that Rada can adapt the buffer with an effectof aggressive increase and conservative decrease. Figure 17shows the throughput with different β1 and β2 when5000 MB data are transferred in the Single TransferScenario. Obviously, Rada performs much better withβ1 > β2 than with β1 < β2. This confirms the needfor the LAICD mechanism that Rada has employed.0.020.220.020.224006008001000β1β2Throughput(Mbps)Fig. 17: Transfer throughput with different β1 and β2.Rada performs poorly when β1 and β2 are out of themagnitude of 0.02 to 2, because a β that is too small haslittle adaptation effect, and a β that is too large wouldresult in excessive adaptation overhead. The range [0.02,2] in fact conforms with Theorem 4 quite well, wherekδt = 3×0.01 = 0.03 (very close to the lower bound 0.02here) and δLthreshold/vb confines the upper bound of β1.6 RELATED WORKThe topic of dynamic buffer adaptation relates to anumber of research issues. Related work has been stud-ied in TCP congestion window adjustment, I/O bufferadaptation, buffer Incast inside a data center, etc. In thissection, we summarize related studies and discuss themindividually.TCP Congestion Window Adjustment. TCP conges-tion window or sending buffer adjustment [25] is a clas-sic buffer adaptation problem in networking. TCP em-ploys an Additive Increase Multiplicative Decrease (AIMD)algorithm to adapt the sending buffer based on thefeedbacks of packet loss [25]. TCP variants modify thebasic AIMD algorithm and adapt the sending buffer withthe feedbacks of packet loss [22], [23], queuing delay [21]or Bandwidth-Delay Product (BDP) [26], [28]. All of thesealgorithms aim at adapting the sending buffer basedon aforementioned feedbacks. However, Rada aims toadapt the receiving buffer and similar feedbacks are notavailable on the receiver side. Interestingly, the LAICDscheme in Rada is initially inspired by TCP congestioncontrol, as we observe our motivation is opposite to thatof TCP.I/O Buffer Adaptation. Gulati et al. [8], [29] proposeda buffer adaptation algorithm to access the storage ar-ray fairly and efficiently based on the estimated dataIEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 13access latency, which is very similar to TCP’s congestioncontrol. This algorithm also operates in the sender andcannot be applied to the receiver, because in the receiversimilar feedbacks like latency are not available. Zhang etal. [9] extended the idea from Supply Chain Management(SCM) to improve the efficiency of prefetching in file sys-tems, which allocates more cache for applications withfaster access rates. However, the SCM-based prefetchingapproach aims to improve the hit rate of the buffer andto maximize the data output speed from the buffer, whileRada is designed to increase the data input speed to thereceiving buffer so that the data transfer throughput canbe improved.Buffer Incast inside a Data Center. Inside a datacenter, when a client simultaneously requests data frommultiple servers through TCP, the fixed-size switchbuffer will be overwhelmed and thus the TCP through-put will decrease significantly due to packet loss andTCP timeout. This phenomenon is referred to as In-cast [2], [11], [27]. To solve the Incast problem, existingapproaches have tried to 1) reduce the TCP retransmis-sion timeout to mitigate the underutilized link capacityduring the timeout period [6], 2) use Ethernet flow con-trol to avoid switch buffer overflow [11], 3) reduce theswitch buffer occupation with Explicit Congestion Noti-fication (ECN) [3], Fair Quantized Congestion Notifica-tion (FQCN) [1], or an Incast Congestion Control schemeon the receiver side [4], and 4) design custom congestion-aware UDP-based protocols that manage congestionacross multiple connections rather than within a singleconnection [14]. Note that the switch buffer in Incastis a fixed-size hardware buffer inside a data center, sothat all existing solutions towards this problem try tocontrol the buffer occupation to avoid buffer overflowas well as maximize throughput. However, the bufferbottleneck we address in this paper is about the receivingbuffer for high-speed data transfer between two serversfrom two data centers. This buffer is a software bufferallocated from the memory at the receiver machine.Therefore, we could reallocate/deallocate the memoryto dynamically adjust the buffer size. With a dynamicbuffer size adjustment, the throughput would not belimited by a fixed buffer size due to congestion control.Our experimental results show that our dynamic bufferadaptation approach outperforms the fixed-size bufferscheme employed in PA-UDP [5].Others. Eryilmaz et al. [30] studied the problem ofallocating resources at a base station to many competingflows. Through a combination of queue-length-basedscheduling and congestion control, their approach leadsto fair resource allocation and queue length stability.Unlike Rada, their approach primarily allocates timeresources such as time slots, frequency and power,while Rada allocates the space resource, i.e., the mem-ory. There are also other resource or buffer allocationapproaches [7], [12], [31], [32], but their scenarios ormotivations still have little in common with ours.7 CONCLUSIONAlthough a number of high-bandwidth networks havebeen constructed, existing high-speed protocols cannotfully utilize the bandwidth of these networks, as theirfixed-size application level receiving buffers suffer frombuffer bottleneck. In this paper, we analyze the buffer bottle-neck problem and propose Rada, a rate detection basedbuffer adaptation scheme to remedy this problem. Byperiodically detecting the data arrival rate and consump-tion rate in the buffer, Rada adapts the buffer size dy-namically. Rada decides to increase/decrease the bufferwhen the data arrival rate is constantly faster/slowerthan the data consumption rate. The adaptation extent ineach buffer increase/decrease operation is based on aLinear Aggressive Increase Conservative Decrease scheme,and is also automatically adjusted according to thereceiver’s memory utilization using a Weighted MeanFunction. Through a number of detailed evaluations invarious scenarios, we demonstrate that Rada is effectiveand efficient compared with the static buffer scheme.Rada outperforms the static buffer scheme in termsof throughput, memory footprint, and fairness in allthe scenarios that we have explored. Our future workincludes 1) exploring Rada in more types of high-speedprotocols and in different local and wide-area networks;and 2) implementing Rada as libraries so that Rada canbe easily ported to existing protocols.REFERENCES[1] Y. Zhang and N. Ansari, ”On mitigating TCP Incast in Data CenterNetworks,” in Proceedings of IEEE INFOCOM, 2011.[2] J. Zhang, F. Ren, and C. Lin, ”Modeling and understanding TCPincast in data center networks,” in Proceedings of IEEE INFOCOM,2011.[3] M. Alizadeh, A. Greenberg, D.A. Maltz, J. Padhye, P. Patel, B.Prabhakar, S. Sengupta, and M. Sridharan. ”Data center TCP(DCTCP),” in Proceedings of ACM SIGCOMM, 2010.[4] H. Wu, Z. Feng, C. Guo, and Y. Zhang, ”ICTCP: Incast Conges-tion Control for TCP in data center networks,” in Proceedings ofCoNEXT, 2010.[5] B. Eckart, X. He, Q. Wu, and C. Xie, ”A Dynamic Performance-Based Flow Control Method for High-Speed Data Transfer,” IEEETransactions on Parallel and Distributed Systems (TPDS), vol. 21, 2010,pp. 114-125.[6] V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen,G. R. Ganger, G. A. Gibsonl, and B. Mueller, ”Safe and EffectiveFinegrained TCP Retransmissions for Datacenter Communication,”in Proceedings of ACM SIGCOMM, 2009.[7] G. Michelogiannakis, D.U. Becker, and W.J. Dally, ”EvaluatingElastic Buffer and Wormhole Flow Control,” IEEE Transactions onComputers (TC), vol. 60, 2011, pp. 896-903.[8] A. Gulati and C.A. Waldspurger, ”PARDA : Proportional Allocationof Resources for Distributed Storage Access,” in Proceedings of theUSENIX Conference on File and Storage Technologies (FAST), 2009.[9] Z. Zhang, A. Kulkarni, X. Ma, and Y. Zhou, ”Memory ResourceAllocation for File System Prefetching From a Supply Chain Man-agement Perspective,” in Proceedings of ACM EuroSys, 2009.[10] Y. Lin, Q. Wu, N.S. Rao, and M. Zhu, ”On Design of SchedulingAlgorithms for Advance Bandwidth Reservation in DedicatedNetworks,” in Proceedings of IEEE INFOCOM, 2008.[11] A. Phanishayee, E. Krevat, V. Vasudevan, D. G. Andersen, G. R.Ganger, G. A. Gibson, and S. Seshan. ”Measurement and analysisof TCP throughput collapse in cluster-based storage systems,”in Proceedings of the 6th USENIX Conference on File and StorageTechnologies (FAST), 2008.IEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
    • JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2012 14[12] G. Michelogiannakis, J. Balfour, and W.J. Dally, ”Elastic-bufferflow control for on-chip networks,” in Proceedings of IEEE Interna-tional Symposium on High Performance Computer Architecture (HPCA),2009.[13] R. Buyya, C.S. Yeo, S. Venugopal, J. Broberg, and I. Brandic,”Cloud computing and emerging IT platforms: Vision, Hype,and Reality for Delivering Computing as the 5th Utility,” FutureGeneration Computer Systems, vol. 25, 2009, pp. 599-616.[14] Presentation Summary-high Performance atMassive Scale Lessons Learned at Facebook.http://idleprocess.wordpress.com/2009/11/24/presentation-summary-high-performance-at-massive-scale-lessons-learned-at-facebook/.[15] Open Cloud Testbed. http://opencloudconsortium.org/testbed/.[16] DOE UltraScience Net. http://www.csm.ornl.gov/ultranet/.[17] N.S. Rao, Q. Wu, S. Ding, S.M. Carter, W.R. Wing, A. Banerjee, D.Ghosal, and B. Mukherjee, ”Control Plane for Advance BandwidthScheduling in Ultra High-Speed Networks,” in Proceedings of IEEEINFOCOM, 2006.[18] JGN2plus: Advanced Testbed for Research and Development.http://www.jgn.nict.go.jp/english/index.html.[19] User Controlled LightPath Project. http://uclp.uwaterloo.ca/.[20] DRAGON:Dynamic Resource Allocation via GMPLS OpticalNetworks. http://dragon.maxgigapop.net/twiki/bin/view/DRAGON/WebHome.[21] D. Wei and S. Low, ”FAST TCP: Motivation, Architecture, Al-gorithms, Performance,” IEEE/ACM Transactions on Networking(TON), vol. 14, 2006, pp. 2490-2501.[22] S. Floyd, HighSpeed TCP for Large Congestion Windows, RFCEditor, 2003.[23] T. Kelly, ”Scalable TCP : Improving Performance in HighspeedWide Area Networks,” ACM SIGCOMM Computer CommunicationReview, vol. 33, 2002, pp. 83-91.[24] Y. Gu and R. Grossman, ”UDT: UDP-based Data Transfer forHigh-speed Wide Area Networks,” Computer Networks, vol. 51,2007, pp. 1777-1799.[25] V. Jacobson, ”Congestion Avoidance and Control,” ACM SIG-COMM Computer Communication Review, vol. 18, 1988, pp. 314-329.[26] S. Tao, L.K. Jacob, and A. Ananda, “A TCP Socket Buffer Auto-Tuning Daemon”, in Proceedings of the International Conference onComputer Communications and Networks (ICCCN), 2003, pp. 299-304.[27] D. Nagle, D. Serenyi, and A. Matthews, ”The Panasas ActiveScaleStorage Cluster: Delivering Scalable High Bandwidth Storage,” inProceedings of the ACM/IEEE conference on Supercomputing, 2004.[28] R.S. Prasad, M. Jain, and C. Dovrolis, “Socket Buffer Auto-Sizingfor High-Performance Data Transfers”, Journal of Grid Computing,vol. 1, 2003, pp. 361-376.[29] A. Gulati and I. Ahmad, ”Towards Distributed Storage ResourceManagement Using Flow Control,” ACM SIGOPS Operating Sys-tems Review, vol. 42, 2008, pp. 10-16.[30] A. Eryilmaz and R. Srikant, ”Fair Resource Allocation in WirelessNetworks Using Queue-Length-Based Scheduling and CongestionControl,” IEEE/ACM Transactions on Networking (TON), vol. 15,2007, pp. 1333-1344.[31] L. Bui, R. Srikant, and A. Stolyar, ”Optimal Resource Allocationfor Multicast Sessions in Multi-hop Wireless Networks,” Philo-sophical Transactions. Series A, Mathematical, physical, and engineeringsciences, vol. 366, 2008, pp. 2059-2074.[32] J. Andrews and B. Evans, ”Adaptive Resource Allocation inMultiuser OFDM Systems with Proportional Rate Constraints,”IEEE Transactions on Wireless Communications (TWC), vol. 4, 2005,pp. 2726-2737.[33] H. Liu, Y. Zhang, Y. Zhou, and R. Xue, ”A Rate and Resource De-tection Based Receive Buffer Adaptation Approach for High-speedData Transportation,” in Proceedings of the International Conferenceon Computer Communications and Networks (ICCCN), 2010.[34] R. Jain, The Art of Computer Systems Performance Analysis:Techniques for Experimental Design, Measurement, Simulationand Modeling. John Wiley and Sons, Inc., 1991.[35] J.S. Hunter, The Exponentially Weighted Moving Average, Journalof Quality Technology, vol. 18, 1986, pp. 203-209.[36] E.B. Nightingale and J. Flinn, ”Energy-efficiency and storageflexibility in the blue file system,” in Proceedings of the conference onSymposium on Operating Systems Design and Implementation (OSDI),2004, pp. 25-25.[37] F. Qi, Generalized Weighted Mean Values with Two Parameters,in Proceedings:Mathematical, Physical and Engineering Sciences. vol.454, 1998, pp. 2723 - 2732.[38] UDP Buffering Background. http://www.29west.com/docs/THPM/udp-buffering-background.html.[39] J. D. Schwager, Getting Started in Technical Analysis. John Wileyand Sons, Inc., 1999.Hao Liu received the B.S. degree in SoftwareEngineering from Beijing University of Aeronau-tics and Astronautics, China, in 2009. He iscurrently a Ph.D. Candidate at the Departmentof Computer Science and Technology, TsinghuaUniversity, China. His research interests includedistributed computing, high speed transport pro-tocols and pervasive computing with smart-phones.Yaoxue Zhang received his BS degree fromNorthwest Institute of Telecommunication Engi-neering, China, and received his PhD degreein computer networking from Tohoku University,Japan in 1989. Then, he joined Department ofComputer Science, Tsinghua University, China.He was a visiting professor of MassachusettsInstitute of Technology (MIT) and University ofAizu, in 1995 and 1998, respectively. Currently,he is a fellow of the Chinese Academy of En-gineering and a professor in computer scienceand technology in Tsinghua University, China. Additionally, he servesas an editorial board member of four international journals. His majorresearch areas include computer networking, operating systems, ubiqui-tous/pervasive computing, transparent computing, and active services.He has published over 170 technical papers in international journals andconferences, as well 8 monographs and textbooks.Yuezhi Zhou obtained his PhD degree in Com-puter Science and Technology from TsinghuaUniversity, China in 2004 and is now working asan associate professor at the same department.He worked as a visiting scientist at the ComputerScience Department in Carnegie Mellon Univer-sity in 2005. His research interests include ubiq-uitous/pervasive computing, distributed system,mobile device and systems. He has publishedover 30 technical papers in international journalsor conferences. He received the IEEE Best Pa-per Award in the 21st IEEE AINA International Conference in 2007. Heis a member of IEEE and ACM.Xiaoming Fu is professor and head of ComputerNetworks Group at the University of Goettingen,Germany. He received his Ph.D. Degree in Com-puter Science from Tsinghua University, Chinain 2000. He was a research staff at TechnicalUniversity Berlin, before moving to Goettingenas assistant professor in 2002. His researchinterests include network architectures, proto-cols, mobile communications and service over-lays. He has served as TPC member/sessionchair/chair for several networking conferencessuch as INFOCOM, ICNP, ICDCS, and MobiArch. He currently servesas vice chair of the Technical Committee on Computer Communications(TCCC), IEEE Communications Society (ComSoc). He is a seniormember of IEEE, a member of ACM and a member of GI.Laurence T. Yang received B.E in ComputerScience from Tsinghua University, China andPh.D in Computer Science from University ofVictoria, Canada. His is a professor in Com-puter Science at St Francis Xavier University,Canada. His current research includes paralleland distributed computing, embedded and ubiq-uitous/pervasive computing. He has publishedmany papers in various refereed journals, con-ference proceedings and book chapters in theseareas (including around 100 international journalpapers such as IEEE and ACM Transactions). His research has beensupported by NSERC and CFI of Canada.IEEE TRANSACTIONS ON COMPUTERSThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.