• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
A dynamic performance-based_flow_control
 

A dynamic performance-based_flow_control

on

  • 856 views

Dear Students...

Dear Students
Ingenious techno Solution offers an expertise guidance on you Final Year IEEE & Non- IEEE Projects on the following domain
JAVA
.NET
EMBEDDED SYSTEMS
ROBOTICS
MECHANICAL
MATLAB etc
For further details contact us:
enquiry@ingenioustech.in
044-42046028 or 8428302179.

Ingenious Techno Solution
#241/85, 4th floor
Rangarajapuram main road,
Kodambakkam (Power House)
http://www.ingenioustech.in/

Statistics

Views

Total Views
856
Views on SlideShare
856
Embed Views
0

Actions

Likes
0
Downloads
10
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    A dynamic performance-based_flow_control A dynamic performance-based_flow_control Document Transcript

    • 114 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 1, JANUARY 2010 A Dynamic Performance-Based Flow Control Method for High-Speed Data Transfer Ben Eckart, Student Member, IEEE, Xubin He, Senior Member, IEEE, Qishi Wu, Member, IEEE, and Changsheng Xie Abstract—New types of specialized network applications are being created that need to be able to transmit large amounts of data across dedicated network links. TCP fails to be a suitable method of bulk data transfer in many of these applications, giving rise to new classes of protocols designed to circumvent TCP’s shortcomings. It is typical in these high-performance applications, however, that the system hardware is simply incapable of saturating the bandwidths supported by the network infrastructure. When the bottleneck for data transfer occurs in the system itself and not in the network, it is critical that the protocol scales gracefully to prevent buffer overflow and packet loss. It is therefore necessary to build a high-speed protocol adaptive to the performance of each system by including a dynamic performance-based flow control. This paper develops such a protocol, Performance Adaptive UDP (henceforth PA-UDP), which aims to dynamically and autonomously maximize performance under different systems. A mathematical model and related algorithms are proposed to describe the theoretical basis behind effective buffer and CPU management. A novel delay-based rate- throttling model is also demonstrated to be very accurate under diverse system latencies. Based on these models, we implemented a prototype under Linux, and the experimental results demonstrate that PA-UDP outperforms other existing high-speed protocols on commodity hardware in terms of throughput, packet loss, and CPU utilization. PA-UDP is efficient not only for high-speed research networks, but also for reliable high-performance bulk data transfer over dedicated local area networks where congestion and fairness are typically not a concern. Index Terms—Flow control, high-speed protocol, reliable UDP, bulk transfer. Ç1 INTRODUCTIONA certain class of next generation science applications needs to be able to transfer increasingly large amountsof data between remote locations. Toward this goal, several protocol. We demonstrate a mathematical basis for the control algorithms we use, and we implement and bench- mark our method against other commonly used applica-new dedicated networks with bandwidths upward of tions and protocols. A new protocol is necessary,10 Gbps have emerged to facilitate bulk data transfers. unfortunately, due to the fact that the de facto standard ofSuch networks include UltraScience Net (USN) [1], CHEE- network communication, TCP, has been found to beTAH [2], OSCARS [3], User-Controlled Light Paths unsuitable for high-speed bulk transfer. It is difficult to(UCLPs) [4], Enlightened [5], Dynamic Resource Allocation configure TCP to saturate the bandwidth of these links duevia GMPLS Optical Networks (DRAGONs) [6], Japanese to several assumptions made during its creation.Gigabit Network II [7], Bandwidth on Demand (BoD) on The first shortcoming is that TCP was made to distributeGeant2 network [8], Hybrid Optical and Packet Infrastruc- bandwidth equally among the current participants in ature (HOPI) [9], Bandwidth Brokers [10], and others. network and uses a congestion control mechanism based on The goal of our work is to present a protocol that can packet loss. Throughput is halved in the presence of detectedmaximally utilize the bandwidth of these private links packet loss and only additively increased during subsequent loss-free transfer. This is the so-called Additive Increasethrough a novel performance-based system flow control. As Multiplicative Decrease algorithm (AIMD) [13]. If packet lossMultigigabit speeds become more pervasive in dedicated is a good indicator of network congestion, then transfer ratesLANs and WANs and as hard drives remain relatively will converge to an equal distribution among the users of thestagnant in read and write speeds, it becomes increasingly network. In a dedicated link, however, packet loss due toimportant to address these issues inside of the data transfer congestion can be avoided. The partitioning of bandwidth, therefore, can be done via some other, more intelligent. B. Eckart and X. He are with the Department of Electrical and Computer bandwidth scheduling process, leading to more precise Engineering, Tennessee Technological University, Cookeville, TN 38505. throughput and higher link utilization. Examples of ad- E-mail: {bdeckart21, hexb}@tntech.edu. vanced bandwidth scheduling systems include the centra-. Q. Wu is with the Department of Computer Science, University of lized control plane of USN and Generalized Multiple Protocol Memphis, Memphis, TN 38152. E-mail: qishiwu@memphis.edu.. C. Xie is with the Data Storage Division of Wuhan National Laboratory for Label Switching (GMPLS) for DRAGON [11], [38]. On a Optoelectronics, School of Computer Science and Technology, Huazhong related note, there is no need for TCP’s slow-start mechanism University of Science and Technology, Wuhan 430074, China. because dedicated links with automatic bandwidth partition- E-mail: cs_xie@hust.edu.cn. ing remove the risk of a new connection overloading theManuscript received 8 July 2008; revised 8 Jan. 2009; accepted 17 Feb. 2009; network. For more information, see [12].published online 24 Feb. 2009. A second crucial shortcoming of TCP is its congestionRecommended for acceptance by C.-Z. Xu.For information on obtaining reprints of this article, please send e-mail to: window. To ensure in-order, reliable delivery, both partiestpds@computer.org, and reference IEEECS Log Number TPDS-2008-07-0252. maintain a buffer the size of the congestion window and theDigital Object Identifier no. 10.1109/TPDS.2009.37. sender sends a burst of packets. The receiver then sends 1045-9219/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on July 10,2010 at 03:56:09 UTC from IEEE Xplore. Restrictions apply.
    • ECKART ET AL.: A DYNAMIC PERFORMANCE-BASED FLOW CONTROL METHOD FOR HIGH-SPEED DATA TRANSFER 115back positive acknowledgments (ACKs) in order to receive end-to-end delay of the hosts. In a sense, this is the amount ofthe next window. Using timeouts and logic, the sender data present “on the line” at any given moment. A 10 Gbpsdecides which packets are lost in the window and resends channel with an RTT of 10 ms would need approximately athem. This synchronization scheme ensures that the 12.5 Megabyte buffer on either end, because at any givenreceiver receives all packets sent, in-order, and without time, 12.5 Megabytes would be on the line that potentiallyduplicates; however, it can come at a price. On networks would need to be resent due to errors in the line or packet losswith high latencies, reliance on synchronous communica- at the receiving end. Ideally, a channel could sustaintion can severely stunt any attempt for high-bandwidth maximum throughput by setting the BDP equal to theutilization because the protocol relies on latency-bound congestion window, but it can be difficult to determine thesecommunication. For example, consider the following parameters accurately. Moreover, the TCP header field usesthroughput equation relating latency to throughput. Dis- only 16 bits to specify window size. Therefore, unless the TCPregarding the effects of queuing delays or packet loss, the protocol is rewritten at the kernel level, the largest usableeffective throughput can be expressed as window is 65 Kilobytes. Note that there are modifications to TCP that can increase the window size for large BDP cwin  MSS networks [14]. Efforts in this area also include dynamic throughput ¼ ; ð1Þ rtt windows, different acknowledgment procedures, and statis-where cwin is the window size, MSS the maximum tical measurements for channel parameters. Other TCPsegment size, and rtt the round-trip time. With a congestion variants attempt to modify the congestion control algorithmwindow of 100 packets and a maximum segment size of to be more amenable to characteristics of high-speed1,460 bytes (the difference between the MTU and TCP/IP networks. Still others look toward multiple TCP streams,header), a network with an infinite bandwidth and 10 ms like bbFTP, GridFTP, and pTCP. Most employ a combination of these methods, including (but not limited to) High-Speedround-trip time would only be able to achieve approxi- TCP [43], Scalable TCP [44], and FAST TCP [46].mately 120 Mbps effective throughput. One could attempt Many of the TCP-based algorithms are based in theto mitigate the latency bottleneck by letting cwin scale to the transport layer, and thus, kernel modification is usuallybandwidth-delay product (BW  rtt) or by striping and necessary to implement them. Some also rely on speciallyparallelizing TCP streams (see BBCP [15]), but there are also configured routers. As a result, the widespread deploymentdifficulties associated with these techniques. Regardless, (1) of any of these algorithms would be a very daunting task. Itillustrates the potentially deleterious effect of synchronous would be ideal to be able to run a protocol on top of the twocommunication on high-latency channels. standard transport layer protocols, TCP and UDP, so that Solutions to these problems have come in primarily two any computer could implement them. This would entail anforms: modifications to the TCP algorithm and application- application-level protocol which could combine thelevel protocols which utilize UDP for asynchronous data strengths of UDP and TCP and which could be appliedtransfer and TCP for control and data integrity issues. This universally to these types of networks.paper focuses on the class of high-speed reliable UDPprotocols [20], which include SABUL/UDT [16], [17],RBUDP [18], Tsunami [19], and Hurricane [39]. Despite 3 HIGH-SPEED RELIABLE UDPthe primary focus on these protocols, most of the techniques High-speed Reliable UDP protocols include SABUL/UDToutlined in this paper could be applied to any protocol for [16], [17], RBUDP [18], Tsunami [19], and Hurricane [39],which transfer bandwidths are set using interpacket delay. among others [20]. The rest of the paper is organized as follows: High-speed UDP-based protocols generally follow a similar structure:TCP and High-speed reliable UDP are discusses in UDP is used for bulk data transfer and TCP is used marginallySections 2 and 3, respectively. The goals for high-speed for control mechanisms. Most high-speed reliable UDPbulk data transfer over reliable UDP are discussed inSection 4. Section 5 defines our mathematical model. protocols use delay-based rate control to remove the needSection 6 describes the architecture and algorithms for the for congestion windows. This control scheme allows a host toPA-UDP protocol. Section 7 discusses the implementation statically set the rate and undoes the throughput-limitingdetails of our PA-UDP protocol. Experimental results and stairstep effects of AIMD. Furthermore, reliable delivery isCPU utilization statistics are presented in Section 8. We ensured with either delayed, selective, or negative acknowl-examine related work in Section 9 and draw our conclu- edgments of packets. Negative acknowledgments are opti-sions in Section 10. mal in cases where packet loss is minimal. If there is little loss, acknowledging only lost packets will incur the least amount of synchronous communication between the hosts. A simple2 TCP SOLUTIONS packet numbering scheme and application-level logic canAs mentioned in Section 1, the congestion window provided provide in-order, reliable delivery of data. Finally, reliableby TCP can make it impossible to saturate link bandwidth UDP is positioned at the application level, which allows usersunder certain conditions. In the example pertaining to (1), to explore more customized approaches to suit the type ofone obvious speed boost would be to increase the congestion transfer, whether it is disk-to-disk, memory-to-disk, or anywindow beyond one packet. Assuming a no-loss link, a combination thereof.window size of n packets would allow for 12:5n Mbps Due to deliberate design choices, most High-Speedthroughput. On real networks, however, it turns out that the Reliable UDP protocols have no congestion control or fairnessBandwidth-Delay Product (BDP) of the network is integral to mechanisms. Eschewing fairness for simplicity and speedthe window size. As the name suggests, the BDP is simply the improvements, UDP-based protocols are meant to beproduct of the bandwidth of the channel multiplied by the deployed only on private networks where congestion is not Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on July 10,2010 at 03:56:09 UTC from IEEE Xplore. Restrictions apply.
    • 116 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 1, JANUARY 2010an issue, or where bandwidth is partitioned apart from the maximum bandwidth of any data transfer given a system’sprotocol. disk and CPU performance characteristics. Reliable UDP protocols have shown varying degrees of Since the host receiving the data is under considerablysuccess in different environments, but they all ignore the more system strain than the sender, we shall concentrate on aeffects of disk throughput and CPU latency for data transfer model for the receiver, and then, briefly consider the sender.applications. In such high-performance distributed applica- The receiver’s capacity can be thought of as an equationtions, it is critical that system attributes be taken into account relating its internal system characteristics with those of theto make sure that both sending and receiving parties can network. Two buffers are of primary importance insupport the required data rates. Many tests show artificially preventing packet loss at the receiving end: the kernel’shigh packet loss because of the limitations of the end systems UDP buffer and the user buffer at the application level.in acquiring the data and managing buffers. In this paper,we show that this packet loss can be largely attributed to the 5.1 Receiving Application Bufferseffects of lackluster disk and CPU performance. We then For the protocols which receive packets and write to diskshow how these limitations can be circumvented by a asynchronously, the time before the receiver has a fullsuitable architecture and a self-monitoring rate control. application buffer can be calculated with a simple formula. Let t be the time in seconds , rðÁÞ be a function which4 GOALS FOR HIGH-SPEED BULK TRANSFER returns the data rate in bits per second (bps) of its argument, and m be the buffer size in bits. The time beforeIdeally, we would want a high-performing protocol suitable m is full is given byfor a variety of high-speed, high-latency networks withoutmuch configuration necessary at the user level. Further- m t¼ : ð2Þmore, we would like to see good performance on many rðrecvÞ À rðdiskÞtypes of hardware, including commodity hardware anddisk systems. Understanding the interplay between these At time t, the receiver will not be able to accept any morealgorithms and the host properties is crucial. packets, and thus, will have to drop some. We found this to On high-speed, high-latency, congestion-free networks, a be a substantial source of packet loss in most high-speedprotocol should strive to accomplish two goals: to maximize reliable UDP protocols. To circumvent this problem, onegoodput by minimizing synchronous, latency-bound com- may put a restriction on the size of the file sent by relatingmunication and to maximize the data rate according to the file size to rðrecvÞ Â t. Let f be the size of a file and fmax bereceiver’s capacity. (Here, we define goodput as the its maximum size:throughput of usable data, discounting any protocol head- mers or transport overhead [21].) fmax ¼ : ð3Þ Latency-bound communication is one of the primary 1 À rðdiskÞ rðrecvÞproblems of TCP due to the positive acknowledgment Note that fmax can never be negative since rðdiskÞ cancongestion window mechanism. As previous solutions haveshown, asynchronous communication is the key to achiev- only be as fast as rðrecvÞ. Also, note that if the two rates areing maximum goodput. When UDP is used in tandem with equally matched, fmax will be infinite since the applicationTCP, UDP packets can be sent asynchronously, allowing the buffer will never overflow.synchronous TCP component to do its job without limiting Designing a protocol that limits file sizes is certainly not anthe overall bandwidth. acceptable solution, especially since we have already stipu- High-speed network throughputs put considerable lated that these protocols need to be designed to sustain verystrain on the receiving system. It is often the case that disk large amounts of data. Therefore, if we can set the rate of thethroughput is less than half of the network’s potential and sender, we can design an equation to accommodate our bufferhigh-speed processing of packets greatly taxes the CPU. size and rðdiskÞ. Rearranging, we see thatDue to this large discrepancy, it is critical that the data rateis set by the receiver’s capacity. An overly high data rate rðdiskÞ rðrecvÞ ¼ ; ð4Þwill cause a system buffer to grow at a rate relative to the 1Àm fdifference between receiving and processing the data. If thismismatch continues, packet loss will inexorably occur due or if we letto finite buffer sizes. Therefore, any protocol attempting to 1prevent this must continually communicate with the sender ¼ ; ð5Þ 1Àmfto make sure that the sender only sends at the receiver’sspecific capacity. we can then arrive at rðrecvÞ ¼ rðdiskÞ; ð6Þ5 A MATHEMATICAL MODEL orGiven the relative simplicity of high-speed UDP algorithms,mathematical models can be constructed with few uncontrol- rðrecvÞlable parameters. We can exploit this determinism by ¼ : ð7Þ rðdiskÞtweaking system parameters for maximum performance. Inthis section, we produce a mathematical model relating buffer We can intuitively see that if the ratio between disk andsizes to network rates and sending rates to interpacket delay network activity remains constant at , the transfer willtimes. These equations will be used to predict the theoretical make full use of the buffer while minimizing the maximum Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on July 10,2010 at 03:56:09 UTC from IEEE Xplore. Restrictions apply.
    • ECKART ET AL.: A DYNAMIC PERFORMANCE-BASED FLOW CONTROL METHOD FOR HIGH-SPEED DATA TRANSFER 117Fig. 1. File versus buffer completion during the course of a transfer.Three paths are shown; the path that adheres to is optimal. Fig. 2. Throughputs for various parameters.value of rðrecvÞ. To see why this is the case, consider Fig. 1. buffer size for UDP on most operating systems is veryThe middle line represents a transfer which adheres to . If small; on Linux 2.6.9, for example, it is set to a default ofthe transfer is to make full use of the buffer, then any 131 kB. At 131 kB, a 1 Gbps transfer will quickly deplete adeviations from at some point will require a slope greater buffer of size m:than since rðdiskÞ is assumed to be at its peak for theduration of the transfer. Thus, rðrecvÞ must be increased to m 131 kB t¼ ¼ % 1:0 ms:compensate. Adjusting rðrecvÞ to maintain while rðdiskÞ rðrecvÞ 1;000 Mbpsfluctuates will keep the transfer optimal in the sense thatrðrecvÞ has the lowest possible maximum value, while total Note that full depletion would only occur in thethroughput for the data transfer is maximized. The CPU has complete absence of any receiving calls from the applica-a maximum processing rate and by keeping the receiving tion. Nevertheless, any CPU scheduling latency must be made to be shorter than this time, and the average latencyrate from spiking, we remove the risk of overloading the rate must conform to the processing rate of the CPU suchCPU. Burstiness has been recognized as a limiting factor in that the queue does not slowly build and overflow overthe previous literature [22]. Additionally, the entirety of the time. A rigorous mathematical treatment of the kernelbuffer is used during the course of transfer, avoiding the buffer would involve modeling the system as a queuingsituation of a suboptimal transfer rate due to unused buffer. network, but this is beyond the scope of the paper. Making sure that the buffer is only full at the end of the Let t% represent the percentage of time during executiondata transfer has other important consequences as well. that the application is actively receiving packets, andMany protocols fill up the application buffer as fast as rðCP UÞ be the rate at which the CPU can process packets:possible, without regard to the state of the transfer. Whenthe buffer fills completely, the receiver must issue a rðrecvÞcommand to halt any further packets from being sent. Such t% ! : ð8Þ rðCP UÞa requirement is problematic due to the latency involvedwith this type of synchronous communication. With a 100- For example, if rðCP UÞ ¼ 2 Â rðrecvÞ, then the applica-millisecond round-trip time (rtt) on a 10 Gbps link, the tion will only need to be actively receiving packets from thereceiver would potentially have to drop in excess of buffer 50 percent of the time.80,000 packets of size 1,500 bytes before successfully halting Rate modeling is an important factor in all of thesethe sender. Furthermore, we do not want to use a higher calculations. Indeed, (4), (5), and (6) would be useless ifpeak bandwidth than is absolutely necessary for the one could not set a rate to a high degree of precision. TCPduration of the transfer, especially if we are held to an has been known to produce complicated models forimposed bandwidth cap by some external application or throughputs, but fortunately, our discussion is greatlyclient. Holding to this, ratio will achieve optimal simplified by a delay-based rate that can be employed inthroughput in terms of disk and CPU performance. congestion-free environments. Let L be the datagram size The theoretical effects of various system parameters on a (set to the MTU) and td be the time interval between3 GB transfer are shown in Fig. 2. Note how simply transmitted packets. Thus, we haveincreasing the buffer size does not appreciably affect the Lthroughput but increasing both rðdiskÞ and m provides the rðrecvÞ ¼ : ð9Þmaximum performance gain. This graph also gives some tdindication of the computational and disk power required In practice, it is difficult to use this equation to anyfor transfers exceeding 1 Gbps for bulk transfer. degree of accuracy due to context switching and timing precision limitations. We found that by using system timers5.2 Receiving Kernel Buffers to measure the amount of time spent sending and sleepingAnother source of packet loss occurs when the kernel’s for the difference between the desired time span and thereceiving buffer fills up. Since UDP was not designed for sending time, we could set the time delay to our desiredanything approximating reliable bulk transfer, the default time with a predictably decreasing error rate. We found the Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on July 10,2010 at 03:56:09 UTC from IEEE Xplore. Restrictions apply.
    • 118 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 1, JANUARY 2010Fig. 3. Actual and predicted error rates versus interpacket delay. Fig. 4. Send rate versus interpacket delay. Note that the actual and error-corrected predicted rates are nearly indistinguishable.error rate as a percentage difference between the desiredsending rate and the actual sending rate as We can see the extreme strain this would cause to a system. In the experiments described in [23], five Ultra-SCSI
    • eðrecvÞ ¼ ; ð10Þ disks in RAID 0 could not achieve 800 Mbps for a 10 GB file. td Assuming a sequential write speed of 1 Gbps, (12) showswhere td is the desired interpacket delay and
    • is a value that we would require a 9 GB buffer. Similarly, a 10 Gbpswhich can be determined programmatically during the transfer rate would put considerable strain on the CPU. Thetransfer. We used a floating
    • , dynamic to the statistics of exact relation to CPU utilization would depend on thethe transfer. Using the pthreads library under Linux 2.6.9, we complexity of the algorithms behind the protocol.found that
    • generally was about 2e-6 for each transfer. 5.3 The Data SenderTaking this error into account, we can update our original Depending on the application, the sender may be locked intorate formula to obtain the same kinds of performance-limiting factors as the L
    • L receiver. For disk-to-disk transfers, if the disk read rate is rà ðrecvÞ ¼ À 2 : ð11Þ slower than the bandwidth of the channel, the host must rely td td on preallocated buffers before the transfer. This is virtually Fig. 3 shows the percentage error rate between (9) and the same relationship as seen in (6). Unfortunately, if thethe true sending rate. As shown by Projected, we notice that bottleneck occurs at this point, nothing can be done but tothe error due to scheduling can be predicted with a good improve the host’s disk performance. Unlike the receiver,degree of certainty by (10). In Fig. 4, the different rate however, CPU latency and kernel buffers are less crucial tocalculations for various interpacket delays can be seen. performance and disk read speeds are almost universallyEquation (11), with the error rate factored in, is sufficiently faster than disk write speeds. Therefore, if buffers ofaccurate for our purpose. comparable size are used (meaning will be the same), the It should be noted that under extremely high band- burden will always be on the receiver to keep up with thewidths, certain aspects of a system that one might take for sender and not vice versa. Note that this only applies for disk-granted begin to break down. For instance, many kernels to-disk transfers. If the data are being generated in real time,support only up to microsecond precision in system-level transfer speed limitations will depend on the computationaltiming functions. This is good enough for bandwidthslower than 1 Gbps, but unacceptable for higher capacitylinks. As shown in Fig. 5, the resolution of the timingmechanism has a profound impact on the granularity of thedelay-based rates. Even a 1 Gbps channel with microsecondprecision has some trouble matching the desired sendingrate. This problem has been noted previously in [22] andhas been usually solved by timing using clock cycles.SABUL/UDT uses this technique for increased precision. Another source of breakdown can occur at the hardwarelevel. To sustain a 10 Gbps file transfer for a 10 GB file,according to (6), a receiver must have a sequential diskwrite rate of  m  rðdiskÞ ¼ 10  109 bps  1 À ; ð12Þ 80  109where m is in bits and rðdiskÞ in bits per second. Fig. 5. Effects of timing granularity. Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on July 10,2010 at 03:56:09 UTC from IEEE Xplore. Restrictions apply.
    • ECKART ET AL.: A DYNAMIC PERFORMANCE-BASED FLOW CONTROL METHOD FOR HIGH-SPEED DATA TRANSFER 119aspects of the data being generated. If the generation rate ishigher than the channel bandwidth, then the generation ratemust be throttled down or buffers must be used. Otherwise, ifthe generation rate is lower than channel bandwidth, abottleneck occurs at the sending side and maximum linkutilization may be impossible.6 ARCHITECTURE AND ALGORITHMSFirst, we discuss a generic architecture which takesadvantage of the considerations related in the previoussection. In the next three sections, a real-life implementationis presented and its performance is analyzed and comparedto other existing high-speed protocols.6.1 Rate Control AlgorithmsAccording to (6), given certain system characteristics of thehost receiving the file, an optimum rate can be calculated sothat the receiver will not run out of memory during thetransfer. Thus, a target rate can be negotiated at connectiontime. We propose a simple three-way handshake protocolwhere the first SYN packet from the sender asks for a rate. Thesender may be restricted to 500 Mbps, for instance. The Fig. 6. A dynamic rate control algorithm based on the buffer manage-receiver then checks its system parameters rðdiskÞ, rðrecvÞ, ment equations of Section 5.and m, and either accepts the supplied rate, or throttles therate down to the maximum allowed by the system. The Priority should be given to the receiving portion of thefollowing SYNACK packet would instruct the sender of a program given the limitations of the CPU. When the CPUchange, if any. cannot receive data as fast as they are sent, the kernel UDP Data could then be sent over the UDP socket at the target buffer will overflow. Thus, a multithreaded programrate, with the receiver checking for lost packets and sending structure is mandated so that disk activity can be decoupledretransmission requests periodically over the TCP channel with the receiving algorithm. Given that disk activity andupon discovery of lost packets. The requests must be spaced disk latencies are properly decoupled, appropriate schedul-out in time relative to the RTT of the channel, which can ing priority is given to the receiving thread, and rate controlalso be roughly measured during the initial handshake, so is properly implemented, optimal transfer rates will bethat multiple requests are not made for the same packet, obtained given virtually any two host configurations.while the packet has already been sent but not yet received. Reliable UDP works by assigning an ordered ID to eachThis is an example of a negative acknowledgment system, packet. In this way, the receiver knows when packets arebecause the sender assumes that the packets were received missing and how to group and write the packets to disk. Ascorrectly unless it receives data indicating otherwise. stipulated previously, the receiver gets packets from the TCP should also be used for dynamic rate control. The network and writes them to disk in parallel. Since most disksdisk throughput will vary over the course of a transfer, and have written speeds well below that of a high-speed network,as a consequence, should be monitored throughout. Rate a growing buffer of data waiting to be written to disk willadjustments can then proceed according to (6). To do this, occur. It is therefore a priority to maximize disk performance.disk activity, memory usage, and data rate must be If datagrams are received out of order, they can bemonitored at specified time intervals. The dynamic rate dynamically rearranged from within the buffer, but a systemcontrol algorithm is presented in Fig. 6. A specific waiting for a packet will have to halt disk activity at someimplementation is given in Section 7. point. In this scenario, we propose that when using PA-UDP, most of the time it is desirable from a performance standpoint6.2 Processing Packets to naively write packets to disk as they are received,Several practical solutions exist to decrease CPU latency for regardless of order. The file can then be reordered afterwardreceiving packets. Multithreading is an indispensable step from a log detailing the order of ID reception.to decouple other processes which have no sequential See Fig. 7 for pseudocode of this algorithm. Note that thisliability with one another. Minimizing I/O and system calls algorithm is only superior to in-order disk writing if there areand appropriately using mutexes can contribute to overall not too many packets lost and written out of order. If the rateefficiency. Thread priorities can often guarantee CPU control of PA-UDP functions as it should, little packet lossattentiveness on certain kernel scheduler implementations. should occur and this method should be optimal. Otherwise,Also, libraries exist which guarantee high-performance, it may be better to wait for incoming packets that have beenlow-latency threads [24], [25]. Regardless of the measures lost before flushing a section of the buffer to disk.mentioned above to curb latency, great care must be madeto keep the CPU attentive to the receiving portion of theprogram. Even the resulting latencies from a single print 7 IMPLEMENTATION DETAILSstatement inline with the receiving algorithm may cause the To verify the effectiveness of our proposed protocol, webuildup and eventual overflow of the UDP buffer. have implemented PA-UDP according to the architecture Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on July 10,2010 at 03:56:09 UTC from IEEE Xplore. Restrictions apply.
    • 120 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 1, JANUARY 2010 Fig. 9. PA-UDP: the data receiver.Fig. 7. The postfile processing algorithm in pseudocode. UDP kernel buffer was increased to 16 Megabytes from thediscussed in Section 6. Written mostly in C for use in Linux default of 131 kB. We found this configuration adequate forand Unix environments, PA-UDP is a multithreaded transfers of any size. Timing was done with microsecondapplication designed to be self-configuring with minimal precision by using the gettimeofday function. Note, however,human input. We have also included a parametric latency that better timing granularity is needed for the applicationsimulator, so we could test the effects of high latencies over to support transfers in excess of 1 Gbps.a low-latency Gigabit LAN. The PA-UDP protocol handles only a single client at a7.1 Data Flow and Structures time, putting the others in a wait queue. Thus, the threads are not shared among multiple connections. Since our goalA loose description of data flow and important data was maximum link utilization over a private network, westructures for both the sender and receiver is shown in were not concerned with multiple users at a time.Figs. 8 and 9. The sender sends data through the UDPsocket, which is asynchronous, while periodically probing 7.2 Disk Activitythe TCP socket for control and retransmission requests. A In the disk write threads, it is very important from abuffer is maintained, so the sender does not have to performance standpoint that writing is done synchronouslyreread from disk when a retransmitted packet is needed. with the kernel. File streams normally default to beingAlternatively, when the data are generated, a buffer might buffered, but in our case, this can have adverse effects onbe crucial to the integrity of the received data if data are CPU latencies. Normally, the kernel allocates as much spacetaken from sensors or other such nonreproducible events. as necessary in unused RAM to allow for fast returns on At the receiver end, as shown in Fig. 9, there are six threads. disk writing operations. The RAM buffer is then asynchro-Threads serve to provide easily attainable parallelism, nously written to disk, depending on which algorithm iscrucially hiding latencies. Furthermore, the use of threading used, write-through, or write-back. We do not care if ato achieve periodicity of independent functions simplifies the system call to write to disk halts thread activity, becausesystem code. As the Recv thread receives packets, two Disk disk activity is decoupled from data reception and haltingthreads write them to disk in parallel. Asynchronously, the will not affect the rate at which packets are received. Thus,Rexmt thread sends retransmit requests, and the Rate control it is not pertinent that a buffer be kept in unused RAM. Inthread profiles and sends the current optimum sending rate fact, if the transfer is large enough, eventually, this willto the sender. The File processing thread ensures that the data cause a premature flushing of the kernel’s disk buffer,are in the correct order once the transfer is over. which can introduce unacceptably high latencies across all The Recv thread is very sensitive to CPU scheduling threads. We found this to be the cause of many droppedlatency, and thus, should be given high scheduling priority packets even for file transfers having sizes less than theto prevent packet loss from kernel buffer overflows. The application buffers. Our solution was to force synchrony with repeated calls to fsync. As shown in Fig. 9, we employed two parallel threads to write to disk. Since part of the disk thread’s job is to corral data together and do memory management, better effi- ciency can be achieved by having one thread do memory management, while the other is blocked by the hard disk and vice versa. A single-threaded solution would introduce a delay during memory management. Parallel disk threads remove this delay because execution is effectively pipe- lined. We found that the addition of a second thread significantly augmented disk performance. Since data may be written out of order due to packet loss, it is necessary to have a reordering algorithm which works to put the file in its proper order. The algorithm discussedFig. 8. PA-UDP: the data sender. in Section 6 is given in Fig. 7. Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on July 10,2010 at 03:56:09 UTC from IEEE Xplore. Restrictions apply.
    • ECKART ET AL.: A DYNAMIC PERFORMANCE-BASED FLOW CONTROL METHOD FOR HIGH-SPEED DATA TRANSFER 1217.3 Retransmission and Rate ControlTCP is used for both retransmission requests and ratecontrol. PA-UDP simply waits for a set period of time, andthen, makes grouped retransmission requests if necessary.The retransmission packet structure is identical to Hurri-cane [39]. An array of integers is used, denoting datagramID’s that need to be retransmitted. The sender prioritizesthese requests, locking down the UDP data flow with amutex while sending the missed packets. It is not imperative that retransmission periods becalibrated except in cases where the sending buffer is small Fig. 10. Memory management algorithm.or there is a very large rtt. Care needs to be made to makesure that the rtt is not more than the retransmission wait The simulator works by intercepting and time stampingperiod. If this is the case, requests will be sent multiple every packet sent to a socket. A loop runs in the backgroundtimes before the sender can possibly resend them, resulting which checks to see if the current time minus the timein duplicate packets. Setting the retransmission period at stamp is greater than the desired latency. If the packet hasleast five times higher than the rtt ensures that this will not waited for the desired latency, it is sent over the socket. Wehappen while preserving the efficacy of the protocol. should note that the buffer size needed for the simulator is The retransmission period does directly influence the related to the desired latency and the sending rate. Let b beminimum size of the sending buffer, however. For instance, the size of the latency buffer and tl be the average latency:if a transfer is disk-to-disk and the sender does not have arequested packet in the application buffer, a seek time cost b % rðsendÞ Â tl : ð14Þwill incur when the disk is accessed nonsequentially for the By testing high-latency effects in a parametric way, wepacket. In this scenario, the retransmission request would can find out how adaptable the timing aspects are. Forconsiderably slow down the transfer during this time. This instance, if the retransmission thread has a static sleep timecan be prevented by either increasing the application buffer before resending retransmission requests, a high latencyor sufficiently lowering the retransmission sleep period. could result in successive yet unnecessary requests before As outlined in Fig. 6, the rate control is computationally the sender could send back the dropped packets. Theinexpensive. Global count variables are updated per profiling power of the rate control algorithm is alsoreceived datagram and per written datagram. A profile is somewhat affected by latencies, since ideally, the perfor-stored before and after a set sleep time. After the sleep time, mance monitor would be real-time. In our tests, we foundthe pertinent data can be constructed, including rðrecvÞ, that PA-UDP could run with negligibly small side effectsrðdiskÞ, m, and f. These parameters are used in conjunction with rtt’s over 1 second. This is mainly due to the relativelywith (6) and (7) to update the sending rate accordingly. The low variance of rðdiskÞ that we observed on our systems.request is sent over the TCP socket in the simple form 7.5 Memory Management“RATE: R,” where R is an integer speed in megabits persecond (Mbps). The sender receives the packet in the TCP For high-performance applications such as these, efficient memory management is crucial. It is not necessary to deletemonitoring thread and derives new sleep times from (11). packets which have been written to disk, since this memorySpecifically, the equation used by the protocol is can be reallocated by the application when future packets pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi L þ L2 À 4
    • LR come through the network. Therefore, we used a scheme td ¼ ; ð13Þ whereby each packet’s memory address is marked once the 2R data it contains are written to disk. When the networkwhere R represents the newly requested rate. receives a new packet, if a marked packet exists, the new As per the algorithm in Fig. 6, if the memory left is larger packet is assigned to the old allocated memory of the markedthan the amount left to be transferred, the rate can be set at packet. In this way, we do not have to use the C function freethe allowed maximum. until the transfer is over. The algorithm is presented in Fig. 10.7.4 Latency SimulatorWe included a latency simulator to more closely mimic the 8 RESULTS AND ANALYSIScharacteristics of high-rtt high-speed WANs over low- 8.1 Throughput and Packet Loss Performancelatency high-speed LANs. The reasons for the simulator are We tested PA-UDP over a Gigabit Ethernet switch on antwofold: the first reason is simply a matter of convenience, LAN. Our setup consisted of two Dell PowerEdge 850’sgiven that testing could be done locally, on an LAN. The each equipped with a 1 Gigabit NIC, dual Pentium 4second reason is that simulations provide the means for processors, 1 GB of RAM, and a 7,200 RPM IDE hard drive.parametric testing which would otherwise be impossible in a We compared PA-UDP to three UDP-based protocols:real environment. In this way, we can test for a variety of Tsunami, Hurricane, and UDT (UDT4). Five trials werehypothetical rtt’s without porting the applications to differ- conducted at each file size for both protocols using the sameent networks. We can also use the simulator to introduce parameters for buffers and speeds. We used buffers 750 MBvariance in latency according to any parametric distribution. large for each protocol and generated test data both on-the- Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on July 10,2010 at 03:56:09 UTC from IEEE Xplore. Restrictions apply.
    • 122 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 1, JANUARY 2010 TABLE 1 Throughput Averages TABLE 2 Packet Loss Averagesfly and from the disk. The average throughputs and packet Tables 1 and 2. UDT does not seem to have these problems,loss percentages are given in Tables 1 and 2, respectively, but shows lower throughputs than PA-UDP.for the case when data were generated dynamically. Theresults are very similar for disk-to-disk transfers. 8.2 CPU Utilization PA-UDP performs favorably to the other protocols, As discussed in Section 5, one of the primary benefits of ourexcelling at each file size. Tsunami shows high throughputs, flow control method is its low CPU utilization. The flowbut fails to be consistent at higher file sizes due to large control limits the transfer speeds to the optimal range for theretransmission errors. At larger file sizes, Tsunami fails to current hardware profile of the host. Other protocols withoutcomplete the transfers, instead restarting ad infinitum due to this type of flow control essentially have to “discover” theinternal logic decisions for retransmission. Hurricane hardware-imposed maximum by running at a unsustainablecompletes all transfers, but does not perform consistently rate, and then, reactively curbing throughput when packetand suffers dramatically due to high packet loss. UDT loss occurs. In contrast to other high-speed protocols, PA-shows consistent and stable throughputs, especially for UDP maintains a more stable and more efficient rate.large transfers, but adopts a somewhat more conservative A simple CPU utilization average during a transferrate control than the others. would be insufficient to compare the various protocols’ In addition to having better throughputs as compared to computational efficiency, since higher throughputs affectTsunami, Hurricane, and UDT, PA-UDP also has virtually CPU utilization adversely. Thus, a transfer that spends mostzero packet loss due to buffer overflow. This is a direct result of its time waiting for the retransmission of lost packetsof the rate control algorithm from Fig. 6, which preemptively may look more efficient from a CPU utilization perspectivethrottles bandwidth before packet loss from buffer overflows though, in fact, it would perform much worse. To alleviateoccurs. Tsunami and Hurricane perform poorly in these tests this problem, we introduce a measure of CPU utilization perlargely due to unstable rate control. When the receiving rate is units of throughput. Using this metric, a protocol whichset above the highest rate sustainable by the hardware, packet incurs high packet loss and spends time idling would beloss eventually occurs. Since the transmission rates are punished, and its computational efficiency would be morealready at or above the maximum capable by the hardware, accurately reflected. Fig. 11a shows this metric comparedany extra overhead incurred by retransmission requests andthe handling of retransmitted packets causes even more for several different high-speed protocols at three differentpacket loss, often spiraling out of control. This process can file sizes over three different runs each. To obtain thelead to final packet retransmission rates of 100 percent or throughput efficiency, the average CPU utilization ismore in some cases, depending on the file size and protocol divided by the throughput of the transfer. For complete-employed. Tsunami has a simple protection scheme against ness, we included a popular high-speed TCP-basedretransmission spiraling that involves completely restarting application, BBCP, as well as the other UDP-based proto-the transfer after too much packet loss has occurred. Starting cols. The results shown are from the receiver, since it is thethe transfer over voids the pool of packets to be retransmitted most computationally burdened. PA-UDP is considerablywith the hope that the packet loss was a one-time error. more efficient than the other protocols, with the discre-Unfortunately, this scheme causes the larger files in our tests pancy being most noticeable at 1 GB. The percentageto endlessly restart and, thus, never complete, as shown in utilization is averaged across both CPU’s in our testbed. Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on July 10,2010 at 03:56:09 UTC from IEEE Xplore. Restrictions apply.
    • ECKART ET AL.: A DYNAMIC PERFORMANCE-BASED FLOW CONTROL METHOD FOR HIGH-SPEED DATA TRANSFER 123Fig. 11. (a) Percentage CPU utilization per megabits per second for three file sizes: 100, 1,000, and 10,000 MB. PA-UDP can drive data faster at aconsistently lower computational cost. Note that we could not get UDT or Tsunami to successfully complete a 10 GB transfer, so the bars are notshown. (b) A section of a CPU trace for three transfers of a 10 GB file using PA-UDP, Hurricane, and BBCP. PA-UDP not only incurs the lowest CPUutilization, but it is also the most stable. TABLE 3 Throughputs to Predicted Maxima To give a more complete picture of PA-UDP’s efficiency, can be attributed to the impreciseness of the measuringFig. 11b shows a CPU utilization trace over a period of time methods. Nevertheless, it is constructive to see that theduring a 10 GB transfer for the data receiver. Two trials are transfers are at the predicted maxima given the systemrepresented for each of the three applications: PA-UDP, characteristics profiled during the transfer.Hurricane, and BBCP. PA-UDP is not only consistently lesscomputationally expensive than other two protocols duringthe course of the transfer, but it is also the most stable. 9 RELATED WORKHurricane, for instance, jumps between 50 and 100 percent High-bandwidth data transport is required for large-scaleCPU utilization during the course of the transfer. We note distributed scientific applications. The default implementa-here also that BBCP, a TCP-based application, outperforms tions of Transmission Control Protocol (TCP) [30] and UserHurricane, a UDP-based protocol implementation. Though Datagram Protocol (UDP) do not adequately meet theseUDP-based protocols typically have less overhead, which is requirements. While several Internet backbone links havethe main impetus for moving from TCP to UDP, the I/O been upgraded to OC-192 and 10GigE WAN PHY, end usersefficiency of a protocol is also very important and BBCP have not experienced proportional throughput increases. Theappears to have better I/O efficiency compared to Hurri- weekly traffic measurements reported in [41] reveal that mostcane. Again, the CPU utilization is averaged between both of bulk TCP traffic carrying more than 10 MB of data onprocessors on the Dell PowerEdge 850. Internet2 only experiences throughput of 5 Mbps or less. For8.3 Predicted Maxima control applications, TCP may result in jittery dynamics on lossy links [37].To demonstrate how PA-UDP achieves the predicted Currently, there are two approaches to transport protocolmaximum performance, Table 3 shows the rate-controlled design: TCP enhancements and UDP-based transport withthroughputs for various file sizes in relation to the predictedmaximum throughput given disk performance over the non-Additive Increase Multiplicative Decrease (AIMD) con-time of the transfer. Again, a buffer of 750 Megabytes was trol. In the recent years, many changes to TCP have beenused at the receiver. introduced to improve its performance for high-speed net- For 400, 800, and 1,000 Megabyte transfers, the dis- works [29]. Efforts by Kelly have resulted in a TCP variantcrepancy between predicted and real comes from the fact called Scalable TCP [32]. High-Speed TCP Low Prioritythat the transfers were saturating the link’s capacity. The (HSTCP-LP) is a TCP-LP version with an aggressive windowrest of the transfers showed that the true throughputs were increase policy targeted toward high-bandwidth and long-very close to the predicted maxima. The slight error present distance networks [33]. The Fast Active-Queue-Management Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on July 10,2010 at 03:56:09 UTC from IEEE Xplore. Restrictions apply.
    • 124 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 1, JANUARY 2010Scalable TCP (FAST) is based on a modification of TCP Vegas ACKNOWLEDGMENTS[26], [34]. The Explicit Control Protocol (XCP) has a conges- This research was supported in part by the US Nationaltion control mechanism designed for networks with a high Science Foundation under grants OCI-0453438 and CNS-BDP [31], [45] and requires hardware support in routers. TheStream Control Transmission Protocol (SCTP) is a new 0720617 and a Chinese 973 project under grant numberstandard for robust Internet data transport proposed by the 2004CB318203.Internet Engineering Task Force [42]. Other efforts in this areaare devoted to TCP buffer tuning, which retains the core REFERENCESalgorithms of TCP but adjusts the send or receive buffer sizes [1] N.S.V. Rao, W.R. Wing, S.M. Carter, and Q. Wu, “Ultrascienceto enforce supplementary rate control [27], [36], [40]. Net: Network Testbed for Large-Scale Science Applications,” Transport protocols based on UDP have been developed IEEE Comm. Magazine, vol. 43, no. 11, pp. S12-S17, Nov. 2005.by using various rate control algorithms. Such works [2] X. Zheng, M. Veeraraghavan, N.S.V. Rao, Q. Wu, and M. Zhu, “CHEETAH: Circuit-Switched High-Speed End-to-End Transportinclude SABUL/UDT [16], [17], Tsunami [19], Hurricane Architecture Testbed,” IEEE Comm. Magazine, vol. 43, no. 8, pp. 11-[39], FRTP [35], and RBUDP [18] (see [20], [28] for an 17, Aug. 2005. [3] On-Demand Secure Circuits and Advance Reservation System,overview). These transport methods are implemented over http://www.es.net/oscars, 2009.UDP at the application layer for easy deployment. The main [4] User Controlled LightPath Provisioning, http://phi.badlab.crc.advantage of these protocols is that their efficiency in ca/uclp, 2009. [5] Enlightened Computing, www.enlightenedcomputing.org, 2009.utilizing the available bandwidth is much higher than that [6] Dynamic Resource Allocation via GMPLS Optical Networks,achieved by TCP. On the other hand, these protocols may http://dragon.maxgigapop.net, 2009.produce non-TCP-friendly flows and are better suited for [7] JGN II: Advanced Network Testbed for Research and Develop- ment, http://www.jgn.nict.go.jp, 2009.dedicated network environments. [8] Geant2, http://www.geant2.net, 2009. PA-UDP falls under the class of reliable UDP-based [9] Hybrid Optical and Packet Infrastructure, http://networks.protocols and like the others is implemented at the internet2.edu/hopi, 2009.application layer. PA-UDP differentiates itself from the [10] Z.-L. Zhang, “Decoupling QoS Control from Core Routers: A Novel Bandwidth Broker Architecture for Scalable Support of Guaran-other high-speed reliable UDP protocols by intelligent teed Services,” Proc. ACM SIGCOMM ’00, pp. 71-83, 2000.buffer management based on dynamic system profiling [11] N.S.V. Rao, Q. Wu, S. Ding, S.M. Carter, W.R. Wing, A. Banerjee,considering the impact of network, CPU, and disk. D. Ghosal, and B. Mukherjee, “Control Plane for Advance Bandwidth Scheduling in Ultra High-Speed Networks,” Proc. IEEE INFOCOM, 2006. [12] K. Wehrle, F. Pahlke, H. Ritter, D. Muller, and M. Bechler, Linux10 CONCLUSIONS Network Architecture. Prentice-Hall, Inc., 2004.The protocol based on the ideas in this paper has shown that [13] S. Floyd, “RFC 2914: Congestion Control Principles,” Category: Best Current Practise, ftp://ftp.isi.edu/in-notes/rfc2914.txt, Sept.transfer protocols designed for high-speed networks should 2000.not only rely on good theoretical performance but also be [14] V. Jacobson, R. Braden, and D. Borman, “RFC 2647: Tcpintimately tied to the system hardware on which they run. Extensions for High Performance,” United States, http:// www.ietf.org/rfc/rfc1323.txt, 1992.Thus, a high-performance protocol should adapt in different [15] A. Hanushevsky, “Peer-to-Peer Computing for Secure Highenvironments to ensure maximum performance, and transfer Performance Data Cop,” http://www.osti.gov/servlets/purl/rates should be set appropriately to proactively curb packet 826702-5UdHlZ/native/, Apr. 2007. [16] R.L. Grossman, M. Mazzucco, H. Sivakumar, Y. Pan, and Q.loss. If this relationship is properly understood, optimal Zhang, “Simple Available Bandwidth Utilization Library fortransfer rates can be achieved over high-speed, high-latency High-Speed Wide Area Networks,” J. Supercomputing, vol. 34,networks at all times without excessive amounts of user no. 3, pp. 231-242, 2005. [17] Y. Gu and R.L. Grossman, “UDT: UDP-Based Data Transfer forcustomization and parameter guesswork. High-Speed Wide Area Networks,” Computer Networks, vol. 51, In addition to low packet loss and high throughput, PA- no. 7, pp. 1777-1799, 2007.UDP has shown to be computationally efficient in terms of [18] E. He, J. Leigh, O.T. Yu, and T.A. DeFanti, “Reliable Blast UDP: Predictable High Performance Bulk Data Transfer,” Proc. IEEE Int’lprocessing power per throughput. The adaptive nature of Conf. Cluster Computing, pp. 317-324, http://csdl.computer.org/,PA-UDP shows that it can scale computationally, given 2002.different hardware constraints. PA-UDP was tested against [19] M. Meiss, “Tsunami: A High-Speed Rate-Controlled Protocolmany other high-speed reliable UDP protocols, and also for File Transfer,” www.evl.uic.edu/eric/atp/TSUNAMI.pdf/, 2009.against BBCP, a high-speed TCP variant. Among all [20] M. Goutelle, Y. Gu, and E. He, “A Survey of Transport Protocolsprotocols tested, PA-UDP consistently outperformed the Other than Standard tcp,” citeseer.ist.psu.edu/he05survey.html,other protocols in CPU utilization efficiency. 2004. [21] D. Newman, “RFC 2647: Benchmarking Terminology for Firewall The algorithms presented in this paper are computation- Performance,” www.ietf.org/rfc/rfc2647.txt, 1999.ally inexpensive and can be added into existing protocols [22] Y. Gu and R.L. Grossman, “Optimizing udp-Based Protocolwithout much recoding as long as the protocol supports Implementations,” Proc. Third Int’l Workshop Protocols for Fastrate control via interpacket delay. Additionally, these Long-Distance Networks (PFLDnet), 2005. [23] R.L. Grossman, Y. Gu, D. Hanley, X. Hong, and B. Krishnaswamy,techniques can be used to maximize throughput for bulk “Experimental Studies of Data Transport and Data Access oftransfer on Gigabit LANs, where disk performance is a Earth-Science Data over Networks with High Bandwidth Delaylimiting factor. Our preliminary results are very promising, Products,” Computer Networks, vol. 46, no. 3, pp. 411-421, http://with PA-UDP matching the predicted maximum perfor- dx.doi.org/10.1016/j.comnet.2004.06.016, 2004. [24] A.C. Heursch and H. Rzehak, “Rapid Reaction Linux: Linux withmance. The prototype code for PA-UDP is available online Low Latency and High Timing Accuracy,” Proc. Fifth Ann. Linuxat http://iweb.tntech.edu/hexb/pa-udp.tgz. Showcase & Conf. (ALS ’01), p. 4, 2001. Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on July 10,2010 at 03:56:09 UTC from IEEE Xplore. Restrictions apply.
    • ECKART ET AL.: A DYNAMIC PERFORMANCE-BASED FLOW CONTROL METHOD FOR HIGH-SPEED DATA TRANSFER 125[25] “Low Latency: Eliminating Application Jitter with Solaris,” White Ben Eckart received the BS degree in Paper, Sun Microsystems, May 2007. computer science from Tennessee Technolo-[26] L.S. Brakmo and S.W. O’Malley, “Tcp Vegas: New Techniques for gical University, Cookeville, in 2008. He is Congestion Detection and Avoidance,” Proc. ACM SIGCOMM ’94, currently a graduate student in electrical pp. 24-35, Oct. 1994. engineering at Tennessee Technological Uni-[27] T. Dunigan, M. Mathis, and B. Tierney, “A tcp Tuning Daemon,” versity in the Storage Technology Architecture Proc. Supercomputing Conf.: High-Performance Networking and Research (STAR) Lab. His research interests Computing, Nov. 2002. include distributed computing, virtualization,[28] A. Falk, T. Faber, J. Bannister, A. Chien, R. Grossman, and J. Leigh, fault-tolerant systems, and machine learning. “Transport Protocols for High Performance,” Comm. ACM, vol. 46, He is a student member of the IEEE. no. 11, pp. 43-49, 2002.[29] S. Floyd, “Highspeed TCP for Large Congestion Windows,” Internet Draft, Feb. 2003. Xubin He received the PhD degree in electrical[30] V. Jacobson, “Congestion Avoidance and Control,” Proc. ACM engineering from the University of Rhode Island, SIGCOMM ’88, pp. 314-29, 1988. in 2002, and the BS and MS degrees in computer[31] D. Katabi, M. Handley, and C. Rohrs, “Internet Congestion science from Huazhong University of Science Control for Future High-Bandwidth-Delay Product Environ- and Technology, China, in 1995 and 1997, ments,” Proc. ACM SIGCOMM ’02, www.acm.org/sigcomm/ respectively. He is currently an associate pro- sigcomm2002/papers/xcp.pdf, Aug. 2002. fessor in the Department of Electrical and[32] T. Kelly, “Scalable TCP: Improving Performance in Highspeed Computer Engineering, Tennessee Technologi- Wide Area Networks,” Proc. Workshop Protocols for Fast Long- cal University, and supervises the Storage Distance Networks, Feb. 2003. Technology Architecture Research (STAR) Lab.[33] A. Kuzmanovic, E. Knightly, and R.L. Cottrell, “HSTCP-LP: A His research interests include computer architecture, storage systems, Protocol for Low-Priority Bulk Data Transfer in High-Speed High- virtualization, and high availability computing. He received the Ralph E. RTT Networks,” Proc. Second Int’l Workshop Protocols for Fast Long- Powe Junior Faculty Enhancement Award in 2004 and the TTU Chapter Distance Networks, Feb. 2004. Sigma Xi Research Award in 2005. He is a senior member of the IEEE[34] S.H. Low, L.L. Peterson, and L. Wang, “Understanding Vegas: A and a member of the IEEE Computer Society. Duality Model,” J. ACM, vol. 49, no. 2, pp. 207-235, Mar. 2002.[35] A.P. Mudambi, X. Zheng, and M. Veeraraghavan, “A Transport Qishi Wu received the BS degree in remote Protocol for Dedicated End-to-End Circuits,” Proc. IEEE Int’l Conf. sensing and GIS from Zhejiang University, Comm., 2006. China, in 1995, the MS degree in geomatics[36] R. Prasad, M. Jain, and C. Dovrolis, “Socket Buffer Auto-Sizing for from Purdue University in 2000, and the PhD High-Performance Data Transfers,” J. Grid Computing, vol. 1, no. 4, degree in computer science from Louisiana pp. 361-376, 2004. State University in 2003. He was a research[37] N.S.V. Rao, J. Gao, and L.O. Chua “Chapter on Dynamics of fellow in the Computer Science and Mathe- Transport Protocols in Wide Area Internet Connections,” Complex matics Division at Oak Ridge National Labora- Dynamics in Communication Networks, Springer-Verlag, 2004. tory during 2003-2006. He is currently an[38] N. Rao, W. Wing, Q. Wu, N. Ghani, Q. Liu, T. Lehman, C. Guok, assistant professor in the Department of Com- and E. Dart, “Measurements on Hybrid Dedicated Bandwidth puter Science, University of Memphis. His research interests include Connections,” Proc. High-Speed Networks Workshop, pp. 41-45, May computer networks, remote visualization, distributed sensor networks, 2007. high-performance computing, algorithms, and artificial intelligence. He is[39] N.S.V. Rao, Q. Wu, S.M. Carter, and W.R. Wing, “High-Speed a member of the IEEE. Dedicated Channels and Experimental Results with Hurricane Protocol,” Annals of Telecomm., vol. 61, nos. 1/2, pp. 21-45, 2006. Changsheng Xie received the BS and MS[40] J. Semke, J. Madhavi, and M. Mathis, “Automatic TCP Buffer degrees in computer science from Huazhong Tuning,” Proc. ACM SIGCOMM ’98, Aug. 1998. University of Science and Technology (HUST),[41] S. Shalunov and B. Teitelbaum, “A Weekly Version of the Bulk China, in 1982 and 1988, respectively. He is TCP Use and Performance on Internet2,” Internet2 Netflow: currently a professor in the Department of Weekly Reports, 2004. Computer Engineering at HUST. He is also the[42] R. Stewart and Q. Xie, Stream Control Transmission Protocol, IETF director of the Data Storage Systems Laboratory RFC 2960, www.ietf.org/rfc/rfc2960.txt, Oct. 2000. of HUST and the deputy director of the Wuhan[43] S. Floyd, “Highspeed TCP for Large Congestion Windows,” National Laboratory for Optoelectronics. His citeseer.ist.psu.edu/article/floyd02highspeed.html, 2002. research interests include computer architec-[44] T. Kelly, “Scalable TCP: Improving Performance in Highspeed ture, disk I/O system, networked data-storage system, and digital media Wide Area Networks,” ACM SIGCOMM Computer Comm. Rev., technology. He is the vice chair of the expert committee of Storage vol. 33, no. 2, pp. 83-91, 2003. Networking Industry Association (SNIA), China.[45] Y. Zhang and M. Ahmed, “A Control Theoretic Analysis of XCP,” Proc. IEEE INFOCOM, pp. 2831-2835, 2005.[46] C. Jin, D.X. Wei, S.H. Low, J.J. Bunn, H.D. Choe, J.C. Doyle, H.B. . For more information on this or any other computing topic, Newman, S. Ravot, S. Singh, F. Paganini, G. Buhrmaster, R.L. please visit our Digital Library at www.computer.org/publications/dlib. Cottrell, O. Martin, and W. chun Feng, “FAST TCP: From Theory to Experiments,” IEEE Network, vol. 19, no. 1, pp. 4-11, Jan./Feb. 2005. Authorized licensed use limited to: Tamil Nadu College of Engineering. Downloaded on July 10,2010 at 03:56:09 UTC from IEEE Xplore. Restrictions apply.