This paper was presented as part of the main technical program at IEEE INFOCOM 2011Mahout: Low-Overhead Datacenter Traffic...
We have also built a Mahout controller, for setting up switches      B.   Identifying elephant flowswith default entries a...
:     ,---:----, :::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::.............................              ...
Algorithm    1   Pseudocode for end host shim layer                  Algorithm 1 shows pseudocode for the end host shim la...
IMPL.         Bit lengths _.­             48          48         16    12           32          32                      16...
Parameter                 Description                       Value                                                         ...
250uses sampling to detect elephant flows, and (3) the Mahout               �                                             ...
12000        -g                                                                                                           ...
100 r--�--�----                                          100 r--�--�--,                               Time to Queue �     ...
Upcoming SlideShare
Loading in …5

Mahout low-overhead datacenter traffic management using end-host-based elephant detection


Published on

J Gabriel Lima -

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Mahout low-overhead datacenter traffic management using end-host-based elephant detection

  1. 1. This paper was presented as part of the main technical program at IEEE INFOCOM 2011Mahout: Low-Overhead Datacenter Traffic Management using End-Host-Based Elephant Detection Andrew R. Curtis* Wonho Kim* Praveen Yalagandula University of Waterloo Princeton University HP Labs Waterloo, Ontario, Canada Princeton, NJ, USA Palo Alto, CA, USA Abstract-Datacenters need high-bandwidth interconnection report that 90% of the flows carry less than 1MB of datafabrics. Several researchers have proposed highly-redundant and more than 90% of bytes transferred are in flows greatertopologies with multiple paths between pairs of end hosts for than 100MB. Hash-based flow forwarding techniques such asdatacenter networks. However, traffic management is necessaryto effectively utilize the bisection bandwidth provided by these Equal-Cost Multi-Path (ECMP) routing [19] works well onlytopologies. This requires timely detection of elephant flows­ for large numbers of small (or mice) flows and no elephantflows that carry large amount of data-and managing those flows. For example, AI-Fares et al. s Hedera [7] shows thatflows. Previously proposed approaches incur high monitoring managing elephant flows effectively can yield as much asoverheads, consume significant switch resources, and/or have long 113% higher aggregate throughput compared to ECMP.detection times. We propose, instead, to detect elephant flows at the end hosts. Existing elephant flow detection methods have several lim­We do this by observing the end hostss socket buffers, which itations that make them unsuitable for datacenter networks.provide better, more efficient visibility of flow behavior. We These proposals use one of three techniques to identify ele­present Mahout, a low-overhead yet effective traffic management phants: (1) periodic polling of statistics from switches, (2)system that follows OpenFlow-like central controller approach for streaming techniques like sampling or window-based algo­network management but augments the design with our novel endhost mechanism. Once an elephant flow is detected, an end host rithms, or (3) application-level modifications (full details ofsignals the network controller using in-band signaling with low each approach are given in Section II). We have not seenoverheads. Through analytical evaluation and experiments, we support for Quality of Service (QoS) solutions take hold,demonstrate the benefits of Mahout over previous solutions. which implies that modifying applications is probably an unac­ ceptable solution. We will show that the other two approaches I. INT RODUCT ION fall short in the datacenter setting due to high monitoring Datacenter switching fabrics have enormous bandwidth de­ overheads, significant switch resource consumption, and/ormands due to the recent uptick in bandwidth-intensive appli­ long detection times.cations used by enterprises to manage their exploding data. We assert that the right place for elephant flow detectionThese applications transfer huge quantities of data between is at the end hosts. In this paper, we describe Mahout, athousands of servers. For example, Hadoop [18] performs an low-overhead yet effective traffic management system usingall-to-all transfer of up to petabytes of files during the shuffle end-host-based elephant detection. We subscribe to the in­phase of a MapReduce job [15]. Further, to better consolidate creasingly popular simple-switchlsmart-controller model (as inemployee desktop and other computation needs, enterprises OpenFlow [4]), and so our system is similar to NOX [28] andare leveraging virtualized datacenter frameworks (e.g., using Hedera [7].VMWare [29] and Xen [10], [30]), where timely migration of Mahout augments this basic design. It has low overhead, asvirtual machines requires high throughput network. it monitors and detects elephant flows at the end host via a Designing datacenter networks using redundant topologies shim layer in the OS, rather than monitoring at the switchessuch as Fat-tree [6], [12], HyperX [5], and Flattened But­ in the network. Mahout does timely management of elephantterfly [22] solves the high-bandwidth requirement. However, flows through an in-band signaling mechanism between thetraffic management is necessary to extract the best bisection shim layer at the end hosts and the network controller. At thebandwidth from such topologies [7]. A key challenge is that switches, any flow not signaled as an elephant is routed usingthe flows come and go too quickly in a data center to compute a static load-balancing scheme (e.g., ECMP). Only elephanta route for each individually; e.g., Kandula et al. report lOOK flows are monitored and managed by the central controller.flow arrivals a second in a 1,500 server cluster [21]. The combination of end host elephant detection and in-band For effective utilization of the datacenter fabric, we need to signaling eliminates the need for per-flow monitoring in thedetect elephant flows-flows that transfer significant amount switches, and hence incurs low overhead and requires fewof data-and dynamically orchestrate their paths. Datacenter switch resources.measurements [17], [21] show that a large fraction of datacen­ We demonstrate the benefits of Mahout using analyticalter traffic is carried in a small fraction of flows. The authors evaluation and simulations and through experiments on a small *This work was performed while Andrew and Wonho were interns at HP testbed. We have built a Linux prototype for our end hostLabs-Palo Alto. elephant flow detection algorithm and tested its effectiveness. 978-1-4244-9921-2/11/$26.00 ©2011 IEEE 1629
  2. 2. We have also built a Mahout controller, for setting up switches B. Identifying elephant flowswith default entries and for processing the tagged packets from The mix of latency- and throughput-sensitive flows in thethe end hosts. Our analytical evaluation shows that Mahout data centers means that effective flow scheduling needs tooffers one to two orders of magnitude of reduction in the balance visibility and overhead-a one size fits all approach isnumber of flows processed by the controller and in switch not sufficient in this setting. To achieve this balance, elephantresource requirements, compared to Hedera-like approaches. flows must be identified so that they are the only flows touchedOur simulations show that Mahout can achieve considerable by the controller. The following are the previously consideredthroughput improvements compared to static load balancing mechanisms for identifying elephants:techniques while incurring an order of magnitude lower over­head than Hedera. Our prototype experiments show that the • Applications identify their flows as elephants: This solu­Mahout approach can detect elephant flows at least an order tion accurately and immediately identifies elephant flows.of magnitude sooner than statistics-polling based approaches. This is a common assumption for a plethora of research The key contributions of our work are: 1) a novel end work in network QoS where focus is to give higherhost based mechanism for detecting elephant flows, 2) design priority to latency and throughput-sensitive flows suchof a centralized datacenter traffic management system that as voice and video applications (see, e.g., [11]). How­has low overhead yet high effectiveness, and 3) simulation ever, this solution is impractical for traffic managementand prototype experiments demonstrating the benefits of the in datacenters as each and every application must beproposed design. modified to support it. If all applications are not modified, an alternative technique will still be needed to identify II. BACKGROUND & RELATED WORK elephant flows initiated by unmodified applications. A related approach is to classify flows based on whichA. Datacenter networks and traffic application is initiating them. This classifies flows using The heterogeneous mix of applications running in datacen­ stochastic machine learning techniques [27], or usingters produces flows that are generally sensitive to either latency simple matching based on the packet header fields (suchor throughput. Latency-sensitive flows are usually generated as TCP port numbers). While this approach might beby network protocols (such as ARP and DNS) and interactive suitable for enterprise network management, it is un­applications. They typically transfer up to a few kilobytes. suitable for datacenter network management because ofOn the other hand, throughput-sensitive flows, created by, the enormous amount of traffic in the datacenter and thee.g., MapReduce, scientific computing, and virtual machine difficulty in obtaining flow traces to train the classificationmigration, transfer up to gigabytes. This traffic mix implies algorithms.that a datacenter network needs to deliver high bisection • Maintain per-flow statistics: In this approach, each flowbandwidth for throughput-sensitive flows without introducing is monitored at the first switch that the flow goessetup delay on latency-sensitive flows. through. These statistics are pulled from switches by Designing datacenter networks using redundant topologies the controller at regular intervals and used to classifysuch as Fat-tree [6], [12], HyperX [5], or Flattened But­ elephant flows. Hedera [7] and Helios [16] are examplesterfly [22] solves the high-bandwidth requirement. However, of systems proposing to use such a mechanism. However,these networks use multiple end-to-end paths to provide this this approach does not scale to large networks. First,high-bandwidth, so they need to load balance traffic across this consumes significant switch resources: a flow tablethem. Load balancing can be performed with no overhead entry for each flow monitored at a switch. Well showusing oblivious routing, where the path a flow from node i in Section IV that this requires considerable number ofto node j is routed on is randomly selected from a probability flow table entries. Second, bandwidth between switchesdistribution over all i to j paths, but it has been shown to and the controller is limited, so much so that transferringachieve less than half the optimal throughput when the traffic statistics becomes the bottleneck in traffic managementmix contains many elephant flows [7]. The other extreme is in datacenter network. As a result, the flow statisticsto perform online scheduling by selecting the path for all new cannot be quickly transferred to the controller, resultingflows using a load balancing algorithm, e.g., greedily adding in prolonged sub-par routings.a flow along the path with least congestion. This approach • Sampling: Instead of monitoring each flow in the net­doesnt scale well-flows arrive too quickly for a single sched­ work, in this approach, a controller samples packets fromuler to keep up-and it adds too much setup time to latency­ all ports of the switches using switch sampling featuressensitive flows. For example, flow installation using NOX can such as sFlow [3]. Only a small fraction of packets aretake up to lOms [28]. Partition-aggregate applications (such sampled (typically, 1 in 1000) at the switches and onlyas search and other web applications) partition work across headers of the packets are transferred to the controller.multiple machines and then aggregate the responses. Jobs have The controller analyzes the samples and identifies a flowa deadline of 10-1OOms [8], so a lOms flow setup delay can as an elephant after it has seen sufficient number of sam­consume the entire time budget. Therefore, online scheduling ples from the flow. However, such an approach can notis not suitable for latency-sensitive flows. reliably detect an elephant flow before it has carried more 1630
  3. 3. : ,---:----, :::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::............................. 100 80 in � 60 2] " >- 40 RACK RACK 20 TCPBuffer -­ SWITCH SWITCH Sent Data -- a a 100 200 300 400 500 600 700 800 r--:::--j -.- END-HOST END-HOST END-HOST END-HOST Time (us) Fig. 2: Amount of data observed in the Tep buffers vs. data observed at the network layer for a flow. simple approach allows the controller to detect elephant flows END-HOST END-HOST END·HOST END·HOST without any switch CPU- and bandwidth-intensive monitoring. The Mahout controller then manages only the elephant flows, Fig. I: Mahout architecture. to maintain a globally optimal arrangement of them. In the following, we describe Mahouts end host shim layer than 10K packets, or roughly 15MB [25]. Additionally, for detecting elephant flows, our in-band signaling method for sampling has high overhead, since the controller must informing controller about the elephant flows, and the Mahout process each sampled packet. network controller.C. OpenFlow A. Detecting Elephant Flows OpenFlow [23] aims to open up traditionally closed de­ An end host based implementation for detecting elephantsigns of commercial switches to enable network innovation. flows is better than in-network monitoring/sampling basedOpenFlow switches maintain a flow-table where each entry methods, particularly in datacenters, because: (l) The networkcontains a pattern to match and the actions to perform on a behavior of a flow is affected by how rapidly the end-pointpacket that matches that entry. OpenFlow defines a protocol for applications are generating data for the flow, and this is notcommunication between a controller and an OpenFlow switch biased by congestion in the network. In contrast to in-networkto add and remove entries from the flow table of the switch monitors, the end host OS has better visibility into the behaviorand to query statistics of the flows. of applications. (2) In datacenters, it is possible to augment Upon receiving a packet, if an OpenFlow switch does not the end host OS; this is enabled by the single administrativehave an entry in the flow table or TCAM that matches the domain and software uniformity typical of modern datacenters.packet, the switch encapsulates and forwards the packet to the (3) Mahouts elephant detection mechanism has very littlecontroller over a secure connection. The controller responds overhead (it is implemented with two if statements) on com­back with a flow table entry and the original packet. The switch modity servers. In contrast, using an in-network mechanismthen installs the entry into its flow table and forwards the to do fine-grained flow monitoring (e.g., using exact matchingpacket according to the actions specified in the entry. The flow on OpenFlows 12-tuple) can be infeasible, even on an edgetable entries expire after a set amount of time, typically 60 switch, and even more so on a core switch, especially onseconds. OpenFlow switches maintain statistics for each entry commodity hardware. For example, assume that 32 serversin their flow table. These statistics include a packet counter, are connected, as is typical, to a rack switch. If each serverbyte counter, and duration. generates 20 new flows per second, with a default flow timeout The OpenFlow 1.0 specification [2] defines matching over period of 60 seconds, an edge-switch needs to maintain and12 fields of packet header (see the top line in Figure 3). The monitor 38400 flow entries. This number is infeasible in anyspecification defines several actions including forwarding on a of the real switch implementations of OpenFlow that we aresingle physical port, forwarding on multiple ports, forwarding aware the controller, drop, queue (to a specified queue), and A key idea of the Mahout system is to monitor end hostdefaulting to traditional switching. To support such flexibility, socket buffers, and thus determine elephant flows beforecurrent commercial switch implementations of OpenFlow use and with lower overheads than with in-network monitoringTCAMs for flow table. systems. We demonstrate the rationale for this approach with a micro-benchmark: an ftp transfer of a 50MB file from a host III. OUR SOLUTION: MAHOUT 1 to host 2, connected via two switches all with 1 Gbps links. Mahouts architecture is shown in Figure 1. In Mahout, a In Figure 2, we show the cumulative amount of datashim layer on each end host monitors the flows originating observed on the network, and in the TCP buffer, as timefrom that host. When this layer detects an elephant flow, it progresses. The time axis starts when the application firstmarks subsequent packets of that flow using an in-band sig­ provides data to the kernel. From the graph, one can observenaling mechanism. The switches in the network are configured that the application fills the TCP buffers at a rate much higherto forward these marked packets to the Mahout controller. This than the observed network rate. If the threshold for considering 1631
  4. 4. Algorithm 1 Pseudocode for end host shim layer Algorithm 1 shows pseudocode for the end host shim layer 1: When sending a packet function that is executed when a TCP packet is being sent. 2: if number of bytes in buffer � thresholdelephant then C. Mahout Controller 3: / * Elephant flow */ 4: if last-tagged-time - nowO � Ttagperiod then At each rack switch, the Mahout controller initially config­ 5: set DS = 00001100 ures two default OpenFlow flow table entries: (i) an entry to 6: last-tagged-time = nowO send a copy of packets with the DSCP bits set to 000011 to the 7: end if controller and (ii) the lowest-priority entry to switch packets 8: end if using NORMAL forwarding action. We set up switches to per­ form ECMP forwarding by default in the NORMAL operation mode. Figure 3 shows the two default entries at the bottom.a flow as an elephant is 100KB (Figure 2. of [17] shows that In this figure, an entry has a higher priority over (is matchedmore than 85% of flows are less than 100KB), we can see before) entries below that entry.that Mahouts end host shim layer can detect a flow to be When a flow starts, it normally will match the lowest­an elephant 3x sooner than in-network monitoring. In this priority (NORMAL) rule, so its packet will follow ECMPexperiment there were no other active flows on the network. forwarding. When an end host detects a flow as an elephantIn further experimental results, presented in Section V, we and marks a packet of that flow. That packet marked withobserve an order of magnitude faster detection when there are DSCP 000011 matches the other default rule, and the rackother flows. switch forwards it to the Mahout controller. The controller Mahout uses a shim layer in the end hosts to monitor then computes the best path for this elephant, and installs athe socket buffers. When a socket buffer crosses a chosen flow-specific entry in the rack switch.threshold, the shim layer determines that the flow is an In Figure 3, we show a few example entries for the ele­elephant. This simple approach ensures that flows that are phant flows. Note that these entries are installed with higherbottlenecked at the application layer and not in the network priority than Mahouts two default rules; hence, the packetslayer, irrespective of how long-lived they are or how many corresponding to these elephant flows are switched using thebytes they have transferred, will not be determined as the actions of these flow-specific entries rather than the actions ofelephant flows. Such flows need no special management in the default entries. Also, the DS field is set to wildcard forthe network. In contrast, if an application is generating data these elephant flow entries, so that once the flow-specific rulefor a flow faster than the flows achieved network throughput, is installed, any tagged packets from the end hosts are notthe socket buffer will fill up, and hence Mahout will detect forwarded to the controller.this an an elephant flow that needs management. Once an elephant flow is reported to the Mahout controller, it needs to be placed on the best available path. We define theB. In-band Signaling best path for a flow from s to t as the least congested of all paths from s to t. The least congested s-t path is found by Once Mahouts shim layer has detected an elephant flow, enumerating over all such needs to signal this to the network controller. We do this To manage the elephant flows, Mahout regularly pullsindirectly, by marking the packets in a way that is easily statistics on the elephant flows and link utilizations from theand efficiently detected by OpenFlow switches, and then the switches, and uses these statistics to optimize the elephantswitches divert the marked packets to the network controller. flows routes. This is done with the increasing first fit algo­To avoid inundating the controller with too many packets of rithm given in Algorithm 2. Correa and Goemans introducedthe same flow, the end host shim layer marks the packets of this algorithm and proved that it finds routings that have atan elephant flow only once every Ttagperiod seconds (we use most a 10% higher link utilization than the optimal routing1 second in our prototype). [13]. While we cannot guarantee this bound because we re­ To mark a packet, we repurpose the Differentiated Services route only the elephant flows, we expect this algorithm toField (DS Field) [26] in the IPv4 header. This field was perform as well as any other heuristic.originally called the IP Type-of-Service (IPToS) byte. Thefirst 6 bits of the DS Field, called Differentiated Services D. DiscussionCode Point (DSCP), define the per-hop behavior of a packet. a) DSCP bits: In Mahout, the end host shim layer usesThe current OpenFlow specification [2] allows matching on the DSCP bits of the DS field in IP header for signalingDSCP bits, and most commercial switch implementations elephant flows. However, there may be some datacentersof OpenFlow support this feature in hardware; hence, we where DSCP may be needed for other uses, such as foruse the DS Field for signaling between the end host shim prioritization among different types of flows (voice, video, andlayer and the network controller. Currently the code point data) or for prioritization among different customers. In suchspace corresponding to xxxx11 (x denotes a wild-card bit) is scenarios, we plan to use VLAN Priority Code Point (PCP) [1]reserved for experimental or local usage [20], and we leverage bits. OpenFlow supports matching on these bits too. We canthis space. When an end host detects an elephant flow, it sets leverage the fact that it is very unlikely for both these codethe DSCP bits to 000011 in the packets belonging to that flow. point fields (PCP and DSCP) to be in use simultaneously. 1632
  5. 5. IMPL. Bit lengths _.­ 48 48 16 12 32 32 16 16 DEP. Packet fields ACTIONS to match 2 01AB..8F 3OCD..9E 0806 xx xx 6 XXXXXX 80 2490 Forward to port 4 Flow entries / xx xx XXXXXX for detected - 10 BE9F..03 04DE .. CF 0806 6 3450 3451 Forward to port 3 q elephant flows � : I : I : I : I : I : I : I : I : I::I:� : I : ! D,fa"" rul" Send to Controller at the lowest NORMAL routing priotity (ECMP) Fig. 3: An example flow table setup at a switch by the Mahout controller.Algorithm 2 Offline increasing first fit has more than 100KB data, xxxll1 to denote more than 1MB, 1: sort(F); reverse(F) /* F: set of elephant flows */ xxllll to denote more than 10MB, and so on. The controller 2: for J E F do can then change the default entry corresponding to the tagged 3: for l E J.path do packets (second from bottom in the Figure 3) to select higher 4: l.load = l.load - J.rate thresholds, based on the load at the controller. Further study 5: end for 6: end for is needed to explore these approaches. 7: for J E F do 8: bescpaths[fj.congest = 00 IV. ANALY T ICAL EVALUAT ION 9: /* Pst: set of all s-t paths */ In this section, we analyze the expected overhead of de­10: for path E Pst do tecting elephant flows with Mahout, with flow sampling, and11: congest = (f.rate + path.load) / path. bandwidth by maintaining per-flow statistics (e.g., the approach used by12: if congest < bescpath.congest then13: best_paths[fj = path Hedera). We set up an analytical framework to evaluate the14: bescpaths[f].congest = congest number of switch table entries and control messages used15: end if by each method. We evaluate each method using an example16: end for datacenter, and show that Mahout is the only solution that can17: end for scale to support large datacenters.18: return best_paths Flow sampling identifies elephants by sampling an expected l out of k packets. Once it has seen enough packets from the same flow, then the flow is classified as an elephant. The b) Virtualized Datacenter: In a virtualized datacenter, number of packets needed to classify an elephant does nota single server will host multiple guest virtual machines, affect our analysis in this section, so we ignore it for now.each possibly running a different operating system. In such Hedera [7] uses periodic polling for elephant flow detection.a scenario, the Mahout shim layer needs to be deployed in Every t seconds, the Hedera controller pulls the per-floweach of the guest virtual machines. Note that the host operating statistics from each switch. In order to estimate the true rate ofsystem will not have visibility into the socket buffers of a guest a flow (i.e., the rate of the flow if its rate is only constrainedvirtual machine. However, in cloud computing infrastructures by its endpoints NICs and not by any link in the network),such as Amazon EC2 [9], typically the infrastructure provider the statistics for every flow in the network must be collected.makes available a few preconfigured OS versions, which in­ Pulling statistics for all flows using OpenFlow requires settingclude the paravirtualization drivers to work with the providers up a flow table entry for every flow, so each flow must be senthypervisor. Thus, we believe that it is feasible to deploy the to the controller before it can be started, so we include thisMahout shim layer in virtualized datacenters, too. cost in our analysis. c) Elephantfiow threshold: Choosing too Iow a value for We consider a million server network for the followingthresholdelephant in Algorithm 1 can cause many flows to analysis. Our notation and and the assumed values are shownbe recognized as elephants, and hence cause the rack switches in the Table forward too many packets to the controller. When there Hedera [7]: As table entries need to be maintained for allare many elephant flows, to avoid the controller overload, we flows, the number of flow table entries needed at each rackcould provide a means for the controller to signal the end hosts switch is T·P-D. In our example, this translates to 32·20·60 =to increase the threshold value. However, this would require 38,400 entries at each rack switch. We are not aware of anya out-of-band control mechanism. An alternative is to use existing switch with OpenFlow support that can support thismultiple DSCP values to denote different levels of thresholds. many entries in the flow table in the hardware-for example,For example, xxxxll can be designated to denote that a flow HP ProCurve 5400zl switches support up to 1.7K OpenFlow 1633
  6. 6. Parameter Description Value reduces this overhead but adversely impacts the effects of flow N Num. of end hosts 2�u (1M) T Num. of end hosts per rack switch 32 scheduling since not all elephants are detected. S Num. of rack switches 215 (32K) We expect the number of elephants identified by sampling F Avg. new flows per second per end host 20 [28] to be similar to Mahout, so we do not analyze the flow table D Avg. duration of a flow in the flow table 60 seconds c Size of counters in bytes 24 [2] entry overhead of sampling separately. Tstat Rate of gathering statistics I-per-second Mahout: Because elephant flow detection is done at the p Num. of bytes in a packet 1500 end-host, switches contain flow table entries for elephant flows Im Fraction of mice 0.99 Ie Fraction of elephants 0.01 only. Also, statistics are only gathered for the elephant flows. Tsample Rate of sampling l-in-IOOO So, the number of flow entries per rack switch in Mahout is h.ample Size of packet sample (bytes) 60TABLE I: Parameters and tYPIcal values for the analytical evaluatIOn T . F·D·fe 384 entries. The number of flow setups that the = Mahout controller needs to handle is N·F·fe, which is about 200K requests per second, which needs 7 controllers. Also,entries per linecard. It is unlikely that any switch in the near the number of packets per second that need to be processedfuture will support so many table entries given the expense of for gathering statistics is a fe fraction of the same in case ofhigh-speed memory. Hedera. Thus 7 controllers are needed for gathering statistics The Hedera controller needs to handle N . F flow setups per at the rate of once per second, or the statistics can be gatheredsecond, or more than 20 million requests per second in our by a single controller at the rate of once every 7 seconds.example. A single NOX controller can handle only 30,000requests per second; hence one needs 667 controllers to just V. EXPERIMENTShandle the flow setup load [28], assuming that the load canbe perfectly distributed. Flow scheduling, however, does not A. Simulationsseem to be a simple task to distribute. Our goal is to compare the performance and overheads of The rate at which the controller needs to process the Mahout against the competing approaches described in thestatistics packets is previous section. To do so, we implemented a flow-level, c·T·F ·D event-based simulator that can scale to a few thousand end ----- . S . r stat p hosts connected using Clos topology [12]. We now describe this simulator and our evaluation of Mahout with it. In our example, this implies (24 . 38400) /1500 . 215 1) Methodology: We simulate a datacenter network by1 � 20 . 1M control packets per second. Assuming that NOX modeling the behavior of flows. The network topology iscontroller can handle these packets at the rate it can handle modeled as a capacitated, directed graph and forms a three­the flow setup requests (30,000 per second), this translates to level Clos topology. All simulations here are of a 1,600needing 670 controllers just to process these packets. Or, if we server datacenter network, and they use a network with aconsider only one controller, then the statistics can be gathered rack to aggregation and aggregation to core links that areonly once every 670 seconds ( � 11 minutes). 1:5 oversubscribed, i.e., the network has 320Gb bisection Sampling: Sampling incurs the messaging overhead of bandwidth. All servers have 1Gbps NICs and links have 1 Gbpstaking samples, and then installs flow table entries when an capacity. Our simulation is event-based, so there is no discreteelephant is detected. The rate at which the controller needs to clock-instead, the timing of events is accurate to floating­process the sampled packets is point precision. Input to the simulator is a file listing the start bytes per sample time, bytes, and endpoints of a set of flows (our workloads are throughput· r sample . ------=----=-­ p described below). When a flow starts or completes, the rate of We assume that each sample contains only a 60 byte each flow is recomputed.header and that headers can be combined into 1500 byte We model the OpenFlow protocol only by accounting forpackets, so there are 25 samples per message to the controller. the delay when a switch sets up a flow table entry for a flow.The aggregate throughput of a datacenter network changes When this occurs, the switch sends the flow to the OpenFlowfrequently, but if 10% of the hosts are sending traffic, the controller by placing it in its OpenFlow queues. This queueaggregate throughput (in Gbps) is 0 . 10 . N. We then find the has lOMbps of bandwidth (this number was measured frommessaging overhead of sampling to be around 550K messages an OpenFlow switch [24]). This queue has infinite capacity,per second, or if we bundle samples into packets (i.e., 25 so our model optimistically estimates the delay between asamples fit in a 1500 byte packet), then this drops to 22K switch and the OpenFlow controller since a real system dropsmessages per second. arriving packets if one of these queues is full, resulting in TCP At first blush, this messaging overhead does not seem timeouts. Moreover, we assume that there is no other overheadlike too much overhead; however, as the network utilization when setting up a flow, so the OpenFlow controller deals withincreases, the messaging overhead can reach 3.75 million the flow and installs flow table entries instantly.(or 150K if there are 25 samples per packet) packets per We simulate three different schedulers: (1) an offline sched­second. Therefore, sampling incurs the highest overhead when uler that periodically pulls flow statistics from the switches,load balancing is most needed. Decreasing the sampling rate (2) a scheduler that behaves like the Mahout scheduler, but 1634
  7. 7. 250uses sampling to detect elephant flows, and (3) the Mahout � Q.scheduler as described in Sec. III-C. t[ 200 " ,. The stat-pulling controller behaves like Hedera [7] and � 150 I-- - I-- - I-- - ,.- I-- - f- .s::;Helios [16]. Here, the controller pulls flow statistics from " ." � 100 I-- - I-- - I-- - r-- I-- - f-each switch at regular intervals. The statistics from a flow -5 ! I-- - I-- - I-- - I-- I-- - f-table entry are 24 bytes, so the amount of time to transfer the SOstatistics from a switch to the controller is proportional to the � » 0 ... .... ;lnumber of flow table entries at the switch. When transferring " e>. CD CD CD 0 0 0 :;; :;; S 0 0 c:i � S 00 0 � � 0 ;lstatistics, we assume that the CPU-to-controller rate is the ;l -;:;-bottleneck, not the network or OpenFiow controller itself.Once the controller has statistics for all flows, it computes Mahout (threshold) Sampling (frac. Pulling (s) packets)a new routing for elephant flows and reassigns paths instantly.In practice, computing this routing and inserting updated flow Fig. 4: Throughput results for the schedulers with various parameters. Error bars on all charts show 95% confidence intervals.table entries into the switches will take up to hundreds ofmilliseconds. We allow this to be done instantaneously to intra- or inter-rack flow respectively. We select the numberfind the theoretical best achievable results using an offline of bytes in a flow following the distribution of flow sizes inapproach. The global re-routing of flows is computed using their measurements as well. Before starting the shuffle job,the increasing best fit algorithm described in Algorithm 2. This we simulate this background traffic for three minutes. Thealgorithm is simpler than the simulated annealing employed by simulation ends whenever the last shuffle job flow completes.Hedera; however, we expect the results to be similar, since this b) Metrics: To measure the performance of each sched­heuristic is likely to be as good as any other (as discussed in uler, we tracked the aggregate throughput of all flows; this isSec. III-C) the amount of bisection bandwidth the scheduler is able to As we are doing flow-level simulations, sampling packets is extract from the network. We measure overhead as before innot straightforward since there are no packets to sample from. Section IV, i.e., by counting the number of control messagesInstead, we sample from flows by determining the amount of and the number of flow table entries at each switch. Alltime it will take for k packets to traverse a link, given its rate, numbers shown here are averaged from ten runs.and then sample from the flows on the link by weighting each 2) Results: The per-second aggregate throughput for theflow by its rate. Full details are in [14]. various scheduling methods is shown in Figure 4. We com­ a) Workloads: We simulate background traffic modeled pare these schedulers to static load balancing with equal-coston recent measurements [21] and add traffic modeled on multipath (ECMP), which uniformly randomizes the outgoingMapReduce traffic to stress the network. We assume that the flows across a set of ports [7]. We used three different elephantMapReduce job has just gone into its shuffle phase. In this thresholds for Mahout: 128KB, 1MB, and 100MB, and flowsphase, each end host transfers 128MB to each other host. Each carrying at least this threshold of bytes were classified as anend host opens a connection to at most five other end hosts elephant after sending 2, 20, or 2000 packets respectively. Assimultaneously (as done by default in Hadoops implementa­ expected, controlling elephant flows extracts more bisectiontion of MapReduce). Once one of these connections completes, bandwidth from the network-Mahout extracts 16% morethe host opens a connection to another end host, repeating this bisection bandwidth from the network than ECMP and theuntil it has transferred its 128MB file to each other end host. other schedulers obtain similar results depending on theirThe order of these outgoing connections is randomized for parameters.each end host. For all experiments here, we used 250 randomly Hederas results found that flow scheduling gives a muchselected end hosts in the shuffle load. The reduce phase shuffle larger improvement over ECMP than our results (up to 113%begins three minutes after the background traffic is started on some workloads) [7]. This is due to the differences into allow the background traffic to reach a steady state, and workloads. Our workload is based on measurements [21],measurements shown here are taken for five minutes after the whereas their workloads are synthetic. We have repeated ourreduce phase began. simulations using some of their workloads and find similar We added background traffic following the macroscopic results: the schedulers improve throughput by more than 100%flow measurements collected by Kandula et al. [17], [21] compared to ECMP on their the traffic mix because datacenters run a heterogeneous We examine the overhead versus performance tradeoff bymix of services simultaneously. They give the fraction of counting the maximum number of flow table entries per rackcorrespondents a server has within its rack and outside of its switch and the number of messages to the controller. Theserack over a ten second interval. We follow this distribution results are shown in Figures 5 and 6to decide how many inter- and intra-rack flows a server starts Mahout has the least overhead of any scheduling approachover ten seconds; however, they do not give a more detailed considered. Pulling statistics requires too many flow tablebreakdown of flow destinations than this, so we assume entries per switch and sends too many packets to the controllerthat the selection of a destination host is uniformly random to scale to large datacenters; here, the stat-pulling scheduleracross the source servers rack or the remaining racks for an used nearly 800 flow table entries per rack switch on average 1635
  8. 8. 12000 -g HOST-1 HOST-2 8 10000 3: � 8000 rt: Server Background Sources openFIOw protoCOl .: ../ .. � Unux .. . . .... . 6000 /." Linux .. ""--- � .. I Mahout Shim I ,--. . .. -, I Mahout Shim I E 4000 � 2000 0 u n II n 1Gbps 1Gbps 1Gbps 128KB 1MB 100MB 1/100 1/1000 1/10000 0.1 1 10 Fig. 8: Testbed for prototype experiments. Mahout (threshold) Sampling (frac. packets) Pulling (s) the networking stack. Implementing it as a separate kernelFig. 5: Number of packets sent to controller by various schedulers. module improves deployability because it can be installedHere, we bundled samples together into a single packet (there are without modifying or upgrading the linux kernel. Our con­25 samples per packet)-each bundle of samples counts as a singlecontroller message. troller leverages the NOX platform to learn topology and configure switches with the default entries. It also processes 1800 • Avg table entries • Max table entries the packets marked by the shim layer and installs entries for .= 1600 .t:: 1400 the elephant flows. � .. 1200 Our testbed for experimenting different components of the � 1000 .� 800 prototype is shown in the Figure 8. Switches 1 and 2 in the .. .. 600 Figure are HP ProCurve 5400zl switches running firmware :is 400 � 200 with OpenFiow support. We have two end hosts in the testbed, one acting as the server for the flows and another acting as the 128KB 1MB 100MB 1/100 1/1000 1/10000 0.1 10 client. For some experiments, we also run some background Mahout (threshold) Sampling (frac. packets) Pulling (s) flows to emulate the other network traffic.Fig. 6: Average and maximum number of flow table entries at each Since this is a not a large scale testbed, we performswitch used by the schedulers. microbenchmark experiments focusing on the timeliness ofno matter how frequently the statistics were pulled. This elephant flow detection and compare Mahout against Hedera­is more than seven times the number of entries used by like polling approach. We first present measurements of thethe sampling and Mahout controllers, and makes the offline time it takes to detect an elephant flow at end host and thenscheduler infeasible in larger datacenters because the flow present the overall time it takes for the controller to detect antables will not be able to support such a large number of elephant flow. Our workload consists of a file transfer usingentries. Also, when pulling stats every 1 sec., the controller ftp. For experiments with presence of background flows, wereceives lOx more messages than when using Mahout with an run 10 simultaneous iperf connections.elephant threshold of 100MB. a) End host elephant flow detection time: In this ex­ These simulations indicate that, for our workload, the value periment, we ftp a 50MB file from Host- l to Host-2. Weof thresholdelephant affects the overhead of the Mahout con­ track the number of bytes in the socket buffer for that flowtroller, but does not have much of an impact on performance and the number of bytes transferred on the network along(up to a point: when we set this threshold to 1GB (not shown with the timestamps. We did 30 trials of this experiment.on the charts), the Mahout scheduler performed no better than Figure 2 shows a single run. In Figure 7, we show the time itECMP). The number of packets to the Mahout controller goes takes before a flow can be classified as an elephant basedfrom 328 per sec. when the elephant threshold is 128KB to 214 on information from the buffer utilization versus based onper sec. when the threshold is 100MB, indicating that tuning the number of bytes sent on the network. Here we considerit can reduce controller overhead by more than 50% without different thresholds for considering a flow as an elephant. Weaffecting the schedulers performance. Even so, we suggest present both cases of with and without background flows. It ismaking this threshold as small as possible to save memory at clear from these results that Mahouts approach of monitoringthe end hosts and for quicker elephant flow detection (see the the TCP buffers can significantly quicken the elephant flowexperiments on our prototype in the next section). We believe detection (more than an order of magnitude sooner in somea threshold of 200-500KB is best for most workloads. cases) at the end hosts and is also not affected by the congestion in the network.B. Prototype & Microbenchmarks b) Elephant flow detection time at the controller: In this We have implemented a prototype of the Mahout system. experiment, we measure how long it takes for an elephant flowThe shim layer is implemented as a kernel module inserted to be detected at the controller using the Mahout approachbetween the TCPIIP stack and device driver, and the controller versus using Hedera-like periodic polling approach. To be fairis built upon NOX [28], an open-source OpenFiow controller to the polling approach, we have done the periodic pollingwritten in Python language. For the shim layer, we created a at the fastest rate possible (poll in a loop without any waitfunction for the pseudocode shown in Algorithm 1 and invoke periods in between). As can be seen from the Table II, Mahoutit for the outgoing packets, after the IP header creation in controller can detect an elephant flow in few milliseconds. 1636
  9. 9. 100 r--�--�---- 100 r--�--�--, Time to Queue � Time to Queue !QQQQ Time to Send rs;:::s:s::s] Time to Send rs:ss:sJ U) 10 U) 10 S S " " E E i= i= 0.1 100KB 200KB 500KB 1MB 100KB 200KB 500KB 1MB (a) No background flows (b) 10 background flowsFig. 7: For each threshold bytes, time taken for the TCP buffer is filled versus the time taken for those many bytes to appear on the networkwith (a) no background flows and (b) 10 background TCP flows. Error bars show 95% confidence intervals. [6] M. AI-Fares,A. Loukissas,and A. Vahdat. A scalable,commodity data center network architecture. In SIGCOMM, 2008. [7] M. AI-Fares,S. Radhakrishnan,B. Raghavan,N. Huang,and A. Vahdat. Hedera: Dynamic How scheduling for data center networks. In NSDI,TABLE II: Time it takes to detect an elephant flow at the Mahout 2010.controller vs. the Hedera controller, with no other active flows. [8] M. Alizadeh,A. Greenberg,D. A. Maltz,J. Padhye,P. Patel, B. Prab­ hakar,S. Sengupta,and M. Sridharan. Dctcp: Efficient packet transportIn contrast, Hedera takes 189.83ms before the flow can be for the commoditized data center. In SIGCOMM, 2010.detected as an elephant irrespective of the threshold. All times �] same due to switchs overheads in collecting statistics and [10] P. Barham,B. Dragovic, K. Fraser,S. H,T. Harris,A. Ho,R. Neuge­ bauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. Inrelaying it to the central controller. SOSP, pages 164-177,2003. Overall, our working prototype demonstrates the deploy­ [II] R. Braden,D. Clark,and S. Shenker. Integrated service in the internetment feasibility of Mahout. The experiments show an order architecture: an overview. Technical report, IETF, Network Working Group,June 1994.of magnitude difference in the elephant flow detection times [12] C. Clos. A study of non-blocking switching networks. Bell Systemat the controller in Mahout vs. a competing approach. Technical Journal, 32(5):406-424,1953. [13] J. R. Correa and M. X. Goemans. Improved bounds on nonblocking VI. CONCLUSION 3-stage c10s networks. SIAM J. Comput., 37(3):870-894,2007. [14] A. R. Curtis,W. Kim,and P. Yalagandula. Mahout: Low-Overhead Dat­ Previous research in datacenter network management has acenter Traffic Management using End-Host-Based Elephant Detection.shown that elephant flows-flows that carry large amount of Technical Report HPL-2010-91,HP Labs, to be detected and managed for better utilization of [IS] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters . In OSDI,2004.the mUlti-path topologies. However, the previous approaches [16] N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Sub­for elephant flow detection are based on monitoring the ramanya, Y. Fainman, G. Papen, and A. Vahdat. Helios: A Hybridbehavior of flows in the network and hence incur long de­ ElectricaVOptical Switch Architecture for Modular Data Centers. In SIGCOMM, 2010.tection times, high switch resource usage, and/or high control [17] A. Greenberg, J. R. Hamilton,N. Jain,S. Kandula, C. Kim,P. Lahiri,bandwidth and processing overhead. In contrast, we propose a D. Maltz, P. Patel, and S. Sengupta. VL2: a scalable and Hexible datanovel end host based solution that monitors the socket buffers center network. In SIGCOMM, 2009. [18] Hadoop MapReduce. http://hadoop.apache.orglmapreduce/.to detect elephant flows and signals the network controller [19] C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992using an in-band mechanism. We present Mahout, a low­ (Informational),Nov. 2000.overhead yet effective traffic management system based on this [20] lANA DSCP registry. http://www.iana.orglassignments/dscp-registry. [21] S. Kandula,S. Sengupta, A. Greenberg, and P. Patel. The nature ofidea. Our experimental results show that our system can detect datacenter traffic: Measurements & analysis. In IMC, 2009.elephant flows an order of magnitude sooner than polling [22] J. Kim and W. J. Dally. Flattened butterHy: A cost-efficient topologybased approaches while incurring an order of magnitude lower for high-radix networks. In ISCA, 2007. [23] N. McKeown,T. Anderson,H. Balakrishnan,G. Parulkar,L. Peterson,controller overhead than other approaches. J. Rexford,S. Shenker,and J. Turner. OpenFlow: Enabling Innovation in Campus Networks. ACM CCR, 2008. ACKNOWLEDGEMENTS [24] J. C. Mogul, J. Tourrilhes,P. Yalagandula,P. Sharma, A. R. Curtis, We sincerely thank Sujata Banerjee, Jeff Mogul, Puneet and S. Banerjee. DevoHow: Cost-effective How management for high performance enterprise networks. In HotNets, 2010.Sharma, and Jean Tourrilhes for several beneficial discussions [25] T. Mori, M. Uchida, R. Kawahara, J. Pan, and S. Goto. Identifyingand comments on earlier drafts of this paper. elephant Hows through periodically sampled packets. In Proc. IMC, pages 115-120,Taormina,Oct. 2004. REFERENCES [26] K. Nichols, S. Blake, F. Baker, and D. Black. Definition of the Differentiated Services Field (OS Field) in the IPv4 and IPv6 Headers. [I] IEEE Std. 802.IQ-2005,Virtual Bridged Local Area Networks. RFC 2474 (Proposed Standard),Dec. 1998. [2] OpenFlow Switch Specification, Version 1.0.0. [27] M. Roughan,S. Sen,O. Spatscheck,and N. Duffield. Class-of-service http://www.openHowswitch. orgidocuments/openHow-spec-v1.0.0.pdf. mapping for qos: A statistical signature-based approach to ip traffic [3] sFlow. http://www.sHow.orgi. classification. In In IMC04, pages 135-148,2004. [4] The OpenFlow Switch Consortium. http://www.openHowswitch.orgi. [28] A. Tavakoli,M. Casado,T. Koponen, and S. Shenker. Applying NOX [5] J. H. Ahn,N. Binkert,A. Davis,M. McLaren,and R. S. Schreiber. Hy- to the datacenter. In HotNets-Vlll, 2009. perx: topology,routing,and packaging of efficient large-scale networks. [29] VMware. In Proceedings of the Conference on High Performance Computing [30] Xen. Networking, Storage and Analysis (SC 09), 2009. 1637