VL2: A Scalable and Flexible Data Center Network


Published on

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

VL2: A Scalable and Flexible Data Center Network

  1. 1. VL2: A Scalable and Flexible Data Center Network Albert Greenberg James R. Hamilton Navendu Jain Srikanth Kandula Changhoon Kim Parantap Lahiri David A. Maltz Parveen Patel Sudipta Sengupta Microsoft Research Abstract Agility promises improved risk management and cost savings. To be agile and cost e ective, data centers should allow dynamic re- Without agility, each service must pre-allocate enough servers to source allocation across large server pools. In particular, the data meet di cult to predict demand spikes, or risk failure at the brink center network should enable any server to be assigned to any ser- of success. With agility, the data center operator can meet the uc- vice. To meet these goals, we present VL , a practical network archi- tuating demands of individual services from a large shared server tecture that scales to support huge data centers with uniform high pool, resulting in higher server utilization and lower costs. capacity between servers, performance isolation between services, Unfortunately, the designs for today’s data center network pre- and Ethernet layer- semantics. VL uses ( ) at addressing to allow vent agility in several ways. First, existing architectures do not service instances to be placed anywhere in the network, ( ) Valiant provide enough capacity between the servers they interconnect. Load Balancing to spread tra c uniformly across network paths, Conventional architectures rely on tree-like network con gurations and ( ) end-system based address resolution to scale to large server built from high-cost hardware. Due to the cost of the equipment, pools, without introducing complexity to the network control plane. the capacity between di erent branches of the tree is typically over- VL ’s design is driven by detailed measurements of tra c and fault subscribed by factors of : or more, with paths through the highest data from a large operational cloud service provider. VL ’s imple- levels of the tree oversubscribed by factors of : to : . is lim- mentation leverages proven network technologies, already available its communication between servers to the point that it fragments the at low cost in high-speed hardware implementations, to build a scal- server pool — congestion and computation hot-spots are prevalent able and reliable network architecture. As a result, VL networks even when spare capacity is available elsewhere. Second, while data can be deployed today, and we have built a working prototype. We centers host multiple services, the network does little to prevent a evaluate the merits of the VL design using measurement, analysis, tra c ood in one service from a ecting the other services around and experiments. Our VL prototype shu es . TB of data among it — when one service experiences a tra c ood,it is common for all servers in seconds – sustaining a rate that is of the max- those sharing the same network sub-tree to su er collateral damage. imum possible. ird, the routing design in conventional networks achieves scale by assigning servers topologically signi cant IP addresses and dividing Categories and Subject Descriptors: C. . [Computer-Communi- servers among VLANs. Such fragmentation of the address space cation Network]: Network Architecture and Design limits the utility of virtual machines, which cannot migrate out of General Terms: Design, Performance, Reliability their original VLAN while keeping the same IP address. Further, Keywords: Data center network, commoditization the fragmentation of address space creates an enormous con gura- tion burden when servers must be reassigned among services, and the human involvement typically required in these recon gurations 1. INTRODUCTION limits the speed of deployment. Cloud services are driving the creation of data centers that hold To overcome these limitations in today’s design and achieve tens to hundreds of thousands of servers and that concurrently sup- agility, we arrange for the network to implement a familiar and port a large number of distinct services (e.g., search, email, map- concrete model: give each service the illusion that all the servers reduce computations, and utility computing). e motivations for assigned to it, and only those servers, are connected by a single building such shared data centers are both economic and technical: non-interfering Ethernet switch—a Virtual Layer — and maintain to leverage the economies of scale available to bulk deployments and this illusion even as the size of each service varies from server to to bene t from the ability to dynamically reallocate servers among , . Realizing this vision concretely translates into building a services as workload changes or equipment fails [ , ]. e cost is network that meets the following three objectives: also large – upwards of million per month for a , server data center — with the servers themselves comprising the largest • Uniform high capacity: e maximum rate of a server-to-server cost component. To be pro table, these data centers must achieve tra c ow should be limited only by the available capacity on the high utilization, and key to this is the property of agility — the ca- network-interface cards of the sending and receiving servers, and pacity to assign any server to any service. assigning servers to a service should be independent of network topology. • Performance isolation: Tra c of one service should not be af- Permission to make digital or hard copies of all or part of this work for fected by the tra c of any other service, just as if each service was personal or classroom use is granted without fee provided that copies are connected by a separate physical switch. not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to • Layer- semantics: Just as if the servers were on a LAN—where republish, to post on servers or to redistribute to lists, requires prior specific any IP address can be connected to any port of an Ethernet switch permission and/or a fee. SIGCOMM’09, August 17–21, 2009, Barcelona, Spain. due to at addressing—data-center management so ware should Copyright 2009 ACM 978-1-60558-594-9/09/08 ...$10.00. be able to easily assign any server to any service and con gure
  2. 2. that server with whatever IP address the service expects. Virtual £¡¢£  ¡¢ ¢£  ¡¢ £¡ ¤ ¤ machines should be able to migrate to any server while keeping ¥ ¦ ¦ ¥ ¢¢ ¤£ £ ¡ the same IP address, and the network con guration of each server should be identical to what it would be if connected via a LAN. § § ¦ ¦¨¦ ¦ § § £ ¤ Finally, features like link-local broadcast, on which many legacy applications depend, should work. 3 § § © © ¤ £ A568 4% @$97 $7(97 (11% In this paper we design, implement and evaluate VL , a net- @ 6 $7 B) © © © © A$ E C 8¨ (1CB% A( 8 • • BE D ) • work architecture for data centers that meets these three objectives D% A and thereby provides agility. In creating VL , a goal was to investi- A B 1 B P 1F8 E 7G8F S S S S I1 DBHH 7 Q Q Q Q R R R R • • gate whether we could create a network architecture that could be U U U UT XWT XWT XWT XW VU VU VU VU V V V V deployed today, so we limit ourselves from making any changes to 0 # !1 $$ (%! 2)' the hardware of the switches or servers, and we require that legacy Figure : A conventional network architecture for data centers applications work unmodi ed. However, the so ware and operat- (adapted from gure by Cisco [ ]). ing systems on data-center servers are already extensively modi ed (e.g., to create hypervisors for virtualization or blob le-systems to e feasibility of our design rests on several questions that we store data). erefore, VL ’s design explores a new split in the re- experimentally evaluate. First, the theory behind Valiant Load Bal- sponsibilities between host and network — using a layer . shim ancing, which proves that the network will be hot-spot free, requires in servers’ network stack to work around limitations of the network that (a) randomization is performed at the granularity of small pack- devices. No new switch so ware or APIs are needed. ets, and (b) the tra c sent into the network conforms to the hose VL consists of a network built from low-cost switch ASICs model [ ]. For practical reasons, however, VL picks a di erent arranged into a Clos topology [ ] that provides extensive path di- path for each ow rather than packet (falling short of (a)), and it versity between servers. Our measurements show data centers have also relies on TCP to police the o ered tra c to the hose model tremendous volatility in their workload, their tra c, and their fail- (falling short of (b), as TCP needs multiple RTTs to conform traf- ure patterns. To cope with this volatility, we adopt Valiant Load c to the hose model). Nonetheless, our experiments show that for Balancing (VLB) [ , ] to spread tra c across all available paths data-center tra c, the VL design choices are su cient to o er the without any centralized coordination or tra c engineering. Using desired hot-spot free properties in real deployments. Second, the VLB, each server independently picks a path at random through the directory system that provides the routing information needed to network for each of the ows it sends to other servers in the data reach servers in the data center must be able to handle heavy work- center. Common concerns with VLB, such as the extra latency and loads at very low latency. We show that designing and implementing the consumption of extra network capacity caused by path stretch, such a directory system is achievable. are overcome by a combination of our environment (propagation In the remainder of this paper we will make the following con- delay is very small inside a data center) and our topology (which tributions, in roughly this order. includes an extra layer of switches that packets bounce o of). Our • We make a rst of its kind study of the tra c patterns in a produc- experiments verify that our choice of using VLB achieves both the tion data center, and nd that there is tremendous volatility in the uniform capacity and performance isolation objectives. tra c, cycling among - di erent patterns during a day and e switches that make up the network operate as layer- spending less than s in each pattern at the th percentile. routers with routing tables calculated by OSPF, thereby enabling the • We design, build, and deploy every component of VL in an - use of multiple paths (unlike Spanning Tree Protocol) while using a server cluster. Using the cluster, we experimentally validate that well-trusted protocol. However, the IP addresses used by services VL has the properties set out as objectives, such as uniform ca- running in the data center cannot be tied to particular switches pacity and performance isolation. We also demonstrate the speed in the network, or the agility to reassign servers between services of the network, such as its ability to shu e . TB of data among would be lost. Leveraging a trick used in many systems [ ], VL servers in s. assigns servers IP addresses that act as names alone, with no topo- • We apply Valiant Load Balancing in a new context, the inter- logical signi cance. When a server sends a packet, the shim-layer switch fabric of a data center, and show that ow-level tra c split- on the server invokes a directory system to learn the actual location ting achieves almost identical split ratios (within of optimal of the destination and then tunnels the original packet there. e fairness index) on realistic data center tra c, and it smoothes uti- shim-layer also helps eliminate the scalability problems created by lization while eliminating persistent congestion. ARP in layer- networks, and the tunneling improves our ability to • We justify the design trade-o s made in VL by comparing the implement VLB. ese aspects of the design enable VL to provide cost of a VL network with that of an equivalent network based layer- semantics, while eliminating the fragmentation and waste of on existing designs. server pool capacity that the binding between addresses and loca- tions causes in the existing architecture. Taken together, VL ’s choices of topology, routing design, and 2. BACKGROUND so ware architecture create a huge shared pool of network capacity In this section, we rst explain the dominant design pattern for that each pair of servers can draw from when communicating. We data-center architecture today [ ]. We then discuss why this archi- implement VLB by causing the tra c between any pair of servers tecture is insu cient to serve large cloud-service data centers. to bounce o a randomly chosen switch in the top level of the Clos As shown in Figure , the network is a hierarchy reaching from topology and leverage the features of layer- routers, such as Equal- a layer of servers in racks at the bottom to a layer of core routers at Cost MultiPath (ECMP), to spread the tra c along multiple sub- the top. ere are typically to servers per rack, each singly con- paths for these two path segments. Further, we use anycast addresses nected to a Top of Rack (ToR) switch with a Gbps link. ToRs con- and an implementation of Paxos [ ] in a way that simpli es the nect to two aggregation switches for redundancy, and these switches design of the Directory System and, when failures occur, provides aggregate further connecting to access routers. At the top of the hi- consistency properties that are on par with existing protocols. erarchy, core routers carry tra c between access routers and man-
  3. 3. age tra c into and out of the data center. All links use Ethernet as 0.45 0.4 Flow Size PDF a physical-layer protocol, with a mix of copper and ber cabling. 0.35 Total Bytes PDF All switches below each pair of access routers form a single layer- 0.3 PDF 0.25 domain, typically connecting several thousand servers. To limit 0.2 0.15 overheads (e.g., packet ooding and ARP broadcasts) and to iso- 0.1 late di erent services or logical server groups (e.g., email, search, 0.05 0 web front ends, web back ends), servers are partitioned into vir- 1 100 10000 1e+06 1e+08 1e+10 1e+12 tual LANs (VLANs). Unfortunately, this conventional design su ers Flow Size (Bytes) from three fundamental limitations: 1 Limited server-to-server capacity: As we go up the hierar- 0.8 Flow Size CDF CDF 0.6 Total Bytes CDF chy, we are confronted with steep technical and nancial barriers 0.4 in sustaining high bandwidth. us, as tra c moves up through the 0.2 0 layers of switches and routers, the over-subscription ratio increases 1 100 10000 1e+06 1e+08 1e+10 1e+12 rapidly. For example, servers typically have : over-subscription to Flow Size (Bytes) other servers in the same rack — that is, they can communicate at the full rate of their interfaces (e.g., Gbps). We found that up-links Figure : Mice are numerous; of ows are smaller than MB. However, more than of bytes are in ows between from ToRs are typically : to : oversubscribed (i.e., to Gbps MB and GB. of up-link for servers), and paths through the highest layer of the tree can be : oversubscribed. is large over-subscription factor fragments the server pool by preventing idle servers from be- Our measurement studies found two key results with implica- ing assigned to overloaded services, and it severely limits the entire tions for the network design. First, the tra c patterns inside a data data-center’s performance. center are highly divergent (as even over representative tra c Fragmentation of resources: As the cost and performance of matrices only loosely cover the actual tra c matrices seen),and they communication depends on distance in the hierarchy, the conven- change rapidly and unpredictably. Second, the hierarchical topology tional design encourages service planners to cluster servers nearby is intrinsically unreliable—even with huge e ort and expense to in- in the hierarchy. Moreover, spreading a service outside a single crease the reliability of the network devices close to the top of the layer- domain frequently requires recon guring IP addresses and hierarchy, we still see failures on those devices resulting in signi - VLAN trunks, since the IP addresses used by servers are topolog- cant downtimes. ically determined by the access routers above them. e result is a high turnaround time for such recon guration. Today’s designs 3.1 Data-Center Traffic Analysis avoid this recon guration lag by wasting resources; the plentiful Analysis of Net ow and SNMP data from our data centers re- spare capacity throughout the data center is o en e ectively re- veals several macroscopic trends. First, the ratio of tra c volume served by individual services (and not shared), so that each service between servers in our data centers to tra c entering/leaving our can scale out to nearby servers to respond rapidly to demand spikes data centers is currently around : (excluding CDN applications). or to failures. Despite this, we have observed instances when the Second, data-center computation is focused where high speed ac- growing resource needs of one service have forced data center oper- cess to data on memory or disk is fast and cheap. Although data ations to evict other services from nearby servers, incurring signif- is distributed across multiple data centers, intense computation and icant cost and disruption. communication on data does not straddle data centers due to the Poor reliability and utilization: Above the ToR, the basic re- cost of long-haul links. ird, the demand for bandwidth between silience model is : , i.e., the network is provisioned such that if an servers inside a data center is growing faster than the demand for aggregation switch or access router fails, there must be su cient re- bandwidth to external hosts. Fourth, the network is a bottleneck maining idle capacity on a counterpart device to carry the load. is to computation. We frequently see ToR switches whose uplinks are forces each device and link to be run up to at most of its maxi- above utilization. mum utilization. Further, multiple paths either do not exist or aren’t To uncover the exact nature of tra c inside a data center, we e ectively utilized. Within a layer- domain, the Spanning Tree Pro- instrumented a highly utilized , node cluster in a data center tocol causes only a single path to be used even when multiple paths that supports data mining on petabytes of data. e servers are between switches exist. In the layer- portion, Equal Cost Multipath distributed roughly evenly across ToR switches, which are con- (ECMP) when turned on, can use multiple paths to a destination nected hierarchically as shown in Figure . We collected socket-level if paths of the same cost are available. However, the conventional event logs from all machines over two months. topology o ers at most two paths. 3.2 Flow Distribution Analysis 3. MEASUREMENTS IMPLICATIONS Distribution of ow sizes: Figure illustrates the nature of To design VL , we rst needed to understand the data cen- ows within the monitored data center. e ow size statistics ter environment in which it would operate. Interviews with archi- (marked as ‘+’s) show that the majority of ows are small (a few tects, developers, and operators led to the objectives described in KB); most of these small ows are hellos and meta-data requests to Section , but developing the mechanisms on which to build the the distributed le system. To examine longer ows, we compute a network requires a quantitative understanding of the tra c matrix statistic termed total bytes (marked as ‘o’s) by weighting each ow (who sends how much data to whom and when?) and churn (how size by its number of bytes. Total bytes tells us, for a random byte, o en does the state of the network change due to changes in demand the distribution of the ow size it belongs to. Almost all the bytes or switch/link failures and recoveries, etc.?). We analyze these as- in the data center are transported in ows whose lengths vary from pects by studying production data centers of a large cloud service about MB to about GB. e mode at around MB springs provider and use the results to justify our design choices as well as from the fact that the distributed le system breaks long les into the workloads used to stress the VL testbed. -MB size chunks. Importantly, ows over a few GB are rare.
  4. 4. 0.04 1 Fraction of Time PDF 40 Index of the Containing Cluster 0.8 Cumulative 0.03 CDF 35 300 0.6 30 200 0.02 0.4 Frequency Frequency 25 200 0.01 0.2 20 50 100 0 0 15 100 1 10 100 1000 10 Number of Concurrent flows in/out of each Machine 5 0 0 0 Figure : Number of concurrent connections has two modes: ( ) 0 200 400 600 800 1000 0 5 10 20 2.0 3.0 4.0 ows per node more than of the time and ( ) ows per Time in 100s intervals Run Length log(Time to Repeat) node for at least of the time. (a) (b) (c) Figure : Lack of short-term predictability: e cluster to which Similar to Internet ow characteristics [ ], we nd that there are a tra c matrix belongs, i.e., the type of tra c mix in the TM, myriad small ows (mice). On the other hand, as compared with changes quickly and randomly. Internet ows, the distribution is simpler and more uniform. e reason is that in data centers, internal ows arise in an engineered Surprisingly, the number of representative tra c matrices in environment driven by careful design decisions (e.g., the -MB our data center is quite large. On a timeseries of TMs, indicat- chunk size is driven by the need to amortize disk-seek times over ing a day’s worth of tra c in the datacenter, even when approximat- read times) and by strong incentives to use storage and analytic tools ing with 50 − 60 clusters, the tting error remains high ( ) and with well understood resilience and performance. only decreases moderately beyond that point. is indicates that the Number of Concurrent Flows: Figure shows the probability variability in datacenter tra c is not amenable to concise summa- density function (as a fraction of time) for the number of concur- rization and hence engineering routes for just a few tra c matrices rent ows going in and out of a machine, computed over all , is unlikely to work well for the tra c encountered in practice. monitored machines for a representative day’s worth of ow data. Instability of tra c patterns: Next we ask how predictable is ere are two modes. More than of the time, an average ma- the tra c in the next interval given the current tra c? Tra c pre- chine has about ten concurrent ows, but at least of the time it dictability enhances the ability of an operator to engineer routing has greater than concurrent ows. We almost never see more as tra c demand changes. To analyze the predictability of tra c in than concurrent ows. the network, we nd the best TM clusters using the technique e distributions of ow size and number of concurrent ows above and classify the tra c matrix for each s interval to the both imply that VLB will perform well on this tra c. Since even big best tting cluster. Figure (a) shows that the tra c pattern changes ows are only MB ( s of transmit time at Gbps), randomiz- nearly constantly, with no periodicity that could help predict the fu- ing at ow granularity (rather than packet) will not cause perpetual ture. Figure (b) shows the distribution of run lengths - how many congestion if there is unlucky placement of a few ows. Moreover, intervals does the network tra c pattern spend in one cluster be- adaptive routing schemes may be di cult to implement in the data fore shi ing to the next. e run length is to the th percentile. center, since any reactive tra c engineering will need to run at least Figure (c) shows the time between intervals where the tra c maps once a second if it wants to react to individual ows. to the same cluster. But for the mode at s caused by transitions within a run, there is no structure to when a tra c pattern will next 3.3 Traffic Matrix Analysis appear. Poor summarizability of tra c patterns: Next, we ask the e lack of predictability stems from the use of randomness to question: Is there regularity in the tra c that might be exploited improve the performance of data-center applications. For exam- through careful measurement and tra c engineering? If tra c in the ple, the distributed le system spreads data chunks randomly across DC were to follow a few simple patterns, then the network could be servers for load distribution and redundancy. e volatility implies easily optimized to be capacity-e cient for most tra c. To answer, that it is unlikely that other routing strategies will outperform VLB. we examine how the Tra c Matrix(TM) of the , server cluster changes over time. For computational tractability, we compute the 3.4 Failure Characteristics ToR-to-ToR TM — the entry T M (t)i,j is the number of bytes sent To design VL to tolerate the failures and churn found in data from servers in ToR i to servers in ToR j during the s beginning centers, we collected failure logs for over a year from eight produc- at time t. We compute one TM for every s interval, and servers tion data centers that comprise hundreds of thousands of servers, outside the cluster are treated as belonging to a single “ToR”. host over a hundred cloud services and serve millions of users. We Given the timeseries of TMs, we nd clusters of similar TMs analyzed hardware and so ware failures of switches, routers, load using a technique due to Zhang et al. [ ]. In short, the technique balancers, rewalls, links and servers using SNMP polling/traps, recursively collapses the tra c matrices that are most similar to each syslogs, server alarms, and transaction monitoring frameworks. In other into a cluster, where the distance (i.e., similarity) re ects how all, we looked at M error events from over K alarm tickets. much tra c needs to be shu ed to make one TM look like the What is the pattern of networking equipment failures? We other. We then choose a representative TM for each cluster, such de ne a failure as the event that occurs when a system or compo- that any routing that can deal with the representative TM performs nent is unable to perform its required function for more than s. no worse on every TM in the cluster. Using a single representative As expected, most failures are small in size (e.g., of network TM per cluster yields a tting error (quanti ed by the distances be- device failures involve devices and of network device fail- tween each representative TMs and the actual TMs they represent), ures involve devices) while large correlated failures are rare which will decrease as the number of clusters increases. Finally, if (e.g., the largest correlated failure involved switches). However, there is a knee point (i.e., a small number of clusters that reduces downtimes can be signi cant: of failures are resolved in min, the tting error considerably), the resulting set of clusters and their in hr, . in day, but . last days. representative TMs at that knee corresponds to a succinct number What is the impact of networking equipment failure? As dis- of distinct tra c matrices that summarize all TMs in the set. cussed in Section , conventional data center networks apply : re-
  5. 5. Link-state network Internet Separating names from locators: e data center network carrying only LAs (e.g., 10/8) must support agility, which means, in particular, support for hosting DA/2 x Intermediate Switches any service on any server, for rapid growing and shrinking of server Int ... pools, and for rapid virtual machine migration. In turn, this calls DI x10G for separating names from locations. VL ’s addressing scheme sep- DA/2 x 10G arates server names, termed application-speci c addresses (AAs), Aggr ... from their locations, termed location-speci c addresses (LAs). VL DI x Aggregate Switches uses a scalable, reliable directory system to maintain the mappings 2 x10G DA/2 x 10G DADI/4 x ToR Switches between names and locators. A shim layer running in the network ... ToR stack on every server, called the VL agent, invokes the directory system’s resolution service. We evaluate the performance of the di- `Y ecbdcba 20(DADI/4) x Servers rectory system in § . . .... Fungible pool of Embracing End Systems: e rich and homogeneous pro- servers owning AAs (e.g., 20/8) grammability available at data-center hosts provides a mechanism to rapidly realize new functionality. For example, the VL agent en- Figure : An example Clos network between Aggregation and In- ables ne-grained path control by adjusting the randomization used termediate switches provides a richly-connected backbone well- in VLB. e agent also replaces Ethernet’s ARP functionality with suited for VLB. e network is built with two separate address queries to the VL directory system. e directory system itself is families — topologically signi cant Locator Addresses (LAs) and at Application Addresses (AAs). also realized on servers, rather than switches, and thus o ers exi- bility, such as ne-grained, context-aware server access control and dynamic service re-provisioning. dundancy to improve reliability at higher layers of the hierarchical We next describe each aspect of the VL system and how they tree. Despite these techniques, we nd that in . of failures all work together to implement a virtual layer- network. ese aspects redundant components in a network device group became unavail- include the network topology, the addressing and routing design, able (e.g., the pair of switches that comprise each node in the con- and the directory that manages name-locator mappings. ventional network (Figure ) or both the uplinks from a switch). In one incident, the failure of a core switch (due to a faulty supervi- 4.1 Scale-out Topologies sor card) a ected ten million users for about four hours. We found As described in § . , conventional hierarchical data-center the main causes of these downtimes are network miscon gurations, topologies have poor bisection bandwidth and are also suscepti- rmware bugs, and faulty components (e.g., ports). With no obvi- ble to major disruptions due to device failures at the highest levels. ous way to eliminate all failures from the top of the hierarchy, VL ’s Rather than scale up individual network devices with more capac- approach is to broaden the topmost levels of the network so that the ity and features, we scale out the devices — build a broad network impact of failures is muted and performance degrades gracefully, o ering huge aggregate capacity using a large number of simple, in- moving from : redundancy to n:m redundancy. expensive devices, as shown in Figure . is is an example of a folded Clos network [ ] where the links between the Intermedi- ate switches and the Aggregation switches form a complete bipar- 4. VIRTUAL LAYER TWO NETWORKING tite graph. As in the conventional topology, ToRs connect to two Aggregation switches, but the large number of paths between ev- Before detailing our solution, we brie y discuss our design prin- ery two Aggregation switches means that if there are n Intermedi- ciples and preview how they will be used in the VL design. ate switches, the failure of any one of them reduces the bisection Randomizing to Cope with Volatility: VL copes with bandwidth by only 1/n–a desirable graceful degradation of band- the high divergence and unpredictability of data-center tra c width that we evaluate in § . . Further, it is easy and less expen- matrices by using Valiant Load Balancing to do destination- sive to build a Clos network for which there is no over-subscription independent (e.g., random) tra c spreading across multiple inter- (further discussion on cost is in § ). For example, in Figure , we mediate nodes. We introduce our network topology suited for VLB use DA -port Aggregation and DI -port Intermediate switches, and in § . , and the corresponding ow spreading mechanism in § . . connect these switches such that the capacity between each layer is VLB, in theory, ensures a non-interfering packet switched net- DI DA /2 times the link capacity. work [ ], the counterpart of a non-blocking circuit switched net- e Clos topology is exceptionally well suited for VLB in that by work, as long as (a) tra c spreading ratios are uniform, and (b) the indirectly forwarding tra c through an Intermediate switch at the o ered tra c patterns do not violate edge constraints (i.e., line card top tier or “spine” of the network, the network can provide band- speeds). To meet the latter condition, we rely on TCP’s end-to-end width guarantees for any tra c matrices subject to the hose model. congestion control mechanism. While our mechanisms to realize Meanwhile, routing is extremely simple and resilient on this topol- VLB do not perfectly meet either of these conditions, we show in ogy — take a random path up to a random intermediate switch and § . that our scheme’s performance is close to the optimum. a random path down to a destination ToR switch. Building on proven networking technology: VL is based on VL leverages the fact that at every generation of technol- IP routing and forwarding technologies that are already available ogy, switch-to-switch links are typically faster than server-to-switch in commodity switches: link-state routing, equal-cost multi-path links, and trends suggest that this gap will remain. Our current de- (ECMP) forwarding, IP anycasting, and IP multicasting. VL uses sign uses G server links and G switch links, and the next design a link-state routing protocol to maintain the switch-level topology, point will probably be G server links with G switch links. By but not to disseminate end hosts’ information. is strategy protects leveraging this gap, we reduce the number of cables required to im- switches from needing to learn voluminous, frequently-changing plement the Clos (as compared with a fat-tree [ ]), and we simplify host information. Furthermore, the routing design uses ECMP for- the task of spreading load over the links (§ . ). warding along with anycast addresses to enable VLB with minimal control plane messaging or churn.
  6. 6. Link-state network with LAs (10/8) tory service can enforce access-control policies. Further, since the Int Int Int directory system knows which server is making the request when ( ( ( ... ... handling a lookup, it can enforce ne-grained isolation policies. For H(ft) H(ft) H(ft) example, it could enforce the policy that only servers belonging to Payload Payload the same service can communicate with each other. An advantage of ( ( VL is that, when inter-service communication is allowed, packets ToR ToR ow directly from a source to a destination, without being detoured ( ( H(ft) to an IP gateway as is required to connect two VLANs in the con- H(ft) ventional architecture. Payload Payload ese addressing and forwarding mechanisms were chosen for S ( D ( two reasons. First, they make it possible to use low-cost switches, IP subnet with AAs (20/8) IP subnet with AAs (20/8) which o en have small routing tables (typically just 16K entries) Figure : VLB in an example VL network. Sender S sends pack- that can hold only LA routes, without concern for the huge number ets to destination D via a randomly-chosen intermediate switch of AAs. Second, they reduce overhead in the network control plane using IP-in-IP encapsulation. AAs are from 20/8, and LAs are by preventing it from seeing the churn in host state, tasking it to the from 10/8. H(ft) denotes a hash of the ve tuple. more scalable directory system instead. 4.2 VL2 Addressing and Routing 4.2.2 Random traffic spreading over multiple paths is section explains how packets ow through a VL network, To o er hot-spot-free performance for arbitrary tra c matri- and how the topology, routing design, VL agent, and directory sys- ces, VL uses two related mechanisms: VLB and ECMP. e goals tem combine to virtualize the underlying network fabric — creating of both are similar — VLB distributes tra c across a set of inter- the illusion that hosts are connected to a big, non-interfering data- mediate nodes and ECMP distributes across equal-cost paths — but center-wide layer- switch. each is needed to overcome limitations in the other. VL uses ows, rather than packets, as the basic unit of tra c spreading and thus 4.2.1 Address resolution and packet forwarding avoids out-of-order delivery. VL uses two di erent IP-address families, as illustrated in Fig- Figure illustrates how the VL agent uses encapsulation to im- ure . e network infrastructure operates using location-speci c plement VLB by sending tra c through a randomly-chosen Inter- IP addresses (LAs); all switches and interfaces are assigned LAs, and mediate switch. e packet is rst delivered to one of the Intermedi- switches run an IP-based (layer- ) link-state routing protocol that ate switches, decapsulated by the switch, delivered to the ToR’s LA, disseminates only these LAs. is allows switches to obtain the com- decapsulated again, and nally sent to the destination. plete switch-level topology, as well as forward packets encapsulated While encapsulating packets to a speci c, but randomly chosen, with LAs along shortest paths. On the other hand, applications use Intermediate switch correctly realizes VLB, it would require updat- application-speci c IP addresses (AAs), which remain unaltered no ing a potentially huge number of VL agents whenever an Inter- matter how servers’ locations change due to virtual-machine migra- mediate switch’s availability changes due to switch/link failures. In- tion or re-provisioning. Each AA (server) is associated with an LA, stead, we assign the same LA address to all Intermediate switches, the identi er of the ToR switch to which the server is connected. and the directory system returns this anycast address to agents upon e VL directory system stores the mapping of AAs to LAs, and lookup. Since all Intermediate switches are exactly three hops away this mapping is created when application servers are provisioned to from a source host, ECMP takes care of delivering packets encapsu- a service and assigned AA addresses. lated with the anycast address to any one of the active Intermediate e crux of o ering layer- semantics is having servers believe switches. Upon switch or link failures, ECMP will react, eliminating they share a single large IP subnet (i.e., the entire AA space) with the need to notify agents and ensuring scalability. other servers in the same service, while eliminating the ARP and In practice, however, the use of ECMP leads to two problems. DHCP scaling bottlenecks that plague large Ethernets. First, switches today only support up to -way ECMP, with - Packet forwarding: To route tra c between servers, which use way ECMP being released by some vendors this year. If there are AA addresses, on an underlying network that knows routes for LA more paths available than ECMP can use, then VL de nes several addresses, the VL agent at each server traps packets from the host anycast addresses, each associated with only as many Intermediate and encapsulates the packet with the LA address of the ToR of the switches as ECMP can accommodate. When an Intermediate switch destination as shown in Figure . Once the packet arrives at the fails, VL reassigns the anycast addresses from that switch to other LA (the destination ToR), the switch decapsulates the packet and Intermediate switches so that all anycast addresses remain live, and delivers it to the destination AA carried in the inner header. servers can remain unaware of the network churn. Second, some Address resolution: Servers in each service are con gured to inexpensive switches cannot correctly retrieve the ve-tuple values believe that they all belong to the same IP subnet. Hence, when an (e.g., the TCP ports) when a packet is encapsulated with multiple IP application sends a packet to an AA for the rst time,the networking headers. us, the agent at the source computes a hash of the ve- stack on the host generates a broadcast ARP request for the destina- tuple values and writes that value into the source IP address eld, tion AA. e VL agent running on the host intercepts this ARP re- which all switches do use in making ECMP forwarding decisions. quest and converts it to a unicast query to the VL directory system. e greatest concern with both ECMP and VLB is that if “ele- e directory system answers the query with the LA of the ToR to phant ows” are present, then the random placement of ows could which packets should be tunneled. e VL agent caches this map- lead to persistent congestion on some links while others are under- ping from AA to LA addresses, similar to a host’s ARP cache, such utilized. Our evaluation did not nd this to be a problem on data- that subsequent communication need not entail a directory lookup. center workloads (§ . ). Should it occur, initial results show the VL Access control via the directory service: A server cannot send agent can detect and deal with such situations with simple mecha- packets to an AA if the directory service refuses to provide it with an nisms, such as re-hashing to change the path of large ows when LA through which it can route its packets. is means that the direc- TCP detects a severe congestion event (e.g., a full window loss).
  7. 7. h f g Hence, we choose ms as the maximum acceptable response time. ˆ „‡ For updates, however, the key requirement is reliability, and re- †…„ ’n”‘ yy xx p“ { y u’  sponse time is less critical. Further, for updates that are scheduled h g h f f g ahead of time, as is typical of planned outages and upgrades, high ™ p v m o zyu vusn  p  ’x’tr q ’‰ w throughput can be achieved by batching updates. x€ w ƒ v u u g u u g ‚xy Consistency requirements: Conventional L networks provide †…„ u i u i g u u yy xx i u eventual consistency for the IP to MAC address mapping, as hosts n  } – •’‰ “˜— •’‰ ™l m ”‘ ™ “  d˜– “ ”‘ will use a stale MAC address to send packets until the ARP cache y| p ’“  times out and a new ARP request is sent. VL aims for a similar sp tq r sp tq r goal, eventual consistency of AA-to-LA mappings coupled with a ‚ e k€~ j jge kh f ig reliable update mechanism. Figure : VL Directory System Architecture 4.3.2 Directory System Design 4.2.3 Backwards Compatibility e di ering performance requirements and workload patterns of lookups and updates led us to a two-tiered directory system ar- is section describes how a VL network handles external traf- chitecture. Our design consists of ( ) a modest number ( - c, as well as general layer- broadcast tra c. servers for K servers) of read-optimized, replicated directory Interaction with hosts in the Internet: 20 of the tra c han- servers that cache AA-to-LA mappings and handle queries from VL dled in our cloud-computing data centers is to or from the Internet, agents, and ( ) a small number ( - servers) of write-optimized, so the network must be able to handle these large volumes. Since asynchronous replicated state machine (RSM) servers that o er a VL employs a layer- routing fabric to implement a virtual layer- strongly consistent, reliable store of AA-to-LA mappings. e di- network, the external tra c can directly ow across the high- rectory servers ensure low latency, high throughput, and high avail- speed silicon of the switches that make up VL , without being forced ability for a high lookup rate. Meanwhile, the RSM servers ensure through gateway servers to have their headers rewritten, as required strong consistency and durability, using the Paxos [ ] consensus by some designs (e.g., Monsoon [ ]). algorithm, for a modest rate of updates. Servers that need to be directly reachable from the Internet (e.g., Each directory server caches all the AA-to-LA mappings stored front-end web servers) are assigned two addresses: an LA in addi- at the RSM servers and independently replies to lookups from agents tion to the AA used for intra-data-center communication with back- using the cached state. Since strong consistency is not required, a end servers. is LA is drawn from a pool that is announced via directory server lazily synchronizes its local mappings with the RSM BGP and is externally reachable. Tra c from the Internet can then every seconds. To achieve high availability and low latency, an directly reach the server, and tra c from the server to external desti- agent sends a lookup to k (two in our prototype) randomly-chosen nations will exit toward the Internet from the Intermediate switches, directory servers. If multiple replies are received, the agent simply while being spread across the egress links by ECMP. chooses the fastest reply and stores it in its cache. Handling Broadcast: VL provides layer- semantics to appli- e network provisioning system sends directory updates to a cations for backwards compatibility, and that includes supporting randomly-chosen directory server, which then forwards the update broadcast and multicast. VL completely eliminates the most com- to a RSM server. e RSM reliably replicates the update to every mon sources of broadcast: ARP and DHCP. ARP is replaced by the RSM server and then replies with an acknowledgment to the direc- directory system, and DHCP messages are intercepted at the ToR tory server, which in turn forwards the acknowledgment back to the using conventional DHCP relay agents and unicast forwarded to originating client. As an optimization to enhance consistency, the DHCP servers. To handle other general layer- broadcast tra c, directory server can optionally disseminate the acknowledged up- every service is assigned an IP multicast address, and all broadcast dates to a few other directory servers. If the originating client does tra c in that service is handled via IP multicast using the service- not receive an acknowledgment within a timeout (e.g., s), the client speci c multicast address. e VL agent rate-limits broadcast traf- sends the same update to another directory server, trading response c to prevent storms. time for reliability and availability. 4.3 Maintaining Host Information using Updating caches reactively: Since AA-to-LA mappings are the VL2 Directory System cached at directory servers and in VL agents’ caches, an update can lead to inconsistency. To resolve inconsistency without wasting e VL directory provides three key functions: ( ) lookups and server and network resources, our design employs a reactive cache- ( ) updates for AA-to-LA mappings; and ( ) a reactive cache update update mechanism. e cache-update protocol leverages this obser- mechanism so that latency-sensitive updates (e.g., updating the AA vation: a stale host mapping needs to be corrected only when that to LA mapping for a virtual machine undergoing live migration) mapping is used to deliver tra c. Speci cally, when a stale map- happen quickly. Our design goals are to provide scalability, relia- ping is used, some packets arrive at a stale LA—a ToR which does bility and high performance. not host the destination server anymore. e ToR may forward a 4.3.1 Characterizing requirements sample of such non-deliverable packets to a directory server, trig- gering the directory server to gratuitously correct the stale mapping We expect the lookup workload for the directory system to be in the source’s cache via unicast. frequent and bursty. As discussed in Section . , servers can com- municate with up to hundreds of other servers in a short time period with each ow generating a lookup for an AA-to-LA mapping. For 5. EVALUATION updates, the workload is driven by failures and server startup events. In this section we evaluate VL using a prototype running on As discussed in Section . , most failures are small in size and large an server testbed and commodity switches (Figure ). Our correlated failures are rare. goals are rst to show that VL can be built from components that Performance requirements: e bursty nature of workload im- are available today, and second, that our implementation meets the plies that lookups require high throughput and low response time. objectives described in Section .
  8. 8. Aggregate goodput (Gbps) 60 6000 50 5000 Active flows 40 4000 30 Aggregate goodput 3000 20 Active flows 2000 10 1000 0 0 0 50 100 150 200 250 300 350 400 Time (s) Figure : Aggregate goodput during a . TB shu e among servers. ginning at time (some ows start late due to a bug in our tra c generator). VL completes the shu e in s. During the run, the sustained utilization of the core links in the Clos network is Figure : VL testbed comprising servers and switches. about . For the majority of the run, VL achieves an aggregate goodput of . Gbps. e goodput is evenly divided among the ows for most of the run, with a fairness index between the ows e testbed is built using the Clos network topology of Fig- of . [ ], where . indicates perfect fairness (mean goodput ure , consisting of Intermediate switches, Aggregation switches per ow . Mbps, standard deviation . Mbps). is goodput and ToRs. e Aggregation and Intermediate switches have is more than x what the network in our current data centers can Gbps Ethernet ports, of which ports are used on each Aggre- achieve with the same investment (see § ). gation switch and ports on each Intermediate switch. e ToRs How close is VL to the maximum achievable throughput in this switches have Gbps ports and Gbps ports. Each ToR is environment? To answer this question, we compute the goodput ef- connected to two Aggregation switches via Gbps links, and to ciency for this data transfer. e goodput e ciency of the network servers via Gbps links. Internally, the switches use commodity for any interval of time is de ned as the ratio of the sent goodput ASICs — the Broadcom and — although any switch summed over all interfaces divided by the sum of the interface ca- that supports line rate L forwarding, OSPF, ECMP, and IPinIP de- pacities. An e ciency of 1 would mean that all the capacity on all capsulation will work. To enable detailed analysis of the TCP behav- the interfaces is entirely used carrying useful bytes from the time the ior seen during experiments, the servers’ kernels are instrumented rst ow starts to when the last ow ends. to log TCP extended statistics [ ] (e.g., congestion window (cwnd) To calculate the goodput e ciency, two sources of ine ciency and smoothed RTT) a er each socket bu er is sent (typically KB must be accounted for. First, to achieve a performance e ciency in our experiments). is logging only marginally a ects goodput, of 1, the server network interface cards must be completely full- i.e., useful information delivered per second to the application layer. duplex: able to both send and receive Gbps simultaneously. Mea- We rst investigate VL ’s ability to provide high and uniform surements show our interfaces are able to support a sustained rate network bandwidth between servers. en, we analyze performance of . Gbps (summing the sent and received capacity), introducing isolation and fairness between tra c ows, measure convergence an ine ciency of 1 − 1.8 = 10%. e source of this ine ciency is 2 a er link failures, and nally, quantify the performance of address largely the device driver implementation. Second, for every two full- resolution. Overall, our evaluation shows that VL provides an ef- size data packets there is a TCP ACK, and these three frames have fective substrate for a scalable data center network; VL achieves ( ) B of unavoidable overhead from Ethernet, IP and TCP headers optimal network capacity, ( ) a TCP fairness index of . , ( ) for every B sent over the network. is results in an ine ciency graceful degradation under failures with fast reconvergence, and ( ) of . erefore, our current testbed has an intrinsic ine ciency of K lookups/sec under ms for fast address resolution. resulting in a maximum achievable goodput for our testbed of (75∗.83) = 62.3 Gbps. We derive this number by noting that every 5.1 VL2 Provides Uniform High Capacity unit of tra c has to sink at a server, of which there are instances A central objective of VL is uniform high capacity between any and each has a Gbps link. Taking this into consideration, the VL two servers in the data center. How closely does the performance network sustains an e ciency of 58.8/62.3 = 94% with the dif- and e ciency of a VL network match that of a Layer switch with ference from perfect due to the encapsulation headers ( . ), TCP : over-subscription? congestion control dynamics, and TCP retransmissions. To answer this question, we consider an all-to-all data shu e To put this number in perspective, we note that a conven- stress test: all servers simultaneously initiate TCP transfers to all tional hierarchical design with servers per rack and : over- other servers. is data shu e pattern arises in large scale sorts, subscription at the rst level switch would take x longer to shuf- merges and join operations in the data center. We chose this test e the same amount of data as tra c from each server not in the because, in our interactions with application developers, we learned rack ( ) to each server within the rack ( ) needs to ow through that many use such operations with caution, because the operations the Gbps downlink from rst level switch to the ToR switch. are highly expensive in today’s data center network. However, data e e ciency combined with the fairness index of . shu es are required, and, if data shu es can be e ciently sup- demonstrates that VL promises to achieve uniform high band- ported, it could have large impact on the overall algorithmic and width across all servers in the data center. data storage strategy. We create an all-to-all data shu e tra c matrix involving 5.2 VL2 Provides VLB Fairness servers. Each of servers must deliver MB of data to each of Due to its use of an anycast address on the intermediate the other servers - a shu e of . TB from memory to memory. switches, VL relies on ECMP to split tra c in equal ratios among Figure shows how the sum of the goodput over all ows varies the intermediate switches. Because ECMP does ow-level splitting, with time during a typical run of the . TB data shu e. All data is coexisting elephant and mice ows might be split unevenly at small carried over TCP connections, all of which attempt to connect be- time scales. To evaluate the e ectiveness of VL ’s implementation
  9. 9. Aggregate goodput (Gbps) 1.02 15 1 0.98 10 Fairness 0.96 Service 1 Agg1 0.94 Service 2 Agg2 5 0.92 Agg3 0.9 0 100 200 300 400 500 600 0 Time (s) 60 80 100 120 140 160 180 200 220 Time (s) Figure : Fairness measures how evenly ows are split to inter- Figure : Aggregate goodput of two services with servers inter- mediate switches from aggregation switches. mingled on the ToRs. Service one’s goodput is una ected as ser- vice two ramps tra c up and down. of Valiant Load Balancing in splitting tra c evenly across the net- Aggregate goodput (Gbps) 20 2000 work, we created an experiment on our -node testbed with tra c # mice started 15 1500 characteristics extracted from the DC workload of Section . Each server initially picks a value from the distribution of number of con- 10 1000 current ows and maintains this number of ows throughout the 5 Aggregate goodput 500 experiment. At the start, or a er a ow completes, it picks a new # mice started ow size from the associated distribution and starts the ow(s). Be- 0 0 50 60 70 80 90 100 110 120 130 Time (s) cause all ows pass through the Aggregation switches, it is su cient to check at each Aggregation switch for the split ratio among the Figure : Aggregate goodput of service one as service two cre- links to the Intermediate switches. We do so by collecting SNMP ates bursts containing successively more short TCP connections. counters at second intervals for all links from Aggregation to In- termediate switches. Before proceeding further, we note that, unlike the e ciency A key question we need to validate for performance isolation is experiment above, the tra c mix here is indicative of actual data whether TCP reacts su ciently quickly to control the o ered rate center workload. We mimic the ow size distribution and the num- of ows within services. TCP works with packets and adjusts their ber of concurrent ows observed by measurements in § . sending rate at the time-scale of RTTs. Conformance to the hose In Figure , for each Aggregation switch, we plot Jain’s fair- model, however, requires instantaneous feedback to avoid over- ness index [ ] for the tra c to Intermediate switches as a time se- subscription of tra c ingress/egress bounds. Our next set of exper- ries. e average utilization of links was between and . As iments shows that TCP is fast enough to enforce the hose model shown in the gure, the VLB split ratio fairness index averages more for tra c in each service so as to provide the desired performance than . for all Aggregation switches over the duration of this ex- isolation across services. periment. VL achieves such high fairness because there are enough In this experiment, we add two services to the network. e rst ows at the Aggregation switches that randomization bene ts from service has servers allocated to it and each server starts a single statistical multiplexing. is evaluation validates that our imple- TCP transfer to one other server at time and these ows last for mentation of VLB is an e ective mechanism for preventing hot spots the duration of the experiment. e second service starts with one in a data center network. server at seconds and a new server is assigned to it every sec- Our randomization-based tra c splitting in Valiant Load Bal- onds for a total of servers. Every server in service two starts an ancing takes advantage of the 10x gap in speed between server line GB transfer over TCP as soon as it starts up. Both the services’ cards and core network links. If the core network were built out of servers are intermingled among the ToRs to demonstrate agile as- links with the same speed as the server line cards, then only one full- signment of servers. rate ow will t on each link, and the spreading of ows has to be Figure shows the aggregate goodput of both services as a perfect in order to prevent two long-lived ows from traversing the function of time. As seen in the gure, there is no perceptible change same link and causing congestion. However, splitting at a sub- ow to the aggregate goodput of service one as the ows in service two granularity (for example, owlet switching [ ]) might alleviate this start or complete, demonstrating performance isolation when the problem. tra c consists of large long-lived ows. rough extended TCP statistics, we inspected the congestion window size (cwnd) of ser- vice one’s TCP ows, and found that the ows uctuate around their 5.3 VL2 Provides Performance Isolation fair share brie y due to service two’s activity but stabilize quickly. One of the primary objectives of VL is agility, which we de ne We would expect that a service sending unlimited rates of UDP as the ability to assign any server, anywhere in the data center, to any tra c might violate the hose model and hence performance isola- service (§ ). Achieving agility critically depends on providing suf- tion. We do not observe such UDP tra c in our data centers, al- cient performance isolation between services so that if one service though techniques such as STCP to make UDP “TCP friendly” are comes under attack or a bug causes it to spray packets, it does not well known if needed [ ]. However, large numbers of short TCP adversely impact the performance of other services. connections (mice), which are common in DCs (Section ), have the Performance isolation in VL rests on the mathematics of VLB potential to cause problems similar to UDP as each ow can trans- — that any tra c matrix that obeys the hose model is routed by split- mit small bursts of packets during slow start. ting to intermediate nodes in equal ratios (through randomization) To evaluate this aspect, we conduct a second experiment with to prevent any persistent hot spots. Rather than have VL perform service one sending long-lived TCP ows, as in experiment one. admission control or rate shaping to ensure the tra c o ered to the Servers in service two create bursts of short TCP connections ( to network conforms to the hose model, we instead rely on TCP to en- KB), each burst containing progressively more connections. Fig- sure that each ow o ered to the network is rate-limited to its fair ure shows the aggregate goodput of service one’s ows along with share of its bottleneck. the total number of TCP connections created by service two. Again,