Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

VL2: A scalable and flexible Data Center Network

This Data Center Network Architecture introduces a virtual layer 2.5 in the protocol stack of hosts and uses a directory service to achieve efficient forwarding. It uses separate location/identifier IPs

  • Login to see the comments

VL2: A scalable and flexible Data Center Network

  1. 1. VL2: A Scalable and Flexible Data Center Network Microsoft Research Presented by: Ankita Mahajan
  2. 2. INTRODUCTION • Cloud services need data centers with hundreds of thousands of servers and that concurrently support a large number of distinct services. • To be profitable, DC must achieve high utilization, and key to this is agility — the capacity to assign any server to any service. • Agility promises: improved risk management and cost savings.
  3. 3. Agility in today’s DCN • Designs for today’s DCN prevent agility in several ways: • Oversubscription : Existing architectures do not provide enough capacity between the servers they interconnect • Traffic flood in 1 service affects other services. • Topologically significant IP addresses. • Dividing servers among VLANs: reates Fragmentation of Address Space.
  4. 4. Objectives of building VL2 To overcome limitations we need a n/w with following objectives: • Uniform high capacity: Hot-spot free • Performance isolation • Layer-2 semantics: Just as if the servers were on a LAN it should be easy to assign any server to any service and configure that server with whatever IP address the service expects.
  5. 5. • 20 to 40 servers per rack, each singly connected to • a Top of Rack (ToR) switch with a 1 Gbps link. • ToRs connect to two aggregation switches for redundancy, and • these switches aggregate further connecting to access routers. • At the top, core routers carry traffic between access routers. • All links use Ethernet as a physical-layer protocol • To limit overheads (e.g., packet flooding and ARP broadcasts) and to isolate different services servers are partitioned into virtual LANs (VLANs)
  6. 6. Limitations 3 fundamental Limitations: • Limited server-to-server capacity: ToRs are 1:5 to 1:20 oversubscribed and paths through the highest layer can be 1:240 oversubscribed. • Fragmentation of resources: spare capacity is reserved by individual services. • Poor reliability and utilization: Resilience model forces each device and link to be run up to at most 50% of its maximum utilization.
  7. 7. Data-Center Traffic Analysis 1. The ratio of traffic volume between servers into data centers, to, traffic entering/leaving data centers is currently around 4:1 2. data-center computation is focused where high speed access to data on memory or disk is fast and cheap. 3. The demand for bandwidth between servers inside a data center is growing faster than the demand for bandwidth to external hosts. 4. The network is a bottleneck to computation.
  8. 8. Flow Distribution Analysis: Distribution of flow sizes: • Similar to Internet traffic, 99% of flows are smaller than 100 MB. • But the distribution is simpler and more uniform than Internet. • More than 90% of bytes are in flows between 100MB and 1 GB.
  9. 9. Flow Distribution Analysis: Number of Concurrent Flows: • More than 50% of the time, an average machine has about ten concurrent flows. • At least 5% of the time it has greater than 80 concurrent ows. • We almost never see more than 100 concurrent ows. Both the above Flow Distribution Analysis imply that VLB will perform well on this traffic. Since even big flows are only 100 MB. adaptive routing schemes may be dicult to implement in the data center since any reactive traffic engineering will need to run at least once a second if it wants to react to individual flows.
  10. 10. Traffic Matrix Analysis • Poor summarizability of trac patterns: Is there regularity in the trac that might be exploited through careful measurement and trac engineering? TM(t)ij clustering: a day’s worth of trac in the datacenter, even when approximating with 50-60 clusters, the fitting error remains 60% • Instability of trac patterns: how predictable is the trac in the next interval given the current trac?
  11. 11. Failure Characteristics 1. pattern of networking equipment failures: • Most failures are small in size: 50% of network device failures involve < 4 devices and 95% of network device failures involve < 20 devices • Large correlated failures are rare: the largest correlated failure involved 217 switches • downtimes can be signicant: 95% of failures are resolved in 10min, 98% in < 1 hr, 99.6% in < 1 day, but 0.09% last > 10 days.
  12. 12. impact of networking equipment failure? • in 0.3% of failures all redundant components in a network device group became unavailable • The main causes of these downtimes are network miscongurations, firmware bugs, and faulty components (e.g., ports). • With no obvious way to eliminate all failures from the top of the hierarchy, VL2’sapproach is to broaden the topmost levels of the network so that the impact of failures is muted and performance degrades gracefully, • moving from 1:1 redundancy to n:m redundancy.
  13. 13. Terminology • Goodput: useful information delivered per second to the application layer. • VLB: each server independently picks a path at random through the network for each of the flows it sends to other servers in the data center. • ECMP: distributes traffic across equal-cost paths • anycast addresses for the Directory System