Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MPLS in DC and inter-DC networks: the unified forwarding mechanism for network programmability at scale

21,222 views

Published on

MPLS in DC and inter-DC networks: the unified forwarding mechanism for network programmability at scale

Published in: Technology
  • Excellent presentation, very “Shenker-ish” if you will –which I enjoy everywhere I see it –complex constructs condensed into great punchlines, thanks for posting
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

MPLS in DC and inter-DC networks: the unified forwarding mechanism for network programmability at scale

  1. 1. Dmitry Afanasiev, fl0w@yandex-team.ru Daniel Ginsburg, dbg@yandex-team.ru Network Architects MPLS in DC and inter-DC networks: the unified forwarding mechanism for network programmability at scale
  2. 2. About Us
  3. 3. 3 • Founded in 1993 • NASDAQ:YNDX, Mkt Cap ~$12.5B • One of Europe's largest internet companies and the leading search provider in Russia • Over 60% of the local search market • Monthly user audience of over 90 million worldwide. • Services: search, music, video, cloud storage, news, weather, maps, traffic, email, ads ... What is Yandex
  4. 4. 4 • We're rather typical MS-DC • Several DCs in Russia and abroad + MPLS backbone to connect them • About 100k servers and growing fast • Mostly IPv6 internally, need to serve external IPv4 • Network architecture is a bit outdated, needs rethinking Our Infrastructure
  5. 5. In Search of New Arch
  6. 6. 6 • It needs to be: – Scalable – Flexible – Programmable • Lots of approaches out there, some get many things right… • But not one combines all the right pieces in the right way • It's really surprising because right combination seems almost inevitable. In Search of New Arch
  7. 7. 7 • Many of the ideas have been around for years (or even decades) • Interconnection network topology – folded Clos • Let the edge handle complexity • Core just delivers packets edge to edge • Overlay/underlay logical split • Control: mix of centralized and distributed. Needs a nice way to combine both • Simple commodity network elements • Hierarchy and automation to scale the network Ideas to Build Upon
  8. 8. 8 • All these are ideas are well known, understood and almost universally accepted in the industry • People are trying to implement them using a wild mix of data plane mechanisms. • And it introduces enormous complexity • What's missing? Unified forwarding mechanism What’s missing
  9. 9. 9 • Life is much easier when we don't have to deal with multitude of data planes and forwarding mechanisms. • Fortunately, there is already well known, well understood, standardized forwarding plane mechanism upon which we can implement all those ideas without compromising their value. • It has well defined and standardized mapping to many other popular forwarding panes. • It's known as MPLS. Missing… or overlooked?
  10. 10. Unified Forwarding: Why and How
  11. 11. 11 • Different data plane mechanisms – different features • The unified data plane should be able to support all useful features and produce their combinations • MPLS is very flexible: – forwarding over a pre-signalled virtual circuit a-la ATM - this is what RSPV- TE does – source routing over a previously discovered topology a-la Token Ring networks - see Segment Routing proposal – hierarchical LPM a-la IP - just split the address over several labels and allow routers to act on the topmost one (not that we suggest it is practical, but it is definitely possible) Flexibility
  12. 12. 12 • Best way to implement arbitrary semantics is to get rid of any semantics in protocol headers and assign it externally • Hardware works with protocol headers • Control software defines the semantics An Abstract Note on Semantics
  13. 13. 13 • Why combining? To have the right features at the right place or produce useful combination of features • There're basically two ways to combine different data-planes together: stitch or interwork them, and overlay them on top of each other Combining Data Planes
  14. 14. 14 • It’s pain • Might be done for subset of protocol features • Need to translate between protocols (complex, never perfect, looses information) • Need to provision interworking points: fragile, operational nightmare, create bottlenecks • Seems nobody really does this anymore… Or maybe we still have to sometimes? Stitching Data Planes
  15. 15. 15 • Overlay to: scale, virtualize, augment one data plane with properties of another • Overlaying is building hierarchy • But with multiple data planes it is limited and ad-hoc • Often ugly: IP over Ethernet over VXLAN over IP over Ethernet • MPLS is intrinsically hierarchical (overlayable, if you will) Overlaying Data Planes
  16. 16. 16 • Many hierarchical structures are already in the network: topology, addressing, management and control • Hierarchy is the most important and the most reliable way to scale things Hierarchy is your friend
  17. 17. 17 • The ability to implement hierarchy natively enables us to ditch the notion of hard overlay/underlay boundary. • In a stack of DC-label, ToR-label, port-label, slice-label, vm-label, where's the boundary of overlay/underlay? Not in the packet • Placement of the boundary only depends on how you structure your control Overlay/underlay split is a metaphor
  18. 18. 18 • Can be as granular or coarse-grained as one wishes. There's no network-imposed limitation • Easy behavior aggregation. Just add an extra label on top • Easy behavior disaggregation. One can expose additional granularity by adding extra label on bottom FEC is hierarchical
  19. 19. How to Control MPLS
  20. 20. 20 • MPLS control plane is notoriously complex • Good news: you don’t have to use all of it, can pick good parts • Classical distributed control is Ok for transport • Centralized control seems better for higher level artifacts on the edge, sometimes called services • Both styles can (and should) be combined MPLS is complex?
  21. 21. 21 • The device has be a bit smarter than in OF • Gets parts of label stack from different control plane components • Assembles the full stack from those parts, using local logic to follow assembly instructions provided by control plane • Assembly instructions come in form of referencing by “name” • Assembly uses late binding Enabling combinability
  22. 22. 22 • MPLS VPN (abstraction A) refers to MPLS tunnels (abstraction B), using next-hop resolution. • The resolution happens on the device itself, and two control plane entities are loosely coupled - MPLS tunnels paths can change their paths, the assigned labels etc, without MP-BGP caring about it • VPN abstraction refers to tunnel abstraction using next-hops. Next-hop is the name which one control plane abstraction refers to another Enabling combinability – example
  23. 23. 23 • Recursive next-hop resolution with labeled routes (RFC 3107) is the powerful way to overlay one control plane abstraction over another • Able to express almost anything we currently want. Still, more expressive way is desired • BGP 3107 is the way to interact with all- classically-controlled MPLS networks Enabling Combinability – BGP 3107
  24. 24. 24 • If you can ensure that the labels at some point of the network always stay the same (because you assigned them to be so), you can use static configuration on the other side • The way to go, when one wants to avoid any signaling dependencies • Static configuration can be calculated and disseminated automatically Static Configuration
  25. 25. 25 • On the host! Or even right from the application • Hypervisor switch is the easiest point. SW only, very flexible. • Naturally fits centralized control • Helps to scale. Lots of RAM, each element keeps only needed state • Modern CPUs can forward 10s of Gbps without breaking sweat Where MPLS should start?
  26. 26. 26 • A simple forwarding plane (3 simple ops) • A simple software agent on the device (receives parts of label stack from different control plane components, assembles full stack, and programs the HW) • Centralized and distributed control, or anything in between • Combinability of different control plane components with late binding via names, which the device resolves Looks SDNish
  27. 27. 27 • “Modularity based on abstraction is the way things get done” --Liskov • “SDN ...Not a revolutionary technology... ...just a way of organizing network functionality” -- Shenker • “SDN is merely set of abstractions for control plane, not a specific set of mechanisms.” -- Shenker • “Most lasting legacy of SDN is not better datacenters - But better ways of reasoning about network control” --Shenker What SDN is
  28. 28. 28 • Let the edge handle complexity – do it on host • Core just delivers packets edge to edge – hierarchy enables the devices to be agnostic to changes on the edge • Overlay/underlay logical split – just a way to implement hierarchy • Control: mix of centralized and distributed. Needs a nice way to combine both – yeah! • Simple commodity network elements – cheap MPLS capable silicon is finally there How Ideas Map to MPLS
  29. 29. 29 • Key point of S-MPLS was to extend MPLS to access and separate transport and service in MPLS network • NFV describes how to host service nodes in DC. If you don’t have MPLS in DC it’s no longer seamless • Fix is obvious – extend MPLS into DC • Labels can carry additional metadata if one wants them to NFV and Seamless MPLS
  30. 30. Case Study: New Yandex DC
  31. 31. 31 • Cheap and abundant bandwidth • Scalable forwarding with minimal state • Multitenancy (=> network virtualization) • Efficient resource pooling • InterDC traffic engineering • Function chaining: load balancing, FW, etc. • Interconnection with existing infrastructure • Means to integrate all of above • Local response to some events, e.g. failures • Automation at scale What we need?
  32. 32. 32 We are trying to keep design really simple. Don’t need many functions often perceived as desireable: • L2 (neither real, nor emulated) • VM mobility – In scale-out applications nodes coming and going is a norm, no need to move them around while preserving state and identity – VM mobility increases complexity as it depends on other features • Multicast • We don't have too many changes in topology What we don’t need
  33. 33. 33 • Host with vLER (MPLS capable vRouter) • Fabric switching elements – LSRs • Centralized controller • Legacy routers. Need to interwork with fabric LSRs and controller. BGP 3107 is the tool Components
  34. 34. 34 • 3-label stack: topmost for egress switch, next for egress port, bottom for VM • vRouter uses {dst prefix, VRF} to impose label stack • Bottom label processed by destination vLER • Expected state on a fabric switch: #switches_in_the_fabric + #local_access_ports Forwarding model
  35. 35. 35 • iBGP 3107 (in-path RR w/ NHS) inside fabric for reachabilty and label distribution (draft- lapukhov…, but with iBGP and labels) • iBGP 3107 to interwork with legacy routers – Session with connected network element with NHS for switch label – Session with controller for remaining labels, binds to switch label via next hop • Label mappings on edge of the fabric are stable, can be provisioned rather than signaled • Internal fabric failures are handled locally • Label mappings on vRouters are distributed centrally Control plane
  36. 36. Why Now and What’s Next?
  37. 37. 37 “The world is changed… I smell it in the air” • A lot of similar ideas in the industry • Seems that thinking converges on something • But ... a lot of ugly ad-hoc solutions are popping out here and there • Better implement good solution until bad ones are entrenched • It would be a shame and missed opportunity to stick with VXLAN/… for years when we could get MPLS instead Why Now?
  38. 38. 38 • Merchant silicon is finally MPLS capable. And the price is almost right. • Modern CPUs can process tens of Mpps in SW, making host-based switching feasible. • Several open source MPLS data plane implementations are emerging • Several "classical" MPLS control plane components are very useful - BGP 3107, and have been there for quite long time. What’s Ready?
  39. 39. 39 • All RFC3107 implementations are broken (multiple labels). Talk to your vendor • Silicon is not perfect. Talk to your vendor • A more expressive way to control late binding of control plane artifacts than BGP 3107 • Perception MPLS as complex technology. It's current MPLS control plane that is complex • Perception of MPLS as WAN or metro technology Gaps
  40. 40. Thank you! Questions?

×