MPLS in DC and inter-DC networks: the unified forwarding mechanism for network programmability at scale

Dmitry Afanasiev, fl0w@yandex-team.ru
Daniel Ginsburg, dbg@yandex-team.ru
Network Architects
MPLS in DC and inter-DC
networks: the unified
forwarding mechanism for
network programmability at
scale

3
• Founded in 1993
• NASDAQ:YNDX, Mkt Cap ~$12.5B
• One of Europe's largest internet companies
and the leading search provider in Russia
• Over 60% of the local search market
• Monthly user audience of over 90 million
worldwide.
• Services: search, music, video, cloud storage,
news, weather, maps, traffic, email, ads ...
What is Yandex

4
• We're rather typical MS-DC
• Several DCs in Russia and abroad + MPLS
backbone to connect them
• About 100k servers and growing fast
• Mostly IPv6 internally, need to serve external
IPv4
• Network architecture is a bit outdated, needs
rethinking
Our Infrastructure

6
• It needs to be:
– Scalable
– Flexible
– Programmable
• Lots of approaches out there, some get many
things right…
• But not one combines all the right pieces in the
right way
• It's really surprising because right combination
seems almost inevitable.
In Search of New Arch

7
• Many of the ideas have been around for years
(or even decades)
• Interconnection network topology – folded Clos
• Let the edge handle complexity
• Core just delivers packets edge to edge
• Overlay/underlay logical split
• Control: mix of centralized and distributed.
Needs a nice way to combine both
• Simple commodity network elements
• Hierarchy and automation to scale the network
Ideas to Build Upon

8
• All these are ideas are well known, understood
and almost universally accepted in the industry
• People are trying to implement them using a
wild mix of data plane mechanisms.
• And it introduces enormous complexity
• What's missing? Unified forwarding
mechanism
What’s missing

9
• Life is much easier when we don't have to deal
with multitude of data planes and forwarding
mechanisms.
• Fortunately, there is already well known, well
understood, standardized forwarding plane
mechanism upon which we can implement all
those ideas without compromising their value.
• It has well defined and standardized mapping
to many other popular forwarding panes.
• It's known as MPLS.
Missing… or overlooked?

Unified Forwarding: Why and How

11
• Different data plane mechanisms – different
features
• The unified data plane should be able to
support all useful features and produce their
combinations
• MPLS is very flexible:
– forwarding over a pre-signalled virtual circuit a-la ATM - this is what RSPV-
TE does
– source routing over a previously discovered topology a-la Token Ring
networks - see Segment Routing proposal
– hierarchical LPM a-la IP - just split the address over several labels and
allow routers to act on the topmost one (not that we suggest it is practical,
but it is definitely possible)
Flexibility

12
• Best way to implement arbitrary semantics is to
get rid of any semantics in protocol headers
and assign it externally
• Hardware works with protocol headers
• Control software defines the semantics
An Abstract Note on Semantics

13
• Why combining? To have the right features at
the right place or produce useful combination
of features
• There're basically two ways to combine
different data-planes together: stitch or
interwork them, and overlay them on top of
each other
Combining Data Planes

14
• It’s pain
• Might be done for subset of protocol features
• Need to translate between protocols (complex,
never perfect, looses information)
• Need to provision interworking points: fragile,
operational nightmare, create bottlenecks
• Seems nobody really does this anymore… Or
maybe we still have to sometimes?
Stitching Data Planes

15
• Overlay to: scale, virtualize, augment one data
plane with properties of another
• Overlaying is building hierarchy
• But with multiple data planes it is limited and
ad-hoc
• Often ugly: IP over Ethernet over VXLAN over
IP over Ethernet
• MPLS is intrinsically hierarchical (overlayable,
if you will)
Overlaying Data Planes

16
• Many hierarchical structures are already in the
network: topology, addressing, management
and control
• Hierarchy is the most important and the most
reliable way to scale things
Hierarchy is your friend

17
• The ability to implement hierarchy natively
enables us to ditch the notion of hard
overlay/underlay boundary.
• In a stack of DC-label, ToR-label, port-label,
slice-label, vm-label, where's the boundary of
overlay/underlay? Not in the packet
• Placement of the boundary only depends on
how you structure your control
Overlay/underlay split is a metaphor

18
• Can be as granular or coarse-grained as one
wishes. There's no network-imposed limitation
• Easy behavior aggregation. Just add an extra
label on top
• Easy behavior disaggregation. One can
expose additional granularity by adding extra
label on bottom
FEC is hierarchical

20
• MPLS control plane is notoriously complex
• Good news: you don’t have to use all of it, can
pick good parts
• Classical distributed control is Ok for transport
• Centralized control seems better for higher
level artifacts on the edge, sometimes called
services
• Both styles can (and should) be combined
MPLS is complex?

21
• The device has be a bit smarter than in OF
• Gets parts of label stack from different control
plane components
• Assembles the full stack from those parts,
using local logic to follow assembly instructions
provided by control plane
• Assembly instructions come in form of
referencing by “name”
• Assembly uses late binding
Enabling combinability

22
• MPLS VPN (abstraction A) refers to MPLS
tunnels (abstraction B), using next-hop
resolution.
• The resolution happens on the device itself,
and two control plane entities are loosely
coupled - MPLS tunnels paths can change
their paths, the assigned labels etc, without
MP-BGP caring about it
• VPN abstraction refers to tunnel abstraction
using next-hops. Next-hop is the name which
one control plane abstraction refers to another
Enabling combinability – example

23
• Recursive next-hop resolution with labeled
routes (RFC 3107) is the powerful way to
overlay one control plane abstraction over
another
• Able to express almost anything we currently
want. Still, more expressive way is desired
• BGP 3107 is the way to interact with all-
classically-controlled MPLS networks
Enabling Combinability – BGP 3107

24
• If you can ensure that the labels at some point
of the network always stay the same (because
you assigned them to be so), you can use
static configuration on the other side
• The way to go, when one wants to avoid any
signaling dependencies
• Static configuration can be calculated and
disseminated automatically
Static Configuration

25
• On the host! Or even right from the application
• Hypervisor switch is the easiest point. SW only,
very flexible.
• Naturally fits centralized control
• Helps to scale. Lots of RAM, each element
keeps only needed state
• Modern CPUs can forward 10s of Gbps without
breaking sweat
Where MPLS should start?

26
• A simple forwarding plane (3 simple ops)
• A simple software agent on the device
(receives parts of label stack from different
control plane components, assembles full
stack, and programs the HW)
• Centralized and distributed control, or anything
in between
• Combinability of different control plane
components with late binding via names, which
the device resolves
Looks SDNish

27
• “Modularity based on abstraction is the way
things get done” --Liskov
• “SDN ...Not a revolutionary technology... ...just
a way of organizing network functionality” --
Shenker
• “SDN is merely set of abstractions for control
plane, not a specific set of mechanisms.” --
Shenker
• “Most lasting legacy of SDN is not better
datacenters - But better ways of reasoning
about network control” --Shenker
What SDN is

28
• Let the edge handle complexity – do it on host
• Core just delivers packets edge to edge –
hierarchy enables the devices to be agnostic to
changes on the edge
• Overlay/underlay logical split – just a way to
implement hierarchy
• Control: mix of centralized and distributed.
Needs a nice way to combine both – yeah!
• Simple commodity network elements – cheap
MPLS capable silicon is finally there
How Ideas Map to MPLS

29
• Key point of S-MPLS was to extend MPLS to
access and separate transport and service in
MPLS network
• NFV describes how to host service nodes in
DC. If you don’t have MPLS in DC it’s no
longer seamless
• Fix is obvious – extend MPLS into DC
• Labels can carry additional metadata if one
wants them to
NFV and Seamless MPLS

31
• Cheap and abundant bandwidth
• Scalable forwarding with minimal state
• Multitenancy (=> network virtualization)
• Efficient resource pooling
• InterDC traffic engineering
• Function chaining: load balancing, FW, etc.
• Interconnection with existing infrastructure
• Means to integrate all of above
• Local response to some events, e.g. failures
• Automation at scale
What we need?

32
We are trying to keep design really simple. Don’t
need many functions often perceived as
desireable:
• L2 (neither real, nor emulated)
• VM mobility
– In scale-out applications nodes coming and going is a norm, no need to
move them around while preserving state and identity
– VM mobility increases complexity as it depends on other features
• Multicast
• We don't have too many changes in topology
What we don’t need

33
• Host with vLER (MPLS capable vRouter)
• Fabric switching elements – LSRs
• Centralized controller
• Legacy routers. Need to interwork with fabric
LSRs and controller. BGP 3107 is the tool
Components

34
• 3-label stack: topmost for egress switch, next
for egress port, bottom for VM
• vRouter uses {dst prefix, VRF} to impose label
stack
• Bottom label processed by destination vLER
• Expected state on a fabric switch:
#switches_in_the_fabric + #local_access_ports
Forwarding model

35
• iBGP 3107 (in-path RR w/ NHS) inside fabric
for reachabilty and label distribution (draft-
lapukhov…, but with iBGP and labels)
• iBGP 3107 to interwork with legacy routers
– Session with connected network element with NHS for switch label
– Session with controller for remaining labels, binds to switch label via next
hop
• Label mappings on edge of the fabric are
stable, can be provisioned rather than signaled
• Internal fabric failures are handled locally
• Label mappings on vRouters are distributed
centrally
Control plane

37
“The world is changed… I smell it in the air”
• A lot of similar ideas in the industry
• Seems that thinking converges on something
• But ... a lot of ugly ad-hoc solutions are
popping out here and there
• Better implement good solution until bad ones
are entrenched
• It would be a shame and missed opportunity to
stick with VXLAN/… for years when we could
get MPLS instead
Why Now?

38
• Merchant silicon is finally MPLS capable. And
the price is almost right.
• Modern CPUs can process tens of Mpps in
SW, making host-based switching feasible.
• Several open source MPLS data plane
implementations are emerging
• Several "classical" MPLS control plane
components are very useful - BGP 3107, and
have been there for quite long time.
What’s Ready?

39
• All RFC3107 implementations are broken
(multiple labels). Talk to your vendor
• Silicon is not perfect. Talk to your vendor
• A more expressive way to control late binding
of control plane artifacts than BGP 3107
• Perception MPLS as complex technology. It's
current MPLS control plane that is complex
• Perception of MPLS as WAN or metro
technology
Gaps

MPLS in DC and inter-DC networks: the unified forwarding mechanism for network programmability at scale

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Viewers also liked

Viewers also liked (6)

Similar to MPLS in DC and inter-DC networks: the unified forwarding mechanism for network programmability at scale

Similar to MPLS in DC and inter-DC networks: the unified forwarding mechanism for network programmability at scale (20)

Recently uploaded

Recently uploaded (20)

MPLS in DC and inter-DC networks: the unified forwarding mechanism for network programmability at scale