RIFT: Routing In Fat Trees
A New DC Routing Protocol
Antoni Przygienda & Zhaohui (Jeffrey) Zhang
Juniper Distinguished Engineer
Background: DC Technologies Evolution
• Tree to CLOS topology
• Tree: core/aggregation/access layers
• Folded CLOS, or Fat Tree
• Spine & Leaf
• Layer 2 switching to Layer 3 routing
• Layer 3 routing underlay with Layer 2/3 overlay
• Layer 3 underlay routing: IGP  eBGP  RIFT
• For scaling, convergence and Opex considerations
Source: Arial 12pt.
Issues with LSR IGP
• Link State Routing protocols like OSPF/ISIS have the following issues
when used in large scale DCs
• Failure Impact Scope (aka Blast Radius)
• A small change (e.g. a single link up/down on a leaf) is flooded throughout of an IGP
area, triggering SPF recalculation on every node
• Rich connections make flooding unnecessarily redundant and inefficient
• Every node holds host routes of all servers (aggregation limits prefix mobility
and leads to blackholes on failures)
• As a result, eBGP was introduced as a DC underlay routing
technology
• RFC 7938
Issues with eBGP Solution
• The eBGP solution has the following prominent issues
• Cannot take advantages of well defined network topology
• E.g. ideally a leaf (tier-3) node only needs a default route, and a tier-2 node only needs a default route
and routes for destinations south of it
• This cannot be done due to black-holing upon link failure
• A node needs to keep all paths learnt from different neighbors
• A leaf node connecting to 32 tier-2 nodes needs to keep 32 paths for each of all the prefixes in the DC
• eBGP peering configuration burden
• There are other subtle issues making the eBGP band-aid solution not that
simple
• See appendix slide
RIFT: LINK-STATE UP, DISTANCE VECTOR DOWN & BOUNCE
RIFT 2017, Juniper Confidential
Northbound LSR
• Link State flooded northbound to the top tier
• With flooding reduction
• Each node has full view of the southbound topology
• A top tier node has full set of prefixes from the SPF calculation
• A middle tier node has only information necessary for its level
• All destinations south of the node, from its SPF calculation
• Default route (next slide)
• Potential disaggregated routes (next slide)
• Fast convergence and ECMP benefits of LSR
Southbound Distance Vector Routing
• Default route and automatically disaggregated routes (when needed)
advertised one-hop southbound
• When a level-2 node A detects that another level-2 node B cannot reach one
of A’s south destinations P, it advertises P via southbound DVR
• That way a south level-3 node will route P traffic only towards A (via the more specific
route) not towards B (via the default route)
• A node’s local link state is advertised one-hop southbound and then
reflected one-hop northbound
• So that node A can detect if node B can reach A’s south destinations
• Other than that, link state is not propagated south, greatly reducing impact
scope
AUTOMATIC DE-AGGREGATION
• SOUTH REPRESENTATION OF THE RED
SPINE IS REFLECTED BY THE GREEN
LAYER
• LOWER RED SPINE SEES THAT UPPER
NODE HAS NO ADJACENCY TO THE
ONLY AVAILABLE NEXT-HOP TO P1
• LOWER RED NODE DISAGGREGATES
P1
RIFT 2017, Juniper Confidential
Zero Touch Provisioning
• Only top tier nodes need to be configured
• Nodes that must be leaves or have leaf-leaf connection may be configured
• Nodes with specific configuration can be mixed with others
• Upon connection nodes will fully auto-configure themselves and form
adjacencies in a well defined north/south topology
• With optional east-west connections
• ZTP makes DC fabric like RAM banks
• No one configures RAM banks and CAS/RAS manually in a laptop
• DC fabric HW is largely commodity already
• DC fabric OPEX must and will commoditize
• RIFT enables that
Other Features of RIFT
• Optimal Reduction and Load-Balancing of Flooding
• Mobility Support
• Built-in support for rapid prefix moving from one leaf to another
• Key/Value Store
• Fabric Bandwidth Balancing: weighted all paths routing (RIFT is loop-free)
• Northbound: modify the distance of default route received from a neighbor based on available BW through
that neighbor
• Southbound: during SPF consider available BW through lower level nodes
• Segment Routing Support
• Leaf-to-leaf Procedures
• Allow E-W traffic strictly for local prefixes
• Policy Guided Prefixes
• Moved to a separate draft
Summary of RIFT Advantages
• Advantages of Link-State and Distance
Vector
• Fastest Possible Convergence
• Automatic Detection of Topology
• Minimal Routes/Info on TORs
• High Degree of ECMP
• Fast De-comissioning of Nodes
• Maximum Propagation Speed with Flexible
# Prefixes in an Update
• No Disadvantages of Link-State or
Distance Vector
• Reduced and Balanced Flooding
• Automatic Neighbor Detection
• Unique RIFT Advantages
• True ZTP
• Minimal Blast Radius on Failures
• Can Utilize All Paths Through Fabric
Without Looping
• Automatic Disaggregation on Failures
• Simple Leaf Implementation that Can Scale
Down to Servers
• Key-Value Store
• Horizontal Links Used for Protection Only
• Supports Non-Equal Cost Multipath and
Can Replace MC-LAG
• Optimal Flooding Reduction and Load-
Balancing
Standardization/Industry Status
• IETF Standard Track Working Group
• https://datatracker.ietf.org/wg/rift/about/
• Juniper co-chair
• Base protocol standard targeted at Feb 2019
• YANG and other things forthcoming
• Juniper main author with Cisco/Comcast co-authors
• Other design team members from Bloomberg, HPE, Mellanox and
open source community
• Two independent implementations
• Juniper: available on-box (JUNOS) or standalone package (for public
evaluation)
• Open Source: by Bruno Rijsman (on-going)
Appendices
eBGP Solution Nuances
• ASN related knobs/tricks
• Numbering scheme to control path-hunting
• Allow-own-as knob to reuse private ASNs
• “Relaxed ECMP Multipath” since ECMP over different AS does not normally work with eBGP
• Additional work to get beyond 65K ASNs
• To see fabric topology BGP-LS must be run on top
• Add-path to support multi-homing, N-ECMP and prevent oscillation
• Vendor-specific provisioning & configuration
• Emerging work on peer auto-discovery diametrically opposite to BGP design principals
• “Violations” of BGP FSM (e.g. restart and MRAI timers)
• Reliance on peer-groups to prevent withdraw dispersion and path hunting upon server link
failures
• Less than successful attempts at prefix summary without black-holing/micro-loop
REQUIREMENTS BREAKDOWN (RFC7938+) FOR A “MINIMAL” OPEX
FABRIC”
Peer Discovery/True ZTP/Preventing Cabling Violations ⚠️ ⚠️
Minimal Amount of Routes/Information on ToRs
High Degree of ECMP (BGP needs lots knobs, memory, own-AS-path
violations) and ideally NEC and LFA
⚠️
Non Equal Cost Multi-Path, ECMP Independent Anycast, MC-LAG
Replacement
Traffic Engineering by Next-Hops, Prefix Modifications
See All Links in Topology to Support PCE/SR ⚠️
Carry Opaque Configuration Data (Key-Value) Efficiently ⚠️
Take a Node out of Production Quickly and Without Disruption
Automatic Disaggregation on Failures to Prevent Black-Holing and
Back-Hauling
Minimal Blast Radius on Failures (On Failure Smallest Possible Part of
the Network “Shakes”)
Fastest Possible Convergence on Failures
Bandwidth Load Balancing
Simplest Initial Implementation RIFT 2017, Juniper Confidential

Routing In Fat Trees

  • 1.
    RIFT: Routing InFat Trees A New DC Routing Protocol Antoni Przygienda & Zhaohui (Jeffrey) Zhang Juniper Distinguished Engineer
  • 2.
    Background: DC TechnologiesEvolution • Tree to CLOS topology • Tree: core/aggregation/access layers • Folded CLOS, or Fat Tree • Spine & Leaf • Layer 2 switching to Layer 3 routing • Layer 3 routing underlay with Layer 2/3 overlay • Layer 3 underlay routing: IGP  eBGP  RIFT • For scaling, convergence and Opex considerations Source: Arial 12pt.
  • 3.
    Issues with LSRIGP • Link State Routing protocols like OSPF/ISIS have the following issues when used in large scale DCs • Failure Impact Scope (aka Blast Radius) • A small change (e.g. a single link up/down on a leaf) is flooded throughout of an IGP area, triggering SPF recalculation on every node • Rich connections make flooding unnecessarily redundant and inefficient • Every node holds host routes of all servers (aggregation limits prefix mobility and leads to blackholes on failures) • As a result, eBGP was introduced as a DC underlay routing technology • RFC 7938
  • 4.
    Issues with eBGPSolution • The eBGP solution has the following prominent issues • Cannot take advantages of well defined network topology • E.g. ideally a leaf (tier-3) node only needs a default route, and a tier-2 node only needs a default route and routes for destinations south of it • This cannot be done due to black-holing upon link failure • A node needs to keep all paths learnt from different neighbors • A leaf node connecting to 32 tier-2 nodes needs to keep 32 paths for each of all the prefixes in the DC • eBGP peering configuration burden • There are other subtle issues making the eBGP band-aid solution not that simple • See appendix slide
  • 5.
    RIFT: LINK-STATE UP,DISTANCE VECTOR DOWN & BOUNCE RIFT 2017, Juniper Confidential
  • 6.
    Northbound LSR • LinkState flooded northbound to the top tier • With flooding reduction • Each node has full view of the southbound topology • A top tier node has full set of prefixes from the SPF calculation • A middle tier node has only information necessary for its level • All destinations south of the node, from its SPF calculation • Default route (next slide) • Potential disaggregated routes (next slide) • Fast convergence and ECMP benefits of LSR
  • 7.
    Southbound Distance VectorRouting • Default route and automatically disaggregated routes (when needed) advertised one-hop southbound • When a level-2 node A detects that another level-2 node B cannot reach one of A’s south destinations P, it advertises P via southbound DVR • That way a south level-3 node will route P traffic only towards A (via the more specific route) not towards B (via the default route) • A node’s local link state is advertised one-hop southbound and then reflected one-hop northbound • So that node A can detect if node B can reach A’s south destinations • Other than that, link state is not propagated south, greatly reducing impact scope
  • 8.
    AUTOMATIC DE-AGGREGATION • SOUTHREPRESENTATION OF THE RED SPINE IS REFLECTED BY THE GREEN LAYER • LOWER RED SPINE SEES THAT UPPER NODE HAS NO ADJACENCY TO THE ONLY AVAILABLE NEXT-HOP TO P1 • LOWER RED NODE DISAGGREGATES P1 RIFT 2017, Juniper Confidential
  • 9.
    Zero Touch Provisioning •Only top tier nodes need to be configured • Nodes that must be leaves or have leaf-leaf connection may be configured • Nodes with specific configuration can be mixed with others • Upon connection nodes will fully auto-configure themselves and form adjacencies in a well defined north/south topology • With optional east-west connections • ZTP makes DC fabric like RAM banks • No one configures RAM banks and CAS/RAS manually in a laptop • DC fabric HW is largely commodity already • DC fabric OPEX must and will commoditize • RIFT enables that
  • 10.
    Other Features ofRIFT • Optimal Reduction and Load-Balancing of Flooding • Mobility Support • Built-in support for rapid prefix moving from one leaf to another • Key/Value Store • Fabric Bandwidth Balancing: weighted all paths routing (RIFT is loop-free) • Northbound: modify the distance of default route received from a neighbor based on available BW through that neighbor • Southbound: during SPF consider available BW through lower level nodes • Segment Routing Support • Leaf-to-leaf Procedures • Allow E-W traffic strictly for local prefixes • Policy Guided Prefixes • Moved to a separate draft
  • 11.
    Summary of RIFTAdvantages • Advantages of Link-State and Distance Vector • Fastest Possible Convergence • Automatic Detection of Topology • Minimal Routes/Info on TORs • High Degree of ECMP • Fast De-comissioning of Nodes • Maximum Propagation Speed with Flexible # Prefixes in an Update • No Disadvantages of Link-State or Distance Vector • Reduced and Balanced Flooding • Automatic Neighbor Detection • Unique RIFT Advantages • True ZTP • Minimal Blast Radius on Failures • Can Utilize All Paths Through Fabric Without Looping • Automatic Disaggregation on Failures • Simple Leaf Implementation that Can Scale Down to Servers • Key-Value Store • Horizontal Links Used for Protection Only • Supports Non-Equal Cost Multipath and Can Replace MC-LAG • Optimal Flooding Reduction and Load- Balancing
  • 12.
    Standardization/Industry Status • IETFStandard Track Working Group • https://datatracker.ietf.org/wg/rift/about/ • Juniper co-chair • Base protocol standard targeted at Feb 2019 • YANG and other things forthcoming • Juniper main author with Cisco/Comcast co-authors • Other design team members from Bloomberg, HPE, Mellanox and open source community • Two independent implementations • Juniper: available on-box (JUNOS) or standalone package (for public evaluation) • Open Source: by Bruno Rijsman (on-going)
  • 13.
  • 14.
    eBGP Solution Nuances •ASN related knobs/tricks • Numbering scheme to control path-hunting • Allow-own-as knob to reuse private ASNs • “Relaxed ECMP Multipath” since ECMP over different AS does not normally work with eBGP • Additional work to get beyond 65K ASNs • To see fabric topology BGP-LS must be run on top • Add-path to support multi-homing, N-ECMP and prevent oscillation • Vendor-specific provisioning & configuration • Emerging work on peer auto-discovery diametrically opposite to BGP design principals • “Violations” of BGP FSM (e.g. restart and MRAI timers) • Reliance on peer-groups to prevent withdraw dispersion and path hunting upon server link failures • Less than successful attempts at prefix summary without black-holing/micro-loop
  • 15.
    REQUIREMENTS BREAKDOWN (RFC7938+)FOR A “MINIMAL” OPEX FABRIC” Peer Discovery/True ZTP/Preventing Cabling Violations ⚠️ ⚠️ Minimal Amount of Routes/Information on ToRs High Degree of ECMP (BGP needs lots knobs, memory, own-AS-path violations) and ideally NEC and LFA ⚠️ Non Equal Cost Multi-Path, ECMP Independent Anycast, MC-LAG Replacement Traffic Engineering by Next-Hops, Prefix Modifications See All Links in Topology to Support PCE/SR ⚠️ Carry Opaque Configuration Data (Key-Value) Efficiently ⚠️ Take a Node out of Production Quickly and Without Disruption Automatic Disaggregation on Failures to Prevent Black-Holing and Back-Hauling Minimal Blast Radius on Failures (On Failure Smallest Possible Part of the Network “Shakes”) Fastest Possible Convergence on Failures Bandwidth Load Balancing Simplest Initial Implementation RIFT 2017, Juniper Confidential