v
BGP in the Datacenter
Pete Lumbis – @PeteCCDE
Datacenter Architect
CCIE #28677, CCDE 2012::3
cumulusnetworks.com 1
Pete Who?
CCIE R&S #28677, CCDE 2012::3
Former Cisco TAC Routing
Escalation
Current Cumulus Networks SE
DC Automation and Architecture
Agenda
The history of L2
Routing in the datacenter
BGP in the datacenter
Troubleshooting improvements
BGP on Servers
cumulusnetworks.com 3
In the Beginning…
There was L2…
cumulusnetworks.com 4
In the Beginning…
…but it had
problems
cumulusnetworks.com 5
50% bandwidth loss due to
STP
In the Beginning…
…but it had
problems
cumulusnetworks.com 6
Unexpected Root change
Roo
t
In the Beginning…
…but it had
problems
cumulusnetworks.com 7
STP Brownout
Flooding!
Temporary loops!STP Block on
TCN!
Agenda
The history of L2
Routing in the datacenter
BGP in the datacenter
Troubleshooting improvements
BGP on Servers
cumulusnetworks.com 8
Layer 3 Clos
cumulusnetworks.com 9
Server gateway is
attached Leaf
Routing Between
Spine and Leafs
10.1.1.0/24 10.2.2.0/24 10.3.3.0/24
OSPF or BGP
Layer 3 – Spine and Leaf
cumulusnetworks.com 10
Full ECMP
Layer 3 – Spine and Leaf
cumulusnetworks.com 11
Full ECMP
Manageable
Oversubscription
48 x 10Gig = 480 Gigs
2 x 40Gig = 80 Gigs = 6:1 Oversubscription
Layer 3 – Spine and Leaf
cumulusnetworks.com 12
Full ECMP
Manageable
Oversubscription
Easy to Adjust
48 x 10Gig = 480 Gigs
2 x 40Gig = 80 Gigs = 6:1 Oversubscription
Layer 3 – Spine and Leaf
cumulusnetworks.com 13
Full ECMP
Manageable
Oversubscription
Easy to Adjust
48 x 10Gig = 480 Gigs
3 x 40Gig = 120 Gigs = 4:1
Oversubscription
Layer 3 – Spine and Leaf
cumulusnetworks.com 14
Full ECMP
Manageable
Oversubscription
Easy to Adjust
48 x 10Gig = 480 Gigs
3 x 40Gig = 120 Gigs = 4:1
Oversubscription
Layer 3 – Spine and Leaf
cumulusnetworks.com 15
Full ECMP
Manageable
Oversubscription
Easy to Adjust
Massive Scale
48 x 10Gig = 480 Gigs
3 x 40Gig = 120 Gigs = 4:1
Oversubscription
Layer 3 – Spine and Leaf
cumulusnetworks.com 16
Full ECMP
Manageable
Oversubscription
Easy to Adjust
Massive Scale
Controlled Failures
Leaf Failure Reduces Compute
Layer 3 – Spine and Leaf
cumulusnetworks.com 17
Full ECMP
Manageable
Oversubscription
Easy to Adjust
Massive Scale
Controlled Failures
Spine Failure Increases
Oversubscription
Agenda
The history of L2
Routing in the datacenter
BGP in the datacenter
Troubleshooting improvements
BGP on Servers
cumulusnetworks.com 18
BGP as an IGP
RFC Draft submitted 2014
Microsoft and Facebook
Targeting DC
All the hows and whys
cumulusnetworks.com 19
But I thought BGP was…
…slow
 Nope. Not with BFD and timer tuning. Just as fast as OSPF.
…hard to configure
 We’ll get to that one later, but it can be easy
…only for service providers
 SPs build for scale and stability. You should too
…hard to troubleshoot
 Nice and easy when everything is defined + recent advances
cumulusnetworks.com 20
Single ASN for Spines
Unique ASN for Leafs
Use Private ASN range
2-byte (1023):
 64512 – 65534
4-byte (94 million):
 4200000000 - 4294967294
BGP Datacenter Design
cumulusnetworks.com 21
65534 65534
64512 64513 64514
Reducing BGP Configuration Complexity
Classically lots to manage
cumulusnetworks.com 22
65534 65534
64512 64513 64514
router bgp 65534
router-id 10.0.0.1
neighbor 10.1.1.1 remote-as 64512
neighbor 10.1.1.2 remote-as 64513
neighbor 10.1.1.3 remote-as 64514
neighbor 10.1.1.1 timers 1 3
neighbor 10.1.1.2 timers 1 3
neighbor 10.1.1.3 timers 1 3
neighbor 10.1.1.1 timers connect 3
neighbor 10.1.1.2 timers connect 3
neighbor 10.1.1.3 timers connect 3
Reducing BGP Configuration Complexity
First – Simplify Remote AS
cumulusnetworks.com 23
65534 65534
64512 64513 64514
router bgp 65534
router-id 10.0.0.1
neighbor 10.1.1.1 remote-as 64512
neighbor 10.1.1.2 remote-as 64513
neighbor 10.1.1.3 remote-as 64514
neighbor 10.1.1.1 timers 1 3
neighbor 10.1.1.2 timers 1 3
neighbor 10.1.1.3 timers 1 3
neighbor 10.1.1.1 timers connect 3
neighbor 10.1.1.2 timers connect 3
neighbor 10.1.1.3 timers connect 3
Reducing BGP Configuration Complexity
First – Simplify Remote AS
cumulusnetworks.com 24
65534 65534
64512 64513 64514
router bgp 65534
router-id 10.0.0.1
neighbor 10.1.1.1 remote-as external
neighbor 10.1.1.2 remote-as external
neighbor 10.1.1.3 remote-as external
neighbor 10.1.1.1 timers 1 3
neighbor 10.1.1.2 timers 1 3
neighbor 10.1.1.3 timers 1 3
neighbor 10.1.1.1 timers connect 3
neighbor 10.1.1.2 timers connect 3
neighbor 10.1.1.3 timers connect 3
Reducing BGP Configuration Complexity
First – Simplify Remote AS
cumulusnetworks.com 25
65534 65534
64512 64513 64514
router bgp 65534
router-id 10.0.0.1
neighbor 10.1.1.1 remote-as external
neighbor 10.1.1.2 remote-as external
neighbor 10.1.1.3 remote-as external
neighbor 10.1.1.1 timers 1 3
neighbor 10.1.1.2 timers 1 3
neighbor 10.1.1.3 timers 1 3
neighbor 10.1.1.1 timers connect 3
neighbor 10.1.1.2 timers connect 3
neighbor 10.1.1.3 timers connect 3
remote-as internal as
well
Reducing BGP Configuration Complexity
Next – Use Peer Groups
cumulusnetworks.com 27
65534 65534
64512 64513 64514
router bgp 65534
router-id 10.0.0.1
neighbor 10.1.1.1 remote-as external
neighbor 10.1.1.2 remote-as external
neighbor 10.1.1.3 remote-as external
neighbor 10.1.1.1 timers 1 3
neighbor 10.1.1.2 timers 1 3
neighbor 10.1.1.3 timers 1 3
neighbor 10.1.1.1 timers connect 3
neighbor 10.1.1.2 timers connect 3
neighbor 10.1.1.3 timers connect 3
Reducing BGP Configuration Complexity
Next – Use Peer Groups
cumulusnetworks.com 28
65534 65534
64512 64513 64514
router bgp 65534
router-id 10.0.0.1
neighbor 10.1.1.1 peer-group leafs
neighbor 10.1.1.2 peer-group leafs
neighbor 10.1.1.3 peer-group leafs
neighbor leafs remote-as external
neighbor leafs timers 1 3
neighbor leafs timers connect 3
Reducing BGP Configuration Complexity
Finally – BGP Unnumbered
cumulusnetworks.com 29
65534 65534
64512 64513 64514
router bgp 65534
router-id 10.0.0.1
neighbor 10.1.1.1 peer-group leafs
neighbor 10.1.1.2 peer-group leafs
neighbor 10.1.1.3 peer-group leafs
neighbor leafs remote-as external
neighbor leafs timers 1 3
neighbor leafs timers connect 3
Reducing BGP Configuration Complexity
Finally – BGP Unnumbered
cumulusnetworks.com 30
65534 65534
64512 64513 64514
router bgp 65534
router-id 10.0.0.1
neighbor swp1 peer-group leafs
neighbor swp2 peer-group leafs
neighbor swp3 peer-group leafs
neighbor leafs remote-as external
neighbor leafs timers 1 3
neighbor leafs timers connect 3
BGP Unnumbered
Uses IPv6 Link Local addresses
 Automatically assigned, no address management
No need for infrastructure Ips
 Only need Loopbacks
Advertises both IPv4 and IPv6 Routes
 RFC 5549. Full interop with Cisco, Arista, Juniper
cumulusnetworks.com 31
Agenda
The history of L2
Routing in the datacenter
BGP in the datacenter
Troubleshooting improvements
BGP on Servers
cumulusnetworks.com 34
BGP Troubleshooting Improvements - Traceroute
How do you troubleshoot links
without IPs?
Traceroute improvements
 Report back loopback IP
cumulusnetworks.com 35
BGP Troubleshooting Improvements - Hostnames
Who is the
peer?
Hostname
BGP
extension
draft-walton-
bgp-hostname-
capability
cumulusnetworks.com 36
Comparing BGP Configurations
Traditional
Config
cumulusnetworks.com 37
router bgp 65534
router-id 10.0.0.1
maximum-paths 64
bgp bestpath as-path multipath-relax
neighbor 10.1.1.1 remote-as 64512
neighbor 10.1.1.2 remote-as 64513
neighbor 10.1.1.3 remote-as 64514
neighbor 10.1.1.1 timers 1 3
neighbor 10.1.1.2 timers 1 3
neighbor 10.1.1.3 timers 1 3
neighbor 10.1.1.1 timers connect 3
neighbor 10.1.1.2 timers connect 3
neighbor 10.1.1.3 timers connect 3
router bgp 65534
router-id 10.0.0.1
neighbor swp1 peer-group leafs
neighbor swp2 peer-group leafs
neighbor swp3 peer-group leafs
neighbor leafs remote-as external
Cumulus Config
Agenda
The history of L2
Routing in the datacenter
BGP in the datacenter
Troubleshooting improvements
BGP on Servers
cumulusnetworks.com 38
BGP to the Server
Why stop at the top of rack?
BGP to the Server!
Cumulus Quagga, GoBGP, Bird.
 Just Linux Apps!
No L2, No mLAG, No Infrastructure IPs
 Use BGP Unnumbered
Same troubleshooting and monitoring
cumulusnetworks.com 39
Summary
L3 > L2
 At least 1 better
 Routing provides better scale and stability
Easy to configure, automate, troubleshoot
BGP all the way to the server!
Smart defaults and Configuration
Simplifications
cumulusnetworks.com 41
© 2014Cumulus Networks. Cumulus Networks, the Cumulus Networks Logo, and Cumulus Linux are trademarks or registered trademarks of Cumulus Networks, Inc. or its affiliates in the
U.S. and other countries. Other names may be trademarks of their respective owners.The registered trademark Linux® is used pursuant to a sublicense from LMI, the exclusive licensee of
LinusTorvalds, owner of the mark on a world-wide basis.
ThankYou!
cumulusnetworks.com 42
Asaf Wachtel, Sr. Director Enterprise
July 2016
25GbE Technology Update
© 2016 Mellanox Technologies 44- Mellanox Confidential -
Open APIs
Open Composable Networks
Automation
End-to-End
Interconnect
Network
OS
Choice
SONiC
© 2016 Mellanox Technologies 45- Mellanox Confidential -
Open Networking is Real: OCP Summit March 2016
© 2016 Mellanox Technologies 46- Mellanox Confidential -
25/50/100GbE: The Future is Here!
Compute
Nodes
Storage
Nodes
Network
40GbE
10GbE 40GbE
Compute
Nodes
150% Higher
Bandwidth
Storage
Nodes
25% Higher
Bandwidth
Network
150%
Higher
Bandwidth
100GbE
25GbE 50GbE
Similar Connectors
Similar Infrastructure
Similar Cost / Power
© 2016 Mellanox Technologies 47- Mellanox Confidential -
Who needs more than 10GbE?
 Latest multi-core Intel CPUs can easily drive more than 10Gb/s
 Cloud (public or private)
• Multi-tenancy
• Need to deliver higher SLAs with lower predictability
 Hyperconverged / Software Defined Storage / NVMe
• Network & Storage on the same wire
• Faster & Cheaper storage media
 Database / Big Data
• Increasing volumes
• Moving from batch to real-time
 Network Function Virtualization (NFV)
• I/O intensive data plane
© 2016 Mellanox Technologies 48- Mellanox Confidential -
Why 25GbE? Do the Math!
 Best match for current PCI technology
• PCIe3x8 = ~52Gb/s; 2 x 25 = 50Gb/s
 Most efficient switch silicon design
• Maximizes both ports and bandwidth
• 40GbE requires 4 lanes per port == cost + power
 Unmatched price-performance / Best price per Gb/s
• 25G = 2.5X BW at 1.5x the price
 Lower OPEX & TCO
• Cut number of NICs, cables, switch ports in half
• Lower power & cooling
 Better switch port density
• Fewer uplinks needed to maintain 1:1 subscription
 Uses existing fiber infrastructure (single lane)
 Fully backward compatible
• Mix/match new 25GbE components and existing 10GbE
 Future proof + economies of scale (50/100GbE)
• 50Gb is 2x25G, 100G is 4x25G
2.5X bandwidth with single-lane technology
© 2016 Mellanox Technologies 49- Mellanox Confidential -
25GbE Industry Timeline
 March 2014: Microsoft presents proposal for 25GbE to IEEE, leveraging
existing activities, such as 25G PHY (100GbE) & SFP28 (32G FC)
 July 2014: Open Industry Consortium to Bring 25 and 50 Gigabit
Ethernet to Cloud-Scale Networks
 August 2015: First products ship to end customers
 September 2015: The 25G Ethernet Consortium specification draft
completed
 December 2015: Multi-vendor interoperability validated by multiple
customers
 Q4 2015 – Q2 2016: Ecosystem grows and matures
 June 2016: IEEE 802.3by standard approved by The IEEE-SA
Standards Board
© 2016 Mellanox Technologies 50- Mellanox Confidential -
25GbE vs 10GbE
25GbE 10GbE
Picture
Standard SFP28 SFP+
Physical Form Factor SFP SFP
Number of lanes 1 1
Lane speed 25Gbps 10Gbps
Encoding 64b/66b 64b/66b
Backward/Forward
Compatibility
Fully interoperable @
10Gb/s
Fully interoperable @
10Gb/s
Max Copper Reach 5m 7m
MM Fiber Reach 100m 300m
SM Fiber Reach 10KM 10KM
© 2016 Mellanox Technologies 51- Mellanox Confidential -
3 Types of Connectivity Products
Direct Attach Copper (DAC)
“Transceiver”
4-channels Transmit
4-channels Receiver
Copper Wires.
Directly Attaches one system to another
Key feature = Lowest Priced Link
<3m reaches
Optical Transceiver
Converts electrical signals to optical.
Transmits blinking laser light over optical fiber.
Key feature = long reach - up to 10Km.
Active Optical Cable
2 Transceivers with optical fiber bonded in.
Key feature = Lowest Priced Optical Link
100m/200m Reaches
SFP28
LC
Transceiver
QSFP28
LC
Transceiver
QSFP28
MPO
Transceiver
© 2016 Mellanox Technologies 52- Mellanox Confidential -
As Data Rates Increase, Distances Decrease
Favoring Silicon Photonics + Single-mode Fiber
Link Length (m)
10 100 500150 300 1000 2000
10
25
50
3 51
20
DataRateperLane(Gbs)
10000500020 30 50 752
Single mode fiber
OM4OM3
Copper Multi-mode fiber
Silicon Photonics
Direct Attach Copper
• Zero power
• Demo’d 8m at 100G
• Best fit 3m
DACs
Active Optical Cables
• VCSEL 100m
• Silicon Photonics 200m
• Best fit for 5-20m
SR/SR4 VCSEL Transceivers
• Reaches to 100m
• Best fit for MMF
• Structured cabling
Silicon Photonics Transceivers
• Reaches to 2km
• Best fit for SMF
• Parallel PSM4 or WDM4
3-5M 70m 100M
MMF= MULTI-MODE FIBER SMF = SINGLE-MODE FIBER
2Km/10KmSR-SR4
VCSELs
© 2016 Mellanox Technologies 53- Mellanox Confidential -
Webscale IT Innovation:
QSFP TOR for 4x Density and Lower COGS
EST = $166
Single cable!
Break-out cabling vs standard cabling
Ideal port density and configuration deployment options
4 cables = $216
Qty (4) cables @ $54
 Benefits
• Easier cable management
• fewer cables
• 23% lowers cost
 Benefits
• Flexible configuration options
• Highest port density
• Lowest power consumption
• Half-width deployment option
• 4 SFP+ plus 4 QSFP+ ports
• Up to 128 ports of 10GbE in 2 RU
• Illogical configuration with wasted ports
* RU = rack unit
• 16 QSFP28 ports (32 in 1 RU*)
• Up to 128 10/25GbE ports in 1 RU
• Logical configuration options:
• Redundant “48 + 4” in 1 RU
Mellanox Competition
To achieve
equivalent
bandwidth
$1000 less cable cost per rack
© 2016 Mellanox Technologies 54- Mellanox Confidential -
Summary: 25/50/100GbE is Here!
100GbE Adapter
150 million messages per second
10 / 25 / 40 / 50 / 56 / 100GbE
32 100GbE Ports, 64 25/50GbE Ports
10 / 25 / 40 / 50 / 56 / 100GbE
Throughput of 6.4Tb/s
Transceivers
Active Optical and Copper Cables
10 / 25 / 40 / 50 / 56 / 100GbE
VCSELs, Silicon Photonics and Copper

July NYC Open Networking Meeup

  • 1.
    v BGP in theDatacenter Pete Lumbis – @PeteCCDE Datacenter Architect CCIE #28677, CCDE 2012::3 cumulusnetworks.com 1
  • 2.
    Pete Who? CCIE R&S#28677, CCDE 2012::3 Former Cisco TAC Routing Escalation Current Cumulus Networks SE DC Automation and Architecture
  • 3.
    Agenda The history ofL2 Routing in the datacenter BGP in the datacenter Troubleshooting improvements BGP on Servers cumulusnetworks.com 3
  • 4.
    In the Beginning… Therewas L2… cumulusnetworks.com 4
  • 5.
    In the Beginning… …butit had problems cumulusnetworks.com 5 50% bandwidth loss due to STP
  • 6.
    In the Beginning… …butit had problems cumulusnetworks.com 6 Unexpected Root change Roo t
  • 7.
    In the Beginning… …butit had problems cumulusnetworks.com 7 STP Brownout Flooding! Temporary loops!STP Block on TCN!
  • 8.
    Agenda The history ofL2 Routing in the datacenter BGP in the datacenter Troubleshooting improvements BGP on Servers cumulusnetworks.com 8
  • 9.
    Layer 3 Clos cumulusnetworks.com9 Server gateway is attached Leaf Routing Between Spine and Leafs 10.1.1.0/24 10.2.2.0/24 10.3.3.0/24 OSPF or BGP
  • 10.
    Layer 3 –Spine and Leaf cumulusnetworks.com 10 Full ECMP
  • 11.
    Layer 3 –Spine and Leaf cumulusnetworks.com 11 Full ECMP Manageable Oversubscription 48 x 10Gig = 480 Gigs 2 x 40Gig = 80 Gigs = 6:1 Oversubscription
  • 12.
    Layer 3 –Spine and Leaf cumulusnetworks.com 12 Full ECMP Manageable Oversubscription Easy to Adjust 48 x 10Gig = 480 Gigs 2 x 40Gig = 80 Gigs = 6:1 Oversubscription
  • 13.
    Layer 3 –Spine and Leaf cumulusnetworks.com 13 Full ECMP Manageable Oversubscription Easy to Adjust 48 x 10Gig = 480 Gigs 3 x 40Gig = 120 Gigs = 4:1 Oversubscription
  • 14.
    Layer 3 –Spine and Leaf cumulusnetworks.com 14 Full ECMP Manageable Oversubscription Easy to Adjust 48 x 10Gig = 480 Gigs 3 x 40Gig = 120 Gigs = 4:1 Oversubscription
  • 15.
    Layer 3 –Spine and Leaf cumulusnetworks.com 15 Full ECMP Manageable Oversubscription Easy to Adjust Massive Scale 48 x 10Gig = 480 Gigs 3 x 40Gig = 120 Gigs = 4:1 Oversubscription
  • 16.
    Layer 3 –Spine and Leaf cumulusnetworks.com 16 Full ECMP Manageable Oversubscription Easy to Adjust Massive Scale Controlled Failures Leaf Failure Reduces Compute
  • 17.
    Layer 3 –Spine and Leaf cumulusnetworks.com 17 Full ECMP Manageable Oversubscription Easy to Adjust Massive Scale Controlled Failures Spine Failure Increases Oversubscription
  • 18.
    Agenda The history ofL2 Routing in the datacenter BGP in the datacenter Troubleshooting improvements BGP on Servers cumulusnetworks.com 18
  • 19.
    BGP as anIGP RFC Draft submitted 2014 Microsoft and Facebook Targeting DC All the hows and whys cumulusnetworks.com 19
  • 20.
    But I thoughtBGP was… …slow  Nope. Not with BFD and timer tuning. Just as fast as OSPF. …hard to configure  We’ll get to that one later, but it can be easy …only for service providers  SPs build for scale and stability. You should too …hard to troubleshoot  Nice and easy when everything is defined + recent advances cumulusnetworks.com 20
  • 21.
    Single ASN forSpines Unique ASN for Leafs Use Private ASN range 2-byte (1023):  64512 – 65534 4-byte (94 million):  4200000000 - 4294967294 BGP Datacenter Design cumulusnetworks.com 21 65534 65534 64512 64513 64514
  • 22.
    Reducing BGP ConfigurationComplexity Classically lots to manage cumulusnetworks.com 22 65534 65534 64512 64513 64514 router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 remote-as 64512 neighbor 10.1.1.2 remote-as 64513 neighbor 10.1.1.3 remote-as 64514 neighbor 10.1.1.1 timers 1 3 neighbor 10.1.1.2 timers 1 3 neighbor 10.1.1.3 timers 1 3 neighbor 10.1.1.1 timers connect 3 neighbor 10.1.1.2 timers connect 3 neighbor 10.1.1.3 timers connect 3
  • 23.
    Reducing BGP ConfigurationComplexity First – Simplify Remote AS cumulusnetworks.com 23 65534 65534 64512 64513 64514 router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 remote-as 64512 neighbor 10.1.1.2 remote-as 64513 neighbor 10.1.1.3 remote-as 64514 neighbor 10.1.1.1 timers 1 3 neighbor 10.1.1.2 timers 1 3 neighbor 10.1.1.3 timers 1 3 neighbor 10.1.1.1 timers connect 3 neighbor 10.1.1.2 timers connect 3 neighbor 10.1.1.3 timers connect 3
  • 24.
    Reducing BGP ConfigurationComplexity First – Simplify Remote AS cumulusnetworks.com 24 65534 65534 64512 64513 64514 router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 remote-as external neighbor 10.1.1.2 remote-as external neighbor 10.1.1.3 remote-as external neighbor 10.1.1.1 timers 1 3 neighbor 10.1.1.2 timers 1 3 neighbor 10.1.1.3 timers 1 3 neighbor 10.1.1.1 timers connect 3 neighbor 10.1.1.2 timers connect 3 neighbor 10.1.1.3 timers connect 3
  • 25.
    Reducing BGP ConfigurationComplexity First – Simplify Remote AS cumulusnetworks.com 25 65534 65534 64512 64513 64514 router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 remote-as external neighbor 10.1.1.2 remote-as external neighbor 10.1.1.3 remote-as external neighbor 10.1.1.1 timers 1 3 neighbor 10.1.1.2 timers 1 3 neighbor 10.1.1.3 timers 1 3 neighbor 10.1.1.1 timers connect 3 neighbor 10.1.1.2 timers connect 3 neighbor 10.1.1.3 timers connect 3 remote-as internal as well
  • 26.
    Reducing BGP ConfigurationComplexity Next – Use Peer Groups cumulusnetworks.com 27 65534 65534 64512 64513 64514 router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 remote-as external neighbor 10.1.1.2 remote-as external neighbor 10.1.1.3 remote-as external neighbor 10.1.1.1 timers 1 3 neighbor 10.1.1.2 timers 1 3 neighbor 10.1.1.3 timers 1 3 neighbor 10.1.1.1 timers connect 3 neighbor 10.1.1.2 timers connect 3 neighbor 10.1.1.3 timers connect 3
  • 27.
    Reducing BGP ConfigurationComplexity Next – Use Peer Groups cumulusnetworks.com 28 65534 65534 64512 64513 64514 router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 peer-group leafs neighbor 10.1.1.2 peer-group leafs neighbor 10.1.1.3 peer-group leafs neighbor leafs remote-as external neighbor leafs timers 1 3 neighbor leafs timers connect 3
  • 28.
    Reducing BGP ConfigurationComplexity Finally – BGP Unnumbered cumulusnetworks.com 29 65534 65534 64512 64513 64514 router bgp 65534 router-id 10.0.0.1 neighbor 10.1.1.1 peer-group leafs neighbor 10.1.1.2 peer-group leafs neighbor 10.1.1.3 peer-group leafs neighbor leafs remote-as external neighbor leafs timers 1 3 neighbor leafs timers connect 3
  • 29.
    Reducing BGP ConfigurationComplexity Finally – BGP Unnumbered cumulusnetworks.com 30 65534 65534 64512 64513 64514 router bgp 65534 router-id 10.0.0.1 neighbor swp1 peer-group leafs neighbor swp2 peer-group leafs neighbor swp3 peer-group leafs neighbor leafs remote-as external neighbor leafs timers 1 3 neighbor leafs timers connect 3
  • 30.
    BGP Unnumbered Uses IPv6Link Local addresses  Automatically assigned, no address management No need for infrastructure Ips  Only need Loopbacks Advertises both IPv4 and IPv6 Routes  RFC 5549. Full interop with Cisco, Arista, Juniper cumulusnetworks.com 31
  • 31.
    Agenda The history ofL2 Routing in the datacenter BGP in the datacenter Troubleshooting improvements BGP on Servers cumulusnetworks.com 34
  • 32.
    BGP Troubleshooting Improvements- Traceroute How do you troubleshoot links without IPs? Traceroute improvements  Report back loopback IP cumulusnetworks.com 35
  • 33.
    BGP Troubleshooting Improvements- Hostnames Who is the peer? Hostname BGP extension draft-walton- bgp-hostname- capability cumulusnetworks.com 36
  • 34.
    Comparing BGP Configurations Traditional Config cumulusnetworks.com37 router bgp 65534 router-id 10.0.0.1 maximum-paths 64 bgp bestpath as-path multipath-relax neighbor 10.1.1.1 remote-as 64512 neighbor 10.1.1.2 remote-as 64513 neighbor 10.1.1.3 remote-as 64514 neighbor 10.1.1.1 timers 1 3 neighbor 10.1.1.2 timers 1 3 neighbor 10.1.1.3 timers 1 3 neighbor 10.1.1.1 timers connect 3 neighbor 10.1.1.2 timers connect 3 neighbor 10.1.1.3 timers connect 3 router bgp 65534 router-id 10.0.0.1 neighbor swp1 peer-group leafs neighbor swp2 peer-group leafs neighbor swp3 peer-group leafs neighbor leafs remote-as external Cumulus Config
  • 35.
    Agenda The history ofL2 Routing in the datacenter BGP in the datacenter Troubleshooting improvements BGP on Servers cumulusnetworks.com 38
  • 36.
    BGP to theServer Why stop at the top of rack? BGP to the Server! Cumulus Quagga, GoBGP, Bird.  Just Linux Apps! No L2, No mLAG, No Infrastructure IPs  Use BGP Unnumbered Same troubleshooting and monitoring cumulusnetworks.com 39
  • 38.
    Summary L3 > L2 At least 1 better  Routing provides better scale and stability Easy to configure, automate, troubleshoot BGP all the way to the server! Smart defaults and Configuration Simplifications cumulusnetworks.com 41
  • 39.
    © 2014Cumulus Networks.Cumulus Networks, the Cumulus Networks Logo, and Cumulus Linux are trademarks or registered trademarks of Cumulus Networks, Inc. or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners.The registered trademark Linux® is used pursuant to a sublicense from LMI, the exclusive licensee of LinusTorvalds, owner of the mark on a world-wide basis. ThankYou! cumulusnetworks.com 42
  • 40.
    Asaf Wachtel, Sr.Director Enterprise July 2016 25GbE Technology Update
  • 41.
    © 2016 MellanoxTechnologies 44- Mellanox Confidential - Open APIs Open Composable Networks Automation End-to-End Interconnect Network OS Choice SONiC
  • 42.
    © 2016 MellanoxTechnologies 45- Mellanox Confidential - Open Networking is Real: OCP Summit March 2016
  • 43.
    © 2016 MellanoxTechnologies 46- Mellanox Confidential - 25/50/100GbE: The Future is Here! Compute Nodes Storage Nodes Network 40GbE 10GbE 40GbE Compute Nodes 150% Higher Bandwidth Storage Nodes 25% Higher Bandwidth Network 150% Higher Bandwidth 100GbE 25GbE 50GbE Similar Connectors Similar Infrastructure Similar Cost / Power
  • 44.
    © 2016 MellanoxTechnologies 47- Mellanox Confidential - Who needs more than 10GbE?  Latest multi-core Intel CPUs can easily drive more than 10Gb/s  Cloud (public or private) • Multi-tenancy • Need to deliver higher SLAs with lower predictability  Hyperconverged / Software Defined Storage / NVMe • Network & Storage on the same wire • Faster & Cheaper storage media  Database / Big Data • Increasing volumes • Moving from batch to real-time  Network Function Virtualization (NFV) • I/O intensive data plane
  • 45.
    © 2016 MellanoxTechnologies 48- Mellanox Confidential - Why 25GbE? Do the Math!  Best match for current PCI technology • PCIe3x8 = ~52Gb/s; 2 x 25 = 50Gb/s  Most efficient switch silicon design • Maximizes both ports and bandwidth • 40GbE requires 4 lanes per port == cost + power  Unmatched price-performance / Best price per Gb/s • 25G = 2.5X BW at 1.5x the price  Lower OPEX & TCO • Cut number of NICs, cables, switch ports in half • Lower power & cooling  Better switch port density • Fewer uplinks needed to maintain 1:1 subscription  Uses existing fiber infrastructure (single lane)  Fully backward compatible • Mix/match new 25GbE components and existing 10GbE  Future proof + economies of scale (50/100GbE) • 50Gb is 2x25G, 100G is 4x25G 2.5X bandwidth with single-lane technology
  • 46.
    © 2016 MellanoxTechnologies 49- Mellanox Confidential - 25GbE Industry Timeline  March 2014: Microsoft presents proposal for 25GbE to IEEE, leveraging existing activities, such as 25G PHY (100GbE) & SFP28 (32G FC)  July 2014: Open Industry Consortium to Bring 25 and 50 Gigabit Ethernet to Cloud-Scale Networks  August 2015: First products ship to end customers  September 2015: The 25G Ethernet Consortium specification draft completed  December 2015: Multi-vendor interoperability validated by multiple customers  Q4 2015 – Q2 2016: Ecosystem grows and matures  June 2016: IEEE 802.3by standard approved by The IEEE-SA Standards Board
  • 47.
    © 2016 MellanoxTechnologies 50- Mellanox Confidential - 25GbE vs 10GbE 25GbE 10GbE Picture Standard SFP28 SFP+ Physical Form Factor SFP SFP Number of lanes 1 1 Lane speed 25Gbps 10Gbps Encoding 64b/66b 64b/66b Backward/Forward Compatibility Fully interoperable @ 10Gb/s Fully interoperable @ 10Gb/s Max Copper Reach 5m 7m MM Fiber Reach 100m 300m SM Fiber Reach 10KM 10KM
  • 48.
    © 2016 MellanoxTechnologies 51- Mellanox Confidential - 3 Types of Connectivity Products Direct Attach Copper (DAC) “Transceiver” 4-channels Transmit 4-channels Receiver Copper Wires. Directly Attaches one system to another Key feature = Lowest Priced Link <3m reaches Optical Transceiver Converts electrical signals to optical. Transmits blinking laser light over optical fiber. Key feature = long reach - up to 10Km. Active Optical Cable 2 Transceivers with optical fiber bonded in. Key feature = Lowest Priced Optical Link 100m/200m Reaches SFP28 LC Transceiver QSFP28 LC Transceiver QSFP28 MPO Transceiver
  • 49.
    © 2016 MellanoxTechnologies 52- Mellanox Confidential - As Data Rates Increase, Distances Decrease Favoring Silicon Photonics + Single-mode Fiber Link Length (m) 10 100 500150 300 1000 2000 10 25 50 3 51 20 DataRateperLane(Gbs) 10000500020 30 50 752 Single mode fiber OM4OM3 Copper Multi-mode fiber Silicon Photonics Direct Attach Copper • Zero power • Demo’d 8m at 100G • Best fit 3m DACs Active Optical Cables • VCSEL 100m • Silicon Photonics 200m • Best fit for 5-20m SR/SR4 VCSEL Transceivers • Reaches to 100m • Best fit for MMF • Structured cabling Silicon Photonics Transceivers • Reaches to 2km • Best fit for SMF • Parallel PSM4 or WDM4 3-5M 70m 100M MMF= MULTI-MODE FIBER SMF = SINGLE-MODE FIBER 2Km/10KmSR-SR4 VCSELs
  • 50.
    © 2016 MellanoxTechnologies 53- Mellanox Confidential - Webscale IT Innovation: QSFP TOR for 4x Density and Lower COGS EST = $166 Single cable! Break-out cabling vs standard cabling Ideal port density and configuration deployment options 4 cables = $216 Qty (4) cables @ $54  Benefits • Easier cable management • fewer cables • 23% lowers cost  Benefits • Flexible configuration options • Highest port density • Lowest power consumption • Half-width deployment option • 4 SFP+ plus 4 QSFP+ ports • Up to 128 ports of 10GbE in 2 RU • Illogical configuration with wasted ports * RU = rack unit • 16 QSFP28 ports (32 in 1 RU*) • Up to 128 10/25GbE ports in 1 RU • Logical configuration options: • Redundant “48 + 4” in 1 RU Mellanox Competition To achieve equivalent bandwidth $1000 less cable cost per rack
  • 51.
    © 2016 MellanoxTechnologies 54- Mellanox Confidential - Summary: 25/50/100GbE is Here! 100GbE Adapter 150 million messages per second 10 / 25 / 40 / 50 / 56 / 100GbE 32 100GbE Ports, 64 25/50GbE Ports 10 / 25 / 40 / 50 / 56 / 100GbE Throughput of 6.4Tb/s Transceivers Active Optical and Copper Cables 10 / 25 / 40 / 50 / 56 / 100GbE VCSELs, Silicon Photonics and Copper

Editor's Notes

  • #16 Depends on oversubscription needs Given 2:1 as requirement: two-tier: 32-port 40 gig spine and 48-port 10 gig leaf supports ~1500 hosts three-tier: ~65,000 hosts
  • #17 Depends on oversubscription needs Given 2:1 as requirement: two-tier: 32-port 40 gig spine and 48-port 10 gig leaf supports ~1500 hosts three-tier: ~65,000 hosts
  • #18 Depends on oversubscription needs Given 2:1 as requirement: two-tier: 32-port 40 gig spine and 48-port 10 gig leaf supports ~1500 hosts three-tier: ~65,000 hosts
  • #20 This is important as it created an easy to read set of guidelines on what to build and how to do it as well as the justification of their choices.
  • #45 Now that you understand how the hyperscalers build their cloud network infrastructure, you might start thinking, OK,, that is great, but I don’t have the manpower and large number of software developers to follow this model. The good news is, Mellanox and our ecosystem partners make things easy for you. We have a solution called Open Composable Networks that can provide you a set of high-performance, highly programmable networking components including switches, server adapters, optical modules and cables, network processors, which support open APIs such as SAI and switchdev for Linux, and on top of these standard interfaces, you have a slew of network operating system and software application choices. As a matter of fact, in this year’s OCP Summit last month, we did a live demo of 5 different network operating systems running over our flagship Spectrum switches. We also provide the middleware that make it easy to compose your ideal cloud network infrastructure, and simple to monitor, manage and scale.
  • #55 Not only did we carve performance on our flag, we carved Mellanox on the performance flag as well by continuously providing leading technology to a variety of applications. What we are looking at here is just a glimpse of the latest capabilities provided by Mellanox demonstrating a complete set of solutions for highest performing Ethernet speeds. Each element leading its market as a standalone and obviously providing the complete experience when combined into an end to end solution. On today’s session, we will focus on the switch side but in our demo we will demonstrate Adapters and cables as well.