Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PLNOG16: Data center interconnect dla opornych, Krzysztof Mazepa


Published on

PLNOG16: Data center interconnect dla opornych, Krzysztof Mazepa

Published in: Internet
  • Be the first to comment

  • Be the first to like this

PLNOG16: Data center interconnect dla opornych, Krzysztof Mazepa

  1. 1. Data Center Interconnect dla opornych ... Krzysztof Mazepa
  2. 2. • Introduction • Business Continuity Solutions • DCI Solutions for LAN extension • OTV, EVPN and other • Interconnecting Fabrics • VXLAN (Stretched Fabric & Dual Fabrics) • ACI (Stretched Fabric & Dual Fabrics) • Key Takeaways Agenda
  3. 3. Network Evolution Business drivers 10G 40G & 100G DCI Virtualization drives 10G to the edge High Density 10G at Edge 40G &100G in Core/Agg Unified I/O & Fabric Clustered Applications & Big Data Applications East-West (E-W) traffic Non-Blocking FabricPath/ VxLAN / ACI ECMP / Predictable Lower Latency Multi-Tenancy Secure Segmentation Resource Sharing Business Continuity Multi-Site DCI Extensions Storage Extensions Workload Mobility More Virtual Workloads per Server Large L2 Domains Non-Disruptive Migration Increased Dependence on East West Traffic in the Data Center Data Center Evolution DC Fabrics are Increasing in Scale, Extended across sites, and require more Security 3 LowCost,StandardProtocols,OpenArchitectures Automated Policy Driven Provisioning and Management
  4. 4. Business Continuity and Disaster Recovery Ability to Absorb the Impact of a Disaster and Continue to Provide an Acceptable Level of Service. “Disaster Avoidance (DA)” Planned / Unplanned Service Continuity within the Metro without interruption of Services (including a single data center loss) “Disaster Recovery (DR)” Loss of “Regional data centers” leads to recovery in a Remote data center DC-1 DC-2 DC-3 Metro Area Pervasive Data Protection + Infrastructure Rebalance Active-Active Disaster-Recovery / Hybrid-cloud •Long Distances •Beyond app latency •Asynchronous Rep. •Move Apps • not distribute •Cold Migration: • Stateless Services •Public/Hybrid Cloud •Subnet Extension •RTO > hours/days •RPO > several secs to min Main/Active DCs Business Continuity with no interruption •Shorter Distances •Synchronous Replication •Low latency •Distribute Apps •Hot Live Migration •Active/Active DCs •Integration of Stateful Dev •Private Cloud •LAN extension •RTO/RPO~=0
  5. 5. Latency Consideration 5  Intrinsically LAN extension technologies do not impose a latency limit between sites  However some latency limits are imposed by DC technologies o Live Migration is in general limited to 10ms maximum o ACI Stretched Fabric is limited to 10ms  And mostly latency is imposed by Data Storage o Cisco I/O Acceleration improves latency by 50% o EMC VPLEX Metro or NetApp Metro Cluster (Active/Active) allow theoretically maximum of 10ms o If replication is synchronous therefore distances is “unlimited” (usually referred to “GEO cluster)  Usually A/A HA Stretched Cluster (Campus or Metro Cluster) allows maximum of 10ms  In addition multi-tiers applications cannot support high latency between tiers o Recommendation to move all tiers and storage for an application (Host Affinity) o Stateful devices implies return traffic to owner of the stateful session  User to Application traffic latency and bandwidth consumption can be mitigated using path optimization o LISP Mobility o Host routing advertisement
  6. 6. Requirements for the Active-Active Metro Design – Hot Live Migration Move Virtual Workload across Metro Data Centers while maintaining Stateful Services 6 Business Continuity Use Cases for Live Mobility Most Business Critical Applications (Lowest RPO/RTO) Stateful Live Workload Migrations Operations Rebalancing / Maintenance / Consolidation of Live Workloads Disaster Avoidance of Live Workloads Application HA-Clusters spanning Metro DCs (<10 ms) Hypervisor Tools for Live Mobility VMware vMotion or Hyper-V Live Migration Stretched HA-Clusters across Metro DCs (<10 ms) Host Affinity rules to manage resource allocation Distributed vCenter or System Center across Metro DCs Metro DC Infrastructure to support Live Workload Mobility Network: LAN Extension Data Center Interconnect and Localized E-W traffic Virtual Switches Distributed across Metro distances Maintain Multi-Tenant Containers Localized E-W traffic using distributed Default Gateway Services: Maintain Stateful Services for active connections Minimize traffic tromboning between Metro DCs Compute: Support Single-Tier and Multi-Tier Applications Storage: Shared Storage extended across Metro Synchronous Data Replication Distributed Virtual Volumes Hyper-V Shared Nothing Live Migration (Storage agnostic) Cisco Public
  7. 7. Business Continuity Use Cases for Cold Mobility Less Business Critical Applications (Medium to High RPO/RTO) Planned Workload Migrations of Stopped VMs Operations Rebalancing / Maintenance / Consolidation of Stopped Workloads Disaster Avoidance of Stopped Workloads Disaster Recovery of Stopped Workloads Hypervisor Tools for Cold Mobility VMware Site Recovery Manager (SRM) or Hyper-V Failover Clustering Geo-Clusters across A/A or A/S Geographically dispersed DCs Host Affinity rules to manage resource allocation Many-to-One Site Recovery Scenarios VMDC Infrastructure to support Cold Workload Mobility Network: Subnet Extension Data Center Interconnect LAN Extension optional, Localized N-S traffic using Ingress Path Optimisation Create new Multi-Tenant Containers Cold migration across unlimited distances Services: Service connections temporarily disrupted New service containers created at new site Traffic tromboning between DCs can be reduced Compute: Support Single-Tier and Multi-Tier Applications Storage: Asynchronous Data Replication to remote site (NetApp SnapMirror) Hyper-V Replica Asynchronous Data Replication (Storage agnostic) Virtual Volumes silo’d to each DC Cisco Public Requirements for Metro/Geo Data Centers – Cold Migration Move a Stopped Virtual Workload across Metro/Geo DCs, reboot machine at new site Subnet Extension
  8. 8. Interconnecting Data Centers with LAN extension
  9. 9. Scope of the L2 DCI requirements Must have: Failure domain containment  Control Plane Independence o STP domain confined inside the DC o EVPN multi-domains  Control-plane MAC learning  Reduce any flooding  Control the BUM*  Site Independence Dual-homing with independent paths Reduced hair-pinning  Distributed L2 Gateway on TOR  Localized E-W traffic o FHRP Isolation o Anycast L3 Default Gateway Fast convergence Transport agnostic Additional Improvements:  ARP suppress  ARP caching  VLAN translation  IP Multicast or Non-IP Multicast choice  Multi-homing  Path Diversity (VLAN based / Flow-based / IP-based)  Load Balancing (A/S, VLAN-based, Flow-based)  Localized N-S traffic (for long distances) o Ingress Path Optimization (LISP) o Works in conjunction with egress Path optimization (FHRP localization, Anycast L3 Gateway) * Broadcast, Unknown Unicast and Multicast
  10. 10. Ethernet MPLS IP LAN Extension for DCI Technology Selection Over dark fiber or protected D-WDM  VSS & vPC  Dual site interconnection  FabricPath & VXLAN & ACI Stretched Fabric  Multiple sites interconnection MPLS Transport  EoMPLS  Transparent point to point  VPLS  Large scale & Multi-tenants, Point to Multipoint  PBB-EVPN  Large scale & Multi-tenants, Point to Multipoint IP Transport  OTV  Enterprise style Inter-site MAC Routing  LISP  For Subnet extension and Path Optimization  VXLAN/EVPN  Emerging A/A site interconnect (Layer 2 only or with Anycast L3 gateway) Metro style SP style IP style
  11. 11. DCI LAN extension IP-Based Solution OTV
  12. 12. Traditional Layer 2 VPNs 12 • Unknown Unicast Flooding used to propagate MAC reachability • Flooding domain extended to every site Our goal… providing layer 2 connectivity, yet restrict the reach of the unknown unicast flooding domain in order to contain failures and preserve the resiliency x2 Site A Site B Site C MAC 1 propagation MAC 1 Extending the Failure Domain
  13. 13. Overlay Transport Virtualization Technology Pillars 13 OTV is a “MAC in IP” technique to extend Layer 2 domains OVER ANY TRANSPORT Protocol Learning Built-in Loop Prevention Preserve Failure Boundary Site Independence Automated Multi-homing Dynamic Encapsulation No Pseudo-Wire State Maintenance Optimal Multicast Replication Multipoint Connectivity Point-to-Cloud Model First platform to support OTV (since 5.0 NXOS Release) Nexus 7000 Now also supporting OTV (since 3.5 XE Release) ASR 1000
  14. 14. Overlay Transport Virtualization OTV terminology  Edge Device (ED): connects the site to the (WAN/MAN) core and responsible for performing all the OTV functions  Internal Interfaces: L2 interfaces (usually 802.1q trunks) of the ED that face the site  Join Interface: L3 interface of the ED that faces the core  Overlay Interface: logical multi-access multicast-capable interface. It encapsulates Layer 2 frames in IP unicast or multicast headers OTV Internal Interfaces Core L2 L3 Join Interface Overlay Interface
  15. 15. IP A West East 3 New MACs are learned on VLAN 100 Vlan 100 MAC A Vlan 100 MAC B Vlan 300 MAC C South VLAN MAC IF 100 MAC A IP A 100 MAC B IP A 300 MAC C IP A 4 OTV updates exchanged via the L3 core 3 3 2 VLAN MAC IF 100 MAC A IP A 100 MAC B IP A 300 MAC C IP A 4 3 New MACs are learned on VLAN 100 1 Overlay Transport Virtualization OTV Control Plane 15 • Neighbor discovery and adjacency over • Multicast (Nexus 7000 and ASR 1000) • Unicast (Adjacency Server Mode currently available with Nexus 7000 from 5.2 release) • OTV proactively advertises/withdraws MAC reachability (control-plane learning) • IS-IS is the OTV Control Protocol - No specific configuration required
  16. 16. Transport Infrastructure OTV OTV OTV OTV MAC TABLE VLAN MAC IF 100 MAC 1 Eth 2 100 MAC 2 Eth 1 100 MAC 3 IP B 100 MAC 4 IP B MAC 1  MAC 3 MAC TABLE VLAN MAC IF 100 MAC 1 IP A 100 MAC 2 IP A 100 MAC 3 Eth 3 100 MAC 4 Eth 4 Layer 2 Lookup 6 IP A  IP BMAC 1  MAC 3MAC 1  MAC 3 Encap 3 Decap 5 MAC 1  MAC 3 West SiteServer 1 Server 3 East Site 4 7 IP A IP B 1 IP A IP BMAC 1  MAC 3 Overlay Transport Virtualization Inter-Sites Packet Flow 16
  17. 17. OTV Failure Domain Isolation Spanning-Tree Site Independence 17  Site transparency: no changes to the STP topology  Total isolation of the STP domain  Default behavior: no configuration is required  BPDUs sent and received ONLY on Internal Interfaces L2 L3 OTV OTV The BPDUs stop here
  18. 18. OTV Failure Domain Isolation Preventing Unknown Unicast Storms 18  No requirements to forward unknown unicast frames  Assumption: end-host are not silent or uni-directional • Since 6.2(4) Selective Unicast Flooding  Default behavior: no configuration is required L2 L3 OTV OTV MAC TABLE VLAN MAC IF 100 MAC 1 Eth1 100 MAC 2 IP B - - - MAC 1  MAC 3 No MAC 3 in the MAC Table
  19. 19. Authoritative ED Election Site VLAN and Site Identifier 19 Fully Automated Multi-homing  Site Adjacency established across the site vlan  Overlay Adjacency established via the Join interface across Layer 3 network  OTV site-vlan used to discover OTV neighbor in the same site  A single OTV device is elected as AED on a per-vlan basis.  The AED is responsible for: o MAC addresses advertisement for its VLANs o Forwarding its VLANs’ traffic inside and outside the site otv otv L3 Core OTV Hello Site-ID 1.1.1 Full Adjacency OTV Hello Site-ID 1.1.1 Site Adjacency Overlay Adjacency OTV Hello Site-ID 1.1.1 OTV Hello Site-ID 1.1.1 I’m AED for Even VLANs I’m AED for Odd VLANs
  20. 20. 20 OTV Fast Convergence • AED Server: centralized model where a single edge device runs the AED election for each VLAN and assigns VLANs to edge devices. • Per-VLAN AED and Backup AED assigned and advertised to all sites • Fast Remote Convergence: on remote AED failure, OTV routes are updated to new AED immediately • Fast Failure Detection: Detect site VLAN failures faster with BFD and core failures with route tracking 20 Optimized local and remote convergence
  21. 21.  L2-L3 boundary at aggregation  DC Core performs only L3 role  STP and L2 broadcast Domains isolated between PODs  Intra-DC and Inter-DCs LAN extension provided by OTV o Requires the deployment of dedicated OTV VDCs  Ideal for single aggregation block topologies  Recommended for Green Field deployments o Nexus 7000 required in aggregation vPC vPC SVIs SVIs SVIs SVIs Placement of the OTV Edge Device OTV in the DC Aggregation SVIs SVIs  Easy deployment for Brownfield  L2-L3 boundary in the DC core  DC Core devices performs L2, L3 and OTV functionalities  Leverage FabricPath between PoD
  22. 22.  The Default Gateway (SVI) are distributed among the Leafs (Anycast Gateway)  The Firewalls host the Default Gateway  No SVIs at the Aggregation Layer or DCI Layer  No Need for the OTV VDC Aggregation Core Def GWY FirewallFirewall OTVOTV Def GWY L2 L3 Spine Leaf Border-Leaf OTV DCI Layer L2 L3 Placement of the OTV Edge Device SVI enabled on different platforms OTVOTV Anycast L3 GWY
  23. 23. DCI Convergence Summary Robust HA is the guiding principle 23 Join Interfaces Other Uplink Interfaces Internal Interfaces Common Failures: 1. Core failures Multipath routing (or TE FRR)  sub-sec  2. Join interface failures Link Aggregates across line-cards  sub-sec  3. Internal Interfaces failures Multipath topology (vPC) & LAGs  sub-sec  4. ED component failures HW/SW resiliency  sub-sec  Extreme failures (unlikely): 1x. Core partition 3x. Site partition 4x. Device down Implements OTV reconvergence 6.2  < 5s  OTV VDC OTV VDC Access Aggregation East-B East-A OSFP Core 2 VPC 4 1 1 3 1x 1x4 4x
  24. 24. Additional innovations with OTV Selective Unknown Unicast flooding (6.2.2) Join Interface with Loopback Address (Roadmap) Tunnel Depolarization & Secondary IP (6.2.8) VLAN Translation (6.2.2) Direct Translation Transit mode Dedicated Broadcast Group (6.2.2) OTV 2.5 VXLAN Encapsulation (7.2)
  25. 25.  Extensions over any transport (IP, MPLS)  Failure boundary preservation  Site independence  Optimal BW utilization (IP MC)  Automated Built-in Multi-homing  End-to-End loop prevention  Operations simplicity  VXLAN Encapsulation  Improvement • Selective Unicast Flooding • F-series internal interfaces • Logical Source Interfaces w/Multiple Uplinks* • Dedicated Distribution Group for Data Broadcasting • Tunnel Depolarization • VLAN translation • Improved Scale & Convergence OTV Summary Only 5 CLI commands
  26. 26. Interconnecting Modern Fabrics
  27. 27. Interconnecting VXLAN fabric
  28. 28. What is a Fabric ? 28 New L3 switching model  Fat tree architecture to maintain scalability with tens of thousands of servers • Spine • Evolution toward plain switching in Spine • Concept of Border-Spine to connect to outside the Fabric • Leaf • Evolution toward enabling L3 gateway at Leaf • Concept of Border-Leaf to connect to the outside world  Type of Fabrics: • MAC-in-MAC encapsulation • Fabric-Path • MAC-in-IP encapsulation • VXLAN • ACI (Application Centric Architecture) Spine Leaf Border-Leaf
  29. 29. Stretched Fabric Considerations  No clear boundary demarcation  Shared multicast domain  Gateway localization per site o Anycast gateway o E/W traffic local routing o N/S egress path optimization o N/S ingress requires additional technics (LISP, RHI, GSLB)  Hardware based: o One global L3 only fabric o Anycast VTEP L2 or L3 gateway distributed o VxLAN EVPN (ToR) o ACI (ToR) o VLAN translation with local VLAN significant Fabric Metro Distance – Dark fiber / DWDM Fabric Metro Distance – Dark fiber / DWDM
  30. 30. Network Dual-Fabric Considerations Any Distance OTV/EVPN L3 WAN Dual site vPC N-S Traffic localization is a choice between efficiency (latency-sensitive Application) and elaboration DCI model with dedicated DCI device  Failure domain isolation  E-W traffic localization o Distributed Active L3 gateways  N-S traffic localization o Egress path optimization o Ingress path optimization (LISP or IGP assist)  Dual homing  Flow control between site  VLAN Translation o per Site, per ToR, per Port,  Unicast L3 WAN supported  Path diversity Metro Distance – Dark fiber / DWDM
  31. 31. VXLAN Stretched Fabric
  32. 32. VXLAN Stretched Fabric From Transit Leaf Nodes  VXLAN/EVPN Stretched Fabric using Transit Leaf Nodes o Host Reachability information is distributed End-to-End. o Transit Leaf (or Spine) node can be a pure Layer 3 only platform o Data Plane is stretched End-to-End (VXLAN tunnels are established from site to site)  When to use VXLAN/EVPN Stretched Fabric ? o Across Metro distances, Private L3 DCI, IP Multicast available E2E o Currently Up to 256 Leaf Nodes End-to-End  Why to use it ? o VXLAN/EVPN intra-fabric within multiple Greenfield DC  What is the Cisco Value of it ? o VXLAN EVPN MP-BGP Control-Plane o IRB Symmetrical Routing and Anycast L3 Gateway o Storm Control, BPDU Guard, HMM Route Tracking o ARP suppress o Bud-Node support VXLAN Stretched Fabric Transit Leaf nodes Host Reachability is End-to-End Traffic is encapsulated & de-encapsulated on each far end side * Do NOT necessarily terminate the overlay tunnel VXLAN Leaf * DCI Leaf nodes VXLAN tunnel iBGP AS 200eBGPiBGP AS 100 VXLAN Stretched Fabric Layer 3 Host Reachability is End-to-End Traffic is encapsulated & de-encapsulated on each far end side Transit Spine nodes
  33. 33. VXLAN stretched Fabric Design Consideration Control Plane Functions delineation Underlay network (Layer 3) Used to exchange VTEP reachability information Separated IGP Area Area 0 being the Inter-site links Overlay Routing Control Plane Separated MP-iBGP (AS) interconnected via MP- eBGP sessions Data Plane & Host Reachability Information is End-to-End VTEP tunnels are extended inside or across sites
  34. 34. Interconnecting ACI Fabrics
  35. 35. ACI Multi-Fabric Design Options Single APIC Cluster/Single Domain Multiple APIC Clusters/Multiple Domains Site 1 Site 2 ACI Fabric Stretched Fabric ACI Fabric 2ACI Fabric 1 Dual-Fabric Connected (L2 and L3 Extension) DB Web App L2/L3 POD ‘A’ POD ‘B’ Web/AppDB Web/App APIC Cluster MP-BGP - EVPN Multi-POD (Q2CY16) IP NetworkSite ‘A’ Site ‘B’ MP-BGP - EVPN Web DB App Multi-Site (Future, Pre-CC)
  36. 36. 36 Supported Distances and Interconnection Technologies Dark Fiber Transceivers Cable Distance QSFP-40G-LR4 10 km QSFP-40GE-LR4 10 km QSFP-40GLR4L 2 km QSFP-40G-ER4 30 km in 1.0(4h) or earlier 40 km in 1.1 and later For all these transceivers the cable type is SMF DC Site 1 DC Site 2 APIC APIC APIC ACI Fabric vCenter Server Transit leaf Transit leaf 36
  37. 37. 37 Stretched Fabric Future Enhancement DC Site 1 DC Site 2 APIC APIC Node ID 1 Node ID 2 Node ID 3 APIC Metro Fiber Metro Fiber 10ms RTT QSA QSA QSA QSA • QSA and 10G inter-site DWDM links • Only supported on new platforms (-EX) • Longer distance (10ms RTT) 10GE 10GE 10GE 10GE
  38. 38. 38 Supported Distances and Interconnection Technologies Ethernet over MPLS (EoMPLS) for Speed Adaptation  Port mode EoMPLS used to stretch the ACI fabric over long distance. o DC Interconnect links could be 10G (minimum) or higher (100GE) with 40G facing the Leafs / Spines o DWDM or Dark Fiber provides connectivity between two sites (validated) or MPLS core network with QoS for CP (not validated).  Tested and Validated up to 10 ms (public document *).  Other ports on the Router used for connecting to the WAN via L3Out  MACsec support on DCI (DWDM between 2 ASR9k) 10 ms RTT DC Site 1 DC Site 2 APIC APIC Node ID 1 Node ID 2 Node ID 3 APIC QSFP-40G-SR4 40G 40G 40G 10G/40G/100G 10G/40G/100G 40G EoMPLS Pseudowire WAN 38*
  39. 39. ACI Multi-POD Solution (Q2CY16) 39
  40. 40. 40 Web/App DB ACI Multi-POD Solution Details POD ‘A’ POD ‘B’ Web/App mBGP - EVPN Single APIC Cluster • One APIC cluster manages all PODs. Same name space(VTEP address, VNID, class ID, GIPo etc) • Host reachability is propagated via BGP-EVPN. Not exposed to transit IP network • Transit Network is IP Based • Ability to support advanced forwarding(direct ARP forwarding, no unknown unicast flooding) • Support multiple PODs • Multicast in the Inter-POD Network Future Roadmap
  41. 41. ACI Multi-POD Solution WAN Connectivity  Each POD can have a dedicated connection to the WAN  Traditional L3Out configuration Shared between tenants or dedicated per tenant (VRF-Lite)  VTEPs always select local connection based on IS-IS metric  Requires an inbound path optimization solution for achieving traffic symmetry MP-BGP - EVPN WAN WAN
  42. 42. 42 ACI Multi-POD Solution Topologies Two DC sites connected back2back POD 1 POD 2 Web/AppDB Web/App APIC Cluster Dark fiber/DWDM (up to 10 msec RTT) 40G/100G 40G/100G 10G/40G/100G speed agnostic POD 1 POD n Web/AppDB Web/App APIC Cluster … Intra-DC 40G/100G 40G/100G POD 1 POD 2 POD 3 3 DC Sites Dark fiber/DWDM (up to 10 msec RTT) 40G/100G 40G/100G 40G/100G 10G/40G/100G Multiple sites interconnected by a generic L3 network L3 40G/100G 40G/100G 40G/100G 40G/100G Target : Up to 20 sites
  43. 43. Key Takeways
  44. 44. Interconnecting multiple Fabrics Multiple Scenarios, Multiple Options  Multi-sites Models o Classical DC to Classical DC o VXLAN Fabric to VXLAN Fabric o ACI to ACI o Any to Any  DCI Models (2 sites) o (Campus-Metro) Native Ethernet using vPC Dual-sites back-to-back (Fiber/DWDM)  DCI Models (2 or multiple sites) o (Metro-Geo) MPLS-based using VPLS or PBB-EVPN o (Metro-Geo) IP-based using OTV, VXLAN  Stretched Fabric (VXLAN, ACI)  Multi-Fabrics (VXLAN, ACI)  Localized E-W traffic (FHRP isolation, Anycast HSRP, Anycast L3 gateway)  Localized N-S traffic (LISP Mobility, IGP Assist, Host-Route Injection)