Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OpenNebula - Mellanox Considerations for Smart Cloud

219 views

Published on

Considerations on Smart Cloud Implementations

Published in: Software
  • Be the first to comment

  • Be the first to like this

OpenNebula - Mellanox Considerations for Smart Cloud

  1. 1. 1© 2018 Mellanox Technologies Opennebula Techday, Frankfurt Sept. 26 2018 | Arne Heitmann, Staff System Engineer, EMEA Considerations on Smart Cloud Implementations
  2. 2. 2© 2018 Mellanox Technologies Agenda  Introduction Mellanox  Faster storage and networks  Linbit architecture with RDMA/RoCE  RDMA/RoCE  ASAP2 (OVS Offloading)  Overlay Networking  EVPN/VXLAN routing  Conclusion
  3. 3. 3© 2018 Mellanox Technologies Mellanox Overview  Ticker: MLNX  Worldwide Offices  Company Headquarters • Yokneam, Israel • Sunnyvale, California  Employees worldwide • ~ 2,900 Ticker: MLNX
  4. 4. 4© 2018 Mellanox Technologies Comprehensive End-to-End Ethernet Product Portfolio NICs Cables NICs Optics Switches Automation & Monitoring Management software
  5. 5. 5© 2018 Mellanox Technologies Unique Engine in Mellanox Ethernet Switch  Mellanox switches are powered by Mellanox superior ASIC  Wire speed, cut through switching at any packet size  Zero Jitter  Low power  10GbE to 100GbE  DAC Passive Copper for 10/25/40/50/100GbE  vs. active copper is more expensive, less reliable and suffers from interoperability issues  Active Fiber for 10/25/40/50/100GbE
  6. 6. 6© 2018 Mellanox Technologies New Storage Media Require Faster Networks  Transition to faster storage media requires faster networks  Flash SSDs move the bottleneck from the storage to the network  What does it take to saturate one 10Gb/s link? • 24 x HDDs • 2 x SATA SSDs • 1 x SAS SSD • NVMe…
  7. 7. 7© 2018 Mellanox Technologies DRBD and RDMA – Architectural Advantage
  8. 8. 8© 2018 Mellanox Technologies Excursion - RoCE - RDMA over Converged Ethernet  Remote Direct Memory Access (RDMA) is a technology that enables data to be read from and written to a remote server without involving either one’s CPU, which results in:  Reduced latency  Increased throughput  The CPU freed up to perform other tasks  Twice the bandwidth with less than half the CPU utilization  RoCE needs a lossless network!
  9. 9. 9© 2018 Mellanox Technologies RoCE Done Right! 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Pausetime(Microseconds) Time/Seconds Application Blocked by the Switch 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Pausetime(Microseconds) Time/Seconds Application Blocked by the Switch Application runs Smoothly Other Switches
  10. 10. 10© 2018 Mellanox Technologies Best Congestion Management For RoCE  Configuration  4 hosts connected to 1 switch in a star topology  ECN enabled, PFC enabled  3 sources to 1 common destination  Results  Other ASIC sends pauses to hosts, no pauses sent by Spectrum 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233 Pausetime(Microseconds) Time/Seconds Spectrum switch No Pauses 0 100000 200000 300000 400000 500000 600000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 Pausetime(Microseconds) Time/Seconds Other ASIC based switch Up to 21% Pause time VS
  11. 11. 11© 2018 Mellanox Technologies Better RoCE with Fast Congestion Notification marks packets entering queue marks packets exiting queue  Fast Congestion Notification • Packets marked as they leave queue • Up to 10ms faster alerts • Servers react faster • Reduces average queue depth - Lowers real world latency • Improves application performance  Legacy Congestion Notification: • Packets marked as they enter queue • Notification delayed until queue empties 10 Gigabit Ethernet 10 Gigabit Ethernet Delay inside switch is equivalent to placing server hundreds of miles away
  12. 12. 12© 2018 Mellanox Technologies Predictable QoS with RoCE  Configuration  7 hosts connected to 1 switch in a star topology  ECN enabled, PFC enabled  6 sources to 1 common destination  Results  Non-equal bandwidth distribution on other ASICs -1500000 500000 2500000 4500000 6500000 8500000 10500000 12500000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 KBytes/second Time/Seconds Other ASIC Based Switch -1500000 500000 2500000 4500000 6500000 8500000 10500000 12500000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 KBytes/second Time/Seconds Spectrum Switch One source host gets 50% of the total bandwidth 5 others get only 10% each Each host gets 16.66% of the total bandwidthVS
  13. 13. 13© 2018 Mellanox Technologies RoCE runs best in Lossless Networks  Could be complex configuration  2 modes  Enhanced mode for experts  User mode for easy configuration meeting 98% of implementation
  14. 14. 14© 2018 Mellanox Technologies NEO Simplifies RoCE Provisioning  Automated setup of RoCE across entire fabric  Mellanox switches  Mellanox NICs  Ideal for End-to-End Mellanox deployments  No manual configuration needed
  15. 15. 15© 2018 Mellanox Technologies Para-Virtualized SR-IOV Single Root I/O Virtualization (SR-IOV)  PCIe device presents multiple instances to the OS/Hypervisor  Enables Application Direct Access  Bare metal performance for VM  Reduces CPU overhead  Enables many advanced NIC features (e.g. DPDK, RDMA, ASAP2) NIC Hypervisor vSwitch VM VM SR-IOV NIC Hypervisor VM VM eSwitch Physical Function (PF) Virtual Function (VF)
  16. 16. 16© 2018 Mellanox Technologies Introduction to OVS (Open vSwitch)  Software component that typically provides switching between Virtual Machines  Target application: Multi-server virtualization deployments  OVS is an open project. Code and materials at http://openvswitch.org/  OVS Main Functionality • Bridge traffic between Virtual-Machines (VMs) on the same Host • Bridge traffic between VMs and the outside world • Migration of VMs with all of their associated configuration: - L2 learning table, L3 forwarding state, ACLs, QoS, Policy and more • OpenFlow support • VM tagging and manipulation • Flow-based switching • Support for tunneling: VXLAN, GRE and more  OVS works on KVM, XenServer, OpenStack
  17. 17. 17© 2018 Mellanox Technologies Open Virtual Switch (OVS) Challenges  Virtual switches such as Open vSwitch (OVS) are used as the forwarding plane in the hypervisor  Virtual switches implement extensive support for SDN (e.g. enforce policies) and are widely used by the industry  Supports L2-L3 networking features:  L2 & L3 Forwarding, NAT, ACL, Connection Tracking etc.  Flow based  OVS Challenges:  Awful Packet Performance: <1M w/ 2-4 cores,  Burns CPU like Hell : Even w/ 12 cores, can’t get 1/3rd 100G NIC Speed  Bad User Experience: High and unpredictable latency, packet drops  Solution  Offload OVS data plane into Mellanox NIC using ASAP2 technology
  18. 18. 18© 2018 Mellanox Technologies • Enable SR-IOV data path with OVS control plane • Use Open vSwitch to be the management interface and offload OVS data-plane to Mellanox embedded Switch (eSwitch) using ASAP2 Direct ASAP2 SRIOV - Example
  19. 19. 19© 2018 Mellanox Technologies Virtualized Datapath Options Today 19 VF0 VF1 SR-IOV (Single-Root IO Virtualization) Accelerated vSwitch (Open vSwitch over DPDK) Hardware Dependent to the NIC line rate, no CPU overhead ToR for switching DPDK - Direct IO to NIC or vNIC switching, bonding, overlay Kernel space User space Legacy vSwitch (kernel datapath) Default for Openstack switching, bonding, overlay, live migration
  20. 20. 20© 2018 Mellanox Technologies OVS over DPDK VS. OVS Offload – ConnectX-5 ConnectX-5 provides significant performance boost  Without adding CPU resources 7.6 MPPS 66 MPPS 4 Cores 0 Cores 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 10 20 30 40 50 60 70 OVS over DPDK OVS Offload NumberofDedicatedCores MillionPacketPerSecond Message Rate Dedicated Hypervisor Cores Test ASAP2 Direct OVS DPDK Benefit 1 Flow VXLAN 66M PPS 7.6M PPS (VLAN) 8.6X 60K flows VXLAN 19.8M PPS 1.9M PPS 10.4X
  21. 21. 21© 2018 Mellanox Technologies ASAP2 – Facts (current)  Offloads:  Match 12 Tuple and Forward to VM/Network or Drop  Ethernet Layer 2  IP (v4 /v6)  TCP /UDP  Action:  Forwarding  Drop/Allow  VxLAN Encap/Decap  VLAN Push/Pop  ConnectX-4 Lx: Per port  ConnectX-5: Per flow  Header Re-write (ConnectX-5): Up to and including Layer 4  VF mirroring: Mirroring traffic from one VF to another in the same eSwitch  VF LAG with LACP: Active/Backup and Bonding (50Gb/s from 2 ports of 25GbE)  Supported OS (as of today):  RHEL 7.4  RHEL 7.5  CentOS 7.2  Packages required:  MLNX_OFED 4.4  Iproute 4.12 and up  Openvswitch 2.8 and up , for CentOS 7.2 from Mellanox.com
  22. 22. 22© 2018 Mellanox Technologies Overlay Networks  Traditional VLANs based networks  Layer 2 segmentation  VLANs scalability is 4K  No support for VM Mobility  Overlay networks with VXLAN  Layer 2 over Layer 3 segmentation  VXLAN scalability is 16M  Support for VM Mobility  Multi tenant isolation  Overlay networks run as independent virtual networks on top of a physical network infrastructure
  23. 23. 23© 2018 Mellanox Technologies VXLAN Overview  VXLAN - Virtual eXtensible Local Area Network:  A standard overlay protocol that enables multiple layer 2 logical networks over a single underlay layer 3 network  Each virtual network is a VXLAN logical layer 2 segment  Encapsulates MAC-based layer 2 Ethernet frames in layer 3 UDP/IP packets  Uses a 24-bit VXLAN network identifier (VNI) in the VXLAN header, hence scales to 16 million layer 2 segments
  24. 24. 24© 2018 Mellanox Technologies VTEP - VXLAN Tunnel End Point  VTEP on the host  VXLAN agents run on hosts hypervisor  Encapsulation/de-ecapsulation in software  Degraded performance  Mellanox network adapters support VXLAN offloads, encapsulation/de-ecapsulation can be offloaded to the NIC  VTEP on the ToR  VXLAN agents run on ToRs  Encapsulation/de-ecapsulation in switch hardware  Efficient performance  Cumulus Linux that runs on Mellanox switches supports VTEP on the switch
  25. 25. 25© 2018 Mellanox Technologies VTEP on the Host - Accelerating Overlay Networks  Virtual Overlay Networks simplifies management and VM migration  Overlay Accelerators in NIC enable Bare Metal Performance
  26. 26. 26© 2018 Mellanox Technologies VTEP on the ToR  VTEP on the ToR enables scaleability and flexibility  Multitenancy / integration of legacy sevices
  27. 27. 27© 2018 Mellanox Technologies Why BGP-EVPN + VXLAN ?  BGP-EVPN is an open controller-less solution  Controllers are centralized and limit the scale of the solution  Controller lock-in the customers into certain technologies and increase costs  BGP-EVPN with VXLAN is a better alternative to legacy VPLS/VLL
  28. 28. 28© 2018 Mellanox Technologies VXLAN Bridging with EVPN  Ethernet Virtual Private Network (EVPN)  Often used to implement controller-less VXLAN  Standard-based control plane for VXLAN defined in RFC 7432  Relies on multi-protocol BGP (MP-BGP) for information exchange  Key features include:  VNI membership exchange between VTEPs  Exchange of host MAC and IP addresses  Support for host/VM mobility (MAC and IP moves)  Support for inter-VXLAN routing  Support for layer 3 multi-tenancy with VRFs  Support for dual-attached hosts via VXLAN active-active mode.
  29. 29. 29© 2018 Mellanox Technologies VXLAN Routing Modes  EVPN supports three VXLAN routing modes:  Centralized routing:  Specific VTEPs act as designated Layer 3 gateways and route between subnets  Other VTEPs just act as bridges  Distributed asymmetric routing:  Every VTEP participates in routing  Ingress VTEP only participates in routing; The egress VTEPs only acts as bridges  Distributed symmetric routing:  Every VTEP participates in routing  Both the ingress VTEP and the egress VTEP participate in routing
  30. 30. 30© 2018 Mellanox Technologies Distributed VXLAN Routing  Distributed Asymmetric Routing:  each VTEP acts as a layer 3 gateway, performing routing for its attached hosts  Only the ingress VTEP performs routing, the egress VTEP only performs the bridging  Advantage:  Easy to deploy and no additional special VNIs  Less routing hops occur to communicate between VXLANs  Disadvantage:  Each VTEP must be provisioned with all VLANs/VNIs  Distributed Symmetric Routing  each VTEP acts as a layer 3 gateway, performing routing for its attached hosts  Both the ingress VTEP and egress VTEP route the packets  A new specialty transit VNI is used for all routed VXLAN traffic, called the L3VN  Advantage:  Each VTEP only needs local VLANs, local VNIs and L3VNI with associated VLAN  Disadvantage:  More complex configuration  Extra routing hop that might cause extra latency
  31. 31. 31© 2018 Mellanox Technologies Conclusion  Cloud Infrastructures with hypervirtualized topologies, storage and compute  Provide flexibility at any scale  Require Intelligent use of protocol and feature tool sets  Fast, distributed storage requires higher bandwidth and efficient CPU management  RoCE done right accelerates the storage performance  VMs require internal and external communication over a virtually switched infrastructure  ASAP2 helps taking the load from the OVS and thus the CPU while optimizing the communication path  Highly virtualized environments need to extend L2 segregation above VLAN limits and accross L3 boundaries  Overlay networking with VXLAN adds scale  VXLAN with EVPN adds flexibility and manageability  VXLAN routing adds agility
  32. 32. 32© 2018 Mellanox Technologies Thank You
  33. 33. 33© 2018 Mellanox Technologies For Further Reading Addendum  RoCE/RDMA:  Mellanox RoCE Homepage  Boosting Performance With RDMA – A Case Study  What is RDMA?  RDMA/RoCE Solutions  Recommended Network Configuration Examples for RoCE Deployment  ASAP2:  Mellanox ASAP2 Homepage  Getting started with Mellanox ASAP^2  The Ideal Network for Containers and NFV Microservices  Overlay Networking/VXLAN/EVPN  EVPN with Mellanox Switches  Top 3 considerations for picking your BGP EVPN VXLAN infrastructure  VXLAN is finally simple, use EVPN and set up VXLAN in 3 steps  Mellanox Ethernet Solutions  Mellanox Open Ethernet Switches  Mellanox Open Ethernet Switches Product Brief

×