Christopher Lim, Sr. Engineer
July 2016
Mellanox Efficient Virtual Network for Service Providers
© 2016 Mellanox Technologies 2- Mellanox Confidential -
Leading Supplier of End-to-End Interconnect Solutions
StoreAnalyze
Enabling the Use of Data
SoftwareICs Switches/GatewaysAdapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio (VPI)
Metro / WANNPU & Multicore
NPS
TILE
© 2016 Mellanox Technologies 3- Mellanox Confidential -
Cloud-Native NFV Architecture Dictates Efficient Virtual Network
Mellanox EVN: Foundation for Efficient Telco Cloud Infrastructure
Efficient Virtual Network
Enabling High-performance, Reliable and
Scalable Infrastructure for Cloud Service Delivery
AUTOMATIONACCELERATIONVIRTUALIZATION
Compute
Higher Workload
Density
Network
Line Rate Packet
Processing
Storage
Higher IOPS, Lower
Latency
© 2016 Mellanox Technologies 4- Mellanox Confidential -
SR-IOV – Overcome Compute Virtualization Penalty
VM 1 VM 2 VM N
……
VF Driver VF Driver VF Driver
VM
Virtual NIC
VM
Virtual NIC
Hypervisor
Single Root I/O Virtualization (SR-IOV) capable NIC
Virtual Switch
Physical FunctionVirtual
Function
Virtual
Function
Virtual
Function
NIC Embedded Switch
PF Driver
PCIe Bus
Application Direct
Access to achieve
bare metal I/O
performance
……
VMs leveraging SR-IOV and Mellanox eSwitch for
near-line-rate performance without CPU overhead
Software-switched VMs suffering
from compute virtualization penalty
© 2016 Mellanox Technologies 5- Mellanox Confidential -
SR-IOV + DPDK: Better Together with Mellanox PMD
VM 1 VM 2 VM N
……
Hypervisor
Single Root I/O Virtualization (SR-IOV) capable NIC
Virtual
Function
Virtual
Function
Virtual
Function
NIC Embedded Switch
PCIe Bus
Further accelerate
packet processing
performance by
eliminating interrupts
and context switches
……
Mellanox DPDK PMD
DPDK Library
Mellanox DPDK PMD
DPDK Library
Mellanox DPDK PMD
DPDK Library
© 2016 Mellanox Technologies 6- Mellanox Confidential -
Mellanox Sets New DPDK Performance Records
42.11
30.58
17.96
9.36
4.78 3.83 3.24
0
10
20
30
40
50
60
64 128 256 512 1024 1280 1518
FramesperSecond(InMillions)
Frame Size (In Bytes)
Superior DPDK Packet Performance at Various Frame Sizes (Lx 40G)
Test setup:
• ConnectX-4Lx 40GbE Single
port
• 4 Cores Dedicated to DPDK
Product
Single-port TCP Throughput
DPDK 64B Packet Throughput
ConnectX-4 100G
93.4 Gb/s
74.4 million p/s
ConnectX-4 Lx
40G
37.6 Gb/s
42.1 million p/s
ConnectX-4 Lx
25G
23.5 Gb/s
34 million p/s
ConnectX-4 40G
37.6 Gb/s
56.4 million p/s
© 2016 Mellanox Technologies 7- Mellanox Confidential -
 Solution:
• Overlay Network Accelerators in NIC
• Penalty free overlays at bare-metal speed
• Integrated and validated by major SDN
vendors
 Benefits:
• 37.5Gb/s on 40G link, >2X compared to
without VxLAN offload
• On a 20 cores system, 7 cores are freed to
run addition VMs, saving 35% of total cores
while doubling the throughput!
Turbocharge Overlay Networks with ConnectX-3/4 NICs
© 2016 Mellanox Technologies 8- Mellanox Confidential -
Cumulus Overlay Solution
VMware NSX
PLUMgrid ONS
Nuage VSP
Midokura Midonet
Juniper OpenContrail
Akanda Astara
Cumulus LNV
 Switch VXLAN tunnel endpoint (VTEP) is used
• To connect bare metal servers to VXLAN network
• To connect VXLAN and legacy network
 Cumulus Integrated with every major Overlay Solution
 Available with Mellanox switches April 2016
© 2016 Mellanox Technologies 9- Mellanox Confidential -
Accelerated Switching And Packet Processing (ASAP2)
 Best of both worlds: Enable hardware accelerated data plane with SDN/virtual switch
control plane
 Multiple possibilities of accelerated data plane including DPDK in CPU, embedded
switch, FPGA, network processor, multi-core processor in server adaptor, TOR
switch, or centralized acceleration pool
 Standard hardware API to allow control plane and data plane to operate and innovate
independently
Roadmap
Virtual
Switch
Control
Plane
Hardware
Accelerated
Data Plane
Standard
Hardware
Abstraction
Interface
ASAP2
© 2016 Mellanox Technologies 10- Mellanox Confidential -
ASAP2 Phase 1: ASAP2 Direct
OVS Control Plane, optionally
combined with SDN controller
Direct application I/O access through
SR-IOV
Accelerated forwarding and
classification through Embedded
Switch (eSwitch) on Mellanox NIC
OS
VM
OS
VM
OS
VM
OS
VM
tap tap
SR-IOV
to the
VM
Embedded
Switch
© 2016 Mellanox Technologies 11- Mellanox Confidential -
OVS Architecture and Operations
11
OVS-vswitchd
OVS Kernel Module
First
Packet
Subsequent
Packets
User
Kernel
 Forwarding
• Flow-based forwarding
• First packet of a new flow (match miss) is
directed to user space (ovs-vswitchd)
• ovs-vswitchd determines flow handling and
programs kernel (fast path)
• Following packets hit kernel flow entries and are
executed in fast path
© 2016 Mellanox Technologies 12- Mellanox Confidential -
Mellanox eSwitch
ASAP2 – Let the Hardware Do the Heavy-lifting
New Flow
• A new flow will result in a ‘miss’ action in
eSwitch and is directed to OVS kernel module
• Miss in kernel will punt the packet to OVS-
vswitchd in user space
Configuration
• OVS-vswitchd will resolve the flow entry, and
based on a policy decision to offload,
propagate that to corresponding eSwitch tables
for offload-enabled flows
Fast
Forwarding
• Subsequent frames of offload-enabled flows
will be processed and forwarded by eSwitch
OVS-vswitchd
OVS Kernel Module
First
Packet
Subsequent
HW Forwarded
Packets
User
Kernel
Fallback
Forwarding Path
SoftwareHardware
© 2016 Mellanox Technologies 13- Mellanox Confidential -
OVS and SRIOV, Working Seamlessly Together
Representor ports enable OVS to
“know” and service those VMs
that uses SR-IOV
Representor ports are used for
eSwitch / OVS communication
(miss flow and PV to SR-IOV
communication)
Netdev
Representor
Netdev
Representor
netdev netdev
VMs using OVS Offload VMs using Para-Virtualization
NIC eSwitch
Policy based Flow Sync
© 2016 Mellanox Technologies 14- Mellanox Confidential -
Software Defined Networking, at Full Speed
 Highest performance (High throughput, low
and deterministic latency)
• Offload is increasingly important as server I/O
speed goes up
 Low CPU overhead, higher infrastructure
efficiency
 Software defined
 Everything In-Box
• All changes will be up-streamed, no proprietary
OVS or kernel patches eSwitcheSwitcheSwitch
Configuration
Stats Reporting
SDN or Other Network
Orchestration
© 2016 Mellanox Technologies 15- Mellanox Confidential -
Benchmark Targets
 Matrices
• Message Rate (PPS)
• Network related CPU Load
 Environments
• 25Gbps network
• Extreme performance
• Open Source
• Free
 Standard Benchmark
• RFC 2544
© 2016 Mellanox Technologies 16- Mellanox Confidential -
Benchmark Topology and Traffic Flow
Mellanox
Kernel Kernel
Kernel Kernel
User User
UserUser
OVS Over DPDK OVS Offload
OVS
DPDK
DPDK
Testpmd
OVS
eSwitch
DPDK
Testpmd
Flows Offload
25GE 25GE
VM
Hypervisor
NIC
VS.
© 2016 Mellanox Technologies 17- Mellanox Confidential -
Results and Conclusions
 330% higher message rate compared to OVS
over DPDK
• 33M PPS VS. 7.6M PPS
• OVS Offload reach near line rate at 25G (37.2M PPS)
 Zero! CPU utilization on hypervisor compared to
4 cores with OVS over DPDK
• This delta will grow further with packet rate and link
speed
 Same CPU load on VM
33M PPS
7.6M PPS
0 Cores
4 Cores
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0
5
10
15
20
25
30
35
OVS Offload OVS over DPDK
NumberofDedicatedCores
MillionPacketPerSecond
Message Rate Dedicated Hypervisor Cores
© 2016 Mellanox Technologies 18- Mellanox Confidential -
Accelerated Data Movement End to End: 25 is the New 10
One Switch. A World of Options.
Flexibility, Opportunities, Speed
Open Ethernet, Zero Packet Loss
Most Cost-Effective Ethernet Adapter
2.5X the Network Performance
Same Infrastructure, Same Connector
One Switch. A World of Options. 25G and 50G at Your Fingertips
© 2016 Mellanox Technologies 19- Mellanox Confidential -
Spectrum: The Ultimate 25/100GbE Switch
 The only predictable 25/50/100Gb/s Ethernet switch
 Full wire speed, non-blocking switch
• Doesn’t drop packets per RFC2544
 ZPL: ZeroPacketLoss for all packets sizes
© 2016 Mellanox Technologies 20- Mellanox Confidential -
25GbE to 25GbE Latency Test
Results
Not All Ethernet Switches Were Born Equal
5.2
8.4
9.6 9.7
0.3
0.9 1.0 1.1
64B 512B 1.5B 9KB
MaxBurstSize(MB)
Packet size
Microburst Absorption Capability
Spectrum Tomahawk
50
60
70
80
90
100
64
82
128
146
164
182
200
256
1518
9216
50
60
70
80
90
100
64
82
128
146
164
182
200
256
1518
9216
Packet Size (Bytes)Packet Size (Bytes)
Broadcom Spectrum
Microburst Absorption Fairness
Avoidable Packet Loss
Broadcom Spectrum
www.Mellanox.com/tolly
www.zeropacketloss.com
Consistently Low Latency
© 2016 Mellanox Technologies 21- Mellanox Confidential -
Open APIs
Open Composable Networks
Automation
End-to-End
Interconnect
Network
OS
Choice
SONiC
© 2016 Mellanox Technologies 22- Mellanox Confidential -
RDMA Acceleration – Overcome Transport Protocol Inefficiencies
ZERO Copy Remote Data Transfer
Low Latency, High Performance Data Transfers
InfiniBand - 100Gb/s RoCE* – 100Gb/s
Kernel Bypass Protocol Offload
* RDMA over Converged Ethernet
Application ApplicationUSER
KERNEL
HARDWARE
Buffer Buffer
© 2016 Mellanox Technologies 23- Mellanox Confidential -
RDMA Increases Memcached Performance
 Memcached: High Performance in-memory distributed memory object caching system
• Simple key-value store
• Speeds application by eliminating database access
• Used by YouTube, Facebook, Zynga, Twitter etc.
 RDMA improved Memcached performance:
• 1/3 query latency
• >3X throughput
D. Shankar, X. Lu, J. Jose, M.W. Rahman, N. Islam, and D.K. Panda, Can RDMA Benefit On‐Line Data Processing Workloads with Memcached and MySQL, ISPASS’15
OLDP workload
0
1
2
3
4
5
6
7
8
64 96 128 160 320 400
Latency(sec)
No. of Clients
Memcached-TCP Memcached-RDMA
0
500
1000
1500
2000
2500
3000
3500
64 96 128 160 320 400
Throughput(Kq/s)
No. of Clients
Memcached-TCP Memcached-RDMA
Reduced
by 66% Increased
by >200%
© 2016 Mellanox Technologies 24- Mellanox Confidential -
Case Studies
© 2016 Mellanox Technologies 25- Mellanox Confidential -
Server I/O Decides Affirmed Networks Virtual EPC Efficiency
 When server I/O is constrained, the Affirmed
MCC deployment efficiency can be
constrained, resulting in underutilized
resources and larger server footprint
 Mellanox 40G NIC enables MCC to fully utilize
CPU resources, reduce server footprint and
enhance efficiency.
MCM
CCM
DCM
ASM
WSM
IOM
MCM
CCM
DCM
ASM
WSM
IOM
1
2
1
2
N
1
2
N
1
2
N
1
2
N
1
2
N
MCM – Management Control Module
CCM – Centralized Control Module
DCM – Distributed Control Module
IOM – Input Output Module
WSM – Workflow Services Module (data plane)
ASM – Advanced Services Module (data plane)
MCC Cluster
IOM
IOM
IOM
WSM
WSM
WSM
SP Router
North-South traffic to and from MCC Cluster
East-West traffic within MCC Cluster
A Typical Datapath Traffic Pattern
A single “composite” virtualized network function with distributed
microservices that can scale in and out independently
To support
20Gbps of Cluster
I/O
With 10G NIC With 40G NIC
Number of Servers
Needed
4 1
An Example to Show Server Efficiency Improvement
© 2016 Mellanox Technologies 26- Mellanox Confidential -
SR-IOV & Data Plane Acceleration Essential for Affirmed MCC
HypervisorHypervisor
PHY
Native Open vSwitch (OVS) DPDK Accelerated vSwitch (AVS)
~20-30% Line Rate ~80% Line Rate
SR-IOV
Near Line Rate
VM VM VM
Hypervisor
Server NIC
OVS
PHY
Server NIC
OVS
DPDK
Lib
DPDK
Lib
PHY
Server NIC
OVS
© 2016 Mellanox Technologies 27- Mellanox Confidential -
Conclusion - The Mellanox EVN Differentiation
Higher
Workload
Density
Faster Data
Movement
Cloud-
native
Scalability
and
Reliability
Operation
and Cost
Efficiency
Thank You

22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Service Providers

  • 1.
    Christopher Lim, Sr.Engineer July 2016 Mellanox Efficient Virtual Network for Service Providers
  • 2.
    © 2016 MellanoxTechnologies 2- Mellanox Confidential - Leading Supplier of End-to-End Interconnect Solutions StoreAnalyze Enabling the Use of Data SoftwareICs Switches/GatewaysAdapter Cards Cables/Modules Comprehensive End-to-End InfiniBand and Ethernet Portfolio (VPI) Metro / WANNPU & Multicore NPS TILE
  • 3.
    © 2016 MellanoxTechnologies 3- Mellanox Confidential - Cloud-Native NFV Architecture Dictates Efficient Virtual Network Mellanox EVN: Foundation for Efficient Telco Cloud Infrastructure Efficient Virtual Network Enabling High-performance, Reliable and Scalable Infrastructure for Cloud Service Delivery AUTOMATIONACCELERATIONVIRTUALIZATION Compute Higher Workload Density Network Line Rate Packet Processing Storage Higher IOPS, Lower Latency
  • 4.
    © 2016 MellanoxTechnologies 4- Mellanox Confidential - SR-IOV – Overcome Compute Virtualization Penalty VM 1 VM 2 VM N …… VF Driver VF Driver VF Driver VM Virtual NIC VM Virtual NIC Hypervisor Single Root I/O Virtualization (SR-IOV) capable NIC Virtual Switch Physical FunctionVirtual Function Virtual Function Virtual Function NIC Embedded Switch PF Driver PCIe Bus Application Direct Access to achieve bare metal I/O performance …… VMs leveraging SR-IOV and Mellanox eSwitch for near-line-rate performance without CPU overhead Software-switched VMs suffering from compute virtualization penalty
  • 5.
    © 2016 MellanoxTechnologies 5- Mellanox Confidential - SR-IOV + DPDK: Better Together with Mellanox PMD VM 1 VM 2 VM N …… Hypervisor Single Root I/O Virtualization (SR-IOV) capable NIC Virtual Function Virtual Function Virtual Function NIC Embedded Switch PCIe Bus Further accelerate packet processing performance by eliminating interrupts and context switches …… Mellanox DPDK PMD DPDK Library Mellanox DPDK PMD DPDK Library Mellanox DPDK PMD DPDK Library
  • 6.
    © 2016 MellanoxTechnologies 6- Mellanox Confidential - Mellanox Sets New DPDK Performance Records 42.11 30.58 17.96 9.36 4.78 3.83 3.24 0 10 20 30 40 50 60 64 128 256 512 1024 1280 1518 FramesperSecond(InMillions) Frame Size (In Bytes) Superior DPDK Packet Performance at Various Frame Sizes (Lx 40G) Test setup: • ConnectX-4Lx 40GbE Single port • 4 Cores Dedicated to DPDK Product Single-port TCP Throughput DPDK 64B Packet Throughput ConnectX-4 100G 93.4 Gb/s 74.4 million p/s ConnectX-4 Lx 40G 37.6 Gb/s 42.1 million p/s ConnectX-4 Lx 25G 23.5 Gb/s 34 million p/s ConnectX-4 40G 37.6 Gb/s 56.4 million p/s
  • 7.
    © 2016 MellanoxTechnologies 7- Mellanox Confidential -  Solution: • Overlay Network Accelerators in NIC • Penalty free overlays at bare-metal speed • Integrated and validated by major SDN vendors  Benefits: • 37.5Gb/s on 40G link, >2X compared to without VxLAN offload • On a 20 cores system, 7 cores are freed to run addition VMs, saving 35% of total cores while doubling the throughput! Turbocharge Overlay Networks with ConnectX-3/4 NICs
  • 8.
    © 2016 MellanoxTechnologies 8- Mellanox Confidential - Cumulus Overlay Solution VMware NSX PLUMgrid ONS Nuage VSP Midokura Midonet Juniper OpenContrail Akanda Astara Cumulus LNV  Switch VXLAN tunnel endpoint (VTEP) is used • To connect bare metal servers to VXLAN network • To connect VXLAN and legacy network  Cumulus Integrated with every major Overlay Solution  Available with Mellanox switches April 2016
  • 9.
    © 2016 MellanoxTechnologies 9- Mellanox Confidential - Accelerated Switching And Packet Processing (ASAP2)  Best of both worlds: Enable hardware accelerated data plane with SDN/virtual switch control plane  Multiple possibilities of accelerated data plane including DPDK in CPU, embedded switch, FPGA, network processor, multi-core processor in server adaptor, TOR switch, or centralized acceleration pool  Standard hardware API to allow control plane and data plane to operate and innovate independently Roadmap Virtual Switch Control Plane Hardware Accelerated Data Plane Standard Hardware Abstraction Interface ASAP2
  • 10.
    © 2016 MellanoxTechnologies 10- Mellanox Confidential - ASAP2 Phase 1: ASAP2 Direct OVS Control Plane, optionally combined with SDN controller Direct application I/O access through SR-IOV Accelerated forwarding and classification through Embedded Switch (eSwitch) on Mellanox NIC OS VM OS VM OS VM OS VM tap tap SR-IOV to the VM Embedded Switch
  • 11.
    © 2016 MellanoxTechnologies 11- Mellanox Confidential - OVS Architecture and Operations 11 OVS-vswitchd OVS Kernel Module First Packet Subsequent Packets User Kernel  Forwarding • Flow-based forwarding • First packet of a new flow (match miss) is directed to user space (ovs-vswitchd) • ovs-vswitchd determines flow handling and programs kernel (fast path) • Following packets hit kernel flow entries and are executed in fast path
  • 12.
    © 2016 MellanoxTechnologies 12- Mellanox Confidential - Mellanox eSwitch ASAP2 – Let the Hardware Do the Heavy-lifting New Flow • A new flow will result in a ‘miss’ action in eSwitch and is directed to OVS kernel module • Miss in kernel will punt the packet to OVS- vswitchd in user space Configuration • OVS-vswitchd will resolve the flow entry, and based on a policy decision to offload, propagate that to corresponding eSwitch tables for offload-enabled flows Fast Forwarding • Subsequent frames of offload-enabled flows will be processed and forwarded by eSwitch OVS-vswitchd OVS Kernel Module First Packet Subsequent HW Forwarded Packets User Kernel Fallback Forwarding Path SoftwareHardware
  • 13.
    © 2016 MellanoxTechnologies 13- Mellanox Confidential - OVS and SRIOV, Working Seamlessly Together Representor ports enable OVS to “know” and service those VMs that uses SR-IOV Representor ports are used for eSwitch / OVS communication (miss flow and PV to SR-IOV communication) Netdev Representor Netdev Representor netdev netdev VMs using OVS Offload VMs using Para-Virtualization NIC eSwitch Policy based Flow Sync
  • 14.
    © 2016 MellanoxTechnologies 14- Mellanox Confidential - Software Defined Networking, at Full Speed  Highest performance (High throughput, low and deterministic latency) • Offload is increasingly important as server I/O speed goes up  Low CPU overhead, higher infrastructure efficiency  Software defined  Everything In-Box • All changes will be up-streamed, no proprietary OVS or kernel patches eSwitcheSwitcheSwitch Configuration Stats Reporting SDN or Other Network Orchestration
  • 15.
    © 2016 MellanoxTechnologies 15- Mellanox Confidential - Benchmark Targets  Matrices • Message Rate (PPS) • Network related CPU Load  Environments • 25Gbps network • Extreme performance • Open Source • Free  Standard Benchmark • RFC 2544
  • 16.
    © 2016 MellanoxTechnologies 16- Mellanox Confidential - Benchmark Topology and Traffic Flow Mellanox Kernel Kernel Kernel Kernel User User UserUser OVS Over DPDK OVS Offload OVS DPDK DPDK Testpmd OVS eSwitch DPDK Testpmd Flows Offload 25GE 25GE VM Hypervisor NIC VS.
  • 17.
    © 2016 MellanoxTechnologies 17- Mellanox Confidential - Results and Conclusions  330% higher message rate compared to OVS over DPDK • 33M PPS VS. 7.6M PPS • OVS Offload reach near line rate at 25G (37.2M PPS)  Zero! CPU utilization on hypervisor compared to 4 cores with OVS over DPDK • This delta will grow further with packet rate and link speed  Same CPU load on VM 33M PPS 7.6M PPS 0 Cores 4 Cores 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 5 10 15 20 25 30 35 OVS Offload OVS over DPDK NumberofDedicatedCores MillionPacketPerSecond Message Rate Dedicated Hypervisor Cores
  • 18.
    © 2016 MellanoxTechnologies 18- Mellanox Confidential - Accelerated Data Movement End to End: 25 is the New 10 One Switch. A World of Options. Flexibility, Opportunities, Speed Open Ethernet, Zero Packet Loss Most Cost-Effective Ethernet Adapter 2.5X the Network Performance Same Infrastructure, Same Connector One Switch. A World of Options. 25G and 50G at Your Fingertips
  • 19.
    © 2016 MellanoxTechnologies 19- Mellanox Confidential - Spectrum: The Ultimate 25/100GbE Switch  The only predictable 25/50/100Gb/s Ethernet switch  Full wire speed, non-blocking switch • Doesn’t drop packets per RFC2544  ZPL: ZeroPacketLoss for all packets sizes
  • 20.
    © 2016 MellanoxTechnologies 20- Mellanox Confidential - 25GbE to 25GbE Latency Test Results Not All Ethernet Switches Were Born Equal 5.2 8.4 9.6 9.7 0.3 0.9 1.0 1.1 64B 512B 1.5B 9KB MaxBurstSize(MB) Packet size Microburst Absorption Capability Spectrum Tomahawk 50 60 70 80 90 100 64 82 128 146 164 182 200 256 1518 9216 50 60 70 80 90 100 64 82 128 146 164 182 200 256 1518 9216 Packet Size (Bytes)Packet Size (Bytes) Broadcom Spectrum Microburst Absorption Fairness Avoidable Packet Loss Broadcom Spectrum www.Mellanox.com/tolly www.zeropacketloss.com Consistently Low Latency
  • 21.
    © 2016 MellanoxTechnologies 21- Mellanox Confidential - Open APIs Open Composable Networks Automation End-to-End Interconnect Network OS Choice SONiC
  • 22.
    © 2016 MellanoxTechnologies 22- Mellanox Confidential - RDMA Acceleration – Overcome Transport Protocol Inefficiencies ZERO Copy Remote Data Transfer Low Latency, High Performance Data Transfers InfiniBand - 100Gb/s RoCE* – 100Gb/s Kernel Bypass Protocol Offload * RDMA over Converged Ethernet Application ApplicationUSER KERNEL HARDWARE Buffer Buffer
  • 23.
    © 2016 MellanoxTechnologies 23- Mellanox Confidential - RDMA Increases Memcached Performance  Memcached: High Performance in-memory distributed memory object caching system • Simple key-value store • Speeds application by eliminating database access • Used by YouTube, Facebook, Zynga, Twitter etc.  RDMA improved Memcached performance: • 1/3 query latency • >3X throughput D. Shankar, X. Lu, J. Jose, M.W. Rahman, N. Islam, and D.K. Panda, Can RDMA Benefit On‐Line Data Processing Workloads with Memcached and MySQL, ISPASS’15 OLDP workload 0 1 2 3 4 5 6 7 8 64 96 128 160 320 400 Latency(sec) No. of Clients Memcached-TCP Memcached-RDMA 0 500 1000 1500 2000 2500 3000 3500 64 96 128 160 320 400 Throughput(Kq/s) No. of Clients Memcached-TCP Memcached-RDMA Reduced by 66% Increased by >200%
  • 24.
    © 2016 MellanoxTechnologies 24- Mellanox Confidential - Case Studies
  • 25.
    © 2016 MellanoxTechnologies 25- Mellanox Confidential - Server I/O Decides Affirmed Networks Virtual EPC Efficiency  When server I/O is constrained, the Affirmed MCC deployment efficiency can be constrained, resulting in underutilized resources and larger server footprint  Mellanox 40G NIC enables MCC to fully utilize CPU resources, reduce server footprint and enhance efficiency. MCM CCM DCM ASM WSM IOM MCM CCM DCM ASM WSM IOM 1 2 1 2 N 1 2 N 1 2 N 1 2 N 1 2 N MCM – Management Control Module CCM – Centralized Control Module DCM – Distributed Control Module IOM – Input Output Module WSM – Workflow Services Module (data plane) ASM – Advanced Services Module (data plane) MCC Cluster IOM IOM IOM WSM WSM WSM SP Router North-South traffic to and from MCC Cluster East-West traffic within MCC Cluster A Typical Datapath Traffic Pattern A single “composite” virtualized network function with distributed microservices that can scale in and out independently To support 20Gbps of Cluster I/O With 10G NIC With 40G NIC Number of Servers Needed 4 1 An Example to Show Server Efficiency Improvement
  • 26.
    © 2016 MellanoxTechnologies 26- Mellanox Confidential - SR-IOV & Data Plane Acceleration Essential for Affirmed MCC HypervisorHypervisor PHY Native Open vSwitch (OVS) DPDK Accelerated vSwitch (AVS) ~20-30% Line Rate ~80% Line Rate SR-IOV Near Line Rate VM VM VM Hypervisor Server NIC OVS PHY Server NIC OVS DPDK Lib DPDK Lib PHY Server NIC OVS
  • 27.
    © 2016 MellanoxTechnologies 27- Mellanox Confidential - Conclusion - The Mellanox EVN Differentiation Higher Workload Density Faster Data Movement Cloud- native Scalability and Reliability Operation and Cost Efficiency
  • 28.

Editor's Notes

  • #3 2
  • #9 Another aspect of overlay network is switch VXLAN tunnel endpoint or (VTEP). Our approach is VTEP should be implemented on the NIC as much as possible as it allow better scaling and performance. But in some case such as connecting bare metal servers or connect VXLAN to VLAN networks it is not possible. For such cases VTEP needs to be implemented on the switch rather than on the NIC. We will start supporting switch VTEP together with our Cumulus release in 2Q16. MLNX-OS will follow later this year.
  • #12 This is how packets are forwarded by OVS in a paravirtualized environment. Both the vswitchd slow path and OVS kernel module fast path are running in CPU. OVS performs flow-based forwarding, the first packet of a new flow will hit the OVS kernel module, and result in a match miss and punted to the user space vswitch deamon. The vswitchd will resolve the flow entry, possibly with the help from an SDN controller, and program that flow entry to the fast path, the OVS kernel module. Subsequently, the following packets in the same flow will hit the flow entry programmed into the OVS kernel module and the packet forwarding is then executed in the kernel fast path.
  • #13 With OVS Offload, we add the fast and efficient eSwitch in the NIC into the picture. This is how networking on routers and switches have evolved over the last 20-25 years. As I said, no router or switch from any reputable networking vendor today is doing packet forwarding with the CPU any more. Instead, packet processing and forwarding are offloaded to a hardware fast path, normally implemented in ASICs or network processors. A new flow will result in a ‘miss’ action in eSwitch and is directed to OVS kernel module Miss in kernel will punt the packet to OVS-vswitchd in user space OVS-vswitchd will resolve the flow entry, and based on a policy decision to offload, propagate that to corresponding eSwitch tables for offload-enabled flows Subsequent frames of offload-enabled flows will be processed and forwarded by eSwitch
  • #14 SR-IOV and OVS were like oil and water, Architecture design takes into consideration both VMs directly attached to Virtual Functions (VF) and Paravirt (PV) VMs VF representors are a netdev modeling of eSwitch ports The VF representor will support the following operations Flow configuration Flow statistics read Send/receive packet (from the host CPU to VF)
  • #15 Three immediate benefits here, as you can expect, much higher performance, significantly lower CPU overhead, and software defined. The high performance refers to high throughput, not only bit throughput but packet throughput, which is shown as how fast the system can process packets, something that is really important for virtualized network functions, especially those real time multimedia applications. High performance is also reflected in low and deterministic latency. Offload is only getting more important as server I/O speed goes up. At 10G I/O, you are looking at a theoretical max of 15 million packets per second, and if you throw in a few CPU cores, you might be able to achieve decent packet rate with software forwarding. We’ve heard numbers like 9 million packets per second with 4 CPU core. But at 25G, the theoretical max packet rate is 37.5 million pps, at 40G, 60 million pps, and at 100G, that will be 150 million pps and software will not be able to catch up even if you want to dedicate all your CPU cores to process packets instead of actually running applications to do service processing. OVS offload can lower your CPU overhead and free up CPU from doing packet processing to running applications, resulting in higher infrastructure efficiency. The hardware offloads are transparent to the user and Open vSwitch interfaces to the user remain untouched, so users don’t need changes in their environment at all. The interaction between SDN controller and OVS remains the same and you get the best of both worlds, SDN at full speed. Last but not least, we are contributing all changes to OVS and Openstack upstream, so you don’t need to run yet another proprietary version of OVS to take advantage of these offload capabilities.
  • #22 Now that you understand how the hyperscalers build their cloud network infrastructure, you might start thinking, OK,, that is great, but I don’t have the manpower and large number of software developers to follow this model. The good news is, Mellanox and our ecosystem partners make things easy for you. We have a solution called Open Composable Networks that can provide you a set of high-performance, highly programmable networking components including switches, server adapters, optical modules and cables, network processors, which support open APIs such as SAI and switchdev for Linux, and on top of these standard interfaces, you have a slew of network operating system and software application choices. As a matter of fact, in this year’s OCP Summit last month, we did a live demo of 5 different network operating systems running over our flagship Spectrum switches. We also provide the middleware that make it easy to compose your ideal cloud network infrastructure, and simple to monitor, manage and scale.
  • #24 Illustrated with OLDP (On-Line Data Processing) workload using modified mysqlslap load testing tool. Memcached in its implementation uses sockets. We show the advantages one could get by using RDMA in this environment. KQ/s stands for Kilo Queries per Second The above numbers are generated on Infiniband QDR.
  • #26 IOM module takes in packets from service provider gateway router (north-south traffic ), and distribute them to one of the datapath modules such as WSM (east-west traffic). WSM will process packets and send them back to IOM which in turn sends the traffic back to SP router. For IOM to take in X Gbps traffic from the SP router, it requires the server I/O to be able to handle 2X Gbps traffic.