Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and CSIT projects

Software Network Data Plane – Satisfying the Need for Speed
FastData.io – VPP and CSIT Projects
16th of April 2018, Out of the Box Net Dev Meetup, London
Maciek Konstantynowicz, lf-id/iirc: mackonstan, mkonstan@cisco.com
CSIT-CPL - Continuous System Integration and Testing, a.k.a. “Continuous Performance Lab”
https://wiki.fd.io/view/CSIT
VPP – Vector Packet Processing
https://wiki.fd.io/view/VPP

Multiparty: Broad Membership
FD.io Foundation 2
Service Providers Network Vendors
Chip Vendors
Integrators

Multiparty: Broad Contribution
FD.io Foundation 3
Universitat Politècnica de Catalunya (UPC)
Yandex
Qiniu

Topics
• What is FD.io
• The “Magic” of Vectors
• SW Data Plane Benchmarking
• Deployment Applicability
• Addressing the Continuity Problem
• Some Results, Reports and Analysis

5
Breaking the Barrier of Software Defined Network Services
1 Terabit Services on a Single Intel® Xeon® Server !
EFFICIENCY
PERFORMANCE
SOFTWARE DEFINED NETWORKING
CLOUD NETWORK SERVICES
LINUX FOUNDATION
A Universal Terabit Network Platform
For Cloud-native Network Services
Superior Performance
Most Efficient on the Planet
Flexible and Extensible
Open Source
Cloud Native

FD.io VPP – Vector Packet Processor
Compute Optimized SW Network Platform
Packet Processing Software Platform
• High performance
• Linux user space
• Runs on compute CPUs:
- And “knows” how to run them well !
6
Packet Processing
Dataplane Management Agent
Network IO
Bare-metal / VM / Container

FD.io VPP – The “Magic” of Vectors
Compute Optimized SW Network Platform
1
Packet processing is decomposed
into a directed graph of nodes …
Packet 0
Packet 1
Packet 2
Packet 3
Packet 4
Packet 5
Packet 6
Packet 7
Packet 8
Packet 9
Packet 10
… packets move through
graph nodes in vector …2
Microprocessor
… graph nodes are optimized
to fit inside the instruction cache …
… packets are pre-fetched
into the data cache.
Instruction Cache3
Data Cache4
3
4
vhost-user-
input
af-packet-
input
dpdk-input
ip4-lookup-
mulitcast
ip4-lookup*
ethernet-
input
mpls-input
lldp-input
arp-inputcdp-input
...-no-
checksum
ip6-inputl2-input ip4-input
ip4-load-
balance
mpls-policy-
encap
ip4-rewrite-
transit
ip4-
midchain
interface-
output
* Each graph node implements a “micro-NF”, a “micro-NetworkFunction” processing packets.

FD.io Benefits from Intel® Xeon® Processor Developments
Increased Processor I/O Improves Packet Forwarding Rates
YESTERDAY
Intel® Xeon® E5-2699v4
22 Cores, 2.2 GHz, 55MB Cache
Network I/O: 160 Gbps
Core ALU: 4-wide parallel µops
Memory: 4-channels 2400 MHz
Max power: 145W (TDP)
1
2
3
4
Socket 0
Broadwell
Server CPU
Socket 1
Broadwell
Server CPU
2
DDR4
QPI
QPI
4
2
DDR4
DDR4
PCIe
PCIe
PCIe
x8 50GE
x16 100GE
x16 100GE
3
1
4
PCIe
PCIe
x8 50GE
x16 100GE
Ethernet
1
3
DDR4
DDR4
DDR4
DDR4
DDR4
SATA
B
I
O
S
PCH
Intel® Xeon® Platinum 8168
24 Cores, 2.7 GHz, 33MB Cache
TODAY
Network I/O: 280 Gbps
Core ALU: 5-wide parallel µops
Memory: 6-channels 2666 MHz
Max power: 205W (TDP)
1
2
3
4
Socket 0
Skylake
Server CPU
Socket 1
Skylake
Server CPU
UPI
UPI
DDR4 DDR4
DDR4
PCIe
PCIe
PCIe
PCIe
PCIe
PCIe
x8 50GE
x16 100GE
x8 50GE
x16 100GE
x16 100GE
SATA
B
I
O
S
2
4
2
1
3
1
4
3
x8 50GE
DDR4
PCIe
x8 40GE
Lewisburg
PCH
DDR4
DDR4
DDR4
DDR4
DDR4
DDR4
DDR4
DDR4
0 200 400 600 800 1000 1200
160
280
320
560
640
Server
[1 Socket]
Server
[2 Sockets]
Server
2x [2 Sockets]
+75%
+75%
PCle Packet Forwarding Rate [Gbps]
Intel® Xeon® v3, v4 Processors Intel® Xeon® Platinum 8180 Processors
1,120*
Gbps
+75%
* On compute platforms with all PCIe lanes from the Processors routed to PCIe slots.
FD.io Takes Full Advantage of Faster
Intel® Xeon® Scalable Processors
No Code Change Required
https://goo.gl/UtbaHy

2CPU
Network I/O 490 Gbps
Crypto I/O 100 Gbps
2CPU
Network I/O 490 Gbps
Crypto I/O 100 Gbps
Socket 0
Skylake
Server CPU
Socket 1
Skylake
Server CPU
UPI
UPI
DDR4 DDR4
DDR4
PCIe
PCIe
PCIe
PCIe
PCIe
PCIe
x8 50GE
x16 100GE
x8 50GE
x16 100GE
x16 100GE
SATA
B
I
O
S
2
4
2
1
3
1
4
3
x8 50GE
DDR4
PCIe
x8 40GE
Lewisburg
PCH
DDR4
DDR4
DDR4
DDR4
DDR4
DDR4
DDR4
DDR4
FD.io VPP – The “Magic” Behind the Equation
FD.io Takes Full Advantage of Faster
Intel® Xeon® Scalable Processors
No Code Change Required
FD.io Data Plane Efficiency
Metrics: { + } higher is better
{ - } lower is better
YESTERDAY TODAY
Intel® Xeon®
E5-2699v4
Intel® Xeon®
Platinum 8168
Improvement
{ + } 4 Socket forwarding rate [Gbps] 560 Gbps 948 Gbps* +69 %
{ - } Cycles / Packet 180 158 -12 %
{ + } Instructions / Cycle (HW max.) 2.8 ( 4 ) 3.28 ( 5 ) +17 %
{ - } Instructions / Packet 499 497 ~0 %
Socket 0
Skylake
Server CPU
Socket 1
Skylake
Server CPU
UPI
UPI
DDR4 DDR4
DDR4
PCIe
PCIe
PCIe
PCIe
PCIe
PCIe
x8 50GE
x16 100GE
x8 50GE
x16 100GE
x16 100GE
SATA
B
I
O
S
2
4
2
1
3
1
4
3
x8 50GE
DDR4
PCIe
x8 40GE
Lewisburg
PCH
DDR4
DDR4
DDR4
DDR4
DDR4
DDR4
DDR4
DDR4
Per processor: 24 cores 48 threads 2.7GHz
On-board LBG-NS 100G QAT Crypto
Machine with Intel® Xeon® Platinum 8168
* Measured 4 Socket forwarding rate is limited by PCIe I/O slot layout on tested compute machines; nominal forwarding rate for tested
FD.io VPP configuration is 280 Gbps per Platinum Processor. Not all cores are used.
9
https://goo.gl/UtbaHy

tℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 [𝑏𝑝𝑠] = tℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝑝𝑝𝑠 ∗ 𝑝𝑎𝑐𝑘𝑒𝑡_𝑠𝑖𝑧𝑒[𝑝𝑝𝑠]
DP Benchmarking Metrics – External and Internal
Compute CPP from PPS or vice versa..
𝑝𝑟𝑜𝑔𝑟𝑎𝑚_𝑢𝑛𝑖𝑡_𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛_𝑡𝑖𝑚𝑒[𝑠𝑒𝑐] =
#𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
𝑝𝑟𝑜𝑔𝑟𝑎𝑚_𝑢𝑛𝑖𝑡
∗
#𝑐𝑦𝑐𝑙𝑒𝑠
𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛
∗ 𝑐𝑦𝑐𝑙𝑒_𝑡𝑖𝑚𝑒
𝑝𝑎𝑐𝑘𝑒𝑡_𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔_𝑡𝑖𝑚𝑒[𝑠𝑒𝑐] =
𝑝𝑎𝑐𝑘𝑒𝑡
∗
∗ 𝑐𝑦𝑐𝑙𝑒_𝑡𝑖𝑚𝑒
#cycles_per_packet =
𝑝𝑎𝑐𝑘𝑒𝑡
∗
tℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡[𝑝𝑝𝑠] =
1
]𝑝𝑎𝑐𝑘𝑒𝑡_𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔_𝑡𝑖𝑚𝑒[𝑠𝑒𝑐
=
]𝐶𝑃𝑈_𝑓𝑟𝑒𝑞[𝐻𝑧
#𝑐𝑦𝑐𝑙𝑒𝑠_𝑝𝑒𝑟_𝑝𝑎𝑐𝑘𝑒𝑡
Treat software network Data Plane as
one would any program, ..
with the instructions per packet
being the program unit, ..
and arrive to the main data plane
benchmarking metrics.
CPP
PPS
BPS
CPI or 1/IPCIPP
External
Metrics
Internal
Metrics

Main architecture resources used for packet-
centric operations:
1. Packet processing operations – How many
CPU core cycles are required to process a
packet?
2. Memory bandwidth – How many memory-read
and -write accesses are made per packet?
3. I/O bandwidth – How many bytes are
transferred over PCIe link per packet?
4. Inter-socket transactions – How many bytes
are accessed from the other socket or other
core in the same socket per packet?
Socket 0
Broadwell
Server CPU
Socket 1
Broadwell
Server CPU
1
DDR4
QPI
QPI
4 1
DDR4
DDR4
PCIe
PCIe
PCIe
x8 50GE
x16 100GE
x16 100GE
2
3
PCIe
PCIe
x8 50GE
x16 100GE
Ethernet
3
DDR4
DDR4
DDR4
DDR4
DDR4
SATA
B
I
O
S
PCH
Metrics – Mapping Them to Resources..
In-depth introduction on SW data plane performance benchmarking:
https://fd.io/resources/performance_analysis_sw_data_planes.pdf

Applicability – SW Network Services within a Node
• Start simple
• Benchmark the NIC to NIC packet path
• Use the right test and telemetry tools and approaches
• Analyze all key metrics: PPS, CPP, IPC, IPP and more
• Find performance ceilings of Network Function data plane
• Then apply the same methodology to other packet paths and services
✓
…

• FD.io VPP works today
• Great external and internal performance metrics
• The world keeps moving on
• New functions and features are being added continuously
• New generations of hardware are showing up periodically
• So, how do you keep the world happy, i.e. :
• maintain the best-in-class performance?
• prevent rogue patches going in?
• qualify further optimizations of existing code?
• quantify HW accelerators, processors, device setting changes?
The Continuity Problem

• CSIT-CPL goals and aspirations
• FD.io VPP benchmarking
• VPP functionality per specifications (RFCs1)
• VPP performance and efficiency (PPS2, CPP3)
• Network data plane - throughput Non-Drop Rate, bandwidth, PPS, packet delay
• Network Control Plane, Management Plane Interactions (memory leaks!)
• Performance baseline references for HW + SW stack (PPS, CPP)
• Range of deterministic operation for HW + SW stack (NDR, PDR4)
• Provide testing platform and tools to FD.io VPP dev and usr community
• Automated functional and performance tests
• Automated telemetry feedback with conformance, performance and efficiency metrics
• Help to drive good practice and engineering discipline into FD.io dev community
• Drive innovative optimizations into the source code – verify they work
• Enable innovative functional, performance and efficiency additions & extensions
• Prevent unnecessary code “harm”
Addressing the Continuity Problem with FD.io CSIT-CPL
Continuous Performance Lab
Legend:
1 RFC – Request For Comments – IETF Specs basically
2 PPS – Packets Per Second
3 CPP – Cycles Per Packet (metric of packet processing efficiency)
4 NDR, PDR – Non-Drop Rate, Partial Drop Rate

• Continuous Testing and Reporting
• Functional – Pass/Fail
• Device Drivers – Pass/Fail
• Performance Benchmarking – Throughput and Latency
• no Pass/Fail, but a Spectrum of Data that needs to be analyzed and classfied further
• Continuous Analysis
• Performance Trending, Spotting Progressions, Regressions
• Anomaly Detection and Notification
• All in open-source and published
• Tools and code
• Results and analytics
FD.io CSIT-CPL
Continuous Performance Lab – CI/CD for SW network data planes

FD.io CSIT-CPL
Per release test and performance reports
https://docs.fd.io/csit/rls1801/report/index.html

Measuring and Trending Performance – a Spectrum of Data
https://docs.fd.io/csit/master/trending/

CSIT-CPL - Getting “C” right in “CI/CD”..
• Need “baremetal” to execute tests
• Many functional and all performance tests need to run on physical servers
• Lots of tests, many combinations, they take time
• Physical resources, testbeds, servers are always in short supply!
• Dealing with scarce physical resources
• Focus on efficiency and execution time
• Reduce infra overhead
• Speedup build time for per patch tests
• Reduce execution time
• smarter NDR/PDR throughput rate search algorithms
• Parallelize
• Keep optimizing..

x86
Server
NIC1
Socket 0
Xeon Processor
E5-2699v3
NIC2 NIC3
x8 x8 x8
DDR4
Socket 1
Xeon Processor
E5-2699v3
NIC1 NIC2 NIC3
x8 x8 x8
Q
P
I
x86
Server
NIC3
Socket 0
Xeon Processor
E5-2699v3
NIC2NIC1
x8x8x8
DDR4
Socket 1
Xeon Processor
E5-2699v3
NIC3NIC2NIC1
x8x8x8
Q
P
I
x86
Server
x86
Server
NIC3
Socket 0
Xeon Processor
E5-2699v3
NIC2NIC1
x8x8x8
DDR4
Socket 1
Xeon Processor
E5-2699v3
NIC3NIC2NIC1
x8x8x8
Q
P
I
x86
Server
CSIT-CPL – Testbeds Today
2-Node Topology 3-Node Topology
Systems Under Test
“SW Devices” Under Test

CSIT-CPL – Where we got to..
• Enabled per patch performance tests
• In POC phase due to limited physical testbeds capacity, be fixed shortly
• Growing physical performance lab
• 20 of 2-socket Xeon Skylake servers
• Each Skylake server can do 280Gbps of I/O full-duplex per socket!
• https://goo.gl/UtbaHy

CSIT-CPL – .. and where we are going with this..
• Every patch performance benchmarked
• Cause once it is merged, it is gone
• Results summarized and abstracted for meaningful feedback loop
• To humans: contributors, commiters, testers, users
• To downstream projects
• To trending analytics for anomaly detection and notification
• To telemetry analytics for efficiency verification

#cycles/packet = cpu_freq[MHz] / throughput[Mpps]
Future: Planned Summary Data Views
Results and Analysis – #cycles/packet (CPP) and Throughput (Mpps)
See Kubecon Dec-2017, Benchmarking and
Analysis.., https://wiki.fd.io/view/File:Benchm
arking-sw-data-planes-Dec5_2017.pdf

0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0
10
20
30
40
50
60
70
80
90
100
CoreMark DPDK-Testpmd
L2 Loop
DPDK-L3Fwd
IPv4 Forwarding
VPP L2 Patch
Cross-Connect
VPP L2 MAC
Switching
OVS-DPDK L2
Cross-Connect
VPP IPv4
Routing
IPCValue
%Contribution
TMAM Level 1 Distribution (HT)
%Retiring %Bad_ Speculation %Frontend_Bound %Backend_Bound IPC
Compute Usage Efficiency
IPC Good IPC for all Network workloads due to code optimization, HT makes it even better.
Retiring Instructions retired, drives IPC.
Bad_Speculations Minimal Bad branch speculations, Attributed to architecture logic, and software pragmas.
Backend Stalls Major contributor causing low IPC in noHT cases, HT hides backend stalls.
Frontend Stalls Becomes a factor in HT as more instructions are being executed by both logical cores.
Observations:
Future: Planned Summary Data Views
Xeon Telemetry Analytics
See Kubecon Dec-2017, Benchmarking and
Analysis.., https://wiki.fd.io/view/File:Benchm
arking-sw-data-planes-Dec5_2017.pdf

25
Superior Performance
Most Efficient on the Planet
Flexible and Extensible
Open Source
Cloud Native
SOFTWARE DEFINED NETWORKING
CLOUD NETWORK SERVICES
LINUX FOUNDATION
A Universal Terabit Network Platform
For Native Cloud Network Services
EFFICIENCY
PERFORMANCE

Summary
• Terabit level SW network services are within reach
• FD.io is here, available to all
• And it continuously improving..
• Next is to make use of them in the cloud
• Birth of Cloud-native Network Services
• E.g. Integration into k8s eco-system
• Industry collaboration in open-source is essential
• Code development, benchmarking
• Publishing all work and results, dev and test
• Benchmarking automation tools
• Automated telemetry data analytics

References
FD.io VPP, CSIT-CPL and related projects
• VPP: https://wiki.fd.io/view/VPP
• CSIT-CPL: https://wiki.fd.io/view/CSIT
• pma_tools - https://wiki.fd.io/view/Pma_tools
Benchmarking Methodology
• Kubecon Dec-2017, Benchmarking and Analysis.., https://wiki.fd.io/view/File:Benchmarking-sw-data-planes-Dec5_2017.pdf
• “Benchmarking and Analysis of Software Network Data Planes” by M. Konstantynowicz, P. Lu, S.M. Shah, https://fd.io/resources/performance_analysis_sw_data_planes.pdf
Benchmarks
• EEMBC CoreMark® - http://www.eembc.org/index.php
• DPDK testpmd - http://dpdk.org/doc/guides/testpmd_app_ug/index.html
• FDio VPP – Fast Data IO packet processing platform, docs: https://wiki.fd.io/view/VPP, code: https://git.fd.io/vpp/
Performance Analysis Tools
• “Intel Optimization Manual” – Intel® 64 and IA-32 architectures optimization reference manual
• Linux PMU-tools, https://github.com/andikleen/pmu-tools
TMAM
• Intel Developer Zone, Tuning Applications Using a Top-down Microarchitecture Analysis Method, https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win
• Technion presentation on TMAM , Software Optimizations Become Simple with Top-Down Analysis Methodology (TMAM) on Intel® Microarchitecture Code Name Skylake, Ahmad
Yasin. Intel Developer Forum, IDF 2015. [Recording]
• A Top-Down Method for Performance Analysis and Counters Architecture, Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software,
ISPASS 2014, https://sites.google.com/site/analysismethods/yasin-pubs

Opportunities to Contribute
We invite you to Participate in FD.io
• Get the Code, Build the Code, Run the
Code
• Try the vpp user demo
• Install vpp from binary packages
(yum/apt)
• Install Honeycomb from binary packages
• Read/Watch the Tutorials
• Join the Mailing Lists
• Join the IRC Channels
• Explore the wiki
• Join FD.io as a member
FD.io Foundation 28
• Container Integration
• Firewall
• IDS
• Hardware Accelerators
• Control plane – support your favorite SDN
Protocol Agent
• DPI
• Test tools
• Packaging
• Testing

Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and CSIT projects

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and CSIT projects

Similar to Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and CSIT projects (20)

More from Haidee McMahon

More from Haidee McMahon (7)

Recently uploaded

Recently uploaded (20)

Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and CSIT projects