3. Express I/O for XDP
• Kernel Network I/O has been a performance bottleneck
• Netmap and DPDK claimed 10x performance advantage
• Bypass is not low-hanging fruit
• Could rebuilding EVERYTHING in userspace really do better?
• Unless all bottlenecks are removed, it’s still a long way to go
• Kernel is the place for better driver/platform eco-system
• Multi-vendor NICs and accelerators
• X86, ARM, Power, SPARC, etc.
• Programmability of XDP will enable innovation in “Network
Functional Applications”
4. History of CETH (Common Ethernet Driver Framework)
Designed for Performance and Virtualization:
1. Improve kernel networking performance for
virtualization, particularly vSwitch and virtual I/O
2. Simplify NIC drivers by consolidate common
functions, particularly for “internal” new NICs
accelerators
3. Standalone module for various kernel versions
Supports:
• Huawei’s EVS(Elastic Virtual Switch)
• NICs:
• Intel ixgbe
• Intel i40e (40G)
• Broadcom bnx2x
• Mellanox mlnx-en
• Emulex be2net
• Accelerators:
• Huawei SNP-lite
• Broadcom XLP
• Ezchip Gx36
• Huawei VDR
• vNIC:
• ctap(tap+vhost)
• virtio-net
• ceth-pair
5. Design Considerations (before XDP)
1. Efficient Memory/Buffer Management
o Pre-allocated packet buffer pool
o Efficient buffer acquire/recycle mechanism
o Data Prefetching
o Batching packet process
o Optimized for efficient cache usage
o Locking reduction/avoidance
o High performance copy
o Reduction of DMA mapping
o Huge pages, etc.
2. Flexible TX/RX Scheduling
o Threaded_irq
o All-in-interrupt handling
o Optional R2C or Pipeline Threading models
o Feature-triggered mode switching
3. Customizable Meta-data structure
o Cache-friendly data structure
o Hardware/accelerator friendly
o Extensible Metadata format is customizable
o SKB compatible
4. Compatible with Kernel IP stack
o Hardware Offloading friendly
o Checksum, VLAN, etc.
o TSO/GSO, LRO/GRO
o Easy to port existing Linux device drivers
o Reuse most existing non-datapath functions
o Guild for easy driver porting
5. Tools for easy performance tuning
o “ceth” tool to tune all parameters
o sysfs interfaces
6. Simplified CETH for XDP
1. Efficient Memory/Buffer Management
o Pre-allocated packet buffer pool
o Efficient buffer acquire/recycle mechanism
o Data Prefetching
o Batching packet process
o Optimized for efficient cache usage
o Locking reduction/avoidance
o High performance copy
o Reduction of DMA mapping
o Huge pages, etc.
2. Flexible TX/RX Scheduling
o Threaded_irq
o All-in-interrupt handling
o Optional R2C or Pipeline Threading models
o Feature-triggered mode switching
3. Customizable Meta-data structure
o Cache-friendly data structure
o Hardware/accelerator friendly
o Extensible Metadata format is customizable
o SKB compatible
4. Compatible with Kernel IP stack
o Hardware Offloading friendly
o Checksum, VLAN, etc.
o TSO/GSO, LRO/GRO
o Easy to port existing Linux device drivers
o Easy driver porting: less than 200LOC/driver
5. Tools for easy performance tuning
o “ceth” tool to tune all parameters
o Sysfs interfaces
7. Simple interfaces for drivers
• New Functions (CETH module)
o ceth_pkt_aquire()
o ceth_pkt_recycle()
o ceth_pkt_to_skb()
• Kernel modification
o __kfree_skb()
• Driver modifications
• allocate buffers from CETH
• optional: use pkt_t by default
• optimize the driver!
• Performance
30% performance improvement for packet
switching (br, ovs)
40% of pktgen performance
100% improvement for XDP forwarding
33Mpps XDP dropping rate with 2 CPU threads
Scalable with multiple hardware queues
Patch available based on latest XDP kernel tree.
Preliminary Performance numbers:
https://docs.google.com/spreadsheets/d/1nT0DO25lfS1QpB
LQkdIMm4LJl1v_VMScZVSOcRgkQOI/edit#gid=0
NOTE: all numbers were internally tested for development
purpose only.
8. Memory and Buffer Management
• Separate memory
management layer for
various optimizations,
like huge page
• Per-CPU or per-queue
buffer pool mechanisms
• May use skb by default
(pkt_t as buffer data
structure only)
• Can use non-skb meta-
data cross all XDP
functions
Packet Management
for XDP and protocol stack
driver
Buffer ManagementMemory Management
RX queue
RX queue
RX queue
RX queue
RX queue
RX queue
per-CPU
ceth_pkt buffer pool
per-CPU
ceth_pkt buffer pool
per-CPU
ceth_pkt buffer pool
ceth_pkt batch
in-use
ceth_pkt free
ceth_pkt free
ceth_pkt in-use
ceth_pkt batch
in-use
ceth_pkt free
ceth_pkt free
ceth_pkt in-use
ceth_pkt batch
in-use
ceth_pkt free
ceth_pkt free
ceth_pkt in-use
ceth_pkt batch
in-use
ceth_pkt free
ceth_pkt free
ceth_pkt in-use
default
paged memory implementation
using buddy allocator
contiguous pages
of batch size
page
page
page
contiguous pages
of batch size
page
page
page
contiguous pages
of batch size
page
page
page
per-CPU / per device queue
ceth_pkt buffer pool
recycled batch list
free ceth_pkt batch
ceth_pkt
ceth_pkt
ceth_pkt
free ceth_pkt batch
ceth_pkt
ceth_pkt
ceth_pkt
TX queue
current ceth_pkt batch
ceth_pkt batch
in-use
desc ring
if current batch is used up
and recycled list is not empty
take the first batch in recycled list
ceth_pkt
ceth_pkt
ceth_pkt
ceth_pkt
RX queue
desc ring
ceth_pkt
ceth_pkt
ceth_pkt
ceth_pkt
host protocol stack
forwarding
ceth_pkt_acquire()
__kfree_skb(skb)
ceth_pkt_to_skb(pkt)
netif_receive_skb(skb)
ceth_pkt
ceth_pkt in-use
ceth_pkt in-use
ceth_pkt
ceth_pkt
ceth_pkt in-use
ceth_pkt
ceth_pkt
contiguous pages
of batch size
page
page
page
page
if recycled list is empty
alloc_pages()
if recycled list is too long
free the batch directly
if recycled list idled for too long
free all pkt batches in the list
free ceth_pkt batch
ceth_pkt
ceth_pkt
ceth_pkt
ceth_pkt
whoever frees the last in-use ceth_pkt in a batch
will push the batch to head of recycled list
while taking the recycle list lock
drop
ceth_pkt_recycle(pkt)
optional
huge-page implementation
for mapping to user space
contiguous pages
of batch size
page
page
page
contiguous pages
of batch size
page
page
page
contiguous pages
of batch size
page
page
page
contiguous memory
frags of batch size
frag
frag
frag
frag
XDP
9. CETH pkt_t Structure
• Use one page for one packet
• Customizable meta data (for XDP)
• Header room for overlay
• SKB data structure ready
• Easy conversion between pkt_t and skb_buff
(with cost)
• Reuse skb_shared_info for fragments
frags[17]
end
skb_shared_info
head room
128 (64x2)
data
skb
data
sk_buff
232x2+8(64x8)
320 (64x5)
128 (64x2)
2880(64x45)
4K (64*64)
sk_buff2
fclone_ref=2
sk_buff_fclones
head
data
end
head
data
end
handle
data_offset
signature
meda data
ceth_pkt
list head
ceth_pkt_buffer
10. Next Steps (w/ XDP)
• Ongoing
1. Port more mm/bm features
2. Measure performance with XDP
use cases
3. optimize performance with
drivers (need help from driver
developers! )
4. Measure perfoermance
improvement of virtio
5. Direct Socket Interface for
userspace applications
• Discussions on mailing lists
1. Meta-data format
2. Offloading features, like TSO
3. Acceleration API
4. Virtualization Supports