ixgbe Internals
SUSE Labs Taipei Technology Sharing Day 2018
Gary Lin
Software Engineer, SUSE Labs
glin@suse.com
ixgbe?
Intel 10 Gigabit PCI Express Linux Driver
Why ixgbe?
●
The first driver supporting XDP_REDIRECT in
Linux kernel mainline
●
Using a different memory model for XDP
●
Just For Fun™
Overview
NIC
driver
Application
Network Stack
QDISC
From Driver to AP
TX
RX
NAPI
TX – QDISC
... CPU 1
CPU 2
CPU n
TX Q 1
TX Q 2
TX Q n-1
TX Q n
CPU n-1
Ring Buffer
...
QDISC
netdev_pick_tx() +
ndo_start_xmit()
RX – NAPI
RAM
Network Card
DMA
poll poll poll poll
IRQ
RX Queue
... CPU 1
CPU 2
CPU n
napi_poll()
CPU Q 1
CPU Q 2
CPU Q n
...
enqueue_to_backlog()
RX Q 1
RX Q 2
RX Q n-1
RX Q n
CPU n-1 CPU Q n-1
CPU i
Ring Buffer
Ring Buffers in ixgbe
TX Ring
RX Ring
XDP Ring
RX Q 2
IRQ q_vector
RX Ring 2
TX Q 1
TX Ring 1
Legacy Interrupt
RX Q 1
TX Q 2
RX Ring 1
TX Ring 2
RX Q IRQ q_vector
RX Ring
TX Q IRQ q_vector
TX Ring
RX Q
IRQ q_vector
RX Ring
TX Q
TX Ring
MSI-X
...
Ring Buffer
buffer1
buffer2
buffer3
buffer4
next_to_use
TX Ring
ixgbe_tx_buffer
sk_buff
QDISCstruct sk_buff *skb
unsigned int bytescount
XDP Ring
ixgbe_tx_buffer
void *data
xdp_buff
XDP
unsigned int bytescount
void *data
TX Ring + XDP Ring
ixgbe_q_vector
struct ixgbe_ring_container tx
TX Ring 1
TX Ring 2
TX Ring n
XDP Ring 1
XDP Ring 2
XDP Ring n
RX Ring
ixgbe_rx_buffer
struct page *page
__u32 *page_offset
__u16 pagecnt_bias
page
dev_alloc_pages()
RX Ring
ixgbe_rx_buffer
struct page *page
__u32 *page_offset
__u16 pagecnt_bias
page
DMA
RX Ring
ixgbe_rx_buffer
struct page *page
__u32 *page_offset
__u16 pagecnt_bias
page
SKB
RX Ring
ixgbe_rx_buffer
struct page *page
__u32 *page_offset
__u16 pagecnt_bias
page
Flip
Flip: page_offset ^= page_size / 2
RX Ring
ixgbe_rx_buffer
struct page *page
__u32 *page_offset
__u16 pagecnt_bias
page
RX Ring – Recycle
ixgbe_rx_buffer
struct page *page
__u32 *page_offset
__u16 pagecnt_bias
page
Flip
RX Ring – Replace
ixgbe_rx_buffer
struct page *page
__u32 *page_offset
__u16 pagecnt_bias
page
page
Page Ref Count
●
Tracking the number of users of the page
●
The possible page ref count (ideally):
– 1: The whole page is available.
– 2: One half of the page is in use.
– 3: The whole page is in use.
Page Count Operations
static inline void set_page_count(struct page *page, int v)
{
atomic_set(&page->_refcount, v);
if (page_ref_tracepoint_active(__tracepoint_page_ref_set))
__page_ref_set(page, v);
}
static inline void page_ref_add(struct page *page, int nr)
{
atomic_add(nr, &page->_refcount);
if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
__page_ref_mod(page, nr);
}
static inline void page_ref_sub(struct page *page, int nr)
{
atomic_sub(nr, &page->_refcount);
if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
__page_ref_mod(page, -nr);
}
Atomic operations are
expensive!
Adjusted Ref Count (1/3)
●
A locally maintained pagecnt_bias for the RX
page
●
Initial value of pagecnt_bias: 1
●
adj_pagecnt = pagecnt – pagecnt_bias
– 0: The whole page is available.
– 1: One half of the page is in use.
– 2: The whole page is in use.
Adjusted Ref Count (2/3)
●
Harvesting a packet: pagecnt_bias--
●
Recycling the page: pagecnt_bias++
– The XDP program returns XDP_DROP or an error.
– The packet is small enough to be copied into the
allocated skb.
– The packet is failed to be packed into skb.
●
If pagecnt_bias == 0, set pagecnt_bias to
USHRT_MAX and add USHRT_MAX to pagecnt.
Adjusted Ref Count (3/3)
●
Packet consumed: pagecnt--
– The packet is consumed by the network stack.
– XDP_TX or XDP_REDIRECT is completed.
●
Releasing the page: pagecnt -= pagecnt_bias
– With the help of __page_frag_cache_drain()
XDP
TX Q
RX Q
Network
Stack
PASS
DROP
eBPF
Program
TX
NIC Driver
eXpress Data Path
REDIRECT
Other CPU or NIC
Page 1 Page 2 Page 3
Conventional RX Buffer
Page 1 Page 2 Page 3
One Packet Per Page
Memory Model Switch
XDP
NOTE: The driver other than ixgbe
For ixgbe, memory model switch is
not necessary!
XDP in ixgbe
eBPF
Program
XDP_PASS
Network
Stack
RX Ring
XDP Ring
XDP_DROP
XDP_TX
XDP_REDIRECT
Other CPU or NICXDP_REDIRECT
ixgbe_xdp_xmit()
xdp_do_redirect()
ixgbe_xmit_xdp_ring()
pagecnt_bias++
Incoming New Features
●
XDP for ixgbevf (linux-next)
– ixgbe blocks XDP if SR-IOV is enabled.
●
XDP redirect memory return API (net-next)
– Managing pages across drivers
– Adopted by ixgbe, i40e, mlx5, tuntap, and virtio_net
– Preparing for the AF_XDP zero-copy patch set
– ixgbe tweaked the page ref counting scheme for the
new API.
Question?
Thank You!
References

Linux kernel v4.15
https://github.com/torvalds/linux/tree/v4.15/drivers/net/ethernet/intel/ixgb
e

[0/5] Enable XDP for ixgbevf
http://patchwork.ozlabs.org/cover/887197/

[net-next V11 PATCH 00/17] XDP redirect memory return API
https://www.spinics.net/lists/netdev/msg495995.html

ixgbe: tweak page counting for XDP_REDIRECT
https://patchwork.ozlabs.org/patch/889261/

Monitoring and Tuning the Linux Networking Stack: Sending Data
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-net
working-stack-sending-data/

Monitoring and Tuning the Linux Networking Stack: Receiving Data
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-net
working-stack-receiving-data/

Ixgbe internals