Linux Network Stack

Adrien Mahieux
Adrien MahieuxSystem Engineer (Soft to Hard) & Linux expert at BNP Paribas Corporate and Institutional Banking
Linux Network Stack
To Hell and back
Sysadmin #8
Adrien Mahieux - Performance Engineer (nanosecond hunter)
gh: github.com/Saruspete
tw: @Saruspete
em: adrien.mahieux@gmail.com
Summary
A lot of text...
- What’s available atm ?
- Definitions
- How does it really work ?
- Monitoring
- Configuration
- What I’d like to have...
And a big graph !
- Receive Flow
- (Transmit Flow)
Disclaimer : this is not a presentation
Timeline....
- Renchap proposed this talk
- Nice, should be interesting
- Should be easy
- Indexed source with opengrok
- Should be lots of comments...
- I “know” globally how it works
- (turns out, nope…)
- Let’s do it very complete !
- 1 version for sysadmin, 1 for Kernel dev.
- Hey hardware interrupts… vectors ?
- Drivers ? NAPI ? DCA ? XDP ?
- WTF SO MUCH FUNC POINTERS
- Wait, sysadmindays are tomorrow ?!
What’s available atm ?
- Lots of tutorials
- Lots of LKML discussions
- Lots of (copy/pasted) code
- Applications using black magic
- Robust and tested protocols
- New tools to explore
- Clear and detailed graphs…
Linux Network Stack
Linux Network Stack
ftrace shows a nice stack
net_rx_action() {
ixgbe_poll() {
ixgbe_clean_tx_irq();
ixgbe_clean_rx_irq() {
ixgbe_fetch_rx_buffer() {
... // allocate buffer for packet
} // returns the buffer containing packet data
... // housekeeping
napi_gro_receive() {
// generic receive offload
dev_gro_receive() {
inet_gro_receive() {
udp4_gro_receive() {
udp_gro_receive();
}
}
}
napi_skb_finish() {
netif_receive_skb_internal() {
__netif_receive_skb() {
__netif_receive_skb_core() {
...
ip_rcv() {
...
ip_rcv_finish() {
. . .
ip_local_deliver() {
ip_local_deliver_finish() {
raw_local_deliver();
udp_rcv() {
__udp4_lib_rcv() {
__udp4_lib_mcast_deliver() {
...
// clone skb & deliver
flush_stack() {
udp_queue_rcv_skb() {
... // data preparation
// deliver UDP packet
// check if buffer is full
__udp_queue_rcv_skb() {
// deliver to socket queue
// check for delivery error
sock_queue_rcv_skb() {
...
_raw_spin_lock_irqsave();
// enqueue packet to socket buffer list
_raw_spin_unlock_irqrestore();
// wake up listeners
sock_def_readable() {
__wake_up_sync_key() {
_raw_spin_lock_irqsave();
__wake_up_common() {
ep_poll_callback() {
...
_raw_spin_unlock_irqrestore();
}
}
_raw_spin_unlock_irqrestore();
}
Plenty of comments !
* struct net_device - The DEVICE structure.
*
* Actually, this whole structure is a big mistake. It mixes I/O
* data with strictly "high-level" data, and it has to know about
* almost every data structure used in the INET module.
/* Accept zero addresses only to limited broadcast;
* I even do not know to fix it or not. Waiting for complains :-)
/* An explanation is required here, I think.
* Packet length and doff are validated by header prediction,
* Really tricky (and requiring careful tuning) part of algorithm
* is hidden in functions tcp_time_to_recover() and
tcp_xmit_retransmit_queue().
/* skb reference here is a bit tricky to get right, since
* shifting can eat and free both this skb and the next,
* so not even _safe variant of the loop is enough.
/* Here begins the tricky part :
* We are called from release_sock() with :
/* This is TIME_WAIT assassination, in two flavors.
* Oh well... nobody has a sufficient solution to this
* protocol bug yet.
BUG(); /* "Please do not press this button again." */
/* The socket is already corked while preparing it. */
/* ... which is an evident application bug. --ANK */
/* Ugly, but we have no choice with this interface.
* Duplicate old header, fix ihl, length etc.
* Parse and mangle SNMP message according to mapping.
* (And this is the fucking 'basic' method).
/* 2. Fixups made earlier cannot be right.
* If we do not estimate RTO correctly without them,
* all the algo is pure shit and should be replaced
* with correct one. It is exactly, which we pretend to do.
/* OK, ACK is valid, create big socket and
* feed this segment to it. It will repeat all
* the tests. THIS SEGMENT MUST MOVE SOCKET TO
* ESTABLISHED STATE. If it will be dropped after
* socket is created, wait for troubles.
* packets force peer to delay ACKs and calculation is correct too.
* The algorithm is adaptive and, provided we follow specs, it
* NEVER underestimate RTT. BUT! If peer tries to make some clever
* tricks sort of "quick acks" for time long enough to decrease RTT
* to low value, and then abruptly stops to do it and starts to delay
* ACKs, wait for troubles.
Let’s do it myself !
Definitions
- Intel’ “ixgbe” 10G driver + MSI-x mode + Kernel 4.14.68
- Well-tested in production
- Fairly generic and available on many systems
- 10G PCIe is now the default in high-workload servers
- 10G drivers have multiple queues
- MSIx is the current standard (nice for VMs)
- Kernel 4.14 is the current LTS at time of analysis
Definitions (keep it handy)
Optimization Hardware Software
Distribute load of 1 NIC across multiple
CPUs
RSS
(multiple
queues)
RPS (only 1
queue)
CPU locality between consumer app and
net processing
aRFS RFS
Select queue to transmit packets XPS
Combining data of “similar enough”
packets
LRO GRO
Push data in CPU cache DCA
Processing packets before reaching
higher protocols
XDP + BPF
NIC Network Interface Controller
INTx Legacy Interrupt
MSI Message Signaled Interrupt
MSI-x MSI extended
VF Virtual Function
Coalescing Delaying events to group them in one shot
RingBuffer pre-defined area of memory for data
DMA Direct Memory Access.
DCA Direct Cache Access
XDP Xtreme Data Path
BPF Berkeley Packet Filter
RSS Receive Side Scaling (multiqueue)
RPS Receive Packet Steering (CPU dist)
RFS Receive Flow Steering.
aRFS Accelerated RFS.
XPS Transmit Packet Steering
LRO Large Receive Offload
GRO Generic Receive Offload
GSO Generic Segmentation Offload
How does it really work ?
It’s a stack of multiple layers
SOFTWARE L7: socket, write
SYSCALL libc: vfs write, sendmsg, sendmmsg
TCP/UDP L4+: Session, congestion, state, membership, timeouts...
ROUTING L3: IPv4/6 IP routing
DRIVER L2: MAC addr, offloading, checksum...
DEVICE L1: hardware, interrupts, statistics, duplex, ethtool...
Let’s get physical
Speed and duplex negotiation : The speed and duplex used by the NIC for
transfers is determined by autonegotiation (required for 1000Base-T).
If only one device can do autoneg, it’ll determine and match the speed of the other,
but assume half-duplex.
Autoneg is based on pulses to detect the presence of another device. Called
Normal Link Pulses (NLP) they are generated every 16ms and required to be sent if
no packet is being transmitted. If no packet and no NLP is being received for 50-
150ms, a link failure is considered as detected.
HardIRQ / Interrupts
Level-triggered VS edge-triggered
- Edge-triggered Interrupts are generated one time upon event
arrival by an edge (rising or falling electrical voltage). The event
needs to happen again to generate a new interrupt.
This is used to reflect an occurring event. In /proc/interrupts,
this is shown as “IO-APIC-edge” or “PCI-MSI-edge”.
Remember the time you configured your SoundBlaster to “IRQ
5” and “DMA 1” ? If multiple cards were on the same IRQ line,
their signals collided and were lost.
- Level-sensitive keep generating interrupts until acknowledged
in the programmable interrupt controller.
This is mostly used to reflect a state. In /proc/interrupts, this is
shown as “IO-APIC-level” or “IO-APIC-fasteoi” (end of interrupt).
Standard / fixed IRQs
- 0 - Timer
- 1 - PS2 Keyboard (i8042)
- 2 - cascade
- 3,4 - Serial
- 5 - (Modem)
- 6 - Floppy
- 7 - (Parallel port for printer)
- 8 - RealTime Clock
- 9 - (ACPI / sound card)
- 10 - (sound / network card)
- 11 - (Video Card)
- 12 - (PS2 Mouse)
- 13 - Math Co-Processor
- 14 - EIDE disk chain 1
- 15 - EIDE disk chain 2
SoftIRQ / Ksoftirqd
- In Realtime kernels, the hard IRQ is also managed by a kernel thread. This to
allow priority management.
- While processing critical code path, interrupt handling can be disabled. But
NMI and SMI cannot be disabled.
- Softirq process jobs in a specified order (kernel/softirq.c :: softirq_to_name[]).
- Most interrupts are defined per CPU. You’ll find “ksoftirqd/X” where X is the
CPU-ID
Netfilter Hooks
Hook overhead is considered low, as no processing is done if
no rule is loaded.
Even lower overhead if using JUMP_LABELS.
CC_HAVE_ASM_GOTO && CONFIG_JUMP_LABEL
⇒ # define HAVE_JUMP_LABEL
The recurring pattern for function & callback using NF_HOOK
:
funcname ⇒ NF_HOOK ⇒ funcname_finish
static inline int nf_hook(... ) {
struct nf_hook_entries *hook_head;
#ifdef HAVE_JUMP_LABEL
if (__builtin_constant_p(pf) &&
__builtin_constant_p(hook) &&
!static_key_false(&nf_hooks_needed[pf][hook]))
return 1;
#endif
rcu_read_lock();
hook_head = rcu_dereference(net-
>nf.hooks[pf][hook]);
if (hook_head) {
struct nf_hook_state state;
nf_hook_state_init(&state, hook, pf,
indev, outdev,
sk, net,
okfn);
ret = nf_hook_slow(skb, &state,
hook_head, 0);
}
rcu_read_unlock();
return ret;
Monitoring
ethtool
ethtool manages the NIC internals: PHY, firmware, counters, indirection table, ntuple
table, ring-buffer, WoL...
It’s the driver duty to provide these features through “struct ethtool_opts”
ethtool -i $iface : details on the iface (kmod, firmware, version…)
ethtool -c $iface : Interrupt coalescing values
ethtool -k $iface : Features (pkt-rate, GRO, TSO…)
ethtool -g $iface : Show the ring-buffer sizes
ethtool -n $iface : Show ntuple configuration table
ethtool -S $iface : Show NIC internal statistics (driver dependant)
ethtool -S | --statistics
ethtool -S $iface : shows custom
interface statistics. No standard,
tx_queue_0_packets: 15742031
tx_queue_0_bytes: 13149125228
tx_queue_0_restart: 0
tx_queue_1_packets: 14280094
tx_queue_1_bytes: 6326351422
tx_queue_1_restart: 0
tx_queue_2_packets: 33093890
tx_queue_2_bytes: 29099316424
tx_queue_2_restart: 0
tx_queue_3_packets: 6375636
tx_queue_3_bytes: 3191645669
tx_queue_3_restart: 0
rx_queue_0_packets: 32746660
rx_queue_0_bytes: 32200636085
rx_queue_0_drops: 0
rx_queue_0_csum_err: 0
rx_queue_0_alloc_failed: 0
rx_queue_1_packets: 12761541
rx_queue_1_bytes: 2017306804
rx_queue_1_drops: 0
rx_queue_1_csum_err: 0
rx_queue_1_alloc_failed: 0
rx_queue_2_packets: 17801690
rx_queue_2_bytes: 4863010445
rx_queue_2_drops: 0
rx_queue_2_csum_err: 0
rx_queue_2_alloc_failed: 0
rx_queue_3_packets: 8640774
rx_queue_3_bytes: 2566051868
rx_queue_3_drops: 0
rx_queue_3_csum_err: 0
rx_queue_3_alloc_failed: 0
tx_timeout_count: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 4266569
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 42215356407
tx_dma_out_of_sync: 0
tx_smbus: 12468
rx_smbus: 479
dropped_smbus: 0
os2bmc_rx_by_bmc: 336860
os2bmc_tx_by_bmc: 12468
os2bmc_tx_by_host: 336860
os2bmc_rx_by_host: 12468
tx_hwtstamp_timeouts: 0
tx_hwtstamp_skipped: 0
rx_hwtstamp_cleared: 0
rx_errors: 0
tx_errors: 0
tx_dropped: 0
rx_length_errors: 0
rx_over_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
rx_packets: 71938676
tx_packets: 69504119
rx_bytes: 42215356407
tx_bytes: 52355860539
rx_broadcast: 675957
tx_broadcast: 346481
rx_multicast: 1245202
tx_multicast: 2847
multicast: 1245202
collisions: 0
rx_crc_errors: 0
rx_no_buffer_count: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
ethtool -x | --show-rxfh-indir
ethtool -x enp5s0
RX flow hash indirection table for enp5s0 with 4 RX ring(s):
0: 0 0 0 0 0 0 0 0
8: 0 0 0 0 0 0 0 0
16: 0 0 0 0 0 0 0 0
24: 0 0 0 0 0 0 0 0
32: 1 1 1 1 1 1 1 1
40: 1 1 1 1 1 1 1 1
48: 1 1 1 1 1 1 1 1
56: 1 1 1 1 1 1 1 1
64: 2 2 2 2 2 2 2 2
72: 2 2 2 2 2 2 2 2
80: 2 2 2 2 2 2 2 2
88: 2 2 2 2 2 2 2 2
96: 3 3 3 3 3 3 3 3
104: 3 3 3 3 3 3 3 3
112: 3 3 3 3 3 3 3 3
120: 3 3 3 3 3 3 3 3
RSS hash key:
procfs
/proc/net contains a lot of statistics on
many parts of the network… but most of
them don’t have description
- netlink
- netstat Protocols stats
- ptype Registered
transports
- snmp Multiple proto
details
- softnet_stat NAPI polling
/proc/interrupts : show CPU /
interruptions matrix table
/proc/irq/$irq/smp_affinity : see & set
CPU masks for interruption processing
sysfs
/sys/class/net/$iface/ : Details on the interface status (duplex, speed, mtu..)
/sys/class/net/$iface/statistics/ :
/sys/class/net/$iface/queues/[rt]x-[0-9]/ : RPS configuration and display
Configuration
ethtool
All parameters showing something, also allow to set the same category of
parameters with an uppercase letter.
ethtool bonus - Built-in firewall
ethtool -N eth4 flow-type udp4 src-ip 10.10.10.100 m 255.255.255.0 action -1
ethtool -n eth4
4 RX rings available
Total 1 rules
Filter: 8189
Rule Type: UDP over IPv4
Src IP addr: 0.0.0.100 mask: 255.255.255.0
Dest IP addr: 0.0.0.0 mask: 255.255.255.255
TOS: 0x0 mask: 0xff
Src port: 0 mask: 0xffff
Dest port: 0 mask: 0xffff
VLAN EtherType: 0x0 mask: 0xffff
VLAN: 0x0 mask: 0xffff
User-defined: 0x0 mask: 0xffffffffffffffff
Action: Drop
netlink
Special socket type for communicating with network stack
Used by iproute2 (ip), and any tool that want to configure it
Also used to monitor events (ip monitor, nltrace)
sysctls
Tweaks most values of the network stack
- Polling budget
- TCP features & timeouts
- Buffer sizes
- Forwarding
- Reverse Path filtering
- Routing Table cache size
sysfs
/sys/class/net/$iface/
/statistics : file-based access to standard counters
/queues/tx-* : show and configure XPS + rate limits
/queues/rx-* : show and configure RPS
Under the hood
Global schema… in progress
Global schema… please contribute
That’s a lot of work ! I need your help…
- Complete the schema to map the most used protocols
- Add more details to the documentation
- Make the schema more interactive to have a live view from the OS
- Reference other schemas and map their details
Documents and analysis : https://github.com/saruspete/LinuxNetworking
Source Code browser : https://dev.mahieux.net/kernelgrok/
Schema made with: https://draw.io
Thank you
Bibliography
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/
https://blog.packagecloud.io/eng/2016/10/11/monitoring-tuning-linux-networking-stack-receiving-data-illustrated/
https://blog.cloudflare.com/how-to-receive-a-million-packets/
https://blog.cloudflare.com/how-to-drop-10-million-packets/
https://blog.cloudflare.com/the-story-of-one-latency-spike/
https://wiki.linuxfoundation.org/networking/kernel_flow
https://www.lmax.com/blog/staff-blogs/2016/05/06/navigating-linux-kernel-network-stack-receive-path/
http://www.jauu.net/talks/data/runtime-analyse-of-linux-network-stack.pdf
https://www.xilinx.com/Attachment/Xilinx_Answer_58495_PCIe_Interrupt_Debugging_Guide.pdf
https://www.cubrid.org/blog/understanding-tcp-ip-network-stack
https://events.static.linuxfound.org/sites/events/files/slides/Chaiken_ELCE2016.pdf
1 of 33

Recommended

The TCP/IP Stack in the Linux Kernel by
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelDivye Kapoor
52.3K views68 slides
LinuxCon 2015 Linux Kernel Networking Walkthrough by
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughThomas Graf
26.3K views20 slides
introduction to linux kernel tcp/ip ptocotol stack by
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack monad bobo
12.9K views16 slides
DevConf 2014 Kernel Networking Walkthrough by
DevConf 2014   Kernel Networking WalkthroughDevConf 2014   Kernel Networking Walkthrough
DevConf 2014 Kernel Networking WalkthroughThomas Graf
8.3K views20 slides
Fun with Network Interfaces by
Fun with Network InterfacesFun with Network Interfaces
Fun with Network InterfacesKernel TLV
5K views50 slides
Introduction to DPDK by
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
5.9K views32 slides

More Related Content

What's hot

Understanding DPDK algorithmics by
Understanding DPDK algorithmicsUnderstanding DPDK algorithmics
Understanding DPDK algorithmicsDenys Haryachyy
10.6K views17 slides
DPDK & Layer 4 Packet Processing by
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingMichelle Holley
2.5K views51 slides
DPDK KNI interface by
DPDK KNI interfaceDPDK KNI interface
DPDK KNI interfaceDenys Haryachyy
26.1K views10 slides
netfilter and iptables by
netfilter and iptablesnetfilter and iptables
netfilter and iptablesKernel TLV
4.3K views44 slides
DPDK: Multi Architecture High Performance Packet Processing by
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingMichelle Holley
9.1K views52 slides
EBPF and Linux Networking by
EBPF and Linux NetworkingEBPF and Linux Networking
EBPF and Linux NetworkingPLUMgrid
14.6K views36 slides

What's hot(20)

Understanding DPDK algorithmics by Denys Haryachyy
Understanding DPDK algorithmicsUnderstanding DPDK algorithmics
Understanding DPDK algorithmics
Denys Haryachyy10.6K views
DPDK & Layer 4 Packet Processing by Michelle Holley
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet Processing
Michelle Holley2.5K views
netfilter and iptables by Kernel TLV
netfilter and iptablesnetfilter and iptables
netfilter and iptables
Kernel TLV4.3K views
DPDK: Multi Architecture High Performance Packet Processing by Michelle Holley
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
Michelle Holley9.1K views
EBPF and Linux Networking by PLUMgrid
EBPF and Linux NetworkingEBPF and Linux Networking
EBPF and Linux Networking
PLUMgrid14.6K views
DPDK In Depth by Kernel TLV
DPDK In DepthDPDK In Depth
DPDK In Depth
Kernel TLV5.2K views
Linux MMAP & Ioremap introduction by Gene Chang
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
Gene Chang10.4K views
eBPF/XDP by Netronome
eBPF/XDP eBPF/XDP
eBPF/XDP
Netronome3.3K views
DPDK in Containers Hands-on Lab by Michelle Holley
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on Lab
Michelle Holley10.3K views
Jagan Teki - U-boot from scratch by linuxlab_conf
Jagan Teki - U-boot from scratchJagan Teki - U-boot from scratch
Jagan Teki - U-boot from scratch
linuxlab_conf710 views
Kdump and the kernel crash dump analysis by Buland Singh
Kdump and the kernel crash dump analysisKdump and the kernel crash dump analysis
Kdump and the kernel crash dump analysis
Buland Singh2.5K views
Meet cute-between-ebpf-and-tracing by Viller Hsiao
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
Viller Hsiao8.8K views
Debug dpdk process bottleneck & painpoints by Vipin Varghese
Debug dpdk process bottleneck & painpointsDebug dpdk process bottleneck & painpoints
Debug dpdk process bottleneck & painpoints
Vipin Varghese819 views

Similar to Linux Network Stack

Network Programming: Data Plane Development Kit (DPDK) by
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
2.3K views65 slides
NUSE (Network Stack in Userspace) at #osio by
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioHajime Tazaki
6.7K views42 slides
Seastar at Linux Foundation Collaboration Summit by
Seastar at Linux Foundation Collaboration SummitSeastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration SummitDon Marti
1.4K views31 slides
Network Stack in Userspace (NUSE) by
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Hajime Tazaki
6.7K views27 slides
Ltsp talk by
Ltsp talkLtsp talk
Ltsp talkKanchilug
1.1K views20 slides
Kernel Recipes 2019 - Metrics are money by
Kernel Recipes 2019 - Metrics are moneyKernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are moneyAnne Nicolas
1.4K views51 slides

Similar to Linux Network Stack(20)

Network Programming: Data Plane Development Kit (DPDK) by Andriy Berestovskyy
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
Andriy Berestovskyy2.3K views
NUSE (Network Stack in Userspace) at #osio by Hajime Tazaki
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osio
Hajime Tazaki6.7K views
Seastar at Linux Foundation Collaboration Summit by Don Marti
Seastar at Linux Foundation Collaboration SummitSeastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration Summit
Don Marti1.4K views
Network Stack in Userspace (NUSE) by Hajime Tazaki
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)
Hajime Tazaki6.7K views
Ltsp talk by Kanchilug
Ltsp talkLtsp talk
Ltsp talk
Kanchilug 1.1K views
Kernel Recipes 2019 - Metrics are money by Anne Nicolas
Kernel Recipes 2019 - Metrics are moneyKernel Recipes 2019 - Metrics are money
Kernel Recipes 2019 - Metrics are money
Anne Nicolas1.4K views
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ... by PROIDEA
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
PROIDEA76 views
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała... by PROIDEA
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PROIDEA288 views
[CCC-28c3] Post Memory Corruption Memory Analysis by Moabi.com
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
Moabi.com2.5K views
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter by Anne Nicolas
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterKernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Anne Nicolas2.3K views
Advanced RAC troubleshooting: Network by Riyaj Shamsudeen
Advanced RAC troubleshooting: NetworkAdvanced RAC troubleshooting: Network
Advanced RAC troubleshooting: Network
Riyaj Shamsudeen2.1K views
SMP implementation for OpenBSD/sgi by Takuya ASADA
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgi
Takuya ASADA1.2K views
Make container without_docker_7 by Sam Kim
Make container without_docker_7Make container without_docker_7
Make container without_docker_7
Sam Kim184 views
IRQs: the Hard, the Soft, the Threaded and the Preemptible by Alison Chaiken
IRQs: the Hard, the Soft, the Threaded and the PreemptibleIRQs: the Hard, the Soft, the Threaded and the Preemptible
IRQs: the Hard, the Soft, the Threaded and the Preemptible
Alison Chaiken1.7K views
FPGA based 10G Performance Tester for HW OpenFlow Switch by Yutaka Yasuda
FPGA based 10G Performance Tester for HW OpenFlow SwitchFPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow Switch
Yutaka Yasuda532 views
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이... by OpenStack Korea Community
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
SREcon Europe 2016 - Full-mesh IPsec network at Hosted Graphite by HostedGraphite
SREcon Europe 2016 - Full-mesh IPsec network at Hosted GraphiteSREcon Europe 2016 - Full-mesh IPsec network at Hosted Graphite
SREcon Europe 2016 - Full-mesh IPsec network at Hosted Graphite
HostedGraphite803 views

Recently uploaded

Network Source of Truth and Infrastructure as Code revisited by
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisitedNetwork Automation Forum
49 views45 slides
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineShapeBlue
154 views19 slides
State of the Union - Rohit Yadav - Apache CloudStack by
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStackShapeBlue
218 views53 slides
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueShapeBlue
63 views15 slides
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueShapeBlue
147 views20 slides
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...ShapeBlue
74 views17 slides

Recently uploaded(20)

KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue154 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue218 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue63 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue147 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue74 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu287 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue86 views
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue52 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10110 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE67 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue58 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue48 views
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue138 views
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty54 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue69 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue120 views

Linux Network Stack

  • 1. Linux Network Stack To Hell and back Sysadmin #8 Adrien Mahieux - Performance Engineer (nanosecond hunter) gh: github.com/Saruspete tw: @Saruspete em: adrien.mahieux@gmail.com
  • 2. Summary A lot of text... - What’s available atm ? - Definitions - How does it really work ? - Monitoring - Configuration - What I’d like to have... And a big graph ! - Receive Flow - (Transmit Flow)
  • 3. Disclaimer : this is not a presentation Timeline.... - Renchap proposed this talk - Nice, should be interesting - Should be easy - Indexed source with opengrok - Should be lots of comments... - I “know” globally how it works - (turns out, nope…) - Let’s do it very complete ! - 1 version for sysadmin, 1 for Kernel dev. - Hey hardware interrupts… vectors ? - Drivers ? NAPI ? DCA ? XDP ? - WTF SO MUCH FUNC POINTERS - Wait, sysadmindays are tomorrow ?!
  • 4. What’s available atm ? - Lots of tutorials - Lots of LKML discussions - Lots of (copy/pasted) code - Applications using black magic - Robust and tested protocols - New tools to explore - Clear and detailed graphs…
  • 7. ftrace shows a nice stack net_rx_action() { ixgbe_poll() { ixgbe_clean_tx_irq(); ixgbe_clean_rx_irq() { ixgbe_fetch_rx_buffer() { ... // allocate buffer for packet } // returns the buffer containing packet data ... // housekeeping napi_gro_receive() { // generic receive offload dev_gro_receive() { inet_gro_receive() { udp4_gro_receive() { udp_gro_receive(); } } } napi_skb_finish() { netif_receive_skb_internal() { __netif_receive_skb() { __netif_receive_skb_core() { ... ip_rcv() { ... ip_rcv_finish() { . . . ip_local_deliver() { ip_local_deliver_finish() { raw_local_deliver(); udp_rcv() { __udp4_lib_rcv() { __udp4_lib_mcast_deliver() { ... // clone skb & deliver flush_stack() { udp_queue_rcv_skb() { ... // data preparation // deliver UDP packet // check if buffer is full __udp_queue_rcv_skb() { // deliver to socket queue // check for delivery error sock_queue_rcv_skb() { ... _raw_spin_lock_irqsave(); // enqueue packet to socket buffer list _raw_spin_unlock_irqrestore(); // wake up listeners sock_def_readable() { __wake_up_sync_key() { _raw_spin_lock_irqsave(); __wake_up_common() { ep_poll_callback() { ... _raw_spin_unlock_irqrestore(); } } _raw_spin_unlock_irqrestore(); }
  • 8. Plenty of comments ! * struct net_device - The DEVICE structure. * * Actually, this whole structure is a big mistake. It mixes I/O * data with strictly "high-level" data, and it has to know about * almost every data structure used in the INET module. /* Accept zero addresses only to limited broadcast; * I even do not know to fix it or not. Waiting for complains :-) /* An explanation is required here, I think. * Packet length and doff are validated by header prediction, * Really tricky (and requiring careful tuning) part of algorithm * is hidden in functions tcp_time_to_recover() and tcp_xmit_retransmit_queue(). /* skb reference here is a bit tricky to get right, since * shifting can eat and free both this skb and the next, * so not even _safe variant of the loop is enough. /* Here begins the tricky part : * We are called from release_sock() with : /* This is TIME_WAIT assassination, in two flavors. * Oh well... nobody has a sufficient solution to this * protocol bug yet. BUG(); /* "Please do not press this button again." */ /* The socket is already corked while preparing it. */ /* ... which is an evident application bug. --ANK */ /* Ugly, but we have no choice with this interface. * Duplicate old header, fix ihl, length etc. * Parse and mangle SNMP message according to mapping. * (And this is the fucking 'basic' method). /* 2. Fixups made earlier cannot be right. * If we do not estimate RTO correctly without them, * all the algo is pure shit and should be replaced * with correct one. It is exactly, which we pretend to do. /* OK, ACK is valid, create big socket and * feed this segment to it. It will repeat all * the tests. THIS SEGMENT MUST MOVE SOCKET TO * ESTABLISHED STATE. If it will be dropped after * socket is created, wait for troubles. * packets force peer to delay ACKs and calculation is correct too. * The algorithm is adaptive and, provided we follow specs, it * NEVER underestimate RTT. BUT! If peer tries to make some clever * tricks sort of "quick acks" for time long enough to decrease RTT * to low value, and then abruptly stops to do it and starts to delay * ACKs, wait for troubles.
  • 9. Let’s do it myself !
  • 10. Definitions - Intel’ “ixgbe” 10G driver + MSI-x mode + Kernel 4.14.68 - Well-tested in production - Fairly generic and available on many systems - 10G PCIe is now the default in high-workload servers - 10G drivers have multiple queues - MSIx is the current standard (nice for VMs) - Kernel 4.14 is the current LTS at time of analysis
  • 11. Definitions (keep it handy) Optimization Hardware Software Distribute load of 1 NIC across multiple CPUs RSS (multiple queues) RPS (only 1 queue) CPU locality between consumer app and net processing aRFS RFS Select queue to transmit packets XPS Combining data of “similar enough” packets LRO GRO Push data in CPU cache DCA Processing packets before reaching higher protocols XDP + BPF NIC Network Interface Controller INTx Legacy Interrupt MSI Message Signaled Interrupt MSI-x MSI extended VF Virtual Function Coalescing Delaying events to group them in one shot RingBuffer pre-defined area of memory for data DMA Direct Memory Access. DCA Direct Cache Access XDP Xtreme Data Path BPF Berkeley Packet Filter RSS Receive Side Scaling (multiqueue) RPS Receive Packet Steering (CPU dist) RFS Receive Flow Steering. aRFS Accelerated RFS. XPS Transmit Packet Steering LRO Large Receive Offload GRO Generic Receive Offload GSO Generic Segmentation Offload
  • 12. How does it really work ? It’s a stack of multiple layers SOFTWARE L7: socket, write SYSCALL libc: vfs write, sendmsg, sendmmsg TCP/UDP L4+: Session, congestion, state, membership, timeouts... ROUTING L3: IPv4/6 IP routing DRIVER L2: MAC addr, offloading, checksum... DEVICE L1: hardware, interrupts, statistics, duplex, ethtool...
  • 13. Let’s get physical Speed and duplex negotiation : The speed and duplex used by the NIC for transfers is determined by autonegotiation (required for 1000Base-T). If only one device can do autoneg, it’ll determine and match the speed of the other, but assume half-duplex. Autoneg is based on pulses to detect the presence of another device. Called Normal Link Pulses (NLP) they are generated every 16ms and required to be sent if no packet is being transmitted. If no packet and no NLP is being received for 50- 150ms, a link failure is considered as detected.
  • 14. HardIRQ / Interrupts Level-triggered VS edge-triggered - Edge-triggered Interrupts are generated one time upon event arrival by an edge (rising or falling electrical voltage). The event needs to happen again to generate a new interrupt. This is used to reflect an occurring event. In /proc/interrupts, this is shown as “IO-APIC-edge” or “PCI-MSI-edge”. Remember the time you configured your SoundBlaster to “IRQ 5” and “DMA 1” ? If multiple cards were on the same IRQ line, their signals collided and were lost. - Level-sensitive keep generating interrupts until acknowledged in the programmable interrupt controller. This is mostly used to reflect a state. In /proc/interrupts, this is shown as “IO-APIC-level” or “IO-APIC-fasteoi” (end of interrupt). Standard / fixed IRQs - 0 - Timer - 1 - PS2 Keyboard (i8042) - 2 - cascade - 3,4 - Serial - 5 - (Modem) - 6 - Floppy - 7 - (Parallel port for printer) - 8 - RealTime Clock - 9 - (ACPI / sound card) - 10 - (sound / network card) - 11 - (Video Card) - 12 - (PS2 Mouse) - 13 - Math Co-Processor - 14 - EIDE disk chain 1 - 15 - EIDE disk chain 2
  • 15. SoftIRQ / Ksoftirqd - In Realtime kernels, the hard IRQ is also managed by a kernel thread. This to allow priority management. - While processing critical code path, interrupt handling can be disabled. But NMI and SMI cannot be disabled. - Softirq process jobs in a specified order (kernel/softirq.c :: softirq_to_name[]). - Most interrupts are defined per CPU. You’ll find “ksoftirqd/X” where X is the CPU-ID
  • 16. Netfilter Hooks Hook overhead is considered low, as no processing is done if no rule is loaded. Even lower overhead if using JUMP_LABELS. CC_HAVE_ASM_GOTO && CONFIG_JUMP_LABEL ⇒ # define HAVE_JUMP_LABEL The recurring pattern for function & callback using NF_HOOK : funcname ⇒ NF_HOOK ⇒ funcname_finish static inline int nf_hook(... ) { struct nf_hook_entries *hook_head; #ifdef HAVE_JUMP_LABEL if (__builtin_constant_p(pf) && __builtin_constant_p(hook) && !static_key_false(&nf_hooks_needed[pf][hook])) return 1; #endif rcu_read_lock(); hook_head = rcu_dereference(net- >nf.hooks[pf][hook]); if (hook_head) { struct nf_hook_state state; nf_hook_state_init(&state, hook, pf, indev, outdev, sk, net, okfn); ret = nf_hook_slow(skb, &state, hook_head, 0); } rcu_read_unlock(); return ret;
  • 18. ethtool ethtool manages the NIC internals: PHY, firmware, counters, indirection table, ntuple table, ring-buffer, WoL... It’s the driver duty to provide these features through “struct ethtool_opts” ethtool -i $iface : details on the iface (kmod, firmware, version…) ethtool -c $iface : Interrupt coalescing values ethtool -k $iface : Features (pkt-rate, GRO, TSO…) ethtool -g $iface : Show the ring-buffer sizes ethtool -n $iface : Show ntuple configuration table ethtool -S $iface : Show NIC internal statistics (driver dependant)
  • 19. ethtool -S | --statistics ethtool -S $iface : shows custom interface statistics. No standard, tx_queue_0_packets: 15742031 tx_queue_0_bytes: 13149125228 tx_queue_0_restart: 0 tx_queue_1_packets: 14280094 tx_queue_1_bytes: 6326351422 tx_queue_1_restart: 0 tx_queue_2_packets: 33093890 tx_queue_2_bytes: 29099316424 tx_queue_2_restart: 0 tx_queue_3_packets: 6375636 tx_queue_3_bytes: 3191645669 tx_queue_3_restart: 0 rx_queue_0_packets: 32746660 rx_queue_0_bytes: 32200636085 rx_queue_0_drops: 0 rx_queue_0_csum_err: 0 rx_queue_0_alloc_failed: 0 rx_queue_1_packets: 12761541 rx_queue_1_bytes: 2017306804 rx_queue_1_drops: 0 rx_queue_1_csum_err: 0 rx_queue_1_alloc_failed: 0 rx_queue_2_packets: 17801690 rx_queue_2_bytes: 4863010445 rx_queue_2_drops: 0 rx_queue_2_csum_err: 0 rx_queue_2_alloc_failed: 0 rx_queue_3_packets: 8640774 rx_queue_3_bytes: 2566051868 rx_queue_3_drops: 0 rx_queue_3_csum_err: 0 rx_queue_3_alloc_failed: 0 tx_timeout_count: 0 rx_long_length_errors: 0 rx_short_length_errors: 0 rx_align_errors: 0 tx_tcp_seg_good: 4266569 tx_tcp_seg_failed: 0 rx_flow_control_xon: 0 rx_flow_control_xoff: 0 tx_flow_control_xon: 0 tx_flow_control_xoff: 0 rx_long_byte_count: 42215356407 tx_dma_out_of_sync: 0 tx_smbus: 12468 rx_smbus: 479 dropped_smbus: 0 os2bmc_rx_by_bmc: 336860 os2bmc_tx_by_bmc: 12468 os2bmc_tx_by_host: 336860 os2bmc_rx_by_host: 12468 tx_hwtstamp_timeouts: 0 tx_hwtstamp_skipped: 0 rx_hwtstamp_cleared: 0 rx_errors: 0 tx_errors: 0 tx_dropped: 0 rx_length_errors: 0 rx_over_errors: 0 rx_frame_errors: 0 rx_fifo_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 rx_packets: 71938676 tx_packets: 69504119 rx_bytes: 42215356407 tx_bytes: 52355860539 rx_broadcast: 675957 tx_broadcast: 346481 rx_multicast: 1245202 tx_multicast: 2847 multicast: 1245202 collisions: 0 rx_crc_errors: 0 rx_no_buffer_count: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_window_errors: 0 tx_abort_late_coll: 0 tx_deferred_ok: 0 tx_single_coll_ok: 0 tx_multi_coll_ok: 0
  • 20. ethtool -x | --show-rxfh-indir ethtool -x enp5s0 RX flow hash indirection table for enp5s0 with 4 RX ring(s): 0: 0 0 0 0 0 0 0 0 8: 0 0 0 0 0 0 0 0 16: 0 0 0 0 0 0 0 0 24: 0 0 0 0 0 0 0 0 32: 1 1 1 1 1 1 1 1 40: 1 1 1 1 1 1 1 1 48: 1 1 1 1 1 1 1 1 56: 1 1 1 1 1 1 1 1 64: 2 2 2 2 2 2 2 2 72: 2 2 2 2 2 2 2 2 80: 2 2 2 2 2 2 2 2 88: 2 2 2 2 2 2 2 2 96: 3 3 3 3 3 3 3 3 104: 3 3 3 3 3 3 3 3 112: 3 3 3 3 3 3 3 3 120: 3 3 3 3 3 3 3 3 RSS hash key:
  • 21. procfs /proc/net contains a lot of statistics on many parts of the network… but most of them don’t have description - netlink - netstat Protocols stats - ptype Registered transports - snmp Multiple proto details - softnet_stat NAPI polling /proc/interrupts : show CPU / interruptions matrix table /proc/irq/$irq/smp_affinity : see & set CPU masks for interruption processing
  • 22. sysfs /sys/class/net/$iface/ : Details on the interface status (duplex, speed, mtu..) /sys/class/net/$iface/statistics/ : /sys/class/net/$iface/queues/[rt]x-[0-9]/ : RPS configuration and display
  • 24. ethtool All parameters showing something, also allow to set the same category of parameters with an uppercase letter.
  • 25. ethtool bonus - Built-in firewall ethtool -N eth4 flow-type udp4 src-ip 10.10.10.100 m 255.255.255.0 action -1 ethtool -n eth4 4 RX rings available Total 1 rules Filter: 8189 Rule Type: UDP over IPv4 Src IP addr: 0.0.0.100 mask: 255.255.255.0 Dest IP addr: 0.0.0.0 mask: 255.255.255.255 TOS: 0x0 mask: 0xff Src port: 0 mask: 0xffff Dest port: 0 mask: 0xffff VLAN EtherType: 0x0 mask: 0xffff VLAN: 0x0 mask: 0xffff User-defined: 0x0 mask: 0xffffffffffffffff Action: Drop
  • 26. netlink Special socket type for communicating with network stack Used by iproute2 (ip), and any tool that want to configure it Also used to monitor events (ip monitor, nltrace)
  • 27. sysctls Tweaks most values of the network stack - Polling budget - TCP features & timeouts - Buffer sizes - Forwarding - Reverse Path filtering - Routing Table cache size
  • 28. sysfs /sys/class/net/$iface/ /statistics : file-based access to standard counters /queues/tx-* : show and configure XPS + rate limits /queues/rx-* : show and configure RPS
  • 31. Global schema… please contribute That’s a lot of work ! I need your help… - Complete the schema to map the most used protocols - Add more details to the documentation - Make the schema more interactive to have a live view from the OS - Reference other schemas and map their details Documents and analysis : https://github.com/saruspete/LinuxNetworking Source Code browser : https://dev.mahieux.net/kernelgrok/ Schema made with: https://draw.io
  • 33. Bibliography https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/ https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ https://blog.packagecloud.io/eng/2016/10/11/monitoring-tuning-linux-networking-stack-receiving-data-illustrated/ https://blog.cloudflare.com/how-to-receive-a-million-packets/ https://blog.cloudflare.com/how-to-drop-10-million-packets/ https://blog.cloudflare.com/the-story-of-one-latency-spike/ https://wiki.linuxfoundation.org/networking/kernel_flow https://www.lmax.com/blog/staff-blogs/2016/05/06/navigating-linux-kernel-network-stack-receive-path/ http://www.jauu.net/talks/data/runtime-analyse-of-linux-network-stack.pdf https://www.xilinx.com/Attachment/Xilinx_Answer_58495_PCIe_Interrupt_Debugging_Guide.pdf https://www.cubrid.org/blog/understanding-tcp-ip-network-stack https://events.static.linuxfound.org/sites/events/files/slides/Chaiken_ELCE2016.pdf

Editor's Notes

  1. RSS : Distribution is usually done from calculating a hash (Toeplitz) of “ipsrc/ipdst/portsrc/portdst” + a random rss_key, using its low-order bits as a key to an indirection table to read the target queue. May also be done by programmable filters on advanced NICs RPS: interfaces can also compute the hash before, and put it skb.rxhash (to avoid computation from CPU and cache miss). CPU Mask from /sys/class/net/ethX/queues/rx-Y/rps_cpus XPS: Conf from /sys/class/net/ethX/queues/tx-Y/xps_cpus
  2. top/hard IRQ is the real processing of the IRQ. An interrupt stops the currently running process to do its job. Must be done as soon as possible to avoid filling the buffer where the data to be processed is, and as quickly as possible to avoid interrupting the running process for too long (it’s stolen from its scheduling time).
  3. Bottom/Soft IRQ is the real processing of the data (network, io…). Often under the ksoftirqd kernel process, in our case, the napi_schedule() function (as the processing loop).