SlideShare a Scribd company logo
Copyright © 2016 NTT Corp. All Rights Reserved.
Boost UDP Transaction
Performance
Toshiaki Makita
NTT Open Source Software Center
2Copyright © 2016 NTT Corp. All Rights Reserved.
• Background
• Basic technologies for network performance
• How to improve UDP performance
Today's topics
3Copyright © 2016 NTT Corp. All Rights Reserved.
• Linux kernel engineer at NTT Open Source
Software Center
• Technical support for NTT group companies
• Active patch submitter on kernel networking
subsystem
Who is Toshiaki Makita?
4Copyright © 2016 NTT Corp. All Rights Reserved.
Background
5Copyright © 2016 NTT Corp. All Rights Reserved.
• Services using UDP
• DNS
• RADIUS
• NTP
• SNMP
• ...
• Heavily used by network service providers
UDP transactions in the Internet
6Copyright © 2016 NTT Corp. All Rights Reserved.
• Ethernet bandwidth evolution
• 10M -> 100M -> 1G -> 10G -> 40G -> 100G -> ...
• 10G (or more) NICs are getting common on commodity
servers
• Transactions in 10G network
• In the shortest packet case:
• Maximum 14,880,952 packets/s*1
• Getting hard to handle in a single server...
Ethernet Bandwidth and Transactions
*1 shortest ethernet frame size 64bytes + preamble+IFG 20bytes = 84 bytes = 672 bits
10,000,000,000 / 672 = 14,880,952
7Copyright © 2016 NTT Corp. All Rights Reserved.
• UDP payload sizes
• DNS
• A/AAAA query: 40~ bytes
• A/AAAA response: 100~ bytes
• RADIUS
• Access-Request: 70~ bytes
• Access-Accept: 30~ bytes
• Typically 100~ bytes with some attributes
• In many cases 100~ bytes
• 100 bytes transactions in 10G network
• Max 7,530,120 transactions/s*1
• Less than shortest packet case, but still challenging
How many transactions to handle?
*1 100 bytes + IP/UDP/Ether headers 46bytes + preamble+IFG 20bytes = 166 bytes = 1328 bits
10,000,000,000 / 1328 = 7,530,120
8Copyright © 2016 NTT Corp. All Rights Reserved.
Basic technologies for
network performance
(not only for UDP)
9Copyright © 2016 NTT Corp. All Rights Reserved.
• TSO/GSO/GRO
• Packet segmentation/aggregation
• Reduce packets to process within server
• Applicable to TCP*1 (byte stream)
• Not applicable to UDP*2 (datagram)
• UDP has explicit boundary between datagrams
• Cannot segment/aggregate packets
Basic technologies for network performance
TCP
UDP
byte stream
datagram
TSO/GSO
(segmentation)
GRO
(aggregation)
MTU
size
Tx Server Rx Server
:<
*1 TCP in UDP tunneling (e.g. VXLAN) is OK as well
*2 Other than UFO, which is rarely implemented on physical NICs
Great performance gain!
Not applicable
10Copyright © 2016 NTT Corp. All Rights Reserved.
• RSS
• Scale network Rx processing in multi-core server
• RSS itself is a NIC feature
• Distribute packets to multi-queue in a NIC
• Each queue has a different interrupt vector
(Packets on each queue can be processed by different core)
• Applicable to TCP/UDP
• Common 10G NICs have RSS
Basic technologies for network performance
NIC
rx queue
rx queue
Core
Core
packets
RSS
:)
interrupt
11Copyright © 2016 NTT Corp. All Rights Reserved.
• 100 bytes UDP transaction performance
• Measured by simple*1 (multi-threaded) echo server
• OS: kernel 4.6.3 (in RHEL 7.2 environment)
• Mid-range commodity server with 20 cores and 10G NIC:
• NIC: Intel 82599ES (has RSS, max 64 queues)
• CPU: Xeon E5-2650 v3 (2.3 GHz 10 cores) * 2 sockets
Hyper-threading off (make analysis easy, enabled later)
• Results: 270,000 transactions/s (tps) (approx. 360Mbps)
• 3.6% utilization of 10G bandwidth
Performance with RSS enabled NIC
echo server
thread thread thread x 20
UDP
socket
bulk 100bytes UDP packets*1
echo back
*1 create as many threads as core num
each thread just calls recvfrom() and sendto()
*2 There is only 1 client (IP address).
To spread UDP traffic on the NIC, RSS is configured
to see UDP port numbers. This setting is not
needed for common UDP servers.
12Copyright © 2016 NTT Corp. All Rights Reserved.
How to improve this?
13Copyright © 2016 NTT Corp. All Rights Reserved.
• sar -u ALL -P ALL 1
• softirq (interrupt processing) is performed only
on NUMA Node 0, why?
• although we have enough (64) queues for 20 cores...
Identify bottleneck
19:57:54 CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle
19:57:54 all 0.37 0.00 42.58 0.00 0.00 0.00 50.00 0.00 0.00 7.05
19:57:54 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
19:57:54 1 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
19:57:54 2 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
19:57:54 3 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
19:57:54 4 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
19:57:54 5 1.82 0.00 83.64 0.00 0.00 0.00 0.00 0.00 0.00 14.55
19:57:54 6 0.00 0.00 87.04 0.00 0.00 0.00 0.00 0.00 0.00 12.96
19:57:54 7 0.00 0.00 85.19 0.00 0.00 0.00 0.00 0.00 0.00 14.81
19:57:54 8 0.00 0.00 85.45 0.00 0.00 0.00 0.00 0.00 0.00 14.55
19:57:54 9 0.00 0.00 85.19 0.00 0.00 0.00 0.00 0.00 0.00 14.81
19:57:54 10 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
19:57:54 11 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
19:57:54 12 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
19:57:54 13 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
19:57:54 14 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
19:57:54 15 1.82 0.00 83.64 0.00 0.00 0.00 0.00 0.00 0.00 14.55
19:57:54 16 0.00 0.00 87.04 0.00 0.00 0.00 0.00 0.00 0.00 12.96
19:57:54 17 1.82 0.00 83.64 0.00 0.00 0.00 0.00 0.00 0.00 14.55
19:57:54 18 0.00 0.00 85.45 0.00 0.00 0.00 0.00 0.00 0.00 14.55
19:57:54 19 0.00 0.00 85.45 0.00 0.00 0.00 0.00 0.00 0.00 14.55
Node 0
Node 1
14Copyright © 2016 NTT Corp. All Rights Reserved.
• RSS distributes packets to rx-queues
• Interrupt destination of each queue is
determined by /proc/irq/<irq>/smp_affinity
• smp_affinity is usually set by irqbalance
daemon
softirq (interrupt processing) with RSS
NIC
rx queue
rx queue
Core
Core
packets
RSS
smp_affinity
interrupt
15Copyright © 2016 NTT Corp. All Rights Reserved.
• smp_affinity*1
• irqbalance is using only Node 0
(cores 0-4, 10-14)
• Can we change this?
Check smp_affinity
$ for ((irq=105; irq<=124; irq++)); do
> cat /proc/irq/$irq/smp_affinity
> done
01000 -> 12 -> Node 0
00800 -> 11 -> Node 0
00400 -> 10 -> Node 0
00400 -> 10 -> Node 0
01000 -> 12 -> Node 0
04000 -> 14 -> Node 0
00400 -> 10 -> Node 0
00010 -> 4 -> Node 0
00004 -> 2 -> Node 0
02000 -> 13 -> Node 0
04000 -> 14 -> Node 0
00001 -> 0 -> Node 0
02000 -> 13 -> Node 0
01000 -> 12 -> Node 0
00008 -> 3 -> Node 0
00800 -> 11 -> Node 0
00800 -> 11 -> Node 0
04000 -> 14 -> Node 0
00800 -> 11 -> Node 0
02000 -> 13 -> Node 0
*1 irq number can be obtained from /proc/interrupts
16Copyright © 2016 NTT Corp. All Rights Reserved.
• Some NIC drivers provide affinity_hint
• affinity_hint is evenly distributed
• To honor the hint, add "-h exact" option to
irqbalance (via /etc/sysconfig/irqbalance, etc.)*1
Check affinity_hint
$ for ((irq=105; irq<=124; irq++)); do
> cat /proc/irq/$irq/affinity_hint
> done
00001 -> 0
00002 -> 1
00004 -> 2
00008 -> 3
00010 -> 4
00020 -> 5
00040 -> 6
00080 -> 7
00100 -> 8
00200 -> 9
00400 -> 10
00800 -> 11
01000 -> 12
02000 -> 13
04000 -> 14
08000 -> 15
10000 -> 16
20000 -> 17
40000 -> 18
80000 -> 19
*1 If your NIC doesn't provide hint, you can use "-i" option or stop irqbalance to set their affinity manually
17Copyright © 2016 NTT Corp. All Rights Reserved.
• Added "-h exact" and restarted irqbalance
• With hint honored, irqs are distributed to all
cores
Change irqbalance option
$ for ((irq=105; irq<=124; irq++)); do
> cat /proc/irq/$irq/smp_affinity
> done
00001 -> 0
00002 -> 1
00004 -> 2
00008 -> 3
00010 -> 4
00020 -> 5
00040 -> 6
00080 -> 7
00100 -> 8
00200 -> 9
00400 -> 10
00800 -> 11
01000 -> 12
02000 -> 13
04000 -> 14
08000 -> 15
10000 -> 16
20000 -> 17
40000 -> 18
80000 -> 19
18Copyright © 2016 NTT Corp. All Rights Reserved.
• sar -u ALL -P ALL 1
• Though irqs looks distributed evenly,
core 16-19 are not used for softirq...
• Nodes look irrelevant this time
Change irqbalance option
20:06:07 CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle
20:06:07 all 0.00 0.00 19.18 0.00 0.00 0.00 80.82 0.00 0.00 0.00
20:06:07 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 1 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 2 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 3 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 4 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 5 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 6 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 7 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 8 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 9 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 10 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 11 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 12 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 13 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 14 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 15 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:06:07 16 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20:06:07 17 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20:06:07 18 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
20:06:07 19 0.00 0.00 93.33 0.00 0.00 0.00 6.67 0.00 0.00 0.00
Node 0
Node 1
19Copyright © 2016 NTT Corp. All Rights Reserved.
• ethtool -S*1
• Revealed RSS has not distributed packets to
queues 16-19
Check rx-queue stats
$ ethtool -S ens1f0 | grep 'rx_queue_.*_packets'
rx_queue_0_packets: 198005155
rx_queue_1_packets: 153339750
rx_queue_2_packets: 162870095
rx_queue_3_packets: 172303801
rx_queue_4_packets: 153728776
rx_queue_5_packets: 158138563
rx_queue_6_packets: 164411653
rx_queue_7_packets: 165924489
rx_queue_8_packets: 176545406
rx_queue_9_packets: 165340188
rx_queue_10_packets: 150279834
rx_queue_11_packets: 150983782
rx_queue_12_packets: 157623687
rx_queue_13_packets: 150743910
rx_queue_14_packets: 158634344
rx_queue_15_packets: 158497890
rx_queue_16_packets: 4
rx_queue_17_packets: 3
rx_queue_18_packets: 0
rx_queue_19_packets: 8
*1 Output format depends on drivers
20Copyright © 2016 NTT Corp. All Rights Reserved.
• RSS has indirection table which determines to
which queue it spreads packets
• Can be shown by ethtool -x
• Only rx-queue 0-15 are used, 16-19 not used
RSS Indirection Table
$ ethtool -x ens1f0
RX flow hash indirection table for ens1f0 with 20 RX ring(s):
0: 0 1 2 3 4 5 6 7
8: 8 9 10 11 12 13 14 15
16: 0 1 2 3 4 5 6 7
24: 8 9 10 11 12 13 14 15
32: 0 1 2 3 4 5 6 7
40: 8 9 10 11 12 13 14 15
48: 0 1 2 3 4 5 6 7
56: 8 9 10 11 12 13 14 15
64: 0 1 2 3 4 5 6 7
72: 8 9 10 11 12 13 14 15
80: 0 1 2 3 4 5 6 7
88: 8 9 10 11 12 13 14 15
96: 0 1 2 3 4 5 6 7
104: 8 9 10 11 12 13 14 15
112: 0 1 2 3 4 5 6 7
120: 8 9 10 11 12 13 14 15
flow hash
(hash value
from packet
header)
rx-queue
number
21Copyright © 2016 NTT Corp. All Rights Reserved.
• Change to use all 0-19?
• This NIC's max rx-queues in the indirection table is
actually 16 so we cannot use 20 queues
• although we have 64 rx-queues...
• Use RPS instead
• Software emulation of RSS
RSS Indirection Table
# ethtool -X ens1f0 equal 20
Cannot set RX flow hash configuration: Invalid argument
NIC
rx queue
rx queue
Core
Core
packets
RSS
Core
softirq
backlog
softirq
backlog
softirq
backlog
RPS
softirq
interrupt
22Copyright © 2016 NTT Corp. All Rights Reserved.
• This time I spread flows from rx-queue 6-9 to
core 6-9 and 16-19
• Because they are all in Node 1
• rx-queue 6 -> core 6, 16
• rx-queue 7 -> core 7, 17
• rx-queue 8 -> core 8, 18
• rx-queue 9 -> core 9, 19
Use RPS
# echo 10040 > /sys/class/net/ens1f0/queues/rx-6/rps_cpus
# echo 20080 > /sys/class/net/ens1f0/queues/rx-7/rps_cpus
# echo 40100 > /sys/class/net/ens1f0/queues/rx-8/rps_cpus
# echo 80200 > /sys/class/net/ens1f0/queues/rx-9/rps_cpus
23Copyright © 2016 NTT Corp. All Rights Reserved.
• sar -u ALL -P ALL 1
• softirq is almost evenly distributed
Use RPS
20:18:53 CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle
20:18:54 all 0.00 0.00 2.38 0.00 0.00 0.00 97.62 0.00 0.00 0.00
20:18:54 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 1 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 2 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 3 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 4 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 5 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 6 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 7 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 8 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 9 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 10 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 11 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 12 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 13 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 14 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 15 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
20:18:54 16 0.00 0.00 15.56 0.00 0.00 0.00 84.44 0.00 0.00 0.00
20:18:54 17 0.00 0.00 6.98 0.00 0.00 0.00 93.02 0.00 0.00 0.00
20:18:54 18 0.00 0.00 18.18 0.00 0.00 0.00 81.82 0.00 0.00 0.00
20:18:54 19 2.27 0.00 6.82 0.00 0.00 0.00 90.91 0.00 0.00 0.00
24Copyright © 2016 NTT Corp. All Rights Reserved.
• Now thanks to affinity_hint and RPS, we
succeeded to spread flows almost evenly
• Performance change
• Before: 270,000 tps (approx. 360Mbps)
• After: 17,000 tps (approx. 23Mbps)
• Got worse...
• Probably the reason is too heavy softirq
• softirq is almost 100% in total
• Need finer-grained profiling than sar
RSS & affinity_hint & RPS
25Copyright © 2016 NTT Corp. All Rights Reserved.
• perf
• Profiling tool developed in kernel tree
• Identify hot spots by sampling CPU cycles
• Example usage of perf
• perf record -a -g -- sleep 5
• Save sampling results for 5 seconds to perf.data file
• FlameGraph
• Visualize perf.data in svg format
• https://github.com/brendangregg/FlameGraph
Profile softirq
26Copyright © 2016 NTT Corp. All Rights Reserved.
• FlameGraph of CPU0*1
• x-axis (width): CPU consumption
• y-axis (height): Depth of call stack
• queued_spin_lock_slowpath: lock is contended
• udp_queue_rcv_skb: aquires socket lock
Profile softirq
*1 I filtered call stack under irq context from output of perf script to make the chart easier to see
irq context is shown as "interrupt" here
queued_spin_lock_slowpath
udp_queue_rcv_skb
27Copyright © 2016 NTT Corp. All Rights Reserved.
• Echo server has only one socket bound to a
certain port
• softirq of each core pushes packets into socket
queue concurrently
• socket lock gets contended
Socket lock contention
NIC
rx queue
rx queue
Core
Core
packets
Core
softirq
backlog
softirq
backlog
softirq
backlog
softirq
UDP
socket
lock
contention!! interrupt
RPS
28Copyright © 2016 NTT Corp. All Rights Reserved.
• Split sockets by SO_REUSEPORT
• Introduced by kernel 3.9
• SO_REUSEPORT allows multiple UDP sockets
to bind the same port
• One of the sockets is chosen on queueing each packet
Avoid lock contention
NIC
rx queue
rx queue
Core
Core
packets
Core
softirq
backlog
softirq
backlog
softirq
backlog
UDP
socket
UDP
socket
UDP
socket
int on = 1;
int sock = socket(AF_INET, SOCK_DGRAM, 0);
setsockopt(sock, SOL_SOCKET, SO_REUSEPORT, &on, sizeof(on));
bind(sock, ...);
socket
selector
*1
*1 select a socket by flow (packet header) hash by default
interrupt
RPS
29Copyright © 2016 NTT Corp. All Rights Reserved.
• sar -u ALL -P ALL 1
• CPU consumption in softirq became some more
reasonable
Use SO_REUSEPORT
20:44:33 CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle
20:44:34 all 3.26 0.00 37.23 0.00 0.00 0.00 59.52 0.00 0.00 0.00
20:44:34 0 3.33 0.00 28.33 0.00 0.00 0.00 68.33 0.00 0.00 0.00
20:44:34 1 3.33 0.00 25.00 0.00 0.00 0.00 71.67 0.00 0.00 0.00
20:44:34 2 1.67 0.00 23.33 0.00 0.00 0.00 75.00 0.00 0.00 0.00
20:44:34 3 3.28 0.00 32.79 0.00 0.00 0.00 63.93 0.00 0.00 0.00
20:44:34 4 3.33 0.00 33.33 0.00 0.00 0.00 63.33 0.00 0.00 0.00
20:44:34 5 1.69 0.00 23.73 0.00 0.00 0.00 74.58 0.00 0.00 0.00
20:44:34 6 3.28 0.00 50.82 0.00 0.00 0.00 45.90 0.00 0.00 0.00
20:44:34 7 3.45 0.00 50.00 0.00 0.00 0.00 46.55 0.00 0.00 0.00
20:44:34 8 1.69 0.00 37.29 0.00 0.00 0.00 61.02 0.00 0.00 0.00
20:44:34 9 1.67 0.00 33.33 0.00 0.00 0.00 65.00 0.00 0.00 0.00
20:44:34 10 1.69 0.00 18.64 0.00 0.00 0.00 79.66 0.00 0.00 0.00
20:44:34 11 3.23 0.00 35.48 0.00 0.00 0.00 61.29 0.00 0.00 0.00
20:44:34 12 1.69 0.00 27.12 0.00 0.00 0.00 71.19 0.00 0.00 0.00
20:44:34 13 1.67 0.00 21.67 0.00 0.00 0.00 76.67 0.00 0.00 0.00
20:44:34 14 1.67 0.00 21.67 0.00 0.00 0.00 76.67 0.00 0.00 0.00
20:44:34 15 3.33 0.00 35.00 0.00 0.00 0.00 61.67 0.00 0.00 0.00
20:44:34 16 6.67 0.00 68.33 0.00 0.00 0.00 25.00 0.00 0.00 0.00
20:44:34 17 5.00 0.00 65.00 0.00 0.00 0.00 30.00 0.00 0.00 0.00
20:44:34 18 6.78 0.00 54.24 0.00 0.00 0.00 38.98 0.00 0.00 0.00
20:44:34 19 4.92 0.00 63.93 0.00 0.00 0.00 31.15 0.00 0.00 0.00
30Copyright © 2016 NTT Corp. All Rights Reserved.
• before
• after
Userspace starts to work
Use SO_REUSEPORT
Interrupt processing (irq)
Interrupt processing (irq)
userspace
thread
(sys, user)
31Copyright © 2016 NTT Corp. All Rights Reserved.
• Perfomance change
• RSS: 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• Great improvement!
• but...
Use SO_REUSEPORT
32Copyright © 2016 NTT Corp. All Rights Reserved.
• More analysis
• Socket lock is still contended
Use SO_REUSEPORT
queued_spin_lock_slowpath
33Copyright © 2016 NTT Corp. All Rights Reserved.
• SO_REUSEPORT uses flow hash to select
queue by default
• Same sockets can be selected by different
cores
• Socket lock still gets contended
Socket lock contention again
NIC
rx queue
rx queue
Core1
Core2
packets
Core0
softirq
backlog
softirq
backlog
softirq
backlog
UDP
socket0
UDP
socket1
UDP
socket2
socket
selector
interrupt
select queue by flow hash
lock
contention!!
RPS
34Copyright © 2016 NTT Corp. All Rights Reserved.
• Select socket by core number
• Realized by SO_ATTACH_REUSEPORT_CBPF/EBPF*1
• Introduced by kernel 4.5
• No lock contention between softirq
• Usage
• See example in kernel source tree
• tools/testing/selftests/net/reuseport_bpf_cpu.c
Avoid socket lock contention
NIC
rx queue
rx queue
Core1
Core2
packets
Core0
softirq
backlog
softirq
backlog
softirq
backlog
UDP
socket0
UDP
socket1
UDP
socket2
socket
selector
*1 BPF allows much more flexible logic but this time only cpu
number is used
interrupt
select queue by core number
RPS
35Copyright © 2016 NTT Corp. All Rights Reserved.
• before
• after
irq overhead gets less
Use SO_ATTACH_REUSEPORT_EPBF
Interrupt processing
(irq)
userspace
thread
(sys, user)
Interrupt processing (irq)
userspace
thread
(sys, user)
36Copyright © 2016 NTT Corp. All Rights Reserved.
• Perfomance change
• RSS: 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
Use SO_ATTACH_REUSEPORT_EBPF
37Copyright © 2016 NTT Corp. All Rights Reserved.
• Userspace threads : sockets == 1 : 1
• No lock contention
• But not necessarily on the same core as softirq
• Pin userspace thread on the same core for better cache affinity
• cgroup, taskset, pthread_setaffinity_np(), ... any way you like
Pin userspace threads
NIC
rx queue
rx queue
Core1
Core2
packets
Core0
softirq
backlog
softirq
backlog
softirq
backlog
UDP
socket0
UDP
socket1
UDP
socket2
interrupt
thread1
thread2
thread0
userspace kernel
softirq
NIC
rx queue
rx queue
Core1
Core2
packets
Core0
softirq
backlog
softirq
backlog
softirq
backlog
UDP
socket0
UDP
socket1
UDP
socket2
interrupt
thread0
thread1
thread2
userspace kernel
softirq
1:1
pin
RPS
RPS
38Copyright © 2016 NTT Corp. All Rights Reserved.
• Perfomance change
• RSS: 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
Pin userspace threads
39Copyright © 2016 NTT Corp. All Rights Reserved.
• So far everything has been about Rx
• No lock contention on Tx?
Tx lock contention?
40Copyright © 2016 NTT Corp. All Rights Reserved.
• kernel has Qdisc (Queueing discipline)
• Each Qdisc is linked to NIC tx-queue
• Each Qdisc has its lock
Tx queue
Core
Core
Qdisc queue
Qdisc queue
thread
thread
threadCore
NIC
tx queue
tx queue
tx queueQdisc queue
userspace kernel
41Copyright © 2016 NTT Corp. All Rights Reserved.
• By default Qdisc is selected by flow hash
• Thus lock contention can happen
• We haven't seen contention on Tx, why?
Tx queue lock contention
Core
Core
Qdisc queue
Qdisc queue
thread
thread
threadCore
NIC
tx queue
tx queue
tx queueQdisc queue
userspace kernel
select queue by flow hash
lock
contention!!
42Copyright © 2016 NTT Corp. All Rights Reserved.
• Because ixgbe (Intel 10GbE NIC driver) has an
ability to set XPS automatically
Avoid Tx queue lock contention
$ for ((txq=0; txq<20; txq++)); do
> cat /sys/class/net/ens1f0/queues/tx-$txq/xps_cpus
> done
00001 -> core 0
00002 -> core 1
00004 -> core 2
00008 -> core 3
00010 -> core 4
00020 -> core 5
00040 -> core 6
00080 -> core 7
00100 -> core 8
00200 -> core 9
00400 -> core 10
00800 -> core 11
01000 -> core 12
02000 -> core 13
04000 -> core 14
08000 -> core 15
10000 -> core 16
20000 -> core 17
40000 -> core 18
80000 -> core 19
43Copyright © 2016 NTT Corp. All Rights Reserved.
• XPS allows kernel to select Tx queue (Qdisc) by
core number
• Tx has no lock contention
XPS
Core
Core
Qdisc queue
Qdisc queue
thread
thread
threadCore
NIC
tx queue
tx queue
tx queueQdisc queue
userspace kernel
select queue by core number
44Copyright © 2016 NTT Corp. All Rights Reserved.
• Try disabling it
• Before: 5,050,000 tps (approx. 6710Mbps)
• After: 1,086,000 tps (approx. 1440Mbps)
How effective is XPS?
# for ((txq=0; txq<20; txq++)); do
> echo 0 > /sys/class/net/ens1f0/queues/tx-$txq/xps_cpus
> done
45Copyright © 2016 NTT Corp. All Rights Reserved.
• XPS
enabled
• XPS
disabled
Disabling XPS
Interrupt processing
(irq)
userspace
thread Tx
Interrupt
processing
(irq)
userspace
thread Tx
userspace
thread Rx
userspace
thread Rx
queued_spin_lock_slowpath
(lock contention)
46Copyright © 2016 NTT Corp. All Rights Reserved.
• Enable XPS again
• Although ixgbe can automatically set XPS, not
all drivers can do that
• Make sure to check xps_cpus is configured
Enable XPS
# echo 00001 > /sys/class/net/<NIC>/queues/tx-0/xps_cpus
# echo 00002 > /sys/class/net/<NIC>/queues/tx-1/xps_cpus
# echo 00004 > /sys/class/net/<NIC>/queues/tx-2/xps_cpus
# echo 00008 > /sys/class/net/<NIC>/queues/tx-3/xps_cpus
...
47Copyright © 2016 NTT Corp. All Rights Reserved.
• By making full use of multi-core with avoiding
contention, we achieved
• 5,050,000 tps (approx. 6710Mbps)
• To get more performance, reduce overhead per
core
Optimization per core
48Copyright © 2016 NTT Corp. All Rights Reserved.
• GRO is enabled by default
• Consuming 4.9% of CPU time
Optimization per core
dev_gro_receive
49Copyright © 2016 NTT Corp. All Rights Reserved.
• GRO is not applicable to UDP*1
• Disable it for UDP servers
• WARNING:
• Don't disable it if TCP performance matters
• Disabling GRO makes TCP rx throughput miserably low
• Don't disable it on KVM hypervisors as well
• GRO boost throughput of tunneling protocol traffic as well
as guest's TCP traffic on hypervisors
GRO
# ethtool -K <NIC> gro off
*1 Other than UDP tunneling, like VXLAN
50Copyright © 2016 NTT Corp. All Rights Reserved.
• Perfomance change
• RSS (+XPS): 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
• +Disable GRO: 5,180,000 tps (approx. 6880Mbps)
Disable GRO
51Copyright © 2016 NTT Corp. All Rights Reserved.
• iptables-related processing (nf_iterate) is
performed
• Although I have not added any rule to iptables
• Consuming 3.00% of CPU time
Optimization per core
nf_iterate
nf_iterate
52Copyright © 2016 NTT Corp. All Rights Reserved.
• With iptables kernel module loaded, even if
you don't have any rules, it can incur some
overhead
• Some distributions load iptables module even
when you don't add any rule
• If you are not using iptables, unload the
module
iptables (netfilter)
# modprobe -r iptable_filter
# modprobe -r ip_tables
53Copyright © 2016 NTT Corp. All Rights Reserved.
• Perfomance change
• RSS (+XPS): 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
• +Disable GRO: 5,180,000 tps (approx. 6880Mbps)
• +Unload iptables: 5,380,000 tps (approx. 7140Mbps)
Unload iptables
54Copyright © 2016 NTT Corp. All Rights Reserved.
• On Rx, FIB (routing table) lookup is done
twice
• Each is consuming 1.82%~ of CPU time
Optimization per core
fib_table_lookup
fib_table_lookup
55Copyright © 2016 NTT Corp. All Rights Reserved.
• One of two times of table lookup is for
validating source IP addresses
• Reverse path filter
• Local address check
• If you really don't need source validation, you
can skip it
FIB lookup on Rx
# sysctl -w net.ipv4.conf.all.rp_filter=0
# sysctl -w net.ipv4.conf.<NIC>.rp_filter=0
# sysctl -w net.ipv4.conf.all.accept_local=1
56Copyright © 2016 NTT Corp. All Rights Reserved.
• Perfomance change
• RSS (+XPS): 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
• +Disable GRO: 5,180,000 tps (approx. 6880Mbps)
• +Unload iptables: 5,380,000 tps (approx. 7140Mbps)
• +Disable validation: 5,490,000 tps (approx. 7290Mbps)
Disable source validation
57Copyright © 2016 NTT Corp. All Rights Reserved.
• Audit is a bit heavy when heavily processing
packets
• Consuming 2.5% of CPU time
Optimization per core
audit related processing
58Copyright © 2016 NTT Corp. All Rights Reserved.
• If you don't need audit, disable it
Audit
# systemctl disable auditd
# reboot
59Copyright © 2016 NTT Corp. All Rights Reserved.
• Perfomance change
• RSS (+XPS): 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
• +Disable GRO: 5,180,000 tps (approx. 6880Mbps)
• +Unload iptables: 5,380,000 tps (approx. 7140Mbps)
• +Disable validation: 5,490,000 tps (approx. 7290Mbps)
• +Disable audit: 5,860,000 tps (approx. 7780Mbps)
Disable audit
60Copyright © 2016 NTT Corp. All Rights Reserved.
• IP ID field calculation (__ip_select_ident) is
heavy
• Consuming 4.82% of CPU time
Optimization per core
__ip_select_ident
61Copyright © 2016 NTT Corp. All Rights Reserved.
• This is an environment-specific issue
• This happens if many clients has the same IP address
• Cache contention by atomic operations
• It is very likely you don't see this amount of CPU
consumption without using tunneling protocol
• If you really see this problem...
• You can skip it only if you never send over-mtu-sized
packets
• Though it is very strict
IP ID field calculation
int pmtu = IP_PMTUDISC_DO;
setsockopt(sock, IPPROTO_IP, IP_MTU_DISCOVER, &pmtu, sizeof(pmtu));
62Copyright © 2016 NTT Corp. All Rights Reserved.
• Perfomance change
• RSS (+XPS): 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
• +Disable GRO: 5,180,000 tps (approx. 6880Mbps)
• +Unload iptables: 5,380,000 tps (approx. 7140Mbps)
• +Disable validation: 5,490,000 tps (approx. 7290Mbps)
• +Disable audit: 5,860,000 tps (approx. 7780Mbps)
• +Skip ID calculation: 6,010,000 tps (approx. 7980Mbps)
Skip IP ID calculation
63Copyright © 2016 NTT Corp. All Rights Reserved.
• So far we have not enabled hyper threading
• It makes the number of logical cores 40
• Number of physical cores are 20 in this box
• With 40 cores we need to rely more on RPS
• Remind: Max usable rx-queues == 16
• Enable hyper-threading and set RPS on all rx-
queues
• queue 0 -> core 0, 20
• queue 1 -> core 1, 21
• ...
• queue 10 -> core 10, 16, 30
• queue 11 -> core 11, 17, 31
• ...
Hyper threading
64Copyright © 2016 NTT Corp. All Rights Reserved.
• Perfomance change
• RSS (+XPS): 270,000 tps (approx. 360Mbps)
• +affinity_hint+RPS: 17,000 tps (approx. 23Mbps)
• +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps)
• +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps)
• +Pin threads: 5,050,000 tps (approx. 6710Mbps)
• +Disable GRO: 5,180,000 tps (approx. 6880Mbps)
• +Unload iptables: 5,380,000 tps (approx. 7140Mbps)
• +Disable validation: 5,490,000 tps (approx. 7290Mbps)
• +Disable audit: 5,860,000 tps (approx. 7780Mbps)
• +Skip ID calculation: 6,010,000 tps (approx. 7980Mbps)
• +Hyper threading: 7,010,000 tps (approx. 9310Mbps)
• I guess more rx-queues would realize even better
performance number
Hyper threading
65Copyright © 2016 NTT Corp. All Rights Reserved.
• Tx Qdisc lock (_raw_spin_lock) is heavy
• Not contended but involves many atomic
operations
• Being optimized in Linux netdev community
More hot spots
_raw_spin_lock
66Copyright © 2016 NTT Corp. All Rights Reserved.
• Memory alloc/free (slab)
• Being optimized in netdev community as well
More hot spots
kmem/kmalloc/kfree
67Copyright © 2016 NTT Corp. All Rights Reserved.
• Virtualization
• UDP servers as guests
• Hypervisor can saturate CPUs or drop packets
• We are going to investigate ways to boost performance
in virtualized environment as well
Other challenges
68Copyright © 2016 NTT Corp. All Rights Reserved.
• For 100bytes, we can achieve almost 10G
• From: 270,000 tps (approx. 360Mbps)
• To: 7,010,000 tps (approx. 9310Mbps)
• Of course we need to take into account additional userspace work in real
applications so this number is not applicable as is
• To boost UDP performance
• Applications (Most important!)
• implement SO_REUSEPORT
• implement SO_ATTACH_REUSEPORT_EBPF/CBPF
• These are useful for TCP listening sockets as well
• OS settings
• Check smp_affinity
• Use RPS if rx-queues are not enough
• Make sure XPS is configured
• Consider other tunings to reduce per-core overhead
• Disable GRO
• Unload iptables
• Disable source IP validation
• Disable auditd
• Hardware
• Use NICs which have enough RSS rx-queues if possible
(as many queues as core num)
Summary
69Copyright © 2016 NTT Corp. All Rights Reserved.
Thank you!

More Related Content

What's hot

Data Center Networks:Virtual Bridging
Data Center Networks:Virtual BridgingData Center Networks:Virtual Bridging
Data Center Networks:Virtual Bridging
rjain51
 
BGP Unnumbered で遊んでみた
BGP Unnumbered で遊んでみたBGP Unnumbered で遊んでみた
BGP Unnumbered で遊んでみた
akira6592
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
Denys Haryachyy
 
Linux Linux Traffic Control
Linux Linux Traffic ControlLinux Linux Traffic Control
Linux Linux Traffic Control
SUSE Labs Taipei
 
암호화 이것만 알면 된다.
암호화 이것만 알면 된다.암호화 이것만 알면 된다.
암호화 이것만 알면 된다.
KwangSeob Jeong
 
DPDKによる高速コンテナネットワーキング
DPDKによる高速コンテナネットワーキングDPDKによる高速コンテナネットワーキング
DPDKによる高速コンテナネットワーキング
Tomoya Hibi
 
DPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on Lab
Michelle Holley
 
さくらのVPS で IPv4 over IPv6ルータの構築
さくらのVPS で IPv4 over IPv6ルータの構築さくらのVPS で IPv4 over IPv6ルータの構築
さくらのVPS で IPv4 over IPv6ルータの構築
Tomocha Potter
 
SRv6 study
SRv6 studySRv6 study
SRv6 study
Hiro Mura
 
TRex Realistic Traffic Generator - Stateless support
TRex  Realistic Traffic Generator  - Stateless support TRex  Realistic Traffic Generator  - Stateless support
TRex Realistic Traffic Generator - Stateless support
Hanoch Haim
 
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月 知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
VirtualTech Japan Inc.
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
Thomas Graf
 
マルチコアとネットワークスタックの高速化技法
マルチコアとネットワークスタックの高速化技法マルチコアとネットワークスタックの高速化技法
マルチコアとネットワークスタックの高速化技法Takuya ASADA
 
OVN 設定サンプル | OVN config example 2015/12/27
OVN 設定サンプル | OVN config example 2015/12/27OVN 設定サンプル | OVN config example 2015/12/27
OVN 設定サンプル | OVN config example 2015/12/27
Kentaro Ebisawa
 
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Thomas Graf
 
NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?
Anton Zadorozhniy
 
Docker Networking with New Ipvlan and Macvlan Drivers
Docker Networking with New Ipvlan and Macvlan DriversDocker Networking with New Ipvlan and Macvlan Drivers
Docker Networking with New Ipvlan and Macvlan Drivers
Brent Salisbury
 
SR-IOV Networking in OpenStack - OpenStack最新情報セミナー 2016年3月
SR-IOV Networking in OpenStack - OpenStack最新情報セミナー 2016年3月SR-IOV Networking in OpenStack - OpenStack最新情報セミナー 2016年3月
SR-IOV Networking in OpenStack - OpenStack最新情報セミナー 2016年3月
VirtualTech Japan Inc.
 
How VXLAN works on Linux
How VXLAN works on LinuxHow VXLAN works on Linux
How VXLAN works on LinuxEtsuji Nakai
 
Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)
Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)
Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)
NTT DATA Technology & Innovation
 

What's hot (20)

Data Center Networks:Virtual Bridging
Data Center Networks:Virtual BridgingData Center Networks:Virtual Bridging
Data Center Networks:Virtual Bridging
 
BGP Unnumbered で遊んでみた
BGP Unnumbered で遊んでみたBGP Unnumbered で遊んでみた
BGP Unnumbered で遊んでみた
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Linux Linux Traffic Control
Linux Linux Traffic ControlLinux Linux Traffic Control
Linux Linux Traffic Control
 
암호화 이것만 알면 된다.
암호화 이것만 알면 된다.암호화 이것만 알면 된다.
암호화 이것만 알면 된다.
 
DPDKによる高速コンテナネットワーキング
DPDKによる高速コンテナネットワーキングDPDKによる高速コンテナネットワーキング
DPDKによる高速コンテナネットワーキング
 
DPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on Lab
 
さくらのVPS で IPv4 over IPv6ルータの構築
さくらのVPS で IPv4 over IPv6ルータの構築さくらのVPS で IPv4 over IPv6ルータの構築
さくらのVPS で IPv4 over IPv6ルータの構築
 
SRv6 study
SRv6 studySRv6 study
SRv6 study
 
TRex Realistic Traffic Generator - Stateless support
TRex  Realistic Traffic Generator  - Stateless support TRex  Realistic Traffic Generator  - Stateless support
TRex Realistic Traffic Generator - Stateless support
 
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月 知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
知っているようで知らないNeutron -仮想ルータの冗長と分散- - OpenStack最新情報セミナー 2016年3月
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
 
マルチコアとネットワークスタックの高速化技法
マルチコアとネットワークスタックの高速化技法マルチコアとネットワークスタックの高速化技法
マルチコアとネットワークスタックの高速化技法
 
OVN 設定サンプル | OVN config example 2015/12/27
OVN 設定サンプル | OVN config example 2015/12/27OVN 設定サンプル | OVN config example 2015/12/27
OVN 設定サンプル | OVN config example 2015/12/27
 
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
 
NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?
 
Docker Networking with New Ipvlan and Macvlan Drivers
Docker Networking with New Ipvlan and Macvlan DriversDocker Networking with New Ipvlan and Macvlan Drivers
Docker Networking with New Ipvlan and Macvlan Drivers
 
SR-IOV Networking in OpenStack - OpenStack最新情報セミナー 2016年3月
SR-IOV Networking in OpenStack - OpenStack最新情報セミナー 2016年3月SR-IOV Networking in OpenStack - OpenStack最新情報セミナー 2016年3月
SR-IOV Networking in OpenStack - OpenStack最新情報セミナー 2016年3月
 
How VXLAN works on Linux
How VXLAN works on LinuxHow VXLAN works on Linux
How VXLAN works on Linux
 
Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)
Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)
Grafana LokiではじめるKubernetesロギングハンズオン(NTT Tech Conference #4 ハンズオン資料)
 

Viewers also liked

NVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxNVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in Linux
LF Events
 
SR-IOV ixgbe Driver Limitations and Improvement
SR-IOV ixgbe Driver Limitations and ImprovementSR-IOV ixgbe Driver Limitations and Improvement
SR-IOV ixgbe Driver Limitations and Improvement
LF Events
 
Evaluating MLC vs TLC vs V-NAND for Enterprise SSDs – Whitepaper
Evaluating MLC vs TLC vs V-NAND for Enterprise SSDs – WhitepaperEvaluating MLC vs TLC vs V-NAND for Enterprise SSDs – Whitepaper
Evaluating MLC vs TLC vs V-NAND for Enterprise SSDs – Whitepaper
Samsung Business USA
 
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Odinot Stanislas
 
Userspace Linux I/O
Userspace Linux I/O Userspace Linux I/O
Userspace Linux I/O
Garima Kapoor
 
Hardware accelerated virtio networking for nfv linux con
Hardware accelerated virtio networking for nfv linux conHardware accelerated virtio networking for nfv linux con
Hardware accelerated virtio networking for nfv linux con
sprdd
 
LISA15: systemd, the Next-Generation Linux System Manager
LISA15: systemd, the Next-Generation Linux System Manager LISA15: systemd, the Next-Generation Linux System Manager
LISA15: systemd, the Next-Generation Linux System Manager
Alison Chaiken
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems Baruch Osoveskiy
 
IRQs: the Hard, the Soft, the Threaded and the Preemptible
IRQs: the Hard, the Soft, the Threaded and the PreemptibleIRQs: the Hard, the Soft, the Threaded and the Preemptible
IRQs: the Hard, the Soft, the Threaded and the Preemptible
Alison Chaiken
 
Tuning systemd for embedded
Tuning systemd for embeddedTuning systemd for embedded
Tuning systemd for embedded
Alison Chaiken
 
NVMe PCIe and TLC V-NAND It’s about Time
NVMe PCIe and TLC V-NAND It’s about TimeNVMe PCIe and TLC V-NAND It’s about Time
NVMe PCIe and TLC V-NAND It’s about Time
Dell World
 
Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...
Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...
Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...
Principled Technologies
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red_Hat_Storage
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
Coburn Watson
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
Patrick McGarry
 
Function Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe DriverFunction Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe Driver
인구 강
 
VMworld 2015: Extreme Performance Series - vSphere Compute & Memory
VMworld 2015: Extreme Performance Series - vSphere Compute & MemoryVMworld 2015: Extreme Performance Series - vSphere Compute & Memory
VMworld 2015: Extreme Performance Series - vSphere Compute & Memory
VMworld
 
Docker, LinuX Container
Docker, LinuX ContainerDocker, LinuX Container
Docker, LinuX Container
Araf Karsh Hamid
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsCeph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Red_Hat_Storage
 
2016 Flash Storage-NVMe Brand Leader Mini-Report
2016 Flash Storage-NVMe Brand Leader Mini-Report2016 Flash Storage-NVMe Brand Leader Mini-Report
2016 Flash Storage-NVMe Brand Leader Mini-Report
IT Brand Pulse
 

Viewers also liked (20)

NVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxNVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in Linux
 
SR-IOV ixgbe Driver Limitations and Improvement
SR-IOV ixgbe Driver Limitations and ImprovementSR-IOV ixgbe Driver Limitations and Improvement
SR-IOV ixgbe Driver Limitations and Improvement
 
Evaluating MLC vs TLC vs V-NAND for Enterprise SSDs – Whitepaper
Evaluating MLC vs TLC vs V-NAND for Enterprise SSDs – WhitepaperEvaluating MLC vs TLC vs V-NAND for Enterprise SSDs – Whitepaper
Evaluating MLC vs TLC vs V-NAND for Enterprise SSDs – Whitepaper
 
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
 
Userspace Linux I/O
Userspace Linux I/O Userspace Linux I/O
Userspace Linux I/O
 
Hardware accelerated virtio networking for nfv linux con
Hardware accelerated virtio networking for nfv linux conHardware accelerated virtio networking for nfv linux con
Hardware accelerated virtio networking for nfv linux con
 
LISA15: systemd, the Next-Generation Linux System Manager
LISA15: systemd, the Next-Generation Linux System Manager LISA15: systemd, the Next-Generation Linux System Manager
LISA15: systemd, the Next-Generation Linux System Manager
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
 
IRQs: the Hard, the Soft, the Threaded and the Preemptible
IRQs: the Hard, the Soft, the Threaded and the PreemptibleIRQs: the Hard, the Soft, the Threaded and the Preemptible
IRQs: the Hard, the Soft, the Threaded and the Preemptible
 
Tuning systemd for embedded
Tuning systemd for embeddedTuning systemd for embedded
Tuning systemd for embedded
 
NVMe PCIe and TLC V-NAND It’s about Time
NVMe PCIe and TLC V-NAND It’s about TimeNVMe PCIe and TLC V-NAND It’s about Time
NVMe PCIe and TLC V-NAND It’s about Time
 
Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...
Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...
Comparing file system performance: Red Hat Enterprise Linux 6 vs. Microsoft W...
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
Function Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe DriverFunction Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe Driver
 
VMworld 2015: Extreme Performance Series - vSphere Compute & Memory
VMworld 2015: Extreme Performance Series - vSphere Compute & MemoryVMworld 2015: Extreme Performance Series - vSphere Compute & Memory
VMworld 2015: Extreme Performance Series - vSphere Compute & Memory
 
Docker, LinuX Container
Docker, LinuX ContainerDocker, LinuX Container
Docker, LinuX Container
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsCeph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
 
2016 Flash Storage-NVMe Brand Leader Mini-Report
2016 Flash Storage-NVMe Brand Leader Mini-Report2016 Flash Storage-NVMe Brand Leader Mini-Report
2016 Flash Storage-NVMe Brand Leader Mini-Report
 

Similar to Boost UDP Transaction Performance

Fine grained monitoring
Fine grained monitoringFine grained monitoring
Fine grained monitoring
Iben Rodriguez
 
[IGC2018] AMD Don Woligroski - WHY Ryzen
[IGC2018] AMD Don Woligroski - WHY Ryzen[IGC2018] AMD Don Woligroski - WHY Ryzen
[IGC2018] AMD Don Woligroski - WHY Ryzen
강 민우
 
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
維泰 蔡
 
realestate and MySQL devops melbourne
realestate and MySQL devops melbournerealestate and MySQL devops melbourne
realestate and MySQL devops melbourne
mysqldbahelp
 
Microcontrollers and RT programming 3
Microcontrollers and RT programming 3Microcontrollers and RT programming 3
Microcontrollers and RT programming 3SSGMCE SHEGAON
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and Hadoop
DataWorks Summit
 
Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例Kazuhito Ohkawa
 
Day 11 eigrp
Day 11 eigrpDay 11 eigrp
Day 11 eigrp
CYBERINTELLIGENTS
 
6th floorsharingsession ep 1 - networking - arp v 1.0
6th floorsharingsession ep 1 - networking - arp v 1.06th floorsharingsession ep 1 - networking - arp v 1.0
6th floorsharingsession ep 1 - networking - arp v 1.0
A Achyar Nur
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus SDN/OpenFlow switch
 
Особенности архитектуры и траблшутинга маршрутизаторов серии ASR1000
Особенности архитектуры и траблшутинга маршрутизаторов серии ASR1000Особенности архитектуры и траблшутинга маршрутизаторов серии ASR1000
Особенности архитектуры и траблшутинга маршрутизаторов серии ASR1000
Cisco Russia
 
Nat
NatNat
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
Jim St. Leger
 
Microprocessors-based systems (under graduate course) Lecture 9 of 9
Microprocessors-based systems (under graduate course) Lecture 9 of 9 Microprocessors-based systems (under graduate course) Lecture 9 of 9
Microprocessors-based systems (under graduate course) Lecture 9 of 9
Randa Elanwar
 
Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationJeff Squyres
 
001 network toi_basics_v1
001 network toi_basics_v1001 network toi_basics_v1
001 network toi_basics_v1
Hisao Tsujimura
 
How deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performanceHow deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performance
Cumulus Networks
 
World IPv6 Day in indonesia
World IPv6 Day in indonesiaWorld IPv6 Day in indonesia
World IPv6 Day in indonesia
Affan Basalamah
 

Similar to Boost UDP Transaction Performance (20)

Fine grained monitoring
Fine grained monitoringFine grained monitoring
Fine grained monitoring
 
[IGC2018] AMD Don Woligroski - WHY Ryzen
[IGC2018] AMD Don Woligroski - WHY Ryzen[IGC2018] AMD Don Woligroski - WHY Ryzen
[IGC2018] AMD Don Woligroski - WHY Ryzen
 
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
 
realestate and MySQL devops melbourne
realestate and MySQL devops melbournerealestate and MySQL devops melbourne
realestate and MySQL devops melbourne
 
Microcontrollers and RT programming 3
Microcontrollers and RT programming 3Microcontrollers and RT programming 3
Microcontrollers and RT programming 3
 
Ppt of routing protocols
Ppt of routing protocolsPpt of routing protocols
Ppt of routing protocols
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and Hadoop
 
Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例
 
Day 11 eigrp
Day 11 eigrpDay 11 eigrp
Day 11 eigrp
 
6th floorsharingsession ep 1 - networking - arp v 1.0
6th floorsharingsession ep 1 - networking - arp v 1.06th floorsharingsession ep 1 - networking - arp v 1.0
6th floorsharingsession ep 1 - networking - arp v 1.0
 
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
 
Особенности архитектуры и траблшутинга маршрутизаторов серии ASR1000
Особенности архитектуры и траблшутинга маршрутизаторов серии ASR1000Особенности архитектуры и траблшутинга маршрутизаторов серии ASR1000
Особенности архитектуры и траблшутинга маршрутизаторов серии ASR1000
 
Nat
NatNat
Nat
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
Microprocessors-based systems (under graduate course) Lecture 9 of 9
Microprocessors-based systems (under graduate course) Lecture 9 of 9 Microprocessors-based systems (under graduate course) Lecture 9 of 9
Microprocessors-based systems (under graduate course) Lecture 9 of 9
 
Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentation
 
HP C7000 Cconfiguration Guide v.10
HP C7000 Cconfiguration Guide v.10HP C7000 Cconfiguration Guide v.10
HP C7000 Cconfiguration Guide v.10
 
001 network toi_basics_v1
001 network toi_basics_v1001 network toi_basics_v1
001 network toi_basics_v1
 
How deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performanceHow deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performance
 
World IPv6 Day in indonesia
World IPv6 Day in indonesiaWorld IPv6 Day in indonesia
World IPv6 Day in indonesia
 

More from LF Events

Feature rich BTRFS is Getting Richer with Encryption
Feature rich BTRFS is Getting Richer with EncryptionFeature rich BTRFS is Getting Richer with Encryption
Feature rich BTRFS is Getting Richer with Encryption
LF Events
 
KASan in a Bare-Metal Hypervisor
 KASan in a Bare-Metal Hypervisor  KASan in a Bare-Metal Hypervisor
KASan in a Bare-Metal Hypervisor
LF Events
 
Efficient kernel backporting
Efficient kernel backportingEfficient kernel backporting
Efficient kernel backporting
LF Events
 
Raspberry pi Update - Encourage your IOT
Raspberry pi Update - Encourage your IOTRaspberry pi Update - Encourage your IOT
Raspberry pi Update - Encourage your IOT
LF Events
 
Introduction to Open-O
Introduction to Open-OIntroduction to Open-O
Introduction to Open-O
LF Events
 
CNCF and Fujitsu
CNCF and FujitsuCNCF and Fujitsu
CNCF and Fujitsu
LF Events
 
Linxu conj2016 96boards
Linxu conj2016 96boardsLinxu conj2016 96boards
Linxu conj2016 96boards
LF Events
 
Taking over to the Next Generation
Taking over to the Next GenerationTaking over to the Next Generation
Taking over to the Next Generation
LF Events
 
Learning From Real Practice of Providing Highly Available Hybrid Cloud Servic...
Learning From Real Practice of Providing Highly Available Hybrid Cloud Servic...Learning From Real Practice of Providing Highly Available Hybrid Cloud Servic...
Learning From Real Practice of Providing Highly Available Hybrid Cloud Servic...
LF Events
 
Generating a Reproducible and Maintainable Embedded Linux Environment with Po...
Generating a Reproducible and Maintainable Embedded Linux Environment with Po...Generating a Reproducible and Maintainable Embedded Linux Environment with Po...
Generating a Reproducible and Maintainable Embedded Linux Environment with Po...
LF Events
 
Secure IOT Gateway
Secure IOT GatewaySecure IOT Gateway
Secure IOT Gateway
LF Events
 
Trading Derivatives on Hyperledger
Trading Derivatives on HyperledgerTrading Derivatives on Hyperledger
Trading Derivatives on Hyperledger
LF Events
 
Introducing Oracle Linux and Securing It With ksplice
Introducing Oracle Linux and Securing It With kspliceIntroducing Oracle Linux and Securing It With ksplice
Introducing Oracle Linux and Securing It With ksplice
LF Events
 
Containers: Don't Skeu Them Up, Use Microservices Instead
Containers: Don't Skeu Them Up, Use Microservices InsteadContainers: Don't Skeu Them Up, Use Microservices Instead
Containers: Don't Skeu Them Up, Use Microservices Instead
LF Events
 

More from LF Events (14)

Feature rich BTRFS is Getting Richer with Encryption
Feature rich BTRFS is Getting Richer with EncryptionFeature rich BTRFS is Getting Richer with Encryption
Feature rich BTRFS is Getting Richer with Encryption
 
KASan in a Bare-Metal Hypervisor
 KASan in a Bare-Metal Hypervisor  KASan in a Bare-Metal Hypervisor
KASan in a Bare-Metal Hypervisor
 
Efficient kernel backporting
Efficient kernel backportingEfficient kernel backporting
Efficient kernel backporting
 
Raspberry pi Update - Encourage your IOT
Raspberry pi Update - Encourage your IOTRaspberry pi Update - Encourage your IOT
Raspberry pi Update - Encourage your IOT
 
Introduction to Open-O
Introduction to Open-OIntroduction to Open-O
Introduction to Open-O
 
CNCF and Fujitsu
CNCF and FujitsuCNCF and Fujitsu
CNCF and Fujitsu
 
Linxu conj2016 96boards
Linxu conj2016 96boardsLinxu conj2016 96boards
Linxu conj2016 96boards
 
Taking over to the Next Generation
Taking over to the Next GenerationTaking over to the Next Generation
Taking over to the Next Generation
 
Learning From Real Practice of Providing Highly Available Hybrid Cloud Servic...
Learning From Real Practice of Providing Highly Available Hybrid Cloud Servic...Learning From Real Practice of Providing Highly Available Hybrid Cloud Servic...
Learning From Real Practice of Providing Highly Available Hybrid Cloud Servic...
 
Generating a Reproducible and Maintainable Embedded Linux Environment with Po...
Generating a Reproducible and Maintainable Embedded Linux Environment with Po...Generating a Reproducible and Maintainable Embedded Linux Environment with Po...
Generating a Reproducible and Maintainable Embedded Linux Environment with Po...
 
Secure IOT Gateway
Secure IOT GatewaySecure IOT Gateway
Secure IOT Gateway
 
Trading Derivatives on Hyperledger
Trading Derivatives on HyperledgerTrading Derivatives on Hyperledger
Trading Derivatives on Hyperledger
 
Introducing Oracle Linux and Securing It With ksplice
Introducing Oracle Linux and Securing It With kspliceIntroducing Oracle Linux and Securing It With ksplice
Introducing Oracle Linux and Securing It With ksplice
 
Containers: Don't Skeu Them Up, Use Microservices Instead
Containers: Don't Skeu Them Up, Use Microservices InsteadContainers: Don't Skeu Them Up, Use Microservices Instead
Containers: Don't Skeu Them Up, Use Microservices Instead
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 

Boost UDP Transaction Performance

  • 1. Copyright © 2016 NTT Corp. All Rights Reserved. Boost UDP Transaction Performance Toshiaki Makita NTT Open Source Software Center
  • 2. 2Copyright © 2016 NTT Corp. All Rights Reserved. • Background • Basic technologies for network performance • How to improve UDP performance Today's topics
  • 3. 3Copyright © 2016 NTT Corp. All Rights Reserved. • Linux kernel engineer at NTT Open Source Software Center • Technical support for NTT group companies • Active patch submitter on kernel networking subsystem Who is Toshiaki Makita?
  • 4. 4Copyright © 2016 NTT Corp. All Rights Reserved. Background
  • 5. 5Copyright © 2016 NTT Corp. All Rights Reserved. • Services using UDP • DNS • RADIUS • NTP • SNMP • ... • Heavily used by network service providers UDP transactions in the Internet
  • 6. 6Copyright © 2016 NTT Corp. All Rights Reserved. • Ethernet bandwidth evolution • 10M -> 100M -> 1G -> 10G -> 40G -> 100G -> ... • 10G (or more) NICs are getting common on commodity servers • Transactions in 10G network • In the shortest packet case: • Maximum 14,880,952 packets/s*1 • Getting hard to handle in a single server... Ethernet Bandwidth and Transactions *1 shortest ethernet frame size 64bytes + preamble+IFG 20bytes = 84 bytes = 672 bits 10,000,000,000 / 672 = 14,880,952
  • 7. 7Copyright © 2016 NTT Corp. All Rights Reserved. • UDP payload sizes • DNS • A/AAAA query: 40~ bytes • A/AAAA response: 100~ bytes • RADIUS • Access-Request: 70~ bytes • Access-Accept: 30~ bytes • Typically 100~ bytes with some attributes • In many cases 100~ bytes • 100 bytes transactions in 10G network • Max 7,530,120 transactions/s*1 • Less than shortest packet case, but still challenging How many transactions to handle? *1 100 bytes + IP/UDP/Ether headers 46bytes + preamble+IFG 20bytes = 166 bytes = 1328 bits 10,000,000,000 / 1328 = 7,530,120
  • 8. 8Copyright © 2016 NTT Corp. All Rights Reserved. Basic technologies for network performance (not only for UDP)
  • 9. 9Copyright © 2016 NTT Corp. All Rights Reserved. • TSO/GSO/GRO • Packet segmentation/aggregation • Reduce packets to process within server • Applicable to TCP*1 (byte stream) • Not applicable to UDP*2 (datagram) • UDP has explicit boundary between datagrams • Cannot segment/aggregate packets Basic technologies for network performance TCP UDP byte stream datagram TSO/GSO (segmentation) GRO (aggregation) MTU size Tx Server Rx Server :< *1 TCP in UDP tunneling (e.g. VXLAN) is OK as well *2 Other than UFO, which is rarely implemented on physical NICs Great performance gain! Not applicable
  • 10. 10Copyright © 2016 NTT Corp. All Rights Reserved. • RSS • Scale network Rx processing in multi-core server • RSS itself is a NIC feature • Distribute packets to multi-queue in a NIC • Each queue has a different interrupt vector (Packets on each queue can be processed by different core) • Applicable to TCP/UDP • Common 10G NICs have RSS Basic technologies for network performance NIC rx queue rx queue Core Core packets RSS :) interrupt
  • 11. 11Copyright © 2016 NTT Corp. All Rights Reserved. • 100 bytes UDP transaction performance • Measured by simple*1 (multi-threaded) echo server • OS: kernel 4.6.3 (in RHEL 7.2 environment) • Mid-range commodity server with 20 cores and 10G NIC: • NIC: Intel 82599ES (has RSS, max 64 queues) • CPU: Xeon E5-2650 v3 (2.3 GHz 10 cores) * 2 sockets Hyper-threading off (make analysis easy, enabled later) • Results: 270,000 transactions/s (tps) (approx. 360Mbps) • 3.6% utilization of 10G bandwidth Performance with RSS enabled NIC echo server thread thread thread x 20 UDP socket bulk 100bytes UDP packets*1 echo back *1 create as many threads as core num each thread just calls recvfrom() and sendto() *2 There is only 1 client (IP address). To spread UDP traffic on the NIC, RSS is configured to see UDP port numbers. This setting is not needed for common UDP servers.
  • 12. 12Copyright © 2016 NTT Corp. All Rights Reserved. How to improve this?
  • 13. 13Copyright © 2016 NTT Corp. All Rights Reserved. • sar -u ALL -P ALL 1 • softirq (interrupt processing) is performed only on NUMA Node 0, why? • although we have enough (64) queues for 20 cores... Identify bottleneck 19:57:54 CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle 19:57:54 all 0.37 0.00 42.58 0.00 0.00 0.00 50.00 0.00 0.00 7.05 19:57:54 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 19:57:54 1 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 19:57:54 2 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 19:57:54 3 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 19:57:54 4 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 19:57:54 5 1.82 0.00 83.64 0.00 0.00 0.00 0.00 0.00 0.00 14.55 19:57:54 6 0.00 0.00 87.04 0.00 0.00 0.00 0.00 0.00 0.00 12.96 19:57:54 7 0.00 0.00 85.19 0.00 0.00 0.00 0.00 0.00 0.00 14.81 19:57:54 8 0.00 0.00 85.45 0.00 0.00 0.00 0.00 0.00 0.00 14.55 19:57:54 9 0.00 0.00 85.19 0.00 0.00 0.00 0.00 0.00 0.00 14.81 19:57:54 10 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 19:57:54 11 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 19:57:54 12 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 19:57:54 13 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 19:57:54 14 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 19:57:54 15 1.82 0.00 83.64 0.00 0.00 0.00 0.00 0.00 0.00 14.55 19:57:54 16 0.00 0.00 87.04 0.00 0.00 0.00 0.00 0.00 0.00 12.96 19:57:54 17 1.82 0.00 83.64 0.00 0.00 0.00 0.00 0.00 0.00 14.55 19:57:54 18 0.00 0.00 85.45 0.00 0.00 0.00 0.00 0.00 0.00 14.55 19:57:54 19 0.00 0.00 85.45 0.00 0.00 0.00 0.00 0.00 0.00 14.55 Node 0 Node 1
  • 14. 14Copyright © 2016 NTT Corp. All Rights Reserved. • RSS distributes packets to rx-queues • Interrupt destination of each queue is determined by /proc/irq/<irq>/smp_affinity • smp_affinity is usually set by irqbalance daemon softirq (interrupt processing) with RSS NIC rx queue rx queue Core Core packets RSS smp_affinity interrupt
  • 15. 15Copyright © 2016 NTT Corp. All Rights Reserved. • smp_affinity*1 • irqbalance is using only Node 0 (cores 0-4, 10-14) • Can we change this? Check smp_affinity $ for ((irq=105; irq<=124; irq++)); do > cat /proc/irq/$irq/smp_affinity > done 01000 -> 12 -> Node 0 00800 -> 11 -> Node 0 00400 -> 10 -> Node 0 00400 -> 10 -> Node 0 01000 -> 12 -> Node 0 04000 -> 14 -> Node 0 00400 -> 10 -> Node 0 00010 -> 4 -> Node 0 00004 -> 2 -> Node 0 02000 -> 13 -> Node 0 04000 -> 14 -> Node 0 00001 -> 0 -> Node 0 02000 -> 13 -> Node 0 01000 -> 12 -> Node 0 00008 -> 3 -> Node 0 00800 -> 11 -> Node 0 00800 -> 11 -> Node 0 04000 -> 14 -> Node 0 00800 -> 11 -> Node 0 02000 -> 13 -> Node 0 *1 irq number can be obtained from /proc/interrupts
  • 16. 16Copyright © 2016 NTT Corp. All Rights Reserved. • Some NIC drivers provide affinity_hint • affinity_hint is evenly distributed • To honor the hint, add "-h exact" option to irqbalance (via /etc/sysconfig/irqbalance, etc.)*1 Check affinity_hint $ for ((irq=105; irq<=124; irq++)); do > cat /proc/irq/$irq/affinity_hint > done 00001 -> 0 00002 -> 1 00004 -> 2 00008 -> 3 00010 -> 4 00020 -> 5 00040 -> 6 00080 -> 7 00100 -> 8 00200 -> 9 00400 -> 10 00800 -> 11 01000 -> 12 02000 -> 13 04000 -> 14 08000 -> 15 10000 -> 16 20000 -> 17 40000 -> 18 80000 -> 19 *1 If your NIC doesn't provide hint, you can use "-i" option or stop irqbalance to set their affinity manually
  • 17. 17Copyright © 2016 NTT Corp. All Rights Reserved. • Added "-h exact" and restarted irqbalance • With hint honored, irqs are distributed to all cores Change irqbalance option $ for ((irq=105; irq<=124; irq++)); do > cat /proc/irq/$irq/smp_affinity > done 00001 -> 0 00002 -> 1 00004 -> 2 00008 -> 3 00010 -> 4 00020 -> 5 00040 -> 6 00080 -> 7 00100 -> 8 00200 -> 9 00400 -> 10 00800 -> 11 01000 -> 12 02000 -> 13 04000 -> 14 08000 -> 15 10000 -> 16 20000 -> 17 40000 -> 18 80000 -> 19
  • 18. 18Copyright © 2016 NTT Corp. All Rights Reserved. • sar -u ALL -P ALL 1 • Though irqs looks distributed evenly, core 16-19 are not used for softirq... • Nodes look irrelevant this time Change irqbalance option 20:06:07 CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle 20:06:07 all 0.00 0.00 19.18 0.00 0.00 0.00 80.82 0.00 0.00 0.00 20:06:07 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 1 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 2 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 3 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 4 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 5 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 6 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 7 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 8 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 9 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 10 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 11 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 12 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 13 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 14 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 15 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:06:07 16 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 20:06:07 17 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 20:06:07 18 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 20:06:07 19 0.00 0.00 93.33 0.00 0.00 0.00 6.67 0.00 0.00 0.00 Node 0 Node 1
  • 19. 19Copyright © 2016 NTT Corp. All Rights Reserved. • ethtool -S*1 • Revealed RSS has not distributed packets to queues 16-19 Check rx-queue stats $ ethtool -S ens1f0 | grep 'rx_queue_.*_packets' rx_queue_0_packets: 198005155 rx_queue_1_packets: 153339750 rx_queue_2_packets: 162870095 rx_queue_3_packets: 172303801 rx_queue_4_packets: 153728776 rx_queue_5_packets: 158138563 rx_queue_6_packets: 164411653 rx_queue_7_packets: 165924489 rx_queue_8_packets: 176545406 rx_queue_9_packets: 165340188 rx_queue_10_packets: 150279834 rx_queue_11_packets: 150983782 rx_queue_12_packets: 157623687 rx_queue_13_packets: 150743910 rx_queue_14_packets: 158634344 rx_queue_15_packets: 158497890 rx_queue_16_packets: 4 rx_queue_17_packets: 3 rx_queue_18_packets: 0 rx_queue_19_packets: 8 *1 Output format depends on drivers
  • 20. 20Copyright © 2016 NTT Corp. All Rights Reserved. • RSS has indirection table which determines to which queue it spreads packets • Can be shown by ethtool -x • Only rx-queue 0-15 are used, 16-19 not used RSS Indirection Table $ ethtool -x ens1f0 RX flow hash indirection table for ens1f0 with 20 RX ring(s): 0: 0 1 2 3 4 5 6 7 8: 8 9 10 11 12 13 14 15 16: 0 1 2 3 4 5 6 7 24: 8 9 10 11 12 13 14 15 32: 0 1 2 3 4 5 6 7 40: 8 9 10 11 12 13 14 15 48: 0 1 2 3 4 5 6 7 56: 8 9 10 11 12 13 14 15 64: 0 1 2 3 4 5 6 7 72: 8 9 10 11 12 13 14 15 80: 0 1 2 3 4 5 6 7 88: 8 9 10 11 12 13 14 15 96: 0 1 2 3 4 5 6 7 104: 8 9 10 11 12 13 14 15 112: 0 1 2 3 4 5 6 7 120: 8 9 10 11 12 13 14 15 flow hash (hash value from packet header) rx-queue number
  • 21. 21Copyright © 2016 NTT Corp. All Rights Reserved. • Change to use all 0-19? • This NIC's max rx-queues in the indirection table is actually 16 so we cannot use 20 queues • although we have 64 rx-queues... • Use RPS instead • Software emulation of RSS RSS Indirection Table # ethtool -X ens1f0 equal 20 Cannot set RX flow hash configuration: Invalid argument NIC rx queue rx queue Core Core packets RSS Core softirq backlog softirq backlog softirq backlog RPS softirq interrupt
  • 22. 22Copyright © 2016 NTT Corp. All Rights Reserved. • This time I spread flows from rx-queue 6-9 to core 6-9 and 16-19 • Because they are all in Node 1 • rx-queue 6 -> core 6, 16 • rx-queue 7 -> core 7, 17 • rx-queue 8 -> core 8, 18 • rx-queue 9 -> core 9, 19 Use RPS # echo 10040 > /sys/class/net/ens1f0/queues/rx-6/rps_cpus # echo 20080 > /sys/class/net/ens1f0/queues/rx-7/rps_cpus # echo 40100 > /sys/class/net/ens1f0/queues/rx-8/rps_cpus # echo 80200 > /sys/class/net/ens1f0/queues/rx-9/rps_cpus
  • 23. 23Copyright © 2016 NTT Corp. All Rights Reserved. • sar -u ALL -P ALL 1 • softirq is almost evenly distributed Use RPS 20:18:53 CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle 20:18:54 all 0.00 0.00 2.38 0.00 0.00 0.00 97.62 0.00 0.00 0.00 20:18:54 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 1 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 2 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 3 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 4 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 5 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 6 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 7 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 8 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 9 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 10 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 11 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 12 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 13 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 14 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 15 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 20:18:54 16 0.00 0.00 15.56 0.00 0.00 0.00 84.44 0.00 0.00 0.00 20:18:54 17 0.00 0.00 6.98 0.00 0.00 0.00 93.02 0.00 0.00 0.00 20:18:54 18 0.00 0.00 18.18 0.00 0.00 0.00 81.82 0.00 0.00 0.00 20:18:54 19 2.27 0.00 6.82 0.00 0.00 0.00 90.91 0.00 0.00 0.00
  • 24. 24Copyright © 2016 NTT Corp. All Rights Reserved. • Now thanks to affinity_hint and RPS, we succeeded to spread flows almost evenly • Performance change • Before: 270,000 tps (approx. 360Mbps) • After: 17,000 tps (approx. 23Mbps) • Got worse... • Probably the reason is too heavy softirq • softirq is almost 100% in total • Need finer-grained profiling than sar RSS & affinity_hint & RPS
  • 25. 25Copyright © 2016 NTT Corp. All Rights Reserved. • perf • Profiling tool developed in kernel tree • Identify hot spots by sampling CPU cycles • Example usage of perf • perf record -a -g -- sleep 5 • Save sampling results for 5 seconds to perf.data file • FlameGraph • Visualize perf.data in svg format • https://github.com/brendangregg/FlameGraph Profile softirq
  • 26. 26Copyright © 2016 NTT Corp. All Rights Reserved. • FlameGraph of CPU0*1 • x-axis (width): CPU consumption • y-axis (height): Depth of call stack • queued_spin_lock_slowpath: lock is contended • udp_queue_rcv_skb: aquires socket lock Profile softirq *1 I filtered call stack under irq context from output of perf script to make the chart easier to see irq context is shown as "interrupt" here queued_spin_lock_slowpath udp_queue_rcv_skb
  • 27. 27Copyright © 2016 NTT Corp. All Rights Reserved. • Echo server has only one socket bound to a certain port • softirq of each core pushes packets into socket queue concurrently • socket lock gets contended Socket lock contention NIC rx queue rx queue Core Core packets Core softirq backlog softirq backlog softirq backlog softirq UDP socket lock contention!! interrupt RPS
  • 28. 28Copyright © 2016 NTT Corp. All Rights Reserved. • Split sockets by SO_REUSEPORT • Introduced by kernel 3.9 • SO_REUSEPORT allows multiple UDP sockets to bind the same port • One of the sockets is chosen on queueing each packet Avoid lock contention NIC rx queue rx queue Core Core packets Core softirq backlog softirq backlog softirq backlog UDP socket UDP socket UDP socket int on = 1; int sock = socket(AF_INET, SOCK_DGRAM, 0); setsockopt(sock, SOL_SOCKET, SO_REUSEPORT, &on, sizeof(on)); bind(sock, ...); socket selector *1 *1 select a socket by flow (packet header) hash by default interrupt RPS
  • 29. 29Copyright © 2016 NTT Corp. All Rights Reserved. • sar -u ALL -P ALL 1 • CPU consumption in softirq became some more reasonable Use SO_REUSEPORT 20:44:33 CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle 20:44:34 all 3.26 0.00 37.23 0.00 0.00 0.00 59.52 0.00 0.00 0.00 20:44:34 0 3.33 0.00 28.33 0.00 0.00 0.00 68.33 0.00 0.00 0.00 20:44:34 1 3.33 0.00 25.00 0.00 0.00 0.00 71.67 0.00 0.00 0.00 20:44:34 2 1.67 0.00 23.33 0.00 0.00 0.00 75.00 0.00 0.00 0.00 20:44:34 3 3.28 0.00 32.79 0.00 0.00 0.00 63.93 0.00 0.00 0.00 20:44:34 4 3.33 0.00 33.33 0.00 0.00 0.00 63.33 0.00 0.00 0.00 20:44:34 5 1.69 0.00 23.73 0.00 0.00 0.00 74.58 0.00 0.00 0.00 20:44:34 6 3.28 0.00 50.82 0.00 0.00 0.00 45.90 0.00 0.00 0.00 20:44:34 7 3.45 0.00 50.00 0.00 0.00 0.00 46.55 0.00 0.00 0.00 20:44:34 8 1.69 0.00 37.29 0.00 0.00 0.00 61.02 0.00 0.00 0.00 20:44:34 9 1.67 0.00 33.33 0.00 0.00 0.00 65.00 0.00 0.00 0.00 20:44:34 10 1.69 0.00 18.64 0.00 0.00 0.00 79.66 0.00 0.00 0.00 20:44:34 11 3.23 0.00 35.48 0.00 0.00 0.00 61.29 0.00 0.00 0.00 20:44:34 12 1.69 0.00 27.12 0.00 0.00 0.00 71.19 0.00 0.00 0.00 20:44:34 13 1.67 0.00 21.67 0.00 0.00 0.00 76.67 0.00 0.00 0.00 20:44:34 14 1.67 0.00 21.67 0.00 0.00 0.00 76.67 0.00 0.00 0.00 20:44:34 15 3.33 0.00 35.00 0.00 0.00 0.00 61.67 0.00 0.00 0.00 20:44:34 16 6.67 0.00 68.33 0.00 0.00 0.00 25.00 0.00 0.00 0.00 20:44:34 17 5.00 0.00 65.00 0.00 0.00 0.00 30.00 0.00 0.00 0.00 20:44:34 18 6.78 0.00 54.24 0.00 0.00 0.00 38.98 0.00 0.00 0.00 20:44:34 19 4.92 0.00 63.93 0.00 0.00 0.00 31.15 0.00 0.00 0.00
  • 30. 30Copyright © 2016 NTT Corp. All Rights Reserved. • before • after Userspace starts to work Use SO_REUSEPORT Interrupt processing (irq) Interrupt processing (irq) userspace thread (sys, user)
  • 31. 31Copyright © 2016 NTT Corp. All Rights Reserved. • Perfomance change • RSS: 270,000 tps (approx. 360Mbps) • +affinity_hint+RPS: 17,000 tps (approx. 23Mbps) • +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps) • Great improvement! • but... Use SO_REUSEPORT
  • 32. 32Copyright © 2016 NTT Corp. All Rights Reserved. • More analysis • Socket lock is still contended Use SO_REUSEPORT queued_spin_lock_slowpath
  • 33. 33Copyright © 2016 NTT Corp. All Rights Reserved. • SO_REUSEPORT uses flow hash to select queue by default • Same sockets can be selected by different cores • Socket lock still gets contended Socket lock contention again NIC rx queue rx queue Core1 Core2 packets Core0 softirq backlog softirq backlog softirq backlog UDP socket0 UDP socket1 UDP socket2 socket selector interrupt select queue by flow hash lock contention!! RPS
  • 34. 34Copyright © 2016 NTT Corp. All Rights Reserved. • Select socket by core number • Realized by SO_ATTACH_REUSEPORT_CBPF/EBPF*1 • Introduced by kernel 4.5 • No lock contention between softirq • Usage • See example in kernel source tree • tools/testing/selftests/net/reuseport_bpf_cpu.c Avoid socket lock contention NIC rx queue rx queue Core1 Core2 packets Core0 softirq backlog softirq backlog softirq backlog UDP socket0 UDP socket1 UDP socket2 socket selector *1 BPF allows much more flexible logic but this time only cpu number is used interrupt select queue by core number RPS
  • 35. 35Copyright © 2016 NTT Corp. All Rights Reserved. • before • after irq overhead gets less Use SO_ATTACH_REUSEPORT_EPBF Interrupt processing (irq) userspace thread (sys, user) Interrupt processing (irq) userspace thread (sys, user)
  • 36. 36Copyright © 2016 NTT Corp. All Rights Reserved. • Perfomance change • RSS: 270,000 tps (approx. 360Mbps) • +affinity_hint+RPS: 17,000 tps (approx. 23Mbps) • +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps) • +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps) Use SO_ATTACH_REUSEPORT_EBPF
  • 37. 37Copyright © 2016 NTT Corp. All Rights Reserved. • Userspace threads : sockets == 1 : 1 • No lock contention • But not necessarily on the same core as softirq • Pin userspace thread on the same core for better cache affinity • cgroup, taskset, pthread_setaffinity_np(), ... any way you like Pin userspace threads NIC rx queue rx queue Core1 Core2 packets Core0 softirq backlog softirq backlog softirq backlog UDP socket0 UDP socket1 UDP socket2 interrupt thread1 thread2 thread0 userspace kernel softirq NIC rx queue rx queue Core1 Core2 packets Core0 softirq backlog softirq backlog softirq backlog UDP socket0 UDP socket1 UDP socket2 interrupt thread0 thread1 thread2 userspace kernel softirq 1:1 pin RPS RPS
  • 38. 38Copyright © 2016 NTT Corp. All Rights Reserved. • Perfomance change • RSS: 270,000 tps (approx. 360Mbps) • +affinity_hint+RPS: 17,000 tps (approx. 23Mbps) • +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps) • +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps) • +Pin threads: 5,050,000 tps (approx. 6710Mbps) Pin userspace threads
  • 39. 39Copyright © 2016 NTT Corp. All Rights Reserved. • So far everything has been about Rx • No lock contention on Tx? Tx lock contention?
  • 40. 40Copyright © 2016 NTT Corp. All Rights Reserved. • kernel has Qdisc (Queueing discipline) • Each Qdisc is linked to NIC tx-queue • Each Qdisc has its lock Tx queue Core Core Qdisc queue Qdisc queue thread thread threadCore NIC tx queue tx queue tx queueQdisc queue userspace kernel
  • 41. 41Copyright © 2016 NTT Corp. All Rights Reserved. • By default Qdisc is selected by flow hash • Thus lock contention can happen • We haven't seen contention on Tx, why? Tx queue lock contention Core Core Qdisc queue Qdisc queue thread thread threadCore NIC tx queue tx queue tx queueQdisc queue userspace kernel select queue by flow hash lock contention!!
  • 42. 42Copyright © 2016 NTT Corp. All Rights Reserved. • Because ixgbe (Intel 10GbE NIC driver) has an ability to set XPS automatically Avoid Tx queue lock contention $ for ((txq=0; txq<20; txq++)); do > cat /sys/class/net/ens1f0/queues/tx-$txq/xps_cpus > done 00001 -> core 0 00002 -> core 1 00004 -> core 2 00008 -> core 3 00010 -> core 4 00020 -> core 5 00040 -> core 6 00080 -> core 7 00100 -> core 8 00200 -> core 9 00400 -> core 10 00800 -> core 11 01000 -> core 12 02000 -> core 13 04000 -> core 14 08000 -> core 15 10000 -> core 16 20000 -> core 17 40000 -> core 18 80000 -> core 19
  • 43. 43Copyright © 2016 NTT Corp. All Rights Reserved. • XPS allows kernel to select Tx queue (Qdisc) by core number • Tx has no lock contention XPS Core Core Qdisc queue Qdisc queue thread thread threadCore NIC tx queue tx queue tx queueQdisc queue userspace kernel select queue by core number
  • 44. 44Copyright © 2016 NTT Corp. All Rights Reserved. • Try disabling it • Before: 5,050,000 tps (approx. 6710Mbps) • After: 1,086,000 tps (approx. 1440Mbps) How effective is XPS? # for ((txq=0; txq<20; txq++)); do > echo 0 > /sys/class/net/ens1f0/queues/tx-$txq/xps_cpus > done
  • 45. 45Copyright © 2016 NTT Corp. All Rights Reserved. • XPS enabled • XPS disabled Disabling XPS Interrupt processing (irq) userspace thread Tx Interrupt processing (irq) userspace thread Tx userspace thread Rx userspace thread Rx queued_spin_lock_slowpath (lock contention)
  • 46. 46Copyright © 2016 NTT Corp. All Rights Reserved. • Enable XPS again • Although ixgbe can automatically set XPS, not all drivers can do that • Make sure to check xps_cpus is configured Enable XPS # echo 00001 > /sys/class/net/<NIC>/queues/tx-0/xps_cpus # echo 00002 > /sys/class/net/<NIC>/queues/tx-1/xps_cpus # echo 00004 > /sys/class/net/<NIC>/queues/tx-2/xps_cpus # echo 00008 > /sys/class/net/<NIC>/queues/tx-3/xps_cpus ...
  • 47. 47Copyright © 2016 NTT Corp. All Rights Reserved. • By making full use of multi-core with avoiding contention, we achieved • 5,050,000 tps (approx. 6710Mbps) • To get more performance, reduce overhead per core Optimization per core
  • 48. 48Copyright © 2016 NTT Corp. All Rights Reserved. • GRO is enabled by default • Consuming 4.9% of CPU time Optimization per core dev_gro_receive
  • 49. 49Copyright © 2016 NTT Corp. All Rights Reserved. • GRO is not applicable to UDP*1 • Disable it for UDP servers • WARNING: • Don't disable it if TCP performance matters • Disabling GRO makes TCP rx throughput miserably low • Don't disable it on KVM hypervisors as well • GRO boost throughput of tunneling protocol traffic as well as guest's TCP traffic on hypervisors GRO # ethtool -K <NIC> gro off *1 Other than UDP tunneling, like VXLAN
  • 50. 50Copyright © 2016 NTT Corp. All Rights Reserved. • Perfomance change • RSS (+XPS): 270,000 tps (approx. 360Mbps) • +affinity_hint+RPS: 17,000 tps (approx. 23Mbps) • +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps) • +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps) • +Pin threads: 5,050,000 tps (approx. 6710Mbps) • +Disable GRO: 5,180,000 tps (approx. 6880Mbps) Disable GRO
  • 51. 51Copyright © 2016 NTT Corp. All Rights Reserved. • iptables-related processing (nf_iterate) is performed • Although I have not added any rule to iptables • Consuming 3.00% of CPU time Optimization per core nf_iterate nf_iterate
  • 52. 52Copyright © 2016 NTT Corp. All Rights Reserved. • With iptables kernel module loaded, even if you don't have any rules, it can incur some overhead • Some distributions load iptables module even when you don't add any rule • If you are not using iptables, unload the module iptables (netfilter) # modprobe -r iptable_filter # modprobe -r ip_tables
  • 53. 53Copyright © 2016 NTT Corp. All Rights Reserved. • Perfomance change • RSS (+XPS): 270,000 tps (approx. 360Mbps) • +affinity_hint+RPS: 17,000 tps (approx. 23Mbps) • +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps) • +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps) • +Pin threads: 5,050,000 tps (approx. 6710Mbps) • +Disable GRO: 5,180,000 tps (approx. 6880Mbps) • +Unload iptables: 5,380,000 tps (approx. 7140Mbps) Unload iptables
  • 54. 54Copyright © 2016 NTT Corp. All Rights Reserved. • On Rx, FIB (routing table) lookup is done twice • Each is consuming 1.82%~ of CPU time Optimization per core fib_table_lookup fib_table_lookup
  • 55. 55Copyright © 2016 NTT Corp. All Rights Reserved. • One of two times of table lookup is for validating source IP addresses • Reverse path filter • Local address check • If you really don't need source validation, you can skip it FIB lookup on Rx # sysctl -w net.ipv4.conf.all.rp_filter=0 # sysctl -w net.ipv4.conf.<NIC>.rp_filter=0 # sysctl -w net.ipv4.conf.all.accept_local=1
  • 56. 56Copyright © 2016 NTT Corp. All Rights Reserved. • Perfomance change • RSS (+XPS): 270,000 tps (approx. 360Mbps) • +affinity_hint+RPS: 17,000 tps (approx. 23Mbps) • +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps) • +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps) • +Pin threads: 5,050,000 tps (approx. 6710Mbps) • +Disable GRO: 5,180,000 tps (approx. 6880Mbps) • +Unload iptables: 5,380,000 tps (approx. 7140Mbps) • +Disable validation: 5,490,000 tps (approx. 7290Mbps) Disable source validation
  • 57. 57Copyright © 2016 NTT Corp. All Rights Reserved. • Audit is a bit heavy when heavily processing packets • Consuming 2.5% of CPU time Optimization per core audit related processing
  • 58. 58Copyright © 2016 NTT Corp. All Rights Reserved. • If you don't need audit, disable it Audit # systemctl disable auditd # reboot
  • 59. 59Copyright © 2016 NTT Corp. All Rights Reserved. • Perfomance change • RSS (+XPS): 270,000 tps (approx. 360Mbps) • +affinity_hint+RPS: 17,000 tps (approx. 23Mbps) • +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps) • +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps) • +Pin threads: 5,050,000 tps (approx. 6710Mbps) • +Disable GRO: 5,180,000 tps (approx. 6880Mbps) • +Unload iptables: 5,380,000 tps (approx. 7140Mbps) • +Disable validation: 5,490,000 tps (approx. 7290Mbps) • +Disable audit: 5,860,000 tps (approx. 7780Mbps) Disable audit
  • 60. 60Copyright © 2016 NTT Corp. All Rights Reserved. • IP ID field calculation (__ip_select_ident) is heavy • Consuming 4.82% of CPU time Optimization per core __ip_select_ident
  • 61. 61Copyright © 2016 NTT Corp. All Rights Reserved. • This is an environment-specific issue • This happens if many clients has the same IP address • Cache contention by atomic operations • It is very likely you don't see this amount of CPU consumption without using tunneling protocol • If you really see this problem... • You can skip it only if you never send over-mtu-sized packets • Though it is very strict IP ID field calculation int pmtu = IP_PMTUDISC_DO; setsockopt(sock, IPPROTO_IP, IP_MTU_DISCOVER, &pmtu, sizeof(pmtu));
  • 62. 62Copyright © 2016 NTT Corp. All Rights Reserved. • Perfomance change • RSS (+XPS): 270,000 tps (approx. 360Mbps) • +affinity_hint+RPS: 17,000 tps (approx. 23Mbps) • +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps) • +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps) • +Pin threads: 5,050,000 tps (approx. 6710Mbps) • +Disable GRO: 5,180,000 tps (approx. 6880Mbps) • +Unload iptables: 5,380,000 tps (approx. 7140Mbps) • +Disable validation: 5,490,000 tps (approx. 7290Mbps) • +Disable audit: 5,860,000 tps (approx. 7780Mbps) • +Skip ID calculation: 6,010,000 tps (approx. 7980Mbps) Skip IP ID calculation
  • 63. 63Copyright © 2016 NTT Corp. All Rights Reserved. • So far we have not enabled hyper threading • It makes the number of logical cores 40 • Number of physical cores are 20 in this box • With 40 cores we need to rely more on RPS • Remind: Max usable rx-queues == 16 • Enable hyper-threading and set RPS on all rx- queues • queue 0 -> core 0, 20 • queue 1 -> core 1, 21 • ... • queue 10 -> core 10, 16, 30 • queue 11 -> core 11, 17, 31 • ... Hyper threading
  • 64. 64Copyright © 2016 NTT Corp. All Rights Reserved. • Perfomance change • RSS (+XPS): 270,000 tps (approx. 360Mbps) • +affinity_hint+RPS: 17,000 tps (approx. 23Mbps) • +SO_REUSEPORT: 2,540,000 tps (approx. 3370Mbps) • +SO_ATTACH_...: 4,250,000 tps (approx. 5640Mbps) • +Pin threads: 5,050,000 tps (approx. 6710Mbps) • +Disable GRO: 5,180,000 tps (approx. 6880Mbps) • +Unload iptables: 5,380,000 tps (approx. 7140Mbps) • +Disable validation: 5,490,000 tps (approx. 7290Mbps) • +Disable audit: 5,860,000 tps (approx. 7780Mbps) • +Skip ID calculation: 6,010,000 tps (approx. 7980Mbps) • +Hyper threading: 7,010,000 tps (approx. 9310Mbps) • I guess more rx-queues would realize even better performance number Hyper threading
  • 65. 65Copyright © 2016 NTT Corp. All Rights Reserved. • Tx Qdisc lock (_raw_spin_lock) is heavy • Not contended but involves many atomic operations • Being optimized in Linux netdev community More hot spots _raw_spin_lock
  • 66. 66Copyright © 2016 NTT Corp. All Rights Reserved. • Memory alloc/free (slab) • Being optimized in netdev community as well More hot spots kmem/kmalloc/kfree
  • 67. 67Copyright © 2016 NTT Corp. All Rights Reserved. • Virtualization • UDP servers as guests • Hypervisor can saturate CPUs or drop packets • We are going to investigate ways to boost performance in virtualized environment as well Other challenges
  • 68. 68Copyright © 2016 NTT Corp. All Rights Reserved. • For 100bytes, we can achieve almost 10G • From: 270,000 tps (approx. 360Mbps) • To: 7,010,000 tps (approx. 9310Mbps) • Of course we need to take into account additional userspace work in real applications so this number is not applicable as is • To boost UDP performance • Applications (Most important!) • implement SO_REUSEPORT • implement SO_ATTACH_REUSEPORT_EBPF/CBPF • These are useful for TCP listening sockets as well • OS settings • Check smp_affinity • Use RPS if rx-queues are not enough • Make sure XPS is configured • Consider other tunings to reduce per-core overhead • Disable GRO • Unload iptables • Disable source IP validation • Disable auditd • Hardware • Use NICs which have enough RSS rx-queues if possible (as many queues as core num) Summary
  • 69. 69Copyright © 2016 NTT Corp. All Rights Reserved. Thank you!