2. This talk is not about ‘fixing’
the Linux networking stack
The Linux networking stack isn’t broken.
care of
• The people who takedo goodthe stack know
what they’re doing &
work.
on
• Based has all the measurements I’m aware of, of
Linux
the fastest & most complete stack
any OS.
LCA06 - Jan 27, 2006 - Jacobson / Felderman
2
3. This talk is about fixing
an architectural problem
created a long time ago
in a place far, far away. . .
LCA06 - Jan 27, 2006 - Jacobson / Felderman
3
5. In the beginning . . .
ARPA created MULTICS
• First OS networking stack (MIT, 1970)
multi-user ‘super-computer’
• Ran on a @ 0.4 MIPS)
(GE-640
than100
• Rarely feweruser. users; took ~2 minutes
to page in a
• Since ARPAnet performance depended only
on how fast host could empty its 6 IMP
buffers, had to put stack in kernel.
LCA06 - Jan 27, 2006 - Jacobson / Felderman
4
6. The Multics stack begat
many other stacks . . .
• First TCP/IP stack done on Multics (1980)
went to
• People from that projectBerkeley BBN to
do first TCP/IP stack for
Unix
(1983).
used BBN stack as
• Berkeley CSRGfor 4.1c BSD stack (1985).
functional spec
University of
• CSRG wins long battle withstack source
California lawyers & makes
available under ‘BSD copyright’ (1987).
LCA06 - Jan 27, 2006 - Jacobson / Felderman
5
7. Multics architecture, as
elaborated by Berkeley,
became ‘Standard Model’
Interrupt level
Task level
System
ISR
packet
LCA06 - Jan 27, 2006 - Jacobson / Felderman
Softint
skb
Application
Socket
read()
byte stream
6
8. “The way we’ve always done it”
is not necessarily the same as
“the right way to do it”
There are a lot of problems associated
with this style of implementation . . .
LCA06 - Jan 27, 2006 - Jacobson / Felderman
7
9. Protocol Complication
• Since data is received by destination
kernel, “window” was added to distinguish
between “data has arrived” & “data was
consumed”.
addition more than triples the size
• Thisprotocol (window probes, persist of
the
states) and is responsible for at least half
the interoperability issues (Silly Window
Syndrome, FIN wars, etc.)
LCA06 - Jan 27, 2006 - Jacobson / Felderman
8
10. Internet Stability
• You can view a network connection as a
servo-loop:
S
R
• A kernel-based protocol implementation
converts this to two coupled loops:
•
S
K
R
A very general theorem (Routh-Hurwitz)
says that the two coupled loops will always
be less stable then one.
also hides the receiving
• The kernel loopthe sender which screws app
dynamics from
up
the RTT estimate & causes spurious
retransmissions.
LCA06 - Jan 27, 2006 - Jacobson / Felderman
9
11. Compromises
Even for a simple stream abstraction like TCP,
there’s no such thing as a “one size fits all”
protocol implementation.
• The packetization and send strategies are
completely different for bulk data vs.
transactions vs. event streams.
ack strategies are completely
• Thestreaming vs. request response.different
for
Some of this can be handled with sockopts
but some app / kernel implementation
mismatch is inevitable.
LCA06 - Jan 27, 2006 - Jacobson / Felderman
10
12. Performance
(the topic for the rest of this talk)
have
• Kernel-based implementations oftenuser).
extra data copies (packet to skb to
• Kernel-based implementations often have
extra boundary crossings (hardware
interrupt to software interrupt to context
switch to syscall return).
• Kernel-based implementations often have
lock contention and hotspots.
LCA06 - Jan 27, 2006 - Jacobson / Felderman
11
13. Why should we care?
gotten fast
• Networking gear hasenough ($10enough8
(10Gb/s) and cheap
for an
port Gb switch) that it’s changing from a
communications technology to a backplane
technology.
• The huge mismatch between processor
clock rate & memory latency has forced
chip makers to put multiple cores on a die.
LCA06 - Jan 27, 2006 - Jacobson / Felderman
12
14. Why multiple cores?
• Vanilla 2GHz P4 issues 2-4 instr / clock
4-8 instr / ns.
• Internal structure of DRAM chip makes
cache line fetch take 50-100ns (FSB speed
doesn’t matter).
did 400 instructions of computing
• If you cache line, system would be 50% on
every
efficient with one core & 100% with two.
more like 20
/ line
• Typical number iswith one core instrcores
or 2.5% efficient
(20
for 100%).
LCA06 - Jan 27, 2006 - Jacobson / Felderman
13
15. Good system performance
comes from having lots of
cores working independently
• This is the canonical Internet problem.
is called the “end-to-end
• The solutionsays you should push all work
principle”. It
to the edge & do the absolute minimum
inside the net.
LCA06 - Jan 27, 2006 - Jacobson / Felderman
14
16. The end of the wire isn’t
the end of the net
On a uni-processor it doesn’t matter but on a
multi-processor the protocol work should be done
on the processor that’s going to consume the data.
This means ISR & Softint should
()
do almost nothing and Socket
ead
r
t
should do everything.
cke
So
ISR
Softint
Socket
read()
S oc
ket
rea
d
()
LCA06 - Jan 27, 2006 - Jacobson / Felderman
15
17. How good is the stack at
spreading out the work?
Let’s look at some
LCA06 - Jan 27, 2006 - Jacobson / Felderman
16
18. Test setup
Dell Poweredge 1750s (2.4GHz P4 Xeon,
• Two processor, hyperthreading off) hooked up
dual
back-to-back via Intel e1000 gig ether cards.
2.6.15
• Running stock(6.3.9). plus current Sourceforge
e1000 driver
with oprofile
• Measurements done runs. Showing 0.9.1. Each 5.
test was 5 5-minute
median of
• booted with idle=poll_idle. Irqbalance off.
LCA06 - Jan 27, 2006 - Jacobson / Felderman
17
20. Uni vs. dual processor
(netperf) with cpu
• 1cpu: run netservercpu as e1000 interrupts.
affinity set to same
netserver with
• 2cpu: runcpu from e1000 cpu affinity set to
different
interrupts.
(%) Busy
1cpu 50
2cpu 77
Intr
7
9
LCA06 - Jan 27, 2006 - Jacobson / Felderman
Softint Socket Locks Sched
11
13
16
24
8
14
5
12
App
1
1
19
21. Uni vs. dual processor
(netperf) with cpu
• 1cpu: run netservercpu as e1000 interrupts.
affinity set to same
netserver with
• 2cpu: runcpu from e1000 cpu affinity set to
different
interrupts.
(%) Busy
1cpu 50
2cpu 77
Intr
7
9
LCA06 - Jan 27, 2006 - Jacobson / Felderman
Softint Socket Locks Sched
11
13
16
24
8
14
5
12
App
1
1
19
22. This is just Amdahl’s law
in action
When adding additional processors:
• Benefit (cycles to do work) grows at most
linearly.
competition,
• Cost (contention, grows quadratically.
serialization, etc.)
•
System capacity goes as C(n) = an - bn2
For big enough n, the quadratic always wins.
• The key to good scaling is to minimize b.
LCA06 - Jan 27, 2006 - Jacobson / Felderman
20
23. Locking destroys
performance two ways
has multiple writers
• The lock(fabulously expensive) so each has
to do a
RFO cache
cycle.
update which
• The lock requires an atomicthe cache.
is implemented by freezing
To go fast you want to have a single writer
per line and no locks.
LCA06 - Jan 27, 2006 - Jacobson / Felderman
21
24. • Networking involves a lot of queues.
They’re often implented as doubly linked
lists:
the
child
• This isuser poster write for cache thrashing.
Every
has to
every line and
every change has to be made multiple
places.
network
• Since most consumercomponents have a
producer /
relationship, a lock
free fifo can work a lot better.
LCA06 - Jan 27, 2006 - Jacobson / Felderman
22
25. net_channel - a cache aware,
cache friendly queue
typedef struct {
uint16_t
tail;
uint8_t
wakecnt;
uint8_t
pad;
} net_channel_producer_t;
typedef struct {
uint16_t
head;
uint8_t
wakecnt;
uint8_t
wake_type;
void*
wake_arg;
} net_channel_consumer_t;
struct {
/* next element to add */
/* do wakeup if != consumer wakecnt */
/*
/*
/*
/*
next element to remove */
increment to request wakeup */
how to wakeup consumer */
opaque argument to wakeup routine */
net_channel_producer_t p CACHE_ALIGN;
uint32_t q[NET_CHANNEL_Q_ENTRIES];
net_channel_consumer_t c;
} net_channel_t ;
LCA06 - Jan 27, 2006 - Jacobson / Felderman
/* producer's header */
/* consumer's header */
23
35. 10Gb/s ixgb netpipe tests
NPtcp streaming test between two nodes.
NPtcp ping-pong test between two nodes (one-way latency measured).
(4.3Gb/s throughput limit due to DDR333 memory;
cpus were loafing)
LCA06 - Jan 27, 2006 - Jacobson / Felderman
28
36. more 10Gb/s
NPtcp ping-pong test between two nodes (one-way latency measured).
LCA06 - Jan 27, 2006 - Jacobson / Felderman
29
37. more 10Gb/s
LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes)
SendRecv bandwidth (bigger is better)
LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes)
SendRecv Latency (smaller is better)
LCA06 - Jan 27, 2006 - Jacobson / Felderman
30
38. more 10Gb/s
LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes)
SendRecv Latency (smaller is better)
LCA06 - Jan 27, 2006 - Jacobson / Felderman
31
39. Conclusion
relatively trivial changes, it’s
• With some finish the good work started by
possible to
NAPI & get rid of almost all the interrupt /
softint processing.
• As a result, everything gets a lot faster.
• Get linear scalability on multi-cpu / multicore systems.
LCA06 - Jan 27, 2006 - Jacobson / Felderman
32
40. Conclusion (cont.)
• Drivers get simpler (hard_start_xmit &
napi_poll become generic; drivers only
service hardware interrupts).
can
• Anythinglocks,send or receive packets,
without
very cheaply.
• Easy, incremental transition strategy.
LCA06 - Jan 27, 2006 - Jacobson / Felderman
33