Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Speeding up
Networking
Van Jacobson
van@packetdesign.com

Bob Felderman
feldy@precisionio.com

Precision I/O

Linux.conf.a...
This talk is not about ‘fixing’
the Linux networking stack
The Linux networking stack isn’t broken.
care of
• The people w...
This talk is about fixing
an architectural problem
created a long time ago
in a place far, far away. . .
LCA06 - Jan 27, 2...
LCA06 - Jan 27, 2006 - Jacobson / Felderman

4
In the beginning . . .
ARPA created MULTICS
• First OS networking stack (MIT, 1970)
multi-user ‘super-computer’
• Ran on a...
The Multics stack begat
many other stacks . . .
• First TCP/IP stack done on Multics (1980)
went to
• People from that pro...
Multics architecture, as
elaborated by Berkeley,
became ‘Standard Model’
Interrupt level

Task level

System
ISR
packet

L...
“The way we’ve always done it”
is not necessarily the same as
“the right way to do it”

There are a lot of problems associ...
Protocol Complication
• Since data is received by destination
kernel, “window” was added to distinguish
between “data has ...
Internet Stability
• You can view a network connection as a
servo-loop:
S

R

• A kernel-based protocol implementation
con...
Compromises
Even for a simple stream abstraction like TCP,
there’s no such thing as a “one size fits all”
protocol impleme...
Performance
(the topic for the rest of this talk)
have
• Kernel-based implementations oftenuser).
extra data copies (packe...
Why should we care?
gotten fast
• Networking gear hasenough ($10enough8
(10Gb/s) and cheap
for an
port Gb switch) that it’...
Why multiple cores?
• Vanilla 2GHz P4 issues 2-4 instr / clock
 4-8 instr / ns.

• Internal structure of DRAM chip makes
c...
Good system performance
comes from having lots of
cores working independently
• This is the canonical Internet problem.
is...
The end of the wire isn’t
the end of the net
On a uni-processor it doesn’t matter but on a
multi-processor the protocol wo...
How good is the stack at
spreading out the work?
Let’s look at some

LCA06 - Jan 27, 2006 - Jacobson / Felderman

16
Test setup
Dell Poweredge 1750s (2.4GHz P4 Xeon,
• Two processor, hyperthreading off) hooked up
dual
back-to-back via Inte...
15

Digression: comparing
two profiles
●

e1000_intr
●

10

●

e1000_clean

●

●

__copy_user_intel

5

schedule

__switch...
Uni vs. dual processor
(netperf) with cpu
• 1cpu: run netservercpu as e1000 interrupts.
affinity set to same
netserver wit...
Uni vs. dual processor
(netperf) with cpu
• 1cpu: run netservercpu as e1000 interrupts.
affinity set to same
netserver wit...
This is just Amdahl’s law
in action
When adding additional processors:

• Benefit (cycles to do work) grows at most
linear...
Locking destroys
performance two ways
has multiple writers
• The lock(fabulously expensive) so each has
to do a
RFO cache
...
• Networking involves a lot of queues.
They’re often implented as doubly linked
lists:
the
child
• This isuser poster writ...
net_channel - a cache aware,
cache friendly queue
typedef struct {
uint16_t
tail;
uint8_t
wakecnt;
uint8_t
pad;
} net_chan...
net_channel (cont.)
#define NET_CHANNEL_ENTRIES 512 /* approx number of entries in channel q */
#define NET_CHANNEL_Q_ENTR...
“Channelize” driver
• Remove e1000 driver hard_start_xmit &in
napi_poll routines. No softint code left
driver & no skb’s (...
“Channelize” driver
• Remove e1000 driver hard_start_xmit &in
napi_poll routines. No softint code left
driver & no skb’s (...
“Channelize” socket
“registers” transport signature with
• socket on “accept()”. Gets back a channel.
driver
with matching...
“Channelize” socket
“registers” transport signature with
• socket on “accept()”. Gets back a channel.
driver
with matching...
“Channelize” socket
“registers” transport signature with
• socket on “accept()”. Gets back a channel.
driver
with matching...
“Channelize” App

• App “registers” transport signature. Gets back an
(mmaped) channel & buffer pool.
matching packets int...
“Channelize” App

• App “registers” transport signature. Gets back an
(mmaped) channel & buffer pool.
matching packets int...
“Channelize” App

• App “registers” transport signature. Gets back an
(mmaped) channel & buffer pool.
matching packets int...
10Gb/s ixgb netpipe tests
NPtcp streaming test between two nodes.

NPtcp ping-pong test between two nodes (one-way latency...
more 10Gb/s
NPtcp ping-pong test between two nodes (one-way latency measured).

LCA06 - Jan 27, 2006 - Jacobson / Felderma...
more 10Gb/s
LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes)
SendRecv bandwidth (bigger is better)

LAM MPI:...
more 10Gb/s
LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes)
SendRecv Latency (smaller is better)

LCA06 - J...
Conclusion
relatively trivial changes, it’s
• With some finish the good work started by
possible to
NAPI & get rid of almo...
Conclusion (cont.)
• Drivers get simpler (hard_start_xmit &
napi_poll become generic; drivers only
service hardware interr...
Upcoming SlideShare
Loading in …5
×

Van jaconson netchannels

769 views

Published on

Published in: Technology
  • Be the first to comment

Van jaconson netchannels

  1. 1. Speeding up Networking Van Jacobson van@packetdesign.com Bob Felderman feldy@precisionio.com Precision I/O Linux.conf.au 2006 Dunedin, NZ
  2. 2. This talk is not about ‘fixing’ the Linux networking stack The Linux networking stack isn’t broken. care of • The people who takedo goodthe stack know what they’re doing & work. on • Based has all the measurements I’m aware of, of Linux the fastest & most complete stack any OS. LCA06 - Jan 27, 2006 - Jacobson / Felderman 2
  3. 3. This talk is about fixing an architectural problem created a long time ago in a place far, far away. . . LCA06 - Jan 27, 2006 - Jacobson / Felderman 3
  4. 4. LCA06 - Jan 27, 2006 - Jacobson / Felderman 4
  5. 5. In the beginning . . . ARPA created MULTICS • First OS networking stack (MIT, 1970) multi-user ‘super-computer’ • Ran on a @ 0.4 MIPS) (GE-640 than100 • Rarely feweruser. users; took ~2 minutes to page in a • Since ARPAnet performance depended only on how fast host could empty its 6 IMP buffers, had to put stack in kernel. LCA06 - Jan 27, 2006 - Jacobson / Felderman 4
  6. 6. The Multics stack begat many other stacks . . . • First TCP/IP stack done on Multics (1980) went to • People from that projectBerkeley BBN to do first TCP/IP stack for Unix (1983). used BBN stack as • Berkeley CSRGfor 4.1c BSD stack (1985). functional spec University of • CSRG wins long battle withstack source California lawyers & makes available under ‘BSD copyright’ (1987). LCA06 - Jan 27, 2006 - Jacobson / Felderman 5
  7. 7. Multics architecture, as elaborated by Berkeley, became ‘Standard Model’ Interrupt level Task level System ISR packet LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint skb Application Socket read() byte stream 6
  8. 8. “The way we’ve always done it” is not necessarily the same as “the right way to do it” There are a lot of problems associated with this style of implementation . . . LCA06 - Jan 27, 2006 - Jacobson / Felderman 7
  9. 9. Protocol Complication • Since data is received by destination kernel, “window” was added to distinguish between “data has arrived” & “data was consumed”. addition more than triples the size • Thisprotocol (window probes, persist of the states) and is responsible for at least half the interoperability issues (Silly Window Syndrome, FIN wars, etc.) LCA06 - Jan 27, 2006 - Jacobson / Felderman 8
  10. 10. Internet Stability • You can view a network connection as a servo-loop: S R • A kernel-based protocol implementation converts this to two coupled loops: • S K R A very general theorem (Routh-Hurwitz) says that the two coupled loops will always be less stable then one. also hides the receiving • The kernel loopthe sender which screws app dynamics from up the RTT estimate & causes spurious retransmissions. LCA06 - Jan 27, 2006 - Jacobson / Felderman 9
  11. 11. Compromises Even for a simple stream abstraction like TCP, there’s no such thing as a “one size fits all” protocol implementation. • The packetization and send strategies are completely different for bulk data vs. transactions vs. event streams. ack strategies are completely • Thestreaming vs. request response.different for Some of this can be handled with sockopts but some app / kernel implementation mismatch is inevitable. LCA06 - Jan 27, 2006 - Jacobson / Felderman 10
  12. 12. Performance (the topic for the rest of this talk) have • Kernel-based implementations oftenuser). extra data copies (packet to skb to • Kernel-based implementations often have extra boundary crossings (hardware interrupt to software interrupt to context switch to syscall return). • Kernel-based implementations often have lock contention and hotspots. LCA06 - Jan 27, 2006 - Jacobson / Felderman 11
  13. 13. Why should we care? gotten fast • Networking gear hasenough ($10enough8 (10Gb/s) and cheap for an port Gb switch) that it’s changing from a communications technology to a backplane technology. • The huge mismatch between processor clock rate & memory latency has forced chip makers to put multiple cores on a die. LCA06 - Jan 27, 2006 - Jacobson / Felderman 12
  14. 14. Why multiple cores? • Vanilla 2GHz P4 issues 2-4 instr / clock  4-8 instr / ns. • Internal structure of DRAM chip makes cache line fetch take 50-100ns (FSB speed doesn’t matter). did 400 instructions of computing • If you cache line, system would be 50% on every efficient with one core & 100% with two. more like 20 / line • Typical number iswith one core instrcores or 2.5% efficient (20 for 100%). LCA06 - Jan 27, 2006 - Jacobson / Felderman 13
  15. 15. Good system performance comes from having lots of cores working independently • This is the canonical Internet problem. is called the “end-to-end • The solutionsays you should push all work principle”. It to the edge & do the absolute minimum inside the net. LCA06 - Jan 27, 2006 - Jacobson / Felderman 14
  16. 16. The end of the wire isn’t the end of the net On a uni-processor it doesn’t matter but on a multi-processor the protocol work should be done on the processor that’s going to consume the data. This means ISR & Softint should () do almost nothing and Socket ead r t should do everything. cke So ISR Softint Socket read() S oc ket rea d () LCA06 - Jan 27, 2006 - Jacobson / Felderman 15
  17. 17. How good is the stack at spreading out the work? Let’s look at some LCA06 - Jan 27, 2006 - Jacobson / Felderman 16
  18. 18. Test setup Dell Poweredge 1750s (2.4GHz P4 Xeon, • Two processor, hyperthreading off) hooked up dual back-to-back via Intel e1000 gig ether cards. 2.6.15 • Running stock(6.3.9). plus current Sourceforge e1000 driver with oprofile • Measurements done runs. Showing 0.9.1. Each 5. test was 5 5-minute median of • booted with idle=poll_idle. Irqbalance off. LCA06 - Jan 27, 2006 - Jacobson / Felderman 17
  19. 19. 15 Digression: comparing two profiles ● e1000_intr ● 10 ● e1000_clean ● ● __copy_user_intel 5 schedule __switch_to ● ● tcp_v4_rcv ● 0 device interrupt & app on different cpu _raw_spin_lock ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ●● ●● ●● ● ● ●● ●●● ● ● ●● ●● ● ● ● ● ●● ● ●● ●● ●● ●● ●● ● 0 ● ● 5 10 15 device interrupt & app on same cpu LCA06 - Jan 27, 2006 - Jacobson / Felderman 18
  20. 20. Uni vs. dual processor (netperf) with cpu • 1cpu: run netservercpu as e1000 interrupts. affinity set to same netserver with • 2cpu: runcpu from e1000 cpu affinity set to different interrupts. (%) Busy 1cpu 50 2cpu 77 Intr 7 9 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 16 24 8 14 5 12 App 1 1 19
  21. 21. Uni vs. dual processor (netperf) with cpu • 1cpu: run netservercpu as e1000 interrupts. affinity set to same netserver with • 2cpu: runcpu from e1000 cpu affinity set to different interrupts. (%) Busy 1cpu 50 2cpu 77 Intr 7 9 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 16 24 8 14 5 12 App 1 1 19
  22. 22. This is just Amdahl’s law in action When adding additional processors: • Benefit (cycles to do work) grows at most linearly. competition, • Cost (contention, grows quadratically. serialization, etc.) • System capacity goes as C(n) = an - bn2 For big enough n, the quadratic always wins. • The key to good scaling is to minimize b. LCA06 - Jan 27, 2006 - Jacobson / Felderman 20
  23. 23. Locking destroys performance two ways has multiple writers • The lock(fabulously expensive) so each has to do a RFO cache cycle. update which • The lock requires an atomicthe cache. is implemented by freezing To go fast you want to have a single writer per line and no locks. LCA06 - Jan 27, 2006 - Jacobson / Felderman 21
  24. 24. • Networking involves a lot of queues. They’re often implented as doubly linked lists: the child • This isuser poster write for cache thrashing. Every has to every line and every change has to be made multiple places. network • Since most consumercomponents have a producer / relationship, a lock free fifo can work a lot better. LCA06 - Jan 27, 2006 - Jacobson / Felderman 22
  25. 25. net_channel - a cache aware, cache friendly queue typedef struct { uint16_t tail; uint8_t wakecnt; uint8_t pad; } net_channel_producer_t; typedef struct { uint16_t head; uint8_t wakecnt; uint8_t wake_type; void* wake_arg; } net_channel_consumer_t; struct { /* next element to add */ /* do wakeup if != consumer wakecnt */ /* /* /* /* next element to remove */ increment to request wakeup */ how to wakeup consumer */ opaque argument to wakeup routine */ net_channel_producer_t p CACHE_ALIGN; uint32_t q[NET_CHANNEL_Q_ENTRIES]; net_channel_consumer_t c; } net_channel_t ; LCA06 - Jan 27, 2006 - Jacobson / Felderman /* producer's header */ /* consumer's header */ 23
  26. 26. net_channel (cont.) #define NET_CHANNEL_ENTRIES 512 /* approx number of entries in channel q */ #define NET_CHANNEL_Q_ENTRIES ((ROUND_UP(NET_CHANNEL_ENTRIES*sizeof(uint32_t),CACHE_LINE_SIZE) - sizeof(net_channel_producer_t) - sizeof(net_channel_consumer_t)) / sizeof(uint32_t)) #define CACHE_ALIGN __attribute__((aligned(CACHE_LINE_SIZE))) static inline void net_channel_queue(net_channel_t *chan, uint32_t item) { uint16_t tail = chan->p.tail; uint16_t nxt = (tail + 1) % NET_CHANNEL_Q_ENTRIES; if (nxt != chan->c.head) { chan->q[tail] = item; STORE_BARRIER; chan->p.tail = nxt; if (chan->p.wakecnt != chan->c.wakecnt) { ++chan->p.wakecnt; net_chan_wakeup(chan); } } } LCA06 - Jan 27, 2006 - Jacobson / Felderman 24
  27. 27. “Channelize” driver • Remove e1000 driver hard_start_xmit &in napi_poll routines. No softint code left driver & no skb’s (driver deals only in packets). • Send packets to generic_napi_poll via a net channel. (%) Busy 1cpu 50 2cpu 77 drvr 58 Intr 7 9 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 16 24 16 8 14 9 5 12 9 App 1 1 1 25
  28. 28. “Channelize” driver • Remove e1000 driver hard_start_xmit &in napi_poll routines. No softint code left driver & no skb’s (driver deals only in packets). • Send packets to generic_napi_poll via a net channel. (%) Busy 1cpu 50 2cpu 77 drvr 58 Intr 7 9 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 16 24 16 8 14 9 5 12 9 App 1 1 1 25
  29. 29. “Channelize” socket “registers” transport signature with • socket on “accept()”. Gets back a channel. driver with matching • driver drops all packetschannel & wakes app signature into socket’s if sleeping in socket code. Socket code processes packet(s) on wakeup. (%) 1cpu 2cpu drvr sock Busy Intr 50 77 58 28 7 9 6 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 0 16 24 16 16 8 14 9 1 5 12 9 3 App 1 1 1 1 26
  30. 30. “Channelize” socket “registers” transport signature with • socket on “accept()”. Gets back a channel. driver with matching • driver drops all packetschannel & wakes app signature into socket’s if sleeping in socket code. Socket code processes packet(s) on wakeup. (%) 1cpu 2cpu drvr sock Busy Intr 50 77 58 28 7 9 6 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 0 16 24 16 16 8 14 9 1 5 12 9 3 App 1 1 1 1 26
  31. 31. “Channelize” socket “registers” transport signature with • socket on “accept()”. Gets back a channel. driver with matching • driver drops all packetschannel & wakes app signature into socket’s if sleeping in socket code. Socket code processes packet(s) on wakeup. (%) 1cpu 2cpu drvr sock Busy Intr 50 77 58 28 7 9 6 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 0 16 24 16 16 8 14 9 1 5 12 9 3 App 1 1 1 1 26
  32. 32. “Channelize” App • App “registers” transport signature. Gets back an (mmaped) channel & buffer pool. matching packets into • driver drops sleeping. TCP stack in channel & wakes app if library processes packet(s) on wakeup. (%) 1cpu 2cpu drvr sock App Busy Intr 50 77 58 28 14 7 9 6 6 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 0 0 16 24 16 16 0 8 14 9 1 0 5 12 9 3 2 App 1 1 1 1 5 27
  33. 33. “Channelize” App • App “registers” transport signature. Gets back an (mmaped) channel & buffer pool. matching packets into • driver drops sleeping. TCP stack in channel & wakes app if library processes packet(s) on wakeup. (%) 1cpu 2cpu drvr sock App Busy Intr 50 77 58 28 14 7 9 6 6 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 0 0 16 24 16 16 0 8 14 9 1 0 5 12 9 3 2 App 1 1 1 1 5 27
  34. 34. “Channelize” App • App “registers” transport signature. Gets back an (mmaped) channel & buffer pool. matching packets into • driver drops sleeping. TCP stack in channel & wakes app if library processes packet(s) on wakeup. (%) 1cpu 2cpu drvr sock App Busy Intr 50 77 58 28 14 7 9 6 6 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 0 0 16 24 16 16 0 8 14 9 1 0 5 12 9 3 2 App 1 1 1 1 5 27
  35. 35. 10Gb/s ixgb netpipe tests NPtcp streaming test between two nodes. NPtcp ping-pong test between two nodes (one-way latency measured). (4.3Gb/s throughput limit due to DDR333 memory; cpus were loafing) LCA06 - Jan 27, 2006 - Jacobson / Felderman 28
  36. 36. more 10Gb/s NPtcp ping-pong test between two nodes (one-way latency measured). LCA06 - Jan 27, 2006 - Jacobson / Felderman 29
  37. 37. more 10Gb/s LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes) SendRecv bandwidth (bigger is better) LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes) SendRecv Latency (smaller is better) LCA06 - Jan 27, 2006 - Jacobson / Felderman 30
  38. 38. more 10Gb/s LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes) SendRecv Latency (smaller is better) LCA06 - Jan 27, 2006 - Jacobson / Felderman 31
  39. 39. Conclusion relatively trivial changes, it’s • With some finish the good work started by possible to NAPI & get rid of almost all the interrupt / softint processing. • As a result, everything gets a lot faster. • Get linear scalability on multi-cpu / multicore systems. LCA06 - Jan 27, 2006 - Jacobson / Felderman 32
  40. 40. Conclusion (cont.) • Drivers get simpler (hard_start_xmit & napi_poll become generic; drivers only service hardware interrupts). can • Anythinglocks,send or receive packets, without very cheaply. • Easy, incremental transition strategy. LCA06 - Jan 27, 2006 - Jacobson / Felderman 33

×