SlideShare a Scribd company logo
1 of 40
Download to read offline
Speeding up
Networking
Van Jacobson
van@packetdesign.com

Bob Felderman
feldy@precisionio.com

Precision I/O

Linux.conf.au 2006
Dunedin, NZ
This talk is not about ‘fixing’
the Linux networking stack
The Linux networking stack isn’t broken.
care of
• The people who takedo goodthe stack know
what they’re doing &
work.
on
• Based has all the measurements I’m aware of, of
Linux
the fastest & most complete stack
any OS.
LCA06 - Jan 27, 2006 - Jacobson / Felderman

2
This talk is about fixing
an architectural problem
created a long time ago
in a place far, far away. . .
LCA06 - Jan 27, 2006 - Jacobson / Felderman

3
LCA06 - Jan 27, 2006 - Jacobson / Felderman

4
In the beginning . . .
ARPA created MULTICS
• First OS networking stack (MIT, 1970)
multi-user ‘super-computer’
• Ran on a @ 0.4 MIPS)
(GE-640
than100
• Rarely feweruser. users; took ~2 minutes
to page in a

• Since ARPAnet performance depended only
on how fast host could empty its 6 IMP
buffers, had to put stack in kernel.
LCA06 - Jan 27, 2006 - Jacobson / Felderman

4
The Multics stack begat
many other stacks . . .
• First TCP/IP stack done on Multics (1980)
went to
• People from that projectBerkeley BBN to
do first TCP/IP stack for
Unix
(1983).
used BBN stack as
• Berkeley CSRGfor 4.1c BSD stack (1985).
functional spec
University of
• CSRG wins long battle withstack source
California lawyers & makes
available under ‘BSD copyright’ (1987).
LCA06 - Jan 27, 2006 - Jacobson / Felderman

5
Multics architecture, as
elaborated by Berkeley,
became ‘Standard Model’
Interrupt level

Task level

System
ISR
packet

LCA06 - Jan 27, 2006 - Jacobson / Felderman

Softint
skb

Application
Socket

read()

byte stream

6
“The way we’ve always done it”
is not necessarily the same as
“the right way to do it”

There are a lot of problems associated
with this style of implementation . . .

LCA06 - Jan 27, 2006 - Jacobson / Felderman

7
Protocol Complication
• Since data is received by destination
kernel, “window” was added to distinguish
between “data has arrived” & “data was
consumed”.
addition more than triples the size
• Thisprotocol (window probes, persist of
the
states) and is responsible for at least half
the interoperability issues (Silly Window
Syndrome, FIN wars, etc.)
LCA06 - Jan 27, 2006 - Jacobson / Felderman

8
Internet Stability
• You can view a network connection as a
servo-loop:
S

R

• A kernel-based protocol implementation
converts this to two coupled loops:
•

S
K
R
A very general theorem (Routh-Hurwitz)
says that the two coupled loops will always
be less stable then one.

also hides the receiving
• The kernel loopthe sender which screws app
dynamics from
up
the RTT estimate & causes spurious
retransmissions.
LCA06 - Jan 27, 2006 - Jacobson / Felderman

9
Compromises
Even for a simple stream abstraction like TCP,
there’s no such thing as a “one size fits all”
protocol implementation.

• The packetization and send strategies are
completely different for bulk data vs.
transactions vs. event streams.
ack strategies are completely
• Thestreaming vs. request response.different
for
Some of this can be handled with sockopts
but some app / kernel implementation
mismatch is inevitable.
LCA06 - Jan 27, 2006 - Jacobson / Felderman

10
Performance
(the topic for the rest of this talk)
have
• Kernel-based implementations oftenuser).
extra data copies (packet to skb to

• Kernel-based implementations often have
extra boundary crossings (hardware
interrupt to software interrupt to context
switch to syscall return).

• Kernel-based implementations often have
lock contention and hotspots.
LCA06 - Jan 27, 2006 - Jacobson / Felderman

11
Why should we care?
gotten fast
• Networking gear hasenough ($10enough8
(10Gb/s) and cheap
for an
port Gb switch) that it’s changing from a
communications technology to a backplane
technology.

• The huge mismatch between processor
clock rate & memory latency has forced
chip makers to put multiple cores on a die.

LCA06 - Jan 27, 2006 - Jacobson / Felderman

12
Why multiple cores?
• Vanilla 2GHz P4 issues 2-4 instr / clock
 4-8 instr / ns.

• Internal structure of DRAM chip makes
cache line fetch take 50-100ns (FSB speed
doesn’t matter).
did 400 instructions of computing
• If you cache line, system would be 50% on
every
efficient with one core & 100% with two.
more like 20
/ line
• Typical number iswith one core instrcores
or 2.5% efficient
(20
for 100%).
LCA06 - Jan 27, 2006 - Jacobson / Felderman

13
Good system performance
comes from having lots of
cores working independently
• This is the canonical Internet problem.
is called the “end-to-end
• The solutionsays you should push all work
principle”. It
to the edge & do the absolute minimum
inside the net.

LCA06 - Jan 27, 2006 - Jacobson / Felderman

14
The end of the wire isn’t
the end of the net
On a uni-processor it doesn’t matter but on a
multi-processor the protocol work should be done
on the processor that’s going to consume the data.
This means ISR & Softint should
()
do almost nothing and Socket
ead
r
t
should do everything.
cke
So
ISR

Softint

Socket

read()

S oc

ket
rea
d

()

LCA06 - Jan 27, 2006 - Jacobson / Felderman

15
How good is the stack at
spreading out the work?
Let’s look at some

LCA06 - Jan 27, 2006 - Jacobson / Felderman

16
Test setup
Dell Poweredge 1750s (2.4GHz P4 Xeon,
• Two processor, hyperthreading off) hooked up
dual
back-to-back via Intel e1000 gig ether cards.

2.6.15
• Running stock(6.3.9). plus current Sourceforge
e1000 driver
with oprofile
• Measurements done runs. Showing 0.9.1. Each 5.
test was 5 5-minute
median of

• booted with idle=poll_idle. Irqbalance off.
LCA06 - Jan 27, 2006 - Jacobson / Felderman

17
15

Digression: comparing
two profiles
●

e1000_intr
●

10

●

e1000_clean

●

●

__copy_user_intel

5

schedule

__switch_to
● ● tcp_v4_rcv
●

0

device interrupt & app on different cpu

_raw_spin_lock

●
●
●
● ● ●●
● ●
● ● ●
● ● ●●
● ● ●●
●
●
● ● ●●●
●●
●
● ●
●
●● ●
●●
●●
●● ● ●
●● ●●●
●
●
●●
●● ●
●
●
●
●● ●
●●
●●
●●
●●
●●
●

0

●
●

5

10

15

device interrupt & app on same cpu

LCA06 - Jan 27, 2006 - Jacobson / Felderman

18
Uni vs. dual processor
(netperf) with cpu
• 1cpu: run netservercpu as e1000 interrupts.
affinity set to same
netserver with
• 2cpu: runcpu from e1000 cpu affinity set to
different
interrupts.

(%) Busy
1cpu 50
2cpu 77

Intr

7
9

LCA06 - Jan 27, 2006 - Jacobson / Felderman

Softint Socket Locks Sched

11
13

16
24

8
14

5
12

App

1
1
19
Uni vs. dual processor
(netperf) with cpu
• 1cpu: run netservercpu as e1000 interrupts.
affinity set to same
netserver with
• 2cpu: runcpu from e1000 cpu affinity set to
different
interrupts.

(%) Busy
1cpu 50
2cpu 77

Intr

7
9

LCA06 - Jan 27, 2006 - Jacobson / Felderman

Softint Socket Locks Sched

11
13

16
24

8
14

5
12

App

1
1
19
This is just Amdahl’s law
in action
When adding additional processors:

• Benefit (cycles to do work) grows at most
linearly.
competition,
• Cost (contention, grows quadratically.
serialization, etc.)

•

System capacity goes as C(n) = an - bn2
For big enough n, the quadratic always wins.

• The key to good scaling is to minimize b.
LCA06 - Jan 27, 2006 - Jacobson / Felderman

20
Locking destroys
performance two ways
has multiple writers
• The lock(fabulously expensive) so each has
to do a
RFO cache
cycle.
update which
• The lock requires an atomicthe cache.
is implemented by freezing
To go fast you want to have a single writer
per line and no locks.

LCA06 - Jan 27, 2006 - Jacobson / Felderman

21
• Networking involves a lot of queues.
They’re often implented as doubly linked
lists:
the
child
• This isuser poster write for cache thrashing.
Every
has to
every line and
every change has to be made multiple
places.
network
• Since most consumercomponents have a
producer /
relationship, a lock
free fifo can work a lot better.

LCA06 - Jan 27, 2006 - Jacobson / Felderman

22
net_channel - a cache aware,
cache friendly queue
typedef struct {
uint16_t
tail;
uint8_t
wakecnt;
uint8_t
pad;
} net_channel_producer_t;
typedef struct {
uint16_t
head;
uint8_t
wakecnt;
uint8_t
wake_type;
void*
wake_arg;
} net_channel_consumer_t;
struct {

/* next element to add */
/* do wakeup if != consumer wakecnt */

/*
/*
/*
/*

next element to remove */
increment to request wakeup */
how to wakeup consumer */
opaque argument to wakeup routine */

net_channel_producer_t p CACHE_ALIGN;
uint32_t q[NET_CHANNEL_Q_ENTRIES];
net_channel_consumer_t c;
} net_channel_t ;

LCA06 - Jan 27, 2006 - Jacobson / Felderman

/* producer's header */
/* consumer's header */

23
net_channel (cont.)
#define NET_CHANNEL_ENTRIES 512 /* approx number of entries in channel q */
#define NET_CHANNEL_Q_ENTRIES 
((ROUND_UP(NET_CHANNEL_ENTRIES*sizeof(uint32_t),CACHE_LINE_SIZE) 
- sizeof(net_channel_producer_t) - sizeof(net_channel_consumer_t)) 
/ sizeof(uint32_t))
#define CACHE_ALIGN __attribute__((aligned(CACHE_LINE_SIZE)))
static inline void net_channel_queue(net_channel_t *chan, uint32_t item) {
uint16_t tail = chan->p.tail;
uint16_t nxt = (tail + 1) % NET_CHANNEL_Q_ENTRIES;
if (nxt != chan->c.head) {
chan->q[tail] = item;
STORE_BARRIER;
chan->p.tail = nxt;
if (chan->p.wakecnt != chan->c.wakecnt) {
++chan->p.wakecnt;
net_chan_wakeup(chan);
}
}
}
LCA06 - Jan 27, 2006 - Jacobson / Felderman

24
“Channelize” driver
• Remove e1000 driver hard_start_xmit &in
napi_poll routines. No softint code left
driver & no skb’s (driver deals only in packets).

• Send packets to generic_napi_poll via a net
channel.
(%) Busy
1cpu 50
2cpu 77
drvr 58

Intr

7
9
6

LCA06 - Jan 27, 2006 - Jacobson / Felderman

Softint Socket Locks Sched

11
13
12

16
24
16

8
14
9

5
12
9

App

1
1
1
25
“Channelize” driver
• Remove e1000 driver hard_start_xmit &in
napi_poll routines. No softint code left
driver & no skb’s (driver deals only in packets).

• Send packets to generic_napi_poll via a net
channel.
(%) Busy
1cpu 50
2cpu 77
drvr 58

Intr

7
9
6

LCA06 - Jan 27, 2006 - Jacobson / Felderman

Softint Socket Locks Sched

11
13
12

16
24
16

8
14
9

5
12
9

App

1
1
1
25
“Channelize” socket
“registers” transport signature with
• socket on “accept()”. Gets back a channel.
driver
with matching
• driver drops all packetschannel & wakes app
signature into socket’s
if sleeping in socket code. Socket code
processes packet(s) on wakeup.

(%)
1cpu
2cpu
drvr
sock

Busy

Intr

50
77
58
28

7
9
6
6

LCA06 - Jan 27, 2006 - Jacobson / Felderman

Softint Socket Locks Sched

11
13
12
0

16
24
16
16

8
14
9
1

5
12
9
3

App

1
1
1
1
26
“Channelize” socket
“registers” transport signature with
• socket on “accept()”. Gets back a channel.
driver
with matching
• driver drops all packetschannel & wakes app
signature into socket’s
if sleeping in socket code. Socket code
processes packet(s) on wakeup.

(%)
1cpu
2cpu
drvr
sock

Busy

Intr

50
77
58
28

7
9
6
6

LCA06 - Jan 27, 2006 - Jacobson / Felderman

Softint Socket Locks Sched

11
13
12
0

16
24
16
16

8
14
9
1

5
12
9
3

App

1
1
1
1
26
“Channelize” socket
“registers” transport signature with
• socket on “accept()”. Gets back a channel.
driver
with matching
• driver drops all packetschannel & wakes app
signature into socket’s
if sleeping in socket code. Socket code
processes packet(s) on wakeup.

(%)
1cpu
2cpu
drvr
sock

Busy

Intr

50
77
58
28

7
9
6
6

LCA06 - Jan 27, 2006 - Jacobson / Felderman

Softint Socket Locks Sched

11
13
12
0

16
24
16
16

8
14
9
1

5
12
9
3

App

1
1
1
1
26
“Channelize” App

• App “registers” transport signature. Gets back an
(mmaped) channel & buffer pool.
matching packets into
• driver drops sleeping. TCP stack in channel &
wakes app if
library
processes packet(s) on wakeup.

(%)
1cpu
2cpu
drvr
sock
App

Busy

Intr

50
77
58
28
14

7
9
6
6
6

LCA06 - Jan 27, 2006 - Jacobson / Felderman

Softint Socket Locks Sched

11
13
12
0
0

16
24
16
16
0

8
14
9
1
0

5
12
9
3
2

App

1
1
1
1
5
27
“Channelize” App

• App “registers” transport signature. Gets back an
(mmaped) channel & buffer pool.
matching packets into
• driver drops sleeping. TCP stack in channel &
wakes app if
library
processes packet(s) on wakeup.

(%)
1cpu
2cpu
drvr
sock
App

Busy

Intr

50
77
58
28
14

7
9
6
6
6

LCA06 - Jan 27, 2006 - Jacobson / Felderman

Softint Socket Locks Sched

11
13
12
0
0

16
24
16
16
0

8
14
9
1
0

5
12
9
3
2

App

1
1
1
1
5
27
“Channelize” App

• App “registers” transport signature. Gets back an
(mmaped) channel & buffer pool.
matching packets into
• driver drops sleeping. TCP stack in channel &
wakes app if
library
processes packet(s) on wakeup.

(%)
1cpu
2cpu
drvr
sock
App

Busy

Intr

50
77
58
28
14

7
9
6
6
6

LCA06 - Jan 27, 2006 - Jacobson / Felderman

Softint Socket Locks Sched

11
13
12
0
0

16
24
16
16
0

8
14
9
1
0

5
12
9
3
2

App

1
1
1
1
5
27
10Gb/s ixgb netpipe tests
NPtcp streaming test between two nodes.

NPtcp ping-pong test between two nodes (one-way latency measured).

(4.3Gb/s throughput limit due to DDR333 memory;
cpus were loafing)
LCA06 - Jan 27, 2006 - Jacobson / Felderman

28
more 10Gb/s
NPtcp ping-pong test between two nodes (one-way latency measured).

LCA06 - Jan 27, 2006 - Jacobson / Felderman

29
more 10Gb/s
LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes)
SendRecv bandwidth (bigger is better)

LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes)
SendRecv Latency (smaller is better)
LCA06 - Jan 27, 2006 - Jacobson / Felderman

30
more 10Gb/s
LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes)
SendRecv Latency (smaller is better)

LCA06 - Jan 27, 2006 - Jacobson / Felderman

31
Conclusion
relatively trivial changes, it’s
• With some finish the good work started by
possible to
NAPI & get rid of almost all the interrupt /
softint processing.

• As a result, everything gets a lot faster.
• Get linear scalability on multi-cpu / multicore systems.
LCA06 - Jan 27, 2006 - Jacobson / Felderman

32
Conclusion (cont.)
• Drivers get simpler (hard_start_xmit &
napi_poll become generic; drivers only
service hardware interrupts).
can
• Anythinglocks,send or receive packets,
without
very cheaply.

• Easy, incremental transition strategy.
LCA06 - Jan 27, 2006 - Jacobson / Felderman

33

More Related Content

What's hot

Odi installation guide
Odi installation guideOdi installation guide
Odi installation guide
prakashdas05
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Less04 database instance
Less04 database instanceLess04 database instance
Less04 database instance
Amit Bhalla
 

What's hot (20)

Java Performance Tuning
Java Performance TuningJava Performance Tuning
Java Performance Tuning
 
SQL-Server Database.pdf
SQL-Server Database.pdfSQL-Server Database.pdf
SQL-Server Database.pdf
 
Automating the Entire PostgreSQL Lifecycle
Automating the Entire PostgreSQL Lifecycle Automating the Entire PostgreSQL Lifecycle
Automating the Entire PostgreSQL Lifecycle
 
From Mainframe to Microservice: An Introduction to Distributed Systems
From Mainframe to Microservice: An Introduction to Distributed SystemsFrom Mainframe to Microservice: An Introduction to Distributed Systems
From Mainframe to Microservice: An Introduction to Distributed Systems
 
Odi installation guide
Odi installation guideOdi installation guide
Odi installation guide
 
KSQL – An Open Source Streaming Engine for Apache Kafka
KSQL – An Open Source Streaming Engine for Apache KafkaKSQL – An Open Source Streaming Engine for Apache Kafka
KSQL – An Open Source Streaming Engine for Apache Kafka
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
How IBM's Massive POWER9 UNIX Servers Benefit from InfluxDB and Grafana Techn...
How IBM's Massive POWER9 UNIX Servers Benefit from InfluxDB and Grafana Techn...How IBM's Massive POWER9 UNIX Servers Benefit from InfluxDB and Grafana Techn...
How IBM's Massive POWER9 UNIX Servers Benefit from InfluxDB and Grafana Techn...
 
Splunk conf2014 - Lesser Known Commands in Splunk Search Processing Language ...
Splunk conf2014 - Lesser Known Commands in Splunk Search Processing Language ...Splunk conf2014 - Lesser Known Commands in Splunk Search Processing Language ...
Splunk conf2014 - Lesser Known Commands in Splunk Search Processing Language ...
 
Less04 database instance
Less04 database instanceLess04 database instance
Less04 database instance
 
Performance tuning a quick intoduction
Performance tuning   a quick intoductionPerformance tuning   a quick intoduction
Performance tuning a quick intoduction
 
Exadata SMART Monitoring - OEM 13c
Exadata SMART Monitoring - OEM 13cExadata SMART Monitoring - OEM 13c
Exadata SMART Monitoring - OEM 13c
 
Oracle Client Failover - Under The Hood
Oracle Client Failover - Under The HoodOracle Client Failover - Under The Hood
Oracle Client Failover - Under The Hood
 
The C10k Problem
The C10k ProblemThe C10k Problem
The C10k Problem
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
Less03 db dbca
Less03 db dbcaLess03 db dbca
Less03 db dbca
 
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
 
GoldenGate and Stream Processing with Special Guest Rakuten
GoldenGate and Stream Processing with Special Guest RakutenGoldenGate and Stream Processing with Special Guest Rakuten
GoldenGate and Stream Processing with Special Guest Rakuten
 
Microservices Integration Patterns with Kafka
Microservices Integration Patterns with KafkaMicroservices Integration Patterns with Kafka
Microservices Integration Patterns with Kafka
 

Similar to Van jaconson netchannels

Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_Griffin
Peter Griffin
 
Atf 3 q15-2 - product preview
Atf 3 q15-2 - product previewAtf 3 q15-2 - product preview
Atf 3 q15-2 - product preview
Mason Mei
 

Similar to Van jaconson netchannels (20)

DPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. MeltonDPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. Melton
 
FPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow SwitchFPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow Switch
 
D031201021027
D031201021027D031201021027
D031201021027
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
 
XS Boston 2008 Network Topology
XS Boston 2008 Network TopologyXS Boston 2008 Network Topology
XS Boston 2008 Network Topology
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and Hadoop
 
Scaling Networks Lab Manual 1st Edition Cisco Solutions Manual
Scaling Networks Lab Manual 1st Edition Cisco Solutions ManualScaling Networks Lab Manual 1st Edition Cisco Solutions Manual
Scaling Networks Lab Manual 1st Edition Cisco Solutions Manual
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_Griffin
 
10 sdn-vir-6up
10 sdn-vir-6up10 sdn-vir-6up
10 sdn-vir-6up
 
Lustre Generational Performance Improvements & New Features
Lustre Generational Performance Improvements & New FeaturesLustre Generational Performance Improvements & New Features
Lustre Generational Performance Improvements & New Features
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Ccnp3 lab 3_1_en (hacer)
Ccnp3 lab 3_1_en (hacer)Ccnp3 lab 3_1_en (hacer)
Ccnp3 lab 3_1_en (hacer)
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
淺談 Live patching technology
淺談 Live patching technology淺談 Live patching technology
淺談 Live patching technology
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with Sherlock
 
Atf 3 q15-2 - product preview
Atf 3 q15-2 - product previewAtf 3 q15-2 - product preview
Atf 3 q15-2 - product preview
 

More from Susant Sahani (20)

systemd
systemdsystemd
systemd
 
systemd
systemdsystemd
systemd
 
How to debug systemd problems fedora project
How to debug systemd problems   fedora projectHow to debug systemd problems   fedora project
How to debug systemd problems fedora project
 
Systemd vs-sys vinit-cheatsheet.jpg
Systemd vs-sys vinit-cheatsheet.jpgSystemd vs-sys vinit-cheatsheet.jpg
Systemd vs-sys vinit-cheatsheet.jpg
 
Systemd cheatsheet
Systemd cheatsheetSystemd cheatsheet
Systemd cheatsheet
 
Systemd
SystemdSystemd
Systemd
 
Systemd for administrators
Systemd for administratorsSystemd for administrators
Systemd for administrators
 
Pdf c1t tlawaxb
Pdf c1t tlawaxbPdf c1t tlawaxb
Pdf c1t tlawaxb
 
Systemd mlug-20140614
Systemd mlug-20140614Systemd mlug-20140614
Systemd mlug-20140614
 
Summit demystifying systemd1
Summit demystifying systemd1Summit demystifying systemd1
Summit demystifying systemd1
 
Systemd evolution revolution_regression
Systemd evolution revolution_regressionSystemd evolution revolution_regression
Systemd evolution revolution_regression
 
Systemd for administrators
Systemd for administratorsSystemd for administrators
Systemd for administrators
 
Systemd poettering
Systemd poetteringSystemd poettering
Systemd poettering
 
Interface between kernel and user space
Interface between kernel and user spaceInterface between kernel and user space
Interface between kernel and user space
 
Week3 binary trees
Week3 binary treesWeek3 binary trees
Week3 binary trees
 
Trees
TreesTrees
Trees
 
Synchronization linux
Synchronization linuxSynchronization linux
Synchronization linux
 
Demo preorder-stack
Demo preorder-stackDemo preorder-stack
Demo preorder-stack
 
Bacnet white paper
Bacnet white paperBacnet white paper
Bacnet white paper
 
Api presentation
Api presentationApi presentation
Api presentation
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Van jaconson netchannels

  • 1. Speeding up Networking Van Jacobson van@packetdesign.com Bob Felderman feldy@precisionio.com Precision I/O Linux.conf.au 2006 Dunedin, NZ
  • 2. This talk is not about ‘fixing’ the Linux networking stack The Linux networking stack isn’t broken. care of • The people who takedo goodthe stack know what they’re doing & work. on • Based has all the measurements I’m aware of, of Linux the fastest & most complete stack any OS. LCA06 - Jan 27, 2006 - Jacobson / Felderman 2
  • 3. This talk is about fixing an architectural problem created a long time ago in a place far, far away. . . LCA06 - Jan 27, 2006 - Jacobson / Felderman 3
  • 4. LCA06 - Jan 27, 2006 - Jacobson / Felderman 4
  • 5. In the beginning . . . ARPA created MULTICS • First OS networking stack (MIT, 1970) multi-user ‘super-computer’ • Ran on a @ 0.4 MIPS) (GE-640 than100 • Rarely feweruser. users; took ~2 minutes to page in a • Since ARPAnet performance depended only on how fast host could empty its 6 IMP buffers, had to put stack in kernel. LCA06 - Jan 27, 2006 - Jacobson / Felderman 4
  • 6. The Multics stack begat many other stacks . . . • First TCP/IP stack done on Multics (1980) went to • People from that projectBerkeley BBN to do first TCP/IP stack for Unix (1983). used BBN stack as • Berkeley CSRGfor 4.1c BSD stack (1985). functional spec University of • CSRG wins long battle withstack source California lawyers & makes available under ‘BSD copyright’ (1987). LCA06 - Jan 27, 2006 - Jacobson / Felderman 5
  • 7. Multics architecture, as elaborated by Berkeley, became ‘Standard Model’ Interrupt level Task level System ISR packet LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint skb Application Socket read() byte stream 6
  • 8. “The way we’ve always done it” is not necessarily the same as “the right way to do it” There are a lot of problems associated with this style of implementation . . . LCA06 - Jan 27, 2006 - Jacobson / Felderman 7
  • 9. Protocol Complication • Since data is received by destination kernel, “window” was added to distinguish between “data has arrived” & “data was consumed”. addition more than triples the size • Thisprotocol (window probes, persist of the states) and is responsible for at least half the interoperability issues (Silly Window Syndrome, FIN wars, etc.) LCA06 - Jan 27, 2006 - Jacobson / Felderman 8
  • 10. Internet Stability • You can view a network connection as a servo-loop: S R • A kernel-based protocol implementation converts this to two coupled loops: • S K R A very general theorem (Routh-Hurwitz) says that the two coupled loops will always be less stable then one. also hides the receiving • The kernel loopthe sender which screws app dynamics from up the RTT estimate & causes spurious retransmissions. LCA06 - Jan 27, 2006 - Jacobson / Felderman 9
  • 11. Compromises Even for a simple stream abstraction like TCP, there’s no such thing as a “one size fits all” protocol implementation. • The packetization and send strategies are completely different for bulk data vs. transactions vs. event streams. ack strategies are completely • Thestreaming vs. request response.different for Some of this can be handled with sockopts but some app / kernel implementation mismatch is inevitable. LCA06 - Jan 27, 2006 - Jacobson / Felderman 10
  • 12. Performance (the topic for the rest of this talk) have • Kernel-based implementations oftenuser). extra data copies (packet to skb to • Kernel-based implementations often have extra boundary crossings (hardware interrupt to software interrupt to context switch to syscall return). • Kernel-based implementations often have lock contention and hotspots. LCA06 - Jan 27, 2006 - Jacobson / Felderman 11
  • 13. Why should we care? gotten fast • Networking gear hasenough ($10enough8 (10Gb/s) and cheap for an port Gb switch) that it’s changing from a communications technology to a backplane technology. • The huge mismatch between processor clock rate & memory latency has forced chip makers to put multiple cores on a die. LCA06 - Jan 27, 2006 - Jacobson / Felderman 12
  • 14. Why multiple cores? • Vanilla 2GHz P4 issues 2-4 instr / clock  4-8 instr / ns. • Internal structure of DRAM chip makes cache line fetch take 50-100ns (FSB speed doesn’t matter). did 400 instructions of computing • If you cache line, system would be 50% on every efficient with one core & 100% with two. more like 20 / line • Typical number iswith one core instrcores or 2.5% efficient (20 for 100%). LCA06 - Jan 27, 2006 - Jacobson / Felderman 13
  • 15. Good system performance comes from having lots of cores working independently • This is the canonical Internet problem. is called the “end-to-end • The solutionsays you should push all work principle”. It to the edge & do the absolute minimum inside the net. LCA06 - Jan 27, 2006 - Jacobson / Felderman 14
  • 16. The end of the wire isn’t the end of the net On a uni-processor it doesn’t matter but on a multi-processor the protocol work should be done on the processor that’s going to consume the data. This means ISR & Softint should () do almost nothing and Socket ead r t should do everything. cke So ISR Softint Socket read() S oc ket rea d () LCA06 - Jan 27, 2006 - Jacobson / Felderman 15
  • 17. How good is the stack at spreading out the work? Let’s look at some LCA06 - Jan 27, 2006 - Jacobson / Felderman 16
  • 18. Test setup Dell Poweredge 1750s (2.4GHz P4 Xeon, • Two processor, hyperthreading off) hooked up dual back-to-back via Intel e1000 gig ether cards. 2.6.15 • Running stock(6.3.9). plus current Sourceforge e1000 driver with oprofile • Measurements done runs. Showing 0.9.1. Each 5. test was 5 5-minute median of • booted with idle=poll_idle. Irqbalance off. LCA06 - Jan 27, 2006 - Jacobson / Felderman 17
  • 19. 15 Digression: comparing two profiles ● e1000_intr ● 10 ● e1000_clean ● ● __copy_user_intel 5 schedule __switch_to ● ● tcp_v4_rcv ● 0 device interrupt & app on different cpu _raw_spin_lock ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ●● ●● ●● ● ● ●● ●●● ● ● ●● ●● ● ● ● ● ●● ● ●● ●● ●● ●● ●● ● 0 ● ● 5 10 15 device interrupt & app on same cpu LCA06 - Jan 27, 2006 - Jacobson / Felderman 18
  • 20. Uni vs. dual processor (netperf) with cpu • 1cpu: run netservercpu as e1000 interrupts. affinity set to same netserver with • 2cpu: runcpu from e1000 cpu affinity set to different interrupts. (%) Busy 1cpu 50 2cpu 77 Intr 7 9 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 16 24 8 14 5 12 App 1 1 19
  • 21. Uni vs. dual processor (netperf) with cpu • 1cpu: run netservercpu as e1000 interrupts. affinity set to same netserver with • 2cpu: runcpu from e1000 cpu affinity set to different interrupts. (%) Busy 1cpu 50 2cpu 77 Intr 7 9 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 16 24 8 14 5 12 App 1 1 19
  • 22. This is just Amdahl’s law in action When adding additional processors: • Benefit (cycles to do work) grows at most linearly. competition, • Cost (contention, grows quadratically. serialization, etc.) • System capacity goes as C(n) = an - bn2 For big enough n, the quadratic always wins. • The key to good scaling is to minimize b. LCA06 - Jan 27, 2006 - Jacobson / Felderman 20
  • 23. Locking destroys performance two ways has multiple writers • The lock(fabulously expensive) so each has to do a RFO cache cycle. update which • The lock requires an atomicthe cache. is implemented by freezing To go fast you want to have a single writer per line and no locks. LCA06 - Jan 27, 2006 - Jacobson / Felderman 21
  • 24. • Networking involves a lot of queues. They’re often implented as doubly linked lists: the child • This isuser poster write for cache thrashing. Every has to every line and every change has to be made multiple places. network • Since most consumercomponents have a producer / relationship, a lock free fifo can work a lot better. LCA06 - Jan 27, 2006 - Jacobson / Felderman 22
  • 25. net_channel - a cache aware, cache friendly queue typedef struct { uint16_t tail; uint8_t wakecnt; uint8_t pad; } net_channel_producer_t; typedef struct { uint16_t head; uint8_t wakecnt; uint8_t wake_type; void* wake_arg; } net_channel_consumer_t; struct { /* next element to add */ /* do wakeup if != consumer wakecnt */ /* /* /* /* next element to remove */ increment to request wakeup */ how to wakeup consumer */ opaque argument to wakeup routine */ net_channel_producer_t p CACHE_ALIGN; uint32_t q[NET_CHANNEL_Q_ENTRIES]; net_channel_consumer_t c; } net_channel_t ; LCA06 - Jan 27, 2006 - Jacobson / Felderman /* producer's header */ /* consumer's header */ 23
  • 26. net_channel (cont.) #define NET_CHANNEL_ENTRIES 512 /* approx number of entries in channel q */ #define NET_CHANNEL_Q_ENTRIES ((ROUND_UP(NET_CHANNEL_ENTRIES*sizeof(uint32_t),CACHE_LINE_SIZE) - sizeof(net_channel_producer_t) - sizeof(net_channel_consumer_t)) / sizeof(uint32_t)) #define CACHE_ALIGN __attribute__((aligned(CACHE_LINE_SIZE))) static inline void net_channel_queue(net_channel_t *chan, uint32_t item) { uint16_t tail = chan->p.tail; uint16_t nxt = (tail + 1) % NET_CHANNEL_Q_ENTRIES; if (nxt != chan->c.head) { chan->q[tail] = item; STORE_BARRIER; chan->p.tail = nxt; if (chan->p.wakecnt != chan->c.wakecnt) { ++chan->p.wakecnt; net_chan_wakeup(chan); } } } LCA06 - Jan 27, 2006 - Jacobson / Felderman 24
  • 27. “Channelize” driver • Remove e1000 driver hard_start_xmit &in napi_poll routines. No softint code left driver & no skb’s (driver deals only in packets). • Send packets to generic_napi_poll via a net channel. (%) Busy 1cpu 50 2cpu 77 drvr 58 Intr 7 9 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 16 24 16 8 14 9 5 12 9 App 1 1 1 25
  • 28. “Channelize” driver • Remove e1000 driver hard_start_xmit &in napi_poll routines. No softint code left driver & no skb’s (driver deals only in packets). • Send packets to generic_napi_poll via a net channel. (%) Busy 1cpu 50 2cpu 77 drvr 58 Intr 7 9 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 16 24 16 8 14 9 5 12 9 App 1 1 1 25
  • 29. “Channelize” socket “registers” transport signature with • socket on “accept()”. Gets back a channel. driver with matching • driver drops all packetschannel & wakes app signature into socket’s if sleeping in socket code. Socket code processes packet(s) on wakeup. (%) 1cpu 2cpu drvr sock Busy Intr 50 77 58 28 7 9 6 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 0 16 24 16 16 8 14 9 1 5 12 9 3 App 1 1 1 1 26
  • 30. “Channelize” socket “registers” transport signature with • socket on “accept()”. Gets back a channel. driver with matching • driver drops all packetschannel & wakes app signature into socket’s if sleeping in socket code. Socket code processes packet(s) on wakeup. (%) 1cpu 2cpu drvr sock Busy Intr 50 77 58 28 7 9 6 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 0 16 24 16 16 8 14 9 1 5 12 9 3 App 1 1 1 1 26
  • 31. “Channelize” socket “registers” transport signature with • socket on “accept()”. Gets back a channel. driver with matching • driver drops all packetschannel & wakes app signature into socket’s if sleeping in socket code. Socket code processes packet(s) on wakeup. (%) 1cpu 2cpu drvr sock Busy Intr 50 77 58 28 7 9 6 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 0 16 24 16 16 8 14 9 1 5 12 9 3 App 1 1 1 1 26
  • 32. “Channelize” App • App “registers” transport signature. Gets back an (mmaped) channel & buffer pool. matching packets into • driver drops sleeping. TCP stack in channel & wakes app if library processes packet(s) on wakeup. (%) 1cpu 2cpu drvr sock App Busy Intr 50 77 58 28 14 7 9 6 6 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 0 0 16 24 16 16 0 8 14 9 1 0 5 12 9 3 2 App 1 1 1 1 5 27
  • 33. “Channelize” App • App “registers” transport signature. Gets back an (mmaped) channel & buffer pool. matching packets into • driver drops sleeping. TCP stack in channel & wakes app if library processes packet(s) on wakeup. (%) 1cpu 2cpu drvr sock App Busy Intr 50 77 58 28 14 7 9 6 6 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 0 0 16 24 16 16 0 8 14 9 1 0 5 12 9 3 2 App 1 1 1 1 5 27
  • 34. “Channelize” App • App “registers” transport signature. Gets back an (mmaped) channel & buffer pool. matching packets into • driver drops sleeping. TCP stack in channel & wakes app if library processes packet(s) on wakeup. (%) 1cpu 2cpu drvr sock App Busy Intr 50 77 58 28 14 7 9 6 6 6 LCA06 - Jan 27, 2006 - Jacobson / Felderman Softint Socket Locks Sched 11 13 12 0 0 16 24 16 16 0 8 14 9 1 0 5 12 9 3 2 App 1 1 1 1 5 27
  • 35. 10Gb/s ixgb netpipe tests NPtcp streaming test between two nodes. NPtcp ping-pong test between two nodes (one-way latency measured). (4.3Gb/s throughput limit due to DDR333 memory; cpus were loafing) LCA06 - Jan 27, 2006 - Jacobson / Felderman 28
  • 36. more 10Gb/s NPtcp ping-pong test between two nodes (one-way latency measured). LCA06 - Jan 27, 2006 - Jacobson / Felderman 29
  • 37. more 10Gb/s LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes) SendRecv bandwidth (bigger is better) LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes) SendRecv Latency (smaller is better) LCA06 - Jan 27, 2006 - Jacobson / Felderman 30
  • 38. more 10Gb/s LAM MPI: Intel MPI Benchmark (IMB) using 4 boxes (8 processes) SendRecv Latency (smaller is better) LCA06 - Jan 27, 2006 - Jacobson / Felderman 31
  • 39. Conclusion relatively trivial changes, it’s • With some finish the good work started by possible to NAPI & get rid of almost all the interrupt / softint processing. • As a result, everything gets a lot faster. • Get linear scalability on multi-cpu / multicore systems. LCA06 - Jan 27, 2006 - Jacobson / Felderman 32
  • 40. Conclusion (cont.) • Drivers get simpler (hard_start_xmit & napi_poll become generic; drivers only service hardware interrupts). can • Anythinglocks,send or receive packets, without very cheaply. • Easy, incremental transition strategy. LCA06 - Jan 27, 2006 - Jacobson / Felderman 33