How to Remove Document Management Hurdles with X-Docs?
Recent advance in netmap/VALE(mSwitch)
1. Recent advance in netmap/
VALE(mSwitch)
Michio Honda, Felipe Huici
(NEC Europe Ltd.)
Giuseppe Lettieri and Luigi Rizzo
(Universita di Pisa)
Kernel/VM@Jimbo-cho, Japan on Dec. 8 2013
michio.honda@neclab.eu / @michioh
2. Outline
• netmap API basics
– Architecture
– How to write apps
• VALE (mSwitch)
– Architecture
– System design
– Evaluation
– Use cases
4. netmap overview
• A fast packet I/O mechanism between the NIC and the
user-space
– Remove unnecessary metadata (e.g., sk_buff) allocation
– Amortized systemcall costs, reduced/removed data copies
Page
4
From
h,p://info.iet.unipi.it/~luigi/netmap/
5. Performance
• Saturate 10 Gbps pipe with low CPU frequency
From
h,p://info.iet.unipi.it/~luigi/netmap/
6. netmap API (initialization)
• open(“/dev/netmap”) returns a file descriptor
• ioctl(fd, NIOCREG, arg) puts an interface in netmap mode
• mmap(…, fd, 0) maps buffers and rings
Page
6
From
h,p://info.iet.unipi.it/~luigi/netmap/
7. netmap API (TX)
• TX
– Fill up to avail buffers, starting from slot cur
– ioctl(fd, NIOCTXSYNC) queues the packets
• poll() can be used for blocking I/O
Page
7
From
h,p://info.iet.unipi.it/~luigi/netmap/
8. netmap API (RX)
• RX
– ioctl(fd, NIOCTXSYNC) reports newly received packets
– Process up to avail buffers, starting from slot cur
• poll() can be used for blocking I/O
Page
8
From
h,p://info.iet.unipi.it/~luigi/netmap/
9. Other features
• Multi queue support
– One netmap ring is initialized for a single physical ring
• e.g., different pthreads can be assigned to different netmap/
physical rings
• Host stack support
– NIC is put into netmap mode,
resetting its phy
– The host stack still sees the interface,
and packets can be sent to/from the
NIC via “software rings”
• Either implicitly by the kernel or explicitly
by the app
From
h,p://info.iet.unipi.it/~luigi/netmap/
10. Implementation
• Available for FreeBSD and Linux
– Linux code is glued from FreeBSD one
• Common code
– control, systemcall backends, memory allocator etc
• Device-specific code
– Each of the supported drivers implements some functions
• nm_register(struct ifnet *ifp, int onoff)
– Put the NIC into netmap mode, allocate netmap rings and slots
• nm_txsync(struct ifnet *ifp, u_int ring_nr)
– Flush out the packets from the netmap ring filled by the user
• nm_rxsync(struct ifnet *ifp, u_int ring_nr)
– refil the netmap ring with the receiving packets
13. Software switch
• Switching packets between network interfaces
– General-purpose OS and processor
– Virtual and hardware interfaces
• 20 years ago
– Prototyping, low-performance alternative
• Now and near future
– Replacement of hardware switch
– Hosting virtual machines (incl. virtual
network functions)
Page
13
Software switch
14. Performance of today’s software switch
• Forward packets from a 10Gbps NIC to another one
Throughput (Gbps)
– Xeon E5-1650 (3.8Ghz) (1 CPU core is used)
– Lower than 1 Gbps for the minimum-sized packets
FreeBSD bridge
10
5
2
1
64
Page
14
Openvswitch
128
256
512
Packet size (Bytes)
1024
15. Problems
1. Inefficient packet I/O mechanism
– Today’s software switches use a dynamically-allocated, big
metadata (e.g., sk_buff) designed for end systems
– it should be simplified, because the packet is just forwarded
– For switches it is more important to process small packets
efficiently
2. Inefficient packet switching algorithm
– How to move packets from the source
to the destination(s) efficiently?
– Traditional way
Page
15
1
2
3
4
• Lock a destination, send a single packet, then unlock the destination
• Inefficient due to locking cost/contention
16. Problems (cont.)
3. Lack of flexibility in packet processing
– How to decide packet’s destination?
– One could use layer 2 learning bridge to decide packet’s
destination
– One could use OpenFlow packet matching to do so
packet
processing
Page
16
17. Solutions
1. “Inefficient packet I/O mechanisms”
– Simple, minimalistic packet representation (netmap API*)
• No metadata allocation cost
• Reduced cache pollution
2. “Inefficient packet switching”
1
– Group multiple packets going to the same
destination
3
– Lock the destination only once for a group of packets
Page
17
2
4
Netmap
–
a
novel
frameworl
for
fast
packet
I/O
h,p://info.iet.unipi.it/~luigi/netmap/
Luigi
Rizzo
Università
di
Pisa
18. Bitmap-based forwarding algorithm
• Algorithm in the original VALE
• Support for unicast, multicast and broadcast
– Get pointers to a batch of packets
pkt id dst
– Identify the destination of each
p0 0010
packet and represent as a bitmap
p1 0001
– Lock each destination, and send all p2 0010
p3 1111
the packets going there
• Problem
– Scalability issue in the
p4 0010
0010
0001
0010
1111
0010
Figure 3. Bitmap-based packet forwarding algorithm: packets are labeled destinations
presence of many from p0 to p4; for each packet, destination(s)
are identified and represented in a bitmap (a bit for each posVALE,
a
V forwarder considers
sible destination port). Theirtual
Local
Ethernet
each destih,p://info.iet.unipi.it/~luigi/vale/
nation port in turn, scanning the corresponding column of
Luigi
Rizzo,
Giuseppe
LeReri
the bitmap to identify the packets bound to the current destiUniversità
di
Pisa
nation port.
19. p1
p2
p3
p4
0001
0010
1111
0010
0010
1111
0010
List-based forwarding algorithm
Figure 3. Bitmap-based packet forwarding algorithm: packets are labeled from p0 to p4; for each packet, destination(s)
are identified and represented in a bitmap (a bit for each possible destination port). The forwarder considers each destination port in turn, scanning the corresponding column of
the bitmap to identify the packets bound to the current destination port.
• Algorithm in the current VALE (mSwitch)
• Support for unicast and broadcast
– Make a linked-list for each destination
– Broadcast packets are pkt id dst next
p0 d1 null
mapped into destination p1 d0 null
p2
index 254
p3
– Scan each destination, p4
and broadcast packets
are inserted in-order
pkt id dst next
p0 d1
p2
p1 d0 null
p2 d1
p4
p3 d254 null
p4 d1 null
dst d0 d1 d2
head p1 p0
tail p1 p0
d254
...
dst d0 d1 d2
head p1 p0
tail p1 p4
d254
... p3
p3
Figure 4. List-based packet forwarding: packets are labeled
from p0 to p4, destination port indices are labeled from d1
in the
in the
well f
To
cast o
This l
tinatio
forwa
curren
time
9 sho
4.3
Once
identi
and m
menta
the du
of pac
Ins
two p
large
leases
tion o
vance
order
many
page
for th
queue
tolera
20. Solutions (cont.)
3. “Lack of flexibility in packet processing”
– Separate a switching fabric
and packet processing
– Switching fabric
Packet processing
• Move packets quickly
Switching fabric
– Packet processing
• Decide packets’ destination and tell the switching fabric
typedef
u_int
(*BDG_LOOKUP_T)(char
*buf,
u_int
len,
uint8_t
*ring_nr,
struct
netmap_adapter
*srcif);
– Return
• The index of the destination port for unicast
• NM_BDG_BROADCAST for broadcast
• NM_BDG_NOPORT for dropping this packet
– By default L2 learning is set
Page
20
21. VALE (mSwitch) architecture
...
Netmap API
Virtual
interfaces
Netmap API
app1/vm1 . . . appN/vmN
apps
User
Socket API
Kernel
OS stack
Packet processing
Switching fabric
NIC
• Packet forwarding (identifying packets’ destination
(packet processing) and copying packets to the
destination ring) takes place in the sender’s context
– The receiver just consumes the packets
Page
21
22. The other features
• Indirect buffer support
– netmap slot can contain a pointer to the actual buffer
– Useful to eliminate data copy from VM’s backend to a
netmap slot
• Support for a large packet
– Multiple netmap slots (by default 2048 byte each) can be
used to contain a single packet
23. Bare mSwitch performance
8
6
4
2
64 128 256 512 1518
Packet size (Bytes)
Dummy
20
64B
128B
256B
15
10
5
1.3 1.9 2.6 3.2 3.8
CPU Clock Frequency (Ghz)
Experiments
are
done
with
Xeon
E5-‐1650
CPU
(6
core,
3.8
Ghz
with
Turboboost),
16GB
DDR3
(a) NIC to NIC
Page
23
RAM
(quad
channel)
and
Intel
X520-‐T2
10Gbps
NICs
Throughput (Gbps)
10
10Gbps line rate
Throughput (Gbps)
Throughput (Gbps)
• NIC to NIC (10Gbps)
• “Dummy” processing module
20
64
15
10
5
1.3
CPU C
(b
Figure 5. Throughput between 10 Gb/s NICs and vir
24. port, and 20 Gb/s ones with two pairs). In addition, we assign one CPU core per port pair. The results in figure 5(b)
are similar to those in the NIC-to-NIC case, with line rate
values for all packet sizes at 3.8 GHz. The graph also shows
that mSwitch scales well with the number of ports and CPU
cores: we achieve line rate for two 10 Gb/s ports for all
• Virtual port Finally, figure port
packet sizes. to virtual 5(c) presents throughput numbers in the opposite direction, from virtual
Dummy
• “Dummy” processing module
ports to NICs,
with similar results.
of baseline pery plug-in mod1 CPU core
figured to mark
200
2 CPU cores
here, the packet
3 CPU cores
mediately (thus
150
s for packets to
then the switch
different packet
orts. We further
Turbo Boost to
ets us shed light
ts of the NIC to
one to the other
h this setup, we
Page
ies for 256-byte 24
ones starting at
Bare mSwitch performance
Throughput (Gbps)
gen, a fast genbe plugged into
e Gb/s to mean
ackets per sectch size of 1024
100
50
25
60
508 1514 8K 64K
Packet size (Bytes)
Figure 6. Forwarding performance between two virtual
25. Bare mSwitch performance
250
1514B packets
64KB packets
200
150
100
50
0
broadcast
(a) Experiment topologies.
Throughput (Gbps)
unicast
Throughput (Gbps)
• Dummy packet processing module”
• N virtual ports to N virtual ports
• “
64B packets
250
200
64B packets
1514B packets
64KB packets
150
100
50
0
2
4
6
# of ports
(b) Unicast throughput.
8
2
4
6
# of ports
8
(c) Broadcast throughput.
Figure 7. Switching capacity with an increasing number of virtual ports. For unicast, each src/dst port pair is assigned a single
CPU core, for broadcast each port is given a core. For setups with more than 6 ports (our system has 6 cores) we assign cores
in a round-robin fashion.
Page
25
26. adcast
# of ports
ent topologies.
# of ports
(b) Unicast throughput.
(c) Broadcast throughput.
g capacity with an increasing number of virtual ports. For unicast, each src/dst port pair is assigned a single
cast each port is given a core. For setups with more than 6 ports (our system has 6 cores) we assign cores
hion.
mSwitch’s Scalability
– Bitmap- vs List-based
algorithm
– List-based algorithm
scales very well
2
3
4
# of destination ports
5
forwarding throughput from a single
iple destinations using minimum-sized
compares mSwitch’s forwarding algo’s (bitmap).
Page
26
Aggregate throughput (Gbps)
• A single virtual port to 14
List Algo.
Bitmapmany virtual ports
Algo.
12
List Algo.
Bitmap Algo.
10
8
6
4
2
0
1 20 40 60 100 150 200
# of destination ports
250
Figure 9. Comparison ofWe
use
the
minimum-‐sized
mSwitch’s forwarding algorithm
packets
w the single
CPU
c a large
(list) to that of VALE (bitmap) in ith
a
presence of ore
number of active destination ports (single sender, minimum-
29. Conclusion
• Our contribution
– VALE(mSwitch): fast, modular software switch
– Very fast packet forwarding at bare metal
• 200 Gbps between virtual ports (with 1500 Byte packets and 3 CPU
cores)
• Almost the line rate with using 1 CPU core and 2 10 Gbps NICs
– Useful to implement various systems
• Very fast learning bridge
• Accelerate OpenVswitch up to 2.6 times
– Small modifications, preserving control interface
• Fast protocol multiplexer/demultiplexer for user-space protocol stacks
• Code (Linux, FreeBSD) is available at:
– http://info.iet.unipi.it/~luigi/netmap/
Page
29