New sendfile(2)
Gleb Smirnoff
glebius@FreeBSD.org
FreeBSD Storage Summit
Netflix
20 February 2015
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 1 / 23
History of sendfile(2) Before sendfile(2)
Miserable life w/o sendfile(2)
while ((cnt = read(filefd, buf, (u_int)blksize))
write(netfd, buf, cnt) == cnt)
byte_count += cnt;
send_data() в src/libexec/ftpd/ftpd.c,
FreeBSD 1.0, 1993
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 2 / 23
History of sendfile(2) sendfile(2) introduced
sendfile(2) introduced
int
sendfile(int fd, int s, off_t offset, size_t nbytes, .. );
1997: HP-UX 11.00
1998: FreeBSD 3.0 and Linux 2.2
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 3 / 23
History of sendfile(2) sendfile(2) in FreeBSD
sendfile(2) in FreeBSD
First implementation - mapping userland cycle to the
kernel:
read(filefd) → VOP_READ(vnode)
write(netfd) → sosend(socket)
blksize → PAGE_SIZE
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
History of sendfile(2) sendfile(2) in FreeBSD
sendfile(2) in FreeBSD
First implementation - mapping userland cycle to the
kernel:
read(filefd) → VOP_READ(vnode)
write(netfd) → sosend(socket)
blksize → PAGE_SIZE
Further optimisations:
2004: SF_NODISKIO flag
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
History of sendfile(2) sendfile(2) in FreeBSD
sendfile(2) in FreeBSD
First implementation - mapping userland cycle to the
kernel:
read(filefd) → VOP_READ(vnode)
write(netfd) → sosend(socket)
blksize → PAGE_SIZE
Further optimisations:
2004: SF_NODISKIO flag
2006: inner cycle, working on sbspace() bytes
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
History of sendfile(2) sendfile(2) in FreeBSD
sendfile(2) in FreeBSD
First implementation - mapping userland cycle to the
kernel:
read(filefd) → VOP_READ(vnode)
write(netfd) → sosend(socket)
blksize → PAGE_SIZE
Further optimisations:
2004: SF_NODISKIO flag
2006: inner cycle, working on sbspace() bytes
2013: sending a shared memory descriptor data
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
What’s not right with sendfile(2) blocking on I/O
Problem #1: blocking on I/O
Algorithm of a modern HTTP-server:
1 Take yet another descriptor from kevent(2)
2 Do write(2)/read(2)/sendfile(2) on it
3 Go to 1
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 5 / 23
What’s not right with sendfile(2) blocking on I/O
Problem #1: blocking on I/O
Algorithm of a modern HTTP-server:
1 Take yet another descriptor from kevent(2)
2 Do write(2)/read(2)/sendfile(2) on it
3 Go to 1
Bottleneck: any syscall time.
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 5 / 23
What’s not right with sendfile(2) blocking on I/O
Attempts to solve problem #1
Separate I/O contexts: processes, threads
Apache
nginx 2
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 6 / 23
What’s not right with sendfile(2) blocking on I/O
Attempts to solve problem #1
Separate I/O contexts: processes, threads
Apache
nginx 2
SF_NODISKIO + aio_read(2)
nginx
Varnish
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 6 / 23
What’s not right with sendfile(2) blocking on I/O
More attempts . . .
aio_mlock(2) instead of aio_read(2)
aio_sendfile(2) ???
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 7 / 23
What’s not right with sendfile(2) control over
Problem #2: control over VM
VOP_READ() leaves pages in VM cache
VOP_READ() [for UFS] does readahead
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 8 / 23
What’s not right with sendfile(2) control over
Problem #2: control over VM
VOP_READ() leaves pages in VM cache
VOP_READ() [for UFS] does readahead
Not easy to prevent it doing that!
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 8 / 23
New sendfile(2) implementation above pager
waht if VOP_GETPAGES()?
VOP_READ() → VOP_GETPAGES()
Pros:
sendfile() already works on pages
implementations for vnode and shmem converge
control over VM is now easier task
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 9 / 23
New sendfile(2) implementation above pager
waht if VOP_GETPAGES()?
VOP_READ() → VOP_GETPAGES()
Pros:
sendfile() already works on pages
implementations for vnode and shmem converge
control over VM is now easier task
Cons
Losing readahead heuristics
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 9 / 23
New sendfile(2) implementation above pager
waht if VOP_GETPAGES()?
VOP_READ() → VOP_GETPAGES()
Pros:
sendfile() already works on pages
implementations for vnode and shmem converge
control over VM is now easier task
Cons
Losing readahead heuristics
But no one used them!
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 9 / 23
New sendfile(2) VOP_GETPAGES_ASYNC()
VOP_GETPAGES_ASYNC()
int
VOP_GETPAGES(struct vnode *vp, vm_page_t *ma,
int count, int reqpage);
1 Initialize buf(9)
2 buf->b_iodone = bdone;
3 bstrategy(buf);
4 bwait(buf); /* sleeps until I/O completes */
5 return;
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 10 / 23
New sendfile(2) VOP_GETPAGES_ASYNC()
VOP_GETPAGES_ASYNC()
int
VOP_GETPAGES_ASYNC(struct vnode *vp,
vm_page_t *ma, int count, int reqpage,
vop_getpages_iodone_t *iodone, void *arg);
1 Initialize buf(9)
2 buf->b_iodone = vnode_pager_async_iodone;
3 bstrategy(buf);
4 return;
vnode_pager_async_iodone calls iodone() .
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 10 / 23
New sendfile(2) non-blocking sendfile(2)
naive non-blocking sendfile(2)
In kern_sendfile():
1 nios++;
2 VOP_GETPAGES_ASYNC(sendfile_iodone);
In sendfile_iodone():
1 nios--;
2 if (nios) return;
3 sosend();
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 11 / 23
New sendfile(2) non-blocking sendfile(2)
the problem of naive implementation
sendfile(filefd, sockfd, ..);
write(sockfd, ..);
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 12 / 23
New sendfile(2) “not ready” data in socket buffers
socket buffer
mbuf mbuf mbuf mbuf mbuf mbuf
struct sockbuf
struct mbuf *sb_mb
struct mbuf *sb_mbtail
u_int sb_cc
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 13 / 23
New sendfile(2) “not ready” data in socket buffers
socket buffer with “not ready” data
mbuf mbuf mbuf mbuf mbuf mbuf
page page
struct sockbuf
struct mbuf *sb_mb
struct mbuf *sb_fnrdy
struct mbuf *sb_mbtail
u_int sb_acc
u_int sb_ccc
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 14 / 23
New sendfile(2) final implementation
non-blocking sendfile(2)
In kern_sendfile():
1 nios++;
2 VOP_GETPAGES_ASYNC(sendfile_iodone);
3 sosend(NOT_READY);
In sendfile_iodone():
1 nios--;
2 if (nios) return;
3 soready();
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 15 / 23
New sendfile(2) comparison with old sendfile(2)
traffic
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 16 / 23
New sendfile(2) comparison with old sendfile(2)
CPU idle
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 17 / 23
New sendfile(2) comparison with old sendfile(2)
profiling sendfile(2) in head
aio_daemon 13.64%
sys_sendfile 7.40%
t4_intr 5.66%
xpt_done 1.04%
pagedaemon 4.16%
scheduler 5.28%
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 18 / 23
New sendfile(2) comparison with old sendfile(2)
profiling new sendfile(2)
sys_sendfile 16.9%
t4_intr 8.17%
xpt_done 9.91%
pagedaemon 6.54%
scheduler 3.58%
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 19 / 23
New sendfile(2) comparison with old sendfile(2)
profiling new sendfile(2)
sys_sendfile 16.9% (vm_page_grab 9.24% !!)
t4_intr 8.17% (tcp_output() 2.07% !!)
xpt_done 9.91% (m_freem() 3.11% !!)
pagedaemon 6.54%
scheduler 3.58%
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 19 / 23
New sendfile(2) comparison with old sendfile(2)
what did change?
New code always sends full socket buffer
Which is good for TCP (as protocol)
Which hurts VM, mbuf allocator,
and unexpectedly TCP stack
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 20 / 23
New sendfile(2) comparison with old sendfile(2)
what did change?
New code always sends full socket buffer
Which is good for TCP (as protocol)
Which hurts VM, mbuf allocator,
and unexpectedly TCP stack
Will fix that!
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 20 / 23
New sendfile(2) comparison with old sendfile(2)
old sendfile(2) @ Netflix
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 21 / 23
New sendfile(2) comparison with old sendfile(2)
new sendfile(2) @ Netflix
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 21 / 23
New sendfile(2) plans and problems
TODO list
Problems:
VM & I/O overcommit
ZFS
SCTP
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 22 / 23
New sendfile(2) plans and problems
TODO list
Problems:
VM & I/O overcommit
ZFS
SCTP
Future plans:
sendfile(2) doing TLS
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 22 / 23
New sendfile(2)
Questions?
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 23 / 23

New sendfile

  • 1.
    New sendfile(2) Gleb Smirnoff glebius@FreeBSD.org FreeBSDStorage Summit Netflix 20 February 2015 Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 1 / 23
  • 2.
    History of sendfile(2)Before sendfile(2) Miserable life w/o sendfile(2) while ((cnt = read(filefd, buf, (u_int)blksize)) write(netfd, buf, cnt) == cnt) byte_count += cnt; send_data() в src/libexec/ftpd/ftpd.c, FreeBSD 1.0, 1993 Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 2 / 23
  • 3.
    History of sendfile(2)sendfile(2) introduced sendfile(2) introduced int sendfile(int fd, int s, off_t offset, size_t nbytes, .. ); 1997: HP-UX 11.00 1998: FreeBSD 3.0 and Linux 2.2 Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 3 / 23
  • 4.
    History of sendfile(2)sendfile(2) in FreeBSD sendfile(2) in FreeBSD First implementation - mapping userland cycle to the kernel: read(filefd) → VOP_READ(vnode) write(netfd) → sosend(socket) blksize → PAGE_SIZE Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
  • 5.
    History of sendfile(2)sendfile(2) in FreeBSD sendfile(2) in FreeBSD First implementation - mapping userland cycle to the kernel: read(filefd) → VOP_READ(vnode) write(netfd) → sosend(socket) blksize → PAGE_SIZE Further optimisations: 2004: SF_NODISKIO flag Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
  • 6.
    History of sendfile(2)sendfile(2) in FreeBSD sendfile(2) in FreeBSD First implementation - mapping userland cycle to the kernel: read(filefd) → VOP_READ(vnode) write(netfd) → sosend(socket) blksize → PAGE_SIZE Further optimisations: 2004: SF_NODISKIO flag 2006: inner cycle, working on sbspace() bytes Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
  • 7.
    History of sendfile(2)sendfile(2) in FreeBSD sendfile(2) in FreeBSD First implementation - mapping userland cycle to the kernel: read(filefd) → VOP_READ(vnode) write(netfd) → sosend(socket) blksize → PAGE_SIZE Further optimisations: 2004: SF_NODISKIO flag 2006: inner cycle, working on sbspace() bytes 2013: sending a shared memory descriptor data Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
  • 8.
    What’s not rightwith sendfile(2) blocking on I/O Problem #1: blocking on I/O Algorithm of a modern HTTP-server: 1 Take yet another descriptor from kevent(2) 2 Do write(2)/read(2)/sendfile(2) on it 3 Go to 1 Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 5 / 23
  • 9.
    What’s not rightwith sendfile(2) blocking on I/O Problem #1: blocking on I/O Algorithm of a modern HTTP-server: 1 Take yet another descriptor from kevent(2) 2 Do write(2)/read(2)/sendfile(2) on it 3 Go to 1 Bottleneck: any syscall time. Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 5 / 23
  • 10.
    What’s not rightwith sendfile(2) blocking on I/O Attempts to solve problem #1 Separate I/O contexts: processes, threads Apache nginx 2 Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 6 / 23
  • 11.
    What’s not rightwith sendfile(2) blocking on I/O Attempts to solve problem #1 Separate I/O contexts: processes, threads Apache nginx 2 SF_NODISKIO + aio_read(2) nginx Varnish Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 6 / 23
  • 12.
    What’s not rightwith sendfile(2) blocking on I/O More attempts . . . aio_mlock(2) instead of aio_read(2) aio_sendfile(2) ??? Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 7 / 23
  • 13.
    What’s not rightwith sendfile(2) control over Problem #2: control over VM VOP_READ() leaves pages in VM cache VOP_READ() [for UFS] does readahead Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 8 / 23
  • 14.
    What’s not rightwith sendfile(2) control over Problem #2: control over VM VOP_READ() leaves pages in VM cache VOP_READ() [for UFS] does readahead Not easy to prevent it doing that! Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 8 / 23
  • 15.
    New sendfile(2) implementationabove pager waht if VOP_GETPAGES()? VOP_READ() → VOP_GETPAGES() Pros: sendfile() already works on pages implementations for vnode and shmem converge control over VM is now easier task Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 9 / 23
  • 16.
    New sendfile(2) implementationabove pager waht if VOP_GETPAGES()? VOP_READ() → VOP_GETPAGES() Pros: sendfile() already works on pages implementations for vnode and shmem converge control over VM is now easier task Cons Losing readahead heuristics Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 9 / 23
  • 17.
    New sendfile(2) implementationabove pager waht if VOP_GETPAGES()? VOP_READ() → VOP_GETPAGES() Pros: sendfile() already works on pages implementations for vnode and shmem converge control over VM is now easier task Cons Losing readahead heuristics But no one used them! Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 9 / 23
  • 18.
    New sendfile(2) VOP_GETPAGES_ASYNC() VOP_GETPAGES_ASYNC() int VOP_GETPAGES(structvnode *vp, vm_page_t *ma, int count, int reqpage); 1 Initialize buf(9) 2 buf->b_iodone = bdone; 3 bstrategy(buf); 4 bwait(buf); /* sleeps until I/O completes */ 5 return; Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 10 / 23
  • 19.
    New sendfile(2) VOP_GETPAGES_ASYNC() VOP_GETPAGES_ASYNC() int VOP_GETPAGES_ASYNC(structvnode *vp, vm_page_t *ma, int count, int reqpage, vop_getpages_iodone_t *iodone, void *arg); 1 Initialize buf(9) 2 buf->b_iodone = vnode_pager_async_iodone; 3 bstrategy(buf); 4 return; vnode_pager_async_iodone calls iodone() . Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 10 / 23
  • 20.
    New sendfile(2) non-blockingsendfile(2) naive non-blocking sendfile(2) In kern_sendfile(): 1 nios++; 2 VOP_GETPAGES_ASYNC(sendfile_iodone); In sendfile_iodone(): 1 nios--; 2 if (nios) return; 3 sosend(); Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 11 / 23
  • 21.
    New sendfile(2) non-blockingsendfile(2) the problem of naive implementation sendfile(filefd, sockfd, ..); write(sockfd, ..); Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 12 / 23
  • 22.
    New sendfile(2) “notready” data in socket buffers socket buffer mbuf mbuf mbuf mbuf mbuf mbuf struct sockbuf struct mbuf *sb_mb struct mbuf *sb_mbtail u_int sb_cc Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 13 / 23
  • 23.
    New sendfile(2) “notready” data in socket buffers socket buffer with “not ready” data mbuf mbuf mbuf mbuf mbuf mbuf page page struct sockbuf struct mbuf *sb_mb struct mbuf *sb_fnrdy struct mbuf *sb_mbtail u_int sb_acc u_int sb_ccc Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 14 / 23
  • 24.
    New sendfile(2) finalimplementation non-blocking sendfile(2) In kern_sendfile(): 1 nios++; 2 VOP_GETPAGES_ASYNC(sendfile_iodone); 3 sosend(NOT_READY); In sendfile_iodone(): 1 nios--; 2 if (nios) return; 3 soready(); Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 15 / 23
  • 25.
    New sendfile(2) comparisonwith old sendfile(2) traffic Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 16 / 23
  • 26.
    New sendfile(2) comparisonwith old sendfile(2) CPU idle Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 17 / 23
  • 27.
    New sendfile(2) comparisonwith old sendfile(2) profiling sendfile(2) in head aio_daemon 13.64% sys_sendfile 7.40% t4_intr 5.66% xpt_done 1.04% pagedaemon 4.16% scheduler 5.28% Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 18 / 23
  • 28.
    New sendfile(2) comparisonwith old sendfile(2) profiling new sendfile(2) sys_sendfile 16.9% t4_intr 8.17% xpt_done 9.91% pagedaemon 6.54% scheduler 3.58% Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 19 / 23
  • 29.
    New sendfile(2) comparisonwith old sendfile(2) profiling new sendfile(2) sys_sendfile 16.9% (vm_page_grab 9.24% !!) t4_intr 8.17% (tcp_output() 2.07% !!) xpt_done 9.91% (m_freem() 3.11% !!) pagedaemon 6.54% scheduler 3.58% Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 19 / 23
  • 30.
    New sendfile(2) comparisonwith old sendfile(2) what did change? New code always sends full socket buffer Which is good for TCP (as protocol) Which hurts VM, mbuf allocator, and unexpectedly TCP stack Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 20 / 23
  • 31.
    New sendfile(2) comparisonwith old sendfile(2) what did change? New code always sends full socket buffer Which is good for TCP (as protocol) Which hurts VM, mbuf allocator, and unexpectedly TCP stack Will fix that! Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 20 / 23
  • 32.
    New sendfile(2) comparisonwith old sendfile(2) old sendfile(2) @ Netflix Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 21 / 23
  • 33.
    New sendfile(2) comparisonwith old sendfile(2) new sendfile(2) @ Netflix Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 21 / 23
  • 34.
    New sendfile(2) plansand problems TODO list Problems: VM & I/O overcommit ZFS SCTP Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 22 / 23
  • 35.
    New sendfile(2) plansand problems TODO list Problems: VM & I/O overcommit ZFS SCTP Future plans: sendfile(2) doing TLS Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 22 / 23
  • 36.
    New sendfile(2) Questions? Gleb Smirnoffglebius@FreeBSD.org New sendfile(2) 20 February 2015 23 / 23