Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

New sendfile

8,552 views

Published on

New sendfile(2) system call, that doesn't block on disk I/O. Presented at FreeBSD storage devsummit at Netflix in February 2015.

Published in: Technology
  • Be the first to comment

New sendfile

  1. 1. New sendfile(2) Gleb Smirnoff glebius@FreeBSD.org FreeBSD Storage Summit Netflix 20 February 2015 Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 1 / 23
  2. 2. History of sendfile(2) Before sendfile(2) Miserable life w/o sendfile(2) while ((cnt = read(filefd, buf, (u_int)blksize)) write(netfd, buf, cnt) == cnt) byte_count += cnt; send_data() в src/libexec/ftpd/ftpd.c, FreeBSD 1.0, 1993 Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 2 / 23
  3. 3. History of sendfile(2) sendfile(2) introduced sendfile(2) introduced int sendfile(int fd, int s, off_t offset, size_t nbytes, .. ); 1997: HP-UX 11.00 1998: FreeBSD 3.0 and Linux 2.2 Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 3 / 23
  4. 4. History of sendfile(2) sendfile(2) in FreeBSD sendfile(2) in FreeBSD First implementation - mapping userland cycle to the kernel: read(filefd) → VOP_READ(vnode) write(netfd) → sosend(socket) blksize → PAGE_SIZE Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
  5. 5. History of sendfile(2) sendfile(2) in FreeBSD sendfile(2) in FreeBSD First implementation - mapping userland cycle to the kernel: read(filefd) → VOP_READ(vnode) write(netfd) → sosend(socket) blksize → PAGE_SIZE Further optimisations: 2004: SF_NODISKIO flag Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
  6. 6. History of sendfile(2) sendfile(2) in FreeBSD sendfile(2) in FreeBSD First implementation - mapping userland cycle to the kernel: read(filefd) → VOP_READ(vnode) write(netfd) → sosend(socket) blksize → PAGE_SIZE Further optimisations: 2004: SF_NODISKIO flag 2006: inner cycle, working on sbspace() bytes Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
  7. 7. History of sendfile(2) sendfile(2) in FreeBSD sendfile(2) in FreeBSD First implementation - mapping userland cycle to the kernel: read(filefd) → VOP_READ(vnode) write(netfd) → sosend(socket) blksize → PAGE_SIZE Further optimisations: 2004: SF_NODISKIO flag 2006: inner cycle, working on sbspace() bytes 2013: sending a shared memory descriptor data Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
  8. 8. What’s not right with sendfile(2) blocking on I/O Problem #1: blocking on I/O Algorithm of a modern HTTP-server: 1 Take yet another descriptor from kevent(2) 2 Do write(2)/read(2)/sendfile(2) on it 3 Go to 1 Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 5 / 23
  9. 9. What’s not right with sendfile(2) blocking on I/O Problem #1: blocking on I/O Algorithm of a modern HTTP-server: 1 Take yet another descriptor from kevent(2) 2 Do write(2)/read(2)/sendfile(2) on it 3 Go to 1 Bottleneck: any syscall time. Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 5 / 23
  10. 10. What’s not right with sendfile(2) blocking on I/O Attempts to solve problem #1 Separate I/O contexts: processes, threads Apache nginx 2 Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 6 / 23
  11. 11. What’s not right with sendfile(2) blocking on I/O Attempts to solve problem #1 Separate I/O contexts: processes, threads Apache nginx 2 SF_NODISKIO + aio_read(2) nginx Varnish Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 6 / 23
  12. 12. What’s not right with sendfile(2) blocking on I/O More attempts . . . aio_mlock(2) instead of aio_read(2) aio_sendfile(2) ??? Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 7 / 23
  13. 13. What’s not right with sendfile(2) control over Problem #2: control over VM VOP_READ() leaves pages in VM cache VOP_READ() [for UFS] does readahead Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 8 / 23
  14. 14. What’s not right with sendfile(2) control over Problem #2: control over VM VOP_READ() leaves pages in VM cache VOP_READ() [for UFS] does readahead Not easy to prevent it doing that! Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 8 / 23
  15. 15. New sendfile(2) implementation above pager waht if VOP_GETPAGES()? VOP_READ() → VOP_GETPAGES() Pros: sendfile() already works on pages implementations for vnode and shmem converge control over VM is now easier task Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 9 / 23
  16. 16. New sendfile(2) implementation above pager waht if VOP_GETPAGES()? VOP_READ() → VOP_GETPAGES() Pros: sendfile() already works on pages implementations for vnode and shmem converge control over VM is now easier task Cons Losing readahead heuristics Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 9 / 23
  17. 17. New sendfile(2) implementation above pager waht if VOP_GETPAGES()? VOP_READ() → VOP_GETPAGES() Pros: sendfile() already works on pages implementations for vnode and shmem converge control over VM is now easier task Cons Losing readahead heuristics But no one used them! Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 9 / 23
  18. 18. New sendfile(2) VOP_GETPAGES_ASYNC() VOP_GETPAGES_ASYNC() int VOP_GETPAGES(struct vnode *vp, vm_page_t *ma, int count, int reqpage); 1 Initialize buf(9) 2 buf->b_iodone = bdone; 3 bstrategy(buf); 4 bwait(buf); /* sleeps until I/O completes */ 5 return; Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 10 / 23
  19. 19. New sendfile(2) VOP_GETPAGES_ASYNC() VOP_GETPAGES_ASYNC() int VOP_GETPAGES_ASYNC(struct vnode *vp, vm_page_t *ma, int count, int reqpage, vop_getpages_iodone_t *iodone, void *arg); 1 Initialize buf(9) 2 buf->b_iodone = vnode_pager_async_iodone; 3 bstrategy(buf); 4 return; vnode_pager_async_iodone calls iodone() . Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 10 / 23
  20. 20. New sendfile(2) non-blocking sendfile(2) naive non-blocking sendfile(2) In kern_sendfile(): 1 nios++; 2 VOP_GETPAGES_ASYNC(sendfile_iodone); In sendfile_iodone(): 1 nios--; 2 if (nios) return; 3 sosend(); Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 11 / 23
  21. 21. New sendfile(2) non-blocking sendfile(2) the problem of naive implementation sendfile(filefd, sockfd, ..); write(sockfd, ..); Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 12 / 23
  22. 22. New sendfile(2) “not ready” data in socket buffers socket buffer mbuf mbuf mbuf mbuf mbuf mbuf struct sockbuf struct mbuf *sb_mb struct mbuf *sb_mbtail u_int sb_cc Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 13 / 23
  23. 23. New sendfile(2) “not ready” data in socket buffers socket buffer with “not ready” data mbuf mbuf mbuf mbuf mbuf mbuf page page struct sockbuf struct mbuf *sb_mb struct mbuf *sb_fnrdy struct mbuf *sb_mbtail u_int sb_acc u_int sb_ccc Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 14 / 23
  24. 24. New sendfile(2) final implementation non-blocking sendfile(2) In kern_sendfile(): 1 nios++; 2 VOP_GETPAGES_ASYNC(sendfile_iodone); 3 sosend(NOT_READY); In sendfile_iodone(): 1 nios--; 2 if (nios) return; 3 soready(); Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 15 / 23
  25. 25. New sendfile(2) comparison with old sendfile(2) traffic Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 16 / 23
  26. 26. New sendfile(2) comparison with old sendfile(2) CPU idle Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 17 / 23
  27. 27. New sendfile(2) comparison with old sendfile(2) profiling sendfile(2) in head aio_daemon 13.64% sys_sendfile 7.40% t4_intr 5.66% xpt_done 1.04% pagedaemon 4.16% scheduler 5.28% Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 18 / 23
  28. 28. New sendfile(2) comparison with old sendfile(2) profiling new sendfile(2) sys_sendfile 16.9% t4_intr 8.17% xpt_done 9.91% pagedaemon 6.54% scheduler 3.58% Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 19 / 23
  29. 29. New sendfile(2) comparison with old sendfile(2) profiling new sendfile(2) sys_sendfile 16.9% (vm_page_grab 9.24% !!) t4_intr 8.17% (tcp_output() 2.07% !!) xpt_done 9.91% (m_freem() 3.11% !!) pagedaemon 6.54% scheduler 3.58% Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 19 / 23
  30. 30. New sendfile(2) comparison with old sendfile(2) what did change? New code always sends full socket buffer Which is good for TCP (as protocol) Which hurts VM, mbuf allocator, and unexpectedly TCP stack Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 20 / 23
  31. 31. New sendfile(2) comparison with old sendfile(2) what did change? New code always sends full socket buffer Which is good for TCP (as protocol) Which hurts VM, mbuf allocator, and unexpectedly TCP stack Will fix that! Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 20 / 23
  32. 32. New sendfile(2) comparison with old sendfile(2) old sendfile(2) @ Netflix Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 21 / 23
  33. 33. New sendfile(2) comparison with old sendfile(2) new sendfile(2) @ Netflix Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 21 / 23
  34. 34. New sendfile(2) plans and problems TODO list Problems: VM & I/O overcommit ZFS SCTP Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 22 / 23
  35. 35. New sendfile(2) plans and problems TODO list Problems: VM & I/O overcommit ZFS SCTP Future plans: sendfile(2) doing TLS Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 22 / 23
  36. 36. New sendfile(2) Questions? Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 23 / 23

×