Successfully reported this slideshow.
Your SlideShare is downloading. ×

Real-time in the real world: DIRT in production

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Real-time in the
real world:
DIRT in production
Bryan Cantrill     Brendan Gregg
SVP, Engineering   Lead Performance Engin...

YouTube videos are no longer supported on SlideShare

View original on YouTube

Previously, on #surgecon...

    • Two years ago at Surge, we described the emergence
     of real-time data semantics in ...
Loading in …3
×

Check these out next

1 of 61 Ad

More Related Content

Slideshows for you (20)

Similar to Real-time in the real world: DIRT in production (20)

Advertisement

More from bcantrill (20)

Real-time in the real world: DIRT in production

  1. 1. Real-time in the real world: DIRT in production Bryan Cantrill Brendan Gregg SVP, Engineering Lead Performance Engineer bryan@joyent.com brendan@joyent.com @bcantrill @brendangregg
  2. 2. Previously, on #surgecon... • Two years ago at Surge, we described the emergence of real-time data semantics in web-facing applications • We dubbed this data-intensive real-time (DIRT) • Last year at Surge 2011, we presented experiences building a DIRTy system of our own — a facility for real- time analytics of latency in the cloud • While this system is interesting, it is somewhat synthetic in nature in that it does not need to scale (much) with respect to users...
  3. 3. #surgecon 2012 • Accelerated by the rise of mobile applications, DIRTy systems are becoming increasingly common • In the past year, we’ve seen apps in production at scale • There are many examples of this, but for us, a paragon of the emerging DIRTy apps has been Voxer • Voxer is a push-to-talk mobile app that can be thought of as the confluence of voice mail and SMS • A canonical DIRTy app: latency and scale both matter! • Our experiences debugging latency bubbles with Voxer over the past year have taught us quite a bit about the new challenges that DIRTy apps pose...
  4. 4. The challenge of DIRTy apps • DIRTy applications tend to have the human in the loop • Good news: deadlines are soft — microseconds only matter when they add up to tens of milliseconds • Bad news: because humans are in the loop, demand for the system can be non-linear • One must deal not only with the traditional challenge of scalability, but also the challenge of a real-time system • Worse, emerging DIRTy apps have mobile devices at their edge — network transience makes clients seem ill- behaved with respect to connection state!
  5. 5. The lessons of DIRTy apps • Many latency bubbles originate deep in the stack; OS understanding and instrumentation have been essential even when the OS is not at fault • For up-stack problems, tooling has been essential • Latency outliers can come from many sources: application restarts, dropped connections, slow disks, boundless memory growth • We have also seen some traditional real-time problems with respect to CPU scheduling, e.g. priority inversions • Enough foreplay; on with the DIRTy disaster pr0n!
  6. 6. Application restarts • Modern internet-facing architectures are designed to be resilient with respect to many failure modes… • ...but application restarts can induce pathological, cascading latency bubbles, as clients reconnect, clusters reconverge, etc. • For example, Voxer ran into a node.js bug where it would terminate on ECONNABORTED from accept(2) • Classic difference in OS semantics: BSD and illumos variants (including SmartOS) do this; Linux doesn’t • Much more likely over a transient network!
  7. 7. Dropped connections • If an application can’t keep up with TCP backlog, packets (SYNs) are dropped: $ netstat -s | grep Drop tcpTimRetransDrop = 56 tcpTimKeepalive = 2582 tcpTimKeepaliveProbe= 1594 tcpTimKeepaliveDrop = 41 tcpListenDrop =3089298 tcpListenDropQ0 = 0 tcpHalfOpenDrop = 0 tcpOutSackRetrans =1400832 icmpOutDrops = 0 icmpOutErrors = 0 sctpTimRetrans = 0 sctpTimRetransDrop = 0 sctpTimHearBeatProbe= 0 sctpTimHearBeatDrop = 0 sctpListenDrop = 0 sctpInClosed = 0 • Client waits, then retransmits (after 1 or 3 seconds), inducing tremendous latency outliers; terrible for DIRTy apps!
  8. 8. Dropped connections, cont. • The fix for dropped connections : • If due to a surge, increase TCP backlog • If due to sustained load, increase CPU resources, decrease CPU consumption or scale the app • If fixed by increasing the TCP backlog, check that the system backlog tunable took effect! • If not, does the app need to be restarted? • If not, is the application providing its own backlog that it taking precedent? • How close are we to dropping?
  9. 9. Dropped connections, cont. • Networking 101 App TCP SYN accept() backlog listen drop max
  10. 10. Dropped connections, cont. The kernel code (usr/src/uts/common/inet/tcp/tcp_input.c): /* * THIS FUNCTION IS DIRECTLY CALLED BY IP VIA SQUEUE FOR SYN. * tcp_input_data will not see any packets for listeners since the listener * has conn_recv set to tcp_input_listener. */ /* ARGSUSED */ static void tcp_input_listener(void *arg, mblk_t *mp, void *arg2, ip_recv_attr_t *ira) { [...] if (listener->tcp_conn_req_cnt_q >= listener->tcp_conn_req_max) { mutex_exit(&listener->tcp_eager_lock); TCP_STAT(tcps, tcp_listendrop); TCPS_BUMP_MIB(tcps, tcpListenDrop); if (lconnp->conn_debug) { (void) strlog(TCP_MOD_ID, 0, 1, SL_TRACE|SL_ERROR, "tcp_input_listener: listen backlog (max=%d) " "overflow (%d pending) on %s", listener->tcp_conn_req_max, listener->tcp_conn_req_cnt_q, tcp_display(listener, NULL, DISP_PORT_ONLY)); } goto error2; } [...]
  11. 11. Dropped connections, cont. SEE ALL THE THINGS! tcp_conn_req_cnt_q distributions: cpid:3063 max_q:8 value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 1 | 0 cpid:11504 Text max_q:128 value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7279 1 |@@ 405 2 |@ 255 4 |@ 138 8 | 81 16 | 83 32 | 62 64 | 67 128 | 34 256 | 0 tcpListenDrops: cpid:11504 max_q:128 34
  12. 12. Dropped connections, cont. • Uses DTrace to get a distribution of TCP backlog queue length on SYN; max_q is the current backlog length, per-process: fbt::tcp_input_listener:entry { this->connp = (conn_t *)arg0; this->tcp = (tcp_t *)this->connp->conn_proto_priv.cp_tcp; self->max = strjoin("max_q:", lltostr(this->tcp->tcp_conn_req_max)); self->pid = strjoin("cpid:", lltostr(this->connp->conn_cpid)); @[self->pid, self->max] = quantize(this->tcp->tcp_conn_req_cnt_q); } mib:::tcpListenDrop { this->max = self->max; this->pid = self->pid; this->max != NULL ? this->max : "<null>"; this->pid != NULL ? this->pid : "<null>"; @drops[this->pid, this->max] = count(); printf("%Y %s:%s %sn", walltimestamp, probefunc, probename, this->pid); } • Script is on http://github.com/brendangregg/dtrace-cloud-tools as net/tcpconnreqmaxq-pid*.d
  13. 13. Dropped connections, cont. Or, snoop each drop: # ./tcplistendrop.d TIME SRC-IP PORT DST-IP PORT 2012 Jan 19 01:22:49 10.17.210.103 25691 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.108 18423 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.116 38883 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.117 10739 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.112 27988 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.106 28824 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.12.143.16 65070 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.100 56392 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.99 24628 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.98 11686 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.101 34629 -> 192.192.240.212 80 [...]
  14. 14. Dropped connections, cont. • That code parsed IP and TCP headers from the in-kernel packet buffer: fbt::tcp_input_listener:entry { self->mp = args[1]; } fbt::tcp_input_listener:return { self->mp = 0; } mib:::tcpListenDrop /self->mp/ { this->iph = (ipha_t *)self->mp->b_rptr; this->tcph = (tcph_t *)(self->mp->b_rptr + 20); printf("%-20Y %-18s %-5d -> %-18s %-5dn", walltimestamp, inet_ntoa(&this->iph->ipha_src), ntohs(*(uint16_t *)this->tcph->th_lport), inet_ntoa(&this->iph->ipha_dst), ntohs(*(uint16_t *)this->tcph->th_fport)); } • Script is tcplistendrop*.d, also on github
  15. 15. Dropped connections, cont. • To summarize, dropped connections induce acute latency bubbles • With Voxer, found that failures often cascaded: high CPU utilization due to unrelated issues will induce TCP listen drops • Tunables don’t always take effect: need confirmation • Having a quick tool to check scalability issues (DTrace) has been invaluable
  16. 16. Slow disks • Slow I/O in a cloud computing environment can be caused by multi-tenancy — which is to say, neighbors: • Neighbor running a backup • Neighbor running a benchmark • Neighbors can’t be seen by tenants... • ...but is it really a neighbor?
  17. 17. Slow disks, cont. • Unix 101 Process Syscall Interface VFS ZFS ... Block Device Interface Disks
  18. 18. Slow disks, cont. • Unix 101 Process sync. Syscall Interface VFS ZFS ... Block Device Interface iostat(1) often async: write buffering, Disks read ahead
  19. 19. Slow disks, cont. • VFS-level-iostat: vfsstat # vfsstat -Z 1 r/s w/s kr/s kw/s ractv wactv read_t writ_t %r %w d/s del_t zone 1.2 2.8 0.6 0.2 0.0 0.0 0.0 0.0 0 0 0.0 0.0 global (0) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 34.9 9cc2d0d3 (2) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 46.5 72188ca0 (3) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 16.5 4d2a62bb (4) 0.3 0.1 0.1 0.3 0.0 0.0 0.0 0.0 0 0 0.0 27.6 8bbc4000 (5) 5.9 0.2 0.5 0.1 0.0 0.0 0.0 0.0 0 0 5.0 11.3 d305ee44 (6) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 132.0 9897c8f5 (7) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0 0 0.0 40.7 5f3c7d9e (9) 0.2 0.8 0.5 0.6 0.0 0.0 0.0 0.0 0 0 0.0 31.9 22ef87fc (10) • Kernel changes, new kstats (thanks Bill Pijewski)
  20. 20. Slow disks, cont. • zfsslower.d: # ./zfsslower.d 10 TIME PROCESS D B ms FILE 2012 Sep 27 13:45:33 zlogin W 372 11 /zones/b8b2464c/var/adm/wtmpx 2012 Sep 27 13:45:36 bash R 8 14 /zones/b8b2464c/opt/local/bin/zsh 2012 Sep 27 13:45:58 mysqld R 1048576 19 /zones/b8b2464c/var/mysql/ibdata1 2012 Sep 27 13:45:58 mysqld R 1048576 22 /zones/b8b2464c/var/mysql/ibdata1 2012 Sep 27 13:46:14 master R 8 6 /zones/b8b2464c/root/opt/local/ libexec/postfix/qmgr 2012 Sep 27 13:46:14 master R 4096 5 /zones/b8b2464c/root/opt/local/etc/ postfix/master.cf [...] • Go-to tool. Are there VFS-level I/O > 10ms? (arg) • Stupidly easy to do
  21. 21. Slow disks, cont. • Written in DTrace [...] fbt::zfs_read:entry, fbt::zfs_write:entry { self->path = args[0]->v_path; self->kb = args[1]->uio_resid / 1024; self->start = timestamp; } fbt::zfs_read:return, fbt::zfs_write:return /self->start && (timestamp - self->start) >= min_ns/ { this->iotime = (timestamp - self->start) / 1000000; this->dir = probefunc == "zfs_read" ? "R" : "W"; printf("%-20Y %-16s %1s %4d %6d %sn", walltimestamp, execname, this->dir, self->kb, this->iotime, self->path != NULL ? stringof(self->path) : "<null>"); } [...] • zfsslower.d, also on github, originated from the DTrace book
  22. 22. Slow disks, cont. • Traces VFS/ZFS interface (kernel) from usr/src/uts/common/fs/zfs/zfs_vnops.c: /* * Regular file vnode operations template */ vnodeops_t *zfs_fvnodeops; const fs_operation_def_t zfs_fvnodeops_template[] = { VOPNAME_OPEN, { .vop_open = zfs_open }, VOPNAME_CLOSE, { .vop_close = zfs_close }, VOPNAME_READ, { .vop_read = zfs_read }, VOPNAME_WRITE, { .vop_write = zfs_write }, VOPNAME_IOCTL, { .vop_ioctl = zfs_ioctl }, VOPNAME_GETATTR, { .vop_getattr = zfs_getattr }, [...]
  23. 23. Slow disks, cont. • Unix 101 Process Syscall Interface VFS zfsslower.d ZFS ... Correlate iosnoop Block Device Interface Disks
  24. 24. Slow disks, cont. • Correlating the layers narrows the latency location • Or you can associate in the same D script • Via text, filtering on slow I/O, works fine • For high frequency I/O, heat maps
  25. 25. Slow disks, cont. • WHAT DOES IT MEAN?
  26. 26. Slow disks, cont. • Latency outliers:
  27. 27. Slow disks, cont. • Latency outliers: Inconceivable Very Bad Bad Good
  28. 28. Slow disks, cont. • Inconceivably bad, 1000+ms VFS-level latency: • Queueing behind large ZFS SPA syncs (tunable) • Other tenants benchmarking (before we added I/O throttling to SmartOS) read = red, write = blue • Reads queueing behind writes. Needed to tune 60ms ZFS and LSI PERC (shakes fist!) latency time (s)
  29. 29. Slow disks, cont. • Deeper tools rolled as needed. Anywhere in ZFS. # dtrace -n 'io:::start { @[stack()] = count(); }' dtrace: description 'io:::start ' matched 6 probes ^C genunix`ldi_strategy+0x53 zfs`vdev_disk_io_start+0xcc zfs`zio_vdev_io_start+0xab zfs`zio_execute+0x88 zfs`zio_nowait+0x21 zfs`vdev_mirror_io_start+0xcd zfs`zio_vdev_io_start+0x250 zfs`zio_execute+0x88 zfs`zio_nowait+0x21 zfs`arc_read_nolock+0x4f9 zfs`arc_read+0x96 zfs`dsl_read+0x44 zfs`dbuf_read_impl+0x166 zfs`dbuf_read+0xab zfs`dmu_buf_hold_array_by_dnode+0x189 zfs`dmu_buf_hold_array+0x78 zfs`dmu_read_uio+0x5c zfs`zfs_read+0x1a3 genunix`fop_read+0x8b genunix`read+0x2a7 143
  30. 30. Slow disks, cont. • On Joyent’s IaaS architecture, it’s usually not the disks or filesystem; useful to rule that out quickly • Some of the time it is, due to bad disks (1000+ms I/O); heat map or iosnoop correlation matches • Some of the time it’s due to big I/O (how quick is a 40 Mbyte read from cache?) • Some of the time it is other tenants (benchmarking!); much less for us now with ZFS I/O throttling • With ZFS and an SSD-based intent log, HW RAID is not just unobservable, but entirely unnecessary — adios PERC!
  31. 31. Memory growth • Riak had endless memory growth • Expected 9GB, after two days: $ prstat -c 1 Please wait... PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 21722 103 43G 40G cpu0 59 0 72:23:41 2.6% beam.smp/594 15770 root 7760K 540K sleep 57 0 23:28:57 0.9% zoneadmd/5 95 root 0K 0K sleep 99 -20 7:37:47 0.2% zpool-zones/166 12827 root 128M 73M sleep 100 - 0:49:36 0.1% node/5 10319 bgregg 10M 6788K sleep 59 0 0:00:00 0.0% sshd/1 10402 root 22M 288K sleep 59 0 0:18:45 0.0% dtrace/1 [...] • Eventually hits paging and terrible performance, needing a restart • Remember, application restarts are a latency disaster!
  32. 32. Memory growth, cont. • What is in the heap? $ pmap 14719 14719: beam.smp 0000000000400000 2168K r-x-- /opt/riak/erts-5.8.5/bin/beam.smp 000000000062D000 328K rw--- /opt/riak/erts-5.8.5/bin/beam.smp 000000000067F000 4193540K rw--- /opt/riak/erts-5.8.5/bin/beam.smp 00000001005C0000 4194296K rw--- [ anon ] 00000002005BE000 4192016K rw--- [ anon ] 0000000300382000 4193664K rw--- [ anon ] 00000004002E2000 4191172K rw--- [ anon ] 00000004FFFD3000 4194040K rw--- [ anon ] 00000005FFF91000 4194028K rw--- [ anon ] 00000006FFF4C000 4188812K rw--- [ anon ] 00000007FF9EF000 588224K rw--- [ heap ] [...] • ... and why does it keep growing?
  33. 33. Memory growth, cont. • Is this a memory leak? ? • In the app logic: Voxer? Voxer • In the DB logic: Riak? Riak • In the DB’s Erlang VM? Erlang VM libc, lib* • In the OS libraries? kernel • In the OS kernel? • Or application growth? • Where would you guess?
  34. 34. Memory growth, cont. • Voxer (App): don’t think it’s us • Basho (Riak): don’t think it’s us • Joyent (OS): don’t think it’s us • This sort of issue is usually app growth... • ...but we can check libs & kernel to be sure
  35. 35. Memory growth, cont. • libumem was in use for allocations • fast, scalable, object-caching, multi-threaded support • user-land version of kmem (slab allocator, Bonwick)
  36. 36. Memory growth, cont. • Fix by experimentation (backend=mmap, other allocators) wasn’t working. • Detailed observability can be enabled in libumem, allowing heap profiling and leak detection • While designed with speed and production use in mind, it still comes with some cost (time and space), and isn’t on by default: restart required. • UMEM_DEBUG=audit
  37. 37. Memory growth, cont. • libumem provides some default observability • Eg, slabs: > ::umem_malloc_info CACHE BUFSZ MAXMAL BUFMALLC AVG_MAL MALLOCED OVERHEAD %OVER 0000000000707028 8 0 0 0 0 0 0.0% 000000000070b028 16 8 8730 8 69836 1054998 1510.6% 000000000070c028 32 16 8772 16 140352 1130491 805.4% 000000000070f028 48 32 1148038 25 29127788 156179051 536.1% 0000000000710028 64 48 344138 40 13765658 58417287 424.3% 0000000000711028 80 64 36 62 2226 4806 215.9% 0000000000714028 96 80 8934 79 705348 1168558 165.6% 0000000000715028 112 96 1347040 87 117120208 190389780 162.5% 0000000000718028 128 112 253107 111 28011923 42279506 150.9% 000000000071a028 160 144 40529 118 4788681 6466801 135.0% 000000000071b028 192 176 140 155 21712 25818 118.9% 000000000071e028 224 208 43 188 8101 6497 80.1% 000000000071f028 256 240 133 229 30447 26211 86.0% 0000000000720028 320 304 56 276 15455 12276 79.4% 0000000000723028 384 368 35 335 11726 7220 61.5% [...]
  38. 38. Memory growth, cont. • ... and heap (captured @14GB RSS): > ::vmem ADDR NAME INUSE TOTAL SUCCEED FAIL fffffd7ffebed4a0 sbrk_top 9090404352 14240165888 4298117 84403 fffffd7ffebee0a8 sbrk_heap 9090404352 9090404352 4298117 0 fffffd7ffebeecb0 vmem_internal 664616960 664616960 79621 0 fffffd7ffebef8b8 vmem_seg 651993088 651993088 79589 0 fffffd7ffebf04c0 vmem_hash 12583424 12587008 27 0 fffffd7ffebf10c8 vmem_vmem 46200 55344 15 0 00000000006e7000 umem_internal 352862464 352866304 88746 0 00000000006e8000 umem_cache 113696 180224 44 0 00000000006e9000 umem_hash 13091328 13099008 86 0 00000000006ea000 umem_log 0 0 0 0 00000000006eb000 umem_firewall_va 0 0 0 0 00000000006ec000 umem_firewall 0 0 0 0 00000000006ed000 umem_oversize 5218777974 5520789504 3822051 0 00000000006f0000 umem_memalign 0 0 0 0 0000000000706000 umem_default 2552131584 2552131584 307699 0 • The heap is 9 GB (as expected), but sbrk_top total is 14 GB (equal to RSS). And growing. • Are there Gbyte-sized malloc()/free()s?
  39. 39. Memory growth, cont. # dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p 17472 dtrace: description 'pid$target::malloc:entry ' matched 3 probes ^C value ------------- Distribution ------------- count 2 | 0 4 | 3 8 |@ 5927 16 |@@@@ 41818 32 |@@@@@@@@@ 81991 64 |@@@@@@@@@@@@@@@@@@ 169888 128 |@@@@@@@ 69891 256 | 2257 512 | 406 1024 | 893 2048 | 146 4096 | 1467 8192 | 755 16384 | 950 32768 | 83 65536 | 31 131072 | 11 262144 | 15 524288 | 0 1048576 | 1 2097152 | 0 • No huge malloc()s, but RSS continues to climb.
  40. 40. Memory growth, cont. • Tracing why the heap grows via brk(): # dtrace -n 'syscall::brk:entry /execname == "beam.smp"/ { ustack(); }' dtrace: description 'syscall::brk:entry ' matched 1 probe CPU ID FUNCTION:NAME 10 18 brk:entry libc.so.1`_brk_unlocked+0xa libumem.so.1`vmem_sbrk_alloc+0x84 libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`umem_alloc+0x72 libumem.so.1`malloc+0x59 libstdc++.so.6.0.14`_Znwm+0x20 libstdc++.so.6.0.14`_Znam+0x9 eleveldb.so`_ZN7leveldb9ReadBlockEPNS_16RandomAccessFileERKNS_11Rea... eleveldb.so`_ZN7leveldb5Table11BlockReaderEPvRKNS_11ReadOptionsERKN... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator13InitDataBl... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5... eleveldb.so`_ZN7leveldb12_GLOBAL__N_115MergingIterator4SeekERKNS_5S... eleveldb.so`_ZN7leveldb12_GLOBAL__N_16DBIter4SeekERKNS_5SliceE+0xcc eleveldb.so`eleveldb_get+0xd3 beam.smp`process_main+0x6939 beam.smp`sched_thread_func+0x1cf beam.smp`thr_wrapper+0xbe
  41. 41. Memory growth, cont. • More DTrace showed the size of the malloc()s causing the brk()s: # dtrace -x dynvarsize=4m -n ' pid$target::malloc:entry { self->size = arg0; } syscall::brk:entry /self->size/ { printf("%d bytes", self->size); } pid$target::malloc:return { self->size = 0; }' -p 17472 dtrace: description 'pid$target::malloc:entry ' matched 7 probes CPU ID FUNCTION:NAME 0 44 brk:entry 8343520 bytes 0 44 brk:entry 8343520 bytes [...] • These 8 Mbyte malloc()s grew the heap • Even though the heap has Gbytes not in use • This is starting to look like an OS issue
  42. 42. Memory growth, cont. • More tools were created: • Show memory entropy (+ malloc - free) along with heap growth, over time • Show codepath taken for allocations compare successful with unsuccessful (heap growth) • Show allocator internals: sizes, options, flags • And run in the production environment • Briefly. Tracing frequent allocs does cost overhead • Casting light into what was a black box
  43. 43. Memory growth, cont. 4 <- vmem_xalloc 0 4 -> _sbrk_grow_aligned 4096 4 <- _sbrk_grow_aligned 17155911680 4 -> vmem_xalloc 7356400 4 | vmem_xalloc:entry umem_oversize 4 -> vmem_alloc 7356416 4 -> vmem_xalloc 7356416 4 | vmem_xalloc:entry sbrk_heap 4 -> vmem_sbrk_alloc 7356416 4 -> vmem_alloc 7356416 4 -> vmem_xalloc 7356416 4 | vmem_xalloc:entry sbrk_top 4 -> vmem_reap 16777216 4 <- vmem_reap 3178535181209758 4 | vmem_xalloc:return vmem_xalloc() == NULL, vm: sbrk_top, size: 7356416, align: 4096, phase: 0, nocross: 0, min: 0, max: 0, vmflag: 1 libumem.so.1`vmem_xalloc+0x80f libumem.so.1`vmem_sbrk_alloc+0x33 libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`umem_alloc+0x72 libumem.so.1`malloc+0x59 libstdc++.so.6.0.3`_Znwm+0x2b libstdc++.so.6.0.3`_ZNSs4_Rep9_S_createEmmRKSaIcE+0x7e
  44. 44. Memory growth, cont. • These new tools and metrics pointed to the allocation algorithm “instant fit” • This had been hypothesized earlier; the tools provided solid evidence that this really was the case here • A new version of libumem was built to force use of VM_BESTFIT • ...and added by Robert Mustacchi as a tunable: UMEM_OPTIONS=allocator=best • Riak was restarted with new libumem version, solving the problem
  45. 45. Memory growth, cont. • Not the first issue with the system memory allocator; depending on configuration, Riak may use libc’s malloc(), which isn’t designed to be scalable • man page does say it isn’t multi-thread scaleable • libumem was the answer (with the fix)
  46. 46. Memory growth, cont. • The fragmentation problem was interesting because it was unusual; it is not the most common source of memory growth! • DIRTy systems are often event-oriented… • ...in event-oriented systems, memory growth can be a consequence of either surging or drowning • In an interpreted environment, memory growth can also come from memory that is semantically leaked • Voxer — like many emerging DIRTy apps — has a substantial node.js component; how to debug node.js memory growth?
  47. 47. Memory growth, cont. • We have developed a postmortem technique for making sense of a node.js heap: OBJECT #OBJECTS #PROPS CONSTRUCTOR: PROP fe806139 1 1 Object: Queue fc424131 1 1 Object: Credentials fc424091 1 1 Object: version fc4e3281 1 1 Object: message fc404f6d 1 1 Object: uncaughtException ... fafcb229 1007 23 ClientRequest: outputEncodings, _headerSent, ... fafc5e75 1034 5 Timing: req_start, res_end, res_bytes, req_end, ... fafcbecd 1037 3 Object: aborted, data, end 8045475 1060 1 Object: fb0cee9d 1220 9 HTTPParser: socket, incoming, onHeadersComplete, ... fafc58d5 1271 25 Socket: _connectQueue, bytesRead, _httpMessage, ... fafc4335 1311 16 ServerResponse: outputEncodings, statusCode, ... • Used by @izs to debug a nasty node.js leak • Search for “findjsobjects” (one word) for details
  48. 48. CPU scheduling • Problem: occasional latency outliers • Analysis: no smoking gun. No slow I/O or locks. Some random dispatcher queue latency, but with CPU headroom. $ prstat -mLc 1 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 17930 103 21 7.6 0.0 0.0 0.0 53 16 9.1 57K 1 73K 0 beam.smp/265 17930 103 20 7.0 0.0 0.0 0.0 57 16 0.4 57K 2 70K 0 beam.smp/264 17930 103 20 7.4 0.0 0.0 0.0 53 18 1.7 63K 0 78K 0 beam.smp/263 17930 103 19 6.7 0.0 0.0 0.0 60 14 0.4 52K 0 65K 0 beam.smp/266 17930 103 2.0 0.7 0.0 0.0 0.0 96 1.6 0.0 6K 0 8K 0 beam.smp/267 17930 103 1.0 0.9 0.0 0.0 0.0 97 0.9 0.0 4 0 47 0 beam.smp/280 [...]
  49. 49. CPU scheduling, cont. • Unix 101 Threads: R = Ready to run O = On-CPU R R R CPU Run Queue O Scheduler preemption R R R R CPU Run Queue O
  50. 50. CPU scheduler, cont. • Unix 102 • TS (and FSS) check for CPU starvation Priority Promotion R R R R R R R CPU Run Queue O CPU Starvation
  51. 51. CPU scheduling, cont. • Experimentation: run 2 CPU-bound threads, 1 CPU • Subsecond offset heat maps:
  52. 52. CPU scheduling, cont. • Experimentation: run 2 CPU-bound threads, 1 CPU • Subsecond offset heat maps: THIS SHOULDNT HAPPEN
  53. 53. CPU scheduling, cont. • Worst case (4 threads 1 CPU), 44 sec dispq latency # dtrace -n 'sched:::off-cpu /execname == "burn1"/ { self->s = timestamp; } sched:::on-cpu /self->s/ { @["off-cpu (ms)"] = lquantize((timestamp - self->s) / 1000000, 0, 100000, 1000); self->s = 0; }' off-cpu (ms) value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 387184 1000 | 2256 2000 | 1078 3000 | Expected 862 4000 | 1070 5000 | Bad 637 [...] 6000 | Inconceivable 535 41000 | 3 42000 | 2 43000 | 2 44000 | 1 45000 | 0 ts_maxwait @pri 59 = 32s, FSS uses ?
  54. 54. CPU scheduling, cont. • Findings: • FSS scheduler class bug: • FSS uses a more complex technique to avoid CPU starvation. A thread priority could stay high and on-CPU for many seconds before the priority is decayed to allow another thread to run. • Analyzed (more DTrace) and fixed (thanks Jerry Jelinek) • DTrace analysis of the scheduler was invaluable • Under (too) high CPU load, your runtime can be bound by how well you schedule, not do work • Not the only scheduler issue we’ve encountered
  55. 55. CPU scheduling, cont. • CPU caps to throttle tenants in our cloud • Experiment: add hot-CPU threads (saturation):
  56. 56. CPU scheduling, cont. • CPU caps to throttle tenants in our cloud • Experiment: add hot-CPU threads: :-(
  57. 57. Visualizing CPU latency • Using a node.js ustack helper and the DTrace profile provider, we can determine the relative frequency of stack backtraces in terms of CPU consumption • Stacks can be visualized with flame graphs, a stack visualization we developed:
  58. 58. DIRT in production • node.js is particularly amenable for the DIRTy apps that typify the real-time web • The ability to understand latency must be considered when deploying node.js-based systems into production! • Understanding latency requires dynamic instrumentation and novel visualization • At Joyent, we have added DTrace-based dynamic instrumentation for node.js to SmartOS, and novel visualization into our cloud and software offerings • Better production support — better observability, better debuggability — remains an important area of node.js development!
  59. 59. Beyond node.js • node.js is adept at connecting components in the system; it is unlikely to be the only component! • As such, when using node.js to develop a DIRTy app, you can expect to spend as much time (if not more!) understanding the components as the app • When selecting components — operating system, in- memory data store, database, distributed data store — observability must be a primary consideration! • When building a team, look for full-stack engineers — DIRTy apps pose a full-stack challenge!
  60. 60. Thank you! • @mranney for being an excellent guinea pigcustomer • @dapsays for the V8 DTrace ustack helper and V8 debugging support • More information: http://dtrace.org/blogs/brendan, http:// dtrace.org/blogs/dap, and http://smartos.org

×