Real-time in the real world: DIRT in production

14,391 views

Published on

My talk with @brendangregg at Surge 2012

1 Comment
9 Likes
Statistics
Notes
  • http://www.youtube.com/watch?v=IQkPoIXJsEo
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
14,391
On SlideShare
0
From Embeds
0
Number of Embeds
3,626
Actions
Shares
0
Downloads
99
Comments
1
Likes
9
Embeds 0
No embeds

No notes for slide

Real-time in the real world: DIRT in production

  1. 1. Real-time in thereal world:DIRT in productionBryan Cantrill Brendan GreggSVP, Engineering Lead Performance Engineerbryan@joyent.com brendan@joyent.com@bcantrill @brendangregg
  2. 2. Previously, on #surgecon... • Two years ago at Surge, we described the emergence of real-time data semantics in web-facing applications • We dubbed this data-intensive real-time (DIRT) • Last year at Surge 2011, we presented experiences building a DIRTy system of our own — a facility for real- time analytics of latency in the cloud • While this system is interesting, it is somewhat synthetic in nature in that it does not need to scale (much) with respect to users...
  3. 3. #surgecon 2012 • Accelerated by the rise of mobile applications, DIRTy systems are becoming increasingly common • In the past year, we’ve seen apps in production at scale • There are many examples of this, but for us, a paragon of the emerging DIRTy apps has been Voxer • Voxer is a push-to-talk mobile app that can be thought of as the confluence of voice mail and SMS • A canonical DIRTy app: latency and scale both matter! • Our experiences debugging latency bubbles with Voxer over the past year have taught us quite a bit about the new challenges that DIRTy apps pose...
  4. 4. The challenge of DIRTy apps • DIRTy applications tend to have the human in the loop • Good news: deadlines are soft — microseconds only matter when they add up to tens of milliseconds • Bad news: because humans are in the loop, demand for the system can be non-linear • One must deal not only with the traditional challenge of scalability, but also the challenge of a real-time system • Worse, emerging DIRTy apps have mobile devices at their edge — network transience makes clients seem ill- behaved with respect to connection state!
  5. 5. The lessons of DIRTy apps • Many latency bubbles originate deep in the stack; OS understanding and instrumentation have been essential even when the OS is not at fault • For up-stack problems, tooling has been essential • Latency outliers can come from many sources: application restarts, dropped connections, slow disks, boundless memory growth • We have also seen some traditional real-time problems with respect to CPU scheduling, e.g. priority inversions • Enough foreplay; on with the DIRTy disaster pr0n!
  6. 6. Application restarts • Modern internet-facing architectures are designed to be resilient with respect to many failure modes… • ...but application restarts can induce pathological, cascading latency bubbles, as clients reconnect, clusters reconverge, etc. • For example, Voxer ran into a node.js bug where it would terminate on ECONNABORTED from accept(2) • Classic difference in OS semantics: BSD and illumos variants (including SmartOS) do this; Linux doesn’t • Much more likely over a transient network!
  7. 7. Dropped connections • If an application can’t keep up with TCP backlog, packets (SYNs) are dropped: $ netstat -s | grep Drop tcpTimRetransDrop = 56 tcpTimKeepalive = 2582 tcpTimKeepaliveProbe= 1594 tcpTimKeepaliveDrop = 41 tcpListenDrop =3089298 tcpListenDropQ0 = 0 tcpHalfOpenDrop = 0 tcpOutSackRetrans =1400832 icmpOutDrops = 0 icmpOutErrors = 0 sctpTimRetrans = 0 sctpTimRetransDrop = 0 sctpTimHearBeatProbe= 0 sctpTimHearBeatDrop = 0 sctpListenDrop = 0 sctpInClosed = 0 • Client waits, then retransmits (after 1 or 3 seconds), inducing tremendous latency outliers; terrible for DIRTy apps!
  8. 8. Dropped connections, cont. • The fix for dropped connections : • If due to a surge, increase TCP backlog • If due to sustained load, increase CPU resources, decrease CPU consumption or scale the app • If fixed by increasing the TCP backlog, check that the system backlog tunable took effect! • If not, does the app need to be restarted? • If not, is the application providing its own backlog that it taking precedent? • How close are we to dropping?
  9. 9. Dropped connections, cont.• Networking 101 App TCP SYN accept() backlog listen drop max
  10. 10. Dropped connections, cont. The kernel code (usr/src/uts/common/inet/tcp/tcp_input.c): /* * THIS FUNCTION IS DIRECTLY CALLED BY IP VIA SQUEUE FOR SYN. * tcp_input_data will not see any packets for listeners since the listener * has conn_recv set to tcp_input_listener. */ /* ARGSUSED */ static void tcp_input_listener(void *arg, mblk_t *mp, void *arg2, ip_recv_attr_t *ira) { [...] if (listener->tcp_conn_req_cnt_q >= listener->tcp_conn_req_max) { mutex_exit(&listener->tcp_eager_lock); TCP_STAT(tcps, tcp_listendrop); TCPS_BUMP_MIB(tcps, tcpListenDrop); if (lconnp->conn_debug) { (void) strlog(TCP_MOD_ID, 0, 1, SL_TRACE|SL_ERROR, "tcp_input_listener: listen backlog (max=%d) " "overflow (%d pending) on %s", listener->tcp_conn_req_max, listener->tcp_conn_req_cnt_q, tcp_display(listener, NULL, DISP_PORT_ONLY)); } goto error2; } [...]
  11. 11. Dropped connections, cont.SEE ALL THE THINGS!tcp_conn_req_cnt_q distributions: cpid:3063 max_q:8 value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 1 | 0 cpid:11504 Text max_q:128 value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7279 1 |@@ 405 2 |@ 255 4 |@ 138 8 | 81 16 | 83 32 | 62 64 | 67 128 | 34 256 | 0tcpListenDrops: cpid:11504 max_q:128 34
  12. 12. Dropped connections, cont.• Uses DTrace to get a distribution of TCP backlog queue length on SYN; max_q is the current backlog length, per-process: fbt::tcp_input_listener:entry { this->connp = (conn_t *)arg0; this->tcp = (tcp_t *)this->connp->conn_proto_priv.cp_tcp; self->max = strjoin("max_q:", lltostr(this->tcp->tcp_conn_req_max)); self->pid = strjoin("cpid:", lltostr(this->connp->conn_cpid)); @[self->pid, self->max] = quantize(this->tcp->tcp_conn_req_cnt_q); } mib:::tcpListenDrop { this->max = self->max; this->pid = self->pid; this->max != NULL ? this->max : "<null>"; this->pid != NULL ? this->pid : "<null>"; @drops[this->pid, this->max] = count(); printf("%Y %s:%s %sn", walltimestamp, probefunc, probename, this->pid); }• Script is on http://github.com/brendangregg/dtrace-cloud-tools as net/tcpconnreqmaxq-pid*.d
  13. 13. Dropped connections, cont. Or, snoop each drop:# ./tcplistendrop.dTIME SRC-IP PORT DST-IP PORT2012 Jan 19 01:22:49 10.17.210.103 25691 -> 192.192.240.212 802012 Jan 19 01:22:49 10.17.210.108 18423 -> 192.192.240.212 802012 Jan 19 01:22:49 10.17.210.116 38883 -> 192.192.240.212 802012 Jan 19 01:22:49 10.17.210.117 10739 -> 192.192.240.212 802012 Jan 19 01:22:49 10.17.210.112 27988 -> 192.192.240.212 802012 Jan 19 01:22:49 10.17.210.106 28824 -> 192.192.240.212 802012 Jan 19 01:22:49 10.12.143.16 65070 -> 192.192.240.212 802012 Jan 19 01:22:49 10.17.210.100 56392 -> 192.192.240.212 802012 Jan 19 01:22:49 10.17.210.99 24628 -> 192.192.240.212 802012 Jan 19 01:22:49 10.17.210.98 11686 -> 192.192.240.212 802012 Jan 19 01:22:49 10.17.210.101 34629 -> 192.192.240.212 80[...]
  14. 14. Dropped connections, cont.• That code parsed IP and TCP headers from the in-kernel packet buffer:fbt::tcp_input_listener:entry { self->mp = args[1]; }fbt::tcp_input_listener:return { self->mp = 0; }mib:::tcpListenDrop/self->mp/{ this->iph = (ipha_t *)self->mp->b_rptr; this->tcph = (tcph_t *)(self->mp->b_rptr + 20); printf("%-20Y %-18s %-5d -> %-18s %-5dn", walltimestamp, inet_ntoa(&this->iph->ipha_src), ntohs(*(uint16_t *)this->tcph->th_lport), inet_ntoa(&this->iph->ipha_dst), ntohs(*(uint16_t *)this->tcph->th_fport));}• Script is tcplistendrop*.d, also on github
  15. 15. Dropped connections, cont. • To summarize, dropped connections induce acute latency bubbles • With Voxer, found that failures often cascaded: high CPU utilization due to unrelated issues will induce TCP listen drops • Tunables don’t always take effect: need confirmation • Having a quick tool to check scalability issues (DTrace) has been invaluable
  16. 16. Slow disks • Slow I/O in a cloud computing environment can be caused by multi-tenancy — which is to say, neighbors: • Neighbor running a backup • Neighbor running a benchmark • Neighbors can’t be seen by tenants... • ...but is it really a neighbor?
  17. 17. Slow disks, cont.• Unix 101 Process Syscall Interface VFS ZFS ... Block Device Interface Disks
  18. 18. Slow disks, cont.• Unix 101 Process sync. Syscall Interface VFS ZFS ... Block Device Interface iostat(1) often async: write buffering, Disks read ahead
  19. 19. Slow disks, cont. • VFS-level-iostat: vfsstat# vfsstat -Z 1 r/s w/s kr/s kw/s ractv wactv read_t writ_t %r %w d/s del_t zone 1.2 2.8 0.6 0.2 0.0 0.0 0.0 0.0 0 0 0.0 0.0 global (0) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 34.9 9cc2d0d3 (2) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 46.5 72188ca0 (3) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 16.5 4d2a62bb (4) 0.3 0.1 0.1 0.3 0.0 0.0 0.0 0.0 0 0 0.0 27.6 8bbc4000 (5) 5.9 0.2 0.5 0.1 0.0 0.0 0.0 0.0 0 0 5.0 11.3 d305ee44 (6) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 132.0 9897c8f5 (7) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0 0 0.0 40.7 5f3c7d9e (9) 0.2 0.8 0.5 0.6 0.0 0.0 0.0 0.0 0 0 0.0 31.9 22ef87fc (10) • Kernel changes, new kstats (thanks Bill Pijewski)
  20. 20. Slow disks, cont. • zfsslower.d:# ./zfsslower.d 10TIME PROCESS D B ms FILE2012 Sep 27 13:45:33 zlogin W 372 11 /zones/b8b2464c/var/adm/wtmpx2012 Sep 27 13:45:36 bash R 8 14 /zones/b8b2464c/opt/local/bin/zsh2012 Sep 27 13:45:58 mysqld R 1048576 19 /zones/b8b2464c/var/mysql/ibdata12012 Sep 27 13:45:58 mysqld R 1048576 22 /zones/b8b2464c/var/mysql/ibdata12012 Sep 27 13:46:14 master R 8 6 /zones/b8b2464c/root/opt/local/libexec/postfix/qmgr2012 Sep 27 13:46:14 master R 4096 5 /zones/b8b2464c/root/opt/local/etc/postfix/master.cf[...] • Go-to tool. Are there VFS-level I/O > 10ms? (arg) • Stupidly easy to do
  21. 21. Slow disks, cont.• Written in DTrace[...]fbt::zfs_read:entry,fbt::zfs_write:entry{ self->path = args[0]->v_path; self->kb = args[1]->uio_resid / 1024; self->start = timestamp;}fbt::zfs_read:return,fbt::zfs_write:return/self->start && (timestamp - self->start) >= min_ns/{ this->iotime = (timestamp - self->start) / 1000000; this->dir = probefunc == "zfs_read" ? "R" : "W"; printf("%-20Y %-16s %1s %4d %6d %sn", walltimestamp, execname, this->dir, self->kb, this->iotime, self->path != NULL ? stringof(self->path) : "<null>");}[...]• zfsslower.d, also on github, originated from the DTrace book
  22. 22. Slow disks, cont.• Traces VFS/ZFS interface (kernel) from usr/src/uts/common/fs/zfs/zfs_vnops.c:/* * Regular file vnode operations template */vnodeops_t *zfs_fvnodeops;const fs_operation_def_t zfs_fvnodeops_template[] = { VOPNAME_OPEN, { .vop_open = zfs_open }, VOPNAME_CLOSE, { .vop_close = zfs_close }, VOPNAME_READ, { .vop_read = zfs_read }, VOPNAME_WRITE, { .vop_write = zfs_write }, VOPNAME_IOCTL, { .vop_ioctl = zfs_ioctl }, VOPNAME_GETATTR, { .vop_getattr = zfs_getattr },[...]
  23. 23. Slow disks, cont.• Unix 101 Process Syscall Interface VFSzfsslower.d ZFS ... Correlateiosnoop Block Device Interface Disks
  24. 24. Slow disks, cont.• Correlating the layers narrows the latency location • Or you can associate in the same D script• Via text, filtering on slow I/O, works fine• For high frequency I/O, heat maps
  25. 25. Slow disks, cont.• WHAT DOES IT MEAN?
  26. 26. Slow disks, cont.• Latency outliers:
  27. 27. Slow disks, cont.• Latency outliers: Inconceivable Very Bad Bad Good
  28. 28. Slow disks, cont.• Inconceivably bad, 1000+ms VFS-level latency: • Queueing behind large ZFS SPA syncs (tunable) • Other tenants benchmarking (before we added I/O throttling to SmartOS) read = red, write = blue • Reads queueing behind writes. Needed to tune 60ms ZFS and LSI PERC (shakes fist!) latency time (s)
  29. 29. Slow disks, cont.• Deeper tools rolled as needed. Anywhere in ZFS.# dtrace -n io:::start { @[stack()] = count(); }dtrace: description io:::start matched 6 probes^C genunix`ldi_strategy+0x53 zfs`vdev_disk_io_start+0xcc zfs`zio_vdev_io_start+0xab zfs`zio_execute+0x88 zfs`zio_nowait+0x21 zfs`vdev_mirror_io_start+0xcd zfs`zio_vdev_io_start+0x250 zfs`zio_execute+0x88 zfs`zio_nowait+0x21 zfs`arc_read_nolock+0x4f9 zfs`arc_read+0x96 zfs`dsl_read+0x44 zfs`dbuf_read_impl+0x166 zfs`dbuf_read+0xab zfs`dmu_buf_hold_array_by_dnode+0x189 zfs`dmu_buf_hold_array+0x78 zfs`dmu_read_uio+0x5c zfs`zfs_read+0x1a3 genunix`fop_read+0x8b genunix`read+0x2a7 143
  30. 30. Slow disks, cont. • On Joyent’s IaaS architecture, it’s usually not the disks or filesystem; useful to rule that out quickly • Some of the time it is, due to bad disks (1000+ms I/O); heat map or iosnoop correlation matches • Some of the time it’s due to big I/O (how quick is a 40 Mbyte read from cache?) • Some of the time it is other tenants (benchmarking!); much less for us now with ZFS I/O throttling • With ZFS and an SSD-based intent log, HW RAID is not just unobservable, but entirely unnecessary — adios PERC!
  31. 31. Memory growth• Riak had endless memory growth• Expected 9GB, after two days:$ prstat -c 1Please wait... PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 21722 103 43G 40G cpu0 59 0 72:23:41 2.6% beam.smp/594 15770 root 7760K 540K sleep 57 0 23:28:57 0.9% zoneadmd/5 95 root 0K 0K sleep 99 -20 7:37:47 0.2% zpool-zones/166 12827 root 128M 73M sleep 100 - 0:49:36 0.1% node/5 10319 bgregg 10M 6788K sleep 59 0 0:00:00 0.0% sshd/1 10402 root 22M 288K sleep 59 0 0:18:45 0.0% dtrace/1[...]• Eventually hits paging and terrible performance, needing a restart• Remember, application restarts are a latency disaster!
  32. 32. Memory growth, cont.• What is in the heap?$ pmap 1471914719: beam.smp0000000000400000 2168K r-x-- /opt/riak/erts-5.8.5/bin/beam.smp000000000062D000 328K rw--- /opt/riak/erts-5.8.5/bin/beam.smp000000000067F000 4193540K rw--- /opt/riak/erts-5.8.5/bin/beam.smp00000001005C0000 4194296K rw--- [ anon ]00000002005BE000 4192016K rw--- [ anon ]0000000300382000 4193664K rw--- [ anon ]00000004002E2000 4191172K rw--- [ anon ]00000004FFFD3000 4194040K rw--- [ anon ]00000005FFF91000 4194028K rw--- [ anon ]00000006FFF4C000 4188812K rw--- [ anon ]00000007FF9EF000 588224K rw--- [ heap ][...]• ... and why does it keep growing?
  33. 33. Memory growth, cont. • Is this a memory leak? ? • In the app logic: Voxer? Voxer • In the DB logic: Riak? Riak • In the DB’s Erlang VM? Erlang VM libc, lib* • In the OS libraries? kernel • In the OS kernel? • Or application growth? • Where would you guess?
  34. 34. Memory growth, cont. • Voxer (App): don’t think it’s us • Basho (Riak): don’t think it’s us • Joyent (OS): don’t think it’s us • This sort of issue is usually app growth... • ...but we can check libs & kernel to be sure
  35. 35. Memory growth, cont. • libumem was in use for allocations • fast, scalable, object-caching, multi-threaded support • user-land version of kmem (slab allocator, Bonwick)
  36. 36. Memory growth, cont. • Fix by experimentation (backend=mmap, other allocators) wasn’t working. • Detailed observability can be enabled in libumem, allowing heap profiling and leak detection • While designed with speed and production use in mind, it still comes with some cost (time and space), and isn’t on by default: restart required. • UMEM_DEBUG=audit
  37. 37. Memory growth, cont.• libumem provides some default observability • Eg, slabs: > ::umem_malloc_info CACHE BUFSZ MAXMAL BUFMALLC AVG_MAL MALLOCED OVERHEAD %OVER 0000000000707028 8 0 0 0 0 0 0.0% 000000000070b028 16 8 8730 8 69836 1054998 1510.6% 000000000070c028 32 16 8772 16 140352 1130491 805.4% 000000000070f028 48 32 1148038 25 29127788 156179051 536.1% 0000000000710028 64 48 344138 40 13765658 58417287 424.3% 0000000000711028 80 64 36 62 2226 4806 215.9% 0000000000714028 96 80 8934 79 705348 1168558 165.6% 0000000000715028 112 96 1347040 87 117120208 190389780 162.5% 0000000000718028 128 112 253107 111 28011923 42279506 150.9% 000000000071a028 160 144 40529 118 4788681 6466801 135.0% 000000000071b028 192 176 140 155 21712 25818 118.9% 000000000071e028 224 208 43 188 8101 6497 80.1% 000000000071f028 256 240 133 229 30447 26211 86.0% 0000000000720028 320 304 56 276 15455 12276 79.4% 0000000000723028 384 368 35 335 11726 7220 61.5% [...]
  38. 38. Memory growth, cont. • ... and heap (captured @14GB RSS):> ::vmemADDR NAME INUSE TOTAL SUCCEED FAILfffffd7ffebed4a0 sbrk_top 9090404352 14240165888 4298117 84403fffffd7ffebee0a8 sbrk_heap 9090404352 9090404352 4298117 0fffffd7ffebeecb0 vmem_internal 664616960 664616960 79621 0fffffd7ffebef8b8 vmem_seg 651993088 651993088 79589 0fffffd7ffebf04c0 vmem_hash 12583424 12587008 27 0fffffd7ffebf10c8 vmem_vmem 46200 55344 15 000000000006e7000 umem_internal 352862464 352866304 88746 000000000006e8000 umem_cache 113696 180224 44 000000000006e9000 umem_hash 13091328 13099008 86 000000000006ea000 umem_log 0 0 0 000000000006eb000 umem_firewall_va 0 0 0 000000000006ec000 umem_firewall 0 0 0 000000000006ed000 umem_oversize 5218777974 5520789504 3822051 000000000006f0000 umem_memalign 0 0 0 00000000000706000 umem_default 2552131584 2552131584 307699 0• The heap is 9 GB (as expected), but sbrk_top total is 14 GB (equal to RSS). And growing. • Are there Gbyte-sized malloc()/free()s?
  39. 39. Memory growth, cont.# dtrace -n pid$target::malloc:entry { @ = quantize(arg0); } -p 17472dtrace: description pid$target::malloc:entry matched 3 probes^C value ------------- Distribution ------------- count 2 | 0 4 | 3 8 |@ 5927 16 |@@@@ 41818 32 |@@@@@@@@@ 81991 64 |@@@@@@@@@@@@@@@@@@ 169888 128 |@@@@@@@ 69891 256 | 2257 512 | 406 1024 | 893 2048 | 146 4096 | 1467 8192 | 755 16384 | 950 32768 | 83 65536 | 31 131072 | 11 262144 | 15 524288 | 0 1048576 | 1 2097152 | 0• No huge malloc()s, but RSS continues to climb.
  40. 40. Memory growth, cont.• Tracing why the heap grows via brk():# dtrace -n syscall::brk:entry /execname == "beam.smp"/ { ustack(); }dtrace: description syscall::brk:entry matched 1 probeCPU ID FUNCTION:NAME 10 18 brk:entry libc.so.1`_brk_unlocked+0xa libumem.so.1`vmem_sbrk_alloc+0x84 libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`umem_alloc+0x72 libumem.so.1`malloc+0x59 libstdc++.so.6.0.14`_Znwm+0x20 libstdc++.so.6.0.14`_Znam+0x9 eleveldb.so`_ZN7leveldb9ReadBlockEPNS_16RandomAccessFileERKNS_11Rea... eleveldb.so`_ZN7leveldb5Table11BlockReaderEPvRKNS_11ReadOptionsERKN... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator13InitDataBl... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5... eleveldb.so`_ZN7leveldb12_GLOBAL__N_115MergingIterator4SeekERKNS_5S... eleveldb.so`_ZN7leveldb12_GLOBAL__N_16DBIter4SeekERKNS_5SliceE+0xcc eleveldb.so`eleveldb_get+0xd3 beam.smp`process_main+0x6939 beam.smp`sched_thread_func+0x1cf beam.smp`thr_wrapper+0xbe
  41. 41. Memory growth, cont.• More DTrace showed the size of the malloc()s causing the brk()s: # dtrace -x dynvarsize=4m -n pid$target::malloc:entry { self->size = arg0; } syscall::brk:entry /self->size/ { printf("%d bytes", self->size); } pid$target::malloc:return { self->size = 0; } -p 17472 dtrace: description pid$target::malloc:entry matched 7 probes CPU ID FUNCTION:NAME 0 44 brk:entry 8343520 bytes 0 44 brk:entry 8343520 bytes [...]• These 8 Mbyte malloc()s grew the heap • Even though the heap has Gbytes not in use • This is starting to look like an OS issue
  42. 42. Memory growth, cont.• More tools were created: • Show memory entropy (+ malloc - free) along with heap growth, over time • Show codepath taken for allocations compare successful with unsuccessful (heap growth) • Show allocator internals: sizes, options, flags• And run in the production environment • Briefly. Tracing frequent allocs does cost overhead• Casting light into what was a black box
  43. 43. Memory growth, cont. 4 <- vmem_xalloc 0 4 -> _sbrk_grow_aligned 4096 4 <- _sbrk_grow_aligned 17155911680 4 -> vmem_xalloc 7356400 4 | vmem_xalloc:entry umem_oversize 4 -> vmem_alloc 7356416 4 -> vmem_xalloc 7356416 4 | vmem_xalloc:entry sbrk_heap 4 -> vmem_sbrk_alloc 7356416 4 -> vmem_alloc 7356416 4 -> vmem_xalloc 7356416 4 | vmem_xalloc:entry sbrk_top 4 -> vmem_reap 16777216 4 <- vmem_reap 3178535181209758 4 | vmem_xalloc:return vmem_xalloc() == NULL, vm:sbrk_top, size: 7356416, align: 4096, phase: 0, nocross: 0, min: 0, max: 0,vmflag: 1 libumem.so.1`vmem_xalloc+0x80f libumem.so.1`vmem_sbrk_alloc+0x33 libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`umem_alloc+0x72 libumem.so.1`malloc+0x59 libstdc++.so.6.0.3`_Znwm+0x2b libstdc++.so.6.0.3`_ZNSs4_Rep9_S_createEmmRKSaIcE+0x7e
  44. 44. Memory growth, cont. • These new tools and metrics pointed to the allocation algorithm “instant fit” • This had been hypothesized earlier; the tools provided solid evidence that this really was the case here • A new version of libumem was built to force use of VM_BESTFIT • ...and added by Robert Mustacchi as a tunable: UMEM_OPTIONS=allocator=best • Riak was restarted with new libumem version, solving the problem
  45. 45. Memory growth, cont. • Not the first issue with the system memory allocator; depending on configuration, Riak may use libc’s malloc(), which isn’t designed to be scalable • man page does say it isn’t multi-thread scaleable • libumem was the answer (with the fix)
  46. 46. Memory growth, cont. • The fragmentation problem was interesting because it was unusual; it is not the most common source of memory growth! • DIRTy systems are often event-oriented… • ...in event-oriented systems, memory growth can be a consequence of either surging or drowning • In an interpreted environment, memory growth can also come from memory that is semantically leaked • Voxer — like many emerging DIRTy apps — has a substantial node.js component; how to debug node.js memory growth?
  47. 47. Memory growth, cont.• We have developed a postmortem technique for making sense of a node.js heap: OBJECT #OBJECTS #PROPS CONSTRUCTOR: PROPfe806139 1 1 Object: Queuefc424131 1 1 Object: Credentialsfc424091 1 1 Object: versionfc4e3281 1 1 Object: messagefc404f6d 1 1 Object: uncaughtException...fafcb229 1007 23 ClientRequest: outputEncodings, _headerSent, ...fafc5e75 1034 5 Timing: req_start, res_end, res_bytes, req_end, ...fafcbecd 1037 3 Object: aborted, data, end 8045475 1060 1 Object:fb0cee9d 1220 9 HTTPParser: socket, incoming, onHeadersComplete, ...fafc58d5 1271 25 Socket: _connectQueue, bytesRead, _httpMessage, ...fafc4335 1311 16 ServerResponse: outputEncodings, statusCode, ...• Used by @izs to debug a nasty node.js leak• Search for “findjsobjects” (one word) for details
  48. 48. CPU scheduling • Problem: occasional latency outliers • Analysis: no smoking gun. No slow I/O or locks. Some random dispatcher queue latency, but with CPU headroom.$ prstat -mLc 1 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 17930 103 21 7.6 0.0 0.0 0.0 53 16 9.1 57K 1 73K 0 beam.smp/265 17930 103 20 7.0 0.0 0.0 0.0 57 16 0.4 57K 2 70K 0 beam.smp/264 17930 103 20 7.4 0.0 0.0 0.0 53 18 1.7 63K 0 78K 0 beam.smp/263 17930 103 19 6.7 0.0 0.0 0.0 60 14 0.4 52K 0 65K 0 beam.smp/266 17930 103 2.0 0.7 0.0 0.0 0.0 96 1.6 0.0 6K 0 8K 0 beam.smp/267 17930 103 1.0 0.9 0.0 0.0 0.0 97 0.9 0.0 4 0 47 0 beam.smp/280[...]
  49. 49. CPU scheduling, cont.• Unix 101 Threads: R = Ready to run O = On-CPU R R R CPU Run Queue O Scheduler preemption R R R R CPU Run Queue O
  50. 50. CPU scheduler, cont.• Unix 102• TS (and FSS) check for CPU starvation Priority Promotion R R R R R R R CPU Run Queue O CPU Starvation
  51. 51. CPU scheduling, cont.• Experimentation: run 2 CPU-bound threads, 1 CPU• Subsecond offset heat maps:
  52. 52. CPU scheduling, cont.• Experimentation: run 2 CPU-bound threads, 1 CPU• Subsecond offset heat maps: THIS SHOULDNT HAPPEN
  53. 53. CPU scheduling, cont. • Worst case (4 threads 1 CPU), 44 sec dispq latency# dtrace -n sched:::off-cpu /execname == "burn1"/ { self->s = timestamp; } sched:::on-cpu /self->s/ { @["off-cpu (ms)"] = lquantize((timestamp - self->s) / 1000000, 0, 100000, 1000); self->s = 0; } off-cpu (ms) value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 387184 1000 | 2256 2000 | 1078 3000 | Expected 862 4000 | 1070 5000 | Bad 637[...] 6000 | Inconceivable 535 41000 | 3 42000 | 2 43000 | 2 44000 | 1 45000 | 0 ts_maxwait @pri 59 = 32s, FSS uses ?
  54. 54. CPU scheduling, cont.• Findings: • FSS scheduler class bug: • FSS uses a more complex technique to avoid CPU starvation. A thread priority could stay high and on-CPU for many seconds before the priority is decayed to allow another thread to run. • Analyzed (more DTrace) and fixed (thanks Jerry Jelinek) • DTrace analysis of the scheduler was invaluable • Under (too) high CPU load, your runtime can be bound by how well you schedule, not do work • Not the only scheduler issue we’ve encountered
  55. 55. CPU scheduling, cont.• CPU caps to throttle tenants in our cloud• Experiment: add hot-CPU threads (saturation):
  56. 56. CPU scheduling, cont.• CPU caps to throttle tenants in our cloud• Experiment: add hot-CPU threads: :-(
  57. 57. Visualizing CPU latency • Using a node.js ustack helper and the DTrace profile provider, we can determine the relative frequency of stack backtraces in terms of CPU consumption • Stacks can be visualized with flame graphs, a stack visualization we developed:
  58. 58. DIRT in production • node.js is particularly amenable for the DIRTy apps that typify the real-time web • The ability to understand latency must be considered when deploying node.js-based systems into production! • Understanding latency requires dynamic instrumentation and novel visualization • At Joyent, we have added DTrace-based dynamic instrumentation for node.js to SmartOS, and novel visualization into our cloud and software offerings • Better production support — better observability, better debuggability — remains an important area of node.js development!
  59. 59. Beyond node.js • node.js is adept at connecting components in the system; it is unlikely to be the only component! • As such, when using node.js to develop a DIRTy app, you can expect to spend as much time (if not more!) understanding the components as the app • When selecting components — operating system, in- memory data store, database, distributed data store — observability must be a primary consideration! • When building a team, look for full-stack engineers — DIRTy apps pose a full-stack challenge!
  60. 60. Thank you! • @mranney for being an excellent guinea pigcustomer • @dapsays for the V8 DTrace ustack helper and V8 debugging support • More information: http://dtrace.org/blogs/brendan, http:// dtrace.org/blogs/dap, and http://smartos.org

×