DTrace in the Non-global Zone

My presentation at the BayLISA SmartOS meetup on August 16th, 2012.

  • 1. DTrace in theNon-global ZoneBryan CantrillSVP Engineering, Joyent@bcantrillbryan@joyent.com
  • 2. DTrace and zones: Fraternal twins • DTrace and zones were developed in parallel during development of Solaris 10 • DTrace integrated (September 2003) before zones (early 2004) • When zones integrated, the priority was making DTrace in the global zone be able to meaningfully instrument non-global zones • DTrace in the non-global zone was hard — and a lower priority than other work on both technologies
  • 3. DTrace and zones: Basic functionality • In 2006, Dan Price (with help from Adam Leventhal and Jonathan Adams) added initial support for DTrace in the non-global zone • Allowed use of syscall provider, pid provider and (in a deranged, broken way) the profile provider • This was significant work: required modifications to both the zones privilege model and the DTrace privilege model • For example, required an implicit predicate on syscall and profile probes
  • 4. DTrace and zones in SmartOS • As the worldʼs heaviest user of zones, we at Joyent ran into (and fixed) a number of annoying bugs: • USDT probes from the non-global were not properly being enabled in the global zone (illumos#908) • Tick and profile probes did not properly fire when used in the non-global zone (illumos#1456) • Fixing the latter required an extension of the DTrace privilege model: introduced a notion of restricted operation in which args could not be referenced
  • 5. DTrace and zones in SmartOS • Other (very) annoying issues still lurked: • Inability to read “cpu” in the non-global zone • Inability to read any fields from “curlwpsinfo” and “curpsinfo”— especially “pr_dmodel” • Inability to read the “fds[]” array • Failure mode highly obnoxious: [my-non-global-zone ~]# dtrace -n BEGIN{trace(curpsinfo->pr_psargs)} dtrace: description BEGIN matched 1 probe dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid kernel access in action #1 at DIF offset 44
  • 6. Divide and conquer • curlwpsinfo and curpsinfo both are translators over the current thread (“kthread_t”) and current process (“proc_t”) • Importantly, the state contained in oneʼs own kthread_t and proc_t: • Is safe to read while executing (threads cannot disappear out from under themselves) • Does not represent potential privilege escalation • This can be fixed by simply allowing the loads where one has privileges to the current process!
  • 7. fds[]: A magic bullet? • Somehow, I convinced myself that the problem with fds[] was the translator that translates the member accesses into kernel accesses: inline fileinfo_t fds[int fd] = xlate ( fd >= 0 && fd < t_procp->p_user.u_finfo.fi_nfiles ? curthread->t_procp->p_user.u_finfo.fi_list[fd].uf_file : NULL); • If the problem was the static translators, the solution must be dynamic translators — a(n in)famously unimplemented feature of DTrace! • After dtrace.conf(12), I realized that the expression was orthogonal to the fact that the in-kernel implementation must not allow privilege escalation
  • 8. fds[]: No magic bullets • Focussing on the implementation, allows one to consider the specifics of the fds[] case • Helped by the fact that the fi_list implementation uses memory retiring for scalability of file descriptor lookups: the array is only freed upon process exit • Assures that oneʼs own fi_list is always pointing to memory that is (or was) an array of uf_entry_t • Leaves the file_t itself, which can be freed during probe context (specifically, by another thread in the same process)
  • 9. Dealing with file_t • We can deal with this by forcing everyone out of probe context after a file_t has been removed from the uf_entry_t, but before being freed • This is done by issuing a dtrace_sync() — a synchronous (empty) cross-call to all CPUs • This is expensive, and required answering an important question: just how hot is the closef() path, anyway? • By instrumenting our guinea pigs production cloud, we could answer this concisely: closef() is pretty damned hot (> 5,000/second on some machines!)
  • 10. Adding getf() • To track when fds[] was active in the non-global zone, we added a getf() subroutine (ht: ken) • Allows us to issue the sync only when we have a closef() from a non-global zone using fds[] • Had to take the final step of cleaning up the path output to strip off the zone path from the file name (as a cleanliness issue, not a security issue) • De-mo, de-mo, de-mo!
  • 11. sched and proc providers • With fds[] done, focus turned the only meaningful impediment to DTrace in the non-global zone: enabling the sched and proc providers • Recall the restricted operation introduced for the profile provider in the non-global zone... • Used this to have limited (non-global) DTrace privileges imply restricted operation for some SDT providers • Thanks to the curlwpsinfo/curpsinfo work, these providers can be meaningfully used without access to arguments
  • 12. Thank you.FOR MORE INFORMATION VISIT www.joyent.com ORwww.smartos.org