Plumbr case study

  1. Nikita Salnikov-Tarnovski TECHNICAL OBSTACLES WHEN BUILDING PLUMBR Monday, April 1, 13
  2. AGENDA Who we were and who we are Object lifecycle with little overhead Graph analysis in low memory The problem of quitting Monday, April 1, 13
  3. OUR BACKGROUND 2 developers Nikita Salnikov-Tarnovski, @iNikem Vladimir Šor, @vovencij 10+ years in custom software house Nortal Mostly Java EE development Web sites, backend systems, batch processes Monday, April 1, 13
  4. NEW PROBLEM Memory leaks 130,000 monthly searches for OutOfMemoryError in Google 20,000 monthly unique visitors on our site 400 monthly downloads 1700+ leaks discovered Monday, April 1, 13
  5. PLUMBR Automated performance consultant Giving you the exact location of the leak with enough information to fix it The foundation is based on machine learning trained on 500,000 memory snapshots From 3,000 different applications Finding 88% of the existing leaks. Quality only going up with the additional data gathered each day. Monday, April 1, 13
  6. PLUMBR AGENT... JVM TI agents both java and native, OS specific welcome malloc and free! JNI code for communication between them Monday, April 1, 13
  7. ... WATCHES YOU We monitor object creation and disposal On-the-fly bytecode instrumentation Hooks into GC events Monday, April 1, 13
  8. OBJECT MONITORING I Java agent registers java.lang.instrument.ClassFileTransformer Modifies bytecode as classes are loaded Using ASM library To capture all newly created objects Monday, April 1, 13
  9. PROBLEMS Different compilers produce slightly different bytecode Some classes are too fragile or broken already new and chain of <init> Clone, deserialization, reflection Monday, April 1, 13
  10. OBJECT MONITORING II We keep some data about each live object That data creation and association takes time On every object creation! Monday, April 1, 13
  11. OBJECT MONITORING II If you cannot do in-process, do it off-process Monday, April 1, 13
  12. PROBLEMS BlockingQueue are slow Locks are slow Atomic* are slow! No existing library Even Disruptor doesn’t suite We’ve written no-guarantee-lock-free-many-producers-one- consumer buffer Concurrent programming IS hard Monday, April 1, 13
  13. MORE PROBLEMS Have to store all that objects related data somewhere Java Collections are too fat No lock-free thread-safe reading We use Trove to save memory Hand-written clone with dirty check Testing persistent immutable data structures Monday, April 1, 13
  14. LEAK HUNTING When leaks are detected we need to find out, who is holding them Paths to GC roots While application is still running Monday, April 1, 13
  15. PROBLEMS Java objects have no incoming refs You can walk the heap in C code But that stops the world Standard heap dump loses information So we make custom heap dump And traverse reference graph on it Monday, April 1, 13
  16. STILL PROBLEMS We’ve tried many graph traversal libraries And NoSQL solutions All somewhat works If you give them gigs of memory But we have to do this on-site, while application is still running We needed memory sensitive solution Monday, April 1, 13
  17. ONE MORE BICYCLE We’ve written our own specialized version of Dijkstra path searching Again had to replace many Java Collections with more memory efficient implementations Monday, April 1, 13
  18. TIME TO DIE Plumbr runs inside JVM alongside with an application It isn’t the main actor, just a supporter So Plumbr must be ready to quit whenever main application wishes Monday, April 1, 13
  19. WHEN JVM QUITS It turns out JVM is quite survivable No shutdown notification or smth It just quits when there are no more non-daemon threads And some threads live for far too long Monday, April 1, 13
  20. PROBLEMS Plumbr’s own threads Threads from libraries that Plumbr uses ExecutorService with daemon thread factory Monday, April 1, 13
  21. PROBLEMS RMI Reaper Thread Keeps JVM alive as long as some JMX resources are in use We must clean behind ourselves, MBeans, JMX connections, JMX servers But when??? Implemented our own monitor thread with some heuristics Monday, April 1, 13
  22. PROBLEMS Earlier versions used some Swing components, e.g. Systray icon And JVM will not quit while there is some displayable Swing components Should kill it when before quitting Again, when??? Monday, April 1, 13
  23. CONCLUSION Don’t spend all your time writing web components or web-services or Swing There is more to Java than that There are many Java libraries but not enough Monday, April 1, 13