Lab: JVM Production Debugging 101

Tomer Gabel
Tomer GabelConsulting Engineer at Substrate Software Services
Java Production
Debugging 101
A Reversim Summit Lab, February, 2013
PRODUCTION
 DEBUGGING

= FORENSICS
Business Requirements

                Prod.
Requirements                  Forensics
                Debugging
                Severely      Hours, days,
Timeframe
                limited       weeks…
Chain of
                Meaningless   Sacred
Custody

Documentation   Useful        Sacred
Endgame


Production Debugging          Forensics


1. Gather evidence            1. Identify crime in progress


2. Restore functionality      2. Gather evidence


                3. Figure out what happened
Our Forensic Process


Gather Evidence

  Restore Production

    Analyze Findings

      Implement Solution

        Post-Mortem
Evidence toolchain
WHAT SHALL WE COLLECT?
Our focus points for today

•   Thread dump
•   Heap dump
•   VM (especially GC) metrics
•   System metrics
•   Logs
jstack

• Minimalistic tool
• Against a running process:
 jstack <pid>
• Outputs to stdout
• Identifies deadlocks
jmap

• Heap-dump from a running process
  – Lengthy process
  – Freezes VM
• Some extras
• Command:
  jmap –dump:format=b,file=<output>
  <pid>
jstat

•   JVM metrics: classloader, JIT, GC
•   Tracking over time
•   Console-based
•   jstat –gcutil <pid> 5s
The JVM GC
jvisualvm

• Combines most of the above, with GUI
• Remote via X11 forwarding (dreadful!)
So…

SHALL WE DANCE?
Scenario 1

• Phone call in the middle of the night
  – “The application is stuck!”


• What do you do?
Scenario 2


• Looks familiar?
   – “The application is
     crawling to a halt!”
   – “So restart it.”
   – “OK, it‟s good now.”


• This is a lie.
   – You will get another
     call.
Scenario 3

• 1st tier support engineer (maybe
  you?) calls:
  – “I get OutOfMemoryExceptions on this
    service.”
  – “Restart it.”
  – “Already have. Happened again.”
  – “Well, shit.”
BREAK TIME!
Without further ado…

FORENSIC
TOOLCHAIN
GNU toolchain is your friend


• bash, ps, grep, less, awk
  – „nuff said


• … or:
  – http://gnuwin32.sourceforge.net/
MAT

• Eclipse
  plugin/standal
  one
• Reads heap
  dumps
• Easy drill-
  down
And most important…
RESOLUTION TIME!
Back to: Scenario 1

• What did we gather?
  –   CPU – 100% single-core utilization
  –   GC metrics – no useful data
  –   Heap dump – no useful data
  –   Thread dump
       • java.util.Regex * gazillion
• Where the problem is implies…
   what the problem is
Back to: Scenario 2

• What did we gather?
  –   CPU – 100% single-core utilization
  –   Heap dump – no useful data
  –   Thread dump
  –   GC metrics
       • Frequent, long GCs (GC, FGC, FGCT)
• Rapid HashMap insertions: recipe for
  disaster
Back to: Scenario 3

• What did we gather?
  –   CPU – low utilization
  –   Thread dump – no useful data
  –   GC metrics – high heap utilization, low GC
  –   Heap dump
       • Predictably high number of strings
       • Strings are abnormally large
       • Strings contain entire HTML subset!
• Substring/regex can be dangerous!
Headache? Take two of these!

AFTERWORD
Adieu
• Thank you for attending!

• Presentation and demos:
            http://git.io/7LK4fw

• Tomer Gabel
  – tomer@tomergabel.com
  – http://www.tomergabel.com/
  – @tomerg
Thank you
 our sponsors
1 of 29

More Related Content

What's hot(20)

On the way to low latency (2nd edition)On the way to low latency (2nd edition)
On the way to low latency (2nd edition)
Artem Orobets1.1K views
Docker at OpenDNSDocker at OpenDNS
Docker at OpenDNS
OpenDNS15.8K views
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
Brendan Gregg1K views
Infrastructure coders logstashInfrastructure coders logstash
Infrastructure coders logstash
David Lutz2.9K views

More from Tomer Gabel(20)

Recently uploaded(20)

Java Platform Approach 1.0 - Picnic MeetupJava Platform Approach 1.0 - Picnic Meetup
Java Platform Approach 1.0 - Picnic Meetup
Rick Ossendrijver24 views
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh36 views
CXL at OCPCXL at OCP
CXL at OCP
CXL Forum203 views
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya59 views

Lab: JVM Production Debugging 101

  • 1. Java Production Debugging 101 A Reversim Summit Lab, February, 2013
  • 3. Business Requirements Prod. Requirements Forensics Debugging Severely Hours, days, Timeframe limited weeks… Chain of Meaningless Sacred Custody Documentation Useful Sacred
  • 4. Endgame Production Debugging Forensics 1. Gather evidence 1. Identify crime in progress 2. Restore functionality 2. Gather evidence 3. Figure out what happened
  • 5. Our Forensic Process Gather Evidence Restore Production Analyze Findings Implement Solution Post-Mortem
  • 7. WHAT SHALL WE COLLECT?
  • 8. Our focus points for today • Thread dump • Heap dump • VM (especially GC) metrics • System metrics • Logs
  • 9. jstack • Minimalistic tool • Against a running process: jstack <pid> • Outputs to stdout • Identifies deadlocks
  • 10. jmap • Heap-dump from a running process – Lengthy process – Freezes VM • Some extras • Command: jmap –dump:format=b,file=<output> <pid>
  • 11. jstat • JVM metrics: classloader, JIT, GC • Tracking over time • Console-based • jstat –gcutil <pid> 5s
  • 13. jvisualvm • Combines most of the above, with GUI • Remote via X11 forwarding (dreadful!)
  • 15. Scenario 1 • Phone call in the middle of the night – “The application is stuck!” • What do you do?
  • 16. Scenario 2 • Looks familiar? – “The application is crawling to a halt!” – “So restart it.” – “OK, it‟s good now.” • This is a lie. – You will get another call.
  • 17. Scenario 3 • 1st tier support engineer (maybe you?) calls: – “I get OutOfMemoryExceptions on this service.” – “Restart it.” – “Already have. Happened again.” – “Well, shit.”
  • 20. GNU toolchain is your friend • bash, ps, grep, less, awk – „nuff said • … or: – http://gnuwin32.sourceforge.net/
  • 21. MAT • Eclipse plugin/standal one • Reads heap dumps • Easy drill- down
  • 24. Back to: Scenario 1 • What did we gather? – CPU – 100% single-core utilization – GC metrics – no useful data – Heap dump – no useful data – Thread dump • java.util.Regex * gazillion • Where the problem is implies…  what the problem is
  • 25. Back to: Scenario 2 • What did we gather? – CPU – 100% single-core utilization – Heap dump – no useful data – Thread dump – GC metrics • Frequent, long GCs (GC, FGC, FGCT) • Rapid HashMap insertions: recipe for disaster
  • 26. Back to: Scenario 3 • What did we gather? – CPU – low utilization – Thread dump – no useful data – GC metrics – high heap utilization, low GC – Heap dump • Predictably high number of strings • Strings are abnormally large • Strings contain entire HTML subset! • Substring/regex can be dangerous!
  • 27. Headache? Take two of these! AFTERWORD
  • 28. Adieu • Thank you for attending! • Presentation and demos: http://git.io/7LK4fw • Tomer Gabel – tomer@tomergabel.com – http://www.tomergabel.com/ – @tomerg
  • 29. Thank you our sponsors

Editor's Notes

  1. Picture source: CSI Las Vegas (http://flowtv.org/wp-content/uploads/2007/11/csi3.jpg)
  2. Image source: http://www.about-larnaca.info/2012/06/thief-is-caught-red-handed-in-kiti.html
  3. Invite discussion. Ask audience to point out different data that is (a) useful and (b) readily accessible. Limit to 3 minutes.Image source: http://lets-rap.com/wp-content/uploads/2011/05/house-md-d-house-md-1048019_1152_864.jpg (copyright Fox)
  4. Expound a bit on anything that hasn’t been raised in the earlier discussion. Limit to 2 minutes, less if possible.
  5. “All this and more” sales pitch. Mention the profiler.
  6. Actual scenario details: pathological regular expression in a service (http://swtch.com/~rsc/regexp/regexp1.html).Exhibited behavior: very high single-core CPU utilization. Little or no GC activity. Possible StackOverflow if left long enoughAnalysis: stack trace will exhibit very deep, repetitive call stack to java.util.* classes. Package and class names will indicate a regex issue.Bonus points to whomever recognizes the pathological regex scenario without looking at the code, beyond the “problematic Regex” axiom.Workaround: depends on the details, in some cases the offensive input can be deleted or routed to a dead letter queue. In this case, none – this has to be resolved at the algorithmic level.Image source: http://blog.rogersbroadcasting.com/billhart/2010/01/19/tuesday-january-19th-2010-dialing-miscue-avatar-can-kill-you-look-where-kellie-flew/
  7. Actual scenario details: GC storm as a result of exponentially-growing HashSet. Adding small elements to a java.collection.HashSet (or HashMap) rapidly without specifying an appropriate capacity is a recipe for disaster.Exhibited behavior: ~100% single-core CPU utilization. Very high GC activity – eden generation fills up rapidly, overflowing to old gen with FGCs in increasing frequency. Eventual OutOfMemoryException.Analysis: jstat –gcutil or tracking the graphs in VisualVM clearly exhibit high GC pressure and inability to clear up enough RAM on each GC cycle. Stack trace is hit-or-miss (may exhibit a thread very clearly working on HashMap.resize, but as likely to point at code generating the strings added to the HashSet). Heap dump will likely exhibit a HashSet with very high load factor and an appropriately high count of items in the map.Bonus points to whomever recognizes the exponential expansion scenario without looking at the code.Workaround: restart the service, and possibly set up a cron task to restart periodically. VM flags that watchdog OOM situations are useless because OOME fires way too late, if at all.Image source: http://pathogenomics.bham.ac.uk/blog/wp-content/uploads/You-Cant-Handle-the-Truth.jpg (A Few Good Men)
  8. Actual scenario details: Memory leak as a result of saving substrings from web calls. Substring actually references the original string, or in practical terms the resulting HTML from each website is kept in active memory until a large-enough allocation fires OOME. GC isn’t immediately overwhelmed because it mostly has large, long-lived objects, so once memory is defragmented (a single FGC cycle) it just doesn’t have much it can do about the memory pressure.Exhibited behavior: Service process dead. OOME clearly marked in the logs.Analysis: Save logs aside; restart application to collect more information. jstat –gcutil output and a pre-failure thread dump should be saved aside as a matter of good course, but are not useful in the analysis of this scenario. Either saving pre-failure heap dumps or turning on the -XX:+HeapDumpOnOutOfMemoryError JVM option is necessary to resolve this problem.Bonus points for legitimate scenario suggestions at this point, but cut it off very quickly as a good lesson in “not jumping ahead of our data.”Workaround: restart the service, and possibly set up a cron task to restart periodically. VM flags that watchdog OOM situations can help.
  9. 5-10 minutesImage source: http://www.penny-arcade.com/comic/2002/02/22
  10. Image source: http://www.forensicinnovations.com/blog/wp-includes/images/mobilekit.jpg
  11. Image source: http://etc.usf.edu/clipart/28300/28384/brain_28384.htmH. Newell Martin, The Human Body (New York: Henry Holt and Company, 1917) 145
  12. Image source: http://www.ign.com/articles/2008/03/28/guitar-hero-aerosmith-first-look?page=2 (Guitar Hero: Aerosmith @ IGN)
  13. Follow through on evidence in the form of discussion (who used…? What did you find?) – don’t discount useful alternative theories!Forensic analysis: read through stack trace. Identify likely culprit. Bonus points to whomever yells “pathological regex” first.
  14. Heap dump actually has useful data (hashmap capacity, size and load factor) but it’s not likely anyone will notice that.Forensics: jstat –gcutil clearly shows GC storm. Stack trace to figure out the where, and common sense to figure out the why.
  15. Forensics: Analyzing the heap dump will clearly evidence an abundance of suspiciously-large strings; drilling into a couple of these will let us see that the strings kept in the heap are in fact full-blown HTML responses.Bonus points to whomever recognizes the substring scenario without looking at the code (nBA employees don’t count ).Evidently this behavior was changed in Java 7u6: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622