Java ProductionDebugging 101A Reversim Summit Lab, February, 2013
PRODUCTION DEBUGGING= FORENSICS
Business Requirements                Prod.Requirements                  Forensics                Debugging                ...
EndgameProduction Debugging          Forensics1. Gather evidence            1. Identify crime in progress2. Restore functi...
Our Forensic ProcessGather Evidence  Restore Production    Analyze Findings      Implement Solution        Post-Mortem
Evidence toolchain
WHAT SHALL WE COLLECT?
Our focus points for today•   Thread dump•   Heap dump•   VM (especially GC) metrics•   System metrics•   Logs
jstack• Minimalistic tool• Against a running process: jstack <pid>• Outputs to stdout• Identifies deadlocks
jmap• Heap-dump from a running process  – Lengthy process  – Freezes VM• Some extras• Command:  jmap –dump:format=b,file=<...
jstat•   JVM metrics: classloader, JIT, GC•   Tracking over time•   Console-based•   jstat –gcutil <pid> 5s
The JVM GC
jvisualvm• Combines most of the above, with GUI• Remote via X11 forwarding (dreadful!)
So…SHALL WE DANCE?
Scenario 1• Phone call in the middle of the night  – “The application is stuck!”• What do you do?
Scenario 2• Looks familiar?   – “The application is     crawling to a halt!”   – “So restart it.”   – “OK, it‟s good now.”...
Scenario 3• 1st tier support engineer (maybe  you?) calls:  – “I get OutOfMemoryExceptions on this    service.”  – “Restar...
BREAK TIME!
Without further ado…FORENSICTOOLCHAIN
GNU toolchain is your friend• bash, ps, grep, less, awk  – „nuff said• … or:  – http://gnuwin32.sourceforge.net/
MAT• Eclipse  plugin/standal  one• Reads heap  dumps• Easy drill-  down
And most important…
RESOLUTION TIME!
Back to: Scenario 1• What did we gather?  –   CPU – 100% single-core utilization  –   GC metrics – no useful data  –   Hea...
Back to: Scenario 2• What did we gather?  –   CPU – 100% single-core utilization  –   Heap dump – no useful data  –   Thre...
Back to: Scenario 3• What did we gather?  –   CPU – low utilization  –   Thread dump – no useful data  –   GC metrics – hi...
Headache? Take two of these!AFTERWORD
Adieu• Thank you for attending!• Presentation and demos:            http://git.io/7LK4fw• Tomer Gabel  – tomer@tomergabel....
Thank you our sponsors
Upcoming SlideShare
Loading in …5
×

Lab: JVM Production Debugging 101

2,260 views

Published on

A lab given at the Reversim Summit on 19 February 2013.
http://summit2013.reversim.com/#/sessions/Lab:%20Java%20Production%20Debugging%20101
The code for the sample scenarios can be found on GitHub: https://github.com/holograph/examples/tree/master/reversim-proddbg-lab

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,260
On SlideShare
0
From Embeds
0
Number of Embeds
113
Actions
Shares
0
Downloads
27
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Picture source: CSI Las Vegas (http://flowtv.org/wp-content/uploads/2007/11/csi3.jpg)
  • Image source: http://www.about-larnaca.info/2012/06/thief-is-caught-red-handed-in-kiti.html
  • Invite discussion. Ask audience to point out different data that is (a) useful and (b) readily accessible. Limit to 3 minutes.Image source: http://lets-rap.com/wp-content/uploads/2011/05/house-md-d-house-md-1048019_1152_864.jpg (copyright Fox)
  • Expound a bit on anything that hasn’t been raised in the earlier discussion. Limit to 2 minutes, less if possible.
  • “All this and more” sales pitch. Mention the profiler.
  • Actual scenario details: pathological regular expression in a service (http://swtch.com/~rsc/regexp/regexp1.html).Exhibited behavior: very high single-core CPU utilization. Little or no GC activity. Possible StackOverflow if left long enoughAnalysis: stack trace will exhibit very deep, repetitive call stack to java.util.* classes. Package and class names will indicate a regex issue.Bonus points to whomever recognizes the pathological regex scenario without looking at the code, beyond the “problematic Regex” axiom.Workaround: depends on the details, in some cases the offensive input can be deleted or routed to a dead letter queue. In this case, none – this has to be resolved at the algorithmic level.Image source: http://blog.rogersbroadcasting.com/billhart/2010/01/19/tuesday-january-19th-2010-dialing-miscue-avatar-can-kill-you-look-where-kellie-flew/
  • Actual scenario details: GC storm as a result of exponentially-growing HashSet. Adding small elements to a java.collection.HashSet (or HashMap) rapidly without specifying an appropriate capacity is a recipe for disaster.Exhibited behavior: ~100% single-core CPU utilization. Very high GC activity – eden generation fills up rapidly, overflowing to old gen with FGCs in increasing frequency. Eventual OutOfMemoryException.Analysis: jstat –gcutil or tracking the graphs in VisualVM clearly exhibit high GC pressure and inability to clear up enough RAM on each GC cycle. Stack trace is hit-or-miss (may exhibit a thread very clearly working on HashMap.resize, but as likely to point at code generating the strings added to the HashSet). Heap dump will likely exhibit a HashSet with very high load factor and an appropriately high count of items in the map.Bonus points to whomever recognizes the exponential expansion scenario without looking at the code.Workaround: restart the service, and possibly set up a cron task to restart periodically. VM flags that watchdog OOM situations are useless because OOME fires way too late, if at all.Image source: http://pathogenomics.bham.ac.uk/blog/wp-content/uploads/You-Cant-Handle-the-Truth.jpg (A Few Good Men)
  • Actual scenario details: Memory leak as a result of saving substrings from web calls. Substring actually references the original string, or in practical terms the resulting HTML from each website is kept in active memory until a large-enough allocation fires OOME. GC isn’t immediately overwhelmed because it mostly has large, long-lived objects, so once memory is defragmented (a single FGC cycle) it just doesn’t have much it can do about the memory pressure.Exhibited behavior: Service process dead. OOME clearly marked in the logs.Analysis: Save logs aside; restart application to collect more information. jstat –gcutil output and a pre-failure thread dump should be saved aside as a matter of good course, but are not useful in the analysis of this scenario. Either saving pre-failure heap dumps or turning on the -XX:+HeapDumpOnOutOfMemoryError JVM option is necessary to resolve this problem.Bonus points for legitimate scenario suggestions at this point, but cut it off very quickly as a good lesson in “not jumping ahead of our data.”Workaround: restart the service, and possibly set up a cron task to restart periodically. VM flags that watchdog OOM situations can help.
  • 5-10 minutesImage source: http://www.penny-arcade.com/comic/2002/02/22
  • Image source: http://www.forensicinnovations.com/blog/wp-includes/images/mobilekit.jpg
  • Image source: http://etc.usf.edu/clipart/28300/28384/brain_28384.htmH. Newell Martin, The Human Body (New York: Henry Holt and Company, 1917) 145
  • Image source: http://www.ign.com/articles/2008/03/28/guitar-hero-aerosmith-first-look?page=2 (Guitar Hero: Aerosmith @ IGN)
  • Follow through on evidence in the form of discussion (who used…? What did you find?) – don’t discount useful alternative theories!Forensic analysis: read through stack trace. Identify likely culprit. Bonus points to whomever yells “pathological regex” first.
  • Heap dump actually has useful data (hashmap capacity, size and load factor) but it’s not likely anyone will notice that.Forensics: jstat –gcutil clearly shows GC storm. Stack trace to figure out the where, and common sense to figure out the why.
  • Forensics: Analyzing the heap dump will clearly evidence an abundance of suspiciously-large strings; drilling into a couple of these will let us see that the strings kept in the heap are in fact full-blown HTML responses.Bonus points to whomever recognizes the substring scenario without looking at the code (nBA employees don’t count ).Evidently this behavior was changed in Java 7u6: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622
  • Lab: JVM Production Debugging 101

    1. 1. Java ProductionDebugging 101A Reversim Summit Lab, February, 2013
    2. 2. PRODUCTION DEBUGGING= FORENSICS
    3. 3. Business Requirements Prod.Requirements Forensics Debugging Severely Hours, days,Timeframe limited weeks…Chain of Meaningless SacredCustodyDocumentation Useful Sacred
    4. 4. EndgameProduction Debugging Forensics1. Gather evidence 1. Identify crime in progress2. Restore functionality 2. Gather evidence 3. Figure out what happened
    5. 5. Our Forensic ProcessGather Evidence Restore Production Analyze Findings Implement Solution Post-Mortem
    6. 6. Evidence toolchain
    7. 7. WHAT SHALL WE COLLECT?
    8. 8. Our focus points for today• Thread dump• Heap dump• VM (especially GC) metrics• System metrics• Logs
    9. 9. jstack• Minimalistic tool• Against a running process: jstack <pid>• Outputs to stdout• Identifies deadlocks
    10. 10. jmap• Heap-dump from a running process – Lengthy process – Freezes VM• Some extras• Command: jmap –dump:format=b,file=<output> <pid>
    11. 11. jstat• JVM metrics: classloader, JIT, GC• Tracking over time• Console-based• jstat –gcutil <pid> 5s
    12. 12. The JVM GC
    13. 13. jvisualvm• Combines most of the above, with GUI• Remote via X11 forwarding (dreadful!)
    14. 14. So…SHALL WE DANCE?
    15. 15. Scenario 1• Phone call in the middle of the night – “The application is stuck!”• What do you do?
    16. 16. Scenario 2• Looks familiar? – “The application is crawling to a halt!” – “So restart it.” – “OK, it‟s good now.”• This is a lie. – You will get another call.
    17. 17. Scenario 3• 1st tier support engineer (maybe you?) calls: – “I get OutOfMemoryExceptions on this service.” – “Restart it.” – “Already have. Happened again.” – “Well, shit.”
    18. 18. BREAK TIME!
    19. 19. Without further ado…FORENSICTOOLCHAIN
    20. 20. GNU toolchain is your friend• bash, ps, grep, less, awk – „nuff said• … or: – http://gnuwin32.sourceforge.net/
    21. 21. MAT• Eclipse plugin/standal one• Reads heap dumps• Easy drill- down
    22. 22. And most important…
    23. 23. RESOLUTION TIME!
    24. 24. Back to: Scenario 1• What did we gather? – CPU – 100% single-core utilization – GC metrics – no useful data – Heap dump – no useful data – Thread dump • java.util.Regex * gazillion• Where the problem is implies…  what the problem is
    25. 25. Back to: Scenario 2• What did we gather? – CPU – 100% single-core utilization – Heap dump – no useful data – Thread dump – GC metrics • Frequent, long GCs (GC, FGC, FGCT)• Rapid HashMap insertions: recipe for disaster
    26. 26. Back to: Scenario 3• What did we gather? – CPU – low utilization – Thread dump – no useful data – GC metrics – high heap utilization, low GC – Heap dump • Predictably high number of strings • Strings are abnormally large • Strings contain entire HTML subset!• Substring/regex can be dangerous!
    27. 27. Headache? Take two of these!AFTERWORD
    28. 28. Adieu• Thank you for attending!• Presentation and demos: http://git.io/7LK4fw• Tomer Gabel – tomer@tomergabel.com – http://www.tomergabel.com/ – @tomerg
    29. 29. Thank you our sponsors

    ×