Troubleshooting JVM Outages
3 Fortune 500
Case Studies
Ram Lakshmanan
Architect yCrash
2
Slowdown
Major Financial Institution in N. America
Analysis Report: https://tinyurl.com/5da3ft8z
Open-source script:
https://github.com/ycrash/yc-data-script
1. GC Log
10. netstat
12. vmstat
2. Thread Dump
9. dmesg
3. Heap Dump
6. ps
8. Disk Usage
5. top 13. iostat
11. ping
14. Kernel Params
15. App Logs
16. metadata
4. Heap Substitute
7. top -H
3
360° Troubleshooting artifacts
./yc –p <PROCESS_ID>
What is a Thread Dump?
Threads: Blood System of Your Applicatio
https://blog.fastthread.io/understanding-java-thread-dumps/
1 2
3
1
Timestamp at which thread dump was triggered
2 JVM Version info
3
Thread Details - <<details in following slides>>
5
2023-09-16 17:13:23
1 2 3 4 5
6
7
1
Thread Name - InvoiceThread-A996
2
Priority - Can have values from 1 to 10
3
Thread Id - 0x00002b7cfc6fb000 – Unique ID assigned by JVM. It's returned by calling the Thread.getId() method.
4
Native Id - 0x4479 - This ID is highly platform dependent. On Linux, it's the pid of the thread. On Windows, it's simply the OS-level thread ID within
a process. On Mac OS X, it is said to be the native pthread_t value.
5
Address space - 0x00002b7d17ab8000 -
6
Thread State - RUNNABLE
7 Stack trace -
6
How to analyze Thread dump?
https://www.ibm.com/support/pa
ges/ibm-thread-and-monitor-du
mp-analyzer-java-tmda
IBM TDMA
FastThread
https://fastthread.io/
03
02
https://tinyurl.com/wq95weo
Sample thread
report
yCrash
https://ycrash.io/
01
7
How To Do Thread Dump Analysis
8
https://youtu.be/MFOPb3PZXPA?si=FDSwEj0a5adhoW5f
9
Poor Response Time
Major Cloud Service Provider
Blog: https://blog.gceasy.io/garbage-collection-tuning-success-story-reducing-young-gen-size/
What is Garbage?
HTTP Request
Objects
Memory
Garbage
10
11
3 Decades ago
Developer
Writes code to Manually evict Garbage
JVM
Automatically evicts Garbage
Now
How are objects Garbage Collected?
Evolution: Manual -> Automatic
12
Automatic GC sounds good right?
Yes, but for
GC pauses CPU consumption
Open-source script:
https://github.com/ycrash/yc-data-script
1. GC Log
10. netstat
12. vmstat
2. Thread Dump
9. dmesg
3. Heap Dump
6. ps
8. Disk Usage
5. top 13. iostat
11. ping
14. Kernel Params
15. App Logs
16. metadata
4. Heap Substitute
7. top -H
13
360° Troubleshooting artifacts
2019-08-31T01:09:19.397+0000: 1.606: [GC (Metadata GC Threshold) [PSYoungGen: 545393K->18495K(2446848K)] 545393K->18519K(8039424K),
0.0189376 secs] [Times: user=0.15 sys=0.01, real=0.02 secs]
2019-08-31T01:09:19.416+0000: 1.625: [Full GC (Metadata GC Threshold) [PSYoungGen: 18495K->0K(2446848K)] [ParOldGen: 24K->17366K(5592576K)]
18519K->17366K(8039424K), [Metaspace: 20781K->20781K(1067008K)], 0.0416162 secs] [Times: user=0.38 sys=0.03, real=0.04 secs]
2019-08-31T01:18:19.288+0000: 541.497: [GC (Metadata GC Threshold) [PSYoungGen: 1391495K->18847K(2446848K)] 1408861K->36230K(8039424K),
0.0568365 secs] [Times: user=0.31 sys=0.75, real=0.06 secs]
2019-08-31T01:18:19.345+0000: 541.554: [Full GC (Metadata GC Threshold) [PSYoungGen: 18847K->0K(2446848K)] [ParOldGen: 17382K-
>25397K(5592576K)] 36230K->25397K(8039424K), [Metaspace: 34865K->34865K(1079296K)], 0.0467640 secs] [Times: user=0.31 sys=0.08, real=0.04 secs]
2019-08-31T02:33:20.326+0000: 5042.536: [GC (Allocation Failure) [PSYoungGen: 2097664K->11337K(2446848K)] 2123061K->36742K(8039424K),
0.3298985 secs] [Times: user=0.00 sys=9.20, real=0.33 secs]
2019-08-31T03:40:11.749+0000: 9053.959: [GC (Allocation Failure) [PSYoungGen: 2109001K->15776K(2446848K)] 2134406K->41189K(8039424K),
0.0517517 secs] [Times: user=0.00 sys=1.22, real=0.05 secs]
2019-08-31T05:11:46.869+0000: 14549.079: [GC (Allocation Failure) [PSYoungGen: 2113440K->24832K(2446848K)] 2138853K->50253K(8039424K),
0.0392831 secs] [Times: user=0.02 sys=0.79, real=0.04 secs]
2019-08-31T06:26:10.376+0000: 19012.586: [GC (Allocation Failure) [PSYoungGen: 2122496K->25600K(2756096K)] 2147917K->58149K(8348672K),
0.0371416 secs] [Times: user=0.01 sys=0.75, real=0.04 secs]
2019-08-31T07:50:03.442+0000: 24045.652: [GC (Allocation Failure) [PSYoungGen: 2756096K->32768K(2763264K)] 2788645K->72397K(8355840K),
0.0709641 secs] [Times: user=0.16 sys=1.39, real=0.07 secs]
2019-08-31T09:04:21.406+0000: 28503.616: [GC (Allocation Failure) [PSYoungGen: 2763264K->32768K(2733568K)] 2802893K->83469K(8326144K),
0.0789178 secs] [Times: user=0.12 sys=1.59, real=0.08 secs]
Sample GC Log
How to analyze GC Log?
https://developer.ibm.c
om/javasdk/tools/
IBM GC & Memory visualizer
GCeasy
yCrash
https://gceasy.io/
Google Garbage cat (cms)
https://code.google.co
m/archive/a/eclipselabs
.org/p/garbagecat
HP Jmeter
https://h20392.www2.h
pe.com/portal/swdepot
/displayProductInfo.do?
productNumber=HPJME
TER
03
02
01
05
04
https://ycrash.io/
15
GC Throughput
16
96% GC Throughput means
Application is spending 96% of it’s time in processing customer transactions. Remaining 4% of time
in processing GC activities.
What is 96% GC Throughput?
1 day has 1440 minutes (i.e. 24 hours x 60 minutes)
4% = 57.6 minutes/day/JVM is stopping
How to Tune GC Performance?
17
https://www.youtube.com/watch?v=udFJm7u0Pv0
18
More GC Tuning case
studies
Uber Saves Millions of $
https://blog.gceasy.io/2022/03/04/garbage-collection-tuning-success-story-reducing-young-gen-size/
Large Automobile Manufacturer Improves Response Time
https://blog.gceasy.io/2022/03/04/garbage-collection-tuning-success-story-reducing-young-gen-size/
CloudBees (Jenkins Parent company) optimizes
https://blog.gceasy.io/2019/08/01/cloudbees-gc-performance-optimized-with-gceasy/
Oracle optimizes App performance by tuning GC
https://blog.gceasy.io/2022/12/06/oracle-architect-optimizes-performance-using-gceasy/
19
Large SaaS company CEO’s tweet
Intermittent HTTP 502 Errors
20
Major Travel Service Provider
EBS Architecture
21
Clue: Nginx Error
22
1. GC Log
10. netstat
12. vmstat
2. Thread Dump
9. dmesg
3. Heap Dump
6. ps
8. Disk Usage
5. top 13. iostat
11. ping
14. Kernel Params
15. App Logs
16. metadata
4. Heap Substitute
7. top -H
23
Open-source script:
https://github.com/ycrash/yc-data-script
360° Data
24
CPU Spike (Bonus)
25
Major Trading App in N. America
ttps://blog.fastthread.io/2020/04/23/troubleshooting-cpu-spike-in-a-major-trading-applicati
top –H –p <PROCESS_ID>’
Secrete Option:
26
We might have used ‘top’
1. GC Log
10. netstat
12. vmstat
2. Thread Dump
9. dmesg
3. Heap Dump
6. ps
8. Disk Usage
5. top 13. iostat
11. ping
14. Kernel Params
15. App Logs
16. metadata
4. Heap Substitute
7. top -H
27
Open-source script:
https://github.com/ycrash/yc-data-script
360° Troubleshooting artifacts
Ram Lakshmanan ram@tier1app.com
@tier1app https://www.linkedin.com/company/ycrash
This deck will be published in:
https://blog.ycrash.io
28
THANK YOU
FRIENDS

Troubleshooting JVM Outages – 3 Fortune 500 case studies

  • 1.
    Troubleshooting JVM Outages 3Fortune 500 Case Studies Ram Lakshmanan Architect yCrash
  • 2.
    2 Slowdown Major Financial Institutionin N. America Analysis Report: https://tinyurl.com/5da3ft8z
  • 3.
    Open-source script: https://github.com/ycrash/yc-data-script 1. GCLog 10. netstat 12. vmstat 2. Thread Dump 9. dmesg 3. Heap Dump 6. ps 8. Disk Usage 5. top 13. iostat 11. ping 14. Kernel Params 15. App Logs 16. metadata 4. Heap Substitute 7. top -H 3 360° Troubleshooting artifacts ./yc –p <PROCESS_ID>
  • 4.
    What is aThread Dump? Threads: Blood System of Your Applicatio https://blog.fastthread.io/understanding-java-thread-dumps/
  • 5.
    1 2 3 1 Timestamp atwhich thread dump was triggered 2 JVM Version info 3 Thread Details - <<details in following slides>> 5 2023-09-16 17:13:23
  • 6.
    1 2 34 5 6 7 1 Thread Name - InvoiceThread-A996 2 Priority - Can have values from 1 to 10 3 Thread Id - 0x00002b7cfc6fb000 – Unique ID assigned by JVM. It's returned by calling the Thread.getId() method. 4 Native Id - 0x4479 - This ID is highly platform dependent. On Linux, it's the pid of the thread. On Windows, it's simply the OS-level thread ID within a process. On Mac OS X, it is said to be the native pthread_t value. 5 Address space - 0x00002b7d17ab8000 - 6 Thread State - RUNNABLE 7 Stack trace - 6
  • 7.
    How to analyzeThread dump? https://www.ibm.com/support/pa ges/ibm-thread-and-monitor-du mp-analyzer-java-tmda IBM TDMA FastThread https://fastthread.io/ 03 02 https://tinyurl.com/wq95weo Sample thread report yCrash https://ycrash.io/ 01 7
  • 8.
    How To DoThread Dump Analysis 8 https://youtu.be/MFOPb3PZXPA?si=FDSwEj0a5adhoW5f
  • 9.
    9 Poor Response Time MajorCloud Service Provider Blog: https://blog.gceasy.io/garbage-collection-tuning-success-story-reducing-young-gen-size/
  • 10.
    What is Garbage? HTTPRequest Objects Memory Garbage 10
  • 11.
    11 3 Decades ago Developer Writescode to Manually evict Garbage JVM Automatically evicts Garbage Now How are objects Garbage Collected? Evolution: Manual -> Automatic
  • 12.
    12 Automatic GC soundsgood right? Yes, but for GC pauses CPU consumption
  • 13.
    Open-source script: https://github.com/ycrash/yc-data-script 1. GCLog 10. netstat 12. vmstat 2. Thread Dump 9. dmesg 3. Heap Dump 6. ps 8. Disk Usage 5. top 13. iostat 11. ping 14. Kernel Params 15. App Logs 16. metadata 4. Heap Substitute 7. top -H 13 360° Troubleshooting artifacts
  • 14.
    2019-08-31T01:09:19.397+0000: 1.606: [GC(Metadata GC Threshold) [PSYoungGen: 545393K->18495K(2446848K)] 545393K->18519K(8039424K), 0.0189376 secs] [Times: user=0.15 sys=0.01, real=0.02 secs] 2019-08-31T01:09:19.416+0000: 1.625: [Full GC (Metadata GC Threshold) [PSYoungGen: 18495K->0K(2446848K)] [ParOldGen: 24K->17366K(5592576K)] 18519K->17366K(8039424K), [Metaspace: 20781K->20781K(1067008K)], 0.0416162 secs] [Times: user=0.38 sys=0.03, real=0.04 secs] 2019-08-31T01:18:19.288+0000: 541.497: [GC (Metadata GC Threshold) [PSYoungGen: 1391495K->18847K(2446848K)] 1408861K->36230K(8039424K), 0.0568365 secs] [Times: user=0.31 sys=0.75, real=0.06 secs] 2019-08-31T01:18:19.345+0000: 541.554: [Full GC (Metadata GC Threshold) [PSYoungGen: 18847K->0K(2446848K)] [ParOldGen: 17382K- >25397K(5592576K)] 36230K->25397K(8039424K), [Metaspace: 34865K->34865K(1079296K)], 0.0467640 secs] [Times: user=0.31 sys=0.08, real=0.04 secs] 2019-08-31T02:33:20.326+0000: 5042.536: [GC (Allocation Failure) [PSYoungGen: 2097664K->11337K(2446848K)] 2123061K->36742K(8039424K), 0.3298985 secs] [Times: user=0.00 sys=9.20, real=0.33 secs] 2019-08-31T03:40:11.749+0000: 9053.959: [GC (Allocation Failure) [PSYoungGen: 2109001K->15776K(2446848K)] 2134406K->41189K(8039424K), 0.0517517 secs] [Times: user=0.00 sys=1.22, real=0.05 secs] 2019-08-31T05:11:46.869+0000: 14549.079: [GC (Allocation Failure) [PSYoungGen: 2113440K->24832K(2446848K)] 2138853K->50253K(8039424K), 0.0392831 secs] [Times: user=0.02 sys=0.79, real=0.04 secs] 2019-08-31T06:26:10.376+0000: 19012.586: [GC (Allocation Failure) [PSYoungGen: 2122496K->25600K(2756096K)] 2147917K->58149K(8348672K), 0.0371416 secs] [Times: user=0.01 sys=0.75, real=0.04 secs] 2019-08-31T07:50:03.442+0000: 24045.652: [GC (Allocation Failure) [PSYoungGen: 2756096K->32768K(2763264K)] 2788645K->72397K(8355840K), 0.0709641 secs] [Times: user=0.16 sys=1.39, real=0.07 secs] 2019-08-31T09:04:21.406+0000: 28503.616: [GC (Allocation Failure) [PSYoungGen: 2763264K->32768K(2733568K)] 2802893K->83469K(8326144K), 0.0789178 secs] [Times: user=0.12 sys=1.59, real=0.08 secs] Sample GC Log
  • 15.
    How to analyzeGC Log? https://developer.ibm.c om/javasdk/tools/ IBM GC & Memory visualizer GCeasy yCrash https://gceasy.io/ Google Garbage cat (cms) https://code.google.co m/archive/a/eclipselabs .org/p/garbagecat HP Jmeter https://h20392.www2.h pe.com/portal/swdepot /displayProductInfo.do? productNumber=HPJME TER 03 02 01 05 04 https://ycrash.io/ 15
  • 16.
    GC Throughput 16 96% GCThroughput means Application is spending 96% of it’s time in processing customer transactions. Remaining 4% of time in processing GC activities. What is 96% GC Throughput? 1 day has 1440 minutes (i.e. 24 hours x 60 minutes) 4% = 57.6 minutes/day/JVM is stopping
  • 17.
    How to TuneGC Performance? 17 https://www.youtube.com/watch?v=udFJm7u0Pv0
  • 18.
    18 More GC Tuningcase studies Uber Saves Millions of $ https://blog.gceasy.io/2022/03/04/garbage-collection-tuning-success-story-reducing-young-gen-size/ Large Automobile Manufacturer Improves Response Time https://blog.gceasy.io/2022/03/04/garbage-collection-tuning-success-story-reducing-young-gen-size/ CloudBees (Jenkins Parent company) optimizes https://blog.gceasy.io/2019/08/01/cloudbees-gc-performance-optimized-with-gceasy/ Oracle optimizes App performance by tuning GC https://blog.gceasy.io/2022/12/06/oracle-architect-optimizes-performance-using-gceasy/
  • 19.
    19 Large SaaS companyCEO’s tweet
  • 20.
    Intermittent HTTP 502Errors 20 Major Travel Service Provider
  • 21.
  • 22.
  • 23.
    1. GC Log 10.netstat 12. vmstat 2. Thread Dump 9. dmesg 3. Heap Dump 6. ps 8. Disk Usage 5. top 13. iostat 11. ping 14. Kernel Params 15. App Logs 16. metadata 4. Heap Substitute 7. top -H 23 Open-source script: https://github.com/ycrash/yc-data-script 360° Data
  • 24.
  • 25.
    CPU Spike (Bonus) 25 MajorTrading App in N. America ttps://blog.fastthread.io/2020/04/23/troubleshooting-cpu-spike-in-a-major-trading-applicati
  • 26.
    top –H –p<PROCESS_ID>’ Secrete Option: 26 We might have used ‘top’
  • 27.
    1. GC Log 10.netstat 12. vmstat 2. Thread Dump 9. dmesg 3. Heap Dump 6. ps 8. Disk Usage 5. top 13. iostat 11. ping 14. Kernel Params 15. App Logs 16. metadata 4. Heap Substitute 7. top -H 27 Open-source script: https://github.com/ycrash/yc-data-script 360° Troubleshooting artifacts
  • 28.
    Ram Lakshmanan ram@tier1app.com @tier1apphttps://www.linkedin.com/company/ycrash This deck will be published in: https://blog.ycrash.io 28 THANK YOU FRIENDS

Editor's Notes

  • #2 http://localhost:8080/ycrash/my-thread-report.jsp?p=Yzpcd29ya3NwYWNlXHVwbG9hZHNcc2hhcmVkXDIwMjQtOC0yMVxvbnMtdGhyZWFkLWR1bXAudHh0LTctMTEtMTs7&s=t https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjIvMDcvMzEvdGhyZWFkX2thc3RsZV8yNjA3MjIudHh0LS03LTMwLTMzLS0xNi0zMy0zNg==&&s=t EE: http://localhost:8080/yc-report.jsp?ou=SAP&de=198.134.23.1&app=yc&ts=2023-06-11T22-56-32
  • #7 http://localhost:8080/my-thread-report.jsp?p=Qzpcd29ya3NwYWNlLXRtcFxlZS11cGxvYWRzMlxzaGFyZWRcMjAyNS01LTZcdGhyZWFkRHVtcC0yLnR4dC0xNS01MC0zMzs7& http://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTcvMDMvMTQvLS10aHJlYWREdW1wLTIudHh0LS0xMi0yOC0zMw==&s=t
  • #15 http://localhost:8080/my-gc-report.jsp?p=Qzpcd29ya3NwYWNlLXRtcFxlZS11cGxvYWRzMlxzaGFyZWRcMjAyNS01LTZcMjQtaG91ci1nYy1sb2cuZ3otMTUtNTMtNTE=&channel=WEB http://localhost:8080/my-gc-report.jsp?p=Qzpcd29ya3NwYWNlLXRtcFxlZS11cGxvYWRzMlxzaGFyZWRcMjAyNS01LTZcNTAtaG91ci1nYy1sb2cuZ3otMTUtNTUtNDI=&channel=WEB
  • #16 Baseline: http://localhost:8080/yc-load-report-gc?ou=SAP&de=145.23.82.1&app=yc&ts=2023-06-11T23-03-50 Benchmark: http://localhost:8080/yc-load-report-gc?ou=SAP&de=193.45.89.12&app=yc&ts=2023-06-11T23-09-10 Baseline: http://localhost:8080/ycrash/my-gc-report.jsp?p=Yzpcd29ya3NwYWNlXHVwbG9hZHNcc2hhcmVkXDIwMjQtOC0yMVxiYXNlbGluZS1nYy1sb2cuZ3otNy0xMC0yMw==&channel=WEB Benchmark: http://localhost:8080/ycrash/my-gc-report.jsp?p=Yzpcd29ya3NwYWNlXHVwbG9hZHNcc2hhcmVkXDIwMjQtOC0yMVxiZW5jaG1hcmstZ2MtbG9nLmd6LTctMTQtMg==&channel=WEB
  • #24 https://ee.ycrash.io/yc-report-kernel.jsp?ou=Testing&de=172.31.31.240&app=yc&ts=2025-03-19T20-47-17
  • #27 http://localhost:8080/my-thread-report.jsp?p=Qzpcd29ya3NwYWNlLXRtcFxlZS11cGxvYWRzMlxzaGFyZWRcMjAyNS01LTZcaWJtLXRvcGRhdGEuemlwLTE1LTU3LTUzOzs=& https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjMvMDMvMTMvaWJtLWNvcmUtZHVtcC10b3BkYXRhLnppcC0tMjItMjItNQ==&s=t