Large Scale Enterprise Crash Dump
Analysis
By Christoph Neumüller
Product Architect @ Dynatrace
Large Scale Enterprise Crash Dump Analysis
• The journey that led to an tool
(SuperDump) that fully automates crash dump analysis
• How we reduced the time it takes to analyze a crash
dump from to
• How automation transformed our workflow
A story from 2014
Astory from 2014
• Peter (Customer): "We have a problem with your product. It's crashing."
• Steffi (Support): "Ok, please create a crash-dump and upload it to our support-portal."
• Peter (Customer): "Ok, here you go."
• Steffi (Support): "Development, this customer has [problem X], please have a look at these crash dumps."
• Luke (Development): "Oh, I've never done this before. Sarah, you have experience with this. Can you help?"
• Sarah (Development): "Sure. (Downloads 500MBfile). Oh, it's a windows dump. Can't analyze that on my Linux box. Tom, can you take this?"
• Tom (Development): "Ok. (Downloads 500MB file). Configures symbol server. Uses Visual Studio to see stacktraces. Makes screenshots and
attaches them to JIRA."
• Next day: Luke (dev): "Thanks Tom. I've almost got it. Can you find out this [detail X] for me?"
• Tom (dev): "Sigh. Loads dump again, this time in a different tool (WinDbg), as it allows deeper research. Finds [detail X]."
• Luke (dev): Finds and fixes problem.
• ...
• Next week: "Hey Tom, I have 20 new crash dumps, can you analyze them?"
• Tom: “Great Scott. We need to automate this."
Crash dump analysis?
Crash dumpanalysis
• A crash dump is:
• Windows: „.dmp“ (FullDump, MiniDump)
• Linux: „.core“ (Coredump)
• Crash dump analysis is like going back in time to inspect a certain event
• The goal is usually to find the faulting thread, the faulting stackframe and
thus the line of code caused the fault (e.g. access violation, segfault, ...)
• We‘re focused on native (C++) and managed (.NET) crash analysis
• Visual Studio
• Easy. Basic analysis. Windows.
• DebugDiag
• Easy. Emits HTML report. Windows.
• GDB
• Intermediate. Advanced analysis. Linux.
• WinDbg
• Hard. Advanced analysis. Windows.
Commontools forcrash dumpanalysis (C++,.NET)
Anexample: WinWbg
|. (status about process)
~15s (select thread 15)
k (native stack)
~* k (all native stacks
lmf (show loaded modules)
.exr -1 (last exception)
.cordll -ve -u –l (get SOS loaded)
!clrstack (managed .net stack)
~*e !clrstack (show all managed .net stacks)
x *! (show symbol paths)
• Expert tool: very
powerful, but hard
to learn
Crash dumpanalysis istimeconsumingand sometimeshard
• Simple analysis needs preparation
• Tools installed
• Symbol servers properly configured
• Different tools required for Windows and Linux
• Simple analysis is repetitive
• Download crashdump
• Open tool (e.g. WinDbg)
• Find list all stacks with exceptions
• Post results to JIRA
• Deep analysis is considered „dark magic“ art
• Nasty crashes are hard to crack (memory corruptions, deadlocks)
What was our problem in our story?
Ourproblems
• Experts required
• Multiple devs needed to be involved
• Although we had a few distibguished experts, not nearly all developers were
experienced in crash dump analysis
• Workflow cumbersome
• Passing around large files (what about data security and retention?)
• Time effort
• Setup and running analysis is time consuming. Expert time is wasted.
• How can we scale this?
• We want to become more proactive about bugs & crashes. Automatically capture every
crash from Test, Staging, Production (selected) & Support.
Our journey to automation
Step1:Automateanalysis
Step1:Automateanalysis
SuperDump.Analyzer.exe
Text Output
CLRMD
That’s cute. But does it
help productivity yet?
Step2:WebFrontend
SuperDump.Analyzer.exe
SuperDump.Service.exe
CLRMD
ASP.NET Core
result.json.dmp
Web-Frontend
Developers
Hangfire
https://github.com/HangfireIO/Hangfire
Step3: Automateworkflow
It also helps non-Windows developers to quick-
assess crash-dumps more easily!
Nice! Non-experienced people can analyze dumps
without special tools and knowhow.
Crash dumps can be referred to per URL
https://superdump.acme.org/Home/Report?bundleId=zgi5110&dumpId=wkc9242
Step3: Automateworkflow
SuperDump.Analyzer.exe
SuperDump.Service.exe
CLRMD
ASP.NET Core
result.json.dmp
Web-Frontend
JIRA
Support REST API
Developers
Hangfire
Tests
curl -X POST --header 'Content-Type: application/json' --header 'Accept:
application/json' -d '{ 
"url": "https://dumps.local/mydump.dmp", 
}' 'http://superdump.local/api/Dumps'
Response:
{
"location": "http://superdump.local/Home/BundleCreated?bundleId=czs6140",
"date": "Fri, 05 May 2017 20:13:04 GMT",
}
Awesome. Analysis is already finished by the time a
dev gets involved.
But still not enough. What if I want to investigate a
very special case. I want all the power of WinDbg.
But in the browser...
Step4: Allowdeep analysis
SuperDump.
Analyzer.exe
SuperDump.Service.exe
CLRMD
ASP.NET Core
result
.json
Web-FrontendREST API
cdb.exe
(WinDbg)
Websockets
I/O
Redirect
Browser
jquery.
console
Developers
Hangfire
JIRA
Support
Tests
Wow. Now even deep investigations can be made
in the browser. No need for local tools anymore.
This is a game changer for non-Windows
developers.
SuperDump.
Analyzer.exe
SuperDump.Service.exe
CLRMD
ASP.NET Core
result
.json
Web-FrontendREST API
cdb.exe
(WinDbg)
Websockets
I/O
Redirect
Browser
jquery.
console
Remote Docker
Linux
result
.json
SuperDump.Analyzer.Linux.dll
Developers
Hangfire
JIRA
Support
Tests
libunwind
Neat. No more Linux VM’s necessary for
Windows developers to debug Linux
coredumps.
Linux
Architecture
SuperDump.
Analysis.exe
SuperDump.Service.exe
CLRMD
ASP.NET Core
result
.json
Web-FrontendREST API
cdb.exe
(WinDbg)
Websockets
I/O
Redirect
Browser
jquery.
console
Docker for Windows
result
.json
Developers
Hangfire
JIRA
Support
Tests
Linux container
gotty (remote TTY)
GDB
I/O
Redirect
https://github.com/yudai/gotty
SuperDump.Analyzer.Linux.exe
libunwind
More goodness...
• LDAP Authentication & User Roles
• Audit Logging
• JIRA integration (backlink detection)
• Automatic data retention
• Slack-Notifications
• Similiarty detection
• Elasticsearch storage (for indexing and search)
Demo Time
Demo
Automation transformed our workflow!
What changed byautomaticcrash dumpanalysis? (1)
• Speed
• Triaging crash dumps down from to !
• Enabling people
• Non-experienced people are capable of simple crash analysis
• No more local tools & setup required (all in the browser)
• Experts not blocked so much anymore
• Communication
• Referring to a crash via URL changed a lot. Can be referenced in JIRA, E-Mail, Slack.
Better than passing huge files around.
What changed byautomaticcrash dumpanalysis? (2)
• Security
• Files are kept in a secure location. Audit-log for access. Automatic retention.
• Scalability
• We can now assess every single crash dump from tests, from staging, from production.
• Can analyze up to 1000+ crash dumps per day.
• Quality improved
• Since analysis is easier, we are much more pro-active and feed all available sources into
SuperDump. It has increased our product quality.
SuperDump and Open Source
SuperDumpand OpenSource
• Open-sourced in 2017 with permissive license (MIT):
https://github.com/Dynatrace/superdump
• Maintained and actively used at Dynatrace
• (not as a commercial product)
• Roadmap:
• Generic analyzer framework to enable not only crash-dump analysis but also analysis
of logfiles, java hs_err_pid, … (a.k.a. generic “dumps” of data)
• Kubernetize SuperDump (be able to scale analyzers up and down)
• Better clustering and visualization of duplicates
• Contributions and feedback are welcome ☺
Summary
Summary
• What is crash-dump analysis and how we did it in 2014
• The journey to automation and how it led to SuperDump
• How automation via SuperDump transformed us
• This led to
• Analysis time down from to !
involved
quality through
Appendix
Howto create acrash dump
• Windows Task Manager (manual, be aware of bitness!)
• Process Explorer (SysInternals, manual)
• ProcDump (SysInternals, can dump on crash!)
• Windows Error Reporting (automatic, if enabled)
• DebugDiag (automatic, if enabled)
• dbghelp.dll API (MiniDumpWriteDump, it’s on you!)
• Linux: Adapt “kernel.core_pattern”

PAC 2019 virtual Christoph NEUMÜLLER

  • 1.
    Large Scale EnterpriseCrash Dump Analysis By Christoph Neumüller Product Architect @ Dynatrace
  • 2.
    Large Scale EnterpriseCrash Dump Analysis • The journey that led to an tool (SuperDump) that fully automates crash dump analysis • How we reduced the time it takes to analyze a crash dump from to • How automation transformed our workflow
  • 3.
  • 4.
    Astory from 2014 •Peter (Customer): "We have a problem with your product. It's crashing." • Steffi (Support): "Ok, please create a crash-dump and upload it to our support-portal." • Peter (Customer): "Ok, here you go." • Steffi (Support): "Development, this customer has [problem X], please have a look at these crash dumps." • Luke (Development): "Oh, I've never done this before. Sarah, you have experience with this. Can you help?" • Sarah (Development): "Sure. (Downloads 500MBfile). Oh, it's a windows dump. Can't analyze that on my Linux box. Tom, can you take this?" • Tom (Development): "Ok. (Downloads 500MB file). Configures symbol server. Uses Visual Studio to see stacktraces. Makes screenshots and attaches them to JIRA." • Next day: Luke (dev): "Thanks Tom. I've almost got it. Can you find out this [detail X] for me?" • Tom (dev): "Sigh. Loads dump again, this time in a different tool (WinDbg), as it allows deeper research. Finds [detail X]." • Luke (dev): Finds and fixes problem. • ... • Next week: "Hey Tom, I have 20 new crash dumps, can you analyze them?" • Tom: “Great Scott. We need to automate this."
  • 5.
  • 6.
    Crash dumpanalysis • Acrash dump is: • Windows: „.dmp“ (FullDump, MiniDump) • Linux: „.core“ (Coredump) • Crash dump analysis is like going back in time to inspect a certain event • The goal is usually to find the faulting thread, the faulting stackframe and thus the line of code caused the fault (e.g. access violation, segfault, ...) • We‘re focused on native (C++) and managed (.NET) crash analysis
  • 7.
    • Visual Studio •Easy. Basic analysis. Windows. • DebugDiag • Easy. Emits HTML report. Windows. • GDB • Intermediate. Advanced analysis. Linux. • WinDbg • Hard. Advanced analysis. Windows. Commontools forcrash dumpanalysis (C++,.NET)
  • 8.
    Anexample: WinWbg |. (statusabout process) ~15s (select thread 15) k (native stack) ~* k (all native stacks lmf (show loaded modules) .exr -1 (last exception) .cordll -ve -u –l (get SOS loaded) !clrstack (managed .net stack) ~*e !clrstack (show all managed .net stacks) x *! (show symbol paths) • Expert tool: very powerful, but hard to learn
  • 9.
    Crash dumpanalysis istimeconsumingandsometimeshard • Simple analysis needs preparation • Tools installed • Symbol servers properly configured • Different tools required for Windows and Linux • Simple analysis is repetitive • Download crashdump • Open tool (e.g. WinDbg) • Find list all stacks with exceptions • Post results to JIRA • Deep analysis is considered „dark magic“ art • Nasty crashes are hard to crack (memory corruptions, deadlocks)
  • 10.
    What was ourproblem in our story?
  • 11.
    Ourproblems • Experts required •Multiple devs needed to be involved • Although we had a few distibguished experts, not nearly all developers were experienced in crash dump analysis • Workflow cumbersome • Passing around large files (what about data security and retention?) • Time effort • Setup and running analysis is time consuming. Expert time is wasted. • How can we scale this? • We want to become more proactive about bugs & crashes. Automatically capture every crash from Test, Staging, Production (selected) & Support.
  • 12.
    Our journey toautomation
  • 13.
  • 14.
  • 15.
    That’s cute. Butdoes it help productivity yet?
  • 16.
  • 20.
    Step3: Automateworkflow It alsohelps non-Windows developers to quick- assess crash-dumps more easily! Nice! Non-experienced people can analyze dumps without special tools and knowhow. Crash dumps can be referred to per URL https://superdump.acme.org/Home/Report?bundleId=zgi5110&dumpId=wkc9242
  • 21.
    Step3: Automateworkflow SuperDump.Analyzer.exe SuperDump.Service.exe CLRMD ASP.NET Core result.json.dmp Web-Frontend JIRA SupportREST API Developers Hangfire Tests curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{ "url": "https://dumps.local/mydump.dmp", }' 'http://superdump.local/api/Dumps' Response: { "location": "http://superdump.local/Home/BundleCreated?bundleId=czs6140", "date": "Fri, 05 May 2017 20:13:04 GMT", }
  • 22.
    Awesome. Analysis isalready finished by the time a dev gets involved. But still not enough. What if I want to investigate a very special case. I want all the power of WinDbg. But in the browser...
  • 23.
    Step4: Allowdeep analysis SuperDump. Analyzer.exe SuperDump.Service.exe CLRMD ASP.NETCore result .json Web-FrontendREST API cdb.exe (WinDbg) Websockets I/O Redirect Browser jquery. console Developers Hangfire JIRA Support Tests
  • 25.
    Wow. Now evendeep investigations can be made in the browser. No need for local tools anymore. This is a game changer for non-Windows developers.
  • 26.
  • 28.
    Neat. No moreLinux VM’s necessary for Windows developers to debug Linux coredumps.
  • 29.
    Linux Architecture SuperDump. Analysis.exe SuperDump.Service.exe CLRMD ASP.NET Core result .json Web-FrontendREST API cdb.exe (WinDbg) Websockets I/O Redirect Browser jquery. console Dockerfor Windows result .json Developers Hangfire JIRA Support Tests Linux container gotty (remote TTY) GDB I/O Redirect https://github.com/yudai/gotty SuperDump.Analyzer.Linux.exe libunwind
  • 31.
    More goodness... • LDAPAuthentication & User Roles • Audit Logging • JIRA integration (backlink detection) • Automatic data retention • Slack-Notifications • Similiarty detection • Elasticsearch storage (for indexing and search)
  • 32.
  • 33.
  • 34.
    What changed byautomaticcrashdumpanalysis? (1) • Speed • Triaging crash dumps down from to ! • Enabling people • Non-experienced people are capable of simple crash analysis • No more local tools & setup required (all in the browser) • Experts not blocked so much anymore • Communication • Referring to a crash via URL changed a lot. Can be referenced in JIRA, E-Mail, Slack. Better than passing huge files around.
  • 35.
    What changed byautomaticcrashdumpanalysis? (2) • Security • Files are kept in a secure location. Audit-log for access. Automatic retention. • Scalability • We can now assess every single crash dump from tests, from staging, from production. • Can analyze up to 1000+ crash dumps per day. • Quality improved • Since analysis is easier, we are much more pro-active and feed all available sources into SuperDump. It has increased our product quality.
  • 36.
  • 37.
    SuperDumpand OpenSource • Open-sourcedin 2017 with permissive license (MIT): https://github.com/Dynatrace/superdump • Maintained and actively used at Dynatrace • (not as a commercial product) • Roadmap: • Generic analyzer framework to enable not only crash-dump analysis but also analysis of logfiles, java hs_err_pid, … (a.k.a. generic “dumps” of data) • Kubernetize SuperDump (be able to scale analyzers up and down) • Better clustering and visualization of duplicates • Contributions and feedback are welcome ☺
  • 38.
  • 39.
    Summary • What iscrash-dump analysis and how we did it in 2014 • The journey to automation and how it led to SuperDump • How automation via SuperDump transformed us • This led to • Analysis time down from to ! involved quality through
  • 41.
  • 42.
    Howto create acrashdump • Windows Task Manager (manual, be aware of bitness!) • Process Explorer (SysInternals, manual) • ProcDump (SysInternals, can dump on crash!) • Windows Error Reporting (automatic, if enabled) • DebugDiag (automatic, if enabled) • dbghelp.dll API (MiniDumpWriteDump, it’s on you!) • Linux: Adapt “kernel.core_pattern”