Software and Systems Engineering Standards: Verification and Validation of Sy...
PAC 2019 virtual Christoph NEUMÜLLER
1. Large Scale Enterprise Crash Dump
Analysis
By Christoph Neumüller
Product Architect @ Dynatrace
2. Large Scale Enterprise Crash Dump Analysis
• The journey that led to an tool
(SuperDump) that fully automates crash dump analysis
• How we reduced the time it takes to analyze a crash
dump from to
• How automation transformed our workflow
4. Astory from 2014
• Peter (Customer): "We have a problem with your product. It's crashing."
• Steffi (Support): "Ok, please create a crash-dump and upload it to our support-portal."
• Peter (Customer): "Ok, here you go."
• Steffi (Support): "Development, this customer has [problem X], please have a look at these crash dumps."
• Luke (Development): "Oh, I've never done this before. Sarah, you have experience with this. Can you help?"
• Sarah (Development): "Sure. (Downloads 500MBfile). Oh, it's a windows dump. Can't analyze that on my Linux box. Tom, can you take this?"
• Tom (Development): "Ok. (Downloads 500MB file). Configures symbol server. Uses Visual Studio to see stacktraces. Makes screenshots and
attaches them to JIRA."
• Next day: Luke (dev): "Thanks Tom. I've almost got it. Can you find out this [detail X] for me?"
• Tom (dev): "Sigh. Loads dump again, this time in a different tool (WinDbg), as it allows deeper research. Finds [detail X]."
• Luke (dev): Finds and fixes problem.
• ...
• Next week: "Hey Tom, I have 20 new crash dumps, can you analyze them?"
• Tom: “Great Scott. We need to automate this."
6. Crash dumpanalysis
• A crash dump is:
• Windows: „.dmp“ (FullDump, MiniDump)
• Linux: „.core“ (Coredump)
• Crash dump analysis is like going back in time to inspect a certain event
• The goal is usually to find the faulting thread, the faulting stackframe and
thus the line of code caused the fault (e.g. access violation, segfault, ...)
• We‘re focused on native (C++) and managed (.NET) crash analysis
8. Anexample: WinWbg
|. (status about process)
~15s (select thread 15)
k (native stack)
~* k (all native stacks
lmf (show loaded modules)
.exr -1 (last exception)
.cordll -ve -u –l (get SOS loaded)
!clrstack (managed .net stack)
~*e !clrstack (show all managed .net stacks)
x *! (show symbol paths)
• Expert tool: very
powerful, but hard
to learn
9. Crash dumpanalysis istimeconsumingand sometimeshard
• Simple analysis needs preparation
• Tools installed
• Symbol servers properly configured
• Different tools required for Windows and Linux
• Simple analysis is repetitive
• Download crashdump
• Open tool (e.g. WinDbg)
• Find list all stacks with exceptions
• Post results to JIRA
• Deep analysis is considered „dark magic“ art
• Nasty crashes are hard to crack (memory corruptions, deadlocks)
11. Ourproblems
• Experts required
• Multiple devs needed to be involved
• Although we had a few distibguished experts, not nearly all developers were
experienced in crash dump analysis
• Workflow cumbersome
• Passing around large files (what about data security and retention?)
• Time effort
• Setup and running analysis is time consuming. Expert time is wasted.
• How can we scale this?
• We want to become more proactive about bugs & crashes. Automatically capture every
crash from Test, Staging, Production (selected) & Support.
20. Step3: Automateworkflow
It also helps non-Windows developers to quick-
assess crash-dumps more easily!
Nice! Non-experienced people can analyze dumps
without special tools and knowhow.
Crash dumps can be referred to per URL
https://superdump.acme.org/Home/Report?bundleId=zgi5110&dumpId=wkc9242
22. Awesome. Analysis is already finished by the time a
dev gets involved.
But still not enough. What if I want to investigate a
very special case. I want all the power of WinDbg.
But in the browser...
34. What changed byautomaticcrash dumpanalysis? (1)
• Speed
• Triaging crash dumps down from to !
• Enabling people
• Non-experienced people are capable of simple crash analysis
• No more local tools & setup required (all in the browser)
• Experts not blocked so much anymore
• Communication
• Referring to a crash via URL changed a lot. Can be referenced in JIRA, E-Mail, Slack.
Better than passing huge files around.
35. What changed byautomaticcrash dumpanalysis? (2)
• Security
• Files are kept in a secure location. Audit-log for access. Automatic retention.
• Scalability
• We can now assess every single crash dump from tests, from staging, from production.
• Can analyze up to 1000+ crash dumps per day.
• Quality improved
• Since analysis is easier, we are much more pro-active and feed all available sources into
SuperDump. It has increased our product quality.
37. SuperDumpand OpenSource
• Open-sourced in 2017 with permissive license (MIT):
https://github.com/Dynatrace/superdump
• Maintained and actively used at Dynatrace
• (not as a commercial product)
• Roadmap:
• Generic analyzer framework to enable not only crash-dump analysis but also analysis
of logfiles, java hs_err_pid, … (a.k.a. generic “dumps” of data)
• Kubernetize SuperDump (be able to scale analyzers up and down)
• Better clustering and visualization of duplicates
• Contributions and feedback are welcome ☺
39. Summary
• What is crash-dump analysis and how we did it in 2014
• The journey to automation and how it led to SuperDump
• How automation via SuperDump transformed us
• This led to
• Analysis time down from to !
involved
quality through
42. Howto create acrash dump
• Windows Task Manager (manual, be aware of bitness!)
• Process Explorer (SysInternals, manual)
• ProcDump (SysInternals, can dump on crash!)
• Windows Error Reporting (automatic, if enabled)
• DebugDiag (automatic, if enabled)
• dbghelp.dll API (MiniDumpWriteDump, it’s on you!)
• Linux: Adapt “kernel.core_pattern”