PAC 2019 virtual Christoph NEUMÜLLER

Large Scale Enterprise Crash Dump
Analysis
By Christoph Neumüller
Product Architect @ Dynatrace

Large Scale Enterprise Crash Dump Analysis
• The journey that led to an tool
(SuperDump) that fully automates crash dump analysis
• How we reduced the time it takes to analyze a crash
dump from to
• How automation transformed our workflow

Astory from 2014
• Peter (Customer): "We have a problem with your product. It's crashing."
• Steffi (Support): "Ok, please create a crash-dump and upload it to our support-portal."
• Peter (Customer): "Ok, here you go."
• Steffi (Support): "Development, this customer has [problem X], please have a look at these crash dumps."
• Luke (Development): "Oh, I've never done this before. Sarah, you have experience with this. Can you help?"
• Sarah (Development): "Sure. (Downloads 500MBfile). Oh, it's a windows dump. Can't analyze that on my Linux box. Tom, can you take this?"
• Tom (Development): "Ok. (Downloads 500MB file). Configures symbol server. Uses Visual Studio to see stacktraces. Makes screenshots and
attaches them to JIRA."
• Next day: Luke (dev): "Thanks Tom. I've almost got it. Can you find out this [detail X] for me?"
• Tom (dev): "Sigh. Loads dump again, this time in a different tool (WinDbg), as it allows deeper research. Finds [detail X]."
• Luke (dev): Finds and fixes problem.
• ...
• Next week: "Hey Tom, I have 20 new crash dumps, can you analyze them?"
• Tom: “Great Scott. We need to automate this."

Crash dumpanalysis
• A crash dump is:
• Windows: „.dmp“ (FullDump, MiniDump)
• Linux: „.core“ (Coredump)
• Crash dump analysis is like going back in time to inspect a certain event
• The goal is usually to find the faulting thread, the faulting stackframe and
thus the line of code caused the fault (e.g. access violation, segfault, ...)
• We‘re focused on native (C++) and managed (.NET) crash analysis

• Visual Studio
• Easy. Basic analysis. Windows.
• DebugDiag
• Easy. Emits HTML report. Windows.
• GDB
• Intermediate. Advanced analysis. Linux.
• WinDbg
• Hard. Advanced analysis. Windows.
Commontools forcrash dumpanalysis (C++,.NET)

Anexample: WinWbg
|. (status about process)
~15s (select thread 15)
k (native stack)
~* k (all native stacks
lmf (show loaded modules)
.exr -1 (last exception)
.cordll -ve -u –l (get SOS loaded)
!clrstack (managed .net stack)
~*e !clrstack (show all managed .net stacks)
x *! (show symbol paths)
• Expert tool: very
powerful, but hard
to learn

Crash dumpanalysis istimeconsumingand sometimeshard
• Simple analysis needs preparation
• Tools installed
• Symbol servers properly configured
• Different tools required for Windows and Linux
• Simple analysis is repetitive
• Download crashdump
• Open tool (e.g. WinDbg)
• Find list all stacks with exceptions
• Post results to JIRA
• Deep analysis is considered „dark magic“ art
• Nasty crashes are hard to crack (memory corruptions, deadlocks)

What was our problem in our story?

Ourproblems
• Experts required
• Multiple devs needed to be involved
• Although we had a few distibguished experts, not nearly all developers were
experienced in crash dump analysis
• Workflow cumbersome
• Passing around large files (what about data security and retention?)
• Time effort
• Setup and running analysis is time consuming. Expert time is wasted.
• How can we scale this?
• We want to become more proactive about bugs & crashes. Automatically capture every
crash from Test, Staging, Production (selected) & Support.

Step1:Automateanalysis
SuperDump.Analyzer.exe
Text Output
CLRMD

That’s cute. But does it
help productivity yet?

Step2:WebFrontend
SuperDump.Service.exe
CLRMD
ASP.NET Core
result.json.dmp
Web-Frontend
Developers
Hangfire
https://github.com/HangfireIO/Hangfire

Step3: Automateworkflow
It also helps non-Windows developers to quick-
assess crash-dumps more easily!
Nice! Non-experienced people can analyze dumps
without special tools and knowhow.
Crash dumps can be referred to per URL
https://superdump.acme.org/Home/Report?bundleId=zgi5110&dumpId=wkc9242

Step3: Automateworkflow
CLRMD
ASP.NET Core
result.json.dmp
Web-Frontend
JIRA
Support REST API
Developers
Hangfire
Tests
curl -X POST --header 'Content-Type: application/json' --header 'Accept:
application/json' -d '{
"url": "https://dumps.local/mydump.dmp",
}' 'http://superdump.local/api/Dumps'
Response:
{
"location": "http://superdump.local/Home/BundleCreated?bundleId=czs6140",
"date": "Fri, 05 May 2017 20:13:04 GMT",
}

Awesome. Analysis is already finished by the time a
dev gets involved.
But still not enough. What if I want to investigate a
very special case. I want all the power of WinDbg.
But in the browser...

Step4: Allowdeep analysis
SuperDump.
Analyzer.exe
CLRMD
ASP.NET Core
result
.json
Web-FrontendREST API
cdb.exe
(WinDbg)
Websockets
I/O
Redirect
Browser
jquery.
console
Developers
Hangfire
JIRA
Support
Tests

Wow. Now even deep investigations can be made
in the browser. No need for local tools anymore.
This is a game changer for non-Windows
developers.

SuperDump.
Analyzer.exe
CLRMD
ASP.NET Core
result
.json
cdb.exe
(WinDbg)
Websockets
I/O
Redirect
Browser
jquery.
console
Remote Docker
Linux
result
.json
SuperDump.Analyzer.Linux.dll
Developers
Hangfire
JIRA
Support
Tests
libunwind

Neat. No more Linux VM’s necessary for
Windows developers to debug Linux
coredumps.

Linux
Architecture
SuperDump.
Analysis.exe
CLRMD
ASP.NET Core
result
.json
cdb.exe
(WinDbg)
Websockets
I/O
Redirect
Browser
jquery.
console
Docker for Windows
result
.json
Developers
Hangfire
JIRA
Support
Tests
Linux container
gotty (remote TTY)
GDB
I/O
Redirect
https://github.com/yudai/gotty
SuperDump.Analyzer.Linux.exe
libunwind

More goodness...
• LDAP Authentication & User Roles
• Audit Logging
• JIRA integration (backlink detection)
• Automatic data retention
• Slack-Notifications
• Similiarty detection
• Elasticsearch storage (for indexing and search)

Automation transformed our workflow!

What changed byautomaticcrash dumpanalysis? (1)
• Speed
• Triaging crash dumps down from to !
• Enabling people
• Non-experienced people are capable of simple crash analysis
• No more local tools & setup required (all in the browser)
• Experts not blocked so much anymore
• Communication
• Referring to a crash via URL changed a lot. Can be referenced in JIRA, E-Mail, Slack.
Better than passing huge files around.

What changed byautomaticcrash dumpanalysis? (2)
• Security
• Files are kept in a secure location. Audit-log for access. Automatic retention.
• Scalability
• We can now assess every single crash dump from tests, from staging, from production.
• Can analyze up to 1000+ crash dumps per day.
• Quality improved
• Since analysis is easier, we are much more pro-active and feed all available sources into
SuperDump. It has increased our product quality.

SuperDumpand OpenSource
• Open-sourced in 2017 with permissive license (MIT):
https://github.com/Dynatrace/superdump
• Maintained and actively used at Dynatrace
• (not as a commercial product)
• Roadmap:
• Generic analyzer framework to enable not only crash-dump analysis but also analysis
of logfiles, java hs_err_pid, … (a.k.a. generic “dumps” of data)
• Kubernetize SuperDump (be able to scale analyzers up and down)
• Better clustering and visualization of duplicates
• Contributions and feedback are welcome ☺

Summary
• What is crash-dump analysis and how we did it in 2014
• The journey to automation and how it led to SuperDump
• How automation via SuperDump transformed us
• This led to
• Analysis time down from to !
involved
quality through

Howto create acrash dump
• Windows Task Manager (manual, be aware of bitness!)
• Process Explorer (SysInternals, manual)
• ProcDump (SysInternals, can dump on crash!)
• Windows Error Reporting (automatic, if enabled)
• DebugDiag (automatic, if enabled)
• dbghelp.dll API (MiniDumpWriteDump, it’s on you!)
• Linux: Adapt “kernel.core_pattern”

PAC 2019 virtual Christoph NEUMÜLLER

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PAC 2019 virtual Christoph NEUMÜLLER

Similar to PAC 2019 virtual Christoph NEUMÜLLER (20)

More from Neotys

More from Neotys (20)

Recently uploaded

Recently uploaded (20)

PAC 2019 virtual Christoph NEUMÜLLER