Automatic crash analysis system

•Download as PPTX, PDF•

0 likes•305 views

В докладе будет рассмотрена система контроля качества с помощью снятия и анализа крэшдамп-файлов (применительно в первую очередь к системе Windows). Будет представлена архитектура и инструментарий для организации автоматического сбора и анализа крэшдамп-файлов, снимаемых в момент возникновения проблем в приложениях (падения, зависания, превышение потребления ресурсов – памяти, файловых дескрипторов и т.д.)

Technology

Automatic
Crash Analysis System
Anton Naumovich

About me
Anton Naumovich
Development Manager at LogicNow
Developer at Microsoft (Hyper-V) in the past
Specializing in performance, debugging,
troubleshooting

Divergence
Crashes Hangs
Overusage of processor and RAM
And more...

Sources of divergence
Developers’ mistakes
Thirdparty libraries issues
Environment diversity (software, hardware)

Take a memory dump!
Dump is a snapshot of process memory
Problem root cause can be located from the dump
The fact that dump is taken is an “attention!” signal

Dump kinds
Minidump
threads and handles
Full dump
+virtual memory
Kernel dump
Full OS

Taking a process dump
We need a “non-involved” controller process
SuperController.exe
Controller app
SuperApp.exe
Worker app
Dump
file
Monitoring
Taking dumps

Apps capable of taking dumps
Process Explorer (full, mini)
Task Manager (full)
ProcDump (full, mini, and much more)

ProcDump: basics
-c CPU threshold above which to create a dump of the process
-e Write a dump when the process encounters an unhandled exception
-m Memory commit threshold in MB at which to create a dump
-t Write a dump when the process terminates
-h Write dump if process has a hung window
-p Trigger on the performance counter when the threshold exceeded

ProcDump: advanced
-w Wait for the specified process to launch if it's not running
-s Consecutive seconds before dump is written (default is 10)
-n Number of dumps to write before exiting
-r Dump using a clone
-i Install ProcDump as the AeDebug postmortem debugger
-ma Write a dump file with all process memory

procdump: controlling apps
SuperApp.exe
Worker app
procdump -c 30 SuperApp.exe procdump -h SuperApp.exe
procdump -m 300 SuperApp.exe
procdump -t SuperApp.exe
procdump -p "Process(SuperApp)Handle Count" 1000 SuperApp.exe

Fetching info from the dump
Dump analysis is
just static debugging
cdb.exe -y C:lab -i C:lab -z C:labSuperApp.dmp -c "~*k;q" > C:analysis.txt
Easily automatable:
Debugger
SuperApp.pdb
Debugging
symbols
SuperApp.dmp
Memory dump
SuperApp.exe
App executable

Analysis results
It’s all about thread stacks
008afcf0 MSVCP120!std::_Xout_of_range+0x36
008fc86b SuperApp!WorkerProcessor::GetNextChunk+0x1e1
0061d914 SuperApp!WorkerProcessor::CalculateAverage+0x202
0062875c SuperApp!WorkerModule::ProcessQueueEvent+0xdf
0012877a SuperApp!WorkerModule::TakeSingleItem+0x54
004dc89a SuperApp!WorkerModule::Run+0x67
00bdc100 SuperApp!main+0x1955

Key analysis features
Dump fuzzy matching and grouping by stack
Integration with issue tracking (Jira)
Analyze dump by user request
Notifications about new/critical problems

Symbol Server
- Storage and access to app debugging symbols
- Dramatically speeds up debugging

Analogues
Windows Error Reporting
http://msdn.microsoft.com/en-us/library/windows/desktop/bb513641(v=vs.85).aspx
Mozilla Crash Reporter
https://support.mozilla.org/en-US/kb/mozillacrashreporter
Dr. Dump
https://drdump.com/crash-reporting-system

Example: Dr. Dump
https://drdump.com/AppVersion.aspx

What can you do tomorrow?
Setup symbol server (simply a shared folder)
Use a script to monitor problems and capture dumps
Use a script to analyze dumps

Toolset
Debugging Tools for Windows (cdb, windbg)
http://msdn.microsoft.com/en-us/windows/hardware/hh852365.aspx
Sysinternals tool suite (procdump, procexp)
http://technet.microsoft.com/en-us/sysinternals/bb545021.aspx
Google Breakpad library
https://code.google.com/p/google-breakpad/
Windows API: Debug Help family
http://msdn.microsoft.com/en-us/library/windows/desktop/ms679309(v=vs.85).aspx
Microsoft Symbols Server
http://en.wikipedia.org/wiki/Microsoft_Symbol_Server

1. Speed up defect location
2. Immediate reaction to critical problems
3. Version quality indicators
4. Improve stability
Profit

Thanks! Questions?
Anton.Naumovich@LogicNow.com

Viewers also liked

EEON103 Хичээл 11E-Gazarchin Online University

RMON304E-Gazarchin Online University

Tecniche di commercializzazioneGIANCARLO PASTORE

Антон Наумович, Система автоматической крэш-аналитики своими средствамиSergey Platonov

Kimes 0317 문여정Yeo Jung Moon

Cover officina 1 copia 4GIANCARLO PASTORE

Enhance Your Learning PlatformAnderspink

2016-07 Indonesia Car Sales BMW July 2016Uli Kaiser

Thailand Automotive Sales Statistics by Van/MPV Segment March 2016Uli Kaiser

Thailand Domestic Vehicle Sales by Segment 2015 7Uli Kaiser

Nonlinear static simulation of automotive bumper of a passenger car in low sp...eSAT Journals

"Солонгос хэлний орчуулгын онол" Хичээл-5E-Gazarchin Online University

Dsl for c++corehard_by

Дмитрий Кашицын, Троллейбус из буханки: алиасинг и векторизация в LLVMSergey Platonov

Splunk Ninja: New Features, Pivot and Search DojoSplunk

Thailand Automotive Sales Statistics by B-Car Segment March 2016Uli Kaiser

эртний энэтхэгийн философиPrime Rose Snowdrop

Теория и практика написания безопасного кода на C++corehard_by

visual programming lecture 2Donald G-hub

C++ game development with oxyginecorehard_by

Viewers also liked (20)

EEON103 Хичээл 11

RMON304

Tecniche di commercializzazione

Антон Наумович, Система автоматической крэш-аналитики своими средствами

Kimes 0317 문여정

Cover officina 1 copia 4

Enhance Your Learning Platform

2016-07 Indonesia Car Sales BMW July 2016

Thailand Automotive Sales Statistics by Van/MPV Segment March 2016

Thailand Domestic Vehicle Sales by Segment 2015 7

Nonlinear static simulation of automotive bumper of a passenger car in low sp...

"Солонгос хэлний орчуулгын онол" Хичээл-5

Dsl for c++

Дмитрий Кашицын, Троллейбус из буханки: алиасинг и векторизация в LLVM

Splunk Ninja: New Features, Pivot and Search Dojo

Thailand Automotive Sales Statistics by B-Car Segment March 2016

эртний энэтхэгийн философи

Теория и практика написания безопасного кода на C++

visual programming lecture 2

C++ game development with oxygine

Similar to Automatic crash analysis system

Introductiontoasp netwindbgdebugging-100506045407-phpapp01Camilo Alvarez Rivera

php & performancesimon8410

Large Scale Crash Dump Analysis with SuperDumpChristoph Neumüller

PHP & Performance毅吕

PAC 2019 virtual Christoph NEUMÜLLERNeotys

.NET Debugging WorkshopSasha Goldshtein

Crash dump analysis - experience sharingJames Hsieh

Linux Profiling at NetflixBrendan Gregg

Linux Server Deep Dives (DrupalCon Amsterdam)Amin Astaneh

Scaling python webapps from 0 to 50 million users - A top-down approachJinal Jhaveri

Debugging Java from DumpsChris Bailey

PyCon AU 2012 - Debugging Live Python Web ApplicationsGraham Dumpleton

Smash the Stack: Writing a Buffer Overflow Exploit (Win32)Elvin Gentiles

Operating system conceptsGreen Ecosystem

Oleksandr Smoktal "Parallel Seismic Data Processing Using OpenMP"LogeekNightUkraine

Virtual platformsean chen

Batch programming and VirusesAkshay Saini

OSTEP Chapter2 IntroductionShuya Osaki

Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan GreggAnne Nicolas

Kernel Recipes 2017: Using Linux perf at NetflixBrendan Gregg

Similar to Automatic crash analysis system (20)

Introductiontoasp netwindbgdebugging-100506045407-phpapp01

php & performance

Large Scale Crash Dump Analysis with SuperDump

PHP & Performance

PAC 2019 virtual Christoph NEUMÜLLER

.NET Debugging Workshop

Crash dump analysis - experience sharing

Linux Profiling at Netflix

Linux Server Deep Dives (DrupalCon Amsterdam)

Scaling python webapps from 0 to 50 million users - A top-down approach

Debugging Java from Dumps

PyCon AU 2012 - Debugging Live Python Web Applications

Smash the Stack: Writing a Buffer Overflow Exploit (Win32)

Operating system concepts

Oleksandr Smoktal "Parallel Seismic Data Processing Using OpenMP"

Virtual platform

Batch programming and Viruses

OSTEP Chapter2 Introduction

Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg

Kernel Recipes 2017: Using Linux perf at Netflix

Recently uploaded

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko

Bluetooth Controlled Car with Arduino.pdfngoud9212

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely

Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Key Features Of Token Development (1).pptxLBM Solutions

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

CloudStudio User manual (basic edition):comworks

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

costume and set research powerpoint presentationphoebematthew05

Recently uploaded (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Unleash Your Potential - Namagunga Girls Coding Club

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

The transition to renewables in India.pdf

Are Multi-Cloud and Serverless Good or Bad?

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Bluetooth Controlled Car with Arduino.pdf

Designing IA for AI - Information Architecture Conference 2024

Unlocking the Potential of the Cloud for IBM Power Systems

Science&tech:THE INFORMATION AGE STS.pdf

Understanding the Laravel MVC Architecture

Key Features Of Token Development (1).pptx

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Connect Wave/ connectwave Pitch Deck Presentation

Pigging Solutions Piggable Sweeping Elbows

CloudStudio User manual (basic edition):

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

costume and set research powerpoint presentation

Automatic crash analysis system

1. Automatic Crash Analysis System Anton Naumovich

2. About me Anton Naumovich Development Manager at LogicNow Developer at Microsoft (Hyper-V) in the past Specializing in performance, debugging, troubleshooting

3. Bad things happen

4. Divergence Crashes Hangs Overusage of processor and RAM And more...

5. Large numbers

6. Sources of divergence Developers’ mistakes Thirdparty libraries issues Environment diversity (software, hardware)

7. How to find the root cause?

8. Take a memory dump! Dump is a snapshot of process memory Problem root cause can be located from the dump The fact that dump is taken is an “attention!” signal

9. Dump kinds Minidump threads and handles Full dump +virtual memory Kernel dump Full OS

10. KeBugCheckEx

11. Taking a process dump We need a “non-involved” controller process SuperController.exe Controller app SuperApp.exe Worker app Dump file Monitoring Taking dumps

12. Apps capable of taking dumps Process Explorer (full, mini) Task Manager (full) ProcDump (full, mini, and much more)

13. ProcDump: basics -c CPU threshold above which to create a dump of the process -e Write a dump when the process encounters an unhandled exception -m Memory commit threshold in MB at which to create a dump -t Write a dump when the process terminates -h Write dump if process has a hung window -p Trigger on the performance counter when the threshold exceeded

14. ProcDump: advanced -w Wait for the specified process to launch if it's not running -s Consecutive seconds before dump is written (default is 10) -n Number of dumps to write before exiting -r Dump using a clone -i Install ProcDump as the AeDebug postmortem debugger -ma Write a dump file with all process memory

15. procdump: controlling apps SuperApp.exe Worker app procdump -c 30 SuperApp.exe procdump -h SuperApp.exe procdump -m 300 SuperApp.exe procdump -t SuperApp.exe procdump -p "Process(SuperApp)Handle Count" 1000 SuperApp.exe

16. Fetching info from the dump Dump analysis is just static debugging cdb.exe -y C:lab -i C:lab -z C:labSuperApp.dmp -c "~*k;q" > C:analysis.txt Easily automatable: Debugger SuperApp.pdb Debugging symbols SuperApp.dmp Memory dump SuperApp.exe App executable

17. Analysis results It’s all about thread stacks 008afcf0 MSVCP120!std::_Xout_of_range+0x36 008fc86b SuperApp!WorkerProcessor::GetNextChunk+0x1e1 0061d914 SuperApp!WorkerProcessor::CalculateAverage+0x202 0062875c SuperApp!WorkerModule::ProcessQueueEvent+0xdf 0012877a SuperApp!WorkerModule::TakeSingleItem+0x54 004dc89a SuperApp!WorkerModule::Run+0x67 00bdc100 SuperApp!main+0x1955

18. Mission Control

19. Connecting it all together

20. Key analysis features Dump fuzzy matching and grouping by stack Integration with issue tracking (Jira) Analyze dump by user request Notifications about new/critical problems

21. Symbol Server - Storage and access to app debugging symbols - Dramatically speeds up debugging

22. Analogues Windows Error Reporting http://msdn.microsoft.com/en-us/library/windows/desktop/bb513641(v=vs.85).aspx Mozilla Crash Reporter https://support.mozilla.org/en-US/kb/mozillacrashreporter Dr. Dump https://drdump.com/crash-reporting-system

23. Example: Dr. Dump https://drdump.com/AppVersion.aspx

24. What can you do tomorrow? Setup symbol server (simply a shared folder) Use a script to monitor problems and capture dumps Use a script to analyze dumps

25. Toolset Debugging Tools for Windows (cdb, windbg) http://msdn.microsoft.com/en-us/windows/hardware/hh852365.aspx Sysinternals tool suite (procdump, procexp) http://technet.microsoft.com/en-us/sysinternals/bb545021.aspx Google Breakpad library https://code.google.com/p/google-breakpad/ Windows API: Debug Help family http://msdn.microsoft.com/en-us/library/windows/desktop/ms679309(v=vs.85).aspx Microsoft Symbols Server http://en.wikipedia.org/wiki/Microsoft_Symbol_Server

26. 1. Speed up defect location 2. Immediate reaction to critical problems 3. Version quality indicators 4. Improve stability Profit

27. Thanks! Questions? Anton.Naumovich@LogicNow.com

Editor's Notes

Я расскажу о контроле качества в реальном времени - то есть не в “лабораторных” условиях, а во время того, как приложение выполняет свою реальную работу на машинах конечных пользователей. Фактически, речь пойдет о построении системы обратной связи из продакшена в “Центр Управления Полетами” :) Такая система находится в сфере интересов и на стыке компетенции многих отделов - Development, QA+Automation, Support
Нас будут интересовать такие показатели качества как: падения (выполнение недопустимой операции) - все видели такое окошечко, скоро мы узнаем что происходит за кулисами когда мы соглашаемся отправить отчет подвисания - когда приложение перестает отвечать на внешние раздражители потребление памяти - либо утечки, либо просто нерациональное ее использование потребление процессора и дугие - специфические для предметной области, или комбинированные перечисленные выше.
Причина как правило в 90% случаев - это человеческий фактор, т.е. ошибки разработчиков. Мы все люди, мы все делаем ошибки, и будем их делать. Нюансы сторонних библиотек и разнообразие окружения - это зачастую тоже человеческий ошибки, только других людей.
В Windows, как и в других операционных системах, есть встроенная возможность снимать с процесса слепок памяти в любой момент времени Причем беглого анализа достаточно чтобы найти причину того или иного отклонения Более того, очень важно, параметры отклонений можно подобрать так, что сам факт наличия дампа - уже сигнал “Внимание”
Поговорим о том, как же снимать дампы В докладе приводится сквозной пример - приложение SuperApp. Обычно если приложение должно работать в фоне, то в связке с ним идет и приложение-контроллер SuperController, который отвечает за то чтобы его подопечный жил и функционировал. Так вот, это приложение-контроллер можно нагрузить дополнительной работой и заставить мониторить важные показатели жизнедеятельности реального работника, и в случае отклонений этих показателей от нормы снимать дамп с наблюдаемого. Также, дампы умеют снимать Task Manager (встроенный в Windows) и очень мощная утилита procdump из sysinternals - на ней мы остановимся подробнее для демонстрации спектра возможностей.
В качестве примеров Performance Counter - ов можно привести количество открытых файлов, количество прочитанных с диска или отосланных в сеть байт, и так далее.
Немного технических деталей об анализе дампов. Для анализа нужны три компонента - отладочная информация, исполняемый файл и отладочные символы Анализ - это очень просто, то же самое что и отладка, то есть любой программист априори умеет это делать Это элементарно автоматизируемо, например с помощью отладчика cdb
Пример - проблема обычно кроется в самых последних фреймах - вот выход за границы вектора
Мы знаем как собирать и как анализировать дампы - если связать все вместе, получится такая картина. На клиенте На клиентской стороне SuperApp и SuperController работают в паре - SuperApp делает свою работу, SuperController следит за ним Как только происходит отклонение - SuperController снимает дамп и отсылает его на сервер (например, по протоколу FTP или HTTP) вместе со вспомогательной информацией. На сервере Присланный дамп попадает в хранилище, например на файловой системе или в базе данных В фоне процесс-аналитик SuperAnalyst запускает анализ дампов, извлекая нужную эксперту информацию О самых критичных проблемах процесс-аналитик сообщает эксперту например через почту или SMS. Иногда надо среагировать мгновенно.
Технология для облегчения доступа к отладочной информации разных версий. Не нужно тратить время на поиск символов, достаточно просто указать один адрес, остальное сделает отладчик.
Любое более-менее серьезное приложение имеет похожую систему обратной связи - например Windows, Mozilla и т.п.
С помощью бесплатных Debugging Tools for Windows и SysInternals можно организовать подобную систему в тестовой лаборатории - причем за считанные дни, причем практически без дополнительных усилий со стороны программистов.

Automatic crash analysis system

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Automatic crash analysis system

Similar to Automatic crash analysis system (20)

More from corehard_by

More from corehard_by (20)

Recently uploaded

Recently uploaded (20)

Automatic crash analysis system

Editor's Notes