SlideShare a Scribd company logo
1 of 45
Download to read offline
SELF-AWARE APPLICATIONS
AUTOMATIC PRODUCTION MONITORING
DINA GOLDSHTEIN,
Agenda
Motivation

Hierarchy of self-monitoring

CPU profiling

GC monitoring

Heap analysis

Deadlock detection
2
Agenda
Motivation

Hierarchy of self-monitoring

CPU profiling

GC monitoring

Heap analysis

Deadlock detection

Not today: profiling tools, 3rd party monitoring, dashboards
3
Motivation
Why monitor?

Obviously a must for servers and/or anything at scale
4
Motivation
Why monitor?

Obviously a must for servers and/or anything at scale

But also a must for “simple” shrink-wrap software
5
Motivation
Why monitor?

Obviously a must for servers and/or anything at scale

But also a must for “simple” shrink-wrap software

Why our own?
6
Motivation
Why monitor?

Obviously a must for servers and/or anything at scale

But also a must for “simple” shrink-wrap software

Why our own?

Customized for specific business needs
7
Motivation
Why monitor?

Obviously a must for servers and/or anything at scale

But also a must for “simple” shrink-wrap software

Why our own?

Customized for specific business needs

Diagnostics flow from ground up
8
Motivation
Why monitor?

Obviously a must for servers and/or anything at scale

But also a must for “simple” shrink-wrap software

Why our own?

Customized for specific business needs

Diagnostics flow from ground up

That’s what all the cool kids do :-)
9
Master Plan
Use a hierarchical monitoring strategy
10
Master Plan
Use a hierarchical monitoring strategy

Lightweight for continuous and frequent monitoring of all basics

Numerical resource consumption: CPU, memory, disk
11
Master Plan
Use a hierarchical monitoring strategy

Lightweight for continuous and frequent monitoring of all basics

Numerical resource consumption: CPU, memory, disk

Medium for less frequency events

Rare exceptions, deadlocks
12
Master Plan
Use a hierarchical monitoring strategy

Lightweight for continuous and frequent monitoring of all basics

Numerical resource consumption: CPU, memory, disk

Medium for less frequency events

Rare exceptions, deadlocks

Invasive for deep-dive and concrete diagnostics

Memory leaks, bulk call-stack data (e.g. CPU profiling)
13
TOOLS
Lightweight Monitoring
Performance counters - provide numeric information about the
system and specific processes (CPU, memory, disk, …)
15
Lightweight Monitoring
Performance counters - provide numeric information about the
system and specific processes (CPU, memory, disk, …)

Win32 APIs

GetProcessTimes - provides CPU information for a process

GetProcessMemoryInfo - query for memory usage 

And others…
16
Lightweight Monitoring
Performance counters - provide numeric information about the
system and specific processes (CPU, memory, disk, …)

Win32 APIs

GetProcessTimes - provides CPU information for a process

GetProcessMemoryInfo - query for memory usage 

And others…

Stopwatch - you know what this is :-)
17
Halfway Monitoring Tools
18
Halfway Monitoring Tools
ETW - a structured logging framework incorporated into Windows
which can provide very rich data with relatively low overhead
19
Halfway Monitoring Tools
ETW - a structured logging framework incorporated into Windows
which can provide very rich data with relatively low overhead

Evaluate your own logs
20
Halfway Monitoring Tools
ETW - a structured logging framework incorporated into Windows
which can provide very rich data with relatively low overhead

Evaluate your own logs

ClrMD - open source C# library which provides programatic APIs
which until recently were only available using a fully fledged debugger
21
Invasive Analysis APIs
CLR profiling APIs - a whole bunch of native APIs which provide lots
of data about what’s going on in your managed (attached) application

AppDomain creation, assembly loading, JIT, thread creation, GC
22
Invasive Analysis APIs
CLR profiling APIs - a whole bunch of native APIs which provide lots
of data about what’s going on in your managed (attached) application

AppDomain creation, assembly loading, JIT, thread creation, GC

CLR debugging APIs - a whole bunch of native APIs which provide
programmatic APIs to debugging actions

Setting breakpoints, exception notifications, state modification
23
Invasive Analysis APIs
CLR profiling APIs - a whole bunch of native APIs which provide lots
of data about what’s going on in your managed (attached) application

AppDomain creation, assembly loading, JIT, thread creation, GC

CLR debugging APIs - a whole bunch of native APIs which provide
programmatic APIs to debugging actions

Setting breakpoints, exception notifications, state modification

Hooks - to specific APIs of your choosing
24
LET’S GET TO BUSINESS
(Self) CPU-Profiling
Monitor CPU using performance counters

Are we above a certain threshold for a certain amount of time? 

Turn on ETW and collect stacks (live using LiveStacks)

Find hot paths, produce flame graphs

Suggest recommendations
26
What Can Be Done?
AuthenticationController takes 95% CPU, maybe we're being
DDoS'ed

Image processing component takes 100% CPU, need to auto-scale
the app

Encoding this 30 second video takes 3 minutes at 100% CPU, tell the
user she can send us a bug report
27
DEMO
MONITOR FOR CPU SPIKES
http://jeremiahblatz.com/personal/pics/Australia_Travel_Pictures_2009/day2/22_Kangaroo_Island_Kangaroo_Portrait_Kangaroo_Island.html (CC BY-NC-SA 1.0)
(Self) GC-Monitoring
Monitor GC performance using performance counters

Register on ETW’s GC events such as GCAllocationTick

Types of objects allocated and their stacks(!)

Number of GCs of each kind and size of reclaimed memory

Duration of GC pauses

Attach ClrMD to get heap breakdown

Generations, segments, reserved/committed, number of objects
29
DEMO
MONITOR FOR ALLOCATION SPIKES
https://www.flickr.com/photos/thegirlsny/2803794829/
(Self) Heap Analysis
Monitor memory usage using performance counters

Has memory increased above a certain threshold for a certain time?

Attach ClrMD to get heap statistics

Compare snapshots

Report which objects are not freed

Combine with ETW GCAllocationTick to get rate of allocation
31
DEMO
FIND A LEAK
https://www.flickr.com/photos/53368913@N05/6038464060 (CC BY-NC-SA 2.0)
(Self) Deadlock Detection
Monitor for potential deadlock

Low CPU

Request timeouts

Increased thread count

Attach ClrMD to create wait chains and detect deadlocks

Report, try to break, pray for a miracle…
33
DEMO
DETECT A DEADLOCK
https://www.flickr.com/photos/34547542@N08/3205125391/ (CC BY 2.0)
Food for Thought
Many more scenarios are possible
35
Food for Thought
Many more scenarios are possible

Monitor heap fragmentation and compact large objects if needed
36
Food for Thought
Many more scenarios are possible

Monitor heap fragmentation and compact large objects if needed

Native memory leak analysis using ETW
37
Food for Thought
Many more scenarios are possible

Monitor heap fragmentation and compact large objects if needed

Native memory leak analysis using ETW 

Side notes:

ClrMD is also very suitable for automating crash dump analysis

You can automate opening tickets in bug tracker, consolidate same
issue from different users, versions, etc.
38
Not Everything is Perfect
The pros are obvious (visibility, easy scaling…)
39
Not Everything is Perfect
The pros are obvious (visibility, easy scaling…)

But there are some cons as well…
40
Not Everything is Perfect
The pros are obvious (visibility, easy scaling…)

But there are some cons as well…

Adds complexity (reduce risk by using separate process)
41
Not Everything is Perfect
The pros are obvious (visibility, easy scaling…)

But there are some cons as well…

Adds complexity (reduce risk by using separate process)

Adds overhead
42
Not Everything is Perfect
The pros are obvious

But there are some cons as well…

Adds complexity (reduce risk by using separate process)

Adds overhead

Requires additional development
43
Summary
Self-monitoring is important for all kinds of software

Best to create a hierarchy of monitoring (and overhead and
complexity)

Lots of scenarios: CPU, GC, memory, deadlocks

Demos: https://github.com/dinazil/self-aware-applications
44
THANK YOU
DINA GOLDSHTEIN,
@DINAGOZIL

More Related Content

What's hot

DevSecCon London 2019: A Kernel of Truth: Intrusion Detection and Attestation...
DevSecCon London 2019: A Kernel of Truth: Intrusion Detection and Attestation...DevSecCon London 2019: A Kernel of Truth: Intrusion Detection and Attestation...
DevSecCon London 2019: A Kernel of Truth: Intrusion Detection and Attestation...
DevSecCon
 

What's hot (14)

Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
 
Testing CAN network with help of CANToolz
Testing CAN network with help of CANToolzTesting CAN network with help of CANToolz
Testing CAN network with help of CANToolz
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
Multiply your Testing Effectiveness with Parameterized Testing, v1
Multiply your Testing Effectiveness with Parameterized Testing, v1Multiply your Testing Effectiveness with Parameterized Testing, v1
Multiply your Testing Effectiveness with Parameterized Testing, v1
 
Introduction to Perf
Introduction to PerfIntroduction to Perf
Introduction to Perf
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
 
IOT Firmware: Best Pratices
IOT Firmware:  Best PraticesIOT Firmware:  Best Pratices
IOT Firmware: Best Pratices
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
Overclocking
OverclockingOverclocking
Overclocking
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
DevSecCon London 2019: A Kernel of Truth: Intrusion Detection and Attestation...
DevSecCon London 2019: A Kernel of Truth: Intrusion Detection and Attestation...DevSecCon London 2019: A Kernel of Truth: Intrusion Detection and Attestation...
DevSecCon London 2019: A Kernel of Truth: Intrusion Detection and Attestation...
 

Similar to Self-Aware Applications: Automatic Production Monitoring (NDC Sydney 2017)

Process control daemon
Process control daemonProcess control daemon
Process control daemon
haish
 
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportMaking Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
QAware GmbH
 

Similar to Self-Aware Applications: Automatic Production Monitoring (NDC Sydney 2017) (20)

WebSphere Technical University: Introduction to the Java Diagnostic Tools
WebSphere Technical University: Introduction to the Java Diagnostic ToolsWebSphere Technical University: Introduction to the Java Diagnostic Tools
WebSphere Technical University: Introduction to the Java Diagnostic Tools
 
Intro to open source telemetry linux con 2016
Intro to open source telemetry   linux con 2016Intro to open source telemetry   linux con 2016
Intro to open source telemetry linux con 2016
 
Introduction to architecture exploration
Introduction to architecture explorationIntroduction to architecture exploration
Introduction to architecture exploration
 
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
 
Agentless System Crawler - InterConnect 2016
Agentless System Crawler - InterConnect 2016Agentless System Crawler - InterConnect 2016
Agentless System Crawler - InterConnect 2016
 
Fix Heap corruption in Android - Using valgrind
Fix Heap corruption in Android - Using valgrindFix Heap corruption in Android - Using valgrind
Fix Heap corruption in Android - Using valgrind
 
How to build observability into a serverless application
How to build observability into a serverless applicationHow to build observability into a serverless application
How to build observability into a serverless application
 
A165 tools for java and javascript
A165 tools for java and javascriptA165 tools for java and javascript
A165 tools for java and javascript
 
IBM Monitoring and Diagnostic Tools - GCMV 2.8
IBM Monitoring and Diagnostic Tools - GCMV 2.8IBM Monitoring and Diagnostic Tools - GCMV 2.8
IBM Monitoring and Diagnostic Tools - GCMV 2.8
 
Java Performance & Profiling
Java Performance & ProfilingJava Performance & Profiling
Java Performance & Profiling
 
Operational Visibiliy and Analytics - BU Seminar
Operational Visibiliy and Analytics - BU SeminarOperational Visibiliy and Analytics - BU Seminar
Operational Visibiliy and Analytics - BU Seminar
 
Spug pt session2 - debuggingl
Spug pt session2 - debugginglSpug pt session2 - debuggingl
Spug pt session2 - debuggingl
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
ENT203 Monitoring and Autoscaling, a Match Made in Heaven
ENT203 Monitoring and Autoscaling, a Match Made in HeavenENT203 Monitoring and Autoscaling, a Match Made in Heaven
ENT203 Monitoring and Autoscaling, a Match Made in Heaven
 
Process control daemon
Process control daemonProcess control daemon
Process control daemon
 
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportMaking Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
 
Impact2014: Introduction to the IBM Java Tools
Impact2014: Introduction to the IBM Java ToolsImpact2014: Introduction to the IBM Java Tools
Impact2014: Introduction to the IBM Java Tools
 
Design and Optimize your code for high-performance with Intel® Advisor and I...
Design and Optimize your code for high-performance with Intel®  Advisor and I...Design and Optimize your code for high-performance with Intel®  Advisor and I...
Design and Optimize your code for high-performance with Intel® Advisor and I...
 
Honorable Squires
Honorable SquiresHonorable Squires
Honorable Squires
 

More from Dina Goldshtein

More from Dina Goldshtein (16)

How Does the Internet Work? (Wix she codes; branch)
How Does the Internet Work? (Wix she codes; branch)How Does the Internet Work? (Wix she codes; branch)
How Does the Internet Work? (Wix she codes; branch)
 
Look Mommy, No GC! (TechDays NL 2017)
Look Mommy, No GC! (TechDays NL 2017)Look Mommy, No GC! (TechDays NL 2017)
Look Mommy, No GC! (TechDays NL 2017)
 
Self-Aware Applications: Automatic Production Monitoring (TechDays NL 2017)
Self-Aware Applications: Automatic Production Monitoring (TechDays NL 2017)Self-Aware Applications: Automatic Production Monitoring (TechDays NL 2017)
Self-Aware Applications: Automatic Production Monitoring (TechDays NL 2017)
 
ETW - Monitor Anything, Anytime, Anywhere (Velocity NYC 2017)
ETW - Monitor Anything, Anytime, Anywhere (Velocity NYC 2017)ETW - Monitor Anything, Anytime, Anywhere (Velocity NYC 2017)
ETW - Monitor Anything, Anytime, Anywhere (Velocity NYC 2017)
 
Look Mommy, no GC! (.NET Summit 2017)
Look Mommy, no GC! (.NET Summit 2017)Look Mommy, no GC! (.NET Summit 2017)
Look Mommy, no GC! (.NET Summit 2017)
 
Look Mommy, no GC! (BrightSource)
Look Mommy, no GC! (BrightSource)Look Mommy, no GC! (BrightSource)
Look Mommy, no GC! (BrightSource)
 
ETW - Monitor Anything, Anytime, Anywhere (NDC Oslo 2017)
ETW - Monitor Anything, Anytime, Anywhere (NDC Oslo 2017)ETW - Monitor Anything, Anytime, Anywhere (NDC Oslo 2017)
ETW - Monitor Anything, Anytime, Anywhere (NDC Oslo 2017)
 
Look Mommy, no GC! (SDP May 2017)
Look Mommy, no GC! (SDP May 2017)Look Mommy, no GC! (SDP May 2017)
Look Mommy, no GC! (SDP May 2017)
 
Look Mommy, No GC! (Codecamp Iasi 2017)
Look Mommy, No GC! (Codecamp Iasi 2017)Look Mommy, No GC! (Codecamp Iasi 2017)
Look Mommy, No GC! (Codecamp Iasi 2017)
 
Look Mommy, No GC! (NDC London 2017)
Look Mommy, No GC! (NDC London 2017)Look Mommy, No GC! (NDC London 2017)
Look Mommy, No GC! (NDC London 2017)
 
How does the Internet Work?
How does the Internet Work?How does the Internet Work?
How does the Internet Work?
 
How does the Internet Work?
How does the Internet Work?How does the Internet Work?
How does the Internet Work?
 
Things They Don’t Teach You @ School
Things They Don’t Teach You @ SchoolThings They Don’t Teach You @ School
Things They Don’t Teach You @ School
 
What's New in C++ 11/14?
What's New in C++ 11/14?What's New in C++ 11/14?
What's New in C++ 11/14?
 
HTML5 Canvas
HTML5 CanvasHTML5 Canvas
HTML5 Canvas
 
JavaScript Basics
JavaScript BasicsJavaScript Basics
JavaScript Basics
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Self-Aware Applications: Automatic Production Monitoring (NDC Sydney 2017)

  • 1. SELF-AWARE APPLICATIONS AUTOMATIC PRODUCTION MONITORING DINA GOLDSHTEIN,
  • 2. Agenda Motivation Hierarchy of self-monitoring CPU profiling GC monitoring Heap analysis Deadlock detection 2
  • 3. Agenda Motivation Hierarchy of self-monitoring CPU profiling GC monitoring Heap analysis Deadlock detection Not today: profiling tools, 3rd party monitoring, dashboards 3
  • 4. Motivation Why monitor? Obviously a must for servers and/or anything at scale 4
  • 5. Motivation Why monitor? Obviously a must for servers and/or anything at scale But also a must for “simple” shrink-wrap software 5
  • 6. Motivation Why monitor? Obviously a must for servers and/or anything at scale But also a must for “simple” shrink-wrap software Why our own? 6
  • 7. Motivation Why monitor? Obviously a must for servers and/or anything at scale But also a must for “simple” shrink-wrap software Why our own? Customized for specific business needs 7
  • 8. Motivation Why monitor? Obviously a must for servers and/or anything at scale But also a must for “simple” shrink-wrap software Why our own? Customized for specific business needs Diagnostics flow from ground up 8
  • 9. Motivation Why monitor? Obviously a must for servers and/or anything at scale But also a must for “simple” shrink-wrap software Why our own? Customized for specific business needs Diagnostics flow from ground up That’s what all the cool kids do :-) 9
  • 10. Master Plan Use a hierarchical monitoring strategy 10
  • 11. Master Plan Use a hierarchical monitoring strategy Lightweight for continuous and frequent monitoring of all basics Numerical resource consumption: CPU, memory, disk 11
  • 12. Master Plan Use a hierarchical monitoring strategy Lightweight for continuous and frequent monitoring of all basics Numerical resource consumption: CPU, memory, disk Medium for less frequency events Rare exceptions, deadlocks 12
  • 13. Master Plan Use a hierarchical monitoring strategy Lightweight for continuous and frequent monitoring of all basics Numerical resource consumption: CPU, memory, disk Medium for less frequency events Rare exceptions, deadlocks Invasive for deep-dive and concrete diagnostics Memory leaks, bulk call-stack data (e.g. CPU profiling) 13
  • 14. TOOLS
  • 15. Lightweight Monitoring Performance counters - provide numeric information about the system and specific processes (CPU, memory, disk, …) 15
  • 16. Lightweight Monitoring Performance counters - provide numeric information about the system and specific processes (CPU, memory, disk, …) Win32 APIs GetProcessTimes - provides CPU information for a process GetProcessMemoryInfo - query for memory usage And others… 16
  • 17. Lightweight Monitoring Performance counters - provide numeric information about the system and specific processes (CPU, memory, disk, …) Win32 APIs GetProcessTimes - provides CPU information for a process GetProcessMemoryInfo - query for memory usage And others… Stopwatch - you know what this is :-) 17
  • 19. Halfway Monitoring Tools ETW - a structured logging framework incorporated into Windows which can provide very rich data with relatively low overhead 19
  • 20. Halfway Monitoring Tools ETW - a structured logging framework incorporated into Windows which can provide very rich data with relatively low overhead Evaluate your own logs 20
  • 21. Halfway Monitoring Tools ETW - a structured logging framework incorporated into Windows which can provide very rich data with relatively low overhead Evaluate your own logs ClrMD - open source C# library which provides programatic APIs which until recently were only available using a fully fledged debugger 21
  • 22. Invasive Analysis APIs CLR profiling APIs - a whole bunch of native APIs which provide lots of data about what’s going on in your managed (attached) application AppDomain creation, assembly loading, JIT, thread creation, GC 22
  • 23. Invasive Analysis APIs CLR profiling APIs - a whole bunch of native APIs which provide lots of data about what’s going on in your managed (attached) application AppDomain creation, assembly loading, JIT, thread creation, GC CLR debugging APIs - a whole bunch of native APIs which provide programmatic APIs to debugging actions Setting breakpoints, exception notifications, state modification 23
  • 24. Invasive Analysis APIs CLR profiling APIs - a whole bunch of native APIs which provide lots of data about what’s going on in your managed (attached) application AppDomain creation, assembly loading, JIT, thread creation, GC CLR debugging APIs - a whole bunch of native APIs which provide programmatic APIs to debugging actions Setting breakpoints, exception notifications, state modification Hooks - to specific APIs of your choosing 24
  • 25. LET’S GET TO BUSINESS
  • 26. (Self) CPU-Profiling Monitor CPU using performance counters Are we above a certain threshold for a certain amount of time? Turn on ETW and collect stacks (live using LiveStacks) Find hot paths, produce flame graphs Suggest recommendations 26
  • 27. What Can Be Done? AuthenticationController takes 95% CPU, maybe we're being DDoS'ed Image processing component takes 100% CPU, need to auto-scale the app Encoding this 30 second video takes 3 minutes at 100% CPU, tell the user she can send us a bug report 27
  • 28. DEMO MONITOR FOR CPU SPIKES http://jeremiahblatz.com/personal/pics/Australia_Travel_Pictures_2009/day2/22_Kangaroo_Island_Kangaroo_Portrait_Kangaroo_Island.html (CC BY-NC-SA 1.0)
  • 29. (Self) GC-Monitoring Monitor GC performance using performance counters Register on ETW’s GC events such as GCAllocationTick Types of objects allocated and their stacks(!) Number of GCs of each kind and size of reclaimed memory Duration of GC pauses Attach ClrMD to get heap breakdown Generations, segments, reserved/committed, number of objects 29
  • 30. DEMO MONITOR FOR ALLOCATION SPIKES https://www.flickr.com/photos/thegirlsny/2803794829/
  • 31. (Self) Heap Analysis Monitor memory usage using performance counters Has memory increased above a certain threshold for a certain time? Attach ClrMD to get heap statistics Compare snapshots Report which objects are not freed Combine with ETW GCAllocationTick to get rate of allocation 31
  • 33. (Self) Deadlock Detection Monitor for potential deadlock Low CPU Request timeouts Increased thread count Attach ClrMD to create wait chains and detect deadlocks Report, try to break, pray for a miracle… 33
  • 35. Food for Thought Many more scenarios are possible 35
  • 36. Food for Thought Many more scenarios are possible Monitor heap fragmentation and compact large objects if needed 36
  • 37. Food for Thought Many more scenarios are possible Monitor heap fragmentation and compact large objects if needed Native memory leak analysis using ETW 37
  • 38. Food for Thought Many more scenarios are possible Monitor heap fragmentation and compact large objects if needed Native memory leak analysis using ETW Side notes: ClrMD is also very suitable for automating crash dump analysis You can automate opening tickets in bug tracker, consolidate same issue from different users, versions, etc. 38
  • 39. Not Everything is Perfect The pros are obvious (visibility, easy scaling…) 39
  • 40. Not Everything is Perfect The pros are obvious (visibility, easy scaling…) But there are some cons as well… 40
  • 41. Not Everything is Perfect The pros are obvious (visibility, easy scaling…) But there are some cons as well… Adds complexity (reduce risk by using separate process) 41
  • 42. Not Everything is Perfect The pros are obvious (visibility, easy scaling…) But there are some cons as well… Adds complexity (reduce risk by using separate process) Adds overhead 42
  • 43. Not Everything is Perfect The pros are obvious But there are some cons as well… Adds complexity (reduce risk by using separate process) Adds overhead Requires additional development 43
  • 44. Summary Self-monitoring is important for all kinds of software Best to create a hierarchy of monitoring (and overhead and complexity) Lots of scenarios: CPU, GC, memory, deadlocks Demos: https://github.com/dinazil/self-aware-applications 44