The road to zen winds through self-monitoring applications, which can reduce the time and effort in diagnosing and correcting production issues. In this talk, we will see how modern Windows applications can self-monitor, self-diagnose, and potentially self-recover without needing an external monitoring agent or a brute-force restarting watchdog. By harnessing the power of ETW for low-level accurate monitoring, Windows performance counters for zero-overhead statistics, and the CLRMD library for inspecting your own threads, heap objects, and locks, you can take your applications one step closer to self-awareness. This will be illustrated through a series of demos: automatic CPU profiling and pinpointing the busy threads and stacks; automatic GC monitoring, including object allocations; automatic heap analysis to reveal unraveling memory leaks; and more. At the end of the talk, you will be equipped with tools and techniques for implementing self-monitoring in your own applications.
6. Motivation
Why monitor?
Obviously a must for servers and/or anything at scale
But also a must for “simple” shrink-wrap software
Why our own?
6
7. Motivation
Why monitor?
Obviously a must for servers and/or anything at scale
But also a must for “simple” shrink-wrap software
Why our own?
Customized for specific business needs
7
8. Motivation
Why monitor?
Obviously a must for servers and/or anything at scale
But also a must for “simple” shrink-wrap software
Why our own?
Customized for specific business needs
Diagnostics flow from ground up
8
9. Motivation
Why monitor?
Obviously a must for servers and/or anything at scale
But also a must for “simple” shrink-wrap software
Why our own?
Customized for specific business needs
Diagnostics flow from ground up
That’s what all the cool kids do :-)
9
11. Master Plan
Use a hierarchical monitoring strategy
Lightweight for continuous and frequent monitoring of all basics
Numerical resource consumption: CPU, memory, disk
11
12. Master Plan
Use a hierarchical monitoring strategy
Lightweight for continuous and frequent monitoring of all basics
Numerical resource consumption: CPU, memory, disk
Medium for less frequency events
Rare exceptions, deadlocks
12
13. Master Plan
Use a hierarchical monitoring strategy
Lightweight for continuous and frequent monitoring of all basics
Numerical resource consumption: CPU, memory, disk
Medium for less frequency events
Rare exceptions, deadlocks
Invasive for deep-dive and concrete diagnostics
Memory leaks, bulk call-stack data (e.g. CPU profiling)
13
16. Lightweight Monitoring
Performance counters - provide numeric information about the
system and specific processes (CPU, memory, disk, …)
Win32 APIs
GetProcessTimes - provides CPU information for a process
GetProcessMemoryInfo - query for memory usage
And others…
16
17. Lightweight Monitoring
Performance counters - provide numeric information about the
system and specific processes (CPU, memory, disk, …)
Win32 APIs
GetProcessTimes - provides CPU information for a process
GetProcessMemoryInfo - query for memory usage
And others…
Stopwatch - you know what this is :-)
17
19. Halfway Monitoring Tools
ETW - a structured logging framework incorporated into Windows
which can provide very rich data with relatively low overhead
19
20. Halfway Monitoring Tools
ETW - a structured logging framework incorporated into Windows
which can provide very rich data with relatively low overhead
Evaluate your own logs
20
21. Halfway Monitoring Tools
ETW - a structured logging framework incorporated into Windows
which can provide very rich data with relatively low overhead
Evaluate your own logs
ClrMD - open source C# library which provides programatic APIs
which until recently were only available using a fully fledged debugger
21
22. Invasive Analysis APIs
CLR profiling APIs - a whole bunch of native APIs which provide lots
of data about what’s going on in your managed (attached) application
AppDomain creation, assembly loading, JIT, thread creation, GC
22
23. Invasive Analysis APIs
CLR profiling APIs - a whole bunch of native APIs which provide lots
of data about what’s going on in your managed (attached) application
AppDomain creation, assembly loading, JIT, thread creation, GC
CLR debugging APIs - a whole bunch of native APIs which provide
programmatic APIs to debugging actions
Setting breakpoints, exception notifications, state modification
23
24. Invasive Analysis APIs
CLR profiling APIs - a whole bunch of native APIs which provide lots
of data about what’s going on in your managed (attached) application
AppDomain creation, assembly loading, JIT, thread creation, GC
CLR debugging APIs - a whole bunch of native APIs which provide
programmatic APIs to debugging actions
Setting breakpoints, exception notifications, state modification
Hooks - to specific APIs of your choosing
24
26. (Self) CPU-Profiling
Monitor CPU using performance counters
Are we above a certain threshold for a certain amount of time?
Turn on ETW and collect stacks (live using LiveStacks)
Find hot paths, produce flame graphs
Suggest recommendations
26
27. What Can Be Done?
AuthenticationController takes 95% CPU, maybe we're being
DDoS'ed
Image processing component takes 100% CPU, need to auto-scale
the app
Encoding this 30 second video takes 3 minutes at 100% CPU, tell the
user she can send us a bug report
27
28. DEMO
MONITOR FOR CPU SPIKES
http://jeremiahblatz.com/personal/pics/Australia_Travel_Pictures_2009/day2/22_Kangaroo_Island_Kangaroo_Portrait_Kangaroo_Island.html (CC BY-NC-SA 1.0)
29. (Self) GC-Monitoring
Monitor GC performance using performance counters
Register on ETW’s GC events such as GCAllocationTick
Types of objects allocated and their stacks(!)
Number of GCs of each kind and size of reclaimed memory
Duration of GC pauses
Attach ClrMD to get heap breakdown
Generations, segments, reserved/committed, number of objects
29
31. (Self) Heap Analysis
Monitor memory usage using performance counters
Has memory increased above a certain threshold for a certain time?
Attach ClrMD to get heap statistics
Compare snapshots
Report which objects are not freed
Combine with ETW GCAllocationTick to get rate of allocation
31
33. (Self) Deadlock Detection
Monitor for potential deadlock
Low CPU
Request timeouts
Increased thread count
Attach ClrMD to create wait chains and detect deadlocks
Report, try to break, pray for a miracle…
33
36. Food for Thought
Many more scenarios are possible
Monitor heap fragmentation and compact large objects if needed
36
37. Food for Thought
Many more scenarios are possible
Monitor heap fragmentation and compact large objects if needed
Native memory leak analysis using ETW
37
38. Food for Thought
Many more scenarios are possible
Monitor heap fragmentation and compact large objects if needed
Native memory leak analysis using ETW
Side notes:
ClrMD is also very suitable for automating crash dump analysis
You can automate opening tickets in bug tracker, consolidate same
issue from different users, versions, etc.
38
39. Not Everything is Perfect
The pros are obvious (visibility, easy scaling…)
39
40. Not Everything is Perfect
The pros are obvious (visibility, easy scaling…)
But there are some cons as well…
40
41. Not Everything is Perfect
The pros are obvious (visibility, easy scaling…)
But there are some cons as well…
Adds complexity (reduce risk by using separate process)
41
42. Not Everything is Perfect
The pros are obvious (visibility, easy scaling…)
But there are some cons as well…
Adds complexity (reduce risk by using separate process)
Adds overhead
42
43. Not Everything is Perfect
The pros are obvious
But there are some cons as well…
Adds complexity (reduce risk by using separate process)
Adds overhead
Requires additional development
43
44. Summary
Self-monitoring is important for all kinds of software
Best to create a hierarchy of monitoring (and overhead and
complexity)
Lots of scenarios: CPU, GC, memory, deadlocks
Demos: https://github.com/dinazil/self-aware-applications
44