Self-Aware Applications: Automatic Production Monitoring (NDC Sydney 2017)

SELF-AWARE APPLICATIONS
AUTOMATIC PRODUCTION MONITORING
DINA GOLDSHTEIN,

Agenda
Motivation

Hierarchy of self-monitoring

CPU proﬁling

GC monitoring

Heap analysis

Deadlock detection
2

Agenda
Motivation

Hierarchy of self-monitoring

CPU proﬁling

GC monitoring

Heap analysis

Deadlock detection

Not today: proﬁling tools, 3rd party monitoring, dashboards
3

Motivation
Why monitor?

Obviously a must for servers and/or anything at scale
4

Motivation
Why monitor?


But also a must for “simple” shrink-wrap software
5

Motivation
Why monitor?



Why our own?
6

Motivation
Why monitor?



Why our own?

Customized for speciﬁc business needs
7

Motivation
Why monitor?



Why our own?


Diagnostics ﬂow from ground up
8

Motivation
Why monitor?



Why our own?


Diagnostics ﬂow from ground up

That’s what all the cool kids do :-)
9

Master Plan
Use a hierarchical monitoring strategy
10

Master Plan

Lightweight for continuous and frequent monitoring of all basics

Numerical resource consumption: CPU, memory, disk
11

Master Plan



Medium for less frequency events

Rare exceptions, deadlocks
12

Master Plan



Medium for less frequency events

Rare exceptions, deadlocks

Invasive for deep-dive and concrete diagnostics

Memory leaks, bulk call-stack data (e.g. CPU proﬁling)
13

Lightweight Monitoring
Performance counters - provide numeric information about the
system and speciﬁc processes (CPU, memory, disk, …)
15


Win32 APIs

GetProcessTimes - provides CPU information for a process

GetProcessMemoryInfo - query for memory usage

And others…
16


Win32 APIs

GetProcessTimes - provides CPU information for a process

GetProcessMemoryInfo - query for memory usage

And others…

Stopwatch - you know what this is :-)
17

Halfway Monitoring Tools
ETW - a structured logging framework incorporated into Windows
which can provide very rich data with relatively low overhead
19


Evaluate your own logs
20


Evaluate your own logs

ClrMD - open source C# library which provides programatic APIs
which until recently were only available using a fully ﬂedged debugger
21

Invasive Analysis APIs
CLR proﬁling APIs - a whole bunch of native APIs which provide lots
of data about what’s going on in your managed (attached) application

AppDomain creation, assembly loading, JIT, thread creation, GC
22



CLR debugging APIs - a whole bunch of native APIs which provide
programmatic APIs to debugging actions

Setting breakpoints, exception notiﬁcations, state modiﬁcation
23



CLR debugging APIs - a whole bunch of native APIs which provide
programmatic APIs to debugging actions

Setting breakpoints, exception notifications, state modification

Hooks - to specific APIs of your choosing
24

(Self) CPU-Proﬁling
Monitor CPU using performance counters

Are we above a certain threshold for a certain amount of time?

Turn on ETW and collect stacks (live using LiveStacks)

Find hot paths, produce ﬂame graphs

Suggest recommendations
26

What Can Be Done?
AuthenticationController takes 95% CPU, maybe we're being
DDoS'ed

Image processing component takes 100% CPU, need to auto-scale
the app

Encoding this 30 second video takes 3 minutes at 100% CPU, tell the
user she can send us a bug report
27

DEMO
MONITOR FOR CPU SPIKES
http://jeremiahblatz.com/personal/pics/Australia_Travel_Pictures_2009/day2/22_Kangaroo_Island_Kangaroo_Portrait_Kangaroo_Island.html (CC BY-NC-SA 1.0)

(Self) GC-Monitoring
Monitor GC performance using performance counters

Register on ETW’s GC events such as GCAllocationTick

Types of objects allocated and their stacks(!)

Number of GCs of each kind and size of reclaimed memory

Duration of GC pauses

Attach ClrMD to get heap breakdown

Generations, segments, reserved/committed, number of objects
29

DEMO
MONITOR FOR ALLOCATION SPIKES
https://www.ﬂickr.com/photos/thegirlsny/2803794829/

(Self) Heap Analysis
Monitor memory usage using performance counters

Has memory increased above a certain threshold for a certain time?

Attach ClrMD to get heap statistics

Compare snapshots

Report which objects are not freed

Combine with ETW GCAllocationTick to get rate of allocation
31

DEMO
FIND A LEAK
https://www.ﬂickr.com/photos/53368913@N05/6038464060 (CC BY-NC-SA 2.0)

(Self) Deadlock Detection
Monitor for potential deadlock

Low CPU

Request timeouts

Increased thread count

Attach ClrMD to create wait chains and detect deadlocks

Report, try to break, pray for a miracle…
33

DEMO
DETECT A DEADLOCK
https://www.ﬂickr.com/photos/34547542@N08/3205125391/ (CC BY 2.0)

Food for Thought
Many more scenarios are possible
35

Food for Thought

Monitor heap fragmentation and compact large objects if needed
36

Food for Thought


Native memory leak analysis using ETW
37

Food for Thought


Native memory leak analysis using ETW

Side notes:

ClrMD is also very suitable for automating crash dump analysis

You can automate opening tickets in bug tracker, consolidate same
issue from diﬀerent users, versions, etc.
38

Not Everything is Perfect
The pros are obvious (visibility, easy scaling…)
39


But there are some cons as well…
40



Adds complexity (reduce risk by using separate process)
41




Adds overhead
42

The pros are obvious



Adds overhead

Requires additional development
43

Summary
Self-monitoring is important for all kinds of software

Best to create a hierarchy of monitoring (and overhead and
complexity)

Lots of scenarios: CPU, GC, memory, deadlocks

Demos: https://github.com/dinazil/self-aware-applications
44

THANK YOU
DINA GOLDSHTEIN,
@DINAGOZIL

Self-Aware Applications: Automatic Production Monitoring (NDC Sydney 2017)

More Related Content

What's hot

Similar to Self-Aware Applications: Automatic Production Monitoring (NDC Sydney 2017)

More from Dina Goldshtein

Recently uploaded

Self-Aware Applications: Automatic Production Monitoring (NDC Sydney 2017)