At a certain scale, millions of events happen every second, and all of them are important to evaluate the health of the system. If not handled correctly, such a volume of information can overwhelm both the infrastructure that needs to support them, and people who have to make a sense out of thousands of signals and make decisions upon them, fast. By understanding how our rational mind works, how people process information, we can present data so it's more evident and intuitive. This talk will explain how to collect useful metrics, and to create the perfect monitoring dashboard to organise and display them, letting our intuition operate automatically and quickly, and saving attention and mental effort to activities that demand it.
7. Monitoring mindset
You can’t control Design systems
what you can’t measure to be monitored
Tom DeMarco
Good reporting:
Observe patterns and
difference between noticing
automate most things
and not having a clue
http://www.threesixtymag.co.uk/2012/12/state-of-mind-tee/ 6
8. Monitoring mindset
The
hardest
part
Good reporting:
difference between noticing
and not having a clue
7
10. Dashboard: what is it?
Tool to display
PIs and KPIs
quantitative analysis
Immediacy, intuitiveness
and appropriate context
9
11. Operational Strategic Analytic
monitors functions quick overview of comparisons,
which need an organization’s reviewing
constant, health extensive histories,
real-time, evaluating
minute-by-minute assist with performance
attention executive decisions
assists with
immediacy and what is going on data analysis
practicality right now is not
important - what is
no statistics or pressing is what has doesn’t require
analyzing been going on real-time data
10
12. Multiple dashboard views
Operational: Strategic: Analytic:
Ops / Engineering CEO / CIO Marketing / Accountancy
Different view for each audience:
keep metrics relevant to each group
11
13. Multiple dashboard views
Operational:
Ops / Engineering This talk
is about
this one
(but the others are important too)
12
16. A tale of two systems
Intuition Reasoning
operates automatically consciously
and quickly, with little or allocates attention
no effort and no sense to the effortful mental
of voluntary control activities that demand it
2+2=? 216 × 725 = ?
involuntary fast voluntary slow
effortless invisible difficult visible
15
17. A tale of two systems
Intuition
operates automatically
and quickly, with little or
no effort and no sense Monitoring
should rely on
of voluntary control System I
2+2=?
involuntary fast
effortless invisible
16
18. A tale of two systems
Reasoning
consciously
allocates attention
System 2
regulates our intuition to the effortful mental
and is ready to jump in activities that demand it
when attention is required
216 × 725 = ?
voluntary slow
difficult visible
17
19. Model “Normality”
http://www.flickr.com/photos/fwooper7/4942474212/ 18
20. Be surprised by anomalies
http://animal.discovery.com/tv-shows/wild-kingdom/about-animals/lions-elephant-hunters-pictures.htm 19
26. Clarity and immediacy FTW
Charles Joseph Minard, Napoleon’s March on Moscow
“Probably the best statistical graphic ever drawn” - Edward Tufte
http://www.edwardtufte.com/tufte/posters 23
27. Clarity and immediacy FTW
Charles Joseph Minard, Napoleon’s March on Moscow
worst
“Probably the best statistical graphic ever drawn” - Edward Tufte
http://www.edwardtufte.com/tufte/posters 23
28. Graphs fit short-term memory
Sales Jan Feb Mar Apr May Jun Jul
US 23923 21695 20032 24030 24302 25032 26203
EU 14390 16400 17303 21900 23547 20142 27321
24
29. Graphs fit short-term memory
Sales Jan Feb Mar Apr May Jun Jul
US 23923 21695 20032 24030 24302 25032 26203
EU 14390 16400 17303 21900 23547 20142 27321
give values
a visual shape
30000
US
25000 EU
Sales
20000
15000
10000
Jan Feb Mar Apr May Jun Jul Aug
24
32. Busy Dashboards Are Busy
http://img.photobucket.com/albums/v254/tomklipp/Misc/C-130e-flight-station.jpg 27
33. Dashboard design mistakes
Too much data,
too little information
At a glance, tell if there’s a
problem, not a precise analysis
28
34. The only thing I want to know
Everything is alright
http://www.x929.ca/shows/newsboy/?cat=28&paged=2 29
35. Attention as limited resource
http://www.climateshifts.org/wp-content/uploads/2010/12/coal_hands.jpg 30
36. Attention has a limited budget
Attention
depletion
Leverage intuition
whenever possible
31
37. Strain and effort ➔ Heuristics
It takes 5 machines 5 minutes to make 5 widgets,
how long would it take
100 machines to make 100 widgets?
32
38. Strain and effort ➔ Heuristics
It takes 5 machines 5 minutes to make 5 widgets,
how long would it take
100 machines to make 100 widgets?
33
39. Strain and effort ➔ Heuristics
It takes 5 machines 5 minutes to make 5 widgets,
how long would it take
100 machines to make 100 widgets?
100!
33
40. Strain and effort ➔ Heuristics
Tendency to answer questions with the first idea that
comes to mind, without checking it
It takes 5 machines 5 minutes to make 5 widgets,
how long would it take
100 machines to make 100 widgets?
100!
5
33
41. Swap out difficult tasks for easier ones
Heuristic, n.
simple procedure that helps find
adequate, though often imperfect,
answers to difficult questions.
34
44. Let the visual cortex do the work
http://chariotsolutions.com/presentations/the-programming-ape 36
45. Dashboard best practices
Organise information
to support meaning
Apply the latest understanding of human visual perception
to the visual presentation of information
37
47. Organised by context
Shopping Cart Product Catalog Auth Service
Memory
Traffic
DB
BETTER
39
48. Organised by context
Shopping Cart Product Catalog Auth Service
Memory
Traffic
DB
BETTER
39
49. Correlate events to add context
Releases Performance
/ Events Feature X TV Ads
hotfix
Last 7
Days
5% users DB load 90th percentile
Symptoms
locked out -40% latency +730%
40
57. (3D) Pie charts
Size of round areas
17%
difficult to evaluate
23%
13% Distortion in the
1%
2% perceived size
4% (and value of data)
➡
21%
17%
They sacrifice
accuracy for
aesthetic appeal
http://www.dashboardinsight.com/articles/digital-dashboards/building-dashboards/the-case-against-3d-charts-in-dashboards.aspx 43
58. Pie chart vs. Bar chart
A 27%
5%
6%
27% B 23%
16%
C 22%
D 16%
22% 23%
E 6%
F 5%
A B C
D E F 0 25 50 75 100
44
59. Pie chart vs. Bar chart
About the same screen estate
A 27%
5%
6%
27% B 23%
16%
C 22%
D 16%
22% 23%
E 6%
F 5%
A B C
D E F 0 25 50 75 100
44
60. Pie chart vs. Bar chart
A 27%
5%
6%
27% B 23%
16%
C 22%
D 16%
22% 23%
E 6%
F 5%
A B C
D E F 0 25 50 75 100
44
61. Pie chart vs. Bar chart
Easier to compare size of bars
(i.e. the value of the data)
A 27%
5%
6%
27% B 23%
16%
C 22%
D 16%
22% 23%
E 6%
F 5%
A B C
D E F 0 25 50 75 100
44
63. Mind tricks
WHAT I IF TOLD
YOU
YOU READ THAT
WRONG
http://www.quora.com/Optical-Illusions/What-are-some-great-optical-illusions 46
64. A machine for jumping to conclusions
W Y S I AT I
What You See Is All There Is
Intuitive thinking
jumps to conclusions
on the basis of limited evidence
47
67. Neglect of ambiguity
Ann
approached
the bank
Fabrication of coherent stories
http://www.flickr.com/photos/27000501@N08/5613967601 49
68. Neglect of ambiguity
Ann
approached
the bank
Fabrication of coherent stories
http://www.flickr.com/photos/27000501@N08/5613967601 49
69. Neglect of ambiguity
Ann
approached
the bank
Fabrication of coherent stories
http://www.flickr.com/photos/27000501@N08/5613967601 49
70. WYSIATI and the need for more data
Data Througput
Server 3
Server 2
Server 1
50
71. WYSIATI and the need for more data
Data Througput
oh cr*p.
Server 3
Server 2
Server 1
50
72. WYSIATI and the need for more data
Data Througput
Surely, we’re
losing data :-(
No doubt about it.
Server 3
Server 2
Server 1
50
73. WYSIATI and the need for more data
Data Througput
wait, all other
metrics are
OK....
Server 3
Server 2
Server 1
50
74. WYSIATI and the need for more data
Data Througput
Platform OK.
Metrics couldn’t reach
the stats server.
Server 3 (Stats server rebooted
Server 2 without eth1 interface)
Server 1
50
75. Multiple perspectives / facets
Examine data
from multiple
perspectives
simultaneously
(one of them will
hopefully make sense)
Uncover meaningful
relationships that
exist in the data
51
76. Grids / Crosstabs
Failures by service
Auth Mgr Product Catalog Shopping Cart
20K 20K 20K
US
Out Of 15K 15K 15K
EU
Memory 10K EU 10K EU 10K
Failures by type
5K 5K US 5K US
0 0 0
20K 20K 20K
15K EU 15K
EU 15K
Timeout US
10K 10K 10K EU
US
5K 5K 5K US
0 0 0
20K 20K 20K
15K 15K 15K
Unreachable
10K 10K 10K
5K US 5K US 5K US
0 EU 0 EU 0 EU
52
77. Grids / Crosstabs
Failures by service
Auth Mgr Product Catalog Shopping Cart
20K 20K 20K
US
Out Of 15K 15K 15K
EU
Memory 10K EU 10K EU 10K
Failures by type
5K 5K US 5K US
0 0 0
20K 20K 20K
15K EU 15K
EU 15K
Timeout US
10K 10K 10K EU
US
5K 5K 5K US
0 0 0
20K 20K 20K
15K 15K 15K
Unreachable
10K 10K 10K
5K US 5K US 5K US
0 EU 0 EU 0 EU
52
78. Grids / Crosstabs
Failures by service
Auth Mgr Product Catalog Shopping Cart
20K 20K 20K
US
Out Of 15K 15K 15K
EU
Memory 10K EU 10K EU 10K
Failures by type
5K 5K US 5K US
0 0 0
20K 20K 20K
15K EU 15K
EU 15K
Timeout US
10K 10K 10K EU
US
5K 5K 5K US
0 0 0
20K 20K 20K
15K 15K 15K
Unreachable
10K 10K 10K
5K US 5K US 5K US
0 EU 0 EU 0 EU
52
79. Grids / Crosstabs
Failures by service
Auth Mgr Product Catalog Shopping Cart
20K 20K 20K
US
Out Of 15K 15K 15K
EU
Memory 10K EU 10K EU 10K
Failures by type
5K 5K US 5K US
0 0 0
20K 20K 20K
15K EU 15K
EU 15K
Timeout US
10K 10K 10K EU
US
5K 5K 5K US
0 0 0
20K 20K 20K
15K 15K 15K
Unreachable
10K 10K 10K
5K US 5K US 5K US
0 EU 0 EU 0 EU
52
80. Halo effect - Biases
Judgement influenced
by previous information
Information processed earlier
might skew our perception of new data.
No evidence required to jump to conclusions.
53
81. Halo effect - Biases
C++ J av a
C++
Ruby
R
0 20 40 60 80
Garbage Collection
54
82. Biases stronger than hard evidence
Data A B No
Data
In
C++ J av a Out
Which component is broken? A or B ?
55
83. Biases stronger than hard evidence
Data A B No
Data
In
C++ J av a Out
Which component is broken? A or B ?
55
84. Biases stronger than hard evidence
Data A B No
Data
In
C++ J av a Out
Which component is broken? A or B ?
Don’t guess, look at metrics!!!
55
91. Pattern detection
Colors Shapes Sounds
GOOD
BAD
Our brain is good at
creating associations
and detecting patterns
http://www.vladstudio.com/wallpaper/?violin 62
97. Monitoring At Different Levels
UX / Business metrics
Is there a problem?
System monitors
Where is the problem?
66
98. Monitoring At Different Levels
UX / Business metrics
Is there a problem?
System monitors
Where is the problem?
Application monitors
What is the problem?
66
101. Getting started with monitoring
Monigusto
A single-server box that contains the most
common/current tools for monitoring like
graphite, statsd, collectd, nagios, logstash,
jmxtrans, tasseo, gdash, librato and sensu
https://github.com/monigusto
Real-Time Graphing With Graphite
http://bit.ly/rt-graphite
68
102. StatsD + Graphite
Example
StatsD: Node.JS daemon. Listens for messages over a UDP port and
extracts metrics, which are dumped to Graphite for further processing
and visualisation.
Graphite: Real-time graphing system. Data is sent to carbon
(processing back-end) which stores data into Graphite’s db. Data
visualised via Graphite’s web interface.
69
103. StatsD metrics
; statsd.ini
<?php [statsd]
host = yourhost
foreach ($items as $item) {
port = 8125
// time how long it takes
// to process this item...
$time_start = microtime(true);
// ... process item here ...
$time = (int)(1000 * (microtime(true) - $time_start));
StatsD::timing('workerX.processing_time', $time); // in ms
// count items by type
StatsD::increment('workerX.received.type.'.$item['type']);
}
https://github.com/etsy/statsd/ 70
104. StatsD metrics
; statsd.ini
<?php [statsd]
host = yourhost
foreach ($items as $item) {
port = 8125
// time how long it takes
// to process this item...
$time_start = microtime(true);
// ... process item here ...
$time = (int)(1000 * (microtime(true) - $time_start));
StatsD::timing('workerX.processing_time', $time); // in ms
// count items by type
StatsD::increment('workerX.received.type.'.$item['type']);
}
https://github.com/etsy/statsd/ 70
105. StatsD metrics
; statsd.ini
<?php [statsd]
host = yourhost
foreach ($items as $item) {
port = 8125
// time how long it takes
define a
// to process this item...
hierarchy of
$time_start = microtime(true);
event names
// ... process item here ...
$time = (int)(1000 * (microtime(true) - $time_start));
StatsD::timing('workerX.processing_time', $time); // in ms
// count items by type
StatsD::increment('workerX.received.type.'.$item['type']);
}
https://github.com/etsy/statsd/ 70
108. Bell curve
“normal” distribution
of response times:
# of requests
Average / Median
Average = Median
i.e. observed perf.
represents the majority
of the transactions
Below Average Above Average
Response time
http://apmblog.compuware.com/2012/11/14/why-averages-suck-and-percentiles-are-great/ 73
109. Bell curve - Alerting levels
# of requests
Median
Std Deviation:
33% of transactions
with the mean
as the middle
Within 1 std Response time
deviation of mean
74
110. Bell curve - Alerting levels
# of requests
Median
2x Std Deviation:
66% of transactions
(majority)
Within 2 times Response time
Std Deviation of Mean
75
111. Bell curve - Alerting levels
# of requests
Median
Everything outside:
outlier
Outside 2 times Outside 2 times Response time
Std Deviation of Mean Std Deviation of Mean
76
112. “Normal” vs. Real distribution
Real life: few very heavy outliers and long tail
Median ≠ Average
number of
requests
average looks a lot
8 faster than most
~20% transactions
6
of very fast
transactions
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 Response
time
Average
20th percentile Median
http://apmblog.compuware.com/2012/11/14/why-averages-suck-and-percentiles-are-great/ 77
113. Averages vs. Percentiles
Average
Load
time 200
(ms)
150
100
50
0
8AM 10AM 12PM 2PM 4PM
Percentiles allow us to understand the distribution
The 50th percentile is more stable than the average
78
114. Averages vs. Percentiles
Average 50th percentile 90th percentile
Load
time 200
(ms)
150
100
50
0
8AM 10AM 12PM 2PM 4PM
Percentiles allow us to understand the distribution
The 50th percentile is more stable than the average
78
116. Automatic Baselining and Alerts
50th percentile 90th percentile
Load
time 200
(ms)
150
100
50 threshold
X
0
8AM 10AM 12PM 2PM 4PM
Alert if std deviation of 50th percentile is over X
79
142. 10-40GB links - Bandwidth monitor
Great, but not enough!
Contextualise
metrics
http://www.network-weathermap.com/ http://cacti.net 91
143. HeatMaps: Cacti + WeatherMap
Cacti: Network graphing solution harnessing the power of RRDTool’s
data storage and graphing functionality. Provides a fast poller, graph
templating, multiple data acquisition methods.
Weathermap: Cacti plugin to integrate network maps into the
Cacti web UI. Includes a web-based map editor.
92
147. Server load: memory, CPU, disk...
500%
Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
148. Server load: memory, CPU, disk...
CPU/memory
overload on
filtering node?
500%
Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
149. Server load: memory, CPU, disk...
Slow DB
500% queries?
Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
150. Server load: memory, CPU, disk...
500%
Disk Storage
Running
Out Of Space?
Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
152. Guidelines: dashboards for humans
Make the subtle obvious
Make the complex/busy simple/clean
Group data by context, not means of prod
Detect anomalies/deviation from norm
Turn raw numbers into graphs
Appeal to intuition, conserve attention
96
153. References
http://www.alberton.info/talks
Daniel Kahneman, “Thinking, Fast and Slow”, Penguin Books 2012
Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012
Stephen Few - http://www.perceptualedge.com/
http://www.dashboardinsight.com
Coda Hale, The Programming APE
97