Monitoring at scale - Intuitive dashboard design

54,863 views
51,760 views

Published on

At a certain scale, millions of events happen every second, and all of them are important to evaluate the health of the system. If not handled correctly, such a volume of information can overwhelm both the infrastructure that needs to support them, and people who have to make a sense out of thousands of signals and make decisions upon them, fast. By understanding how our rational mind works, how people process information, we can present data so it's more evident and intuitive. This talk will explain how to collect useful metrics, and to create the perfect monitoring dashboard to organise and display them, letting our intuition operate automatically and quickly, and saving attention and mental effort to activities that demand it.

Published in: Technology
2 Comments
131 Likes
Statistics
Notes
No Downloads
Views
Total views
54,863
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
0
Comments
2
Likes
131
Embeds 0
No embeds

No notes for slide

Monitoring at scale - Intuitive dashboard design

  1. Lorenzo Alberton @lorenzoalberton Monitoring at scale:intuitive dashboard design Make decisions, fast PHP UK, Saturday 23rd February 2013 1
  2. Lorenzo Alberton Chief Technical Architect, DataSift http://alberton.info @lorenzoalberton http://bit.ly/scaleds 2
  3. Big Data, little clue? Monitoring is crucial http://www.flickr.com/photos/mrflip/5150336351/lightbox/ 3
  4. Complex architectures 4
  5. Identify (and prevent) failures? ? ? ? No output data: where is the problem??? ? ? ? 5
  6. Identify (and prevent) failures? ? ? ? No output data: where is the problem??? ? ? ? 5
  7. Monitoring mindset You can’t control Design systems what you can’t measure to be monitored Tom DeMarco Good reporting: Observe patterns and difference between noticing automate most things and not having a clue http://www.threesixtymag.co.uk/2012/12/state-of-mind-tee/ 6
  8. Monitoring mindset The hardest part Good reporting: difference between noticing and not having a clue 7
  9. Dashboard Design Learning the appropriate language 8
  10. Dashboard: what is it? Tool to display PIs and KPIs quantitative analysis Immediacy, intuitiveness and appropriate context 9
  11. Operational Strategic Analyticmonitors functions quick overview of comparisons, which need an organization’s reviewing constant, health extensive histories, real-time, evaluatingminute-by-minute assist with performance attention executive decisions assists with immediacy and what is going on data analysis practicality right now is not important - what is no statistics or pressing is what has doesn’t require analyzing been going on real-time data 10
  12. Multiple dashboard views Operational: Strategic: Analytic: Ops / Engineering CEO / CIO Marketing / Accountancy Different view for each audience: keep metrics relevant to each group 11
  13. Multiple dashboard views Operational: Ops / Engineering This talk is about this one (but the others are important too) 12
  14. Effective Monitoring Understanding how we think 13
  15. Thinking, Fast and Slow 14
  16. A tale of two systems Intuition Reasoning operates automatically consciouslyand quickly, with little or allocates attention no effort and no sense to the effortful mental of voluntary control activities that demand it 2+2=? 216 × 725 = ? involuntary fast voluntary slow effortless invisible difficult visible 15
  17. A tale of two systems Intuition operates automaticallyand quickly, with little or no effort and no sense Monitoring should rely on of voluntary control System I 2+2=? involuntary fast effortless invisible 16
  18. A tale of two systems Reasoning consciously allocates attention System 2 regulates our intuition to the effortful mental and is ready to jump in activities that demand itwhen attention is required 216 × 725 = ? voluntary slow difficult visible 17
  19. Model “Normality” http://www.flickr.com/photos/fwooper7/4942474212/ 18
  20. Be surprised by anomalies http://animal.discovery.com/tv-shows/wild-kingdom/about-animals/lions-elephant-hunters-pictures.htm 19
  21. Create surprise with alerts 20
  22. Create surprise with alerts 20
  23. Over-Use of color Revenue Goal 80 60 40 20 0 Jan Feb Mar Apr May Jun 21
  24. Over-Use of color Revenue Goal 80 60 40 20 0 Jan Feb Mar Apr May Jun Only attract attention when things go bad 21
  25. Dashboard best practices Show, don’t tell Keep text/numbers to a minimum 22
  26. Clarity and immediacy FTW Charles Joseph Minard, Napoleon’s March on Moscow“Probably the best statistical graphic ever drawn” - Edward Tufte http://www.edwardtufte.com/tufte/posters 23
  27. Clarity and immediacy FTW Charles Joseph Minard, Napoleon’s March on Moscow worst“Probably the best statistical graphic ever drawn” - Edward Tufte http://www.edwardtufte.com/tufte/posters 23
  28. Graphs fit short-term memory Sales Jan Feb Mar Apr May Jun Jul US 23923 21695 20032 24030 24302 25032 26203 EU 14390 16400 17303 21900 23547 20142 27321 24
  29. Graphs fit short-term memory Sales Jan Feb Mar Apr May Jun Jul US 23923 21695 20032 24030 24302 25032 26203 EU 14390 16400 17303 21900 23547 20142 27321 give values a visual shape 30000 US 25000 EU Sales 20000 15000 10000 Jan Feb Mar Apr May Jun Jul Aug 24
  30. Dashboard best practices Communicate with clarity Simplicity is key 25
  31. Dashboard design mistakes 26
  32. Busy Dashboards Are Busy http://img.photobucket.com/albums/v254/tomklipp/Misc/C-130e-flight-station.jpg 27
  33. Dashboard design mistakes Too much data, too little information At a glance, tell if there’s a problem, not a precise analysis 28
  34. The only thing I want to know Everything is alright http://www.x929.ca/shows/newsboy/?cat=28&paged=2 29
  35. Attention as limited resource http://www.climateshifts.org/wp-content/uploads/2010/12/coal_hands.jpg 30
  36. Attention has a limited budget Attention depletion Leverage intuition whenever possible 31
  37. Strain and effort ➔ HeuristicsIt takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 32
  38. Strain and effort ➔ HeuristicsIt takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 33
  39. Strain and effort ➔ HeuristicsIt takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 100! 33
  40. Strain and effort ➔ Heuristics Tendency to answer questions with the first idea that comes to mind, without checking itIt takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 100! 5 33
  41. Swap out difficult tasks for easier ones Heuristic, n. simple procedure that helps find adequate, though often imperfect, answers to difficult questions. 34
  42. Human-centric software? 35
  43. Human-centric software?Attention Too subtle: didn’t notice is LAZY Too tired: didn’t care 35
  44. Let the visual cortex do the work http://chariotsolutions.com/presentations/the-programming-ape 36
  45. Dashboard best practices Organise information to support meaning Apply the latest understanding of human visual perception to the visual presentation of information 37
  46. Organised by means of productionCPU Load DB queriesBandwidth BAD 38
  47. Organised by context Shopping Cart Product Catalog Auth ServiceMemoryTrafficDB BETTER 39
  48. Organised by context Shopping Cart Product Catalog Auth ServiceMemoryTrafficDB BETTER 39
  49. Correlate events to add context Releases Performance / Events Feature X TV Ads hotfixLast 7Days 5% users DB load 90th percentileSymptoms locked out -40% latency +730% 40
  50. Dashboard best practices Reduce Visual Noise Clutter, Distractions, Clichés, Animations, Embellishments create confusion 41
  51. Gauges / Speedometers 42
  52. Gauges / Speedometers 3D effect 42
  53. Gauges / Speedometers 3D effect Glass reflection 42
  54. Gauges / Speedometers 3D effect Glass reflection Bouncing needle 42
  55. Gauges / Speedometers 3D effect ... Glass reflection Bouncing needle 42
  56. Gauges / Speedometers 3D effect ... Glass reflection Bacon? Bouncing needle 42
  57. (3D) Pie charts Size of round areas 17% difficult to evaluate 23% 13% Distortion in the 1% 2% perceived size 4% (and value of data) ➡ 21% 17% They sacrifice accuracy for aesthetic appeal http://www.dashboardinsight.com/articles/digital-dashboards/building-dashboards/the-case-against-3d-charts-in-dashboards.aspx 43
  58. Pie chart vs. Bar chart A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  59. Pie chart vs. Bar chart About the same screen estate A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  60. Pie chart vs. Bar chart A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  61. Pie chart vs. Bar chart Easier to compare size of bars (i.e. the value of the data) A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  62. Mind tricks 45
  63. Mind tricks WHAT I IF TOLD YOU YOU READ THAT WRONG http://www.quora.com/Optical-Illusions/What-are-some-great-optical-illusions 46
  64. A machine for jumping to conclusions W Y S I AT I What You See Is All There Is Intuitive thinking jumps to conclusions on the basis of limited evidence 47
  65. Neglect of ambiguity Suppression of doubt 48
  66. Neglect of ambiguity Suppression of doubt 48
  67. Neglect of ambiguity Ann approached the bank Fabrication of coherent stories http://www.flickr.com/photos/27000501@N08/5613967601 49
  68. Neglect of ambiguity Ann approached the bank Fabrication of coherent stories http://www.flickr.com/photos/27000501@N08/5613967601 49
  69. Neglect of ambiguity Ann approached the bank Fabrication of coherent stories http://www.flickr.com/photos/27000501@N08/5613967601 49
  70. WYSIATI and the need for more dataData ThrougputServer 3Server 2Server 1 50
  71. WYSIATI and the need for more dataData Througput oh cr*p.Server 3Server 2Server 1 50
  72. WYSIATI and the need for more dataData Througput Surely, we’re losing data :-( No doubt about it.Server 3Server 2Server 1 50
  73. WYSIATI and the need for more dataData Througput wait, all other metrics are OK....Server 3Server 2Server 1 50
  74. WYSIATI and the need for more dataData Througput Platform OK. Metrics couldn’t reach the stats server.Server 3 (Stats server rebootedServer 2 without eth1 interface)Server 1 50
  75. Multiple perspectives / facets Examine data from multiple perspectives simultaneously (one of them will hopefully make sense) Uncover meaningful relationships that exist in the data 51
  76. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10KFailures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  77. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10KFailures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  78. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10KFailures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  79. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10KFailures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  80. Halo effect - Biases Judgement influenced by previous information Information processed earlier might skew our perception of new data. No evidence required to jump to conclusions. 53
  81. Halo effect - Biases C++ J av a C++ Ruby R 0 20 40 60 80 Garbage Collection 54
  82. Biases stronger than hard evidenceData A B No Data In C++ J av a Out Which component is broken? A or B ? 55
  83. Biases stronger than hard evidenceData A B No Data In C++ J av a Out Which component is broken? A or B ? 55
  84. Biases stronger than hard evidenceData A B No Data In C++ J av a Out Which component is broken? A or B ? Don’t guess, look at metrics!!! 55
  85. Priming effectWASH 56
  86. Priming effect S _ AP 57
  87. Priming effect SOAP 58
  88. Priming effect SLAP 59
  89. Priming effect SNAP 60
  90. Priming effectSWAP 61
  91. Pattern detection Colors Shapes Sounds GOOD BAD Our brain is good at creating associations and detecting patterns http://www.vladstudio.com/wallpaper/?violin 62
  92. Shapes that create emotions 63
  93. Shapes that create emotions 63
  94. Normalise data, keep patterns consistentNormalised 64
  95. Going Real-Time 65
  96. Monitoring At Different Levels UX / Business metrics Is there a problem? 66
  97. Monitoring At Different Levels UX / Business metrics Is there a problem? System monitors Where is the problem? 66
  98. Monitoring At Different Levels UX / Business metrics Is there a problem? System monitors Where is the problem? Application monitors What is the problem? 66
  99. Instrumentation: Monitoring + Alerting www.android-zenoss.info 67
  100. Instrumentation: Monitoring + AlertingUnconventional alerting tools can be surprisingly effective 67
  101. Getting started with monitoringMonigustoA single-server box that contains the mostcommon/current tools for monitoring likegraphite, statsd, collectd, nagios, logstash,jmxtrans, tasseo, gdash, librato and sensuhttps://github.com/monigustoReal-Time Graphing With Graphitehttp://bit.ly/rt-graphite 68
  102. StatsD + Graphite Example StatsD: Node.JS daemon. Listens for messages over a UDP port and extracts metrics, which are dumped to Graphite for further processing and visualisation. Graphite: Real-time graphing system. Data is sent to carbon (processing back-end) which stores data into Graphite’s db. Data visualised via Graphite’s web interface. 69
  103. StatsD metrics ; statsd.ini<?php [statsd] host = yourhostforeach ($items as $item) { port = 8125 // time how long it takes // to process this item... $time_start = microtime(true); // ... process item here ... $time = (int)(1000 * (microtime(true) - $time_start)); StatsD::timing(workerX.processing_time, $time); // in ms // count items by type StatsD::increment(workerX.received.type..$item[type]);} https://github.com/etsy/statsd/ 70
  104. StatsD metrics ; statsd.ini<?php [statsd] host = yourhostforeach ($items as $item) { port = 8125 // time how long it takes // to process this item... $time_start = microtime(true); // ... process item here ... $time = (int)(1000 * (microtime(true) - $time_start)); StatsD::timing(workerX.processing_time, $time); // in ms // count items by type StatsD::increment(workerX.received.type..$item[type]);} https://github.com/etsy/statsd/ 70
  105. StatsD metrics ; statsd.ini<?php [statsd] host = yourhostforeach ($items as $item) { port = 8125 // time how long it takes define a // to process this item... hierarchy of $time_start = microtime(true); event names // ... process item here ... $time = (int)(1000 * (microtime(true) - $time_start)); StatsD::timing(workerX.processing_time, $time); // in ms // count items by type StatsD::increment(workerX.received.type..$item[type]);} https://github.com/etsy/statsd/ 70
  106. Graphite output workerX.processing_time.mean workerX.processing_time.90percentile http://graphite.wikidot.com/ 71
  107. Understanding Distribution Why averages suck 72
  108. Bell curve “normal” distribution of response times: # of requests Average / Median Average = Median i.e. observed perf. represents the majority of the transactions Below Average Above Average Response time http://apmblog.compuware.com/2012/11/14/why-averages-suck-and-percentiles-are-great/ 73
  109. Bell curve - Alerting levels # of requests Median Std Deviation: 33% of transactions with the mean as the middle Within 1 std Response time deviation of mean 74
  110. Bell curve - Alerting levels # of requests Median 2x Std Deviation: 66% of transactions (majority) Within 2 times Response time Std Deviation of Mean 75
  111. Bell curve - Alerting levels # of requests Median Everything outside: outlier Outside 2 times Outside 2 times Response time Std Deviation of Mean Std Deviation of Mean 76
  112. “Normal” vs. Real distribution Real life: few very heavy outliers and long tail Median ≠ Average number of requests average looks a lot 8 faster than most ~20% transactions 6 of very fast transactions 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 Response time Average 20th percentile Median http://apmblog.compuware.com/2012/11/14/why-averages-suck-and-percentiles-are-great/ 77
  113. Averages vs. Percentiles Average Load time 200 (ms) 150 100 50 0 8AM 10AM 12PM 2PM 4PMPercentiles allow us to understand the distribution The 50th percentile is more stable than the average 78
  114. Averages vs. Percentiles Average 50th percentile 90th percentile Load time 200 (ms) 150 100 50 0 8AM 10AM 12PM 2PM 4PMPercentiles allow us to understand the distribution The 50th percentile is more stable than the average 78
  115. Automatic Baselining and Alerts 50th percentile 90th percentile Load time 200 (ms) 150 100 50 0 8AM 10AM 12PM 2PM 4PM 79
  116. Automatic Baselining and Alerts 50th percentile 90th percentile Load time 200 (ms) 150 100 50 threshold X 0 8AM 10AM 12PM 2PM 4PM Alert if std deviation of 50th percentile is over X 79
  117. Tips And TricksPatterns our brain should recognise 80
  118. Normalise + Add baseline 81
  119. Normalise + Add baseline 81
  120. Normalise + Add baseline 81
  121. Normalise + Add baseline 81
  122. Normalise + Add baseline let machines determine the baseline 81
  123. Anomaly detection in fluctuating trafficIOPS 82
  124. Anomaly detection in fluctuating trafficIOPS 82
  125. Anomaly detection in fluctuating trafficIOPS 82
  126. Derivative (Detect big spikes)derivative(IOPS) 83
  127. Derivative (Detect big spikes)derivative(IOPS) OK 83
  128. Derivative (Detect big spikes)derivative(IOPS) OK Anomalies 83
  129. Different visuals to spot differencesStackedArea 84
  130. Different visuals to spot differencesStackedArea 84
  131. Different visuals to spot differencesOverlappingLines 85
  132. Different visuals to spot differencesOverlappingLines 85
  133. Flattening effect Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 86
  134. Flattening effect saturation of a resource or discontinuation of flow Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 86
  135. Regular anomalies Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 87
  136. Regular anomalies check your cron jobs Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 87
  137. Advanced Heatmaps 88
  138. Heat-Maps 89
  139. Heat-Maps 89
  140. Look! Rib cages! Network load viz http://www.network-weathermap.com/ http://cacti.net 90
  141. 10-40GB links - Bandwidth monitor http://www.network-weathermap.com/ http://cacti.net 91
  142. 10-40GB links - Bandwidth monitor Great, but not enough! Contextualise metrics http://www.network-weathermap.com/ http://cacti.net 91
  143. HeatMaps: Cacti + WeatherMap Cacti: Network graphing solution harnessing the power of RRDTool’s data storage and graphing functionality. Provides a fast poller, graph templating, multiple data acquisition methods. Weathermap: Cacti plugin to integrate network maps into the Cacti web UI. Includes a web-based map editor. 92
  144. Network throughput / latency 345/s 84 32 225/s /s 296/s 335/s 7312/s 311/s 289/s 14 5/ s 4410/s 5320/s 80/s 1331/s 5320/s 5320/s 13/s 2954/s 44/s 3296/s 4322/s 219/s 2954/s 5320/s 832/s 5320/s Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 93
  145. Network throughput / latency 345/s 84 32 225/s /s 296/s 335/s 7312/s 311/s 289/s 14 5/ augmentation s 4410/s 5320/s service timing out? 80/s 1331/s 5320/s 5320/s 13/s 2954/s 44/s 3296/s 4322/s 219/s 2954/s 5320/s 832/s 5320/s Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 93
  146. Network throughput / latency 345/s 84 32 225/s /s 296/s 335/s 7312/s 311/s 289/s 14 5/ s 4410/s 5320/s 80/s 5320/s 1331/s consumer 5320/s slower than 13/s producer? 2954/s 44/s 3296/s 4322/s 219/s 2954/s 5320/s 832/s 5320/s Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 93
  147. Server load: memory, CPU, disk... 500% Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  148. Server load: memory, CPU, disk...CPU/memory overload onfiltering node? 500% Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  149. Server load: memory, CPU, disk... Slow DB 500% queries? Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  150. Server load: memory, CPU, disk... 500% Disk Storage Running Out Of Space? Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  151. Conclusions Almost beer time... 95
  152. Guidelines: dashboards for humans Make the subtle obviousMake the complex/busy simple/cleanGroup data by context, not means of prodDetect anomalies/deviation from norm Turn raw numbers into graphsAppeal to intuition, conserve attention 96
  153. References http://www.alberton.info/talksDaniel Kahneman, “Thinking, Fast and Slow”, Penguin Books 2012 Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 Stephen Few - http://www.perceptualedge.com/ http://www.dashboardinsight.com Coda Hale, The Programming APE 97
  154. We’re Hiring!http://datasift.com/about-us/careers lorenzo@datasift.com 98
  155. Lorenzo Alberton @LorenzoAlberton Thank you! lorenzo@alberton.infohttp://www.alberton.info/talks http://joind.in/8060 99

×