Monitoring at scale - Intuitive dashboard design

  • 34,934 views
Uploaded on

At a certain scale, millions of events happen every second, and all of them are important to evaluate the health of the system. If not handled correctly, such a volume of information can overwhelm …

At a certain scale, millions of events happen every second, and all of them are important to evaluate the health of the system. If not handled correctly, such a volume of information can overwhelm both the infrastructure that needs to support them, and people who have to make a sense out of thousands of signals and make decisions upon them, fast. By understanding how our rational mind works, how people process information, we can present data so it's more evident and intuitive. This talk will explain how to collect useful metrics, and to create the perfect monitoring dashboard to organise and display them, letting our intuition operate automatically and quickly, and saving attention and mental effort to activities that demand it.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • jay mataji
    Are you sure you want to
    Your message goes here
  • Great deck. Would love to hear the talk.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
34,934
On Slideshare
0
From Embeds
0
Number of Embeds
16

Actions

Shares
Downloads
0
Comments
2
Likes
93

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Lorenzo Alberton @lorenzoalberton Monitoring at scale:intuitive dashboard design Make decisions, fast PHP UK, Saturday 23rd February 2013 1
  • 2. Lorenzo Alberton Chief Technical Architect, DataSift http://alberton.info @lorenzoalberton http://bit.ly/scaleds 2
  • 3. Big Data, little clue? Monitoring is crucial http://www.flickr.com/photos/mrflip/5150336351/lightbox/ 3
  • 4. Complex architectures 4
  • 5. Identify (and prevent) failures? ? ? ? No output data: where is the problem??? ? ? ? 5
  • 6. Identify (and prevent) failures? ? ? ? No output data: where is the problem??? ? ? ? 5
  • 7. Monitoring mindset You can’t control Design systems what you can’t measure to be monitored Tom DeMarco Good reporting: Observe patterns and difference between noticing automate most things and not having a clue http://www.threesixtymag.co.uk/2012/12/state-of-mind-tee/ 6
  • 8. Monitoring mindset The hardest part Good reporting: difference between noticing and not having a clue 7
  • 9. Dashboard Design Learning the appropriate language 8
  • 10. Dashboard: what is it? Tool to display PIs and KPIs quantitative analysis Immediacy, intuitiveness and appropriate context 9
  • 11. Operational Strategic Analyticmonitors functions quick overview of comparisons, which need an organization’s reviewing constant, health extensive histories, real-time, evaluatingminute-by-minute assist with performance attention executive decisions assists with immediacy and what is going on data analysis practicality right now is not important - what is no statistics or pressing is what has doesn’t require analyzing been going on real-time data 10
  • 12. Multiple dashboard views Operational: Strategic: Analytic: Ops / Engineering CEO / CIO Marketing / Accountancy Different view for each audience: keep metrics relevant to each group 11
  • 13. Multiple dashboard views Operational: Ops / Engineering This talk is about this one (but the others are important too) 12
  • 14. Effective Monitoring Understanding how we think 13
  • 15. Thinking, Fast and Slow 14
  • 16. A tale of two systems Intuition Reasoning operates automatically consciouslyand quickly, with little or allocates attention no effort and no sense to the effortful mental of voluntary control activities that demand it 2+2=? 216 × 725 = ? involuntary fast voluntary slow effortless invisible difficult visible 15
  • 17. A tale of two systems Intuition operates automaticallyand quickly, with little or no effort and no sense Monitoring should rely on of voluntary control System I 2+2=? involuntary fast effortless invisible 16
  • 18. A tale of two systems Reasoning consciously allocates attention System 2 regulates our intuition to the effortful mental and is ready to jump in activities that demand itwhen attention is required 216 × 725 = ? voluntary slow difficult visible 17
  • 19. Model “Normality” http://www.flickr.com/photos/fwooper7/4942474212/ 18
  • 20. Be surprised by anomalies http://animal.discovery.com/tv-shows/wild-kingdom/about-animals/lions-elephant-hunters-pictures.htm 19
  • 21. Create surprise with alerts 20
  • 22. Create surprise with alerts 20
  • 23. Over-Use of color Revenue Goal 80 60 40 20 0 Jan Feb Mar Apr May Jun 21
  • 24. Over-Use of color Revenue Goal 80 60 40 20 0 Jan Feb Mar Apr May Jun Only attract attention when things go bad 21
  • 25. Dashboard best practices Show, don’t tell Keep text/numbers to a minimum 22
  • 26. Clarity and immediacy FTW Charles Joseph Minard, Napoleon’s March on Moscow“Probably the best statistical graphic ever drawn” - Edward Tufte http://www.edwardtufte.com/tufte/posters 23
  • 27. Clarity and immediacy FTW Charles Joseph Minard, Napoleon’s March on Moscow worst“Probably the best statistical graphic ever drawn” - Edward Tufte http://www.edwardtufte.com/tufte/posters 23
  • 28. Graphs fit short-term memory Sales Jan Feb Mar Apr May Jun Jul US 23923 21695 20032 24030 24302 25032 26203 EU 14390 16400 17303 21900 23547 20142 27321 24
  • 29. Graphs fit short-term memory Sales Jan Feb Mar Apr May Jun Jul US 23923 21695 20032 24030 24302 25032 26203 EU 14390 16400 17303 21900 23547 20142 27321 give values a visual shape 30000 US 25000 EU Sales 20000 15000 10000 Jan Feb Mar Apr May Jun Jul Aug 24
  • 30. Dashboard best practices Communicate with clarity Simplicity is key 25
  • 31. Dashboard design mistakes 26
  • 32. Busy Dashboards Are Busy http://img.photobucket.com/albums/v254/tomklipp/Misc/C-130e-flight-station.jpg 27
  • 33. Dashboard design mistakes Too much data, too little information At a glance, tell if there’s a problem, not a precise analysis 28
  • 34. The only thing I want to know Everything is alright http://www.x929.ca/shows/newsboy/?cat=28&paged=2 29
  • 35. Attention as limited resource http://www.climateshifts.org/wp-content/uploads/2010/12/coal_hands.jpg 30
  • 36. Attention has a limited budget Attention depletion Leverage intuition whenever possible 31
  • 37. Strain and effort ➔ HeuristicsIt takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 32
  • 38. Strain and effort ➔ HeuristicsIt takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 33
  • 39. Strain and effort ➔ HeuristicsIt takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 100! 33
  • 40. Strain and effort ➔ Heuristics Tendency to answer questions with the first idea that comes to mind, without checking itIt takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 100! 5 33
  • 41. Swap out difficult tasks for easier ones Heuristic, n. simple procedure that helps find adequate, though often imperfect, answers to difficult questions. 34
  • 42. Human-centric software? 35
  • 43. Human-centric software?Attention Too subtle: didn’t notice is LAZY Too tired: didn’t care 35
  • 44. Let the visual cortex do the work http://chariotsolutions.com/presentations/the-programming-ape 36
  • 45. Dashboard best practices Organise information to support meaning Apply the latest understanding of human visual perception to the visual presentation of information 37
  • 46. Organised by means of productionCPU Load DB queriesBandwidth BAD 38
  • 47. Organised by context Shopping Cart Product Catalog Auth ServiceMemoryTrafficDB BETTER 39
  • 48. Organised by context Shopping Cart Product Catalog Auth ServiceMemoryTrafficDB BETTER 39
  • 49. Correlate events to add context Releases Performance / Events Feature X TV Ads hotfixLast 7Days 5% users DB load 90th percentileSymptoms locked out -40% latency +730% 40
  • 50. Dashboard best practices Reduce Visual Noise Clutter, Distractions, Clichés, Animations, Embellishments create confusion 41
  • 51. Gauges / Speedometers 42
  • 52. Gauges / Speedometers 3D effect 42
  • 53. Gauges / Speedometers 3D effect Glass reflection 42
  • 54. Gauges / Speedometers 3D effect Glass reflection Bouncing needle 42
  • 55. Gauges / Speedometers 3D effect ... Glass reflection Bouncing needle 42
  • 56. Gauges / Speedometers 3D effect ... Glass reflection Bacon? Bouncing needle 42
  • 57. (3D) Pie charts Size of round areas 17% difficult to evaluate 23% 13% Distortion in the 1% 2% perceived size 4% (and value of data) ➡ 21% 17% They sacrifice accuracy for aesthetic appeal http://www.dashboardinsight.com/articles/digital-dashboards/building-dashboards/the-case-against-3d-charts-in-dashboards.aspx 43
  • 58. Pie chart vs. Bar chart A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  • 59. Pie chart vs. Bar chart About the same screen estate A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  • 60. Pie chart vs. Bar chart A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  • 61. Pie chart vs. Bar chart Easier to compare size of bars (i.e. the value of the data) A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  • 62. Mind tricks 45
  • 63. Mind tricks WHAT I IF TOLD YOU YOU READ THAT WRONG http://www.quora.com/Optical-Illusions/What-are-some-great-optical-illusions 46
  • 64. A machine for jumping to conclusions W Y S I AT I What You See Is All There Is Intuitive thinking jumps to conclusions on the basis of limited evidence 47
  • 65. Neglect of ambiguity Suppression of doubt 48
  • 66. Neglect of ambiguity Suppression of doubt 48
  • 67. Neglect of ambiguity Ann approached the bank Fabrication of coherent stories http://www.flickr.com/photos/27000501@N08/5613967601 49
  • 68. Neglect of ambiguity Ann approached the bank Fabrication of coherent stories http://www.flickr.com/photos/27000501@N08/5613967601 49
  • 69. Neglect of ambiguity Ann approached the bank Fabrication of coherent stories http://www.flickr.com/photos/27000501@N08/5613967601 49
  • 70. WYSIATI and the need for more dataData ThrougputServer 3Server 2Server 1 50
  • 71. WYSIATI and the need for more dataData Througput oh cr*p.Server 3Server 2Server 1 50
  • 72. WYSIATI and the need for more dataData Througput Surely, we’re losing data :-( No doubt about it.Server 3Server 2Server 1 50
  • 73. WYSIATI and the need for more dataData Througput wait, all other metrics are OK....Server 3Server 2Server 1 50
  • 74. WYSIATI and the need for more dataData Througput Platform OK. Metrics couldn’t reach the stats server.Server 3 (Stats server rebootedServer 2 without eth1 interface)Server 1 50
  • 75. Multiple perspectives / facets Examine data from multiple perspectives simultaneously (one of them will hopefully make sense) Uncover meaningful relationships that exist in the data 51
  • 76. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10KFailures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  • 77. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10KFailures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  • 78. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10KFailures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  • 79. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10KFailures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  • 80. Halo effect - Biases Judgement influenced by previous information Information processed earlier might skew our perception of new data. No evidence required to jump to conclusions. 53
  • 81. Halo effect - Biases C++ J av a C++ Ruby R 0 20 40 60 80 Garbage Collection 54
  • 82. Biases stronger than hard evidenceData A B No Data In C++ J av a Out Which component is broken? A or B ? 55
  • 83. Biases stronger than hard evidenceData A B No Data In C++ J av a Out Which component is broken? A or B ? 55
  • 84. Biases stronger than hard evidenceData A B No Data In C++ J av a Out Which component is broken? A or B ? Don’t guess, look at metrics!!! 55
  • 85. Priming effectWASH 56
  • 86. Priming effect S _ AP 57
  • 87. Priming effect SOAP 58
  • 88. Priming effect SLAP 59
  • 89. Priming effect SNAP 60
  • 90. Priming effectSWAP 61
  • 91. Pattern detection Colors Shapes Sounds GOOD BAD Our brain is good at creating associations and detecting patterns http://www.vladstudio.com/wallpaper/?violin 62
  • 92. Shapes that create emotions 63
  • 93. Shapes that create emotions 63
  • 94. Normalise data, keep patterns consistentNormalised 64
  • 95. Going Real-Time 65
  • 96. Monitoring At Different Levels UX / Business metrics Is there a problem? 66
  • 97. Monitoring At Different Levels UX / Business metrics Is there a problem? System monitors Where is the problem? 66
  • 98. Monitoring At Different Levels UX / Business metrics Is there a problem? System monitors Where is the problem? Application monitors What is the problem? 66
  • 99. Instrumentation: Monitoring + Alerting www.android-zenoss.info 67
  • 100. Instrumentation: Monitoring + AlertingUnconventional alerting tools can be surprisingly effective 67
  • 101. Getting started with monitoringMonigustoA single-server box that contains the mostcommon/current tools for monitoring likegraphite, statsd, collectd, nagios, logstash,jmxtrans, tasseo, gdash, librato and sensuhttps://github.com/monigustoReal-Time Graphing With Graphitehttp://bit.ly/rt-graphite 68
  • 102. StatsD + Graphite Example StatsD: Node.JS daemon. Listens for messages over a UDP port and extracts metrics, which are dumped to Graphite for further processing and visualisation. Graphite: Real-time graphing system. Data is sent to carbon (processing back-end) which stores data into Graphite’s db. Data visualised via Graphite’s web interface. 69
  • 103. StatsD metrics ; statsd.ini<?php [statsd] host = yourhostforeach ($items as $item) { port = 8125 // time how long it takes // to process this item... $time_start = microtime(true); // ... process item here ... $time = (int)(1000 * (microtime(true) - $time_start)); StatsD::timing(workerX.processing_time, $time); // in ms // count items by type StatsD::increment(workerX.received.type..$item[type]);} https://github.com/etsy/statsd/ 70
  • 104. StatsD metrics ; statsd.ini<?php [statsd] host = yourhostforeach ($items as $item) { port = 8125 // time how long it takes // to process this item... $time_start = microtime(true); // ... process item here ... $time = (int)(1000 * (microtime(true) - $time_start)); StatsD::timing(workerX.processing_time, $time); // in ms // count items by type StatsD::increment(workerX.received.type..$item[type]);} https://github.com/etsy/statsd/ 70
  • 105. StatsD metrics ; statsd.ini<?php [statsd] host = yourhostforeach ($items as $item) { port = 8125 // time how long it takes define a // to process this item... hierarchy of $time_start = microtime(true); event names // ... process item here ... $time = (int)(1000 * (microtime(true) - $time_start)); StatsD::timing(workerX.processing_time, $time); // in ms // count items by type StatsD::increment(workerX.received.type..$item[type]);} https://github.com/etsy/statsd/ 70
  • 106. Graphite output workerX.processing_time.mean workerX.processing_time.90percentile http://graphite.wikidot.com/ 71
  • 107. Understanding Distribution Why averages suck 72
  • 108. Bell curve “normal” distribution of response times: # of requests Average / Median Average = Median i.e. observed perf. represents the majority of the transactions Below Average Above Average Response time http://apmblog.compuware.com/2012/11/14/why-averages-suck-and-percentiles-are-great/ 73
  • 109. Bell curve - Alerting levels # of requests Median Std Deviation: 33% of transactions with the mean as the middle Within 1 std Response time deviation of mean 74
  • 110. Bell curve - Alerting levels # of requests Median 2x Std Deviation: 66% of transactions (majority) Within 2 times Response time Std Deviation of Mean 75
  • 111. Bell curve - Alerting levels # of requests Median Everything outside: outlier Outside 2 times Outside 2 times Response time Std Deviation of Mean Std Deviation of Mean 76
  • 112. “Normal” vs. Real distribution Real life: few very heavy outliers and long tail Median ≠ Average number of requests average looks a lot 8 faster than most ~20% transactions 6 of very fast transactions 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 Response time Average 20th percentile Median http://apmblog.compuware.com/2012/11/14/why-averages-suck-and-percentiles-are-great/ 77
  • 113. Averages vs. Percentiles Average Load time 200 (ms) 150 100 50 0 8AM 10AM 12PM 2PM 4PMPercentiles allow us to understand the distribution The 50th percentile is more stable than the average 78
  • 114. Averages vs. Percentiles Average 50th percentile 90th percentile Load time 200 (ms) 150 100 50 0 8AM 10AM 12PM 2PM 4PMPercentiles allow us to understand the distribution The 50th percentile is more stable than the average 78
  • 115. Automatic Baselining and Alerts 50th percentile 90th percentile Load time 200 (ms) 150 100 50 0 8AM 10AM 12PM 2PM 4PM 79
  • 116. Automatic Baselining and Alerts 50th percentile 90th percentile Load time 200 (ms) 150 100 50 threshold X 0 8AM 10AM 12PM 2PM 4PM Alert if std deviation of 50th percentile is over X 79
  • 117. Tips And TricksPatterns our brain should recognise 80
  • 118. Normalise + Add baseline 81
  • 119. Normalise + Add baseline 81
  • 120. Normalise + Add baseline 81
  • 121. Normalise + Add baseline 81
  • 122. Normalise + Add baseline let machines determine the baseline 81
  • 123. Anomaly detection in fluctuating trafficIOPS 82
  • 124. Anomaly detection in fluctuating trafficIOPS 82
  • 125. Anomaly detection in fluctuating trafficIOPS 82
  • 126. Derivative (Detect big spikes)derivative(IOPS) 83
  • 127. Derivative (Detect big spikes)derivative(IOPS) OK 83
  • 128. Derivative (Detect big spikes)derivative(IOPS) OK Anomalies 83
  • 129. Different visuals to spot differencesStackedArea 84
  • 130. Different visuals to spot differencesStackedArea 84
  • 131. Different visuals to spot differencesOverlappingLines 85
  • 132. Different visuals to spot differencesOverlappingLines 85
  • 133. Flattening effect Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 86
  • 134. Flattening effect saturation of a resource or discontinuation of flow Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 86
  • 135. Regular anomalies Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 87
  • 136. Regular anomalies check your cron jobs Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 87
  • 137. Advanced Heatmaps 88
  • 138. Heat-Maps 89
  • 139. Heat-Maps 89
  • 140. Look! Rib cages! Network load viz http://www.network-weathermap.com/ http://cacti.net 90
  • 141. 10-40GB links - Bandwidth monitor http://www.network-weathermap.com/ http://cacti.net 91
  • 142. 10-40GB links - Bandwidth monitor Great, but not enough! Contextualise metrics http://www.network-weathermap.com/ http://cacti.net 91
  • 143. HeatMaps: Cacti + WeatherMap Cacti: Network graphing solution harnessing the power of RRDTool’s data storage and graphing functionality. Provides a fast poller, graph templating, multiple data acquisition methods. Weathermap: Cacti plugin to integrate network maps into the Cacti web UI. Includes a web-based map editor. 92
  • 144. Network throughput / latency 345/s 84 32 225/s /s 296/s 335/s 7312/s 311/s 289/s 14 5/ s 4410/s 5320/s 80/s 1331/s 5320/s 5320/s 13/s 2954/s 44/s 3296/s 4322/s 219/s 2954/s 5320/s 832/s 5320/s Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 93
  • 145. Network throughput / latency 345/s 84 32 225/s /s 296/s 335/s 7312/s 311/s 289/s 14 5/ augmentation s 4410/s 5320/s service timing out? 80/s 1331/s 5320/s 5320/s 13/s 2954/s 44/s 3296/s 4322/s 219/s 2954/s 5320/s 832/s 5320/s Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 93
  • 146. Network throughput / latency 345/s 84 32 225/s /s 296/s 335/s 7312/s 311/s 289/s 14 5/ s 4410/s 5320/s 80/s 5320/s 1331/s consumer 5320/s slower than 13/s producer? 2954/s 44/s 3296/s 4322/s 219/s 2954/s 5320/s 832/s 5320/s Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 93
  • 147. Server load: memory, CPU, disk... 500% Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  • 148. Server load: memory, CPU, disk...CPU/memory overload onfiltering node? 500% Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  • 149. Server load: memory, CPU, disk... Slow DB 500% queries? Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  • 150. Server load: memory, CPU, disk... 500% Disk Storage Running Out Of Space? Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  • 151. Conclusions Almost beer time... 95
  • 152. Guidelines: dashboards for humans Make the subtle obviousMake the complex/busy simple/cleanGroup data by context, not means of prodDetect anomalies/deviation from norm Turn raw numbers into graphsAppeal to intuition, conserve attention 96
  • 153. References http://www.alberton.info/talksDaniel Kahneman, “Thinking, Fast and Slow”, Penguin Books 2012 Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 Stephen Few - http://www.perceptualedge.com/ http://www.dashboardinsight.com Coda Hale, The Programming APE 97
  • 154. We’re Hiring!http://datasift.com/about-us/careers lorenzo@datasift.com 98
  • 155. Lorenzo Alberton @LorenzoAlberton Thank you! lorenzo@alberton.infohttp://www.alberton.info/talks http://joind.in/8060 99