Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Lorenzo Alberton                       @lorenzoalberton        Monitoring at scale:intuitive dashboard design             ...
Lorenzo Alberton             Chief Technical Architect, DataSift             http://alberton.info             @lorenzoalbe...
Big Data, little clue?                               Monitoring is crucial  http://www.flickr.com/photos/mrflip/5150336351/l...
Complex architectures                        4
Identify (and prevent) failures?                         ?            ?                              ?                    ...
Identify (and prevent) failures?                         ?            ?                              ?                    ...
Monitoring mindset    You can’t control                                              Design systems what you can’t measure...
Monitoring mindset    The  hardest   part                         Good reporting:                     difference between n...
Dashboard Design   Learning the appropriate language                                       8
Dashboard: what is it?       Tool to display        PIs and KPIs           quantitative analysis         Immediacy, intuit...
Operational            Strategic              Analyticmonitors functions   quick overview of         comparisons,   which ...
Multiple dashboard views  Operational:        Strategic:          Analytic: Ops / Engineering    CEO / CIO     Marketing /...
Multiple dashboard views  Operational: Ops / Engineering                    This talk                                     ...
Effective Monitoring       Understanding how we think                                    13
Thinking, Fast and Slow                          14
A tale of two systems   Intuition                   Reasoning operates automatically               consciouslyand quickly,...
A tale of two systems   Intuition operates automaticallyand quickly, with little or no effort and no sense         Monitor...
A tale of two systems                              Reasoning                                     consciously              ...
Model “Normality”         http://www.flickr.com/photos/fwooper7/4942474212/   18
Be surprised by anomalies   http://animal.discovery.com/tv-shows/wild-kingdom/about-animals/lions-elephant-hunters-picture...
Create surprise with alerts                              20
Create surprise with alerts                              20
Over-Use of color               Revenue               Goal    80    60    40    20     0         Jan   Feb       Mar   Apr...
Over-Use of color                Revenue               Goal    80    60    40    20     0         Jan   Feb        Mar   A...
Dashboard best practices      Show, don’t tell        Keep text/numbers to a minimum                                      ...
Clarity and immediacy FTW    Charles Joseph Minard, Napoleon’s March on Moscow“Probably the best statistical graphic ever ...
Clarity and immediacy FTW    Charles Joseph Minard, Napoleon’s March on Moscow             worst“Probably the best statist...
Graphs fit short-term memory   Sales Jan  Feb Mar Apr May Jun           Jul    US 23923 21695 20032 24030 24302 25032 26203...
Graphs fit short-term memory     Sales Jan  Feb Mar Apr May Jun           Jul      US 23923 21695 20032 24030 24302 25032 2...
Dashboard best practices       Communicate        with clarity           Simplicity is key                               25
Dashboard design mistakes                            26
Busy Dashboards Are Busy      http://img.photobucket.com/albums/v254/tomklipp/Misc/C-130e-flight-station.jpg   27
Dashboard design mistakes       Too much data,    too little information          At a glance, tell if there’s a        pr...
The only thing I want to know             Everything is alright          http://www.x929.ca/shows/newsboy/?cat=28&paged=2 ...
Attention as limited resource        http://www.climateshifts.org/wp-content/uploads/2010/12/coal_hands.jpg   30
Attention has a limited budget          Attention          depletion           Leverage intuition           whenever possi...
Strain and effort ➔ HeuristicsIt takes 5 machines 5 minutes to make 5 widgets,             how long would it take       10...
Strain and effort ➔ HeuristicsIt takes 5 machines 5 minutes to make 5 widgets,             how long would it take       10...
Strain and effort ➔ HeuristicsIt takes 5 machines 5 minutes to make 5 widgets,             how long would it take       10...
Strain and effort ➔ Heuristics Tendency to answer questions with the first idea that         comes to mind, without checkin...
Swap out difficult tasks for easier ones  Heuristic, n.          simple procedure that helps find          adequate, though ...
Human-centric software?                          35
Human-centric software?Attention      Too subtle: didn’t notice is LAZY       Too tired: didn’t care                      ...
Let the visual cortex do the work         http://chariotsolutions.com/presentations/the-programming-ape   36
Dashboard best practices  Organise information  to support meaning Apply the latest understanding of human visual percepti...
Organised by means of productionCPU Load          DB queriesBandwidth                    BAD                              ...
Organised by context         Shopping Cart    Product Catalog   Auth ServiceMemoryTrafficDB                         BETTER ...
Organised by context         Shopping Cart    Product Catalog   Auth ServiceMemoryTrafficDB                         BETTER ...
Correlate events to add context Releases                   Performance / Events   Feature X                        TV Ads ...
Dashboard best practices   Reduce Visual Noise  Clutter, Distractions, Clichés, Animations, Embellishments                ...
Gauges / Speedometers                        42
Gauges / Speedometers  3D effect                        42
Gauges / Speedometers  3D effect  Glass reflection                        42
Gauges / Speedometers  3D effect  Glass reflection  Bouncing needle                        42
Gauges / Speedometers  3D effect             ...  Glass reflection  Bouncing needle                              42
Gauges / Speedometers  3D effect             ...  Glass reflection       Bacon?  Bouncing needle                           ...
(3D) Pie charts                                                                                  Size of round areas      ...
Pie chart vs. Bar chart                                 A       27%              5%         6%                       27%  ...
Pie chart vs. Bar chart                   About the same screen estate                                  A       27%       ...
Pie chart vs. Bar chart                                 A       27%              5%         6%                       27%  ...
Pie chart vs. Bar chart               Easier to compare size of bars                (i.e. the value of the data)          ...
Mind tricks              45
Mind tricks                                                         WHAT I IF TOLD                                        ...
A machine for jumping to conclusions          W Y S I AT I          What You See Is All There Is             Intuitive thi...
Neglect of ambiguity          Suppression of doubt                                 48
Neglect of ambiguity          Suppression of doubt                                 48
Neglect of ambiguity          Ann       approached        the bank      Fabrication of coherent stories          http://ww...
Neglect of ambiguity          Ann       approached        the bank      Fabrication of coherent stories          http://ww...
Neglect of ambiguity          Ann       approached        the bank      Fabrication of coherent stories          http://ww...
WYSIATI and the need for more dataData ThrougputServer 3Server 2Server 1                                      50
WYSIATI and the need for more dataData Througput           oh cr*p.Server 3Server 2Server 1                               ...
WYSIATI and the need for more dataData Througput              Surely, we’re             losing data :-(           No doubt...
WYSIATI and the need for more dataData Througput           wait, all other            metrics are               OK....Serv...
WYSIATI and the need for more dataData Througput                        Platform OK.                   Metrics couldn’t re...
Multiple perspectives / facets                             Examine data                             from multiple         ...
Grids / Crosstabs                                                   Failures by service                                   ...
Grids / Crosstabs                                                   Failures by service                                   ...
Grids / Crosstabs                                                   Failures by service                                   ...
Grids / Crosstabs                                                   Failures by service                                   ...
Halo effect - Biases      Judgement influenced     by previous information        Information processed earlier   might ske...
Halo effect - Biases              C++                 J av a   C++   Ruby     R          0   20   40   60   80            ...
Biases stronger than hard evidenceData              A         B                No                                         ...
Biases stronger than hard evidenceData              A         B                No                                         ...
Biases stronger than hard evidenceData              A         B                No                                         ...
Priming effectWASH                 56
Priming effect S _ AP                 57
Priming effect SOAP                 58
Priming effect  SLAP                 59
Priming effect SNAP                 60
Priming effectSWAP                 61
Pattern detection  Colors                  Shapes                            Sounds GOOD BAD              Our brain is goo...
Shapes that create emotions                              63
Shapes that create emotions                              63
Normalise data, keep patterns consistentNormalised                                      64
Going Real-Time                  65
Monitoring At Different Levels    UX / Business metrics           Is there a problem?                    66
Monitoring At Different Levels    UX / Business metrics           Is there a problem?       System monitors          Where...
Monitoring At Different Levels    UX / Business metrics           Is there a problem?       System monitors          Where...
Instrumentation: Monitoring + Alerting                              www.android-zenoss.info                               ...
Instrumentation: Monitoring + AlertingUnconventional alerting tools     can be  surprisingly    effective                 ...
Getting started with monitoringMonigustoA single-server box that contains the mostcommon/current tools for monitoring like...
StatsD + Graphite                       Example StatsD: Node.JS daemon. Listens for messages over a UDP port and extracts ...
StatsD metrics                                                          ; statsd.ini<?php                                 ...
StatsD metrics                                                          ; statsd.ini<?php                                 ...
StatsD metrics                                                          ; statsd.ini<?php                                 ...
Graphite output  workerX.processing_time.mean            workerX.processing_time.90percentile                             ...
Understanding Distribution              Why averages suck                                  72
Bell curve                                                                   “normal” distribution                        ...
Bell curve - Alerting levels  # of requests                   Median                                        Std Deviation:...
Bell curve - Alerting levels  # of requests                      Median                                           2x Std D...
Bell curve - Alerting levels   # of requests                         Median                                        Everyth...
“Normal” vs. Real distribution  Real life: few very heavy outliers and long tail                Median ≠ Average  number o...
Averages vs. Percentiles                 Average Load time   200 (ms)        150        100         50          0         ...
Averages vs. Percentiles                 Average     50th percentile     90th percentile Load time   200 (ms)        150  ...
Automatic Baselining and Alerts                        50th percentile     90th percentile Load time   200 (ms)        150...
Automatic Baselining and Alerts                        50th percentile     90th percentile Load time   200 (ms)        150...
Tips And TricksPatterns our brain should recognise                                      80
Normalise + Add baseline                           81
Normalise + Add baseline                           81
Normalise + Add baseline                           81
Normalise + Add baseline                           81
Normalise + Add baseline                       let machines                      determine the                          ba...
Anomaly detection in fluctuating trafficIOPS                                         82
Anomaly detection in fluctuating trafficIOPS                                         82
Anomaly detection in fluctuating trafficIOPS                                         82
Derivative (Detect big spikes)derivative(IOPS)                                 83
Derivative (Detect big spikes)derivative(IOPS)  OK                                 83
Derivative (Detect big spikes)derivative(IOPS)  OK    Anomalies                                 83
Different visuals to spot differencesStackedArea                                        84
Different visuals to spot differencesStackedArea                                        84
Different visuals to spot differencesOverlappingLines                                         85
Different visuals to spot differencesOverlappingLines                                         85
Flattening effect          Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012   86
Flattening effect          saturation of a resource         or discontinuation of flow          Slawek Ligus, “Effective Mo...
Regular anomalies         Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012   87
Regular anomalies            check your cron jobs         Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012...
Advanced Heatmaps                    88
Heat-Maps            89
Heat-Maps            89
Look! Rib cages! Network load viz       http://www.network-weathermap.com/   http://cacti.net   90
10-40GB links - Bandwidth monitor       http://www.network-weathermap.com/   http://cacti.net   91
10-40GB links - Bandwidth monitor               Great, but not enough!                    Contextualise                   ...
HeatMaps: Cacti + WeatherMap Cacti: Network graphing solution harnessing the power of RRDTool’s data storage and graphing ...
Network throughput / latency                                                                                             3...
Network throughput / latency                                                                                             3...
Network throughput / latency                                                                                             3...
Server load: memory, CPU, disk...                500%     Graphite datasource for Weathermap: https://github.com/alexforro...
Server load: memory, CPU, disk...CPU/memory overload onfiltering node?                    500%         Graphite datasource ...
Server load: memory, CPU, disk...                                                          Slow DB                500%    ...
Server load: memory, CPU, disk...                500%                                              Disk Storage           ...
Conclusions   Almost beer time...                         95
Guidelines: dashboards for humans       Make the subtle obviousMake the complex/busy simple/cleanGroup data by context, no...
References                                   http://www.alberton.info/talksDaniel Kahneman, “Thinking, Fast and Slow”, Pen...
We’re Hiring!http://datasift.com/about-us/careers      lorenzo@datasift.com                                   98
Lorenzo Alberton          @LorenzoAlberton   Thank you!       lorenzo@alberton.infohttp://www.alberton.info/talks         ...
Upcoming SlideShare
Loading in …5
×

Monitoring at scale - Intuitive dashboard design

At a certain scale, millions of events happen every second, and all of them are important to evaluate the health of the system. If not handled correctly, such a volume of information can overwhelm both the infrastructure that needs to support them, and people who have to make a sense out of thousands of signals and make decisions upon them, fast. By understanding how our rational mind works, how people process information, we can present data so it's more evident and intuitive. This talk will explain how to collect useful metrics, and to create the perfect monitoring dashboard to organise and display them, letting our intuition operate automatically and quickly, and saving attention and mental effort to activities that demand it.

  • Be the first to comment

Monitoring at scale - Intuitive dashboard design

  1. Lorenzo Alberton @lorenzoalberton Monitoring at scale:intuitive dashboard design Make decisions, fast PHP UK, Saturday 23rd February 2013 1
  2. Lorenzo Alberton Chief Technical Architect, DataSift http://alberton.info @lorenzoalberton http://bit.ly/scaleds 2
  3. Big Data, little clue? Monitoring is crucial http://www.flickr.com/photos/mrflip/5150336351/lightbox/ 3
  4. Complex architectures 4
  5. Identify (and prevent) failures? ? ? ? No output data: where is the problem??? ? ? ? 5
  6. Identify (and prevent) failures? ? ? ? No output data: where is the problem??? ? ? ? 5
  7. Monitoring mindset You can’t control Design systems what you can’t measure to be monitored Tom DeMarco Good reporting: Observe patterns and difference between noticing automate most things and not having a clue http://www.threesixtymag.co.uk/2012/12/state-of-mind-tee/ 6
  8. Monitoring mindset The hardest part Good reporting: difference between noticing and not having a clue 7
  9. Dashboard Design Learning the appropriate language 8
  10. Dashboard: what is it? Tool to display PIs and KPIs quantitative analysis Immediacy, intuitiveness and appropriate context 9
  11. Operational Strategic Analyticmonitors functions quick overview of comparisons, which need an organization’s reviewing constant, health extensive histories, real-time, evaluatingminute-by-minute assist with performance attention executive decisions assists with immediacy and what is going on data analysis practicality right now is not important - what is no statistics or pressing is what has doesn’t require analyzing been going on real-time data 10
  12. Multiple dashboard views Operational: Strategic: Analytic: Ops / Engineering CEO / CIO Marketing / Accountancy Different view for each audience: keep metrics relevant to each group 11
  13. Multiple dashboard views Operational: Ops / Engineering This talk is about this one (but the others are important too) 12
  14. Effective Monitoring Understanding how we think 13
  15. Thinking, Fast and Slow 14
  16. A tale of two systems Intuition Reasoning operates automatically consciouslyand quickly, with little or allocates attention no effort and no sense to the effortful mental of voluntary control activities that demand it 2+2=? 216 × 725 = ? involuntary fast voluntary slow effortless invisible difficult visible 15
  17. A tale of two systems Intuition operates automaticallyand quickly, with little or no effort and no sense Monitoring should rely on of voluntary control System I 2+2=? involuntary fast effortless invisible 16
  18. A tale of two systems Reasoning consciously allocates attention System 2 regulates our intuition to the effortful mental and is ready to jump in activities that demand itwhen attention is required 216 × 725 = ? voluntary slow difficult visible 17
  19. Model “Normality” http://www.flickr.com/photos/fwooper7/4942474212/ 18
  20. Be surprised by anomalies http://animal.discovery.com/tv-shows/wild-kingdom/about-animals/lions-elephant-hunters-pictures.htm 19
  21. Create surprise with alerts 20
  22. Create surprise with alerts 20
  23. Over-Use of color Revenue Goal 80 60 40 20 0 Jan Feb Mar Apr May Jun 21
  24. Over-Use of color Revenue Goal 80 60 40 20 0 Jan Feb Mar Apr May Jun Only attract attention when things go bad 21
  25. Dashboard best practices Show, don’t tell Keep text/numbers to a minimum 22
  26. Clarity and immediacy FTW Charles Joseph Minard, Napoleon’s March on Moscow“Probably the best statistical graphic ever drawn” - Edward Tufte http://www.edwardtufte.com/tufte/posters 23
  27. Clarity and immediacy FTW Charles Joseph Minard, Napoleon’s March on Moscow worst“Probably the best statistical graphic ever drawn” - Edward Tufte http://www.edwardtufte.com/tufte/posters 23
  28. Graphs fit short-term memory Sales Jan Feb Mar Apr May Jun Jul US 23923 21695 20032 24030 24302 25032 26203 EU 14390 16400 17303 21900 23547 20142 27321 24
  29. Graphs fit short-term memory Sales Jan Feb Mar Apr May Jun Jul US 23923 21695 20032 24030 24302 25032 26203 EU 14390 16400 17303 21900 23547 20142 27321 give values a visual shape 30000 US 25000 EU Sales 20000 15000 10000 Jan Feb Mar Apr May Jun Jul Aug 24
  30. Dashboard best practices Communicate with clarity Simplicity is key 25
  31. Dashboard design mistakes 26
  32. Busy Dashboards Are Busy http://img.photobucket.com/albums/v254/tomklipp/Misc/C-130e-flight-station.jpg 27
  33. Dashboard design mistakes Too much data, too little information At a glance, tell if there’s a problem, not a precise analysis 28
  34. The only thing I want to know Everything is alright http://www.x929.ca/shows/newsboy/?cat=28&paged=2 29
  35. Attention as limited resource http://www.climateshifts.org/wp-content/uploads/2010/12/coal_hands.jpg 30
  36. Attention has a limited budget Attention depletion Leverage intuition whenever possible 31
  37. Strain and effort ➔ HeuristicsIt takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 32
  38. Strain and effort ➔ HeuristicsIt takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 33
  39. Strain and effort ➔ HeuristicsIt takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 100! 33
  40. Strain and effort ➔ Heuristics Tendency to answer questions with the first idea that comes to mind, without checking itIt takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? 100! 5 33
  41. Swap out difficult tasks for easier ones Heuristic, n. simple procedure that helps find adequate, though often imperfect, answers to difficult questions. 34
  42. Human-centric software? 35
  43. Human-centric software?Attention Too subtle: didn’t notice is LAZY Too tired: didn’t care 35
  44. Let the visual cortex do the work http://chariotsolutions.com/presentations/the-programming-ape 36
  45. Dashboard best practices Organise information to support meaning Apply the latest understanding of human visual perception to the visual presentation of information 37
  46. Organised by means of productionCPU Load DB queriesBandwidth BAD 38
  47. Organised by context Shopping Cart Product Catalog Auth ServiceMemoryTrafficDB BETTER 39
  48. Organised by context Shopping Cart Product Catalog Auth ServiceMemoryTrafficDB BETTER 39
  49. Correlate events to add context Releases Performance / Events Feature X TV Ads hotfixLast 7Days 5% users DB load 90th percentileSymptoms locked out -40% latency +730% 40
  50. Dashboard best practices Reduce Visual Noise Clutter, Distractions, Clichés, Animations, Embellishments create confusion 41
  51. Gauges / Speedometers 42
  52. Gauges / Speedometers 3D effect 42
  53. Gauges / Speedometers 3D effect Glass reflection 42
  54. Gauges / Speedometers 3D effect Glass reflection Bouncing needle 42
  55. Gauges / Speedometers 3D effect ... Glass reflection Bouncing needle 42
  56. Gauges / Speedometers 3D effect ... Glass reflection Bacon? Bouncing needle 42
  57. (3D) Pie charts Size of round areas 17% difficult to evaluate 23% 13% Distortion in the 1% 2% perceived size 4% (and value of data) ➡ 21% 17% They sacrifice accuracy for aesthetic appeal http://www.dashboardinsight.com/articles/digital-dashboards/building-dashboards/the-case-against-3d-charts-in-dashboards.aspx 43
  58. Pie chart vs. Bar chart A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  59. Pie chart vs. Bar chart About the same screen estate A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  60. Pie chart vs. Bar chart A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  61. Pie chart vs. Bar chart Easier to compare size of bars (i.e. the value of the data) A 27% 5% 6% 27% B 23% 16% C 22% D 16% 22% 23% E 6% F 5% A B C D E F 0 25 50 75 100 44
  62. Mind tricks 45
  63. Mind tricks WHAT I IF TOLD YOU YOU READ THAT WRONG http://www.quora.com/Optical-Illusions/What-are-some-great-optical-illusions 46
  64. A machine for jumping to conclusions W Y S I AT I What You See Is All There Is Intuitive thinking jumps to conclusions on the basis of limited evidence 47
  65. Neglect of ambiguity Suppression of doubt 48
  66. Neglect of ambiguity Suppression of doubt 48
  67. Neglect of ambiguity Ann approached the bank Fabrication of coherent stories http://www.flickr.com/photos/27000501@N08/5613967601 49
  68. Neglect of ambiguity Ann approached the bank Fabrication of coherent stories http://www.flickr.com/photos/27000501@N08/5613967601 49
  69. Neglect of ambiguity Ann approached the bank Fabrication of coherent stories http://www.flickr.com/photos/27000501@N08/5613967601 49
  70. WYSIATI and the need for more dataData ThrougputServer 3Server 2Server 1 50
  71. WYSIATI and the need for more dataData Througput oh cr*p.Server 3Server 2Server 1 50
  72. WYSIATI and the need for more dataData Througput Surely, we’re losing data :-( No doubt about it.Server 3Server 2Server 1 50
  73. WYSIATI and the need for more dataData Througput wait, all other metrics are OK....Server 3Server 2Server 1 50
  74. WYSIATI and the need for more dataData Througput Platform OK. Metrics couldn’t reach the stats server.Server 3 (Stats server rebootedServer 2 without eth1 interface)Server 1 50
  75. Multiple perspectives / facets Examine data from multiple perspectives simultaneously (one of them will hopefully make sense) Uncover meaningful relationships that exist in the data 51
  76. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10KFailures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  77. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10KFailures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  78. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10KFailures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  79. Grids / Crosstabs Failures by service Auth Mgr Product Catalog Shopping Cart 20K 20K 20K US Out Of 15K 15K 15K EU Memory 10K EU 10K EU 10KFailures by type 5K 5K US 5K US 0 0 0 20K 20K 20K 15K EU 15K EU 15K Timeout US 10K 10K 10K EU US 5K 5K 5K US 0 0 0 20K 20K 20K 15K 15K 15K Unreachable 10K 10K 10K 5K US 5K US 5K US 0 EU 0 EU 0 EU 52
  80. Halo effect - Biases Judgement influenced by previous information Information processed earlier might skew our perception of new data. No evidence required to jump to conclusions. 53
  81. Halo effect - Biases C++ J av a C++ Ruby R 0 20 40 60 80 Garbage Collection 54
  82. Biases stronger than hard evidenceData A B No Data In C++ J av a Out Which component is broken? A or B ? 55
  83. Biases stronger than hard evidenceData A B No Data In C++ J av a Out Which component is broken? A or B ? 55
  84. Biases stronger than hard evidenceData A B No Data In C++ J av a Out Which component is broken? A or B ? Don’t guess, look at metrics!!! 55
  85. Priming effectWASH 56
  86. Priming effect S _ AP 57
  87. Priming effect SOAP 58
  88. Priming effect SLAP 59
  89. Priming effect SNAP 60
  90. Priming effectSWAP 61
  91. Pattern detection Colors Shapes Sounds GOOD BAD Our brain is good at creating associations and detecting patterns http://www.vladstudio.com/wallpaper/?violin 62
  92. Shapes that create emotions 63
  93. Shapes that create emotions 63
  94. Normalise data, keep patterns consistentNormalised 64
  95. Going Real-Time 65
  96. Monitoring At Different Levels UX / Business metrics Is there a problem? 66
  97. Monitoring At Different Levels UX / Business metrics Is there a problem? System monitors Where is the problem? 66
  98. Monitoring At Different Levels UX / Business metrics Is there a problem? System monitors Where is the problem? Application monitors What is the problem? 66
  99. Instrumentation: Monitoring + Alerting www.android-zenoss.info 67
  100. Instrumentation: Monitoring + AlertingUnconventional alerting tools can be surprisingly effective 67
  101. Getting started with monitoringMonigustoA single-server box that contains the mostcommon/current tools for monitoring likegraphite, statsd, collectd, nagios, logstash,jmxtrans, tasseo, gdash, librato and sensuhttps://github.com/monigustoReal-Time Graphing With Graphitehttp://bit.ly/rt-graphite 68
  102. StatsD + Graphite Example StatsD: Node.JS daemon. Listens for messages over a UDP port and extracts metrics, which are dumped to Graphite for further processing and visualisation. Graphite: Real-time graphing system. Data is sent to carbon (processing back-end) which stores data into Graphite’s db. Data visualised via Graphite’s web interface. 69
  103. StatsD metrics ; statsd.ini<?php [statsd] host = yourhostforeach ($items as $item) { port = 8125 // time how long it takes // to process this item... $time_start = microtime(true); // ... process item here ... $time = (int)(1000 * (microtime(true) - $time_start)); StatsD::timing(workerX.processing_time, $time); // in ms // count items by type StatsD::increment(workerX.received.type..$item[type]);} https://github.com/etsy/statsd/ 70
  104. StatsD metrics ; statsd.ini<?php [statsd] host = yourhostforeach ($items as $item) { port = 8125 // time how long it takes // to process this item... $time_start = microtime(true); // ... process item here ... $time = (int)(1000 * (microtime(true) - $time_start)); StatsD::timing(workerX.processing_time, $time); // in ms // count items by type StatsD::increment(workerX.received.type..$item[type]);} https://github.com/etsy/statsd/ 70
  105. StatsD metrics ; statsd.ini<?php [statsd] host = yourhostforeach ($items as $item) { port = 8125 // time how long it takes define a // to process this item... hierarchy of $time_start = microtime(true); event names // ... process item here ... $time = (int)(1000 * (microtime(true) - $time_start)); StatsD::timing(workerX.processing_time, $time); // in ms // count items by type StatsD::increment(workerX.received.type..$item[type]);} https://github.com/etsy/statsd/ 70
  106. Graphite output workerX.processing_time.mean workerX.processing_time.90percentile http://graphite.wikidot.com/ 71
  107. Understanding Distribution Why averages suck 72
  108. Bell curve “normal” distribution of response times: # of requests Average / Median Average = Median i.e. observed perf. represents the majority of the transactions Below Average Above Average Response time http://apmblog.compuware.com/2012/11/14/why-averages-suck-and-percentiles-are-great/ 73
  109. Bell curve - Alerting levels # of requests Median Std Deviation: 33% of transactions with the mean as the middle Within 1 std Response time deviation of mean 74
  110. Bell curve - Alerting levels # of requests Median 2x Std Deviation: 66% of transactions (majority) Within 2 times Response time Std Deviation of Mean 75
  111. Bell curve - Alerting levels # of requests Median Everything outside: outlier Outside 2 times Outside 2 times Response time Std Deviation of Mean Std Deviation of Mean 76
  112. “Normal” vs. Real distribution Real life: few very heavy outliers and long tail Median ≠ Average number of requests average looks a lot 8 faster than most ~20% transactions 6 of very fast transactions 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 Response time Average 20th percentile Median http://apmblog.compuware.com/2012/11/14/why-averages-suck-and-percentiles-are-great/ 77
  113. Averages vs. Percentiles Average Load time 200 (ms) 150 100 50 0 8AM 10AM 12PM 2PM 4PMPercentiles allow us to understand the distribution The 50th percentile is more stable than the average 78
  114. Averages vs. Percentiles Average 50th percentile 90th percentile Load time 200 (ms) 150 100 50 0 8AM 10AM 12PM 2PM 4PMPercentiles allow us to understand the distribution The 50th percentile is more stable than the average 78
  115. Automatic Baselining and Alerts 50th percentile 90th percentile Load time 200 (ms) 150 100 50 0 8AM 10AM 12PM 2PM 4PM 79
  116. Automatic Baselining and Alerts 50th percentile 90th percentile Load time 200 (ms) 150 100 50 threshold X 0 8AM 10AM 12PM 2PM 4PM Alert if std deviation of 50th percentile is over X 79
  117. Tips And TricksPatterns our brain should recognise 80
  118. Normalise + Add baseline 81
  119. Normalise + Add baseline 81
  120. Normalise + Add baseline 81
  121. Normalise + Add baseline 81
  122. Normalise + Add baseline let machines determine the baseline 81
  123. Anomaly detection in fluctuating trafficIOPS 82
  124. Anomaly detection in fluctuating trafficIOPS 82
  125. Anomaly detection in fluctuating trafficIOPS 82
  126. Derivative (Detect big spikes)derivative(IOPS) 83
  127. Derivative (Detect big spikes)derivative(IOPS) OK 83
  128. Derivative (Detect big spikes)derivative(IOPS) OK Anomalies 83
  129. Different visuals to spot differencesStackedArea 84
  130. Different visuals to spot differencesStackedArea 84
  131. Different visuals to spot differencesOverlappingLines 85
  132. Different visuals to spot differencesOverlappingLines 85
  133. Flattening effect Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 86
  134. Flattening effect saturation of a resource or discontinuation of flow Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 86
  135. Regular anomalies Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 87
  136. Regular anomalies check your cron jobs Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 87
  137. Advanced Heatmaps 88
  138. Heat-Maps 89
  139. Heat-Maps 89
  140. Look! Rib cages! Network load viz http://www.network-weathermap.com/ http://cacti.net 90
  141. 10-40GB links - Bandwidth monitor http://www.network-weathermap.com/ http://cacti.net 91
  142. 10-40GB links - Bandwidth monitor Great, but not enough! Contextualise metrics http://www.network-weathermap.com/ http://cacti.net 91
  143. HeatMaps: Cacti + WeatherMap Cacti: Network graphing solution harnessing the power of RRDTool’s data storage and graphing functionality. Provides a fast poller, graph templating, multiple data acquisition methods. Weathermap: Cacti plugin to integrate network maps into the Cacti web UI. Includes a web-based map editor. 92
  144. Network throughput / latency 345/s 84 32 225/s /s 296/s 335/s 7312/s 311/s 289/s 14 5/ s 4410/s 5320/s 80/s 1331/s 5320/s 5320/s 13/s 2954/s 44/s 3296/s 4322/s 219/s 2954/s 5320/s 832/s 5320/s Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 93
  145. Network throughput / latency 345/s 84 32 225/s /s 296/s 335/s 7312/s 311/s 289/s 14 5/ augmentation s 4410/s 5320/s service timing out? 80/s 1331/s 5320/s 5320/s 13/s 2954/s 44/s 3296/s 4322/s 219/s 2954/s 5320/s 832/s 5320/s Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 93
  146. Network throughput / latency 345/s 84 32 225/s /s 296/s 335/s 7312/s 311/s 289/s 14 5/ s 4410/s 5320/s 80/s 5320/s 1331/s consumer 5320/s slower than 13/s producer? 2954/s 44/s 3296/s 4322/s 219/s 2954/s 5320/s 832/s 5320/s Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 93
  147. Server load: memory, CPU, disk... 500% Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  148. Server load: memory, CPU, disk...CPU/memory overload onfiltering node? 500% Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  149. Server load: memory, CPU, disk... Slow DB 500% queries? Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  150. Server load: memory, CPU, disk... 500% Disk Storage Running Out Of Space? Graphite datasource for Weathermap: https://github.com/alexforrow/php-weathermap-graphite 94
  151. Conclusions Almost beer time... 95
  152. Guidelines: dashboards for humans Make the subtle obviousMake the complex/busy simple/cleanGroup data by context, not means of prodDetect anomalies/deviation from norm Turn raw numbers into graphsAppeal to intuition, conserve attention 96
  153. References http://www.alberton.info/talksDaniel Kahneman, “Thinking, Fast and Slow”, Penguin Books 2012 Slawek Ligus, “Effective Monitoring and Alerting”, O’Reilly 2012 Stephen Few - http://www.perceptualedge.com/ http://www.dashboardinsight.com Coda Hale, The Programming APE 97
  154. We’re Hiring!http://datasift.com/about-us/careers lorenzo@datasift.com 98
  155. Lorenzo Alberton @LorenzoAlberton Thank you! lorenzo@alberton.infohttp://www.alberton.info/talks http://joind.in/8060 99

×