Winning the metrics battle (finally)
Winning the metrics battle         (finally)       Simon Hildrew           Nick Satterly  Infrastructure Developer   Monito...
The metrics battlefield
Total metrics                                180,000                       50,0001,400   2,800
http://www.flickr.com/photos/ghostsigns/6676069121                                              5 minutes                  ...
developer dashboards
Physical screens   Screensaver hacks201510 5 0
devhack
business dashboards
metrics + dashboards = culture change
http://www.flickr.com/photos/chrisjames_taylor/5454315456
our approach         Side project    ➡   PrioritiseIncremental upgrade      ➡   Understand the real problemUse off the she...
Prioritise
drowning in workhttp://www.flickr.com/photos/iampeas/246738971
a dedicated monitoring and     metrics engineer
Understand the real problem
Urgent issue -current tool end of life
The story so far...
metrics were not helping us solve production outages
ballooning number of     applications
but... difficult to instrument applications
T.T. Detect                      +T.T. Fix   =   T.T. Diagnose                      +                T.T. Resolve
inaccessible tools             http://www.flickr.com/photos/kdashy/2678539087
inconsistent datahttp://www.flickr.com/photos/sybrenstuvel/2468506922
hypothesising & arguing easier than measuring               http://www.flickr.com/photos/nouqraz/200049988
The ‘right’ thing• measure everything• measure frequently• measure each data point once• input and output must be open
Question the tools
Brute force?http://www.flickr.com/photos/epublicist/3546059144
The safe option?http://www.flickr.com/photos/alicebartlett/2361209195
Unintuitive?http://www.flickr.com/photos/merlijnhoek/2841785343
Imposing a flawed model?http://www.flickr.com/photos/evansville/8953838/
Too difficult / no progress?http://www.flickr.com/photos/ginja_andy/4165849136/
Nagios•   the “IBM” of monitoring tools•   compromise over quantity and frequency of checks•   < insert your criticism of ...
Zabbix•   metric collection tightly coupled to monitoring tool•   confusing UI with poor visualisation•   needed brute for...
The ‘right’ thing• measure everything• measure frequently• measure each data point once• input and output must be open
don’t compromise
Be ambitious
http://www.flickr.com/photos/mugley/2961131550                                 Throw work away
Draw your dream
http://www.flickr.com/photos/sk8geek/7358702704                             Get as far as you can
screens           users                                            db?             alerting? Etsy dashboard               ...
Develop missing pieces              http://www.flickr.com/photos/kalexanderson/5969012589
screens           users                                            mongodb                   alerta       elastic         ...
Guardian Managementhttps://github.com/guardian/guardian-management
Ganglia APIhttps://github.com/guardian/ganglia-api
rescale image???                       Alertahttps://github.com/guardian/alerta
Current stack• Ganglia             • Guardian management                        https://github.com/guardian/guardian-manag...
Keep learning
we are not there yet
Watch the cultural changes
detecting
diagnosis
diagnosis
performance testing
confirmation
#monitoringsucks
➡ Prioritise➡ Understand the real problem➡ Question the tools➡ Be ambitious➡ Keep learning
tools can change culture
Thank you               http://github.com/guardian                 http://gu.com/p/3ap5f       Simon Hildrew              ...
Winning the metrics battle
Winning the metrics battle
Upcoming SlideShare
Loading in...5
×

Winning the metrics battle

6,826

Published on

The slides from a presentation at Velocity Europe 2012 talk about how the Guardian does metrics an monitoring.

The original proposal is at http://velocityconf.com/velocityeu2012/public/schedule/detail/26576 and there is also an article about it at http://www.guardian.co.uk/info/developer-blog/2012/oct/04/winning-the-metrics-battle

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,826
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
14
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Winning the metrics battle

  1. 1. Winning the metrics battle (finally)
  2. 2. Winning the metrics battle (finally) Simon Hildrew Nick Satterly Infrastructure Developer Monitoring Engineer The Guardian The Guardian
  3. 3. The metrics battlefield
  4. 4. Total metrics 180,000 50,0001,400 2,800
  5. 5. http://www.flickr.com/photos/ghostsigns/6676069121 5 minutes every 15 seconds http://www.flickr.com/photos/millynet/134071210
  6. 6. developer dashboards
  7. 7. Physical screens Screensaver hacks201510 5 0
  8. 8. devhack
  9. 9. business dashboards
  10. 10. metrics + dashboards = culture change
  11. 11. http://www.flickr.com/photos/chrisjames_taylor/5454315456
  12. 12. our approach Side project ➡ PrioritiseIncremental upgrade ➡ Understand the real problemUse off the shelf tool ➡ Question the tools Pragmatic solution ➡ Be ambitious Done in a year ➡ Keep learning
  13. 13. Prioritise
  14. 14. drowning in workhttp://www.flickr.com/photos/iampeas/246738971
  15. 15. a dedicated monitoring and metrics engineer
  16. 16. Understand the real problem
  17. 17. Urgent issue -current tool end of life
  18. 18. The story so far...
  19. 19. metrics were not helping us solve production outages
  20. 20. ballooning number of applications
  21. 21. but... difficult to instrument applications
  22. 22. T.T. Detect +T.T. Fix = T.T. Diagnose + T.T. Resolve
  23. 23. inaccessible tools http://www.flickr.com/photos/kdashy/2678539087
  24. 24. inconsistent datahttp://www.flickr.com/photos/sybrenstuvel/2468506922
  25. 25. hypothesising & arguing easier than measuring http://www.flickr.com/photos/nouqraz/200049988
  26. 26. The ‘right’ thing• measure everything• measure frequently• measure each data point once• input and output must be open
  27. 27. Question the tools
  28. 28. Brute force?http://www.flickr.com/photos/epublicist/3546059144
  29. 29. The safe option?http://www.flickr.com/photos/alicebartlett/2361209195
  30. 30. Unintuitive?http://www.flickr.com/photos/merlijnhoek/2841785343
  31. 31. Imposing a flawed model?http://www.flickr.com/photos/evansville/8953838/
  32. 32. Too difficult / no progress?http://www.flickr.com/photos/ginja_andy/4165849136/
  33. 33. Nagios• the “IBM” of monitoring tools• compromise over quantity and frequency of checks• < insert your criticism of nagios here >
  34. 34. Zabbix• metric collection tightly coupled to monitoring tool• confusing UI with poor visualisation• needed brute force to make limited API work
  35. 35. The ‘right’ thing• measure everything• measure frequently• measure each data point once• input and output must be open
  36. 36. don’t compromise
  37. 37. Be ambitious
  38. 38. http://www.flickr.com/photos/mugley/2961131550 Throw work away
  39. 39. Draw your dream
  40. 40. http://www.flickr.com/photos/sk8geek/7358702704 Get as far as you can
  41. 41. screens users db? alerting? Etsy dashboard message queue graphite SNMP? syslog? FITB ganglia api?network hosts applications
  42. 42. Develop missing pieces http://www.flickr.com/photos/kalexanderson/5969012589
  43. 43. screens users mongodb alerta elastic search Etsy dashboard message queue syslog SNMP graphite ganglia alerts alerts alerts FITB ganglia ganglia-apinetwork hosts applications
  44. 44. Guardian Managementhttps://github.com/guardian/guardian-management
  45. 45. Ganglia APIhttps://github.com/guardian/ganglia-api
  46. 46. rescale image??? Alertahttps://github.com/guardian/alerta
  47. 47. Current stack• Ganglia • Guardian management https://github.com/guardian/guardian-management• FITB • Guardian ganglia-api https://github.com/guardian/ganglia-api• Graphite • Guardian alerta• Etsy dashboards https://github.com/guardian/alerta
  48. 48. Keep learning
  49. 49. we are not there yet
  50. 50. Watch the cultural changes
  51. 51. detecting
  52. 52. diagnosis
  53. 53. diagnosis
  54. 54. performance testing
  55. 55. confirmation
  56. 56. #monitoringsucks
  57. 57. ➡ Prioritise➡ Understand the real problem➡ Question the tools➡ Be ambitious➡ Keep learning
  58. 58. tools can change culture
  59. 59. Thank you http://github.com/guardian http://gu.com/p/3ap5f Simon Hildrew Nick Satterly @sihil @nicksatterlysimon.hildrew@guardian.co.uk nick.satterly@guardian.co.uk
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×