Winning the metrics battle

7,537 views

Published on

The slides from a presentation at Velocity Europe 2012 talk about how the Guardian does metrics an monitoring.

The original proposal is at http://velocityconf.com/velocityeu2012/public/schedule/detail/26576 and there is also an article about it at http://www.guardian.co.uk/info/developer-blog/2012/oct/04/winning-the-metrics-battle

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,537
On SlideShare
0
From Embeds
0
Number of Embeds
5,972
Actions
Shares
0
Downloads
15
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Winning the metrics battle

  1. 1. Winning the metrics battle (finally)
  2. 2. Winning the metrics battle (finally) Simon Hildrew Nick Satterly Infrastructure Developer Monitoring Engineer The Guardian The Guardian
  3. 3. The metrics battlefield
  4. 4. Total metrics 180,000 50,0001,400 2,800
  5. 5. http://www.flickr.com/photos/ghostsigns/6676069121 5 minutes every 15 seconds http://www.flickr.com/photos/millynet/134071210
  6. 6. developer dashboards
  7. 7. Physical screens Screensaver hacks201510 5 0
  8. 8. devhack
  9. 9. business dashboards
  10. 10. metrics + dashboards = culture change
  11. 11. http://www.flickr.com/photos/chrisjames_taylor/5454315456
  12. 12. our approach Side project ➡ PrioritiseIncremental upgrade ➡ Understand the real problemUse off the shelf tool ➡ Question the tools Pragmatic solution ➡ Be ambitious Done in a year ➡ Keep learning
  13. 13. Prioritise
  14. 14. drowning in workhttp://www.flickr.com/photos/iampeas/246738971
  15. 15. a dedicated monitoring and metrics engineer
  16. 16. Understand the real problem
  17. 17. Urgent issue -current tool end of life
  18. 18. The story so far...
  19. 19. metrics were not helping us solve production outages
  20. 20. ballooning number of applications
  21. 21. but... difficult to instrument applications
  22. 22. T.T. Detect +T.T. Fix = T.T. Diagnose + T.T. Resolve
  23. 23. inaccessible tools http://www.flickr.com/photos/kdashy/2678539087
  24. 24. inconsistent datahttp://www.flickr.com/photos/sybrenstuvel/2468506922
  25. 25. hypothesising & arguing easier than measuring http://www.flickr.com/photos/nouqraz/200049988
  26. 26. The ‘right’ thing• measure everything• measure frequently• measure each data point once• input and output must be open
  27. 27. Question the tools
  28. 28. Brute force?http://www.flickr.com/photos/epublicist/3546059144
  29. 29. The safe option?http://www.flickr.com/photos/alicebartlett/2361209195
  30. 30. Unintuitive?http://www.flickr.com/photos/merlijnhoek/2841785343
  31. 31. Imposing a flawed model?http://www.flickr.com/photos/evansville/8953838/
  32. 32. Too difficult / no progress?http://www.flickr.com/photos/ginja_andy/4165849136/
  33. 33. Nagios• the “IBM” of monitoring tools• compromise over quantity and frequency of checks• < insert your criticism of nagios here >
  34. 34. Zabbix• metric collection tightly coupled to monitoring tool• confusing UI with poor visualisation• needed brute force to make limited API work
  35. 35. The ‘right’ thing• measure everything• measure frequently• measure each data point once• input and output must be open
  36. 36. don’t compromise
  37. 37. Be ambitious
  38. 38. http://www.flickr.com/photos/mugley/2961131550 Throw work away
  39. 39. Draw your dream
  40. 40. http://www.flickr.com/photos/sk8geek/7358702704 Get as far as you can
  41. 41. screens users db? alerting? Etsy dashboard message queue graphite SNMP? syslog? FITB ganglia api?network hosts applications
  42. 42. Develop missing pieces http://www.flickr.com/photos/kalexanderson/5969012589
  43. 43. screens users mongodb alerta elastic search Etsy dashboard message queue syslog SNMP graphite ganglia alerts alerts alerts FITB ganglia ganglia-apinetwork hosts applications
  44. 44. Guardian Managementhttps://github.com/guardian/guardian-management
  45. 45. Ganglia APIhttps://github.com/guardian/ganglia-api
  46. 46. rescale image??? Alertahttps://github.com/guardian/alerta
  47. 47. Current stack• Ganglia • Guardian management https://github.com/guardian/guardian-management• FITB • Guardian ganglia-api https://github.com/guardian/ganglia-api• Graphite • Guardian alerta• Etsy dashboards https://github.com/guardian/alerta
  48. 48. Keep learning
  49. 49. we are not there yet
  50. 50. Watch the cultural changes
  51. 51. detecting
  52. 52. diagnosis
  53. 53. diagnosis
  54. 54. performance testing
  55. 55. confirmation
  56. 56. #monitoringsucks
  57. 57. ➡ Prioritise➡ Understand the real problem➡ Question the tools➡ Be ambitious➡ Keep learning
  58. 58. tools can change culture
  59. 59. Thank you http://github.com/guardian http://gu.com/p/3ap5f Simon Hildrew Nick Satterly @sihil @nicksatterlysimon.hildrew@guardian.co.uk nick.satterly@guardian.co.uk

×