• Like
Winning the metrics battle
Upcoming SlideShare
Loading in...5
×

Winning the metrics battle

  • 6,570 views
Uploaded on

The slides from a presentation at Velocity Europe 2012 talk about how the Guardian does metrics an monitoring. …

The slides from a presentation at Velocity Europe 2012 talk about how the Guardian does metrics an monitoring.

The original proposal is at http://velocityconf.com/velocityeu2012/public/schedule/detail/26576 and there is also an article about it at http://www.guardian.co.uk/info/developer-blog/2012/oct/04/winning-the-metrics-battle

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
6,570
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
12
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Winning the metrics battle (finally)
  • 2. Winning the metrics battle (finally) Simon Hildrew Nick Satterly Infrastructure Developer Monitoring Engineer The Guardian The Guardian
  • 3. The metrics battlefield
  • 4. Total metrics 180,000 50,0001,400 2,800
  • 5. http://www.flickr.com/photos/ghostsigns/6676069121 5 minutes every 15 seconds http://www.flickr.com/photos/millynet/134071210
  • 6. developer dashboards
  • 7. Physical screens Screensaver hacks201510 5 0
  • 8. devhack
  • 9. business dashboards
  • 10. metrics + dashboards = culture change
  • 11. http://www.flickr.com/photos/chrisjames_taylor/5454315456
  • 12. our approach Side project ➡ PrioritiseIncremental upgrade ➡ Understand the real problemUse off the shelf tool ➡ Question the tools Pragmatic solution ➡ Be ambitious Done in a year ➡ Keep learning
  • 13. Prioritise
  • 14. drowning in workhttp://www.flickr.com/photos/iampeas/246738971
  • 15. a dedicated monitoring and metrics engineer
  • 16. Understand the real problem
  • 17. Urgent issue -current tool end of life
  • 18. The story so far...
  • 19. metrics were not helping us solve production outages
  • 20. ballooning number of applications
  • 21. but... difficult to instrument applications
  • 22. T.T. Detect +T.T. Fix = T.T. Diagnose + T.T. Resolve
  • 23. inaccessible tools http://www.flickr.com/photos/kdashy/2678539087
  • 24. inconsistent datahttp://www.flickr.com/photos/sybrenstuvel/2468506922
  • 25. hypothesising & arguing easier than measuring http://www.flickr.com/photos/nouqraz/200049988
  • 26. The ‘right’ thing• measure everything• measure frequently• measure each data point once• input and output must be open
  • 27. Question the tools
  • 28. Brute force?http://www.flickr.com/photos/epublicist/3546059144
  • 29. The safe option?http://www.flickr.com/photos/alicebartlett/2361209195
  • 30. Unintuitive?http://www.flickr.com/photos/merlijnhoek/2841785343
  • 31. Imposing a flawed model?http://www.flickr.com/photos/evansville/8953838/
  • 32. Too difficult / no progress?http://www.flickr.com/photos/ginja_andy/4165849136/
  • 33. Nagios• the “IBM” of monitoring tools• compromise over quantity and frequency of checks• < insert your criticism of nagios here >
  • 34. Zabbix• metric collection tightly coupled to monitoring tool• confusing UI with poor visualisation• needed brute force to make limited API work
  • 35. The ‘right’ thing• measure everything• measure frequently• measure each data point once• input and output must be open
  • 36. don’t compromise
  • 37. Be ambitious
  • 38. http://www.flickr.com/photos/mugley/2961131550 Throw work away
  • 39. Draw your dream
  • 40. http://www.flickr.com/photos/sk8geek/7358702704 Get as far as you can
  • 41. screens users db? alerting? Etsy dashboard message queue graphite SNMP? syslog? FITB ganglia api?network hosts applications
  • 42. Develop missing pieces http://www.flickr.com/photos/kalexanderson/5969012589
  • 43. screens users mongodb alerta elastic search Etsy dashboard message queue syslog SNMP graphite ganglia alerts alerts alerts FITB ganglia ganglia-apinetwork hosts applications
  • 44. Guardian Managementhttps://github.com/guardian/guardian-management
  • 45. Ganglia APIhttps://github.com/guardian/ganglia-api
  • 46. rescale image??? Alertahttps://github.com/guardian/alerta
  • 47. Current stack• Ganglia • Guardian management https://github.com/guardian/guardian-management• FITB • Guardian ganglia-api https://github.com/guardian/ganglia-api• Graphite • Guardian alerta• Etsy dashboards https://github.com/guardian/alerta
  • 48. Keep learning
  • 49. we are not there yet
  • 50. Watch the cultural changes
  • 51. detecting
  • 52. diagnosis
  • 53. diagnosis
  • 54. performance testing
  • 55. confirmation
  • 56. #monitoringsucks
  • 57. ➡ Prioritise➡ Understand the real problem➡ Question the tools➡ Be ambitious➡ Keep learning
  • 58. tools can change culture
  • 59. Thank you http://github.com/guardian http://gu.com/p/3ap5f Simon Hildrew Nick Satterly @sihil @nicksatterlysimon.hildrew@guardian.co.uk nick.satterly@guardian.co.uk