Successfully reported this slideshow.

Metrics-Driven Engineering

89

Share

Upcoming SlideShare
Take My Logs. Please!
Take My Logs. Please!
Loading in …3
×
1 of 106
1 of 106

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Metrics-Driven Engineering

  1. 1. Metrics-Driven Engineering Mike Brittain @ mikebrittain Director of engineering, Infrastructure October 13, 2011
  2. 2. Tools and Process at Etsy
  3. 3. How many new visits? How many listings created? How many registrations? How do people use Etsy? How many convos sent? How many purchases? How many new shops?
  4. 4. Search indexing? How fast are pages generating? Async tasks currently in queue? What is the application doing? Developer API auth and rate limiting? Images resized and stored? Error and warning rates?
  5. 5. Replication slave lag? Memcache hits/misses? Available connections? Are the servers in good shape ? Database queries per second? Total outgoing bandwidth? CPU, Memory, I/O?
  6. 6. Business Metrics
  7. 7. Application Metrics
  8. 8. System Metrics
  9. 9. Visibility EVERYWHERE
  10. 10. Constant Change
  11. 11. $314 Million GMS 2010 $180 Million GMS 2009 $87 Million GMS 2008 $26 Million GMS 2007 credit: pentarux (flickr)
  12. 12. 25 Million Unique Visitors 1 Billion page views per month credit: pentarux (flickr)
  13. 13. Engineering team grew 500% over 18 months credit: martin_heigan (flickr)
  14. 14. Less talk, more do.
  15. 15. Always Be Shipping credit: ibailemon (flickr)
  16. 16. Always Be Shipping (even if it’s your first day) credit: ibailemon (flickr)
  17. 17. 90+ Engineers 40+ Deploys / day credit: misswired (flickr)
  18. 18. credit: digidave (flickr)
  19. 19. Code Reviews
  20. 20. Automated Tests
  21. 21. $cfg = array( 'checkout' => array('enabled' => 'on'), 'homepage' => array('enabled' => 'on'), 'profiles' => array('enabled' => 'on'), 'new_search' => array('enabled' => 'off'), ); Config Flags Enable and disable features quickly
  22. 22. $cfg = array( 'checkout' => array('enabled' => 'on'), 'homepage' => array('enabled' => 'on'), 'profiles' => array('enabled' => 'on'), 'new_search' => array('enabled' => 'off'), ); Config Flags Enable and disable features quickly Plus “admin-only,” percentage ramp-up, A/B testing, whitelists, blacklists, etc...
  23. 23. Failure is not an option
  24. 24. inevitable! Failure is not an option
  25. 25. inevitable! Failure is not an option a learning opportunity!
  26. 26. inevitable! Failure is not an option a learning opportunity! DETECTABLE!
  27. 27. Access
  28. 28. Detect problems quickly
  29. 29. CONFIDENCE
  30. 30. A: Well, the Ops team manages the network, racks the servers, installed the monitoring tools, wears the pagers, blah, blah, blah...
  31. 31. Engineers build the application
  32. 32. Logging Graphing OPS ENG Trending Alerting
  33. 33. “Engineers are too busy writing features to build metrics.”
  34. 34. Metrics are part of every feature ...and so are config flags
  35. 35. Dead Simple
  36. 36. Simple, open source tools
  37. 37. Cacti (network, SNMP) Ganglia (machines) Graphite (application) Splunk (log analysis, nightly reports) Nagios (alerting) Logging Logster StatsD
  38. 38. Ganglia
  39. 39. Ganglia Cluster-oriented Huge community contributed recipes Custom metrics (gmetad)
  40. 40. Graphite
  41. 41. Graphite Single-instance Create new metrics on-the-fly Customize via URLs and display functions
  42. 42. Logging
  43. 43. It’s 2:48 PM. Do you know where your logs are?
  44. 44. Logger::log_error("User login failed. Reason: $msg for $username", “login”);
  45. 45. Logger::log_error("User login failed. Reason: $msg for $username", “login”);
  46. 46. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  47. 47. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  48. 48. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  49. 49. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  50. 50. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  51. 51. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  52. 52. LogFormat "%h %l %u %t "%r" %>s %b" common
  53. 53. LogFormat %{True-Client-IP}i %l %t "%r " %>s %b "%{Referer}i" "%{User-Agent}i" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
  54. 54. apache_note()
  55. 55. LogFormat %{True-Client-IP}i %l %t "%r " %>s %b "%{Referer}i" "%{User-Agent}i" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
  56. 56. LogFormat %{True-Client-IP}i %l %t "%r " %>s %b "%{Referer}i" "%{User-Agent}i" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
  57. 57. LogFormat %{True-Client-IP}i %l %t "%r " %>s %b "%{Referer}i" "%{User-Agent}i" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
  58. 58. grep "/listing/" access.log | awk '{sum=sum+$(NF-2)} END {print sum/NR}'
  59. 59. web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda. web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0201 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue. web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling. web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling. web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling. web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0003 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue. web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!
  60. 60. Logster Fatals Errors Warnings
  61. 61. Logster Run by cron Keeps a cursor on your log file Aggregate lines anyway you want Output to Ganglia or Graphite Simple parsers github.com/etsy
  62. 62. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  63. 63. ^.+ [.+] [(?P<log_level>.+)]
  64. 64. if (fields['log_level'] == “fatal”): self.fatals += 1 elif (fields['log_level'] == “error”): self.errors += 1 elif (fields['log_level'] == “warning”): self.warnings += 1 ...
  65. 65. MetricObject("fatals", (self.fatals / self.duration), "per sec") MetricObject("errors", (self.errors / self.duration), "per sec") MetricObject("warning", (self.warnings / self.duration), "per sec")
  66. 66. Fatals Errors Warnings
  67. 67. StatsD
  68. 68. StatsD Network daemon (node.js) Accepts data over UDP Flushes to Graphite every 10 sec One-line of code github.com/etsy
  69. 69. StatsD::increment("logins.success");
  70. 70. StatsD::increment("logins.success"); logins
  71. 71. StatsD::timing("gearman.time", $msec);
  72. 72. StatsD::timing("gearman.time", $msec); 90th pct average lower
  73. 73. Ad hoc name value timestamp
  74. 74. echo "events.deploy.site 1 `date +%s`" | nc graphite.etsycorp.com 2003
  75. 75. Vertical Line Technology! target=drawAsInfinite(events.deploy.site)
  76. 76. We could stare at graphs all day...
  77. 77. http://graphite/render? from=-1hours&width=600&height=200 &target=webs.errorLog.warning&rawData=1
  78. 78. http://graphite/render? from=-1hours&width=600&height=200 &target=webs.errorLog.warning&rawData=1 webs.errorLog.warning,1318444930,1318448530,60| 5.0,1.0,3.0,1.0,0.0,9.0,0.0,1.0,3.0,2.0,1.0,6.0,2.0,6.0,3.0,6.0,4.0,4.0,2.0, 1.0,1.0,8.0,2.0,3.0,6.0,3.0,5.0,3.0,0.0,4.0,6.0,2.0,0.0,2.0,0.0,4.0,0.0,3.0, 1.0,3.0,4.0,2.0,10.0,3.0,0.0,6.0,0.0,4.0,2.0,5.0,18.0,1.0,1.0,2.0,1.0,8.0,5. 0,1.0,1.0,None
  79. 79. Holt-Winters Confidence Bands upper lower
  80. 80. Holt-Winters Aberration
  81. 81. Business metrics + Confidence bands _____________ Alertable metrics
  82. 82. 40,000+ metrics at Etsy Systems, Applications, Business
  83. 83. Dashboards
  84. 84. Dashboards
  85. 85. Kind of Hard :-/ <a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or +Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render? from=-1hours&width=280&height=220&title=File+or+Script+Not +Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> </a>
  86. 86. Super Easy! $g = new Graphite($time); $g->setTitle('File Not Found'); $g->addMetric('webs.errorLog.notExist', '#00cc00'); echo $g->getDashboardHTML(280, 220);
  87. 87. Metrics!
  88. 88. Metrics! Metrics + Events
  89. 89. Metrics! Metrics + Events Metrics + Alerts
  90. 90. Metrics! Metrics + Events Metrics + Alerts Metrics + Metrics
  91. 91. High-level, real-time visibility
  92. 92. Detect problems quickly
  93. 93. CONFIDENCE
  94. 94. Make them required features
  95. 95. Make them dead simple
  96. 96. Make them accessible
  97. 97. Make them!
  98. 98. Homework codeascraft.etsy.com github.com/etsy Get in touch mike @ etsy . com We’re always looking for people @ mikebrittain who are interested in this kind of stuff... Thank You etsy.com/careers

×