Metrics-Driven EngineeringMike Brittain        @ mikebrittainDirector of engineering, Infrastructure                      ...
Tools and Process at Etsy
How many new visits?  How many listings created?  How many registrations?How do people use Etsy?  How many convos sent?   ...
Search indexing?     How fast are pages generating?   Async tasks currently in queue?What is the application doing? Develo...
Replication slave lag?       Memcache hits/misses?       Available connections?Are the servers in good shape ?    Database...
Business Metrics
Application Metrics
System Metrics
Visibility EVERYWHERE
Constant Change
$314 Million GMS 2010  $180 Million GMS 2009  $87 Million GMS 2008  $26 Million GMS 2007credit: pentarux (flickr)
25 Million Unique Visitors  1 Billion page views per monthcredit: pentarux (flickr)
Engineering team grew 500%                        over 18 monthscredit: martin_heigan (flickr)
Less talk, more do.
Always Be Shippingcredit: ibailemon (flickr)
Always Be Shipping                             (even if it’s your first day)credit: ibailemon (flickr)
90+ Engineers                     40+ Deploys / daycredit: misswired (flickr)
credit: digidave (flickr)
Code Reviews
Automated Tests
$cfg = array(   checkout => array(enabled => on),   homepage => array(enabled => on),   profiles => array(enabled => on), ...
$cfg = array(   checkout => array(enabled => on),   homepage => array(enabled => on),   profiles => array(enabled => on), ...
Failure is not an option
inevitable!Failure is not an option
inevitable!Failure is not an option            a learning opportunity!
inevitable!Failure is not an option            a learning opportunity!     DETECTABLE!
Access
Detect problems quickly
CONFIDENCE
A:    Well, the Ops team manages the network, racks     the servers, installed the monitoring tools, wears                ...
Engineers build the application
Logging      GraphingOPS              ENG      Trending      Alerting
“Engineers are too busy writing  features to build metrics.”
Metrics are part of every feature        ...and so are config flags
Dead Simple
Simple, open source tools
Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting)      ...
Ganglia
GangliaCluster-orientedHuge community contributed recipesCustom metrics (gmetad)
Graphite
Graphite                            Single-instance              Create new metrics on-the-fly   Customize via URLs and dis...
Logging
It’s 2:48 PM.Do you know where your       logs are?
Logger::log_error("User login failed.Reason: $msg for $username", “login”);
Logger::log_error("User login failed.Reason: $msg for $username", “login”);
web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
LogFormat "%h %l %u %t "%r" %>s %b"                common
LogFormat %{True-Client-IP}i %l %t "%r         " %>s %b "%{Referer}i"              "%{User-Agent}i"    %{etsy_shop_id}n %{...
apache_note()
LogFormat %{True-Client-IP}i %l %t "%r         " %>s %b "%{Referer}i"              "%{User-Agent}i"    %{etsy_shop_id}n %{...
LogFormat %{True-Client-IP}i %l %t "%r         " %>s %b "%{Referer}i"              "%{User-Agent}i"    %{etsy_shop_id}n %{...
LogFormat %{True-Client-IP}i %l %t "%r         " %>s %b "%{Referer}i"              "%{User-Agent}i"    %{etsy_shop_id}n %{...
grep "/listing/" access.log | awk {sum=sum+$(NF-2)} END {print sum/NR}
web0001   [04:28:54   2011]   [error] [client 10.101.x.x] Help me, Rhonda.web0001   [04:28:54   2011]   [error] [client 10...
LogsterFatals       Errors   Warnings
LogsterRun by cronKeeps a cursor on your log fileAggregate lines anyway you wantOutput to Ganglia or GraphiteSimple parsers...
web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
^.+ [.+] [(?P<log_level>.+)]
if (fields[log_level] == “fatal”):   self.fatals += 1elif (fields[log_level] == “error”):   self.errors += 1elif (fields[l...
MetricObject("fatals",  (self.fatals / self.duration), "per sec")MetricObject("errors",  (self.errors / self.duration), "p...
Fatals   Errors   Warnings
StatsD
StatsD                           Network daemon (node.js)                               Accepts data over UDP             ...
StatsD::increment("logins.success");
StatsD::increment("logins.success");                                  logins
StatsD::timing("gearman.time", $msec);
StatsD::timing("gearman.time", $msec);                                 90th pct                                 average   ...
Ad hocname value timestamp
echo "events.deploy.site 1 `date +%s`"      | nc graphite.etsycorp.com 2003
Vertical Line Technology!target=drawAsInfinite(events.deploy.site)
We could stare at graphs all day...
http://graphite/render?   from=-1hours&width=600&height=200&target=webs.errorLog.warning&rawData=1
http://graphite/render?       from=-1hours&width=600&height=200    &target=webs.errorLog.warning&rawData=1webs.errorLog.wa...
Holt-Winters Confidence Bandsupper         lower
Holt-Winters Aberration
Business metrics + Confidence bands_____________    Alertable metrics
40,000+ metrics at Etsy  Systems, Applications, Business
Dashboards
Dashboards
Kind of Hard :-/<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Fo...
Super Easy!$g = new Graphite($time);$g->setTitle(File Not Found);$g->addMetric(webs.errorLog.notExist, #00cc00);echo $g->g...
Metrics!
Metrics!Metrics + Events
Metrics!Metrics + EventsMetrics + Alerts
Metrics!Metrics + EventsMetrics + AlertsMetrics + Metrics
High-level, real-time visibility
Detect problems quickly
CONFIDENCE
Make them required features
Make them dead simple
Make them accessible
Make them!
Homeworkcodeascraft.etsy.comgithub.com/etsy                      Get in touch                                     mike @ e...
Metrics-Driven Engineering
Metrics-Driven Engineering
Metrics-Driven Engineering
Metrics-Driven Engineering
Metrics-Driven Engineering
Metrics-Driven Engineering
Metrics-Driven Engineering
Metrics-Driven Engineering
Upcoming SlideShare
Loading in...5
×

Metrics-Driven Engineering

18,447

Published on

Presented at Web 2.0 Expo, Oct. 13 2011

Published in: Technology
2 Comments
84 Likes
Statistics
Notes
No Downloads
Views
Total Views
18,447
On Slideshare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
381
Comments
2
Likes
84
Embeds 0
No embeds

No notes for slide

Metrics-Driven Engineering

  1. 1. Metrics-Driven EngineeringMike Brittain @ mikebrittainDirector of engineering, Infrastructure October 13, 2011
  2. 2. Tools and Process at Etsy
  3. 3. How many new visits? How many listings created? How many registrations?How do people use Etsy? How many convos sent? How many purchases? How many new shops?
  4. 4. Search indexing? How fast are pages generating? Async tasks currently in queue?What is the application doing? Developer API auth and rate limiting? Images resized and stored? Error and warning rates?
  5. 5. Replication slave lag? Memcache hits/misses? Available connections?Are the servers in good shape ? Database queries per second? Total outgoing bandwidth? CPU, Memory, I/O?
  6. 6. Business Metrics
  7. 7. Application Metrics
  8. 8. System Metrics
  9. 9. Visibility EVERYWHERE
  10. 10. Constant Change
  11. 11. $314 Million GMS 2010 $180 Million GMS 2009 $87 Million GMS 2008 $26 Million GMS 2007credit: pentarux (flickr)
  12. 12. 25 Million Unique Visitors 1 Billion page views per monthcredit: pentarux (flickr)
  13. 13. Engineering team grew 500% over 18 monthscredit: martin_heigan (flickr)
  14. 14. Less talk, more do.
  15. 15. Always Be Shippingcredit: ibailemon (flickr)
  16. 16. Always Be Shipping (even if it’s your first day)credit: ibailemon (flickr)
  17. 17. 90+ Engineers 40+ Deploys / daycredit: misswired (flickr)
  18. 18. credit: digidave (flickr)
  19. 19. Code Reviews
  20. 20. Automated Tests
  21. 21. $cfg = array( checkout => array(enabled => on), homepage => array(enabled => on), profiles => array(enabled => on), new_search => array(enabled => off),); Config FlagsEnable and disable features quickly
  22. 22. $cfg = array( checkout => array(enabled => on), homepage => array(enabled => on), profiles => array(enabled => on), new_search => array(enabled => off),); Config FlagsEnable and disable features quicklyPlus “admin-only,” percentage ramp-up, A/B testing,whitelists, blacklists, etc...
  23. 23. Failure is not an option
  24. 24. inevitable!Failure is not an option
  25. 25. inevitable!Failure is not an option a learning opportunity!
  26. 26. inevitable!Failure is not an option a learning opportunity! DETECTABLE!
  27. 27. Access
  28. 28. Detect problems quickly
  29. 29. CONFIDENCE
  30. 30. A: Well, the Ops team manages the network, racks the servers, installed the monitoring tools, wears the pagers, blah, blah, blah...
  31. 31. Engineers build the application
  32. 32. Logging GraphingOPS ENG Trending Alerting
  33. 33. “Engineers are too busy writing features to build metrics.”
  34. 34. Metrics are part of every feature ...and so are config flags
  35. 35. Dead Simple
  36. 36. Simple, open source tools
  37. 37. Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting) Logging Logster StatsD
  38. 38. Ganglia
  39. 39. GangliaCluster-orientedHuge community contributed recipesCustom metrics (gmetad)
  40. 40. Graphite
  41. 41. Graphite Single-instance Create new metrics on-the-fly Customize via URLs and display functions
  42. 42. Logging
  43. 43. It’s 2:48 PM.Do you know where your logs are?
  44. 44. Logger::log_error("User login failed.Reason: $msg for $username", “login”);
  45. 45. Logger::log_error("User login failed.Reason: $msg for $username", “login”);
  46. 46. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  47. 47. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  48. 48. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  49. 49. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  50. 50. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  51. 51. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  52. 52. LogFormat "%h %l %u %t "%r" %>s %b" common
  53. 53. LogFormat %{True-Client-IP}i %l %t "%r " %>s %b "%{Referer}i" "%{User-Agent}i" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
  54. 54. apache_note()
  55. 55. LogFormat %{True-Client-IP}i %l %t "%r " %>s %b "%{Referer}i" "%{User-Agent}i" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
  56. 56. LogFormat %{True-Client-IP}i %l %t "%r " %>s %b "%{Referer}i" "%{User-Agent}i" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
  57. 57. LogFormat %{True-Client-IP}i %l %t "%r " %>s %b "%{Referer}i" "%{User-Agent}i" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
  58. 58. grep "/listing/" access.log | awk {sum=sum+$(NF-2)} END {print sum/NR}
  59. 59. web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda.web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0201 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue.web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling.web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling.web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling.web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0003 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue.web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!
  60. 60. LogsterFatals Errors Warnings
  61. 61. LogsterRun by cronKeeps a cursor on your log fileAggregate lines anyway you wantOutput to Ganglia or GraphiteSimple parsers github.com/etsy
  62. 62. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  63. 63. ^.+ [.+] [(?P<log_level>.+)]
  64. 64. if (fields[log_level] == “fatal”): self.fatals += 1elif (fields[log_level] == “error”): self.errors += 1elif (fields[log_level] == “warning”): self.warnings += 1...
  65. 65. MetricObject("fatals", (self.fatals / self.duration), "per sec")MetricObject("errors", (self.errors / self.duration), "per sec")MetricObject("warning", (self.warnings / self.duration), "per sec")
  66. 66. Fatals Errors Warnings
  67. 67. StatsD
  68. 68. StatsD Network daemon (node.js) Accepts data over UDP Flushes to Graphite every 10 sec One-line of codegithub.com/etsy
  69. 69. StatsD::increment("logins.success");
  70. 70. StatsD::increment("logins.success"); logins
  71. 71. StatsD::timing("gearman.time", $msec);
  72. 72. StatsD::timing("gearman.time", $msec); 90th pct average lower
  73. 73. Ad hocname value timestamp
  74. 74. echo "events.deploy.site 1 `date +%s`" | nc graphite.etsycorp.com 2003
  75. 75. Vertical Line Technology!target=drawAsInfinite(events.deploy.site)
  76. 76. We could stare at graphs all day...
  77. 77. http://graphite/render? from=-1hours&width=600&height=200&target=webs.errorLog.warning&rawData=1
  78. 78. http://graphite/render? from=-1hours&width=600&height=200 &target=webs.errorLog.warning&rawData=1webs.errorLog.warning,1318444930,1318448530,60|5.0,1.0,3.0,1.0,0.0,9.0,0.0,1.0,3.0,2.0,1.0,6.0,2.0,6.0,3.0,6.0,4.0,4.0,2.0,1.0,1.0,8.0,2.0,3.0,6.0,3.0,5.0,3.0,0.0,4.0,6.0,2.0,0.0,2.0,0.0,4.0,0.0,3.0,1.0,3.0,4.0,2.0,10.0,3.0,0.0,6.0,0.0,4.0,2.0,5.0,18.0,1.0,1.0,2.0,1.0,8.0,5.0,1.0,1.0,None
  79. 79. Holt-Winters Confidence Bandsupper lower
  80. 80. Holt-Winters Aberration
  81. 81. Business metrics + Confidence bands_____________ Alertable metrics
  82. 82. 40,000+ metrics at Etsy Systems, Applications, Business
  83. 83. Dashboards
  84. 84. Dashboards
  85. 85. Kind of Hard :-/<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>
  86. 86. Super Easy!$g = new Graphite($time);$g->setTitle(File Not Found);$g->addMetric(webs.errorLog.notExist, #00cc00);echo $g->getDashboardHTML(280, 220);
  87. 87. Metrics!
  88. 88. Metrics!Metrics + Events
  89. 89. Metrics!Metrics + EventsMetrics + Alerts
  90. 90. Metrics!Metrics + EventsMetrics + AlertsMetrics + Metrics
  91. 91. High-level, real-time visibility
  92. 92. Detect problems quickly
  93. 93. CONFIDENCE
  94. 94. Make them required features
  95. 95. Make them dead simple
  96. 96. Make them accessible
  97. 97. Make them!
  98. 98. Homeworkcodeascraft.etsy.comgithub.com/etsy Get in touch mike @ etsy . comWe’re always looking for people @ mikebrittainwho are interested in this kind ofstuff...Thank Youetsy.com/careers
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×