Metrics-Driven Engineering
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Metrics-Driven Engineering

  • 18,235 views
Uploaded on

Presented at Web 2.0 Expo, Oct. 13 2011

Presented at Web 2.0 Expo, Oct. 13 2011

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • super convincing slides
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
18,235
On Slideshare
12,673
From Embeds
5,562
Number of Embeds
38

Actions

Shares
Downloads
368
Comments
1
Likes
78

Embeds 5,562

http://benjaminwootton.co.uk 3,253
http://java.dzone.com 1,206
http://blog.livedoor.jp 420
http://storify.com 102
http://agile.dzone.com 91
http://www.web2expo.com 56
http://tatsukii.tumblr.com 47
http://ruby.dzone.com 47
https://twitter.com 46
http://www.pinterest.com 44
http://a0.twimg.com 33
http://pinterest.com 33
http://server.dzone.com 29
http://www.linkedin.com 28
http://architects.dzone.com 22
http://python.dzone.com 17
http://contino.co.uk 16
http://playtherapy.co.uk 12
http://paper.li 11
http://safe.tumblr.com 8
http://webcache.googleusercontent.com 8
http://lanyrd.com 4
http://translate.googleusercontent.com 4
http://do-nothing.tumblr.com 3
http://rascasse.com 3
http://devopsfriday.com 3
http://coderwall.com 3
http://twitter.com 2
http://mrsy.tumblr.com 2
http://www.dzone.com 1
http://us-w1.rockmelt.com 1
http://rss.qoli.de 1
http://www.twylah.com 1
http://nosqlfriday.com 1
http://ranksit.com 1
http://pfmusique.tumblr.com 1
http://rritw.com 1
https://www.pinterest.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Metrics-Driven EngineeringMike Brittain @ mikebrittainDirector of engineering, Infrastructure October 13, 2011
  • 2. Tools and Process at Etsy
  • 3. How many new visits? How many listings created? How many registrations?How do people use Etsy? How many convos sent? How many purchases? How many new shops?
  • 4. Search indexing? How fast are pages generating? Async tasks currently in queue?What is the application doing? Developer API auth and rate limiting? Images resized and stored? Error and warning rates?
  • 5. Replication slave lag? Memcache hits/misses? Available connections?Are the servers in good shape ? Database queries per second? Total outgoing bandwidth? CPU, Memory, I/O?
  • 6. Business Metrics
  • 7. Application Metrics
  • 8. System Metrics
  • 9. Visibility EVERYWHERE
  • 10. Constant Change
  • 11. $314 Million GMS 2010 $180 Million GMS 2009 $87 Million GMS 2008 $26 Million GMS 2007credit: pentarux (flickr)
  • 12. 25 Million Unique Visitors 1 Billion page views per monthcredit: pentarux (flickr)
  • 13. Engineering team grew 500% over 18 monthscredit: martin_heigan (flickr)
  • 14. Less talk, more do.
  • 15. Always Be Shippingcredit: ibailemon (flickr)
  • 16. Always Be Shipping (even if it’s your first day)credit: ibailemon (flickr)
  • 17. 90+ Engineers 40+ Deploys / daycredit: misswired (flickr)
  • 18. credit: digidave (flickr)
  • 19. Code Reviews
  • 20. Automated Tests
  • 21. $cfg = array( checkout => array(enabled => on), homepage => array(enabled => on), profiles => array(enabled => on), new_search => array(enabled => off),); Config FlagsEnable and disable features quickly
  • 22. $cfg = array( checkout => array(enabled => on), homepage => array(enabled => on), profiles => array(enabled => on), new_search => array(enabled => off),); Config FlagsEnable and disable features quicklyPlus “admin-only,” percentage ramp-up, A/B testing,whitelists, blacklists, etc...
  • 23. Failure is not an option
  • 24. inevitable!Failure is not an option
  • 25. inevitable!Failure is not an option a learning opportunity!
  • 26. inevitable!Failure is not an option a learning opportunity! DETECTABLE!
  • 27. Access
  • 28. Detect problems quickly
  • 29. CONFIDENCE
  • 30. A: Well, the Ops team manages the network, racks the servers, installed the monitoring tools, wears the pagers, blah, blah, blah...
  • 31. Engineers build the application
  • 32. Logging GraphingOPS ENG Trending Alerting
  • 33. “Engineers are too busy writing features to build metrics.”
  • 34. Metrics are part of every feature ...and so are config flags
  • 35. Dead Simple
  • 36. Simple, open source tools
  • 37. Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting) Logging Logster StatsD
  • 38. Ganglia
  • 39. GangliaCluster-orientedHuge community contributed recipesCustom metrics (gmetad)
  • 40. Graphite
  • 41. Graphite Single-instance Create new metrics on-the-fly Customize via URLs and display functions
  • 42. Logging
  • 43. It’s 2:48 PM.Do you know where your logs are?
  • 44. Logger::log_error("User login failed.Reason: $msg for $username", “login”);
  • 45. Logger::log_error("User login failed.Reason: $msg for $username", “login”);
  • 46. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  • 47. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  • 48. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  • 49. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  • 50. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  • 51. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  • 52. LogFormat "%h %l %u %t "%r" %>s %b" common
  • 53. LogFormat %{True-Client-IP}i %l %t "%r " %>s %b "%{Referer}i" "%{User-Agent}i" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
  • 54. apache_note()
  • 55. LogFormat %{True-Client-IP}i %l %t "%r " %>s %b "%{Referer}i" "%{User-Agent}i" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
  • 56. LogFormat %{True-Client-IP}i %l %t "%r " %>s %b "%{Referer}i" "%{User-Agent}i" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
  • 57. LogFormat %{True-Client-IP}i %l %t "%r " %>s %b "%{Referer}i" "%{User-Agent}i" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
  • 58. grep "/listing/" access.log | awk {sum=sum+$(NF-2)} END {print sum/NR}
  • 59. web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda.web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0201 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue.web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling.web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling.web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling.web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0003 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue.web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!
  • 60. LogsterFatals Errors Warnings
  • 61. LogsterRun by cronKeeps a cursor on your log fileAggregate lines anyway you wantOutput to Ganglia or GraphiteSimple parsers github.com/etsy
  • 62. web0054 [Fri Mar 04 16:27:48 2011][error] [login] [mk04gw1p71] User login failed. Reason: wrong password for ...
  • 63. ^.+ [.+] [(?P<log_level>.+)]
  • 64. if (fields[log_level] == “fatal”): self.fatals += 1elif (fields[log_level] == “error”): self.errors += 1elif (fields[log_level] == “warning”): self.warnings += 1...
  • 65. MetricObject("fatals", (self.fatals / self.duration), "per sec")MetricObject("errors", (self.errors / self.duration), "per sec")MetricObject("warning", (self.warnings / self.duration), "per sec")
  • 66. Fatals Errors Warnings
  • 67. StatsD
  • 68. StatsD Network daemon (node.js) Accepts data over UDP Flushes to Graphite every 10 sec One-line of codegithub.com/etsy
  • 69. StatsD::increment("logins.success");
  • 70. StatsD::increment("logins.success"); logins
  • 71. StatsD::timing("gearman.time", $msec);
  • 72. StatsD::timing("gearman.time", $msec); 90th pct average lower
  • 73. Ad hocname value timestamp
  • 74. echo "events.deploy.site 1 `date +%s`" | nc graphite.etsycorp.com 2003
  • 75. Vertical Line Technology!target=drawAsInfinite(events.deploy.site)
  • 76. We could stare at graphs all day...
  • 77. http://graphite/render? from=-1hours&width=600&height=200&target=webs.errorLog.warning&rawData=1
  • 78. http://graphite/render? from=-1hours&width=600&height=200 &target=webs.errorLog.warning&rawData=1webs.errorLog.warning,1318444930,1318448530,60|5.0,1.0,3.0,1.0,0.0,9.0,0.0,1.0,3.0,2.0,1.0,6.0,2.0,6.0,3.0,6.0,4.0,4.0,2.0,1.0,1.0,8.0,2.0,3.0,6.0,3.0,5.0,3.0,0.0,4.0,6.0,2.0,0.0,2.0,0.0,4.0,0.0,3.0,1.0,3.0,4.0,2.0,10.0,3.0,0.0,6.0,0.0,4.0,2.0,5.0,18.0,1.0,1.0,2.0,1.0,8.0,5.0,1.0,1.0,None
  • 79. Holt-Winters Confidence Bandsupper lower
  • 80. Holt-Winters Aberration
  • 81. Business metrics + Confidence bands_____________ Alertable metrics
  • 82. 40,000+ metrics at Etsy Systems, Applications, Business
  • 83. Dashboards
  • 84. Dashboards
  • 85. Kind of Hard :-/<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>
  • 86. Super Easy!$g = new Graphite($time);$g->setTitle(File Not Found);$g->addMetric(webs.errorLog.notExist, #00cc00);echo $g->getDashboardHTML(280, 220);
  • 87. Metrics!
  • 88. Metrics!Metrics + Events
  • 89. Metrics!Metrics + EventsMetrics + Alerts
  • 90. Metrics!Metrics + EventsMetrics + AlertsMetrics + Metrics
  • 91. High-level, real-time visibility
  • 92. Detect problems quickly
  • 93. CONFIDENCE
  • 94. Make them required features
  • 95. Make them dead simple
  • 96. Make them accessible
  • 97. Make them!
  • 98. Homeworkcodeascraft.etsy.comgithub.com/etsy Get in touch mike @ etsy . comWe’re always looking for people @ mikebrittainwho are interested in this kind ofstuff...Thank Youetsy.com/careers