Metrics driven engineering (velocity 2011)

  • 2,078 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,078
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
34
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. METRICS-DRIVEN ENGINEERING at Kellan Elliott-McCrea, VP of Eng. kellan@etsy.com @kellanTuesday, June 5, 12
  • 2. Tuesday, June 5, 12
  • 3. Tuesday, June 5, 12
  • 4. What is Etsy?Tuesday, June 5, 12
  • 5. 8.5+ million items in the marketplaceTuesday, June 5, 12
  • 6. 400,000+ activeTuesday, June 5, 12
  • 7. $300+ million in sales in 2010 ~$41 million/monthTuesday, June 5, 12
  • 8. > $1000 / minuteTuesday, June 5, 12
  • 9. > 1 billion page views / monthTuesday, June 5, 12
  • 10. business in over 150 countriesTuesday, June 5, 12
  • 11. deploy the site, every ~20 minutesTuesday, June 5, 12
  • 12. engineering team grew ~4x in 2010Tuesday, June 5, 12
  • 13. Metrics?Tuesday, June 5, 12
  • 14. Logs, Graphs, Trends, and CorrelationsTuesday, June 5, 12
  • 15. Metrics Driven?Tuesday, June 5, 12
  • 16. Making DecisionsTuesday, June 5, 12
  • 17. How many visitors are using this thing?Tuesday, June 5, 12
  • 18. Can we deploy that to 100% of our visitors?Tuesday, June 5, 12
  • 19. Did we make it faster?Tuesday, June 5, 12
  • 20. Did I just break something?Tuesday, June 5, 12
  • 21. Q. WHO MAKES THESE GRAPHS? A. Well,racksOps team manages thethe network, the the servers, installed monitoring tools, wears the pagers, blah, blah, blah...Tuesday, June 5, 12
  • 22. but... Engineers build the application.Tuesday, June 5, 12
  • 23. Dev + OpsTuesday, June 5, 12
  • 24. ACCESSTuesday, June 5, 12
  • 25. Yes! No.Tuesday, June 5, 12
  • 26. “Engineers are too busy!”Tuesday, June 5, 12
  • 27. Here’s the BIG SECRET...Tuesday, June 5, 12
  • 28. ... MAKE IT EASY!Tuesday, June 5, 12
  • 29. Simple, open source toolsTuesday, June 5, 12
  • 30. Cacti (network, SNMP) Ganglia (machines) Graphite (application) Splunk (log analysis, nightly reports) Nagios (alerting)Tuesday, June 5, 12
  • 31. Gan ★cluster oriented ★huge community contributed recipes ★2.0 released today (including several Flickr and Etsy patches!) ★gmetad makes it easy to track custom metricsTuesday, June 5, 12
  • 32. Tuesday, June 5, 12
  • 33. Graphite ★super flexible collection and display ★per metrics buckets ★single instance ★super easy to write and use custom display functionsTuesday, June 5, 12
  • 34. LoggingTuesday, June 5, 12
  • 35. Logger::log_error("User login failed. Reason: $msg for $username", “login”);Tuesday, June 5, 12
  • 36. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong password for ...Tuesday, June 5, 12
  • 37. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong password for ...Tuesday, June 5, 12
  • 38. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong password for ...Tuesday, June 5, 12
  • 39. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong password for ...Tuesday, June 5, 12
  • 40. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong password for ...Tuesday, June 5, 12
  • 41. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong password for ...Tuesday, June 5, 12
  • 42. Counting and Timing http://code.flickr.com/blog/ 2008/10/27/counting-timing/Tuesday, June 5, 12
  • 43. LogsterTuesday, June 5, 12
  • 44. Logster https://github.com/etsy/logsterTuesday, June 5, 12
  • 45. Forked from ganglia-logtailer : - Daemon mode (only cron mode) + Support for Graphite + Simplified parsing scriptsTuesday, June 5, 12
  • 46. web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda. web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0201 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue. web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling. web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling. web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling. web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0003 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue. web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is fallingTuesday, June 5, 12
  • 47. Fatals Errors WarningsTuesday, June 5, 12
  • 48. ★runs out of cron ★maintains a cursor into log files ★supports ganglia and graphite ★custom parsers much easier to write then gmetadTuesday, June 5, 12
  • 49. Apache access logsTuesday, June 5, 12
  • 50. LogFormat "%h %l %u %t "%r" %>s %b" commonTuesday, June 5, 12
  • 51. LogFormat "%{X-Forwarded-For}i % {True-Client-IP}i %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" % {etsy_shop_id}n %{etsy_uaid}n %V % {etsy_ab_selections}n % {etsy_request_uuid}n % {etsy_api_consumer_key}n % {etsy_api_method_name}n % {php_memory_usage_bytes}n % {php_time_microsec}n %D" combinedTuesday, June 5, 12
  • 52. %{etsy_ab_selections}nTuesday, June 5, 12
  • 53. %{etsy_uaid}nTuesday, June 5, 12
  • 54. GraphsTuesday, June 5, 12
  • 55. “If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we track it.” - Erik Kastner http://codeascraft.etsy.com/2011/02/15/measure- anything-measure-everything/Tuesday, June 5, 12
  • 56. Tuesday, June 5, 12
  • 57. StatsDTuesday, June 5, 12
  • 58. StatsD https://github.com/ etsy/statsd/Tuesday, June 5, 12
  • 59. StatsD::increment("logins.success"); StatsD::timing("gearman.time", $msec);Tuesday, June 5, 12
  • 60. 90th pct average lower StatsD::timing("gearman.time", $msec);Tuesday, June 5, 12
  • 61. Ad hoc name value timestampTuesday, June 5, 12
  • 62. echo "events.deploy.site 1 `date +%s`" | nc graphite.etsycorp.com 2003Tuesday, June 5, 12
  • 63. CorrelationsTuesday, June 5, 12
  • 64. echo "events.deploy.site 1 `date +%s`" | nc graphite.etsycorp.com 2003Tuesday, June 5, 12
  • 65. Trends + Events target=drawAsInfinite(events.deploy.site)Tuesday, June 5, 12
  • 66. What Happened?Tuesday, June 5, 12
  • 67. Holt-WintersTuesday, June 5, 12
  • 68. "Forecasting Sales by Exponentially Weighted Moving Averages". PeterTuesday, June 5, 12
  • 69. "Aberrant Behavior Detection in Time Series for Network Monitoring".Tuesday, June 5, 12
  • 70. "Holt-Winters Forecasting Applied to Poisson Processes in Real-Time".Tuesday, June 5, 12
  • 71. holtWintersConfidence(Upper|Lower)Tuesday, June 5, 12
  • 72. holtWintersAberrationTuesday, June 5, 12
  • 73. business metrics with confidence bands == alertable business metricsTuesday, June 5, 12
  • 74. 16,000 metrics in GRAPHITE (plus 32,000 metrics in GANGLIA)Tuesday, June 5, 12
  • 75. 16,000 metrics in GRAPHITE (plus 32,000 metrics in GANGLIA)Tuesday, June 5, 12
  • 76. DashboardsTuesday, June 5, 12
  • 77. DashboardsTuesday, June 5, 12
  • 78. DashboardsTuesday, June 5, 12
  • 79. Hard <a href="http://graphite.etsycorp.com/render? from=-1hours&width=800&height=600&title=File+or+Script+Not +Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render? from=-1hours&width=280&height=220&title=File+or+Script+Not +Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> </a>Tuesday, June 5, 12
  • 80. Easy! $g = new Graphite($time); $g->setTitle(File Not Found); $g->addMetric(webs.errorLog.notExist, #00cc00); $g->showDeploys(true); echo $g->getDashboardHTML(280, 220);Tuesday, June 5, 12
  • 81. 48 dashboards by 32 engineersTuesday, June 5, 12
  • 82. Application healthTuesday, June 5, 12
  • 83. High-level visibilityTuesday, June 5, 12
  • 84. Low MTTDTuesday, June 5, 12
  • 85. ConfidenceTuesday, June 5, 12
  • 86. Make metricsTuesday, June 5, 12
  • 87. Make metricsTuesday, June 5, 12
  • 88. Make metricsTuesday, June 5, 12
  • 89. Not that muchTuesday, June 5, 12
  • 90. codeascraft.etsy.com github.com/etsy/statsd github.com/etsy/logster bitbucket.org/maplebed/ganglia- logtailerTuesday, June 5, 12
  • 91. Questions?Tuesday, June 5, 12