Metrics-Driven Engineering at Etsy

17,820 views
17,593 views

Published on

Published in: Technology

Metrics-Driven Engineering at Etsy

  1. 1. Metrics-drivenEngineering at Etsy MIKE BRITTAIN mike@etsy.com @mikebrittain
  2. 2. Logs, Graphs, Trends, and Correlations
  3. 3. Making Decisions
  4. 4. How many visitors are using this thing?
  5. 5. Can we deploy that to100% of our visitors?
  6. 6. Did we make it faster?
  7. 7. Did I just break something?
  8. 8. Q. Who makes the graphs?A. Well, the Ops team manages the network, racksthe servers, installed the monitoring tools, wears the pagers, blah, blah, blah...
  9. 9. (but...) Engineers build the application.
  10. 10. Dev + Ops
  11. 11. Access
  12. 12. Yes No
  13. 13. “Engineers are too busy meeting our product deadlines.”
  14. 14. Here’s the big secret...
  15. 15. Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting)
  16. 16. Logging
  17. 17. Logger::log_error("User login failed. Reason: $msg for $username", “login”);
  18. 18. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
  19. 19. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
  20. 20. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
  21. 21. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
  22. 22. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
  23. 23. Logster
  24. 24. Forked from ganglia-logtailer...- Daemon mode (only cron mode)+ Support for Graphite+ Simplified parsing scripts
  25. 25. web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda.web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0201 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue.web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling.web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling.web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling.web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0003 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue.web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!
  26. 26. Fatals Errors Warnings
  27. 27. StatsD
  28. 28. StatsD::increment("logins.success");StatsD::timing("gearman.time", $msec);
  29. 29. 90th pct average lowerStatsD::timing("gearman.time", $msec);
  30. 30. Ad hocname value timestampn
  31. 31. echo "events.deploy.site 1 `date +%s`" | nc graphite.etsycorp.com 2003
  32. 32. Trends + Eventstarget=drawAsInfinite(events.deploy.site)
  33. 33. What Happened?
  34. 34. 16,000 metrics in Graphite (plus 32,000 metrics in Ganglia)
  35. 35. Dashboards
  36. 36. Mix & MatchDashboards
  37. 37. Hard<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>
  38. 38. Easy$g = new Graphite($time);$g->setTitle(File Not Found);$g->addMetric(webs.errorLog.notExist, #00cc00);$g->showDeploys(true);echo $g->getDashboardHTML(280, 220);
  39. 39. 20 dashboards by 25 engineers
  40. 40. Application healthcorrelated with events
  41. 41. High-level visibility
  42. 42. Low MTTD
  43. 43. Validation
  44. 44. Confidence
  45. 45. codeascraft.etsy.comgithub.com/etsy/statsdgithub.com/etsy/logsterbitbucket.org/maplebed/ganglia-logtailer
  46. 46. Q&ADoes this sound like fun? Get in touch with us. chad@etsy.com kellan@etsy.com kastner@etsy.com mike@etsy.com

×