Metrics-drivenEngineering at Etsy        MIKE BRITTAIN   mike@etsy.com @mikebrittain
Logs, Graphs, Trends,  and Correlations
Making Decisions
How many visitors are  using this thing?
Can we deploy that to100% of our visitors?
Did we make it faster?
Did I just break  something?
Q. Who makes the graphs?A. Well, the Ops team manages the network, racksthe servers, installed the monitoring tools, wears...
(but...) Engineers build   the application.
Dev + Ops
Access
Yes   No
“Engineers are too busy meeting our product      deadlines.”
Here’s the big secret...
Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting)
Logging
Logger::log_error("User login   failed. Reason: $msg for     $username", “login”);
web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed.   Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed.   Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed.   Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed.   Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed.   Reason: wrong password for ...
Logster
Forked from ganglia-logtailer...- Daemon mode (only cron mode)+ Support for Graphite+ Simplified parsing scripts
web0001   [04:28:54   2011]   [error] [client 10.101.x.x] Help me, Rhonda.web0001   [04:28:54   2011]   [error] [client 10...
Fatals   Errors   Warnings
StatsD
StatsD::increment("logins.success");StatsD::timing("gearman.time", $msec);
90th pct                             average                             lowerStatsD::timing("gearman.time", $msec);
Ad hocname value timestampn
echo "events.deploy.site 1 `date +%s`"      | nc graphite.etsycorp.com 2003
Trends + Eventstarget=drawAsInfinite(events.deploy.site)
What Happened?
16,000 metrics in Graphite     (plus 32,000 metrics in Ganglia)
Dashboards
Mix & MatchDashboards
Hard<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&t...
Easy$g = new Graphite($time);$g->setTitle(File Not Found);$g->addMetric(webs.errorLog.notExist, #00cc00);$g->showDeploys(t...
20 dashboards by  25 engineers
Application healthcorrelated with events
High-level visibility
Low MTTD
Validation
Confidence
codeascraft.etsy.comgithub.com/etsy/statsdgithub.com/etsy/logsterbitbucket.org/maplebed/ganglia-logtailer
Q&ADoes this sound like fun? Get in touch with us.      chad@etsy.com kellan@etsy.com     kastner@etsy.com mike@etsy.com
Upcoming SlideShare
Loading in...5
×

Metrics-Driven Engineering at Etsy

16,392

Published on

Published in: Technology

Transcript of "Metrics-Driven Engineering at Etsy"

  1. 1. Metrics-drivenEngineering at Etsy MIKE BRITTAIN mike@etsy.com @mikebrittain
  2. 2. Logs, Graphs, Trends, and Correlations
  3. 3. Making Decisions
  4. 4. How many visitors are using this thing?
  5. 5. Can we deploy that to100% of our visitors?
  6. 6. Did we make it faster?
  7. 7. Did I just break something?
  8. 8. Q. Who makes the graphs?A. Well, the Ops team manages the network, racksthe servers, installed the monitoring tools, wears the pagers, blah, blah, blah...
  9. 9. (but...) Engineers build the application.
  10. 10. Dev + Ops
  11. 11. Access
  12. 12. Yes No
  13. 13. “Engineers are too busy meeting our product deadlines.”
  14. 14. Here’s the big secret...
  15. 15. Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting)
  16. 16. Logging
  17. 17. Logger::log_error("User login failed. Reason: $msg for $username", “login”);
  18. 18. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
  19. 19. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
  20. 20. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
  21. 21. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
  22. 22. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...
  23. 23. Logster
  24. 24. Forked from ganglia-logtailer...- Daemon mode (only cron mode)+ Support for Graphite+ Simplified parsing scripts
  25. 25. web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda.web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0201 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue.web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling.web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling.web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling.web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0003 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue.web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!
  26. 26. Fatals Errors Warnings
  27. 27. StatsD
  28. 28. StatsD::increment("logins.success");StatsD::timing("gearman.time", $msec);
  29. 29. 90th pct average lowerStatsD::timing("gearman.time", $msec);
  30. 30. Ad hocname value timestampn
  31. 31. echo "events.deploy.site 1 `date +%s`" | nc graphite.etsycorp.com 2003
  32. 32. Trends + Eventstarget=drawAsInfinite(events.deploy.site)
  33. 33. What Happened?
  34. 34. 16,000 metrics in Graphite (plus 32,000 metrics in Ganglia)
  35. 35. Dashboards
  36. 36. Mix & MatchDashboards
  37. 37. Hard<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>
  38. 38. Easy$g = new Graphite($time);$g->setTitle(File Not Found);$g->addMetric(webs.errorLog.notExist, #00cc00);$g->showDeploys(true);echo $g->getDashboardHTML(280, 220);
  39. 39. 20 dashboards by 25 engineers
  40. 40. Application healthcorrelated with events
  41. 41. High-level visibility
  42. 42. Low MTTD
  43. 43. Validation
  44. 44. Confidence
  45. 45. codeascraft.etsy.comgithub.com/etsy/statsdgithub.com/etsy/logsterbitbucket.org/maplebed/ganglia-logtailer
  46. 46. Q&ADoes this sound like fun? Get in touch with us. chad@etsy.com kellan@etsy.com kastner@etsy.com mike@etsy.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×