METRICS-DRIVEN                 ENGINEERING at                      Kellan Elliott-McCrea, VP of Eng.                      ...
Tuesday, June 5, 12
Tuesday, June 5, 12
What is Etsy?Tuesday, June 5, 12
8.5+ million items                      in the marketplaceTuesday, June 5, 12
400,000+ activeTuesday, June 5, 12
$300+ million in                        sales in 2010                      ~$41 million/monthTuesday, June 5, 12
> $1000 / minuteTuesday, June 5, 12
> 1 billion page                      views / monthTuesday, June 5, 12
business in over                       150 countriesTuesday, June 5, 12
deploy the site,                      every ~20 minutesTuesday, June 5, 12
engineering team                            grew                        ~4x in 2010Tuesday, June 5, 12
Metrics?Tuesday, June 5, 12
Logs, Graphs,                          Trends,                      and CorrelationsTuesday, June 5, 12
Metrics Driven?Tuesday, June 5, 12
Making DecisionsTuesday, June 5, 12
How many visitors                              are                       using this thing?Tuesday, June 5, 12
Can we deploy that                       to              100% of our visitors?Tuesday, June 5, 12
Did we make it                          faster?Tuesday, June 5, 12
Did I just break                        something?Tuesday, June 5, 12
Q.  WHO MAKES THESE                             GRAPHS?           A. Well,racksOps team manages thethe            network,...
but... Engineers                            build                      the application.Tuesday, June 5, 12
Dev + OpsTuesday, June 5, 12
ACCESSTuesday, June 5, 12
Yes!   No.Tuesday, June 5, 12
“Engineers are                        too busy!”Tuesday, June 5, 12
Here’s the BIG                        SECRET...Tuesday, June 5, 12
... MAKE IT EASY!Tuesday, June 5, 12
Simple, open                      source toolsTuesday, June 5, 12
Cacti (network, SNMP)                      Ganglia (machines)                      Graphite (application)                 ...
Gan                ★cluster oriented                ★huge community contributed                recipes                ★2.0...
Tuesday, June 5, 12
Graphite                ★super flexible collection and                display                ★per metrics buckets          ...
LoggingTuesday, June 5, 12
Logger::log_error("User login                        failed. Reason: $msg for                          $username", “login”...
web0054 [Fri Mar 04 16:27:48                      2011] [error] [login] [14531658]                      User login failed....
web0054 [Fri Mar 04 16:27:48                      2011] [error] [login] [14531658]                      User login failed....
web0054 [Fri Mar 04 16:27:48                      2011] [error] [login] [14531658]                      User login failed....
web0054 [Fri Mar 04 16:27:48                      2011] [info] [login] [14531658]                      User login failed. ...
web0054 [Fri Mar 04 16:27:48                      2011] [info] [login] [14531658]                      User login failed. ...
web0054 [Fri Mar 04 16:27:48                      2011] [info] [login] [14531658]                      User login failed. ...
Counting                      and Timing                      http://code.flickr.com/blog/                      2008/10/27/...
LogsterTuesday, June 5, 12
Logster                      https://github.com/etsy/logsterTuesday, June 5, 12
Forked from ganglia-logtailer :                            - Daemon mode                (only cron mode)                  ...
web0001        [04:28:54   2011]   [warning] [client 10.101.x.x] Gaaaaahhh!       web0001        [04:28:54   2011]   [erro...
Fatals   Errors   WarningsTuesday, June 5, 12
★runs out of cron                ★maintains a cursor into log files                ★supports ganglia and graphite          ...
Apache access logsTuesday, June 5, 12
LogFormat "%h %l %u %t "%r"                  %>s %b" commonTuesday, June 5, 12
LogFormat "%{X-Forwarded-For}i %             {True-Client-IP}i %l %u %t "%r" %>s %b                "%{Referer}i" "%{User-A...
%{etsy_ab_selections}nTuesday, June 5, 12
%{etsy_uaid}nTuesday, June 5, 12
GraphsTuesday, June 5, 12
“If Engineering at Etsy has        a religion, it’s the Church        of Graphs. If it moves, we          track it.” - Eri...
Tuesday, June 5, 12
StatsDTuesday, June 5, 12
StatsD                        https://github.com/                        etsy/statsd/Tuesday, June 5, 12
StatsD::increment("logins.success");       StatsD::timing("gearman.time", $msec);Tuesday, June 5, 12
90th pct                                    average                                    lower       StatsD::timing("gearman...
Ad hoc                      name value timestampTuesday, June 5, 12
echo "events.deploy.site 1 `date +%s`"               | nc graphite.etsycorp.com 2003Tuesday, June 5, 12
CorrelationsTuesday, June 5, 12
echo "events.deploy.site 1 `date +%s`"               | nc graphite.etsycorp.com 2003Tuesday, June 5, 12
Trends + Events         target=drawAsInfinite(events.deploy.site)Tuesday, June 5, 12
What Happened?Tuesday, June 5, 12
Holt-WintersTuesday, June 5, 12
"Forecasting Sales by                      Exponentially Weighted                      Moving Averages". PeterTuesday, Jun...
"Aberrant Behavior                      Detection in Time Series                      for Network Monitoring".Tuesday, Jun...
"Holt-Winters Forecasting                      Applied to Poisson                   Processes in Real-Time".Tuesday, June ...
holtWintersConfidence(Upper|Lower)Tuesday, June 5, 12
holtWintersAberrationTuesday, June 5, 12
business metrics with             confidence bands                    ==        alertable business metricsTuesday, June 5, 12
16,000 metrics in                           GRAPHITE                      (plus 32,000 metrics in GANGLIA)Tuesday, June 5,...
16,000 metrics in                           GRAPHITE                      (plus 32,000 metrics in GANGLIA)Tuesday, June 5,...
DashboardsTuesday, June 5, 12
DashboardsTuesday, June 5, 12
DashboardsTuesday, June 5, 12
Hard       <a href="http://graphite.etsycorp.com/render?       from=-1hours&width=800&height=600&title=File+or+Script+Not ...
Easy!     $g = new Graphite($time);     $g->setTitle(File Not Found);     $g->addMetric(webs.errorLog.notExist, #00cc00); ...
48 dashboards by                        32 engineersTuesday, June 5, 12
Application                        healthTuesday, June 5, 12
High-level                       visibilityTuesday, June 5, 12
Low MTTDTuesday, June 5, 12
ConfidenceTuesday, June 5, 12
Make metricsTuesday, June 5, 12
Make metricsTuesday, June 5, 12
Make metricsTuesday, June 5, 12
Not that muchTuesday, June 5, 12
codeascraft.etsy.com                      github.com/etsy/statsd                      github.com/etsy/logster             ...
Questions?Tuesday, June 5, 12
Upcoming SlideShare
Loading in...5
×

Metrics driven engineering (velocity 2011)

2,239

Published on

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,239
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
38
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Metrics driven engineering (velocity 2011)

  1. 1. METRICS-DRIVEN ENGINEERING at Kellan Elliott-McCrea, VP of Eng. kellan@etsy.com @kellanTuesday, June 5, 12
  2. 2. Tuesday, June 5, 12
  3. 3. Tuesday, June 5, 12
  4. 4. What is Etsy?Tuesday, June 5, 12
  5. 5. 8.5+ million items in the marketplaceTuesday, June 5, 12
  6. 6. 400,000+ activeTuesday, June 5, 12
  7. 7. $300+ million in sales in 2010 ~$41 million/monthTuesday, June 5, 12
  8. 8. > $1000 / minuteTuesday, June 5, 12
  9. 9. > 1 billion page views / monthTuesday, June 5, 12
  10. 10. business in over 150 countriesTuesday, June 5, 12
  11. 11. deploy the site, every ~20 minutesTuesday, June 5, 12
  12. 12. engineering team grew ~4x in 2010Tuesday, June 5, 12
  13. 13. Metrics?Tuesday, June 5, 12
  14. 14. Logs, Graphs, Trends, and CorrelationsTuesday, June 5, 12
  15. 15. Metrics Driven?Tuesday, June 5, 12
  16. 16. Making DecisionsTuesday, June 5, 12
  17. 17. How many visitors are using this thing?Tuesday, June 5, 12
  18. 18. Can we deploy that to 100% of our visitors?Tuesday, June 5, 12
  19. 19. Did we make it faster?Tuesday, June 5, 12
  20. 20. Did I just break something?Tuesday, June 5, 12
  21. 21. Q. WHO MAKES THESE GRAPHS? A. Well,racksOps team manages thethe network, the the servers, installed monitoring tools, wears the pagers, blah, blah, blah...Tuesday, June 5, 12
  22. 22. but... Engineers build the application.Tuesday, June 5, 12
  23. 23. Dev + OpsTuesday, June 5, 12
  24. 24. ACCESSTuesday, June 5, 12
  25. 25. Yes! No.Tuesday, June 5, 12
  26. 26. “Engineers are too busy!”Tuesday, June 5, 12
  27. 27. Here’s the BIG SECRET...Tuesday, June 5, 12
  28. 28. ... MAKE IT EASY!Tuesday, June 5, 12
  29. 29. Simple, open source toolsTuesday, June 5, 12
  30. 30. Cacti (network, SNMP) Ganglia (machines) Graphite (application) Splunk (log analysis, nightly reports) Nagios (alerting)Tuesday, June 5, 12
  31. 31. Gan ★cluster oriented ★huge community contributed recipes ★2.0 released today (including several Flickr and Etsy patches!) ★gmetad makes it easy to track custom metricsTuesday, June 5, 12
  32. 32. Tuesday, June 5, 12
  33. 33. Graphite ★super flexible collection and display ★per metrics buckets ★single instance ★super easy to write and use custom display functionsTuesday, June 5, 12
  34. 34. LoggingTuesday, June 5, 12
  35. 35. Logger::log_error("User login failed. Reason: $msg for $username", “login”);Tuesday, June 5, 12
  36. 36. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong password for ...Tuesday, June 5, 12
  37. 37. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong password for ...Tuesday, June 5, 12
  38. 38. web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong password for ...Tuesday, June 5, 12
  39. 39. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong password for ...Tuesday, June 5, 12
  40. 40. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong password for ...Tuesday, June 5, 12
  41. 41. web0054 [Fri Mar 04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong password for ...Tuesday, June 5, 12
  42. 42. Counting and Timing http://code.flickr.com/blog/ 2008/10/27/counting-timing/Tuesday, June 5, 12
  43. 43. LogsterTuesday, June 5, 12
  44. 44. Logster https://github.com/etsy/logsterTuesday, June 5, 12
  45. 45. Forked from ganglia-logtailer : - Daemon mode (only cron mode) + Support for Graphite + Simplified parsing scriptsTuesday, June 5, 12
  46. 46. web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda. web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0201 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue. web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling. web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling. web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling. web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0003 [04:28:54 2011] [error] [client 10.101.x.x] Youve been eaten by a grue. web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is fallingTuesday, June 5, 12
  47. 47. Fatals Errors WarningsTuesday, June 5, 12
  48. 48. ★runs out of cron ★maintains a cursor into log files ★supports ganglia and graphite ★custom parsers much easier to write then gmetadTuesday, June 5, 12
  49. 49. Apache access logsTuesday, June 5, 12
  50. 50. LogFormat "%h %l %u %t "%r" %>s %b" commonTuesday, June 5, 12
  51. 51. LogFormat "%{X-Forwarded-For}i % {True-Client-IP}i %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" % {etsy_shop_id}n %{etsy_uaid}n %V % {etsy_ab_selections}n % {etsy_request_uuid}n % {etsy_api_consumer_key}n % {etsy_api_method_name}n % {php_memory_usage_bytes}n % {php_time_microsec}n %D" combinedTuesday, June 5, 12
  52. 52. %{etsy_ab_selections}nTuesday, June 5, 12
  53. 53. %{etsy_uaid}nTuesday, June 5, 12
  54. 54. GraphsTuesday, June 5, 12
  55. 55. “If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we track it.” - Erik Kastner http://codeascraft.etsy.com/2011/02/15/measure- anything-measure-everything/Tuesday, June 5, 12
  56. 56. Tuesday, June 5, 12
  57. 57. StatsDTuesday, June 5, 12
  58. 58. StatsD https://github.com/ etsy/statsd/Tuesday, June 5, 12
  59. 59. StatsD::increment("logins.success"); StatsD::timing("gearman.time", $msec);Tuesday, June 5, 12
  60. 60. 90th pct average lower StatsD::timing("gearman.time", $msec);Tuesday, June 5, 12
  61. 61. Ad hoc name value timestampTuesday, June 5, 12
  62. 62. echo "events.deploy.site 1 `date +%s`" | nc graphite.etsycorp.com 2003Tuesday, June 5, 12
  63. 63. CorrelationsTuesday, June 5, 12
  64. 64. echo "events.deploy.site 1 `date +%s`" | nc graphite.etsycorp.com 2003Tuesday, June 5, 12
  65. 65. Trends + Events target=drawAsInfinite(events.deploy.site)Tuesday, June 5, 12
  66. 66. What Happened?Tuesday, June 5, 12
  67. 67. Holt-WintersTuesday, June 5, 12
  68. 68. "Forecasting Sales by Exponentially Weighted Moving Averages". PeterTuesday, June 5, 12
  69. 69. "Aberrant Behavior Detection in Time Series for Network Monitoring".Tuesday, June 5, 12
  70. 70. "Holt-Winters Forecasting Applied to Poisson Processes in Real-Time".Tuesday, June 5, 12
  71. 71. holtWintersConfidence(Upper|Lower)Tuesday, June 5, 12
  72. 72. holtWintersAberrationTuesday, June 5, 12
  73. 73. business metrics with confidence bands == alertable business metricsTuesday, June 5, 12
  74. 74. 16,000 metrics in GRAPHITE (plus 32,000 metrics in GANGLIA)Tuesday, June 5, 12
  75. 75. 16,000 metrics in GRAPHITE (plus 32,000 metrics in GANGLIA)Tuesday, June 5, 12
  76. 76. DashboardsTuesday, June 5, 12
  77. 77. DashboardsTuesday, June 5, 12
  78. 78. DashboardsTuesday, June 5, 12
  79. 79. Hard <a href="http://graphite.etsycorp.com/render? from=-1hours&width=800&height=600&title=File+or+Script+Not +Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render? from=-1hours&width=280&height=220&title=File+or+Script+Not +Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> </a>Tuesday, June 5, 12
  80. 80. Easy! $g = new Graphite($time); $g->setTitle(File Not Found); $g->addMetric(webs.errorLog.notExist, #00cc00); $g->showDeploys(true); echo $g->getDashboardHTML(280, 220);Tuesday, June 5, 12
  81. 81. 48 dashboards by 32 engineersTuesday, June 5, 12
  82. 82. Application healthTuesday, June 5, 12
  83. 83. High-level visibilityTuesday, June 5, 12
  84. 84. Low MTTDTuesday, June 5, 12
  85. 85. ConfidenceTuesday, June 5, 12
  86. 86. Make metricsTuesday, June 5, 12
  87. 87. Make metricsTuesday, June 5, 12
  88. 88. Make metricsTuesday, June 5, 12
  89. 89. Not that muchTuesday, June 5, 12
  90. 90. codeascraft.etsy.com github.com/etsy/statsd github.com/etsy/logster bitbucket.org/maplebed/ganglia- logtailerTuesday, June 5, 12
  91. 91. Questions?Tuesday, June 5, 12
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×