METRICS-DRIVEN
                 ENGINEERING at
                      Kellan Elliott-McCrea, VP of Eng.
                           kellan@etsy.com @kellan




Tuesday, June 5, 12
Tuesday, June 5, 12
Tuesday, June 5, 12
What is Etsy?



Tuesday, June 5, 12
8.5+ million items
                      in the marketplace




Tuesday, June 5, 12
400,000+ active




Tuesday, June 5, 12
$300+ million in
                        sales in 2010

                      ~$41 million/month


Tuesday, June 5, 12
> $1000 / minute



Tuesday, June 5, 12
> 1 billion page
                      views / month


Tuesday, June 5, 12
business in over
                       150 countries


Tuesday, June 5, 12
deploy the site,
                      every ~20 minutes


Tuesday, June 5, 12
engineering team
                            grew
                        ~4x in 2010


Tuesday, June 5, 12
Metrics?



Tuesday, June 5, 12
Logs, Graphs,
                          Trends,
                      and Correlations


Tuesday, June 5, 12
Metrics Driven?



Tuesday, June 5, 12
Making Decisions



Tuesday, June 5, 12
How many visitors
                              are
                       using this thing?


Tuesday, June 5, 12
Can we deploy that
                       to
              100% of our visitors?


Tuesday, June 5, 12
Did we make it
                          faster?


Tuesday, June 5, 12
Did I just break
                        something?


Tuesday, June 5, 12
Q.  WHO MAKES THESE
                             GRAPHS?
           A. Well,racksOps team manages thethe
            network,
                     the
                         the servers, installed
                      monitoring tools, wears the pagers,
                              blah, blah, blah...




Tuesday, June 5, 12
but... Engineers
                            build
                      the application.


Tuesday, June 5, 12
Dev + Ops


Tuesday, June 5, 12
ACCESS


Tuesday, June 5, 12
Yes!   No.




Tuesday, June 5, 12
“Engineers are
                        too busy!”


Tuesday, June 5, 12
Here’s the BIG
                        SECRET...


Tuesday, June 5, 12
... MAKE IT EASY!



Tuesday, June 5, 12
Simple, open
                      source tools


Tuesday, June 5, 12
Cacti (network, SNMP)
                      Ganglia (machines)
                      Graphite (application)
                      Splunk (log analysis, nightly
                      reports)
                      Nagios (alerting)



Tuesday, June 5, 12
Gan
                ★cluster oriented
                ★huge community contributed
                recipes
                ★2.0 released today (including
                several Flickr and Etsy patches!)
                ★gmetad makes it easy to track
                custom metrics


Tuesday, June 5, 12
Tuesday, June 5, 12
Graphite
                ★super flexible collection and
                display
                ★per metrics buckets
                ★single instance
                ★super easy to write and use
                custom display functions



Tuesday, June 5, 12
Logging


Tuesday, June 5, 12
Logger::log_error("User login
                        failed. Reason: $msg for
                          $username", “login”);




Tuesday, June 5, 12
web0054 [Fri Mar 04 16:27:48
                      2011] [error] [login] [14531658]
                      User login failed. Reason: wrong
                              password for ...




Tuesday, June 5, 12
web0054 [Fri Mar 04 16:27:48
                      2011] [error] [login] [14531658]
                      User login failed. Reason: wrong
                              password for ...




Tuesday, June 5, 12
web0054 [Fri Mar 04 16:27:48
                      2011] [error] [login] [14531658]
                      User login failed. Reason: wrong
                              password for ...




Tuesday, June 5, 12
web0054 [Fri Mar 04 16:27:48
                      2011] [info] [login] [14531658]
                      User login failed. Reason: wrong
                              password for ...




Tuesday, June 5, 12
web0054 [Fri Mar 04 16:27:48
                      2011] [info] [login] [14531658]
                      User login failed. Reason: wrong
                              password for ...




Tuesday, June 5, 12
web0054 [Fri Mar 04 16:27:48
                      2011] [info] [login] [14531658]
                      User login failed. Reason: wrong
                              password for ...




Tuesday, June 5, 12
Counting
                      and Timing
                      http://code.flickr.com/blog/
                      2008/10/27/counting-timing/




Tuesday, June 5, 12
Logster


Tuesday, June 5, 12
Logster
                      https://github.com/etsy/logster




Tuesday, June 5, 12
Forked from ganglia-logtailer :

                            - Daemon mode
                (only cron mode)
                            + Support for
                Graphite
                            + Simplified parsing
                scripts




Tuesday, June 5, 12
web0001        [04:28:54   2011]   [warning] [client 10.101.x.x] Gaaaaahhh!
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Help me, Rhonda.
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Oh noooooo!
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Oh noooooo!
       web0001        [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!
       web0201        [04:28:54   2011]   [warning] [client 10.101.x.x] Gaaaaahhh!
       web0034        [04:28:54   2011]   [warning] [client 10.101.x.x] Oh nooooooooooo
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web1101        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web0201        [04:28:54   2011]   [error] [client 10.101.x.x] You've been eaten by a grue.
       web0055        [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!!!
       web0002        [04:28:54   2011]   [warning] [client 10.101.x.x] Sky is falling.
       web0089        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web0020        [04:28:54   2011]   [error] [client 10.101.x.x] Sky is falling.
       web1101        [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!
       web0055        [04:28:54   2011]   [warning] [client 10.101.x.x] Gaaaaahhh!
       web0001        [04:28:54   2011]   [warning] [client 10.101.x.x] Oh nooooooooooo
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web0034        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web0087        [04:28:54   2011]   [fatal] [client 10.101.x.x] Sky is falling.
       web0002        [04:28:54   2011]   [error] [client 10.101.x.x] Oh noooooo!
       web0201        [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!
       web0077        [04:28:54   2011]   [warning] [client 10.101.x.x] Gaaaaahhh!
       web0355        [04:28:54   2011]   [warning] [client 10.101.x.x] Oh nooooooooooo
       web0052        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web0001        [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
       web0003        [04:28:54   2011]   [error] [client 10.101.x.x] You've been eaten by a grue.
       web0066        [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!!!
       web0001        [04:28:54   2011]   [warning] [client 10.101.x.x] Sky is falling
Tuesday, June 5, 12
Fatals   Errors   Warnings




Tuesday, June 5, 12
★runs out of cron
                ★maintains a cursor into log files
                ★supports ganglia and graphite
                ★custom parsers much easier to
                write then gmetad




Tuesday, June 5, 12
Apache access logs


Tuesday, June 5, 12
LogFormat "%h %l %u %t "%r"
                  %>s %b" common




Tuesday, June 5, 12
LogFormat "%{X-Forwarded-For}i %
             {True-Client-IP}i %l %u %t "%r" %>s %b
                "%{Referer}i" "%{User-Agent}i" %
                {etsy_shop_id}n %{etsy_uaid}n %V %
                       {etsy_ab_selections}n %
                       {etsy_request_uuid}n %
                    {etsy_api_consumer_key}n %
                    {etsy_api_method_name}n %
                  {php_memory_usage_bytes}n %
               {php_time_microsec}n %D" combined

Tuesday, June 5, 12
%{etsy_ab_selections}n




Tuesday, June 5, 12
%{etsy_uaid}n




Tuesday, June 5, 12
Graphs


Tuesday, June 5, 12
“If Engineering at Etsy has
        a religion, it’s the Church
        of Graphs. If it moves, we
          track it.” - Erik Kastner

   http://codeascraft.etsy.com/2011/02/15/measure-
   anything-measure-everything/




Tuesday, June 5, 12
Tuesday, June 5, 12
StatsD


Tuesday, June 5, 12
StatsD
                        https://github.com/
                        etsy/statsd/




Tuesday, June 5, 12
StatsD::increment("logins.success");
       StatsD::timing("gearman.time", $msec);




Tuesday, June 5, 12
90th pct

                                    average
                                    lower


       StatsD::timing("gearman.time", $msec);




Tuesday, June 5, 12
Ad hoc
                      name value timestamp




Tuesday, June 5, 12
echo "events.deploy.site 1 `date +%s`" 
              | nc graphite.etsycorp.com 2003




Tuesday, June 5, 12
Correlations



Tuesday, June 5, 12
echo "events.deploy.site 1 `date +%s`" 
              | nc graphite.etsycorp.com 2003




Tuesday, June 5, 12
Trends + Events
         target=drawAsInfinite(events.deploy.site)




Tuesday, June 5, 12
What Happened?


Tuesday, June 5, 12
Holt-Winters


Tuesday, June 5, 12
"Forecasting Sales by
                      Exponentially Weighted
                      Moving Averages". Peter



Tuesday, June 5, 12
"Aberrant Behavior
                      Detection in Time Series
                      for Network Monitoring".



Tuesday, June 5, 12
"Holt-Winters Forecasting
                      Applied to Poisson
                   Processes in Real-Time".



Tuesday, June 5, 12
holtWintersConfidence(Upper|Lower)




Tuesday, June 5, 12
holtWintersAberration




Tuesday, June 5, 12
business metrics with
             confidence bands
                    ==
        alertable business metrics


Tuesday, June 5, 12
16,000 metrics in
                           GRAPHITE
                      (plus 32,000 metrics in GANGLIA)




Tuesday, June 5, 12
16,000 metrics in
                           GRAPHITE
                      (plus 32,000 metrics in GANGLIA)




Tuesday, June 5, 12
Dashboards


Tuesday, June 5, 12
Dashboards



Tuesday, June 5, 12
Dashboards



Tuesday, June 5, 12
Hard
       <a href="http://graphite.etsycorp.com/render?
       from=-1hours&width=800&height=600&title=File+or+Script+Not
       +Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite
       %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production
       %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite
       %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,
       %23ff0000,%23006633,%23cc6600">
       
   <img src="http://graphite.etsycorp.com/render?
       from=-1hours&width=280&height=220&title=File+or+Script+Not
       +Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite
       %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production
       %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite
       %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,
       %23ff0000,%23006633,%23cc6600">
       </a>




Tuesday, June 5, 12
Easy!
     $g = new Graphite($time);
     $g->setTitle('File Not Found');
     $g->addMetric('webs.errorLog.notExist', '#00cc00');
     $g->showDeploys(true);
     echo $g->getDashboardHTML(280, 220);




Tuesday, June 5, 12
48 dashboards by
                        32 engineers


Tuesday, June 5, 12
Application
                        health


Tuesday, June 5, 12
High-level
                       visibility


Tuesday, June 5, 12
Low MTTD


Tuesday, June 5, 12
Confidence


Tuesday, June 5, 12
Make metrics


Tuesday, June 5, 12
Make metrics


Tuesday, June 5, 12
Make metrics


Tuesday, June 5, 12
Not that much


Tuesday, June 5, 12
codeascraft.etsy.com
                      github.com/etsy/statsd
                      github.com/etsy/logster

                      bitbucket.org/maplebed/ganglia-
                      logtailer




Tuesday, June 5, 12
Questions?




Tuesday, June 5, 12

Metrics driven engineering (velocity 2011)

  • 1.
    METRICS-DRIVEN ENGINEERING at Kellan Elliott-McCrea, VP of Eng. kellan@etsy.com @kellan Tuesday, June 5, 12
  • 2.
  • 3.
  • 4.
  • 5.
    8.5+ million items in the marketplace Tuesday, June 5, 12
  • 6.
  • 7.
    $300+ million in sales in 2010 ~$41 million/month Tuesday, June 5, 12
  • 8.
    > $1000 /minute Tuesday, June 5, 12
  • 9.
    > 1 billionpage views / month Tuesday, June 5, 12
  • 10.
    business in over 150 countries Tuesday, June 5, 12
  • 11.
    deploy the site, every ~20 minutes Tuesday, June 5, 12
  • 12.
    engineering team grew ~4x in 2010 Tuesday, June 5, 12
  • 13.
  • 14.
    Logs, Graphs, Trends, and Correlations Tuesday, June 5, 12
  • 15.
  • 16.
  • 17.
    How many visitors are using this thing? Tuesday, June 5, 12
  • 18.
    Can we deploythat to 100% of our visitors? Tuesday, June 5, 12
  • 19.
    Did we makeit faster? Tuesday, June 5, 12
  • 20.
    Did I justbreak something? Tuesday, June 5, 12
  • 21.
    Q. WHOMAKES THESE GRAPHS? A. Well,racksOps team manages thethe network, the the servers, installed monitoring tools, wears the pagers, blah, blah, blah... Tuesday, June 5, 12
  • 22.
    but... Engineers build the application. Tuesday, June 5, 12
  • 23.
  • 24.
  • 25.
    Yes! No. Tuesday, June 5, 12
  • 26.
    “Engineers are too busy!” Tuesday, June 5, 12
  • 27.
    Here’s the BIG SECRET... Tuesday, June 5, 12
  • 28.
    ... MAKE ITEASY! Tuesday, June 5, 12
  • 29.
    Simple, open source tools Tuesday, June 5, 12
  • 30.
    Cacti (network, SNMP) Ganglia (machines) Graphite (application) Splunk (log analysis, nightly reports) Nagios (alerting) Tuesday, June 5, 12
  • 31.
    Gan ★cluster oriented ★huge community contributed recipes ★2.0 released today (including several Flickr and Etsy patches!) ★gmetad makes it easy to track custom metrics Tuesday, June 5, 12
  • 32.
  • 33.
    Graphite ★super flexible collection and display ★per metrics buckets ★single instance ★super easy to write and use custom display functions Tuesday, June 5, 12
  • 34.
  • 35.
    Logger::log_error("User login failed. Reason: $msg for $username", “login”); Tuesday, June 5, 12
  • 36.
    web0054 [Fri Mar04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong password for ... Tuesday, June 5, 12
  • 37.
    web0054 [Fri Mar04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong password for ... Tuesday, June 5, 12
  • 38.
    web0054 [Fri Mar04 16:27:48 2011] [error] [login] [14531658] User login failed. Reason: wrong password for ... Tuesday, June 5, 12
  • 39.
    web0054 [Fri Mar04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong password for ... Tuesday, June 5, 12
  • 40.
    web0054 [Fri Mar04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong password for ... Tuesday, June 5, 12
  • 41.
    web0054 [Fri Mar04 16:27:48 2011] [info] [login] [14531658] User login failed. Reason: wrong password for ... Tuesday, June 5, 12
  • 42.
    Counting and Timing http://code.flickr.com/blog/ 2008/10/27/counting-timing/ Tuesday, June 5, 12
  • 43.
  • 44.
    Logster https://github.com/etsy/logster Tuesday, June 5, 12
  • 45.
    Forked from ganglia-logtailer: - Daemon mode (only cron mode) + Support for Graphite + Simplified parsing scripts Tuesday, June 5, 12
  • 46.
    web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda. web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0201 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue. web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling. web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling. web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling. web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0003 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue. web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling Tuesday, June 5, 12
  • 47.
    Fatals Errors Warnings Tuesday, June 5, 12
  • 48.
    ★runs out ofcron ★maintains a cursor into log files ★supports ganglia and graphite ★custom parsers much easier to write then gmetad Tuesday, June 5, 12
  • 49.
  • 50.
    LogFormat "%h %l%u %t "%r" %>s %b" common Tuesday, June 5, 12
  • 51.
    LogFormat "%{X-Forwarded-For}i % {True-Client-IP}i %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" % {etsy_shop_id}n %{etsy_uaid}n %V % {etsy_ab_selections}n % {etsy_request_uuid}n % {etsy_api_consumer_key}n % {etsy_api_method_name}n % {php_memory_usage_bytes}n % {php_time_microsec}n %D" combined Tuesday, June 5, 12
  • 52.
  • 53.
  • 54.
  • 55.
    “If Engineering atEtsy has a religion, it’s the Church of Graphs. If it moves, we track it.” - Erik Kastner http://codeascraft.etsy.com/2011/02/15/measure- anything-measure-everything/ Tuesday, June 5, 12
  • 56.
  • 57.
  • 58.
    StatsD https://github.com/ etsy/statsd/ Tuesday, June 5, 12
  • 59.
    StatsD::increment("logins.success"); StatsD::timing("gearman.time", $msec); Tuesday, June 5, 12
  • 60.
    90th pct average lower StatsD::timing("gearman.time", $msec); Tuesday, June 5, 12
  • 61.
    Ad hoc name value timestamp Tuesday, June 5, 12
  • 62.
    echo "events.deploy.site 1`date +%s`" | nc graphite.etsycorp.com 2003 Tuesday, June 5, 12
  • 63.
  • 64.
    echo "events.deploy.site 1`date +%s`" | nc graphite.etsycorp.com 2003 Tuesday, June 5, 12
  • 65.
    Trends + Events target=drawAsInfinite(events.deploy.site) Tuesday, June 5, 12
  • 66.
  • 67.
  • 68.
    "Forecasting Sales by Exponentially Weighted Moving Averages". Peter Tuesday, June 5, 12
  • 69.
    "Aberrant Behavior Detection in Time Series for Network Monitoring". Tuesday, June 5, 12
  • 70.
    "Holt-Winters Forecasting Applied to Poisson Processes in Real-Time". Tuesday, June 5, 12
  • 71.
  • 72.
  • 73.
    business metrics with confidence bands == alertable business metrics Tuesday, June 5, 12
  • 74.
    16,000 metrics in GRAPHITE (plus 32,000 metrics in GANGLIA) Tuesday, June 5, 12
  • 75.
    16,000 metrics in GRAPHITE (plus 32,000 metrics in GANGLIA) Tuesday, June 5, 12
  • 76.
  • 77.
  • 78.
  • 79.
    Hard <a href="http://graphite.etsycorp.com/render? from=-1hours&width=800&height=600&title=File+or+Script+Not +Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render? from=-1hours&width=280&height=220&title=File+or+Script+Not +Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> </a> Tuesday, June 5, 12
  • 80.
    Easy! $g = new Graphite($time); $g->setTitle('File Not Found'); $g->addMetric('webs.errorLog.notExist', '#00cc00'); $g->showDeploys(true); echo $g->getDashboardHTML(280, 220); Tuesday, June 5, 12
  • 81.
    48 dashboards by 32 engineers Tuesday, June 5, 12
  • 82.
    Application health Tuesday, June 5, 12
  • 83.
    High-level visibility Tuesday, June 5, 12
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
    codeascraft.etsy.com github.com/etsy/statsd github.com/etsy/logster bitbucket.org/maplebed/ganglia- logtailer Tuesday, June 5, 12
  • 91.