On Failure and Resilience


             Mike Brittain
             DIRECTOR OF ENGINEERING, ETSY
             @mikebrittain
             Presented at 37signals on Aug 21, 2012
“Software Infrastructure”
“Framework” code, caching, ORM, file storage tier,
developer tools, CI/deployment, site performance,
             front-end architecture.
Managing failures and building
resilience into systems, applications,
         process, and people.
$61 M in goods sold in the marketplace
2.9 M items sold
1.2 B page views




                                                           Photo: http://www.etsy.com/shop/TheOldTimeJunkShop

  http://www.etsy.com/blog/news/2012/etsy-statistics-june-weather-report/
Architecture
Linux, Apache, MySQL, PHP, Postgres,
Solr, Gearman, Memcache, Chef,
Hadoop, EC2/S3/EMR

                      30+ Logical data stores
                 (23 shards + more functionally partitioned)


   Search and storage tiers as “services”
150 Engineers + Designers + Product
                                (this was 20 in Feb 2010)




credit: martin_heigan (flickr)
Buyers, sellers, support,
developer api, i18n,
core infrastructure, storage,
payments, security, fraud
detection, big data and BI,
email delivery, corp IT,
operations, developer tools,
continuous integration and
testing, site performance,
search, advertising, seller
economics, mobile web,
iOS.
Zero Release Managers
There Will Be Fail



                     Credit: wilkee.deviantart.com
We cannot comprehend all of the ways in
which the individual parts of a complex
system will interact. We cannot know all
of the states and scenarios.

We cannot prevent failures.
Yet, we can mitigate them.

Redundant system architectures.
Small, well-understood changes to production.
Control application using config flags.
Gratuitous metrics collection.
Resilient user interfaces.
GameDay exercises.
“Uptime” is not binary.
Async
            Convos           Ads   Auth
                     Tasks




Functionally Partitioned
Async
            Convos           Ads   Auth
                     Tasks




Functionally Partitioned
4         1
                                           3         2    5




                      Async   Async
    Convos   Convos                       Ads       Ads       Auth   Auth
                      tasks   tasks




Master-Master Replication
4         1
                                           3         2    5




                      Async   Async
    Convos   Convos                       Ads       Ads       Auth   Auth
                      tasks   tasks




Master-Master Replication
4         1
                                           3         2    5




                      Async   Async
    Convos   Convos                       Ads       Ads       Auth   Auth
                      tasks   tasks




Master-Master Replication
1         5
                                              3         2        4




    shard1   shard1   shard2   shard2     shard3     shard3          shard4   shard4



                                        ~4% of listing data is
                                          stored on shard3




Sharded Tables
1        5
                                            3        2    4




    shard1   shard1   shard2   shard2   shard3   shard3       shard4   shard4




Sharded Tables
shard1   shard1   shard2   shard2    shard3    shard3      shard4   shard4




                                        Outage is limited to
                                         ~4% of data set




Sharded Tables
“Uptime” is not binary.
Uptime of the application is the
responsibility of our Operations team.
Uptime of the application is the
responsibility of our Operations, Engineering,
Product, and Design teams.
Uptime of the application is the
responsibility of our Operations, Engineering,
Product, and Design teams.

If you are committing code, you are
operating the site.
Branching in Code
“All existing revision control systems
 were built by people who build
 installed software”


                                  Always Ship Trunk
                                      Paul Hammond
                                   Velocity Conf 2010
Config Flags
Enable and disable features quickly.
Features for staff or for beta groups.
Percentage ramp-up of users or requests.
A/B “experiments.”
$cfg[‘new_search’]   =   array('enabled'   =>   'on');
$cfg[‘sign_in’]      =   array('enabled'   =>   'on');
$cfg[‘checkout’]     =   array('enabled'   =>   'on');
$cfg[‘homepage’]     =   array('enabled'   =>   'on');
$cfg[‘new_search’] = array('enabled' => 'on');

// Meanwhile...

if ($cfg[‘new_search’]) {
  # New hotness
  $results = do_solr();
} else {
  # old and boring
  $results = do_grep();
}
But...
“Doesn’t that mean you have conditionals
 all over your code?”
                  Yes.
“Doesn’t that mean you have conditionals
 all over your code?”
                  Yes.
                 “Does anyone ever clean those up?”
               Sometimes.
“Doesn’t that mean you have conditionals
 all over your code?”
                  Yes.
                 “Does anyone ever clean those up?”
               Sometimes.
   “That sounds like it sucks.”
                     Really?
“Doesn’t that mean you have conditionals
 all over your code?”
                  Yes.
                 “Does anyone ever clean those up?”
               Sometimes.
   “That sounds like it sucks.”
                     Really?
   “Wait a minute... all of the counter arguments are
    in Comic Sans. WTF?!?
                         Oh, you noticed? ;)
+06:40
                                               Site up, some seller tools disabled
    00:00
    Site down for maintenance




                       +01:47
                       Site up, disabled login and registration

                                                                                     +07:41
                                                                                     All features restored


DB Server Maintenance, June 16, 2012
http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
“Uptime” is not binary.
Features are launched by flipping a
   config flag, not by deploying
    hundreds of lines of code.
“If Engineering at Etsy has a religion,
           it’s the Church of Graphs.
                     Ian Malpass, Code as Craft
                                http://etsy.me/ePkoZB
THIS IS HOW
                      YOU RUN
               A COMPLEX
                        SYSTEM
http://www.flickr.com/photos/flyforfun/2694158656/
Config flags
                        Operator                       Metrics




http://www.flickr.com/photos/flyforfun/2694158656/
Oh, you want to talk about how we collect
metrics and make graphs?


                http://www.slideshare.net/mikebrittain/metricsdriven-engineering
Resilient User Interfaces
Interfaces and user experiences
that adapt to technical and
architectural failure.
/**
 * Creates a database connection.
 */
public function __construct($host, $user, $pass, $db) {
    parent::__construct($host, $user, $pass, $db);

     if (mysqli_connect_error()) {

         throw new DBConnection_Exception(
             sprintf("Error: %s, %s",
                 mysqli_connect_errno(),
                 mysqli_connect_error()));

     }
 }
try {
    $conn = new DBConnection('viewsdb.host', 'db_read_user',
                             'ssssshh!', 'views_db');
} catch (DBConnection_Exception $e) {

    // TODO: Someone should figure out what to do if
    // we can't connect to the views db.
    throw $e;
}
Site navigation
           Logo

          Cute Picture

Generic, catch-all
error messaging....
Every back-end service is an
  opportunity for failure.
1           4
                                  9
                5
                    8                                   6
    2   3




                                                   10
                                               4            11
                                      7




                                          14
                    7

                        13



                             12
Critical Path
#srsly?
< 400 ms
Non-blocking Ajax
Google Calendar




   Google Docs
GMail
“Oops, we aren’t able to
access click metrics right
now, do not worry — your
      data is safe.”
Product design doesn’t stop
    at 100% availability.
Dev   Ops
Dev         Ops


  Product
1           4
                                  9
                5
                    8                                   6
    2   3




                                                   10
                                               4            11
                                      7




                                          14
                    7

                        13



                             12
Operability Reviews
“What could possibly go wrong?”

What is changing about the architecture?
What kind of data access patterns are we using?
How much traffic, how many queries?
What metrics are we collecting?
Are there automated alerts?
How do we know the thresholds are right?
How do we turn it off?... and what happens when we do?
“What could possibly go wrong?”

What is changing about the architecture?
What kind of data access patterns are we using?
How much traffic, how many queries?
What metrics are we collecting?
Are there automated alerts?
How do we know the thresholds are right?
How do we turn it off? ...and what happens when we do?
“GameDay” Exercises
Pedro
Homepage (95th perc.)

                        Surprise!!!
                        Turning off multi-
                        language support
                        improves our page
                        generation times by
                        up to 25%.
(Blameless) Post-Mortems
How could this have gone better?

How quickly did we find out that something was wrong?
Did we communicate well to our visitors and each other?
Why did we have confidence that what we were doing was OK?
Did we have the right tools, did we use them properly?
Did we collect metrics, and could we find them?
Where did we make the wrong decisions?

What steps do we take to reduce the chance of this
happening again in the future?
“... an engineer who thinks they’re going to be
reprimanded are disincentivized to give the details
necessary to get an understanding of the mechanism,
pathology, and operation of the failure.

This lack of understanding of how the accident occurred
all but guarantees that it will repeat. If not with the
original engineer, another one in the future.”

                                                                    John Allspaw
                                                       VP, Technical Operations, Etsy


         http://codeascraft.etsy.com/2012/05/22/blameless-postmortems/
We should try to learn not only what went
wrong, but also what went right.
+06:40
                                               Site up, some seller tools disabled
    00:00
    Site down for maintenance




                       +01:47
                       Site up, disabled login and registration

                                                                                     +07:41
                                                                                     All features restored


DB Server Maintenance, June 16, 2012
http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
Operational Mindset



Dev   Ops            Product
Operational Mindset



Dev   Ops              Product


        Business Priorities
Introspection
page views for error template
page views for error template
...or, how are we screwing our users?
Risk mitigation in a complex system

Redundant system architectures.
Small, well-understood changes to production.
Control application using config flags.
Gratuitous metrics collection.
Resilient user interfaces.
GameDay exercises.
Thank you.
  Mike Brittain
 mike@etsy.com
 @mikebrittain
PHOTO
CREDITS
                                                                   Flickr: roboppy
                         http://www.flickr.com/photos/51035735481@N01/163374138/




                                                Flickr: jamesjyu
                                                http://www.flickr.com/photos/32593095@N00/3465022/




                                         Flickr: circulating
  http://www.flickr.com/photos/26835318@N00/2318226026/

On Failure and Resilience

  • 1.
    On Failure andResilience Mike Brittain DIRECTOR OF ENGINEERING, ETSY @mikebrittain Presented at 37signals on Aug 21, 2012
  • 2.
    “Software Infrastructure” “Framework” code,caching, ORM, file storage tier, developer tools, CI/deployment, site performance, front-end architecture.
  • 3.
    Managing failures andbuilding resilience into systems, applications, process, and people.
  • 5.
    $61 M ingoods sold in the marketplace 2.9 M items sold 1.2 B page views Photo: http://www.etsy.com/shop/TheOldTimeJunkShop http://www.etsy.com/blog/news/2012/etsy-statistics-june-weather-report/
  • 6.
    Architecture Linux, Apache, MySQL,PHP, Postgres, Solr, Gearman, Memcache, Chef, Hadoop, EC2/S3/EMR 30+ Logical data stores (23 shards + more functionally partitioned) Search and storage tiers as “services”
  • 7.
    150 Engineers +Designers + Product (this was 20 in Feb 2010) credit: martin_heigan (flickr)
  • 8.
    Buyers, sellers, support, developerapi, i18n, core infrastructure, storage, payments, security, fraud detection, big data and BI, email delivery, corp IT, operations, developer tools, continuous integration and testing, site performance, search, advertising, seller economics, mobile web, iOS.
  • 10.
  • 11.
    There Will BeFail Credit: wilkee.deviantart.com
  • 12.
    We cannot comprehendall of the ways in which the individual parts of a complex system will interact. We cannot know all of the states and scenarios. We cannot prevent failures.
  • 13.
    Yet, we canmitigate them. Redundant system architectures. Small, well-understood changes to production. Control application using config flags. Gratuitous metrics collection. Resilient user interfaces. GameDay exercises.
  • 14.
  • 15.
    Async Convos Ads Auth Tasks Functionally Partitioned
  • 16.
    Async Convos Ads Auth Tasks Functionally Partitioned
  • 17.
    4 1 3 2 5 Async Async Convos Convos Ads Ads Auth Auth tasks tasks Master-Master Replication
  • 18.
    4 1 3 2 5 Async Async Convos Convos Ads Ads Auth Auth tasks tasks Master-Master Replication
  • 19.
    4 1 3 2 5 Async Async Convos Convos Ads Ads Auth Auth tasks tasks Master-Master Replication
  • 20.
    1 5 3 2 4 shard1 shard1 shard2 shard2 shard3 shard3 shard4 shard4 ~4% of listing data is stored on shard3 Sharded Tables
  • 21.
    1 5 3 2 4 shard1 shard1 shard2 shard2 shard3 shard3 shard4 shard4 Sharded Tables
  • 22.
    shard1 shard1 shard2 shard2 shard3 shard3 shard4 shard4 Outage is limited to ~4% of data set Sharded Tables
  • 23.
  • 24.
    Uptime of theapplication is the responsibility of our Operations team.
  • 25.
    Uptime of theapplication is the responsibility of our Operations, Engineering, Product, and Design teams.
  • 26.
    Uptime of theapplication is the responsibility of our Operations, Engineering, Product, and Design teams. If you are committing code, you are operating the site.
  • 27.
  • 28.
    “All existing revisioncontrol systems were built by people who build installed software” Always Ship Trunk Paul Hammond Velocity Conf 2010
  • 29.
    Config Flags Enable anddisable features quickly. Features for staff or for beta groups. Percentage ramp-up of users or requests. A/B “experiments.”
  • 30.
    $cfg[‘new_search’] = array('enabled' => 'on'); $cfg[‘sign_in’] = array('enabled' => 'on'); $cfg[‘checkout’] = array('enabled' => 'on'); $cfg[‘homepage’] = array('enabled' => 'on');
  • 31.
    $cfg[‘new_search’] = array('enabled'=> 'on'); // Meanwhile... if ($cfg[‘new_search’]) { # New hotness $results = do_solr(); } else { # old and boring $results = do_grep(); }
  • 32.
  • 33.
    “Doesn’t that meanyou have conditionals all over your code?” Yes.
  • 34.
    “Doesn’t that meanyou have conditionals all over your code?” Yes. “Does anyone ever clean those up?” Sometimes.
  • 35.
    “Doesn’t that meanyou have conditionals all over your code?” Yes. “Does anyone ever clean those up?” Sometimes. “That sounds like it sucks.” Really?
  • 36.
    “Doesn’t that meanyou have conditionals all over your code?” Yes. “Does anyone ever clean those up?” Sometimes. “That sounds like it sucks.” Really? “Wait a minute... all of the counter arguments are in Comic Sans. WTF?!? Oh, you noticed? ;)
  • 37.
    +06:40 Site up, some seller tools disabled 00:00 Site down for maintenance +01:47 Site up, disabled login and registration +07:41 All features restored DB Server Maintenance, June 16, 2012 http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
  • 38.
  • 39.
    Features are launchedby flipping a config flag, not by deploying hundreds of lines of code.
  • 40.
    “If Engineering atEtsy has a religion, it’s the Church of Graphs. Ian Malpass, Code as Craft http://etsy.me/ePkoZB
  • 43.
    THIS IS HOW YOU RUN A COMPLEX SYSTEM http://www.flickr.com/photos/flyforfun/2694158656/
  • 44.
    Config flags Operator Metrics http://www.flickr.com/photos/flyforfun/2694158656/
  • 45.
    Oh, you wantto talk about how we collect metrics and make graphs? http://www.slideshare.net/mikebrittain/metricsdriven-engineering
  • 46.
  • 47.
    Interfaces and userexperiences that adapt to technical and architectural failure.
  • 54.
    /** * Createsa database connection. */ public function __construct($host, $user, $pass, $db) { parent::__construct($host, $user, $pass, $db); if (mysqli_connect_error()) { throw new DBConnection_Exception( sprintf("Error: %s, %s", mysqli_connect_errno(), mysqli_connect_error())); } }
  • 55.
    try { $conn = new DBConnection('viewsdb.host', 'db_read_user', 'ssssshh!', 'views_db'); } catch (DBConnection_Exception $e) { // TODO: Someone should figure out what to do if // we can't connect to the views db. throw $e; }
  • 58.
    Site navigation Logo Cute Picture Generic, catch-all error messaging....
  • 60.
    Every back-end serviceis an opportunity for failure.
  • 64.
    1 4 9 5 8 6 2 3 10 4 11 7 14 7 13 12
  • 66.
  • 70.
  • 71.
  • 72.
  • 73.
    Google Calendar Google Docs
  • 74.
  • 75.
    “Oops, we aren’table to access click metrics right now, do not worry — your data is safe.”
  • 76.
    Product design doesn’tstop at 100% availability.
  • 77.
    Dev Ops
  • 78.
    Dev Ops Product
  • 79.
    1 4 9 5 8 6 2 3 10 4 11 7 14 7 13 12
  • 80.
  • 81.
    “What could possiblygo wrong?” What is changing about the architecture? What kind of data access patterns are we using? How much traffic, how many queries? What metrics are we collecting? Are there automated alerts? How do we know the thresholds are right? How do we turn it off?... and what happens when we do?
  • 82.
    “What could possiblygo wrong?” What is changing about the architecture? What kind of data access patterns are we using? How much traffic, how many queries? What metrics are we collecting? Are there automated alerts? How do we know the thresholds are right? How do we turn it off? ...and what happens when we do?
  • 83.
  • 85.
  • 86.
    Homepage (95th perc.) Surprise!!! Turning off multi- language support improves our page generation times by up to 25%.
  • 87.
  • 88.
    How could thishave gone better? How quickly did we find out that something was wrong? Did we communicate well to our visitors and each other? Why did we have confidence that what we were doing was OK? Did we have the right tools, did we use them properly? Did we collect metrics, and could we find them? Where did we make the wrong decisions? What steps do we take to reduce the chance of this happening again in the future?
  • 89.
    “... an engineerwho thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.” John Allspaw VP, Technical Operations, Etsy http://codeascraft.etsy.com/2012/05/22/blameless-postmortems/
  • 90.
    We should tryto learn not only what went wrong, but also what went right.
  • 91.
    +06:40 Site up, some seller tools disabled 00:00 Site down for maintenance +01:47 Site up, disabled login and registration +07:41 All features restored DB Server Maintenance, June 16, 2012 http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
  • 92.
  • 93.
    Operational Mindset Dev Ops Product Business Priorities
  • 94.
  • 95.
    page views forerror template
  • 96.
    page views forerror template ...or, how are we screwing our users?
  • 97.
    Risk mitigation ina complex system Redundant system architectures. Small, well-understood changes to production. Control application using config flags. Gratuitous metrics collection. Resilient user interfaces. GameDay exercises.
  • 98.
    Thank you. Mike Brittain mike@etsy.com @mikebrittain
  • 102.
    PHOTO CREDITS Flickr: roboppy http://www.flickr.com/photos/51035735481@N01/163374138/ Flickr: jamesjyu http://www.flickr.com/photos/32593095@N00/3465022/ Flickr: circulating http://www.flickr.com/photos/26835318@N00/2318226026/