On Failure and Resilience

On Failure and Resilience

Mike Brittain
DIRECTOR OF ENGINEERING, ETSY
@mikebrittain
Presented at 37signals on Aug 21, 2012

“Software Infrastructure”
“Framework” code, caching, ORM, ﬁle storage tier,
developer tools, CI/deployment, site performance,
front-end architecture.

Managing failures and building
resilience into systems, applications,
process, and people.

$61 M in goods sold in the marketplace
2.9 M items sold
1.2 B page views

Photo: http://www.etsy.com/shop/TheOldTimeJunkShop

http://www.etsy.com/blog/news/2012/etsy-statistics-june-weather-report/

Architecture
Linux, Apache, MySQL, PHP, Postgres,
Solr, Gearman, Memcache, Chef,
Hadoop, EC2/S3/EMR

30+ Logical data stores
(23 shards + more functionally partitioned)

Search and storage tiers as “services”

150 Engineers + Designers + Product
(this was 20 in Feb 2010)

credit: martin_heigan (ﬂickr)

Buyers, sellers, support,
developer api, i18n,
core infrastructure, storage,
payments, security, fraud
detection, big data and BI,
email delivery, corp IT,
operations, developer tools,
continuous integration and
testing, site performance,
search, advertising, seller
economics, mobile web,
iOS.

There Will Be Fail

Credit: wilkee.deviantart.com

We cannot comprehend all of the ways in
which the individual parts of a complex
system will interact. We cannot know all
of the states and scenarios.

We cannot prevent failures.

Yet, we can mitigate them.

Redundant system architectures.
Small, well-understood changes to production.
Control application using conﬁg ﬂags.
Gratuitous metrics collection.
Resilient user interfaces.
GameDay exercises.

Async
Convos Ads Auth
Tasks

Functionally Partitioned

4 1
3 2 5

Async Async
Convos Convos Ads Ads Auth Auth
tasks tasks

Master-Master Replication

1 5
3 2 4

shard1 shard1 shard2 shard2 shard3 shard3 shard4 shard4

~4% of listing data is
stored on shard3

Sharded Tables

1 5
3 2 4


Sharded Tables


Outage is limited to
~4% of data set

Sharded Tables

Uptime of the application is the
responsibility of our Operations team.

responsibility of our Operations, Engineering,
Product, and Design teams.

responsibility of our Operations, Engineering,
Product, and Design teams.

If you are committing code, you are
operating the site.

“All existing revision control systems
were built by people who build
installed software”

Always Ship Trunk
Paul Hammond
Velocity Conf 2010

Conﬁg Flags
Enable and disable features quickly.
Features for staﬀ or for beta groups.
Percentage ramp-up of users or requests.
A/B “experiments.”

$cfg[‘new_search’] = array('enabled' => 'on');
$cfg[‘sign_in’] = array('enabled' => 'on');
$cfg[‘checkout’] = array('enabled' => 'on');
$cfg[‘homepage’] = array('enabled' => 'on');

$cfg[‘new_search’] = array('enabled' => 'on');

// Meanwhile...

if ($cfg[‘new_search’]) {
# New hotness
$results = do_solr();
} else {
# old and boring
$results = do_grep();
}

“Doesn’t that mean you have conditionals
all over your code?”
Yes.

Yes.
“Does anyone ever clean those up?”
Sometimes.

Yes.
Sometimes.
“That sounds like it sucks.”
Really?

Yes.
Sometimes.
“That sounds like it sucks.”
Really?
“Wait a minute... all of the counter arguments are
in Comic Sans. WTF?!?
Oh, you noticed? ;)

+06:40
Site up, some seller tools disabled
00:00
Site down for maintenance

+01:47
Site up, disabled login and registration

+07:41
All features restored

DB Server Maintenance, June 16, 2012
http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/

Features are launched by flipping a
config flag, not by deploying
hundreds of lines of code.

“If Engineering at Etsy has a religion,
it’s the Church of Graphs.
Ian Malpass, Code as Craft
http://etsy.me/ePkoZB

THIS IS HOW
YOU RUN
A COMPLEX
SYSTEM
http://www.ﬂickr.com/photos/ﬂyforfun/2694158656/

Config flags
Operator Metrics

http://www.flickr.com/photos/flyforfun/2694158656/

Oh, you want to talk about how we collect
metrics and make graphs?

http://www.slideshare.net/mikebrittain/metricsdriven-engineering

Interfaces and user experiences
that adapt to technical and
architectural failure.

/**
* Creates a database connection.
*/
public function __construct($host, $user, $pass, $db) {
parent::__construct($host, $user, $pass, $db);

if (mysqli_connect_error()) {

throw new DBConnection_Exception(
sprintf("Error: %s, %s",
mysqli_connect_errno(),
mysqli_connect_error()));

}
}

try {
$conn = new DBConnection('viewsdb.host', 'db_read_user',
'ssssshh!', 'views_db');
} catch (DBConnection_Exception $e) {

// TODO: Someone should figure out what to do if
// we can't connect to the views db.
throw $e;
}

Site navigation
Logo

Cute Picture

Generic, catch-all
error messaging....

Every back-end service is an
opportunity for failure.

1 4
9
5
8 6
2 3

10
4 11
7

14
7

13

12

Google Calendar

Google Docs

“Oops, we aren’t able to
access click metrics right
now, do not worry — your
data is safe.”

Product design doesn’t stop
at 100% availability.

“What could possibly go wrong?”

What is changing about the architecture?
What kind of data access patterns are we using?
How much traﬃc, how many queries?
What metrics are we collecting?
Are there automated alerts?
How do we know the thresholds are right?
How do we turn it oﬀ?... and what happens when we do?

“What could possibly go wrong?”

What is changing about the architecture?
What kind of data access patterns are we using?
How much traﬃc, how many queries?
What metrics are we collecting?
Are there automated alerts?
How do we know the thresholds are right?
How do we turn it oﬀ? ...and what happens when we do?

Homepage (95th perc.)

Surprise!!!
Turning oﬀ multi-
language support
improves our page
generation times by
up to 25%.

How could this have gone better?

How quickly did we find out that something was wrong?
Did we communicate well to our visitors and each other?
Why did we have confidence that what we were doing was OK?
Did we have the right tools, did we use them properly?
Did we collect metrics, and could we find them?
Where did we make the wrong decisions?

What steps do we take to reduce the chance of this
happening again in the future?

“... an engineer who thinks they’re going to be
reprimanded are disincentivized to give the details
necessary to get an understanding of the mechanism,
pathology, and operation of the failure.

This lack of understanding of how the accident occurred
all but guarantees that it will repeat. If not with the
original engineer, another one in the future.”

John Allspaw
VP, Technical Operations, Etsy

http://codeascraft.etsy.com/2012/05/22/blameless-postmortems/

We should try to learn not only what went
wrong, but also what went right.

Operational Mindset

Dev Ops Product

Operational Mindset

Dev Ops Product

Business Priorities

page views for error template
...or, how are we screwing our users?

Risk mitigation in a complex system

Redundant system architectures.
Small, well-understood changes to production.
Control application using conﬁg ﬂags.
Gratuitous metrics collection.
Resilient user interfaces.
GameDay exercises.

Thank you.
Mike Brittain
mike@etsy.com
@mikebrittain

PHOTO
CREDITS
Flickr: roboppy
http://www.ﬂickr.com/photos/51035735481@N01/163374138/

Flickr: jamesjyu

Flickr: circulating

On Failure and Resilience

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to On Failure and Resilience

Similar to On Failure and Resilience (20)

Recently uploaded

Recently uploaded (20)

On Failure and Resilience