3. Managing failures and building
resilience into systems, applications,
process, and people.
4.
5. $61 M in goods sold in the marketplace
2.9 M items sold
1.2 B page views
Photo: http://www.etsy.com/shop/TheOldTimeJunkShop
http://www.etsy.com/blog/news/2012/etsy-statistics-june-weather-report/
6. Architecture
Linux, Apache, MySQL, PHP, Postgres,
Solr, Gearman, Memcache, Chef,
Hadoop, EC2/S3/EMR
30+ Logical data stores
(23 shards + more functionally partitioned)
Search and storage tiers as “services”
7. 150 Engineers + Designers + Product
(this was 20 in Feb 2010)
credit: martin_heigan (flickr)
8. Buyers, sellers, support,
developer api, i18n,
core infrastructure, storage,
payments, security, fraud
detection, big data and BI,
email delivery, corp IT,
operations, developer tools,
continuous integration and
testing, site performance,
search, advertising, seller
economics, mobile web,
iOS.
12. We cannot comprehend all of the ways in
which the individual parts of a complex
system will interact. We cannot know all
of the states and scenarios.
We cannot prevent failures.
13. Yet, we can mitigate them.
Redundant system architectures.
Small, well-understood changes to production.
Control application using config flags.
Gratuitous metrics collection.
Resilient user interfaces.
GameDay exercises.
24. Uptime of the application is the
responsibility of our Operations team.
25. Uptime of the application is the
responsibility of our Operations, Engineering,
Product, and Design teams.
26. Uptime of the application is the
responsibility of our Operations, Engineering,
Product, and Design teams.
If you are committing code, you are
operating the site.
34. “Doesn’t that mean you have conditionals
all over your code?”
Yes.
“Does anyone ever clean those up?”
Sometimes.
35. “Doesn’t that mean you have conditionals
all over your code?”
Yes.
“Does anyone ever clean those up?”
Sometimes.
“That sounds like it sucks.”
Really?
36. “Doesn’t that mean you have conditionals
all over your code?”
Yes.
“Does anyone ever clean those up?”
Sometimes.
“That sounds like it sucks.”
Really?
“Wait a minute... all of the counter arguments are
in Comic Sans. WTF?!?
Oh, you noticed? ;)
37. +06:40
Site up, some seller tools disabled
00:00
Site down for maintenance
+01:47
Site up, disabled login and registration
+07:41
All features restored
DB Server Maintenance, June 16, 2012
http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
47. Interfaces and user experiences
that adapt to technical and
architectural failure.
48.
49.
50.
51.
52.
53.
54. /**
* Creates a database connection.
*/
public function __construct($host, $user, $pass, $db) {
parent::__construct($host, $user, $pass, $db);
if (mysqli_connect_error()) {
throw new DBConnection_Exception(
sprintf("Error: %s, %s",
mysqli_connect_errno(),
mysqli_connect_error()));
}
}
55. try {
$conn = new DBConnection('viewsdb.host', 'db_read_user',
'ssssshh!', 'views_db');
} catch (DBConnection_Exception $e) {
// TODO: Someone should figure out what to do if
// we can't connect to the views db.
throw $e;
}
56.
57.
58. Site navigation
Logo
Cute Picture
Generic, catch-all
error messaging....
81. “What could possibly go wrong?”
What is changing about the architecture?
What kind of data access patterns are we using?
How much traffic, how many queries?
What metrics are we collecting?
Are there automated alerts?
How do we know the thresholds are right?
How do we turn it off?... and what happens when we do?
82. “What could possibly go wrong?”
What is changing about the architecture?
What kind of data access patterns are we using?
How much traffic, how many queries?
What metrics are we collecting?
Are there automated alerts?
How do we know the thresholds are right?
How do we turn it off? ...and what happens when we do?
88. How could this have gone better?
How quickly did we find out that something was wrong?
Did we communicate well to our visitors and each other?
Why did we have confidence that what we were doing was OK?
Did we have the right tools, did we use them properly?
Did we collect metrics, and could we find them?
Where did we make the wrong decisions?
What steps do we take to reduce the chance of this
happening again in the future?
89. “... an engineer who thinks they’re going to be
reprimanded are disincentivized to give the details
necessary to get an understanding of the mechanism,
pathology, and operation of the failure.
This lack of understanding of how the accident occurred
all but guarantees that it will repeat. If not with the
original engineer, another one in the future.”
John Allspaw
VP, Technical Operations, Etsy
http://codeascraft.etsy.com/2012/05/22/blameless-postmortems/
90. We should try to learn not only what went
wrong, but also what went right.
91. +06:40
Site up, some seller tools disabled
00:00
Site down for maintenance
+01:47
Site up, disabled login and registration
+07:41
All features restored
DB Server Maintenance, June 16, 2012
http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
96. page views for error template
...or, how are we screwing our users?
97. Risk mitigation in a complex system
Redundant system architectures.
Small, well-understood changes to production.
Control application using config flags.
Gratuitous metrics collection.
Resilient user interfaces.
GameDay exercises.
98. Thank you.
Mike Brittain
mike@etsy.com
@mikebrittain