Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

On Failure and Resilience

13,714 views

Published on

Published in: Technology

On Failure and Resilience

  1. On Failure and Resilience Mike Brittain DIRECTOR OF ENGINEERING, ETSY @mikebrittain Presented at 37signals on Aug 21, 2012
  2. “Software Infrastructure”“Framework” code, caching, ORM, file storage tier,developer tools, CI/deployment, site performance, front-end architecture.
  3. Managing failures and buildingresilience into systems, applications, process, and people.
  4. $61 M in goods sold in the marketplace2.9 M items sold1.2 B page views Photo: http://www.etsy.com/shop/TheOldTimeJunkShop http://www.etsy.com/blog/news/2012/etsy-statistics-june-weather-report/
  5. ArchitectureLinux, Apache, MySQL, PHP, Postgres,Solr, Gearman, Memcache, Chef,Hadoop, EC2/S3/EMR 30+ Logical data stores (23 shards + more functionally partitioned) Search and storage tiers as “services”
  6. 150 Engineers + Designers + Product (this was 20 in Feb 2010)credit: martin_heigan (flickr)
  7. Buyers, sellers, support,developer api, i18n,core infrastructure, storage,payments, security, frauddetection, big data and BI,email delivery, corp IT,operations, developer tools,continuous integration andtesting, site performance,search, advertising, sellereconomics, mobile web,iOS.
  8. Zero Release Managers
  9. There Will Be Fail Credit: wilkee.deviantart.com
  10. We cannot comprehend all of the ways inwhich the individual parts of a complexsystem will interact. We cannot know allof the states and scenarios.We cannot prevent failures.
  11. Yet, we can mitigate them.Redundant system architectures.Small, well-understood changes to production.Control application using config flags.Gratuitous metrics collection.Resilient user interfaces.GameDay exercises.
  12. “Uptime” is not binary.
  13. Async Convos Ads Auth TasksFunctionally Partitioned
  14. Async Convos Ads Auth TasksFunctionally Partitioned
  15. 4 1 3 2 5 Async Async Convos Convos Ads Ads Auth Auth tasks tasksMaster-Master Replication
  16. 4 1 3 2 5 Async Async Convos Convos Ads Ads Auth Auth tasks tasksMaster-Master Replication
  17. 4 1 3 2 5 Async Async Convos Convos Ads Ads Auth Auth tasks tasksMaster-Master Replication
  18. 1 5 3 2 4 shard1 shard1 shard2 shard2 shard3 shard3 shard4 shard4 ~4% of listing data is stored on shard3Sharded Tables
  19. 1 5 3 2 4 shard1 shard1 shard2 shard2 shard3 shard3 shard4 shard4Sharded Tables
  20. shard1 shard1 shard2 shard2 shard3 shard3 shard4 shard4 Outage is limited to ~4% of data setSharded Tables
  21. “Uptime” is not binary.
  22. Uptime of the application is theresponsibility of our Operations team.
  23. Uptime of the application is theresponsibility of our Operations, Engineering,Product, and Design teams.
  24. Uptime of the application is theresponsibility of our Operations, Engineering,Product, and Design teams.If you are committing code, you areoperating the site.
  25. Branching in Code
  26. “All existing revision control systems were built by people who build installed software” Always Ship Trunk Paul Hammond Velocity Conf 2010
  27. Config FlagsEnable and disable features quickly.Features for staff or for beta groups.Percentage ramp-up of users or requests.A/B “experiments.”
  28. $cfg[‘new_search’] = array(enabled => on);$cfg[‘sign_in’] = array(enabled => on);$cfg[‘checkout’] = array(enabled => on);$cfg[‘homepage’] = array(enabled => on);
  29. $cfg[‘new_search’] = array(enabled => on);// Meanwhile...if ($cfg[‘new_search’]) { # New hotness $results = do_solr();} else { # old and boring $results = do_grep();}
  30. But...
  31. “Doesn’t that mean you have conditionals all over your code?” Yes.
  32. “Doesn’t that mean you have conditionals all over your code?” Yes. “Does anyone ever clean those up?” Sometimes.
  33. “Doesn’t that mean you have conditionals all over your code?” Yes. “Does anyone ever clean those up?” Sometimes. “That sounds like it sucks.” Really?
  34. “Doesn’t that mean you have conditionals all over your code?” Yes. “Does anyone ever clean those up?” Sometimes. “That sounds like it sucks.” Really? “Wait a minute... all of the counter arguments are in Comic Sans. WTF?!? Oh, you noticed? ;)
  35. +06:40 Site up, some seller tools disabled 00:00 Site down for maintenance +01:47 Site up, disabled login and registration +07:41 All features restoredDB Server Maintenance, June 16, 2012http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
  36. “Uptime” is not binary.
  37. Features are launched by flipping a config flag, not by deploying hundreds of lines of code.
  38. “If Engineering at Etsy has a religion, it’s the Church of Graphs. Ian Malpass, Code as Craft http://etsy.me/ePkoZB
  39. THIS IS HOW YOU RUN A COMPLEX SYSTEMhttp://www.flickr.com/photos/flyforfun/2694158656/
  40. Config flags Operator Metricshttp://www.flickr.com/photos/flyforfun/2694158656/
  41. Oh, you want to talk about how we collectmetrics and make graphs? http://www.slideshare.net/mikebrittain/metricsdriven-engineering
  42. Resilient User Interfaces
  43. Interfaces and user experiencesthat adapt to technical andarchitectural failure.
  44. /** * Creates a database connection. */public function __construct($host, $user, $pass, $db) { parent::__construct($host, $user, $pass, $db); if (mysqli_connect_error()) { throw new DBConnection_Exception( sprintf("Error: %s, %s", mysqli_connect_errno(), mysqli_connect_error())); } }
  45. try { $conn = new DBConnection(viewsdb.host, db_read_user, ssssshh!, views_db);} catch (DBConnection_Exception $e) { // TODO: Someone should figure out what to do if // we cant connect to the views db. throw $e;}
  46. Site navigation Logo Cute PictureGeneric, catch-allerror messaging....
  47. Every back-end service is an opportunity for failure.
  48. 1 4 9 5 8 6 2 3 10 4 11 7 14 7 13 12
  49. Critical Path
  50. #srsly?
  51. < 400 ms
  52. Non-blocking Ajax
  53. Google Calendar Google Docs
  54. GMail
  55. “Oops, we aren’t able toaccess click metrics rightnow, do not worry — your data is safe.”
  56. Product design doesn’t stop at 100% availability.
  57. Dev Ops
  58. Dev Ops Product
  59. 1 4 9 5 8 6 2 3 10 4 11 7 14 7 13 12
  60. Operability Reviews
  61. “What could possibly go wrong?”What is changing about the architecture?What kind of data access patterns are we using?How much traffic, how many queries?What metrics are we collecting?Are there automated alerts?How do we know the thresholds are right?How do we turn it off?... and what happens when we do?
  62. “What could possibly go wrong?”What is changing about the architecture?What kind of data access patterns are we using?How much traffic, how many queries?What metrics are we collecting?Are there automated alerts?How do we know the thresholds are right?How do we turn it off? ...and what happens when we do?
  63. “GameDay” Exercises
  64. Pedro
  65. Homepage (95th perc.) Surprise!!! Turning off multi- language support improves our page generation times by up to 25%.
  66. (Blameless) Post-Mortems
  67. How could this have gone better?How quickly did we find out that something was wrong?Did we communicate well to our visitors and each other?Why did we have confidence that what we were doing was OK?Did we have the right tools, did we use them properly?Did we collect metrics, and could we find them?Where did we make the wrong decisions?What steps do we take to reduce the chance of thishappening again in the future?
  68. “... an engineer who thinks they’re going to bereprimanded are disincentivized to give the detailsnecessary to get an understanding of the mechanism,pathology, and operation of the failure.This lack of understanding of how the accident occurredall but guarantees that it will repeat. If not with theoriginal engineer, another one in the future.” John Allspaw VP, Technical Operations, Etsy http://codeascraft.etsy.com/2012/05/22/blameless-postmortems/
  69. We should try to learn not only what wentwrong, but also what went right.
  70. +06:40 Site up, some seller tools disabled 00:00 Site down for maintenance +01:47 Site up, disabled login and registration +07:41 All features restoredDB Server Maintenance, June 16, 2012http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
  71. Operational MindsetDev Ops Product
  72. Operational MindsetDev Ops Product Business Priorities
  73. Introspection
  74. page views for error template
  75. page views for error template...or, how are we screwing our users?
  76. Risk mitigation in a complex systemRedundant system architectures.Small, well-understood changes to production.Control application using config flags.Gratuitous metrics collection.Resilient user interfaces.GameDay exercises.
  77. Thank you. Mike Brittain mike@etsy.com @mikebrittain
  78. PHOTOCREDITS Flickr: roboppy http://www.flickr.com/photos/51035735481@N01/163374138/ Flickr: jamesjyu http://www.flickr.com/photos/32593095@N00/3465022/ Flickr: circulating http://www.flickr.com/photos/26835318@N00/2318226026/

×