Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Epidemic Failures

3,392 views

Published on

Slides originally written in April 2013 for a private conference and internal use at Netflix. Publishing now since Heartbleed is another example of an epidemic failure mode.

Published in: Technology, News & Politics

Epidemic Failures

  1. 1. Cloud Native and Epidemic Failures April 2014 Adrian Cockcroft @adrianco @BatteryVentures http://www.linkedin.com/in/adriancockcroft
  2. 2. Cloud Native? Epidemic Failures Automated Diversity
  3. 3. Cloud Native Construct a highly agile and highly available service from ephemeral and often broken components
  4. 4. Inspiration
  5. 5. Numquam ponenda est pluralitas sine necessitate Plurality must never be posited without necessity Occam’s Razor
  6. 6. Monoculture Replicate “the best” as patterns Reduce interaction complexity Epidemic single point of failure
  7. 7. Pattern Failures Infrastructure Pattern Failures Software Stack Pattern Failures Application Pattern Failures
  8. 8. Infrastructure Pattern Failures • Device failures – bad batch of disks, PSUs, etc. • CPU failures – cache corruption, math errors • Datacenter failures – power, network, disaster • Routing failures – DNS, Internet/ISP path
  9. 9. Software Stack Pattern Failures • Time bombs – Counter wrap, memory leak • Date bombs - Leap year, leap second, epoch • Expiration – Certs timing out • Trust revocation – Certificate Authority fails • Security exploit – e.g. heartbleed • Language bugs – compile time • Runtime bugs – JVM, Linux, Hypervisor • Network bugs – routers, firewalls, protocols
  10. 10. Application Pattern Failures • Time bombs – Counter wrap, memory leak • Date bombs - Leap year, leap second, epoch • Content bombs – Data dependent failure • Configuration – wrong/bad syntax • Versioning – incompatible mixes • Cascading failures – error handling bugs etc. • Cascading overload – excessive logging etc.
  11. 11. What to do? Automated diversity management Diversified automation Efficient vs. Antifragile
  12. 12. Specific Ideas • Automate running a mixture – Diversity as default for any service stack – No developer overhead, stay agile, low cost • Support oldest and newest versions together – Automate running 50/50 mix CentOS/Ubuntu – Mix versions of JDK, Tomcat, etc. • Vendor diversity – Multiple DNS vendors, cloud regions, costs more – Multiple cloud vendors? Much higher cost.
  13. 13. Generate Permutations > epi <- data.frame(java=gl(2,1,8,c("java6","java7")), linux=gl(2,2,8,c("centos","ubuntu")) , codeversion=gl(2,4,8,c("v34","v35"))) > epi java linux codeversion 1 java6 centos v34 2 java7 centos v34 3 java6 ubuntu v34 4 java7 ubuntu v34 5 java6 centos v35 6 java7 centos v35 7 java6 ubuntu v35 8 java7 ubuntu v35
  14. 14. Deployment • Builds – Manual to test, automate if it works – Modify build to generate permutation AMIs – Modify Asgard to auto-deploy permutations • Data collection – Tag each instance with its permutation – Gather metrics by permutation per instance – Do R-based Design of Experiments analysis
  15. 15. Analysis • As a function of permutations – Error rate – Response time – CPU Utilization • Interactions – E.g. interaction between linux and java – Contrasts identify components with issues – Small changes with high statistical significance
  16. 16. GCS Total API Outage for ~1hr
  17. 17. Takeaway Watch out for monocultures A|B Testing – it’s not just for personalization http://perfcap.blogspot.com http://slideshare.net/adrianco – Netflix http://slideshare.net/adriancockcroft - Battery http://www.linkedin.com/in/adriancockcroft @adrianco @BatteryVentures

×