Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

4Developers 2015: Designing for failure - architecting fault-tolerant system - Jakub Derda


Published on

Jakub Derda

Language: English

Designing an enterprise system is not a black magic nowadays - indeed, a lot of experienced developers have some knowledge and experience in designing solutions. However, when SLAs, distribution and integration with legacy systems come into play, the complexity increases exponentially.
This lecture will discuss most common pitfalls in designing fault-talerant systems, ways to plan for failure, how to predict where (and how) system may fail and ways to fail gracefully.
We'll exploit both simple systems and more complex architectural blueprints, using EIP to solve/decrease influence of some of the problems.

Published in: Software
  • Be the first to comment

  • Be the first to like this

4Developers 2015: Designing for failure - architecting fault-tolerant system - Jakub Derda

  1. 1. architecting for failure building fault-tolerant systems Jakub Derda Warsaw, 2015
  2. 2. ‘Tree’ component – overview
  3. 3. ‘Tree’ component – detailed view
  4. 4. ‘Tree’ component – detailed view client network connection sever
  5. 5. ‘Tree’ component – detailed view human factor software client library ISP protocol stack network load balancers OS power source client network connection sever
  6. 6. Your component – detailed viewWhat is a fault?
  7. 7. What is not a fault? Service is not working on our side* * Caused by e.g. technical failures, outages, corrupted data, attacks
  8. 8. What is a fault? The real fault is when we don’t deliver value to customers.
  9. 9. Value delivering without working system Bring your own wine, we’re waiting for license.Last election in Poland
  10. 10. What fault-tolerance is not? It’s NOT making sure your system never goes down. It (eventually) will.
  11. 11. What is a fault-tolerance? It’s making sure that system can quickly recover and/or client is not impacted.
  12. 12. How to solve it?
  13. 13. Solving – redundancy Hot/warm replicas Caches Geographical distribution, CDNs Hardware redundancy Alternative systems and procedures
  14. 14. Solving – design Stateless Auditing Idempotent requests Uniqueness / randomness Asynchronous and decoupling EIPs Commands, not data Break the rules
  15. 15. Solving – procedures Backup creation, cleanup and restore QA & potential problems Continuous integration Deployment
  16. 16. Solving – observe Dive deep, post-mortems Identify bottlenecks Observe key metrics Verify assumptions Predict traffic
  17. 17. Tradeoffs - simple 1/scope QUALITY
  18. 18. Tradeoffs - real cost durability time consistency trust audit (traceability) complexity security scalability functionality stability reliability extensibility performance maintainability manageability
  19. 19. Summary Learn to live with crashes
  20. 20. Summary Automate procedures
  21. 21. Summary Don’t be afraid to cross the line
  22. 22. Fault tolerance is not a property of a design, it’s a process.