Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most? LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users.

As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn moves live traffic to services site-wide by shifting traffic between datacenters to simulate a disaster every business day!

  • Be the first to comment

SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

  1. 1. Building Disaster Recovery via Resilience Engineering Michael Kehoe Staff SRE - LinkedIn
  2. 2. Tonight’s agenda 1 Introductions 2 What is Resilience Engineering 3 The Problem Statement 4 Project Overview 5 Testing Process 6 Project Outcomes 7 Key Takeaways 8 Q&A
  3. 3. Introduction
  4. 4. Michael Kehoe /USR/BIN/WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
  5. 5. Who are we? PRODUCTION-SRE TEAM AT LINKEDIN • Disaster Recovery Planning and Automation • Incident Response and Automation • Visibility Engineering • Reliability Principles
  6. 6. LinkedIn EVOLUTION OF THE INFRASTRUCTURE 2003 2010 2011 2013 2014 2015 Active & Passive Active & Active Multi-colo 3- way Active & Active Multi-colo n- way Active & Active
  7. 7. LinkedIn 2018 4 Data Centers 21 PoPs 1000+ services
  8. 8. What is Resilience Engineering?
  9. 9. What is Resilience Engineering? • Projects that directly demand increased resilience from our applications and infrastructure. • Application Injection Failure • Infrastructure Injection Failure • Full Disaster-Recovery Tests
  10. 10. Problem Statement
  11. 11. How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most?
  12. 12. Problem Statement • How do we ensure that we always have disaster recovery ability without incident? • How do we consistently test for disaster recovery ability without disrupting the company?
  13. 13. Project Overview
  14. 14. Project Overview 1 • Build a process (with Automation) to facilitate disaster recovery • Operate the process on regular cadence • Provide reporting on outcomes of tests with engineering executives
  15. 15. Testing Process
  16. 16. What is Load Testing? 5x a week Peak hour traffic Fixed SLA
  17. 17. LinkedIn Traffic-Tier Border Router IPVS ATS ATS Frontend EDGE FABRIC Stickyrouting
  18. 18. LinkedIn Traffic-Tier Fabric Buckets 1 91 2 3 10 92 93 100
  19. 19. LinkedIn Traffic-Tier EDGE FABRIC DC1 DC2 DC1 in Cookie Got DC2 as secondary fabric Gets secondary fabric for userStickyrouting
  20. 20. TrafficShift Architecture Web application Salt master Stickyrouting ServiceCouchbase Backend Worker Processes FABRIC BUCKETS
  21. 21. Load Testing FABRIC DC3 DC1 DC2 60% Traffic Percentage
  22. 22. Load Testing 22
  23. 23. Project Outcomes
  24. 24. Benefits of Load-testing Capacity Planning Identify Bugs Confidence
  25. 25. Benefits of Load-testing CAPACITY PLANNING • Through this process, we continuously validate our infrastructure capacity • This is the best signal we can possibly get since we’re simulating a real disaster
  26. 26. Benefits of Load-testing IDENTIFY BUGS 2 • Some bugs are only found at high load (under duress) • Helps find inefficiency’s that otherwise may not be found until it’s too late • Gives us clues on how to make our code more resilient to potential failure
  27. 27. Benefits of Load-testing CONFIDENCE 2 • Through load-testing, we’ve built confidence in our disaster recovery strategy • We understand exactly: • What process to follow • How long it takes to avert disaster • What are the risks associated with a disaster incident
  28. 28. Key Takeaways
  29. 29. Key Takeaways • Resilience Engineering is a must for LinkedIn • Design infrastructure to facilitate disaster recovery • Disaster-test regularly to avoid surprises • Automate your testing/ process to reduce engagement time
  30. 30. Q&A