Successfully reported this slideshow.

SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

2

Share

Upcoming SlideShare
The Human Side of DevSecOps
The Human Side of DevSecOps
Loading in …3
×
1 of 31
1 of 31

SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

2

Share

Download to read offline

Description

How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most? LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users.

As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn moves live traffic to services site-wide by shifting traffic between datacenters to simulate a disaster every business day!

Transcript

  1. 1. Building Disaster Recovery via Resilience Engineering Michael Kehoe Staff SRE - LinkedIn
  2. 2. Tonight’s agenda 1 Introductions 2 What is Resilience Engineering 3 The Problem Statement 4 Project Overview 5 Testing Process 6 Project Outcomes 7 Key Takeaways 8 Q&A
  3. 3. Introduction
  4. 4. Michael Kehoe /USR/BIN/WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
  5. 5. Who are we? PRODUCTION-SRE TEAM AT LINKEDIN • Disaster Recovery Planning and Automation • Incident Response and Automation • Visibility Engineering • Reliability Principles
  6. 6. LinkedIn EVOLUTION OF THE INFRASTRUCTURE 2003 2010 2011 2013 2014 2015 Active & Passive Active & Active Multi-colo 3- way Active & Active Multi-colo n- way Active & Active
  7. 7. LinkedIn 2018 4 Data Centers 21 PoPs 1000+ services
  8. 8. What is Resilience Engineering?
  9. 9. What is Resilience Engineering? • Projects that directly demand increased resilience from our applications and infrastructure. • Application Injection Failure • Infrastructure Injection Failure • Full Disaster-Recovery Tests
  10. 10. Problem Statement
  11. 11. How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most?
  12. 12. Problem Statement • How do we ensure that we always have disaster recovery ability without incident? • How do we consistently test for disaster recovery ability without disrupting the company?
  13. 13. Project Overview
  14. 14. Project Overview 1 • Build a process (with Automation) to facilitate disaster recovery • Operate the process on regular cadence • Provide reporting on outcomes of tests with engineering executives
  15. 15. Testing Process
  16. 16. What is Load Testing? 5x a week Peak hour traffic Fixed SLA
  17. 17. LinkedIn Traffic-Tier Border Router IPVS ATS ATS Frontend EDGE FABRIC Stickyrouting
  18. 18. LinkedIn Traffic-Tier Fabric Buckets 1 91 2 3 10 92 93 100
  19. 19. LinkedIn Traffic-Tier EDGE FABRIC DC1 DC2 DC1 in Cookie Got DC2 as secondary fabric Gets secondary fabric for userStickyrouting
  20. 20. TrafficShift Architecture Web application Salt master Stickyrouting ServiceCouchbase Backend Worker Processes FABRIC BUCKETS
  21. 21. Load Testing FABRIC DC3 DC1 DC2 60% Traffic Percentage
  22. 22. Load Testing 22
  23. 23. Project Outcomes
  24. 24. Benefits of Load-testing Capacity Planning Identify Bugs Confidence
  25. 25. Benefits of Load-testing CAPACITY PLANNING • Through this process, we continuously validate our infrastructure capacity • This is the best signal we can possibly get since we’re simulating a real disaster
  26. 26. Benefits of Load-testing IDENTIFY BUGS 2 • Some bugs are only found at high load (under duress) • Helps find inefficiency’s that otherwise may not be found until it’s too late • Gives us clues on how to make our code more resilient to potential failure
  27. 27. Benefits of Load-testing CONFIDENCE 2 • Through load-testing, we’ve built confidence in our disaster recovery strategy • We understand exactly: • What process to follow • How long it takes to avert disaster • What are the risks associated with a disaster incident
  28. 28. Key Takeaways
  29. 29. Key Takeaways • Resilience Engineering is a must for LinkedIn • Design infrastructure to facilitate disaster recovery • Disaster-test regularly to avoid surprises • Automate your testing/ process to reduce engagement time
  30. 30. Q&A

Editor's Notes

  • Anil

    TrafficShift is a two part application - A web application provides easy way for engineers to create planned and emergency offline plans.

    We leverage couchbase as our key/value persistence store

    Python backend worker processes talks to Salt Master via Salt API
    And instructs stickyrouting service to turn buckets online and offline
    We leverage this toolset to run load tests or stress tests of our datacenters


    Uff that’s a lot of talk, how to mitigate issues by doing trafficshift. But if you keenly observe, we are migrating live traffic across datacenter, why not leverage the same to stress test datacenter ? How awesome is that ? Not stress test single service, stress the whole system. I am gonna talk about load testing next.
  • Anil

    As you can see by turning precise number of buckets offline in US-West and US-East - we can reroute that extra traffic to Target datacenter

    We do this in a pretty controlled manner in steps until the threshold level of 50% is reached. If for any reason, an alert fires during this stress test, our TrafficShift tool acknowledges that automatically rebalances the site traffic, sends out the stress test report to SREs
  • Description

    How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most? LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users.

    As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn moves live traffic to services site-wide by shifting traffic between datacenters to simulate a disaster every business day!

    Transcript

    1. 1. Building Disaster Recovery via Resilience Engineering Michael Kehoe Staff SRE - LinkedIn
    2. 2. Tonight’s agenda 1 Introductions 2 What is Resilience Engineering 3 The Problem Statement 4 Project Overview 5 Testing Process 6 Project Outcomes 7 Key Takeaways 8 Q&A
    3. 3. Introduction
    4. 4. Michael Kehoe /USR/BIN/WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
    5. 5. Who are we? PRODUCTION-SRE TEAM AT LINKEDIN • Disaster Recovery Planning and Automation • Incident Response and Automation • Visibility Engineering • Reliability Principles
    6. 6. LinkedIn EVOLUTION OF THE INFRASTRUCTURE 2003 2010 2011 2013 2014 2015 Active & Passive Active & Active Multi-colo 3- way Active & Active Multi-colo n- way Active & Active
    7. 7. LinkedIn 2018 4 Data Centers 21 PoPs 1000+ services
    8. 8. What is Resilience Engineering?
    9. 9. What is Resilience Engineering? • Projects that directly demand increased resilience from our applications and infrastructure. • Application Injection Failure • Infrastructure Injection Failure • Full Disaster-Recovery Tests
    10. 10. Problem Statement
    11. 11. How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most?
    12. 12. Problem Statement • How do we ensure that we always have disaster recovery ability without incident? • How do we consistently test for disaster recovery ability without disrupting the company?
    13. 13. Project Overview
    14. 14. Project Overview 1 • Build a process (with Automation) to facilitate disaster recovery • Operate the process on regular cadence • Provide reporting on outcomes of tests with engineering executives
    15. 15. Testing Process
    16. 16. What is Load Testing? 5x a week Peak hour traffic Fixed SLA
    17. 17. LinkedIn Traffic-Tier Border Router IPVS ATS ATS Frontend EDGE FABRIC Stickyrouting
    18. 18. LinkedIn Traffic-Tier Fabric Buckets 1 91 2 3 10 92 93 100
    19. 19. LinkedIn Traffic-Tier EDGE FABRIC DC1 DC2 DC1 in Cookie Got DC2 as secondary fabric Gets secondary fabric for userStickyrouting
    20. 20. TrafficShift Architecture Web application Salt master Stickyrouting ServiceCouchbase Backend Worker Processes FABRIC BUCKETS
    21. 21. Load Testing FABRIC DC3 DC1 DC2 60% Traffic Percentage
    22. 22. Load Testing 22
    23. 23. Project Outcomes
    24. 24. Benefits of Load-testing Capacity Planning Identify Bugs Confidence
    25. 25. Benefits of Load-testing CAPACITY PLANNING • Through this process, we continuously validate our infrastructure capacity • This is the best signal we can possibly get since we’re simulating a real disaster
    26. 26. Benefits of Load-testing IDENTIFY BUGS 2 • Some bugs are only found at high load (under duress) • Helps find inefficiency’s that otherwise may not be found until it’s too late • Gives us clues on how to make our code more resilient to potential failure
    27. 27. Benefits of Load-testing CONFIDENCE 2 • Through load-testing, we’ve built confidence in our disaster recovery strategy • We understand exactly: • What process to follow • How long it takes to avert disaster • What are the risks associated with a disaster incident
    28. 28. Key Takeaways
    29. 29. Key Takeaways • Resilience Engineering is a must for LinkedIn • Design infrastructure to facilitate disaster recovery • Disaster-test regularly to avoid surprises • Automate your testing/ process to reduce engagement time
    30. 30. Q&A

    Editor's Notes

  • Anil

    TrafficShift is a two part application - A web application provides easy way for engineers to create planned and emergency offline plans.

    We leverage couchbase as our key/value persistence store

    Python backend worker processes talks to Salt Master via Salt API
    And instructs stickyrouting service to turn buckets online and offline
    We leverage this toolset to run load tests or stress tests of our datacenters


    Uff that’s a lot of talk, how to mitigate issues by doing trafficshift. But if you keenly observe, we are migrating live traffic across datacenter, why not leverage the same to stress test datacenter ? How awesome is that ? Not stress test single service, stress the whole system. I am gonna talk about load testing next.
  • Anil

    As you can see by turning precise number of buckets offline in US-West and US-East - we can reroute that extra traffic to Target datacenter

    We do this in a pretty controlled manner in steps until the threshold level of 50% is reached. If for any reason, an alert fires during this stress test, our TrafficShift tool acknowledges that automatically rebalances the site traffic, sends out the stress test report to SREs
  • More Related Content

    Similar to SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering

    Related Books

    Free with a 30 day trial from Scribd

    See all

    ×