Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale


Published on

LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken us from a disaster recovery model to a disaster avoidance model where we can take an unhealthy data center out of rotation and redistribute its traffic to the healthy data centers within minutes, with virtually no visible impact to users

As we transitioned from big monolithic application to micro-services, we witnessed pain in determining capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex micro-services architecture wasn’t sufficient to provide enough confidence in a data center’s capacity. To solve this problem, we at LinkedIn leverage live traffic to stress services site-wide by shifting traffic to simulate a disaster load.

This talk provide details on how LinkedIn uses Traffic Shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.

Published in: Engineering
  • Login to see the comments

SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

  1. 1. TrafficShift - Avoiding Disasters at Scale Michael Kehoe Staff SRE LinkedIn Anil Mallapur SRE LinkedIn
  2. 2. Overview LinkedIn Architectural Overview Fabric Disaster Recovery Questions
  3. 3. 467+ million members World’s largest professional network 200+ Countries
  4. 4. Who are we ? Production-SRE team at LinkedIn ● Assist in restoring stability to services during site critical issues ● Developing applications to improve MTTD and MTTR ● Provide direction and guidelines for site monitoring ● Build tools for efficient site issue troubleshooting, issue detection & correlation
  5. 5. Terminologies Fabric/Colo Data Center with full application stack deployed PoP/Edge Entry point to LinkedIn network (TCP/ SSL termination) Load Test Planned stress testing of data centers
  6. 6. 2003 2010 2011 2013 2014 2015 Active & Passive Active & Active Multi-colo 3- way Active & Active Multi-colo n- way Active & Active
  7. 7. 2017 4 Data Centers 13 PoPs 1000+ services
  8. 8. What are Disasters ? Service Degradation Infrastructure Issues Human Error Data Center on Fire
  9. 9. One solution for all disasters TrafficShift - Reroute user traffic to different datacenters without any user interruption.
  10. 10. Whaaaat ?
  11. 11. Border Router IPVS ATS EDGE ATS Fronte nd FABRIC Stickyrouting Service Internet
  12. 12. ATS Request Stickyrouting Service Gets primary colo for user If not cookie in header DC1 in cookie DC1 DC2 Got DC2 as primary colo for user FABRICEDGE
  13. 13. US-East 1 2 3 10 91 92 93 100 BUCKETS FABRIC Stickyrouting
  14. 14. How StickyRouting assigns users to a colo ? Capacity of Fabric Offline job to assign colo to users Geographic distance to users
  15. 15. Advantages of sticky routing Less latency for users Store data where it’s necessary Provides precise control over capacity allotment
  16. 16. When to TrafficShift ? Impact Mitigation Planned Maintenance Stress Test
  17. 17. Site Traffic and Disaster Recovery US-West US- Central US-East APAC EDGE 0% Distributed Load 50% Distributed Load 50% Distributed Load 0% Distributed Load Traffic stops being served to offline fabrics Traffic is shifted to online fabrics
  18. 18. TrafficShift Architecture Web application Salt master Stickyrouting Service Couchbase Backend Worker Processes FABRIC BUCKETS
  19. 19. What is Load Testing ? 3 times a week Peak hour traffic Fixed SLA USW Target Data Center USW
  20. 20. Load Testing FABRIC Target US-West US-East 50% Traffic Percentage
  21. 21. Benefits of Load Test Capacity Planning Leverage production traffic to stress test services Identify bugs in production Confidence in Disaster Recovery
  22. 22. Big Red Button Kill switch (No Kidding) Failout of a datacenter and PoP in less than 10 minutes Minimal user impact
  23. 23. Key Takeaways ● Design infrastructure to facilitate disaster recovery ● Stress test regularly to avoid surprises ● Automate everything to reduce time to mitigate impact
  24. 24. Questions
  25. 25. Edge Failout
  26. 26. Edge Presence
  27. 27. LinkedIn’s PoP Architecture 29 • Using IPVS - Each PoP announces a unicast address and a regional anycast address • APAC, EU and NAMER anycast regions • Use GeoDNS to steer users to the ‘best’ PoP • DNS will either provide users with an anycast or unicast address for • US and EU members is nearly all anycast • APAC is all unicast
  28. 28. LinkedIn’s PoP DR 30 • Sometimes need to fail out of PoP’s • 3rd party provider issues (e.g. transit links going down) • Infrastructure maintenance • Withdraw anycast route announcements • Fail healthchecks on proxy to drain unicast traffic