Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SouthBay SRE Meetup Jan 2016


Published on

LinkedIn Traffic Shifting

Published in: Engineering
  • Login to see the comments

  • Be the first to like this

SouthBay SRE Meetup Jan 2016

  1. 1. Michael Kehoe Senior Site Reliability Engineer LinkedIn SouthBay SRE Meetup LinkedIn Traffic Shifting
  2. 2. 2 $ whoami Michael Kehoe • Sr Site Reliability Engineer (SRE) • Member of PROD-SRE •
  3. 3. 3 LinkedIn Multicolo History
  4. 4. 4 What is a Traffic Shift? • Edge (PoP) shift • Datacenter Load shift • Single Master Failovers
  5. 5. 5 Why do we do traffic shifts • To mitigate user impact from problems with a 3rd party provider or LinkedIn’s infrastructure/ services • To validate Disaster Recovery (DR) in case of any datacenter failure • To validate and test capacity headroom across our datacenters • To expose bugs and suboptimal configurations by load testing one or more datacenters • To perform planned maintenance • To validate and exercise the traffic shift automation
  6. 6. 6 Traffic shifting How do we do it?
  7. 7. 7 Edge Traffic shifts How does it work • We use IPVS to load balance at our edges • We can withdraw anycast routes to remove traffic from that PoP • Health checks on our edge proxy are tested by DNS providers to verify whether that PoP is in rotation • We can fail those health checks to remove unicast traffic from that PoP
  8. 8. 8 Edge Traffic shifts
  9. 9. 9 Datacenter Traffic shifts How does it work? • Different traffic types are partitioned and controlled separately • Logged-in vs Logged-out • CDN • Monitoring • Microsites • Logged-in users are placed into ‘buckets’ and have primary/ secondary datacenter assignments • Buckets are marked online/ offline to move site traffic
  10. 10. 10 Mitigating Impact What a traffic shift looks like
  11. 11. 11 Load testing How do we do it?
  12. 12. 12 Load testing How do we do it?
  13. 13. 13 Single Master Failover How does it work? • Only used in extreme cases • Leverage distributed locking in Apache Zookeeper • Single master services have a spring component that checks the mastership of the service in a particular datacenter
  14. 14. 14 Single Master Failover How does it work?
  15. 15. 15 Conclusion • The best way to prepare for a disaster is to practice one regularly! • Tooling and automation is your best friend during an outage • Capacity planning/ management is extremely important
  16. 16. Questions? 16 Thank You
  17. 17. ©2014 LinkedIn Corporation. All Rights Reserved.