Successfully reported this slideshow.
Your SlideShare is downloading. ×

SouthBay SRE Meetup Jan 2016

Ad

Michael Kehoe
Senior Site Reliability Engineer
LinkedIn
SouthBay SRE Meetup
LinkedIn Traffic Shifting

Ad

2
$ whoami
Michael Kehoe
• Sr Site Reliability Engineer (SRE)
• Member of PROD-SRE
• https://www.linkedin.com/in/michaelkk...

Ad

3
LinkedIn Multicolo History

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Upcoming SlideShare
Couchbase Connect 2016
Couchbase Connect 2016
Loading in …3
×

Check these out next

1 of 17 Ad
1 of 17 Ad
Advertisement

More Related Content

Slideshows for you (19)

Advertisement

Similar to SouthBay SRE Meetup Jan 2016 (20)

Advertisement

SouthBay SRE Meetup Jan 2016

  1. 1. Michael Kehoe Senior Site Reliability Engineer LinkedIn SouthBay SRE Meetup LinkedIn Traffic Shifting
  2. 2. 2 $ whoami Michael Kehoe • Sr Site Reliability Engineer (SRE) • Member of PROD-SRE • https://www.linkedin.com/in/michaelkkehoe
  3. 3. 3 LinkedIn Multicolo History
  4. 4. 4 What is a Traffic Shift? • Edge (PoP) shift • Datacenter Load shift • Single Master Failovers
  5. 5. 5 Why do we do traffic shifts • To mitigate user impact from problems with a 3rd party provider or LinkedIn’s infrastructure/ services • To validate Disaster Recovery (DR) in case of any datacenter failure • To validate and test capacity headroom across our datacenters • To expose bugs and suboptimal configurations by load testing one or more datacenters • To perform planned maintenance • To validate and exercise the traffic shift automation
  6. 6. 6 Traffic shifting How do we do it?
  7. 7. 7 Edge Traffic shifts How does it work • We use IPVS to load balance at our edges • We can withdraw anycast routes to remove traffic from that PoP • Health checks on our edge proxy are tested by DNS providers to verify whether that PoP is in rotation • We can fail those health checks to remove unicast traffic from that PoP
  8. 8. 8 Edge Traffic shifts
  9. 9. 9 Datacenter Traffic shifts How does it work? • Different traffic types are partitioned and controlled separately • Logged-in vs Logged-out • CDN • Monitoring • Microsites • Logged-in users are placed into ‘buckets’ and have primary/ secondary datacenter assignments • Buckets are marked online/ offline to move site traffic
  10. 10. 10 Mitigating Impact What a traffic shift looks like
  11. 11. 11 Load testing How do we do it?
  12. 12. 12 Load testing How do we do it?
  13. 13. 13 Single Master Failover How does it work? • Only used in extreme cases • Leverage distributed locking in Apache Zookeeper • Single master services have a spring component that checks the mastership of the service in a particular datacenter
  14. 14. 14 Single Master Failover How does it work?
  15. 15. 15 Conclusion • The best way to prepare for a disaster is to practice one regularly! • Tooling and automation is your best friend during an outage • Capacity planning/ management is extremely important
  16. 16. Questions? 16 Thank You
  17. 17. ©2014 LinkedIn Corporation. All Rights Reserved.

Editor's Notes

  • Prior to 2010, running out of Chicago only (ECH3)
    2010 – completion of our second production data center (ELA4)
    Formal disaster recovery strategy
    2011
    Site was not serving traffic – maintaining active/ passive was not easy
    Recovery from a true disaster would not have been easy
    2013
    Built LVA1
    Re-architected services to be (mostly) MultiMaster
    Invested in how to recover from disaster/ service outage quickly
    2014
    Started multi-colo loadtesting
    Single Master Failover 1
    Built LTX1
    2015
    Single Master Failover 2
    Started LSG1
    2016
    Ramp LSG1
    LOR1 – NextGen DC design






  • Edge (PoP) shifts. LinkedIn currently operates 12 PoP’s around the world, with more on the way, that help improve page load times for our users. These PoP’s give LinkedIn more flexibility about where a user enters our network and also gives us added redundancy in the case of an outage. We work with our DNS providers to direct users to an appropriate PoP or to alter the flow of traffic to each PoP. See Ritesh Maheshwari’s post for more details on this approach.

    Data Center Load Shifts. From each PoP, we direct traffic to a specific data center. Logged in users are assigned to a specific data center by default, but during a traffic shift, we can instruct the PoPs to reroute any portion of traffic to one or more different data centers.

    Single master failovers  Some of our legacy services have not been fully migrated to a multi-data center architecture, and operate in single master mode in one data center. This includes both user-facing services and back-end services whose traffic may not be directly related to user page views. When performing maintenance, addressing site issues, or exploring capacity issues, we must also take these single master services into account. Although some of these legacy services require special attention in these situations, many have been converted to a “fast-failover” mechanism that allows us to switch masters between data centers in seconds, with no downtime. Being able to move these master services around at will also allows us to balance that part of the load between data centers.



    Edge
    LinkedIn currently has 12 PoP’s around the world
    These help improve page load times for our users
    Give flexibility on where a user enters our network
    Gives us extra redundancy
    See Ritesh Masheshwari’s blog post

    Fabric
    We assign users to a specific datacenter, but during a trafficshift we can instruct the PoP’s to reroute users to other datacenters to increase the load

    Single Master Failovers
    Some older, more complicated services have not been fully migrated to a multi-datacentre architecture. They operate in SingleMaster mode on one datacentre.
    These services may not be directly related to user page-views but are still important to the running of the site.
    We have converted most of these services to fast-failover which allows us to change the mastership between datacenters in seconds with no downtime.

  • To mitigate user impact from problems with a 3rd party provider or LinkedIn’s infrastructure
    To validate Disaster Recovery (DR) in case of any datacentre failure
    To validate and test capacity headroom across our datacenters
    To expose bugs and suboptimal configurations by loadtesting one or more datacenters
    To perform planned maintenance
    To validate and exercise the traffic shift automation
  • The traffic shifting process is orchestrated by a system we developed internally which is designed to make the load shift process hands off.

    The portal gives a holistic view of the site
  • The following graphs illustrate the traffic-shift process – from last night

    Here we see the number of buckets online for each data center. At roughly 6PM, we progressively marked 100 buckets offline, and then ramped them back online gradually.

    The following graph shows actual measured request percentage for one segment of our traffic. The pattern corresponds to the bucket graph, and shows traffic going to zero in one data center, and being redistributed to the other two.

    Our ability to offline a data center with as little member impact as possible is one of our top priorities. We perform weekly load tests to validate the process and guarantee we can offline a colo successfully with minimal member impact. We load test by load shedding a percentage of traffic to a targeted data center and evaluate sustainability.
  • You simply schedule a load test and the system does the rest.

    A load test is preceded by a series of email notifications starting several hours prior to shifting traffic. At the designated time, the system starts shifting traffic to the targeted data center by offlining buckets from the remaining data centers.

    The manipulation of these buckets is facilitated by underlying libraries that interface with our “sticky routing” service developed in-house. The system has a feedback loop that uses our alerting system to check for any errors potentially triggered by the traffic shift. If an alert is detected, the traffic shift automatically halts and issues notifications, allowing an engineer to manually inspect the reason for the alert and determine whether it is safe to proceed.
     
    The stress test period is reached once the system has successfully redirected the desired volume of traffic to the targeted data center. The stress test period is typically 1 1/2 hours during which we observe the impact the extra load is having on the data center.

    If impact is detected we immediately rebalance the load and begin investigating the source of the impact. We then work with service owners on reviewing their system and determine a solution. If the stress test period is completed without causing impact, the system rebalances traffic and is considered complete once the rebalance is finished.
  • Our single master mechanism relies on Apache ZooKeeper to maintain the status of all single master services. On startup, all instances within a cluster of single master services check the value of a cluster master node in ZooKeeper. Each service determines whether or not it is a master based on the value stored in the cluster master node. All services also establish a watch on the cluster master node. When services accept master status, they create an ephemeral node in ZooKeeper that acts as a lockfile.
  • We can perform a single master failover with a command line tool that handles all the communication with ZooKeeper instances along with the workflow.

    But in a disaster scenario, it can be useful to have an easy-to-use interface to functionality, as well as a visual overview of the system state.

    This interface allows an Engineer to failover all of our “Fast Failover” enabled Singlemaster services with one click. The interface also integrates the LiX and ZooKeeper-based mechanisms into a single location.


×