SouthBay SRE Meetup Jan 2016

Architect of reliable, scalable infrastructure at LinkedIn
Jan. 28, 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
SouthBay SRE Meetup Jan 2016
1 of 17

More Related Content

What's hot

Microsoft Azure and Windows Application monitoringMicrosoft Azure and Windows Application monitoring
Microsoft Azure and Windows Application monitoringSite24x7
Intro to.net core   20170111Intro to.net core   20170111
Intro to.net core 20170111Christian Horsdal
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricAlexander Dean
[Webinar] AWS Monitoring with Site24x7[Webinar] AWS Monitoring with Site24x7
[Webinar] AWS Monitoring with Site24x7Site24x7
Server Monitoring from the CloudServer Monitoring from the Cloud
Server Monitoring from the CloudSite24x7
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know confluent

What's hot(20)

Viewers also liked

SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentMichael Kehoe
Couchbase Meetup Jan 2016Couchbase Meetup Jan 2016
Couchbase Meetup Jan 2016Michael Kehoe
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
The ROLE SRE Approach - Getting more concreteThe ROLE SRE Approach - Getting more concrete
The ROLE SRE Approach - Getting more concretedrenzel
The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...
The Social Requirements Engineering (SRE) Approach to Developing a Large-scal...Ralf Klamma
Sre con16   tier 1 metamorphosisSre con16   tier 1 metamorphosis
Sre con16 tier 1 metamorphosisNina Mushiana

Similar to SouthBay SRE Meetup Jan 2016

Microservices ArchitectureMicroservices Architecture
Microservices ArchitectureLucian Neghina
Trafficshifting: Avoiding Disasters & Improving Performance at ScaleTrafficshifting: Avoiding Disasters & Improving Performance at Scale
Trafficshifting: Avoiding Disasters & Improving Performance at ScaleAPNIC
Bringing it all together - Denver JUGBringing it all together - Denver JUG
Bringing it all together - Denver JUGMelissaMcKay15
Fleet management system  mine excellenceFleet management system  mine excellence
Fleet management system mine excellenceMason Taylor
Nonfunctional Testing: Examine the Other Side of the CoinNonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the CoinTechWell
Introduction to High Availability with SQL ServerIntroduction to High Availability with SQL Server
Introduction to High Availability with SQL ServerJohn Sterrett

More from Michael Kehoe

eBPF WorkshopeBPF Workshop
eBPF WorkshopMichael Kehoe
eBPF BasicseBPF Basics
eBPF BasicsMichael Kehoe
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayMichael Kehoe
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsMichael Kehoe
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayMichael Kehoe
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsMichael Kehoe

Recently uploaded

NFPA 291 -2019 Ingles.pdfNFPA 291 -2019 Ingles.pdf
NFPA 291 -2019 Ingles.pdfJOSELUISPUMASUPAARCE2
8- Siemens Open Library - SIMATIC Visualization Architect (SiVArc).pdf8- Siemens Open Library - SIMATIC Visualization Architect (SiVArc).pdf
8- Siemens Open Library - SIMATIC Visualization Architect (SiVArc).pdfEMERSON EDUARDO RODRIGUES
Airbus A318, A319, A320, A321 Aircraft Flight Crew Operating Manual.pdfAirbus A318, A319, A320, A321 Aircraft Flight Crew Operating Manual.pdf
Airbus A318, A319, A320, A321 Aircraft Flight Crew Operating Manual.pdfTahirSadikovi
Doppler-Vor-Test-Rack.pptxDoppler-Vor-Test-Rack.pptx
Doppler-Vor-Test-Rack.pptxNeometrix_Engineering_Pvt_Ltd
gaurav singh 19ME25 (1).pptxgaurav singh 19ME25 (1).pptx
gaurav singh 19ME25 (1).pptxGauravSingh31583
Fuel Injection Pump Test BenchFuel Injection Pump Test Bench
Fuel Injection Pump Test BenchNeometrix_Engineering_Pvt_Ltd

SouthBay SRE Meetup Jan 2016

Editor's Notes

  1. Prior to 2010, running out of Chicago only (ECH3) 2010 – completion of our second production data center (ELA4) Formal disaster recovery strategy 2011 Site was not serving traffic – maintaining active/ passive was not easy Recovery from a true disaster would not have been easy 2013 Built LVA1 Re-architected services to be (mostly) MultiMaster Invested in how to recover from disaster/ service outage quickly 2014 Started multi-colo loadtesting Single Master Failover 1 Built LTX1 2015 Single Master Failover 2 Started LSG1 2016 Ramp LSG1 LOR1 – NextGen DC design
  2. Edge (PoP) shifts. LinkedIn currently operates 12 PoP’s around the world, with more on the way, that help improve page load times for our users. These PoP’s give LinkedIn more flexibility about where a user enters our network and also gives us added redundancy in the case of an outage. We work with our DNS providers to direct users to an appropriate PoP or to alter the flow of traffic to each PoP. See Ritesh Maheshwari’s post for more details on this approach. Data Center Load Shifts. From each PoP, we direct traffic to a specific data center. Logged in users are assigned to a specific data center by default, but during a traffic shift, we can instruct the PoPs to reroute any portion of traffic to one or more different data centers. Single master failovers  Some of our legacy services have not been fully migrated to a multi-data center architecture, and operate in single master mode in one data center. This includes both user-facing services and back-end services whose traffic may not be directly related to user page views. When performing maintenance, addressing site issues, or exploring capacity issues, we must also take these single master services into account. Although some of these legacy services require special attention in these situations, many have been converted to a “fast-failover” mechanism that allows us to switch masters between data centers in seconds, with no downtime. Being able to move these master services around at will also allows us to balance that part of the load between data centers. Edge LinkedIn currently has 12 PoP’s around the world These help improve page load times for our users Give flexibility on where a user enters our network Gives us extra redundancy See Ritesh Masheshwari’s blog post Fabric We assign users to a specific datacenter, but during a trafficshift we can instruct the PoP’s to reroute users to other datacenters to increase the load Single Master Failovers Some older, more complicated services have not been fully migrated to a multi-datacentre architecture. They operate in SingleMaster mode on one datacentre. These services may not be directly related to user page-views but are still important to the running of the site. We have converted most of these services to fast-failover which allows us to change the mastership between datacenters in seconds with no downtime.
  3. To mitigate user impact from problems with a 3rd party provider or LinkedIn’s infrastructure To validate Disaster Recovery (DR) in case of any datacentre failure To validate and test capacity headroom across our datacenters To expose bugs and suboptimal configurations by loadtesting one or more datacenters To perform planned maintenance To validate and exercise the traffic shift automation
  4. The traffic shifting process is orchestrated by a system we developed internally which is designed to make the load shift process hands off. The portal gives a holistic view of the site
  5. The following graphs illustrate the traffic-shift process – from last night Here we see the number of buckets online for each data center. At roughly 6PM, we progressively marked 100 buckets offline, and then ramped them back online gradually. The following graph shows actual measured request percentage for one segment of our traffic. The pattern corresponds to the bucket graph, and shows traffic going to zero in one data center, and being redistributed to the other two. Our ability to offline a data center with as little member impact as possible is one of our top priorities. We perform weekly load tests to validate the process and guarantee we can offline a colo successfully with minimal member impact. We load test by load shedding a percentage of traffic to a targeted data center and evaluate sustainability.
  6. You simply schedule a load test and the system does the rest. A load test is preceded by a series of email notifications starting several hours prior to shifting traffic. At the designated time, the system starts shifting traffic to the targeted data center by offlining buckets from the remaining data centers. The manipulation of these buckets is facilitated by underlying libraries that interface with our “sticky routing” service developed in-house. The system has a feedback loop that uses our alerting system to check for any errors potentially triggered by the traffic shift. If an alert is detected, the traffic shift automatically halts and issues notifications, allowing an engineer to manually inspect the reason for the alert and determine whether it is safe to proceed.   The stress test period is reached once the system has successfully redirected the desired volume of traffic to the targeted data center. The stress test period is typically 1 1/2 hours during which we observe the impact the extra load is having on the data center. If impact is detected we immediately rebalance the load and begin investigating the source of the impact. We then work with service owners on reviewing their system and determine a solution. If the stress test period is completed without causing impact, the system rebalances traffic and is considered complete once the rebalance is finished.
  7. Our single master mechanism relies on Apache ZooKeeper to maintain the status of all single master services. On startup, all instances within a cluster of single master services check the value of a cluster master node in ZooKeeper. Each service determines whether or not it is a master based on the value stored in the cluster master node. All services also establish a watch on the cluster master node. When services accept master status, they create an ephemeral node in ZooKeeper that acts as a lockfile.
  8. We can perform a single master failover with a command line tool that handles all the communication with ZooKeeper instances along with the workflow. But in a disaster scenario, it can be useful to have an easy-to-use interface to functionality, as well as a visual overview of the system state. This interface allows an Engineer to failover all of our “Fast Failover” enabled Singlemaster services with one click. The interface also integrates the LiX and ZooKeeper-based mechanisms into a single location.