Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
TrafficShift: Avoiding Disasters at
Scale
Jeff Weiner
Chief Executive Officer
Michael Kehoe
Staff SRE
Anil Mallapur
Sr SRE
Today’s
agenda
1 Introductions
2 Evolution of the Infrastructure
3 Planning for Disaster
4 LinkedIn Traffic-Tier
5 Traffic...
Key Takeaways
• Design infrastructure to facilitate disaster
recovery
• Test regularly
• Automate everything
Introductions
World’s largest professional network
Largest global network
of professionals
500+M members
Serving users world-
wide
200+ ...
Who are we?
PRODUCTION-SRE TEAM AT LINKEDIN
• Assist in restoring stability to services
during site-critical issues
• Deve...
Terminologies
Terminologies
• Fabric/Colo Data Center with full application stack deployed
• PoP/ Edge Entry point to LinkedIn network (...
Evolution of the
Infrastructure
Evolution of the Infrastructure
2003 2010 2011 2013 2014 2017
Active &
Passive
Active &
Active
Multi-colo
3-way
Active &
A...
2017
4 Data Centers 13 PoPs 1000+ services
Planning for Disaster
Why care about Disasters ?
What are Disasters
Service
Degradation
Infrastructure
Issues
Human Error Data Center
on Fire
One Solution for all Disasters
• TrafficShift – Reroute user traffic to
different datacenters without any user
interruptio...
LinkedIn Traffic-Tier
LinkedIn Traffic-Tier
Border
Router IPVS ATS ATS Frontend
EDGE FABRIC
Stickyrouting
LinkedIn Traffic-Tier
ATS
EDGE FABRIC
DC1
DC2
DC1 in Cookie
Got DC2 as primary fabric
Gets primary
fabric for user
Stickyr...
LinkedIn Traffic-Tier
Fabric
Buckets
1
91
2 3 10
92 93 100
How Stickyrouting assigns users to a fabric?
Capacity of a
Datacenter
Geographic
distance to
users
Hadoop
Advantages of Stickyrouting
Less Latency Store data
where needed
Control over
capacity
TrafficShift
Site Traffic and Disaster Recovery
DC2 DC3
DC1
DC4
EDGE
30%
Distributed Load
50%
Distributed Load
50%
Distributed Load
10%...
When to TrafficShift
Impact
Mitigation
Planned
Maintenance
Stress Test
TrafficShift Architecture
Web
application
Salt master
Stickyrouting
ServiceCouchbase Backend Worker
Processes
FABRIC
BUCKE...
Load Testing
What is Load Testing?
3x a week Peak hour traffic Fixed SLA
Load Testing
FABRIC
DC3
DC1 DC2
60%
Traffic
Percentage
Benefits of Load testing
Capacity
Planning
Stress Test Identify Bugs Confidence
Big Red Buttom
• Kill-switch for a datacenter
• Failout of a datacenter & PoP in minutes
• Minimal user impact
Key Takeaways
Key Takeaways
• Design infrastructure to facilitate disaster
recovery
• Stress test regularly to avoid surprises
• Automat...
Q & A
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Upcoming SlideShare
Loading in …5
×

of

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 1 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 2 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 3 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 4 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 5 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 6 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 7 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 8 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 9 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 10 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 11 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 12 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 13 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 14 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 15 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 16 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 17 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 18 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 19 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 20 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 21 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 22 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 23 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 24 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 25 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 26 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 27 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 28 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 29 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 30 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 31 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 32 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 33 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 34 Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale Slide 35
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

Download to read offline

LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users.

As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn leverages live traffic to stress services site-wide by shifting traffic to simulate a disaster load.

Michael Kehoe and Anil Mallapur discuss how LinkedIn uses traffic shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale

  1. 1. TrafficShift: Avoiding Disasters at Scale Jeff Weiner Chief Executive Officer Michael Kehoe Staff SRE Anil Mallapur Sr SRE
  2. 2. Today’s agenda 1 Introductions 2 Evolution of the Infrastructure 3 Planning for Disaster 4 LinkedIn Traffic-Tier 5 TrafficShift 6 Load Testing 7 Q&A
  3. 3. Key Takeaways • Design infrastructure to facilitate disaster recovery • Test regularly • Automate everything
  4. 4. Introductions
  5. 5. World’s largest professional network Largest global network of professionals 500+M members Serving users world- wide 200+ Countries
  6. 6. Who are we? PRODUCTION-SRE TEAM AT LINKEDIN • Assist in restoring stability to services during site-critical issues • Develop applications to improve MTTD and MTTR • Provide direction and guidelines for site monitoring • Build tools for efficient site-issue detection, correlation & troubleshooting,
  7. 7. Terminologies
  8. 8. Terminologies • Fabric/Colo Data Center with full application stack deployed • PoP/ Edge Entry point to LinkedIn network (TCP/ SSL Termination) • Load Test Planned stress testing of data centers
  9. 9. Evolution of the Infrastructure
  10. 10. Evolution of the Infrastructure 2003 2010 2011 2013 2014 2017 Active & Passive Active & Active Multi-colo 3-way Active & Active Multi-colo n-way Active & Active
  11. 11. 2017 4 Data Centers 13 PoPs 1000+ services
  12. 12. Planning for Disaster
  13. 13. Why care about Disasters ?
  14. 14. What are Disasters Service Degradation Infrastructure Issues Human Error Data Center on Fire
  15. 15. One Solution for all Disasters • TrafficShift – Reroute user traffic to different datacenters without any user interruption.
  16. 16. LinkedIn Traffic-Tier
  17. 17. LinkedIn Traffic-Tier Border Router IPVS ATS ATS Frontend EDGE FABRIC Stickyrouting
  18. 18. LinkedIn Traffic-Tier ATS EDGE FABRIC DC1 DC2 DC1 in Cookie Got DC2 as primary fabric Gets primary fabric for user Stickyrouting
  19. 19. LinkedIn Traffic-Tier Fabric Buckets 1 91 2 3 10 92 93 100
  20. 20. How Stickyrouting assigns users to a fabric? Capacity of a Datacenter Geographic distance to users Hadoop
  21. 21. Advantages of Stickyrouting Less Latency Store data where needed Control over capacity
  22. 22. TrafficShift
  23. 23. Site Traffic and Disaster Recovery DC2 DC3 DC1 DC4 EDGE 30% Distributed Load 50% Distributed Load 50% Distributed Load 10% Distributed Load Traffic stops being served to offline fabrics when we mark buckets offline Traffic is shifted to online fabrics as ATS redirects those users to their secondary fabric DC1 DC4
  24. 24. When to TrafficShift Impact Mitigation Planned Maintenance Stress Test
  25. 25. TrafficShift Architecture Web application Salt master Stickyrouting ServiceCouchbase Backend Worker Processes FABRIC BUCKETS
  26. 26. Load Testing
  27. 27. What is Load Testing? 3x a week Peak hour traffic Fixed SLA
  28. 28. Load Testing FABRIC DC3 DC1 DC2 60% Traffic Percentage
  29. 29. Benefits of Load testing Capacity Planning Stress Test Identify Bugs Confidence
  30. 30. Big Red Buttom • Kill-switch for a datacenter • Failout of a datacenter & PoP in minutes • Minimal user impact
  31. 31. Key Takeaways
  32. 32. Key Takeaways • Design infrastructure to facilitate disaster recovery • Stress test regularly to avoid surprises • Automate everything to reduce time to mitigate impact
  33. 33. Q & A

LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users. As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn leverages live traffic to stress services site-wide by shifting traffic to simulate a disaster load. Michael Kehoe and Anil Mallapur discuss how LinkedIn uses traffic shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.

Views

Total views

209

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

6

Shares

0

Comments

0

Likes

0

×