Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

252 views

Published on

LinkedIn serves traffic for its 467 million members from four data centers and multiple PoPs spread geographically around the world. Serving live traffic from from many places at the same time has taken us from a disaster recovery model to a disaster avoidance model where we can take an unhealthy data center or PoP out of rotation and redistribute its traffic to a healthy one within minutes, with virtually no visible impact to users. The geographical distribution of our infrastructure also allows us to optimize the end-user's experience by geo routing users to the best possible PoP and datacenter.

This talk provide details on how LinkedIn shifts traffic between its PoPs and data centers to provide the best possible performance and availability for its members. We will also touch on the complexities of performance in APAC, how IPv6 is helping our members and how LinkedIn stress tests data centers verify its disaster recovery capabilities.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

  1. 1. Trafficshifting: Avoiding Disasters & Improving Performance at Scale Michael Kehoe Staff Site Reliability Engineer LinkedIn
  2. 2. 2 Overview • Problem Statement • Solution – How LinkedIn trafficshift’s • Datacenter shifting • PoP steering • Challenges of APAC region • IPv4 vs IPv6 • Questions
  3. 3. $ whoami 3 Michael Kehoe • Staff Site Reliability Engineer (SRE) @ LinkedIn • Production-SRE team • Funny accent = Australian + 3 years American
  4. 4. $ whatis SRE 4 Michael Kehoe • Site Reliability Engineering • Operations for the production application environment • Responsibilities include • Architecture design • Capacity planning • Operations • Tooling • Responsibilities include DNS/ CDN management & Traffic infrastructure
  5. 5. 5 Terminology • PoP - Where LinkedIn terminates incoming requests. • Fabric – Datacenter with full LinkedIn production stack deployed • Loadtest – Stress test of a Fabric – to simulate a disaster scenario
  6. 6. Disaster Recovery 6 Problem Statement • Fail between Fabrics • Performance of applications is degraded • Validate disaster recovery (DR) scenario • Expose bugs and suboptimal configurations via loadtest • Planned maintenance • Fail between PoP’s • Mitigate impact of a 3rd party provider maintenance/ failure (e.g. transport links) • Software/ Configuration Bugs
  7. 7. Performance 7 Problem Statement • Fabric Assignment • Assign preferred and secondary fabric to all members based on: • Member location • Capacity • PoP/ CDN steering • Use GeoDNS to steer user to ‘best’ PoP • Use RUM DNS to steer users to ’best’ CDN
  8. 8. United States Performance (Global) 8 Problem Statement
  9. 9. APAC Performance (APAC cities) 9 Problem Statement
  10. 10. Delta US & APAC 10 Problem Statement
  11. 11. Site Speed 11 Problem Statement • Site Speed affects User Engagement • User Engagement affects page-views & transactions • Bottom Line: Site Speed has an impact on revenue
  12. 12. LinkedIn’s Traffic Architecture 12 Solution
  13. 13. LinkedIn’s Traffic Architecture 13 Solution
  14. 14. Fabric shifting 14 Solution • Stickyrouting • Using a Hadoop job, we calculate a primary and secondary datacenter for the user based on location • This data is stored in a Key-Value store (Espresso) • Stickyrouting serves this information over a RESTful interface to our Edge PoP’s
  15. 15. Fabric shifting 15 Solution • Different traffic types are partitioned and controlled separately • Logged-In vs Logged-out • CDN’s • Monitoring • Microsites • Logged-in users are placed into ‘buckets’ • Buckets are marked online/ offline to move site traffic
  16. 16. Fabric shifting 16 Solution • Stickyrouting – Benefits • Ensure we serve the request as close to the user as possible • Capacity management for datacenters • We can assign a percentage of users to a datacenter • Enables personal data routing (PDR) • Only store data where we need it
  17. 17. Fabric shifting Automation 17 Solution
  18. 18. Fabric shifting Automation 18 Solution
  19. 19. Fabric Shifting 19 Solution
  20. 20. Fabric Shifting Load tests 20 Solution
  21. 21. Fabric Shifting Loadtests 21 Solution
  22. 22. LinkedIn’s Traffic Architecture 22 Solution
  23. 23. LinkedIn’s PoP Distribution 23 Solution
  24. 24. LinkedIn’s PoP Architecture 24 Solution • Using IPVS - Each PoP announces a unicast address and a regional anycast address • APAC, EU and NAMER anycast regions • Use GeoDNS to steer users to the ‘best’ PoP • DNS will either provide users with an anycast or unicast address for www.linkedin.com • US and EU members is nearly all anycast • APAC is all unicast
  25. 25. LinkedIn’s PoP DR 25 Solution • Sometimes need to fail out of PoP’s • 3rd party provider issues (e.g. transit links going down) • Infrastructure maintenance • Withdraw anycast route announcements • Fail healthchecks on proxy to drain unicast traffic
  26. 26. LinkedIn’s PoP Performance 26 Solution • PoP DNS Steering • LinkedIn currently uses GeoDNS for routing • Piloting RumDNS • Pick the best PoP based on network, not country • CDN Steering • Mix CDN’s to get best performance • Constantly evaluate performance/ availability • Automatically adjust CDN weighting
  27. 27. LinkedIn’s PoP Performance 27 Solution US CDN request time 50th percentile 24 hours
  28. 28. Working around fiber cuts 28 APAC Challenges • Case Study: Fail out of India PoP due to fiber cuts Connection Time for Indian members (90th percentile)
  29. 29. ASN 15802 ASN 5384 GeoDNS Suboptimal PoP’s 29 APAC Challenges Source: http://www.submarinecablemap.com/#/submarine-cable/bay-of-bengal-gateway-bbg SingaporeMumbai 45 ms 220 ms 70 ms ASN 15802 RTT to Singapore is (220+70) 290ms (all at 50th percentile)
  30. 30. GeoDNS Suboptimal PoP’s 30 APAC Challenges London Dublin SingaporeMumbai 160 ms 45 ms ASN 15802 ASN 5384 70 ms 35 ms 350 ms Hong Kong160 ms
  31. 31. GeoDNS Suboptimal PoP’s 31 APAC Challenges 600 700 800 900 1000 1100 1200
  32. 32. Performance & Adoption 32 IPv4 vs IPv6 • IPv6 performs better for our members • Less request time-outs on IPv6 for mobile users • Mobile carriers are adopting IPv6 faster • Win for LinkedIn and our members! • In July 2014 (IPv6 launch): 3% of traffic was IPv6 • Today: ~12% of traffic is IPv6
  33. 33. Key Takeaways 33 Conclusion • Application level traffic engineering is extremely important for content providers • RUM data is extremely useful for finding anomalies • Route traffic based on performance, not just location • IPv6 performs better for LinkedIn users
  34. 34. 34 Questions?

×