APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

Trafficshifting: Avoiding Disasters &
Improving Performance at Scale
Michael Kehoe
Staff Site Reliability Engineer
LinkedIn

2
Overview
• Problem Statement
• Solution – How LinkedIn trafficshift’s
• Datacenter shifting
• PoP steering
• Challenges of APAC region
• IPv4 vs IPv6
• Questions

$ whoami
3
Michael Kehoe
• Staff Site Reliability Engineer (SRE) @ LinkedIn
• Production-SRE team
• Funny accent = Australian + 3 years American

$ whatis SRE
4
Michael Kehoe
• Site Reliability Engineering
• Operations for the production application
environment
• Responsibilities include
• Architecture design
• Capacity planning
• Operations
• Tooling
• Responsibilities include DNS/ CDN management &
Traffic infrastructure

5
Terminology
• PoP - Where LinkedIn terminates incoming requests.
• Fabric – Datacenter with full LinkedIn production stack deployed
• Loadtest – Stress test of a Fabric – to simulate a disaster scenario

Disaster Recovery
6
Problem Statement
• Fail between Fabrics
• Performance of applications is degraded
• Validate disaster recovery (DR) scenario
• Expose bugs and suboptimal configurations via loadtest
• Planned maintenance
• Fail between PoP’s
• Mitigate impact of a 3rd party provider maintenance/ failure (e.g. transport links)
• Software/ Configuration Bugs

Performance
7
Problem Statement
• Fabric Assignment
• Assign preferred and secondary fabric to all members based on:
• Member location
• Capacity
• PoP/ CDN steering
• Use GeoDNS to steer user to ‘best’ PoP
• Use RUM DNS to steer users to ’best’ CDN

United States Performance (Global)
8
Problem Statement

APAC Performance (APAC cities)
9
Problem Statement

Delta US & APAC
10
Problem Statement

Site Speed
11
Problem Statement
• Site Speed affects User Engagement
• User Engagement affects page-views & transactions
• Bottom Line: Site Speed has an impact on revenue

LinkedIn’s Traffic Architecture
12
Solution

13
Solution

Fabric shifting
14
Solution
• Stickyrouting
• Using a Hadoop job, we calculate a primary and
secondary datacenter for the user based on
location
• This data is stored in a Key-Value store
(Espresso)
• Stickyrouting serves this information over a
RESTful interface to our Edge PoP’s

Fabric shifting
15
Solution
• Different traffic types are partitioned and controlled separately
• Logged-In vs Logged-out
• CDN’s
• Monitoring
• Microsites
• Logged-in users are placed into ‘buckets’
• Buckets are marked online/ offline to move site traffic

Fabric shifting
16
Solution
• Stickyrouting – Benefits
• Ensure we serve the request as close to the user as possible
• Capacity management for datacenters
• We can assign a percentage of users to a datacenter
• Enables personal data routing (PDR)
• Only store data where we need it

Fabric shifting Automation
17
Solution

Fabric shifting Automation
18
Solution

Fabric Shifting Load tests
20
Solution

Fabric Shifting Loadtests
21
Solution

22
Solution

LinkedIn’s PoP Distribution
23
Solution

LinkedIn’s PoP Architecture
24
Solution
• Using IPVS - Each PoP announces a unicast address and a regional anycast
address
• APAC, EU and NAMER anycast regions
• Use GeoDNS to steer users to the ‘best’ PoP
• DNS will either provide users with an anycast or unicast address for
www.linkedin.com
• US and EU members is nearly all anycast
• APAC is all unicast

LinkedIn’s PoP DR
25
Solution
• Sometimes need to fail out of PoP’s
• 3rd party provider issues (e.g. transit links
going down)
• Infrastructure maintenance
• Withdraw anycast route announcements
• Fail healthchecks on proxy to drain unicast
traffic

LinkedIn’s PoP Performance
26
Solution
• PoP DNS Steering
• LinkedIn currently uses GeoDNS for routing
• Piloting RumDNS
• Pick the best PoP based on network, not country
• CDN Steering
• Mix CDN’s to get best performance
• Constantly evaluate performance/ availability
• Automatically adjust CDN weighting

LinkedIn’s PoP Performance
27
Solution
US CDN request time 50th percentile 24 hours

Working around fiber cuts
28
APAC Challenges
• Case Study: Fail out of India PoP due to fiber cuts
Connection Time for Indian members (90th percentile)

ASN 15802
ASN 5384
GeoDNS Suboptimal PoP’s
29
APAC Challenges
Source: http://www.submarinecablemap.com/#/submarine-cable/bay-of-bengal-gateway-bbg
SingaporeMumbai
45 ms
220 ms
70 ms
ASN 15802 RTT to Singapore is (220+70) 290ms (all at 50th percentile)

30
APAC Challenges
London
Dublin
SingaporeMumbai
160 ms
45 ms
ASN 15802
ASN 5384
70 ms
35 ms
350 ms
Hong
Kong160 ms

31
APAC Challenges
600
700
800
900
1000
1100
1200

Performance & Adoption
32
IPv4 vs IPv6
• IPv6 performs better for our members
• Less request time-outs on IPv6 for mobile users
• Mobile carriers are adopting IPv6 faster
• Win for LinkedIn and our members!
• In July 2014 (IPv6 launch): 3% of traffic was IPv6
• Today: ~12% of traffic is IPv6

Key Takeaways
33
Conclusion
• Application level traffic engineering is extremely important for content providers
• RUM data is extremely useful for finding anomalies
• Route traffic based on performance, not just location
• IPv6 performs better for LinkedIn users

APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

More Related Content

What's hot

Viewers also liked

Similar to APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

More from Michael Kehoe

Recently uploaded

APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

Editor's Notes