This is the slidedeck for my Velocity 2015 talk (http://velocityconf.com/devops-web-performance-2015/public/schedule/detail/42026).
LinkedIn, like many web companies, uses PoPs (point of presence) to terminate TCP connections closer to end-users. This improves the content download time. In this talk, we describe techniques used by LinkedIn to better utilize PoPs through new RUM measurements and in-depth data analysis.
We analyze the following questions, and show how JavaScript code tied with RUM data can be used to help answer these questions and improve performance:
Given a set of PoPs spread across the world, which set of end users should be routed to PoP A vs PoP B?
What is a better strategy for PoP selection: DNS GSLB or Anycast?
Where should we build the next set of PoPs?
This talk will cover the following topics:
How PoPs help in performance
Why accurate user-to-PoP assignment is important for performance
How to utilize your end-user browser sessions to emulate millions of “measurement agents” spread across the world
How to use the above technique, along with RUM data, to:
Measure which PoP is optimal for your end users
Analyze inefficiencies in DNS load-balancing techniques
Compare IP Anycast and DNS GSLB as PoP selection techniques
Decide where to build the next PoP from a candidate list of multiple geographic locations
Answer other relevant network performance questions
Why every web company should use RUM to drive performance optimizations
Attendees of this talk will learn how to utilize their (potentially millions of) end-user visits to do performance measurements, distributed across the globe, with results that are more accurate and realistic than synthetic measurements. Attendees will also be able to directly apply this knowledge to instrument RUM on their site, and use that data to make better optimization decisions.
In the end, we would like to encourage web companies to use RUM data from their users not only for performance monitoring but also to drive optimizations.
6. Story: In a nutshell
1. We built PoPs
2. …used RUM to assign users to Optimal PoPs
3. …found DNS based assignment is suboptimal
4. …evaluated Anycast as a solution using RUM
5. …now using Anycast to assign users to PoPs
6. …new PoPs locations chosen using RUM
7. What is RUM?
JavaScript (client code) to measure performance
Navigation Timing API
LinkedIn uses
https://github.com/lognormal/boomerang/
Metrics like:
• DNS Time
• Connection Time
• First Byte Time
• Download Time
• Page Load Time
8. What are PoPs?
Point of Presence / PoP
• Small-scale data centers
– Cheap, Many
• Proxy servers at LinkedIn (ATS)
11. World Without PoPs
Browser Data Center
connection time
first
byte
time
+
page
download
time
5 RTTs = 5x250ms = 1250ms
server
compute
time
250ms
Total = 2000ms
500ms
14. With PoPs
Browser Data CenterPoP
100msconnection time
first
byte
time
+
page
download
time
server
compute
time
500ms
15. With PoPs
Browser Data CenterPoP
100msconnection time
5 RTTs = 5x100ms = 500ms
Total = 1100ms
900 ms gain!first
byte
time
+
page
download
time
500ms
server
compute
time
16. How are users assigned to PoPs?
Through 3rd party DNS services
IP handed based on user’s resolver location
# Spain
$ dig @109.69.8.51 +short www.linkedin.com
91.225.248.80
# California
$ dig +short www.linkedin.com
216.52.242.80
17. Should India connect to Singapore or
Dublin?
How to assure optimal PoPs assignment?
• Geographical Distance
• Knowledge about Internet Connectivity
18. Lets Make Data-based Decisions
Synthetic measurements?
No realistic agent distribution across networks
No realistic agent connectivity distribution
Can our users become measurements agent?
• RUM!
19. RUM beacons
Fetch a tiny object from each candidate PoP
For each pop_name,
1. Start timer
2. Fetch {pop_name}.perf.linkedin.com/pop/admin
3. Stop timer
Send data back to our servers
• Millions of agents!
• Analyze data to find “optimal” PoP per country
20. We can assign countries to new PoPs!
Country PoP Median Beacon Time
China Hong Kong 434
China Dublin 1216
China Singapore 515
India Hong Kong 1368
India Dublin 1042
India Singapore 898
21. We can audit current assignment!
Country Is PoP optimal? Current PoP Optimal PoP
India TRUE Singapore Singapore
Pakistan FALSE Singapore Dublin
Spain TRUE Dublin Dublin
Brazil FALSE US West Coast US East Coast
Netherlands TRUE Dublin Dublin
UAE FALSE US West Coast Dublin
Italy TRUE Dublin Dublin
Mexico TRUE US West Coast US West Coast
Russia FALSE US West Coast Dublin
23. Trust, but Verify
Is 100% of India connecting to Singapore PoP?
Can RUM identify which PoP served this page view?
1. RUM is JavaScript
2. PoP selected through DNS
3. JavaScripts can’t know DNS answers
JavaScripts can read response headers
24. Can RUM identify which PoP served
this page view?
After page load:
1. Fetch www.linkedin.com/pop/admin
2. Read X-Li-Pop header
3. Add that data to RUM and send it back
Offline analysis:
1. What % of traffic in India going to Singapore PoP?
2. A/B testing PoP assignment.
25. Plot Twist:
Assignment far from optimal
• About 31% of US traffic gets assigned to a
suboptimal PoP.
– 45% of East Coast
• About 10% of traffic globally gets assigned to a
suboptimal PoP.
26. DNS PoP assignment is suboptimal
• Assignment based on Resolver IP, not Client IP
DNS
Resolver
PoP
US
East
PoP
US
West
New YorkCalifornia
27. DNS PoP assignment is suboptimal
• Assignment based on Resolver IP, not Client IP
• Bad IP to Geo databases
– Resolver really in NY, but database says CA
28. Story so far
1. We built PoPs
2. …used RUM to assign users to Optimal PoPs
3. …found DNS based assignment is suboptimal
29. Accurate PoP assignment Problem
• Bug our DNS providers (31% -> 27%)
• Run our own DNS
How about Anycast?
30. Anycast – One IP, Multiple Servers
PoP A
PoP B
PoP C
Bob
1.1.1.1
1.1.1.1
1.1.1.1
• Unicast – one IP per server
• Seamless support at routing layer
• Packets routed to closest server by
# of hops
Client IP, not Resolver IP used!
No Geo-IP Databases
31. How does Anycast compare to DNS?
Will anycast send more users to optimal PoP?
Lets test it!
32. RUM to rescue
For each PoP:
1. Announce same anycast IP (108.174.13.10)
2. Configure a domain
ac.perf.linkedin.com to point to
108.174.13.10
33. RUM to rescue
For each page view:
1. RUM downloads a tiny object :
ac.perf.linkedin.com/pop/admin
2. Read X-Li-Pop response header to record which PoP served
the object
3. Send this back to LinkedIn with RUM data
Data:
1. For each user, the anycast PoP
2. For each user, the optimal PoP (from pop beacons)
34. Results
Region or
Country
DNS % Optimal
Assignment
Anycast % Optimal
Assignment
Illinois 70 90
Florida 73 95
Georgia 75 93
Pennsylvania 85 95
36. Fewer hops != Lower Latency
• Carriers prefer to haul packets within
their own network
• Peering can create inter-continental
short cuts
Z
X
Alice
Y
1.1.1.1
1.1.1.1
1.1.1.1
37. Maybe DNS wasn’t so bad
Continent level identification
City / State level identification
38. “Regional” Anycast
1 anycast IP per continent
Ran a RUM experiment,
all was fine Z
X
Alice
Y
2.2.2.2
1.1.1.1
1.1.1.1
40. Story so far
1. We built PoPs
2. …used RUM to assign users to Optimal PoPs
3. …found DNS based assignment is suboptimal
4. …evaluated Anycast as a solution using RUM
5. …now using Anycast to assign users to PoPs
Next play:
• Build more PoPs!
41. Build More PoPs: But where?
• Previously, based on
– Intuition
– Knowledge
• Can we use RUM data?
– Get a list of candidate PoP locations
– Build a machine learning model
– Rank locations based on most performance benefit
42. Plot Twist: Miami vs Sao Paolo
• Can we use RUM again?
– Fetch a tiny object and measure latency
– But we have no servers there
Our CDNs do!
1. Get IP address of CDN servers
2. Configure a domain to point to the IP
3. RUM to fetch a tiny object ( /cdo/rum/id )
4. Measure time and report back
43. Miami vs Sao Paolo - Results
Sao Paolo is better than US East Coast
Miami is better than Sao Paolo for many in South America
Country Median beacon time
Miami Sao Paolo US East Coast
Columbia 648 970 778
Argentina 918 970 1213
Brazil 868 549 1074
44. Story: Recap
1. We built PoPs
2. …used RUM to assign users to Optimal PoPs
3. …found DNS based assignment is suboptimal
4. …evaluated Anycast as a solution using RUM
5. …now using Anycast to assign users to PoPs
6. …new PoPs locations chosen using RUM
10-25% gain
30% suboptimal in US
“Regional” Anycast
30% -> 10%
45. Story: The End
• Clients are your measurement agents
• Make data-driven decisions
• Trust, but verify
• Collaboration leads to bigger impact
48. What about Mobile?
• Packet Gateways are likely to be in the same
part of the country as users
• Wired measurements should be applicable for
Mobile
– India wired users are closer to Singapore, Mobile
should be similar
49. Is Synthetic Monitoring dead?
No, not yet
• Much more powerful than RUM
• Not everything can be done in production
50. Next Play
• Keep evaluating Anycast
• Keep building new PoPs
• Dynamic RUM-based PoP assignment
Editor's Notes
Here's a slide summarizing the high level story of how it went. There were ups and downs, success and failure and interesting plot twists on the way. Overall it was a bumpy ride.
Simplified example
That’s a 900ms gain! This is huge! Just by adding a proxy server closer to the user, we have managed to gain 900ms in download time.
So we know now that PoPs are great, lets get into the mechanics of how a user gets assigned to a PoP.
At LinkedIn, our primary methodology of doing this is through DNS.
There are two obvious ways to do this:….. But we know that network distance is not equal to geographical distance
Median time to download the tiny object
This is a slide that makes me very happy. This is the .....
Not so fast...... I always like to verify major decisions with data. In this case, we wanted to verify whether the changes we made were effective.
So, what did we find?
We found that ...
Digging deeper
15 mins more
And how do we test it? Lets use RUM
But of course, there was another plot twist.
Can you add graph instead?
So why was it happening?
So we came up with what we “regional” anycast concept. Basically, we will still use DNS, but only to decide which continent is this user from. Within a continent, we will use anycast