Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How LinkedIn used RUM to drive optimizations and make the site faster

3,096 views

Published on

This is the slidedeck for my Velocity 2015 talk (http://velocityconf.com/devops-web-performance-2015/public/schedule/detail/42026).

LinkedIn, like many web companies, uses PoPs (point of presence) to terminate TCP connections closer to end-users. This improves the content download time. In this talk, we describe techniques used by LinkedIn to better utilize PoPs through new RUM measurements and in-depth data analysis.

We analyze the following questions, and show how JavaScript code tied with RUM data can be used to help answer these questions and improve performance:

Given a set of PoPs spread across the world, which set of end users should be routed to PoP A vs PoP B?
What is a better strategy for PoP selection: DNS GSLB or Anycast?
Where should we build the next set of PoPs?
This talk will cover the following topics:

How PoPs help in performance
Why accurate user-to-PoP assignment is important for performance
How to utilize your end-user browser sessions to emulate millions of “measurement agents” spread across the world
How to use the above technique, along with RUM data, to:
Measure which PoP is optimal for your end users
Analyze inefficiencies in DNS load-balancing techniques
Compare IP Anycast and DNS GSLB as PoP selection techniques
Decide where to build the next PoP from a candidate list of multiple geographic locations
Answer other relevant network performance questions
Why every web company should use RUM to drive performance optimizations
Attendees of this talk will learn how to utilize their (potentially millions of) end-user visits to do performance measurements, distributed across the globe, with results that are more accurate and realistic than synthetic measurements. Attendees will also be able to directly apply this knowledge to instrument RUM on their site, and use that data to make better optimization decisions.

In the end, we would like to encourage web companies to use RUM data from their users not only for performance monitoring but also to drive optimizations.

Published in: Technology
  • Be the first to comment

How LinkedIn used RUM to drive optimizations and make the site faster

  1. 1. Story of How LinkedIn Used RUM & PoPs to Make the Site Faster Ritesh Maheshwari Performance Engineer LinkedIn @ritesh
  2. 2. 2013
  3. 3. 2015
  4. 4. Story PoP RUM
  5. 5. Story: In a nutshell 1. We built PoPs 2. …used RUM to assign users to Optimal PoPs 3. …found DNS based assignment is suboptimal 4. …evaluated Anycast as a solution using RUM 5. …now using Anycast to assign users to PoPs 6. …new PoPs locations chosen using RUM
  6. 6. What is RUM? JavaScript (client code) to measure performance  Navigation Timing API  LinkedIn uses https://github.com/lognormal/boomerang/ Metrics like: • DNS Time • Connection Time • First Byte Time • Download Time • Page Load Time
  7. 7. What are PoPs? Point of Presence / PoP • Small-scale data centers – Cheap, Many • Proxy servers at LinkedIn (ATS)
  8. 8. World Without PoPs Browser Data Center connection time 250ms
  9. 9. World Without PoPs Browser Data Center connection time server compute time 250ms 500ms
  10. 10. World Without PoPs Browser Data Center connection time first byte time + page download time 5 RTTs = 5x250ms = 1250ms server compute time 250ms Total = 2000ms 500ms
  11. 11. With PoPs Browser Data CenterPoP 100ms 250ms
  12. 12. With PoPs Browser Data CenterPoP 100msconnection time
  13. 13. With PoPs Browser Data CenterPoP 100msconnection time first byte time + page download time server compute time 500ms
  14. 14. With PoPs Browser Data CenterPoP 100msconnection time 5 RTTs = 5x100ms = 500ms Total = 1100ms 900 ms gain!first byte time + page download time 500ms server compute time
  15. 15. How are users assigned to PoPs? Through 3rd party DNS services IP handed based on user’s resolver location # Spain $ dig @109.69.8.51 +short www.linkedin.com 91.225.248.80 # California $ dig +short www.linkedin.com 216.52.242.80
  16. 16. Should India connect to Singapore or Dublin? How to assure optimal PoPs assignment? • Geographical Distance • Knowledge about Internet Connectivity
  17. 17. Lets Make Data-based Decisions Synthetic measurements? No realistic agent distribution across networks No realistic agent connectivity distribution Can our users become measurements agent? • RUM!
  18. 18. RUM beacons Fetch a tiny object from each candidate PoP For each pop_name, 1. Start timer 2. Fetch {pop_name}.perf.linkedin.com/pop/admin 3. Stop timer Send data back to our servers • Millions of agents! • Analyze data to find “optimal” PoP per country
  19. 19. We can assign countries to new PoPs! Country PoP Median Beacon Time China Hong Kong 434 China Dublin 1216 China Singapore 515 India Hong Kong 1368 India Dublin 1042 India Singapore 898
  20. 20. We can audit current assignment! Country Is PoP optimal? Current PoP Optimal PoP India TRUE Singapore Singapore Pakistan FALSE Singapore Dublin Spain TRUE Dublin Dublin Brazil FALSE US West Coast US East Coast Netherlands TRUE Dublin Dublin UAE FALSE US West Coast Dublin Italy TRUE Dublin Dublin Mexico TRUE US West Coast US West Coast Russia FALSE US West Coast Dublin
  21. 21. 0% 5% 10% 15% 20% 25% 30% India Pakistan Singapore Russia Brazil PercentageImprovement LinkedIn Homepage Download Time Improvement Median Improvement 90th Percentile Improvement
  22. 22. Trust, but Verify Is 100% of India connecting to Singapore PoP? Can RUM identify which PoP served this page view? 1. RUM is JavaScript 2. PoP selected through DNS 3. JavaScripts can’t know DNS answers  JavaScripts can read response headers
  23. 23. Can RUM identify which PoP served this page view? After page load: 1. Fetch www.linkedin.com/pop/admin 2. Read X-Li-Pop header 3. Add that data to RUM and send it back Offline analysis: 1. What % of traffic in India going to Singapore PoP? 2. A/B testing PoP assignment.
  24. 24. Plot Twist: Assignment far from optimal • About 31% of US traffic gets assigned to a suboptimal PoP. – 45% of East Coast • About 10% of traffic globally gets assigned to a suboptimal PoP.
  25. 25. DNS PoP assignment is suboptimal • Assignment based on Resolver IP, not Client IP DNS Resolver PoP US East PoP US West New YorkCalifornia
  26. 26. DNS PoP assignment is suboptimal • Assignment based on Resolver IP, not Client IP • Bad IP to Geo databases – Resolver really in NY, but database says CA
  27. 27. Story so far 1. We built PoPs 2. …used RUM to assign users to Optimal PoPs 3. …found DNS based assignment is suboptimal
  28. 28. Accurate PoP assignment Problem • Bug our DNS providers (31% -> 27%) • Run our own DNS How about Anycast?
  29. 29. Anycast – One IP, Multiple Servers PoP A PoP B PoP C Bob 1.1.1.1 1.1.1.1 1.1.1.1 • Unicast – one IP per server • Seamless support at routing layer • Packets routed to closest server by # of hops  Client IP, not Resolver IP used!  No Geo-IP Databases
  30. 30. How does Anycast compare to DNS? Will anycast send more users to optimal PoP? Lets test it!
  31. 31. RUM to rescue For each PoP: 1. Announce same anycast IP (108.174.13.10) 2. Configure a domain ac.perf.linkedin.com to point to 108.174.13.10
  32. 32. RUM to rescue For each page view: 1. RUM downloads a tiny object : ac.perf.linkedin.com/pop/admin 2. Read X-Li-Pop response header to record which PoP served the object 3. Send this back to LinkedIn with RUM data Data: 1. For each user, the anycast PoP 2. For each user, the optimal PoP (from pop beacons)
  33. 33. Results  Region or Country DNS % Optimal Assignment Anycast % Optimal Assignment Illinois 70 90 Florida 73 95 Georgia 75 93 Pennsylvania 85 95
  34. 34. Results  Region or Country DNS % Optimal Assignment Anycast % Optimal Assignment Arizona 60 39 Brazil 88 33 New York 77 74
  35. 35. Fewer hops != Lower Latency • Carriers prefer to haul packets within their own network • Peering can create inter-continental short cuts Z X Alice Y 1.1.1.1 1.1.1.1 1.1.1.1
  36. 36. Maybe DNS wasn’t so bad Continent level identification City / State level identification
  37. 37. “Regional” Anycast 1 anycast IP per continent Ran a RUM experiment, all was fine Z X Alice Y 2.2.2.2 1.1.1.1 1.1.1.1
  38. 38. 50.00 55.00 60.00 65.00 70.00 75.00 80.00 85.00 90.00 95.00 100.00 20141206 20141208 20141210 20141212 20141214 20141216 20141218 %TrafficgoingtoOptimalPoP Date Illinois Florida North Carolina Indiana New York New Jersey Virginia West Virginia Louisiana USA Ramp Results Ramp outside USA In progress
  39. 39. Story so far 1. We built PoPs 2. …used RUM to assign users to Optimal PoPs 3. …found DNS based assignment is suboptimal 4. …evaluated Anycast as a solution using RUM 5. …now using Anycast to assign users to PoPs Next play: • Build more PoPs!
  40. 40. Build More PoPs: But where? • Previously, based on – Intuition – Knowledge • Can we use RUM data? – Get a list of candidate PoP locations – Build a machine learning model – Rank locations based on most performance benefit
  41. 41. Plot Twist: Miami vs Sao Paolo • Can we use RUM again? – Fetch a tiny object and measure latency – But we have no servers there  Our CDNs do! 1. Get IP address of CDN servers 2. Configure a domain to point to the IP 3. RUM to fetch a tiny object ( /cdo/rum/id ) 4. Measure time and report back
  42. 42. Miami vs Sao Paolo - Results  Sao Paolo is better than US East Coast  Miami is better than Sao Paolo for many in South America Country Median beacon time Miami Sao Paolo US East Coast Columbia 648 970 778 Argentina 918 970 1213 Brazil 868 549 1074
  43. 43. Story: Recap 1. We built PoPs 2. …used RUM to assign users to Optimal PoPs 3. …found DNS based assignment is suboptimal 4. …evaluated Anycast as a solution using RUM 5. …now using Anycast to assign users to PoPs 6. …new PoPs locations chosen using RUM 10-25% gain 30% suboptimal in US “Regional” Anycast 30% -> 10%
  44. 44. Story: The End • Clients are your measurement agents • Make data-driven decisions • Trust, but verify • Collaboration leads to bigger impact
  45. 45. Questions? in/RiteshMaheshwari@ritesh
  46. 46. ©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.
  47. 47. What about Mobile? • Packet Gateways are likely to be in the same part of the country as users • Wired measurements should be applicable for Mobile – India wired users are closer to Singapore, Mobile should be similar
  48. 48. Is Synthetic Monitoring dead? No, not yet • Much more powerful than RUM • Not everything can be done in production
  49. 49. Next Play • Keep evaluating Anycast • Keep building new PoPs • Dynamic RUM-based PoP assignment

×