Making Facebook Faster

6,493 views
6,351 views

Published on

Slides from talk on Frontend Performance Engineering delivered to Velocity 2009 by David Wei and Changhao Jiang

Making Facebook Faster

  1. 1. (c) 2009 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0 Sunday, September 27, 2009 1
  2. 2. Making Facebook faster Frontend performance engineering David Wei and Changhao Jiang Velocity 2009 Jun 24, 2009 San Jose, CA Sunday, September 27, 2009 2
  3. 3. Agenda 1 Site speed matters 2 Performance monitoring 3 Static resource management 4 Ajaxification 5 Client side cache Sunday, September 27, 2009 3
  4. 4. Site speed matters! Sunday, September 27, 2009 4 First thing first: site speed matters.
  5. 5. Site speed matters: large scale 200 million users, more than 4 billion page views / day ▪ 10ms per page = more than 1 man-year per day = more than 5 human-life of time per year Sunday, September 27, 2009 5 Facebook cares site speed. … -- so yes, we care about site speed. With our scales, our 200 Million users generated more than 4 billion page loads per day. If we can speed up each page load by 10 ms, aggregately, we will save our users 1 man-year of time per day; and accumulating over a year, that’s more than 5 human life of time. Site speed is also affecting our bottleline. Experiments show that if we reduce the latency by 600ms, the user click rate improves by more than 5%. We are currently running an in-depth experiment on the impact of latency.
  6. 6. Site speed matters: emerging • Agile development Sunday, September 27, 2009 6 On the other hand, there are huge challenges for a site like facebook in term of site performance optimization. Here are a few major ones…. Move fast, no stable code base Fast Development: every week we release a new version of the site – with hundreds of code changes; tens of small code changes are pushed everyday. So the code base is never stable and there is no time to stop for pure optimization
  7. 7. Site speed matters: emerging • Agile development • Deep integration Sunday, September 27, 2009 7 Deep integration: Each facebook home page is customized for a particular user, with features developed by many teams – some of them are applications by 3rd party developers, some of them are internal facebook feature – depending on the users’ adoption on the features and applications. it also takes a lot of javascript to run them.
  8. 8. Site speed matters: emerging • Agile development • Deep integration • Viral adoption Sunday, September 27, 2009 8 Viral adoption: it is very hard to predict if a feature that is released today will be used by 1 million users or 10 million users next week. It is difficult to optimize beforehand. The infrastructure has to be adaptable to the growth of user adoption.
  9. 9. • Agile development • Deep integration • Viral adoption • Heavily interactive Sunday, September 27, 2009 9 … this talk, we will share our experience on how to make a site faster with these challenges Heavy interaction: our pages have many dynamic features that rely on javascript. E.g. the in-browser chat and application dock provide very convenient user experience, while it also takes a lot of javascript to run them.
  10. 10. Site speed matters: emerging • Agile development • Deep integration • Viral adoption • Heavily interactive Sunday, September 27, 2009 10 In summary, we have a lot of challenges. And these challenges are actually essential to make Facebook a paradise for people who want to build new things – you can write something cool tonight, and push it out tomorrow to 200millions users. At the same time, it also makes the site performance hard to predict and maintain. In this talk, we will share our experience on how to optimize front end performance with these challenges.
  11. 11. Site speed: end-to-end latency experienced by ▪ From a user request to the presentation of the page at the browser, interactive: Rende Browsers Content ▪ Network Transfer Time r Distribution Network (CDN) ▪ Server Generation Time ▪ Client Render Time ▪ NetTim ▪ GenTim FB Server Sunday, September 27, 2009 11 Before going into details, we’d define our problem domains. We define the end-to-end user latency as the time from user starts a page request, to the time the page is presented in the browser, interactive. There are three components of latency in this process: Network Transfer time is the time from the user browser to Facebook server, and back; Server Generation time is the time spent on the Facebook servers; And client render time is the time the browser spends on parsing the HTML, loading javascript/css/images and rendering the contents.
  12. 12. Site speed: end-to-end latency experienced by User latency = RenderTime + NetTime + GenTime ▪ RenderTime: ~50% of end-user latency ▪ NetTime: ~25% of end-user latency ▪ GenTime: ~25% of end-user latency Sunday, September 27, 2009 12 Looking at facebook’s user latency, client side render time is about 50% of the end-to-end latency; network time and server-side generation time are about 25% each.
  13. 13. Site speed: end-to-end latency experienced by User latency = RenderTime + NetTime + GenTime ▪ RenderTime: ~50% of end-user latency ▪ NetTime: ~25% of end-user latency ▪ GenTime: ~25% of end-user latency Sunday, September 27, 2009 13 In this talk, we focus on the biggest chunk: render time.
  14. 14. Cavalry: Site speed monitoring Sunday, September 27, 2009 14
  15. 15. User-based measurement All content loaded, First bytes Page Interactive What’s our speed? Server of HTML ▪ sampling 1/10000 page loads JS Report Sunday, September 27, 2009 15 To make the site faster, the first question we want to ask is: what is our site speed? There are usually two approaches: run some in-house testing, or samples on real users We did both and found that the second approach is much more helpful for us. We actually have lessons on the first approach: our pages are vastly different for different users, and Facebook employees are most likely to be the outliers because they tend to have much more features and functionalities than normal users, and installed many plugins such as firebug, ie developers. even finding a “typical” users is hard, as the usage behaviors of our users have been changing all the time. Our approach is to take samples from our users. We have javascript measurement on a sampled users, 1/10000. to measure the real speed. The red arrows are the events that we records. This gives us a real image of what the site speed looks like for facebook. Btw, we are loading the javascripts before our css, because the javascripts are loaded in parallel, along with css and images
  16. 16. User-based measurement All content loaded, First bytes Page Interactive What’s our speed? Server of HTML ▪ sampling 1/10000 page loads JS Report Sunday, September 27, 2009 16 The last thing I want to point out on this slide is that, we are loading the javascripts before our css – this violates the common best practice of putting css in front of js. However, the case here is that we are downloading most of our javascripts in parallel. If we put JS at top, we make JS, css and images are all in parallels. Half a year ago, we tested and found this is faster. We are running another set of experiments to see if things changed.
  17. 17. Cavalry: Day-to-day monitoring What’s our speed? ▪ Collect gen time / network transfer time and render time GenTime Daily site speed monitoring Network Time Browser onload time Cavalry Logs Sunday, September 27, 2009 17 We combine the js measurement along with our serverside measurement on page generation time and network round trip time, and put it into a database. Now we can yell to the company that “Hey the site is slower today!”. However, we still don’t know who made it? We are continuously launching different features every week. It is hard to stop-and-test for performance.
  18. 18. Cavalry: Project-based analysis Who made it faster / slower? ▪ Integrated with Launch System GenTime Launch Daily site speed System monitoring Network Time Project-based Browser regression onload time Cavalry detection Logs Sunday, September 27, 2009 18 1. The second step of our measurement is to hook the logs with our launching system. For each measurement sample, we record what new features are launched in the page load. 2. When there is a regression, we can go over the samples and identify the feature launch that causes regression. 3. This can make the corresponding team much more responsive to a regression. 4. Then there is still a question: “why is it slow? How can I fix it?”
  19. 19. Cavalry: Numeric metrics Why are we fast / slow? How can I fix it? ▪ YSlow-like technical metrics GenTime Gate Daily site speed Keeper monitoring Network Time Project-based Browser regression onload time Cavalry detection Logs Yslow-like Regression metrics analysis Sunday, September 27, 2009 19 To answer the “why” question, Yslow is a good tool. 1. We instrument a subset of the Yslow metrics into our sampled page load. We measure the # of images / # of dom nodes / # of script tags / # of html bytes / # of css rules and etc. These metrics can give indication on what causes a perf regression. 2. The missing thing is that we still don’t have a mapping from the yslow-metrics to the actual time (msec)
  20. 20. “WWW” in performance monitoring: What? Who? Why? ▪ User-based measurement: unbiased, representative results ▪ Feature-launch integration: identify the regression ▪ Technical metrics: define actionable items for improvement Sunday, September 27, 2009 20 1. Missing part is the priority definition: how much saving, in ms, is if we reduce the # of css rules by 10%? Vs we move the js down to the bottom?
  21. 21. Haste: Static resource management Sunday, September 27, 2009 21
  22. 22. Why we need SR Management? • Day 1: Some smart engineers start a project! <Print css tag for feature A> “Let’s write a <Print css tag for feature B> new page with features A, B <Print css tag for feature C> and C!” <print HTML of feature A> <print HTML of feature B> <print HTML of feature C> Sunday, September 27, 2009 22
  23. 23. Why we need SR Management? • Day 2: Some smart engineers run PageSpeed and thinks… <Print css tag for feature A> “A & B & C are always used; <Print css tag for feature B> let’s package them <Print css tag for feature C> together!” <print HTML of feature A> <print HTML of feature B> <print HTML of feature C> Sunday, September 27, 2009 23
  24. 24. Why we need SR Management? • Day 2: Awesome! <Print css tag for feature A&B&C> <print HTML of feature A> <print HTML of feature B> <print HTML of feature C> … Sunday, September 27, 2009 24
  25. 25. Why we need SR Management? • Day 3: feature C evolves… <Print css tag for feature A & B & C> <print HTML of feature A> <print HTML of feature B> If (users_signup_for_C()) { <print HTML of feature C>} … Sunday, September 27, 2009 25
  26. 26. Why we need SR Management? • Day 3: <Print css tag for feature A & B & C> A&B are always used, while C is <print HTML of feature A> not. .. <print HTML of feature B> If (users_signup_for_C()) { <print HTML of feature C>} … Sunday, September 27, 2009 26
  27. 27. Why we need SR Management? • Day 4: feature C is deprecated <Print css tag for feature A & B & C> <print HTML of feature A> <print HTML of feature B> // no one uses C { <print HTML of feature C>} … Sunday, September 27, 2009 27
  28. 28. Why we need SR Management? • Day 4: we start to send unused bits <Print css tag for feature A & B & C> It is hard to <print HTML of feature A> remember we should remove C <print HTML of feature B> here. // no one uses C { <print HTML of feature C>} … Sunday, September 27, 2009 28
  29. 29. Why we need SR Management? • One months later… <Print css tag for feature A & B & C & D & E & F & G…> Thousands of if (F is used) <print HTML of feature F> dead CSS rules in the package. <print HTML of feature G> if (F is not used) { <print HTML of feature E>} … Sunday, September 27, 2009 29
  30. 30. Static Resource Management @ Challenges: Responses: • Deep Integration • Separate requirement declaration and delivery of static • Viral Adoption resources • Agile Development • Requirement declaration: lives with HTML generation • Delivery: Globally optimized Sunday, September 27, 2009 30 Deep Integration: each page has many features; Viral adoption: usage pattern changes quickly Agile development: feature changes fast
  31. 31. Haste: Static Resource Management Separate Declaration from actual Delivery • Back to Day 1: require_static(A_css); <render HTML of feature A> require_static(B_css); <render HTML of feature B> require_static(C_css);<render HTML Requirement Declaration lives of feature C> with HTML <deliver all required CSS> Global Optimization on Delivery <print all rendered HTML> Sunday, September 27, 2009 31
  32. 32. Haste: Global Optimization Online process Offline analysis require_static(A_css);<render HTML of feature A> Usage Pattern logs require_static(B_css); <render HTML of feature B> Clustering algorithms require_static(C_css); <render HTML of feature C> “Optimal” packages <deliver all required CSS> <print all rendered HTML> Sunday, September 27, 2009 32
  33. 33. Haste: Trace-based Packaging Nov 2008 => May 2009 # of pkg at a # of bytes at Date # of JS files # of JS bytes home.php a home.php Nov 2008 461 4.4 MB 29 629 KB May 2009 729 5.9 MB 14 560 KB Sunday, September 27, 2009 33 The # of JS files are increased by 60%, the byte sites are increased by 30%. The # of pkg sent is halved, the byte size is 10% less. find | grep -v .svn | grep -v intern | grep .css$ -c find | grep -v .svn | grep -v intern | grep .css$ | xargs cat > /tmp/dwei_2008
  34. 34. Haste: Trace-based Packaging Nov 2008 => May 2009 # of pkg at a # of bytes at Date # of JS files # of JS bytes home.php a home.php Nov 2008 461 4.4 MB 29 629 KB May 2009 729 5.9 MB 14 560 KB  'js/careers/jobs.js’,  'js/lib/ui/timeeditor.js’,  'resume/js/resumepro.js’,  'resume/js/resumesection.js’ Sunday, September 27, 2009 34 Developers think that timeeditor.js is a library file – in fact, it is only used in one production page (career) On the other hand, it turns out that “resume“ function is almost always used in career page.
  35. 35. Haste: Trace-based Packaging Nov 2008 => May 2009 # of pkg at a # of bytes at Date # of JS files # of JS bytes home.php a home.php Nov 2008 461 4.4 MB 29 629 KB May 2009 729 5.9 MB 14 560 KB # of CSS # of pkg at a # of bytes at Date # CSS files bytes home.php a home.php Nov 2008 487 1.7 MB 24 69 KB May 2009 706 1.9 MB 15 64 KB Sunday, September 27, 2009 35 CSS is a similar story
  36. 36. Haste: Trace-based Analysis Potentials for image sprites too! • Thousands of virtual gifts with static images, which to sprite? Sunday, September 27, 2009 36 The same tracebase analysis techniques can be use in image spriting too:
  37. 37. Haste: Trace-based Analysis Potentials for image sprites too! • The answer is… Sunday, September 27, 2009 37 The answer is… In retrospection, this is pretty straight forward.
  38. 38. Haste: Trace-based Analysis Adaptive Performance Optimization • JS / CSS package optimization • Guidance for image spriting • Guidance of progressive rendering Sunday, September 27, 2009 38 Once we separate the declaration and delivery of static resources, we have tons of area for automatic optimizations with trace analysis. You can do automatic packaging, you can do automatic spriting, you can also do automatic progressive rendering – you can look at the most frequently used resources, and flush them out before generating the page.
  39. 39. Quickling: Ajaxify the Facebook site Sunday, September 27, 2009 39
  40. 40. Remove redundant work via Ajax Full page load Ajax call Page 1 Page 2 Page 3 Page 4 Use session load unload load unload load unload load unload Sunday, September 27, 2009 40
  41. 41. Remove redundant work via Ajax Full page load Ajax call Page 1 Page 2 Page 3 Page 4 Use session load unload load unload load unload load unload Sunday, September 27, 2009 40
  42. 42. Remove redundant work via Ajax Full page load Ajax call Page 1 Page 2 Page 3 Page 4 Use session load unload load unload load unload load unload Page 1 Page 2 Page 3 Page 4 Use session load unload Sunday, September 27, 2009 40
  43. 43. How Quickling works? Sunday, September 27, 2009 41
  44. 44. How Quickling works? 1. User clicks a link or back/forward button Sunday, September 27, 2009 41
  45. 45. How Quickling works? 1. User clicks a link or back/forward button 2. Quickling sends an ajax to server 3. Response arrives Sunday, September 27, 2009 41
  46. 46. How Quickling works? 1. User clicks a link or back/forward button 2. Quickling sends an ajax to server 3. Response arrives 4. Quickling blanks the content area Sunday, September 27, 2009 41
  47. 47. How Quickling works? 1. User clicks a link or back/forward button 2. Quickling sends an ajax to server 3. Response arrives 4. Quickling blanks the content area 5. Download javascript/CSS Sunday, September 27, 2009 41
  48. 48. How Quickling works? 1. User clicks a link or back/forward button 2. Quickling sends an ajax to server 3. Response arrives 4. Quickling blanks the content area 5. Download javascript/CSS 6. Show new content Sunday, September 27, 2009 41
  49. 49. LinkController Intercept user clicks on links ▪ Dynamically attach a handler to all link clicks: $(‘a’).click(function() { // ‘payload’ is a JSON encoded response from the server $.get(this.href, function(payload) { // Dynamically load ‘js’, ‘css’ resources for this page. bootload(payload.bootload, function() { // Swap in the new page’s content $(‘#content’).html(payload.html) // Execute the onloadRegister’ed js code execute(payload.onload) }); } }); Sunday, September 27, 2009 42
  50. 50. HistoryManager Enable ‘Back/Forward’ buttons for AJAX requests ▪ Set target page URL as the fragment of the URL ▪ http://www.facebook.com/home.php ▪ http://www.facebook.com/home.php#/cjiang?ref=profile ▪ http://www.facebook.com/home.php#/friends/?ref=tn Sunday, September 27, 2009 43
  51. 51. Bootloader Load static resources via ‘script’, ‘link’ tag injection function requestResource(type, source) { var h = document.getElementsByTagName('head')[0]; switch (type) { case 'js': var script = document.createElement('script'); script.src = source; script.type = 'text/javascript'; h.appendChild(script); break; case 'css': var link = document.createElement('link'); link.rel = "stylesheet"; link.type = "text/css"; link.media = "all" ; link.href = source; h.appendChild(link); break; } } Sunday, September 27, 2009 44
  52. 52. Other details ▪ All pages now share a single global javascript scope: ▪ Explicitly reclaim resources or reset states before leaving a page ▪ Stub out setTimeout and setInterval ▪ All CSS rules will be accumulated ▪ Name-spacing CSS rules with page-specific information ▪ Busy indicator ▪iframe transport ▪ Permanent link ▪prelude inlined js code to redirect if necessary Sunday, September 27, 2009 45
  53. 53. Current status ▪ Turned on for FireFox and IE users: (>90% users) ▪ ~60% of page hits to Facebook site are Quickling requests Sunday, September 27, 2009 46
  54. 54. Performance improvement 40% ~ 50% reduction in render time Sunday, September 27, 2009 47
  55. 55. PageCache: Cache visited pages at client side Sunday, September 27, 2009 48
  56. 56. PageCache Cache user visited pages in browsers ▪ Motivation: ▪ A typical user session: ▪ home -> profile -> photo -> home -> notes -> home -> photo -> photo ▪ Some pages are likely to be revisited soon (temporal locality) ▪ Home page visited every 3 ~ 5 page views ▪ Back/Forward button Sunday, September 27, 2009 49
  57. 57. How PageCache works? 1. User clicks a link or back button 2. Quickling sends ajax to server 3. Response arrives 4. Quickling blanks the content area 5. Download javascript/CSS 6. Show new content Sunday, September 27, 2009 50
  58. 58. How PageCache works? 1. User clicks a link or back button 2. Quickling sends ajax to server 3. Response arrives 3.5 Save response in cache 4. Quickling blanks the content area 5. Download javascript/CSS 6. Show new content Sunday, September 27, 2009 50
  59. 59. How PageCache works? 1. User clicks a link or back button 2. Quickling sends ajax to server 3. Response arrives 4. Quickling blanks the content area 5. Download javascript/CSS 6. Show new content Sunday, September 27, 2009 50
  60. 60. How PageCache works? 1. User clicks a link or back button 2. Find Page in the cache 3. Response arrives 4. Quickling blanks the content area 5. Download javascript/CSS 6. Show new content Sunday, September 27, 2009 50
  61. 61. Cache consistency 1: Incremental updates Cached version Sunday, September 27, 2009 51 Provide functions to programmers to allow registering a javascript function to be called right before cached page is shown. Used by home page to refresh ‘ads’, fetch latest stories
  62. 62. Cache consistency 1: Incremental updates Cached version Restored version Sunday, September 27, 2009 51 Provide functions to programmers to allow registering a javascript function to be called right before cached page is shown. Used by home page to refresh ‘ads’, fetch latest stories
  63. 63. Cache consistency 1: Incremental Poll server for incremental updates via ajax calls. ▪ Allow registering javascript functions to be called right before cached page is shown. ▪ Used by home page to refresh ‘ads’, fetch latest stories Cached version Restored version Sunday, September 27, 2009 52 Provide functions to programmers to allow registering a javascript function to be called right before cached page is shown. Used by home page to refresh ‘ads’, fetch latest stories
  64. 64. Cache consistency 2: In-page writes Cached version Sunday, September 27, 2009 53
  65. 65. Cache consistency 2: In-page writes Cached version Restored version Sunday, September 27, 2009 53
  66. 66. Cache consistency 2: In-page writes Record and replay ▪ Automatically record all state-changing operations in a cached page ▪ Automatically replay those operations when cached page is restored. Cached version Restored version Sunday, September 27, 2009 54
  67. 67. Cache consistency 3: Cross-page writes Cached version Sunday, September 27, 2009 55
  68. 68. Cache consistency 3: Cross-page writes Cached version State-changing op Sunday, September 27, 2009 55
  69. 69. Cache consistency 3: Cross-page writes Cached version State-changing Restored version op Sunday, September 27, 2009 55
  70. 70. Cache consistency 3: Cross-page writes Server side invalidation ▪ Instrument server-side database access API, whenever a write operations is detected, send a signal to the client to invalidate the cache. Cached version State-changing Restored version op Sunday, September 27, 2009 56
  71. 71. Current status ▪ Deployed on production ▪ Only cache in memory ▪ Only turned on for home page Sunday, September 27, 2009 57
  72. 72. 20% ~20% savings on page hits to home Sunday, September 27, 2009 page 58
  73. 73. Performance improvement 3X ~ 4X speedup in render time vs Quickling Sunday, September 27, 2009 59
  74. 74. Summary Sunday, September 27, 2009 60
  75. 75. Summary ▪ Performance monitoring: What, Who, and Why (“WWW”) ▪ Static resource management: Adaptive to fast evolution ▪ Ajaxify the website. ▪ Client side caching of user visited pages Sunday, September 27, 2009 61 Measurement: we need to answer three questions: what’s the speed, who made it faster/slower, why it is faster/slower. Static resource management: need to be adaptive to fast evolution of code changes and user adoption Ajaxifying websites where pages in a user session share a lot of common work can save the redundant work and improve user perceived performance. Caching user’s visited pages on the client side can reduce server’s overall load and improve user perceived performance
  76. 76. Thank you! Sunday, September 27, 2009 62

×