Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web Performance & Search Engines - A look beyond rankings

London Web Performance Meetup - 10th November 2020

There is a lot of talk about web performance as a ranking signal in Search Engines and how important or not it is, but often people are overlooking how performance affects multiple phases of a search engine such as crawling, rendering, and indexing.

In this talk, we'll try to understand how a search engine works and how some aspects of web performance affect the online presence of a website.

  • Be the first to comment

  • Be the first to like this

Web Performance & Search Engines - A look beyond rankings

  1. 1. Web Performance & Search Engines A look beyond rankings 2020/11/10 @giacomozecchini
  2. 2. Hi, I’m Giacomo Zecchini Technical SEO @ Verve Search Technical background and previous experiences in development Love: understanding how things work and Web Performance @giacomozecchini
  3. 3. We are going to talk about... @giacomozecchini
  4. 4. We are going to talk about... ● How Web Performance Affects Rankings @giacomozecchini
  5. 5. We are going to talk about... ● How Web Performance Affects Rankings ● How Search Engines Crawl and Render pages @giacomozecchini
  6. 6. We are going to talk about... ● How Web Performance Affects Rankings ● How Search Engines Crawl and Render pages ● How It Affects Your Website @giacomozecchini
  7. 7. How Web Performance Affects Rankings
  8. 8. Photo by Sam Balye on Unsplash Let’s talk about the elephant in the room
  9. 9. It’s been a while that search engines use and talk about speed as a ranking factor ● Using site speed in web search ranking https://webmasters.googleblog.com/2010/04/using-site-speed-in-web-search-ranking.html ● Is your site ranking rank? Do a site review https://blogs.bing.com/webmaster/2010/06/24/is-your-site-ranking-rank-do-a-site-review-part-5-sem-101 ● Using page speed in mobile search ranking https://webmasters.googleblog.com/2018/01/using-page-speed-in-mobile-search.html @giacomozecchini
  10. 10. Bing - “How Bing ranks your content” Page load time: Slow page load times can lead a visitor to leave your website, potentially before the content has even loaded, to seek information elsewhere. Bing may view this as a poor user experience and an unsatisfactory search result. Faster page loads are always better, but webmasters should balance absolute page load speed with a positive, useful user experience. https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a @giacomozecchini
  11. 11. Yandex - “Site Quality” “How do I speed up my site? The speed of page loading is an important indicator of a site's quality. If your site is slow, the user may not wait for a page to open and switch to a different site. This undermines their trust in your site, affects traffic and other statistical indicators. https://yandex.com/support/webmaster/yandex-indexing/page-speed.html @giacomozecchini
  12. 12. Google - “Evaluating page experience for a better web” “Earlier this month, the Chrome team announced Core Web Vitals, a set of metrics related to speed, responsiveness and visual stability, to help site owners measure user experience on the web. Today, we’re building on this work and providing an early look at an upcoming Search ranking change that incorporates these page experience metrics.” https://webmasters.googleblog.com/2020/05/evaluating-page-experience.html @giacomozecchini
  13. 13. https://webmasters.googleblog.com/2020/05/evaluating-page-experience.html
  14. 14. Is speed important for ranking? Google’s Webmaster Trends Analyst https://twitter.com/methode/status/1255224116648476675 @giacomozecchini
  15. 15. Is speed important for ranking? There are hundreds of ranking signals, speed is one of them but not the most important one. An empty page would be damn fast but not that useful. @giacomozecchini
  16. 16. Where does Google get data from for Core Web Vitals? @giacomozecchini
  17. 17. Where does Google get data from for Core Web Vitals? ● Real field data, something similar to the Chrome User Experience Report (CrUX) https://youtu.be/7HKYsJJrySY?t=45 @giacomozecchini
  18. 18. Where does Google get data from for Core Web Vitals? ● Real field data, something similar to the Chrome User Experience Report (CrUX) Likely a raw version of CrUX that may contain all the “URL-Keyed Metrics” that Chrome records. https://source.chromium.org/chromium/chromium/src/+/master:tools/metrics/ukm/ukm.xml @giacomozecchini
  19. 19. CrUX - Chrome User Experience Report The Chrome User Experience Report provides user experience metrics for how real-world Chrome users experience popular destinations on the web. It’s powered by real user measurement of key user experience metrics across the public web. https://developers.google.com/web/tools/chrome-user-experience-report @giacomozecchini
  20. 20. What if I’m not in CrUX? CrUX uses a threshold related to the usage of specific websites, if there is less data than that threshold, websites or pages are not included in the Big Query / API database. @giacomozecchini
  21. 21. What if I’m not in CrUX? CrUX uses a threshold related to the usage of specific websites, if there is less data than that threshold, websites or page are not included in the Big Query / API database. We can end up with: ● No data for a single page ● No data for the whole origin / website @giacomozecchini
  22. 22. What if CrUX has no data for my pages? @giacomozecchini
  23. 23. What if CrUX has no data for my pages? If the URL structure is easy to understand and there is a way to split your website into multiple parts looking at the URL, Google might group pages per subfolder or URL structure pattern grouping URLs that have similar content and resources. If that is not possible, Google may use the aggregate data across whole website. https://youtu.be/JV7egfF29pI?t=848 @giacomozecchini
  24. 24. What if CrUX has no data for my pages? https://www.example.com/forum/thread-1231 This URL may use the aggregate data of URLs with similar /forum/ structure https://www.example.com/fantastic-product-98 This URL may use the subdomain aggregate data You should remember this if planning a new website. @giacomozecchini
  25. 25. What if CrUX has no data for my pages? Looking at the Core Web Vitals Report in Search Console, you can check how Google is already grouping “similar URLs” of your website. @giacomozecchini
  26. 26. What if CrUX has no data for my website? @giacomozecchini
  27. 27. What if CrUX has no data for my website? This is not really clear at the moment. @giacomozecchini
  28. 28. What if CrUX has no data for my website? Possible solutions: @giacomozecchini
  29. 29. What if CrUX has no data for my website? Possible solutions: ● Not using any positive or negative value for the Core Web Vitals @giacomozecchini
  30. 30. What if CrUX has no data for my website? Possible solutions: ● Not using any positive or negative value for the Core Web Vitals ● Using data over a longer period of time to have enough data (BigQuery CrUX data is aggregated on monthly base, API is using the last 28 days of aggregated data) @giacomozecchini
  31. 31. What if CrUX has no data for my website? Possible solutions: ● Not using any positive or negative value for the Core Web Vitals ● Using data over a longer period of time to have enough data (BigQuery CrUX data is aggregated on monthly base, API is using the last 28 days of aggregated data) ● Lab data, calculating theoretical speed @giacomozecchini
  32. 32. What if CrUX has no data for my website? We might have more information on this when Google will start using Core Web Vitals in Search (May, 2021). https://webmasters.googleblog.com/2020/11/timing-for-page-experience.html @giacomozecchini
  33. 33. @giacomozecchini Let’s debunk a few myths..
  34. 34. Is Google using Page Speed / Lighthouse performance score for rankings? @giacomozecchini
  35. 35. Is Google using Page Speed / Lighthouse performance score for rankings? NO @giacomozecchini
  36. 36. What about AMP? @giacomozecchini
  37. 37. What about AMP? ● AMP is not a ranking factor, never has been @giacomozecchini
  38. 38. What about AMP? ● AMP is not a ranking factor, never has been ● Google will remove the AMP requirement from Top Stories eligibility in May, 2021 https://webmasters.googleblog.com/2020/05/evaluating-page-experience.html @giacomozecchini
  39. 39. How Search Engines Crawl And Render Pages
  40. 40. We can split what a Search Engine does in two main parts ● What happens when a user search for something ● What happens in the background ahead of time @giacomozecchini
  41. 41. What happens when a user searches for something When a Search Engine gets a query from a user, it starts processing that trying to understand the meaning behind that search, retrieving and scoring the documents in the index, and eventually serving a list of results to the user. @giacomozecchini
  42. 42. What happens in the background ahead of time To be able serving to users pages that match their queries, a search engine has to: @giacomozecchini
  43. 43. What happens in the background ahead of time To be able serving to users pages that match their queries, a search engine has to: ● Crawl the web @giacomozecchini
  44. 44. What happens in the background ahead of time To be able serving to users pages that match their queries, a search engine has to: ● Crawl the web ● Analyse crawled pages @giacomozecchini
  45. 45. What happens in the background ahead of time To be able serving to users pages that match their queries, a search engine has to: ● Crawl the web ● Analyse crawled pages ● Build an Index @giacomozecchini
  46. 46. https://developers.google.com/search/docs/guides/javascript-seo-basic s @giacomozecchini
  47. 47. If a crawler can’t access your content, that page won’t be indexed by search engines, nor will it be ranked. @giacomozecchini
  48. 48. @giacomozecchini
  49. 49. Even if your pages are being crawled, it doesn't mean they will be indexed. Having your pages indexed doesn't mean they will rank. @giacomozecchini
  50. 50. Crawler “A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).” https://en.wikipedia.org/wiki/Web_crawler @giacomozecchini
  51. 51. Crawler Features it must have: ● Robustness ● Politeness @giacomozecchini Features it should have: ● Distributed ● Scalable ● Performance and efficiency ● Quality ● Freshness ● Extensible
  52. 52. Crawler Features it must have: ● Robustness ● Politeness @giacomozecchini Features it should have: ● Distributed ● Scalable ● Performance and efficiency ● Quality ● Freshness ● Extensible
  53. 53. Crawler - Politeness Politeness can be: ● Explicit - Webmasters can define what portion of site can be crawled using the robots.txt file https://tools.ietf.org/html/draft-koster-rep-00 @giacomozecchini
  54. 54. Crawler - Politeness Politeness can be: ● Explicit - Webmasters can define what portion of site can be crawled using the robots.txt file ● Implicit - Search Engines should avoid requesting any site too often, they have algorithms to determine the optimal crawl speed for a site. @giacomozecchini
  55. 55. Crawler - Politeness - Crawl Rate Crawl Rate defines the max number of parallel connections and the min time between fetches. Together with the Crawl Demand (Popularity + Staleness) is part of the Crawl Budget. https://webmasters.googleblog.com/2017/01/what-crawl-budget-means-for-googlebot.html @giacomozecchini
  56. 56. Crawler - Politeness - Crawl Rate Crawl Rate is based on the Crawl Health and the limit you can manually set in Search Console. Crawl Health is depending on the server response time. If the server is fast to answer, the crawl rate goes up. If the server slows down, starts emitting a significant number of 5xx errors or connection timeouts, crawling slows down. @giacomozecchini
  57. 57. Crawler - Performance and Efficiency A crawler should make efficient use of resources such as processor, storage, and network bandwidth. @giacomozecchini
  58. 58. @giacomozecchini Crawler - Super simplified architecture
  59. 59. @giacomozecchini Crawler - Super simplified architecture
  60. 60. @giacomozecchini
  61. 61. @giacomozecchini
  62. 62. @giacomozecchini
  63. 63. @giacomozecchini
  64. 64. @giacomozecchini
  65. 65. A crawler should make efficient use of resources, using HTTP persistent connection, also called HTTP Keep-Alive connection, helps keeping robots (or threads) busy and saving time. Reusing the same TCP connection gives crawlers some advantages such as less latency in subsequent requests, less CPU usage (no multiple TLS handshakes), and reduced network congestion. @giacomozecchini
  66. 66. @giacomozecchini
  67. 67. @giacomozecchini
  68. 68. @giacomozecchini
  69. 69. @giacomozecchini
  70. 70. @giacomozecchini
  71. 71. Crawler HTTP/1.1 vs HTTP/2 @giacomozecchini
  72. 72. Crawler - HTTP/1.1 and HTTP/2 When I first started writing this presentation all most popular Search Engines crawlers weren’t using HTTP/2 to make requests. @giacomozecchini
  73. 73. Crawler - HTTP/1.1 and HTTP/2 I was also remembering a tweet from Google’s John Mueller: @giacomozecchini
  74. 74. Crawler - HTTP/1.1 and HTTP/2 Instead of thinking “How can crawlers benefit from using HTTP/2?”, I started my research from the (wrong) conclusion: crawlers have no advantages in using HTTP/2. @giacomozecchini
  75. 75. Crawler - HTTP/1.1 and HTTP/2 Instead of thinking “How can crawlers benefit from using HTTP/2?”, I started my research from the (wrong) conclusion: crawlers have no advantages in using HTTP/2. But then Google published this article: Googlebot will soon speak HTTP/2. https://webmasters.googleblog.com/2020/09/googlebot-will-soon-speak-http2.html @giacomozecchini
  76. 76. Crawler - HTTP/1.1 and HTTP/2 How can crawlers benefit from using HTTP/2? From the Article: Some of the many, but most prominent benefits in using H2 include: ● Multiplexing and concurrency ● Header compression ● Server push @giacomozecchini
  77. 77. Crawler - HTTP/1.1 and HTTP/2 Multiplexing and concurrency What they were achieving using multiple robots (or threads) each one with a single HTTP/1.1 connection will be possible using a single (or less) HTTP/2 connection with multiple parallel requests. Crawl Rate HTTP/1.1: max number of parallel connections Crawl Rate HTTP/2: max number of parallel requests @giacomozecchini
  78. 78. Crawler - HTTP/1.1 and HTTP/2 Header Compression HTTP/2 HPACK compression algorithms will reduce the amount of HTTP header sizes saving bandwidth. HPACK is even more effective for crawlers than browsers. Crawlers are stateless using mostly the same HTTP headers for request over and over and they might also request multiple pages (and assets) in one H2 connection. @giacomozecchini
  79. 79. Crawler - HTTP/1.1 and HTTP/2 Server push “This feature is not yet enabled; it's still in the evaluation phase. It may be beneficial for rendering, but we don't have anything specific to say about it at this point.” @giacomozecchini
  80. 80. Crawler - HTTP/1.1 and HTTP/2 Server push “This feature is not yet enabled; it's still in the evaluation phase. It may be beneficial for rendering, but we don't have anything specific to say about it at this point.” Google is making massive use of caching and this seems to be a really good reason to not use server push. I guess they will probably never enable this. @giacomozecchini
  81. 81. Crawler - HTTP/1.1 and HTTP/2 Server push We are too often looking at protocols in a browser-centric way, forgetting that other people might use a specific feature in a beneficial way. E.g. Rest API and server push @giacomozecchini
  82. 82. Crawler - HTTP/1.1 and HTTP/2 Why it took Google so long to approach HTTP/2? ● Widely support and maturation of the protocol ● Code complexity ● Regression testing @giacomozecchini
  83. 83. WRS (Web Rendering Service) Google is using a Web Rendering Service in order to render pages for Search. It’s based on the Chromium rendering engine and is regularly updated to ensure support for the latest web platform features. https://webmasters.googleblog.com/2019/05/the-new-evergreen-googlebot.html @giacomozecchini
  84. 84. WRS @giacomozecchini
  85. 85. WRS ● Doesn’t obey HTTP caching rules WRS caches every GET request for an undefined period of time (it uses an internal heuristic) @giacomozecchini
  86. 86. WRS ● Doesn’t obey HTTP caching rules ● Limits the number of fetches WRS might stop fetching resources after a number of requests or a period of time. It may not fetch known Analytics software. @giacomozecchini
  87. 87. WRS ● Doesn’t obey HTTP caching rules ● Limits the number of fetches ● Built to be resilient WRS will process and render a page even if some fetches fails @giacomozecchini
  88. 88. WRS ● Doesn’t obey HTTP caching rules ● Limits the number of fetches ● Built to be resilient ● Might interrupt scripts (excessive CPU usage, error loops, etc) @giacomozecchini
  89. 89. @giacomozecchini
  90. 90. @giacomozecchini
  91. 91. WRS If resources are not in the cache (or stale), the crawler will request those on behalf of WRS. @giacomozecchini
  92. 92. @giacomozecchini HTML
  93. 93. @giacomozecchini HTML CSS JS
  94. 94. @giacomozecchini HTML CSS JS JSFETCH
  95. 95. @giacomozecchini HTML
  96. 96. @giacomozecchini HTML CSS JS JSFETCH HTML CSS JS JSFETCH
  97. 97. @giacomozecchini HTML CSS JS JSFETCH HTML CSS JS JSFETCH
  98. 98. How It Affects Your Website
  99. 99. Cache and Rendering WRS is caching everything without respecting HTTP caching rules. Using fingerprinting for file names and defining a cache busting strategy is the way to go: bundle.ap443f.js E.g. bundle.js will be cached for an undefined period of time (days, weeks, months) and will be used for rendering even if you change the code. @giacomozecchini
  100. 100. Crawl Rate and Rendering Crawl Rate is shared between crawlers and even those requests that crawler makes on behalf of WRS don’t make an exception. If the server slows down during rendering, Crawl Rate will decrease and rendering may fail. Btw, rendering is quite resilient and it may retry later. Tip: Monitor server response time. @giacomozecchini
  101. 101. Politeness and Rendering Robots.txt can block a crawler from requesting a specific part of a website. What can go wrong? ● If you are blocking a specific file, it won’t be fetched and used ● If you have a JS script with a fetch/retry loop of a resource that is blocked from rule in your robots.txt, that script will be interrupted @giacomozecchini
  102. 102. CPU usage and Rendering WRS limits CPU consumption and can block excessive script run. Performance matters: you should analyse runtime performance, debug issues and remove bottlenecks. @giacomozecchini
  103. 103. Third-party stuff Third-party can cause a few problems: ● Resources can be blocked through robots.txt on their domains ● Request timeouts, connection errors @giacomozecchini
  104. 104. Cookies Cookies, local storage and session storage are enabled but cleared across page loads. If you are checking the presence of a specific cookie to redirect or not a user to a welcome page, WRS won’t be able to render those pages. @giacomozecchini
  105. 105. Service Workers and Rendering Service Worker registration promises are refused. Web Workers are supported. @giacomozecchini
  106. 106. Service Workers and Rendering Service Worker registration promises are refused. @giacomozecchini
  107. 107. WebSockets and WebRTC WebSockets and WebRTC are not supported. @giacomozecchini
  108. 108. Render Queue and Rendering Google states that the Render Queue median time is ~5 seconds. In the past this wasn't true and pages were waiting hours/days to be rendered. This might still be true for other search engines. @giacomozecchini
  109. 109. Render Queue and Rendering I believe Google reduced Render Queue time for two main big reasons: ● Freshness ● Errors with assets / dependencies @giacomozecchini
  110. 110. Render Queue and Rendering When the crawler first requests a page, it tries to get and cache also visible assets on that page. During the rendering phase, the bundle.js dependencies are discovered, requested and cached. @giacomozecchini HTML JS
  111. 111. Render Queue and Rendering But, if you delete the dependencies of bundle.js before the rendering phase, they can’t be fetched even if bundle.js is cached. I guess this was happening a lot in the past but it shouldn’t happen anymore at least in Google’s WRS, as the time span between the two phases is very short. Not sure about other search engines yet. TIP: keep old assets for a bit, even if not using those anymore. @giacomozecchini
  112. 112. Browser Events and Rendering WRS Chrome instances don’t scroll or click, if you want to use Javascript lazy load functionalities use the Intersection Observer. WRS Chrome instances start rendering pages with two fixed viewports for mobile (412 x 732) and desktop (1024 x 1024). And then, they increase the viewport height size to a very big number of pixels (tens of thousands), that is dynamically calculated on a page base. @giacomozecchini
  113. 113. Debugging Rendering problems Search Console is the best way to do it. @giacomozecchini
  114. 114. Debugging Rendering problems Search Console is the best way to do it. @giacomozecchini
  115. 115. Debugging Rendering problems @giacomozecchini
  116. 116. Debugging Rendering problems In the “page resource” tab you shouldn't worry if there are error for FONTs, IMAGEs and Analytics Js files. Those file are not requested in the rendering phase. @giacomozecchini
  117. 117. Debugging Rendering problems If you haven’t Search Console access, you can use Mobile-Friendly Test. WARNING Mobile-Friendly Test, Search Console Live Test, AMP Test, and Rich Results Test are using the same infrastructure as WRS, but bypassing cache and using stricter timeouts than Googlebot / WRS, final results can be very different. https://youtu.be/24TZiDVBwSY?t=816 @giacomozecchini
  118. 118. @giacomozecchini

×