Advertisement

More Related Content

Slideshows for you(20)

Similar to How Search Works(20)

Advertisement
Advertisement

How Search Works

  1. @patrickstox @ahrefs #pubcon How Search Works Presented by: Patrick Stox
  2. @patrickstox @ahrefs #pubcon Product Advisor, Technical SEO, & Brand Ambassador at • I write for Ahrefs blog but have written for many industry publications in the past • I speak at some conferences like SMX, Pubcon, UnGagged, DMO Advanced, TechSEO Boost, BrightonSEO • Organizer for the Raleigh SEO Meetup (most successful in US) and the Beer & SEO Meetup • We also run a conference, the Raleigh SEO Conference • Founder Technical SEO Slack Group • Moderator /r/TechSEO on Reddit • Helped define the role of Search Marketing Strategist for the US Department of Labor • Lead author for the SEO Chapter of the 2021 Web Almanac, reviewer for the 2022 Chapter • Technical Review Editor for The Art of SEO 4th Edition Who is Patrick Stox?
  3. @patrickstox @ahrefs #pubcon Disclaimer This is my understanding of systems and is based on a lot of public statements from Google and my own knowledge. Warning: It’s not going to be 100% complete or accurate.
  4. @patrickstox @ahrefs #pubcon How Many Domains Exist? Q3 2022 according to Verisign: 349.9 million registered January 2023 according to Netcraft: 270.9 million unique domains responded Ahrefs 213.1 million (after removing spam domains)
  5. @patrickstox @ahrefs #pubcon How Many Pages? Google in 2016: 130T known
  6. @patrickstox @ahrefs #pubcon How Big Is The Index? Google: hundreds of billions of pages indexed 100 PB in size Ahrefs: ~380B pages
  7. @patrickstox @ahrefs #pubcon A Fraction Of The Web Is Useful Content Rough math: (400B / 130T) * 100 = 0.3%
  8. @patrickstox @ahrefs #pubcon https://twitter.com/ lilyraynyc/status/150 9176261884747781
  9. @patrickstox @ahrefs #pubcon Spam Google 2021: “every day, we discover 40 billion spammy pages” That’s 14.6T spam pages a year.
  10. @patrickstox @ahrefs #pubcon Googlebot Googlebot is a lot of systems (1000+) and there are multiple Googlebots. • Googlebot Image • Googlebot News • Googlebot Video • Googlebot Desktop • Googlebot Mobile • +Ads and more https://developers.google.com/search/docs/crawling-indexing/overview- google-crawlers
  11. @patrickstox @ahrefs #pubcon Googlebot Is A Protocol Buffer It stores structured data. Similar to JSON, but smaller and faster.
  12. @patrickstox @ahrefs #pubcon Googlebot Rendering Pipeline (Simplified)
  13. @patrickstox @ahrefs #pubcon URL Sources • Links on pages, or anything that even looks like a link • Sitemaps • Request indexing in GSC • Indexing API (limited use cases) • RSS Feeds • WebSub (formerly PubSubHubbub)
  14. @patrickstox @ahrefs #pubcon Crawler Queue / Scheduler Determines what URLs to crawl and when. 2 main purposes: • Discovery • Refresh
  15. @patrickstox @ahrefs #pubcon What SEOs Call Crawl Budget, Google Calls Crawl demand How much Google wants to crawl your site. Crawl rate limit How much crawling your website can support.
  16. @patrickstox @ahrefs #pubcon What Counts Against Your Crawl Budget? All URLs and requests including: • Pages/files • Alternate URLs like AMP or m-dot pages, hreflang • CSS • JavaScript, including XHR requests • Embedded content ***All Googlebots share the same crawl budget, including the ones for Ads, images, etc.
  17. @patrickstox @ahrefs #pubcon Crawl Demand Factors • PageRank • How often pages change (freshness/staleness) • When it was last crawled • Any major changes
  18. @patrickstox @ahrefs #pubcon Crawl Rate Factors • Stability / crawl health • Slow responses • Errors. 5xx (server errors) or 429 (too many requests) HTTP status codes. They don’t want to crash the sites and the crawlers will generally back down if they start seeing issues.
  19. @patrickstox @ahrefs #pubcon Crawl Rate Settings GSC
  20. @patrickstox @ahrefs #pubcon Crawling The little spider is named Crawley.
  21. @patrickstox @ahrefs #pubcon Crawling Mostly from Mountain View, CA, USA. Every request needs to respect robots.txt. 15MB max HTML size.
  22. @patrickstox @ahrefs #pubcon Google Doesn’t Navigate Like Users Sends requests for the files individually, doesn’t navigate between pages like a user.
  23. @patrickstox @ahrefs #pubcon Caching Files They Crawl more than HTML: • Pages and other file types • JavaScript • CSS
  24. @patrickstox @ahrefs #pubcon Caching Files Files are stored for use in rendering. Google will ignore your cache timings and fetch a new copy when they want to. JS HTML HTML HTML JS CSS CSS CSS Cache Cache
  25. @patrickstox @ahrefs #pubcon Processing – We’ll Cover This Shortly
  26. @patrickstox @ahrefs #pubcon Web Rendering Service (WRS) Needed to process JavaScript Evergreen (up-to-date) Googlebot Headless (no Graphical User Interface)
  27. @patrickstox @ahrefs #pubcon Web Rendering Service (WRS) • Stateless (storage and cookies cleared between loads) • Denies Permissions • Flattens light DOM and shadow DOM • Date / Time functions adjusted • Service workers rejected • Animations may differ • Random may not be random
  28. @patrickstox @ahrefs #pubcon Myth: 5 Second Limit I think this started with a test from Max Prin on the time when the testing tools took a screenshot. They need to have reasonable time limits for testing tools. https://maxprin.com/tests/js-timer/
  29. @patrickstox @ahrefs #pubcon No 5 Second Limit They’ll try to wait for pages to finish, something like networkidle0 (no more activity). Eventually cuts off in case something gets stuck or someone is trying to mine bitcoin.
  30. @patrickstox @ahrefs #pubcon It Doesn’t Even Make Sense They’re basically loading a page with everything cached already. WRS JS HTML HTML HTML JS CSS CSS CSS Cache Cache
  31. @patrickstox @ahrefs #pubcon This System Causes Other Issues Impossible states – previous file versions used when rendering. File versioning /fingerprinting should help. XHR requests are done in real time.
  32. @patrickstox @ahrefs #pubcon Myth: Weeks To Render All pages go through the renderer. The average wait time is 5 seconds according to Google’s Martin Splitt. The 90th percentile is only minutes, not weeks. Probably comes from pages not being prioritized for crawling.
  33. @patrickstox @ahrefs #pubcon Rendering At Web Scale The 8th wonder of the world.
  34. @patrickstox @ahrefs #pubcon They Use Some Hacks “In Google search we don’t really care about the pixels because we don’t really want to show it to someone. We want to process the information and the semantic information so we need something in the intermediate state. We don’t have to actually paint the pixels.” – Martin Splitt
  35. @patrickstox @ahrefs #pubcon What That Looks Like Gray = downloads Blue = HTML Yellow = JavaScript Purple = Layout Green = Painting
  36. @patrickstox @ahrefs #pubcon They Won’t Render Noindexed Pages <meta name="robots" content="noindex"> <meta name="robots" content="none"> None = noindex, nofollow
  37. @patrickstox @ahrefs #pubcon They’re Not Taking Actions They don’t scroll. They generally don’t click.
  38. @patrickstox @ahrefs #pubcon Mobile Desktop
  39. @patrickstox @ahrefs #pubcon They Don’t Click Load content into the Document Object Model (DOM) by default. They won’t see the content if it requires a click that makes an XHR request to pull it in. DOM Tree and CSS Object Model (CSSOM) form the Render Tree. That’s what gets indexed.
  40. @patrickstox @ahrefs #pubcon DOM Tree (pictured) CSSOM (not pictured) would add info like font size, weight, color, etc. to each element. Render Tree
  41. @patrickstox @ahrefs #pubcon Collapser • Error handling • Retries • Soft 404s
  42. @patrickstox @ahrefs #pubcon Processing – Now We’ll Talk About It
  43. @patrickstox @ahrefs #pubcon Processing - Duplicates Duplicate detection - content hashes or checksum They’ll remove boilerplate content (nav, footer) for the checksum.
  44. @patrickstox @ahrefs #pubcon Near Duplicates
  45. @patrickstox @ahrefs #pubcon Processing – Duplicate Elimination Canonicalization
  46. @patrickstox @ahrefs #pubcon ~20 Canonicalization Signals • Duplicates • Redirects (high weight) • Canonical link elements - multiple will be ignored • Sitemap URLs • Links (Internal/External, PageRank) • Alternates – mobile, AMP, print, Hreflang • HTTPS pages over HTTP • Shorter URLs over longer URLs • Where content was first published / seen • Site level signals like a history of scraped content • Pages over PDFs Machine learning system
  47. @patrickstox @ahrefs #pubcon 301 = Permanent, 302 = Temporary Holds true for other perm and temp redirects
  48. @patrickstox @ahrefs #pubcon Warning! Speculation
  49. @patrickstox @ahrefs #pubcon Processing – Link Parser Good: <a> tag with an href attribute. <a href=”/page”>simple is good</a> <a href=”/page” onclick=”goTo(‘page’)”>still okay</a>
  50. @patrickstox @ahrefs #pubcon Processing – Link Parser Bad (but may be parsed): <a routerLink="products/category">no href</a> <a onclick=”goTo(‘page’)”>no href</a> <a href=”javascript:goTo(‘page’)”>kind of nested</a> <a href=”javascript:void(0)”>missing link</a> <span onclick=”goTo(‘page’)”>not the right HTML element or href</span> <span href=“page">not the right HTML element</span> <option value="page">not the right HTML element</option> <a href=”#”>no link</a> Button, ng-click, there are many more ways this can be done incorrectly.
  51. @patrickstox @ahrefs #pubcon Processing – Link Parser • Link location, where it goes • Anchor text • Surrounding text • …
  52. @patrickstox @ahrefs #pubcon Link Tagging • Penguin • Location on page (footer, main content) • Disavow • …
  53. @patrickstox @ahrefs #pubcon Processing – Content Parser • Content – tokenized, vectorized. Words become numbers. • Content language • Content location • Extract meta tags • Extract Schema • HTML Lexer – normalize the HTML • Topic analysis. Content on other topics may be weighted less in ranking. • Semantic analysis. Linguistic, knowledge graph, address extraction • …
  54. @patrickstox @ahrefs #pubcon Content Tagging • YMYL • Adult / safe search • Mobile-friendly • …
  55. @patrickstox @ahrefs #pubcon Signal Collectors • PageRank • Spam • Page Experience • Freshness • …
  56. @patrickstox @ahrefs #pubcon A Lot More In Processing Like Drop anything after # in URLs. (some exceptions to this) Most Restrictive Directives index + noindex + index = noindex They’ll drop low quality content
  57. @patrickstox @ahrefs #pubcon Other Files May Be Processed Differently Adobe Portable Document Format (.pdf) •Adobe PostScript (.ps) •Google Earth (.kml, .kmz) •GPS eXchange Format (.gpx) •Hancom Hanword (.hwp) •HTML (.htm, .html, other file extensions) •Lotus •Microsoft Excel (.xls, .xlsx) •Microsoft PowerPoint (.ppt, .pptx) •Microsoft Word (.doc, .docx) •OpenOffice presentation (.odp) •OpenOffice spreadsheet (.ods) •OpenOffice text (.odt) •Rich Text Format (.rtf) •Scalable Vector Graphics (.svg) •TeX/LaTeX (.tex) •Text (.txt, .text, other file extensions), including source code in common programming languages: • Basic source code (.bas) • C/C++ source code (.c, .cc, .cpp, .cxx, .h, .hpp) • C# source code (.cs) • Java source code (.java) • Perl source code (.pl) • Python source code (.py) •Wireless Markup Language (.wml, .wap) •XML (.xml)
  58. @patrickstox @ahrefs #pubcon Image Processing • Text around the image • Content of the image. They tag what is in the image. Not super reliable. • Alt attribute • Image name (minimal weight) • Webpage title and description Photo from a Gary Illyes Presentation at Pubcon.
  59. @patrickstox @ahrefs #pubcon Robots.txt for Images Blocking Googlebot Image from crawling means that your images will not be indexed.
  60. @patrickstox @ahrefs #pubcon Video Processing • OCR to get text • Objects identified from visuals • Speech converted to text • Structured data • Text and other signals from the page, URL, title, description
  61. @patrickstox @ahrefs #pubcon PDFs • PDFs are converted and indexed as HTML • OCR to get text • Images get indexed • Links get picked up • Title • File name • …
  62. @patrickstox @ahrefs #pubcon Google Index Named Caffeine
  63. @patrickstox @ahrefs #pubcon Data Infrastructure Many data centers around the world. Each has a copy of the index. Millions of servers and hard drives. Index is an inverted index. Maps things like words to documents. Index shards are split into words and phrases. Other shards for metadata.
  64. @patrickstox @ahrefs #pubcon Indexing Tiers – Based On Doc Popularity • Ram (fastest) • SSD (fast) • Hard drives (slowest)
  65. @patrickstox @ahrefs #pubcon Mobile Version Is Indexed (Mostly) Some sites may remain on desktop-only indexing. They don’t work on mobile.
  66. @patrickstox @ahrefs #pubcon Life Of A Query
  67. @patrickstox @ahrefs #pubcon Fun Fact 15% of queries have never been seen before
  68. @patrickstox @ahrefs #pubcon Start Typing - Autocomplete Powered by real search data and patterns across the web + • The language of the query • The location a query is coming from • Trending interest in a query • Your past searches Probably reduces misspellings
  69. @patrickstox @ahrefs #pubcon Query parsing and understanding BERT (DeepRank) – combinations of words express different meanings and intents. They won’t drop important words from the queries. Neural matching – words to searches. “For example, neural matching helps Google understand that a search for “why does my TV look strange” is related to the concept of “the soap opera effect.” We can then return pages about the soap opera effect, even if the exact words aren’t used.”
  70. @patrickstox @ahrefs #pubcon Misspelling 1/10 searches are misspelled
  71. @patrickstox @ahrefs #pubcon Google Training Misspelling Example Over 600 ways people misspelled Britney Spears. http://archive.google.com/jobs/britney.html
  72. @patrickstox @ahrefs #pubcon Spelling Old Vs New Old way: How often terms were searched +probability of typos from neighboring keys New way: Deep neural net with 680M parameters
  73. @patrickstox @ahrefs #pubcon Query Expansion When the query is sent, it’s going to also pull pages with terms that include: • Synonyms • Antonyms • Acronyms • Plural/singular • Stemming – root words • Diacritical expansion - accent characters other versions These will mostly get lower weights in scoring than the main term used.
  74. @patrickstox @ahrefs #pubcon Concepts & Entities People, places, things “RankBrain helps Google better relate pages to concepts – This means Google can better return relevant pages even if they don’t contain the exact words used in a search, by understanding the page is related to other words and concepts.”
  75. @patrickstox @ahrefs #pubcon Speculation All the query expansion things may not be necessary anymore. They may just pull close terms in vector space.
  76. @patrickstox @ahrefs #pubcon Stop Words The, is, and, of, a, are, an, if, etc. Removed for some queries. Used for other queries, like when it matches a concept.
  77. @patrickstox @ahrefs #pubcon Segmenter Splits up strings (languages without spaces). '上海浦东开发与建设同步' → ['上海', '浦东', '开发', ‘与', ’建设', '同步']
  78. @patrickstox @ahrefs #pubcon Retrieval – Posting List Remember that inverted index? Map of terms to pages that contain those terms. Get all those.
  79. @patrickstox @ahrefs #pubcon Sum Of The Total Pages From All Shards
  80. @patrickstox @ahrefs #pubcon Popular Queries Are Cached
  81. @patrickstox @ahrefs #pubcon Make A Smaller List - Ranking Google is going to cut all those results down to the top 1000 by ranking them.
  82. @patrickstox @ahrefs #pubcon Ranking / Scoring – Query Dependent Feature of a page & query • Keyword hits • All those other versions from the query expansion like synonyms • Proximity • Content relevance, topicality • …
  83. @patrickstox @ahrefs #pubcon Ranking / Scoring – Query Independent Feature of a page • PageRank, site queries, mentions, & other E-E-A-T signals • Language • Mobile-friendliness • Page experience • … Numbers multiplied by other numbers in the scoring
  84. @patrickstox @ahrefs #pubcon They’re Like Nah, We Can Do Better
  85. @patrickstox @ahrefs #pubcon Reranking / Post-Retrieval Adjustments Has a smaller number of results - 1000 With the smaller number, they can run more intelligent but resource intensive systems to re-order the results.
  86. @patrickstox @ahrefs #pubcon RankBrain & BERT - Again “Based on its complex language understanding, BERT can very quickly rank documents for relevance.” Depending on the search, Google’s algorithm can use either RankBrain, BERT, or both.
  87. @patrickstox @ahrefs #pubcon Host Clustering Limits the results you see from the same domain. Add &filter=0 to your search URL to see unfiltered results.
  88. @patrickstox @ahrefs #pubcon Hreflang Tries to swap to the most relevant country/language version of a page.
  89. @patrickstox @ahrefs #pubcon DMCA, Privacy Removals, URL Removal Tool
  90. @patrickstox @ahrefs #pubcon Spelling Corrections
  91. @patrickstox @ahrefs #pubcon Trending Topics Are Promoted
  92. @patrickstox @ahrefs #pubcon Spam Spam demotions Manual actions
  93. @patrickstox @ahrefs #pubcon Query Other Systems - Universal results News, Maps, Images, Videos, etc. Results are bidding for their position
  94. @patrickstox @ahrefs #pubcon
  95. @patrickstox @ahrefs #pubcon
Advertisement