#SMX #23A2 @maxxeight
Crawling
– Don’t block resources via robots.txt
– onclick + window.location != <a href=”link.html”>
How To Make Sure Google Can Understand Your Pages
#SMX #23A2 @maxxeight
Crawling
– Don’t block resources via robots.txt
– onclick + window.location != <a href=”link.html”>
– 1 unique “clean” URL per piece of content (and vice-versa)
How To Make Sure Google Can Understand Your Pages
#SMX #23A2 @maxxeight
URL Structures (with AJAX websites)
Fragment Identifier: example.com/#url
– Not supported. Ignored. URL = example.com
Hashbang: example.com/#!url (pretty URL)
– Google and Bing will request:
example.com/?_escaped_fragment_=url (ugly URL)
– The escaped_fragment URL should return an HTML snapshot
Clean URL: example.com/url
– Leveraging the pushState function from the History API
– Must return a 200 status code when loaded directly
#SMX #23A2 @maxxeight
Crawling
– Don’t block resources via robots.txt
– onclick + window.location != <a href=”link.html”>
– 1 unique “clean” URL per piece of content (and vice-versa)
Rendering
– Load content automatically, not based on user interaction (click,
mouseover, scroll)
How To Make Sure Google Can Understand Your Pages
#SMX #23A2 @maxxeight
Crawling
– Don’t block resources via robots.txt
– onclick + window.location != <a href=”link.html”>
– 1 unique “clean” URL per piece of content (and vice-versa)
Rendering
– Load content automatically, not based on user interaction (click,
mouseover, scroll)
– the 5-second rule
How To Make Sure Google Can Understand Your Pages
#SMX #23A2 @maxxeight
Crawling
– Don’t block resources via robots.txt
– onclick + window.location != <a href=”link.html”>
– 1 unique “clean” URL per piece of content (and vice-versa)
Rendering
– Load content automatically, not based on user interaction (click,
mouseover, scroll)
– the 5-second rule
– Avoid JavaScript errors (bots vs. browsers)
How To Make Sure Google Can Understand Your Pages
#SMX #23A2 @maxxeight
HTML snapshots are only required with
uncrawlable URLs (#!)
When used with clean URLs:
– 2 URLs requested for each content (crawl budget!)
Served directly to (other) crawlers (Facebook,
Twitter, Linkedin, etc.)
Matching the content in the DOM
No JavaScript (except JSON-LD markup)
Not blocked from crawling
The “Old” AJAX Crawling Scheme And HTML Snapshots
DOM HTML
Snapshot
#SMX #23A2 @maxxeight
HTML snapshots are only required with
uncrawlable URLs (#!)
When used with clean URLs:
– 2 URLs requested for each content (crawl budget!)
Served directly to (other) crawlers (Facebook,
Twitter, Linkedin, etc.)
Matching the content in the DOM
No JavaScript (except JSON-LD markup)
Not blocked from crawling
The “Old” AJAX Crawling Scheme And HTML Snapshots
DOM HTML
Snapshot
#SMX #23A2 @maxxeight
Crawling
– Don’t block resources via robots.txt
– onclick + window.location != <a href=”link.html”>
– 1 unique “clean” URL per piece of content (and vice-versa)
Rendering
– Load content automatically, not based on user interaction (click,
mouseover, scroll)
– the 5-second rule
– Avoid JavaScript errors (bots vs. browsers)
Indexing
– Mind the order of precedence (SEO signals and content)
How To Make Sure Google Can Understand Your Pages
#SMX #23A2 @maxxeight
Google cache (unless HTML snapshots)
Google Fetch & Render (Search Console)
– limitation in terms of bytes (~200 KBs)
– doesn’t show HTML snapshot (DOM)
Tools For SEO And JavaScript
#SMX #23A2 @maxxeight
Google cache (unless HTML snapshots)
Google Fetch & Render (Search Console)
– limitation in terms of bytes (~200 KBs)
– doesn’t show HTML snapshot (DOM)
Fetch & Render As Any Bot (TechnicalSEO.com)
Tools For SEO And JavaScript
#SMX #23A2 @maxxeight
Google cache (unless HTML snapshots)
Google Fetch & Render (Search Console)
– limitation in terms of bytes (~200 KBs)
– doesn’t show HTML snapshot (DOM)
Fetch & Render As Any Bot (TechnicalSEO.com)
Chrome DevTools (JavaScript Console)
Tools For SEO And JavaScript
#SMX #23A2 @maxxeight
Google cache (unless HTML snapshots)
Google Fetch & Render (Search Console)
– limitation in terms of bytes (~200 KBs)
– doesn’t show HTML snapshot (DOM)
Fetch & Render As Any Bot (TechnicalSEO.com)
Chrome DevTools (JavaScript Console)
SEO Crawlers
– ScreamingFrog
– Botify
– Scalpel (Merkle proprietary tool)
Tools For SEO And JavaScript
Let’s start with the basics. Typically, search engines bots crawl, index and then rank pages.
Crawl: request the URL from the server and download the HTML document.
Index: parse and index the content of the HTML document.
Rank: algorithmically rank the URL based on a bunch of factors, including the content of the HTML document.
For a long time, this was enough. The HTML document including everything search engines (and users) needed to know about the page: meta data (title, description) and the main content of the page.
So, what happened? Technology happened. New web development technologies took the Internet from…
THIS…
To this.
Flipkart: one of the largest online retailer in India. Over the last couple of years, their web dev teams worked really hard to create a responsive/progressive web app website. It’s fast, mobile-friendly, works offline, sends push notifications, can be added to the phone home screen like a native app, etc.
Those web dev technologies, including the ones shown of this slide, have allowed and helped webmasters creating fantastic websites, design and user experiences.
Now, let’s remind ourselves a search engine’s mission: serving the best result. This means understanding web pages better. To be able to return the best results Google had to find the best results.
Solely looking at the HTML document (i.e. source code) is not enough. When using JavaScript frameworks it’s often that the “source code” doesn’t include any relevant information for search engines to base their results on.
Everything (meta data and content) is in the Document Object Model.
When clicking “inspect element” in a browser, you access a visual representation of the DOM. It represents what the page looks like (from a code standpoint) after the browser has done its job of parsing, rendering/painting the page.
“rendering” is the keyword. Google is, since a few years now, rendering web pages, after crawling and before indexing, in order to understand them better.
The purpose of rendering goes further than simply getting the content dynamically inserted on the page. It’s about understand the user experience provided by the website.
It’s because Google is rendering pages (loading CSS, executing media queries) that they’re able to determine their mobile-friendliness and make it a (important) ranking factor.
Bing also render web pages to determine mobile-friendliness and we’ve seen instances where Bingbot also execute some JavaScript (redirect) but the search engine has not yet make any official announcement around their ability to fully render websites.
“Is that it? Google can execute JavaScript. Great, so I don’t have to worry about using JavaScript framework?”
Unfortunately, no.
There is a few things to consider in order to make sure Google can understand your JavaScript.
Let’s look at them from a search engine standpoint, in the following order: crawling, rendering and indexing.
- First and foremost: do not block bots from accessing vital resources: CSS, JavaScript and Images.
If CSS can’t be accessed, a responsive website won’t considered as mobile-friendly.
Same goes for content in JavaScript -> if not seen, not indexed.
Same goes for content in JavaScript -> if not seen, not indexed.
This is a particular test that I ran to see if Google, in order to prevent spammy results, was considering the “severity” of the blocked resource for indexing.
It appears that the URL was indexed and ranking for the “puppies” content.
Source code: “puppies” content
DOM (after JavaScript execution): “Amoxicillin” content
Script blocked via robots.txt
Now think about what is happening if your website doesn’t have anything in the source code and everything is the DOM, but the scripts are blocked from crawling.
Remember that JavaScript functions, while they can include URLs paths and strings, are not regular “a href” links.
Google might or might not be able to “crawl” those URLs.
Even if they are crawled, it is unknown how much weight is given to those “links” and how they are considered within the site architecture. We all know the importance of internal linking.
As Gary Illyes said: “URLs are the bridges between Google and your content”.
It is essential to have every piece of content accessible via its own URL
Single Page Applications (SPAs) should actually not be using a “single page” or single URL when delivering the content
Each URL within the application should render and include the relevant content (the same applies to Infinite scroll implementation).
Fragment identifier: this URL structure is already a concept in the web and relates to deep linking into content on a particular page (“jump links”).
Can’t be accessed/crawled/indexed.
Hashbang: Used with the “old” AJAX crawling scheme. Not recommended, more complex to implement.
Clean URL using History API’s pushState function.
With pushState, we can “manipulate” the browser’s address bar and history. The app can push URLs in the history stack so users can use the back and forward buttons.
=> With access to the resources, real <a href> links to follow and unique URLs to find content, Google will be able to crawl a website properly.
Googlebot is a lame user: it doesn’t click on buttons or scroll down the page, etc.
Therefore the content needs to be loaded in the DOM automatically, not based on user interactions
Mega menu – mouseover + ajax
Tabs/accordeons – click + ajax
Load more/infinite scroll - click/scroll + ajax
It appears that Google takes an HTML snapshot of the page around 5 secons after the Load event. This is important because it means that content inserted later won’t be seen and therefore won’t be indexed.
Load event + 5 seconds = HTML snapshot from Googlebot
This sounds like common sense but it happens.
Real world example:
Page is loading
Script sets a JavaScript cookie
Script instructs AJAX request to load a partial HTML file holding the content of a mega menu
=> regular users (browsers) have not issue seeing the full page.
=> Googlebot wasn’t “accepting” the JavaScript cookie – the script fails, stops being executed, the AJAX request is not made, Google doesn’t see the mega menu with all of the top nav links.
=> With content loaded automatically and not delayed over 5 seconds, Google should be able to render you pages properly.
=> With content loaded automatically and not delayed over 5 seconds, Google should be able to render you pages properly.
Google can find SEO signals and/or content from different sources: the HTTP headers, the HTML source code, the DOM, and eventually the HTML snapshot.
It is important to understand the order of precedence:
For content and “hints” (such as a canonical tag) Google will take the last seen signal into consideration. Content in the source code not present in the DOM will be disregarded. It makes perfect sense as if the content is not in the DOM, it is not visible to the user.
For directives, however, such as noindex tags, Google will respect them wherever they find them. A noindex tag in the source code will prevent the page from being indexed despite an explicit “index” tag in the DOM.
Rel=“nofollow”. It appears that Google won’t respect a rel=“nofollow” dynamically inserted into a <ahref> present in the source code. If the link is found in the source code, without the rel=nofollow attribute, it will be queued for crawling – regardless if a rel=nofollow is added in the DOM.
Don't use google cache: https://maxxeight.com/blog/google-cache-seo/
It doesn’t represent how Google rendered the page. The cache only shows the HTML document that was downloaded.
Using the cache as a source of information can also be mis-leading. In this case, the Google cache domain can’t render the page as it doesn’t have access to the resources.
https://webcache.googleusercontent.com can’t make AJAX requests to https://technicalseo.com
Google’s Fetch & Render is limited: https://www.deepcrawl.com/blog/news/google-webmaster-hangout-notes-december-16th-2016/#link1?platform=hootsuite
Large webpages might not be rendered properly in Fetch & Render.
The tool doesn’t return an HTML snapshot of the page. The screenshot is a good way to see if the page is rendered but it won’t help confirming a mega menu has been loaded for example.
But it’s still a better source of info than the cache.
It fetches pages from a Google IP (it makes a difference sometimes for websites blocking “Googlebot” user-agent if not coming from a known Google IP)
It leverages Googlebot’s JavaScript rendering engine which is likely to be more advanced than PhantomJS.
In order to overcome those limitation I’ve built a tool accessible (for free) on TechnicalSEO.com.
Only caveat: it’s not Googlebot. The headless browser is PhantomJS and is not as sophisticated as Googlebot, in some cases.
TechnicalSEO.com Fetch as any bot: https://technicalseo.com/seo-tools/fetch-render/
GSC Fetch as Google Limitation:
It doesn’t fully render “large” HTML document (> ~200KB).
It doesn’t return a DOM snapshot (or “HTML snapshot”) of the rendered page. Only a screenshot, which is useful but not enough to make sure the page is properly rendered (e.g. dynamically generated mega menu, or content in tabs, etc.. => non-visible content). This tool gives you both.
It always respects robots.txt files (no way to get an idea how the page would be rendered without blocked resources) – this tools lets you disregard robots.txt statement while still reporting potentially blocked URLs.
It doesn’t follow redirects – this tool reports on HTTP (301, 302, etc.), meta refresh and JavaScript redirects (even pushState events).
It doesn’t have the option to change the Accept-Language header and/or IP (for locale-adaptive pages or international redirects) when we know Googlebot can actually crawl with different settings and locations.
And obviously, it doesn’t give the option to crawl with different (non-Google) user-agents.
Probably the best and most useful SEO tool available. Use it to troubleshoot eventual JavaScript error that might occur in specific situations: change your user-agent to Googlebot, Googlebot smartphone, disable cookies, etc. and confirm that your pages are rendered properly.
Lastly, some SEO crawlers have, finally, the ability of executing JavaScript and rendering webpages.
ScreamingFrog has integrated the Chromium project library into the spider. This makes it the closest crawler to what Googlebot it.
Botify is using PhantomJS but is a cloud-based crawler so more scalable for large websites (although SF can be installed on a very powerful virtual machine).
Screamingfrog: https://www.screamingfrog.co.uk/crawl-javascript-seo/
Botify: https://www.botify.com/blog/crawling-javascript-for-technical-seo-audits/