SlideShare a Scribd company logo
Crawl Budget
Some Insights + Ideas
Jan Hendrik Merlin Jacob
Founder + CTO
! @jhmjacob

" jhm@onpage.org

# linkedin.com/in/jhmjacob

! @jhmjacob
Agenda
» Philosophy
» Parameters to influence Crawl Budget
» Best practice & next steps
! @jhmjacob
Crawl Budget
Definition
The resources (aka money) Google invests in 

your website by sending its crawlers
! @jhmjacob
Philosophy
What would you do,

if you were Google?
! @jhmjacob
Primary Target: 

Make money!

Secondary Target:

The best search results
Philosophy
! @jhmjacob
With their crawlers Google invests money,
to find the “best” webpages -

in order provide the best search results.
Philosophy
! @jhmjacob
Problem 1: 

The size of the web is infinite

Problem 2: 

Even Googles resources are limited
Philosophy
! @jhmjacob
Source: Netcraft
! @jhmjacob
Size of the Google index:
Something between 5 billion

and 1 trillion documents*
(means: around 5-1000 pages per domain)
* = As a matter of fact, there is no real data on this. 

Probably even Google doesn’t know.
Philosophy
! @jhmjacob
Conclusion
Search engines like Google have to

constantly decide if they continue

spending resources on the 

current website or rather go to another.
! @jhmjacob
What is bing saying about this?
“By providing clear, deep, easy to find 

content on your website, we are more likely 

to index and show your content in search results.”
More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
! @jhmjacob
“clear”
» Distinct Canonical settings

» Valid redirects (not via Meta-Refresh!)

» Exactly one main headline (H1) per page

» Title, description, alt, links to relevant (!) content

» Standard HTML links (“No Rich Media like JS or Flash”)

» Clean and readable HTML site-navigation

» Clean and normalized URL structure

» “Clear keyword focus”
What is bing saying about this?
More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
! @jhmjacob
“deep”
» No “Thin content”

» “Do not copy from other websites”

» Be as relevant as possible for one topic (“Holistic”)

» Keep your pages updated (“freshness”)
What is bing saying about this?
More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
! @jhmjacob
“easy to find”
» Clean and up-to-date Sitemap.xml (last-mod!)
» “keep valuable content close to the home page” 

(aka short click-path aka “page level”)

» “use targeted keywords wherever possible”

(regarding internal linking)

» Well structured navigation

(found in URL + Breadcrumbs)
What is bing saying about this?
More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
! @jhmjacob
Between the lines:
» Sitemap.xml is used to identify new articles and get 

them indexed asap.

» If the system recognizes regular updates on a page,

it will be crawled more frequently.

» Relevancy of a page is calculated based on internal 

(& external) links as well as the “click distance from

the homepage” (aka “page-level”).

» Pagespeed matters: Otherwise Bounce-Rate can

have negative effects on crawl budget (+ rankings)
What is bing saying about this?
More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
! @jhmjacob
What is Yandex saying about this?
More: https://yandex.com/support/webmaster/yandex-indexing/webmaster-advice.xml
Summary of the Webmaster Guidelines:
» Do not use cloaking

» Do not use auto-generated / gibberish text

» No thin content

» No hidden text

» Popups + Downunders = Bad Quality Indicator
» Do not do “User Behaviour Emulation”
! @jhmjacob
What is Google saying about this?
“The best way to think about it is that the number
of pages that we crawl is roughly proportional to
your PageRank. So if you have a lot of incoming
links on your root page, we’ll definitely crawl that.
Then your root page may link to other pages, and
those will get PageRank and we’ll crawl those as
well. As you get deeper and deeper in your site,
however, PageRank tends to decline.”
More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
! @jhmjacob
Reminder
» Internal Links are responsible

for passing Pagerank through your

pages 

(Some believe Pagerank is only

generated out of external Backlinks)
» Pagerank “0 to 10” is just a simplified

display for humans. In reality this score

is way more precise.
! @jhmjacob
“Another way to think about it is that the low
PageRank pages on your site are competing
against a much larger pool of pages with the
same or higher PageRank. There are a large
number of pages on the web that have very little or
close to zero PageRank. The pages that get linked
to a lot tend to get discovered and crawled quite
quickly. The lower PageRank pages are likely to
be crawled not quite as often.”
What is Google saying about this?
More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
! @jhmjacob
“If we can only take two pages from a site at any
given time, and we are only crawling over a certain
period of time, that can then set some sort of
upper bound on how many pages we are able to
fetch from that host.”
What is Google saying about this?
More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
! @jhmjacob
“Imagine we crawl three pages from a site, and
then we discover that the two other pages were
duplicates of the third page. We’ll drop two out of
the three pages and keep only one, and that’s why
it looks like it has less good content. So we might
tend to not crawl quite as much from that site.

…

If there are a large number of pages that we
consider low value, then we might not crawl quite
as many pages from that site, but that is
independent of rel=canonical.”
What is Google saying about this?
More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
! @jhmjacob
“If you link to three pages that are duplicates, a
search engine might be able to realize that those
three pages are duplicates and transfer the
incoming link juice to those merged pages.”
What is Google saying about this?
More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
! @jhmjacob
“There are some things that we will run a
HEAD for. For example, our image crawl may
use HEAD requests because images might
be much, much larger in content than web
pages…In terms of crawling the web and text
content and HTML, we’ll typically just use a
GET and not run a HEAD query first”
What is Google saying about this?
More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
! @jhmjacob
» “There is also not a hard limit on our crawl.”
» Pages with higher Pagerank will get crawled more often

» Free crawling resources will be spend on low-PR pages,

but chances the bot will leave the page are higher

(how are they chosen?!)

» You compete against all other pages. Give the bots

reasons to stay.

» Limitation is not based on “Amount of URLs”, rather in 

form of “Machine-Hours” (time-based limits)

(Loadtime matters!)

» Bad page-quality + bad content metrics can scare away bots

(Exit-Condition like “Amount of Unique Content / Time”)

» Google tries to avoid waste of bandwith

(HEAD Requests for images + if-modified-since)
What is Google saying about this?
! @jhmjacob
Google Search Console
! @jhmjacob
Searchability
Definitions
(aka Findability)
! @jhmjacob
ility!
Crawlability + Indexability + Rankability
=
Searchability
(aka Findability)
! @jhmjacob
ility!
Crawlability + Indexability + Rankability
=
Searchability
(aka Findability)
Crawlability
Is your Webpage (URL)
accessible for crawlers?
! @jhmjacob
ility!
Crawlability + Indexability + Rankability
=
Searchability
(aka Findability)
Indexability
Should the crawled, extracted
and interpreted content be
added to a search index?
! @jhmjacob
ility!
Crawlability + Indexability + Rankability
=
Searchability
(aka Findability)
Rankability
Should a particular page

be displayed in the 

search results for a

particular keyword 

(search phrase).
! @jhmjacob
Crawlability + Indexability + Rankability
have a direct or indirect influence
on the Crawl Budget
! @jhmjacob
Technical SEO Buzzword Bingo
“Crawlability" “Indexability" “Rankability”
robots.txt robots Directive

(Response Header / Meta Tag)
rel=prev

(Response Header / Meta Tag)
Status Code 

(Response Header)
Canonical 

(Response Header / Meta Tag)
hreflang Directives

(Response Header / Meta Tag / Sitemap)
Ladezeit

(DNS+Server)
Redirects 

(Response Header / Meta Tag)
Device Directives

(Response Header / Meta Tag)
Fragment aka Ajax Crawling

(Meta Tag)
Unique Content

(Content)
Content Quality

(Content)
URL-Structure

(URL)
Encoding

(Content)
Rendertime

(Server+Content)
Vary

(Response Header)
File Size

(Content)
Location Directives

(Content)
if-modified-since Support

(Response Header)
Rendering

(CSS+JS)
! @jhmjacob
Analyzed by OnPage.org
“Crawlability" “Indexability” “Rankability”
robots.txt robots Directive

(Response Header / Meta Tag)
rel=prev

(Response Header / Meta Tag)
Status Code 

(Response Header)
Canonical 

(Response Header / Meta Tag)
hreflang Directives

(Response Header / Meta Tag / Sitemap)
Ladezeit

(DNS+Server)
Redirects 

(Response Header)
Device Directives

(Response Header / Meta Tag)
Fragment aka Ajax Crawling

(Meta Tag)
Unique Content

(Content)
Content Quality

(Content)
URL-Structure

(URL)
Encoding

(Content)
Rendertime

(Server+Content)
Vary

(Response Header)
File Size

(Content)
Location Directives

(Content)
if-modified-since Support

(Response Header)
Rendering

(CSS+JS)
We offer the most
comprehensive
analysis on website
quality assurance!
! @jhmjacob
robots.txt
This is obvious!
» Learn how to setup your robots.txt file

» Block irrelevant URLs, so the bots don’t waste 

their time on those pages

» Basics: https://en.onpage.org/wiki/Robots.txt
Always remember: If a page is blocked via robots.txt,

the bots can’t see additional settings like 

Canonicals or “noindex” directives.
! @jhmjacob
Even though a page might look well -
under the hood it can be still

broken as hell.
Status Code
! @jhmjacob
200 Valid Page


301 Permanent redirect (after Redesigns)
302 Temporary Redirect
303 Alternative Version
304 Page did not change since last visit

403 Access forbidden

404 Page does not exist
Status Code
! @jhmjacob
Loadtime
Nice - only

82 Milliseconds
until Googlebot got

the sourcecode of the

page
Not so nice - in average 

1.76 Seconds until the sourcecode

has been transfered
Page A
Page B
! @jhmjacob
0
3.5
7
10.5
14
Page A Page B
0.59 Pages / Second
12.2 Pages / Second
Loadtime
! @jhmjacob
Page A Page B
Per Second 12.2 Pages 0.59 Pages
Per Minute 731.71 Pages 35.29 Pages
Per Hour 43,902.44 Pages 2,117.65 Pages
Per Day 1,053,658.54 Pages 50,823.53 Pages
Loadtime
ouch!
! @jhmjacob
Fragment aka Ajax Crawling
More: https://angularjs.org/
excursion
! @jhmjacob
Why angularjs?
Tries to achieve a better User Experience, 

by transferring only small segments

instead of complete pages.
Provides testing functionalities.
Fragment aka Ajax Crawling
excursion
! @jhmjacob
Fragment aka Ajax Crawling
easy way to identify

a angularjs site (“ng-app”)
! @jhmjacob
Fragment aka Ajax Crawling
! @jhmjacob
<!DOCTYPE html>
<!--<html lang="en" data-ng-app="MainApp">-->
<html lang="en" id="ng-app" data-ng-app="MainApp">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, …”>
<meta name="keywords" content="mercedes films, mercedes clip, …”>
<link href="assets/images/favicon.ico" type="image/x-icon" rel="shortcut icon">
<title>Mercedes-Benz Video Channel</title>
<meta name="keywords" content="{{keywords}}"/>
</head>

…
Fragment aka Ajax Crawling
! @jhmjacob
<!DOCTYPE html>
<!--<html lang="en" data-ng-app="MainApp">-->
<html lang="en" id="ng-app" data-ng-app="MainApp">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, …”>
<meta name="keywords" content="mercedes films, mercedes clip, …”>
<link href="assets/images/favicon.ico" type="image/x-icon" rel="shortcut icon">
<title>Mercedes-Benz Video Channel</title>
<meta name="keywords" content="{{keywords}}"/>
</head>

…
Fragment aka Ajax Crawling
angularjs
placeholder
! @jhmjacob
Fragment aka Ajax Crawling
! @jhmjacob
Fragment aka Ajax Crawling
! @jhmjacob
Fragment aka Ajax Crawling
! @jhmjacobMore: https://developers.facebook.com/tools/debug/
Fragment aka Ajax Crawling
! @jhmjacobMore: https://cards-dev.twitter.com/validator
Fragment aka Ajax Crawling
! @jhmjacob
Fragment aka Ajax Crawling
! @jhmjacob
Why?
1) There are also other JS testing frameworks

Jasmine / PhantomJS
2) WallabyJS 

Nice Plugin for realtime JS Unit Tests
3) IMO: AngularJS is rather suited 

for web-apps

Not so well for content based sites which 

rely on their fit in the web eco-system
! @jhmjacob
Ajax Crawling Scheme
1) Within <head> Tag

<meta name="fragment" content="!"/>
2) Hashbang URLs (“#!”)

https://www.seokomm.at/#!agenda
+ Snapshot URL with “real” HTML
! @jhmjacob
Ajax Crawling Scheme
What happens here?
GET http://video.mercedes-benz.co.uk/#!/
Complete Sourcecode (9kb)
1
! @jhmjacob
Ajax Crawling Scheme
GET http://video.mercedes-benz.co.uk/?_escaped_fragment_=/
Complete Sourcecode (9kb) without AngularJS placeholders
2
Two requests were required to gather the valid HTML code!
What happens here?
! @jhmjacob
Ajax Crawling Scheme
Support of Ajax Crawling
“Ajax Crawling Scheme” Native Ajax Crawling
Google Yes, but “deprecated” Yes
Bing Yes Nope
OnPage.org Yes Nope
Facebook Nope Nope
Twitter Nope Nope
Pinterest Nope Nope
! @jhmjacob
URL Structure
1) Speaking URLs (aka Hackable URLs)

https://www.ccc.de/events/2015/congress
2) Sort GET Parameter (predefined order)

https://de.onpage.org/?currency=de&lang=de
3) Relevant Content on top tier (subfolder),

should correlate with Pagerank flow
4) Session IDs in URLs are a No-Go!

If no other way: Remove them via GSC
! @jhmjacob
Vary Response Header
1) Does the page provide compression? (Must!!!) 

Vary: Compression
2) Do Cookies (notably) change the content?

Vary: Cookie
3) Is the page multi-lingual? (Within same URL!)

Vary: Accept-Language
! @jhmjacob
“if-modified-since” Workflow
! @jhmjacob
11/01/2015:

GoogleBot calls en.onpage.org
Server response:
Complete Sourcecode (10,3kb)

+ Response Header “Last-Modified”
“if-modified-since” Workflow
! @jhmjacob
11/5/2015:

GoogleBot calls en.onpage.org again 

and includes an additional

Request Header “Last-Modified”
Server response:
Empty body (0kb)

+ Response Header “304 Not Modified”
“if-modified-since” Workflow
! @jhmjacob
» Dramatically reduces downloaded file size 

for unchanged content
» Enables bots + users to download more relevant

content within the same timespan
» Requires good Infrastructure / CMS 

like Page-Caching - more on that later!
“if-modified-since” Workflow
! @jhmjacob
robots Directive
1) Within <head> Tag

<meta name="robots"
content="noindex,follow"/>
2) Via Response Header

X-Robots-Tag: noindex,follow
Remember: A lot of “noindex” pages have a negativ effect

on the crawl budget … because resources are wasted

to find out that the URL has no real content.
! @jhmjacob
robots Directive: “unavailable-after”
More: https://googleblog.blogspot.de/2007/07/robots-exclusion-protocol-now-with-even.html
1) Within <head> Tag

<meta name="robots"
content="unavailable_after: 20-Nov-2015
15:35:00 CET">
2) Via Response Header

X-Robots-Tag: unavailable_after: 20 Nov
2015 15:35:00 CET
! @jhmjacob
Canonical
1) Within <head> Tag

<link rel="canonical" href="https://de.onpage.org/"/>
2) Via Response Header

Link: <https://de.onpage.org/>; rel="canonical"
The Response Header Version can also be used for

PDF files and images (yummy).
Remember: A lot of “canonicalized” pages (canonical to 

other URL) have a negativ effect on the crawl budget … 

because resources are wasted to find out that the URL 

has no real content.
! @jhmjacob
Redirects
1) Via Response Header

Status Code: 301

Location: https://de.onpage.org/
2) Im <head> Bereich

<meta http-equiv="refresh" content="5; url=http://example.com/">
Redirect-Chains should be avoided. 

Best practice is to avoid internal redirects at all.

Rather update old links and point them to the new URL.
Search Engines do not like redirects with Meta Tags or Javascript. 

These should only be used with caution to navigate users.

Semantically correct way is the response header (“301 vs 302”)
! @jhmjacob
Unique & Relevant Content
1) No thin content
2) No duplicate content
3) No auto-translated pages
In terms of indexability
! @jhmjacob
Crawler: Behind the Scenes
Bloomfilter
De-Duplication
Index
! @jhmjacob
The Challenge: Big Data Scale
» Was a given URL already crawled?

(if so: Does a reload make sense?)

Solution: Bloomfilter + Key-Value Store
» Is the content of a crawled URL 

valuable enough to be added in the index? 

Solution: Content-Fingerprinting + Hamming Distance
! @jhmjacob
“Most algorithms for near-duplicate detection
run in batch- mode over the entire collection
of documents. For web crawling, an online
algorithm is necessary because the decision
to ignore the hyper-links in a recently-
crawled page has to be made quickly”
More: http://www2007.cpsc.ucalgary.ca/papers/paper215.pdf
Crawler: Behind the Scenes
! @jhmjacob
Encoding
1) Via Response Header

Content-Type: text/html; charset=UTF-8
2) Within <head> Tag

<meta charset="UTF-8" />
Charset should always be defined.

Try to work with UTF-8 - saves a lot of headaches in the long run.
! @jhmjacob
Encoding
This is how an

encoding f*ckup
looks like
! @jhmjacob
File Size
1) Within “Google Search Appliance”: Max. 20 MB

But thats the Enterprise version of Google
2) In the wild the limit is probably way lower

(something around 500 KB and 1 MB)
The bigger the file, the longer it takes to download.

Rule of thumb: The smaller, the better!
! @jhmjacob
Rendering
1) Javascript and CSS files have to be accessible 

for GoogleBot

OnPage.org provides good reports on that
2) If Google has issues rendering the page, indexation

is at risk
3) Also make sure that the rendering does not take too long

(Pagespeed Test).
4) Does the rendering on mobile devices look fine?

(Viewport Tag)
! @jhmjacob
rel=prev
1) Within <head> Tag

<link rel="prev" href="http://abc.com/article?page=1" />

<link rel="next" href="http://abc.com/article?page=3" />
2) Im Response Header

Link: <http://abc.com/article?page=1>; rel="prev"

Link: <http://abc.com/article?page=3>; rel="next"
More: http://googlewebmastercentral.blogspot.co.at/2011/09/pagination-with-relnext-and-relprev.html
! @jhmjacob
rel=prev
» Semantic Markup 

for Paginations

Groups multiple pages

into one ranking
» Intended for multi-page

articles (newspapers).

But Google now also

shows product-listings

as use case.
More: http://googlewebmastercentral.blogspot.co.at/2011/09/pagination-with-relnext-and-relprev.html
! @jhmjacob
rel=prev alternative: “show all page”
More: http://googlewebmastercentral.blogspot.co.at/2011/09/view-all-in-search-results.html
! @jhmjacob
Already sleepy?!

;)
! @jhmjacob
hreflang Directives
More: https://moz.com/blog/using-the-correct-hreflang-tag-a-new-generator-tool
! @jhmjacob
hreflang Directives
More: https://moz.com/blog/using-the-correct-hreflang-tag-a-new-generator-tool
Article XYZ

(“de” = German)
Article XYZ

(“es” = Spanish)
hreflang=“es”
hreflang=“de”
Article XYZ

(English)
hreflang=“x-default”
hreflang=“de”
hreflang=“es”
hreflang=“x-default”
! @jhmjacob
Device Directives
1) Viewport Tag

<meta name="viewport" content="width=device-width, initial-
scale=1.0" />
2) Media Queries

<link rel="stylesheet" media="only screen and (max-width: 800px)"
href="/mobile.min.css" />
3) Dedicated URL for mobile devices

<link rel="alternate" media="only screen and (max-width: 640px)”
href="http://m.example.com/page-1" >
! @jhmjacob
Content Quality
1) The basics 

Title, Description etc.
2) Zero tolerance for broken pages
3) Avoid internal redirects

Update links instead
4) Lightweight Sourcecode

Get rid of unnecessary inline JS + CSS, remove Whitespaces, Line
Breaks, Tabs, etc.
! @jhmjacob
Location Directives
1) schema.org Markup (“LocalBusiness”)

Seems to be used by Google for “Local Search”
2) Address / Telephone

So your websites also matches Query-Modifications
3) Dublin Core Markup

Not really relevant for SEO, but does not hurt (semantic!)
More: https://plus.google.com/+JohnMueller/posts/1EwfjTuCzPQ
More: http://schema.org/LocalBusiness
! @jhmjacob
Outlook
! @jhmjacob
Static CMS
! @jhmjacob
Static CMS
More: https://www.staticgen.com/
! @jhmjacob
Static CMS
! @jhmjacobMore: https://www.getkirby.com/
Static CMS
! @jhmjacob
Wordpress is kind of

the Internet Explorer

in the CMS space
Static CMS
! @jhmjacob
Static File System in the Wild
! @jhmjacob
if-modified-since: OnPage.org
1. First download of the page: The system generates the final sourcecode
! @jhmjacob
2. An optimized version of the sourecode gets saved on disk (“Page-Caching”). 

The cache filename is generated based on relevant cookie values.

(in our case: language + currency of visitor)
if-modified-since: OnPage.org
! @jhmjacob
3. The same URL (+ same cookie settings) gets called again.

Search Engines will append the “Last-Modified” value (from the
previous request) to the Request Header.
if-modified-since: OnPage.org
! @jhmjacob
4. The response for the second call is just taken from the cache file

Means: Ultra fast Time to First Byte, because server doesn’t need to “think”
We dropped irrelevant characters (newlines, tabs, spaces) when we saved the cache file.
-> We have seen clients who reduced 30% (!) of their filesizes with that simple step

-> This results in better loadtimes
if-modified-since: OnPage.org
! @jhmjacob
5. Part of the returned response was the “Last-Modified” setting. 

It was calculated based on the cache file timestamp.
if-modified-since: OnPage.org
! @jhmjacob
» Super fast Time to First Byte

When the file is cached
» Sends optimized sourcecode to reduce

bandwith usage

for both parties: Our servers + Google Crawlers
» If the file was loaded before, only send what’s

really required 

=> “304 Not modified” aka 

“Everything is cool, you have the latest version in
your index”
» Bonus: This workflow enables us to set

the last-mod attribute in sitemap.xml
if-modified-since: OnPage.org
! @jhmjacob
Other design principles of 

our homegrown static CMS
! @jhmjacob
Static CMS: Design Principles
1) File-Position: Folders in URL are the same 

as on the filesystem

Authors are conditioned to build a clean structure + file-hierarchy
! @jhmjacob
2) Separation of Code, Design and Content

Every member of the team sees his part
Static CMS: Design Principles
For Designers:
affiliate.tpl
! @jhmjacob
2) Separation of Code, Design and Content

Every member of the team sees his part
Static CMS: Design Principles
For Texters:
affiliate.de.json
! @jhmjacob
2) Separation of Code, Design and Content

Makes MS Word etc. redundant.



If a new translation needs to be added, the translator gets the

english version. Renames the file, translates the contents, uploads
the file. 



Bam! It’s online.



Text updates, Design changes and new images are versioned by
git.
Static CMS: Design Principles
! @jhmjacob
3) Multilinguality by nature

If a new translations is uploaded, the system starts a couple of 

cool things
Static CMS: Design Principles
! @jhmjacob
Editor Friendliness + File-Management
3) Multilinguality by nature

If a user navigates to the wrong language version of a page, he will

see a friendly reminder that there is a localized version for him
! @jhmjacob
Editor Friendliness + File-Management
3) Multilinguality by nature

Links to the translated versions of the current page are
automatically added to the footer
! @jhmjacob
Editor Friendliness + File-Management
3) Multilinguality by nature

And hreflang markup is automatically added to the <head> section
of the document
! @jhmjacob
Editor Friendliness + File-Management
4) Fast + Secure

No Database which slows down server responses! Git keeps track of
changes and provides rollback functionalities! No other
dependencies / services which might cause security holes
! @jhmjacob
Editor Friendliness + File-Management
5) Transparent und logical structure

Images reside where they belong: In the same folder as the article
itself - like its template, translations, additional script logic.



Cleaning up made easy: If an article needs to be deleted, just
remove the folder -> All files are gone, no more deserted files in 

“images” folders or localization databases, etc.
! @jhmjacob
Outlook
What we want to build next
! @jhmjacob
Outlook
» Multi-Language Images

The same URL for all localized versions of an image
https://en.onpage.org/beispiel/teaser.jpg https://en.onpage.org/beispiel/teaser.jpg
! @jhmjacob
Outlook
Be careful: This is untested freestyle code - just to give you an idea :)
» Multi-Language Images

htaccess file detects that an image file is requested
! @jhmjacob
Outlook
» Multi-Language Images

The browser exposes the preferred languages of the user
! @jhmjacob
Outlook
» Multi-Language 

Images

A script takes the 

request, checks if a localized

version exists and returns

the value (or the default image).



Result is cached in 

browser cache.
Be careful: 

This is untested freestyle 

code - just to give 

you an idea :)
! @jhmjacob
Outlook
» Last-Modified Logging

To find out how popular a page is among search engines
! @jhmjacob
Outlook
» Last-Modified Logging

To find out how popular a page is among search engines
» By setting the last-modified response header,
Search engines will include its value in the next
request of the page 

(for if-modified-since checks)

Knowing this, we can calculate the timespan between this visit and
the last one.
! @jhmjacob
Outlook
» Low timespan

= URL seems to relevant for the search engine

= Good chances to rank
» High timespan

= URL seems to be rather irrelevant for the SE

= Less chances to rank

= Alerting based on the importance of the page
! @jhmjacob
“It’s not that Google will penalize
you, it’s the opportunity cost for
dirty architecture based on a finite
crawl budget.”
More: http://www.blindfiveyearold.com/crawl-optimization
Last words
Thanks!
OnPage.org GmbH
! http://twitter.com/onpage_org

$ http://fb.me/onpage.org

% https://en.onpage.org

Jan Hendrik Merlin Jacob
Founder + CTO
! https://twitter.com/jhmjacob

" jhm@onpage.org

# http://linkedin.com/in/jhmjacob

http://onpa.ge/V141p

More Related Content

What's hot

Advanced data-driven technical SEO - SMX London 2019
Advanced data-driven technical SEO - SMX London 2019Advanced data-driven technical SEO - SMX London 2019
Advanced data-driven technical SEO - SMX London 2019
Bastian Grimm
 
Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014
Bastian Grimm
 
Rendering SEO (explained by Google's Martin Splitt)
Rendering SEO (explained by Google's Martin Splitt)Rendering SEO (explained by Google's Martin Splitt)
Rendering SEO (explained by Google's Martin Splitt)
Anton Shulke
 
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUKeeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Jason Mun
 
Migration Best Practices - SMX West 2019
Migration Best Practices - SMX West 2019Migration Best Practices - SMX West 2019
Migration Best Practices - SMX West 2019
Bastian Grimm
 
The Technical SEO Renaissance
The Technical SEO RenaissanceThe Technical SEO Renaissance
The Technical SEO Renaissance
Michael King
 
OK Google, Whats next? - OMT Wiesbaden 2018
OK Google, Whats next? - OMT Wiesbaden 2018OK Google, Whats next? - OMT Wiesbaden 2018
OK Google, Whats next? - OMT Wiesbaden 2018
Bastian Grimm
 
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
patrickstox
 
Crawl the entire web in 10 minutes...and just 100€
Crawl the entire web  in 10 minutes...and just 100€Crawl the entire web  in 10 minutes...and just 100€
Crawl the entire web in 10 minutes...and just 100€
Danny Linden
 
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your LogsSearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
Distilled
 
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Bastian Grimm
 
Three site speed optimisation tips to make your website REALLY fast - Brighto...
Three site speed optimisation tips to make your website REALLY fast - Brighto...Three site speed optimisation tips to make your website REALLY fast - Brighto...
Three site speed optimisation tips to make your website REALLY fast - Brighto...
Bastian Grimm
 
Migration Best Practices - Peak Ace on Air
Migration Best Practices - Peak Ace on AirMigration Best Practices - Peak Ace on Air
Migration Best Practices - Peak Ace on Air
Bastian Grimm
 
Digital Olympus Technical SEO Findings Whilst Taming An SEO Beast
Digital Olympus Technical SEO Findings Whilst Taming An SEO BeastDigital Olympus Technical SEO Findings Whilst Taming An SEO Beast
Digital Olympus Technical SEO Findings Whilst Taming An SEO Beast
Dawn Anderson MSc DigM
 
What's Next for Page Experience - SMX Next 2021 - Patrick Stox
What's Next for Page Experience - SMX Next 2021 - Patrick StoxWhat's Next for Page Experience - SMX Next 2021 - Patrick Stox
What's Next for Page Experience - SMX Next 2021 - Patrick Stox
Ahrefs
 
React JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
React JS and Search Engines - Patrick Stox at Triangle ReactJS MeetupReact JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
React JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
patrickstox
 
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick StoxSMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
patrickstox
 
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Dawn Anderson MSc DigM
 
A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019
A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019
A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019
patrickstox
 
JavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick StoxJavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick Stox
patrickstox
 

What's hot (20)

Advanced data-driven technical SEO - SMX London 2019
Advanced data-driven technical SEO - SMX London 2019Advanced data-driven technical SEO - SMX London 2019
Advanced data-driven technical SEO - SMX London 2019
 
Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014
 
Rendering SEO (explained by Google's Martin Splitt)
Rendering SEO (explained by Google's Martin Splitt)Rendering SEO (explained by Google's Martin Splitt)
Rendering SEO (explained by Google's Martin Splitt)
 
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUKeeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
 
Migration Best Practices - SMX West 2019
Migration Best Practices - SMX West 2019Migration Best Practices - SMX West 2019
Migration Best Practices - SMX West 2019
 
The Technical SEO Renaissance
The Technical SEO RenaissanceThe Technical SEO Renaissance
The Technical SEO Renaissance
 
OK Google, Whats next? - OMT Wiesbaden 2018
OK Google, Whats next? - OMT Wiesbaden 2018OK Google, Whats next? - OMT Wiesbaden 2018
OK Google, Whats next? - OMT Wiesbaden 2018
 
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
 
Crawl the entire web in 10 minutes...and just 100€
Crawl the entire web  in 10 minutes...and just 100€Crawl the entire web  in 10 minutes...and just 100€
Crawl the entire web in 10 minutes...and just 100€
 
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your LogsSearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
SearchLove London 2016 | Dom Woodman | How to Get Insight From Your Logs
 
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
Migration Best-Practices: Successfully re-launching your website - SMX New Yo...
 
Three site speed optimisation tips to make your website REALLY fast - Brighto...
Three site speed optimisation tips to make your website REALLY fast - Brighto...Three site speed optimisation tips to make your website REALLY fast - Brighto...
Three site speed optimisation tips to make your website REALLY fast - Brighto...
 
Migration Best Practices - Peak Ace on Air
Migration Best Practices - Peak Ace on AirMigration Best Practices - Peak Ace on Air
Migration Best Practices - Peak Ace on Air
 
Digital Olympus Technical SEO Findings Whilst Taming An SEO Beast
Digital Olympus Technical SEO Findings Whilst Taming An SEO BeastDigital Olympus Technical SEO Findings Whilst Taming An SEO Beast
Digital Olympus Technical SEO Findings Whilst Taming An SEO Beast
 
What's Next for Page Experience - SMX Next 2021 - Patrick Stox
What's Next for Page Experience - SMX Next 2021 - Patrick StoxWhat's Next for Page Experience - SMX Next 2021 - Patrick Stox
What's Next for Page Experience - SMX Next 2021 - Patrick Stox
 
React JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
React JS and Search Engines - Patrick Stox at Triangle ReactJS MeetupReact JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
React JS and Search Engines - Patrick Stox at Triangle ReactJS Meetup
 
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick StoxSMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
SMX Advanced 2018 SEO for Javascript Frameworks by Patrick Stox
 
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
 
A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019
A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019
A Crash Course in Technical SEO from Patrick Stox - Beer & SEO Meetup May 2019
 
JavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick StoxJavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick Stox
 

Viewers also liked

Google Search Console für SEO einsetzen!
Google Search Console für SEO einsetzen!Google Search Console für SEO einsetzen!
Google Search Console für SEO einsetzen!
takevalue Consulting GmbH
 
SEO Campixx 2015 | ETL & BI für SEO Analysen und Reportings von Johannes Kunze
SEO Campixx 2015 | ETL & BI für SEO Analysen und Reportings von Johannes KunzeSEO Campixx 2015 | ETL & BI für SEO Analysen und Reportings von Johannes Kunze
SEO Campixx 2015 | ETL & BI für SEO Analysen und Reportings von Johannes Kunze
takevalue Consulting GmbH
 
Crawl Budget Optimization - SMX München 2016
Crawl Budget Optimization - SMX München 2016Crawl Budget Optimization - SMX München 2016
Crawl Budget Optimization - SMX München 2016
Bastian Grimm
 
SeoKomm 2016 Knut Barth - besseres Shop Seo
SeoKomm 2016 Knut Barth -  besseres Shop SeoSeoKomm 2016 Knut Barth -  besseres Shop Seo
SeoKomm 2016 Knut Barth - besseres Shop Seo
Knut Barth
 
Seokomm 2016 Vortrag - Räume deine Website auf
Seokomm 2016 Vortrag - Räume deine Website auf Seokomm 2016 Vortrag - Räume deine Website auf
Seokomm 2016 Vortrag - Räume deine Website auf
Dominik Wojcik
 
Crawl Budget Best Practices - SEODAY 2016
Crawl Budget Best Practices - SEODAY 2016Crawl Budget Best Practices - SEODAY 2016
Crawl Budget Best Practices - SEODAY 2016
Bastian Grimm
 
SEO: Crawl Budget Optimierung & Onsite SEO
SEO: Crawl Budget Optimierung & Onsite SEOSEO: Crawl Budget Optimierung & Onsite SEO
SEO: Crawl Budget Optimierung & Onsite SEO
Philipp Klöckner
 
Crawl-Budget Optimierung - SEOday 2015
Crawl-Budget Optimierung - SEOday 2015Crawl-Budget Optimierung - SEOday 2015
Crawl-Budget Optimierung - SEOday 2015
Bastian Grimm
 
SEO: SERPs im Wandel - SMX Munich 2017
SEO: SERPs im Wandel - SMX Munich 2017SEO: SERPs im Wandel - SMX Munich 2017
SEO: SERPs im Wandel - SMX Munich 2017
Philipp Klöckner
 
The Search Landscape in 2017
The Search Landscape in 2017The Search Landscape in 2017
The Search Landscape in 2017
Rand Fishkin
 
Recruiting Success für Arbeitnehmer
Recruiting Success für ArbeitnehmerRecruiting Success für Arbeitnehmer
Recruiting Success für Arbeitnehmer
workID
 
Recruiting Success
Recruiting SuccessRecruiting Success
Recruiting Success
workID
 
Como aplicar técnicas WPO para optimizar el crawl budget
Como aplicar técnicas WPO para optimizar el crawl budgetComo aplicar técnicas WPO para optimizar el crawl budget
Como aplicar técnicas WPO para optimizar el crawl budget
Raiola Networks
 
Semantik ist sexy – Google im Wandel
Semantik ist sexy – Google im WandelSemantik ist sexy – Google im Wandel
Semantik ist sexy – Google im Wandel
Niels Dahnke
 
Linkbuilding und Offpage für 2017 - SEOkomm
Linkbuilding und Offpage für 2017 - SEOkommLinkbuilding und Offpage für 2017 - SEOkomm
Linkbuilding und Offpage für 2017 - SEOkomm
Christoph C. Cemper
 
Grundlagen: Content einbauen (SEOkomm, 2015)
Grundlagen: Content einbauen (SEOkomm, 2015)Grundlagen: Content einbauen (SEOkomm, 2015)
Grundlagen: Content einbauen (SEOkomm, 2015)
Eric Kubitz
 
Statistik-Manipulationen: Lügen meine Leistungskennzahlen?
Statistik-Manipulationen: Lügen meine Leistungskennzahlen?Statistik-Manipulationen: Lügen meine Leistungskennzahlen?
Statistik-Manipulationen: Lügen meine Leistungskennzahlen?
Niels Dahnke
 
SEO Campixx 2015 | Operatives & strategisches SEO-Controlling - von der Suche...
SEO Campixx 2015 | Operatives & strategisches SEO-Controlling - von der Suche...SEO Campixx 2015 | Operatives & strategisches SEO-Controlling - von der Suche...
SEO Campixx 2015 | Operatives & strategisches SEO-Controlling - von der Suche...
takevalue Consulting GmbH
 
Quo Vadis SEO (Die Zukunft des SEO) - SEOkomm Salzburg 2016
Quo Vadis SEO (Die Zukunft des SEO) - SEOkomm Salzburg 2016Quo Vadis SEO (Die Zukunft des SEO) - SEOkomm Salzburg 2016
Quo Vadis SEO (Die Zukunft des SEO) - SEOkomm Salzburg 2016
Bastian Grimm
 
SEO Audits Key Elements of Discovery and Planning By Jessie Stricchiola
SEO Audits Key Elements of Discovery and Planning By Jessie StricchiolaSEO Audits Key Elements of Discovery and Planning By Jessie Stricchiola
SEO Audits Key Elements of Discovery and Planning By Jessie Stricchiola
Search Marketing Expo - SMX
 

Viewers also liked (20)

Google Search Console für SEO einsetzen!
Google Search Console für SEO einsetzen!Google Search Console für SEO einsetzen!
Google Search Console für SEO einsetzen!
 
SEO Campixx 2015 | ETL & BI für SEO Analysen und Reportings von Johannes Kunze
SEO Campixx 2015 | ETL & BI für SEO Analysen und Reportings von Johannes KunzeSEO Campixx 2015 | ETL & BI für SEO Analysen und Reportings von Johannes Kunze
SEO Campixx 2015 | ETL & BI für SEO Analysen und Reportings von Johannes Kunze
 
Crawl Budget Optimization - SMX München 2016
Crawl Budget Optimization - SMX München 2016Crawl Budget Optimization - SMX München 2016
Crawl Budget Optimization - SMX München 2016
 
SeoKomm 2016 Knut Barth - besseres Shop Seo
SeoKomm 2016 Knut Barth -  besseres Shop SeoSeoKomm 2016 Knut Barth -  besseres Shop Seo
SeoKomm 2016 Knut Barth - besseres Shop Seo
 
Seokomm 2016 Vortrag - Räume deine Website auf
Seokomm 2016 Vortrag - Räume deine Website auf Seokomm 2016 Vortrag - Räume deine Website auf
Seokomm 2016 Vortrag - Räume deine Website auf
 
Crawl Budget Best Practices - SEODAY 2016
Crawl Budget Best Practices - SEODAY 2016Crawl Budget Best Practices - SEODAY 2016
Crawl Budget Best Practices - SEODAY 2016
 
SEO: Crawl Budget Optimierung & Onsite SEO
SEO: Crawl Budget Optimierung & Onsite SEOSEO: Crawl Budget Optimierung & Onsite SEO
SEO: Crawl Budget Optimierung & Onsite SEO
 
Crawl-Budget Optimierung - SEOday 2015
Crawl-Budget Optimierung - SEOday 2015Crawl-Budget Optimierung - SEOday 2015
Crawl-Budget Optimierung - SEOday 2015
 
SEO: SERPs im Wandel - SMX Munich 2017
SEO: SERPs im Wandel - SMX Munich 2017SEO: SERPs im Wandel - SMX Munich 2017
SEO: SERPs im Wandel - SMX Munich 2017
 
The Search Landscape in 2017
The Search Landscape in 2017The Search Landscape in 2017
The Search Landscape in 2017
 
Recruiting Success für Arbeitnehmer
Recruiting Success für ArbeitnehmerRecruiting Success für Arbeitnehmer
Recruiting Success für Arbeitnehmer
 
Recruiting Success
Recruiting SuccessRecruiting Success
Recruiting Success
 
Como aplicar técnicas WPO para optimizar el crawl budget
Como aplicar técnicas WPO para optimizar el crawl budgetComo aplicar técnicas WPO para optimizar el crawl budget
Como aplicar técnicas WPO para optimizar el crawl budget
 
Semantik ist sexy – Google im Wandel
Semantik ist sexy – Google im WandelSemantik ist sexy – Google im Wandel
Semantik ist sexy – Google im Wandel
 
Linkbuilding und Offpage für 2017 - SEOkomm
Linkbuilding und Offpage für 2017 - SEOkommLinkbuilding und Offpage für 2017 - SEOkomm
Linkbuilding und Offpage für 2017 - SEOkomm
 
Grundlagen: Content einbauen (SEOkomm, 2015)
Grundlagen: Content einbauen (SEOkomm, 2015)Grundlagen: Content einbauen (SEOkomm, 2015)
Grundlagen: Content einbauen (SEOkomm, 2015)
 
Statistik-Manipulationen: Lügen meine Leistungskennzahlen?
Statistik-Manipulationen: Lügen meine Leistungskennzahlen?Statistik-Manipulationen: Lügen meine Leistungskennzahlen?
Statistik-Manipulationen: Lügen meine Leistungskennzahlen?
 
SEO Campixx 2015 | Operatives & strategisches SEO-Controlling - von der Suche...
SEO Campixx 2015 | Operatives & strategisches SEO-Controlling - von der Suche...SEO Campixx 2015 | Operatives & strategisches SEO-Controlling - von der Suche...
SEO Campixx 2015 | Operatives & strategisches SEO-Controlling - von der Suche...
 
Quo Vadis SEO (Die Zukunft des SEO) - SEOkomm Salzburg 2016
Quo Vadis SEO (Die Zukunft des SEO) - SEOkomm Salzburg 2016Quo Vadis SEO (Die Zukunft des SEO) - SEOkomm Salzburg 2016
Quo Vadis SEO (Die Zukunft des SEO) - SEOkomm Salzburg 2016
 
SEO Audits Key Elements of Discovery and Planning By Jessie Stricchiola
SEO Audits Key Elements of Discovery and Planning By Jessie StricchiolaSEO Audits Key Elements of Discovery and Planning By Jessie Stricchiola
SEO Audits Key Elements of Discovery and Planning By Jessie Stricchiola
 

Similar to Crawl Budget - Some Insights & Ideas @ seokomm 2015

SEO Fundamentals and Off Page Best Practices
SEO Fundamentals and Off Page Best PracticesSEO Fundamentals and Off Page Best Practices
SEO Fundamentals and Off Page Best Practices
Vaishali Singh
 
Search engine optimisation
Search engine optimisationSearch engine optimisation
Search engine optimisation
robclarkson
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)
Christopher Mbinda
 
seo (1).ppt
seo (1).pptseo (1).ppt
seo (1).ppt
ssuser4ab089
 
seo.ppt
seo.pptseo.ppt
seo.ppt
ssuser53c1282
 
Website Audit [On Page and Off Page] by Carl Benedic Pantaleon
Website Audit [On Page and Off Page] by Carl Benedic PantaleonWebsite Audit [On Page and Off Page] by Carl Benedic Pantaleon
Website Audit [On Page and Off Page] by Carl Benedic Pantaleon
Jacque Doring
 
DIY SEO
DIY SEODIY SEO
DIY SEO
John Quinn
 
SEO-HIGH TRAFFIC ROUTING
SEO-HIGH TRAFFIC ROUTINGSEO-HIGH TRAFFIC ROUTING
SEO-HIGH TRAFFIC ROUTING
BUDNET
 
Link building Services from TheSeoPortal SEO Company
Link building Services from TheSeoPortal SEO CompanyLink building Services from TheSeoPortal SEO Company
Link building Services from TheSeoPortal SEO Company
Theseoportal
 
Link buildingtheseoportal-130705070946-phpapp02
Link buildingtheseoportal-130705070946-phpapp02Link buildingtheseoportal-130705070946-phpapp02
Link buildingtheseoportal-130705070946-phpapp02
Doug Mayhew
 
Crawling, Indicizzazione e SEO - Paolo Ramazzotti
Crawling, Indicizzazione e SEO - Paolo RamazzottiCrawling, Indicizzazione e SEO - Paolo Ramazzotti
Crawling, Indicizzazione e SEO - Paolo Ramazzotti
Gimasi Sa
 
Il processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Il processo di Crawilng e Indexing di Google - Paolo RamazzottiIl processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Il processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Paolo Ramazzotti
 
Seo training
Seo trainingSeo training
Seo training
Michelle Williams
 
Advanced SEO Technoiques-2014
Advanced SEO Technoiques-2014Advanced SEO Technoiques-2014
Advanced SEO Technoiques-2014
VIJAYAKRISHNAN K
 
SEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive GuideSEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive Guide
Adam Audette
 
The Best Guide to SEO
The Best Guide to SEOThe Best Guide to SEO
The Best Guide to SEO
Sumeet Chadha
 
Seo beginners
Seo beginners Seo beginners
Seo beginners
Health Care
 
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
waqas ahmad
 
Beginners seo gs v3
Beginners seo gs v3Beginners seo gs v3
Beginners seo gs v3
Yvonne Dewerne
 
Thunder SEO Presentation - Drupal SandCamp San Diego 2010
Thunder SEO Presentation - Drupal SandCamp San Diego 2010Thunder SEO Presentation - Drupal SandCamp San Diego 2010
Thunder SEO Presentation - Drupal SandCamp San Diego 2010
Max Thomas
 

Similar to Crawl Budget - Some Insights & Ideas @ seokomm 2015 (20)

SEO Fundamentals and Off Page Best Practices
SEO Fundamentals and Off Page Best PracticesSEO Fundamentals and Off Page Best Practices
SEO Fundamentals and Off Page Best Practices
 
Search engine optimisation
Search engine optimisationSearch engine optimisation
Search engine optimisation
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)
 
seo (1).ppt
seo (1).pptseo (1).ppt
seo (1).ppt
 
seo.ppt
seo.pptseo.ppt
seo.ppt
 
Website Audit [On Page and Off Page] by Carl Benedic Pantaleon
Website Audit [On Page and Off Page] by Carl Benedic PantaleonWebsite Audit [On Page and Off Page] by Carl Benedic Pantaleon
Website Audit [On Page and Off Page] by Carl Benedic Pantaleon
 
DIY SEO
DIY SEODIY SEO
DIY SEO
 
SEO-HIGH TRAFFIC ROUTING
SEO-HIGH TRAFFIC ROUTINGSEO-HIGH TRAFFIC ROUTING
SEO-HIGH TRAFFIC ROUTING
 
Link building Services from TheSeoPortal SEO Company
Link building Services from TheSeoPortal SEO CompanyLink building Services from TheSeoPortal SEO Company
Link building Services from TheSeoPortal SEO Company
 
Link buildingtheseoportal-130705070946-phpapp02
Link buildingtheseoportal-130705070946-phpapp02Link buildingtheseoportal-130705070946-phpapp02
Link buildingtheseoportal-130705070946-phpapp02
 
Crawling, Indicizzazione e SEO - Paolo Ramazzotti
Crawling, Indicizzazione e SEO - Paolo RamazzottiCrawling, Indicizzazione e SEO - Paolo Ramazzotti
Crawling, Indicizzazione e SEO - Paolo Ramazzotti
 
Il processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Il processo di Crawilng e Indexing di Google - Paolo RamazzottiIl processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Il processo di Crawilng e Indexing di Google - Paolo Ramazzotti
 
Seo training
Seo trainingSeo training
Seo training
 
Advanced SEO Technoiques-2014
Advanced SEO Technoiques-2014Advanced SEO Technoiques-2014
Advanced SEO Technoiques-2014
 
SEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive GuideSEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive Guide
 
The Best Guide to SEO
The Best Guide to SEOThe Best Guide to SEO
The Best Guide to SEO
 
Seo beginners
Seo beginners Seo beginners
Seo beginners
 
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
 
Beginners seo gs v3
Beginners seo gs v3Beginners seo gs v3
Beginners seo gs v3
 
Thunder SEO Presentation - Drupal SandCamp San Diego 2010
Thunder SEO Presentation - Drupal SandCamp San Diego 2010Thunder SEO Presentation - Drupal SandCamp San Diego 2010
Thunder SEO Presentation - Drupal SandCamp San Diego 2010
 

Recently uploaded

2023. Archive - Gigabajtos selfpublisher homepage
2023. Archive - Gigabajtos selfpublisher homepage2023. Archive - Gigabajtos selfpublisher homepage
2023. Archive - Gigabajtos selfpublisher homepage
Zsolt Nemeth
 
Network Layer and its protocols mod .pptx
Network Layer and its protocols mod .pptxNetwork Layer and its protocols mod .pptx
Network Layer and its protocols mod .pptx
cossykin19
 
Top 50 Telephone Conversation Sample Examples For IT Industries.pdf
Top 50 Telephone Conversation Sample Examples For IT Industries.pdfTop 50 Telephone Conversation Sample Examples For IT Industries.pdf
Top 50 Telephone Conversation Sample Examples For IT Industries.pdf
Krishna L
 
Use of Ontologies in Chemical Kinetic Database CHEMCONNECT
Use of Ontologies in Chemical Kinetic Database CHEMCONNECTUse of Ontologies in Chemical Kinetic Database CHEMCONNECT
Use of Ontologies in Chemical Kinetic Database CHEMCONNECT
Edward Blurock
 
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
samyanvichadda
 
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
elbertablack
 
Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...
Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...
Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...
paridubey2024#G05
 
6 Reasons to Use a VPN | 3S VPN Server App
6 Reasons to Use a VPN | 3S VPN Server App6 Reasons to Use a VPN | 3S VPN Server App
6 Reasons to Use a VPN | 3S VPN Server App
VPN Server
 
Maximizing Network Efficiency with Large Language Models (LLM)
Maximizing Network Efficiency with Large Language Models (LLM)Maximizing Network Efficiency with Large Language Models (LLM)
Maximizing Network Efficiency with Large Language Models (LLM)
Bangladesh Network Operators Group
 
202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影
202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影
202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影
ffg01100
 
Team Cymru Community Services,Overview of all public services
Team Cymru Community Services,Overview of all public servicesTeam Cymru Community Services,Overview of all public services
Team Cymru Community Services,Overview of all public services
Bangladesh Network Operators Group
 
Open Source TCP or Netflow Log Server Using Graylog
Open Source TCP or Netflow Log Server Using GraylogOpen Source TCP or Netflow Log Server Using Graylog
Open Source TCP or Netflow Log Server Using Graylog
Bangladesh Network Operators Group
 
Ontology for the semantic enhancement, database definition and management and...
Ontology for the semantic enhancement, database definition and management and...Ontology for the semantic enhancement, database definition and management and...
Ontology for the semantic enhancement, database definition and management and...
Edward Blurock
 
How Salesforce Development in the UK is Driving Digital Transformation
How Salesforce Development in the UK is Driving Digital TransformationHow Salesforce Development in the UK is Driving Digital Transformation
How Salesforce Development in the UK is Driving Digital Transformation
Sweet Potato Tec
 
Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...
Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...
Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...
mahigarg2024#G05
 
Career Development Advice for Network Engineers across the Pacific, presented...
Career Development Advice for Network Engineers across the Pacific, presented...Career Development Advice for Network Engineers across the Pacific, presented...
Career Development Advice for Network Engineers across the Pacific, presented...
APNIC
 
Geolocation and Geofeed Implementation bdNOG18
Geolocation and Geofeed Implementation bdNOG18Geolocation and Geofeed Implementation bdNOG18
Geolocation and Geofeed Implementation bdNOG18
Bangladesh Network Operators Group
 
Top 50 Data Science Jobs on LinkedIn.docx
Top 50 Data Science Jobs on LinkedIn.docxTop 50 Data Science Jobs on LinkedIn.docx
Top 50 Data Science Jobs on LinkedIn.docx
analyticsinsightmaga
 
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
shamrisumri
 
Girls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in City
dilbaagsingh0898
 

Recently uploaded (20)

2023. Archive - Gigabajtos selfpublisher homepage
2023. Archive - Gigabajtos selfpublisher homepage2023. Archive - Gigabajtos selfpublisher homepage
2023. Archive - Gigabajtos selfpublisher homepage
 
Network Layer and its protocols mod .pptx
Network Layer and its protocols mod .pptxNetwork Layer and its protocols mod .pptx
Network Layer and its protocols mod .pptx
 
Top 50 Telephone Conversation Sample Examples For IT Industries.pdf
Top 50 Telephone Conversation Sample Examples For IT Industries.pdfTop 50 Telephone Conversation Sample Examples For IT Industries.pdf
Top 50 Telephone Conversation Sample Examples For IT Industries.pdf
 
Use of Ontologies in Chemical Kinetic Database CHEMCONNECT
Use of Ontologies in Chemical Kinetic Database CHEMCONNECTUse of Ontologies in Chemical Kinetic Database CHEMCONNECT
Use of Ontologies in Chemical Kinetic Database CHEMCONNECT
 
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
 
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
 
Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...
Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...
Kolkata @Girls @Call WhatsApp Numbers 🫦0000XX0000🫦 List For Friendship Girls ...
 
6 Reasons to Use a VPN | 3S VPN Server App
6 Reasons to Use a VPN | 3S VPN Server App6 Reasons to Use a VPN | 3S VPN Server App
6 Reasons to Use a VPN | 3S VPN Server App
 
Maximizing Network Efficiency with Large Language Models (LLM)
Maximizing Network Efficiency with Large Language Models (LLM)Maximizing Network Efficiency with Large Language Models (LLM)
Maximizing Network Efficiency with Large Language Models (LLM)
 
202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影
202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影
202254.com香蕉影视,在线观看《我才不要和你做朋友呢》在线观看最新电影,香蕉影视在线观看《我才不要和你做朋友呢》在线观看高清电影
 
Team Cymru Community Services,Overview of all public services
Team Cymru Community Services,Overview of all public servicesTeam Cymru Community Services,Overview of all public services
Team Cymru Community Services,Overview of all public services
 
Open Source TCP or Netflow Log Server Using Graylog
Open Source TCP or Netflow Log Server Using GraylogOpen Source TCP or Netflow Log Server Using Graylog
Open Source TCP or Netflow Log Server Using Graylog
 
Ontology for the semantic enhancement, database definition and management and...
Ontology for the semantic enhancement, database definition and management and...Ontology for the semantic enhancement, database definition and management and...
Ontology for the semantic enhancement, database definition and management and...
 
How Salesforce Development in the UK is Driving Digital Transformation
How Salesforce Development in the UK is Driving Digital TransformationHow Salesforce Development in the UK is Driving Digital Transformation
How Salesforce Development in the UK is Driving Digital Transformation
 
Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...
Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...
Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...
 
Career Development Advice for Network Engineers across the Pacific, presented...
Career Development Advice for Network Engineers across the Pacific, presented...Career Development Advice for Network Engineers across the Pacific, presented...
Career Development Advice for Network Engineers across the Pacific, presented...
 
Geolocation and Geofeed Implementation bdNOG18
Geolocation and Geofeed Implementation bdNOG18Geolocation and Geofeed Implementation bdNOG18
Geolocation and Geofeed Implementation bdNOG18
 
Top 50 Data Science Jobs on LinkedIn.docx
Top 50 Data Science Jobs on LinkedIn.docxTop 50 Data Science Jobs on LinkedIn.docx
Top 50 Data Science Jobs on LinkedIn.docx
 
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
High Profile Girls Call ServiCe Chennai XX00XXX00X Tanisha Best High Class Ch...
 
Girls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Shimla 000XX00000 Provide Best And Top Girl Service And No1 in City
 

Crawl Budget - Some Insights & Ideas @ seokomm 2015

  • 1. Crawl Budget Some Insights + Ideas Jan Hendrik Merlin Jacob Founder + CTO ! @jhmjacob
 " jhm@onpage.org
 # linkedin.com/in/jhmjacob

  • 2. ! @jhmjacob Agenda » Philosophy » Parameters to influence Crawl Budget » Best practice & next steps
  • 3. ! @jhmjacob Crawl Budget Definition The resources (aka money) Google invests in 
 your website by sending its crawlers
  • 4. ! @jhmjacob Philosophy What would you do,
 if you were Google?
  • 5. ! @jhmjacob Primary Target: 
 Make money!
 Secondary Target:
 The best search results Philosophy
  • 6. ! @jhmjacob With their crawlers Google invests money, to find the “best” webpages -
 in order provide the best search results. Philosophy
  • 7. ! @jhmjacob Problem 1: 
 The size of the web is infinite
 Problem 2: 
 Even Googles resources are limited Philosophy
  • 9. ! @jhmjacob Size of the Google index: Something between 5 billion
 and 1 trillion documents* (means: around 5-1000 pages per domain) * = As a matter of fact, there is no real data on this. 
 Probably even Google doesn’t know. Philosophy
  • 10. ! @jhmjacob Conclusion Search engines like Google have to
 constantly decide if they continue
 spending resources on the 
 current website or rather go to another.
  • 11. ! @jhmjacob What is bing saying about this? “By providing clear, deep, easy to find 
 content on your website, we are more likely 
 to index and show your content in search results.” More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
  • 12. ! @jhmjacob “clear” » Distinct Canonical settings » Valid redirects (not via Meta-Refresh!) » Exactly one main headline (H1) per page » Title, description, alt, links to relevant (!) content » Standard HTML links (“No Rich Media like JS or Flash”) » Clean and readable HTML site-navigation » Clean and normalized URL structure » “Clear keyword focus” What is bing saying about this? More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
  • 13. ! @jhmjacob “deep” » No “Thin content” » “Do not copy from other websites” » Be as relevant as possible for one topic (“Holistic”) » Keep your pages updated (“freshness”) What is bing saying about this? More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
  • 14. ! @jhmjacob “easy to find” » Clean and up-to-date Sitemap.xml (last-mod!) » “keep valuable content close to the home page” 
 (aka short click-path aka “page level”) » “use targeted keywords wherever possible”
 (regarding internal linking) » Well structured navigation
 (found in URL + Breadcrumbs) What is bing saying about this? More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
  • 15. ! @jhmjacob Between the lines: » Sitemap.xml is used to identify new articles and get 
 them indexed asap. » If the system recognizes regular updates on a page,
 it will be crawled more frequently. » Relevancy of a page is calculated based on internal 
 (& external) links as well as the “click distance from
 the homepage” (aka “page-level”). » Pagespeed matters: Otherwise Bounce-Rate can
 have negative effects on crawl budget (+ rankings) What is bing saying about this? More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
  • 16. ! @jhmjacob What is Yandex saying about this? More: https://yandex.com/support/webmaster/yandex-indexing/webmaster-advice.xml Summary of the Webmaster Guidelines: » Do not use cloaking » Do not use auto-generated / gibberish text » No thin content » No hidden text » Popups + Downunders = Bad Quality Indicator » Do not do “User Behaviour Emulation”
  • 17. ! @jhmjacob What is Google saying about this? “The best way to think about it is that the number of pages that we crawl is roughly proportional to your PageRank. So if you have a lot of incoming links on your root page, we’ll definitely crawl that. Then your root page may link to other pages, and those will get PageRank and we’ll crawl those as well. As you get deeper and deeper in your site, however, PageRank tends to decline.” More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
  • 18. ! @jhmjacob Reminder » Internal Links are responsible
 for passing Pagerank through your
 pages 
 (Some believe Pagerank is only
 generated out of external Backlinks) » Pagerank “0 to 10” is just a simplified
 display for humans. In reality this score
 is way more precise.
  • 19. ! @jhmjacob “Another way to think about it is that the low PageRank pages on your site are competing against a much larger pool of pages with the same or higher PageRank. There are a large number of pages on the web that have very little or close to zero PageRank. The pages that get linked to a lot tend to get discovered and crawled quite quickly. The lower PageRank pages are likely to be crawled not quite as often.” What is Google saying about this? More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
  • 20. ! @jhmjacob “If we can only take two pages from a site at any given time, and we are only crawling over a certain period of time, that can then set some sort of upper bound on how many pages we are able to fetch from that host.” What is Google saying about this? More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
  • 21. ! @jhmjacob “Imagine we crawl three pages from a site, and then we discover that the two other pages were duplicates of the third page. We’ll drop two out of the three pages and keep only one, and that’s why it looks like it has less good content. So we might tend to not crawl quite as much from that site.
 …
 If there are a large number of pages that we consider low value, then we might not crawl quite as many pages from that site, but that is independent of rel=canonical.” What is Google saying about this? More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
  • 22. ! @jhmjacob “If you link to three pages that are duplicates, a search engine might be able to realize that those three pages are duplicates and transfer the incoming link juice to those merged pages.” What is Google saying about this? More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
  • 23. ! @jhmjacob “There are some things that we will run a HEAD for. For example, our image crawl may use HEAD requests because images might be much, much larger in content than web pages…In terms of crawling the web and text content and HTML, we’ll typically just use a GET and not run a HEAD query first” What is Google saying about this? More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
  • 24. ! @jhmjacob » “There is also not a hard limit on our crawl.” » Pages with higher Pagerank will get crawled more often » Free crawling resources will be spend on low-PR pages,
 but chances the bot will leave the page are higher
 (how are they chosen?!) » You compete against all other pages. Give the bots
 reasons to stay. » Limitation is not based on “Amount of URLs”, rather in 
 form of “Machine-Hours” (time-based limits)
 (Loadtime matters!) » Bad page-quality + bad content metrics can scare away bots
 (Exit-Condition like “Amount of Unique Content / Time”) » Google tries to avoid waste of bandwith
 (HEAD Requests for images + if-modified-since) What is Google saying about this?
  • 27. ! @jhmjacob ility! Crawlability + Indexability + Rankability = Searchability (aka Findability)
  • 28. ! @jhmjacob ility! Crawlability + Indexability + Rankability = Searchability (aka Findability) Crawlability Is your Webpage (URL) accessible for crawlers?
  • 29. ! @jhmjacob ility! Crawlability + Indexability + Rankability = Searchability (aka Findability) Indexability Should the crawled, extracted and interpreted content be added to a search index?
  • 30. ! @jhmjacob ility! Crawlability + Indexability + Rankability = Searchability (aka Findability) Rankability Should a particular page
 be displayed in the 
 search results for a
 particular keyword 
 (search phrase).
  • 31. ! @jhmjacob Crawlability + Indexability + Rankability have a direct or indirect influence on the Crawl Budget
  • 32. ! @jhmjacob Technical SEO Buzzword Bingo “Crawlability" “Indexability" “Rankability” robots.txt robots Directive
 (Response Header / Meta Tag) rel=prev
 (Response Header / Meta Tag) Status Code 
 (Response Header) Canonical 
 (Response Header / Meta Tag) hreflang Directives
 (Response Header / Meta Tag / Sitemap) Ladezeit
 (DNS+Server) Redirects 
 (Response Header / Meta Tag) Device Directives
 (Response Header / Meta Tag) Fragment aka Ajax Crawling
 (Meta Tag) Unique Content
 (Content) Content Quality
 (Content) URL-Structure (URL) Encoding
 (Content) Rendertime
 (Server+Content) Vary
 (Response Header) File Size
 (Content) Location Directives
 (Content) if-modified-since Support
 (Response Header) Rendering
 (CSS+JS)
  • 33. ! @jhmjacob Analyzed by OnPage.org “Crawlability" “Indexability” “Rankability” robots.txt robots Directive
 (Response Header / Meta Tag) rel=prev
 (Response Header / Meta Tag) Status Code 
 (Response Header) Canonical 
 (Response Header / Meta Tag) hreflang Directives
 (Response Header / Meta Tag / Sitemap) Ladezeit
 (DNS+Server) Redirects 
 (Response Header) Device Directives
 (Response Header / Meta Tag) Fragment aka Ajax Crawling
 (Meta Tag) Unique Content
 (Content) Content Quality
 (Content) URL-Structure (URL) Encoding
 (Content) Rendertime
 (Server+Content) Vary
 (Response Header) File Size
 (Content) Location Directives
 (Content) if-modified-since Support
 (Response Header) Rendering
 (CSS+JS) We offer the most comprehensive analysis on website quality assurance!
  • 34. ! @jhmjacob robots.txt This is obvious! » Learn how to setup your robots.txt file » Block irrelevant URLs, so the bots don’t waste 
 their time on those pages » Basics: https://en.onpage.org/wiki/Robots.txt Always remember: If a page is blocked via robots.txt,
 the bots can’t see additional settings like 
 Canonicals or “noindex” directives.
  • 35. ! @jhmjacob Even though a page might look well - under the hood it can be still
 broken as hell. Status Code
  • 36. ! @jhmjacob 200 Valid Page 
 301 Permanent redirect (after Redesigns) 302 Temporary Redirect 303 Alternative Version 304 Page did not change since last visit
 403 Access forbidden
 404 Page does not exist Status Code
  • 37. ! @jhmjacob Loadtime Nice - only
 82 Milliseconds until Googlebot got
 the sourcecode of the
 page Not so nice - in average 
 1.76 Seconds until the sourcecode
 has been transfered Page A Page B
  • 38. ! @jhmjacob 0 3.5 7 10.5 14 Page A Page B 0.59 Pages / Second 12.2 Pages / Second Loadtime
  • 39. ! @jhmjacob Page A Page B Per Second 12.2 Pages 0.59 Pages Per Minute 731.71 Pages 35.29 Pages Per Hour 43,902.44 Pages 2,117.65 Pages Per Day 1,053,658.54 Pages 50,823.53 Pages Loadtime ouch!
  • 40. ! @jhmjacob Fragment aka Ajax Crawling More: https://angularjs.org/ excursion
  • 41. ! @jhmjacob Why angularjs? Tries to achieve a better User Experience, 
 by transferring only small segments
 instead of complete pages. Provides testing functionalities. Fragment aka Ajax Crawling excursion
  • 42. ! @jhmjacob Fragment aka Ajax Crawling easy way to identify
 a angularjs site (“ng-app”)
  • 43. ! @jhmjacob Fragment aka Ajax Crawling
  • 44. ! @jhmjacob <!DOCTYPE html> <!--<html lang="en" data-ng-app="MainApp">--> <html lang="en" id="ng-app" data-ng-app="MainApp"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, …”> <meta name="keywords" content="mercedes films, mercedes clip, …”> <link href="assets/images/favicon.ico" type="image/x-icon" rel="shortcut icon"> <title>Mercedes-Benz Video Channel</title> <meta name="keywords" content="{{keywords}}"/> </head>
 … Fragment aka Ajax Crawling
  • 45. ! @jhmjacob <!DOCTYPE html> <!--<html lang="en" data-ng-app="MainApp">--> <html lang="en" id="ng-app" data-ng-app="MainApp"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, …”> <meta name="keywords" content="mercedes films, mercedes clip, …”> <link href="assets/images/favicon.ico" type="image/x-icon" rel="shortcut icon"> <title>Mercedes-Benz Video Channel</title> <meta name="keywords" content="{{keywords}}"/> </head>
 … Fragment aka Ajax Crawling angularjs placeholder
  • 46. ! @jhmjacob Fragment aka Ajax Crawling
  • 47. ! @jhmjacob Fragment aka Ajax Crawling
  • 48. ! @jhmjacob Fragment aka Ajax Crawling
  • 51. ! @jhmjacob Fragment aka Ajax Crawling
  • 52. ! @jhmjacob Why? 1) There are also other JS testing frameworks
 Jasmine / PhantomJS 2) WallabyJS 
 Nice Plugin for realtime JS Unit Tests 3) IMO: AngularJS is rather suited 
 for web-apps
 Not so well for content based sites which 
 rely on their fit in the web eco-system
  • 53. ! @jhmjacob Ajax Crawling Scheme 1) Within <head> Tag
 <meta name="fragment" content="!"/> 2) Hashbang URLs (“#!”)
 https://www.seokomm.at/#!agenda + Snapshot URL with “real” HTML
  • 54. ! @jhmjacob Ajax Crawling Scheme What happens here? GET http://video.mercedes-benz.co.uk/#!/ Complete Sourcecode (9kb) 1
  • 55. ! @jhmjacob Ajax Crawling Scheme GET http://video.mercedes-benz.co.uk/?_escaped_fragment_=/ Complete Sourcecode (9kb) without AngularJS placeholders 2 Two requests were required to gather the valid HTML code! What happens here?
  • 56. ! @jhmjacob Ajax Crawling Scheme Support of Ajax Crawling “Ajax Crawling Scheme” Native Ajax Crawling Google Yes, but “deprecated” Yes Bing Yes Nope OnPage.org Yes Nope Facebook Nope Nope Twitter Nope Nope Pinterest Nope Nope
  • 57. ! @jhmjacob URL Structure 1) Speaking URLs (aka Hackable URLs)
 https://www.ccc.de/events/2015/congress 2) Sort GET Parameter (predefined order)
 https://de.onpage.org/?currency=de&lang=de 3) Relevant Content on top tier (subfolder),
 should correlate with Pagerank flow 4) Session IDs in URLs are a No-Go!
 If no other way: Remove them via GSC
  • 58. ! @jhmjacob Vary Response Header 1) Does the page provide compression? (Must!!!) 
 Vary: Compression 2) Do Cookies (notably) change the content?
 Vary: Cookie 3) Is the page multi-lingual? (Within same URL!)
 Vary: Accept-Language
  • 60. ! @jhmjacob 11/01/2015:
 GoogleBot calls en.onpage.org Server response: Complete Sourcecode (10,3kb)
 + Response Header “Last-Modified” “if-modified-since” Workflow
  • 61. ! @jhmjacob 11/5/2015:
 GoogleBot calls en.onpage.org again 
 and includes an additional
 Request Header “Last-Modified” Server response: Empty body (0kb)
 + Response Header “304 Not Modified” “if-modified-since” Workflow
  • 62. ! @jhmjacob » Dramatically reduces downloaded file size 
 for unchanged content » Enables bots + users to download more relevant
 content within the same timespan » Requires good Infrastructure / CMS 
 like Page-Caching - more on that later! “if-modified-since” Workflow
  • 63. ! @jhmjacob robots Directive 1) Within <head> Tag
 <meta name="robots" content="noindex,follow"/> 2) Via Response Header
 X-Robots-Tag: noindex,follow Remember: A lot of “noindex” pages have a negativ effect
 on the crawl budget … because resources are wasted
 to find out that the URL has no real content.
  • 64. ! @jhmjacob robots Directive: “unavailable-after” More: https://googleblog.blogspot.de/2007/07/robots-exclusion-protocol-now-with-even.html 1) Within <head> Tag
 <meta name="robots" content="unavailable_after: 20-Nov-2015 15:35:00 CET"> 2) Via Response Header
 X-Robots-Tag: unavailable_after: 20 Nov 2015 15:35:00 CET
  • 65. ! @jhmjacob Canonical 1) Within <head> Tag
 <link rel="canonical" href="https://de.onpage.org/"/> 2) Via Response Header
 Link: <https://de.onpage.org/>; rel="canonical" The Response Header Version can also be used for
 PDF files and images (yummy). Remember: A lot of “canonicalized” pages (canonical to 
 other URL) have a negativ effect on the crawl budget … 
 because resources are wasted to find out that the URL 
 has no real content.
  • 66. ! @jhmjacob Redirects 1) Via Response Header
 Status Code: 301
 Location: https://de.onpage.org/ 2) Im <head> Bereich
 <meta http-equiv="refresh" content="5; url=http://example.com/"> Redirect-Chains should be avoided. 
 Best practice is to avoid internal redirects at all.
 Rather update old links and point them to the new URL. Search Engines do not like redirects with Meta Tags or Javascript. 
 These should only be used with caution to navigate users.
 Semantically correct way is the response header (“301 vs 302”)
  • 67. ! @jhmjacob Unique & Relevant Content 1) No thin content 2) No duplicate content 3) No auto-translated pages In terms of indexability
  • 68. ! @jhmjacob Crawler: Behind the Scenes Bloomfilter De-Duplication Index
  • 69. ! @jhmjacob The Challenge: Big Data Scale » Was a given URL already crawled?
 (if so: Does a reload make sense?)
 Solution: Bloomfilter + Key-Value Store » Is the content of a crawled URL 
 valuable enough to be added in the index? 
 Solution: Content-Fingerprinting + Hamming Distance
  • 70. ! @jhmjacob “Most algorithms for near-duplicate detection run in batch- mode over the entire collection of documents. For web crawling, an online algorithm is necessary because the decision to ignore the hyper-links in a recently- crawled page has to be made quickly” More: http://www2007.cpsc.ucalgary.ca/papers/paper215.pdf Crawler: Behind the Scenes
  • 71. ! @jhmjacob Encoding 1) Via Response Header
 Content-Type: text/html; charset=UTF-8 2) Within <head> Tag
 <meta charset="UTF-8" /> Charset should always be defined.
 Try to work with UTF-8 - saves a lot of headaches in the long run.
  • 72. ! @jhmjacob Encoding This is how an
 encoding f*ckup looks like
  • 73. ! @jhmjacob File Size 1) Within “Google Search Appliance”: Max. 20 MB
 But thats the Enterprise version of Google 2) In the wild the limit is probably way lower
 (something around 500 KB and 1 MB) The bigger the file, the longer it takes to download.
 Rule of thumb: The smaller, the better!
  • 74. ! @jhmjacob Rendering 1) Javascript and CSS files have to be accessible 
 for GoogleBot
 OnPage.org provides good reports on that 2) If Google has issues rendering the page, indexation
 is at risk 3) Also make sure that the rendering does not take too long
 (Pagespeed Test). 4) Does the rendering on mobile devices look fine?
 (Viewport Tag)
  • 75. ! @jhmjacob rel=prev 1) Within <head> Tag
 <link rel="prev" href="http://abc.com/article?page=1" />
 <link rel="next" href="http://abc.com/article?page=3" /> 2) Im Response Header
 Link: <http://abc.com/article?page=1>; rel="prev"
 Link: <http://abc.com/article?page=3>; rel="next" More: http://googlewebmastercentral.blogspot.co.at/2011/09/pagination-with-relnext-and-relprev.html
  • 76. ! @jhmjacob rel=prev » Semantic Markup 
 for Paginations
 Groups multiple pages
 into one ranking » Intended for multi-page
 articles (newspapers).
 But Google now also
 shows product-listings
 as use case. More: http://googlewebmastercentral.blogspot.co.at/2011/09/pagination-with-relnext-and-relprev.html
  • 77. ! @jhmjacob rel=prev alternative: “show all page” More: http://googlewebmastercentral.blogspot.co.at/2011/09/view-all-in-search-results.html
  • 79. ! @jhmjacob hreflang Directives More: https://moz.com/blog/using-the-correct-hreflang-tag-a-new-generator-tool
  • 80. ! @jhmjacob hreflang Directives More: https://moz.com/blog/using-the-correct-hreflang-tag-a-new-generator-tool Article XYZ
 (“de” = German) Article XYZ
 (“es” = Spanish) hreflang=“es” hreflang=“de” Article XYZ
 (English) hreflang=“x-default” hreflang=“de” hreflang=“es” hreflang=“x-default”
  • 81. ! @jhmjacob Device Directives 1) Viewport Tag
 <meta name="viewport" content="width=device-width, initial- scale=1.0" /> 2) Media Queries
 <link rel="stylesheet" media="only screen and (max-width: 800px)" href="/mobile.min.css" /> 3) Dedicated URL for mobile devices
 <link rel="alternate" media="only screen and (max-width: 640px)” href="http://m.example.com/page-1" >
  • 82. ! @jhmjacob Content Quality 1) The basics 
 Title, Description etc. 2) Zero tolerance for broken pages 3) Avoid internal redirects
 Update links instead 4) Lightweight Sourcecode
 Get rid of unnecessary inline JS + CSS, remove Whitespaces, Line Breaks, Tabs, etc.
  • 83. ! @jhmjacob Location Directives 1) schema.org Markup (“LocalBusiness”)
 Seems to be used by Google for “Local Search” 2) Address / Telephone
 So your websites also matches Query-Modifications 3) Dublin Core Markup
 Not really relevant for SEO, but does not hurt (semantic!) More: https://plus.google.com/+JohnMueller/posts/1EwfjTuCzPQ More: http://schema.org/LocalBusiness
  • 86. ! @jhmjacob Static CMS More: https://www.staticgen.com/
  • 89. ! @jhmjacob Wordpress is kind of
 the Internet Explorer
 in the CMS space Static CMS
  • 90. ! @jhmjacob Static File System in the Wild
  • 91. ! @jhmjacob if-modified-since: OnPage.org 1. First download of the page: The system generates the final sourcecode
  • 92. ! @jhmjacob 2. An optimized version of the sourecode gets saved on disk (“Page-Caching”). 
 The cache filename is generated based on relevant cookie values.
 (in our case: language + currency of visitor) if-modified-since: OnPage.org
  • 93. ! @jhmjacob 3. The same URL (+ same cookie settings) gets called again.
 Search Engines will append the “Last-Modified” value (from the previous request) to the Request Header. if-modified-since: OnPage.org
  • 94. ! @jhmjacob 4. The response for the second call is just taken from the cache file
 Means: Ultra fast Time to First Byte, because server doesn’t need to “think” We dropped irrelevant characters (newlines, tabs, spaces) when we saved the cache file. -> We have seen clients who reduced 30% (!) of their filesizes with that simple step
 -> This results in better loadtimes if-modified-since: OnPage.org
  • 95. ! @jhmjacob 5. Part of the returned response was the “Last-Modified” setting. 
 It was calculated based on the cache file timestamp. if-modified-since: OnPage.org
  • 96. ! @jhmjacob » Super fast Time to First Byte
 When the file is cached » Sends optimized sourcecode to reduce
 bandwith usage
 for both parties: Our servers + Google Crawlers » If the file was loaded before, only send what’s
 really required 
 => “304 Not modified” aka 
 “Everything is cool, you have the latest version in your index” » Bonus: This workflow enables us to set
 the last-mod attribute in sitemap.xml if-modified-since: OnPage.org
  • 97. ! @jhmjacob Other design principles of 
 our homegrown static CMS
  • 98. ! @jhmjacob Static CMS: Design Principles 1) File-Position: Folders in URL are the same 
 as on the filesystem
 Authors are conditioned to build a clean structure + file-hierarchy
  • 99. ! @jhmjacob 2) Separation of Code, Design and Content
 Every member of the team sees his part Static CMS: Design Principles For Designers: affiliate.tpl
  • 100. ! @jhmjacob 2) Separation of Code, Design and Content
 Every member of the team sees his part Static CMS: Design Principles For Texters: affiliate.de.json
  • 101. ! @jhmjacob 2) Separation of Code, Design and Content
 Makes MS Word etc. redundant.
 
 If a new translation needs to be added, the translator gets the
 english version. Renames the file, translates the contents, uploads the file. 
 
 Bam! It’s online.
 
 Text updates, Design changes and new images are versioned by git. Static CMS: Design Principles
  • 102. ! @jhmjacob 3) Multilinguality by nature
 If a new translations is uploaded, the system starts a couple of 
 cool things Static CMS: Design Principles
  • 103. ! @jhmjacob Editor Friendliness + File-Management 3) Multilinguality by nature
 If a user navigates to the wrong language version of a page, he will
 see a friendly reminder that there is a localized version for him
  • 104. ! @jhmjacob Editor Friendliness + File-Management 3) Multilinguality by nature
 Links to the translated versions of the current page are automatically added to the footer
  • 105. ! @jhmjacob Editor Friendliness + File-Management 3) Multilinguality by nature
 And hreflang markup is automatically added to the <head> section of the document
  • 106. ! @jhmjacob Editor Friendliness + File-Management 4) Fast + Secure
 No Database which slows down server responses! Git keeps track of changes and provides rollback functionalities! No other dependencies / services which might cause security holes
  • 107. ! @jhmjacob Editor Friendliness + File-Management 5) Transparent und logical structure
 Images reside where they belong: In the same folder as the article itself - like its template, translations, additional script logic.
 
 Cleaning up made easy: If an article needs to be deleted, just remove the folder -> All files are gone, no more deserted files in 
 “images” folders or localization databases, etc.
  • 108. ! @jhmjacob Outlook What we want to build next
  • 109. ! @jhmjacob Outlook » Multi-Language Images
 The same URL for all localized versions of an image https://en.onpage.org/beispiel/teaser.jpg https://en.onpage.org/beispiel/teaser.jpg
  • 110. ! @jhmjacob Outlook Be careful: This is untested freestyle code - just to give you an idea :) » Multi-Language Images
 htaccess file detects that an image file is requested
  • 111. ! @jhmjacob Outlook » Multi-Language Images
 The browser exposes the preferred languages of the user
  • 112. ! @jhmjacob Outlook » Multi-Language 
 Images
 A script takes the 
 request, checks if a localized
 version exists and returns
 the value (or the default image).
 
 Result is cached in 
 browser cache. Be careful: 
 This is untested freestyle 
 code - just to give 
 you an idea :)
  • 113. ! @jhmjacob Outlook » Last-Modified Logging
 To find out how popular a page is among search engines
  • 114. ! @jhmjacob Outlook » Last-Modified Logging
 To find out how popular a page is among search engines » By setting the last-modified response header, Search engines will include its value in the next request of the page 
 (for if-modified-since checks)
 Knowing this, we can calculate the timespan between this visit and the last one.
  • 115. ! @jhmjacob Outlook » Low timespan
 = URL seems to relevant for the search engine
 = Good chances to rank » High timespan
 = URL seems to be rather irrelevant for the SE
 = Less chances to rank
 = Alerting based on the importance of the page
  • 116. ! @jhmjacob “It’s not that Google will penalize you, it’s the opportunity cost for dirty architecture based on a finite crawl budget.” More: http://www.blindfiveyearold.com/crawl-optimization Last words
  • 117. Thanks! OnPage.org GmbH ! http://twitter.com/onpage_org
 $ http://fb.me/onpage.org
 % https://en.onpage.org
 Jan Hendrik Merlin Jacob Founder + CTO ! https://twitter.com/jhmjacob
 " jhm@onpage.org
 # http://linkedin.com/in/jhmjacob
 http://onpa.ge/V141p