Technical SEO - Generational cruft in SEO - there is never a new site when theres history - brighton seo concise deck

@dawnieando from
@MoveItMarketing
Dawn
Anderson
@
dawnieando

@dawnieando from
@MoveItMarketing
CRUFT

@dawnieando from
@MoveItMarketing
The Great 302s Pass PageRank Debate

@dawnieando from
@MoveItMarketing
GENERATIONAL CRUFT
MULTIPLE
GENERATIONS
OF
A

WEBSITE

@dawnieando from
@MoveItMarketing
NOT ‘Crufts’ – THE WORLD’S LARGEST
DOG SHOW
ERIC

@dawnieando from
@MoveItMarketing
CONTENT CRUFT
https://moz.com/blog/c
lean-‐site-‐cruft-‐before-‐it-‐
causes-‐ranking-‐
problems-‐whiteboard-‐
friday

@dawnieando from
@MoveItMarketing
THIS TYPE OF CRUFT IS
NOT
THE SAME AS CONTENT
CRUFT

@dawnieando from
@MoveItMarketing
SOFTWARE
CRUFT

@dawnieando from
@MoveItMarketing
‘URL
CRUFT’
IS
A

THING
“characters relevant
or
meaningful

only
to
the
people
who
created
the

site,
such
as
implementation
details

of
the
computer
system
which
serves

the
page.
Examples
of
URL
cruft

include filename
extensions such

as .php or .html,
and
internal

organizational
details
such

as /public/or /Users/john/work/draft
s/.[9]”

(Wikipedia
Definition)

ALL
THE
RANDOM CRAP
PEOPLE
ADD
TO
QUERY
STRINGS,

PARAMETERS,
DIRECTORY

FOLDERS
AND
URL

STRUCTURES

@dawnieando from
@MoveItMarketing
CODE
&
URL

CRUFT
MAKES

CRAWLING

SLUGGISH

@dawnieando from
@MoveItMarketing
“COOL
URIs
DON’T

CHANGE”
Sir
Tim
Berners-‐Lee
(Inventor
of
the
World
Wide
Web)
https://www.w3.org/Provider/Style/URI
Attrubution:
By
Uldis Bojārs (Flickr.)
[CC
BY-‐SA
2.0
(http://creativecommons.org/licenses/by-‐sa/2.0)],
via
Wikimedia

Commons

@dawnieando from
@MoveItMarketing
A Clean Slate
LET’S START WITH
A
CLEAN
SLATE

@dawnieando from
@MoveItMarketing
Websites (AND URLs) are not
disposable

@dawnieando from
@MoveItMarketing
SEARCH
ENGINES
NEVER
FORGETS
Search
engines

have
a
long

memory
and
a
lot

of
storage

@dawnieando from
@MoveItMarketing
404
NOT

FOUND
&
410

GONE
§ “Of
course,
we

won’t
redirect

everything…”
§ “Not
everything

will
be
worth

redirecting”

@dawnieando from
@MoveItMarketing
410 Gone
§ “Some,
we’ll
just
kill

off
with
a
410…”
§ “Then
the
URLs
will

be
gone”

@dawnieando from
@MoveItMarketing
https://twitter.com/JohnMu/status/903904602617204738

@dawnieando from
@MoveItMarketing
302
==
Default 301
==
Intentional
404
==
Default 410
==
Intentional
“The
410
response
is
primarily
intended
to
assist
the
task
of
web
maintenance
by

notifying
the
recipient
that
the
resource
is
intentionally
unavailable
and
that
the
server

owners
desire
that
remote
links
to
that
resource
be
removed.”
(RFC
7231)
https://tools.ietf.org/html/rfc7231#section-‐6.5.9
ARE YOU SURE?
MAYBE YES

@dawnieando from
@MoveItMarketing
https://www.youtube.com/watch?v=xp5Nf8ANfOw
THE
DIFFERENCE
BETWEEN
HOW
GOOGLE
TREATS
404
VERSUS
410s

@dawnieando from
@MoveItMarketing
DO NOT THINK 410s WON’T BE
RECRAWLED AGAIN
Source:
https://www.docsplace.org/4578/09/410-‐gone-‐stops-‐crawling-‐dead-‐urls/

@dawnieando from
@MoveItMarketing
“We
knew
there
was
content

there
at
some
point
so
we

just
swing
by
every
now
and

then
to
see
if
anything
came

back”
(John
Mueller,
2016)
In Reality… Gone Is Never Gone

@dawnieando from
@MoveItMarketing
ZOMBIES
ARE
NEVER
GONE
NO
URLS
ARE

EVER
GONE

ONLY
THE
RESOURCE
THERE

IS
GONE
https://www.seroundtable.com/google-‐410-‐indexing-‐22584.html
5
YEARS
LATER

@dawnieando from
@MoveItMarketing
HOW ABOUT 14 YEARS LATER?
https://www.webmasterworld.com/google/4864613.htm
2
HOURS
ALIVE…

14
YEARS
LATER

@dawnieando from
@MoveItMarketing
YOU END UP WITH A CONGA LINE OF
LEGACY URLS, SUBDOMAINS
& VARIOUS SITE
PROTOCOLS

@dawnieando from
@MoveItMarketing
“Forever,
And ever,
And ever,
And ever…
You’ll be a
URL”

@dawnieando from
@MoveItMarketing
GOOGLEBOT GETS WHERE WATER
COULDN’T
https://petermeadit.com/blog
/block-‐web-‐crawlers/

@dawnieando from
@MoveItMarketing
EVEN YOUR STAGING & DEV SITES
Found
with
a
very
simple
wildcard
*
site:
query

@dawnieando from
@MoveItMarketing
THE CHALLENGE IS
NOT IN INDEXING…
BUT IN KEEPING
EVERYTHING
INDEXED UP TO DATE

@dawnieando from
@MoveItMarketing
INCREMENTAL CRAWLING NEVER ENDS
“Crawling
method

based
on
crawl

frequency
based
on

URL
historical

change
&

importance

rate”
Crawling
Which
Never
Ends
Ongoing

@dawnieando from
@MoveItMarketing
The Crawling ‘Frontier’ (THE URL QUEUE)
‘TO
BE
EXPLORED’
(OR
REVISTED)

@dawnieando from
@MoveItMarketing
URLs Take Their Place in The Frontier
Queue (New & Revisit)
The
Queue
Gets
Long
&

Congested

@dawnieando from
@MoveItMarketing
EVEN
THE
RANDOM
CRAP

@dawnieando from
@MoveItMarketing
PAST DATA ON CHANGE IS A GREAT
PREDICTOR OF FUTURE DATA
PREDICTION
BASED

PRIORITY

SCHEDULING
…
WHEN

THERE
IS

CONSISTENCY
“past
changes
to
a
page
are
a
good
predictor
of
future
changes.
This
result

has
practical
implications
for
incremental
web
crawlers
that
seek
to

maximize
the
freshness
of
a
web
page
collection
or
index.”
(

@dawnieando from
@MoveItMarketing
BASED
ON
ROLLING

AVERAGES
OF
PAST
CRAWL
VISITS

@dawnieando from
@MoveItMarketing
IMPORTANCE
TIERING
FOR SCALE
(EFFICIENCY)

@dawnieando from
@MoveItMarketing
A NEW URL HAS NO
BUT YOUR OLD ONES HAVE LOTS

@dawnieando from
@MoveItMarketing
Stored in Search Engine
History Logs

@dawnieando from
@MoveItMarketing
TO
BUILD

PROBABILITY
&

PREDICTABILITY

MODELS

@dawnieando from
@MoveItMarketing
History Log Records Include:
• URL
fingerprint
• Timestamp
(last
crawl
or
download

attempt)
• Crawl
status
(success
or
error)

(Response
code)
• Content
checksum
(binary
code)
• Source
ID
(accessed
from
cache
or

downloaded)
• Segment
identifier
(Crawl

segment
assigned
to??)
• Page
importance
(a
measure
of

importance
assigned
to
the
URL)

@dawnieando from
@MoveItMarketing
”The
URL
page
importance
score
can
be
retrieved
from
the
…
URL
history
log …or
it
can

be
obtained
by
obtaining
the
historical
page
importance
score
for
the
URL
for
a

predefined
number
of
prior
crawls
and
then
performing
a
predefined
filtering
function

on
those
values
to
obtain
the
URL
page
importance
score.”
Scheduler
for
Search
Engine
Crawler
https://www.google.com/patents/US8042112
DOC
ID CRAWL
1

IMPORTANCE

RECORD
CRAWL
2

IMPORTANCE

RECORD
CRAWL 3

IMPORTANCE

RECORD
CRAWL
4

IMPORTANCE

RECORD
CRAWL
5

IMPORTANCE

RECORD
CRAWL
6
IMPORTANCE

RECORD
DOC
ID
1 1 0.8 0.6 0.4 0.2 0
DOC
ID
2 0 0.2 0.4 0.6 0.8 1

@dawnieando from
@MoveItMarketing
URL_SEEN TEST
YOU CAN’T JUST KEEP TRYING TO JUMP
THE INDEXING QUEUE EITHER
PUSH
INDEXING PULLINDEXING
E.G.
FETCH
AS
GOOGLEBOT
&

SUBMIT
TO
INDEX
VISITS
BY
NATURAL
CRAWLING

&
DISCOVERY
OF
URLS
/
URL

VISIT
SCHEDULING
/
REVISITS

@dawnieando from
@MoveItMarketing
‘Sampling’ in Crawling for Efficiency
‘SMALL
TEST
VISITS
TO
A
SITE
TO

UNDERSTAND
WHETHER
IT
IS
WORTH

CRAWLING
&
UNDERSTAND

URL

PATTERNS
&
RESOURCES
THERE’

@dawnieando from
@MoveItMarketing
Popular CMS ’Rule Patterns’ (URL Parameters)
ALL
WILL
HAVE
COMMON

CANONICALIZATION
PATTERNS
WHICH

CAN
BE
LEARNED

@dawnieando from
@MoveItMarketing
DUSTBUSTER & DUST CRAWLING RULES
DO
NOT

CRAWL
IN

THE
DUST
BUILDS

‘HINTS’
ON

WHAT
NOT

TO
CRAWL
EVERY
SITE
WILL

HAVE
ITS
OWN

CRAWLING

RULES

@dawnieando from
@MoveItMarketing
Aged ‘Patchwork Quilt’ Sites
A
LITTLE
BIT
OF
THIS
CMS
AND
A

LITTLE
BIT
OF
THAT
CMS
MANY
HISTORICAL
PARAMETERS

CREATED
&
CRAWLING
SAMPLE

PATTERNS

@dawnieando from
@MoveItMarketing
Every Version of Your Past Ecommerce Sites
“Exponentially

multiplicative

URLs”
Had
potential
to
spew…
at
some
point…
DIFFERENT
PARAMETERS
&
URL

PATTERNS
WHICH
ARE
LEARNED
BY

CRAWLERS…
AND
REMEMBERED…

FOREVER

@dawnieando from
@MoveItMarketing
‘Transitive’??
Transitive
-‐ A
==
B
+
B
==
C
then
A
==
C
For
some
types
of
content
more
than

others
– e.g.
ecommerce/directories
but

not
news
SAMPLING

@dawnieando from
@MoveItMarketing
EFFICIENCY
IS
NOT
JUST
ABOUT
URL

SCHEDULING.

IT
IS
ABOUT
NEAR
MEMORY

STORAGE
(e.g.
CACHING)
TOO

@dawnieando from
@MoveItMarketing
REUSING PRE-‐EMPTING
(PARTICULARLY

POPULAR
DOCUMENTS
/
QUERIES
)

&
REUSING
WHAT
WAS
ALREADY
IN

NEARBY
(MEMORY
V
DISC)

STORAGE

@dawnieando from
@MoveItMarketing
REUSE LOW
IMPORTANCE
and /
or

DOESN’T
CHANGE OFTEN
REUSE IF
NOT
MODIFIED
SINCE LIKELY
TO
CHANGE
BY
X
DATE

(SINCE DATE)
DOWNLOAD CHANGES
FREQUENTLY WITH

IMPORTANT
CHANGE
OR
IS
AN

IMPORTANT
DOCUMENT
REUSE
IF
NOT
MODIFIED
SINCE

@dawnieando from
@MoveItMarketing
CRAWL
SAMPLES
ALSO

HELP
WITH

MODELLING
TO
MAP

DOCS
TO
TOPIC

RELEVANCE

@dawnieando from
@MoveItMarketing
YOU BROKE YOUR SILO STRUCTURE
Image
credit:
https://www.slideshare.net/patrickstox/nlp-‐sitemap-‐smx-‐2016-‐
patrick-‐stox-‐latest-‐in-‐advanced-‐technical-‐seo
SEMANTIC

LOSS

@dawnieando from
@MoveItMarketing
‘CONCEPT DRIFT’
IS A THING
fuzzy difficult to perceive;; indistinct or vague.
synonyms: blurry, blurred, indistinct; unclear, bleary, misty, distorted, out
of

focus, unfocused, lacking
definition, low
resolution, nebulous;
Ill-‐
defined, indefinite, vague, hazy, imprecise, inexact, loose, woolly
"a
fuzzy
picture"
https://en.wikipedia.org/wiki/Concept_drift
AI
ALERT

@dawnieando from
@MoveItMarketing
BOOLEAN LOGIC – EXTREME CASES OF
TRUTH
(TRUE (1) OR FALSE (0))

@dawnieando from
@MoveItMarketing
‘FUZZY LOGIC’ – DEGREES OF TRUTH
SEMANTIC

LOSS

@dawnieando from
@MoveItMarketing
BIG TOPICAL
URL FISH IN
A SMALL
TOPICAL
POND

@dawnieando from
@MoveItMarketing
SMALL TOPICAL URL
FISH
IN A BIG TOPICAL
POND
SEMANTIC

LOSS

@dawnieando from
@MoveItMarketing
’Fuzzy’ URL Targets with Each Site Generation
EVERYTHING
GETS

A
BIT
BLURRED
‘Which
is
the
target
URL

again?

@dawnieando from
@MoveItMarketing
GENERATIONAL

CRUFT
CAN

SNOWBALL
• Past
infinite
loops
• Dodgy
URL
parameters
• Misconfigured
URL
parameters
• Old
URL
crawling
‘rules
/
hints’
• Old
‘importance
/
quality’

scores
• Filtered
dupes
&
near-‐dupes
• Mixed
messaging
canonicals
• 410s
still
being
revisited
• Internal
links
to
old
sites
/

protocols

@dawnieando from
@MoveItMarketing
WRONG
URL
RANKING
’SWAPPING
OUT’
(Especially

multiple

child
nodes)
SHARP
&

VOLATILE
RANKING

FLUX
SOME
SYMPTOMS

@dawnieando from
@MoveItMarketing
A
LOT
OF
WRONG
TARGETS

RANKING
POST
MIGRATION
SOME
SYMPTOMS

@dawnieando from
@MoveItMarketing
MIXED CONTENT & MULTIPLE SITE
VERSIONS
http://www.itv.com/news/

@dawnieando from
@MoveItMarketing
MIXED
CONTENT &
MULTIPLE SITE
VERSIONS
http://www.itv.com/news/
BOTH
HTTP
&

HTTPS
FIGHTING

EACH
OTHER

@dawnieando from
@MoveItMarketing
PEOPLE CHURN
INTERNAL
TEAM

CHURN
EXTERNAL
AGENCY

CHURN

@dawnieando from
@MoveItMarketing
FIND SITES ON THE SAME SERVER

@dawnieando from
@MoveItMarketing
DIAGNOSE: Validate & Retain in GSC ALL Past
Domains & Past Site Versions (Protocols (HTTPS /
HTTP)
THERE
MAY
STILL
BE
UNDETECTED
ACTIVITY
GOING
ON
THERE

@dawnieando from
@MoveItMarketing
URL Parameter Handling is Your Friend
Help
Google
Build
‘Crawling

Rules’
for
your
site
rather

than
wasting
time
on

‘sampling’
and
giving
a
bad

impression
GIVE
HELP
AND

GUIDANCE
WITH
THE

CRAWL
RULE
AND

HINT
BUILDING

@dawnieando from
@MoveItMarketing
Help
Google
Build

‘Crawling
Rules’
for

your
site
rather
than

wasting
time
on

‘sampling’
and
giving

a
bad
impression
BE
VERY

CAREFUL

@dawnieando from
@MoveItMarketing
PEOPLE CANONICALIZE WRONG
ON
MULTIPLE
GENERATIONS
OF
SITES

@dawnieando from
@MoveItMarketing
47% of TECHNICAL
SEOs thought:
“REL=NEXT / REL =
PREV” IS A FORM OF
CANONICALIZATION

@dawnieando from
@MoveItMarketing
Lots OF SEOS were
unaware that:
“301s and 302s are
BOTH forms of
canonicalization”

@dawnieando from
@MoveItMarketing
Only 64% of ’Technical
SEOs’ realised Href
Lang is a form of
Canonicalization
(Internationalization)

@dawnieando from
@MoveItMarketing

@dawnieando from
@MoveItMarketing
REVIEW & UNDERSTAND - THE
CANONICAL LINK RELATION
§ 30X
redirects
§ Canonical
tag
§ Href lang
§ HTTPS
protocol
§ Global
canonicalization

rules
§ URL
normalization
In
’ALL’
its
forms

@dawnieando from
@MoveItMarketing
PEOPLE APPEND (ADD TO FILES) -
SOMETIMES IT’S FEAR OF DEPENDENCIES

@dawnieando from
@MoveItMarketing
YOU
NEED

TO
KNOW

WHAT’S
ON

THAT

SERVER
DIAGNOSE: HEAD BACK TO THE
SERVER

@dawnieando from
@MoveItMarketing
DIAGNOSE: SERVER LOG FILE ANALYSIS
BUT
WATCH
OUT
FOR

OTHER
TOOLS
EMULATING

GOOGLEBOT
AND
FILTER

THEM
OUT
ANALYSE
THE
LOGS
FOR

‘ALL’
YOUR
SITES
AND
‘ALL’

PROTOCOLS
TO
SEE
THE

PATTERNS
EMERGE

@dawnieando from
@MoveItMarketing
When analysing logs you’re often
viewing URLs from a ‘A LONNNNGGGG Time
Ago’
LOOKING

AT
LEGACY

@dawnieando from
@MoveItMarketing
REVISIT ALLPAST .HTACCESS FILES
Can
you
rewrite
the
rules
to
be

more
efficient
with
regex
or
cut
out

some
old
rules
still
firing

unnecessarily?
(CREATE
SHORTCUTS)
REMEMBER
.HTACCESS
RULES
RUN
IN
ORDER
OF

THEIR
APPEARANCE
IN
THE
FILE.

CAN
YOU
USE
WILDCARDS
TO
OPTIMIZE
OR
SKIP

STEPS?
.HTACCESS

SITE
1
.HTACCESS

SITE
2
.HTACCESS

SITE
3

@dawnieando from
@MoveItMarketing
CHOP BACK REDIRECT CHAINS

@dawnieando from
@MoveItMarketing
Help Googlebot Get Round its Shopping List
OPEN
MORE
CHECKOUTS
WIDEN
THE
AISLES
MAKE
THINGS
EASY
TO
FIND
DON’T
CONFUSE

GOOGLEBOT
HELP
FILL
THE
TROLLEY

QUICKLY
SPEED,
SPEED,
SPEED

@dawnieando from
@MoveItMarketing
XML Sitemaps Are Your Friend… (Strong
Foundations)
They
help
to

pass

‘importance’

signals
to
URLs
But…
never

leave
them
to

just

autogenerate
without

periodically

checking
‘The

foundations’

underneath
a

site

@dawnieando from
@MoveItMarketing
EXTERNALLY HOSTED XML SITEMAPS
• Take
back
control
• Jump
the
dev
queue
• Allows
for
custom
configuration
of
optimal

canonical
click
paths
• Allows
for
consistent
signals
of
importance
to

included
URLs
• Forget
about
setting
priority
• Forget
about
last
modified
• Even
a
simple
list
of
URLs
FTW
will
do
• Keep
them
organised for
granular
analysis
of

problem
site
sections

@dawnieando from
@MoveItMarketing
INSTEAD
OF

REMOVE…

CONSIDER…

DISTRACT
&

ITERATIVELY
IMPROVE
STRATEGIC
USE
OF
INTERNAL
LINK

POPULARITY
REDUCE
IMPORTANCE
SIGNALS

TO
DIFFERENT
PAGES
INCLUDE
IMPORTANT
PAGES
IN

XML
SITEMAPS
INCLUDE
IMPORTANT
PAGES
IN

HTML
SITEMAPS

@dawnieando from
@MoveItMarketing
BUILD WELL CATEGORIZED AND
CONCEPTUALLY STRUCTURED
SITEMAPS
https://www.slideshare.net/p
atrickstox/nlp-‐sitemap-‐smx-‐
2016-‐patrick-‐stox-‐latest-‐in-‐
advanced-‐technical-‐seo

@dawnieando from
@MoveItMarketing
SOLUTION: Increase ‘Importance’ quickly of
target URLs
• Internal
link
optimization
• Canonicalise to
(if
relevant)
• Strengthen
up
importance
signals
• Inclusion
in
front
facing
HTML
and
XML

sitemaps
• Improve
the
content
&
keep
it
updated
• 301
redirect
to
(if
relevant
redundant

content)
• Topical
hubs
and
strong
information

views
to
navigate
users
&
add
relevance

@dawnieando from
@MoveItMarketing
SOLUTION: Reduce ‘Importance’ quickly of old
URLs
• Internal
link
UNOPTIMIZATION
• 410
• Dig
out
URLs
with
links
to
them
• Orphan
URLs
• Canonicals
to
HTTPs
• EXCLUSION
from
XML
sitemaps

(even
old
ones
on
the
server)
• Archiving
of
content

@dawnieando from
@MoveItMarketing
IT’S
VERY

IMPORTANT…

YOU
STAY
OUT

OF
SERVER

ERROR
STATUS
500
‘Try
again’
intervals
likely
extended

between
each
failed
connection

attempt

@dawnieando from
@MoveItMarketing
Consistency is
REMEMBER
’ROLLING

AVERAGES’

@dawnieando from
@MoveItMarketing
APPENDIX

@dawnieando from
@MoveItMarketing
410 Likely Get Deindexed Quicker
https://plus.google.com/+JohnMueller/
posts/NEsqE7Sr4Z4
“Usually
seeing
it
(410)
1-‐2

times
is
enough
for
us
to

drop
those
URLs
from
the

index”

John
M
on
Google+
(https://plus.google.com/u/0
/+JohnMueller/posts/NEsq
E7Sr4Z4)

@dawnieando from
@MoveItMarketing
LEGACY ISSUES VIA CANONICALS OR
REDIRECTION (COMMON MISTAKES)
• PAGE
CANONICALIZED
TO
IS
NOT
A
SUPERSET
OR

DUPLICATIVE
(IT
IS
NOT
RELEVANT
ENOUGH)
• 301s
TO
IRRELEVANT
PAGES
BECOME
SOFT
404
• FOLDING
UP
PRODUCT
PAGES
TO
CATEGORES
(PEOPLE

WERE
LOOKING
FOR
A
SPECIFIC
PRODUCT)
• CANONICALIZATION
TO
PAGES
WHEN
IN
THE
FUTURE

301
REDIRECT
TO
ANOTHER
URL
THEREFORE
NEGATING

THE
PAGES
CANONICALIZING
TO
THEM
• CONFLICTS
BETWEEN
HREF
LANG
AND

CANONICALIZATION

@dawnieando from
@MoveItMarketing
MORE CAUSES
SEARCH ENGINES ARE CRAWLING MORE CODE THAN YOU MIGHT HAVE
INTENDED IN THE FIRST PLACE
JAVASCRIPT ERRORS FROM LEGACY CODE & LIBRARIES
LEGACY 302s FROM REDIRECTED LEGACY DOMAINS WHICH CONFUSE
INTERMEDIATE SIGNALS BETWEEN 301S (WHICH ARE INTENDED DEFINITE
REDIRECTIONS)
ABANDONED URLS
AJAX URLS (NOT THE SAME AS THE NAMED ANCHOR) – DEPRECATION OF
AJAX CRAWLING (ASYNCHRONOUS JAVASCRIPT & XML)

@dawnieando from
@MoveItMarketing
“If
“change”
means
“any
change”,
then
about
40%
of
all
web
pages
change
weekly

[12].
Even
if
we
consider
only
pages
that
change
by
a
third
or
more,
about
7%
of
all

web
pages
change
weekly
[17].”
(Broder,
A.Z.,
Najork,
M.
and
Wiener,
J.L.,
2003)
EVEN
AS
FAR
BACK
IN
2003
40% of ALL web pages
changed weekly
___________________
7%
of
web
pages
changed
a
1/3
of
their

page
content
or
more
weekly

@dawnieando from
@MoveItMarketing
HOW
MUCH
BIGGER
&
DYNAMIC
IS
THE
WEB

NOW
IN
2017?
http://www.internetlivestats.com/total-‐number-‐of-‐websites/

@dawnieando from
@MoveItMarketing
FUZZY
LOGIC• Rule

based

logic
• Been

around

for
20+

years
• Is
within

a
subset

of
AI

@dawnieando from
@MoveItMarketing
THESE
THINGS
ADD
UP
THEY
ALSO
STILL
NEED
TO
BE
DISCOVERED

WHICH
REQUIRES
INITIAL
CRAWLING
https://twitter.com/dawnieando/status/906465965029969920

@dawnieando from
@MoveItMarketing
“404
vs
410
doesn't
affect
the
recrawl
rate:
we'll
still
occasionally
check
to

see
if
these
pages
are
still
gone,

especially
when
we
spot
a
new
link
to

them”
John
Mueller,
Google+
2015
https://plus.google.com/u/0/+JohnMu
eller/posts/NEsqE7Sr4Z4
ESPECIALLY IF
THERE ARE
LINKS TO IT

@dawnieando from
@MoveItMarketing
Pass Strong Clues - Highly Relevant New
Conceptual Structures
STRONG
SEMANTICS
&

CONCEPTUALLY

CO-‐OCCURRING

TERMS

@dawnieando from
@MoveItMarketing
THINK CAREFULLY ABOUT URL CREATION
Not
EVERYTHING
is

worthy
of
its
own
URL
VARIANTS
STEMMINGS
PLURALS
RANDOM
TAGS
LONG,
LONG,
LONG

TAIL
PARAMETERS

@dawnieando from
@MoveItMarketing
ONLY
DOWNLOAD
IF

THERE
IS
SUBSTANTIVE

CHANGE
TAKE
SOME
CONTROL
WITH
304
&
EXPIRES
AFTER
HEADERS

ON
LESS
IMPORTANT
PAGES
https://developers.google.com/web/fundamentals/pe
rformance/optimizing-‐content-‐efficiency/http-‐caching
VALID

REPRESENTATION
THE
URL
WILL
STILL
BE
VISITED

BUT
0
(ZERO)
WILL
BE

DOWNLOADED
SO
IT
IS
STRAIGHT

ON
TO
THE
NEXT
URL
VERY

QUICKLY
https://webmasters.googleblog.com/2006/09/better-‐
details-‐about-‐when-‐googlebot.html
https://tools.ietf.org/html/rfc7232#section-‐4.1

@dawnieando from
@MoveItMarketing
A
URI
is
like
a
fine

wine
Maturing
over

time
“COOL
URIs

DON’T

CHANGE”
Sir
Tim
Berners-‐Lee
(Inventor
of
the
World
Wide
Web)
https://www.w3.org/Provider/Style/URI

@dawnieando from
@MoveItMarketing
A
LONG,
LONG
TIME
AGO
• You
need
to
go
right
back
to
the
beginning
• What
domains
did
the
organisation EVER
register?
• Where
do
they
redirect
to?
• Is
it
via
301,
302
or
are
they
merely
parked
domains?
• Who
would
know?

Who
is
responsible?
• Verify
them
all
in
Google
Search
Console
• Some
of
these
may
EVEN
HAVE
PENALTIES
HISTORICALLY
• If
there
are
links
to
any
there
is
likely
still
crawling
activity
there
• Analyse logs
across
multiple
subdomains
&
protocols

@dawnieando from
@MoveItMarketing
QUESTIONS TO ASK
HOW MANY MICRO-SITES HAVE YOU HAD?
HOW MANY SUBDOMAINS?
HOW MANY OTHER DOMAINS?
WHO IS RESPONSIBLE FOR DOMAIN REG
WHO KNOWS WITHIN THE ORGANISATION?
WHO REGISTERED THE DOMAINS?
WHO CAN UPDATE DNS RECORDS?
ARE THESE SITES STILL ON SERVERS?
HAVE ANY OF THESE SITES HAD MANUALACTIONS?
HOW ARE THESE SITES REDIRECTED?
ARE THEY PARKED DOMAINS?

@dawnieando from
@MoveItMarketing
DATA FROM
HISTORY LOGS
CONTRIBUTE
TO WHEN TO
REVISIT URIs
ON THE WEB

@dawnieando from
@MoveItMarketing
SOLUTION: REVISITING BLOATED
APPENDED .HTACCESS FILES ON ALL
LEGACY SITES (IF NOT REDIRECTING
AT A DNS LEVEL)
NOT
JUST
THE
.HTACCESS
FILE
ON
THE
EXISTING

SITE
EITHER.
GOOGLEBOT
MAY
HIT
.HTACCESS
ON
PAST
SITES

SO
THEY
MAY
ALSO
NEED
OPTIMIZING
.HTACCESS
RUN
IN
ORDER
SO
PROVIDE

OPPORTUNITY
FOR
SHORT
CUTS

@dawnieando from
@MoveItMarketing
SOME TYPES OF URL CRUFT
• INCORRECTLY
APPLIED
CANONICAL

TAGS

• CONFLICTING
HREF
LANG
&

CANONICAL
TAGS
• MIXED
CONTENT
• URL
SHORTENERS
• SESSION
IDS
• UTM
TAGGING
• OLD
AJAX
FRAGMENTS
• PARAMETERS
FROM
MULTI
FACET

DROP
DOWN
CHOICES
• .html,
.php,
.index.html,
.aspx
• LEGACY
URL
REWRITING
&

PARAMETERS
IN
.HTACCESS
FILES
• LEGACY
FOLDERS
WHICH
CONTRIBUTE

NO
MEANING
TO
SITE
ONTOLOGY
UNCRUFTY
www.myeasyurlwillmakeyouw
onder.com/resume
CRUFTY
www.myeasyurlwillmakeyouw
onder.com/resume.html
CRUFTY
http://nymag.com/scienceofus/2015/07/how-‐
to-‐recover-‐from-‐an-‐all-‐
nighter.html?om_rid=AAENcg&om_mid=_BTtF
a0B869PyJp&utm_content=buffer8fdd1&utm_
medium=social&utm_source=twitter.com&ut
m_campaign=buffer

@dawnieando from
@MoveItMarketing
INDEX
TIERING
Presented
by
B
Cambazoglu at
European
Summer
School
Information
Retrieval
2017
– (Cambazoglu,
B.B.
and
Baeza-‐Yates,
R.,
2011.

Scalability
challenges
in
web
search
engines.
In Advanced
topics
in
information
retrieval (pp.
27-‐50).
Springer
Berlin
Heidelberg.)

@dawnieando from
@MoveItMarketing
TWO-PHASE
RANKING IN
A SEARCH
NODE
Presented
by
B
Cambazoglu at
European
Summer
School
Information
Retrieval
2017
– (Cambazoglu,
B.B.
and
Baeza-‐Yates,
R.,

2011.
Scalability
challenges
in
web
search
engines.
In Advanced
topics
in
information
retrieval (pp.
27-‐50).
Springer
Berlin

Heidelberg.)

@dawnieando from
@MoveItMarketing
FUZZY LOGIC – DEGREES OF TRUTH
0.8
Doc
ID
likely
to

be
a
correct
URI
to

choose
from
term
/

query
cluster

@dawnieando from
@MoveItMarketing
EVERY
SINGLE
TIME
YOU
MIGRATE,
CHANGE
DESIGN,
REDIRECT,
REINVENT
A
SITE
/
URL
A
CLEAN
START
REDIRECTIONS
ANOTHER
STRUCTURE
FIRST
SITE

STRUCTURE
NEW
CRAWLING
‘RULES’

BUILT
CRAWLING

‘RULES’
BUILT
EVERYTHING

IS
‘200
OK’
MORE
URLs
MIXED
RESPONSE
CODES
REDIRECTIONS
‘FUZZINESS’
IS
EMERGING
NEW
CRAWLING
‘RULES’
BUILT
MORE
URLs
REDIRECT
CHAINS
&
MIXED

RESPONSE
CODES
NEW
SEO’s
DON’T

KNOW
THE
‘HISTORY’
TARGET
URLs
NOW
‘VERY
FUZZY’

@dawnieando from
@MoveItMarketing
BUT WHEN DATA IS INCONSISTENT
FUZZY LOGIC MAY FAIL
‘DEGREES
OF
TRUTH’
MAY
BECOME
MORE

BLURRED
/
VAGUE

@dawnieando from
@MoveItMarketing
SOLUTION: XML SITEMAPS

@dawnieando from
@MoveItMarketing
TERM-FREQUENCY INVERSE
DOCUMENT FREQUENCY
Cruft
can
also
skew
term-‐frequency

inverse
document
frequency
AND
THE
QUERY
CLUSTERS
DOCUMENTS
BELONG
TO

@dawnieando from
@MoveItMarketing
The Generational ’Snail Trail’
• Old
XML
sitemaps
• Redirects
drop
away
on
old
site

.htaccess
• DNS
issues
• People
link
to
old
site
but
wrong

protocol
• Old
sites
not
verified
in
GSC
• Not
all
protocols
redirecting
Leaving
it’s

slithery

footprint

@dawnieando from
@MoveItMarketing
URL NORMALIZATION
Can be
problematic
and ‘crufty’
too
https://en.wikipedia.org/wiki/URL_normalization

@dawnieando from
@MoveItMarketing
REDUCTION & REPOPULATION OF INTERNAL LINK
POPULARITY (IBP) BETWEEN URL
SCHEDULING
IT’S
NOT
ONLY
THEIR
‘INTERNAL
PAGE

RANK’
BUT
ALSO
THE
ANCHORS,
INTER-‐
CONNECTING
CONCEPTUAL
/
TOPIC

RELEVANCE
IN
CONTENT
AND
THE
TEXT

SURROUNDING
INTERNAL
LINK
ANCHORS

(AND
PROBABLY
OTHER
THINGS
TOO)
SEMANTIC
’CLUES’
WERE
LOST
ALONG

THE
WAY
SEMANTIC
‘CONTEXT’ & IBP
BUCKET IS
LEAKING

@dawnieando from
@MoveItMarketing
SOLUTION: Wiki Page
Redirects on Topics
https://dbpedia.org/sparql
Wikipedia

Redirects
thesaurus.com
OR
A
GOOD
OLD
FASHIONED
THESAURUS

@dawnieando from
@MoveItMarketing
Understand How URLs with
Multiple Parameters Are Handled
The
most
restrictive
parameter
blocked
overrules

lesser
restrictions

@dawnieando from
@MoveItMarketing
THE
USE
OF
REUSE
TABLESTABLE
I
Reuse
Table
Example
URL URL
Record
No. Fingerprint
(FP) Reuse
Type If
Modified
Since
.
.
.
1 2123242 REUSE
2 2323232 REUSE
IF
NOT Feb.
5,
2004
MODIFIED
SINCE
3 3343433 DOWNLOAD
. . . .
. . . .
. . . .

@dawnieando from
@MoveItMarketing
REMEMBER
”Gone
is
Never
Gone”
“Search
Engines
Never

Forget”Dawn
Anderson
@
dawnieando

@dawnieando from
@MoveItMarketing
REFERENCES

@dawnieando from
@MoveItMarketing
Sources & References
Bar-‐Yossef,
Z.,
Keidar,
I.
and
Schonfeld,
U.,
2009.
Do
not
crawl
in
the
dust:

different
urls with
similar
text. ACM
Transactions
on
the
Web
(TWEB), 3(1),
p.3
Broder,
A.Z.,
Najork,
M.
and
Wiener,
J.L.,
2003,
May.
Efficient
URL
caching
for

world
wide
web
crawling.
In Proceedings
of
the
12th
international
conference

on
World
Wide
Web (pp.
679-‐689).
ACM
Cambazoglu,
B.B.
and
Baeza-‐Yates,
R.,
2011.
Scalability
challenges
in
web
search

engines.
In Advanced
topics
in
information
retrieval (pp.
27-‐50).
Springer
Berlin

Heidelberg.
Cho,
J.,
Garcia-‐Molina,
H.
and
Page,
L.,
1998.
Efficient
crawling
through
URL

ordering. Computer
Networks
and
ISDN
Systems, 30(1),
pp.161-‐172
Fetterly,
D.,
Manasse,
M.,
Najork,
M.
and
Wiener,
J.,
2003,
May.
A
large-‐scale

study
of
the
evolution
of
web
pages.
In Proceedings
of
the
12th
international

conference
on
World
Wide
Web (pp.
669-‐678).
ACM

@dawnieando from
@MoveItMarketing
• Olston,
C.
and
Najork,
M.,
2010.
Web
crawling. Foundations
and
Trends®
in

Information
Retrieval, 4(3),
pp.175-‐246.
• Pandey,
S.
and
Olston,
C.,
2008,
February.
Crawl
ordering
by
search
impact.

In Proceedings
of
the
2008
International
Conference
on
Web
Search
and
Data

Mining (pp.
3-‐14).
ACM.
• Olston,
C.
and
Pandey,
S.,
2008,
April.
Recrawl scheduling
based
on
information

longevity.
In Proceedings
of
the
17th
international
conference
on
World
Wide

Web (pp.
437-‐446).
ACM
• Pandey,
S.
and
Olston,
C.,
2005,
May.
User-‐centric
web
crawling.
In Proceedings
of

the
14th
international
conference
on
World
Wide
Web (pp.
401-‐411).
ACM.
• Pandey,
S.
and
Olston,
C.,
2008,
February.
Crawl
ordering
by
search
impact.

In Proceedings
of
the
2008
International
Conference
on
Web
Search
and
Data

Mining (pp.
3-‐14).
ACM

@dawnieando from
@MoveItMarketing
• https://patentimages.storage.googleapis.com/US8042112B1/US08042112-‐
20111018-‐D00000.png
• Randall,
K.H.,
Google
Inc.,
2010. Scheduler
for
search
engine
crawler.
U.S.
Patent

7,725,452.

Technical SEO - Generational cruft in SEO - there is never a new site when theres history - brighton seo concise deck

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Technical SEO - Generational cruft in SEO - there is never a new site when theres history - brighton seo concise deck

Similar to Technical SEO - Generational cruft in SEO - there is never a new site when theres history - brighton seo concise deck (20)

More from Dawn Anderson MSc DigM

More from Dawn Anderson MSc DigM (20)

Recently uploaded

Recently uploaded (20)

Technical SEO - Generational cruft in SEO - there is never a new site when theres history - brighton seo concise deck