Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Technical SEO - Generational cruft in SEO - there is never a new site when theres history - brighton seo concise deck

13,341 views

Published on

Generational cruft in seo there is never a new site when theres history - BrightonSEO concise deck

Published in: Marketing

Technical SEO - Generational cruft in SEO - there is never a new site when theres history - brighton seo concise deck

  1. 1. @dawnieando from  @MoveItMarketing Dawn  Anderson  @  dawnieando
  2. 2. @dawnieando from  @MoveItMarketing CRUFT
  3. 3. @dawnieando from  @MoveItMarketing The Great 302s Pass PageRank Debate
  4. 4. @dawnieando from  @MoveItMarketing GENERATIONAL CRUFT MULTIPLE  GENERATIONS  OF  A   WEBSITE
  5. 5. @dawnieando from  @MoveItMarketing NOT ‘Crufts’ – THE WORLD’S LARGEST DOG SHOW ERIC
  6. 6. @dawnieando from  @MoveItMarketing CONTENT CRUFT https://moz.com/blog/c lean-­‐site-­‐cruft-­‐before-­‐it-­‐ causes-­‐ranking-­‐ problems-­‐whiteboard-­‐ friday
  7. 7. @dawnieando from  @MoveItMarketing THIS TYPE OF CRUFT IS NOT THE SAME AS CONTENT CRUFT
  8. 8. @dawnieando from  @MoveItMarketing SOFTWARE  CRUFT
  9. 9. @dawnieando from  @MoveItMarketing ‘URL  CRUFT’  IS  A   THING “characters relevant  or  meaningful   only  to  the  people  who  created  the   site,  such  as  implementation  details   of  the  computer  system  which  serves   the  page.  Examples  of  URL  cruft   include filename  extensions such   as .php or .html,  and  internal   organizational  details  such   as /public/or /Users/john/work/draft s/.[9]”   (Wikipedia  Definition)
  10. 10. ALL  THE  RANDOM CRAP PEOPLE  ADD  TO QUERY  STRINGS,   PARAMETERS,  DIRECTORY   FOLDERS  AND  URL   STRUCTURES
  11. 11. @dawnieando from  @MoveItMarketing CODE  &  URL   CRUFT  MAKES   CRAWLING   SLUGGISH
  12. 12. @dawnieando from  @MoveItMarketing “COOL  URIs  DON’T   CHANGE” Sir  Tim  Berners-­‐Lee (Inventor  of  the  World  Wide  Web) https://www.w3.org/Provider/Style/URI Attrubution:  By  Uldis Bojārs (Flickr.)  [CC  BY-­‐SA  2.0  (http://creativecommons.org/licenses/by-­‐sa/2.0)],  via  Wikimedia   Commons
  13. 13. @dawnieando from  @MoveItMarketing A Clean Slate LET’S START WITH A CLEAN SLATE
  14. 14. @dawnieando from  @MoveItMarketing Websites (AND URLs) are not disposable
  15. 15. @dawnieando from  @MoveItMarketing SEARCH  ENGINES  NEVER  FORGETS Search  engines   have  a  long   memory  and  a  lot   of  storage
  16. 16. @dawnieando from  @MoveItMarketing 404  NOT   FOUND &  410   GONE § “Of  course,  we   won’t  redirect   everything…” § “Not  everything   will  be  worth   redirecting”
  17. 17. @dawnieando from  @MoveItMarketing 410 Gone § “Some,  we’ll  just  kill   off  with  a  410…” § “Then  the  URLs  will   be  gone”
  18. 18. @dawnieando from  @MoveItMarketing https://twitter.com/JohnMu/status/903904602617204738
  19. 19. @dawnieando from  @MoveItMarketing 302  ==  Default 301  ==  Intentional 404  ==  Default 410  ==  Intentional “The  410  response  is  primarily  intended  to  assist  the  task  of  web  maintenance  by   notifying  the  recipient  that  the  resource  is  intentionally  unavailable  and  that  the  server   owners  desire  that  remote  links  to  that  resource  be  removed.”  (RFC  7231) https://tools.ietf.org/html/rfc7231#section-­‐6.5.9 ARE YOU SURE? MAYBE YES
  20. 20. @dawnieando from  @MoveItMarketing https://www.youtube.com/watch?v=xp5Nf8ANfOw THE  DIFFERENCE  BETWEEN  HOW  GOOGLE  TREATS  404  VERSUS  410s
  21. 21. @dawnieando from  @MoveItMarketing DO NOT THINK 410s WON’T BE RECRAWLED AGAIN Source:  https://www.docsplace.org/4578/09/410-­‐gone-­‐stops-­‐crawling-­‐dead-­‐urls/
  22. 22. @dawnieando from  @MoveItMarketing “We  knew  there  was  content   there  at  some  point  so  we   just  swing  by  every  now  and   then  to  see  if  anything  came   back”  (John  Mueller,  2016) In Reality… Gone Is Never Gone
  23. 23. @dawnieando from  @MoveItMarketing ZOMBIES ARE  NEVER GONE NO  URLS  ARE   EVER  GONE     ONLY  THE  RESOURCE  THERE   IS  GONE https://www.seroundtable.com/google-­‐410-­‐indexing-­‐22584.html 5  YEARS  LATER
  24. 24. @dawnieando from  @MoveItMarketing HOW ABOUT 14 YEARS LATER? https://www.webmasterworld.com/google/4864613.htm 2  HOURS  ALIVE…   14  YEARS  LATER
  25. 25. @dawnieando from  @MoveItMarketing YOU END UP WITH A CONGA LINE OF LEGACY URLS, SUBDOMAINS & VARIOUS SITE PROTOCOLS
  26. 26. @dawnieando from  @MoveItMarketing “Forever, And ever, And ever, And ever… You’ll be a URL”
  27. 27. @dawnieando from  @MoveItMarketing GOOGLEBOT GETS WHERE WATER COULDN’T https://petermeadit.com/blog /block-­‐web-­‐crawlers/
  28. 28. @dawnieando from  @MoveItMarketing EVEN YOUR STAGING & DEV SITES Found  with  a  very  simple  wildcard  *  site:  query
  29. 29. @dawnieando from  @MoveItMarketing THE CHALLENGE IS NOT IN INDEXING… BUT IN KEEPING EVERYTHING INDEXED UP TO DATE
  30. 30. @dawnieando from  @MoveItMarketing INCREMENTAL CRAWLING NEVER ENDS “Crawling  method   based  on  crawl   frequency  based  on   URL  historical   change  &   importance   rate” Crawling Which Never Ends Ongoing
  31. 31. @dawnieando from  @MoveItMarketing The Crawling ‘Frontier’ (THE URL QUEUE) ‘TO  BE  EXPLORED’ (OR  REVISTED)
  32. 32. @dawnieando from  @MoveItMarketing URLs Take Their Place in The Frontier Queue (New & Revisit) The  Queue  Gets  Long  &   Congested
  33. 33. @dawnieando from  @MoveItMarketing EVEN THE RANDOM  CRAP
  34. 34. @dawnieando from  @MoveItMarketing PAST DATA ON CHANGE IS A GREAT PREDICTOR OF FUTURE DATA PREDICTION  BASED   PRIORITY   SCHEDULING …  WHEN   THERE  IS   CONSISTENCY “past  changes  to  a  page  are  a  good  predictor  of  future  changes.  This  result   has  practical  implications  for  incremental  web  crawlers  that  seek  to   maximize  the  freshness  of  a  web  page  collection  or  index.”  (
  35. 35. @dawnieando from  @MoveItMarketing BASED  ON  ROLLING   AVERAGES OF  PAST CRAWL  VISITS
  36. 36. @dawnieando from  @MoveItMarketing IMPORTANCE TIERING FOR SCALE (EFFICIENCY)
  37. 37. @dawnieando from  @MoveItMarketing A NEW URL HAS NO BUT YOUR OLD ONES HAVE LOTS
  38. 38. @dawnieando from  @MoveItMarketing Stored in Search Engine History Logs
  39. 39. @dawnieando from  @MoveItMarketing TO  BUILD   PROBABILITY  &   PREDICTABILITY   MODELS
  40. 40. @dawnieando from  @MoveItMarketing History Log Records Include: • URL  fingerprint • Timestamp  (last  crawl  or  download   attempt) • Crawl  status  (success  or  error)   (Response  code) • Content  checksum  (binary  code) • Source  ID  (accessed  from  cache  or   downloaded) • Segment  identifier  (Crawl   segment  assigned  to??) • Page  importance  (a  measure  of   importance  assigned  to  the  URL)
  41. 41. @dawnieando from  @MoveItMarketing ”The  URL  page  importance  score  can  be  retrieved  from  the  …  URL  history  log …or  it  can   be  obtained  by  obtaining  the  historical  page  importance  score  for  the  URL  for  a   predefined  number  of  prior  crawls  and  then  performing  a  predefined  filtering  function   on  those  values  to  obtain  the  URL  page  importance  score.” Scheduler  for  Search  Engine  Crawler https://www.google.com/patents/US8042112 DOC  ID CRAWL  1   IMPORTANCE   RECORD CRAWL  2   IMPORTANCE   RECORD CRAWL 3   IMPORTANCE   RECORD CRAWL  4   IMPORTANCE   RECORD CRAWL  5   IMPORTANCE   RECORD CRAWL  6 IMPORTANCE   RECORD DOC  ID  1 1 0.8 0.6 0.4 0.2 0 DOC  ID  2 0 0.2 0.4 0.6 0.8 1
  42. 42. @dawnieando from  @MoveItMarketing URL_SEEN TEST YOU CAN’T JUST KEEP TRYING TO JUMP THE INDEXING QUEUE EITHER PUSH  INDEXING PULLINDEXING E.G.  FETCH  AS  GOOGLEBOT  &   SUBMIT  TO  INDEX VISITS  BY  NATURAL  CRAWLING   &  DISCOVERY  OF  URLS  /  URL   VISIT  SCHEDULING  /  REVISITS
  43. 43. @dawnieando from  @MoveItMarketing ‘Sampling’ in Crawling for Efficiency ‘SMALL  TEST  VISITS  TO  A  SITE  TO   UNDERSTAND  WHETHER  IT  IS  WORTH   CRAWLING  &  UNDERSTAND    URL   PATTERNS  &  RESOURCES  THERE’
  44. 44. @dawnieando from  @MoveItMarketing Popular CMS ’Rule Patterns’ (URL Parameters) ALL  WILL  HAVE  COMMON   CANONICALIZATION  PATTERNS  WHICH   CAN  BE  LEARNED
  45. 45. @dawnieando from  @MoveItMarketing DUSTBUSTER & DUST CRAWLING RULES DO  NOT   CRAWL  IN   THE  DUST BUILDS   ‘HINTS’  ON   WHAT  NOT   TO  CRAWL EVERY  SITE  WILL   HAVE  ITS  OWN   CRAWLING   RULES
  46. 46. @dawnieando from  @MoveItMarketing Aged ‘Patchwork Quilt’ Sites A  LITTLE  BIT  OF  THIS  CMS  AND  A   LITTLE  BIT  OF  THAT  CMS MANY  HISTORICAL  PARAMETERS   CREATED  &  CRAWLING  SAMPLE   PATTERNS
  47. 47. @dawnieando from  @MoveItMarketing Every Version of Your Past Ecommerce Sites “Exponentially   multiplicative   URLs” Had  potential  to  spew…  at  some  point… DIFFERENT  PARAMETERS  &  URL   PATTERNS  WHICH  ARE  LEARNED  BY   CRAWLERS…  AND  REMEMBERED…   FOREVER
  48. 48. @dawnieando from  @MoveItMarketing ‘Transitive’?? Transitive  -­‐ A  ==  B  +  B  ==  C  then  A  ==  C For  some  types  of  content  more  than   others  – e.g.  ecommerce/directories  but   not  news SAMPLING
  49. 49. @dawnieando from  @MoveItMarketing EFFICIENCY IS  NOT  JUST  ABOUT  URL   SCHEDULING.   IT  IS  ABOUT  NEAR  MEMORY   STORAGE  (e.g.  CACHING)  TOO
  50. 50. @dawnieando from  @MoveItMarketing REUSING PRE-­‐EMPTING  (PARTICULARLY   POPULAR  DOCUMENTS  /  QUERIES  )   &  REUSING  WHAT  WAS  ALREADY  IN   NEARBY  (MEMORY  V  DISC)   STORAGE
  51. 51. @dawnieando from  @MoveItMarketing REUSE LOW  IMPORTANCE  and /  or   DOESN’T  CHANGE OFTEN REUSE IF  NOT  MODIFIED  SINCE LIKELY  TO  CHANGE  BY  X  DATE   (SINCE DATE) DOWNLOAD CHANGES  FREQUENTLY WITH   IMPORTANT  CHANGE  OR  IS  AN   IMPORTANT  DOCUMENT REUSE  IF  NOT  MODIFIED  SINCE https://www.google.com/patents/US8042112
  52. 52. @dawnieando from  @MoveItMarketing CRAWL  SAMPLES  ALSO   HELP  WITH   MODELLING  TO  MAP   DOCS  TO  TOPIC   RELEVANCE
  53. 53. @dawnieando from  @MoveItMarketing YOU BROKE YOUR SILO STRUCTURE Image  credit:  https://www.slideshare.net/patrickstox/nlp-­‐sitemap-­‐smx-­‐2016-­‐ patrick-­‐stox-­‐latest-­‐in-­‐advanced-­‐technical-­‐seo SEMANTIC   LOSS
  54. 54. @dawnieando from  @MoveItMarketing ‘CONCEPT DRIFT’ IS A THING fuzzy difficult  to  perceive;;  indistinct  or  vague. synonyms: blurry, blurred, indistinct; unclear, bleary, misty, distorted, out  of   focus, unfocused, lacking  definition, low  resolution, nebulous; Ill-­‐ defined, indefinite, vague, hazy, imprecise, inexact, loose, woolly "a  fuzzy  picture" https://en.wikipedia.org/wiki/Concept_drift AI ALERT
  55. 55. @dawnieando from  @MoveItMarketing BOOLEAN LOGIC – EXTREME CASES OF TRUTH (TRUE (1) OR FALSE (0))
  56. 56. @dawnieando from  @MoveItMarketing ‘FUZZY LOGIC’ – DEGREES OF TRUTH SEMANTIC   LOSS
  57. 57. @dawnieando from  @MoveItMarketing BIG TOPICAL URL FISH IN A SMALL TOPICAL POND
  58. 58. @dawnieando from  @MoveItMarketing SMALL TOPICAL URL FISH IN A BIG TOPICAL POND SEMANTIC   LOSS
  59. 59. @dawnieando from  @MoveItMarketing ’Fuzzy’ URL Targets with Each Site Generation EVERYTHING  GETS   A  BIT  BLURRED ‘Which  is  the  target  URL   again?
  60. 60. @dawnieando from  @MoveItMarketing GENERATIONAL   CRUFT  CAN   SNOWBALL • Past  infinite  loops • Dodgy  URL  parameters • Misconfigured  URL  parameters • Old  URL  crawling  ‘rules  /  hints’ • Old  ‘importance  /  quality’   scores • Filtered  dupes  &  near-­‐dupes • Mixed  messaging  canonicals • 410s  still  being  revisited • Internal  links  to  old  sites  /   protocols
  61. 61. @dawnieando from  @MoveItMarketing WRONG URL RANKING ’SWAPPING OUT’ (Especially   multiple   child  nodes) SHARP  &   VOLATILE RANKING   FLUX SOME  SYMPTOMS
  62. 62. @dawnieando from  @MoveItMarketing A  LOT  OF  WRONG  TARGETS   RANKING  POST  MIGRATION SOME  SYMPTOMS
  63. 63. @dawnieando from  @MoveItMarketing MIXED CONTENT & MULTIPLE SITE VERSIONS http://www.itv.com/news/
  64. 64. @dawnieando from  @MoveItMarketing MIXED CONTENT & MULTIPLE SITE VERSIONS http://www.itv.com/news/ BOTH  HTTP  &   HTTPS  FIGHTING   EACH  OTHER
  65. 65. @dawnieando from  @MoveItMarketing PEOPLE CHURN INTERNAL  TEAM   CHURN EXTERNAL  AGENCY   CHURN
  66. 66. @dawnieando from  @MoveItMarketing FIND SITES ON THE SAME SERVER
  67. 67. @dawnieando from  @MoveItMarketing DIAGNOSE: Validate & Retain in GSC ALL Past Domains & Past Site Versions (Protocols (HTTPS / HTTP) THERE  MAY  STILL  BE  UNDETECTED  ACTIVITY  GOING  ON  THERE
  68. 68. @dawnieando from  @MoveItMarketing URL Parameter Handling is Your Friend Help  Google  Build  ‘Crawling   Rules’  for  your  site  rather   than  wasting  time  on   ‘sampling’  and  giving  a  bad   impression GIVE  HELP  AND   GUIDANCE  WITH  THE   CRAWL  RULE  AND   HINT  BUILDING
  69. 69. @dawnieando from  @MoveItMarketing Help  Google  Build   ‘Crawling  Rules’  for   your  site  rather  than   wasting  time  on   ‘sampling’  and  giving   a  bad  impression BE  VERY   CAREFUL
  70. 70. @dawnieando from  @MoveItMarketing PEOPLE CANONICALIZE WRONG ON  MULTIPLE  GENERATIONS  OF  SITES
  71. 71. @dawnieando from  @MoveItMarketing 47%  of  TECHNICAL   SEOs  thought: “REL=NEXT  /  REL  =   PREV”  IS  A  FORM  OF   CANONICALIZATION
  72. 72. @dawnieando from  @MoveItMarketing Lots  OF  SEOS  were   unaware  that: “301s  and  302s  are   BOTH  forms  of   canonicalization”
  73. 73. @dawnieando from  @MoveItMarketing Only  64%  of  ’Technical SEOs’  realised Href Lang  is  a  form  of Canonicalization (Internationalization)
  74. 74. @dawnieando from  @MoveItMarketing
  75. 75. @dawnieando from  @MoveItMarketing REVIEW & UNDERSTAND - THE CANONICAL LINK RELATION § 30X  redirects § Canonical  tag § Href lang § HTTPS  protocol § Global  canonicalization   rules § URL  normalization In  ’ALL’  its  forms
  76. 76. @dawnieando from  @MoveItMarketing PEOPLE APPEND (ADD TO FILES) - SOMETIMES IT’S FEAR OF DEPENDENCIES
  77. 77. @dawnieando from  @MoveItMarketing YOU  NEED   TO  KNOW   WHAT’S  ON   THAT   SERVER DIAGNOSE: HEAD BACK TO THE SERVER
  78. 78. @dawnieando from  @MoveItMarketing DIAGNOSE: SERVER LOG FILE ANALYSIS BUT  WATCH  OUT  FOR   OTHER  TOOLS  EMULATING   GOOGLEBOT  AND  FILTER   THEM  OUT ANALYSE  THE  LOGS  FOR   ‘ALL’  YOUR  SITES  AND  ‘ALL’   PROTOCOLS  TO  SEE  THE   PATTERNS  EMERGE
  79. 79. @dawnieando from  @MoveItMarketing When analysing logs you’re often viewing URLs from a ‘A LONNNNGGGG Time Ago’ LOOKING   AT LEGACY
  80. 80. @dawnieando from  @MoveItMarketing REVISIT ALLPAST .HTACCESS FILES Can  you  rewrite  the  rules  to  be   more  efficient  with  regex  or  cut  out   some  old  rules  still  firing   unnecessarily?  (CREATE  SHORTCUTS) REMEMBER  .HTACCESS  RULES  RUN  IN  ORDER  OF   THEIR  APPEARANCE  IN  THE  FILE.     CAN  YOU  USE  WILDCARDS  TO  OPTIMIZE  OR  SKIP   STEPS? .HTACCESS   SITE  1 .HTACCESS   SITE  2 .HTACCESS   SITE  3
  81. 81. @dawnieando from  @MoveItMarketing CHOP BACK REDIRECT CHAINS
  82. 82. @dawnieando from  @MoveItMarketing Help Googlebot Get Round its Shopping List OPEN  MORE  CHECKOUTS WIDEN  THE  AISLES MAKE  THINGS  EASY  TO  FIND DON’T  CONFUSE   GOOGLEBOT HELP  FILL  THE  TROLLEY   QUICKLY SPEED,  SPEED,  SPEED
  83. 83. @dawnieando from  @MoveItMarketing XML Sitemaps Are Your Friend… (Strong Foundations) They  help  to   pass   ‘importance’   signals  to  URLs But…  never   leave  them  to   just   autogenerate without   periodically   checking ‘The   foundations’   underneath  a   site
  84. 84. @dawnieando from  @MoveItMarketing EXTERNALLY HOSTED XML SITEMAPS • Take  back  control • Jump  the  dev  queue • Allows  for  custom  configuration  of  optimal   canonical  click  paths • Allows  for  consistent  signals  of  importance  to   included  URLs • Forget  about  setting  priority • Forget  about  last  modified • Even  a  simple  list  of  URLs  FTW  will  do • Keep  them  organised for  granular  analysis  of   problem  site  sections
  85. 85. @dawnieando from  @MoveItMarketing INSTEAD  OF   REMOVE…   CONSIDER…   DISTRACT  &   ITERATIVELY IMPROVE STRATEGIC  USE  OF  INTERNAL  LINK   POPULARITY REDUCE  IMPORTANCE  SIGNALS   TO  DIFFERENT  PAGES INCLUDE  IMPORTANT  PAGES  IN   XML  SITEMAPS INCLUDE  IMPORTANT  PAGES  IN   HTML  SITEMAPS
  86. 86. @dawnieando from  @MoveItMarketing BUILD WELL CATEGORIZED AND CONCEPTUALLY STRUCTURED SITEMAPS https://www.slideshare.net/p atrickstox/nlp-­‐sitemap-­‐smx-­‐ 2016-­‐patrick-­‐stox-­‐latest-­‐in-­‐ advanced-­‐technical-­‐seo
  87. 87. @dawnieando from  @MoveItMarketing SOLUTION: Increase ‘Importance’ quickly of target URLs • Internal  link  optimization • Canonicalise to  (if  relevant) • Strengthen  up  importance  signals • Inclusion  in  front  facing  HTML  and  XML   sitemaps • Improve  the  content  &  keep  it  updated • 301  redirect  to  (if  relevant  redundant   content) • Topical  hubs  and  strong  information   views  to  navigate  users  &  add  relevance
  88. 88. @dawnieando from  @MoveItMarketing SOLUTION: Reduce ‘Importance’ quickly of old URLs • Internal  link  UNOPTIMIZATION • 410 • Dig  out  URLs  with  links  to  them • Orphan  URLs • Canonicals  to  HTTPs • EXCLUSION  from  XML  sitemaps   (even  old  ones  on  the  server) • Archiving  of  content
  89. 89. @dawnieando from  @MoveItMarketing CONTENT CRUFT https://moz.com/blog/c lean-­‐site-­‐cruft-­‐before-­‐it-­‐ causes-­‐ranking-­‐ problems-­‐whiteboard-­‐ friday
  90. 90. @dawnieando from  @MoveItMarketing IT’S  VERY   IMPORTANT…   YOU  STAY  OUT   OF  SERVER   ERROR  STATUS 500 ‘Try  again’  intervals  likely  extended   between  each  failed  connection   attempt
  91. 91. @dawnieando from  @MoveItMarketing Consistency is REMEMBER  ’ROLLING   AVERAGES’
  92. 92. @dawnieando from  @MoveItMarketing APPENDIX
  93. 93. @dawnieando from  @MoveItMarketing 410 Likely Get Deindexed Quicker https://plus.google.com/+JohnMueller/ posts/NEsqE7Sr4Z4 “Usually  seeing  it  (410)  1-­‐2   times  is  enough  for  us  to   drop  those  URLs  from  the   index”    John  M  on  Google+ (https://plus.google.com/u/0 /+JohnMueller/posts/NEsq E7Sr4Z4)
  94. 94. @dawnieando from  @MoveItMarketing LEGACY ISSUES VIA CANONICALS OR REDIRECTION (COMMON MISTAKES) • PAGE  CANONICALIZED  TO  IS  NOT  A  SUPERSET  OR   DUPLICATIVE  (IT  IS  NOT  RELEVANT  ENOUGH) • 301s  TO  IRRELEVANT  PAGES  BECOME  SOFT  404 • FOLDING  UP  PRODUCT  PAGES  TO  CATEGORES  (PEOPLE   WERE  LOOKING  FOR  A  SPECIFIC  PRODUCT) • CANONICALIZATION  TO  PAGES  WHEN  IN  THE  FUTURE   301  REDIRECT  TO  ANOTHER  URL  THEREFORE  NEGATING   THE  PAGES  CANONICALIZING  TO  THEM • CONFLICTS  BETWEEN  HREF  LANG  AND   CANONICALIZATION
  95. 95. @dawnieando from  @MoveItMarketing MORE CAUSES SEARCH ENGINES ARE CRAWLING MORE CODE THAN YOU MIGHT HAVE INTENDED IN THE FIRST PLACE JAVASCRIPT ERRORS FROM LEGACY CODE & LIBRARIES LEGACY 302s FROM REDIRECTED LEGACY DOMAINS WHICH CONFUSE INTERMEDIATE SIGNALS BETWEEN 301S (WHICH ARE INTENDED DEFINITE REDIRECTIONS) ABANDONED URLS AJAX URLS (NOT THE SAME AS THE NAMED ANCHOR) – DEPRECATION OF AJAX CRAWLING (ASYNCHRONOUS JAVASCRIPT & XML)
  96. 96. @dawnieando from  @MoveItMarketing “If  “change”  means  “any  change”,  then  about  40%  of  all  web  pages  change  weekly   [12].  Even  if  we  consider  only  pages  that  change  by  a  third  or  more,  about  7%  of  all   web  pages  change  weekly  [17].”  (Broder,  A.Z.,  Najork,  M.  and  Wiener,  J.L.,  2003) EVEN  AS  FAR  BACK  IN  2003 40% of ALL web pages changed weekly ___________________ 7%  of  web  pages  changed  a  1/3  of  their   page  content  or  more  weekly
  97. 97. @dawnieando from  @MoveItMarketing HOW  MUCH  BIGGER  &  DYNAMIC  IS  THE  WEB   NOW  IN  2017? http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/
  98. 98. @dawnieando from  @MoveItMarketing FUZZY  LOGIC• Rule   based   logic • Been   around   for  20+   years • Is  within   a  subset   of  AI
  99. 99. @dawnieando from  @MoveItMarketing THESE  THINGS  ADD  UP THEY  ALSO  STILL  NEED  TO  BE  DISCOVERED   WHICH  REQUIRES  INITIAL  CRAWLING https://twitter.com/dawnieando/status/906465965029969920
  100. 100. @dawnieando from  @MoveItMarketing “404  vs  410  doesn't  affect  the  recrawl rate:  we'll  still  occasionally  check  to   see  if  these  pages  are  still  gone,   especially  when  we  spot  a  new  link  to   them” John  Mueller,  Google+ 2015 https://plus.google.com/u/0/+JohnMu eller/posts/NEsqE7Sr4Z4 ESPECIALLY IF THERE ARE LINKS TO IT
  101. 101. @dawnieando from  @MoveItMarketing Pass Strong Clues - Highly Relevant New Conceptual Structures STRONG SEMANTICS  &   CONCEPTUALLY   CO-­‐OCCURRING   TERMS
  102. 102. @dawnieando from  @MoveItMarketing THINK CAREFULLY ABOUT URL CREATION Not  EVERYTHING  is   worthy  of  its  own  URL VARIANTS STEMMINGS PLURALS RANDOM  TAGS LONG,  LONG,  LONG   TAIL  PARAMETERS
  103. 103. @dawnieando from  @MoveItMarketing ONLY  DOWNLOAD  IF   THERE  IS  SUBSTANTIVE   CHANGE TAKE  SOME  CONTROL  WITH  304  &  EXPIRES  AFTER  HEADERS   ON  LESS  IMPORTANT  PAGES https://developers.google.com/web/fundamentals/pe rformance/optimizing-­‐content-­‐efficiency/http-­‐caching VALID   REPRESENTATION THE  URL  WILL  STILL  BE  VISITED   BUT  0  (ZERO)  WILL  BE   DOWNLOADED  SO  IT  IS  STRAIGHT   ON  TO  THE  NEXT  URL  VERY   QUICKLY https://webmasters.googleblog.com/2006/09/better-­‐ details-­‐about-­‐when-­‐googlebot.html https://tools.ietf.org/html/rfc7232#section-­‐4.1
  104. 104. @dawnieando from  @MoveItMarketing A  URI  is  like  a  fine   wine Maturing  over   time “COOL  URIs   DON’T   CHANGE” Sir  Tim  Berners-­‐Lee (Inventor  of  the  World  Wide  Web) https://www.w3.org/Provider/Style/URI
  105. 105. @dawnieando from  @MoveItMarketing A  LONG,  LONG  TIME  AGO • You  need  to  go  right  back  to  the  beginning • What  domains  did  the  organisation EVER  register? • Where  do  they  redirect  to? • Is  it  via  301,  302  or  are  they  merely  parked  domains? • Who  would  know?    Who  is  responsible? • Verify  them  all  in  Google  Search  Console • Some  of  these  may  EVEN  HAVE  PENALTIES  HISTORICALLY • If  there  are  links  to  any  there  is  likely  still  crawling  activity  there • Analyse logs  across  multiple  subdomains  &  protocols
  106. 106. @dawnieando from  @MoveItMarketing QUESTIONS TO ASK HOW MANY MICRO-SITES HAVE YOU HAD? HOW MANY SUBDOMAINS? HOW MANY OTHER DOMAINS? WHO IS RESPONSIBLE FOR DOMAIN REG WHO KNOWS WITHIN THE ORGANISATION? WHO REGISTERED THE DOMAINS? WHO CAN UPDATE DNS RECORDS? ARE THESE SITES STILL ON SERVERS? HAVE ANY OF THESE SITES HAD MANUALACTIONS? HOW ARE THESE SITES REDIRECTED? ARE THEY PARKED DOMAINS?
  107. 107. @dawnieando from  @MoveItMarketing DATA FROM HISTORY LOGS CONTRIBUTE TO WHEN TO REVISIT URIs ON THE WEB
  108. 108. @dawnieando from  @MoveItMarketing SOLUTION: REVISITING BLOATED APPENDED .HTACCESS FILES ON ALL LEGACY SITES (IF NOT REDIRECTING AT A DNS LEVEL) NOT  JUST  THE  .HTACCESS  FILE  ON  THE  EXISTING   SITE  EITHER. GOOGLEBOT  MAY  HIT  .HTACCESS  ON  PAST  SITES   SO  THEY  MAY  ALSO  NEED  OPTIMIZING .HTACCESS  RUN  IN  ORDER  SO  PROVIDE   OPPORTUNITY  FOR  SHORT  CUTS  
  109. 109. @dawnieando from  @MoveItMarketing SOME TYPES OF URL CRUFT • INCORRECTLY  APPLIED  CANONICAL   TAGS   • CONFLICTING  HREF  LANG  &   CANONICAL  TAGS • MIXED  CONTENT • URL  SHORTENERS • SESSION  IDS • UTM  TAGGING • OLD  AJAX  FRAGMENTS • PARAMETERS  FROM  MULTI  FACET   DROP  DOWN  CHOICES • .html,  .php,  .index.html,  .aspx • LEGACY  URL  REWRITING  &   PARAMETERS  IN  .HTACCESS  FILES • LEGACY  FOLDERS  WHICH  CONTRIBUTE   NO  MEANING  TO  SITE  ONTOLOGY UNCRUFTY www.myeasyurlwillmakeyouw onder.com/resume CRUFTY www.myeasyurlwillmakeyouw onder.com/resume.html CRUFTY http://nymag.com/scienceofus/2015/07/how-­‐ to-­‐recover-­‐from-­‐an-­‐all-­‐ nighter.html?om_rid=AAENcg&om_mid=_BTtF a0B869PyJp&utm_content=buffer8fdd1&utm_ medium=social&utm_source=twitter.com&ut m_campaign=buffer
  110. 110. @dawnieando from  @MoveItMarketing INDEX TIERING Presented  by  B  Cambazoglu at  European  Summer  School  Information  Retrieval  2017  – (Cambazoglu,  B.B.  and  Baeza-­‐Yates,  R.,  2011.   Scalability  challenges  in  web  search  engines.  In Advanced  topics  in  information  retrieval (pp.  27-­‐50).  Springer  Berlin  Heidelberg.)
  111. 111. @dawnieando from  @MoveItMarketing FIND SITES ON THE SAME SERVER
  112. 112. @dawnieando from  @MoveItMarketing TWO-PHASE RANKING IN A SEARCH NODE Presented  by  B  Cambazoglu at  European  Summer  School  Information  Retrieval  2017  – (Cambazoglu,  B.B.  and  Baeza-­‐Yates,  R.,   2011.  Scalability  challenges  in  web  search  engines.  In Advanced  topics  in  information  retrieval (pp.  27-­‐50).  Springer  Berlin   Heidelberg.)
  113. 113. @dawnieando from  @MoveItMarketing FUZZY LOGIC – DEGREES OF TRUTH 0.8  Doc  ID  likely  to   be  a  correct  URI  to   choose  from  term  /   query  cluster
  114. 114. @dawnieando from  @MoveItMarketing EVERY  SINGLE  TIME  YOU  MIGRATE,  CHANGE  DESIGN,  REDIRECT,  REINVENT  A  SITE  /  URL A  CLEAN  START REDIRECTIONS ANOTHER  STRUCTURE FIRST  SITE   STRUCTURE NEW  CRAWLING  ‘RULES’   BUILT CRAWLING   ‘RULES’  BUILT EVERYTHING   IS  ‘200  OK’ MORE  URLs MIXED  RESPONSE  CODES REDIRECTIONS ‘FUZZINESS’  IS  EMERGING NEW  CRAWLING  ‘RULES’  BUILT MORE  URLs REDIRECT  CHAINS  &  MIXED   RESPONSE  CODES NEW  SEO’s  DON’T   KNOW  THE  ‘HISTORY’ TARGET  URLs  NOW  ‘VERY  FUZZY’
  115. 115. @dawnieando from  @MoveItMarketing BUT WHEN DATA IS INCONSISTENT FUZZY LOGIC MAY FAIL ‘DEGREES  OF  TRUTH’ MAY  BECOME  MORE   BLURRED  /  VAGUE
  116. 116. @dawnieando from  @MoveItMarketing SOLUTION: XML SITEMAPS
  117. 117. @dawnieando from  @MoveItMarketing TERM-FREQUENCY INVERSE DOCUMENT FREQUENCY Cruft  can  also  skew  term-­‐frequency   inverse  document  frequency AND  THE  QUERY  CLUSTERS  DOCUMENTS  BELONG  TO
  118. 118. @dawnieando from  @MoveItMarketing The Generational ’Snail Trail’ • Old  XML  sitemaps • Redirects  drop  away  on  old  site   .htaccess • DNS  issues • People  link  to  old  site  but  wrong   protocol • Old  sites  not  verified  in  GSC • Not  all  protocols  redirecting Leaving  it’s   slithery     footprint
  119. 119. @dawnieando from  @MoveItMarketing URL NORMALIZATION Can be problematic and ‘crufty’ too https://en.wikipedia.org/wiki/URL_normalization
  120. 120. @dawnieando from  @MoveItMarketing REDUCTION & REPOPULATION OF INTERNAL LINK POPULARITY (IBP) BETWEEN URL SCHEDULING IT’S  NOT  ONLY  THEIR  ‘INTERNAL  PAGE   RANK’  BUT  ALSO  THE  ANCHORS,  INTER-­‐ CONNECTING  CONCEPTUAL  /  TOPIC   RELEVANCE  IN  CONTENT  AND  THE  TEXT   SURROUNDING  INTERNAL  LINK  ANCHORS   (AND  PROBABLY  OTHER  THINGS  TOO) SEMANTIC  ’CLUES’  WERE  LOST  ALONG   THE  WAY SEMANTIC ‘CONTEXT’ & IBP BUCKET IS LEAKING
  121. 121. @dawnieando from  @MoveItMarketing SOLUTION: Wiki Page Redirects on Topics https://dbpedia.org/sparql Wikipedia   Redirects thesaurus.com OR  A  GOOD  OLD  FASHIONED  THESAURUS
  122. 122. @dawnieando from  @MoveItMarketing Understand How URLs with Multiple Parameters Are Handled The  most  restrictive  parameter  blocked  overrules   lesser  restrictions
  123. 123. @dawnieando from  @MoveItMarketing THE  USE  OF  REUSE  TABLESTABLE  I Reuse  Table  Example URL URL Record  No. Fingerprint  (FP) Reuse  Type If  Modified  Since  .  .  . 1 2123242 REUSE 2 2323232 REUSE  IF  NOT Feb.  5,  2004 MODIFIED  SINCE 3 3343433 DOWNLOAD . . . . . . . . . . . . https://www.google.com/patents/US8042112
  124. 124. @dawnieando from  @MoveItMarketing REMEMBER ”Gone  is  Never  Gone” “Search  Engines  Never   Forget”Dawn  Anderson  @  dawnieando
  125. 125. @dawnieando from  @MoveItMarketing REFERENCES
  126. 126. @dawnieando from  @MoveItMarketing Sources & References Bar-­‐Yossef,  Z.,  Keidar,  I.  and  Schonfeld,  U.,  2009.  Do  not  crawl  in  the  dust:   different  urls with  similar  text. ACM  Transactions  on  the  Web  (TWEB), 3(1),  p.3 Broder,  A.Z.,  Najork,  M.  and  Wiener,  J.L.,  2003,  May.  Efficient  URL  caching  for   world  wide  web  crawling.  In Proceedings  of  the  12th  international  conference   on  World  Wide  Web (pp.  679-­‐689).  ACM Cambazoglu,  B.B.  and  Baeza-­‐Yates,  R.,  2011.  Scalability  challenges  in  web  search   engines.  In Advanced  topics  in  information  retrieval (pp.  27-­‐50).  Springer  Berlin   Heidelberg. Cho,  J.,  Garcia-­‐Molina,  H.  and  Page,  L.,  1998.  Efficient  crawling  through  URL   ordering. Computer  Networks  and  ISDN  Systems, 30(1),  pp.161-­‐172 Fetterly,  D.,  Manasse,  M.,  Najork,  M.  and  Wiener,  J.,  2003,  May.  A  large-­‐scale   study  of  the  evolution  of  web  pages.  In Proceedings  of  the  12th  international   conference  on  World  Wide  Web (pp.  669-­‐678).  ACM
  127. 127. @dawnieando from  @MoveItMarketing Sources & References • Olston,  C.  and  Najork,  M.,  2010.  Web  crawling. Foundations  and  Trends®  in   Information  Retrieval, 4(3),  pp.175-­‐246. • Pandey,  S.  and  Olston,  C.,  2008,  February.  Crawl  ordering  by  search  impact.   In Proceedings  of  the  2008  International  Conference  on  Web  Search  and  Data   Mining (pp.  3-­‐14).  ACM. • Olston,  C.  and  Pandey,  S.,  2008,  April.  Recrawl scheduling  based  on  information   longevity.  In Proceedings  of  the  17th  international  conference  on  World  Wide   Web (pp.  437-­‐446).  ACM • Pandey,  S.  and  Olston,  C.,  2005,  May.  User-­‐centric  web  crawling.  In Proceedings  of   the  14th  international  conference  on  World  Wide  Web (pp.  401-­‐411).  ACM. • Pandey,  S.  and  Olston,  C.,  2008,  February.  Crawl  ordering  by  search  impact.   In Proceedings  of  the  2008  International  Conference  on  Web  Search  and  Data   Mining (pp.  3-­‐14).  ACM
  128. 128. @dawnieando from  @MoveItMarketing Sources & References • https://patentimages.storage.googleapis.com/US8042112B1/US08042112-­‐ 20111018-­‐D00000.png • Randall,  K.H.,  Google  Inc.,  2010. Scheduler  for  search  engine  crawler.  U.S.  Patent   7,725,452.

×