Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

12,199 views

Published on

WHY YOU SHOULD CARE ABOUT TAKING CARE OF CRAWLS (INTELLIGENT USE OF CRAWL ALLOCATION (BUDGET)). Investigating 'crawl budget', 'crawl rank', 'crawl tank' and 'crawl scheduling by Search Engines'

Published in: Marketing
  • John Mueller just mentioned this presentation on an HOA as very insightful! Will look into it now.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016

  1. 1. SEO  ‘Crawl  Tank’  -­‐ ‘Death  and  Resurrection’ WHY  YOU  SHOULD  CARE  ABOUT  TAKING   CARE  OF  CRAWLS  (INTELLIGENT  USE  OF   CRAWL  ALLOCATION  (BUDGET)) THE  QUEST   FOR  ‘CRAWL   RANK’ Dawn  Anderson  @  dawnieando
  2. 2. Indexed  Web  contains at  least  4.73  billion   pages (13/11/2015) 1 THE  WEB  IS  ‘BIG’ Total  number  of  websites 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 1,000,000,000 750,000,000 500,000,000 250,000,000 SINCE  2013  THE  WEB  IS   THOUGHT  TO  HAVE   INCREASED  IN  SIZE  BY  1/3
  3. 3. 2THE  ABILITY  TO  ‘SELF  PUBLISH’  EASILY  HAS  CLEARLY   INFLUENCED  THIS  – WE  ALL ‘LOVE  CONTENT’ IMPORTANT  TO  NOTE   THAT  75%  OF   WEBSITES  ONLINE   ARE  DORMANT  (E.G.   PARKED  DOMAINS) IMAGINE  HOW  MANY   UNIQUE  URLs    COMBINED   THIS  AMOUNTS  TO?   – A  LOT http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/
  4. 4. Capacity  limits   on  Google’s   crawling  system By  prioritising   URLs  for   crawling By  assigning   crawl  period   intervals  to  URLs How  have   search  engines   responded? By  creating  work   ‘schedules’  for   Googlebots 3 TOO  MUCH  CONTENT
  5. 5. 4HERE’S  WHY  -­>  EVERYTHING  HAS  A   FINITE  CAPACITY  (EVEN  CRAWLING) “While  web  pages  can  be  manually  selected  for   crawling,  this  becomes  impracticable  as  the   number  of  web  pages  grows.  Moreover,  to  keep   within  the  capacity  limits  of  the  crawler,   automated  selection  mechanisms  are  needed  to   determine  not  only  which  web  pages  to  crawl,   but  which  web  pages  to  avoid  crawling.  For   instance,  as  of  the  end  of  2003,  the  WWW  is   believed  to  include  well  in  excess  of  10  billion   distinct  documents  or  web  pages,  while  a  search   engine  may  have  a  crawling  capacity  that  is  less   than  half  as  many  documents.”  -­‐ Scheduler  for   search  engine  crawler Google  Patent US  8042112  B1,  (Zhu  et  al)
  6. 6. ‘Managing items in crawl schedule’ - US  8666964  B1 Include 5SOME  GOOGLE  CRAWL  SCHEDULER   PATENTS ‘Scheduling a recrawl’ - US   8386459  B1 ‘Web crawler scheduler that utilizes sitemaps from websites’ - US  8037054  B2 ‘Document reuse in a search engine crawler’ - US  8707312  B1 ‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ - US   8407204  B2 ‘Scheduler for search engine crawler’ - US  8042112  B1 ‘Distributed crawling of hyperlinked documents’ - US  7305610  B1 IT  SEEMS  PRIORITIZATION  AND  GOOGLEBOT   CRAWL  EFFICIENCY  ARE  IMPORTANT  TO  SEARCH   ENGINES
  7. 7. Crawled  multiple   times  daily Crawled  daily   Or  bi-­‐daily Crawled  least  on  a  ‘round   robin’  basis  – only  ‘active’   segment  is  crawledSplit  into  segments   on  random  rotation 6 “MANAGING  ITEMS  IN  A  CRAWL   SCHEDULE” (GOOGLE  PATENT  US  8666964  B1) Real  Time Crawl Daily Crawl Base  Layer    Crawl 3  layers  /  tiers URLs  are  moved   in  and  out  of   layers  based  on   past  visits  data   (retrieved  from   logs) PAGE ‘IMPORTANCE’ AND URL SCHEDULING
  8. 8. 10  types of Googlebot THE  KEY  SEARCH  ENGINE  (THE   APPLIANCE)    CHARACTERS 7 SUPPORTING  ROLES  (LOG   MANAGERS  &  PAGE   RANKERS Indexer  /   Ranking  Engine The  URL   Scheduler History  Logs Link  Logs  /  Link  Maps Anchor  Logs  /  Anchor  Maps Status  Logs Page  Rankers
  9. 9. 8THE  ‘LOG’  MANAGERS          (‘The  Clerks’) History  Logs Link  Logs JOBS  INCLUDE JOBS  INCLUDE Other  Logs JOBS  INCLUDE Consider  these  as  ‘record-­keepers’  (record   info  on  the  crawled  URLS Retrieves   previous  copies  of   documents  for   comparison  with   newly  retrieved   copies  for   purposes   of   ’change   frequency’  and   ‘change  weight’   calculation  (last   modified  &   update  rate) Include: “identifies  all  the  links  (e.g.,   URLs,  also  called  outbound   links)  that  are  found  in  the   document  associated  with  the   record  and  the  text  that   surrounds   the  link”  (Brawer  et   al,  Google  Patent) INFO  USED  TO  MAKE  LINK   MAPS • Anchor  Logs  &   Maps • Status  Logs A  LOT  MORE  INFO  ON   LOGS  AT:  Scheduler  for   Search  Engine  Crawler US  20100241621  A1
  10. 10. 9 SUPERVISOR  -­ TEAM  LEADER  – ‘THE  URL   SCHEDULER’ Think  of  it  as  Google’s   line  manager  or  ‘air   traffic  controller’  for   Googlebots in  the   web  crawling  system JOBS Schedules  Googlebot visits  to  URLs Decides  which  URLs  to  ‘feed’  to  Googlebot Uses  data  from  the  history  logs  about  past  visits Assigns  visit  regularity  of  Googlebot to  URLs Drops  ‘hints’  to  Googlebot to  guide  on  types  of  content  NOT  to   crawl  and  excludes  some  URLs  from  schedules Analyses  past  ‘change’  periods  and  predicts  future  ‘change’   (BASED  ON  PAST  VISIT  DATA)  periods  for  URLs  for  the  purposes  of   scheduling  Googlebot visits Checks  ‘page  importance’  in  scheduling  visits  (PRIORITIES) Assigns  URLs  to  ‘layers  /  tiers’  for  crawling  schedules   (REAL  TIME,   DAILY,  BASE  LAYER  SEGMENT) The  URL   Scheduler   controls  the   meal  planner Scheduler  checks  URLs   for  ‘importance’,  ‘boost   factor’  candidacy,   ‘probability  of   modification’ ‘Budgets’  are  allocated Carefully  controls   the  list  of  URLs   Googlebot visits
  11. 11. THE  10  GOOGLEBOTS Image Video News Adsense Adsbot PAID  SEARCH  TYPES 10 MEDIA  TYPES Smartphone AppsFeaturephoneMobile   Adsense MOBILE  TYPES BOT TYPES HAVE VARYING DEGREES OF ‘BUSY-NESS’ GOOGLEBOT   WEB  SEARCH Crawls   images  only Quality Checks Babybot (’the   Noob’)
  12. 12. GOOGLEBOT  JOBS 11 JOBS • ‘Ranks  nothing  at  all’ • Takes  a  list  of  URLs  to  crawl  from  URL  Scheduler • Job  varies  based  on  ‘bot’  type  (e.g.  Image  bot  seems  a  bit  of  a  ‘part   timer’  (images  change  less  frequently)) • Runs  errands  &  makes  deliveries  for  the  URL  server,  indexer  /  ranking   engine  and  logs • Makes  notes  of  outbound   linked  pages  and  additional  links  for  future   crawling  (in  order  for  them  to  be  assigned  to  future  crawling  schedules) • Takes  notes  of  ‘hints’  from  URL  scheduler  when  crawling • Tells  tales  of  URL  accessibility  status,  server  response  codes,  notes   relationships  between  links  and  collects  content  checksums  (binary  data   equivalent  of  web  content)  for  comparison  with  past  visits  by  history  and   link  logs
  13. 13. 12 ‘INDEXER’ Looks  at  all  of  the   evidence  from  the   various  logs  (and  the   page  rankers)  of  the   search  engine  to   index  the  URLs • Uses  the  combined  data  collected  in  order  to  index  the   results  for  a  given  query • TAKES  DATA  FROM  THE  LOGS    TO  GENERATE  INDEXES “The  indexer(s) 724 use  the  anchor  maps 718   and  other  logs 716 to  generate  index(es) 726.   The  index(es)  are  used  by  the  search  engine  to   identify  documents  matching  queries  entered  by   users  of  the  search  engine.”  (Web  crawler   scheduler  that  utilizes  sitemaps  from  websites US  8037054  B2,  Google  Patent,  Brawer  et  al,   pub  2011)
  14. 14. I  ASKED  JOHN  MUELLER  AT  WEBMASTER  HANGOUT   ABOUT  URL  QUEUES 14 GOOGLE   WEBMASTER   HANGOUT   QUESTION  ON   ’URL  QUEUEING’ BUT  WHAT  OTHER  EVIDENCE  DO  WE  HAVE  TO   SUPPORT  OUT  THEORIES? “URLS  ARE  NOT  ALL  CRAWLED  IN  ORDER,  BUT  THAT   SOME  RECEIVE  MULTIPLE  DAILY  CRAWLS,  SOME  DAILY,   SOME  WEEKLY  AND  SOME  VERY  INFREQUENTLY” https://www.seroundtable.com/google-­‐explains-­‐why-­‐ the-­‐search-­‐console-­‐has-­‐reporting-­‐delays-­‐21688.html LOW  IMPORTANCE  URLs   APPEAR  TO  BE  ‘QUEUED   FOR  LATER’  AND   VISITED  INFREQUENTLY   WHEN  THERE  IS  SPARE   CAPACITY  (LOWER   PRIORITY)  (SCHEDULES)
  15. 15. WHICH  APPEARED  TO  SUPPORT… 15 “Priority  scores  are   computed  for  each   remaining  document   identifier  based  on   predetermined  criteria   (e.g.,  a  page  importance   score  of  the  document).”   (Zhu  et  al,  2011) PATENT  -­‐ Scheduler  for  search   engine  crawler US  8042112  B1
  16. 16. 16 CRAWL  BUDGET 1.  CRAWL  BUDGET  – “AN  ALLOCATION  OF   CRAWL  VISITS  TO  A  HOST”   3.  PAGES  WITH  A  LOT  OF  LINKS  GET   CRAWLED  MORE 4.  THE  VAST  MAJORITY  OF  URLS  ON  THE  WEB  DON’T  GET  A  LOT   OF  BUDGET  ALLOCATED  TO  THEM  (LOW  TO  0  PAGERANK  URLS).   2.  ROUGHLY  PROPORTIONATE  TO   PAGERANK  AND  HOST  SPEED  /  CAPACITY Mostly  taken  from  Eric  Enge’s (interview  with   Matt  Cutts (@mattcutts)  interview  from  2010 https://www.stonetemple.com/matt-­‐cutts-­‐ interviewed-­‐by-­‐eric-­‐enge-­‐2/
  17. 17. I  ASKED  SOME  STUFF  ABOUT  CRAWL   BUDGET  ALLOCATION 17 DISTRIBUTED  CRAWLING  OF  HYPERLINKED   DOCUMENTS  -­‐ Patent  Abstract  – “Hyperlinked   documents  to  be  crawled  are  grouped  by  host   and  the  host  to  be  crawled  next  is  selected   according  to  a  stall  time  of  the  host.  The  stall   time  can  indicate  the  earliest  time  that  the  host   should  be  crawled  and  the  stall  times  can  be  a   predetermined  amount  of  time,  vary  by  host  and   be  adjusted  according  to  actual  retrieval  times   from  the  host”  (Dean  et  al  (Google,  2014)) IT  SEEMS  – BUDGET  IS  ASSIGNED  TO  THE  HOST   (I.P)  AND  THEN  SHARED  BETWEEN  THE  SITES   THERE
  18. 18. I  ASKED  SOME  STUFF  ABOUT  LINKS  AND  CRAWL   BUDGET  (in  light  of  2012  ‘DISAVOW  TOOL’) 18 TIP  (IMHO  -­ DAWN)  – YOU  MAY  NEED  TO   RESTRUCTURE  /   FLATTEN  SO  ‘BUDGET’   CAN  REACH   IMPORTANT  URLS “Thanks   John”  -­‐ Waving  J
  19. 19. 19IT  SEEMS  THERE  MORE  FACTORS  AFFECTING  ‘CRAWL   BUDGET??’ Transcript:   https://searchenginewatch.com/201 6/04/06/webpromos-­‐qa-­‐with-­‐ googles-­‐andrey-­‐lipattsev-­‐transcript/ WEB  PROMOS  Q  &  A  WITH  GOOGLES   ANDREY  LIPATTSEV Andrev chatting  with  Ammon  J   seemed  to  imply  that  a  lot   more  things  affect  crawl   frequency  now  than  just   PageRank
  20. 20. 20 ARE  THERE  OTHER  FACTORS  AFFECTING   BUDGET  AND  /  OR  ‘CRAWL  RANK’  AS  WELL  AS   PAGERANK  AND  SPEED?   I  ASKED  @johnmu IF  I   COULD  ASK  WHETHER   THE  FACTORS   AFFECTING  CRAWL   BUDGET  HAD   CHANGED? JOHN  SAID  – “Sure…You  can  always  ask”  J J – “But,  he  didn’t  tell  me  what  they  were  (if  any)” SO  I  ASKED  IF  I  COULD  ASK  IF  FACTORS  AFFECTING   CRAWL  BUDGET  /  CRAWL  FREQUENCY  HAD   CHANGED  – I.E.  ADDITIONAL  FACTORS?
  21. 21. 22 GOOGLE  PATENT  – ‘NOT  ALL  ‘CHANGE’  IS   CONSIDERED  EQUAL’    (CRITICAL  &  NON-­CRITICAL) “Changes  can  be  described  as  critical  or  non-­critical  and  that   determination  may  depend  on  the  portion  of  the  document  changed,  or   the  context  of  the  changes,  rather  than  the  amount  of  text  or  content   changed.  Sometimes  a  change  to  a  document  may  be  insubstantial,   e.g.,  the  change  of  advertisements  associated  with  a  document.  In  this   case,  it  is  more  appropriate  to  ignore  those  accessory  materials  in  a   document  prior  to  making  content  comparisons.  In  other  cases,  e.g.,  as   part  of  a  product  search,  not  every  piece  of  information  in  a   document  is  weighted  equally  by  a  potential  user.  For  instance,  the   user  may  care  more  about  the  unit  price  of  the  product  and  the   availability  of  the  product.  In  this  case,  it  is  more  appropriate  to  focus   on  the  changes  associated  with  information  that  is  deemed  critical   to  a  potential  user  rather  than  something  that  is  less  significant,   e.g.,  a  change  in  a  product's  colour”    (Minimizing   Visibility  of  Stale   Content  in  Web  Searching  Including  Revising  Web  Crawl  Intervals  of   Documents -­‐ Anton  Carver,  Google  Patent  -­‐ US  20130226897  A1,  pub  2013) Probability  &   predictability   of  future   ‘freshness’   (newness  or   critical  material   change)   (‘CHANGE   RATE’  APPEARS   TO  BE   ‘LEARNED’) ’CHANGE  RATE   &  CHANGE   WEIGHT   THRESHOLDS’
  22. 22. CRITICAL  MATERIAL  CONTENT  CHANGE   (IMPORTANT  CHANGE)  &  FEATURE  WEIGHTS   21 C  =  ∑  i =  0  n  -­‐ 1    weight  i *  feature NOT JUST ‘RANDOM’ CHANGE like Shuffle($variable) or RAND($variable) NOT  ALL  ‘FEATURES’  ARE  CREATED  EQUAL  ACCORDING  TO  THIS  LINE   IN  PATENTS  –”  weight  i *  feature” EXAMPLE  FEATURES  – E.G.  A  CHANGE  IN  PRICE  (FEATURE)   MAY  BE  WEIGHTED  HIGHER  THAN  A  CHANGE  IN  COLOUR   (FEATURE)  – FEATURE  WEIGHT  PRICE  >  FEATURE  WEIGHT   COLOUR ”DEPENDS  ON  HOW  OFTEN  THE   PAGE  CHANGES”  IS  MENTIONED  A   LOT IN  WEBMASTER  HANGOUTS Minimizing   Visibility  of  Stale  Content  in  Web  Searching   Including  Revising  Web  Crawl  Intervals  of  Documents -­‐ Anton   Carver,  Google  Patent  -­‐ US  20130226897  A1,  pub  2013
  23. 23. “BE  CONSISTENT”  -­ (@johnmu,  Nov  2015) 23 SMX  MILAN  (November  2015),  reported  here  by  SERoundtable on  quote  from  Google’s   John  Mueller  @johnmu https://www.seroundtable.com/google-­‐number-­‐one-­‐seo-­‐advice-­‐ be-­‐consistent-­‐21196.html DA  -­‐ I  HAVE  A  FEELING  CONSISTENCY  IS   IMPORTANT  FOR  ‘HISTORY  LOGS’  TO   ‘LEARN’  CHANGE  RATES  /  THRESHOLDS
  24. 24. URL  EXCLUSIONS  FOR  ‘TRIPPING  ‘MINIMUM-­CRAWL-­ THRESHOLD’  REVISIT  ‘HINTS’  AND  ‘SPAM’  URLs 24 ‘RANDOM’ CHANGE created programmatically like Shuffle($variable) or RAND($variable) may even be seen as ‘hints’ TO GOOGLEBOT TO ‘NOT’ CRAWL HINTS  =  ‘MEH  CHANGES’  (E.G.  PATTERNS  OF  ’SAME  OLD,  SAME  OLD   STUFF’  DUPLICATES,  PROGRAMMATICALLY  GENERATED  CONTENT) "Hints  may  also  be  employed  on  pages  that  are  automatically   generated  and/or  contain  dynamically  generated  elements  that  result   in  the  page  having  a  different  checksum  every  time  it  is  crawled”   (Managing  Items  In  A  Crawl  Schedule,  Google  Patent  -­ US  8666964  B1)
  25. 25. 26 GOOGLE  THINKS  CRAWL  BUDGET  IS   IMPORTANT  FOR  SEO CIRCA  JULY  2015 BUT…  NO  ONE  HAS  EVER  OFFICIALLY  SAID  THAT  THERE’S  ANY  KIND  OF     RANKING  BENEFIT  FROM  POSITIVE  CRAWL  ACTIVITY
  26. 26. ENTER  ‘CRAWL  RANK’  -­ A  BENEFIT  OF   CRAWL  OPTIMISATION?? 27 “The  pages  that  aren’t  crawled  as  often  are  pages   with  little  to  no  PageRank.  CrawlRankis  the   difference  in  this  very  large  pool  of  pages.     You  win  if  you  get  your  low  PageRank  pages   crawled  more  frequently  than  the  competition.”     “I’m  still  not  entirely  convinced  this  is  what  is   happening,  but  I’m  seeing  success  using  this   philosophy.  “-­‐ A  J  Kohn  @ajkohn OTHERS  SEEM  TO  BE  TRACKING  IT  TOO  – E.G.  SEO   CLARITY DOES  THE  MYTHOLOGICAL  ‘CRAWL  RANK’  BENEFIT  EVEN  EXIST?
  27. 27. DOES  ‘CRAWL  RANK’  STILL  APPLY? 28 I  ASKED  A  J  KOHN  IF  HE  STILL  THOUGHT  IT  APPLIED   NOW? “Thanks   A.J”  -­‐ Waving  J ”I  still  see  evidence  that  getting  pages  crawled   frequently  (within  7-­‐10  days)  seems  to  have  an   impact  on  their  ability  to  rank  well”  (AJ  Kohn,  2016)
  28. 28. IS  LONG-­TAIL  ‘LEAP-­FROGGING’  (AND  SOME   CLUSTERING)   WHAT  ‘CRAWL  RANK’  LOOKS  LIKE? 29 SITES  JUMPING  OVER  EACH   OTHER  ON  ’LONG  TAILED   QUERIES’  IN  AN  ENDLESS  LAST   LAP  RACE?
  29. 29. HOW  IT  APPEARS  TO  WORK  – ‘YOU  DON’T   ALWAYS  HAVE  TO  FIGHT  THE  ‘BOSS’   URLS’ 30 Why  fight  with  the   Hulk  when  you  can  be   Yoda? Image   Credit:   Flickr
  30. 30. EVEN  STRONGER  DOMAINS  HAVE  WEAKER  URLS 31 THE  SITES  MAY  ALL  BE  STRONGER  THAN  YOU  BUT  THERE   ARE  A  LOT  OF  PAGES  ON  BIG  SITES  WITH  NO  STRENGTH YOU  WON’T  BEAT  THE  STRONG  URLs  WITH   CRAWL  OPTIMISATION  ALONE You  are  unlikely  to  beat   these  URLs  with  crawl   optimisation techniques   alone.    These  URLs  are  not   the  intended  target  for   these  tactics  – TOO   STRONG SAVE  SOME  BATTLES   FOR  LATER Strong   URLs
  31. 31. FIGHT  AT  A  URL  V  URL    OR  TEMPLATE  V  TEMPLATE   LEVEL  WITH  LOW  TO  0  PAGE  RANK  URLS 32 PICK  OFF  THE   WEAKER  URLS   WHEN  BATTLING   WITH  A  BIG  SITE  – LOW  TO  NO  PAGE   RANK  URLS• TARGETS  THE  LOW  STRENGTH  PAGES  FURTHER   DOWN  IN  THE  SITES  OF  COMPETITORS   (SUBCATEGORY  PAGES  E.G.  IN  ECOMMERCE   SITES • THERE  ARE  A  LOT  OF  PAGES  (MILLIONS  WITH   LITTLE  TO  NO  PAGE  RANK) • YOU’RE  AIMING  TO  BEAT  THOSE VIRTUALLY  NO   STRENGTH  IN  1,000s  OF   URLS POWERFULWELL KNOWN BRANDS BUT NO STRENGTH LOWER DOWN THE ARCHITECTURE MANY LOW VOL/ DEEPURLsARE COMPLETE WEEDS ON BEHEMOTH SITES Weak   URLs
  32. 32. 25 A  BIG  FACTOR?  -­ ‘EMPHASIS  OF  ‘  URL   IMPORTANCE’’  (E.G.  ON  PARAMETERS) FULL  TRANSCRIPT  -­‐ https://www.stonetemple.com/matt-­‐cutts-­‐interviewed-­‐by-­‐eric-­‐enge-­‐2/ THIS  WAS  IN  THE   ORIGINAL  INTERVIEW   WITH  MATT  CUTTS ALSO  LOTS  OF  THE   PATENTS  MENTION   “PAGE  IMPORTANCE   (WHICH  MAY  INCLUDE   PAGERANK)”
  33. 33. WHICH  SEEMS  TO  SUPPORT  THIS  PAPER  BY  PAGE  ET  AL  ON  IMPORTANCE 13 “Thanks   Bill”  -­‐ Waving  J THIS  REFERENCES  THE  PROBLEM  OF  THE  SIZE  OF  THE  WEB  AND   PRIORITIZES  IMPORTANT  PAGES Efficient   Crawling   Through   URL   Ordering Page  et  al
  34. 34. ’POINT  TO  THE  NEEDLE  IN  THE  HAY’  – EMPHASISE  IMPORTANCE 33 • Googlebot is  also  ‘hunting’…  Hunting  for  relevant   ‘needles’  in  1,000,000,000s  of  straws  of  ‘hay’  on  the  web • It’s  about  making  your  ‘one  needle’  stand  out  in  importance  in  not  just  your  own   site’s  haystack,  but  tens  of  thousands  of  competing  similar  straws  of  hay  in  other   site’s  haystacks…                            (DON’T  JUST  MAKE  YOUR  HAYSTACK  BIGGER) “Hey,  you  Googlebot…  This  is  the  needle”  via   architectural  internal  linking  without  blur  of  duplication  or   too  many  redirects  or  canonicalization
  35. 35. 13 WHICH  OF  YOUR  URLs  ARE  IMPORTANT? “If  you  don’t  consistently   indicate  via  clean  internal   individual  URL  importance   emphasis,  the  importance  of   your  URLs,  how  will   Googlebot know  which  are   the  most  important?”
  36. 36. 35 INTERNAL  LINKS  COUNT  (A  LOT) (RELEVATIVE  IMPORTANCE  VOTES  ON  URL   IMPORTANCE  FROM  YOUR  OWN  SITE) THESE  ARE   YOUR  ‘VOTES’   TO  GOOGLEBOT   ON  THE   IMPORTANCE   OF  EACH  URL EMPLOY   ‘CONSISTENT’   INTERNAL  LINK   STRATEGIES THINK  OF  THESE   AS  ‘WALL-­‐TIES’   HOLDING  YOUR   BUILDING  (SITE   ARCHITECTURE)   TOGETHER STOP  VOTING  FOR   THE  WRONG  URLS FROM  WITHIN  YOUR   OWN  SITE. WRONG  TARGETS   RANKING?…  CHECK   INTERNAL  LINKS From  Google  Support   Pages Consistent internal  &  external  emphasis  of  a   URLs  ’IMPORTANCE’
  37. 37. 38 NEGATIVE  CONSEQUENCES   FROM  POOR  CRAWL  VISITS   (E.G.  SPIDER  TRAPS  (INFINITE   LOOPS),  INDIVIDUAL  URLS   VISITED  LESS  AND  LESS   FREQUENTLY  BECAUSE   THERE’S  TOO  MANY) BUT  IS  THERE  PERHAPS  AN  OPPOSITE   OF  ‘CRAWL  RANK’?  -­ ’CRAWL  TANK’?? IS  THERE  ADVERSE  EFFECT  WHEN  CRAWLING  GOES  BAD?
  38. 38. WELL  -­ I’VE  SEEN  ‘CRAWL  TANK’  – IT AIN’T  PRETTY 39 SITE  SEO  DEATH  BY  TOO  MANY  URLS  AND   INSUFFICIENT  CRAWL  BUDGET  TO  SUPPORT   (EITHER  DUMPING  A  NEW  THIN  PARAMETER   INTO  A  SITE  OR  INFINITE  LOOP  (CODING   ERROR)  (SPIDER  TRAP)) ”BEEN THERE, DONE THAT”
  39. 39. IT  KIND  OF  LOOKS  A  BIT  LIKE  THIS 40 ”BEEN THERE, DONE THAT” DEFINITELY
  40. 40. 41 ‘EXPONENTIAL  URL  UNIMPORTANCE’? Your  URLs  exponentially,   CONSISTENTLY    confirmed   unimportant   to  queries  with   each  iterative  crawl  visit  to   other  similar  or  duplicate   content  checksum  URLs? MULTPLE  RANDOM  URLs   competing  for  same  query   confirm  irrelevance  of  all   competing  in-­‐site  URLs  with   no  dominant  relevant   IMPORTANT  URL?
  41. 41. STILL…SILVER  LININGS 42 “EVERY  SEO  NEEDS  A   ’FLATLINER’  SITE  TO   RESURRECT  AND   MAKE  BETTER…  “ RIGHT?
  42. 42. Going  ‘where  the  action  is’  in  sites The  ‘need  for  speed’ Logical  structure Correct  ‘response’  codes XML  sitemaps ‘Successful  crawl  visits ‘Seeing  everything’  on  a  page Taking  ‘hints’ Clear  unique  single  ‘URL   fingerprints’  (no  duplicates) Predicting  likelihood  of  ‘future   change’ Slow  sites Too  many  redirects Being  bored  (Meh)  (‘Hints’  are  built  in  by  the   search  engine  systems  – Takes  ‘hints’) Being  lied  to  (e.g.  On  XML  sitemap  priorities) Crawl  traps  and  dead  ends Going  round  in  circles  (Infinite  loops) Spam  URLs Crawl  wasting  minor  content  change  URLs ‘Hidden’  and  blocked  content Uncrawlable  URLs Duplicate  URLs Not  just  any  change Critical  material  change Predicting  future  change Dropping  ‘hints’  to  Googlebot Sending  Googlebot Where  ‘the  action  is’ 43 LIKES DISLIKES CHANGE  IS  KEY BASED  ON  DATA  FROM  THE  HISTORY  LOGS  -­ CAN  WE   INFLUENCE  VIA  CRAWL  OPTIMISATION  TO  ESCAPE  THE   ‘BASE  LAYER  HOME’  OF  THE  ’UNIMPORTANT’  URLS?
  43. 43. 44HERE’S  ONE  I  MADE  EARLIER…SOME  CAVEATS THIS  IS  A  PERSONAL  PROJECT  – MY  20  IN  70:  20:10  MIX IT’S  NOT  MOBILE  FRIENDLY  OR  HTTPS   (HANGS  HEAD  IN  SHAME),  AND  YES,  IT   NEEDS  A  MAKEOVER…  BUT…  TIME…  ,   RESOURCES,  BUDGET…BLAH  BLAH THERE  IS  NO  ‘BIG  BRAND’   MARKETING,  VC  BACKING,  TV  OR   RADIO  ADS  (LIKE  COMPETITORS)  – JUST  ME  -­‐ ‘CHIPPING  AWAY’ 90%+  OF  TRAFFIC  IS NON-­‐BRANDED  GENERIC ORGANIC
  44. 44. URL  CRAWL  FREQUENCY  ’CLOCKING’ 46 Spreadsheet  provided   by   @johnmu during  Webmaster   Hangout https://goo.gl/1p ToL8 ARE  THE  URLS  THAT  YOU   WANT  BEING  CRAWLED   ‘REAL  TIME’,  DAILY  OR   INFREQUENTLY?   (REGULAR  LOG  ANALYSIS   AND  INTERVENTION  TO   EMPHASISE  IMPORTANCE) MY  THOUGHTS  (DA)  -­‐ You  need  to  find  out  which  ones  are  getting  crawled  in   the  ‘real  time’  schedule,  the  ‘daily  crawl’  schedule  and  via  random  selection  in   the  ‘dross’  (or  UNLIKELY  TO  CHANGE  A  LOT  /  UNIMPORTANT)  ‘base  layer’   section.    If  it’s  not  the  URLs  that  you  want  to  be  there,  then  formulate  a  plan   to  improve  the  ‘importance’  of  URLS.  (NOTE:  JOHN  DID  NOT  SAY   THIS)
  45. 45. 45LOSE  THE  ‘DEAD  WOOD’  SO  GOOGLEBOT  DETECTS   ‘IMPORTANCE’ FIX IT FOR A BETTER CRAWL EMBRACE THE ‘410 GONE’FLATTENING   ARCHITECTURES,   CONSISTENTLY  AVOIDING   CANNIBALISATION,  INTERNAL   LINK  STRATEGIES,  LINKING   RELEVANT  CONTENT  TO   RELEVANT  CONTENT,   UTILISING  XML  &  FRONT   FACING  SITEMAPS  AND   STRONG  HUB  PAGES  TO   ‘HERD’  GOOGLEBOT  AROUND   THE  SITE
  46. 46. 47 40,000  TOWNS,  CITIES  &  VILLAGES 40,000+  towns,  cities  and   villages  across  the  UK   multiplied   by  X  site   categories  (THAT’S  A  LOT   OF  LONG  TAIL  QUERY   VOLUME)
  47. 47. 48FWIW  – LONG  TAIL  CRAWL  TECHNIQUES  SEEM  TO   APPLY  TO  OTHER  SEARCH  ENGINES    TOO By  shortening   crawl  paths  and  crawl   frequency  intervals  and  emphasing important   to  subcategory  URLs  on  frequently  changed   URLs  (fresh)  it  appears  you  may  gain  a   competitive  advantage  on  long  tail  queries
  48. 48. IT’S  ALIVE…  NEEDS  WORK…  BUT  ALIVE 49 CAVEAT:  IT’S  TOO  COMPLEX  TO  ANSWER  WITH  A   SIMPLE  FEW  EXAMPLES  OF  COURSE  (TOO  MANY   FACTORS)  – BUT…  FOOD  FOR  THOUGHT ‘CRITICAL  MATERIAL   CHANGE  FREQUENCY’   (FRESHNESS)  AND   DETECTED  URL   IMPORTANCE  EMPHASIS   VIA  EXTERNAL  OR   INTERNAL  SIGNALS  (INC   PAGERANK)  SEEM  KEY IS  IT  ‘CRAWL  RANK’  OR  ‘EMPHASING  URL  IMPORTANCE’  BETTER  THAN  COMPETITORS   EMPHASE  IMPORTANCE  OF  LOW  TO  NO  PAGERANK  PAGES  WHERE  FEW  OTHER  FACTORS   SEPARATE?
  49. 49. 50CRAWL  BUDGET  &  ‘CRAWL  RANK’  – OTHER  FACTORS?? 1.  IT  APPEARS  TO  BE  APPORTIONED   BY  THE  URL  SCHEDULER  (BUDGET) 2.  PAGES  WITH  A  LOT  OF  (HEALTHY??)   LINKS  GET  CRAWLED  MORE  (EXTERNAL   AND  INTERNAL?)  (BUDGET  AND  RANK?) 3.  THERE  ARE  URL  EXCLUSIONS  – (   ’HINT  TRIPPERS’,  OBJECTIONABLE   CONTENT  AND  ‘SPAM  URLS’??  )   (BUDGET) 4  – ‘CRITICAL  MATERIAL  CHANGE’  (FRESHNESS)  AND  THE  PROBABILITY   AND  PREDICTABILITY  OF  CHANGE CORRELATE  (BUDGET) 5  –’CONSISTENT’ EMPHASIS  OF  URL  IMPORTANCE(BUT  I  THINK  THAT  THIS   WAS  ALWAYS  THERE) MAY  BE  ’CRAWL  RANK’(BUDGET  AND  RANK??) ’CRAWL  RANK’  -­‐ IS  IT   CORRELATION  OR   CAUSATION?    (DO  IMPORTANT   PAGES  GET  CRAWLED  MORE,     OR  IS  IT  BECAUSE  THEY  ARE   CRAWLED  MORE  THEY  ARE   IMPORTANT?)
  50. 50. CAN  WEB  PAGES  CRAWLED   INFREQUENTLY   STILL  RANK? 36 YES THEY  CAN  STILL  BE   ’IMPORTANT’ IT’S  THE  ONES  YOU’RE  INDICATING  ARE  UNIMPORTANT   THAT  YOU  WANT  TO  KEEP  AN  EYE  ON  -­ #JUSTSAYING  ;;)
  51. 51. “BE  SMART  ABOUT  YOUR  TAGS  AND  SITE   ARCHITECTURE,  STAY  FRESH  AND  RELEVANT” (@maileohye,  2016) 37 SLIDE  FROM  APRIL  2016’S  SEJSUMMIT  ON  SEO  INSTRUCTIONS  2016 FROM  GOOGLE’S  @maileohye
  52. 52. 52 EITHER  WAY  -­ ARE  ALL  THE  CHECKS  AND  BALANCES   INDICATING  YOU  ARE  STILL  ON  TRACK? BECAUSE  -­‐ BRINGING  A   ROCKET  BACK  ON  COURSE   IS  ‘CHALLENGING’ REGULAR  TESTS  AND  EARLY  DIAGNOSIS  ARE  CRUCIAL  – STOP,  CHECK  AND  KEEP  CHECKING ‘TANK’  OR   ‘RANK’? – YOU  DECIDE
  53. 53. TWITTER  -­‐ @dawnieando GOOGLE+  -­‐ +DawnAnderson888 LINKEDIN  -­‐ msdawnanderson THANKS  FOR   LISTENING   FOLKS  J Dawn  Anderson  @  dawnieando ENJOY  BRIGHTON  SEO
  54. 54. REFERENCES http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/ Scheduler  for  search  engine  crawler Google  Patent US  8042112  B1,  (Zhu  et  al) -­‐ https://www.google.com/patents/US8707313 Managing  items  in  crawl  schedule  – Google  Patent  (Alpert)   http://www.google.ch/patents/US8666964 Document  reuse  in  a  search  engine  crawler  -­‐ Google  Patent  (Zhu  et  al) https://www.google.com/patents/US8707312 Web  crawler  scheduler  that  utilizes  sitemaps  (Brawer  et  al)  -­‐ http://www.google.com/patents/US8037054 Distributed  crawling  of  hyperlinked  documents  (Dean  et  al)  -­‐ http://www.google.co.uk/patents/US7305610 Minimizing  visibility  of  stale  content  (Carver)  -­‐ http://www.google.ch/patents/US20130226897
  55. 55. REFERENCES Efficient  Crawling  Through  URL  Ordering  (Page  et  al)  -­‐ http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdf Crawl  Optimisation (Blind  Five  Year  Old  – A  J  Kohn  -­‐ @ajkohn)  http://www.blindfiveyearold.com/crawl-­‐ optimization Scheduling  a  recrawl (Auerbach)    -­‐ http://www.google.co.uk/patents/US8386459 Scheduler  for  search  engine  crawler  (Zhu  et  al)  -­‐ http://www.google.co.uk/patents/US8042112 Efficient  crawling  through  URL  ordering    (Page  et  al)  -­‐ http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdf Google  Explains  Why  The  Search  Console  Reporting  Is  Not  Real  Time  (SERoundtable)   https://www.seroundtable.com/google-­‐explains-­‐why-­‐the-­‐search-­‐console-­‐has-­‐reporting-­‐delays-­‐21688.html Crawl  Data  Aggregation  Propagation  (Mueller)  -­‐ https://goo.gl/1pToL8 Matt  Cutts Interviewed  By  Eric  Enge -­‐ https://www.stonetemple.com/matt-­‐cutts-­‐interviewed-­‐by-­‐eric-­‐enge-­‐ 2/ Web  Promo  Q  and  A  with  Google’s  Andrev Lippatsev -­‐ https://searchenginewatch.com/2016/04/06/webpromos-­‐qa-­‐with-­‐googles-­‐andrey-­‐lipattsev-­‐transcript/ Google  Number  1  SEO  Advice  – Be  Consistent  -­‐ https://www.seroundtable.com/google-­‐number-­‐one-­‐seo-­‐ advice-­‐be-­‐consistent-­‐21196.html

×