Successfully reported this slideshow.
Your SlideShare is downloading. ×

Negotiating crawl budget with googlebots

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 83 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to Negotiating crawl budget with googlebots (20)

More from Dawn Anderson MSc DigM (20)

Advertisement

Recently uploaded (20)

Negotiating crawl budget with googlebots

  1. 1. USING  ’PAGE  IMPORTANCE’  IN  ONGOING   CONVERSATION  WITH  GOOGLEBOT  TO  GET   JUST  A  BIT  MORE  THAN  YOUR  ALLOCATED   CRAWL  BUDGET NEGOTIATING   CRAWL   BUDGET  WITH   GOOGLEBOTS Dawn  Anderson  @  dawnieando
  2. 2. Another  Rainy   Day  In   Manchester @dawnieando
  3. 3. WTF???
  4. 4. 1994  -­ 1998 “THE  GOOGLE  INDEX  IN  1998  HAD   60  MILLION  PAGES”  (GOOGLE)   (Source:Wikipedia.org)
  5. 5. 2000 “INDEXED  PAGES  REACHES  THE  ONE  BILLION   MARK”  (GOOGLE) “IN  OVER  17  MILLION   WEBSITES”   (INTERNETLIVESTATS.COM)
  6. 6. 2001  ONWARDS ENTER  WORDPRESS,  DRUPAL  CMS’,  PHP  DRIVEN  CMS’,  ECOMMERCE   PLATFORMS,  DYNAMIC  SITES,  AJAX WHICH  CAN  GENERATE  10,000S  OR  100,000S   OR  1,000,000S  OF  DYNAMIC URLS  ON  THE  FLY  WITH  DATABASE  ‘FIELD   BASED’  CONTENT DYNAMIC  CONTENT  CREATION  GROWS ENTER  FACETED  NAVIGATION  (WITH  MANY  #   PATHS  TO  SAME  CONTENT) 2003  – WE’RE  AT  40  MILLION  WEBSITES
  7. 7. 2003  ONWARDS  – USERS  BEGIN  TO  JUMP  ON  THE  CONTENT   GENERATION  BANDWAGGON LOTS  OF   CONTENT  – IN   MANY  FORMS
  8. 8. WE  KNEW  THE  WEB  WAS  BIG…  (GOOGLE,  2008) https://googleblog.blogspot.co.uk/2008/07/we-­‐knew-­‐web-­‐was-­‐big.html “1  trillion  (as  in  1,000,000,000,000)   unique  URLs  on  the  web  at  once!” (Jesse  Alpert  on  Google’s   Official  Blog,  2008) 2008  – EVEN   GOOGLE   ENGINEERS   STOPPED  IN  AWE
  9. 9. 2010  – USER  GENERATED  CONTENT  GROWS “Let  me  repeat  that:  we   create  as  much  information   in  two  days  now  as  we  did   from  the  dawn  of  man   through  2003” “The  real  issue  is  user-­‐ generated  content.”  (Eric   Schmidt,  2010  – Techonomy Conference  Panel) SOURCE:  http://techcrunch.com/2010/08/04/schmidt-­‐data/
  10. 10. Indexed  Web  contains at  least  4.73  billion   pages (13/11/2015) CONTENT KEEPS GROWING Total  number  of  websites 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 1,000,000,000 750,000,000 500,000,000 250,000,000 THE  NUMBER  OF  WEBSITES   DOUBLED  IN  SIZE  BETWEEN   2011  AND  2012 AND  AGAIN  BY  1/3  IN  2014
  11. 11. EVEN  SIR  TIM   BERNERS-­‐LEE (Inventor  of  www)   TWEETED 2014  – WE  PASS  A  BILLION  INDIVIDUAL  WEBSITES   ONLINE
  12. 12. 2014  – WE  ARE  ALL PUBLISHERS SOURCE:  http://wordpress/activity/posting
  13. 13. YUP  -­ WE  ALL‘LOVE  CONTENT’ IMAGINE  HOW  MANY   UNIQUE  URLs    COMBINED   THIS  AMOUNTS  TO?   – A  LOT http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/
  14. 14. “As  of  the  end  of  2003,  the   WWW  is  believed  to  include   well  in  excess  of  10  billion   distinct  documents  or  web   pages,  while  a  search   engine  may  have  a  crawling   capacity  that  is  less  than   half  as  many  documents”   (MANY  GOOGLE  PATENTS) CAPACITY  LIMITATIONS  – EVEN  FOR  SEARCH   ENGINES Source:  Scheduler  for  search  engine  crawler Google  Patent US  8042112  B1,  (Zhu  et  al)
  15. 15. “So  how  many  unique  pages   does  the  web  really   contain?  We  don't  know;  we   don't  have  time  to  look  at   them  all!  :-­‐)”   (Jesse  Alpert,  Google,  2008) Source:  https://googleblog.blogspot.co.uk/2008/07/we-­‐knew-­‐web-­‐ was-­‐big.html NOT   ENOUGH   TIME SOME  THINGS   MUST  BE   FILTERED
  16. 16. A  LOT  OF  THE   CONTENT  IS   ‘KIND  OF  THE   SAME’ “There’s  a  needle  in  here   somewhere” “It’s  an  important  needle  too”
  17. 17. Capacity  limits   on  Google’s   crawling  system By  prioritising   URLs  for   crawling By  assigning   crawl  period   intervals  to  URLs How  have   search  engines   responded? By  creating  work   ‘schedules’  for   Googlebots WHAT IS THE SOLUTION? “To  keep  within  the  capacity  limits  of  the  crawler,  automated  selection  mechanisms  are  needed   to  determine  not  only  which  web  pages  to  crawl,  but  which  web  pages  to  avoid  crawling”.  -­‐ Scheduler  for  search  engine  crawler,  (Zhu  et  al)
  18. 18. ‘Managing items in a crawl schedule’ Include GOOGLE CRAWL SCHEDULER PATENTS ‘Scheduling a recrawl’ ‘Web crawler scheduler that utilizes sitemaps from websites’ ‘ ‘Document reuse in a search engine crawler’ ‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ ‘Scheduler for search engine’ EFFICIENCY  IS   NECESSARY
  19. 19. CRAWL  BUDGET 1.  Crawl  Budget  – “An  allocation  of  crawl   frequency  visits  to  a  host  (IP  LEVEL)”   3.  Pages  with  a  lot  of  links  get  crawled  more 4.  The  vast  majority  of  URLs  on  the  web  don’t  get  a   lot  of  budget  allocated  to  them  (low  to  0  PageRank  URLs). 2.  Roughly  proportionate  to  PageRank  and   host  load  /  speed  /  host  capacity https://www.stonetemple.com/matt-­‐cutts-­‐ interviewed-­‐by-­‐eric-­‐enge-­‐2/
  20. 20. BUT…  MAYBE  THINGS  HAVE  CHANGED? CRAWL  BUDGET  /  CRAWL   FREQUENCY  IS  NOT  JUST   ABOUT  HOST-­LOAD  AND   PAGERANK  ANY  MORE
  21. 21. STOP  THINKING  IT’S  JUST  ABOUT  ‘PAGERANK’ http://www.youtube.com/watch?v=GVKcMU7YNOQ&t=4m45s “You  keep  focusing  on   PageRank”… “There’s  a  shit-­‐ton  of   other  stuff  going  on”   (Illyes,  G,  Google  -­‐ 2016)
  22. 22. THERE’S  A  LOT  OF  OTHER  THINGS  AFFECTING   ‘CRAWLING’ Transcript:   https://searchenginewatch.com/2016/04/06/webpromos-­‐qa-­‐with-­‐googles-­‐andrey-­‐lipattsev-­‐transcript/ WEB  PROMOS  Q  &  A   WITH  GOOGLES   ANDREY  LIPATTSEV
  23. 23. WHY? BECAUSE…   THE  WEB  GOT   ‘MAHOOOOOSIVE’ AND  CONTINUES  TO  GET   ‘MAHOOOOOOSIVER’ SITES  GOT  MORE   DYNAMIC,  COMPLEX,   AUTO-­GENERATED,  MULTI-­ FACETED,  DUPLICATED,   INTERNATIONALISED,   BIGGER,  BECAME   PAGINATED  AND  SORTED
  24. 24. WE  NEED  MORE WAYS  TO  GET MORE  EFFICIENT AND  FILTER  OUT TIME-­WASTING CRAWLING  SO   WE  CAN  FIND   IMPORTANT   CHANGES   QUICKLY GOOGLEBOT’S  TO-­DO  LIST  GOT  REALLY  BIG
  25. 25. Hard  and  Soft   Crawl  Limits Importance   Thresholds Min  and  Max   Hints  &  ‘Hint   ranges’ Importance Crawl   Periods Scheduling FURTHER IMPROVED CRAWLING EFFICIENCY SOLUTIONS NEEDED Prioritization Tiered Crawling Buckets (‘Real  Time,  Daily,   Base  Layer)  
  26. 26. SEVERAL PATENTS UPDATED ‘Managing URLs’ (Alpert et al, 2013) (PAGE IMPORTANCE DETERMINING SOFTAND HARD LIMITS ON CRAWLING) ‘Managing Items in a Crawl Schedule’ (Alpert, 2014) ‘ ‘Scheduling a Recrawl’ (Anerbach, Alpert, 2013) (PREDICTING CHANGE FREQUENCY IN ORDER TO SCHEDULE NEXTVISIT, EMPLOYING HINTS (Min & Max) (SEEM  TO  WORK  TOGETHER) ‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’ (INCLUDES EMPLOYING HINTS TO DETECT PAGES ‘NOT’ TO CRAWL)
  27. 27. Crawled  multiple   times  daily Crawled  daily   Or  bi-­‐daily Crawled  least  on  a  ‘round   robin’  basis  – only  ‘active’   segment  is  crawledSplit  into  segments   on  random  rotation MANAGING ITEMS IN A CRAWL SCHEDULE (GOOGLE PATENT) Real  Time Crawl Daily Crawl Base  Layer    Crawl 3  layers  /  tiers  /   buckets  for   scheduling URLs  are  moved   in  and  out  of   layers  based  on   past  visits  data Most  Unimportant
  28. 28. CAN  WE  ESCAPE  THE  ‘BASE  LAYER’   CRAWL  BUCKET  RESERVED  FOR   ‘UNIMPORTANT’  URLS?
  29. 29. 10  types of Googlebot SOME  OF  THE  MAJOR  SEARCH  ENGINE   CHARACTERS History  Logs  /  History   Server The  URL   Scheduler   /  Crawl   Manager
  30. 30. HISTORY LOGS / HISTORY SERVERS HISTORY  LOGS  /  HISTORY  SERVER  -­‐ Builds  a  picture  of  historical  data  and   past  behaviour  of  the  URL  and  ‘importance’  score  to  predict  and  plan  for   future  crawl  scheduling • Last  crawled  date • Next  crawl  due • Last  server  response • Page  importance  score • Collaborates  with  link   logs • Collaborates  with   anchor  logs • Contributes  info  to   scheduling
  31. 31. ‘BOSS’- URL SCHEDULER / URL MANAGER Think  of  it  as  Google’s   line  manager  or  ‘air   traffic  controller’  for   Googlebots in  the   web  crawling  system • Schedules  Googlebot visits  to  URLs • Decides  which  URLs  to  ‘feed’  to  Googlebot • Uses  data  from  the  history  logs  about  past  visits  (Change  rate  and   importance) • Calculates  importance  crawl  threshold • Assigns  visit  regularity  of  Googlebot to  URLs • Drops  ‘max  and  min  hints’  to  Googlebot to  guide  on  types  of   content  NOT  to  crawl  or  to  crawl  as  exceptions. • Excludes  some  URLs  from  schedules • Assigns  URLs  to  ‘layers  /  tiers’  for  crawling  schedules • Scheduler  checks  URLs  for  ‘importance’,  ‘boost  factor’  candidacy,   ‘probability  of  modification’ • Budgets  are  allocated  to  IPs  and  shared  amongst  domains  there JOBS
  32. 32. • ‘Ranks  nothing  at  all’ • Takes  a  list  of  URLs  to  crawl  from  URL  Scheduler • Runs  errands  &  makes  deliveries  for  the  URL  server,  indexer  /   ranking  engine  and  logs • Makes  notes  of  outbound   linked  pages  and  additional  links   for  future  crawling • Follows  directives  (robots)  and  takes  ‘hints’  when  crawling • Tells  tales  of  URL  accessibility  status,  server  response  codes,   notes  relationships  between  links  and  collects  content   checksums  (binary  data  equivalent  of  web  content)  for   comparison  with  past  visits  by  history  and  link  logs • Will  go  beyond  the  crawl  schedule  if  it  finds  something  more   important  than  URLs  scheduled GOOGLEBOT - CRAWLER JOBS
  33. 33. WHAT  MAKES  THE  DIFFERENCE   BETWEEN  BASE  LAYER  AND  ‘REAL  TIME’   SCHEDULE  ALLOCATION?
  34. 34. CONTRIBUTING  FACTORS 1.  Page  Importance  (which  may  include  PageRank) 3.  Soft  limits  and  hard  crawl  limits 4.  Host  load  capability  &  past  site   performance  (speed  and  access)   (IP  level  and  domain  level  within) 2.  Hints  (max  and  min) 5.  Probability  /  predictability  of  ‘CRITICAL MATERIAL’  change  +  importance  crawl   period
  35. 35. 1 - PAGE IMPORTANCE - Page  importance  is  the   importance  of  a  page  independent  of  a  query • Location  in  Site  (e.g.  home  page  more  important   than  parameter  3  level  output) • PageRank • Page  type  /  file  type • Internal  PageRank • Internal  Backlinks • In-­‐site  Anchor  Text  Consistency • Relevance  (content,  anchors  and  elements)  to  a   topic  (Similarity  Importance) • Directives  from  in-­‐page  robot  and  robots.txt management • Parent  quality  brushes  off  on  child  page  quality IMPORTANT  PARENTS  LIKELY  SEEN  TO   HAVE  IMPORTANT  CHILD  PAGES
  36. 36. 2 - HINTS - ’MIN’ HINTS & ’MAX’ HINTS MIN  HINT  /  MIN  HINT  RANGES • e.g.  Programmatically  generated   content  which  changes  content   checksum  on  load • Unimportant  duplicate  parameter   URLs • Canonicals • Rel=next,  rel=prev • HReflang • Duplicate  content • Spammy URLs? • Objectionable  content MAX  HINT  /  MAX  HINT   RANGES • CHANGE  CONSIDERED  ‘CRITICAL   MATERIAL  CHANGE’  (useful  to   users  e.g.  availability,  price)  &  /  or   improved  site  sections  or  change   to  IMPORTANT  but  infrequently   changing  content? • Important  pages  /  page  range   updates E.G. rel="prev" and rel="next" a ct  as  hints  to  Google,   not   absolute  directives https://support.google.com/webm asters/answer/1663744?hl=en&re f_topic=4617741
  37. 37. 3 - HARD AND SOFT LIMITS ON CRAWLING If  URLs  are  discovered   during  crawling  that   are  more  important   than  those  scheduled   to  be  crawled  then   Googlebot can  go   beyond  its  schedule  to   include  these  up  to  a   hard  crawl  limit ‘Soft’  crawl   limit  is  set   (Original   schedule) ‘Hard’  crawl  limit   is  set  (E.G.  130%   of  schedule) FOR  IMPORTANT   FINDINGS
  38. 38. 4 – HOST LOAD CAPACITY / PAST SITE PERFORMANCE Googlebot has  a  list   of  URLs  to  crawl Naturally,  if  your   site  is  fast  that  list   can  be  crawled   quicker If  Googlebot experiences   500s  e.g.  she   will  retreat  &   ‘past   performance ’  is  noted If  Googlebot doesn’t  get   ‘round  the  list’   you  may  end   up  with   ‘overdue’   URLs  to  crawl
  39. 39. • Not  all  change  is  considered  equal • There  are  many  dynamic  sites  with  low  importance  pages   changing  frequently  – SO  WHAT • Constantly  changing  your  page  just  to  get  Googlebot back  won’t  work  if  the  page  is  low  importance  (crawl   importance  period  <  change  rate)  POINTLESS • Hints  are  employed  to  determine  pages  which  simply   change  the  content  checksum  with  every  visit • Features  are  weighted  for  change  importance  to  user   (price  >  colour  e.g.) • Change  identified  as  useful  to  users  is  considered   ‘CRITICAL  MATERIAL  CHANGE’ • Don’t  just  try  to  randomise  things  to  catch  Googlebot’s eye • That  counter  or  clock  you  added  probably  isn’t  going  to   help  you  get  more  attention,  nor  random  or  shuffle • Change  on  some  types  of  pages  is  more  important than   other  pages  (e.g.  Home  page  CNN  >  SME  about  us  page) 5 - CHANGE
  40. 40. • Current  capacity  of  the  web  crawling  system  is  high • Your  URL  has  a  high  ‘importance  score’ • Your  URL  is  in  the  real  time  (HIGH  IMPORTANCE),  daily  crawl   (LESS  IMPORTANT)  or  ‘active’  base  layer  segment   (UNIMPORTANT  BUT  SELECTED) • Your  URL  changes  a  lot  with  CRITICAL  MATERIAL  CONTENT   change  (AND  IS  IMPORTANT) • Probability  and  predictability  of  CRITICAL  MATERIAL  CONTENT   change  is  high  for  your  URL  (AND  URL  IS  IMPORTANT) • Your  website  speed  is  fast  and  Googlebot gets  the  time  to  visit   your  URL  on  its  bucket  list  of  scheduled  URLs  that  visit • Your  URL  has  been  ‘upgraded’  to  a  daily  or  real  time  crawl  layer   as  it’s  importance  is  detected  as  raised • History  logs  and  URL  Scheduler  ’learn’  together FACTORS AFFECTING GOOGLEBOT HIGHER VISIT FREQUENCY
  41. 41. • Current  capacity  of  web  crawling  system  is  low • Your  URL  has  been  detected  as  a  ‘spam’  URL • Your  URL  is  in  an  ‘inactive’  base  layer  segment  (UNIMPORTANT) • Your  URLs  are  ‘tripping  hints’  built  into  the  system  to  detect  non-­‐ critical  change  dynamic  content • Probability  and  predictability  of  critical  material  content  change  is   low  for  your  URL • Your  website  speed  is  slow  and  Googlebot doesn’t  get  the  time  to   visit  your  URL • Your  URL  has  been  ‘downgraded’  to  an  ‘inactive’  base  layer   (UNIMPORTANT)  segment • Your  URL  has  returned  an  ‘unreachable’  server  response  code   recently • In-­‐page  robots  management  or  robots.txt send  wrong  signals FACTORS AFFECTING LOWER GOOGLEBOT VISIT FREQUENCY
  42. 42. GET  MORE  CRAWL  BY  ‘TURNING   GOOGLEBOT’S  HEAD’  – MAKE  YOUR   URLs  MORE  IMPORTANT  AND   ‘EMPHASISE’ IMPORTANCE
  43. 43. • Hard  limits  and  soft  limits • Follows  ‘min’  and  ‘max’  Hints • If  she  finds  something  important  she  will  go  beyond  a   scheduled  crawl  (SOFT  LIMIT)  to  seek  out  importance  (TO   HARD  LIMIT) • You  need  to  IMPRESS  Googlebot • If  you  ‘bore’  Googlebot she  will  return  to  boring  URLs  less   (e.g.  with  pages  all  the  same  (duplicate  content)  or   dynamically  generated  low  usefulness   content) • If  you  ’delight’  Googlebot she  will  return  to  delightful  URLs   more  (they  became  more  important  or  they  changed  with   ‘CRITICAL  MATERIAL  CHANGE’) • If  she  doesn’t  get  her  crawl  completed  you  will  end  up  with   an  ‘overdue’  list  of  URLs  to  crawl GOOGLEBOT DOES AS SHE’S TOLD – WITH A FEW EXCEPTIONS
  44. 44. • Your  URL  became  more  important  and  achieved  a  higher  ‘importance  score’   via  increased  PageRank • Your  URL  became  more  important  via  increased  IB(P)  (INTERNAL  BACKLINKS  IN   OWN  SITE)  relative  to  other  URLs  within  your  site  (You  emphasised   importance) • You  made  the  URL  content  more  relevant  to  a  topic  and  improved  the   importance  score • The  parent  of  your  URL  became  more  important  (E.G.  IMPROVED  TOPIC   RELEVANCE  (SIMILARITY),  PageRank  OR  local  (in-­‐site)  importance  metric) • YOUR  ‘IMPORTANCE  SCORE’  OF  SOME  URLS  EXCEEDED  THE  ‘IMPORTANCE   SOFT  LIMIT  THRESHOLD’  SO  THAT  IT  IS  INCLUDED  FOR  CRAWLING  WHILST   BEING  VISITED  UP  TO  A  POINT  OF  ‘HARD  LIMIT’  CRAWLING  (E.G.  130%  OF   SCHEDULED  CRAWLING) GETTING MORE CRAWL BY IMPROVING PAGE IMPORTANCE
  45. 45. HOW  DO  WE  DO  THIS?
  46. 46. TO DO - FIND GOOGLEBOT AUTOMATE  SERVER  LOG   RETRIEVAL  VIA  CRON  JOB grep Googlebot access_log >googlebot_access.txt ANALYSE  THE  LOGS
  47. 47. LOOK THROUGH SPIDER-EYES PREPARE TO BE HORRIFIED Incorrect  URL  header  response  codes   301  redirect  chains Old  files  or  XML  sitemaps  left  on  server  from  years  ago Infinite/  endless  loops  (circular  dependency) On  parameter  driven  sites  URLs  crawled  which  produce  same  output AJAX  content  fragments  pulled  in  alone URLs  generated  by  spammers Dead  image  files  being  visited Old  CSS  files  still  being  crawled  and  loading  EVERYTHING You  may  even  see  ’mini’  abandoned  projects  within  the  site Legacy  URLs  generated  by  long  forgotten  .htaccess regex  pattern  matching Googlebot hanging  around  in  your  ‘ever-­‐changing’  blog  but  nowhere  else
  48. 48. URL  CRAWL  FREQUENCY  ’CLOCKING’ Spreadsheet  provided  by  @johnmu during  Webmaster  Hangout  -­‐ https://goo.gl/1pToL8 Identify  your  ‘real  time’,  ‘daily’  and   ‘base  layer’  URLs -­‐ ARE  THEY  THE  ONES  YOU  WANT   THERE?    WHAT  IS  BEING  SEEN  AS   UNIMPORTANT? NOTE GOOGLEBOT Do  you  recognise  all  the URLs  and  URL  ranges  that Are  appearing? If  not…  Why  not?
  49. 49. IMPROVE & EMPHASISE PAGE IMPORTANCE • Cross  modular  internal  linking • Canonicalization • Important  URLs  in  XML  sitemaps • Anchor  text  target  consistency  (but  not  spammyrepetition  of   anchors  everywhere  (it’s  still  output)) • Internal  links  in  right  descending  order  – emphasise IMPORTANCE • Reduce  boiler  plate  content  and  improve  relevance  of  content   and  elements  to  specific  topic  (if  category)  /  product  (if  product   page)  /  subcategory  (if  subcategory) • Reduce  duplicate  content  parts  of  page  to  allow  primary  targets   to  take  ’IMPORTANCE’ • Improve  parent  pages  to  raise  IMPORTANCE  reputation  of  the   children  rather  than  over-­‐optimising the  child  pages  and   cannibalising the  parent. • Improve  content  as  more  ‘relevant’  to  a  topic  to  increase   ‘IMPORTANCE’  and  get  reassigned  to  a  different  crawl  layer • Flatten  ‘architectures’ • Avoid  content  cannibalisation • Link  relevant  content  to  relevant  content • Build  strong  highly  relevant  ‘hub’  pages  to  tie  together  strength   &  IMPORTANCE
  50. 50. EMPHASISE IMPORTANCE WISELY USE  CUSTOM XML SITEMAPS E.G.  XML  UNLIMITED SITEMAP  GENERATOR PUT IMPORTANT URLS IN HERE IF EVERYTHING IS IMPORTANT THEN IMPORTANCE IS NOT DIFFERENTIATED
  51. 51. KEEP CUSTOM SITEMAPS ‘CURRENT’ AUTOMATICALLY AUTOMATE UPDATES WITH  CRON  JOBS  OR   WEB  CRON  JOBS IT’S NOT AS TECHNICALAS YOU MAY THINK – USE WEB CRON JOBS
  52. 52. BE ‘PICKY’ ABOUT WHAT YOU INCLUDE IN XML SITEMAPS EXCLUDE  AND INCLUDE  CRAWL PATHS  IN  XML  SITEMAPS   TO EMPHASISE IMPORTANCE
  53. 53. IF YOU CAN’T IMPROVE - EXCLUDE (VIA NOINDEX) FOR NOW • YOU’RE  OUT  FOR  NOW • When  you  improve  you  can   come  back  in • Tell  Googlebot quickly  that   you’re  out  (via  temporary   XML  sitemap  inclusion) • But  ‘follow’  because  there   will  be  some  relevance   within  these  URLs • Include  again  when  you’ve   improved • Don’t  try  to  canonicalize me  to  something   in  the index
  54. 54. OR REMOVE – 410 GONE (IF IT’S NEVER COMING BACK) http://faxfromthefuture.bandcamp.com/track/410-­‐ gone-­‐acoustic-­‐demo EMBRACE THE ‘410 GONE’ There’s  Even  A  Song About  It
  55. 55. #BIGSITEPROBLEMS – LOSE THE INDEX BLOAT LOSE THE BLOAT TO INCREASE THE CRAWL No.  of  unimportant   URLs  indexed  extend   far  beyond  the   available  importance   crawl  threshold   allocation
  56. 56. Tags:  I,  must,  tag,    this,  blog,  post,  with,   every,  possible,   word,  that,  pops,   into,  my,   head,  when,  I,  look,  at,  it,  and,  dilute,  all,   relevance,  from,  it,  to,  a,  pile,  of,  mush,   cow,  shoes,  sheep,  the,  and,  me,  of,  it Image  Credit:  Buzzfeed Creating  ‘thin’  content  and   Even  more  URLs  to  crawl #BIGSITEPROBLEMS - LOSE THE CRAZY TAG MAN
  57. 57. Most Important Page 1 Most  Important  Page  2 Most  Important  Page  3 IS THIS YOUR BLOG?? HOPE NOT #BIGSITEPROBLEMS – INTERNAL BACKLINKS SKEWED IMPORTANCE DISTORTED BY DISPROPORTIONATE INTERNAL LINKING - LOCAL IB (P) – INTERNAL BACKLINKS
  58. 58. Optimize  Everything:  I  must  optimize  ALL   the  pages  across  a  category  descendants   for  the  same  terms  as  my  primary  target   category  page  so  that  each  of  them  is  of   almost  equal  relevance  to  the  target  page   and  confuse  crawlers  as  to  which  is the  important  one.    I’ll  put  them  all  in  a   sitemap  as  standard  too  just  for  good   measure. Image  Credit:  Buzzfeed HOW  CAN  SEARCH  ENGINES KNOW  WHICH  IS  MOST  IMPORTANT TO  A  TOPIC  IF  ‘EVERYTHING’  IS IMPORTANT?? #BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE ‘MISTER OVER-OPTIMIZER’ ‘OPTIMIZE  ALL  THE  THINGS’
  59. 59. Duplicate  Everything:  I  must  have  a   massive  boiler  plate  area  in  the  footer,   identical  sidebars  and  a  massive  mega   menu  with  all  the  same  output  in  sitewide.     I’ll  put  very  little  unique  content  into  the   page  body  and  it  will  also  look  very  much   like  it’s  parents  and  grandparents  too.     From  time  to  time  I’ll  outrank  my  parents   and  grandparent  pages  but  ‘Meh’… Image  Credit:  Buzzfeed HOW  CAN  SEARCH  ENGINES KNOW  WHICH  IS  MOST  IMPORTANT PAGE  IF  ALL  IT’S  CHILDREN  AND   GRANDCHILDREN  ARE  NEARLY  THE   SAME?? #BIGSITEPROBLEMS - WARNING SIGNS – LOSE THE ‘MISTER DUPLICATER’ ‘DUPLICATE  ALL  THE  THINGS’
  60. 60. IMPROVE SITE PERFORMANCE - HELP GOOGLEBOTGET THROUGH THE ‘BUCKET LIST’ – GET FAST AND RELIABLE Avoid  wasting  time   on  ‘overdue-­‐URL’   crawling   (E.G.   Send  correct   response  codes,   speed  up  your  site,   etc) 8,666,964  B1 ½  time >  2  x  page   crawl  p/day Added  to  Cloudflare CDN
  61. 61. GOOGLEBOT  GOES  WHERE  THE  ACTION  IS USE  ‘ACTION’  WISELY DON’T  TRY  TO  TRICK  GOOGLEBOT  BY   FAKING  ‘FRESHNESS’  ON  LOW  IMPORTANCE   PAGES  – GOOGLEBOT  WILL  REALISE UPDATE  IMPORTANT  PAGES  OFTEN NURTURE  SEASONAL  URLs  TO  GROW   IMPORTANCE  WITH  FRESHNESS  (regular   updates)  &  MATURITY  (HISTORY) DON’T  TURN  GOOGLEBOT’S  HEAD  INTO   THE  WRONG  PLACES Image  Credit:  Buzzfeed ’GET FRESH’AND STAY ‘FRESH’ ‘BUT  DON’T  TRY  TO  FAKE   FRESH  &  USE  FRESH  WISELY’
  62. 62. IMPROVE TO GET THE HARD LIMITS ON CRAWLING By  improving  your URL  importance on  an   ongoing  basis  via Increased  pagerank,   content  improvements   (e.g.  quality  hub  pages),   internal  link  strategies,   IB  (P),  restructuring, You  can  get  the  ‘hard   limit’  or  get  visited   more  generally CAN IMPROVING YOUR SITE HELP TO ‘OVERRIDE’ SOFT LIMIT CRAWL PERIODS SET?
  63. 63. YOU THINK IT DOESN’T MATTER… RIGHT? YOU  SAY… ”  GOOGLE  WILL   WORK  IT  OUT” ”LET’S  JUST  MAKE   MORE  CONTENT”
  64. 64. WRONG  – ‘CRAWL  TANK’  IS  UGLY
  65. 65. WRONG  – CRAWL  TANK  CAN  LOOK  LIKE  THIS SITE  SEO  DEATH  BY  TOO  MANY  URLS  AND   INSUFFICIENT  CRAWL  BUDGET  TO  SUPPORT   (EITHER  DUMPING  A  NEW  ‘THIN’   PARAMETER  INTO  A  SITE  OR  INFINITE  LOOP   (CODING  ERROR)  (SPIDER  TRAP)) WHAT’S  WORSE  THAN  AN  INFINITE   LOOP? ‘A  LOGICAL  INFINITE  LOOP’ IMPORTANCE DISTORTED BY BADLY CODED PARAMETERS GENERATING ‘JUNK’ OR EVEN WORSE PULLING LOGIC TO CRAWLERS BUT NOT HUMANS
  66. 66. WRONG  – SITE  DROWNED -­ IN  IT’S OWN  SEA  OF   UNIMPORTANT   URLS
  67. 67. VIA  ‘EXPONENTIAL  URL  UNIMPORTANCE’ Your  URLs  exponentially  confirmed   unimportant   with  each  iterative  crawl   visit  to  other  similar  or  duplicate   content  checksum  URLs.    Fewer  and   fewer  internal  links  and  ‘thinner  and   thinner’  relevant  content. MULTPLE  RANDOM  URLs  competing  for   same  query  confirm  irrelevance  of  all   competing  in-­‐site  URLs  with  no   dominant  single  relevant  IMPORTANT   URL
  68. 68. WRONG  – ‘SENDING  WRONG  SIGNALS  TO   GOOGLEBOT’  COSTS  DEARLY (Source:Sistrix) “2015  was  the  year  where   website  owners  managed   to  be  mostly  at  fault,  all  by   themselves”  (Sistrix 2015   Organic  Search  Review  -­‐ 2016)
  69. 69. WRONG  -­ NO-­ONE  IS  EXEMPT (Source:Sistrix) “It  doesn’t  matter  how  big   your  brand  is  if  you  ‘talk  to   the  spider’  (Googlebot)   wrong  ”  – You  can  still   ‘tank’
  70. 70. WRONG  – GOOGLE  THINKS  SEOS  SHOULD   UNDERSTAND  CRAWL  BUDGET
  71. 71. ”EMPHASISE  IMPORTANCE” “Make  sure  the  right  URLs  get  on  Googlebot’s menu  and  increase  URL   importance  to  build  Googlebot’s appetite  for  your  site  more” Dawn  Anderson  @  dawnieando SORT OUT CRAWLING
  72. 72. TWITTER  -­‐ @dawnieando GOOGLE+  -­‐ +DawnAnderson888 LINKEDIN  -­‐ msdawnanderson THANK  YOU Dawn  Anderson  @  dawnieando
  73. 73. • Going  ‘where  the  action  is’  in   sites • The  ‘need  for  speed’ • Logical  structure • Correct  ‘response’  codes • XML  sitemaps  with  important   URLs • ‘Successful  crawl  visits • ‘Seeing  everything’  on  a  page • Taking  MAX  ‘hints’ • Clear  unique  single  ‘URL   fingerprints’  (no  duplicates) • Predicting  likelihood  of  ‘future   change’ • Finding  ‘more’  important  content   worth  crawling • Slow  sites • Too  many  redirects • Being  bored  (Meh)  (Min  ‘Hints’  are  built   in  by  the  search  engine  systems  – Takes   ‘hints’) • Being  lied  to  (e.g.  On  XML  sitemap   priorities) • Crawl  traps  and  dead  ends • Going  round  in  circles  (Infinite  loops) • Spam  URLs • Crawl  wasting  minor  content  change   URLs • ‘Hidden’  and  blocked  content • Uncrawlable URLs Not  just  any  change Critical  material  change Predicting  future  change Dropping  ‘hints’  to  Googlebot Sending  Googlebot Where  ‘the  action  is’ Not  just  page  change  designed To  catch  Googlebot’s eye  with No  added  value UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKES LIKES DISLIKES CHANGE  IS  KEY
  74. 74. Going  ‘where  the  action  is’  in  sites The  ‘need  for  speed’ Logical  structure Correct  ‘response’  codes XML  sitemaps ‘Successful  crawl  visits ‘Seeing  everything’  on  a  page Taking  ‘hints’ Clear  unique  single  ‘URL   fingerprints’  (no  duplicates) Predicting  likelihood  of  ‘future   change’ Slow  sites Too  many  redirects Being  bored  (Meh)  (‘Hints’  are  built  in  by  the   search  engine  systems  – Takes  ‘hints’) Being  lied  to  (e.g.  On  XML  sitemap  priorities) Crawl  traps  and  dead  ends Going  round  in  circles  (Infinite  loops) Spam  URLs Crawl  wasting  minor  content  change  URLs ‘Hidden’  and  blocked  content Uncrawlable URLs Not  just  any  change Critical  material  change Predicting  future  change Dropping  ‘hints’  to  Googlebot Sending  Googlebot Where  ‘the  action  is’ CRAWL OPTIMISATION – STAGE 1 - UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKES LIKES DISLIKES CHANGE  IS  KEY
  75. 75. FIX GOOGLEBOT’S JOURNEY SPEED UP YOUR SITE TO ‘FEED’ GOOGLEBOT MORE TECHNICAL  ‘FIXES’       Speed  up  your  site Implement  compression,  minification,  caching ‘ Fix  incorrect  header  response  codes Fix  nonsensical  ‘infinite  loops’  generated  by   database  driven  parameters  or  ‘looping’  relative   URLs Use  absolute  versus  relative  internal  links Ensure  no  parts  of  content  is  blocked  from   crawlers  (e.g.  in  carousels,  concertinas  and   tabbed  content Ensure  no  css or  javascript files  are  blocked  from   crawlers Unpick  301  redirect  chains Consider  using  a  CDN  such  as Cloudflare IMPLEMENTATION OF CONTENT DELIVERY NETWORK
  76. 76. Minimise  301  redirects Minimise  canonicalisation Use  ‘if  modified’  headers  on  low  importance   ‘hygiene’  pages Use  ‘expires  after’  headers  on  content  with  short   shelf  live  (e.g.  auctions,  job  sites,  event  sites) Noindex low  search  volume  or  near  duplicate  URLs   (use  noindex directive  on  robots.txt) Use  410  ‘gone’  headers  on  dead  URLs  liberally Revisit  .htaccess file  and  review  legacy  pattern   matched  301  redirects Combine  CSS  and  javascript files Use  minification,  compression  and  caching FIX GOOGLEBOT’S JOURNEY SAVE  BUDGET  /  EMPHASISE  IMPORTANCE £
  77. 77. Revisit  ‘Votes  for  self’  via  internal  links  in  GSC Clear  ‘unique’  URL  fingerprints Improve  whole  site  sections  /  categories Use  XML  sitemaps  for  your  important  URLs  (don’t  put   everything  on  it) Use  ‘mega  menus’  (very  selectively)  to  key  pages Use  ‘breadcrumbs’ Build  ‘bridges’  and  ‘shortcuts’  via  html  sitemaps  and   ‘cross  modular’  ‘related’  internal  linking  to  key  pages Consolidate  (merge)  important  but  similar  content  (e.g.   merge  FAQs  or  ‘low  search  volume’  content  into  other   relevant  pages) Consider  flattening  your  site  structure  so  ‘importance’   flows  further Reduce  internal  linking  to  lower  priority  URLs BE  CLEAR  TO  GOOGLEBOT  WHICH  ARE   YOUR  MOST  IMPORTANT  PAGES Not  just  any  change  – Critical  material  change Keep  the  ‘action’  in  the  key  areas -­‐ NOT  JUST  THE  BLOG Use  ‘relevant  ‘supplementary  content  to  keep  key  pages  ‘fresh’ Remember  min  crawl  ‘hints’ Regularly  update  key  IMPORTANT  content Consider  ‘updating’  rather  than  replacing  seasonal  content   URLs  (e.g.  annual  events).    Append  and  update. Build  ‘dynamism’  and  ‘interactivity’  into  your  web  development   (sites  that  ‘move’  win) Keep  working  to  improve  and  make  your  URLs  more  important GOOGLEBOT  GOES  WHERE  THE  ACTION  IS  AND   IS  LIKELY  TO  BE  IN  THE  FUTURE  (AS  LONG  AS   THOSE  URLS  ARE  NOT  UNIMPORTANT) TRAIN GOOGLEBOT – ‘TALK TO THE SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS) EMPHASISE  PAGE  IMPORTANCE       TRAIN  ON  CHANGE
  78. 78. SAVINGS, CHANGE & SPEED TOOLS • GSC  Index  levels  (over  indexation  checks) • GSC  Crawl  stats • Last  Accessed  Tools  (versus  competitors) • Server  logs • Keyword  Tools SAVINGS  &  CHANGE SPEED • Yslow • Pingdom • Google  Page  Speed  Tests • Minificiation – JS  Compress  and  CSS   Minifier • Image  Compression   – Compressjpeg.com,   tinypng.com • Content  Delivery  Networks  (e.g.   Cloudflare)
  79. 79. URL IMPORTANCE & CRAWL FREQUENCY TOOLS • GSC  Internal  links  Report  (URL   importance) • Link  Research  Tools  (Strongest   sub  pages  reports) • GSC  Internal  links  (add  site   categories  and  sections  as   additional  profiles) • Powermapper • XML  Sitemap  Generators  for   custom  sitemaps • Crawl  Frequency  Clocking   (@Johnmu) URL  IMPORTANCE
  80. 80. SPIDER EYES TOOLS • GSC  Crawl  Stats • URL  Profiler • Deepcrawl • Screaming  Frog • Server  Logs • SEMRush (auditing  tools) • Webconfs (header  responses   /  similarity   checker) • Powermapper (birds  eye  view  of  site) • Lynx  Browser • Crawl  Frequency  Clocking  (@Johnmu) SPIDER  EYES
  81. 81. REFERENCES Efficient  Crawling  Through  URL  Ordering  (Page  et  al)  -­‐ http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdf Crawl  Optimisation (Blind  Five  Year  Old  – A  J  Kohn  -­‐ @ajkohn)  http://www.blindfiveyearold.com/crawl-­‐ optimization Scheduling  a  recrawl (Auerbach)    -­‐ http://www.google.co.uk/patents/US8386459 Scheduler  for  search  engine  crawler  (Zhu  et  al)  -­‐ http://www.google.co.uk/patents/US8042112 Efficient  crawling  through  URL  ordering    (Page  et  al)  -­‐ http://oak.cs.ucla.edu/~cho/papers/cho-­‐order.pdf Google  Explains  Why  The  Search  Console  Reporting  Is  Not  Real  Time  (SERoundtable)   https://www.seroundtable.com/google-­‐explains-­‐why-­‐the-­‐search-­‐console-­‐has-­‐reporting-­‐delays-­‐21688.html Crawl  Data  Aggregation  Propagation  (Mueller)  -­‐ https://goo.gl/1pToL8 Matt  Cutts Interviewed  By  Eric  Enge -­‐ https://www.stonetemple.com/matt-­‐cutts-­‐interviewed-­‐by-­‐eric-­‐enge-­‐ 2/ Web  Promo  Q  and  A  with  Google’s  Andrev Lippatsev -­‐ https://searchenginewatch.com/2016/04/06/webpromos-­‐qa-­‐with-­‐googles-­‐andrey-­‐lipattsev-­‐transcript/ Google  Number  1  SEO  Advice  – Be  Consistent  -­‐ https://www.seroundtable.com/google-­‐number-­‐one-­‐seo-­‐ advice-­‐be-­‐consistent-­‐21196.html
  82. 82. REFERENCES Internet  Live  Stats  -­‐ http://www.internetlivestats.com/total-­‐number-­‐of-­‐websites/ Scheduler  for  search  engine  crawler Google  Patent US  8042112  B1,  (Zhu  et  al)  -­‐ https://www.google.com/patents/US8707313 Managing  items  in  crawl  schedule  – Google  Patent  (Alpert)   http://www.google.ch/patents/US8666964 Document  reuse  in  a  search  engine  crawler  -­‐ Google  Patent  (Zhu  et  al) https://www.google.com/patents/US8707312 Web  crawler  scheduler  that  utilizes  sitemaps  (Brawer  et  al)  -­‐ http://www.google.com/patents/US8037054 Distributed  crawling  of  hyperlinked  documents  (Dean  et  al)  -­‐ http://www.google.co.uk/patents/US7305610 Minimizing  visibility  of  stale  content  (Carver)  -­‐ http://www.google.ch/patents/US20130226897
  83. 83. REFERENCES https://www.sistrix.com/blog/how-­‐nordstrom-­‐bested-­‐zappos-­‐on-­‐google/ https://www.xml-­‐sitemaps.com/generator-­‐demo/

×