Your SlideShare is downloading. ×
0
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Everything you wanted to know about crawling, but didn't know where to ask

4,301

Published on

Published in: Technology, Design
3 Comments
7 Likes
Statistics
Notes
No Downloads
Views
Total Views
4,301
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
14
Comments
3
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Local Search (Including ImportanceMetricsandLinkMerging) Everythingyou wantedto know about Crawling* *ButDidn't KnowWhere to Ask Agile SEO Meetup – South Jersey Monday, September 10, 2012 7:00 PM to 9:00 PM Bill Slawski Webimax @bill_slawski
  • 2. In the Early Days of the Web, there was an invasion
  • 3. Robots
  • 4. Spiders Via Thomas Shahan - http://www.flickr.com/photos/opoterser/
  • 5. Crawlers
  • 6. Invaded pages across the World Wide Web
  • 7. The Robots Mailing List was formed to solve the problem!
  • 8. Led by a young Martijn Koster, they developed the Robots.txt protocol
  • 9. Which Asked Robots to be Polite
  • 10. And Not Melt Down Internet Servers
  • 11. A student at Stanford named Lawrence Page went on to co-author a paper on how robots might Crawl web pages to index important pages first. http://ilpubs.stanford.edu:8090/347/1/1998-51.pdf
  • 12. <<Insert Subliminal Advertisement Here>>
  • 13. Important Web Pages 1. Contain words similar to a query that starts the crawl 2. Have a high backlink count 3. Have a high PageRank 4. Have a high forward link count 5. Are in or are close to the root directory for sites Image via Fir0002/Flagstaffotos under http://commons.wikimedia.org/wiki/Commons:GNU_Free_Documentation_License_1.2
  • 14. So most crawlers will not only be Polite, but they will also hunt down important pages first
  • 15. Search Engines filed patents on how they might crawl and collect content found on Web pages, including collecting URLs and Anchor Text associated with them. <a href=“http://www.hungryrobots.com”>Feed Me</a> http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7308643
  • 16. Also, in one embodiment, the robots are configured to not follow "permanent redirects". Thus, when a robot encounters a URL that is permanently redirected to another URL, the robot does not automatically retrieve the document at the target address of the permanent redirect.
  • 17. “Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.”* *Google Webmaster Guidelines - http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769
  • 18. Google’s Webmaster Guidelines make crawlers look pretty unsophisticated, and incapable of much more than the simple Lynx browser… …But we have signs that crawlers can be smarter than that, and Microsoft introduced a Vision-based Page Segmentation Algorithm in 2003. Both Google and Yahoo have also published patents and papers that describe smarter crawlers. IBM filed a patent for a crawler in 2000 that is smarter than most browsers today.
  • 19. VIPS: a Vision-based Page Segmentation Algorithm - http://research.microsoft.com/apps/pubs/default.aspx?id=70027
  • 20. http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7519902
  • 21. Link Merging Web Site Structure Analysis - http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7861151 •S-nodes – Structural Link Blocks - organizational and navigational link blocks; Repeated across pages with the same layout and showing the organization of the site. They are often lists of links that don’t usually contain other content elements such as text. •C-nodes – Content link blocks, grouped together by some kind of content association, such as relating to the same topic or sub-topic. These blocks usually point to information resources and aren’t likely to be repeated across more than one page. •I-nodes – Isolated links, which are links on a page that aren’t part of a link group, may be only loosely related to each other, by virtue of something like their appearing together within the same paragraph of text. Each link on a page might be considered an individual i-node, or they might be grouped together by page as an i-node.
  • 22. Crawling and Self Help
  • 23. Canonical = Best! There can be only one: http://example.com http://www.example.com http://example.com/ http://www.example.com/ https://example.com https://www.example.com https://example.com/ https://www.example.com/ http://example.com/index.htm http://www.example.com/index.htm https://example.com/index.htm https://www.example.com/index.htm http://example.com/INDEX.htm http://www.example.com/INDEX.htm https://example.com/INDEX.htm https://www.example.com/INDEX.htm http://example.com/Index.htm http://www.example.com/Index.htm https://example.com/Index.htm https://www.example.com/Index.htm
  • 24. Canonical Link Element <link rel="canonical" href="http://example.com/page.html"/>
  • 25. Rel=“prev” & rel=“next” On the first page, http://www.example.com/article?story=abc&page=1, <link rel="next" href="http://www.example.com/article?story=abc&page=2" /> On the second page, http://www.example.com/article?story=abc&page=2: <link rel="prev" href="http://www.example.com/article?story=abc&page=1" /> <link rel="next" href="http://www.example.com/article?story=abc&page=3" /> On the third page, http://www.example.com/article?story=abc&page=3 <link rel="prev" href="http://www.example.com/article?story=abc&page=2" /> <link rel="next" href="http://www.example.com/article?story=abc&page=4" /> And on the last page, http://www.example.com/article?story=abc&page=4: <link rel="prev" href="http://www.example.com/article?story=abc&page=3" />
  • 26. Paginated Product Pages
  • 27. Paginated Article Pages
  • 28. View All Pages Option 1 • Normal Prev/Next sequence • Self Referential Canonicals (point to their Own URL • Noindex meta element on View All page Option 2 • Normal Prev/Next Sequence • Canonicals (all pages use the view-all page URL) http://googlewebmastercentral.blogspot.com/2011/09/view-all-in-search-results.html
  • 29. Rel=“hreflang”
  • 30. Rel=“hreflang” HTML link element. In the HTML <head> section of http://www.example.com/, add a link element pointing to the Spanish version of that webpage at http://es.example.com/, like this: <link rel="alternate" hreflang="es" href="http://es.example.com/" /> HTTP header. If you publish non-HTML files (like PDFs), you can use an HTTP header to indicate a different language version of a URL: Link: <http://es.example.com/>; rel="alternate"; hreflang="es" Sitemap. Instead of using markup, you can submit language version information in a Sitemap.
  • 31. Rel=“hreflang” XML Sitemap <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/ 0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <url> <loc>http://www.example.com/english/</loc> <xhtml:link rel="alternate" hreflang="de" href="http://www.example.com/deutsch/" /> <xhtml:link rel="alternate" hreflang="de-ch" href="http://www.example.com/schweiz- deutsch/" /> <xhtml:link rel="alternate" hreflang="en" href="http://www.example.com/english/" /> </url>
  • 32. XML Sitemap
  • 33. XML Sitemap •Use Canonical links •Remove 404s •Don’t set priority past 1 week •If more than 50,000 URLs, use multiple Sitemaps and a site index •Validate with an XML Sitemap Validator •Include a Sitemap statement in robots.txt http://www.sitemaps.org/
  • 34. Next, we study which of the two crawl systems, Sitemaps and Discovery, sees URLs first. We conduct this test over a dataset consisting of over five billion URLs that were seen by both systems. According to the most recent statistics at the time of the writing, 78% of these URLs were seen by Sitemaps first, compared to 22% that were seen through Discovery first. Crawling vs. XML Sitemaps: Above and Beyond the Crawl of Duty – http://www.shuri.org/publications/www2009_sitemaps.pdf
  • 35. Crawling Social Media Ranking of Search Results based on Microblog data - http://appft.uspto.gov/netacgi/nph- Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f= G&l=50&s1=%2220110246457%22.PGNR.&OS=DN/20110246457&RS=DN/20110246457
  • 36. Questions? Bill Slawski Webimax @bill_slawski

×