Everything you wanted to know about crawling, but didn't know where to askPresentation Transcript
Everything you wanted to knowLocal Search about Crawling* *But Didnt Know Where to Ask (Including Importance Metrics and Link Merging) Agile SEO Meetup – South Jersey Monday, September 10, 2012 7:00 PM to 9:00 PM Bill Slawski Webimax @bill_slawski
In the Early Days of theWeb, there was an invasion
SpidersVia Thomas Shahan - http://www.flickr.com/photos/opoterser/
Invaded pages across the World Wide Web
The Robots Mailing List was formed to solve the problem!
Led by a young Martijn Koster, they developed the Robots.txt protocol
Which Asked Robots to be Polite
And Not Melt Down Internet Servers
A student at Stanford named Lawrence Page went onto co-author a paper on how robots might Crawl webpages to index important pages first. http://ilpubs.stanford.edu:8090/347/1/1998-51.pdf
<<Insert Subliminal Advertisement Here>>
Important Web Pages 1. Contain words similar to a query that starts the crawl 2. Have a high backlink count 3. Have a high PageRank 4. Have a high forward link count 5. Are in or are close to the root directory for sitesImage via Fir0002/Flagstaffotos under http://commons.wikimedia.org/wiki/Commons:GNU_Free_Documentation_License_1.2
So most crawlers will not only bePolite, but they will also hunt down important pages first
Search Engines filed patents on how they might crawland collect content found on Web pages, including collectingURLs and Anchor Text associated with them. <a href=“http://www.hungryrobots.com”>Feed Me</a> http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7308643
Also, in oneembodiment, the robotsare configured to notfollow "permanentredirects". Thus, when arobot encounters a URLthat is permanentlyredirected to anotherURL, the robot does notautomatically retrieve thedocument at the targetaddress of the permanentredirect.
Google’s Webmaster Guidelines make crawlers look prettyunsophisticated, and incapable of much more than the simpleLynx browser……But we have signs that crawlers can be smarter than that,and Microsoft introduced a Vision-based Page SegmentationAlgorithm in 2003. Both Google and Yahoo have also publishedpatents and papers that describe smarter crawlers. IBM filed a patentfor a crawler in 2000 that is smarter than most browsers today.
VIPS: a Vision-based Page Segmentation Algorithm - http://research.microsoft.com/apps/pubs/default.aspx?id=70027
Link Merging•S-nodes – Structural Link Blocks - organizational and navigational link blocks;Repeated across pages with the same layout and showing the organization of the site.They are often lists of links that don’t usually contain other content elements such as text.•C-nodes – Content link blocks, grouped together by some kind of content association,such as relating to the same topic or sub-topic. These blocks usually point to informationresources and aren’t likely to be repeated across more than one page.•I-nodes – Isolated links, which are links on a page that aren’t part of a link group,may be only loosely related to each other, by virtue of something like theirappearing together within the same paragraph of text. Each link on a page might beconsidered an individual i-node, or they might be grouped together by page as an i-node. Web Site Structure Analysis - http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7861151
Crawling and Self Help
Canonical = Best!There can be only one: http://example.com http://www.example.com http://example.com/ http://www.example.com/ https://example.com https://www.example.com https://example.com/ https://www.example.com/ http://example.com/index.htm http://www.example.com/index.htm https://example.com/index.htm https://www.example.com/index.htm http://example.com/INDEX.htm http://www.example.com/INDEX.htm https://example.com/INDEX.htm https://www.example.com/INDEX.htm http://example.com/Index.htm http://www.example.com/Index.htm https://example.com/Index.htm https://www.example.com/Index.htm
Canonical Link Element<link rel="canonical" href="http://example.com/page.html"/>
Rel=“prev” & rel=“next”On the first page, http://www.example.com/article?story=abc&page=1, <link rel="next" href="http://www.example.com/article?story=abc&page=2" />On the second page, http://www.example.com/article?story=abc&page=2: <link rel="prev" href="http://www.example.com/article?story=abc&page=1" /> <link rel="next" href="http://www.example.com/article?story=abc&page=3" />On the third page, http://www.example.com/article?story=abc&page=3 <link rel="prev" href="http://www.example.com/article?story=abc&page=2" /> <link rel="next" href="http://www.example.com/article?story=abc&page=4" />And on the last page, http://www.example.com/article?story=abc&page=4: <link rel="prev" href="http://www.example.com/article?story=abc&page=3" />
Paginated Product Pages
Paginated Article Pages
View All Pages Option 1 • Normal Prev/Next sequence • Self Referential Canonicals (point to their Own URL • Noindex meta element on View All page Option 2 • Normal Prev/Next Sequence • Canonicals (all pages use the view-all page URL)http://googlewebmastercentral.blogspot.com/2011/09/view-all-in-search-results.html
Rel=“hreflang”HTML link element.In the HTML <head> section of http://www.example.com/, adda link element pointing to the Spanish version of that webpage athttp://es.example.com/, like this:<link rel="alternate" hreflang="es" href="http://es.example.com/" />HTTP header.If you publish non-HTML files (like PDFs), you can use anHTTP header to indicate a different language version of a URL:Link: <http://es.example.com/>; rel="alternate"; hreflang="es"Sitemap.Instead of using markup, you can submit language versioninformation in a Sitemap.
XML Sitemap •Use Canonical links •Remove 404s •Don’t set priority past 1 week •If more than 50,000 URLs, use multiple Sitemaps and a site index •Validate with an XML Sitemap Validator •Include a Sitemap statement in robots.txthttp://www.sitemaps.org/
Crawling vs. XMLNext, we study which of the two crawl systems, Sitemaps and Discovery,sees URLs first. We conduct this test over a dataset consisting of over fivebillion URLs that were seen by both systems.According to the most recent statistics at the time of the writing,78% of these URLs were seen by Sitemaps first, compared to22% that were seen through Discovery first. Sitemaps: Above and Beyond the Crawl of Duty – http://www.shuri.org/publications/www2009_sitemaps.pdf
Crawling Social MediaRanking of Search Results based on Microblog data - http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220110246457%22.PGNR.&OS=DN/20110246457&RS=DN/20110246457