Online Collections Crawlability for Libraries, Archives, and Museums

2,880 views

Published on

The Goal is Crawlability.
Allow and encourage webcrawlers to access everything on your website that you want users to be able to find.
(1) If webcrawlers can’t get to your stuff...
(2) Search engines won’t index your stuff...
(3) Your stuff won’t turn up in users’ web searches...
(4) Users won’t find your stuff!

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,880
On SlideShare
0
From Embeds
0
Number of Embeds
939
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide





























  • Online Collections Crawlability for Libraries, Archives, and Museums

    1. 1. Friendly URLs, Spiders & Robots: Does Google See Your Website? or Online Collections Crawlability for Libraries, Archives, and Museums Matt Herbison Independence Seaport Museum Philadelphia, PA   herbison@gmail.com www.hotbrainstem.org twitter: @herbison 1
    2. 2. Online Collections Crawlability for Libraries, Archives, and Museums Presentation online at www.bit.ly/LAMcrawling 2
    3. 3. Don’t assume Google, Bing, and other search engines can see everything on your website 3
    4. 4. The Goal: Crawlability • Allow and encourage webcrawlers to access everything on your website that you want users to be able to find. 1. If webcrawlers can’t get to your stuff... 2. Search engines won’t index your stuff... 3. Your stuff won’t turn up in users’ web searches... 4. Users won’t find your stuff ! 4
    5. 5. Search engines crawl, they only follow links • If you can’t browse -- that is, click-through -- to reach the resources on your website, search engines won’t find the resources • Search engines don’t know how to use your search forms -- but they are starting to figure it out (Google in 2008) • Content locked in databases: Deep Web / Hidden Web / Invisible Web problem - Searchers and crawlers can’t find anything inside the system from outside the system 5
    6. 6. Questions to ask about your website 1. What does your robots.txt file block? 2. What does your sitemap.xml file include? 3. How long are your URLs, especially for digital collections items? 4. Can users find short, permanent URLs for your resources? 5. Are fancy features hiding your content or data? 6. Are people (and webcrawlers) getting trapped inside orphaned resources on your website? 6
    7. 7. Q1: Robots.txt blocking? • The robots.txt file tells well-behaved webcrawlers which areas of your website to avoid or exclude • Typically located at http://yoursite.org/robots.txt • It is common for old versions to be lingering around, inadvertently excluding webcrawlers 7
    8. 8. Example: robots.txt • Robots.txt file for Minnesota Historical Society collections website http://collections.mnhs.org/robots.txt User-agent: * Disallow: /VisualResources/ • This command tells every webcrawler to not index content in the VisualResources directory/database -- not good!? 8
    9. 9. Q2: Using sitemap.xml? • Compared to robots.txt files which exclude webcrawlers from certain content, sitemap.xml files tell webcrawlers what to include • Sitemaps explicitly tell webcrawlers what pages to index, with the goal that the webcrawlers won't miss anything • Like traditional sitemaps for website visitors, but created to allow webcrawlers to know every webpage in your site 9
    10. 10. Q2: Using sitemap.xml? (cont’d) • Typically located at http://yoursite.org/sitemap.xml but you can have multiple sitemaps for sections of website • OAI-PMH is similar to sitemap files: supported by Google, then not, then incorporated into WorldCat 10
    11. 11. Example: sitemap.xml • Vanderbilt Television News Archive is a good, complex example (tvnews.vanderbilt.edu): 1968-now • More than 800,000 records, most of which have summaries that describe a news story, including the event, people involved, and the video shown • System caters to both webcrawlers and human users • Primary sitemap points to hundreds of year-month specific sitemaps...each of these year-month sitemaps in turn points to a URL-record that describes the content of a TV news story • Result: Exposes the URLs for each of the 800,000+ resource records -- they come up in web searches 11
    12. 12. A Reminder The Point Is... If webcrawling robots cannot crawl your pages, users will not find them when they do web searches 12
    13. 13. Q3: Long URLs? • Database-driven websites often rely on long URLs for both browsing and searching • These long URLs contain a number of parameters that tell the database what to find and display • A simple 1-parameter search or browse: www.example.com/search?medium=daguerreotype • A more complex 4-parameter search or browse: www.example.com/search? medium=daguerreotype &decade=1850s &creator=putnam &theme=family 13
    14. 14. Example: Long URL • Tidy • Used faceted browsing to get to: Duke Libraries > Digital Collections  > Sidney D. Gamble Photographs  > Russia  > closeup  > group • URL: http://library.duke.edu/digitalcollections/ gamble/search/results? q=str_dcterms.spatial.Country:Russia &fq=str_dc.subject:closeup +AND+str_dc.subject:group • Can also get the same results with a search: http://library.duke.edu/digitalcollections/search/results? t=russia+closeup+group 14
    15. 15. Example: Long URL • UnTidy • The image featured on the NARA "ARC Guide for Genealogists and Family Historians" seems to have the following URL: http://arcweb.archives.gov/arc/action/ShowFullRecord? tab=showFullDescriptionTabs/digital &%24searchId=1 &%24showFullDescriptionTabs.selectedPaneId=details &%24digiDetailPageModel.currentPage=0 &%24resultsPartitionPageModel.targetModel=true &%24resultsSummaryPageModel.pageSize=10 &%24partitionIndex=0 &%24digiSummaryPageModel.targetModel=true &%24submitId=1 &%24resultsDetailPageModel.search=true &%24digiDetailPageModel.resultPageModel=true &%24resultsDetailPageModel.currentPage=0 &%24showArchivalDescriptionsTabs.selectedPaneId= &%24resultsDetailPageModel.pageSize=1 &%24resultsSummaryPageModel.targetModel=true &%24sort=RELEVANCE_ASC &%24resultsPartitionPageModel.search=true &%24highlight=false 15
    16. 16. Example: Long URL • UnTidy (cont’d) • But if you request a resource record to be emailed to you, you get a simple URL: http://arcweb.archives.gov/arc/action/ ExternalIdSearch?id=594479 • The resources on the NARA website are a nightmare of crawlability: with no browsing, enormous and unstable URLs (even though they don’t need to be), and features that would hide descriptive data from webcrawlers even if they could get into the site 16
    17. 17. Long URLs • Solutions • Many different approaches, depending on your collections system. Ask your IT folks about: • Apache mod_rewrite, which is useful for “retrofitting” older websites with tidier URLs • SEO-friendly URL tools, which convert parameters into “directories” in the URL, whether for searching or faceted browsing • Google Webmaster Tools, which allows you tell the Googlebot webcrawler which parameters to ignore or include 17
    18. 18. Q4: Short, permanent URLs? • For a database-driven system, it should be possible to assign a single unique ID to any resource • These unique IDs function well as a single- parameter, permanent URL (aka permalink), with benefits: • Easy for people to share (and link into website) • Easy to share with webcrawlers in sitemap.xml 18
    19. 19. Examples: Short, permanent URLs? • http://arcweb.archives.gov/arc/action/ ExternalIdSearch?id=594479 • http://collections.mnhs.org/visualresources/ Details.cfm?ImageID=215871 • http://library.duke.edu/digitalcollections/ gamble.russia-35 Is this the most boring slide yet? Just wait. 19
    20. 20. Q5: Hidden in fancy features? So sayeth Google: If fancy features such as AJAX, JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a plain-text browser, then search engine spiders may have trouble crawling your site. ( http://www.google.com/support/webmasters/bin/ answer.py?answer=35769 ) 20
    21. 21. Example and Warning: Hidden in fancy features? • Don’t let anyone sell you on a Flash-only system for collections access if you’d actually like users to find your materials. • The prettiest sad example may be Peabody Essex Museum’s ARTscape: http://www.pem.org/artscape 21
    22. 22. Q6: Trapped too-far-in? • Google can dump people deep into your website, where they end up at orphaned webpages • Often a problem with webpages inside frames • Also happens with pop-up or PDF "see more detail" pages, like some OCR’d transcripts without citation/repository info • Minor problem: Users can’t click to get back to the “upstream” resource but they know the repository they’re in • Bigger problem: No indication on webpage about which repository/website the stuff came from 22
    23. 23. General approaches for beating the system • Incoming links highlight resources wherever they reside inside your website Example: older version of Ohio Memory • Familiarize yourself with tutorials and tools offered through Google’s Webmaster Tools • Static html website. Yes? No. • Potatoes, &c, &c 23
    24. 24. Do Google & others see your stuff? • Talk to your IT people • Good starting point to test your website: • Using your website’s own search function, search for a known (unique-ish) phrase in a resource record • Example: “she joined the committee of transit” • Do the same search from Google, quoted and limited to your website domain • Example: “she joined the committee of transit” site:yoursite.org 24
    25. 25. And ask questions about your website: 1. What does your robots.txt file block? 2. What does your sitemap.xml file include? 3. How long are your URLs, especially for digital collections items? 4. Can users find short, permanent URLs for your resources? 5. Are fancy features hiding your content or data? 6. Are people (and webcrawlers) getting trapped inside orphaned resources on your website? 25
    26. 26. Bigger SEO questions for discussion • From an web searching point of view, do we need to distinguish between value of item-level (e.g., digital object) descriptive resources versus collection-level (or aggregated) descriptive resources? • Item-level search engine hits often come from known-item searches • Topical searches favor richer content, which can come from more description, multiple items being described, and more web-wise context (aka links) • an example... 26
    27. 27. Bigger SEO questions for discussion • ...cont’d: Question of searches taking users to item level or aggregated level. Examples: • CONTENTdm browse-pages often feature 20 items per page. They are generally more likely to come up in topical searches than any of the item- level pages. • Web-search results often include within-site search results that some other user out there did in the past and provided a link to somewhere -- these are things that have been “passively curated” by a user by virtue of simple in-site search 27
    28. 28. Bigger SEO questions for discussion • Is it our goal to get web searchers into our body of resources, then from there they can look around in our stuff (i.e., in-site searching and browsing)? • Since Wikipedia is always top-ranked anyway, does it make more sense to focus efforts on adding logical and helpful links from and to Wikipedia pages? Is this more for the sake of serving researchers than for the SEO value of an incoming links? • Is it ethically OK to try to use SEO to improve SERP rankings above other webpages/sites that will probably be more helpful for researchers? 28
    29. 29. More thorough presentation online: www.bit.ly/LAMcrawling Matt Herbison   herbison@gmail.com www.hotbrainstem.org twitter: @herbison 29

    ×