Building Archivable Websites


Published on

Presentation for Stanford Drupal Camp on how and why to build archivable websites.

Published in: Internet, Technology, Design
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • You already design for accessibility, performance, SEO, standards compliance, and usability. Why should you design for archivability?
  • You’re not just building your website for users who will access it today and tomorrow; there’s a whole other class of users you may never have thought about.
  • Broken links don’t just diminish the usability of the contemporary web.
  • Broken links also disrupt the continuity in the relationship between a website and a URL.
  • The URL had a redirect to when the latter became active in 2006. The redirect now allows a user to navigate an unbroken timeline that reflects Representative Bono’s website.
  • Warrick allows you to recover a website from files hosted within web archives, such as the Internet Archive Wayback Machine.
  • Web archives allow users to consult earlier and sometimes overwritten content.
  • Unique institutional history and more institutional history generally is reflected (solely) online.
  • Websites document not just the history of individual organizations but also our collective culture.
  • The effective collection, organization, and preservation of web content is increasingly vital to records management.
  • Improving the accessibility of your website to an archival crawler will also tend to make it more accessible to search engine crawlers.
  • Adhering to web standards gives offers the best chance of being able to faithfully re-presenting the website far into the future. Once archived, the accessibility of your website to all future users is only as good as it is now.
  • An archival crawler finds your content by following links; it can’t archive what it hasn’t discovered.
  • Out-of-the-box, Drupal 7’s robots.txt blocks directives that a search engine crawler might not care about but that might be vital to faithfully re-presenting the website.
  • There are good performance reasons for relying on externally-hosted assets. These can even improve archivability.
  • The risk is that those platforms may not view archiving as favorably as you do, as in this case where Google-hosted fonts cannot be crawled due to their robots.txt.
  • The same principle that motivates hosting some assets externally should also motivate serving reusable local assets from a single place.
  • Like any other client, HTTP caching headers will improve archival crawler performance and decrease load on the web server.
  • Prefer open standards and formats where possible and widely-used standards and formats where not to help ensure that they remain understandable.
  • User-agent personalization is opaque to the archival crawler, meaning that only one of many possible versions of a website will be archived, but a responsive website can continue to respond to diverse clients in the archive.
  • How your website looks in the Internet Archive Wayback Machine is a reasonably good proxy for its archivability.
  • Heritrix is the open source archival crawler used by Internet Archive and the international cultural heritage web archiving community.
  • Wget is a command-line network file retrieval utility that can be used to mirror websites and, recently, supports an archival data format.
  • HTTrack is a desktop-based web archiving tool for small-scale projects.
  • Wayback is the open source implementation of the Internet Archive Wayback Machine. It allows for re-presentation and temporal browsing of archived web content.
  • WAIL integratesHeritrix, Wayback, and other web archiving tools in a user-friendly, portable application for small-scale projects.
  • Memento offers the prospect of a temporal web, where client requests for a particular URL at a particular moment in time can be redirected to a web archive containing the most closely-matching resource.
  • Archive Ready is a web service akin to the W3C Validator or the WAVE Web Accessibility Tool. It will provide specific tips on how to improve the archivability of your website.
  • Building Archivable Websites

    1. 1. Building Archivable Websites Nicholas Taylor Web Archiving Service Manager Digital Library Systems and Services Drupal Camp April 19, 2014
    2. 2. ARCHIVABLE WEBSITES? Why Build “Frosted Spiders' Web” by Jess Wood under CC BY 2.0
    3. 3. future users are users, too “a connection between past and future” by Gioia De Antoniis under CC BY-NC-ND 2.0
    4. 4. maintain web usability “Broken Web Connections? Welcome to 2009...” by Paul:Ritchie under CC BY-NC-ND 2.0
    5. 5. improve temporal web usability Internet Archive: “Wayback Machine”
    6. 6. improve temporal web usability Internet Archive: “Wayback Machine”
    7. 7. recover your lost website “Warrick”
    8. 8. refer to earlier website versions “The Iraq War: Wikipedia Historiography” by STML under CC BY-SA 2.0
    9. 9. institutional history Internet Archive Wayback Machine: “Stanford University Homepage”
    10. 10. websites are cultural artifacts “The World Wide Web project”
    11. 11. facilitate compliance
    12. 12. optimize for other crawlers “SEO on a railway platform” by superboreen under CC BY-NC-ND 2.0
    13. 13. IMPROVE ARCHIVABILITY How to “metal web” by paul:74 under CC BY-NC-SA 2.0
    14. 14. follow web standards and accessibility guidelines “Web Standards Fortune Cookie” by Matt Herzberger under CC BY-SA 2.0
    15. 15. use a site map, transparent links, and contiguous navigation “Card sorting” by Manchester Library under CC BY-SA 2.0
    16. 16. maintain stable URLs and redirect when necessary “San Francisco-Oakland Bay Bridge 1442a” by Don Barrett under CC BY-NC-ND 2.0
    17. 17. use semantically-meaningful URLs “”
    18. 18. be careful w/ robot exclusion rules “drupal/robots.txt at 7.x”
    19. 19. minimize reliance on external assets necessary for presentation Internet Archive Wayback Machine: “Stanford Department of English”
    20. 20. minimize reliance on external assets necessary for presentation “Stanford Department of English”
    21. 21. serve reusable assets from a single, common location Google Images: “stanford university seal”
    22. 22. specify HTTP response headers for caching and content encoding “time capsule on Alcatraz” by inajeep under CC BY 2.0
    23. 23. embed metadata, especially character encoding “Keep the Packaging!” by davidd under CC BY 2.0
    24. 24. use durable data formats “Lascaux cave painting” by Christine McIntosh under CC BY-ND 2.0
    25. 25. prefer responsive design over user- agent personalization “«Responsive web design» - 217/366” by Roger Ferrer Ibáñez under CC BY-NC-SA 2.0
    26. 26. examine your site in the Internet Archive Wayback Machine Internet Archive Wayback Machine: “Welcome to A Multidimensional Perception ~/*= & PCGuru”
    27. 27. TOOLS AND SERVICES Web Archiving “giant mechanical spider & crowd” by mjtmail (tiggy) under CC BY 2.0
    28. 28. Heritrix Wikimedia Commons: “File:Heritrix-screenshot.png”
    29. 29. Wget Wikimedia Commons: “File:Wget_1.13.4.png”
    30. 30. HTTrack “HTTrack Website Copier”
    31. 31. Wayback “Internet Archive Wayback Machine”
    32. 32. Web Archiving Integration Layer “Web Archiving Integration Layer”
    33. 33. Memento “Memento”
    34. 34. assess archivability w/ Archive Ready “Archive Ready”
    35. 35. thank you! “stanford dish at sunset” by Dan under CC BY-NC-SA 2.0 Nicholas Taylor