Preserving the web


Published on

overview of web archiving and the experience of Texas A&M University - Commerce using the Internet Archive's Archive-It service.

Published in: Entertainment & Humor
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Why save? Importance of the information and the ephemeral nature of the mediumOver the last twenty years the internet has become ubiquitous in our lives. It affects how were create, receive, and disseminate information. It is also transforming rapidly. The way the web looks today is very different from how it looked even five years ago, and how it will look five years from now. Unless it is actively preserved the information of todays web, and the ways in which we interact with it will be lost in this flood of change.At Texas A&M University – Commerce, the website was increasingly being used to disseminate materials that had traditionally been collected by the archive in print formats, the catalog, annual reports, alumni magazine. For a time there was an overlap of both print and electronic distribution of many of these, but in recent years the cost and environmental impact has pushed many publications to online only formats.Our approach in the archives had been to save and print these items for the collections, but this approach was not ideal for a number of reasons. First you preserved the content but lost the context which these original documents were produced and used. Additionally the ability to perform full text searching is lost, and some publications (the university catalog) created as a relational database cannot be reproduced as a print document. Finally an ad hoc approach to find and saving documents produced on the web was simply no a systematic or permanent solution.All of this was brought to attention when, in preparation the University’s reaccreditation with SACS (Southern Association of Colleges and School) the prep team came to the archives looking for materials produced for the 2003 accreditation. While the university archive contains extensive materials produced for accreditation over the last fifty years, the reports produced ten years ago were placed online, and were no longer available. Much of that material was able to be reproduced and retrieved from the back end of the servers by IT, but this served as a wakeup call that the archives really needed a better solution for preserving our university web presence.
  • When first considering a web archiving program, one has to first decide whether a DIY approach, or subscribing to a web archives serves makes the most sense. There are a variety of open source tools and utilities which can be modified to fit your institutional needs. Additionally the size and scope of the content you hope to capture from the web will influence your decision. With a hosted solution, you have the reliability of an established digital collection, with support staff to train you, answer questions, and work with you to solve problems as they come up. Another possibility is paring with other institutions to create a web archive collaborative.In our case, we do not have dedicated IT staff within the library, and the general campus IT cannot support a customized web harvesting installation. So for our needs we looked to either an approach that could be taken on by staff without a programing background, or a hosted solution which would handle the technical end of the project, allowing us to use our expertise as librarians and archivists.
  • Low barriers to entry, does not require server side installation. But once you’ve captured a site, you’ve still got to worry about long term digital preservation. If you have a robust electronic records strategy then this is not really a problem.
  • OCLC’s answer to web archiving. Uses proprietary tool to capture, search and display.
  • A fantastic collaborative option for University of California affiliated organizations. Using open source tools developed by the Internet Archive. Outside of UC institutions, access is granted through an annual subscription model.
  • Heritrix saves in ARC format, Wayback will display ARC and WARC formats. Since the internet archive develops these tools, well positioned to deploy updates and new approaches as they are developed.The Internet Archive has been preserving websites since 1996, and the archive-it service has been in existence since 2006.
  • Smallestsubcription level$8000 per yearData Budget 768GB and Document Budget 8,000,000Used to this point: 1,078,045 and 47GB
  • The first step in launching our web archiving program was informing various groups on campus and getting them on board with the project. Initially this began with myself, the head of special collections and our digital collections librarian speaking with the staff of Archive-It and learning about the service. We ran initial crawls of our sites to test the capabilities of the crawler and Wayback browser. Next we presented Archive-It to the library as a whole, and explained how it works and the benefits of capturing university sites. After that the director of the library ask that the head of special collections and I present our proposal to the executive committee of the university, a group of about twenty administrators from all parts of campus including the president of the university. After explaining that Archive-It would only capture and preserve publicly available materials, no email or confidential information, the committee expressed near unanimous approval. Following that we met with representative from campus IT to discuss the technical aspects of the crawler, and then we had support from the whole of campus to proceed.
  • Identified 24 seed URLs covering our campus websites, social media and media outlets
  • Crawls can be set to run anywhere from twice daily to once per year. To determine how often to crawl one must evaluate the purpose of capture. Are you attempting to get a comprehensive log of changes over time or capture a snapshot of a site on a particular day. Also one must determine how often sites are updated, or how often content is removed from a site, so that you are assured a crawler will capture that information before it’s replaced.The scope tell the crawler how deeply to go into a site when following links, what areas to capture and what to ignore. In particular social media site like facebook and twitter pose challenging to capture. On these sites, you’re wanting to capture a specific feed or profile, not the entire site. So you add specific rules for the crawler on how to navigate the site. An additional challenge is that Facebook and Twitter change the way the structure and deliver their sites, deploying new features or eliminating old ones, without any notice. This leaves Archive-It and its partners scrambling to develop new techniques to capture it all.
  • robots.txt is a tool that allows webmasters to restrict a site, in part or in total, from web crawlers and other “robots” that access their pages. Heritrix by default respects robot.txt files, in an effort to conform the etiquette of the web. If a site designated for capture is unable to because of robots.txt, it is possible for the webmaster to add the specific IP address of the Archive-It crawler to let it crawl the site. Alternately Heritix can be configured to ignore robots.txt
  • Additionally when conducting a crawl, one may encounter various crawler traps. Are areas of a website that are dynamically generated and infinitely deep. An example may be a calendar page which advances month by month, farther and farther into the future to 9999, well beyond where any actual content may be stored. But the crawler simply advances through the links forever, rather than moving on to other areas with pertinent information on the site.In many cases robots.txt files are there to block areas of a site that contain such traps, in which case respecting robots.txt is actually a good thing.
  • We’ve chosen to only add seed level metadata as full text searching allows a high degree of precision in discovery. However Archive-it is very flexible an allows users to add as much or as little metadata as they see necessary.It used a Dublin Core metadata standard, allowing us to add as much or as little metadata as we see fit to add.
  • Most of the effort to establish a collection occurs at the front end of the process, refining seed URLs and scoping rules, determining craw frequency. Then it becomes a matter of monitoring the crawls as they occur, and making sure everything is happening the way you want them to. Finally you must have a continual eye out for changes, new sites or social media that is being utilized. Our university went through the process to change domain names, from to When this happened I had to go back to the seed URLs and change them to the new domain. Periodically I check for new active Facebook pages or twitter accounts. Many departments or groups will establish a social media presence on a site, then over time abandon those. So I keep an eye out for what’s being used, or if there is a new social media site that is gaining following, so that I’m sure to capture as much as possible in that arena.
  • And now this content is available to anyone, both though our website and through the Archive-It page. Our landing page identifies us as the collection creators. You can either perform a keyword search or browse by Seed URLs. The collection shows when a site was captured and how often, and finally while browsing the site a banner appears telling you that this is an archived page, captured on this date, and a part of the Texas A&M University – Commerce collections.We are considering expanding to included other Northeast Texas sites in our web archive. Specifically we would like to capture smaller newspaper and media sites in our region which are unlikely to be preserved otherwise. Unlike our University Website and Social Media collection, this would involve capturing content not created by the university. We would have to get permission from the site owners to begin such a capture, which would mean adding a whole other group of stakeholders which we would have to get support from. As our university web collection grows, we can use this as an example of the value of preserving this content, and act as a gateway to expanding our web archive collection
  • Preserving the web

    1. 1. Preserving the Web: One institution’s foray intoDigital Preservation through Web Archiving Jeremy Floyd Texas A&M University – Commerce twitter @jjamesfloyd
    2. 2. Why save the web?Google Data Center. The Dalles, Oregon 2012 <>
    3. 3. Approaches and Considerations • Do It Yourself Approach • IT infrastructure • Level of ‘In-house’ Expertise • Long Term Digital Preservation • Hosted Solutions • Annual Expenditure • Options for Joining a Consortium or CollaborativeAlington, Greg. 1936. “A Book Mark Would be Better.”Made for the Illinois WPA Art Project. from Library ofCongress Print and Photographs Online Catalog<>
    4. 4. HTTrack • Free open source software • Allows downloading of websites to a local drive • Preserves content and structure of target sites
    5. 5. OCLC Web Harvester• Runs OCLC’s own Webcrawler• Can Import Directly into CONTENTdm and• Connexion Catalog• Discoverable in WorldCat• Can be Saved in OCLC Digital Archive
    6. 6. California Digital Library Web Archiving Service• Free to join for all UC departments and organization (charged only for storage)• Fee based subscription service for all other institutions• Utilizes Heritrix web crawler • 56 public archives for capture and Wayback for • 21 partners display and Nutchwax search • 4407 web sites engine • 616,585,489 documents • 32.3 TB of data
    7. 7. The Internet Archive Archive-It• Subscription Service• Heritrix web crawler• Nutchwax search engine• Wayback Machine browser -All developed and maintained by the Internet Archive • More than 225 partner organizations • 5,214,935,471 URLs in 2,056 collections • Partners in 45 states and 15 countries including, university libraries, state archives, historical societies, federal institutions, NGOs, public libraries, and museums
    8. 8. Texas A&M University – Commerce partnered with Archive-It
    9. 9. Gathering Support AmongConstituencies and StakeholdersAll aboard! Liberty Bond fourth issue Sept. 28 - Oct. 19, 1918. from Library of CongressPrint and Photographs Online Catalog <>
    10. 10. Selecting Seed URLsUniversity Websites Twitter Youtube University News and MediaAthletics/242136009137926?ref=ts/ http://TheEastTexanOnline.com
    11. 11. Managing Scope and Frequency of Crawls
    12. 12. robots.txt“Robots- Electro and Sparko” 1940. still image. Computer History Museum<>
    13. 13. Crawler Traps “It’s A Trap” 2010. Know Your Meme<>
    14. 14. Adding Descriptive MetadataRebecca Goldman. 2009. “Core Values.” Derangement and Description.<>
    15. 15. Establishing a Workflow
    16. 16. Access and Future Growth
    17. 17. Further Resources• Niu, Jinfang. 2012. “An Overview of Web Archiving” D-Lib Magazine. 18(3/4)• LOC Signal Blog:• International Internet Preservation Consortium (IIPC)• International Web Archiving Workshop (2001 – 2010)• Society of American Archivists: Web Archiving Roundtable email: twitter: @jjamesfloyd