From Seed to Harvest: Web Archiving Program Considerations for SUL


Published on

Presentation given at Stanford University Libraries as part of candidacy for the Web Archiving Service Manager position on web archiving program considerations and elements.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • A little bit of my background: I've worked for the last two-and-a-half years for the Library of Congress Web Archiving program. That program has been running for 13 years now, accumulating over 60 collections, many of them focused on public policy and the legislative branch; over 13,000 nominated websites; and over 400 terabytes of content. Our large-scale crawling is provided by Internet Archive, creating additional workflow requirements and complexities. I am 1 of 3 project managers in the 5-person web archiving team and have personally transferred and ingested over 200 terabytes of content into the repository, reviewed hundreds of websites, and have been involved in the planning or have directly managed all upgrades to our workflow tools over the last couple of years, including especially our data management and QR tools.
  • Some of you may have seen or even been involved in the creation of Archive-It’s Web Archiving Life Cycle Model, released in March. I thought this would be a useful way of reviewing program considerations.
  • The outer circle consists of broader program elements. The inner circle is more particular in focus, concentrating on the mechanics and requirements of workflows.
  • I'll start on the outside with program elements and work my way in.
  • The foremost questions for a web archiving program are, “what is it meant to do?” and “what is its relationship to the mission of larger organizational structures?”
  • Stanford University Libraries is and has been involved in so many interesting and innovative digital library and digital preservation projects. Throughout the rest of this presentation, I refer to best practices and approaches for web archiving from other institutions. My vision for Stanford web archiving is not just to stand up a production service, but also to develop tools and approaches that are as innovative as these other projects.
  • Looking at the mission of the Stanford University Libraries, it seemed to me that the key points were “providing diverse resources and services” and, in doing so, “supporting research and instruction.”
  • Looking then at the mission of the Digital Library Systems and Services group, where the web archiving program would be situated, its role seemed to be to provide IT infrastructure, research, and development. in furtherance of the Libraries' mission.
  • Considering the missions of its parents, a web archiving program mission might look something like this.
  • The mission may be operationalized by articulating more concrete objectives, such as building infrastructure, developing distributed staff expertise, and identifying classes of content for collecting.
  • Program goals can only be achieved through the allocation of resources and development of workflows. What resources will a web archiving program require?
  • Cost modeling is a foremost concern for resource planning. Stanford has some familiarity with cost models from service providers such as Archive-It and CDL’s Web Archiving service, which are based on quotas for data volume, seed count, crawl duration, and/or number of active collections. One advantage of bringing web archiving in-house is to be able to provide better quality captures than these limits permit. The challenge will be ensuring that the cost model is easy to understand, provides for quality archiving, and is financially sustainable.Cost modeling is difficult. At the Library of Congress, our bulk crawling contract with Internet Archive was based on a ceiling on the number of seeds. When we tried to model the costs for website nominators within the Library on a per-seed basis, we found that the more seeds were submitted to the crawl, the less each seed “cost”, communicating a muddled price signal to nominators.
  • A production web archiving program will involve an ongoing or episodic commitment from other staff beyond the two intended FTEs. Curators propose collections and select websites. System administrators maintain the IT infrastructure for web archiving systems. Software engineers enhance web archiving workflows. Technical services staff enhance descriptive metadata and facilitate discovery. Legal counsel helps establish and refine sounds legal terms for the operation of the service.
  • Depending on the scale of the program, web archiving may have significant demands on compute, memory, and storage. Indexing and analysis will require robust IT infrastructure.
  • I understood from Stanford’s web archiving report from several years ago that there was interest in the Web Curator Tool. Web Curator Tool would be a good drop-in solution for handling a number of customer-facing elements of the production web archiving workflow.
  • Don’t worry, though, there will be plenty left to automate. At the Library of Congress, for instance, I’ve spent a significant amount of time improving the movement of and management of data from Internet Archive to our repository.
  • The eventual aim of workflows is to support access and use.
  • The access policy should specify under what terms content can be made available to which designated users and will need to be determined by assessing relevant legal and other risks.
  • The most typical access method is the Wayback Machine, an open source program developed by Internet Archive for ARC and WARC web archive file format replay. It allows you to browse date snapshots for individual URLs and also provides an XML API.
  • Wayback Machine is in fact the most common access interface used by the international cultural heritage web archiving community.
  • One of the advantages of also using Wayback Machine is that it natively supports Memento, a prototype extension of the http protocol that will facilitate discovery of resources in distributed web archives.
  • Of greater importance will be supporting local discovery. I saw that some web archiving collections already had records in SearchWorks. In addition to cataloging web archive collections, it’d be worth assessing the feasibility of website-level records.
  • Full-text search is becoming more common, especially among smaller institutions with smaller collections and larger institutions with more robust infrastructure. This is something that Stanford should consider, to augment other forms of access.
  • Access is interdependent with preservation. How are web archives preserved?
  • The most basic requirement for preservation of web archives, as with other forms of digital content, is bit preservation. Checksums should be generated for all content encapsulated in the SIP and checked every time the data is copied to a new filesystem. An AIP should be stored in SDR.
  • Beyond bit preservation, there are not yet widely adopted approaches to web archive preservation engineering. I’ve participated in the IIPC Preservation Working Group for the last couple of years. We recently created and distributed a preservation survey among IIPC members. Of the dozen or so institutions that have filled it out so far, responses are all over the map in terms of approaches to data normalization, perception of file format obsolescence risk, and technical metadata requirements. The most ambitious of current efforts is being undertaken by the Austrian National Library, who are collecting technical metadata for individual items within their ARCs and WARCs. More moderate approaches the Stanford might consider are that of BnF and Harvard, who collect and store technical metadata principally about the container files themselves.
  • Preservation is a means of mitigating one kind of risk. There are other kinds of risks to be managed.
  • These are some of the other risks a web archiving program will have to confront. Over the last year, I’ve become increasingly sensitized to the risk of inadequate use or under-developed stakeholders at the Library of Congress as access server space becomes scarce and IT Services wants to get the best return on investment of limited resources.
  • As visually indicated by the fact that it surrounds the entire life cycle, policy provides the foundation for all of the other program elements.
  • The legal issues in web archiving center primarily though not exclusively on copyright. Section 108 of the Copyright Act provides exceptions for library preservation of at-risk materials, but does not cover web archiving. Web archiving programs address the copyright issue through a combination of fair use best practices and permissions, with provisions for opting out of crawling or de-accessifying crawled content.One of the challenges related to copyright permissions at the Library of Congress context is that permissions requirements mandated by legal counsel are collection-specific, leading to problems when we collect websites with the same content owner in collections where different permissions are specified.
  • The other major policy area is collection development. This will be informed by existing collection development and records policies. The web is much bigger than any one institution’s capability to collect. Collection development policy should help determine what should be collected, how much of it, how comprehensive or representative it should be, what collecting should take place outside of a collection framework (e.g., Technical Services’ EEMs), and so on.
  • So, so far I’ve been talking about the broader elements of a web archiving program. Now I’m going to talk about more workflow-level considerations.
  • Collection development policy defines what the combination of individual collecting projects should look like in the aggregate. Appraisal and selection are how, within an individual collecting project, a curator decides to collect one resource as versus another.
  • Selection is a challengingly subjective task, especially given the size of the web. Criteria to consider include the value of the website, the risk of its disappearance, the resources it would take to archive it, and the extent to which it has already been archived by other institutions.
  • There have been some nascent efforts to crowd-source the problem. The UK Web Archive is currently working on a tool to identify frequently-cited links in curated Twitter streams.
  • There’s been some discussion of using a live monitor of Wikipedia edits for the same purpose.
  • Wikipedia itself is a crowd-sourced production and may also used to seed certain topical collections.
  • Lastly, the University of North Texas Nomination Tool is used collaboratively by many web archiving institutions and archivists to pool seeds, often in response to breaking events. It was used most recently to curate seed lists for a papal transition crawl and an end of presidential term crawl.
  • After appraising and selecting a resource to be collected, the next essential step is to define the scope of the crawl.
  • Scoping is creating instructions for where the crawler should or should not go, after setting out from the seed URLs. It is the primary mechanism for ensuring crawling resources are used most efficiently and the primary focus of QA. Seeds and scopes are sometimes fungible from a crawling perspective but not from a permissions workflow or cataloging perspective.
  • Once seeds are selected and scoping is configured, you deploy software to capture the data.
  • Just as Wayback Machine is the most typical software used for web archive access, its counterpart Heritrix is the most common software used for data capture. Heritrix is an open source, scalable, archival web crawler and stores captured content in ISO-standard WARC files.
  • Heritrix is not the only data capture tool available, nor the only one that produces WARC files. Wget and the Web Archiving Integration Layer may be useful to consider for test and/or small-scale crawling. George Washington University’s social feed manager is a tool for archiving Twitter streams and is an example of how the web archiving community is exploring other methodologies for capturing web content.
  • The social feed manager hints at a future in which API-enabled archiving becomes more common. As it is, Heritrix and the web crawling paradigm generally are far more suitable to the comparatively static web of 10 years ago than the contemporary web. Continuing efforts and collaborations will be required on the data capture front to maintain the efficacy of web archiving tools.
  • And it’s not that we don’t have the capabilities now to tackle some of the data capture challenges; we just don’t have effective ways to do so at scale, a requirement for a robust production workflow.
  • Once you have the data, you need to organize and store it.
  • Data to be included in the SIP could be the WARCs themselves but also the crawler configuration and logs. It will be important to track the relationship between packages and the collection, website(s), and/or capture date ranges they represent (this may or may not be transparent in the filenames).
  • In the life cycle model, QA takes place after storage and organization.
  • I think that it usually takes place before, during, and after data capture. “Before” includes scoping, assessing obstacles to archiving, or surfacing JavaScript links with a web automation framework like PhantomJS. “During” includes checking up on the running crawl to make sure it doesn’t get stuck. “After” includes reviewing crawl logs, inspecting harvested sites, and making scoping adjustments. QA is most important after the first crawl of a resource.
  • Descriptive metadata may be created or enhanced during many of the workflow stages of the life cycle.
  • Descriptive metadata would optimally come from multiple sources: selectors, catalogers, and automated methods. cURL is a basic automated method for extracting metadata from the head of archived pages. I’ve experimented some with text analysis tools that could suggest appropriate keywords from a controlled vocabulary, but I’m not aware of any tools that are production-ready.
  • I acknowledge that the life cycle model doesn’t cover every aspect of what Stanford will need to consider in the creation of its web archiving program.
  • There are many other considerations such as how will the success of the program be benchmarked and the requirements of different stakeholders be balanced?
  • There will also be the challenge of incorporating existing projects. To what extent can the disparate efforts be standardized, and is that even desirable?
  • The web archiving program will need to engage not just with internal stakeholders but external groups and institutions as well. Web archiving is definitely a community effort, and the community needs all the help it can get.
  • Lastly, it will be necessary to revisit and re-evaluate many of the aforementioned program and workflow elements on an ongoing basis to keep pace with the changing information environment and evolving best practice. Stanford University has come a long way since 1996 and finds itself now at a great moment to become more involved with web archiving. I’d welcome the opportunity to help lead that effort.
  • From Seed to Harvest: Web Archiving Program Considerations for SUL

    1. 1. From Seed to Harvest: Web Archiving Program Considerations for SUL Nicholas Taylor @nullhandle Stanford University Libraries April 17, 2013 “Digital” by Flickr user clickclaker under CC BY-NC-ND 2.0
    2. 2. hello, my name is Nicholas…
    3. 3. Library of Congress Web Archiving Library of Congress: “MINERVA”
    4. 4. Web Archiving Life Cycle Model “Web Archiving Life Cycle Model” by M. Bragg, K. Hanna, et al. (2013). Reproduced with permission.
    5. 5. Web Archiving Life Cycle Model Program Elements • Vision and Objectives • Resources and Workflow • Access / Use / Reuse • Preservation • Risk Management Workflow Elements • Appraisal and Selection • Scoping • Data Capture • Storage and Organization • Quality Assurance and Analysis
    6. 6. PROGRAM ELEMENTS Web Archiving “Element Blocks” by Flickr user Asian Art Museum under CC BY-NC-ND 2.0
    7. 7. Vision and Objectives
    8. 8. web archiving program vision ePADD Discovery Module PASIG
    9. 9. SUL mission “The Stanford University Libraries (SUL) is more than a cluster of libraries; it connects people with information by providing diverse resources and services to the academic community.” “Stanford University Libraries…develops and implements resources and services…that support research and instruction.” SUL: “Stanford University Libraries on Vimeo” SUL: “About The Stanford University Libraries” SUL: “SULAIR Brief Guide”
    10. 10. DLSS mission “DLSS is the information technology production arm of the Stanford Libraries; it serves as the digitization, digital preservation and access systems provider for SUL; and it is the research and development unit for new technologies, standards and methodologies related to library systems.” SUL: “New Images of Rare Books and Digitization Devices” SUL: “SULAIR Digital Library Systems and Services (DLSS)”
    11. 11. proposed program mission “The web archiving program will provide capabilities for the acquisition, preservation, and dissemination of resources that are increasingly and, often, exclusively accessible via the web that are necessary to support University research, instruction, and other purposes.”
    12. 12. objectives • build infrastructure • develop expertise • create research collections • archive records and deprecated content • mirror government documents “Objective” by Flickr user Pedro J. Ferreira under CC BY-NC-ND 2.0
    13. 13. Resources and Workflow
    14. 14. cost modeling “dollar butterfly (2)” by Flickr user eikosi under CC BY-SA 2.0
    15. 15. staffing • service manager • crawl engineer • curators • system administrators • software engineers • technical services • legal counsel “Digitizing Mark Adams cartoons” by Flickr user suldpg under CC BY-NC-SA 2.0
    16. 16. infrastructure “Google Storage Server” by Flickr user Kazuya (Kaz) Yokohama under CC BY-NC-ND 2.0
    17. 17. readily workflow-able • collection management • site nomination • permissions tracking • crawl scheduling • data capture • quality assurance “Web Curator Tool User Manual Version 1.5.2”
    18. 18. workflow challenges • test crawling • automated QA • AIP/DIP generation • SDR ingest • indexing • enabling access • tools testing “Salmon Ladder at Bonneville Dam” by Flickr user Serolynne under CC BY-NC-ND 2.0
    19. 19. Access / Use / Reuse
    20. 20. access policy • dark archive • data redistribution • embargo • onsite/offsite replay • takedown requests “DO NOT DUPLICATE” by Flickr user Sam UL under CC BY-NC-SA 2.0
    21. 21. browse and API: Wayback Internet Archive: “Wayback Machine” UK Web Archive: “Wayback Machine”
    22. 22. many Wayback Machines Wikipedia: “List of Web archiving initiatives”
    23. 23. discovery: Memento “Memento”
    24. 24. discovery: SearchWorks SUL: “SearchWorks”
    25. 25. full-text search: Solr Archive-It: “Explore All Archives”
    26. 26. Preservation
    27. 27. bit preservation “Binary” by Flickr user mikecogh under CC BY-SA 2.0
    28. 28. preservation engineering “Máquina de Rube Goldberg en la base del Alinghi” by Flickr user freshwater2006 under CC BY-NC 2.0
    29. 29. Risk Management
    30. 30. Risk Management • “appified” web • copyright • ephemeral web • financial sustainability • fostering use “Zombie Awareness - Extinguisher” by Flickr user Spiffy0777 under CC BY-NC-SA 2.0
    31. 31. Policy
    32. 32. copyright • § 108 (library exceptions) • fair use • notification vs. permission • opt-out / takedown • robots.txt • third-party sites • exceptions? “Noria con Copyrights” by Flickr user Alex Novoa under CC BY-NC-ND 2.0
    33. 33. collection development “leaf-cutter ants” by Flickr user Vilseskogen under CC BY-NC-SA 2.0
    34. 34. WORKFLOW ELEMENTS Web Archiving “Workflow” by Flickr user luismi_cavalle under CC BY 2.0
    35. 35. Appraisal and Selection
    36. 36. informing selection • value • risk • size • extent to which archived “Fruit market-Barcelona” by Flickr user Marcel Theisen under CC BY-NC-SA 2.0
    37. 37. TwitterVane UK Web Archive: “TwitterVane”
    38. 38. Wikipedia Live Monitor Thomas Steiner: “Wikipedia Live Monitor”
    39. 39. Wikipedia articles Wikipedia: “List of think tanks in the United States”
    40. 40. UNT Nomination Tool University of North Texas Libraries: “Nomination Tool”
    41. 41. Scoping
    42. 42. the purpose of scoping “More god?” by Flickr user one two one three under CC BY-NC-SA 2.0
    43. 43. Data Capture
    44. 44. Heritrix Internet Archive: “A Quick Guide to Running Your First Crawl Job”
    45. 45. other data capture tools Dan Chudnov and Laura Wrubel: “social feed manager” Mat Kelly: “WAIL” Archive Team: “Wget with WARC output”
    46. 46. the elusive web “Light Writing - Spider Web” by Flickr user forcefeed:swede under CC BY-ND 2.0
    47. 47. scale “chutes and ladders” by Flickr user reallyboring under CC BY-NC-SA 2.0
    48. 48. Storage and Organization
    49. 49. packages and their contents “lots and lots and lots of boxes” by Flickr user Toastwife under CC BY-NC-SA 2.0
    50. 50. Quality Assurance and Analysis
    51. 51. QA before, after, during “Check” by Flickr user ex.libris under CC BY-NC-ND 2.0
    52. 52. Metadata / Description
    53. 53. Metadata / Description “Hello! My URL Is...” by Flickr user vasta under CC BY-NC-ND 2.0
    54. 54. BEYOND THE MODEL Considerations “My donut” by Flickr user Molemaster under CC BY-NC-SA 2.0
    55. 55. other program requirements • marketing/outreach • performance metrics • service level definitions • service roadmap • training • user documentation “Sticky notes” by Flickr user Kris Krug under CC BY-SA 2.0
    56. 56. incorporating existing projects • plan capacity • normalize data • ingest into SDR • seek permissions • process • catalog • enable access “Geckos” by Flickr user smashz under CC BY-NC-ND 2.0
    57. 57. community engagement
    58. 58. the web changes Internet Archive: “Wayback Machine”
    59. 59. Nicholas Taylor @nullhandle “Thank You” by Flickr user muffintinmom under CC BY 2.0