Creating and Maintaining Web Archives


Published on

"Creating and Maintaining Web Archives"
Presented by Joanne Archer (University of Maryland), Tessa Fallon (Columbia University), Abbie Grotke (Library of Congress), and Kate Odell (Internet Archive)

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • KATE
  • !
  • Creating and Maintaining Web Archives

    1. 1. Creating and Maintaining Web Archives Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University
    2. 2. Session Goals <ul><li>Provide an overview of web archiving and the tasks involved </li></ul><ul><li>Discuss workflow management and copyright issues </li></ul><ul><li>Talk about collection strategies and collection development for web archives </li></ul><ul><li>Analyze the different options for web archiving </li></ul><ul><li>Discuss some of the commonly encountered technical challenges and problems </li></ul><ul><li>Examine methods of access and description </li></ul>
    3. 3. What is web archiving?  <ul><li>Web Archiving is the capture, management, and preservation of websites and web resources. </li></ul>
    4. 4. Web Archiving Initiatives <ul><li>Prominent Web Archiving Initiatives include:  </li></ul><ul><ul><ul><li>Internet Archive </li></ul></ul></ul><ul><ul><ul><li>International Internet Preservation Consortium </li></ul></ul></ul><ul><ul><ul><li>Large National Libraries: </li></ul></ul></ul><ul><ul><ul><ul><li>Australia </li></ul></ul></ul></ul><ul><ul><ul><ul><li>United Kingdom </li></ul></ul></ul></ul><ul><ul><ul><ul><li>United States </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Denmark </li></ul></ul></ul></ul><ul><ul><ul><li>Web at Risk Project </li></ul></ul></ul>
    5. 5. Workflow Management
    6. 6. <ul><ul><li>  Legal deposit requirement only applies to “published works” ( § 407 ) </li></ul></ul><ul><ul><li>§ 108  of the Copyright Act provides library exceptions but doesn’t address digital preservation and web archiving </li></ul></ul><ul><ul><li>Varying approaches taken: </li></ul></ul><ul><ul><li>  </li></ul></ul><ul><ul><ul><li>Crawl permissions </li></ul></ul></ul><ul><ul><ul><li>Access permissions </li></ul></ul></ul><ul><ul><ul><li>Notification of crawling </li></ul></ul></ul><ul><ul><ul><li>Respecting robots.txt (or not!) </li></ul></ul></ul><ul><ul><li>  </li></ul></ul><ul><ul><li>Risk and web archiving policies should be determined by each institution - talk to your lawyers! </li></ul></ul><ul><ul><li>  </li></ul></ul>Copyright/Permissions
    7. 7. Collection Strategies <ul><ul><li>Whole Domain </li></ul></ul><ul><ul><ul><li>used by some national libraries and by the Internet Archive. --capture everything within a geographic domain such as in the case of     Sweden, all sites within the .se domain.  </li></ul></ul></ul><ul><ul><li>Selective Archiving </li></ul></ul><ul><ul><ul><li>capture certain portions of the web based on predefined criteria or collection policies.  </li></ul></ul></ul><ul><ul><li>Thematic </li></ul></ul><ul><ul><ul><li>event driven (September 11) or theme driven (human rights) </li></ul></ul></ul><ul><ul><ul><li>deposit </li></ul></ul></ul><ul><ul><li>Combination </li></ul></ul>
    8. 8. Collection Development: Topical
    9. 9. Collection Development: Technical
    10. 14. <ul><li>Collection Development Policies or Similar Documents: </li></ul><ul><ul><li>Center for Human Rights Documentation and Research, Human Rights Web Archive </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><li>Library of Congress </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><li>Tamiment Library Web Archive </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><li>  University of Michigan Bentley Historical Library </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><li>National Library of Ireland general election 2011 web archive </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul>Collection Development Policies/Guidelines
    11. 15. Tools: HTTrack
    12. 16. Tools: HTTrack
    13. 17. Tools: In-House Program Web Curator Tool
    14. 18. Tools: In-House Program DigiBoard
    15. 19. Tools: Subscriptions, Web Archiving Service
    16. 20. Tools: Subscriptions, Archive-It
    17. 21. How does web archiving work? <ul><ul><li>Curator Selects Websites (Seeds) to Archive </li></ul></ul><ul><ul><li>Curator Specifies Scope (how much of the websites are archived) </li></ul></ul><ul><ul><li>Archived content is processed and stored (.warc format) </li></ul></ul><ul><ul><li>Crawler visits seed sites and archives the Urls that are discovered (following the scoping rules) </li></ul></ul><ul><ul><li>Seeds and scoping are sent to the Crawler (usually Heritrix) </li></ul></ul><ul><ul><li>Access tools (Wayback) allow archived content to be viewed and browse </li></ul></ul>
    18. 22. Quality Review <ul><ul><li>Quality Review is different for everyone. Why? </li></ul></ul><ul><ul><li>The tool(s) being used for harvesting and access </li></ul></ul><ul><ul><li>Your institution’s goals, needs, and preferences </li></ul></ul><ul><ul><li>How much time you have </li></ul></ul><ul><ul><li>Review Reports </li></ul></ul><ul><ul><li>Were there any blocked content or unreachable sites? </li></ul></ul><ul><ul><li>Did you get more content than expected? Less? </li></ul></ul><ul><ul><li>Review Archived Web Pages </li></ul></ul><ul><ul><li>Some issues can only be found with the human eye (for now!) </li></ul></ul><ul><ul><li>Was look-and-feel properly captured? </li></ul></ul><ul><ul><li>Make Desired Changes </li></ul></ul><ul><ul><li>Scoping, Seeds, </li></ul></ul><ul><ul><li>Crawl Settings, etc. </li></ul></ul><ul><ul><li>Crawl Again </li></ul></ul>
    19. 23. <ul><li>Some web technologies can be tricky (though not impossible!) to capture or to view in the archived version: </li></ul><ul><ul><li>Database driven sites </li></ul></ul><ul><ul><li>Javascript (only sometimes) </li></ul></ul><ul><ul><li>Flash (only sometimes) </li></ul></ul><ul><ul><li>Certain video formats </li></ul></ul><ul><li>Websites change – what archived perfectly yesterday, might not after today’s redesign </li></ul>Common Problems – “The Web is a Mess”
    20. 24. <ul><ul><li>Access Options: </li></ul></ul><ul><ul><li>Subscription Service Access Page (i.e. Archive-It website) </li></ul></ul><ul><ul><li>Website of Your Organization or Project (i.e. Human Rights Web Portal, LOC’s Web Archives site) </li></ul></ul><ul><ul><li>OPAC (i.e. Columbia’s CLIO) </li></ul></ul><ul><ul><li>OCLC’s WorldCat </li></ul></ul><ul><ul><li>Examples of Description: </li></ul></ul><ul><ul><li>Columbia University </li></ul></ul><ul><ul><ul><li>Dublin Core </li></ul></ul></ul><ul><ul><ul><li>MARC </li></ul></ul></ul><ul><ul><ul><li>Internet Resource Cataloging Request (IRCR) </li></ul></ul></ul><ul><ul><li>Library of Congress </li></ul></ul><ul><ul><ul><li>Creates MODS records for each “site” </li></ul></ul></ul><ul><ul><ul><li>Collection level records in MARC (for the OPAC) </li></ul></ul></ul><ul><ul><li>Archive-It </li></ul></ul><ul><ul><ul><li>Dublin Core </li></ul></ul></ul><ul><ul><ul><li>Coming soon: Automated transformation to MARC, MODS, and more. </li></ul></ul></ul>Access and Description
    21. 25. Archive-It Partner Page
    22. 26. Library of Congress Web Archives Page
    23. 27. Library of Virginia
    24. 28. CLIO Record (public view)
    25. 29. Worldcat Link back to the Archive-It collection
    26. 30. <ul><ul><li>Staff needed include: </li></ul></ul><ul><ul><ul><li>Project Management </li></ul></ul></ul><ul><ul><ul><li>Selectors/Curators </li></ul></ul></ul><ul><ul><ul><li>Technical staff for Seed URL preparation (scoping), Quality Review, analysis of reports, etc. </li></ul></ul></ul><ul><ul><ul><li>Catalogers </li></ul></ul></ul><ul><ul><li>Training for Staff: </li></ul></ul><ul><ul><ul><li>Use of Tools </li></ul></ul></ul><ul><ul><ul><li>Selection - and how what can and cannot archive affects that </li></ul></ul></ul><ul><ul><ul><li>Permissions </li></ul></ul></ul><ul><ul><ul><li>Quality Review </li></ul></ul></ul><ul><ul><li>Helpful skills: comfortable with web (not all are, in our experience!), flexibility, good sense of humor </li></ul></ul>Staffing
    27. 31. <ul><li>Is there web content within your collection scope? </li></ul><ul><ul><li>Your organization’s website(s) </li></ul></ul><ul><ul><li>Print material that has migrated to web publication </li></ul></ul><ul><ul><li>Subject related websites </li></ul></ul><ul><ul><li>Websites related to manuscript or archival collections </li></ul></ul><ul><ul><li>State or local government websites </li></ul></ul><ul><li>Research and talk to similar organizations </li></ul><ul><li>Talk to subscription services about trial accounts </li></ul><ul><li>Try out some of the lower barrier tools (i.e. HTTrack) </li></ul><ul><li>Get involved with collaborative web archiving efforts </li></ul><ul><li>Just do it! Jump in! </li></ul>Taking the First Steps…
    28. 32. <ul><li>The National Digital Stewardship Alliance (NDSA) Content Working Group [] is sponsoring this survey of organizations in the United States who are actively involved in or planning to archive content from the web. </li></ul><ul><li> </li></ul><ul><li>The survey will close October 31, 2011 . </li></ul>NDSA Web Archiving Survey
    29. 33. Questions? Comments? Suggestions? Joanne Archer • Tessa Fallon • Abbie Grotke • Kate Odell •