Capture All the URLS: First Steps in Web Archiving


Published on

Presentation with Judy Silva (Fine & Performing Arts Librarian and Archivist at Slippery Rock University) and Alexis Antracoli (Records Management archivist at Drexel University) at the Pennsylvania Library Association's 2013 annual conference in Seven Springs, Pennsylvania.

Abstract: As higher education embraces new technologies, teaching, learning, research, and record-keeping is increasingly taking place on university websites, on university-related social media pages, and elsewhere on the open web. This dynamic digital content, however, is highly vulnerable to degradation and loss. This session will introduce the concept of web archiving and articulate why it’s important for colleges and universities. Speakers will demonstrate web archiving service Archive-It and then share lessons learned from their institutions’ web archiving initiatives, from unexpected stumbling blocks to strategies for raising funds and support from campus stakeholders.

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Recognition of changing platformAll of our stuff is going here, and it’s dynamic
  • This is harder than you think. Digital files are highly vulnerable.
  • Already seeing this happen.
  • Often a combination of tools
  • Often a combination of tools
  • As we’ve seen, Archive-It is the popular favorite and has an impressive list of users.Started looking into web archiving in 2009; followed the topic on professional listservs and saw Archive-It mentioned repeatedly. Had read about Brewster Kahle’s work. Tried the Wayback Machine and was impressed with what was being collected already. In 2010 saw a presentation at MARAC (Mid-Atlantic Archives Conference) by a colleague Rebecca Goldman at Drexel who suggested I try a webinar and that was it.Few Options in 2009Archive-It endorsed by ColleaguesInternet Archive’s WayBack Machine Presentation at Professional Conference Attended a WebinarRan a TrialArchive-It Support
  • First contacted Information Technology (partly to determine they were not already archiving the website somehow). VP Info Technology referred us to PR . . . At Slippery Rock University the Public RelationsOffice is responsible for the website, so they were a natural in helping to select content for capture and preservation. PR publishes more and more content in electronic format (in some cases only electronic): course catalogs, alumni magazine, press releases . . . Administration seeking storage solutions as network drives fillLibrary willing to test the web archiving concept with our content
  • IT deferred concept (and funding) to PR. PR said they could not afford it. Library ended up paying for it. Will ask Provost’s Office next year.
  • Library Homepage Archives: Digital CollectionsUniversity HomepageRockPride (campus e-newsletter)Catalogs: undergraduate and graduateDecided to use library content as prototype, it allowed us to practice and then showcase our own content. Library homepage and Digital Collections.The library is responsible for the annual Student Research Symposium, so that provides a stepping stone beyond library content and some additional stakeholders (faculty whose students are participants in the symposium). Also e-newsletter of faculty publications.PR’s suggestions: University homepage, campus e-newsletter and catalogs (undergraduate and graduate)
  • Popular campus activities like Athletics, student organizations, alumniInfluential friends: president and provostOne time events: anniversaries, is now possible to archive intranet content behind a username and password(provided that the partner supplies those credentials in the web application).
  • Archive-it help pages are user friendly
  • Options: limit or expandSet a timeSet a data limitCrawl frequencyUniversity Homepage: The homepage URL does not necessarily need crawled further than the root at the moment, once a month.RockPride (campus e-newsletter): A new Rock Pride online magazine runs every Friday during the school semester, and roughly once a month throughout the summer.Catalog: undergraduate and graduate catalogs, back to 2004. Annually.
  • look at other college and university sites (available from the Archive-It site) to see what they were harvesting and how they were naming collections.Stumbling blocks: Kristen and Alexis?
  • [Alexis first, then Kristen and Judy can add]
  • [Alexis first, then Kristen and Judy can add]
  • [Alexis first, then Kristen and Judy can add]
  • Capture All the URLS: First Steps in Web Archiving

    1. 1. Capture all the URLs: First Steps in Web Archiving Kristen Yarmey Judy Silva Alexis Antracoli Digital Services Librarian Fine & Performing Arts Librarian and Archivist Records Management Archivist University of Scranton Slippery Rock University of Pennsylvania Drexel University
    2. 2. Where We’re Going Kristen: • Intro to web archiving • Web archives in higher ed • Archive-It and other tools Judy: • First steps • Getting buy-in • Selecting and scoping Alexis: • Metadata • Policies • Workflow All: • • • • Challenges Lessons learned What’s next? Q&A
    3. 3. Why archive the web?
    4. 4. What do we put on the web? • University publications • • • • • • • Course catalogs Student handbooks Newsletters Press releases Alumni Journal Admissions viewbook University calendar • Governance/Planning documents and records • Policies • Assessment reports (Fact Book) • Faculty Senate agendas, minutes, and reports • Presidential announcements • Email • Campus life • • • • • • Student clubs Housing contract Wellness programming Community outreach Athletics scores Alumni class pages • Events • Presidential inauguration • New building construction/dedication • Social Media presence • • • • • Facebook Twitter Blogs YouTube …
    5. 5. Web Archiving in Higher Ed ―We have the responsibility to preserve things like course information, course roster information and policies — all sorts of things that we used to get in paper but are now just showing up as websites.‖ Dean B. Krafft, Chief Technology Strategist, Cornell University ―Almost every office and unit on campus has a web site with business information. .. Many of our campus publications are only on the web now as pdfs or html. [This content] isn’t preserved anywhere else.‖ Ed Busch, Electronic Records Archivist, Michigan State University
    6. 6. Goals: • Preserve dynamic content • • • • • Text Images Animation Video … • Preserve context • • • • Hyperlinks Embedded media Document method and date of capture Relate to prior and later versions • Provide access • Full text search • Browsability • User-friendly interface
    7. 7. Once something is posted on the web, it’s there forever… right? New York Times, September 23, 2013
    8. 8. Web Archiving in Higher Ed “One finding revealed by the survey was the preponderance of universities that have initiated web archiving programs in the last 5 years.” Web Archiving Survey Report by National Digital Stewardship Alliance June 2012
    9. 9. National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
    10. 10. Tools National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
    11. 11. Tools: In-House Options Proprietary tools:  Adobe Acrobat - convert websites into PDFs (internal links remain active but other dynamic functionality is lost)  Grab-a-Site and WebWhacker – download files from a website  Teleport Pro – ―webspidering‖ Open source tools:  Heritrix – crawler  HTTrack – downloads web content to a local directory  Wayback – discovery  Memento – access framework  NutchWAX - search  Solr – search  WARCreate – Google Chrome extension for creating WARC files (view with Wayback, store your own data)  Wget – retrieve files from a website  Web Curator Tool – workflow management  NetarchiveSuite - software package  Xenu’s Link Sleuth – finds broken links National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
    12. 12. Tools: Outsourcing Options Vendor services  Archive-It  California Digital Library Web Archiving Service (WAS)  OCLC Web Harvester National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
    13. 13. Archive-It • Subscription service • Branch of nonprofit Internet Archive • Crawls, harvests, and hosts web content, using open source tools and standard formats • Yearly fees, based on ―data budget‖
    14. 14. Archive-It: Partners Archive-It Partners  279 collecting organizations total  118 colleges & universities Pennsylvania Partners:              Bryn Mawr, Haverford, and Swarthmore (joint, 2005) Bucknell (2012) Chemical Heritage Foundation (2010) Curtis Institute of Music (2010) Drexel (2009) Free Library of Philadelphia (2010) Gettysburg College (2013) La Salle University (2012) Pennsylvania State University (2012) Slippery Rock University of Pennsylvania (2011) Temple University (2013) University of Pennsylvania Law School (2011) University of Scranton (2012)
    15. 15. Archive-It: Crawl • Collection • Seeds (regular, one-time, or RSS) • Documents = any file with a distinct URL, including… – HTML – Images – Video – Audio – PDF –… • Scope = which URLs are captured and which are not • Frequency = how often seed is crawled
    16. 16. Archive-It: Access Users can: • Search • Browse From: • Archive-It website • Portal page • Embedded search boxes • Library catalog • Finding aids • 404 error pages • Wayback Machine Content can be public or private.
    17. 17. Archive-It: Manage Metadata • Dublin Core • Collection, seed, document level Storage • Archive-It hosts content and backup on multiple servers • Partner can request copy of data Support • Training sessions • Partner support • User community
    18. 18. Why Archive-It?
    19. 19. Campus Stakeholders • Library • Information Technology • Public Relations • Administration • New President
    20. 20. Funding • Information Technology • Public Relations • Library • Provost’s Office • Grants • Donors
    21. 21. Selecting Content
    22. 22. Selecting More Content • Athletics • Student organizations • Alumni • President’s page • Provost’s page • 125th Anniversary • University Curriculum Committee minutes (password protected)
    23. 23. What is a seed? • A seed is any URL that you want to capture: • An entire website • • A specific part of a website • • A specific URL • nal_security_strategy.pdf
    24. 24. Scoping & Crawls
    25. 25. Before You Start
    26. 26. Building the Program • • • • • Policy Records Management Benefits Standardizing Metadata Developing Quality Control Procedures Working within organizational constraints
    27. 27. Collection Development • Developed policy • • • • • Mission Scope Designated Community Intellectual Property Access • Determined/reviewed seeds to crawl and frequency • Maintain an up-to-date list of seeds that are regularly crawled ―Brasseri F – Archives oubliees,‖ by GuillaBar.
    28. 28. Updating Metadata • Selected fields to use consistently: • • • • Title Creator Description Collector • Standardized names • Eliminated groups
    29. 29. Quality Control Procedures • • • • New program Excel spreadsheet Track by seed Check basic yes/no problems: • Crawl too large. • Date Queued • Robots.txt • Track errors: • Various seed errors • Embedded file problems • Track updates: • • • • New URLs Recrawls Patch crawls Web administrator contacts ―Our Quality Control,‖ by Paphio.
    30. 30. Challenges • • • • Staffing Time-intensive Correcting technical problems Not yet knowing how people will use the crawls as a resource • Capturing online publications and email newsletters
    31. 31. Lessons Learned ―Lessons‖ by Pavel Ivashkov. • Web-archiving takes time • There are ways to make it work with a small staff • Metadata can be basic and still useful • Quality Control is important • You can’t correct every error with limited staff • Need to keep up with new sites and URL changes
    32. 32. Up Next • Additional outreach to Web administrators • Official launch of Web archiving program to University • Exploring cross-training to improve quality control program • Institute regular scanning of environment for new content and updates • Social Media
    33. 33. Resources  Archive-It Knowledge Center (October 2013)  Brenda Reyes Ayala’s Web Archiving Bibliography (June 2013)  Kalpesh Padia et al., Visualizing Digital Collections at Archive-It (August 2012)  National Digital Stewardship Alliance, Web Archiving Survey Report (June 2012)  International Internet Preservation Consortium, Future of the Web Workshop (May 2012)  Jinfang Niu, ―An Overview of Web Archiving‖ (D-Lib Magazine, March 2012)  Inside Higher Ed, Archiving the Web for Scholars (May 2011)  WebArchivists, Web-Archives Timeline