In 2015, I created a web archiving fundamentals course for the Society of American Archivists (SAA) Digital Archives Specialist (DAS) program. This is a portion of the slide deck I used for that course.
1. Select slides from:
Web Archiving Fundamentals
(SAA course circa 2015,
assuming crawler based capture)
Anna Perricci
Anna.Perricci@gmail.com
These slides were made in 2015 and
minorly edited in 2018. These slides are
not entirely current but offered for
reference
2. YOUR INSTRUCTOR: Anna Perricci
Timeframe for experience with web archiving: 2007-present
Course work at University of Michigan School of Information (including 2008
web archiving course with Margaret Hedstrom)
ICPSR intern project/recommendations
Grad research on digital and physical art and archives (e.g. video games and
new media art)
Columbia University: over two years of full-time work on collaborative web
archiving
Webrecorder/Rhizome: growing robust set of open source web archiving tools
Teaching background
Course design: preservation of new media and performance based artwork
SAA Web Archiving Roundtable Education Coordinator 2013-2016
Interesting extra projects that have shaped how I view capturing and representing
contemporary creative work and social movements
FIGMENT
Occupy Wall Street Archives Working Group
3. Web archiving is a new and growing field
and we need people with
new ideas and evolving skill sets
So glad to have you join us!
4. This course will provide a foundational
knowledge of web archiving including steps to
take in forming a web archiving program and core
concepts in web archiving practice
Constant change will be a given in web archiving as
long as web-based technologies continue to evolve
Let’s get ready for this ongoing challenge!
Goals!
5. Describe current web archiving practice
Identify key steps to go from collection
development policy to initial construction of
collections of archived websites
Explain subsequent steps to test quality, describe,
facilitate preservation and provide access to web
archives
By the end of course you should be able to
6. Students will get information that will support
further learning and training so they can get the
most out of subsequent instruction including on the
use of web archiving software, which is also
subject to regular changes and updates
This course will not teach you “how to do it”
7. Major web archiving software providers output
archived websites in the form of WARC files
These files should be included in wider digital
preservation planning
Opening WARC files/accessing information
contained requires software that is not yet common
The immediate storage & use of web archives is
closely connected to web archiving service providers
(i.e. Archive-It, Webrecorder)
Out of scope:
Preservation workflows for web archives
8. Website: one or more web pages
Web archiving: the process of selecting, capturing,
saving and making accessible select content
available online (e.g. websites)
Web archive archived web content/website
Web archives: a group of web-published materials
collected, managed and made accessible
Working definitions for this webinar
9. Eeeek…?!
Crawlers, robots & spiders
The software used to
collect web content is
often referred to as a
crawler, robot or spider
This webinar will focus
on workflows that have
been developed in
conjunction with the use
of Archive-It but should
be relevant to those
using other tools
10. When explaining web archiving concepts I will call
the software for collecting websites a crawler, robot,
spider and/or a harvester
A crawler (aka spider, robot) is software that
indexes web content
In web archiving a crawler is used in conjunction
with software that harvests (collects) websites and
packages that content into a standard file format
(WARC)
A few names for the same thing…
11. Any URL that one directs the crawler to capture
The seeds selected will determine the content in the
collection and the scope of the crawls
Seed URL(s) determine how much of a website will
be archived
What is a seed site?
Source: Archive-It help wiki
https://webarchive.jira.com/wiki/display/ARIH/Selecting+Seed
12. Seed site: URL for an entire website
top level / domain
http://www.kotekan.com/
13. Seed site: URL for specific part
(directory) of a website
http://www.kotekan.com/design.html
14. Seed site: URL for a specific page
http://www.kotekan.com/Southworth_CV_2013.pdf
15. To comprehensively represent records
created in the twenty-first century,
select websites and
other web-based resources
should be captured, stored,
managed, described,
and made accessible
as appropriate
Why archive websites?
16. Content available primarily or solely online is
among the most at-risk born-digital materials
Websites that can be collected are freely and
widely available to anyone at some time but can
vanish at the volition of the site owner
Like with other digital materials, web content is
very vulnerable to loss by comparison to
information contained in most analog media
Why archive websites? (cont.)
17. Curated collections of web archives can be a
valuable part of collection development
Some resources that used to be published and
distributed on paper are now only available online
Examples include:
Course catalogs (!)
Reports
Publicity materials for art galleries, events
Why archive websites? (cont.)
18. How to scope your collecting (intellectually
and technically)
Practices for acquiring and ensuring
quality of collected websites
Steps to take to facilitate access (e.g.
description concepts and access systems)
We’ll focus on things to consider before
beginning efforts to archive websites
19.
20. These elements can be scaled
to guide the collection of websites or select
materials from websites
at any institution
but don’t be discouraged
by varying levels of success,
a process or scope that needs to be changed
or a lack of resources to do it “right”
Comprehensive web archiving programs
have a few core elements
21. In the US major curated collections of web
archives are usually created and maintained by
institutions (often based in academic libraries)
Most use a suite of tools/software as a
service (e.g. Archive-It, Webrecorder.io)
It is common for an institution to focus on
own/local web presence (i.e. www._.edu,
work of faculty & students
What is being saved?
22. Code/info in web programming language
HTML, Flash (a bit better recently)
Some formatting (e.g. CSS/Cascading Style Sheets)
Text
Images
Some media files (embedded not streamed)
Documents, spreadsheets, presentations, data sets
XML, PDF, CSV
What is being saved by a crawler?
23. Videos & social media content
are among the hardest things
to capture with a crawler but
getting them is
becoming more possible
(e.g. Webrecorder, Brozzler)
24. Robots.txt is a file that blocks crawlers, including
ones set to collect websites for web archives
Robots.txt can be ignored in some services
Streamed media content
Database driven features of websites
Password protected content
Dynamically generated content
What is not being saved by a crawler?
25. A crawler can get caught in an endless loop on a
website
For example: a calendar without an end date
This endless loop is also known as a crawler
trap
Crawler traps
26. The World Wide Web began becoming widely
available in the United States starting in 1993-
1995
The Internet Archive (archive.org) began
collecting websites in 1996 (first web pages
made available in 2001)
For reference
27. The Internet Archive, Library of Congress, and
national libraries in Europe, Australia and New
Zealand were early leaders in web archiving
Web archiving activities at the Library of Congress began
in 2000
In Europe a lot of domain level crawls (i.e. .dk, .fr)
Onsite-only access are the most common models for
national libraries in Europe
A growing number of institutions are making efforts
to collect web archives that fit within their collection
development policies
When and where has web archiving
been done so far?
28. The Internet Archive had an early start with web
archiving but also has a much wider focus that is being
publicized in several project areas
IA is a service provider to LC (crawls) and via Archive-It
Wayback Machine
What is and isn’t captured
Irregular frequency of page capture
Archive.org
‘Save page now’ via https://archive.org/web/
A few more words about the very
amazing Internet Archive
29. Collection development and planning
Selection
Permissions
Harvesting
Description
Access
Long-term preservation
Web archiving is a multi-step process
31. Intellectually—within collecting policy as well as
thinking through what makes sense for you/your
institution with the tools and resources at hand
Careful consideration and plenty of questions
See following slides for framing questions
we’ve used for the collaborative web archiving
pilot projects for Borrow Direct/Ivy Plus
How to scope your collecting
(intellectually)
32. Seed site will initially determine the depth of crawl
Setting scoping rules (limits and expansions) in
web archiving software
How many pages are expected on a given site?
Identify missing content and try to capture it with
patch crawls and/or adding more URLs associated
or within the site you are trying to archive
Read the help documentation for the
software/service you are using for tips
How to scope your collecting
(technically)
33. Why collect websites (needs, collection scope)
What to collect
How/what tools to use
When: how often to collect & when will these
materials be used?
Things to consider before beginning
efforts to archive websites
34. Where will the projects be based (institutionally)
Who will lead this work and complete necessary
tasks
Who are key stakeholders in this work
Things to consider before beginning
efforts to archive websites (cont.)
35. What benefits occur/needs are met through web
collecting (selecting, acquiring, organizing,
providing access, preserving)?
Is your institution doing any web archiving? If
so, are there lessons to keep in mind?
Framing questions
36. Have others in your organization discussed this
idea?
How widespread is awareness about web
collecting/archiving?
Do you think the idea would be well received, or
seen as questionable?
What staff (within the library or beyond) would be
most likely to be involved?
Framing questions
37. What types of web content would you be most
interested in collecting? Is social media a high
priority?
Any specific subjects?
Where does web archiving fit into your collection
development policies (existing or in terms of
upcoming revisions)?
Framing questions
38. Example question set to consider
Columbia has thus far shaped its collecting
around certain policies. What issues, if any,
do you see arising from these? Would they
interfere with your local processes or
expectations?
Framing questions
39. Permissions--requests versus notification only
Limiting collecting to content that is freely available
on the web. To date we have not dealt with licensed
or password-protected content
Making the archived content publicly available (i.e.
without restrictions or authentication)
Collecting whole websites rather than individual
documents (for the sake of efficiency) rather than
separate program for document-based collecting
Factors to consider for example
40. Web archiving is
not a process that
can run successfully
using the workflow
casually known as
‘set it and forget it’
A potential workflow to forget…
41. Collecting strategy and establishment of priorities
for collection development could be a group effort
Contributions could include
Suggest seed URLs
Liaise with site owners to solicit permission to
archive websites
Governance of collaborations if multiple
institutions are involved
Considering ways
to share responsibilities
42. Detailed quality assurance through browsing the
archived website as a user would (e.g. try to
access media files to ensure they have been
successfully captured)
Assessment of efficacy for users?
Considering ways to share
responsibilities (cont.)
43. Determine if you would like to capture all pages on
the website, specific areas of the website (directories)
or a single page
Copy the seed URL from your web browser
Paste the URL in the address bar in another web
browser (Firefox, Chrome) to double check that the
URL leads to the content to be archived
Paste the URL in a document or spreadsheet
Next to the URL add the title of the page and the date
Sample workflow for selecting seeds
44. Running crawls/capture web archives
Do initial &/or oversee quality assurance of crawls
Coordinate efforts & field questions
Technical elements of web archiving
Web archiving policy
Permissions processing
Needs assessment
User profiles and use cases
Value and usage assessment? (later)
Domain / expertise of web archivists
45. Would there be a need to limit access to what you
are collecting? Why or why not?
Are there any privacy or intellectual property rights
issues that can be anticipated?
Is it necessary to ask permission of the site owner
to archive their website?
Are there any ethical implications of your
collecting?
Considering policies:
permissions, privacy and access
46. There is no explicit US Copyright Act giving libraries
any exception for web archiving
As of 2015 Columbia University Libraries’ policy was
to request permission from website owners to
harvest their websites and provide access to
archived versions
Permission request email sent to contact info from website
If no response after 2-3 weeks, follow-up request with
notification of intent to archive website
If no response, proceed with archiving
Rarely denied permission to collect and will respect
a takedown notice
Permissions
47. Tracking nominations
We used Google Sheets
Tracking permissions
Basecamp
Google Sheets for now, relational database later
Tracking progress
Basecamp, Google Sheets
Tracking QA results
Google Forms (feeds into Google spreadsheets)
Considering project tracking & tools
CCWA as example (shared access needed)
51. Having more complete information
Fidelity perceived as correlating with accurate
representation of the resource and the information
contained therein
Perfection is not attainable but better is better
Does this take time?
YES
Why bother?
54. Who are the web archives for?
Are they being used?
Could we encourage more effective use?
55. Cataloging & Quality Assurance
Cataloging / Metadata
assignment essential to
discoverability
Quality assurance
testing
See QA procedural
reference guide from
NYARC
http://wiki.nyarc.org/web-
archiving/quality-assurance/
Photo credit: Anna Perricci
56. Cataloging expertise
Alex Thurman (web
archivist and skilled
cataloger) & Russell
Merritt (with decades of
experience cataloging
music resources) made
high quality records for
CAUSEWAY & CCWA
Bibliographic assistant
added metadata to
Archive-It
57.
58.
59. Records can be released to WorldCat
A query can be built for OCLC WorldShare to
obtain the MARC records for CCWA and
CAUSEWAY
The records can be delivered in a batch one
time or periodically on an ongoing basis
Importing records
via OCLC WorldShare
60. Archive-it.org site-level metadata (All thematic
collections, DCMI, copied from MARC records if
possible)
CLIO collection-level MARC records
CLIO site-level MARC records
Document-level MARC records
Human Rights Web Archive portal on CUL website
(using metadata extracted from MARC records)
Description for archived websites:
examples from Columbia
62. Columbia University resource: Guidelines for
Preservable Websites
https://library.columbia.edu/bts/web_resources_collection/guidelines_
for_preservable_websites.html
Stanford resource: Archivability
https://library.stanford.edu/projects/web-archiving/archivability
Site creators might care about web archiving
particularly if practical steps, best practices and
potential benefits to them are made clear
Best Practices for site creators:
work with website creators & guidelines