2. What is web
archiving?
The collection, management and
stewardship of web resources
in a stable form (e.g.WARC file)
that can be accessed over time
independently from the original
3. Why
archive
web
resources?
To accurately represent experiences and materials
created in the twenty-first century, select websites and
other web-based resources should be captured, stored,
managed, described, and made accessible
Online spaces & web based media are crucial elements
in some current events & crises - a lot of very
important information would be lost / omitted in the
absence of web archiving
4. Why archive
web resources
(continued)?
• Content available primarily or solely online is among
the most at-risk born-digital materials
• Websites that can be collected are freely and widely
available to anyone at some time but can vanish at the
volition of the site owner and/or service provider
• Like with other digital materials, web content is very
vulnerable to loss by comparison to information
contained in most analog media
5. [A few more]
Why archive
web
resources?
• Curated collections of web archives can be a valuable
part of collection development
• Some resources that used to be published and
distributed on paper are now only available online
• Examples include:
• Course catalogs (!)
• Reports
• Publicity materials i.e. for exhibitions, events,
press kits, brochures
6. Web archiving
is a multi-step
process
• Collection development and planning
• Selection
• Permissions / Ethical review
• Collecting / Harvesting
• Description
• Access
• Long-term preservation
7. Is the
visual
Context
also
Content?
– CONTEXT à Content
– Are the visual elements or interactive features
– Important
– Defining
– Non-essential
– Is the experience of usage essential to capture?
– For example is a resource more like a course
catalog versus interactive publication?
– Would anyone truly care about the path of access
enough to prioritize it?
– eBooks created with specific frameworks to weave
together information, however, are different
– Social media is HARD
8. What
[IMHO]
is NOT
web archiving?
– Static screenshots / non-interactive fixed images /
screen recordings of interaction with the site
– There is room for these as supplemental materials
– Stockpiling without any specific strategy for selection,
management and preservation
– For example,YouTube is not a web archive unless
your collection development plan is to document
an enormous, un-curated mass of data
– Also this would be nearly impossible to steward
and make accessible on an enduring basis (i.e.
financially, environmental impact)
– Using, or capturing, web spaces employed as a
PLATFORM or environment for sharing digital archives
9. In the absence
of ideal tools*
– * Ideal tools - fully functional, easy to use, open source,
sustainable, well maintained, tested, widely accessible
& affordable with transparent pricing
– What can you afford?
– What is good enough? For now? Longer term?
– Why are you doing this?
– Who is this for and for how long into the future?
10. Some
essential
terms
– Crawler / spider / robot
– Automated software that traverses web pages per
directions from a human (for indexing or capture)
– Human scale / browser based web collecting
– Collecting that is guided by a human in real time
through a web browser – not a screen recording of
process, an interactive web archive is created
– Seed URL
– Starting point for collecting (can be at a domain,
directory or page level)
11. Some more
essential terms
– WARC (file)
– ISO standard file format for web archives
– Fidelity / quality:
– Similarity to original (e.g. look, functionality)
– Significant properties:
– Defining features of an object or resource – what
about this thing makes it what it is [and
distinguishes it from other things]
– Will illustrate in slides later
12. Frequent web
archiving
project genesis
Archivist: ‘I was just informed that an essential web
resource is about to be taken down/deleted.
Soon! Within weeks or next month.’
– How do I save a functional copy for future use?
– Can I do this in time (within a month or so)?
13. Advocacy
within an
organization
Administrator: ‘What do you mean web based content
isn’t just saved [with full fidelity] automatically?
Doesn’t the Internet Archive have a copy?’
– By all means check the Internet Archive but view
captures critically
– Does this capture accurately represent the original?
Why/why not? If so, can you get a copy?
– Advocacy is hard but leverage training materials
available. Again, explaining limits of web archiving
capabilities in an encouraging way is difficult but
necessary for expectation management
14. Sharing
responsibilities
– Collecting strategy and establishment of priorities for
collection development could be a group effort
– Contributions could include
– Suggest URLs
– Liaise with site owners to solicit permission to
archive websites
– Governance of collaborations if multiple institutions
are involved
– Detailed quality assurance through browsing the
archived website as a user would (e.g. try to access
media files to ensure they have been successfully
captured)
– Assessment of efficacy for users?
15. Let’s go!
Collection
Development
aka
Why are we
doing this??
• Thinking within any existing collecting policy as well as
thinking through what makes sense for you/your
institution with the tools and resources at hand
• Careful consideration and plenty of questions
• Why collect websites (needs, collection scope)?
• What to collect?
• How/what tools to use?
• When: how often to collect & when will these
materials be used?
16. Collecting +
Ethics
– Discussions of ethics – not an after thought but
remembering some things could be made private or
embargoed if needed
– Who is at risk? What is the potential for harm?
– High risk
– Low risk
– Do creators understand implications of their content
being collected?
– Archives are useful as evidence – who could
leverage that evidence and for what purpose?
– Intellectual property / rights of creators
17. Next:
Collecting
materials
– With a plan in mind, browse the live web and/or make a
list of sites you want
– Depending on tools available and associated skills,
collect some resources
– Review what you got – is this what you expected to
get? If not, is it close enough?
– If you did not get what you expected / need, next steps:
contact vendor, tool maker or someone likely to have
the skills to help you troubleshoot
18. Testing is
boring &
tedious &
entirely
necessary
• ‘Set it and forget it’ is not recommended
• It’s boring and tedious, but review your captures please
• If you don’t test your captures you have no basis to
expect you collected materials with adequate “fidelity”
• Fidelity perceived as correlating with accurate
representation of the resource and the information
contained therein
• Perfection is not attainable but better is better
28. Remember!
– Despite a lack of perfect solutions, materials on the
web are too important to give up on for collecting,
managing and preserving via web archiving
– What is or is not good enough is your call (up to a point)
– Is this enough to meet the established purpose(s)?
– Something is better than nothing as long as that
‘something’ has been gathered with intent and
managed (stewarded) adequately
29. Upcoming!
– Getting Started with Web Archiving – March 2, 2021
– Featuring presenters working onArchipelago as well as
team atCarnegie HallArchives!
Coming soon!
– Web Archiving Ethics and Implications
– Tools to ‘Do’Web Archiving
– Learning from Long-term Leading Web Archiving
Initiatives
34. Indianapolis
Museum ofArt
à Newfields
– Rebranding à sudden need to collect website before
taken offline
– Motivated archivist who figured out local deployment
ofWebrecorder
– Good collection made
– Grateful peers, e.g. because key forms and pdfs on the
prior website were not lost and instead were easy to
find in the web archive!
35. Stanford
University
Press
Digital projects associate: ‘Our complex publications are
cutting edge and will have a limited lifespan most likely
(lots of technical dependencies).
How can we make sure they are an enduring resource?
How do we explain challenges and benefits to
administrators and funders?’
– Pilot partnership with Webrecorder team – mutual
benefit
– Hands on work and dialog; custom development beta
(Scalar)
36. UseCase:
Journalists!
Journalist: ‘There’s some wild stuff online I will be
referencing in my journalistic or academic writings.
I need to cite my sources to write a credible article.’
– Getting past screenshots
– What’s the benefit of something more complex than
screenshots?
– Ongoing credibility, evidence
37. Pelican Bomb
Editors/founders: ‘Our publication is closing. We did good
work and want it to have continued impact. What do we
do?’
– Time limited pilot partnership with Webrecorder team –
mutual benefit
– Work plan formed but primary funder did not buy in so
limited implementation
– Stakeholders: ‘now that we know the benefits of web
archiving we realize there are others in our communities
need digital preservation help’
– Outreach, including workshop at Common Ground
Convening