SlideShare a Scribd company logo
1 of 63
Download to read offline
Select slides from:
Web Archiving Fundamentals
(SAA course circa 2015,
assuming crawler based capture)
Anna Perricci
Anna.Perricci@gmail.com
These slides were made in 2015 and
minorly edited in 2018. These slides are
not entirely current but offered for
reference
 YOUR INSTRUCTOR: Anna Perricci
 Timeframe for experience with web archiving: 2007-present
 Course work at University of Michigan School of Information (including 2008
web archiving course with Margaret Hedstrom)
 ICPSR intern project/recommendations
 Grad research on digital and physical art and archives (e.g. video games and
new media art)
 Columbia University: over two years of full-time work on collaborative web
archiving
 Webrecorder/Rhizome: growing robust set of open source web archiving tools
 Teaching background
 Course design: preservation of new media and performance based artwork
 SAA Web Archiving Roundtable Education Coordinator 2013-2016
 Interesting extra projects that have shaped how I view capturing and representing
contemporary creative work and social movements
 FIGMENT
 Occupy Wall Street Archives Working Group
Web archiving is a new and growing field
and we need people with
new ideas and evolving skill sets
So glad to have you join us!
 This course will provide a foundational
knowledge of web archiving including steps to
take in forming a web archiving program and core
concepts in web archiving practice
 Constant change will be a given in web archiving as
long as web-based technologies continue to evolve
 Let’s get ready for this ongoing challenge!
Goals!
 Describe current web archiving practice
 Identify key steps to go from collection
development policy to initial construction of
collections of archived websites
 Explain subsequent steps to test quality, describe,
facilitate preservation and provide access to web
archives
By the end of course you should be able to
Students will get information that will support
further learning and training so they can get the
most out of subsequent instruction including on the
use of web archiving software, which is also
subject to regular changes and updates
This course will not teach you “how to do it”
 Major web archiving software providers output
archived websites in the form of WARC files
 These files should be included in wider digital
preservation planning
 Opening WARC files/accessing information
contained requires software that is not yet common
 The immediate storage & use of web archives is
closely connected to web archiving service providers
(i.e. Archive-It, Webrecorder)
Out of scope:
Preservation workflows for web archives
 Website: one or more web pages
 Web archiving: the process of selecting, capturing,
saving and making accessible select content
available online (e.g. websites)
 Web archive archived web content/website
 Web archives: a group of web-published materials
collected, managed and made accessible
Working definitions for this webinar
Eeeek…?!
Crawlers, robots & spiders
The software used to
collect web content is
often referred to as a
crawler, robot or spider
This webinar will focus
on workflows that have
been developed in
conjunction with the use
of Archive-It but should
be relevant to those
using other tools
 When explaining web archiving concepts I will call
the software for collecting websites a crawler, robot,
spider and/or a harvester
 A crawler (aka spider, robot) is software that
indexes web content
 In web archiving a crawler is used in conjunction
with software that harvests (collects) websites and
packages that content into a standard file format
(WARC)
A few names for the same thing…
 ​Any URL that one directs the crawler to capture​
 The seeds selected will determine the content in the
collection and the scope of the crawls​
​
 Seed URL(s) determine how much of a website will
be archived
What is a seed site?
Source: Archive-It help wiki
https://webarchive.jira.com/wiki/display/ARIH/Selecting+Seed
Seed site: URL for an entire website
top level / domain
http://www.kotekan.com/
Seed site: URL for specific part
(directory) of a website​
http://www.kotekan.com/design.html
Seed site: URL for a specific page​
http://www.kotekan.com/Southworth_CV_2013.pdf
To comprehensively represent records
created in the twenty-first century,
select websites and
other web-based resources
should be captured, stored,
managed, described,
and made accessible
as appropriate
Why archive websites?
 Content available primarily or solely online is
among the most at-risk born-digital materials
 Websites that can be collected are freely and
widely available to anyone at some time but can
vanish at the volition of the site owner
 Like with other digital materials, web content is
very vulnerable to loss by comparison to
information contained in most analog media
Why archive websites? (cont.)
 Curated collections of web archives can be a
valuable part of collection development
 Some resources that used to be published and
distributed on paper are now only available online
Examples include:
 Course catalogs (!)
 Reports
 Publicity materials for art galleries, events
Why archive websites? (cont.)
 How to scope your collecting (intellectually
and technically)
 Practices for acquiring and ensuring
quality of collected websites
 Steps to take to facilitate access (e.g.
description concepts and access systems)
We’ll focus on things to consider before
beginning efforts to archive websites
These elements can be scaled
to guide the collection of websites or select
materials from websites
at any institution
but don’t be discouraged
by varying levels of success,
a process or scope that needs to be changed
or a lack of resources to do it “right”
Comprehensive web archiving programs
have a few core elements
 In the US major curated collections of web
archives are usually created and maintained by
institutions (often based in academic libraries)
 Most use a suite of tools/software as a
service (e.g. Archive-It, Webrecorder.io)
 It is common for an institution to focus on
own/local web presence (i.e. www._.edu,
work of faculty & students
What is being saved?
 Code/info in web programming language
 HTML, Flash (a bit better recently)
 Some formatting (e.g. CSS/Cascading Style Sheets)
 Text
 Images
 Some media files (embedded not streamed)
 Documents, spreadsheets, presentations, data sets
 XML, PDF, CSV
What is being saved by a crawler?
Videos & social media content
are among the hardest things
to capture with a crawler but
getting them is
becoming more possible
(e.g. Webrecorder, Brozzler)
 Robots.txt is a file that blocks crawlers, including
ones set to collect websites for web archives
Robots.txt can be ignored in some services
 Streamed media content
 Database driven features of websites
 Password protected content
 Dynamically generated content
What is not being saved by a crawler?
A crawler can get caught in an endless loop on a
website
For example: a calendar without an end date
This endless loop is also known as a crawler
trap
Crawler traps
 The World Wide Web began becoming widely
available in the United States starting in 1993-
1995
 The Internet Archive (archive.org) began
collecting websites in 1996 (first web pages
made available in 2001)
For reference
 The Internet Archive, Library of Congress, and
national libraries in Europe, Australia and New
Zealand were early leaders in web archiving
 Web archiving activities at the Library of Congress began
in 2000
 In Europe a lot of domain level crawls (i.e. .dk, .fr)
 Onsite-only access are the most common models for
national libraries in Europe
 A growing number of institutions are making efforts
to collect web archives that fit within their collection
development policies
When and where has web archiving
been done so far?
 The Internet Archive had an early start with web
archiving but also has a much wider focus that is being
publicized in several project areas
 IA is a service provider to LC (crawls) and via Archive-It
 Wayback Machine
 What is and isn’t captured
 Irregular frequency of page capture
 Archive.org
 ‘Save page now’ via https://archive.org/web/
A few more words about the very
amazing Internet Archive
 Collection development and planning
 Selection
 Permissions
 Harvesting
 Description
 Access
 Long-term preservation
Web archiving is a multi-step process
Planning, scoping,
acquisition &
ensuring quality
 Intellectually—within collecting policy as well as
thinking through what makes sense for you/your
institution with the tools and resources at hand
 Careful consideration and plenty of questions
See following slides for framing questions
we’ve used for the collaborative web archiving
pilot projects for Borrow Direct/Ivy Plus
How to scope your collecting
(intellectually)
 Seed site will initially determine the depth of crawl
 Setting scoping rules (limits and expansions) in
web archiving software
How many pages are expected on a given site?
 Identify missing content and try to capture it with
patch crawls and/or adding more URLs associated
or within the site you are trying to archive
 Read the help documentation for the
software/service you are using for tips
How to scope your collecting
(technically)
 Why collect websites (needs, collection scope)
 What to collect
 How/what tools to use
 When: how often to collect & when will these
materials be used?
Things to consider before beginning
efforts to archive websites
 Where will the projects be based (institutionally)
 Who will lead this work and complete necessary
tasks
Who are key stakeholders in this work
Things to consider before beginning
efforts to archive websites (cont.)
 What benefits occur/needs are met through web
collecting (selecting, acquiring, organizing,
providing access, preserving)?
 Is your institution doing any web archiving? If
so, are there lessons to keep in mind?
Framing questions
 Have others in your organization discussed this
idea?
 How widespread is awareness about web
collecting/archiving?
 Do you think the idea would be well received, or
seen as questionable?
 What staff (within the library or beyond) would be
most likely to be involved?
Framing questions
 What types of web content would you be most
interested in collecting? Is social media a high
priority?
 Any specific subjects?
 Where does web archiving fit into your collection
development policies (existing or in terms of
upcoming revisions)?
Framing questions
 Example question set to consider
Columbia has thus far shaped its collecting
around certain policies. What issues, if any,
do you see arising from these? Would they
interfere with your local processes or
expectations?
Framing questions
 Permissions--requests versus notification only
 Limiting collecting to content that is freely available
on the web. To date we have not dealt with licensed
or password-protected content
 Making the archived content publicly available (i.e.
without restrictions or authentication)
 Collecting whole websites rather than individual
documents (for the sake of efficiency) rather than
separate program for document-based collecting
Factors to consider for example
Web archiving is
not a process that
can run successfully
using the workflow
casually known as
‘set it and forget it’
A potential workflow to forget…
 Collecting strategy and establishment of priorities
for collection development could be a group effort
 Contributions could include
 Suggest seed URLs
 Liaise with site owners to solicit permission to
archive websites
 Governance of collaborations if multiple
institutions are involved
Considering ways
to share responsibilities
 Detailed quality assurance through browsing the
archived website as a user would (e.g. try to
access media files to ensure they have been
successfully captured)
 Assessment of efficacy for users?
Considering ways to share
responsibilities (cont.)
 Determine if you would like to capture all pages on
the website, specific areas of the website (directories)
or a single page
 Copy the seed URL from your web browser
 Paste the URL in the address bar in another web
browser (Firefox, Chrome) to double check that the
URL leads to the content to be archived ​
 Paste the URL in a document or spreadsheet
 Next to the URL add the title of the page and the date
Sample workflow for selecting seeds
 Running crawls/capture web archives
 Do initial &/or oversee quality assurance of crawls
 Coordinate efforts & field questions
 Technical elements of web archiving
 Web archiving policy
 Permissions processing
 Needs assessment
 User profiles and use cases
 Value and usage assessment? (later)
Domain / expertise of web archivists
 Would there be a need to limit access to what you
are collecting? Why or why not?
 Are there any privacy or intellectual property rights
issues that can be anticipated?
 Is it necessary to ask permission of the site owner
to archive their website?
 Are there any ethical implications of your
collecting?
Considering policies:
permissions, privacy and access
 There is no explicit US Copyright Act giving libraries
any exception for web archiving
 As of 2015 Columbia University Libraries’ policy was
to request permission from website owners to
harvest their websites and provide access to
archived versions
 Permission request email sent to contact info from website
 If no response after 2-3 weeks, follow-up request with
notification of intent to archive website
 If no response, proceed with archiving
 Rarely denied permission to collect and will respect
a takedown notice
Permissions
 Tracking nominations
We used Google Sheets
 Tracking permissions
Basecamp
Google Sheets for now, relational database later
 Tracking progress
Basecamp, Google Sheets
 Tracking QA results
Google Forms (feeds into Google spreadsheets)
Considering project tracking & tools
CCWA as example (shared access needed)
Acquiring &
ensuring quality
of collected websites
Challenges: media files & images
(using QA tools)
http://wayback.archive-it.org/4019/20151026122753/http://www.kotekan.com/design.html
Documenting errors
 Having more complete information
 Fidelity perceived as correlating with accurate
representation of the resource and the information
contained therein
 Perfection is not attainable but better is better
Does this take time?
YES
Why bother?
Description and
access for
archived websites
Use cases
Photo credit: Anna Perricci
Who are the web archives for?
Are they being used?
Could we encourage more effective use?
Cataloging & Quality Assurance
 Cataloging / Metadata
assignment essential to
discoverability
 Quality assurance
testing
 See QA procedural
reference guide from
NYARC
http://wiki.nyarc.org/web-
archiving/quality-assurance/
Photo credit: Anna Perricci
Cataloging expertise
 Alex Thurman (web
archivist and skilled
cataloger) & Russell
Merritt (with decades of
experience cataloging
music resources) made
high quality records for
CAUSEWAY & CCWA
 Bibliographic assistant
added metadata to
Archive-It
 Records can be released to WorldCat
 A query can be built for OCLC WorldShare to
obtain the MARC records for CCWA and
CAUSEWAY
 The records can be delivered in a batch one
time or periodically on an ongoing basis
Importing records
via OCLC WorldShare
 Archive-it.org site-level metadata (All thematic
collections, DCMI, copied from MARC records if
possible)
 CLIO collection-level MARC records
 CLIO site-level MARC records
 Document-level MARC records
 Human Rights Web Archive portal on CUL website
(using metadata extracted from MARC records)
Description for archived websites:
examples from Columbia
http://hrwa.cul.columbia.edu
 Columbia University resource: Guidelines for
Preservable Websites
 https://library.columbia.edu/bts/web_resources_collection/guidelines_
for_preservable_websites.html
 Stanford resource: Archivability
 https://library.stanford.edu/projects/web-archiving/archivability
 Site creators might care about web archiving
particularly if practical steps, best practices and
potential benefits to them are made clear
Best Practices for site creators:
work with website creators & guidelines
Thank you!
Anna Perricci
anna.perricci@gmail.com

More Related Content

Similar to Web Archiving Intro (circa 2015)

Creating and Maintaining Web Archives
Creating and Maintaining Web ArchivesCreating and Maintaining Web Archives
Creating and Maintaining Web ArchivesMARAC Bethlehem PC
 
Introduction to Web Archiving
Introduction to Web ArchivingIntroduction to Web Archiving
Introduction to Web ArchivingAnna Perricci
 
Archiving for Now and Later - workshop at Common Field Convening 2019
Archiving for Now and Later - workshop at Common Field Convening 2019Archiving for Now and Later - workshop at Common Field Convening 2019
Archiving for Now and Later - workshop at Common Field Convening 2019Anna Perricci
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Anna Perricci
 
Preservation for the Next Generation
Preservation for the Next GenerationPreservation for the Next Generation
Preservation for the Next Generationjiscpowr
 
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)jiscpowr
 
Alabi2008presentation
Alabi2008presentationAlabi2008presentation
Alabi2008presentationbirdsnare
 
Intranet 2.0 School: Building the essential staff intranet for your library
Intranet 2.0 School: Building the essential staff intranet for your libraryIntranet 2.0 School: Building the essential staff intranet for your library
Intranet 2.0 School: Building the essential staff intranet for your libraryChris Evjy
 
How to start: Setting up an open access repository in 22 steps
How to start: Setting up an open access repository in 22 stepsHow to start: Setting up an open access repository in 22 steps
How to start: Setting up an open access repository in 22 stepsIryna Kuchma
 
Embedded library in web ct
Embedded library in web ctEmbedded library in web ct
Embedded library in web ctSuhui Ho
 
Digital Practices - introductions
Digital Practices - introductionsDigital Practices - introductions
Digital Practices - introductionsprisca schmarsow
 
Introduction to Omeka
Introduction to OmekaIntroduction to Omeka
Introduction to OmekaShawn Day
 
IR and DSpace - International Seminar, Dhaka University
IR and DSpace - International Seminar, Dhaka UniversityIR and DSpace - International Seminar, Dhaka University
IR and DSpace - International Seminar, Dhaka UniversityMd. Zahid Hossain Shoeb
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Selecting A Content Management System For Athabasca University
Selecting A Content Management System For Athabasca UniversitySelecting A Content Management System For Athabasca University
Selecting A Content Management System For Athabasca Universityrodger.graham
 
Learning Web: Content Management for Instruction
Learning Web: Content Management for InstructionLearning Web: Content Management for Instruction
Learning Web: Content Management for InstructionHouston Community College
 
eHive Open Day - London November 2010
eHive Open Day - London November 2010eHive Open Day - London November 2010
eHive Open Day - London November 2010Paul Rowe
 
Open access repository: How to set it up in 22 steps
Open access repository: How to set it up in 22 stepsOpen access repository: How to set it up in 22 steps
Open access repository: How to set it up in 22 stepsIryna Kuchma
 
Preservation of Web Resources: The JISC PoWR Project
Preservation of Web Resources: The JISC PoWR ProjectPreservation of Web Resources: The JISC PoWR Project
Preservation of Web Resources: The JISC PoWR Projectlisbk
 

Similar to Web Archiving Intro (circa 2015) (20)

Creating and Maintaining Web Archives
Creating and Maintaining Web ArchivesCreating and Maintaining Web Archives
Creating and Maintaining Web Archives
 
Introduction to Web Archiving
Introduction to Web ArchivingIntroduction to Web Archiving
Introduction to Web Archiving
 
Archiving for Now and Later - workshop at Common Field Convening 2019
Archiving for Now and Later - workshop at Common Field Convening 2019Archiving for Now and Later - workshop at Common Field Convening 2019
Archiving for Now and Later - workshop at Common Field Convening 2019
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
 
Preservation for the Next Generation
Preservation for the Next GenerationPreservation for the Next Generation
Preservation for the Next Generation
 
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
 
Alabi2008presentation
Alabi2008presentationAlabi2008presentation
Alabi2008presentation
 
Intranet 2.0 School: Building the essential staff intranet for your library
Intranet 2.0 School: Building the essential staff intranet for your libraryIntranet 2.0 School: Building the essential staff intranet for your library
Intranet 2.0 School: Building the essential staff intranet for your library
 
How to start: Setting up an open access repository in 22 steps
How to start: Setting up an open access repository in 22 stepsHow to start: Setting up an open access repository in 22 steps
How to start: Setting up an open access repository in 22 steps
 
PMU D Space Digital Repository Project Faculty Forum
PMU D Space Digital Repository Project   Faculty Forum PMU D Space Digital Repository Project   Faculty Forum
PMU D Space Digital Repository Project Faculty Forum
 
Embedded library in web ct
Embedded library in web ctEmbedded library in web ct
Embedded library in web ct
 
Digital Practices - introductions
Digital Practices - introductionsDigital Practices - introductions
Digital Practices - introductions
 
Introduction to Omeka
Introduction to OmekaIntroduction to Omeka
Introduction to Omeka
 
IR and DSpace - International Seminar, Dhaka University
IR and DSpace - International Seminar, Dhaka UniversityIR and DSpace - International Seminar, Dhaka University
IR and DSpace - International Seminar, Dhaka University
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Selecting A Content Management System For Athabasca University
Selecting A Content Management System For Athabasca UniversitySelecting A Content Management System For Athabasca University
Selecting A Content Management System For Athabasca University
 
Learning Web: Content Management for Instruction
Learning Web: Content Management for InstructionLearning Web: Content Management for Instruction
Learning Web: Content Management for Instruction
 
eHive Open Day - London November 2010
eHive Open Day - London November 2010eHive Open Day - London November 2010
eHive Open Day - London November 2010
 
Open access repository: How to set it up in 22 steps
Open access repository: How to set it up in 22 stepsOpen access repository: How to set it up in 22 steps
Open access repository: How to set it up in 22 steps
 
Preservation of Web Resources: The JISC PoWR Project
Preservation of Web Resources: The JISC PoWR ProjectPreservation of Web Resources: The JISC PoWR Project
Preservation of Web Resources: The JISC PoWR Project
 

More from Anna Perricci

DPC Web Archiving & Preservation Webinar #4: Outreach & Awareness Raising
DPC Web Archiving & Preservation Webinar #4: Outreach & Awareness Raising DPC Web Archiving & Preservation Webinar #4: Outreach & Awareness Raising
DPC Web Archiving & Preservation Webinar #4: Outreach & Awareness Raising Anna Perricci
 
Ethics & Archiving the Web - presentation at ACH 2019 closing plenary
Ethics & Archiving the Web - presentation at ACH 2019 closing plenaryEthics & Archiving the Web - presentation at ACH 2019 closing plenary
Ethics & Archiving the Web - presentation at ACH 2019 closing plenaryAnna Perricci
 
No one said this would be easy: Sustaining Webrecorder as a robust web archiv...
No one said this would be easy: Sustaining Webrecorder as a robust web archiv...No one said this would be easy: Sustaining Webrecorder as a robust web archiv...
No one said this would be easy: Sustaining Webrecorder as a robust web archiv...Anna Perricci
 
Webrecorder: Web Archiving for All!
Webrecorder: Web Archiving for All!Webrecorder: Web Archiving for All!
Webrecorder: Web Archiving for All!Anna Perricci
 
Archiver le web pour les artistes : Atelier Webrecorder
Archiver le web pour les artistes : Atelier WebrecorderArchiver le web pour les artistes : Atelier Webrecorder
Archiver le web pour les artistes : Atelier WebrecorderAnna Perricci
 
Webrecorder: Building, Maintaining & Growing
Webrecorder: Building, Maintaining & GrowingWebrecorder: Building, Maintaining & Growing
Webrecorder: Building, Maintaining & GrowingAnna Perricci
 
Social Contexts of Web Archiving: Collaboration and Ethical Collection Building
Social Contexts of Web Archiving: Collaboration and  Ethical Collection BuildingSocial Contexts of Web Archiving: Collaboration and  Ethical Collection Building
Social Contexts of Web Archiving: Collaboration and Ethical Collection BuildingAnna Perricci
 
Slides for Web Archiving in the Heritage and Archive Sectors
Slides for Web Archiving in the Heritage and Archive SectorsSlides for Web Archiving in the Heritage and Archive Sectors
Slides for Web Archiving in the Heritage and Archive SectorsAnna Perricci
 
Webrecorder: Web Archiving for All!
Webrecorder: Web Archiving for All!Webrecorder: Web Archiving for All!
Webrecorder: Web Archiving for All!Anna Perricci
 
Dismantling Silos to Build Robust Shared Print Projects
Dismantling Silos to Build Robust Shared Print ProjectsDismantling Silos to Build Robust Shared Print Projects
Dismantling Silos to Build Robust Shared Print ProjectsAnna Perricci
 
Retention Modeling for the Eastern Academic Scholars' Trust (EAST)
Retention Modeling for the Eastern Academic Scholars' Trust (EAST)Retention Modeling for the Eastern Academic Scholars' Trust (EAST)
Retention Modeling for the Eastern Academic Scholars' Trust (EAST)Anna Perricci
 
Information sharing about Columbia University Library’s recent web archiving ...
Information sharing about Columbia University Library’s recent web archiving ...Information sharing about Columbia University Library’s recent web archiving ...
Information sharing about Columbia University Library’s recent web archiving ...Anna Perricci
 
Collaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive AwardsCollaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive AwardsAnna Perricci
 
Contemporary Composers Web Archive (CCWA): Progress in Collaboratively Collec...
Contemporary Composers Web Archive (CCWA): Progress in Collaboratively Collec...Contemporary Composers Web Archive (CCWA): Progress in Collaboratively Collec...
Contemporary Composers Web Archive (CCWA): Progress in Collaboratively Collec...Anna Perricci
 
Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct Anna Perricci
 
Building Web Archiving Collaborations to Save [More of] the Web
Building Web Archiving Collaborations to Save [More of] the WebBuilding Web Archiving Collaborations to Save [More of] the Web
Building Web Archiving Collaborations to Save [More of] the WebAnna Perricci
 
Establishing and growing a multi-institutional web archiving collaboration f...
Establishing and growing a multi-institutional web archiving collaboration f...Establishing and growing a multi-institutional web archiving collaboration f...
Establishing and growing a multi-institutional web archiving collaboration f...Anna Perricci
 
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...Anna Perricci
 
Web archiving collaborations: a presentation for colleagues working in the Li...
Web archiving collaborations: a presentation for colleagues working in the Li...Web archiving collaborations: a presentation for colleagues working in the Li...
Web archiving collaborations: a presentation for colleagues working in the Li...Anna Perricci
 
Lightning talk on MARC records for the Contemporary Composers Web Archive pre...
Lightning talk on MARC records for the Contemporary Composers Web Archive pre...Lightning talk on MARC records for the Contemporary Composers Web Archive pre...
Lightning talk on MARC records for the Contemporary Composers Web Archive pre...Anna Perricci
 

More from Anna Perricci (20)

DPC Web Archiving & Preservation Webinar #4: Outreach & Awareness Raising
DPC Web Archiving & Preservation Webinar #4: Outreach & Awareness Raising DPC Web Archiving & Preservation Webinar #4: Outreach & Awareness Raising
DPC Web Archiving & Preservation Webinar #4: Outreach & Awareness Raising
 
Ethics & Archiving the Web - presentation at ACH 2019 closing plenary
Ethics & Archiving the Web - presentation at ACH 2019 closing plenaryEthics & Archiving the Web - presentation at ACH 2019 closing plenary
Ethics & Archiving the Web - presentation at ACH 2019 closing plenary
 
No one said this would be easy: Sustaining Webrecorder as a robust web archiv...
No one said this would be easy: Sustaining Webrecorder as a robust web archiv...No one said this would be easy: Sustaining Webrecorder as a robust web archiv...
No one said this would be easy: Sustaining Webrecorder as a robust web archiv...
 
Webrecorder: Web Archiving for All!
Webrecorder: Web Archiving for All!Webrecorder: Web Archiving for All!
Webrecorder: Web Archiving for All!
 
Archiver le web pour les artistes : Atelier Webrecorder
Archiver le web pour les artistes : Atelier WebrecorderArchiver le web pour les artistes : Atelier Webrecorder
Archiver le web pour les artistes : Atelier Webrecorder
 
Webrecorder: Building, Maintaining & Growing
Webrecorder: Building, Maintaining & GrowingWebrecorder: Building, Maintaining & Growing
Webrecorder: Building, Maintaining & Growing
 
Social Contexts of Web Archiving: Collaboration and Ethical Collection Building
Social Contexts of Web Archiving: Collaboration and  Ethical Collection BuildingSocial Contexts of Web Archiving: Collaboration and  Ethical Collection Building
Social Contexts of Web Archiving: Collaboration and Ethical Collection Building
 
Slides for Web Archiving in the Heritage and Archive Sectors
Slides for Web Archiving in the Heritage and Archive SectorsSlides for Web Archiving in the Heritage and Archive Sectors
Slides for Web Archiving in the Heritage and Archive Sectors
 
Webrecorder: Web Archiving for All!
Webrecorder: Web Archiving for All!Webrecorder: Web Archiving for All!
Webrecorder: Web Archiving for All!
 
Dismantling Silos to Build Robust Shared Print Projects
Dismantling Silos to Build Robust Shared Print ProjectsDismantling Silos to Build Robust Shared Print Projects
Dismantling Silos to Build Robust Shared Print Projects
 
Retention Modeling for the Eastern Academic Scholars' Trust (EAST)
Retention Modeling for the Eastern Academic Scholars' Trust (EAST)Retention Modeling for the Eastern Academic Scholars' Trust (EAST)
Retention Modeling for the Eastern Academic Scholars' Trust (EAST)
 
Information sharing about Columbia University Library’s recent web archiving ...
Information sharing about Columbia University Library’s recent web archiving ...Information sharing about Columbia University Library’s recent web archiving ...
Information sharing about Columbia University Library’s recent web archiving ...
 
Collaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive AwardsCollaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive Awards
 
Contemporary Composers Web Archive (CCWA): Progress in Collaboratively Collec...
Contemporary Composers Web Archive (CCWA): Progress in Collaboratively Collec...Contemporary Composers Web Archive (CCWA): Progress in Collaboratively Collec...
Contemporary Composers Web Archive (CCWA): Progress in Collaboratively Collec...
 
Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct
 
Building Web Archiving Collaborations to Save [More of] the Web
Building Web Archiving Collaborations to Save [More of] the WebBuilding Web Archiving Collaborations to Save [More of] the Web
Building Web Archiving Collaborations to Save [More of] the Web
 
Establishing and growing a multi-institutional web archiving collaboration f...
Establishing and growing a multi-institutional web archiving collaboration f...Establishing and growing a multi-institutional web archiving collaboration f...
Establishing and growing a multi-institutional web archiving collaboration f...
 
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
 
Web archiving collaborations: a presentation for colleagues working in the Li...
Web archiving collaborations: a presentation for colleagues working in the Li...Web archiving collaborations: a presentation for colleagues working in the Li...
Web archiving collaborations: a presentation for colleagues working in the Li...
 
Lightning talk on MARC records for the Contemporary Composers Web Archive pre...
Lightning talk on MARC records for the Contemporary Composers Web Archive pre...Lightning talk on MARC records for the Contemporary Composers Web Archive pre...
Lightning talk on MARC records for the Contemporary Composers Web Archive pre...
 

Recently uploaded

MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 

Recently uploaded (20)

MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 

Web Archiving Intro (circa 2015)

  • 1. Select slides from: Web Archiving Fundamentals (SAA course circa 2015, assuming crawler based capture) Anna Perricci Anna.Perricci@gmail.com These slides were made in 2015 and minorly edited in 2018. These slides are not entirely current but offered for reference
  • 2.  YOUR INSTRUCTOR: Anna Perricci  Timeframe for experience with web archiving: 2007-present  Course work at University of Michigan School of Information (including 2008 web archiving course with Margaret Hedstrom)  ICPSR intern project/recommendations  Grad research on digital and physical art and archives (e.g. video games and new media art)  Columbia University: over two years of full-time work on collaborative web archiving  Webrecorder/Rhizome: growing robust set of open source web archiving tools  Teaching background  Course design: preservation of new media and performance based artwork  SAA Web Archiving Roundtable Education Coordinator 2013-2016  Interesting extra projects that have shaped how I view capturing and representing contemporary creative work and social movements  FIGMENT  Occupy Wall Street Archives Working Group
  • 3. Web archiving is a new and growing field and we need people with new ideas and evolving skill sets So glad to have you join us!
  • 4.  This course will provide a foundational knowledge of web archiving including steps to take in forming a web archiving program and core concepts in web archiving practice  Constant change will be a given in web archiving as long as web-based technologies continue to evolve  Let’s get ready for this ongoing challenge! Goals!
  • 5.  Describe current web archiving practice  Identify key steps to go from collection development policy to initial construction of collections of archived websites  Explain subsequent steps to test quality, describe, facilitate preservation and provide access to web archives By the end of course you should be able to
  • 6. Students will get information that will support further learning and training so they can get the most out of subsequent instruction including on the use of web archiving software, which is also subject to regular changes and updates This course will not teach you “how to do it”
  • 7.  Major web archiving software providers output archived websites in the form of WARC files  These files should be included in wider digital preservation planning  Opening WARC files/accessing information contained requires software that is not yet common  The immediate storage & use of web archives is closely connected to web archiving service providers (i.e. Archive-It, Webrecorder) Out of scope: Preservation workflows for web archives
  • 8.  Website: one or more web pages  Web archiving: the process of selecting, capturing, saving and making accessible select content available online (e.g. websites)  Web archive archived web content/website  Web archives: a group of web-published materials collected, managed and made accessible Working definitions for this webinar
  • 9. Eeeek…?! Crawlers, robots & spiders The software used to collect web content is often referred to as a crawler, robot or spider This webinar will focus on workflows that have been developed in conjunction with the use of Archive-It but should be relevant to those using other tools
  • 10.  When explaining web archiving concepts I will call the software for collecting websites a crawler, robot, spider and/or a harvester  A crawler (aka spider, robot) is software that indexes web content  In web archiving a crawler is used in conjunction with software that harvests (collects) websites and packages that content into a standard file format (WARC) A few names for the same thing…
  • 11.  ​Any URL that one directs the crawler to capture​  The seeds selected will determine the content in the collection and the scope of the crawls​ ​  Seed URL(s) determine how much of a website will be archived What is a seed site? Source: Archive-It help wiki https://webarchive.jira.com/wiki/display/ARIH/Selecting+Seed
  • 12. Seed site: URL for an entire website top level / domain http://www.kotekan.com/
  • 13. Seed site: URL for specific part (directory) of a website​ http://www.kotekan.com/design.html
  • 14. Seed site: URL for a specific page​ http://www.kotekan.com/Southworth_CV_2013.pdf
  • 15. To comprehensively represent records created in the twenty-first century, select websites and other web-based resources should be captured, stored, managed, described, and made accessible as appropriate Why archive websites?
  • 16.  Content available primarily or solely online is among the most at-risk born-digital materials  Websites that can be collected are freely and widely available to anyone at some time but can vanish at the volition of the site owner  Like with other digital materials, web content is very vulnerable to loss by comparison to information contained in most analog media Why archive websites? (cont.)
  • 17.  Curated collections of web archives can be a valuable part of collection development  Some resources that used to be published and distributed on paper are now only available online Examples include:  Course catalogs (!)  Reports  Publicity materials for art galleries, events Why archive websites? (cont.)
  • 18.  How to scope your collecting (intellectually and technically)  Practices for acquiring and ensuring quality of collected websites  Steps to take to facilitate access (e.g. description concepts and access systems) We’ll focus on things to consider before beginning efforts to archive websites
  • 19.
  • 20. These elements can be scaled to guide the collection of websites or select materials from websites at any institution but don’t be discouraged by varying levels of success, a process or scope that needs to be changed or a lack of resources to do it “right” Comprehensive web archiving programs have a few core elements
  • 21.  In the US major curated collections of web archives are usually created and maintained by institutions (often based in academic libraries)  Most use a suite of tools/software as a service (e.g. Archive-It, Webrecorder.io)  It is common for an institution to focus on own/local web presence (i.e. www._.edu, work of faculty & students What is being saved?
  • 22.  Code/info in web programming language  HTML, Flash (a bit better recently)  Some formatting (e.g. CSS/Cascading Style Sheets)  Text  Images  Some media files (embedded not streamed)  Documents, spreadsheets, presentations, data sets  XML, PDF, CSV What is being saved by a crawler?
  • 23. Videos & social media content are among the hardest things to capture with a crawler but getting them is becoming more possible (e.g. Webrecorder, Brozzler)
  • 24.  Robots.txt is a file that blocks crawlers, including ones set to collect websites for web archives Robots.txt can be ignored in some services  Streamed media content  Database driven features of websites  Password protected content  Dynamically generated content What is not being saved by a crawler?
  • 25. A crawler can get caught in an endless loop on a website For example: a calendar without an end date This endless loop is also known as a crawler trap Crawler traps
  • 26.  The World Wide Web began becoming widely available in the United States starting in 1993- 1995  The Internet Archive (archive.org) began collecting websites in 1996 (first web pages made available in 2001) For reference
  • 27.  The Internet Archive, Library of Congress, and national libraries in Europe, Australia and New Zealand were early leaders in web archiving  Web archiving activities at the Library of Congress began in 2000  In Europe a lot of domain level crawls (i.e. .dk, .fr)  Onsite-only access are the most common models for national libraries in Europe  A growing number of institutions are making efforts to collect web archives that fit within their collection development policies When and where has web archiving been done so far?
  • 28.  The Internet Archive had an early start with web archiving but also has a much wider focus that is being publicized in several project areas  IA is a service provider to LC (crawls) and via Archive-It  Wayback Machine  What is and isn’t captured  Irregular frequency of page capture  Archive.org  ‘Save page now’ via https://archive.org/web/ A few more words about the very amazing Internet Archive
  • 29.  Collection development and planning  Selection  Permissions  Harvesting  Description  Access  Long-term preservation Web archiving is a multi-step process
  • 31.  Intellectually—within collecting policy as well as thinking through what makes sense for you/your institution with the tools and resources at hand  Careful consideration and plenty of questions See following slides for framing questions we’ve used for the collaborative web archiving pilot projects for Borrow Direct/Ivy Plus How to scope your collecting (intellectually)
  • 32.  Seed site will initially determine the depth of crawl  Setting scoping rules (limits and expansions) in web archiving software How many pages are expected on a given site?  Identify missing content and try to capture it with patch crawls and/or adding more URLs associated or within the site you are trying to archive  Read the help documentation for the software/service you are using for tips How to scope your collecting (technically)
  • 33.  Why collect websites (needs, collection scope)  What to collect  How/what tools to use  When: how often to collect & when will these materials be used? Things to consider before beginning efforts to archive websites
  • 34.  Where will the projects be based (institutionally)  Who will lead this work and complete necessary tasks Who are key stakeholders in this work Things to consider before beginning efforts to archive websites (cont.)
  • 35.  What benefits occur/needs are met through web collecting (selecting, acquiring, organizing, providing access, preserving)?  Is your institution doing any web archiving? If so, are there lessons to keep in mind? Framing questions
  • 36.  Have others in your organization discussed this idea?  How widespread is awareness about web collecting/archiving?  Do you think the idea would be well received, or seen as questionable?  What staff (within the library or beyond) would be most likely to be involved? Framing questions
  • 37.  What types of web content would you be most interested in collecting? Is social media a high priority?  Any specific subjects?  Where does web archiving fit into your collection development policies (existing or in terms of upcoming revisions)? Framing questions
  • 38.  Example question set to consider Columbia has thus far shaped its collecting around certain policies. What issues, if any, do you see arising from these? Would they interfere with your local processes or expectations? Framing questions
  • 39.  Permissions--requests versus notification only  Limiting collecting to content that is freely available on the web. To date we have not dealt with licensed or password-protected content  Making the archived content publicly available (i.e. without restrictions or authentication)  Collecting whole websites rather than individual documents (for the sake of efficiency) rather than separate program for document-based collecting Factors to consider for example
  • 40. Web archiving is not a process that can run successfully using the workflow casually known as ‘set it and forget it’ A potential workflow to forget…
  • 41.  Collecting strategy and establishment of priorities for collection development could be a group effort  Contributions could include  Suggest seed URLs  Liaise with site owners to solicit permission to archive websites  Governance of collaborations if multiple institutions are involved Considering ways to share responsibilities
  • 42.  Detailed quality assurance through browsing the archived website as a user would (e.g. try to access media files to ensure they have been successfully captured)  Assessment of efficacy for users? Considering ways to share responsibilities (cont.)
  • 43.  Determine if you would like to capture all pages on the website, specific areas of the website (directories) or a single page  Copy the seed URL from your web browser  Paste the URL in the address bar in another web browser (Firefox, Chrome) to double check that the URL leads to the content to be archived ​  Paste the URL in a document or spreadsheet  Next to the URL add the title of the page and the date Sample workflow for selecting seeds
  • 44.  Running crawls/capture web archives  Do initial &/or oversee quality assurance of crawls  Coordinate efforts & field questions  Technical elements of web archiving  Web archiving policy  Permissions processing  Needs assessment  User profiles and use cases  Value and usage assessment? (later) Domain / expertise of web archivists
  • 45.  Would there be a need to limit access to what you are collecting? Why or why not?  Are there any privacy or intellectual property rights issues that can be anticipated?  Is it necessary to ask permission of the site owner to archive their website?  Are there any ethical implications of your collecting? Considering policies: permissions, privacy and access
  • 46.  There is no explicit US Copyright Act giving libraries any exception for web archiving  As of 2015 Columbia University Libraries’ policy was to request permission from website owners to harvest their websites and provide access to archived versions  Permission request email sent to contact info from website  If no response after 2-3 weeks, follow-up request with notification of intent to archive website  If no response, proceed with archiving  Rarely denied permission to collect and will respect a takedown notice Permissions
  • 47.  Tracking nominations We used Google Sheets  Tracking permissions Basecamp Google Sheets for now, relational database later  Tracking progress Basecamp, Google Sheets  Tracking QA results Google Forms (feeds into Google spreadsheets) Considering project tracking & tools CCWA as example (shared access needed)
  • 48. Acquiring & ensuring quality of collected websites
  • 49. Challenges: media files & images (using QA tools) http://wayback.archive-it.org/4019/20151026122753/http://www.kotekan.com/design.html
  • 51.  Having more complete information  Fidelity perceived as correlating with accurate representation of the resource and the information contained therein  Perfection is not attainable but better is better Does this take time? YES Why bother?
  • 53. Use cases Photo credit: Anna Perricci
  • 54. Who are the web archives for? Are they being used? Could we encourage more effective use?
  • 55. Cataloging & Quality Assurance  Cataloging / Metadata assignment essential to discoverability  Quality assurance testing  See QA procedural reference guide from NYARC http://wiki.nyarc.org/web- archiving/quality-assurance/ Photo credit: Anna Perricci
  • 56. Cataloging expertise  Alex Thurman (web archivist and skilled cataloger) & Russell Merritt (with decades of experience cataloging music resources) made high quality records for CAUSEWAY & CCWA  Bibliographic assistant added metadata to Archive-It
  • 57.
  • 58.
  • 59.  Records can be released to WorldCat  A query can be built for OCLC WorldShare to obtain the MARC records for CCWA and CAUSEWAY  The records can be delivered in a batch one time or periodically on an ongoing basis Importing records via OCLC WorldShare
  • 60.  Archive-it.org site-level metadata (All thematic collections, DCMI, copied from MARC records if possible)  CLIO collection-level MARC records  CLIO site-level MARC records  Document-level MARC records  Human Rights Web Archive portal on CUL website (using metadata extracted from MARC records) Description for archived websites: examples from Columbia
  • 62.  Columbia University resource: Guidelines for Preservable Websites  https://library.columbia.edu/bts/web_resources_collection/guidelines_ for_preservable_websites.html  Stanford resource: Archivability  https://library.stanford.edu/projects/web-archiving/archivability  Site creators might care about web archiving particularly if practical steps, best practices and potential benefits to them are made clear Best Practices for site creators: work with website creators & guidelines

Editor's Notes

  1. _________________________________________________________________ _________________________________________________________________ _________________________________________________________________ _________________________________________________________________ _________________________________________________________________ _________________________________________________________________