1. UBC LIBRARY WEB
ARCHIVING
Presentation to UBC Library Community
L A R I S S A R I N G H A M , L I B R A R I A N , D I G I T A L P R O J E C T S
C A R O L I N A R O M A N A M I G O , S T U D E N T L I B R A R I A N , D I G I T A L P R O J E C T S
2. 2
TODAY …
1. Why web archiving?
2. Benchmarking/research project
3. Web archiving at UBC
4. How Archive-it works
5. What’s next?
5. 5
• 18 University of Victoria
• 17 University of Alberta
• 10 University of Toronto
• 6 University of British Columbia
• 6 Wilfrid Laurier University
• 4 University of Winnipeg
• 4 University of Manitoba
• 3 University of Waterloo
• 3 Simon Fraser University
• 1 University of Waterloo & Toronto
• 1 University of Saskatchewan
• 1 Dalhousie University
• 1 Carleton University (MacOdrum Library)
WEB ARCHIVING BENCHMARKING – CANADIAN UNIVERSITIES
NUMBER OF COLLECTIONS PER INSTITUTION ON ARCHIVE-IT
12 Canadian
Universities have
Web Archiving
Initiatives.
6. 6
COLLECTION SCOPE – TYPE OF CONTENT
Institution
owned/
affiliated
website
Subject
specific
relevant
websites
Federal/Local
governmental
websites
Local
relevant
events
Local
organizations
Research
projects
Local
news
Local
heritageInternatio
nal
events
7. 7
COLLECTION SCOPE – REASONS FOR ARCHIVING
Public or
scholarly
interest
Preserving
institution
produced
content
Historical or
geographically
local significance
At risk or to be
Decommissio
ned
Supplement
existing
collection
Born
digital
resource
8. 8
ACCESS TO WEB ARCHIVING COLLECTIONS
WHERE ARE
THEY AVAILABLE?
DO THEY HAVE A
DEDICATED PAGE?
HOW ARE THEY
LINKED?
Usually under
digital/archives
or special
collections
Large majority
has a dedicated
page to Web
Archives
Initiative
Usually linked to
Archive-it
institution page
Under subject
guides
Under additional
resources
Featured on library
home page
LESS OFTEN:
Direct link for live
webpages
Restricted access
link
Direct link for
archived webpages
LESS OFTEN:
9. 9
• Ownership remains with website owner
and university has no liability.
• Authorization is granted to educational
purposes, since observing copyright
restrictions from website owners.
OWNERSHIP, LIABILITY, AUTHORIZATION
• Only notifies owners/asks permission in
case of technological protected content.
• Accepts takedown requests.
NOTIFICATION AND TAKEDOWN
POLICIES ADOPTED BY TOP 3 WEB ARCHIVES INITIATIVES
AMONG CANADIAN UNIVERSITITES
11. 11
* Pilot * Federal Government Websites
• Partnered with HSS Library
First Nations and Indigenous Communities Websites
• Partnered with Xwi7wxa Library
2015 Metro Vancouver Transportation and Transit Plebiscite
• Partnered with HSS Library
UBC WEB ARCHIVING PROJECTS TO DATE
12. 12
UBC Asian Library Historical Websites
• Partnered with Asian Library
UBC Conferences and Events
UBC Community and Partners
• Partnered with Faculty of Education, UBC Press
22. 22
BC Local Government Websites
• Collaborative project with UBC / UVic / SFU
UBC.ca Institutional Website
• Partnership with UBC Archives
COMING UP NEXT ……
23. 23
• Metadata enhancement
• Access and discoverability
• Assessment and analytics
• Preservation with Archivematica
…. AND ON THE HORIZON
25. 25
1. Research, public or governmental interests relevant for teaching or research
2. Historically or geographically local significance
3. Complementarity to relevant existing collections
4. Content produced by the university or affiliated organizations
WEB ARCHIVING: CONTENT PRIORITIES
27. 27
HOW ARE WEB
ARCHIVING PROJECTS
STRUCTURED?
Source: Bragg, M., & Hanna, K. (2013). The Web
Archiving Life Cycle Model (Publication). Internet Archive.
28. 28
Stakeholder / project partner
• proposes the project
• identifies the content
• performs final QA check
WEB ARCHIVING PROJECT ROLES
Digital Initiatives
• evaluates project against the
policy criteria
• scopes project and assesses for
resources
• performs the archiving crawls
• performs initial QA checks
• creates and applies metadata
• makes content available
29. 29
Technical limitations with the
Archive-it crawler
SOME THINGS TO KEEP IN MIND ….
including but very much not limited to ….
JavaScript
Silverlight
Dynamic databases
Password protected content
Streaming media
What do we mean by “web archiving”?
Collecting targeted web sites and web content for preservation and access
[why it is important]
More and more content is being born digital, and not all of it is being captured
digital preservation programs capture documents and files, but that is not all the *web content*
Many sites are abandoned or taken down and that content is lost forever
UBC Library, through digital initiatives, has been doing web archiving since 2013 using the archive-it service from the internet archive (will take more about that a bit later)
- Our archiving activities have so far been very much an off-the-side of the desk activity and in response to specific, time-sensitive preservation issues. But now we wanted to think about the program a bit more strategically and take the initiative to the larger library community to hear what you would want to see as web archiving activity at UBC.
Now: Carolina … research
Examples of subject specific relevant websites: University of Alberta Energy/Environment Collection, and Circumpolar Collection.
Examples of Local relevant events: Alberta Floods June 2013
Public or scholarly interest: from websites relevant to research to governmental websites.
Examples of historical or geographically local significance: B.C. Teachers' Labour Dispute (2014) (Uvic)
Examples of at risk or to be decommissioned webpages: Edmonton Public Library - Orphaned Collections (University of Alberta)
So, how did we get into web archiving?
2013 CGI-PLN group learned that the federal government websites were being replaced, UBC took part in an initiative to archive them
Pilot project and in retrospect it was not ideal for a pilot given the size of what we captured
But we were pressed for time, and had a little over a month
Almost 1million pages archived
After that we partnered with Xwi7xwa library to build another collection to capture FN and indigenous community content
Much here is an example of endangered community content, and a good example of what I was talking about at the beginning regarding the ephemerality of web content
These sites were largely outdated, not properly maintained often due to the community organizations no longer being in existence.
In some cases we were not successful in capturing the content, sometimes due to technical reasons and once being the site itself was hijacked
But they are a valuable source of community content for researchers and students.
We are using Archive-it
Archive-It is a subscription web archiving service from the Internet Archive that helps organizations to harvest, build, and preserve collections of digital content.
Archive-It allows us to collect, and manage our collections of archived content in an openly accessible format and with full text search available
Content is hosted and stored at the Internet Archive data centers.
[Relationship between the internet archive, archive it, wayback machine]
Archive-it is essentially a managed version of the wayback machine.
- The IA’s wayback machine crawls web content constantly but indiscriminately
- Archive-it allows us to build custom collections based on the needs of what we are collecting, and gives us control over what is crawled and how often
I’m going to let Carolina walk us through the steps of a crawl so you can get the basic idea of the capture process.
Our collections are accessible through this institutional collection page [go to live site for demo]
Due to the fact that we do not have a vast office of bodies dedicated to the task of web archiving – projects are assessed on a case by case basis against a set of criteria.
Web archiving program is run out of digital initiatives. For the most part we do not actively identify and select content for recruitment, we rely on the subject matter experts, librarians, faculty members and others to bring potential content to our attention.
The web archiving team consists of myself, and through to the end of August, Carolina.
Resource constraints