Progress Made and 
Lessons Learned 
through Collaborative Web 
Archiving Projects 
Anna Perricci 
Columbia University Libraries 
Archive-It Partner Meeting 2014 
November 18, 2014
Web Resources Archiving Collaboration 
• Many thanks to the Mellon Foundation 
• Building collaborations among 
– The web archiving community 
– Other research libraries 
– Users and potential users of web archives 
– Site creators
Incentive awards projects 
to advance web archiving tools 
Warcbase: Building a Scalable Web Archiving Platform on HBase 
and Hadoop. (Jimmy Lin, University of Maryland) 
Archiving Transactions Towards Uninterruptible Web Service 
(Zhiwu Xie and Edward A. Fox, Virginia Tech University)
Incentive awards projects 
to advance web archiving tools 
Visualizing Digital Collections of Web Archives (Michele 
Weigle, Old Dominion University) 
Tools for Managing Seed URLs (Michael Nelson, Old 
Dominion University)
Incentive awards projects 
to advance web archiving tools 
Perma.cc: Mitigating the 
Pervasive Problem of Link 
Rot in Scholarly Works and 
Preserving Online Content 
(Kim Dulin, The Harvard 
Library Innovation Lab) 
Free Law Project 
Providing free access to 
primary legal materials, 
developing legal research 
tools, and supporting 
academic research on legal 
corpora)
Building an efficient, coherent, and scalable 
national framework for collecting web content
https://archive-it.org/home/borrowdirect
Program Components 
• Communication and coordination 
• Seed management and harvest 
• Supplemental quality review (QA testing) 
• MARC Metadata 
• Local preservation storage (seeking solutions)
The first 18 months of collaborative collecting 
• Planning, needs assessment (interviews with stakeholders including 
Associate University Librarians for collection development at each Borrow 
Direct institution in 2013), timelines created 
• Group communication (spreadsheets, Basecamp), cultivating dialogs 
• Coordinate seed URLs nomination for pilots collections (CCWA, 
CAUSEWAY), QA testing and creation of MARC records 
• Trying out workflows for optimal balance of involvement and efficient 
forward motion on projects 
• In planning stages for sharing costs & 5 year plan for Borrow Direct/Ivy 
Plus collaborations
Collaboration with music librarians
Contemporary Composers Web Archive 
Selectors 
• Borrow Direct Music Librarians Group: music librarians at Brown, 
Columbia, Cornell, Dartmouth, Harvard, Johns Hopkins, Princeton, 
and Yale universities, MIT, and the universities of Chicago and 
Pennsylvania 
Cataloging expertise 
• Russell Merritt (cataloger specializing in music resources) 
• Kate Harcourt (Director of Original and Special Materials Cataloging) 
• Alex Thurman (Web Resources Collection Coordinator)
CCWA
CCWA
Progress on CCWA & lessons learned so far 
By the numbers: 
• 11 curators participating 
• 56 sites currently available in Archive-It 
– 23 additional sites for follow up 
• 27 GB of content archived (268,519 URLs) 
• 50 MARC records in WorldCat as of 11/18/14 
– Russell Merritt (music cataloger) collaboratively developed MARC records 
for composers websites; further cataloging of available sites through 2CUL 
Outreach 
• SAA presentation on MARC records for CCWA 
http://www.slideshare.net/annaperricci/lightning-talk-for-session-703-of-society-of-american-archivists 
• Over 30 sites tested for quality by five music librarians; 
bibliographic assistant on the grant tested all sites in collection
CCWA Permissions 
77 Composers 
Yes (37) 
No (0) 
Did not respond (35) 
No contact info (2) 
Recently died/did not 
contact (3)
Quality Assurance with music librarians
Creating MARC records for web archives 
• Creating MARC records for archived websites is standard 
practice at CUL 
– MARC records make web archives discoverable in CLIO 
(Columbia Libraries Information Online) 
• Collection level and seed level records 
• Will use Archive-It interface to add Dublin Core metadata
Anticipating wider use of MARC records 
• Records have been regularly 
released to WorldCat 
• Collaborators on cataloging 
were attentive to which 
fields will ordinarily be 
stripped out when a MARC 
record is imported to 
another institution’s OPAC
MARC records
Patron view of record in CLIO
Cataloger’s view of record in CLIO
Progress on CAUSEWAY & lessons learned 
• Curators from 9 Borrow Direct institutions (Ivies Plus Art & 
Architecture Group) 
– Lead advisors: Carole Ann Fabian and Chris Sala 
• 137 seed URLs (over 100 harvested and being released as sites 
are tested, cataloged and assigned metadata in Archive-It) 
• 51 GB of content archived (1,006,114 URLs ) 
• Over 60 sites available in Archive-It with DC metadata (also all 
60+ have MARC records in CLIO) 
Outreach 
• Update sent to IVAAG soliciting feedback 
• Gave update and got feedback at semi annual IVAAG meeting 
• Presentation scheduled for ARLIS/NA 2015
CAUSEWAY Permissions 
137 Site owners 
Yes (74) 
No (3) 
Later (2) 
No contact info (2) 
Did not respond (56)
CAUSEWAY
CAUSEWAY
CAUSEWAY
CAUSEWAY
Cataloging expertise brought to CAUSEWAY 
• Alex’s expertise in cataloging architecture and urban planning 
sites (built through collaboration with Chris Sala on the Avery 
collecting of web archives) equips him to make more specific 
MARC records for sites in CAUSEWAY 
• Columbia University art and architecture librarians encourage 
users to find resources via records in the OPAC so access to 
CAUSEWAY sites will likely be via the MARC records which point 
to the calendar page for archived sites 
• Alex is working with our Bibliographic Assistant, Naeema Akter 
(position funded by the grant as well) to add appropriate 
metadata for better browsing in the Archive-It interface
Early start on facets in Archive-It
CAUSEWAY goals for duration of 
remainder of grant 
• Collect all nominated sites in scope, test for quality, create a MARC 
record for each archived website (by early 2015) 
• Evaluate quality and solicit feedback (ongoing) 
• Meet at ARLIS/NA and discuss progress (March 2015) 
– Anna will also give a presentation on collaborative web 
archiving projects at ARLIS/NA 
• Establish ongoing workflows and goals (2015 and onward) 
• End of pilot phase: December 2015
Project tracking: 
Basecamp & many, many spreadsheets
Pilot climate change collecting 
& lessons learned so far 
• 25 selectors from 5 institutions 
Great range of fields: 
-Wide variety of area studies (9) 
-Social science (5) 
-Science and environmental science (4) 
-Medical (1), Law (1), Special Collections (1) 
-Collection Development AUL (3), Preservation (1) 
• 127 seeds websites nominated (some duplication) 
• A lot of enthusiasm for topic
What we’ve learned about 
workflows and scale 
• Distributing work does not reduce costs 
• Collaborative effort builds the project and new tasks promote 
professional growth 
• Quality Assurance and cataloging integral to process of 
creating high quality collections of web archives
#webarchivinghappenshere
Use cases 
Image credit: Flickr user: Nicky Jurd (CC BY 2.0)
Using the Human Rights Web Archive & learning 
from human rights scholars’ work 
(publications, citations)
Citations scraped from articles published in 
2010 in select scholarly journals
Isolating URLs from list of citations 
using Open Refine 
(approximately 10% of citations scraped have URLs in them)
Querying Internet Archive collection (via API)
Leveraging HRWA Solr index 
http://hrwa.cul.columbia.edu
Columbia University web resources: 
creating best practices for site creators
Wider reach with guidelines rather than 
suggesting changes on case by case basis
Web archiving initiatives 
focusing on art resources 
An initiative designed to address the “urgent need to document the 
dynamic web-based versions of auction catalogues, catalogues 
raisonnés, and scholarly research projects, as well as artist, gallery, 
and museum websites” (http://www.nyarc.org/content/web-archiving) 
Artist files Special Interest Group
What do you want to learn 
about web archiving? 
Do you have any suggestions on how the SAA Web 
Archiving Roundtable can help you develop your 
knowledge of web archiving? 
Categories we identified based on the 33 responses: 
– Description 
– Preservation 
– Access/ Use 
– Project Management/ Collaboration 
– Appraisal/ Collection Dev/ Policy 
– Technology/ Capture/ Tools 
– Business Case/ Costs/ Best Practices
Some presentations, papers, panels & posters during grant 
• Moderated: “Web Archiving: Experiences, Perspectives and Possibilities” held at METRO on 10/20/14 
• Presentation (lightning talk): “MARC Records for the Contemporary Composers Web Archive” for the Society of 
American Archivists annual conference on 8/16/14 
URL (via Academic Commons): http://dx.doi.org/10.7916/D8028Q3S 
• Presentation: “SAA Web Archiving Roundtable Education Needs Assessment Survey Results” for the SAA Web 
Archiving Roundtable meeting at Society of American Archivists annual conference (co-presented with John Bence) 
on 8/14/14 
• Presentation: “How Collaboration Can Save [More of] the Web: Recent Progress in Collaborative Web Archiving 
Initiatives” for the METRO Conference 2014 on 1/15/14 
• Poster session: “Assessment of the Effectiveness of the Human Rights Web Archive @Columbia University” (co-presented 
with Pamela Graham) at the ACRL/NY Symposium on 12/6/13 
URL (via Academic Commons): http://dx.doi.org/10.7916/D8BG2KZ9 
• Presentation: “How Collaboration Can Save [More of] the Web: Recent Progress in Collaborative Web Archiving 
Initiatives” for the Best Practices Exchange on 11/14/13 (with Scott Reed) 
URL (via Academic Commons): http://dx.doi.org/10.7916/D8G73BNK 
• Presentation: “Web Archiving Resource Collaboration” at CrawlCamp held at 
METRO on 7/17/13
Are project elements 
on schedule & within budget? 
• So far yes though we have plenty of challenges and work 
ahead of us 
• Steady progress on citation analysis but it’s been much harder 
than we thought it’d be 
• Lots of room for engagement and team work including 
maintenance and coordination of cooperative efforts
Refining building materials
Modest gains
The next 12.5 months 
• Complete remainder of work called for in grant 
• Establish shared cost model for collaborative collection building 
(e.g. CCWA and CAUSEWAY) 
• Plan for scaling (maintenance and growth) 
• Codify roles for meaningful involvement in web archiving efforts 
• Contribute to professional organizations to strengthen web 
archiving efforts nationally and internationally
Credits to some of many collaborators 
• Bob Wolven, Alex Thurman, Naeema Akter 
• Pamela Graham, Kate Harcourt, Christina Harlow 
• Talia Jimenez, Stephen Davis, incentives awards oversight panel: 
Kris Carpenter, Mark Phillips, Rob Sanderson & Perry Willett 
• Elizabeth Davis, Russell Merritt & Borrow Direct music librarians 
• Carole Ann Fabian, Chris Sala, Ivies Plus Art & Architecture Group 
• Borrow Direct Associate University Librarians for Collection 
Development group 
• Climate change selectors at Borrow Direct institutions 
• Archive-It staff 
• Community for discussion and participation 
Including: NYARC, METRO, International Internet Preservation Consortium 
(IIPC), SAA Web Archiving Roundtable, ARLIS/NA Artist Files SIG
Growing web archives
Thanks! 
Anna Perricci 
alp2198@columbia.edu 
@AnnaPerricci 
Columbia University Libraries

Progress Made and Lessons Learned through Collaborative Web Archiving Projects

  • 1.
    Progress Made and Lessons Learned through Collaborative Web Archiving Projects Anna Perricci Columbia University Libraries Archive-It Partner Meeting 2014 November 18, 2014
  • 2.
    Web Resources ArchivingCollaboration • Many thanks to the Mellon Foundation • Building collaborations among – The web archiving community – Other research libraries – Users and potential users of web archives – Site creators
  • 3.
    Incentive awards projects to advance web archiving tools Warcbase: Building a Scalable Web Archiving Platform on HBase and Hadoop. (Jimmy Lin, University of Maryland) Archiving Transactions Towards Uninterruptible Web Service (Zhiwu Xie and Edward A. Fox, Virginia Tech University)
  • 4.
    Incentive awards projects to advance web archiving tools Visualizing Digital Collections of Web Archives (Michele Weigle, Old Dominion University) Tools for Managing Seed URLs (Michael Nelson, Old Dominion University)
  • 5.
    Incentive awards projects to advance web archiving tools Perma.cc: Mitigating the Pervasive Problem of Link Rot in Scholarly Works and Preserving Online Content (Kim Dulin, The Harvard Library Innovation Lab) Free Law Project Providing free access to primary legal materials, developing legal research tools, and supporting academic research on legal corpora)
  • 6.
    Building an efficient,coherent, and scalable national framework for collecting web content
  • 7.
  • 8.
    Program Components •Communication and coordination • Seed management and harvest • Supplemental quality review (QA testing) • MARC Metadata • Local preservation storage (seeking solutions)
  • 9.
    The first 18months of collaborative collecting • Planning, needs assessment (interviews with stakeholders including Associate University Librarians for collection development at each Borrow Direct institution in 2013), timelines created • Group communication (spreadsheets, Basecamp), cultivating dialogs • Coordinate seed URLs nomination for pilots collections (CCWA, CAUSEWAY), QA testing and creation of MARC records • Trying out workflows for optimal balance of involvement and efficient forward motion on projects • In planning stages for sharing costs & 5 year plan for Borrow Direct/Ivy Plus collaborations
  • 10.
  • 11.
    Contemporary Composers WebArchive Selectors • Borrow Direct Music Librarians Group: music librarians at Brown, Columbia, Cornell, Dartmouth, Harvard, Johns Hopkins, Princeton, and Yale universities, MIT, and the universities of Chicago and Pennsylvania Cataloging expertise • Russell Merritt (cataloger specializing in music resources) • Kate Harcourt (Director of Original and Special Materials Cataloging) • Alex Thurman (Web Resources Collection Coordinator)
  • 12.
  • 13.
  • 14.
    Progress on CCWA& lessons learned so far By the numbers: • 11 curators participating • 56 sites currently available in Archive-It – 23 additional sites for follow up • 27 GB of content archived (268,519 URLs) • 50 MARC records in WorldCat as of 11/18/14 – Russell Merritt (music cataloger) collaboratively developed MARC records for composers websites; further cataloging of available sites through 2CUL Outreach • SAA presentation on MARC records for CCWA http://www.slideshare.net/annaperricci/lightning-talk-for-session-703-of-society-of-american-archivists • Over 30 sites tested for quality by five music librarians; bibliographic assistant on the grant tested all sites in collection
  • 15.
    CCWA Permissions 77Composers Yes (37) No (0) Did not respond (35) No contact info (2) Recently died/did not contact (3)
  • 16.
    Quality Assurance withmusic librarians
  • 17.
    Creating MARC recordsfor web archives • Creating MARC records for archived websites is standard practice at CUL – MARC records make web archives discoverable in CLIO (Columbia Libraries Information Online) • Collection level and seed level records • Will use Archive-It interface to add Dublin Core metadata
  • 18.
    Anticipating wider useof MARC records • Records have been regularly released to WorldCat • Collaborators on cataloging were attentive to which fields will ordinarily be stripped out when a MARC record is imported to another institution’s OPAC
  • 19.
  • 20.
    Patron view ofrecord in CLIO
  • 21.
    Cataloger’s view ofrecord in CLIO
  • 22.
    Progress on CAUSEWAY& lessons learned • Curators from 9 Borrow Direct institutions (Ivies Plus Art & Architecture Group) – Lead advisors: Carole Ann Fabian and Chris Sala • 137 seed URLs (over 100 harvested and being released as sites are tested, cataloged and assigned metadata in Archive-It) • 51 GB of content archived (1,006,114 URLs ) • Over 60 sites available in Archive-It with DC metadata (also all 60+ have MARC records in CLIO) Outreach • Update sent to IVAAG soliciting feedback • Gave update and got feedback at semi annual IVAAG meeting • Presentation scheduled for ARLIS/NA 2015
  • 23.
    CAUSEWAY Permissions 137Site owners Yes (74) No (3) Later (2) No contact info (2) Did not respond (56)
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
    Cataloging expertise broughtto CAUSEWAY • Alex’s expertise in cataloging architecture and urban planning sites (built through collaboration with Chris Sala on the Avery collecting of web archives) equips him to make more specific MARC records for sites in CAUSEWAY • Columbia University art and architecture librarians encourage users to find resources via records in the OPAC so access to CAUSEWAY sites will likely be via the MARC records which point to the calendar page for archived sites • Alex is working with our Bibliographic Assistant, Naeema Akter (position funded by the grant as well) to add appropriate metadata for better browsing in the Archive-It interface
  • 29.
    Early start onfacets in Archive-It
  • 30.
    CAUSEWAY goals forduration of remainder of grant • Collect all nominated sites in scope, test for quality, create a MARC record for each archived website (by early 2015) • Evaluate quality and solicit feedback (ongoing) • Meet at ARLIS/NA and discuss progress (March 2015) – Anna will also give a presentation on collaborative web archiving projects at ARLIS/NA • Establish ongoing workflows and goals (2015 and onward) • End of pilot phase: December 2015
  • 31.
    Project tracking: Basecamp& many, many spreadsheets
  • 32.
    Pilot climate changecollecting & lessons learned so far • 25 selectors from 5 institutions Great range of fields: -Wide variety of area studies (9) -Social science (5) -Science and environmental science (4) -Medical (1), Law (1), Special Collections (1) -Collection Development AUL (3), Preservation (1) • 127 seeds websites nominated (some duplication) • A lot of enthusiasm for topic
  • 33.
    What we’ve learnedabout workflows and scale • Distributing work does not reduce costs • Collaborative effort builds the project and new tasks promote professional growth • Quality Assurance and cataloging integral to process of creating high quality collections of web archives
  • 34.
  • 35.
    Use cases Imagecredit: Flickr user: Nicky Jurd (CC BY 2.0)
  • 36.
    Using the HumanRights Web Archive & learning from human rights scholars’ work (publications, citations)
  • 37.
    Citations scraped fromarticles published in 2010 in select scholarly journals
  • 38.
    Isolating URLs fromlist of citations using Open Refine (approximately 10% of citations scraped have URLs in them)
  • 39.
    Querying Internet Archivecollection (via API)
  • 40.
    Leveraging HRWA Solrindex http://hrwa.cul.columbia.edu
  • 41.
    Columbia University webresources: creating best practices for site creators
  • 42.
    Wider reach withguidelines rather than suggesting changes on case by case basis
  • 43.
    Web archiving initiatives focusing on art resources An initiative designed to address the “urgent need to document the dynamic web-based versions of auction catalogues, catalogues raisonnés, and scholarly research projects, as well as artist, gallery, and museum websites” (http://www.nyarc.org/content/web-archiving) Artist files Special Interest Group
  • 44.
    What do youwant to learn about web archiving? Do you have any suggestions on how the SAA Web Archiving Roundtable can help you develop your knowledge of web archiving? Categories we identified based on the 33 responses: – Description – Preservation – Access/ Use – Project Management/ Collaboration – Appraisal/ Collection Dev/ Policy – Technology/ Capture/ Tools – Business Case/ Costs/ Best Practices
  • 45.
    Some presentations, papers,panels & posters during grant • Moderated: “Web Archiving: Experiences, Perspectives and Possibilities” held at METRO on 10/20/14 • Presentation (lightning talk): “MARC Records for the Contemporary Composers Web Archive” for the Society of American Archivists annual conference on 8/16/14 URL (via Academic Commons): http://dx.doi.org/10.7916/D8028Q3S • Presentation: “SAA Web Archiving Roundtable Education Needs Assessment Survey Results” for the SAA Web Archiving Roundtable meeting at Society of American Archivists annual conference (co-presented with John Bence) on 8/14/14 • Presentation: “How Collaboration Can Save [More of] the Web: Recent Progress in Collaborative Web Archiving Initiatives” for the METRO Conference 2014 on 1/15/14 • Poster session: “Assessment of the Effectiveness of the Human Rights Web Archive @Columbia University” (co-presented with Pamela Graham) at the ACRL/NY Symposium on 12/6/13 URL (via Academic Commons): http://dx.doi.org/10.7916/D8BG2KZ9 • Presentation: “How Collaboration Can Save [More of] the Web: Recent Progress in Collaborative Web Archiving Initiatives” for the Best Practices Exchange on 11/14/13 (with Scott Reed) URL (via Academic Commons): http://dx.doi.org/10.7916/D8G73BNK • Presentation: “Web Archiving Resource Collaboration” at CrawlCamp held at METRO on 7/17/13
  • 46.
    Are project elements on schedule & within budget? • So far yes though we have plenty of challenges and work ahead of us • Steady progress on citation analysis but it’s been much harder than we thought it’d be • Lots of room for engagement and team work including maintenance and coordination of cooperative efforts
  • 47.
  • 48.
  • 49.
    The next 12.5months • Complete remainder of work called for in grant • Establish shared cost model for collaborative collection building (e.g. CCWA and CAUSEWAY) • Plan for scaling (maintenance and growth) • Codify roles for meaningful involvement in web archiving efforts • Contribute to professional organizations to strengthen web archiving efforts nationally and internationally
  • 50.
    Credits to someof many collaborators • Bob Wolven, Alex Thurman, Naeema Akter • Pamela Graham, Kate Harcourt, Christina Harlow • Talia Jimenez, Stephen Davis, incentives awards oversight panel: Kris Carpenter, Mark Phillips, Rob Sanderson & Perry Willett • Elizabeth Davis, Russell Merritt & Borrow Direct music librarians • Carole Ann Fabian, Chris Sala, Ivies Plus Art & Architecture Group • Borrow Direct Associate University Librarians for Collection Development group • Climate change selectors at Borrow Direct institutions • Archive-It staff • Community for discussion and participation Including: NYARC, METRO, International Internet Preservation Consortium (IIPC), SAA Web Archiving Roundtable, ARLIS/NA Artist Files SIG
  • 51.
  • 52.
    Thanks! Anna Perricci alp2198@columbia.edu @AnnaPerricci Columbia University Libraries