Northwestern digital repository initiative: platform and persistence


Published on

Introduction to and overview of digital repository projects at Northwestern University, developed for a guest lecture at the Dominican University Graduate School of Library and Information Science Digital Curation course. Presentation based in part on an earlier presentation developed by Steve DiDomenico and Claire Stewart

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Northwestern digital repository initiative: platform and persistence

  1. 1. Northwestern digital repository initiative: Platform and persistence
  2. 2. Claire Stewart Director, Center for Scholarly Communication and Digital Curation Head, Digital Collections, Library Technology Division Northwestern University
  3. 3. What is a repository and why should I care?
  4. 4. Library as institutional memory
  5. 5. Tweeted in 2012 by Gail Steinhart, Head of Research Services, Mann Library, Cornell University
  6. 6. Vines, T. H., Albert, A. Y. K., Andrew, R. L., Débarre, F., Bock, D. G., Franklin, M. T., … Rennison, D. J. (2013). The Availability of Research Data Declines Rapidly with Article Age. Current Biology, 24(1), 94–97. doi:10.1016/j.cub.2013.11.014 “The major cause of the reduced data availability for older papers was the rapid increase in the proportion of data sets reported as either lost or on inaccessible storage media. For papers where authors reported the status of their data, the odds of the data being extant decreased by 17% per year (Figure 1D).” [emphasis added] The Availability of Research Data Declines Rapidly with Article Age
  7. 7. What is a repository and why should I care? A concept The Repository All the stuff A set of technologies
  8. 8. Technologies and architecture
  9. 9. Repository as service • Description and characterization - descriptive, provenance and technical metadata • Selection, conversion, digitization • Deposit and versioning • Interoperability, APIs for ingest, discovery • Access control, copyright support and other legal/regulatory compliance • Persistence – Stable, permanent links (URLs, DOIs, etc.) – Health of digital objects – Replication and dark archiving – Migration or emulation, virtualization
  10. 10. What’s already in our repository
  11. 11. Maps of Africa First Fedora project @ NU 2006 project, internally funded 116 antique maps at high resolution
  12. 12. Maps in Fedora METS, PREMIS, JPEG2000
  13. 13. Archival finding aids Archon for EAD, Fedora + Blacklight for storage and discovery, Primo syndication
  14. 14. Winterton Collection
  15. 15. Northwestern Books and the Book Workflow Interface 2009 Mellon-funded Now used for all in-house book digitization
  16. 16. Every page of each digitized book has this information: Datastream ID MIMETYPE Schema/ontology Dublin Core metadata DC text/xml OAI_DC MODS metadata MODS text/xml MODS Relationship metadata RELS-EXT text/xml RELS-EXT OCR PDF file PDF application/pdf OCR XML OCR XML text/xml ABBYY OCR OCR Text OCR TEXT text/plain Source camera image file ARCHV-IMG image/jpeg Source technical metadata in MIX ARCHIV-TECHMD text/xml MIX Source camera technical metadata in EXIF ARCHV-EXIF text/xml Exif as XML Corrected image file PROC-IMG image/jpeg Corrected image technical metadata in MIX PROC-TECHMD text/xml MIX Delivery image JPEG2000 file DELIV-IMG image/jp2 Delivery image technical metadata in MIX DELIV-TECHMD text/xml MIX SVG for delivery mechanism DELIV-OPS text/xml SVG Viewer html HTML text/html HTML
  17. 17. By the numbers — # of objects As of November 2013: • Finding aids: 1,114 • Digitized books: 3,491 • Digitized book pages: 835,806 • Image objects: 216,271 • A few others, including 3D objects, and collection objects A total of 1,187,414 objects in the repository Every object has several datastreams (files, descriptive metadata, technical metadata, etc.)
  18. 18. By the numbers — storage As of Feb 5, 2014: 97.1 TB of content on repository (including digitized collections queued for ingestion) and JPEG2000 server. Library & NUIT purchased 200 TB of storage replicated between Evanston and Chicago campuses (that is over 400 TB in total).
  19. 19. Digital preservation/persistence • Persistent URLs • Mirrored storage (as of fall 2014) • PREMIS (preservation) metadata • Routine health checks for data • Geographically distributed storage • Dark archives • Migration/virtualization services
  20. 20. Distributed storage and dark archives • DuraCloud • Amazon Glacier • Digital Preservation Network (DPN)
  21. 21. Current repository projects • Digital Image Library (DIL) • Avalon • Hydramata
  22. 22. Hydra Northwestern joined 2011 Framework for repository applications using Ruby on Rails Community with 22 partners
  23. 23. 2007 Provost funded move from Art History to the Library, expansion to other disciplines 115,000 images in Hydra + Fedora Moving all legacy digital collections into DIL & its Hydra counterparts in 2014-2015 Digital Image Library (DIL)
  24. 24. Avalon IMLS-funded project with Indiana University Releases: • 0 July 2012 • .5 October 2012 • 1.0 May 2013 • 2.0 October 2013 (NU pilot) First NU production with R3, expected in next month (dev/demo)
  25. 25. Scholarly communication and digital curation • Options for archiving scholarly materials • Authors rights, copyright help and education, open access support • E-science and research data life cycle • Digital humanities • Library-based publishing • Responding to funder requirements
  26. 26. Hydramata (formerly Shared IR) Five-institution project to develop a next-generation institutional repository solution in Hydra
  27. 27. Expanding our repository program • Massive storage, planning for growth, sustainability • Digital preservation services o Offsite third copy (DPN, DuraCloud, Glacier) o Verification services • Research computing o Research data lifecyle - how to capture metadata early? what to keep? o Automate deposit from Vault? • Shared infrastructure and services whenever possible • Deeper collaboration with NUIT, Research, central admin, schools
  28. 28. Discussion and questions Claire Stewart Director, Center for Scholarly Communication and Digital Curation Head, Digital Collections, Library Technology Division Northwestern University