NISO Webinar: Understanding Critical Elements of E-books: Part 2: Heritage Lost? Ensuring the Preservation of E-books
http://www.niso.org/news/events/2012/nisowebinars/ebooks_preservation/ Understanding Critical Elements of E- books: Acquiring, Sharing, and Preserving Part 2: Heritage Lost? Ensuring the Preservation of E-books May 23, 2012Speakers: Jeremy York and Sheila Morrissey
HATHITRUST! A Shared Digital Repository!We’re Preserving the Past, What About the Present? NISO Webinar: Ensuring the Preserva;on of E-‐Books May 23, 2012 Jeremy York, Project Librarian, HathiTrust
Outline • About HathiTrust • Preserva;on and Access Strategies • What about the present?
Partnership Arizona State University North Carolina State University of ConnecticutBaylor University University University of FloridaBoston College Northwestern University University of IllinoisBoston University The Ohio State University University of Illinois at ChicagoCalifornia Digital Library The Pennsylvania State The University of IowaColumbia University University Princeton University University of MarylandCornell UniversityDartmouth College Purdue University University of MiamiDuke University Stanford University University of MichiganEmory University Texas A&M University University of MinnesotaFlorida State University Universidad Complutense University of MissouriGetty Research Institute de Madrid University of Nebraska-LincolnHarvard University Library University of Arizona The University of NorthIndiana University University of Calgary Carolina at ChapelJohns Hopkins University University of California HillLafayette College Berkeley Davis University of Notre DameLibrary of CongressMassachusetts Institute of Irvine University of Pennsylvania Technology Los Angeles University of PittsburghMcGill University` Merced University of UtahMichigan State University Riverside University of VirginiaNew York Public Library San Diego University of WashingtonNew York University San Francisco University of Wisconsin-North Carolina Central Santa Barbara Madison University Santa Cruz Utah State University The University of Chicago Washington University Yale University Library
The Name • The meaning behind the name – Hathi (hah-‐tee)-‐-‐Hindi for elephant – Big, strong – Never forgets, wise – Secure – Trustworthy
Strategic Advisory Board Guidance on • 12-‐member Board of Policy, Planning Governors Execu;ve CommiVee • Execu;ve CommiVee • Execu;ve Director Budget/Finances Decision-‐making HathiTrust
Digital Repository • Launched 2008 • Ini;al focus on digi;zed book and journal content – 10,309,742 total volumes – 5,464,306 book ;tles – 271,119 serial ;tles – 3,001,018 public domain (~29%) • “Light” archive
Collec;ons and Collabora;on • Comprehensive collec;on - Preserva;on…with Access • Shared strategies – Copyright – Collec;on management, development – Preserva;on – Discovery / Use – Bibliographic Indeterminacy – Eﬃcient user services • Public Good
Dates Collec;ons Languages La;n Remaining Arabic 1% Languages 2% 14% Italian Japanese 3% 3% Russian English 4% 48% Chinese 4% Spanish 5% French 7% German 9%
To contribute to the common good by collec;ng, organizing, preserving, communica(ng, and sharing the record of human knowledge
• Rights holders open access • Publishers deposit master ﬁles • Publish directly into the repository
jPach: Journal Publishing in HathiTrust • hVp://lib.umich.edu/jpach • Package of tools to enable publica;on of open access journals • Includes modiﬁca;ons to exis;ng code base; new components to facilitate ingest, display, and discoverability of born-‐digital open-‐access journal literature • Allow integra;on with popular journal publishing tools such as Open Journal Systems (OJS)
Key Elements • Openness – Content must be licensed for perpetual open access • Addi;onal formats – Fixity of bitstream guaranteed where preserva;on speciﬁca;ons cannot be developed • Allow download of content not rendered in the interface • Support ar;cles and contextual informa;on (lists of editors, submission requirements) • Support for revisions to content
File Format Considerations in the Preservation of e-Books Sheila Morrissey Senior Research Developer, Portico NISO Webinar: Heritage Lost? Ensuring the Preservation of E-books May 23, 1012
Portico - Third Party Preservation Portico is among the largest community- supported digital archives in the world. Working with libraries, publishers, and funders, we preserve e- journals, e-books, and other electronic scholarly content to ensure researchers and students will have access to it in the future.
Portico - Participating Content Over 2,000 societies, and associations have committed content to Portico through 147 publishers agreements. Committed Content » E-journal titles 13,675 » E-book titles 129,781 » D-collections 46
Portico - Audit and Certification In 2010, Portico became the first digital preservation service to be independently audited by the Center for Research Libraries (CRL) and subsequently certified as a trusted, reliable digital preservation solution that serves the needs of the library community.
Portico - History 2006 2009 2002 Portico PorticoLaunch of ingests ingestsElectronic initial e- initial e- 2009Archiving journal book CRL Initiative content content audit of by into the into the Portico JSTOR archive archive begins 2005 2007 2009 2010 Portico Portico Portico Portico Launched makes fulfills first ingests first PCA initial d- trigger claim collection title content available
Digital Preservation Digital preservation is the series of management policies and activities necessary to ensure the enduring usability, authenticity, discoverability, and accessibility of content over the very long-term. The key goals of digital preservation include: Usability Authenticity Discoverability Accessibility • the intellectual • the provenance of • the content must • the content must be content of the item the content must be have logical available for use to must remain usable proven and the bibliographic the appropriate via the delivery content an authentic metadata so that it community mechanism of replica of the can be found by end current technology original users through time
Preservation: Legal aspects Legal right to preserve content » Not always the same as access rights » Specified in contracts » Includes embedded or supplemental files, such as images » DRM removed
Usability: Rendition and Delivery Content is rendered to support current delivery platform, i.e. web browser. … rendered & delivered … Rendition engine can be modified to meet new technology requirements.
Portico – Another Look at the History 2009 2011 2006 iPad 2 Portico 2002 Portico ingests KindleLaunch of ingests initial e- FireElectronic initial e- book NookArchiving journal content Simple Initiative content Touch by into the Kindle 2 JSTOR archive Nook ePub3 2005 2007 2010 2012 Portico Portico iPad 1 Portico Launched makes Nook ingests first Color initial d- trigger collection title content available iPad 3 iPhone Kindle 1
E-Book Packages in Portico Submissions Flat directory » ONIX xml file with bibliographic metadata, one PDF file per book Front Cover image JPG files
E-Book Packages in Portico Submissions TAR file (multiple books per file) » XML manifest file » One directory for each book, Proprietary XML file (3 possible versions of XML) with bibliographic metadata, Subdirectory with files for front matter “chapters” (XML. PDF, OCR of PDF) Subdirectory with files for regular “chapters” (XML. PDF, OCR of PDF) front Subdirectory with files for back matter “chapters” (XML. PDF, OCR of PDF) Subdirectory with TIFF file for cover image of book
E-Book Packages in Portico Submissions ZIP file (sometimes one book per file, sometime multiple books) » Sometimes flat (all books at one level) » Sometimes one directory for each book, Sometimes cover images (JPG or TIFF) Sometimes one PDF for entire book in addition to PDF for each chapter » Sometimes a manifest
E-book formats in Portico Submissions PDF » One file per chapter » One file per book TIFF » One file per page JPEG » One file per page XML » For bibliographic metadata » Proprietary » ONIX variants » NLM variants
Looking ahead: EPUB 3 EPUB 3 (http://idpf.org/epub/30 ) » “EPUB defines a means of representing, packaging and encoding structured and semantically enhanced Web content-- including HTML5, CSS, SVG, images, and other resources-- for distribution in a single-file format.”
Looking ahead: EPUB 3 EPUB 3 » Web standards for key component technologies » Free and open specification » Must work in at least some appliance Outside publisher’s own workflow
EPUB3 Formats “Profiles” of standard formats for authoring content » XHTML5, SVG 1.1, CSS 2.1, CSS 3 Constraints (extensions to HTML5, constraints on SVG) Specs a “moving target” Conforming readers must support rendition of certain formats » Image, audio, video Defined fallbacks Globalization, Encoding, Fonts
Complications: The New “Browser Wars” Amazon » Announces it is replacing MOBI with K8 iBooks » Different mimetype » Proprietary extension of CSS Media Queries » Proprietary XML namespace » Etc.
Complications: "More What You’d Call ‘Guidelines’Than Actual Rules” Pirates of the Caribbean: The Black Pearl. The Walt Disney Company (2003)
Questions or Comments? Sheila Morrisseysheila.firstname.lastname@example.org @sheilaMorr www.portico.org