NISO Webinar: Metadata for Preservation: A Digital Object's Best Friend


Published on

Over the past decade, as the scholarly community’s reliance on e-content has increased, so too has the development of preservation-related digital repositories. The need for descriptive, administrative, and structural metadata for each digital object in a preservation repository was clearly recognized by digital archivists and curators. However, in the early 2000’s, most of the published specifications for preservation-related metadata were either implementation specific or broadly theoretical. In 2003, the Online Computer Library Center (OCLC) and Research Libraries Group (RLG) established an international working group called PREMIS (Preservation Metadata: Implementation Strategies) to develop a common core set of metadata elements for digital preservation. The first version of the PREMIS Data Dictionary for Preservation Metadata and its supporting XML schema was issued in 2005. Experience using its specifications in preservation repositories has led to several revisions, with the completion of a version 2.0 in 2008. The Data Dictionary is now in version 2.2 (July 2012), and it is widely implemented in preservation repositories throughout the world in multiple domains.

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • PREMIS in METS toolbox consists of 3 modules to help implementers: describe (generate PREMIS metadata), convert (between PREMIS and METS), validate (ensure quality metadata)Controlled vocabularies to increase interoperability and consistency of metadataRDF/OWL ontology to allow for interconnection among preservation repositories, facilitate querying the metadata, and incorporate preservation-specific controlled vocabulariesGuidelines available results in quality and consistent metadata through the conformance statement and the guidelines for using PREMIS in METSCommunity working groups on specific topics include: Ontology working group; Environment working group (to amend the data model)– open to the preservation community at large to participatePREMIS Implementers group forum allows for the preservation community to participate in PREMIS development and submit change requests to the ECImplementation registry assists new implementers in planning their preservation systemsTools registry gives implementers tools
  • PREMIS has had a significant impact in digital preservation activitiesIts wide applicability has resulted in cost savings to institutions developing preservation repositories because they have a standard that can be used by the entire preservation communitiyOngoing work makes it a dynamic resource– it continues to generate new sources of value to the implementer community
  • Turn everything off. Make your sidebar completely empty and make sure your PC won’t shut off or down.
  • Who am I????
  • I am Mom to these 4 beautiful children …<click>
  • More pertinently, though, I am the Archive Service Product Manager. I have an MA in Library Science. I have been with JSTOR and Portico forever – I started at JSTOR in 1996. I now focus on preservation at Portico and JSTOR.<click – to standards>
  • Before we begin, I want to share my philosophy on standards.In my opinion, standards do two things really well …<click>
  • They provide a framework for thinking about a topic and making a plan.Enter the wildernees with a map.<click.
  • They are also quite valuable as interchange specifications between organizations, or even groups within a single organization.<click>
  • Fortunately for me, the PREMIS folks seem to agree. PREMIS is about a way to think about preservation metadata. About the elements and units you need to consider.In my talk to day, you aren’t going to see any PREMIS XML.You are going to see, quite a lot about …<click>
  • The Portico content model and an XML content wrapper that we call PMD or the Preservation Metadata file.It is a pretty direct reflection of our content model and we have at least one PMD file for every item we preserve.Many considerations went into the design of the Portico preservation metadat …<click>
  • Another is our definiton of preservation which isn on the screen.  We spent quite awhile developing this definition and it really helps us focus when making preservation decisions.<click>What is Digital Preservation? Digital preservation is the series of management policies and activities necessary to ensure the enduring usability, authenticity, discoverability and accessibility of content over the very long term. The key goals of digital preservation include:usability – the intellectual content of the item must remain usable via the delivery mechanism of current technologyauthenticity – the provenance of the content must be proven and the content an authentic replica of the originaldiscoverability – the content must have logical bibliographic metadata so that the content can be found by end users through timeaccessibility – the content must be available for use to the appropriate community
  • Any number of other standards influnceed us, including …<click>
  • DIDL is a content model. It is very flexible. We almost used it.<click>
  • Our first preservation metadata file was METS based. We migrated to our new format a couple of years ago.<click>
  • Of coures …<click>
  • And, no doubt many others that aren’t on the tip of my tongue at the moment.<click>
  • When we redesigned our preservation metadata file a couple of years back, we also drew pretty extensively on our experience. You’ll see that refelected in some areas as we talk, for example how we deal with events in our metadat file.<click>
  • The PREMIS entities and semantic units can be found embedded in the Portico content model, our metadata elements, and also in a system of registries we implement. Registries are a way for us to track things.<click>
  • A word about identifiers.PREMIS requires unique identifiers on every entity and semantic unit.At Portico we firmly believe in this philosophy and you’ll see through out the presentation, many of the ways in which we use unique identifiers to link between elements of our content model.<click>
  • We currently preserved a number of disparate things. The have many similarities, but they also have not insignificant differences.<click>
  • One of our goals is to represent all these disparate content types in one content model and with one set of preservation metadata.We need to manage the archive and the preserved content uniformly.To put this another way, if can’t manage these uniformly, my head my explode. So, one content model …<click>
  • So the Portico content model is pretty heavily informed by DIDL.Containers contain other containers.Our model is limited to six levels.We have content types, such as e-books, e-journals, and digitized newspapres.<click>
  • They contain one or more content sets.A content set is just a way for us to bag content together.For example …For the e-journal content type, our content set is the journal.For the e-book content type, our content set is the publisher.For the digitized newspaper content type, our content set is the collection.<click>
  • Content sets contain one or more Archival Units. These are the units of preservation.For example …For e-journals, the archival unit is the article.For e-books, the archival unit is the book.For digitized newspapers, the archival unit is the issue.<click>
  • Each archival unit may contain one or more content units.We’d use this technique if the publisher sent us an update to the full item.<click>
  • Content units contain one or more function units.A funcitonal unit is an intellectualy entity within the item.For example, the page images of an article are a functional unit.Each figure graphic is a functional unit.<click.
  • Each functional unit can contain one or more storage units (which are essentially files).Say we receive a high res image, low res image, and thumbnail for a single figure graphic.That one figure graphic functional unit would contain four storage units.<click>
  • At any level of the Portico content model, we can apply these metadata. Into which some of the PREMIS semantic units may be found.<click>
  • This entire mess of information … content model and metadata are recorded in the file we call PMD or Preservation metadata.<click>
  • It is a thing of beauty.And, I’m not even an XML geek!<click.
  • Our PMD files tightly match the content model.This is a snippet of the XML tree of a PMD file.<click>
  • Archival units …<click.
  • Contain content units …<click>
  • Which contain functional units …
  • Which contain storage units.The higher elements of our content model are encoded in the construction of the archive itself and as metadata attributes and elements within the PMD file.<click>
  • Per PREMIS, objects often have the following information associated with them.This type of information is pretty deeply embedded in the Portico PMD.<click>
  • For example, here as a snippet of information about as storage unit (or file).<click>
  • Among other things, we have an ID for this storage unit.<click>
  • And a preservation level.<click>
  • Deeper into the storage unit, we have additional information, including:<click>
  • The size of the file …<click>
  • A basic format for the file …<click>
  • A format status for the file…<click>
  • I am going to touch very briefly on registries.At Portico, we use registries as a way to consolidate information.In this case, information about formats. <click>
  • Here we have two files (this is an element within the storage unit element.<click>
  • These files each have a very specific format name.<click>
  • That name provides us with significant additional information, found in our format registry. Including a description, the authority and maintenance agencies, the default file extension and more.While we do track PREMIS information on our objects, it is found in a number of different places, from embedded in the content model or PMD files to registries.<click>
  • Per PREMIS, the importantsementic units for Events are on the screen. Nothing too surprising<click>
  • That key information for events can be found in three different locations within the Portico preservation metadata file.<click>
  • This is a processing record.Precossing records are relatively new features for us.When Portico first started, we designed a very flexible system that would allow us to run different elements of our workflow on different machines. As we ramped up, it became clear that our administrative costs would be lower if we limited the number of machines we managed and that we could get much greater throughput running on a single, powerful machine. Originally, we had put all the information about the machine and tools into each event record. But, with experience under our belt it became clear that we could streamline our metadata files by consolidating this information into Processing Records.<click>
  • Here is a close-up on a processing record and a set of events that reference this processing record.<click>
  • They are tied together through that unique processing record ID.And this relationship is telling the world that the events within this event set all occurred on the ConPrepLite system in July 2010.<click>
  • Within our PMD file, events are grouped into Event Sets. These are just a set of events that happened at the same time, for the same conceptual purpose, and are associated with a single processing record.Some of the events we track are above.Nothing too exciting.<click>
  • Another change we made was to unify the format of our events. Within Portico all events now contain only elements from the above list of possible elements.These were informed by PREMIS and you’ll see a number of similarities.<click>
  • If you are going to walk away remembering one thing, remember that events (like descriptive and technical metadata) can live on any element within the content model.<click>
  • These are the semantic units for agents.In general, rights holders are primary agents within a repository.<click>
  • In addition, however, are repository systems and people that might make changes to the content.<click>
  • For example, within these three processing records are three agents that touched the content.<click>
  • Per PREMIS, the important information to consider is on the screen.<click>
  • <chuckle><click>
  • At the moment and for Portico our rights statements are relatively straight forward.All of the Portico agreements, at the moment, are similar and thus, we do not have a need to track a variety of different clauses and commitments.We have set up a system where by content will not enter the Portico archive until such time as we have a formal agreement in place and that agreement has been preserved in the archive.<click>
  • As with many other PREMIS entities, rights entities are embedded within our PMD file.<click>
  • Every archival unit must reference a specific agreement. That agreement has a unique ID and can be found in the archive.<click>
  • Questions?<Amy: stay on afterward.><don’t click unless you need to address 2CUL or Xref questions>
  • NISO Webinar: Metadata for Preservation: A Digital Object's Best Friend

    1. 1. NISO Webinar: Metadata for Preservation: A Digital Objects Best Friend February 13, 2013Speakers: Rebecca Guenther, Amy Kirchhoff
    2. 2. Metadata for Preservation: A DigitalObject’s Best FriendIntroduction to Preservation Metadata Rebecca Squire Guenther Library of Congress, NDMSO and Consultant, NISO Webinar, Feb. 13, 2013
    3. 3. Digital preservation: imperative and challenge  More and more of scholarly and cultural record exists in digital form; steps must be taken to secure its long-term future  Groups such as Digital Preservation Coalition, NDIIPP and National Digital Stewardship Alliance have made significant progress in raising awareness about digital preservation imperative  Gradual shift in focus from articulating problem to solving it … • Not so much “Why is digital preservation important” anymore; rather, “What must be done to achieve preservation objectives?”  Many practical challenges in implementing reliable, sustainable digital preservation programs  One key challenge: preservation metadata
    4. 4. Metadata and preservation metadata PRESERVATION“Structured information that METADATAdescribes, explains, locates,or otherwise makes it easier toretrieve, use, or manage aninformation resource” “Metadata that supports and documents the digital preservation process” METADATA
    5. 5. Preservation Preservation metadata includes: Metadata Provenance: Content • Who has had custody/ownership of the digital object? 10 years on Authenticity: • Is the digital object what it purports to be? 50 years on Preservation Activity: • What has been done to preserve it? Forever! Technical Environment: • What is needed to render and use it? Rights Management: • What IPR must be observed? Makes digital objects self-documenting across time
    6. 6. Basics of preservation metadata Digital preservation concentrates on well-designed formal systems based on digital library and trusted digital repository concepts Information about what needs to be preserved and how are part of any preservation system Since items aren’t on shelves, metadata is the only mechanism for actually keeping or finding anything 3 concepts are important • Metadata about preservation of digital objects • Preservation of metadata itself to ensure that content and metadata is preserved • Use of metadata in a trusted digital repository
    7. 7. PREMIS Data Dictionary May 2005: Data Dictionary for Preservation Metadata: Final Report of the PREMIS Working Group • Version 2.0 (April 2008) • Version 2.1 (January 2011) • Version 2.2 (July 2012) • Version 3.0 expected 2013 Includes: Data Dictionary Context/assumptions Data model Usage examples Conformance XML schema to support implementation Data Dictionary: • Core set of implementable, broadly applicable preservation metadata semantic units, supported by guidelines and recommendations for management and use
    8. 8. What does PREMIS cover? Administrative metadata that supports the digital preservation process Provides information to help manage a resource for preservation purposes • Technical characteristics • Information about actions on an object • Relationships (structural and derivative) • Structural: indicates how compound objects are put together • Derivative: results of common preservation actions • Rights metadata associated with preservation In OAIS terms: • Metadata as part of SIP, AIP or DIP • Fits into Preservation Description Information (Reference, Context, Provenance, Fixity)
    9. 9. What PREMIS is and is not What PREMIS is: • Common data model for organizing/thinking about preservation metadata • A checklist for core metadata in a repository • Guidance for local implementations • Standard for exchanging information packages between repositories What PREMIS is not: • Out-of-the-box solution: need to instantiate as metadata elements in repository system • All needed metadata: excludes business rules, format-specific technical metadata, descriptive metadata for access, non-core preservation metadata • Lifecycle management of objects outside repository • Rights management: limited to permissions regarding actions taken within repository
    10. 10. PREMIS Data Model Intellectual Entities Rights Statements Objects Agents Events
    11. 11. Intellectual Entities  Set of content that is considered a single intellectual unit for purposes of management and description (e.g., a book, a photograph, a map, a database)  May include other IntellectualExamples: Entities (e.g. a website that Rabbit Run by John Updike includes a web page) (a book)  **Has one or more digital “Maggie at the beach” representations** (a photograph)  Previously not fully described The Library of Congress in PREMIS DD, but will be in Website (a website) scope in version 3.0 The Library of Congress: American Memory Home page (a web page)
    12. 12.  Discrete unit of information inObjects digital form  **Objects are what repository actually preserves**  Three types of Object: • FILE: named and ordered sequence of bytes that is known by an operating system • REPRESENTATION: set ofExamples: files, including structural chapter1.pdf (a file) metadata, that, taken together, constitute a chapter1.pdf + chapter2.pdf + complete rendering of an chapter3.pdf (representation of Intellectual Entity a book w/3 chapters) • BITSTREAM: data within a TIFF file containing header and file with properties relevant 2 images (2 bitstreams for preservation purposes (images), each with own set of (but needs additional properties (semantic units): structure or reformatting to e.g., identifiers, technical be stand-alone file) Intellectual entity will become metadata, inhibitors, … ) another level of object
    13. 13. Object Example: book in two versions Intellectual Entity Da Vinci Code by Dan Brown Representation 1 Representation 2 Page image ebook version version File 1: File 2: File N: File N+1: File 1:page1.tiff page2.tiff pageN.tiff METS.xml book.lit
    14. 14. Events  An action that involves or impacts at least one Object or Agent associated with or known by the preservation repository  Helps document digital provenance. Can trackExamples: history of Object through the Validation Event: use JHOVE chain of Events that occur tool to verify that during the Objects lifecycle chapter1.pdf is a valid PDF  Determining which Events file should be recorded, and at Ingest Event: transform an what level of granularity is up OAIS SIP into an AIP to the repository Migration Event: create a new version of an Object in an up-to-date format
    15. 15. Agents  Person, organization, or software program/system associated with an Event or a Right (permission statement)  Agents are associated only indirectly to Objects through Events or RightsExamples:  Not defined in detail in Martha Anderson (a person) PREMIS DD; not considered Library of Congress (an core preservation metadata organization) beyond identification Dark Archive in the Sunshine State implementation (a system) JHOVE version 1.0 (a software program)
    16. 16. Rights Statements  An agreement with a rights holder that grants permission for the repository to undertake an action(s) associated with an Object(s) in the repository.  Not a full rights expressionExample: language; focuses exclusively Priscilla Caplan grants FCLA on permissions that take the digital repository permission form: to make three copies of • Agent X grants Permission metadata_fundamentals.pdf Y to the repository in for preservation purposes. regard to Object Z.
    17. 17. Technical metadata pertaining toobjects Object identifier  Storage Preservation level  Environment Significant characteristics • software Object characteristics • hardware • fixity  Digital signatures • format  Relationships • size  Linking event identifier • creating application  Linking permission • inhibitors statement identifier • object characteristics extension Creating application Original name
    18. 18. Semantic units pertaining to Events:provenance and preservation activity Event identifier Event type (e.g. capture, creation, validation, migration, fixity check) Event dateTime Event detail Event outcome Event outcome detail Linking agent identifier Linking object identifier
    19. 19. Semantic units pertaining to Rights Rights Statement Rights Granted Rights Statement act Identifier restriction Rights Basis termOfGrant Copyright Information rightsGranted License Information Linking Object Statute Information Identifier Other Rights Information Linking Agent Identifier rightsExtension
    20. 20. Semantic units pertaining to Agents Agent Identifier Agent Name Agent Type Agent Note Agent Extension linking Event Identifier Linking Rights Identifier
    21. 21. The State of PREMIS de facto standard for preservation metadata; in some countries mandated for cultural heritage repositories Was recognized by winning the Digital Preservation Award (2005) and was shortlisted for DPC Decennial award for outstanding contribution to digital preservation (2012) PREMIS implementations are appearing in many places, many contexts, many forms Experimentation has led to changes in the data dictionary and schema PREMIS Implementation fairs: attempts to consolidate implementation experiences, issues, best practices,
    22. 22. Key features of PREMIS Developed through international consensus-making process  Mobilized community to address shared need  Shared solution to a shared need Implementation neutral • Makes no assumptions about technology • Can be flexibly adapted for use across all sorts of institutions, digital preservation contexts, repository systems • Allows for extensibility Supported by Maintenance Activity and Editorial Committee, under auspices of US Library of Congress  PREMIS is sustained, maintained, and evolved Extensive outreach to implementer community  Tutorials, guides, implementation fairs, PIG Forum  “Support system” in place for PREMIS implementers
    23. 23. PREMIS Maintenance Activity Web site: • Permanent Web presence, hosted by Library of Congress • Central destination for PREMIS-related info, announcements, resources • Home of the PREMIS Implementers’ Group (PIG) discussion list PREMIS Editorial Committee: • Set directions/priorities for PREMIS development • Coordinate future revisions of Data Dictionary and XML schema • Promote implementation
    24. 24. Implementation resources Tools: • XML schema • PREMIS-in-METS toolbox <> • Controlled vocabularies at • RDF/OWL ontology for use as Linked Data Guidelines: • PREMIS conformance statement • PREMIS & METS guidelines Community Working groups on special topics Others: • Understanding PREMIS (available in multiple languages) • PIG Forum • Implementation Registry • Tools Registry
    25. 25. Some implementers … DAITTSS (Florida): a preservation repository for the use of the libraries of the public universities of Florida. Ex Libris Rosetta: a commercial digital preservation system supporting acquisition, validation, ingest, storage, management, preservation and dissemination of different types of digital objects National Digital Newspaper Program Archivematica: comrehensive open-source digital preservation system National Archives of Sweden, National Archives of Scotland Carolina Digital Repository: repository for material in electronic formats produced by members of the University of North Carolina at Chapel Hill community. British Library electronic journal archiving project For more information see: •
    26. 26. Impact De facto international standard for preservation metadata • Part of permanent infrastructure supporting digital preservation • ISO standardization being considered Wide applicability means benefits from PREMIS extend to entire digital preservation community Ongoing work to revise/update Data Dictionary and create new supporting resources • PREMIS is a dynamic resource that continues to generate new sources of value to implementer community Stood the test of time: • Seven years after initial release, is now indispensable part of digital preservation implementations around the world • Not surpassed or replaced by other standard or resource
    27. 27. URLs, etc. PREMIS Maintenance Activity: PREMIS Data Dictionary for Preservation Metadata: Understanding PREMIS: premis.pdf PREMIS Implementation Registry PREMIS Implementers Group list
    28. 28. Metadata forPreservationA digital object’s bestfriendImplementation!
    29. 29. Amy Kirchhoff
    30. 30. Archive ServiceProduct Manager
    31. 31. Standards
    32. 32. Standar framework for thinkingds
    33. 33. Standar framework for thinking interchange specificationds
    34. 34. [The PREMISdocumentation has an]emphasis on the need toknow rather than theneed to record orrepresent in any
    35. 35. ContentTypeContentSet(s)ArchivalUnit(s)ContentUnit(s)FunctionalUnit(s)StorageUnit(s)
    36. 36. Intellectual Entities Rights Statements Objects Agents Events
    37. 37. Digital preservation is theseries of managementpolicies and activitiesnecessary to ensure theenduringusability, authenticity, discoverability and accessibility of
    38. 38. Dublin Core
    39. 39. Dublin CoreDIDL (from MPEG-21)
    40. 40. Dublin CoreDIDL (from MPEG-21)METS
    41. 41. Dublin CoreDIDL (from MPEG-21)METSOAIS
    42. 42. Dublin CoreDIDL (from MPEG-21)METSOAIS…
    43. 43. Dublin CoreDIDL (from MPEG-21)METSOAIS…Experience
    44. 44. 1. Content model2. Metadata elements3. Registries
    45. 45. Intellectual Entities Rights Statements Objects Agents Events
    46. 46. Identifiers
    47. 47. Intellectual Entities Rights Statements Objects Agents Events
    48. 48. BooksJournalsDigitized NewspapersDigitized DocumentsSupplied FilesArchive Management Docum
    49. 49. BooksJournalsDigitized NewspapersDigitized DocumentsSupplied FilesArchive Management Docum
    50. 50. ContentType(s)
    51. 51. ContentType(s)ContentSet(s)
    52. 52. ContentType(s)ContentSet(s)ArchivalUnit(s)
    53. 53. ContentType(s)ContentSet(s)ArchivalUnit(s)ContentUnit(s)
    54. 54. ContentType(s)ContentSet(s)ArchivalUnit(s)ContentUnit(s)FunctionalUnit(s)
    55. 55. ContentTypeContentSet(s)ArchivalUnit(s)ContentUnit(s)FunctionalUnit(s)StorageUnit(s)
    56. 56. Descriptive MetadataTechnical MetadataEvents Metadata
    57. 57. PMD
    58. 58. PMDa thing of beauty
    59. 59. Intellectual Entities Rights Statements Objects Agents Events
    60. 60. Semantic Units1.1 objectIdentifier1.2 objectCategory for Objects1.3 preservationLevel1.4 significantProperties1.5 objectCharacteristics1.6 originalName1.7 storage1.8 environment1.9 signatureInformation1.10 relationship1.11 linkingEventIdentifier1.12 linkingIntellectualEntityIdentifier1.13 linkingRightsStatementIdentifier
    61. 61. Registries
    62. 62. Intellectual Entities Rights Statements Objects Agents Events
    63. 63. Semantic Units for Events2.1 eventIdentifier2.2 eventType2.3 eventDateTime2.4 eventDetail2.5 eventOutcomeInformation2.6 linkingAgentIdentifier2.7 linkingObjectIdentifier
    64. 64. Processing RecordEvent SetsEvents
    65. 65. Some Portico EventsEdit Descriptive MetadataCheck Descriptive MetadataGenerate Descriptive MetadataIngest Into ArchiveCreate FileGenerate Technical MetadataSet Preservation LevelGenerate Fixity
    66. 66. Portico Event ElementsTimestampRationaleInputListArgListOutputToolWrapperTool Component ListOutcomeOutcomeDetailList
    67. 67. ContentTypeContentSet(s)ArchivalUnit(s)ContentUnit(s)FunctionalUnit(s)StorageUnit(s)
    68. 68. Intellectual Entities Rights Statements Objects Agents Events
    69. 69. Semantic Units for Agents3.1 agentIdentifier3.2 agentName3.3 agentType3.4 agentNote3.5 agentExtension3.6 linkingEventIdentifier3.7 linkingRightsStatementIdentifier
    70. 70. Intellectual Entities Rights Statements Objects Agents Events
    71. 71. Semantic Units4.1 rightsStatement for Rights 4.1.1 rightsStatementIdentifier 4.1.2 rightsBasis 4.1.3 copyrightInformation 4.1.4 licenseInformation 4.1.5 statuteInformation 4.1.6 otherRightsInformation 4.1.7 rightsGranted 4.1.8 linkingObjectIdentifier 4.1.9 linkingAgentIdentifier4.2 rightsExtension
    72. 72. Easy
    73. 73. EasFor PorticoFor the momenty
    74. 74. Intellectual Entities Rights Statements Objects Agents Events
    75. 75. THANK YOU.Amy Kirchhoffamy.kirchhoff@ithaka.org
    76. 76. NISO Webinar:Metadata for Preservation:A Digital Objects Best FriendQuestions?All questions will be posted with presenter answers onthe NISO website following the webinar: NISO Webinar • February 13, 2013
    77. 77. THANK YOU Thank you for joining us today.Please take a moment to fill out the brief online survey. We look forward to hearing from you!