Dataset Metadata, Tools and Approaches for Access and Preservation


Published on

Joan Starr's presentation at ALA Midwinter 2012 to ALCTS Intellectual Access to Preservation Metadata Interest Group

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Thank you for this opportunity to speakwith you today about Dataset Metadata. Let me give special thanks to Meghan for asking me to speak.Image credits:By: MDB 28, davecurlee, sabarishr: rkrichardson: awsheffield: Scutter: Amy the Nurse: Anita & Greg:
  • My library:Serving the 10 UC campuses226,000 students 134,000 faculty and staffWorking collaborativelylibrariesdata centersmuseums, archivesfaculty and researchersCDL has historically provided strategic, integrated technical and program services in a broad portfolio, including:Groundbreaking licensing agreementsUnion bibliographic servicesData curation & preservation toolsOpen access publishing servicesCDL:
  • My group:The UC Curation Center is creative partnership between the CDL, the ten UC campuses, and peer institutions in the community.A community of shared concern and practiceProvide solutions, services, resources for digital assets Pool & distribute diverse experience, expertise, & resources
  • Access: The researchers’ requirements are for: ESIP—Earth Science Information Partners ( provide fair credit to those responsible: exposureTo aid scientific reproducibility—re-useTo ensure scientific transparency and reasonable accountability: verificationTo aid in tracking the impact of the work: citation trackingPreservation: Easy to maintainThe funders’ requirements are for data management and And the library’s charge is to preserve our institutions’ scholarly assets
  • How are we going to meet these needs? If we go back to what the domains are doing…From ESIP –Earth Science Information Partners (same link)Author(s)--the people or organizations responsible for the intellectual work to develop the data set. The data creators.Release Date--when the particular version of the data set was first made available for use (and potential citation) by others.Title--the formal title of the data setVersion--the precise version of the data used. Careful version tracking is critical to accurate citation.Archive and/or Distributor--the organization distributing or caring for the data, ideally over the long term.Locator/Identifier--this could be a URL but ideally it should be a persistant service, such as a DOI, Handle or ARK, that resolves to the current location of the data in question.Access Date and Time--because data can be dynamic and changeable in ways that are not always reflected in release dates and versions, it is important to indicate when on-line data were accessed.From ICPSR—Inter-University Consortium for Political and Social Research identifier (such as the Digital Object Identifier, Uniform Resource Name URN, or Handle System)
  • What’s in common: the persistent identifier.
  • DataCite was formed in 2009 by 10 Libraries and Research Centers with a Mission: “"Helping you find, access, and reuse data“The number has now grown to 15. In addition there are 3 associate members, including the Korea Institute of Science and Technology Information, so there is a presence in Asia.California Digital Library was one of the founding members.DATACITE’s primary methodology for achieving this mission: issuing DOIs (Digital Object Identifiers) for datasets.
  • DOIs are one kind of persistent identifier.But what is an identifier?An identifier is an alphanumeric string assigned to an object, and if that assignment is managed with some metadata and the object is made available over time, the identifier becomes a VERY reliable way of keeping track of that object.
  • Let’s take a look at one.So you can see that with just the identifier and a simple set of metadata, you get:Location for VERIFICATIONEXPOSURE & CITATION TRACKING(this is not an actual DOI, nor an actual study)
  • And here’s that same DOI some time later.THE STRING NEVER CHANGES. This means it can be cited, tracked and associated with all kinds of metadata. More on that in a minute.
  • EZID is CDL’s application for offering DataCite DOIs as well as other identifiers.
  • If you go to the Home Page, you can use the UI to test EZID. CLICK for HELP TAB.
  • On the Help screen, you have the choice of creating a test ARK or DOI.[CLICK] Click the Create buttonARKs and DOIsARKsFlexibleCase-sensitiveSpecial features support granularityCan be deletedInexpensiveDOIsEstablished brand in publishingIndexed by major A&I citation databases DataCite policies applyCannot be deletedMore costlyDOIs should be assigned to objects that are under good long-term management, and where there is an intention is to make the object persistently available.DOIs must be registered exclusively with metadata that is available to public view.Can DOIs and ARKs work together?Yes. For example, researchers may choose to use ARKs for unpublished materials associated with an object that has been registered with a DOI. These two identifier schemes can work well together, and EZID offers them both, along with policy support consistent across both schemes.
  • EZID creates the identifier and sends you to the MANAGE tab where you have the opportunity to enter a target URL and other metadata.UI support: Dublin KernelDublin CoreDataCite KernelAPI supportAll of the aboveFull DataCite Schema
  • When you hover over a field, it opens up for editing as you can see here. This is where you would go if you wanted to maintain the metadata or the target URL.
  • Now let’s take a look at the full DataCite Metadata set.MDS=Metadata SearchRemember, we said that any solution needed to:ALLOW the submitter to accurately describe the object so that anyone accessing knows what they are getting. ALLOW the submitter to give credit where credit is due. PROVIDEsupport for *data management* – format, version, rights
  • The 5 Required properties = basic citation elementsIdentifier = DOI now; in future may open upCreator is repeatable; Name can have a nameIdentifier and schema as in ORCHID idTitle is repeatable and has an optional type attribute for Alternative Title; Subtitle; and TranslatedTitlePublisher: “In the case of datasets, "publish" is understood to mean making the data available to the community of researchers.”IDENTIFIER=VERIFICATIONALLOW the submitter to give credit where credit is due. EXPOSURE & CITATION TRACKINGIf the Year field isn’t quite what you want—use the repeatable DATE field in the optional set.
  • Optional elementsIncludes support for data management FORMAT, VERSION, RIGHTSIn addition, some of these offer expansion of the required set. Contributer expands Creator. Date expands PublicationYear.But the distinctive strength comes from Number 12.[CLICK]
  • Optional elementsThe Family Jewels = RelatedIdentifer, relationTypeIsCitedBy & Cites IsSupplementTo  & IsSupplementedByIsContinuedBy  & Continues IsNewVersionOf  & IsPreviousVersionOf  IsPartOf  & HasPart  IsDocumentedBy & Documents isCompiledBy & CompilesIsVariantFormOf  & IsOriginalFormOfCOMING IN 2.3: IsIdenticalTo
  • “Data Management Planning” is a popularphrase these days. As metadata and preservation librarians, I think you’ll find many of the concepts to be very familiar, if wearing new clothes.Let me tell you a little story about the life of a dataset.You start out in a laptop (or a tablet) travelling around, or under a deskMaybe then you get emailed across the country or around the world.Years can go by as you get updated and altered.Eventually, maybe you have a day in the sun: your researcher decides to write up the results and cite you.Then, perhaps, it’s back to a server in the dark. Or, you move from server to server. Will you be forgotten?
  • That’s why we at California Digital Library have taken a life cycle approach with an array of tools.CDL has developed an array of tools and services ranging from the first stage of developing a data management plan, through to formal publication. We encourage researchers to assign an ID early in the process - to provide a credible data management plan for funders;- to make the later stages easier and - to manage situations where changes might occur during the course of the research—a researcher changes institutions or a research team changes the location of their data, for example.
  • What difference does this make? +Keep track of datasets early in the life cycle when you’re not sure where you’re keeping things.+Get common & stable references for distributed research teams.+Citations in published papers keep working even if the data moves.+Part of data organization plans mandated by funders.Photo credits:in field: by Dave Rogers black board: ©All rights reserved by University of California, stars: ©All rights reserved by University of California, table: by David Mellis,
  • Dublin Core application profile available for the DataCite Metadata Schema; we’ll keep it up to date and in-sync. From the DCMI: “A DCAP is designed to promote interoperability within the constraints of the Dublin Core model and to encourage harmonization of usage and convergence on "emerging semantics" around its edges.”Content Service exposes our metadata stored in the DataCite Metadata Store (MDS) using multiple formats Alpha version: The service can be accessed at http://data.datacite.orgEZID: UI redesignActivity reportingBrowse & searchEnhanced persistence supportAutomated link checking in support of our new Tombstone pages (a web page returned for a resource no longer found at its target location of record. The tombstone may provide “last known” metadata, including the original owner.)Exposure for metadata—evidence that citations will increase (Heather Piwowar’s work)Thomson-Reuters (Web of Knowledge)Elsevier (Scopus)OAI? RSS?GoogleScholar
  • Library as a service center: Consulting, EZID, DMP,DCXL, IRInformation: pointing people to standards, toolsHelping make connections.
  • The next steps for you as individuals is to get more information and try things for yourselves.
  • Dataset Metadata, Tools and Approaches for Access and Preservation

    1. 1. Dataset MetadataTools and Approaches for Access and Preservation Joan Starr California Digital Library January, 2012 @joan_starr
    2. 2. Dataset Metadata Tools & ApproachesIntroductionRequirementsDataCite, EZID & IdentifiersDataCite MetadataNext steps By Brain farts (Joschua)
    3. 3. Requirements for dataset description• Access• Preservation By barryegan (Vitor Leite)
    4. 4. How?• Key identifying elements• Emerging recommendations• Variation among the domains
    5. 5. How?• Key identifying elements• Emerging recommendations• Variation among the domains• In common: Persistent identifier
    6. 6. DataCiteGerman National Library of Economics (ZBW) Canada Institute for Scientific and Technical InformationGerman National Library of Science and Technology (TIB) (CISTI)German National Library of Medicine (ZB MED) Technical Information Center of DenmarkGESIS - Leibniz Institute for the Social Sciences, Germany Institute for Scientific & Technical Information (INIST-Australian National Data Service (ANDS) CNRS), FranceETH Zurich, Switzerland TU Delft Library, The Netherlands The Swedish National Data Service (SNDS) The British Library , UK California Digital Library (CDL), USA Office of Scientific & Technical Information (OSTI), USA Purdue University Library
    7. 7. What is an identifier?What you see: alphanumeric string (never changes)Associated with: location of object (such as a URL)Optional: who, what, when, etc (i.e. metadata) By Joelk75:
    8. 8. Identifier examplestring: doi:10.9999/FK40K2GTVhtml version: creator: Dr. Felix Kottor title: Data for chromosomal study of catfish (Ictalurus punctatus) publisher: University of Bologna date: 8/31/2011
    9. 9. Identifier examplestring: doi:10.9999/FK40K2GTVhtml version: creator: Dr. Felix Kottor title: Data for chromosomal study of catfish (Ictalurus punctatus) publisher: Dryad Data Repository date: 10/01/2011
    10. 10. EZID: long-term identifiers made easy take control of the management anddistribution of your research, share and get credit for it, and build your reputation through its collection and documentation Primary Functions 1. Create persistent identifiers 2. Manage identifiers over time 3. Manage associated metadata over time
    11. 11.
    12. 12.
    13. 13.
    14. 14.
    15. 15. DataCite Metadata V. 2.2• Small required set = citation elements• Optional descriptive set: – extendable lists – can refer to other standards, schemes – domain-neutral – rich ability to describe relationships to other digital objects• Metadata Search (MDS) is full-text indexed
    16. 16. DataCite Metadata V. 2.2Required properties1. Identifier (with type attribute)2. Creator (with name identifier attributes)3. Title (with optional type attribute)4. Publisher5. PublicationYear
    17. 17. DataCite Metadata V. 2.2Optional properties6. Subject (with schema attribute)7. Contributor (with type & name identifier attributes)8. Date (with type attribute)9. Language10. ResourceType (with description attribute)11. AlternateIdentifier (with type attribute)12. RelatedIdentifier (with type &relation type attributes)13. Size14. Format15. Version16. Rights17. Description (with type attribute)
    18. 18. DataCite Metadata V. 2.2Optional properties6. Subject (with schema attribute)7. Contributor (with type & name identifier attributes)8. Date (with type attribute)9. Language10. ResourceType (with description attribute)11. AlternateIdentifier (with type attribute)12. RelatedIdentifier (with type &relation type attributes)13. Size14. Format15. Version16. Rights17. Description (with type attribute)
    19. 19. Data Management Planning By NASA Goddard Photo and Video:
    20. 20. A life cycle approach CDL Curation and Publishing Services Create, edit, share, and save data management plans Open source add-in for Microsoft Excel as a data collection tool Create and manage persistent identifiers Curation repository: store, manage, and share research dataOpen access scholarly publishing services:papers, journals, books, seminars & moreAn infrastructure to publish and get credit Data Publication for sharing research data
    21. 21. Identifiers and data management Track your Organize results your data Get more citations Meet funder requirements
    22. 22. Next StepsDataCite• Dublin Core application profile• Content Service• Metadata v. 2.3EZID•UI redesign•Automated link checking•Exposure for metadata By Nicola Whitaker
    23. 23. Next StepsLibrary • service center • information center • your ideas here By Nicola Whitaker
    24. 24. For more informationEZIDEZID application: website: Home: Metadata Schema: 2.2/index.htmlDataCite Metadata Search:
    25. 25. Questions? by Horia Varlan Starr: @joan_starr