Let me begin by introducing the overall landscape for data management. Range of content: Increasing number, size, and diversity of content, and Range of producers and consumers: Faculty, researchers, libraries, etc. Disruptive changes: technology, user expectation, institutional mission, resources The new landscape demands a range of partners and solutions You can look at this as a problem or as an unparalleled Opportunity.
Digital curation is not just about technology; ->I t is a set of policies and practices focused on maintaining and adding value to digital content for use now and into the indefinite future ->It can be applied to the humanities, social sciences, and sciences ->encompasses preservation and access --preservation ensures access *over time* --access depends upon preservation *up to a point in time*
University of California Curation Center is the provider of digital curation services centered at CDL. Providing high quality and cost-effective digital curation services Developing hosted and locally deployable services Creating and foster ing partnerships that bring together the expertise and resources of the University of California. Showcasing campus initiatives Supporting a community of experts, researchers and stakeholders
Let‘s zero in on the problem statement for datasets in particular. Here we see the first step in many research processes –the collection, analysis, synthesis and interpretation of DATA.
AND NOW that data has become Information and is published...
And now that it is published , the knowledge is accessible
...and the publication is traceable...
But the DATA IS LOST!
IN OTHER WORDS, there is a gap between published research and underlying data ->Published work is held by libraries ->Datasets are held by data archiving centers ->There is No effective way to link between datasets and articles ->There is No widely used method to cite or identify datasets ->There is No easy way to share or to get credit for data creation
->Published work held by libraries ->Datasets held by data archiving centers ->No effective way to link between datasets and articles ->No widely used method to cite or identify datasets ->No easy way to share or to get credit for data creation
This chart summarizes the differences between the two worlds, if you will.
We are left with a choice, dramatized here. Can we really afford to lose some of the data that is at risk?
To address this challenge, DataCite was formed in 2009 by 10 Libraries and Research Centers
Publishers and Data Centers had to establish one-to-one relationships.
Publishers and Data Centers had to establish one-to-one relationships.
DataCite provides a connection, a hub. And now, let me turn things over to John Kunze, who will explain just how that happens.
TRANSITION TO JOHN
This slide animates, with a few options, a vision of how DataONE would work with DataCite. It all begins and ends with the research scientist or data producer shown here at the bottom. A scientist may nurture a dataset for months or years before it gets to an archive, during which time it will need an identifier, therefore [CLICK] their institution may encourage the option of obtaining a “preservation-ready” identifier very early in that period. An identifier string that they obtain from CDL’s EZID (easy-eye-dee) service will be opaque, unique, and well-suited for embedding inside a DOI, ARK, or other identifier when the time comes to deposit a copy of the data in an archive. There’s also a substantial convenience depositor and users in not having to rename the dataset upon deposit. [CLICK] When it comes time to deposit, the scientist will upload the dataset and descriptive metadata to a DataONE Member Node, which corresponds to a DataCite regional data archive. The receiving data archive will have its own policy with regard to identifier assignment. Some archives, such as Dryad, will create and assign identifier strings reflecting institutional naming requirements. Others will exercise the option of using a preservation-ready id that arrives with the dataset, if any, or [CLICK] optionally obtaining and assigning an identifier from EZID. [CLICK] The DataONE Member Node then contacts a DataONE Coordinating Node to signal the presence of a new or updated dataset so that the DataONE metadata catalog can be updated. [CLICK] Then the Member Node will contact DataCite member, CDL, to request registration of a dataset citation. CDL in turn will create two registrations. [CLICK] The first is with its own EZID resolver service, which keeps a redundant record, supports ids from any identifier scheme and provides a “shadow resolution” service for DOIs; the shadow resolution will be available to those who want to publish URLs containing DOIs with extended path parts in order to reference versioned dataset components, but do so without the cost of registering a DOI for every component and with greater resolver function than the DOI/Handle infrastructure currently supports. The second registration [CLICK] that CDL creates is with the DOI infrastructure via the interface running at the TIB (Technical Information Bibliotek). Armed with what by now is a completely fleshed out and ready-to-go standard DataCite citation, [CLICK] CL returns that citation to the DataONE member node. [CLICK] Finally, the member node communicates the official citation back to the data producer, who can now update their bibliographies accordingly, notify colleagues, etc.
-> by enabling them to find, cite, and get credit for research datasets with confidence ->by providing workflows and standards for data publication ->as they extend their historic collection-building activities to datasets, allowing them to preserve their institution’s research investments ->to enrich their publications with the full story
API targeted rollout to an early adopter V 2.0 UI rollout to our University of California partners first.
Persistent Citation & Identification for Datasets: DataCite and EZID John Kunze, Associate Director, UC3 Joan Starr, Manager, Strategic & Project Planning, CDL
Second-class citizens in the scholarly record.
Data is difficult to manage after project funding ceases Who has it? How do I get it? What is it’s impact? Where is it? Libraries keep it safe. Many libraries and archives have it . Many libraries and archives have it and will share it. I can monitor its impact. I know how to find it.
DataCite structure Carries International DOI Foundation DataCite Member Institution Member Institution . . . Works with Managing Agent (TIB) Associate Stakeholder, e.g., Library Data Researcher or Producer Data Researcher or Producer Member Data Centre Data Centre Data Center, Library, Publisher Data Centre Data Centre Data Center, Library, Publisher
DataCite example CDL DataONE Member Node data archive (eg, Dryad ) Research scientist 6. full citation 7. full citation
3. citation + URL + id DOI resolver and TIB registration 5. URL plus id EZID resolver and registration service 4. save full citation (opt) CDL-hosted EZID id minting service DataONE Coordinating Node metadata catalog (eg, UNM or UCSB) get unique id string get unique id string 2. metadata + URL + id