You may ask: Making Data Management Easy for Whom? The researcher? Ourselves?With any luck, the answer is a little bit of both…
Serving the 10 UC campuses226,000 students 134,000 faculty and staffWorking collaborativelylibrariesdata centersmuseums, archivesfaculty and researchersCDL has historically provided strategic, integrated technical and program services in a broad portfolio, including:Groundbreaking licensing agreementsUnion bibliographic servicesData curation & preservation toolsOpen access publishing services
The UC Curation Center is creative partnership between the CDL, the ten UC campuses, and peer institutions in the community.An evolving community of shared concern and practice; bringing together diverse experience, expertise, and resources; providing robust curation solutions."Digital curation refers to the actions people take to maintain and add value to digital information over its lifecycle, including the processes used when creating digital content.“"Digital preservation focuses on the 'series of managed activities necessary to ensure continued access to digital materials for as long as necessary.'" New Roles for New Times: Digital Curation for Preservation, Published by Association of Research Libs in March 2011http://www.arl.org/news/pr/nrnt-dcreport17mar11.shtmlBy Tyler Walters and Katherine Skinner
This is an enmeshed partnership between scholars and curators.Scholars can share because curators collect, scholars can discover because curators publish, and so forth.Curators giving scholars what they need, where/when/how they need it. Curators enhancing the discoverability of the scholarly output.AND protecting the institution’s intellectual investment, securing its intellectual legacy.SO, we want to be involved at every point in the scholarly life cycle that we can.
Or maybe we *should* drop in at every point. “Build it and they will come” doesn’t work any more.
Preservation: Curationmicroservices and MerrittTo bake data curationinto data creation: DCXL (Data Curation XL Plug-In)To enhance data sharing, collecting and gathering: WASserviceTo facilitate data publication, we are exploring this new Data Paper model.And behind many of these steps, the EZID service.We are engaged in a number of network-level collaborations and partnerships, but these two have particular relevance to the data management space, with DataONE focused on distributed data networks and DataCite on persistent identifiers.And lastly, we have partnered with UVA, and many others to develop and launch an easy to use Data Management Plan Tool.So let’s take a brief look at all of these things, and while I’m there, I’ll dive more deeply into EZID, which is the service I manage.
Building Blocks: COMPONENTS focused on CURATION functions/solutionsMake each is small and self-contained, so they are collectively easier to develop, maintain, and deployKeep the level of investment in any given service small, so they are easier to replace when they have outlived their usefulnessLimit the scope of each service, but combine them to form complex solutions.Provide public interfaces for all service interactions.Interoperability and utility to end users
I’ve given you the link here to our Curation wiki, a place where we’re sharing specs, presentations, community contributions, some code, and so forth.Persistent identifiers: distinguishing one object from all othersPersistent storage: manage the secure and persistent storage of content (files)Fixity: verify the bit-level integrity of filesReplication: provide a globally-fault tolerant storage environmentCharacterization: determining the significant properties of stored files and implications of those properties.Discovery: provide an interface that supports finding the stored objects as appropriate (according to security decisions)Transformation: transcodes digital object representations from existing forms to newly required forms.Notification: a means to inform user communities that digital content is being managed and is available for use.Annotation: a means for user-driven enrichment of managed object description.
Another option, other way to go, is to choose the off-the-shelf plan. You can see here where we are with the specific services, and they are baked into Merritt, which is a curation and preservation service you can engage.This also gives you a sense of where we are on the build path with the micro-services list you just saw.If you sign up for Merritt, then you get a hosted service with the flexibility of the micro-services built in.Another “off-the-shelf” solution based on micro-services is: Archivematica (Artefactual Systems in collaboration with the UNESCO Memory of the World's Subcommittee on Technology, the City of Vancouver Archives, the University of British Columbia Library, the Rockefeller Archive Center and a number of other collaborators.)http://archivematica.org/wiki/index.php
Dark archive for important digital assets–Pro-active preservation, but no expectation of direct end user accessUCTVBright archive with direct discovery and access–Provide preservation and end user accessPart of grant-funded research data sustainability planPreservation back-end for existing or new discovery and content management systemsPreservation only; content discovery/delivery provided by well-known external systems–eScholarship, Media Hub, Open ContextInvestigating integration with Islandora/Drupal and AlfrescoIntegration with distributed data gridsPart of sustainable and secure cyber-infrastructure–Chronopolis, DataONE member node
Nobody thinks of Excel as a preservation-ready tool, but everybody uses it! The KEY IDEA in keeping this EASY here is: let them use the tools they are use to using. (Get out of the way of that elephant!)Gordon & Betty Moore Foundation + Microsoft Research are funding this.Our part is requirements gathering; MS will do development. Open source plug in.
Some ideas to better publish, share, and archiveBest ideas to be selected by project partners
Idea here is to capture ephemeral data for future study.These figures include 2007 – 2008 pilot activity.Very high number of dark archives reflects:Caution making content public- very active new users whose content is still embargoed for rights considerations
USE CASE: GULF OIL SPILL“Web resources” means both entire web sites, sections of websites pertaining to the spill, and individual resources such as patent information for blowout preventers etc.117 of these sites were also included in the 2005 Hurricane Katrina Web archive. That archive is not yet publicly available; we hope to provide access concurrent with the oil spill archive.400+ of these sites were selected by Louisiana State University subject experts.
The KEY IDEA here is: Put something unfamiliar (a dataset) in a familiar wrapper (a citable paper)Funded by the Gordon and Betty Moore Foundation
Preparing the cover sheet and an assemblage of references provides some real exposure to internet search engines. We see this as a kind of graduated investment opportunity for any institution that wants to get involved in this space.
Moving from shared storage and archiving to actual publication and then ultimately to peer review, each incremental investment yields returns additional returns.
Let’s take a look at one of the ways that EZID can make a big difference in a researcher’s scholarly life.When that researcher goes to publish data-based results, very often, there is no way to associate the data with the paper. Data is a second class citizen.Here’s an example, for an article that is actually ABOUT DATA! You can’t find it anywhere.Nowhere on the Science Direct page is there a link to the datasets…So if you decide to go on a hunt on your own, it gets very wild and wooly. Suppose you try Google. Faculty home pages, many more PDFs of scholarly articles, but no data sets.Wikipedia pagesNo datasets.If you email the faculty, you might get lucky and get a reply, and they ‘ll send you to their FTP site: but even then—are you sure you are looking at the right version of the data?
So here is the difference that EZID makes. Here is an example of a data set deposited with our first customer, Dryad.Dryad is an international repository of data underlying peer-reviewed articles in the basic and applied biosciences.
A service to make and manage actionable identifersIds for anything: digital, physical, living, abstractCan manage identifiers under different schemes:ARKs, DOIs, and more to comeUser and programming interfacesPartnering for replication
EZID is a tool for extending library services…by meeting the needs of researchers.And it does to in a number of ways. Let’s step through them.
The researcher doesdata-intensive research and writing.No permanent home for dataBut wants to reference it nowRegister + get clickable referenceUpdate later when published
Virtual teamAging infrastructurePermanent home is not so permanentRegister now, distribute the identifier to entire team.When infrastructure replaced, update target location. All links still work beautifully.
A widely published researcherwho uses EZID doesn’t need to worry about career moves.She can move her data with her and as long as she updates her target URLs, all her citations will continue to work.Her worldwide colleagues will have no interruption of access to the data via the original identifiers.
A research team Seeking a research grant from the National Science Foundation. must submit a formal data management plan. 1 component: naming and organizing theirfiles. EZID helps meet this need.
In addition, EZID has advantages for libraries and data centers and data publishers.
Science and technology librarians + digital humanities librarians How to serve our faculty members whose work revolves around datasets?another form of scholarly content, and work to curate and preserve these assets.•Assisting libraries as they extend their historic collection-building activities to datasets, allowing them to preserve their institution’s research investments
Often our partners in this work are campus data centers, New demands for storageNew workflowsNew toolsEZID: automated workflows , standards-based processes, and a community of support.
You can try out EZID for yourself by going to this URL and clicking on the HELP tab. That will let you make test DOIs and test ARKs without an account. Contact me if you’d like a demonstration for your institution.
3 CLICKSDataONE is an NSF funded, virtual data center for biology, ecology, and environmental sciences.DataOne has the overarching goal of building a new culture of data access and data sharing. This is an international collaboration working with scientists and librarians, as well as other stakeholders.Engaging the scientist in the data curation processSupporting the full data life cycleEncouraging data stewardship and sharingPromoting best practicesEngaging citizensDeveloping domain agnostic solutions
I just wanted to show you a picture of the scope and depth of this work.
Benefits to funders and publishersCLICKBenefits to researchersDigital Curation Centre (UK)DataONESmithsonian InstitutionUniversity of California Curation Center, California Digital LibraryUniversity of California Los Angeles LibraryUniversity of California Merced LibraryUniversity of California San Diego LibrariesUniversity of Illinois, Urbana-ChampaignUniversity of Virginia LibrariesICPSR Heather Piwowar’s research: Sharing Detailed Research Data Is Associated with Increased Citation RatePLoS ONEhttp://www.plosone.org/article/info:doi/10.1371/journal.pone.0000308“Publicly available data was significantly associated with a 69% increase in citations.”Examined citation history of 85 cancer microarray clinical trial publicationsSharing of Data Leads to Progress on Alzheimer’sBy GINA KOLATA, Published: August 12, 2010, New York Times, p. A1 in print on August 13, 2010
Start a new plan (by funder)Edit a plan in-processLook at existing plans
CLICKGeneric form layout – current data model should make this possible for any questionRe: #3 – Institutional info – your Contracts &Grants officers may have some advice or specific requests for material to include.Looking toward an early 4th quarter public release.
Specs are available now, code will come out incrementally over the summer and beyond.DCXL is a 1-year project.Data Management Plan tool—early 4th quarterMerritt and EZID are available right now
With all this ease, WE think that DATA CURATION LEADS TO GOOD OUTCOMES FOR RESEARCHERS.They’ll be motivated routinely to deposit in stable public storage. Data products (datasets and processing information) and the data papers that reward them with authorship creditData journals will spring up around disciplines, even if disciplinary data papers are scattered across geographically distributed repositories.Data products will be re-used, annotated, corrected, and precisely linked to from traditional publications.Data products will enter the scientific record instead of being lost
While you are thinking about these matters, you can take advantage of a growing community of practice around curation issues, curation services, and the microservices themselves.Ring leaders are our UC3 partner at UC San Diego: Declan Fleming and our good friend Mike Giarlo at Penn State.Virtual options include: listserv, Google group, Twitter, Facebook, Chat, and wiki. Visit the website for details.
Transcript of "(Toward) Making Data Management Easy"
Making Data Management Easy<br />Toward…<br />ALA Annual 2011<br />Joan Starr<br />University of California Curation Center<br />California Digital Library<br />
HOT TOPICS DISCUSSION GROUP<br />STS Programs are sponsored by:<br />
Introductions<br />The research life cycle<br />Some examples from CDL/UC3 (curation micro-services and more!)<br />…with a focus on EZID<br />Discussion/Questions<br />
Research has a life cycle.<br />DISCOVER<br />SHARE<br />COLLECT<br />PUBLISH<br />CREATE<br />PRESERVE<br />GATHER<br />ACCESS<br />Judy Baxter, http://www.flickr.com/photos/judybaxter/9825836/<br />
Librarians can jump in at any point.<br />Ims.photo: http://www.flickr.com/photos/bigblackbox/4805557065/<br />
TOOLS & SERVICES<br />To enable data preservation<br />To bake data curationinto data creation<br />To enhance data sharing, collecting and gathering<br />To facilitate data publication<br />PARTNERSHIPS<br />To promote data discovery and access<br />To help researchers comply with new requirements<br />What this means for Data Management<br />DISCOVER<br />SHARE<br />COLLECT<br />PUBLISH<br />CREATE<br />PRESERVE<br />GATHER<br />ACCESS<br />Judy Baxter, http://www.flickr.com/photos/judybaxter/9825836/<br />
Merritt repository<br />Dark archive for preservation masters<br />Integration with distributed data grids<br />Bright archive for preservation and end-user access<br />Preservation back-end for existing discovery services<br />
DCXL: Data Curation Excel<br />WHY EXCEL? <br />CON: poor feature set and scalability compared to DBMSs<br />PRO: ubiquity, familiarity, ease-of-use<br />Cody Simms: http://www.flickr.com/photos/jcodysimms/246023851<br />
What an Excel add-in could do<br />Permit standardized column headers<br />Versioning and standard date formats<br />Auto-archiving and persistent id assignment<br />“Speed bumps” to discourage macros et al.<br />NOTE: This will be released as OPEN SOURCE!<br />
Web Archiving Service snapshot<br />Stats: Since January 2007<br />21 organizations using service<br />4,681 sites captured<br />44,468 captures run<br />26.4 terabytes<br />100 + archives under construction<br />35 archives published<br />In partnership with the IIPC consortium of national libraries.<br />
Archiving the Gulf oil spillImproving support for collaboration<br />946 sites<br />8,400 + captures<br />1.3 TB<br />Began May 5<br />
The Data Paper Model<br />Minimal: a cover sheet and a set of links to archived artifacts<br />Best practice: citation elements (including persistent identifier)<br />Kevin Steele: http://www.flickr.com/photos/kevinsteele/20631162 /<br />
The Data Paper Model<br />Cover sheet with citation data<br />title, date, authors, abstract, and persistent identifier (DOI, ARK, etc.)<br />
A data journal<br />Incorporation of elements to enrich discovery, re-use, and archiving<br />Discipline specific<br />Peer reviewed<br />The Data Paper Model<br />
Create a persistent identifier: DOI or ARK<br />Add object location<br />Add metadata<br />Update object location<br />Update object metadata<br />
Meeting researcher needs<br />Early in the research life cycle<br />Working on a federated team<br />Making a career move<br />Meeting funder requirements<br />
Early in the research life cycle<br />+<br />Data-intensive research<br />Writing up the results<br />Where’s the data?<br />What if I move it?<br />With EZID: all your references, citations, links, etc. will be stable!<br />by Dave Rogers http://www.flickr.com/photos/dave-rogers/2815036285/<br />
Meeting funder requirements<br />+<br /><ul><li>Grantor requirements for data management plan</li></ul>Data-intensive research<br />What do we put here?<br />How do we track the data?<br />With EZID: track your data from capture to publication and beyond.<br />By David Mellis, http://www.flickr.com/photos/mellis/7675610/<br />
Working with Libraries & Data Centers<br />Libraries<br />Extending an historic role<br />Data Centers & Publishers<br />Providing workflows and standards <br />
EZID: Meeting library needs<br />+<br /><ul><li>New kinds of scholarlyoutput
Working at the Network Level<br />enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it<br />1. Build on existing cyberinfrastructure<br />2. Create new cyberinfrastructure<br />3. Create new communities of practice<br />
DataONE’s new infrastructurehttps://www.dataone.org/<br />
Data Management Plan Toolhttps://bitbucket.org/dmptool/main/wiki/Home<br />Collaborative effort<br />Funders’ datamgmt/sharing polices<br />Journals’ (Nature, Science, and PLoS) data sharing requirements. <br />Researchers<br />Distributing research results leads to increased citations (Piwowar et al., 2007)<br />A shared, common data set may help researchers collaborate and accelerate discoveries (NY Times, 2010). <br />Better organization, leading to easier preservation<br />Cultivate quality and efficiency<br /> Thanks to Jeffrey Loo, Chemical Informatics Librarian, UCB <br />
Home screen: once the user has logged in presented with a view of their work and options<br />1.<br />2.<br />3.<br />University of California <br />Libraries<br />
1.<br />2.<br />3.<br />University of California <br />Libraries<br />
Data Management Plan Tool</li></ul>From CDL/UC3<br />DISCOVER<br />SHARE<br />COLLECT<br />PUBLISH<br />CREATE<br />PRESERVE<br />GATHER<br />ACCESS<br />Judy Baxter, http://www.flickr.com/photos/judybaxter/9825836/<br />
Summary: Just how easy is it for you?<br />Build your own (Curation micro-services)<br />specs<br />code<br />Open source tools<br />DCXL <br />Data Management Plan tool<br />Off the shelf options<br />Merritt<br />EZID<br />WAS<br />liquidnight: http://www.flickr.com/photos/liquidnight/3101493460/<br />
& how easy is it for researchers?<br />For organizing their data<br />DCXL , EZID<br />To keep their data safe<br />Merritt, Micro-services<br />To help them get grants <br />Data Management Plan tool<br />To help get their worknoticed<br />EZID, Data Papers<br />To help them find otherdata<br />EZID, Data Papers<br />TOOLS!<br />liquidnight: http://www.flickr.com/photos/liquidnight/3101493460/<br />
CURATECamp: unconference events connecting practitioners & technologists interested in digital curation and data management.<br />Next f2f event: August 15 – 16, 2011Stanford University, Palo Alto, California<br />http://www.regonline.com/Register/Checkin.aspx?EventID=953543 <br />http://groups.google.com/group/digital-curation<br />http://curatecamp.org/<br />But wait, there’s more: Community!<br />courtesy of Oxnard Public Library, http://content.cdlib.org/ark:/13030/kt6c600758<br />
and more information!<br />UC Curation Center<br />http://www.cdlib.org/uc3<br />firstname.lastname@example.org<br />EZID<br />http://n2t.net/ezid/ <br />Micro-services<br />http://www.cdlib.org/uc3/curation<br />http://groups.google.com/group/digital-curation <br />UC3/CDL<br />Stephen Abrams David Loy<br />Patricia Cruse Lisa Colvin<br />Scott Fisher Mark Reyes <br />Erik Hetzner Tracy Seneca <br />Greg Janée Joan Starr<br />John Kunze Marisa Strong<br />Margaret Low Perry Willett<br />
…and here’s how to find me.<br />Joan Starr<br />email@example.com<br />@joan_starr<br />http://www.slideshare.net/joanstarr<br />
Image credits for Opening Slide<br />Optical Shop, Adam Reeder, http://www.flickr.com/photos/adamreeder/5379477315<br />Streetcar, Adam Reeder, http://www.flickr.com/photos/adamreeder/5379459127<br />Jazz Gumbo, Adam Reeder, http://www.flickr.com/photos/adamreeder/5380083448<br />Streetcar, Adam Reeder, http://www.flickr.com/photos/adamreeder/5379459127<br />Boat, Adam Reeder, http://www.flickr.com/photos/adamreeder/5379429155<br />Garden, ncpttmedia, http://www.flickr.com/photos/ncpttmedia/4008605841<br />Shutters, OZinOH, http://www.flickr.com/photos/75905404@N00/379444291<br />}<br />
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.