(Toward) Making Data Management Easy

  • 4,100 views
Uploaded on

Data Management Presentation at ALA Annual to ACRL STS Hot Topics mtg

Data Management Presentation at ALA Annual to ACRL STS Hot Topics mtg

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,100
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
62
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • You may ask: Making Data Management Easy for Whom? The researcher? Ourselves?With any luck, the answer is a little bit of both…
  • Serving the 10 UC campuses226,000 students 134,000 faculty and staffWorking collaborativelylibrariesdata centersmuseums, archivesfaculty and researchersCDL has historically provided strategic, integrated technical and program services in a broad portfolio, including:Groundbreaking licensing agreementsUnion bibliographic servicesData curation & preservation toolsOpen access publishing services
  • The UC Curation Center is creative partnership between the CDL, the ten UC campuses, and peer institutions in the community.An evolving community of shared concern and practice; bringing together diverse experience, expertise, and resources; providing robust curation solutions."Digital curation refers to the actions people take to maintain and add value to digital information over its lifecycle, including the processes used when creating digital content.“"Digital preservation focuses on the 'series of managed activities necessary to ensure continued access to digital materials for as long as necessary.'" New Roles for New Times: Digital Curation for Preservation, Published by Association of Research Libs in March 2011http://www.arl.org/news/pr/nrnt-dcreport17mar11.shtmlBy Tyler Walters and Katherine Skinner
  • This is an enmeshed partnership between scholars and curators.Scholars can share because curators collect, scholars can discover because curators publish, and so forth.Curators giving scholars what they need, where/when/how they need it. Curators enhancing the discoverability of the scholarly output.AND protecting the institution’s intellectual investment, securing its intellectual legacy.SO, we want to be involved at every point in the scholarly life cycle that we can.
  • Or maybe we *should* drop in at every point. “Build it and they will come” doesn’t work any more.
  • Preservation: Curationmicroservices and MerrittTo bake data curationinto data creation: DCXL (Data Curation XL Plug-In)To enhance data sharing, collecting and gathering: WASserviceTo facilitate data publication, we are exploring this new Data Paper model.And behind many of these steps, the EZID service.We are engaged in a number of network-level collaborations and partnerships, but these two have particular relevance to the data management space, with DataONE focused on distributed data networks and DataCite on persistent identifiers.And lastly, we have partnered with UVA, and many others to develop and launch an easy to use Data Management Plan Tool.So let’s take a brief look at all of these things, and while I’m there, I’ll dive more deeply into EZID, which is the service I manage.
  • Building Blocks: COMPONENTS focused on CURATION functions/solutionsMake each is small and self-contained, so they are collectively easier to develop, maintain, and deployKeep the level of investment in any given service small, so they are easier to replace when they have outlived their usefulnessLimit the scope of each service, but combine them to form complex solutions.Provide public interfaces for all service interactions.Interoperability and utility to end users
  • I’ve given you the link here to our Curation wiki, a place where we’re sharing specs, presentations, community contributions, some code, and so forth.Persistent identifiers: distinguishing one object from all othersPersistent storage: manage the secure and persistent storage of content (files)Fixity: verify the bit-level integrity of filesReplication: provide a globally-fault tolerant storage environmentCharacterization: determining the significant properties of stored files and implications of those properties.Discovery: provide an interface that supports finding the stored objects as appropriate (according to security decisions)Transformation: transcodes digital object representations from existing forms to newly required forms.Notification: a means to inform user communities that digital content is being managed and is available for use.Annotation: a means for user-driven enrichment of managed object description.
  • Another option, other way to go, is to choose the off-the-shelf plan. You can see here where we are with the specific services, and they are baked into Merritt, which is a curation and preservation service you can engage.This also gives you a sense of where we are on the build path with the micro-services list you just saw.If you sign up for Merritt, then you get a hosted service with the flexibility of the micro-services built in.Another “off-the-shelf” solution based on micro-services is: Archivematica (Artefactual Systems in collaboration with the UNESCO Memory of the World's Subcommittee on Technology, the City of Vancouver Archives, the University of British Columbia Library, the Rockefeller Archive Center and a number of other collaborators.)http://archivematica.org/wiki/index.php
  • Dark archive for important digital assets–Pro-active preservation, but no expectation of direct end user accessUCTVBright archive with direct discovery and access–Provide preservation and end user accessPart of grant-funded research data sustainability planPreservation back-end for existing or new discovery and content management systemsPreservation only; content discovery/delivery provided by well-known external systems–eScholarship, Media Hub, Open ContextInvestigating integration with Islandora/Drupal and AlfrescoIntegration with distributed data gridsPart of sustainable and secure cyber-infrastructure–Chronopolis, DataONE member node
  • Nobody thinks of Excel as a preservation-ready tool, but everybody uses it! The KEY IDEA in keeping this EASY here is: let them use the tools they are use to using. (Get out of the way of that elephant!)Gordon & Betty Moore Foundation + Microsoft Research are funding this.Our part is requirements gathering; MS will do development. Open source plug in.
  • Some ideas to better publish, share, and archiveBest ideas to be selected by project partners
  • Idea here is to capture ephemeral data for future study.These figures include 2007 – 2008 pilot activity.Very high number of dark archives reflects:Caution making content public- very active new users whose content is still embargoed for rights considerations
  • USE CASE: GULF OIL SPILL“Web resources” means both entire web sites, sections of websites pertaining to the spill, and individual resources such as patent information for blowout preventers etc.117 of these sites were also included in the 2005 Hurricane Katrina Web archive. That archive is not yet publicly available; we hope to provide access concurrent with the oil spill archive.400+ of these sites were selected by Louisiana State University subject experts.
  • The KEY IDEA here is: Put something unfamiliar (a dataset) in a familiar wrapper (a citable paper)Funded by the Gordon and Betty Moore Foundation
  • Preparing the cover sheet and an assemblage of references provides some real exposure to internet search engines. We see this as a kind of graduated investment opportunity for any institution that wants to get involved in this space.
  • Moving from shared storage and archiving to actual publication and then ultimately to peer review, each incremental investment yields returns additional returns.
  • Let’s take a look at one of the ways that EZID can make a big difference in a researcher’s scholarly life.When that researcher goes to publish data-based results, very often, there is no way to associate the data with the paper. Data is a second class citizen.Here’s an example, for an article that is actually ABOUT DATA! You can’t find it anywhere.Nowhere on the Science Direct page is there a link to the datasets…So if you decide to go on a hunt on your own, it gets very wild and wooly. Suppose you try Google. Faculty home pages, many more PDFs of scholarly articles, but no data sets.Wikipedia pagesNo datasets.If you email the faculty, you might get lucky and get a reply, and they ‘ll send you to their FTP site: but even then—are you sure you are looking at the right version of the data?
  • So here is the difference that EZID makes. Here is an example of a data set deposited with our first customer, Dryad.Dryad is an international repository of data underlying peer-reviewed articles in the basic and applied biosciences.
  • A service to make and manage actionable identifersIds for anything: digital, physical, living, abstractCan manage identifiers under different schemes:ARKs, DOIs, and more to comeUser and programming interfacesPartnering for replication
  • EZID is a tool for extending library services…by meeting the needs of researchers.And it does to in a number of ways. Let’s step through them.
  • The researcher doesdata-intensive research and writing.No permanent home for dataBut wants to reference it nowRegister + get clickable referenceUpdate later when published
  • Virtual teamAging infrastructurePermanent home is not so permanentRegister now, distribute the identifier to entire team.When infrastructure replaced, update target location. All links still work beautifully.
  • A widely published researcherwho uses EZID doesn’t need to worry about career moves.She can move her data with her and as long as she updates her target URLs, all her citations will continue to work.Her worldwide colleagues will have no interruption of access to the data via the original identifiers.
  • A research team Seeking a research grant from the National Science Foundation. must submit a formal data management plan. 1 component: naming and organizing theirfiles. EZID helps meet this need.
  • In addition, EZID has advantages for libraries and data centers and data publishers.
  • Science and technology librarians + digital humanities librarians How to serve our faculty members whose work revolves around datasets?another form of scholarly content, and work to curate and preserve these assets.•Assisting libraries as they extend their historic collection-building activities to datasets, allowing them to preserve their institution’s research investments
  • Often our partners in this work are campus data centers, New demands for storageNew workflowsNew toolsEZID: automated workflows , standards-based processes, and a community of support.
  • You can try out EZID for yourself by going to this URL and clicking on the HELP tab. That will let you make test DOIs and test ARKs without an account. Contact me if you’d like a demonstration for your institution.
  • 3 CLICKSDataONE is an NSF funded, virtual data center for biology, ecology, and environmental sciences.DataOne has the overarching goal of building a new culture of data access and data sharing. This is an international collaboration working with scientists and librarians, as well as other stakeholders.Engaging the scientist in the data curation processSupporting the full data life cycleEncouraging data stewardship and sharingPromoting best practicesEngaging citizensDeveloping domain agnostic solutions
  • I just wanted to show you a picture of the scope and depth of this work.
  • Benefits to funders and publishersCLICKBenefits to researchersDigital Curation Centre (UK)DataONESmithsonian InstitutionUniversity of California Curation Center, California Digital LibraryUniversity of California Los Angeles LibraryUniversity of California Merced LibraryUniversity of California San Diego LibrariesUniversity of Illinois, Urbana-ChampaignUniversity of Virginia LibrariesICPSR Heather Piwowar’s research: Sharing Detailed Research Data Is Associated with Increased Citation RatePLoS ONEhttp://www.plosone.org/article/info:doi/10.1371/journal.pone.0000308“Publicly available data was significantly associated with a 69% increase in citations.”Examined citation history of 85 cancer microarray clinical trial publicationsSharing of Data Leads to Progress on Alzheimer’sBy GINA KOLATA, Published: August 12, 2010, New York Times, p. A1 in print on August 13, 2010
  • Start a new plan (by funder)Edit a plan in-processLook at existing plans
  • CLICKGeneric form layout – current data model should make this possible for any questionRe: #3 – Institutional info – your Contracts &Grants officers may have some advice or specific requests for material to include.Looking toward an early 4th quarter public release.
  • Specs are available now, code will come out incrementally over the summer and beyond.DCXL is a 1-year project.Data Management Plan tool—early 4th quarterMerritt and EZID are available right now
  • With all this ease, WE think that DATA CURATION LEADS TO GOOD OUTCOMES FOR RESEARCHERS.They’ll be motivated routinely to deposit in stable public storage. Data products (datasets and processing information) and the data papers that reward them with authorship creditData journals will spring up around disciplines, even if disciplinary data papers are scattered across geographically distributed repositories.Data products will be re-used, annotated, corrected, and precisely linked to from traditional publications.Data products will enter the scientific record instead of being lost
  • While you are thinking about these matters, you can take advantage of a growing community of practice around curation issues, curation services, and the microservices themselves.Ring leaders are our UC3 partner at UC San Diego: Declan Fleming and our good friend Mike Giarlo at Penn State.Virtual options include: listserv, Google group, Twitter, Facebook, Chat, and wiki. Visit the website for details.

Transcript

  • 1. Making Data Management Easy
    Toward…
    ALA Annual 2011
    Joan Starr
    University of California Curation Center
    California Digital Library
  • 2. HOT TOPICS DISCUSSION GROUP
    STS Programs are sponsored by:
  • 3. Introductions
    The research life cycle
    Some examples from CDL/UC3 (curation micro-services and more!)
    …with a focus on EZID
    Discussion/Questions
  • 4. California Digital Library (CDL)
  • 5.
  • 6. Research has a life cycle.
    DISCOVER
    SHARE
    COLLECT
    PUBLISH
    CREATE
    PRESERVE
    GATHER
    ACCESS
    Judy Baxter, http://www.flickr.com/photos/judybaxter/9825836/
  • 7. Librarians can jump in at any point.
    Ims.photo: http://www.flickr.com/photos/bigblackbox/4805557065/
  • 8. TOOLS & SERVICES
    To enable data preservation
    To bake data curationinto data creation
    To enhance data sharing, collecting and gathering
    To facilitate data publication
    PARTNERSHIPS
    To promote data discovery and access
    To help researchers comply with new requirements
    What this means for Data Management
    DISCOVER
    SHARE
    COLLECT
    PUBLISH
    CREATE
    PRESERVE
    GATHER
    ACCESS
    Judy Baxter, http://www.flickr.com/photos/judybaxter/9825836/
  • 9. TOOLS & SERVICES
    Micro-services & Merritt
    DCXL
    WAS
    Data Paper model
    EZID
    PARTNERSHIPS
    DataONE & DataCite
    Data Management Plan Tool
    Examples from CDL & UC3
    DISCOVER
    SHARE
    COLLECT
    PUBLISH
    CREATE
    PRESERVE
    GATHER
    ACCESS
    Judy Baxter, http://www.flickr.com/photos/judybaxter/9825836/
  • 10. Curation Micro-services
    Individual
    small & self-contained
    components
    in custom combinations
    can solve complex problems.
    photo by Joan Starr
  • 11. Building blocks
    Persistent identifiers
    Persistent storage
    Fixity
    Replication
    Characterization
    Discovery
    Transformation
    Notification
    Annotation
    https://confluence.ucop.edu/display/Curation/Home
    WindellOskay: http://www.flickr.com/photos/oskay/265899811
  • 12. Merritt is: Micro-services “Off the Shelf”
    Persistent identifiers
    Persistent storage
    Fixity
    Replication
    Characterization
    Discovery
    Transformation
    Notification
    Annotation
    http://www.cdlib.org/services/uc3/merritt
    EZID
    CAN/Pairtree/Dflat/ReDD
    Version2
    Fixity
    Replication
    JHOVE2
    XTF
  • 13. Merritt repository
    Dark archive for preservation masters
    Integration with distributed data grids
    Bright archive for preservation and end-user access
    Preservation back-end for existing discovery services
  • 14. TOOLS & SERVICES
    • Micro-services & Merritt
    DCXL
    WAS
    Data Paper model
    EZID
    PARTNERSHIPS
    DataONE & DataCite
    Data Management Plan Tool
    From CDL/UC3
    PRESERVE
    Judy Baxter, http://www.flickr.com/photos/judybaxter/9825836/
  • 15. DCXL: Data Curation Excel
    WHY EXCEL?
    CON: poor feature set and scalability compared to DBMSs
    PRO: ubiquity, familiarity, ease-of-use
    Cody Simms: http://www.flickr.com/photos/jcodysimms/246023851
  • 16. What an Excel add-in could do
    Permit standardized column headers
    Versioning and standard date formats
    Auto-archiving and persistent id assignment
    “Speed bumps” to discourage macros et al.
    NOTE: This will be released as OPEN SOURCE!
  • 17. TOOLS & SERVICES
    • Micro-services & Merritt
    • 18. DCXL
    WAS
    Data Paper model
    EZID
    PARTNERSHIPS
    DataONE & DataCite
    Data Management Plan Tool
    From CDL/UC3
    CREATE
    PRESERVE
    Judy Baxter, http://www.flickr.com/photos/judybaxter/9825836/
  • 19. Web Archiving Service snapshot
    Stats: Since January 2007
    21 organizations using service
    4,681 sites captured
    44,468 captures run
    26.4 terabytes
    100 + archives under construction
    35 archives published
    In partnership with the IIPC consortium of national libraries.
  • 20. Archiving the Gulf oil spillImproving support for collaboration
    946 sites
    8,400 + captures
    1.3 TB
    Began May 5
  • 21. TOOLS & SERVICES
    Data Paper model
    EZID
    PARTNERSHIPS
    DataONE & DataCite
    Data Management Plan Tool
    From CDL/UC3
    SHARE
    COLLECT
    CREATE
    PRESERVE
    GATHER
    Judy Baxter, http://www.flickr.com/photos/judybaxter/9825836/
  • 24. The Data Paper Model
    Minimal: a cover sheet and a set of links to archived artifacts
    Best practice: citation elements (including persistent identifier)
    Kevin Steele: http://www.flickr.com/photos/kevinsteele/20631162 /
  • 25. The Data Paper Model
    Cover sheet with citation data
    title, date, authors, abstract, and persistent identifier (DOI, ARK, etc.)
  • 26. A data journal
    Incorporation of elements to enrich discovery, re-use, and archiving
    Discipline specific
    Peer reviewed
    The Data Paper Model
  • 27. TOOLS & SERVICES
    EZID
    PARTNERSHIPS
    DataONE & DataCite
    Data Management Plan Tool
    From CDL/UC3
    SHARE
    COLLECT
    PUBLISH
    CREATE
    PRESERVE
    GATHER
    Judy Baxter, http://www.flickr.com/photos/judybaxter/9825836/
  • 31. An article about data, but no data
  • 32. And then the hunt for the data…
    FTP site
  • 33. The EZID difference: data linked…
  • 34. …to the scholarly publication
  • 35. Create a persistent identifier: DOI or ARK
    Add object location
    Add metadata
    Update object location
    Update object metadata
  • 36. Meeting researcher needs
    Early in the research life cycle
    Working on a federated team
    Making a career move
    Meeting funder requirements
  • 37. Early in the research life cycle
    +
    Data-intensive research
    Writing up the results
    Where’s the data?
    What if I move it?
    With EZID: all your references, citations, links, etc. will be stable!
    by Dave Rogers http://www.flickr.com/photos/dave-rogers/2815036285/
  • 38. Working on a federated team
    +
    Data-intensive research
    Regional research center
    +
    Aging infrastructure
    Where’s the data?
    We have to move it!
    With EZID: all your references, citations, links, etc. will be stable!
    ©All rights reserved by University of California, http://www.flickr.com/photos/universityofcalifornia/5405812887
  • 39. Making a career move
    +
    • Researcher(s) on the move
    Data-intensive research
    I know where my data is
    and I’m taking it with me!
    With EZID: all your references, citations, links, etc. will be stable!
    ©All rights reserved by University of California,
    http://www.flickr.com/photos/universityofcalifornia/5406308654
  • 40. Meeting funder requirements
    +
    • Grantor requirements for data management plan
    Data-intensive research
    What do we put here?
    How do we track the data?
    With EZID: track your data from capture to publication and beyond.
    By David Mellis, http://www.flickr.com/photos/mellis/7675610/
  • 41. Working with Libraries & Data Centers
    Libraries
    Extending an historic role
    Data Centers & Publishers
    Providing workflows and standards
  • 42. EZID: Meeting library needs
    +
    • New kinds of scholarlyoutput
    • 43. Continued need to build collections
    How do we keep track of all this new stuff?
    With EZID: you can extend your historic activities & preserve your institution’s research investment.
    ©All rights reserved by University of California, http://www.flickr.com/photos/universityofcalifornia/5098256828
  • 44. EZID: Meeting data center needs
    +
    • New demands for storage
    • 45. Changing landscape
    They
    want what?
    When?
    With EZID: use simple tools, and easy workflows. Work with international standards.
    ©All rights reserved by University of California,
    http://www.flickr.com/photos/universityofcalifornia/5325618610
  • 46. http://n2t.net/ezid/
  • 47. TOOLS & SERVICES
    PARTNERSHIPS
    DataONE & DataCite
    Data Management Plan Tool
    Examples from CDL/UC3
    SHARE
    COLLECT
    PUBLISH
    CREATE
    PRESERVE
    GATHER
    Judy Baxter, http://www.flickr.com/photos/judybaxter/9825836/
  • 52. Working at the Network Level
    enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it
    1. Build on existing cyberinfrastructure
    2. Create new cyberinfrastructure
    3. Create new communities of practice
  • 53. DataONE’s new infrastructurehttps://www.dataone.org/
  • 54. TOOLS & SERVICES
    PARTNERSHIPS
    • DataONE & DataCite
    Data Management Plan Tool
    From CDL/UC3
    DISCOVER
    SHARE
    COLLECT
    PUBLISH
    CREATE
    PRESERVE
    GATHER
    ACCESS
    Judy Baxter, http://www.flickr.com/photos/judybaxter/9825836/
  • 59. Data Management Plan Toolhttps://bitbucket.org/dmptool/main/wiki/Home
    Collaborative effort
    Funders’ datamgmt/sharing polices
    Journals’ (Nature, Science, and PLoS) data sharing requirements.
    Researchers
    Distributing research results leads to increased citations (Piwowar et al., 2007)
    A shared, common data set may help researchers collaborate and accelerate discoveries (NY Times, 2010).
    Better organization, leading to easier preservation
    Cultivate quality and efficiency
    Thanks to Jeffrey Loo, Chemical Informatics Librarian, UCB
  • 60. Home screen: once the user has logged in presented with a view of their work and options
    1.
    2.
    3.
    University of California
    Libraries
  • 61. 1.
    2.
    3.
    University of California
    Libraries
  • 62. TOOLS & SERVICES
    PARTNERSHIPS
    • DataONE & DataCite
    • 67. Data Management Plan Tool
    From CDL/UC3
    DISCOVER
    SHARE
    COLLECT
    PUBLISH
    CREATE
    PRESERVE
    GATHER
    ACCESS
    Judy Baxter, http://www.flickr.com/photos/judybaxter/9825836/
  • 68. Summary: Just how easy is it for you?
    Build your own (Curation micro-services)
    specs
    code
    Open source tools
    DCXL
    Data Management Plan tool
    Off the shelf options
    Merritt
    EZID
    WAS
    liquidnight: http://www.flickr.com/photos/liquidnight/3101493460/
  • 69. & how easy is it for researchers?
    For organizing their data
    DCXL , EZID
    To keep their data safe
    Merritt, Micro-services
    To help them get grants
    Data Management Plan tool
    To help get their worknoticed
    EZID, Data Papers
    To help them find otherdata
    EZID, Data Papers
    TOOLS!
    liquidnight: http://www.flickr.com/photos/liquidnight/3101493460/
  • 70. CURATECamp: unconference events connecting practitioners & technologists interested in digital curation and data management.
    Next f2f event: August 15 – 16, 2011Stanford University, Palo Alto, California
    http://www.regonline.com/Register/Checkin.aspx?EventID=953543
    http://groups.google.com/group/digital-curation
    http://curatecamp.org/
    But wait, there’s more: Community!
    courtesy of Oxnard Public Library, http://content.cdlib.org/ark:/13030/kt6c600758
  • 71. and more information!
    UC Curation Center
    http://www.cdlib.org/uc3
    uc3@ucop.edu
    EZID
    http://n2t.net/ezid/
    Micro-services
    http://www.cdlib.org/uc3/curation
    http://groups.google.com/group/digital-curation
    UC3/CDL
    Stephen Abrams David Loy
    Patricia Cruse Lisa Colvin
    Scott Fisher Mark Reyes
    Erik Hetzner Tracy Seneca
    Greg Janée Joan Starr
    John Kunze Marisa Strong
    Margaret Low Perry Willett
  • 72. …and here’s how to find me.
    Joan Starr
    joan.starr@ucop.edu
    @joan_starr
    http://www.slideshare.net/joanstarr
  • 73. Image credits for Opening Slide
    Optical Shop, Adam Reeder, http://www.flickr.com/photos/adamreeder/5379477315
    Streetcar, Adam Reeder, http://www.flickr.com/photos/adamreeder/5379459127
    Jazz Gumbo, Adam Reeder, http://www.flickr.com/photos/adamreeder/5380083448
    Streetcar, Adam Reeder, http://www.flickr.com/photos/adamreeder/5379459127
    Boat, Adam Reeder, http://www.flickr.com/photos/adamreeder/5379429155
    Garden, ncpttmedia, http://www.flickr.com/photos/ncpttmedia/4008605841
    Shutters, OZinOH, http://www.flickr.com/photos/75905404@N00/379444291
    }