10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides


Published on

“Hot Topics: The DuraSpace Community Webinar Series, " Series Six: Research Data in Repositories” Curated by David Minor, Research Data Curation Program, UC San Diego Library. Webinar 3: “Researcher Perspectives of Data Curation”
Presented by: David Minor, Research Data Curation Program, UC San Diego Library, Dick Norris, Professor, Scripps Institution of Oceanography & Rick Wagner, Data Scientist, San Diego Supercomputer Center.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

  1. 1. Hot Topics Web Seminar Series: Research Data in Repositories The UC San Diego Experience Third Webinar: The Researcher Perspective
  2. 2. Reminder: General Series Info • First webinar: Intro and Framing: UC San Diego decisions and planning • Second Webinar: Deep dive into technology and metadata • Third Webinar: The perspective from researchers, next steps
  3. 3. Reminder: General Series Info Slides and presentations from previous webinars are available for download! http://www.duraspace.org/hot-topics
  4. 4. Your esteemed presenters … First webinar: David Minor – Program Director, Research Data Curation Declan Fleming - Chief Technology Strategist Second webinar: Declan Fleming - Chief Technology Strategist Arwen Hutt - Metadata Librarian Matt Critchlow - Manager of Development and Web Services Third webinar: David Minor – Program Director, Research Data Curation Dick Norris – Professor, Scripps Institution of Oceanography Rick Wagner – Data Scientist at San Diego Supercomputer Center
  5. 5. Today we will … Discuss how researchers have approached curation and data management
  6. 6. Reminder: UCSD Research Data Curation Pilots • The Brain Observatory • NSF OpenTopography Facility • Levantine Archaeology Laboratory • Scripps Institute of Oceanography Geological Collections • The Laboratory for Computational Astrophysics
  7. 7. Reminder: UCSD Research Data Curation Pilots • The Brain Observatory • NSF OpenTopography Facility • Levantine Archaeology Laboratory • Scripps Institute of Oceanography Geological Collections • The Laboratory for Computational Astrophysics
  8. 8. Richard Norris Professor at Scripps Institution of Oceanography
  9. 9. Rick Wagner High Performance Computing Manager at the San Diego Supercomputer Center Ph.D. Candidate within The Laboratory for Computational Astrophysics
  10. 10. SIO Geological Collections General Series Intro First webinar: Intro and Framing: UC San Diego decisions and planning Part of the Curator: Dick Norris International Marine and Lacustrine CollectionsWebinar: Deep dive into technology and • Second Manager: Geological Collections Alexandra Hangsterfer metadata • • With collections at Third Webinar: The perspective fromOregon Columbia, researchers, next steps State, Woods Hole, USGS and more
  11. 11. Our Collection: Sediment cores and rocks recovered from the oceans & long-lived lakes Reef sediment-Panama Salton Sea-CA
  12. 12. How we get them…. Mostly by Sea (Ship, Cruise, Leg) But also by Land Country, Locality, Lat/Long
  13. 13. Collection events Recovering a Gravity Core to collect seafloor sediments Deploying a Dredge to collect seafloor rocks
  14. 14. A collection event is an Object and includes: • • • • • • • Specimen(s) Latitude/Longitude) Ship name and cruise number Text descriptions Thin-sections Images, field notes, publications Location in the repository International Geological Sample Number
  15. 15. The Sediment Core Collection Archive and Working halves of ~7000 cores from the world’s oceans Typically 3-5 sections/core + core photos, chemical data and sampling history The IODP Core collection, Bremen Germany
  16. 16. The Marine Rock collection… • ~4000 dredge sites worldwide • In an 8000 sq ft building • Volcanic rocks, manganese nodules, reef rock
  17. 17. Our data resides with NGDC… • NOAA’s National Geophysical Data Center • And IGSN’s with Lamont’s SESAR
  18. 18. NGDC searches on ships, repositories, sampling systems, and locations But no keyword search, automated data input, ways to link associated data, returns on nearest search terms, sampling history, etc….
  19. 19. What the Community Wants • A unified National geo-referenced system • Exploratory search by nearest word and mapbased system • Links to associated data types (images, text, data, references…) • All data types linked by IGSNs • Data entry through web forms with publication by curators
  20. 20. What we did with RCI • Identified one type of object – Based in sampling events – Ship-Cruise-Sampling device-Sample number – Geo-referenced – Includes associated materials: text description, images, chemical data, references, records of sampling event, sampling records, storage location • NGDC records imported into UC Library system • Records searchable by any word in a record
  21. 21. What’s next? • NSF-sponsored SEASAR (System for Earth Sample Registration) – Created the International GeoSample Number – http://www.geosamples.org/ • NSF-sponsored workshop: – Digital Environment for Sample Curation (June 2013) – http://www.geosamples.org/news/descwebinarmaterials • NSF “EarthCube” initiative
  22. 22. CyberInfrastructure needs (from DESC) • Offline data entry at sea or in the field • DESC should respect data moratoriums (typically 2 years, if collected with NSF grants) • Automated release to public at close of moratorium • Secure login-based data serving for project scientists • Flexible search and access for users to view public archive (view by location name, type, bounding region) and associated data • Flexible sample request submission
  23. 23. More cyberInfrastructure needs • Display stored datasets and images hosted on other servers (as in other repositories) • Connections with Standard Visualization Tools Such as Corelyzer, Correlator, PSICAT, CoreRef, GMT, GeoMapApp • Sampling database should be easily accessible by researchers to submit requests • Automatically updated by repository (personnel) to reflect samples sent to the researchers • Way of entering historical sampling information
  24. 24. These are general issues for Natural History Collections • Most museums have similar issues to us – Geo-Referenced collections – Mix of physical specimens, images, text descriptions, sampling data, and affiliated data files – Many have home-grown data bases that are not interoperable with other museums Fish from the SIO Marine Vertebrates Collection
  25. 25. Natural History Collections • Need controlled vocabularies but flexibility to search on variants – Since nobody agrees on common vocabularies • Value in cross-referencing to related collections – Such as samples (geology, biology, water) collected on a cruise with ship track, sea floor maps… – Presently working on “Rolling deck to Repository” NSF project
  26. 26. LCA PI: Mike Norman Current and Past Students: Many
  27. 27. Research group focusing on numerical modeling of complex astrophysical processes: cosmology, galaxy formation, turbulence, radiation hydrodynamics, magneto-hydrodynamics, … Image credit: NASA, IoA, A. Fabian et al.
  28. 28. Our simulations are large, based on the current definition of “large” (we grow with the technology). Typical results are 1-100 TB.
  29. 29. This work is costly in terms of both the computer time and human effort, and we see a benefit to the science community in sharing. (Citations are nice, too.) http://bit.ly/sB30f1 http://bit.ly/IzTVV2 http://bit.ly/IE4iFd http://bit.ly/HFYLQJ
  30. 30. Prior Sharing Efforts Participation in the Virtual Observatory • Standards for simulation metadata, search, and retrieval • An odd fit beside the “pure” astronomy projects and data centers • But, it meant we weren’t starting from scratch in terms of describing our data Started the curation effort very curious about how much of this previous work would translate to library space Also wanted stable platform for data hosting (e.g., not a closet server)
  31. 31. Curation Process By E.gordienko (Own work) [CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons Several steps: • Choosing the pilot dataset • Cleaning up simulation cruft • Identifying related publications • Adding historical documents (proposals, reports, etc.) • Organize various data groups • Simulations are a collection of datasets from various points in time, needed a description for each type of digital object in each dataset • Bundle, checksum, and handoff Decided near the end to replicate the metadata record to a second site as test of its portability
  32. 32. http://bit.ly/17yTc1n
  33. 33. Final result: • Datasets from a high-resolution cosmology simulation held at UCSD • Viewable both at UCSD, and via the Online Archive of California • Raw simulation data and various analysis results accessible over HTTP
  34. 34. Some thoughts: • When it comes to metadata formats libraries are like any other science domain and speak their own language • If you have a highly-specialized domain-specific metadata dialect or language, you may need an additional discovery service • If not, it’s a good starting point • We’re working on repeating this process on our own for another simulation
  35. 35. Next steps at UC San Diego Move from pilot services to a scalable series of processes. Work with additional researchers in same domains. Work with new domains. Broaden lifecycle management mindset on campus.
  36. 36. Questions? Rick Wagner - rpwagner@sdsc.edu Richard Norris - rnorris@ucsd.edu David Minor - dminor@ucsd.edu http://www.duraspace.org/hot-topics