Duraspace Hot Topics Series 6: Metadata and Repository Services


Published on

Presented by Declan Fleming, Arwen Hutt, and Matt Critchlow. The second in a three part Webinar series on Research Data Curation at UC San Diego, as part of the larger Research Cyberinfrastructure initiative.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Duraspace Hot Topics Series 6: Metadata and Repository Services

  1. 1. Hot Topics Web Seminar Series: Research Data in Repositories The UC San Diego Experience Second Webinar: Metadata and Repository Services for Research Data Curation
  2. 2. General Series Intro • First webinar: Intro and Framing: UC San Diego decisions and planning • Second Webinar: Deep dive into technology and metadata • Third Webinar: The perspective from researchers, next steps
  3. 3. Your esteemed presenters … First webinar: David Minor – Program Director, Research Data Curation Declan Fleming - Chief Technology Strategist Second webinar: Declan Fleming - Chief Technology Strategist Arwen Hutt - Metadata Librarian Matt Critchlow - Manager of Development and Web Services Third webinar: Dick Norris – Professor, Scripps Institution of Oceanography Rick Wagner – Data Scientist at San Diego Supercomputer Center
  4. 4. Today we will … • Discuss real-world researcher interaction • Document how metadata and files combine to make digital objects • Describe the DAMS data model and how it supports complex research objects • Detail the technology driving the DAMS • Point to the future
  5. 5. Working with Researchers: Pilots • The Brain Observatory • NSF OpenTopography Facility • Levantine Archaeology Laboratory • Scripps Institute of Oceanography Geological Collections • The Laboratory for Computational Astrophysics
  6. 6. Working with Researchers: Process • • • • Introductory meeting Metadata point person Ongoing discussions One on one work Iterative, collaborative, customized, experimental…pilot!
  7. 7. Working with Researchers: Data management • • • • Collocation Clean up Identifiers Metadata
  8. 8. Working with Researchers: What is an object? • What are the boundaries on a discreet set or subset of data? What is required to make the data intelligible, usable and reusable? • What needs to be preserved? • What do they want to display and/or share? • What do they want to be able to refer to or cite?
  9. 9. Working with Researchers: What is an object? Brain or Slice Etc… Artifact Site or
  10. 10. Working with Researchers: Take Aways They are the subject experts There are a lot of broad level similarities But no such thing as one size fits all
  11. 11. We want a new data model… • One that is flexible and accommodates disparate metadata from a variety of sources • While promoting consistency within the data store • One that supports relationships within and between objects • One that is more community engaged, both sharing vocabularies and technology, and utilizing others shared vocabularies and technologies • One that supports improved management of objects and metadata
  12. 12. DAMS Data Model Development Process • Five people, in a room, 16 hours a week for 4 months • Worked through existing data, use case scenarios, known data requirements, investigated known ontologies, etc. • Lots and lots and lots of discussion • Utilizes MADS (Metadata Authority Description Schema) • Results = a data dictionary and an OWL ontology • Living document
  13. 13. DAMS Data Model: Flexibility • The data model provides enough flexibility that we can accommodate a wide variety of data within the schema – Vocabularies – Use of “types” or “display labels” to distinguish specific subtypes of a data field – Flexible structures and relationships – Extensible
  14. 14. DAMS Data Model: Consistency • But enough consistency that searching and display rules do not need to be customized for each individual collection of material – Rules can be applied at the level of the broader concept • As well as establishing the organizational structure necessary for maintaining consistency over time – Evaluation and approval of modifications
  15. 15. DAMS Data Model: Relationships • It allows us to create a number of different relationships – Collections and sub-collections – Collections and objects – Objects and components (complex hierarchical objects) – Other related resources internal or external to the DAMS complex object example
  16. 16. DAMS Data Model: Vocabularies • Allow management of local & community vocabularies – Vocabulary terms as entities – Ability to encode authority data (vocabulary source, value uri, etc.) as well as sameAs relationships between the same term expressed in multiple sources – Ability to update authority records as community vocabularies become more formalized.
  17. 17. DAMS Data Model: Management • One that supports improved management of objects and metadata – Authority management of vocabulary terms – Event metadata!
  18. 18. DAMS Architecture
  19. 19. Preservation: Chronopolis Current DAMS Process 1. Create Bagit bags for all objects 2. Host via HTTP(S) 3. Bags are retrieved and ingested into Chronopolis DAMS4 Process 1. Create Bagit bags for Δ objects using Event metadata 2. Host via HTTP(S) or enqueue on messaging queue for ingestion
  20. 20. Storage
  21. 21. Storage: EMC Isilon 72NL Storage For Library Collections 1 cluster of 5 Nodes 1 Node = 36 x 2TB Drives Total Current Usable Storage of 320TB OneFS
  22. 22. Storage: OpenStack Storage For Research Data Collections Testing: • Performance versus Local Storage • Large Files (up to 1TB) – Segmenting files > 5GB – Lexical order bug fix: 1,10,2 -> 0001,0002,…0010 • Rackspace CloudFiles API VS OpenStack REST API Testing Notes: https://libraries.ucsd.edu/blogs/dams/openstack-testing-notes/
  23. 23. DAMS Repository
  24. 24. DAMS Repository Core Repository Application: Create, Read, Update, Delete (CRUD) Uses: Jena, ActiveMQ, JHOVE, Apache Tika, FFMPEG, ImageMagick Manages: • Metadata Triplestore • Storage • Solr
  25. 25. DAMS Repository: Metadata Triplestore
  26. 26. DAMS Repository: Metadata Triplestore Triplestore was: Allegrograph Triplestore is: PostgresSQL DB + Jena • Schema: (ID), Parent, Subject, Predicate, Object Jena Usage: • Core/RDF API – Parsing, loading, updating, serializing RDF • ARQ API – SPARQL queries
  27. 27. DAMS Repository: REST API
  28. 28. Hydra Framework Source: https://wiki.duraspace.org/display/hydra/Technical+Framework+and+its+Parts
  29. 29. DAMS Repository: Fedora API-ish
  30. 30. Fedora API – Next PID
  31. 31. Fedora API – Next PID
  32. 32. DAMS Manager
  33. 33. DAMS Manager Java application using Spring MVC framework • Collection Management – – – – Metadata Ingest and Export File Ingest Derivative Generation Solr indexing by Collection • Administrative Reporting and Statistics
  34. 34. DAMS Hydra Head
  35. 35. DAMS Hydra Head
  36. 36. DAMS Hydra Head: Blacklight
  37. 37. RDF in Hydra
  38. 38. RDF in Hydra: (Read) Nested Attributes
  39. 39. RDF in Hydra: (Create) Nested Attributes
  40. 40. DAMS Hydra Head: Complex Objects
  41. 41. Next Steps Beta Release: Late October Production Release: January Future: • Sufia/Curate Integration for administrative functionality • Additional Linked Data Integration and Crosswalks – Schema.org, OpenURL, Dublin Core, ResourceSync • Fedora4
  42. 42. More Information DAMS Overview https://github.com/ucsdlib/dams/wiki/DAMS-Manual DAMS Hydra Head https://github.com/ucsdlib/damspas DAMS Ontology https://github.com/ucsdlib/dams/tree/master/ontology DAMS REST API https://github.com/ucsdlib/dams/wiki/REST-API Hot Topics Series 3: Get a Head on the Repository with Hydra http://duraspace.org/hot-topics Hydra Technical Overview https://wiki.duraspace.org/display/hydra/Technical+Framework+and+its+Parts OneFS Technical Overview http://www.emc.com/collateral/hardware/white-papers/h10719-isilon-onefs-technical-overview-wp.pdf Isilon Overview http://www.emc.com/collateral/software/data-sheet/h10541-ds-isilon-platform.pdf
  43. 43. Coming Up Next Final Webinar (October 31) The researcher perspective from two of our pilot participants Dick Norris – Professor, Scripps Institution of Oceanography Rick Wagner – Data Scientist at San Diego Supercomputer Center
  44. 44. Questions? Thanks! Declan Fleming @declan | dfleming@ucsd.edu Arwen Hutt @arwenh | ahutt@ucsd.edu Matt Critchlow @mattcritchlow | mcritchlow@ucsd.edu