10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

1,498 views

Published on

“Hot Topics: The DuraSpace Community Webinar Series," Series Six: Research Data in Repositories” Curated by David Minor, Research Data Curation Program, UC San Diego Library. Webinar 2: “Metadata and Repository Services for Research Data Curation”

Presented by Declan Fleming, Chief Technology Strategist, Arwen Hutt, Metadata Librarian & Matt Critchlow, Manager of Development and Web ServicesUC, San Diego Library.

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,498
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
29
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

10-15-13 “Metadata and Repository Services for Research Data Curation” Presentation Slides

  1. 1. Hot Topics: The DuraSpace Community Webinar Series Series Six: “Research Data in Repositories” Curated by David Minor October 15, 2014 Hot Topics: DuraSpace Community Webinar Series
  2. 2. Webinar 2: Metadata & Repository Services for Research Data Curation Presented by: Declan Fleming, Chief Technology Strategist, UC San Diego Library Matt Critchlow, Manager of Development and Web Services, UC San Diego Library Arwen Hutt, Metadata Librarian, UC San Diego Library October 15, 2013 Hot Topics: DuraSpace Community Webinar Series
  3. 3. Hot Topics Web Seminar Series: Research Data in Repositories The UC San Diego Experience Second Webinar: Metadata and Repository Services for Research Data Curation
  4. 4. General Series Intro • First webinar: Intro and Framing: UC San Diego decisions and planning • Second Webinar: Deep dive into technology and metadata • Third Webinar: The perspective from researchers, next steps
  5. 5. Your esteemed presenters … First webinar: David Minor – Program Director, Research Data Curation Declan Fleming - Chief Technology Strategist Second webinar: Declan Fleming - Chief Technology Strategist Arwen Hutt - Metadata Librarian Matt Critchlow - Manager of Development and Web Services Third webinar: Dick Norris – Professor, Scripps Institution of Oceanography Rick Wagner – Data Scientist at San Diego Supercomputer Center
  6. 6. Today we will … • Discuss real-world researcher interaction • Document how metadata and files combine to make digital objects • Describe the DAMS data model and how it supports complex research objects • Detail the technology driving the DAMS • Point to the future
  7. 7. Working with Researchers: Pilots • The Brain Observatory • NSF OpenTopography Facility • Levantine Archaeology Laboratory • Scripps Institute of Oceanography Geological Collections • The Laboratory for Computational Astrophysics
  8. 8. Working with Researchers: Process • • • • Introductory meeting Metadata point person Ongoing discussions One on one work Iterative, collaborative, customized, experimental…pilot!
  9. 9. Working with Researchers: Data management • • • • Collocation Clean up Identifiers Metadata
  10. 10. Working with Researchers: What is an object? • What are the boundaries on a discreet set or subset of data? What is required to make the data intelligible, usable and reusable? • What needs to be preserved? • What do they want to display and/or share? • What do they want to be able to refer to or cite?
  11. 11. Working with Researchers: What is an object? Brain or Slice Etc… Artifact Site or
  12. 12. Working with Researchers: Take Aways They are the subject experts There are a lot of broad level similarities But no such thing as one size fits all
  13. 13. We want a new data model… • One that is flexible and accommodates disparate metadata from a variety of sources • While promoting consistency within the data store • One that supports relationships within and between objects • One that is more community engaged, both sharing vocabularies and technology, and utilizing others shared vocabularies and technologies • One that supports improved management of objects and metadata
  14. 14. DAMS Data Model Development Process • Five people, in a room, 16 hours a week for 4 months • Worked through existing data, use case scenarios, known data requirements, investigated known ontologies, etc. • Lots and lots and lots of discussion • Utilizes MADS (Metadata Authority Description Schema) • Results = a data dictionary and an OWL ontology • Living document
  15. 15. DAMS Data Model: Flexibility • The data model provides enough flexibility that we can accommodate a wide variety of data within the schema – Vocabularies – Use of “types” or “display labels” to distinguish specific subtypes of a data field – Flexible structures and relationships – Extensible
  16. 16. DAMS Data Model: Consistency • But enough consistency that searching and display rules do not need to be customized for each individual collection of material – Rules can be applied at the level of the broader concept • As well as establishing the organizational structure necessary for maintaining consistency over time – Evaluation and approval of modifications
  17. 17. DAMS Data Model: Relationships • It allows us to create a number of different relationships – Collections and sub-collections – Collections and objects – Objects and components (complex hierarchical objects) – Other related resources internal or external to the DAMS complex object example
  18. 18. DAMS Data Model: Vocabularies • Allow management of local & community vocabularies – Vocabulary terms as entities – Ability to encode authority data (vocabulary source, value uri, etc.) as well as sameAs relationships between the same term expressed in multiple sources – Ability to update authority records as community vocabularies become more formalized.
  19. 19. DAMS Data Model: Management • One that supports improved management of objects and metadata – Authority management of vocabulary terms – Event metadata!
  20. 20. DAMS Architecture
  21. 21. Preservation: Chronopolis Current DAMS Process 1. Create Bagit bags for all objects 2. Host via HTTP(S) 3. Bags are retrieved and ingested into Chronopolis DAMS4 Process 1. Create Bagit bags for Δ objects using Event metadata 2. Host via HTTP(S) or enqueue on messaging queue for ingestion
  22. 22. Storage
  23. 23. Storage: EMC Isilon 72NL Storage For Library Collections 1 cluster of 5 Nodes 1 Node = 36 x 2TB Drives Total Current Usable Storage of 320TB OneFS 7.0.2.1
  24. 24. Storage: OpenStack Storage For Research Data Collections Testing: • Performance versus Local Storage • Large Files (up to 1TB) – Segmenting files > 5GB – Lexical order bug fix: 1,10,2 -> 0001,0002,…0010 • Rackspace CloudFiles API VS OpenStack REST API Testing Notes: https://libraries.ucsd.edu/blogs/dams/openstack-testing-notes/
  25. 25. DAMS Repository
  26. 26. DAMS Repository Core Repository Application: Create, Read, Update, Delete (CRUD) Uses: Jena, ActiveMQ, JHOVE, Apache Tika, FFMPEG, ImageMagick Manages: • Metadata Triplestore • Storage • Solr
  27. 27. DAMS Repository: Metadata Triplestore
  28. 28. DAMS Repository: Metadata Triplestore Triplestore was: Allegrograph Triplestore is: PostgresSQL DB + Jena • Schema: (ID), Parent, Subject, Predicate, Object Jena Usage: • Core/RDF API – Parsing, loading, updating, serializing RDF • ARQ API – SPARQL queries
  29. 29. DAMS Repository: REST API
  30. 30. Hydra Framework Source: https://wiki.duraspace.org/display/hydra/Technical+Framework+and+its+Parts
  31. 31. DAMS Repository: Fedora API-ish
  32. 32. Fedora API – Next PID
  33. 33. Fedora API – Next PID
  34. 34. DAMS Manager
  35. 35. DAMS Manager Java application using Spring MVC framework • Collection Management – – – – Metadata Ingest and Export File Ingest Derivative Generation Solr indexing by Collection • Administrative Reporting and Statistics
  36. 36. DAMS Hydra Head
  37. 37. DAMS Hydra Head
  38. 38. DAMS Hydra Head: Blacklight
  39. 39. RDF in Hydra
  40. 40. RDF in Hydra: (Read) Nested Attributes
  41. 41. RDF in Hydra: (Create) Nested Attributes
  42. 42. DAMS Hydra Head: Complex Objects
  43. 43. Next Steps Beta Release: Late October Production Release: January Future: • Sufia/Curate Integration for administrative functionality • Additional Linked Data Integration and Crosswalks – Schema.org, OpenURL, Dublin Core, ResourceSync • Fedora4
  44. 44. More Information DAMS Overview https://github.com/ucsdlib/dams/wiki/DAMS-Manual DAMS Hydra Head https://github.com/ucsdlib/damspas DAMS Ontology https://github.com/ucsdlib/dams/tree/master/ontology DAMS REST API https://github.com/ucsdlib/dams/wiki/REST-API Hot Topics Series 3: Get a Head on the Repository with Hydra http://duraspace.org/hot-topics Hydra Technical Overview https://wiki.duraspace.org/display/hydra/Technical+Framework+and+its+Parts OneFS Technical Overview http://www.emc.com/collateral/hardware/white-papers/h10719-isilon-onefs-technical-overview-wp.pdf Isilon Overview http://www.emc.com/collateral/software/data-sheet/h10541-ds-isilon-platform.pdf
  45. 45. Coming Up Next Final Webinar (October 31) The researcher perspective from two of our pilot participants Dick Norris – Professor, Scripps Institution of Oceanography Rick Wagner – Data Scientist at San Diego Supercomputer Center
  46. 46. Questions? Thanks! Declan Fleming @declan | dfleming@ucsd.edu Arwen Hutt @arwenh | ahutt@ucsd.edu Matt Critchlow @mattcritchlow | mcritchlow@ucsd.edu

×