Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to DATS v2.2 - NIH May 2017

493 views

Published on

Overview of DATS development and status to date; DATS underpins the NIH BD2K DataMed index.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Introduction to DATS v2.2 - NIH May 2017

  1. 1. Supported by the NIH grant 1U24 AI117966-01 to UCSD PI , Co-Investigators at: The annotated with schema.org Susanna-Assunta Sansone, Alejandra Gonzalez-Beltran, Philippe Rocca-Serra Oxford e-Research Centre, University of Oxford, UK bioCADDIE - DATS Workshop, Bethesda, 7 May 2017
  2. 2. The model
  3. 3. What is ? Like the JATS (Journal Article Tag Suite) is used by PubMed to index literature, a DATS (DatA Tag Suite) is needed for a scalable way to index data sources in the DataMed prototype
  4. 4. Where do I find the documentation? Like the JATS (Journal Article Tag Suite) is used by PubMed to index literature, a DATS (DatA Tag Suite) is needed for a scalable way to index data sources in the DataMed prototype
  5. 5. Mar15 Jun15 Dec15 Jun16 Aug15 May16 Sep16 Mar17 bioCADDIE team’s work on DATS Our community engagement: input, feedback and links May17
  6. 6. Mar15 Jun15 Dec15 Jun16 Aug15 May16 Sep16 Mar17 bioCADDIE team’s work on DATS Our community engagement: input, feedback and links Phase 1 Phase 2 Phase 3 Design and development Continued evaluation & consolidation May17 Evaluation & iterative refinement
  7. 7. Mar15 Jun15 Dec15 Jun16 Aug15 May16 Sep16 Mar17 bioCADDIE team’s work on DATS Our community engagement: input, feedback and links Phase 1 Phase 2 Phase 3 Design and development • Use cases collection= George, ExcTeam • Method development (SOP); competency questions; metadata mapping and definition= Susanna, Alejandra, Philippe SOP and metadata strawman <DATS> name May17 Metadata specification V1.0 with JSON schema Use cases workshop WG3 formed; telecons start; dissemination via WG7 formed; telecons start Evaluation & iterative refinement Continued evaluation & consolidation
  8. 8. Mar15 Jun15 Dec15 Jun16 Aug15 May16 Sep16 Mar17 bioCADDIE team’s work on DATS Our community engagement: input, feedback and links Phase 1 Phase 2 Phase 3 Design and development • Use cases collection= George, ExcTeam • Method development (SOP); competency questions; metadata mapping and definition= Susanna, Alejandra, Philippe SOP and metadata strawman <DATS> name DATS v1.1 May17 DATS v2.0 (with access metadata, WG7) Metadata specification V1.0 with JSON schema Use cases workshop 1st DATS workshop WG3 formed; telecons start; dissemination via WG7 formed; telecons start Evaluation & iterative refinement • DATS specification, serializations, refinement= (Susanna), Alejandra, Philippe and CoreDevTeam, also based on community feedback Continued evaluation & consolidation
  9. 9. Mar15 Jun15 Dec15 Jun16 Aug15 May16 Sep16 Mar17 bioCADDIE team’s work on DATS Our community engagement: input, feedback and links Phase 1 Phase 2 Phase 3 Design and development • Use cases collection= George, ExcTeam • Method development (SOP); competency questions; metadata mapping and definition= Susanna, Alejandra, Philippe SOP and metadata strawman <DATS> name DATS v1.1 May17 DATS v2.0 (with access metadata, WG7) DATS v2.1 (schema.org JSON-LD) DATS v2.2 Metadata specification V1.0 with JSON schema Use cases workshop 1st DATS workshop WG3 formed; telecons start; dissemination via 2nd DATS workshop WG7 formed; telecons start WG12 formed; telecons start Evaluation & iterative refinement • DATS specification, serializations, refinement= (Susanna), Alejandra, Philippe and CoreDevTeam, also based on community feedback Continued evaluation & consolidation • Alignment with other community efforts; documentation and curation guidelines = Susanna, Alejandra, Philippe, Jared, ExcTeam and CoreDevTeam
  10. 10. Mar15 Jun15 Dec15 Jun16 Aug15 May16 Sep16 Mar17 Use cases workshop bioCADDIE team’s work on DATS Our community engagement: input, feedback and links Phase 1 Phase 2 Phase 3 Design and development • Use cases collection= George, ExcTeam • Method development (SOP); competency questions; metadata mapping and definition= Susanna, Alejandra, Philippe Evaluation & iterative refinement • DATS specification, serializations, refinement= (Susanna), Alejandra, Philippe and CoreDevTeam, also based on community feedback Continued evaluation & consolidation • Alignment with other community efforts; documentation and curation guidelines = Susanna, Alejandra, Philippe, Jared, ExcTeam and CoreDevTeam 1st DATS workshop SOP and metadata strawman WG3 formed; telecons start; dissemination via <DATS> name DATS v1.1 May17 2nd DATS workshop DATS v2.0 (with access metadata, WG7) WG7 formed; telecons start DATS v2.1 (schema.org JSON-LD) DATS v2.2 primarily metadata modelers primarily implementers Metadata specification V1.0 with JSON schema WG12 formed; telecons start
  11. 11. ❖ Enabling discoverability: find and access datasets ❖ Focusing on surfacing key metadata descriptors, such as ✧ information and relations between authors, datasets, publication, funding sources, nature of biological signal and perturbation etc. ✧ Not the perfect model to represent the experimental details ✧ the level of details and metadata needed to ensure interoperability and reusability are left to the indexed databases ❖ Better than just having keywords ✧ we have aimed to have maximum coverage of use cases with minimal number of data elements and relations What is supposed to do and be?
  12. 12. Metadata elements identified by combining the two complementary approaches USE CASES: top-down approach SCHEMAS: bottom-up approach The development process in a nutshell (v1.0, v1.1, v2.0, v2.1, v2.2)
  13. 13. Extracting requirements from use cases ❖ Selected competency questions ✧ representative set collected from: use cases workshop, white paper, submitted by the community and from NIH and Phil Bourne’s ADDS office ✧ key metadata elements processed: abstracted, color-coded and terms binned binned as Material, Process, Information, Properties; relation identified top-down approach
  14. 14. bottom-up approach Standing on the shoulders of giants ❖ schema.org ❖ DataCite ❖ RIF-CS ❖ W3C HCLS dataset descriptions (mapping of many models including DCAT, PROV, VOID, Dublin Core) ❖ Project Open Metadata (used by HealthData.gov is being added in this new iteration) ❖ ……(full list in the DATS specification) ❖ ISA ❖ BioProject ❖ BioSample ❖ ……(full list in the DATS specification) ❖ MiNIML ❖ PRIDE-ml ❖ MAGE-tab ❖ GA4GH metadata schema ❖ SRA xml ❖ CDISC SDM / element of BRIDGE model ❖ ……(full list in the DATS specification)
  15. 15. Convergence of elements extracted from competency questions and existing (generic and biomedical) data models (incl. DataCite, DCAT, schema.org, HCLS dataset, RIF- CS, ISA-Tab, SRA- xml etc.) model for scalable indexing Adoption of elements extracted from and from core entities extended entities
  16. 16. ❖ The descriptors for each metadata element (Entity), include ✧ Property (describing the Entity), Definition (of each Entity and Property), Value(s) (allowed for each Property) Key features of
  17. 17. ❖ The descriptors for each metadata element (Entity), include ✧ Property (describing the Entity), Definition (of each Entity and Property), Value(s) (allowed for each Property) ❖ We have defined a set of core and extended entities ✧ Core elements are generic and applicable to any type of datasets, like the JATS can describe any type of publication. ✧ Extended elements includes an additional elements, some of which are specific for life, environmental and biomedical science domains ✧ this set can be further extended as needed Key features of
  18. 18. ❖ The descriptors for each metadata element (Entity), include ✧ Property (describing the Entity), Definition (of each Entity and Property), Value(s) (allowed for each Property) ❖ We have defined a set of core and extended entities ✧ Core elements are generic and applicable to any type of datasets, like the JATS can describe any type of publication. ✧ Extended elements includes an additional elements, some of which are specific for life, environmental and biomedical science domains ✧ this set can be further extended as needed ❖ Entities are not mandatory, in both core and extended set ✧ An entity is used only when applicable to the dataset to be described ✧ In that case, only few of its properties are defined as mandatory Key features of
  19. 19. ❖ Dataset, a core entity catering for any unit of information ✧ archived experimental datasets, which do not change after deposition to the repository => examples available for dbGAP, GEO, ClinicalTrials.org ✧ datasets in reference knowledge bases, describing dynamic concepts, such as “genes”, whose definition morphs over time => examples available for UniProt ❖ Dataset entity is also linked to other digital research objects ✧ Software and Data Standard, which are also part of the NIH Commons, but the focus on other discovery indexes and therefore are not described in detail in this model General design of the
  20. 20. core and extended elements
  21. 21. Of the 20 core elements none is mandatory
  22. 22. Only few properties of the 20 core elements are mandatory
  23. 23. ❖ What is the dataset about? ✧ Material ❖ How was the dataset produced ? Which information does it hold? ✧ Dataset / Data Type with its Information, Method, Platform, Instrument ❖ Where can a dataset be found? ✧ Dataset, Distribution, Access objects (links to License) ❖ When was the datasets produced, released etc.? ✧ Dates to specify the nature of an event {create, modify, start, end...} and its timestamp ❖ Who did the work, funded the research, hosts the resources etc.? ✧ Person, Organization and their roles, Grant Core elements provide the basic info
  24. 24. Interlinking to other indexes
  25. 25. also follows the W3C Data on the Web Best Practices DATS follows these, which also recommend DatasetDistribution https://www.w3.org/TR/dwbp
  26. 26. Serializations and use of schema.org ❖ DATS model in JSON schema, serialized as: ✧ JSON* format, and ✧ JSON-LD** with vocabulary from schema.org ✧ serializations in other formats can also be done, as / if needed ❖ Benefits for DataMed and databases index by DataMed ✧ Increased visibility (by both popular search engines), accessibility (via common query interfaces) and possibly improved ranking ❖ Extending schema.org ✧ Submitted to their tracker missing DATS core elements ✧ Coordinating via the bioschemas.org initiative (ELIXIR is also part of) the extension of schema.org for life science * JavaScript Object Notation ** JavaScript Object Notation for Linked Data

×