NIH-Funded Model Enables Discovery of Biomedical Datasets
1. Supported by the NIH grant 1U24 AI117966-01 to UCSD
PI , Co-Investigators at:
Alejandra Gonzalez-Beltran, Susanna-Assunta Sansone, Philippe Rocca-Serra
Oxford e-Research Centre, University of Oxford, UK
Smart Descriptions & Smarter Vocabularies (SDSVoc)
30 November-1 December 2016, CWI Amsterdam Science Park
The model:
dataset descriptions for
data discovery in DataMed
2. Like JATS (Journal Article Tag Suite) is used by PubMed to index literature,
DATS (DatA Tag Suite) is needed for a scalable way to
index data sources in the DataMed prototype
http://datamed.org
NIH BD2K
Data
Discovery
Index
prototype
Biomedical
and healthcare
datasets
3. open community-driven development &
documentation
open community-driven development &
documentation
http://biocaddie.org/workgroup-3-group-links
http://github.com/biocaddie/WG3-MetadataSpecifications
http://tiny.cc/datswebinar
4. v Enabling discoverability: find and access datasets available in multiple
repositories
v Focusing on surfacing key metadata descriptors, such as
² information and relations between datasets, creators, publication,
funding sources, nature of biological signal and perturbation etc.
v Not the perfect model to represent the experimental details
² the level of detail and metadata needed to ensure interoperability
and reusability are left to the indexed databases
² We have aimed to have maximum coverage of use cases with minimal
number of data elements and relations
² Only very few properties are required
² Follow Best Practices for Data on the Web
What is ‘ remit?What is ‘ remit?
5. Metadata elements identified by combining the two complementary approaches
USE CASES: top-down approach SCHEMAS: bottom-up approach
The development process in a nutshellThe development process in a nutshell
Model serialized as JSON schemas and mapping to schema.org
(v1.0, v1.1, v2.0, v2.1)
6. Using an existing model?Using an existing model?
v schema.org
v DataCite
v RIF-CS (Registry Interchange Format – Collection and Services)
v W3C HCLS dataset descriptions (mapping of many models including Dublin Core,
DCAT, PROV, VoID, VoID-ext)
v Project Open Metadata (used by HealthData.gov)
v ISA (Investigation/Study/Assay)
v BioProject
v BioSample
v MiNIML
v PRIDE-ml
v MAGE-tab
v GA4GH metadata schema
v SRAxml
v CDISC SDM / element of BRIDGE model
Generic ModelsLife Science / BioMedicalModels
Considered multiple models, mapped/analyze these ones:
bottom-up approach
7. Convergence
of elements
extracted from
competency
questions
and existing
(generic and
biomedical)
data models
(incl. DataCite,
DCAT, schema.org,
HCLS dataset, RIF-
CS, ISA-Tab, SRA-
xml etc.)
model for scalable indexingmodel for scalable indexing
Adoption
of elements extracted
from
and from
core entities
extended entities
plus elements from
other models (e.g.
dataset/distribution/
catalog from DCAT)
8. Serializations and use of schema.orgSerializations and use of schema.org
v DATS model represented as JSON schemas, instances as:
² JSON* format, and
² JSON-LD** with vocabulary from schema.org
² serializations in other formats and with other vocabularies can
also be done, as / if needed
v Benefits for DataMed and databases indexed by DataMed
v Increased visibility (by both popular search engines), accessibility
(via common query interfaces) and possibly improve ranking
v Use and extensions of schema.org
² Submitted to their tracker missing DATS elements
² Coordinating via the bioschemas.org initiative (ELIXIR is also
part of) the extension of schema.org for life science
* JavaScript Object Notation
** JavaScript Object Notation for Linked Data
9. Other adopters
exporting
DATS in their APIs
To evaluate DATS
model capabilities
Work in progress:
documentation and
curation guidelines for
adopters
Implementations and documentationImplementations and documentation