DatasetswithBioschemas
Alejandra Gonzalez-Beltran(*), Philippe Rocca-Serra(*),
Susanna Sansone(*) and the bioCADDIE Team and Commmunity
(*) Oxford e-Research Centre, University of Oxford
Bioschemas community meeting,
Harpenden,Hertfordshire, UK
8th-9th November 2016
Theproblem
how to describe scientific(*)
datasets to enable data discovery
(*) considering in particular
biological and biomedical datasets
Designprinciples
The model for data description to be designed around the
Dataset entity, i.e. a unit of information stored by a data
repository:
● Archived experimental datasets which do not change after
deposition to the repository; e.g. dbGAP, GEO,
ClinicalTrials.org
● Datasets in reference knowledge bases, describing dynamic
concepts, such as “genes” whose definition morphs over
time; e.g. UniProt
Additionally:
● A dataset and related datasets may available in multiple
repositories
● A dataset may be available in multiple forms
Bestpracticesfordataontheweb
https://www.w3.org/TR/dwbp
Theapproach:convergenceofmetadataelementsderivedfromuse
cases(top-down)andmodels/schemas(bottom-up)
top-down approach bottom-up approach
(v1.0, v1.1, v2.0, v2.1)
Modelserialisedas JSON-schemas/JSON-LD withschema.organnotations
Evaluationthrough
Mapping50+repositories
Availablethrough
DataCiteAPI,
mappedtoDatacitationmetadata
Extractingrequirementsfromusecases
❖ Selected competency questions
✧ representative set collected from: use cases workshop, white
paper, submitted by the community and from NIH and Phil
Bourne’s ADDS office
✧ key metadata elements processed: abstracted, color-coded and
terms binned binned as Material, Process, Information,
Properties; relation identified
Mappingexistingmetadataschemas/models
❖ schema.org
❖ DataCite
❖ RIF-CS
❖ W3C HCLS dataset descriptions (mapping of many models including DCAT, PROV,
VOID, Dublin Core)
❖ Project Open Metadata (used by HealthData.gov is being added in this new
iteration)
❖ ISA
❖ BioProject
❖ BioSample
❖ MiNIML
❖ PRIDE-ml
❖ MAGE-tab
❖ GA4GH metadata schema
❖ SRA xml
❖ CDISC SDM / element of BRIDGE model
bottom-up approach
https://biosharing.org/collection/bioCADDIE
https://github.com/biocaddie/WG3-MetadataSpecifications/
DATS:DAtaTagSuite
Coreentities
Biomedicalextension
Like the JATS (Journal Article Tag Suite) is used by
PubMed to index literature,
a DATS (DatA Tag Suite) is needed for a scalable way to
index data sources in the DataMed prototype
https://github.com/biocaddie/WG3-MetadataSpecifications/
Dats-community-drivenmodel
https://biocaddie.org/group/working-groups
Dats-community-drivenmodel
DATScore
DATScoreandbiomedicalextension
https://github.com/biocaddie/WG3-MetadataSpecifications/
Mappingdatstoschema.org
✧ Missing elements (needed by DATS) submitted to
the tracker; Roughly 80 % of DATS entities and
properties can be mapped but alignment is not
perfect/less precise), the remaining 20%
constitute major gaps
✧
✧ Tracking schema.org and its related Health and
Life Science extension evolution (the latter
focuses on clinical studies)
https://goo.gl/nNTeW1
https://developers.google.com/search/docs/data-types/datasets
https://developers.google.com/search/docs/data-types/datasets
Adoptionandintegration
DataMed
https://datamed.org/
Datsexportedby
https://github.com/datacite/spinone/issues/3
● An API endpoint that returns
DataCite metadata in DATS
format is work in progress:
http://api.datacite.org/dats
● DataCite Metadata Schema allows
for a RelatedIdentifier with
the HasMetadata relation type
this allows linking to the DATS
metadata from a DataCite
metadata record
Martin Fenner
DataCite
DatsandDCIPsupplementmapping
Citation metadata for repositories’ landing page
Mappingbetweenomics-diandDATS
❖ Overlap with DATS
core elements
✧ In red some
DATS extended
elements
Developmentofa
bioschemasspecification
fordatasets
UsingtheDATSexperienceforthebioschemasdataset
specification
Next step...

Datasets with bioschemas