Dats nih-dccpc-kc7-april2018-prs-uoxf

Developed as a community effort, as
part of the NIH BD2K bioCADDIE
grant (1U24 AI117966-01)
DATS - Data Tag Suite:
model overview
Philippe Rocca-Serra,
Alejandra Gonzalez-Beltran, Susanna-Assunta Sansone
{philippe.rocca-serra,alejandra.gonzalez-beltran,susanna-assunta.sansone}@oerc.ox.ac.uk
Oxford e-Research Centre, Department of Engineering, University of Oxford, UK
NIH DCCPC - KC7, crosscut metadata model subgroup; April 20th, 2018
An activity of the NIH Data Commons’
Oxygen (1OT3OD025462-01) and
Phosphorus (1OT3OD025459-01) Teams in KC7

What is DATS ?
JATS (Journal Article Tag Suite) underpins PubMed for literature indexing,
DATS (DAta Tag Suite) the data model to index data sources
(used by DataMed, but not limited to)
TIP: click github octocat to be taken to relevant files/document from this
slide deck to the DATS specification(s)

Where do I find the documentation?
JATS (Journal Article Tag Suite) underpins PubMed for literature indexing,
DATS (DatA Tag Suite) the data model to index data sources
(used by DataMed, but not limited to)
doi:10.1038/ng.3864
(2017)
doi:10.1038/sdata.2017.59
(2017)

Mar15
Jun15
Dec15
Jun16
Aug15
May16
Sep16
Mar17
bioCADDIE team - iterative development
Our community engagement: input, feedback and links
Phase 1 Phase 2 Phase 3
Design and development
SOP and
metadata
strawman
<DATS>
name
DATS
v1.1
May17
DATS v2.0
(with access
metadata,
WG7)
DATS v2.1
(schema.org
JSON-LD)
DATS
v2.2
Metadata
specification V1.0
with JSON schema
Use cases
workshop
1st
DATS
workshop
WG3 formed;
telecons start;
dissemination via
2nd
DATS
workshop
WG7 formed;
telecons start
WG12 formed;
telecons start
Evaluation & iterative refinement Continued evaluation & consolidation
primarily metadata modelers
primarily implementers

❖ Enabling discoverability: find and access datasets
❖ Focusing on surfacing key metadata descriptors, such as
✧ information and relations between authors, datasets, publication,
funding sources, nature of biological signal and perturbation etc.
✧ Not the perfect model to represent all experimental
details but enough capability to capture essential
descriptors
✧ the domain-specific level of details and metadata belong to the
realm of specialized databases
❖ Better than just having keywords
✧ we have aimed to have maximum coverage of use cases with
minimal number of data elements and relations
What was DATS supposed to do and be?

Metadata elements identified by combining the two complementary approaches
USE CASES: top-down approach SCHEMAS: bottom-up approach
The development process in a nutshell
(v1.0, v1.1, v2.0, v2.1, v2.2)

bottom-up approach
Building DATS by alignment
(standing on the shoulders of giants)
❖ BioProject
❖ BioSample
❖ MiNIML
❖ PRIDE-ml
❖ MAGE-tab
❖ GA4GH metadata schema
❖ SRA xml
❖ ISA
❖ CDISC SDM / element of BRIDGE model
❖ ……(full list in the DATS specification)
❖ DataCite
❖ RIF-CS
❖ W3C HCLS dataset descriptions
❖ (mapping of many models including DCAT, PROV, VOID, Dublin Core)
❖ Project Open Metadata (used by HealthData.gov is being added in this new iteration)
❖ schema.org

Convergence
of elements
extracted from
competency
questions
and existing
(generic and
biomedical)
data models
(incl. DataCite,
DCAT, schema.org,
HCLS dataset,
RIF-CS, ISA-Tab,
SRA-xml etc.)
Building DATS from query cases
Adoption
of elements extracted
from
and from
core entities
extended entities

Capturing the nature of a dataset
Database of Reference Knowledge
Storing knowledge about “The building blocks”
Archive of Experiments
Storing “The signal”
[acquisition, analysis, reverse engineering]

Defining boundaries for DATS
● Get all datasets on COPD where transcription profiling,
spirometry’ were measured in cohorts of Southern
Europeans
A query to retrieve datasets for further analysis.
● Get all genes whose expression in human lung tissue is
elevated following exposure to diesel particulates
A query about findings for hypothesis generation
DATS could represent the collection of statements as a datasets
but how the statements are actually structured is beyond to
current scope of DATS .
Complementarity with Biolinks - @cmungall

❖ What is the dataset about?
✧ Material, Data
❖ How was the dataset produced ? Which information does it hold?
✧ Dataset / Data Type with its Dimension, Method/Technology,
Instrument
❖ Where can a dataset be found?
✧ Dataset, Distribution, Access objects (links to License, Formats)
❖ When was the datasets produced, released etc.?
✧ Dates to specify the nature of an event {create, modify, start, end...}
and its timestamp
❖ Who did the work, funded the research, hosts the resources etc.?
✧ Person, Organisation and their roles, Grant
DATS fundamentals

Relevant use cases:
assembling synthetic cohorts
DATS objects highlight

Counting things (I):
tracking patient and
specimen relationships

Counting things (I):
tracking patient and specimen relationships
Relationships between materials matter:
❖ Assessing sample / specimen origin and patient identity
❖ In the context of longitudinal studies, repeated measure designs, where samples are
collected or variables measured several times over the course of a study
Ease of use and compatibility
with biomedical ontologies
owing to 'familiarity and
awareness’ of DO, GO,
UBERON and the likes.
Note: if underlying model isn’t rich enough (as observed when mapping from a broad range of
resources), accurate mapping from a primary resource into DATS may prove difficult

Counting things (II):
groups and sizes
in the context of studies

Counting things (II):
groups and sizes in the context of studies
For all datasets characterising “signal”, the ability to identify, list and characterise
study populations matters, as does the ability to capture descriptors for
‘treatment’ or ‘perturbations’

Dealing with
spatial and temporal
properties of a dataset

Tracking dataset spatial and temporal properties
Where & When
Query: “Get all datasets collected between 1945 and
1968 in University Hospitals from Japan and Korea”

Spatial information
❖ DATS uses the entity ‘Place’ to report geolocation information for a
Dataset (and other entities)
Place: entity with the following attributes
-name.
-description.
-coordinates.
-geometry. {values from geoJSON}
-postalAddress.
dats.Dataset spatialCoverage dats.Place
dats.Material spatialCoverage dats.Place
dats.Organization location dats.Place
dats.Activity location dats.Place
dats.Place relates to Feature in GeoJSON, GeoLocation in DataCite and Place in schema.org

Measuring things:
Supporting the description of variables -
dimensions and their relation to datasets

Dimensions
1. DATS.Dimension: meant to be used to report what data
points are about in a dataset, their nature, their units.
2. DATS.Dimension should be typed (categorical, continuous)
3. DATS.Dimension used from :
○ DATS.Material.characteristics.Dimension
○ DATS.DataAcquisition.measures.Dimension

Dimension: an example
{ "@type": "Dimension",
"identifier": {
"identifier": "AQ5",
"identifierSource": ""
} ,
"name": {
"valueIRI": "",
"value": "Current marital status"
},
"types":[{"value":"categorical","valueIRI":""}],
"values": [
"1 Married",
"2 Widowed",
"3 Separated",
"4 Divorced",
"5 Never married",
"-9 Missing"
],
"partOf": [
"Dataset-33581-0001.json"
],
"extraProperties": [
{
"category": "landingPage",
"values": [
"http://www.icpsr.umich.edu/
icpsrweb/ICPSR/ssvd/studies/
33581/datasets/0001/variables/AQ5"
]
}
]
}
/json-instances/ICPSR-33581/Dimension-33581-0001-AQ5.json
Credits to Matthew Richardson / Sanda Ionescu (ICPSR)

Dimensions
❖ Ongoing discussion to augment DATS.dimension in order to:
❖ provide summary statistics (min,max,mode,median,mean….)
❖ linked to group information
❖ under development and evaluation
❖ tightly tied to consent, access and terms of use issues
summary statistics

Tracking what the dataset is about
Note: alignment with
biomedical ontologies
(e.g. OBO foundry)

Objects to support the description of variables
dimensions and their relation to datasets
A study schedules a data acquisition event which measures a variable about some
material, input to the event the resulting datasets has part dimensions

Datasets can contain Datasets
Distinguish between results from measurement (output of data acquisitions)
and results from data transformations (output of data analysis)

Condition of access:
licenses, access conditions
and distributions

How can the dataset be accessed?
Access entity:
❖ landing page; access URL
❖ methods of access (e.g. download, service)
❖ authorization requirements
❖ authentication requirements

How can the dataset be used? (licenses)
❖ The dataset should be associated with one or more licenses, which
determine the terms of use of the dataset
❖ Licenses are legal documents giving official permission to do something
with the resource
❖ DATS supports to record licenses’ identifiers, names, versions, creators.
More details about the licenses are expected to be retrieved from external
resources.

Where the dataset can be found?
What data standards the dataset conforms to?
How can the dataset be accessed?
DATS allows reporting on what reporting guidelines, formats/models,
terminologies the dataset complies with/uses

Distribution: an example
A distribution is a specific available form of a dataset
(e.g. the dataset in a specific format or specific endpoint)

Relating datasets to databases and data
standards

“Housekeeping elements”:
Identification, publication, organization,
people, and grant

primary identifier (0..1)
alternative and related identifiers (0..n)
Object identification

Object identification
https://biocaddie.org/group/working-group/working-group-2-data-identifiers-recommendation

Object identification: guidelines
master/examples/Uniprot-P77967.json
"identifier":
{
"identifier": "uniprot:P77967",
"identifierSource": "uniprot"
},
“alternateIdentifiers”:[
{
"alternateIdentifier": "PIR:S74805",
"identifierSource": "PIR"
}
],
”relatedIdentifiers": [
{
"identifier": "PANTHER:PTHR11455:SF22",
"identifierSource": “PANTHER”,
"relationType": “family and domain database” }]
❖ Primary identifier of the dataset -
can be a string, but ideally an IRI.
The identifier source is
organization/namespace
responsible for creating/hosting it
(here using “compact URIs”)
❖ Identifiers of the dataset, other
than the primary and their sources
❖ Identifiers of related resources:
useful to allow cross-references
with other complementary
resources

Tracking dataset producer identity
Who
(and acknowledging funders)

Tracking bibliographic information
(and acknowledging funders)

❖ Distinction between primary publication(s) and other citations
❖ If published work, pubmed or DOI will suffice
✧ Rely on dedicated APIs to recover necessary publication metadata for
indexing/search, which can be included in DATS automatically
"primaryPublications" : [
{
"identifier":
{
"identifier": "https://www.ncbi.nlm.nih.gov/pubmed/7762914",
"identifierSource": "pubmed"
},
"alternateIdentifiers": [
{
"identifier": "http://dx.doi.org/10.7326/0003-4819-123-1-199507010-00007",
"identifierSource": "doi"
}
],
}
]
Tracking bibliographic information

❖ Validators / Schema compliance testing
❖ DataMed Transformation Language
https://biocaddie.org/sites/default/files/d7/project/1869/biocaddie-ahm-ingestion-2017sep.pdf
[Jeff Grethe]
Tools to handle DATS documents
https://github.com/biocaddie/WG3-MetadataSpecifications

Lessons learned
❖ Identification of a Dataset:
➢ Identifying what is a dataset for a particular source is crucial
for setting up an indexing pipeline to DataMed
❖ Use of DATS Dimensions (a high-level representation of
quantitative or qualitative properties of an entity)
➢ E.g. in OMOP CDM there was a need to split a single entity
into its procedural (mapped to DATS.DataAcquisition) and
its variable information (mapped to DATS.Dimension)
❖ Documentation
➢ The available DATS documentation was useful, more would
be better
❖ Support infrastructure
➢ For the future, include more examples and validation
infrastructure

Serializations and use of schema.org
❖ DATS model in JSON schema, serialized as:
➢ JSON* format, and
➢ JSON-LD** with vocabulary from schema.org
■ serializations in other formats can also be done, as / if needed
❖ Benefits for DataMed and databases index by DataMed
➢ Increased visibility (by both popular search engines), accessibility (via
common query interfaces) and possibly improved ranking
➢
❖ Extending/influencing schema.org
➢ Submitted to their tracker missing DATS core elements
➢ Coordinating via the bioschemas.org initiative (ELIXIR is also part of)
the extension of schema.org for life science
* JavaScript Object Notation
** JavaScript Object Notation for Linked Data

Influence on schema.org evolution
https://developers.goo
gle.com/search/docs/d
ata-types/datasets

Acknowledgements
(for the bioCADDIE phase of DATS)
doi:10.1038/sdata.2017.59

Work ongoing in the
DCPPC crosscut metadata model
subgroup
❖ Creating DATS examples: https://github.com/dcppc/data-stewards
✧ Oxygen team: https://github.com/dcppc/data-stewards/issues/12
✧ Phosphorous team: in progress
✧ ……

Dats nih-dccpc-kc7-april2018-prs-uoxf

More Related Content

What's hot

Similar to Dats nih-dccpc-kc7-april2018-prs-uoxf

More from Philippe Rocca-Serra

Recently uploaded

Dats nih-dccpc-kc7-april2018-prs-uoxf