This document discusses efforts to standardize data templates for the Human Immunology Project Consortium (HIPC) to make HIPC data more findable, accessible, interoperable, and reusable (FAIR). Currently, HIPC data is inconsistently formatted and named. The authors propose mapping HIPC data submission templates to ontologies to semantically normalize the data and facilitate data integration and querying. They have mapped template elements like assay types and value sets to ontologies like the Ontology for Biomedical Investigations and are working with ontology groups to refine and improve the mappings. The goal is to incorporate this ontology-linked metadata approach into the CEDAR and ImmPort databases.
Standardization of the HIPC Data Templates: The Story So Far
1. Standardization of the HIPC Data
Templates: The Story So Far
Ahmad C. Bukhari, Ph.D., Kei-Hoi Cheung, Ph.D. and Steven H. Kleinstein, Ph.D.
Yale University, School of Medicine
User Group
(HIPC)
2. ● An important resource for raw data and protocols from clinical trials,
mechanistic studies and novel methods for cellular and molecular
measurements
● Provides templates and standard operating procedures to facilitate data
representation and transfer.
● Provides a variety of tools for data access and manipulation
ImmPort
SQL Dump for local
hosting
3. Human Immunology Project Consortium (HIPC)
● Well-characterized human cohorts are studied using a variety of modern
analytic tools including multiplex transcriptional, cytokine, and proteomic
assays.
● HIPC submitted data is an important subset of the ImmPort database
● Submitted HIPC data is not standardized.
● Inconsistent naming and data reporting
4. Our aim is to make HIPC data FAIR
● Findability
○ Finding a large variety of related datasets is an important step to knowledge discovery
● Accessibility
○ A growing number of datasets are being submitted to public repositories such as ImmPort.
These datasets can accessed through different methods including web-based search, bulk
download and API access
● Interoperability
○ Data mining/analysis often requires multiple datasets to be integrated within a single repository
or across multiple repositories
● Reusability
○ Entering enough metadata as part of the data submission process facilitates data reuse
❖ FAIR a set of Digital Object Compliance principles that describes the properties of digital objects
defined under NIH Commons initiative
5. Current practices towards data FAIRness
● Minimum information standards (checklists) specify the minimum amount of
information (metadata) needed for reporting results in a reproducible and
reusable fashion. For example,
○ MIAME: Minimum information about a microarray experiment
○ MIAPE: Minimum Information About a Proteomics Experiment
● Scientific communities have developed templates incorporating detailed
checklists of the metadata needed to describe about the particular types of
experimental data sources.
● Standard identifiers/terminologies/ontologies have been created for different
domains
6.
7. We propose an ontological mapping for the
ImmPort data submission templates.
● Ontology term mapping allows to achieve semantic normalization across
different repositories.
● Ontologically annotated datasets allow context-aware queries and data
integration
● Mapping to controlled vocabularies, relationships and rules facilitates
run-time data validation.
● These help achieve data FAIRness.
8. Ontology mapping of templates
Ontology
Recommender
OBI, OBO, Cell, PR
1
3
2
4
6 5
Incorporate into CEDAR and ImmPort Retrieve annotation (concept Uri, defns, etc)
A collection of ontologies
Expert Verification
Finalizing Mapping
Suggested Alteration
Terms Suggestion
Concept mapper
10. Our mapping strategy
• For certain value sets such as cell populations and cytokines, CM maps
the values to domain specific ontologies such as Cell Ontology (CL) and
Protein Ontology (PR)
• For other elements, CM maps them to the terms in Ontology for
Biomedical Investigations (OBI)
• For elements that do not have matches in OBI, we map these elements to
terms in top-ranked ontologies by OBO Foundry
• For elements that do not have any ontology term matches, we perform
manual search in Bioportal and other available repos for these missing
terms.
• We work closely with individual ontology groups (e.g., CL, OBI) to fill the
11. Template elements mapped to ontologies
• Assay types (e.g., gene expression, flow cytometry, ELISA,
HAI, Luminex )
• Template types (e.g., human subject, biosample)
• Column names (e.g., biosample type, measurement
technique)
• Value sets (e.g., set of cell populations, set of measurement
techniques)
12. Assay Type # Templates # Sub-Templates # Concept # Value Set
Microarray gene
expression
6 10 113 209
Flowcytometry 6 - 67 262
ELISA 2 - 39 602
HAI 2 - 37 117
Luminex 7 - 102 1032
General 6 - 115 190
Mapping Statistics
13. OBI
OBI
OBI
Newly added
A device that moves charged particles through a .... OBI_0001121
A cytometry assay in which the presence of molecules OBI_0002115
14. CEDAR helps to generate ontology-linked metadata
Use case: CEDAR immunology data submission
templates
15. CEDAR has employed our suggested mapping
Map to cell term
in cell ontology
Manual Mapping to “assay”
In OBI Automatic mapping with NCIT
https://cedar.metadatacenter.net
Automatic mapping with OBI
16. Future plan
• Refine mapping of new assay types with updated
algorithm.
• Mapping of clinical metadata with ontology terms.
• Incorporate our ontology-term mapping approach into
CEDAR and ImmPort
• Submit missing terms to relevant ontologies (e.g., OBI)
17. Acknowledgment
• ImmPort
• Jeff Wiser, Patrick Dunn
• Yale
• Hailong Meng, Subhasis Mohanty
•Cell Ontology
• Alex Diehl
• NCBO BioPortal and CEDAR
• Mark Musen, John Graybeal, Martin O’connor
• OBI
• Bjoern Peters