CINECA webinar slides: Making cohort data FAIR

This project has received funding from the European Union’s Horizon 2020 research and
Innovation programme under grant agreement No. 825775
Making Cohort data FAIR
Presenter: William Hsiao (Simon Fraser University)
Host: Marta Lloret Llinares (EMBL-EBI)

This webinar is being recorded

Audience Q&A Session
Please write your
questions in the
questions
window of the
GoToWebinar
application

The challenges:
Stay
informed
@CinecaProject
www.cineca-project.eu
Common Infrastructure for National Cohorts
in Europe, Canada and Africa
Accelerating disease research and
improving health by facilitating
transcontinental human data exchange
The vision:
This project has received funding from the Canadian Institute of Health
Research under grant agreement #404896

Context for the webinar
• CINECA “How FAIR are you?” webinar series and hackathon:
• https://www.cineca-project.eu/news-events-all/how-fair-are-you-
webinar-series-and-hackathon
• Webinar series Jan-April
• Making cohort data FAIR
• FAIR software tools
• Practically FAIR
• How to make training FAIR
• Ethics/ELSI considerations
• Hackathon 28-29th April 4 hours per day
• 3 streams: cohort data, software, training materials

Today’s presenter
Dr. William Hsiao joined the Faculty of Health Sciences at Simon Fraser University as an
associate professor in September 2020. He is also an affiliated scientist at the BC Centre for
Disease Control Public Health Laboratory (BCCDC PHL) and at Canada’s Michael Smith Genome
Sciences Centre.
Prior to joining FHS, Dr. Hsiao was the chief bioinformatician and a senior scientist at the
BCCDC PHL for 8 years and a clinical associate professor at the University of British Columbia.
Dr. Hsiao obtained his PhD in the Department of Molecular Biology and Biochemistry at Simon
Fraser University followed by a postdoctoral fellowship at the Institute for Genome Sciences at
University of Maryland School of Medicine. His research focused on microbial genomics and
metagenomics.
At the BCCDC PHL, he incorporated data science and knowledge engineering into his research and developed expertise in
public health data sharing, integration, and harmonization. With experience conducting and applying genomics and data
science research both in academia and in government laboratories, Dr. Hsiao has developed a special perspective on
integrating basic and applied research to improve our public health system.

Making Cohort data FAIR
William Hsiao, PhD
Associate Professor, Faculty of Health Sciences
Simon Fraser University

Cohort Studies
• A cohort is a group of people who share a common characteristics or
experience within a defined period
• A cohort study typically is observational in nature and follows a large
number of participants over a long period of time (longitudinal)
• Useful when intervention (e.g. RCT) is not possible or not ethical
• Exposure to risk factors are measured or collected
• Multiple outcomes can be observed (e.g. from large population-based
cohort)
• Valuable for scientific understanding of disease causation
• In CINECA, ~10 cohorts are part of the consortium
• A goal of CINECA is to enable federated queries and analyses of the varying
and wide-ranging datasets from the 10 CINECA cohorts.

Construction of Cohorts
• Define study question(s)/scope and identify and consent the study
subjects (cohort)
• Obtain baseline data on the exposure(s) to the risk factor(s) of
interest (e.g. obese vs. lean, genetic allelic variation)
• Subclassify the cohorts into groups with or without (or less) of the
risk factor(s)
• Follow-up with the participants to measure the outcomes using data
collection instruments (questionnaire, physical measures, cognitive
measures, biosamples, etc.)
• Analyze the data to see if outcome correlated to risk factor(s); infer
causality

Examples of Cohorts in CINECA
• CHILD: The CHILD Cohort Study is a prospective longitudinal birth
cohort study in Canada; ~3000 participating families (trios);
multidisciplinary; multi-modal (questionnaires, biological samples,
home assessments, clinical assessments)
• UK-Biobank: population based biomedical database; 500,000
participants from UK (national cohort); able to mobilize for COVID-19
related research by linking host genetics to disease status
• H3Africa: a pan-African consortium to build research program and
infrastructure in genomic medicine on the African continent;
biospecimens and data collection for research

Challenges with Cohort Studies
• Due to the (usually) prospective, large sample-size, and longitudinal
nature, cohort studies can be expensive and labour-intensive
• Data collection instruments and variables are usually specific to a
study, and data may not be interoperable and machine readable (e.g.
free text, custom pick-list, non-standardized units)
• Most valuable data is often the most sensitive and access controlled
• Consent process may restrict the re-use of cohort data
• Broad consents are encouraged as long as they are informed consents and
allow withdrawal of the consents at any time

Discovery of Cohort Database Content (F/A)
• Cohort Profiles are often in descriptive text that are human but not
machine readable
• Similarly, data fields and study protocols are accessible openly but
only in human readable formats (free text or tables)
Cohort profile – CHILD Cohort Study (childstudy.ca)

Access to Data is controlled to protect
privacy
Data access – CHILD Cohort Study (childstudy.ca)

Minimal Metadata for Cohorts
• One solution is to make cohort metadata and aggregated data
researchable by machines by encoding the data in common formats
• E.g. GO FAIR’s Framework for FAIRification has a Metadata for
Machines (M4M) component
• Assess community specific metadata practices
• Using the FAIR principles to define metadata elements
• Formulate these decisions as machine-actionable templates
• Register these templates so they are FAIR and accessible by
machine API
• Result in domain-specific, community built, FAIR metadata schema
Metadata for Machines - GO FAIR (go-fair.org)

A Minimal Metadata Model for
CINECA
• A minimal metadata model is being developed for CINECA
• Based on surveys of common data fields in 10 cohort studies (no assessment of
actual data, but consulted data dictionaries to capture acceptable data values)
• Supplemented with common data fields from additional data catalogues
• Supplemented with “use cases”
• Each variable grouped into a broad category (e.g. socio-demographic and economic
characteristics, diseases, and lifestyle and behaviours)
• Maelstrom Research data standards were used as the basis for the categorization –
promote compatibility with existing data catalogues
• Each variables is mapped to an ontology term (work in progress)
• Metadata model encoded as a new ”application ontology” called GEKCO (Genomics
Cohorts Knowledge Ontology) - http://www.obofoundry.org/ontology/gecko.html

Example of Structured Variables in the
CINECA Minimal Metadata Model

Harmonization of metadata (-IR)
• Data collection instruments are usually developed in silos (only
considering the study scope and current research needs)
• This leads to incompatibilities when trying to aggregate datasets
across studies at both the cohort level (study attributes) and at the
individual data (participant attributes)
• As a result, data are:
• Not Interoperable
• Not (easily) reusable

Similar Epi/Clinical data needs across agencies
Questions re: diseases, symptoms, clinical events, dates of onset
CDC Atlanta,
BC CDC, WHO
18

Examples of Similar Fields / Similar
Names

COVID-19 Case Report Forms in
Canada

COVID-19 Mutation Reports
• RG203KR
• insG28262GAACA
• 3675-3677 SGF
• L242_244L deletion; L242del,A243del,L244del
• SGF 3675-3677 deletion
• Q27stop
• HV 69-70 deletion
• Y144 deletion
• Common: locations and amino acid substitutions
• Variable: nucleotide vs. amino acid locations; abbreviations; gene names

Maelstrom Data Harmonization Guideline
Fortier, Isabel, et al. "Maelstrom Research guidelines for rigorous retrospective data harmonization." International journal of epidemiology (2017) 46 (1): 103-105.

Semantic Web and Ontology
• Semantic Web term refers to set of standards and best practice for
sharing data and the semantic (Meanings) of that data over the web by
using application. Semantic web can be processed directly or indirectly
from machines, so basically human and computers can work in
cooperation.
• Ontology is a way to structure knowledge in a machine readable way with
defined terms and relationships.
• In ontology, one can provide a set of vocabularies that can be used to
model a domain, which is types of objects and concepts that are exist in
specific area and the relationship between them (Gruber TR, 1993).

Ontology
● Ontologies provide standard and linkable IDs for
○ names for form and tabular data fields
○ terms (choices) for categorical variables
● Modelling processes into data structures may be subjective: Who is the
model for, and to what granularity?
● Upper level ontologies (e.g. BFO) bring some harmony but do not
guarantee full interoperability

Ontology Creation and Re-use
● No silo = build the ontology as a suite of interoperable modules
structured to provide berths for future ontologies as needed
● No short half-life = start not with data (which changes rapidly) but
with the entities (things, processes, attributes, ...) the data are used
to describe
● No reinvent the wheel = follow a method tested in over 300
ontology-building initiatives and documented in ISO 21838,
(https://www.iso.org/standard/71954.html) leveraging existing
resources wherever possible
● E.g. GECKO ontology reuse terms whenever possible

• Formal representation of objects, concepts and relationships among
them
• Shared understanding [language] for communicating cohort information
• Overcome the semantic heterogeneity
• Ontologies are interpretable by humans and by computer programs.
• Information integration - Multiple information resources are combined
using ontologies to match concepts with similar meaning.
• Ontology embedded into software - Object-oriented implementation
(e.g. Java classes) generated from classes in the ontology…
Strength of Ontology and Semantic web
26

27
GEEM: platform for building ontology-based data
specifications
http://genepio.org/geem/form.html#GENEPIO:0002083

Challenges with Harmonization
• balance between accepting only precisely uniform variables that
render pooling straightforward (e.g. exact question or standard
operating procedures) but limit the potential to integrate multiple
studies; and accepting a certain level of heterogeneity across
participating studies providing similar but not necessarily identical
data
• E.g. broader terms used to aggregate more detailed terms
• E.g. same term used for different scales
https://www.maelstrom-research.org/about-harmonization/guidelines/data-processing-methods

Benefit Sharing and Privacy Concerns
• Making Cohort Data FAIR can mean broader and easier access to data
• However, usually the HICs (more research capacity) can take better
advantages of such broader access than LMICs
• Food for Thought (discussion): how to we protect the interests of
study participants and enable benefit sharing?

Questions?
Title: Making Cohort data FAIR
Presenter: William Hsiao
Please write your questions in the
questions window of the GoToWebinar
application

Next CINECA webinars
Title: FAIR Software tools
Presenter: Carlos Martinez
Date: Wed 24th February 2021
Time: 3:00 PM GMT / 4:00 PM CET
Registration and details:
https://www.cineca-project.eu/news-
events-all/fair-software-tools
Title: Practically Fair
Presenter: Andrew Stubbs
Date: Thurs 4th March 2021
Time: 3:00 PM GMT / 4:00 PM CET
Registration and details:
https://www.cineca-project.eu/news-
events-all/practically-fair

CINECA webinar slides: Making cohort data FAIR

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CINECA webinar slides: Making cohort data FAIR

Similar to CINECA webinar slides: Making cohort data FAIR (20)

More from CINECAProject

More from CINECAProject (9)

Recently uploaded

Recently uploaded (20)

CINECA webinar slides: Making cohort data FAIR