Cohort studies, which recruit groups of individuals who share common characteristics and follow them over a period of time, are a robust and essential method in biomedical research for understanding the links between risk factors and diseases. Through questionnaires, medical assessments, and other interactions, voluminous and complex data are collected about the study participants. While cohort studies present a treasure trove of data, the data is often not FAIR (findable, accessible, interoperable and reusable). First, due to the sensitive and private nature of medical information, cohort data are often access controlled. Due to the lack of information about the studies (metadata), often one needs to dig deep to know what data is available in a cohort study. Therefore, many cohort datasets suffer from the findable and accessible issues. Second, often data collection is performed with instruments and data specifications tailored to the study. As a result, combining data across cohorts, even ones with similar characteristics, is difficult, making interoperability and reusability a challenge. In this presentation, we will explore several informatics techniques, such as the use of ontology, to make cohort data more FAIR. We will also consider the implications of making cohort data more open and the ethical and governance issues associated with open science benefit sharing.
This webinar is part of the “How FAIR are you” webinar series and hackathon, which aim at increasing and facilitating the uptake of FAIR approaches into software, training materials and cohort data, to facilitate responsible and ethical data and resource sharing and implementation of federated applications for data analysis.
The CINECA webinar series aims to discuss ways to address common challenges and share best practices in the field of cohort data analysis, as well as distribute CINECA project results. All CINECA webinars include an audience Q&A session during which attendees can ask questions and make suggestions. Please note that all webinars are recorded and available for posterior viewing. CINECA webinars include an audience Q&A session during which attendees can ask questions and make suggestions.
This webinar took place on 17th February 2021 and is part of the CINECA webinar series.
For previous and upcoming CINECA webinars see:
https://www.cineca-project.eu/webinars
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
CINECA webinar slides: Making cohort data FAIR
1. This project has received funding from the European Union’s Horizon 2020 research and
Innovation programme under grant agreement No. 825775
Making Cohort data FAIR
Presenter: William Hsiao (Simon Fraser University)
Host: Marta Lloret Llinares (EMBL-EBI)
4. The challenges:
Stay
informed
@CinecaProject
www.cineca-project.eu
Common Infrastructure for National Cohorts
in Europe, Canada and Africa
This project has received funding from the European Union’s Horizon 2020 research and
Innovation programme under grant agreement No. 825775
Accelerating disease research and
improving health by facilitating
transcontinental human data exchange
The vision:
This project has received funding from the Canadian Institute of Health
Research under grant agreement #404896
5. Context for the webinar
• CINECA “How FAIR are you?” webinar series and hackathon:
• https://www.cineca-project.eu/news-events-all/how-fair-are-you-
webinar-series-and-hackathon
• Webinar series Jan-April
• Making cohort data FAIR
• FAIR software tools
• Practically FAIR
• How to make training FAIR
• Ethics/ELSI considerations
• Hackathon 28-29th April 4 hours per day
• 3 streams: cohort data, software, training materials
6. Today’s presenter
Dr. William Hsiao joined the Faculty of Health Sciences at Simon Fraser University as an
associate professor in September 2020. He is also an affiliated scientist at the BC Centre for
Disease Control Public Health Laboratory (BCCDC PHL) and at Canada’s Michael Smith Genome
Sciences Centre.
Prior to joining FHS, Dr. Hsiao was the chief bioinformatician and a senior scientist at the
BCCDC PHL for 8 years and a clinical associate professor at the University of British Columbia.
Dr. Hsiao obtained his PhD in the Department of Molecular Biology and Biochemistry at Simon
Fraser University followed by a postdoctoral fellowship at the Institute for Genome Sciences at
University of Maryland School of Medicine. His research focused on microbial genomics and
metagenomics.
At the BCCDC PHL, he incorporated data science and knowledge engineering into his research and developed expertise in
public health data sharing, integration, and harmonization. With experience conducting and applying genomics and data
science research both in academia and in government laboratories, Dr. Hsiao has developed a special perspective on
integrating basic and applied research to improve our public health system.
7. This project has received funding from the European Union’s Horizon 2020 research and
Innovation programme under grant agreement No. 825775
Making Cohort data FAIR
William Hsiao, PhD
Associate Professor, Faculty of Health Sciences
Simon Fraser University
8. Cohort Studies
• A cohort is a group of people who share a common characteristics or
experience within a defined period
• A cohort study typically is observational in nature and follows a large
number of participants over a long period of time (longitudinal)
• Useful when intervention (e.g. RCT) is not possible or not ethical
• Exposure to risk factors are measured or collected
• Multiple outcomes can be observed (e.g. from large population-based
cohort)
• Valuable for scientific understanding of disease causation
• In CINECA, ~10 cohorts are part of the consortium
• A goal of CINECA is to enable federated queries and analyses of the varying
and wide-ranging datasets from the 10 CINECA cohorts.
9. Construction of Cohorts
• Define study question(s)/scope and identify and consent the study
subjects (cohort)
• Obtain baseline data on the exposure(s) to the risk factor(s) of
interest (e.g. obese vs. lean, genetic allelic variation)
• Subclassify the cohorts into groups with or without (or less) of the
risk factor(s)
• Follow-up with the participants to measure the outcomes using data
collection instruments (questionnaire, physical measures, cognitive
measures, biosamples, etc.)
• Analyze the data to see if outcome correlated to risk factor(s); infer
causality
10. Examples of Cohorts in CINECA
• CHILD: The CHILD Cohort Study is a prospective longitudinal birth
cohort study in Canada; ~3000 participating families (trios);
multidisciplinary; multi-modal (questionnaires, biological samples,
home assessments, clinical assessments)
• UK-Biobank: population based biomedical database; 500,000
participants from UK (national cohort); able to mobilize for COVID-19
related research by linking host genetics to disease status
• H3Africa: a pan-African consortium to build research program and
infrastructure in genomic medicine on the African continent;
biospecimens and data collection for research
11. Challenges with Cohort Studies
• Due to the (usually) prospective, large sample-size, and longitudinal
nature, cohort studies can be expensive and labour-intensive
• Data collection instruments and variables are usually specific to a
study, and data may not be interoperable and machine readable (e.g.
free text, custom pick-list, non-standardized units)
• Most valuable data is often the most sensitive and access controlled
• Consent process may restrict the re-use of cohort data
• Broad consents are encouraged as long as they are informed consents and
allow withdrawal of the consents at any time
12. Discovery of Cohort Database Content (F/A)
• Cohort Profiles are often in descriptive text that are human but not
machine readable
• Similarly, data fields and study protocols are accessible openly but
only in human readable formats (free text or tables)
Cohort profile – CHILD Cohort Study (childstudy.ca)
13. Access to Data is controlled to protect
privacy
Data access – CHILD Cohort Study (childstudy.ca)
14. Minimal Metadata for Cohorts
• One solution is to make cohort metadata and aggregated data
researchable by machines by encoding the data in common formats
• E.g. GO FAIR’s Framework for FAIRification has a Metadata for
Machines (M4M) component
• Assess community specific metadata practices
• Using the FAIR principles to define metadata elements
• Formulate these decisions as machine-actionable templates
• Register these templates so they are FAIR and accessible by
machine API
• Result in domain-specific, community built, FAIR metadata schema
Metadata for Machines - GO FAIR (go-fair.org)
15. A Minimal Metadata Model for
CINECA
• A minimal metadata model is being developed for CINECA
• Based on surveys of common data fields in 10 cohort studies (no assessment of
actual data, but consulted data dictionaries to capture acceptable data values)
• Supplemented with common data fields from additional data catalogues
• Supplemented with “use cases”
• Each variable grouped into a broad category (e.g. socio-demographic and economic
characteristics, diseases, and lifestyle and behaviours)
• Maelstrom Research data standards were used as the basis for the categorization –
promote compatibility with existing data catalogues
• Each variables is mapped to an ontology term (work in progress)
• Metadata model encoded as a new ”application ontology” called GEKCO (Genomics
Cohorts Knowledge Ontology) - http://www.obofoundry.org/ontology/gecko.html
17. Harmonization of metadata (-IR)
• Data collection instruments are usually developed in silos (only
considering the study scope and current research needs)
• This leads to incompatibilities when trying to aggregate datasets
across studies at both the cohort level (study attributes) and at the
individual data (participant attributes)
• As a result, data are:
• Not Interoperable
• Not (easily) reusable
18. Similar Epi/Clinical data needs across agencies
Questions re: diseases, symptoms, clinical events, dates of onset
CDC Atlanta,
BC CDC, WHO
18
22. Maelstrom Data Harmonization Guideline
Fortier, Isabel, et al. "Maelstrom Research guidelines for rigorous retrospective data harmonization." International journal of epidemiology (2017) 46 (1): 103-105.
23. Semantic Web and Ontology
• Semantic Web term refers to set of standards and best practice for
sharing data and the semantic (Meanings) of that data over the web by
using application. Semantic web can be processed directly or indirectly
from machines, so basically human and computers can work in
cooperation.
• Ontology is a way to structure knowledge in a machine readable way with
defined terms and relationships.
• In ontology, one can provide a set of vocabularies that can be used to
model a domain, which is types of objects and concepts that are exist in
specific area and the relationship between them (Gruber TR, 1993).
24. Ontology
● Ontologies provide standard and linkable IDs for
○ names for form and tabular data fields
○ terms (choices) for categorical variables
● Modelling processes into data structures may be subjective: Who is the
model for, and to what granularity?
● Upper level ontologies (e.g. BFO) bring some harmony but do not
guarantee full interoperability
25. Ontology Creation and Re-use
● No silo = build the ontology as a suite of interoperable modules
structured to provide berths for future ontologies as needed
● No short half-life = start not with data (which changes rapidly) but
with the entities (things, processes, attributes, ...) the data are used
to describe
● No reinvent the wheel = follow a method tested in over 300
ontology-building initiatives and documented in ISO 21838,
(https://www.iso.org/standard/71954.html) leveraging existing
resources wherever possible
● E.g. GECKO ontology reuse terms whenever possible
26. • Formal representation of objects, concepts and relationships among
them
• Shared understanding [language] for communicating cohort information
• Overcome the semantic heterogeneity
• Ontologies are interpretable by humans and by computer programs.
• Information integration - Multiple information resources are combined
using ontologies to match concepts with similar meaning.
• Ontology embedded into software - Object-oriented implementation
(e.g. Java classes) generated from classes in the ontology…
Strength of Ontology and Semantic web
26
27. 27
GEEM: platform for building ontology-based data
specifications
http://genepio.org/geem/form.html#GENEPIO:0002083
28. Challenges with Harmonization
• balance between accepting only precisely uniform variables that
render pooling straightforward (e.g. exact question or standard
operating procedures) but limit the potential to integrate multiple
studies; and accepting a certain level of heterogeneity across
participating studies providing similar but not necessarily identical
data
• E.g. broader terms used to aggregate more detailed terms
• E.g. same term used for different scales
https://www.maelstrom-research.org/about-harmonization/guidelines/data-processing-methods
29. Benefit Sharing and Privacy Concerns
• Making Cohort Data FAIR can mean broader and easier access to data
• However, usually the HICs (more research capacity) can take better
advantages of such broader access than LMICs
• Food for Thought (discussion): how to we protect the interests of
study participants and enable benefit sharing?
30. Questions?
Title: Making Cohort data FAIR
Presenter: William Hsiao
Please write your questions in the
questions window of the GoToWebinar
application
31. Next CINECA webinars
Title: FAIR Software tools
Presenter: Carlos Martinez
Date: Wed 24th February 2021
Time: 3:00 PM GMT / 4:00 PM CET
Registration and details:
https://www.cineca-project.eu/news-
events-all/fair-software-tools
Title: Practically Fair
Presenter: Andrew Stubbs
Date: Thurs 4th March 2021
Time: 3:00 PM GMT / 4:00 PM CET
Registration and details:
https://www.cineca-project.eu/news-
events-all/practically-fair