Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

International perspective for sharing publicly funded medical research data


Published on

Presentation by Olivier Salvado, CSIRO, to the 'Unlocking value from publicly funded Clinical Research Data' workshop, cohosted by ARDC and CSIRO at ANU on 6 March 2019.

Published in: Education
  • Be the first to comment

  • Be the first to like this

International perspective for sharing publicly funded medical research data

  1. 1. International perspective for sharing publically funded medical research data Olivier Salvado March 6th
  2. 2. EU report “Facts and Figures for open research data”
  3. 3. EU report “Facts and Figures for open research data”
  4. 4. Large medical data collection to enable AI deCODE: private company. “In our gene discovery work in Iceland, we have gathered genotypic and medical data from more than 160,000 volunteer participants, comprising well over half of the adult population” Biobank UK: ”It is following the health and well-being of 500,000 volunteer participants and provides health information, which does not identify them, to approved researchers in the UK and overseas, from academia and industry”. Genome-wide genetic data is available for 488,000 UK Biobank participants. NIH’s 1-million-volunteer precision medicine study announces first pilot projects. President Obama’s huge 1-million-person long- term health study is getting started. A Google company, Verily, will offer technical help. Why big pharma wants to collect 2 million genome. Five months after announcing its intentions to gather genome sequences from 2 million people, pharmaceutical giant AstraZeneca has selected geneticist David Goldstein to head up the task. China embraces precision medicine on a massive scale. The Chinese Academy of Sciences has issued invitations to apply for funding for projects under the China Precision Medicine Initiative (PMI), a US$9.2-billion, 15-year project announced during the National People’s Congress sessions in March, 2016. 200,000 500,000 1,000,000 2,000,000 10,000,000
  5. 5. Example of a large cohort study with multi-modal clinical datasets The Australian Imaging Biomarkers and Lifestyle study Observe evolution and understand Alzheimer’s pathology Define main biomarkers for Alzheimer’s pathway Focus on early detection Commercial partnerships Cognitive functions Blood biomarkers Genomics Demographic And Lifestyle Clinical biomarkers baseline 18 months 36 months 1500+ subjects over 60 (HC, MCI, AD) 20182006 54 months DWI Connectivity FLAIR White matter SWI Venous tree T2W CSF and structures T1W Anatomy PET Amyloid 11C and 18F markers Retina 72 months , large research group 10+ world leading chief investigators 90 months 108 months PET Optical and fluorescent Blood vessel and Amyloid MRI
  6. 6. The Australian Imaging Biomarkers and Lifestyle study Cognitive functions Blood biomarkers Demographic And Lifestyle Clinical data  Cognition tests: 260 variables  Pathology: 260 variables  Metals: 70  Plasma: 15  RBM: 150  Demographics: 55 variables  Med history: 110  Physical activity: 30  Nutrients: 329 1500+ subjects, 8 time points  Multiple data sources  Sparse data acquisition  Dirty data, human entries  Asynchronous data collection  Millions records  Data anonymisation baseline 18 months 36 months 1500+ subjects over 60 (HC, MCI, AD) 2019 2006 54 months 72 months 90 months 108 months 126 months Data centrally managed Quality controlled Online portal to query and download Whole Genome Sequencing data (e.g. Biobank)  Project level variant call data – pVCF file in PLINK format (~100GB);  Sample level variant call data – VCF files for 50,000 exomes (~5TB);  Sample level aligned sequence data – CRAM files for 50,000 exomes (~50TB). For each 1h MRI session (e.g. ADNI)  Typically 5-10 scans  Each scan 3D datasets at least ~10MB  Per session ~100MB to ~10GB  (not included raw data)
  7. 7. Example of large observational studies Using a centralised data repository open for research 2000 2005 2010 2015 2020 ADNeT – 4000 subjects – 5y - $18M ADNI – 1000 subjects – 17y - $140M UK Biobank – 500,000 subjects – 30y - £67M (initial)initial HCP – ~1000 subjects – 8y - US$38M AIBL – 2000 subjects – 12y - $12M PISA – 500 subjects – 5y - $6.5M
  8. 8. Innovative models • Biobank UK - 500,000 subjects: Genetics + phenotyping - Restricted open access model - Aggregation of contribution • Enigma - Self organised consortium 300 scientists from 185 institutions in 33 countries - 30,000+ subjects: genetics + imaging - Shared methods New models of collaboration “ENIGMA: crowdsourcing meets neuroscience”, Lancet Neurology, 2015
  9. 9. FAIR principle Worldwide uptake for open sharable research data • gaining maximum potential from data assets • increasing the visibility and citations of research • improving the reproducibility and reliability of research • staying aligned with international standards and approaches • attracting new partnerships with researchers, business, policy and broader communities • enabling new research questions to be answered • using new innovative research approaches and tools • achieving maximum impact from research. No explicit mention about data quality
  10. 10. National Institute of Health (NIH) In NIH's view, all data should be considered for data sharing. Data should be made as widely and freely available as possible while safeguarding the privacy of participants, and protecting confidential and proprietary data. To facilitate data sharing, investigators submitting a research application requesting $500,000 or more of direct costs in any single year to NIH on or after October 1, 2003 are expected to include a plan for sharing final research data for research purposes, or state why data sharing is not possible. Given the breadth and variety of science that NIH supports, neither the precise content for the data documentation, nor the formatting, presentation, or transport mode for data is stipulated. What is sensible in one field or one study may not work at all for others. NIH may make data sharing an explicit term and condition of subsequent awards. NIH expects the timely release and sharing of data to be no later than the acceptance for publication of the main findings from the final dataset. Methods for Data Sharing (according to NIH) • Under the auspices of the PI • Data archive • Data enclave • Mixed mode sharing. Data sharing requirement since 2003 for publically funded research
  11. 11. NIH funded data sharing in practice Institutes and centres Data sharing policy name NHGRI ENCODE Consortia Data Release, Data Use, and Publication Policies NHLBI NHLBI Policy for Data Sharing from Clinical Trials and Epidemiological Studies NIA Alzheimer's Disease Genetics Sharing Plan NIA Alzheimer’s Disease Neuroimaging Initiative (ADNI) Data Sharing and Publication Policy …. …. ICO Repository Name Repository Description Common Fund International Mouse Phenotyping Consortium (IMPC) Phenotype data on knockout mouse lines. NCI The Cancer Imaging Archive (TCIA) TCIA is a service which de-identifies and hosts a large archive of medical images of cancer accessible for public download. NEI EyeGENE® The eyeGENE® Biorepository and corresponding Database contain family history and clinical eye exam data from subjects enrolled in eyeGENE® Program coupled to clinical grade DNA samples NIA The National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) The National Institute on Aging Genetics of Alzheimer's Disease Data Storage Site (NIAGADS) is a national genetics data repository facilitating access to genotypic and phenotypic data for Alzheimer's disease (AD). …. ….. ….. 17 specific NIH Data Sharing Policies ………… resulting in 82 data repositories
  12. 12. Europe H2020 Data sharing policy • Discipline-specific Research Data Management Life Sciences – Bio-informatics: ELIXIR and Force11/RDA FAIRSharing ELIXIR: • Services offered • Core Data Resources: European data resources that are of fundamental importance to research in the life sciences and are committed to the long- term preservation of data. • ELIXIR Deposition Databases: repositories recommended for the deposition of life sciences experimental data. • Database services listing: this list is updated as Nodes finalise or review their Service Delivery Plans (see How countries join). • The Commission is running a flexible pilot under Horizon 2020 called the Open Research Data Pilot (ORD pilot). • The ORD pilot aims to improve and maximise access to and re-use of research data generated by Horizon 2020 projects and takes into account the need to balance openness and protection of scientific information, commercialisation and Intellectual Property Rights (IPR), privacy concerns, security as well as data management and preservation questions. • Data Management Plans describes the data management life cycle for the data to be collected, processed and/or generated by a Horizon 2020 project. A DMP is required for all projects participating in the extended ORD pilot, unless they opt out of the ORD pilot. However, projects that opt out are still encouraged to submit a DMP on a voluntary basis. Similar to the US, using FAIR principle EU Guidelines on FAIR Data Management in Horizon 2020
  13. 13. Summary • FAIR principles are universally embraced • Most funding agencies requests data to be shared, some provide funding to cover cost, but responsibility lies with investigators • Numerous domain specific data repositories with limited access • Some very large studies provide centralised open infrastructure as part of their mission • Clinical trials are more constrained and managed separately Main types of data infrastructures • Locally managed by individual Investigators • Directory of studies with limited query ability. Each study needs to be contacted (GAAIN) • Data buckets centrally managed for a specific domain/application (e.g. NIH, CSIRO DAP, ADRC) • Large centralised open repository (e.g. LONI, UK Biobank) • Large federated system: locally managed with global access • Novel model: locally managed data, analysis software sent to the data