CHI MMTC Integrating Public and Private Data


Published on

Presentation given at the CHI Molecular Tri-Medicine Conference in San Francisco, Feb 27th, 2009

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Aggregate Results
  • CHI MMTC Integrating Public and Private Data

    1. 1. Integrating Public and Private Data With a Focus on Genome-Wide Association Studies Hans-Martin Will, Ph.D. Sr. Director, Head of Genomics R&D Rosetta Biosoftware 02-27-2009
    2. 2. Rosetta Biosoftware • Provider of commercial informatics data-management and analysis solutions with 10 years of commercial presence • Enabling biopharmaceutical, academic and government organizations with solutions to drive research and innovative science forward Seattle, London and Tokyo • Close collaborations with customers, Merck scientists, and FDA 3
    3. 3. Core of this Presentation Integrating Public Data Sets as Valuable Resource into Internal R&D Efforts with a Focus on Genome-Wide Association Studies (GWAS) Challenges and Lessons Learned from Integrating a lot of Structurally Similar Data Sets 4
    4. 4. • For statistical geneticists, biologists, and genetics data producers/service providers • Is a scalable repository to organize, analyze, mine, genomics study data • Integrates data from the public domain with your proprietary data across technologies • Is built on an open platform that integrates your analysis tools of choice while avoiding time spent on data formatting • Designed to work in conjunction with other Rosetta Biosoftware products to maintain your prior investments 5
    5. 5. Genome Browser + Google-style Search + Many Integrated Genomics Data Sources 6
    6. 6. Genome Wide Association Studies First Successes and Adoption Curve • Race for the first major GWAS (type 2 diabetes) won by a group from Genome Quebec centre in McGill University in collaboration with Imperial College London. Tested 400,000 markers; identified associations with 2 genes. • 5 months later WTCCC carried out genome-wide association studies for 7 common diseases with known significant familiar component: coronary heart disease, type 1 diabetes, type 2 diabetes, rheumatoid arthritis, Crohn's disease, bipolar disorder and hypertension. • WTCCC study has become an “instant classic”; in less than 1 year it has been referenced by 346 publications [source: WTCCC]. 7
    7. 7. Impact: WTCCC Citations Over Time New Publications Per Quarter 8
    8. 8. Over 16,000 significant and suggestive SNPs found in the analysis Only 194 SNPs reported in the study publication 9
    9. 9. Currently Available GWAS Data • “In the past 2 years, there has been a dramatic increase in genomic discoveries involving complex, non-Mendelian diseases, with nearly 100 loci for as many as 40 common diseases robustly identified and replicated in genome-wide association (GWA) studies” [T.A. Pearson, T.A. Manolio, JAMA2008, 299(11):1335-1344] • Number of participants profiled to date in the public domain is approaching 100,000 (70,000 individuals in dbGAP alone) • $100M’s are spent on GWAS in public domain. Expenditures in private domain (big pharma, hospitals, consumer healthcare companies) at likely at the same or higher level 10
    10. 10. GWA Studies Available from NCBI dbGAP Portal A ging C oronary D isease M etacarpal B ones A lzheim er D isease D eath M otor N euron D isease A m yotrophic Lateral S clerosis D em entia M uscles, S keletal A ngina P ectoris D iabetes M ellitus M uscular A trophy, S pinal A sthm a D iabetes M ellitus, T ype 2 M uscular D isease A therosclerosis D iabetic N ephropathy M yocardial Infarction A trial Fibrillation Fatty Liver N euroblastom a A trial Flutter G laucom a O besity A D D with H yperactivity H eart D iseases O steoarthritis B ipolar D isorder H eart Failure O steoporosis B one D ensity H eart Failure, C ongestive P arkinson D isease B ones of Lower E xtrem ity H eart V alve P rosthesis P soriasis B rain Infarction H eart V alve P rosthesis Im plantation P soriatic A rthritis B rain Ischem ia H orm one R eplacem ent T herapies P ulm onary D isease B ulbar P alsy, P rogressive H ypertension R etinopathy C ardiom yopathies H ypertrophy, Left V entricular R habdom yolysis C ardiovascular D isease H ysterectom y R isk Factors C ardiovascular S ystem Interm ittent C laudication S chizophrenia C ataract Intracranial A neurysm S leep C erebrovascular A ccident Intracranial A rteriovenous M alform ations S leep A pnea S yndrom es C erebrovascular D isease Ischem ic A ttack, T ransient S leep A pnea, O bstructive C erebrovascular D isorders Lupus E rythem atosus, S ystem ic S m oking C holesterol M acular D egeneration S troke C ongestive H eart Failure M ajor D epressive D isorder C oronary A rtery B ypass M enopause 11
    11. 11. Primary GWAS Publications 200 Total Number of Publications 100 WTCCC 0 2005 2006 2007 2008 2009 Data from NHGRI GWAS Catalogue (Jan 2009) 12
    12. 12. How can Internal Research Benefit from these Data? • Validation of internal findings • Use result sets as gateway into research literature • Extend biological context into other disease areas • Enrich internal data sets for meta-analyses 13
    13. 13. Example: Increase the Statistical Power of Association Studies • Original report by Fung et al.: 100 Lancet Neurology 2006; 911- 90 916 80 • Parkinson’s study contained 276 70 cases vs. 276 controls 60 *Power • Identified 26 SNPs with 50 association P-value of <0.0001 40 30 • Expanded Study 20 10 • Use Illumina iControl DB Study 67 0 and 64 data to increase power of 270 500 750 1000 the study Control individuals • Powered study: 267 cases vs. * Assumes disease allele frequency of 0.50. Calculated 1,641 controls with CaTS software from: Skol AD, Scott LJ, Abecasis GR and Boehnke M, Nat. Genet. (2006) 38:209-13 • Identified 114 SNPs with association p-value of p<0.0001 14
    14. 14. Example: Increase the Statistical Power of GWAS Extension of data set Excluded from analysis 2 110 32 1 1 16 Difference in methods 8 Published results Cases Re-analyzed using PLINK Controls (both data sets) Combined data using PLINK 15
    15. 15. Integration Across the Data Pyramid Knowledge Public Data Private Data Level of Abstraction Integrative Results Pathways, Networks Domain Specific Analysis Results Associations, Correlations, Clusters Domain Specific Raw Assay Data Clinical Measurements, Genotypes, Expression Profiles, Sequences, … 16
    16. 16. Challenges Privacy Data IT Support Issues Heterogeneity 17
    17. 17. Privacy Issues Privacy Concerns • Research participants are sensitive about personal information • Presence of disease risk • Paternity • Ancestry Freedom of Research • Potential for public benefit of genetic research is widely acknowledged • General agreement that protecting privacy is critical •Genetic Information Non-discrimination Act) • Risks must be mitigated and balanced with the potential benefits • Currently, this results in a lengthy approval process for access to data 18
    18. 18. 19
    19. 19. Challenges: Data Heterogeneity Plethora of data formats • Lack of standards for certain domains • Too many competing standards • MAGE-ML, SOFT, MiniML, MAGE-TAB, ISA-TAB, … Taxonomies and vocabularies • Overlapping scope (LOINC vs. MEDRA) • Inconsistent and incomplete use of vocabularies (“comments”) Statistical methodology • What does a p-value express? • How do findings translate across studies? 20
    20. 20. Data Deluge 21
    21. 21. STATISTICIAN REQUIRED ! Source: WTCCC Supporting Material 22
    22. 22. Challenges: IT Support Data Transfer Policies and infrastructure are designed GSK cell line data > 2 days transfer to prevent large-scale data access through corporate firewall Data Storage Resources already stretched by internal Internal replication of all public data data generation Data Processing Reprocessing pipeline requires extensive compute resources (cluster) 23
    23. 23. Summary Preliminary • Publicly available data are an underutilized resource of great value • Heterogeneity of formats, annotations and methodology constitute a substantial hurdle to integrate these data into research • Organizations seeking to create in-house compendia based on publicly available data need to be prepared for significant investments in staff and infrastructure 24
    24. 24. Preview: FDA SNPTrack • What is it? • Public repository and publicly- available client for deposit of and access to GWAS data • Infrastructure for submission and review of (voluntary) genomic data submissions • What are the objectives? • Open collaboration around best Collaboration practices (complements MAQC) • Platform for exchange of and access to data of interest • Enablement of large-scale meta- analysis SNPTrack • More details will be provided by Weida Tong later today ♯ Currently operating under LOI 25
    25. 25. Public FDA SNPTrack Portal As Collaborative Effort across the Industry Public SNPTrack Portal  GWAS Data Sets  GWAS Result Sets  GWAS Methods Common Data Formats and Quality Standards Common Data Analysis Methodology Academic FDA Researcher Reviewer BioPharma Researcher 26
    26. 26. Summary & Outlook • Publicly available data are an underutilized resource of great value • Heterogeneity of formats, annotations and methodology constitute a substantial hurdle to integrate these data into research • Efforts such as the development of the FDA SNPTrack System and the MAQC facilitate collaboration and discussions driving towards data harmonization • Industry-wide effort is needed to effectively solve these issues 27
    27. 27. Acknowledgements • FDA • Merck Research Labs • Weida Tong – Jason Johnson • Hong Fang – AndreyLoboda • Joshua Xu • Steve Harris • Rosetta Biosoftware – AsaOudes – Carol Preisig – Kristen Stoops – Michael Rosenberg – Yelena Shevelenko 28