2. Integrating Public and Private Data
With a Focus on Genome-Wide Association Studies
Hans-Martin Will, Ph.D.
Sr. Director, Head of Genomics R&D
Rosetta Biosoftware
02-27-2009
3. Rosetta Biosoftware
• Provider of commercial informatics
data-management and analysis
solutions with 10 years of commercial
presence
• Enabling
biopharmaceutical, academic and
government organizations with
solutions to drive research and
innovative science forward
Seattle, London and Tokyo
• Close collaborations with
customers, Merck scientists, and
FDA
3
4. Core of this Presentation
Integrating Public Data Sets as Valuable Resource
into Internal R&D Efforts
with a Focus on Genome-Wide Association Studies (GWAS)
Challenges and Lessons Learned from
Integrating a lot of Structurally Similar Data Sets
4
5. • For statistical geneticists, biologists, and genetics data
producers/service providers
• Is a scalable repository to organize, analyze, mine, genomics study
data
• Integrates data from the public domain with your proprietary data
across technologies
• Is built on an open platform that integrates your analysis tools of
choice while avoiding time spent on data formatting
• Designed to work in conjunction with other Rosetta Biosoftware
products to maintain your prior investments
5
6. Genome Browser
+
Google-style Search
+
Many Integrated Genomics Data Sources
6
7. Genome Wide Association Studies
First Successes and Adoption Curve
• Race for the first major GWAS (type 2 diabetes) won by a group from
Genome Quebec centre in McGill University in collaboration with
Imperial College London. Tested 400,000 markers; identified
associations with 2 genes.
• 5 months later WTCCC carried out genome-wide association studies
for 7 common diseases with known significant familiar component:
coronary heart disease, type 1 diabetes, type 2 diabetes, rheumatoid
arthritis, Crohn's disease, bipolar disorder and hypertension.
• WTCCC study has become an “instant classic”; in less than 1 year it
has been referenced by 346 publications [source: WTCCC].
7
9. Over 16,000 significant and suggestive SNPs found in the analysis
Only 194 SNPs reported in the study publication
9
10. Currently Available GWAS Data
• “In the past 2 years, there has been a dramatic increase in genomic
discoveries involving complex, non-Mendelian diseases, with nearly
100 loci for as many as 40 common diseases robustly identified and
replicated in genome-wide association (GWA) studies” [T.A.
Pearson, T.A. Manolio, JAMA2008, 299(11):1335-1344]
• Number of participants profiled to date in the public domain is
approaching 100,000 (70,000 individuals in dbGAP alone)
• $100M’s are spent on GWAS in public domain. Expenditures in
private domain (big pharma, hospitals, consumer healthcare
companies) at likely at the same or higher level
10
11. GWA Studies Available from NCBI dbGAP
Portal
A ging C oronary D isease M etacarpal B ones
A lzheim er D isease D eath M otor N euron D isease
A m yotrophic Lateral S clerosis D em entia M uscles, S keletal
A ngina P ectoris D iabetes M ellitus M uscular A trophy, S pinal
A sthm a D iabetes M ellitus, T ype 2 M uscular D isease
A therosclerosis D iabetic N ephropathy M yocardial Infarction
A trial Fibrillation Fatty Liver N euroblastom a
A trial Flutter G laucom a O besity
A D D with H yperactivity H eart D iseases O steoarthritis
B ipolar D isorder H eart Failure O steoporosis
B one D ensity H eart Failure, C ongestive P arkinson D isease
B ones of Lower E xtrem ity H eart V alve P rosthesis P soriasis
B rain Infarction H eart V alve P rosthesis Im plantation P soriatic A rthritis
B rain Ischem ia H orm one R eplacem ent T herapies P ulm onary D isease
B ulbar P alsy, P rogressive H ypertension R etinopathy
C ardiom yopathies H ypertrophy, Left V entricular R habdom yolysis
C ardiovascular D isease H ysterectom y R isk Factors
C ardiovascular S ystem Interm ittent C laudication S chizophrenia
C ataract Intracranial A neurysm S leep
C erebrovascular A ccident Intracranial A rteriovenous M alform ations S leep A pnea S yndrom es
C erebrovascular D isease Ischem ic A ttack, T ransient S leep A pnea, O bstructive
C erebrovascular D isorders Lupus E rythem atosus, S ystem ic S m oking
C holesterol M acular D egeneration S troke
C ongestive H eart Failure M ajor D epressive D isorder
C oronary A rtery B ypass M enopause
11
12. Primary GWAS Publications
200
Total Number of Publications
100
WTCCC
0
2005 2006 2007 2008 2009
Data from NHGRI GWAS Catalogue (Jan 2009)
12
13. How can Internal Research Benefit from these Data?
• Validation of internal findings
• Use result sets as gateway into research literature
• Extend biological context into other disease areas
• Enrich internal data sets for meta-analyses
13
14. Example: Increase the Statistical Power of Association
Studies
• Original report by Fung et al.:
100
Lancet Neurology 2006; 911-
90
916 80
• Parkinson’s study contained 276 70
cases vs. 276 controls 60
*Power
• Identified 26 SNPs with 50
association P-value of <0.0001 40
30
• Expanded Study
20
10
• Use Illumina iControl DB Study 67 0
and 64 data to increase power of 270 500 750 1000
the study Control individuals
• Powered study: 267 cases vs. * Assumes disease allele frequency of 0.50. Calculated
1,641 controls with CaTS software from: Skol AD, Scott LJ, Abecasis
GR and Boehnke M, Nat. Genet. (2006) 38:209-13
• Identified 114 SNPs with
association p-value of p<0.0001
14
15. Example: Increase the Statistical Power of GWAS
Extension of data set
Excluded from analysis
2
110
32
1
1
16
Difference
in methods
8
Published results
Cases
Re-analyzed using PLINK
Controls (both data sets)
Combined data using PLINK
15
16. Integration Across the Data Pyramid
Knowledge
Public Data Private Data
Level of Abstraction
Integrative Results
Pathways, Networks
Domain Specific Analysis Results
Associations, Correlations, Clusters
Domain Specific Raw Assay Data
Clinical Measurements, Genotypes, Expression Profiles,
Sequences, …
16
17. Challenges
Privacy Data
IT Support
Issues Heterogeneity
17
18. Privacy Issues
Privacy Concerns
• Research participants are sensitive about
personal information
• Presence of disease risk
• Paternity
• Ancestry
Freedom of Research
• Potential for public benefit of
genetic research is widely
acknowledged
• General agreement that protecting privacy is critical
•Genetic Information Non-discrimination Act)
• Risks must be mitigated and balanced with the potential benefits
• Currently, this results in a lengthy approval process for access to data
18
20. Challenges: Data Heterogeneity
Plethora of data formats
• Lack of standards for certain domains
• Too many competing standards
• MAGE-ML, SOFT, MiniML, MAGE-TAB, ISA-TAB, …
Taxonomies and vocabularies
• Overlapping scope (LOINC vs. MEDRA)
• Inconsistent and incomplete use of vocabularies
(“comments”)
Statistical methodology
• What does a p-value express?
• How do findings translate across studies?
20
22. STATISTICIAN
REQUIRED !
Source: WTCCC Supporting Material
22
23. Challenges: IT Support
Data Transfer
Policies and infrastructure are designed GSK cell line data > 2 days transfer
to prevent large-scale data access through corporate firewall
Data Storage
Resources already stretched by internal
Internal replication of all public data
data generation
Data Processing
Reprocessing pipeline requires extensive compute resources (cluster)
23
24. Summary
Preliminary
• Publicly available data are an underutilized resource of great value
• Heterogeneity of formats, annotations and methodology constitute a
substantial hurdle to integrate these data into research
• Organizations seeking to create in-house compendia based on
publicly available data need to be prepared for significant investments
in staff and infrastructure
24
25. Preview: FDA SNPTrack
• What is it?
• Public repository and publicly-
available client for deposit of and
access to GWAS data
• Infrastructure for submission and
review of (voluntary) genomic data
submissions
• What are the objectives?
• Open collaboration around best Collaboration
practices (complements MAQC)
• Platform for exchange of and
access to data of interest
• Enablement of large-scale meta-
analysis
SNPTrack
• More details will be provided by
Weida Tong later today ♯ Currently operating under LOI
25
26. Public FDA SNPTrack Portal
As Collaborative Effort across the Industry
Public SNPTrack Portal
GWAS Data Sets
GWAS Result Sets
GWAS Methods
Common Data Formats and Quality
Standards
Common Data Analysis Methodology
Academic FDA
Researcher Reviewer
BioPharma
Researcher
26
27. Summary & Outlook
• Publicly available data are an underutilized resource of great value
• Heterogeneity of formats, annotations and methodology constitute a
substantial hurdle to integrate these data into research
• Efforts such as the development of the FDA SNPTrack System and
the MAQC facilitate collaboration and discussions driving towards data
harmonization
• Industry-wide effort is needed to effectively solve these issues
27
28. Acknowledgements
• FDA • Merck Research Labs
• Weida Tong – Jason Johnson
• Hong Fang
– AndreyLoboda
• Joshua Xu
• Steve Harris
• Rosetta Biosoftware
– AsaOudes
– Carol Preisig
– Kristen Stoops
– Michael Rosenberg
– Yelena Shevelenko
28