CHI MMTC Integrating Public and Private Data

Integrating Public and Private Data
With a Focus on Genome-Wide Association Studies

Hans-Martin Will, Ph.D.
Sr. Director, Head of Genomics R&D
Rosetta Biosoftware

02-27-2009

Rosetta Biosoftware

• Provider of commercial informatics
data-management and analysis
solutions with 10 years of commercial
presence

• Enabling
biopharmaceutical, academic and
government organizations with
solutions to drive research and
innovative science forward
Seattle, London and Tokyo

• Close collaborations with
customers, Merck scientists, and
FDA

3

Core of this Presentation

Integrating Public Data Sets as Valuable Resource
into Internal R&D Efforts
with a Focus on Genome-Wide Association Studies (GWAS)

Challenges and Lessons Learned from
Integrating a lot of Structurally Similar Data Sets

4

• For statistical geneticists, biologists, and genetics data
producers/service providers
• Is a scalable repository to organize, analyze, mine, genomics study
data
• Integrates data from the public domain with your proprietary data
across technologies
• Is built on an open platform that integrates your analysis tools of
choice while avoiding time spent on data formatting
• Designed to work in conjunction with other Rosetta Biosoftware
products to maintain your prior investments

5

Genome Browser
+
Google-style Search
+
Many Integrated Genomics Data Sources

6

Genome Wide Association Studies
First Successes and Adoption Curve

• Race for the first major GWAS (type 2 diabetes) won by a group from
Genome Quebec centre in McGill University in collaboration with
Imperial College London. Tested 400,000 markers; identified
associations with 2 genes.

• 5 months later WTCCC carried out genome-wide association studies
for 7 common diseases with known significant familiar component:
coronary heart disease, type 1 diabetes, type 2 diabetes, rheumatoid
arthritis, Crohn's disease, bipolar disorder and hypertension.

• WTCCC study has become an “instant classic”; in less than 1 year it
has been referenced by 346 publications [source: WTCCC].

7

Impact: WTCCC Citations Over Time
New Publications Per Quarter

8

Over 16,000 significant and suggestive SNPs found in the analysis

Only 194 SNPs reported in the study publication

9

Currently Available GWAS Data

• “In the past 2 years, there has been a dramatic increase in genomic
discoveries involving complex, non-Mendelian diseases, with nearly
100 loci for as many as 40 common diseases robustly identified and
replicated in genome-wide association (GWA) studies” [T.A.
Pearson, T.A. Manolio, JAMA2008, 299(11):1335-1344]

• Number of participants profiled to date in the public domain is
approaching 100,000 (70,000 individuals in dbGAP alone)

• $100M’s are spent on GWAS in public domain. Expenditures in
private domain (big pharma, hospitals, consumer healthcare
companies) at likely at the same or higher level

10

GWA Studies Available from NCBI dbGAP
Portal
A ging C oronary D isease M etacarpal B ones
A lzheim er D isease D eath M otor N euron D isease
A m yotrophic Lateral S clerosis D em entia M uscles, S keletal
A ngina P ectoris D iabetes M ellitus M uscular A trophy, S pinal
A sthm a D iabetes M ellitus, T ype 2 M uscular D isease
A therosclerosis D iabetic N ephropathy M yocardial Infarction
A trial Fibrillation Fatty Liver N euroblastom a
A trial Flutter G laucom a O besity
A D D with H yperactivity H eart D iseases O steoarthritis
B ipolar D isorder H eart Failure O steoporosis
B one D ensity H eart Failure, C ongestive P arkinson D isease
B ones of Lower E xtrem ity H eart V alve P rosthesis P soriasis
B rain Infarction H eart V alve P rosthesis Im plantation P soriatic A rthritis
B rain Ischem ia H orm one R eplacem ent T herapies P ulm onary D isease
B ulbar P alsy, P rogressive H ypertension R etinopathy
C ardiom yopathies H ypertrophy, Left V entricular R habdom yolysis
C ardiovascular D isease H ysterectom y R isk Factors
C ardiovascular S ystem Interm ittent C laudication S chizophrenia
C ataract Intracranial A neurysm S leep
C erebrovascular A ccident Intracranial A rteriovenous M alform ations S leep A pnea S yndrom es
C erebrovascular D isease Ischem ic A ttack, T ransient S leep A pnea, O bstructive
C erebrovascular D isorders Lupus E rythem atosus, S ystem ic S m oking
C holesterol M acular D egeneration S troke
C ongestive H eart Failure M ajor D epressive D isorder
C oronary A rtery B ypass M enopause

11

Primary GWAS Publications

200
Total Number of Publications

100

WTCCC

0
2005 2006 2007 2008 2009

Data from NHGRI GWAS Catalogue (Jan 2009)

12

How can Internal Research Benefit from these Data?

• Validation of internal findings

• Use result sets as gateway into research literature

• Extend biological context into other disease areas

• Enrich internal data sets for meta-analyses

13

Example: Increase the Statistical Power of Association
Studies
• Original report by Fung et al.:
100
Lancet Neurology 2006; 911-
90
916 80
• Parkinson’s study contained 276 70
cases vs. 276 controls 60

*Power
• Identified 26 SNPs with 50
association P-value of <0.0001 40
30

• Expanded Study
20

10
• Use Illumina iControl DB Study 67 0
and 64 data to increase power of 270 500 750 1000
the study Control individuals

• Powered study: 267 cases vs. * Assumes disease allele frequency of 0.50. Calculated
1,641 controls with CaTS software from: Skol AD, Scott LJ, Abecasis
GR and Boehnke M, Nat. Genet. (2006) 38:209-13
• Identified 114 SNPs with
association p-value of p<0.0001

14

Example: Increase the Statistical Power of GWAS

Extension of data set

Excluded from analysis

2
110
32

1
1
16
Difference
in methods
8

Published results
Cases
Re-analyzed using PLINK
Controls (both data sets)
Combined data using PLINK

15

Integration Across the Data Pyramid

Knowledge
Public Data Private Data
Level of Abstraction

Integrative Results
Pathways, Networks

Domain Specific Analysis Results
Associations, Correlations, Clusters

Domain Specific Raw Assay Data
Clinical Measurements, Genotypes, Expression Profiles,
Sequences, …

16

Challenges

Privacy Data
IT Support
Issues Heterogeneity

17

Privacy Issues

Privacy Concerns
• Research participants are sensitive about
personal information
• Presence of disease risk
• Paternity
• Ancestry

Freedom of Research
• Potential for public benefit of
genetic research is widely
acknowledged

• General agreement that protecting privacy is critical
•Genetic Information Non-discrimination Act)
• Risks must be mitigated and balanced with the potential benefits
• Currently, this results in a lengthy approval process for access to data
18

Challenges: Data Heterogeneity

Plethora of data formats
• Lack of standards for certain domains
• Too many competing standards
• MAGE-ML, SOFT, MiniML, MAGE-TAB, ISA-TAB, …

Taxonomies and vocabularies
• Overlapping scope (LOINC vs. MEDRA)
• Inconsistent and incomplete use of vocabularies
(“comments”)

Statistical methodology
• What does a p-value express?
• How do findings translate across studies?

20

STATISTICIAN
REQUIRED !

Source: WTCCC Supporting Material

22

Challenges: IT Support

Data Transfer
Policies and infrastructure are designed GSK cell line data > 2 days transfer
to prevent large-scale data access through corporate firewall

Data Storage
Resources already stretched by internal
Internal replication of all public data
data generation

Data Processing
Reprocessing pipeline requires extensive compute resources (cluster)

23

Summary
Preliminary

• Publicly available data are an underutilized resource of great value

• Heterogeneity of formats, annotations and methodology constitute a
substantial hurdle to integrate these data into research

• Organizations seeking to create in-house compendia based on
publicly available data need to be prepared for significant investments
in staff and infrastructure

24

Preview: FDA SNPTrack

• What is it?
• Public repository and publicly-
available client for deposit of and
access to GWAS data
• Infrastructure for submission and
review of (voluntary) genomic data
submissions

• What are the objectives?
• Open collaboration around best Collaboration
practices (complements MAQC)
• Platform for exchange of and
access to data of interest
• Enablement of large-scale meta-
analysis
SNPTrack
• More details will be provided by
Weida Tong later today ♯ Currently operating under LOI

25

Public FDA SNPTrack Portal
As Collaborative Effort across the Industry

Public SNPTrack Portal
 GWAS Data Sets
 GWAS Result Sets
 GWAS Methods

Common Data Formats and Quality
Standards
Common Data Analysis Methodology

Academic FDA
Researcher Reviewer
BioPharma
Researcher

26

Summary & Outlook

• Publicly available data are an underutilized resource of great value

• Heterogeneity of formats, annotations and methodology constitute a
substantial hurdle to integrate these data into research

• Efforts such as the development of the FDA SNPTrack System and
the MAQC facilitate collaboration and discussions driving towards data
harmonization

• Industry-wide effort is needed to effectively solve these issues

27

Acknowledgements

• FDA • Merck Research Labs
• Weida Tong – Jason Johnson
• Hong Fang
– AndreyLoboda
• Joshua Xu
• Steve Harris
• Rosetta Biosoftware
– AsaOudes
– Carol Preisig
– Kristen Stoops
– Michael Rosenberg
– Yelena Shevelenko

28

CHI MMTC Integrating Public and Private Data

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to CHI MMTC Integrating Public and Private Data

Similar to CHI MMTC Integrating Public and Private Data (20)

CHI MMTC Integrating Public and Private Data

Editor's Notes